Software Engineering  

What Do Developers Need to Know About Kubernetes, Anyway?

By Austin Parker  |   Last modified on November 10, 2023

This article was originally published on Spiceworks.

Stop me if you’ve heard this one before: you just pushed and deployed your latest change to production, and it’s rolling out to your Kubernetes cluster. You sip your coffee as you wrap up some documentation when a ping in the ops channel catches your eye—a sales engineer is complaining that the demo environment is slow. Probably nothing to worry about, not like your changes had anything to do with that… but, minutes later, more alerts start to fire off. 

You pull up a dashboard, and it’s a christmas tree of red and green indicators flashing at you. What’s the problem? Unavailable replicas, unknown pods, unsuccessful jobs—it’s a lot to take in, and the clamors of the sales engineers are picking up because they’re gonna be demoing in half an hour.

Kubernetes offers developers an appealing story—that it can do most of the heavy lifting associated with running a distributed application. The nice thing about that story is that it’s true! Vertical or horizontal autoscaling, setting and enforcing resource limits, or even basic workload scheduling used to be extremely challenging to accomplish on your own, requiring specialized tooling or cost-intensive IT infrastructure.

Too often, though, developers aren’t given the right tools to understand what’s happening in a cluster, or how it impacts their applications and code. Kubernetes monitoring remains focused on low-level node or pod metrics, with a healthy dash of time-consuming log searches on the side. It’s difficult to discover correlations between changes in cluster health and application behavior. It’s even harder to find the inverse, to understand how application behavior influences cluster health.

Running with Kubernetes vs. running on Kubernetes

Much of this tooling gap can be distilled down to how Kubernetes is managed, and offered, by platform teams. While Kubernetes itself is an increasingly popular deployment target—97% of organizations surveyed are using or evaluating Kubernetes as of 2022, we don’t see a lot of developers building applications that specifically leverage Kubernetes APIs.

This isn’t a bad thing at all, though! Kubernetes itself is an abstraction layer over compute, storage, memory, and networking. You don’t necessarily need to build applications that hook into the API to get value out of it. However, this is where the pain begins for developers; even if you aren’t building operators or using Kubernetes-native frameworks (like Quarkus), you’re going to rely on the underlying machinery of Kubernetes in order to handle things like service discovery, routing, storage, resource limits, scaling, and more.

In either case, you have a problem. Kubernetes itself can influence your application health, and your application can influence the cluster state, but the telemetry that you need to correlate and diagnose these problems is often disjoint.

Consider a relatively uncomplicated service running on Kubernetes. Changes to your load profile can have deleterious effects on other pods scheduled on your node. New deployments can lead to load spikes on stateful services, like databases—especially as you change queries and add features. Trying to track down intermittent bugs across thousands of pods is an exercise in frustration. These challenges only multiply when you start to make deeper integrations into the Kubernetes API—for instance, if your service starts new jobs or is otherwise modifying cluster resources.

Impedance mismatches

There are two main challenges that you need to tackle when deciding how to understand your Kubernetes applications, as a developer. The first is getting the right telemetry data, at the right resolution, in the right place, in order to ask questions about it. The second is to filter out all the data that’s less important, focusing on the things that provide the most value.

These are challenges that existing tools have a hard time addressing. For example, it’s very popular to use tools like Prometheus and the kube-state-metrics service to turn object-level information into metrics data. Unfortunately, this data tends to be very high cardinality over time—attributes on a single measurement, like k8s.pod.ready, will change frequently as pods move through their lifecycle. You can end up in a state where you might know how many pods are failing, but not which ones. Worse, entire series might wind up being not exported at all. Things like secret or service creation, which can be helpful for understanding why pods may have incorrect or missing configuration values or aren’t accessible, are often dropped.

The fundamental problem isn’t just that “it’s hard to get the right data out,” though. It’s that the people who are most often responsible for collecting and managing telemetry aren’t the people who need to use it in order to understand their systems. This doesn’t just set you up to fail on a technical level, but on a very human one as well.

What do developers need to understand Kubernetes?

I think there are three main things that developers need to understand Kubernetes-based applications:

  • Opinionated and optimized telemetry about Kubernetes events and objects.
  • A stream of highly annotated and contextually relevant application telemetry.
  • Analysis tools that can not only quickly identify hotspots and places to start looking for problems, but also can assist in deep dives into the system.

OpenTelemetry is the answer for all three of these points. Out of the box, the OpenTelemetry Collector can capture a wealth of data about the health of a Kubernetes cluster and its workloads. You can then transform that data using the Collector to re-aggregate the events, reduce the number of metrics emitted, transform their attributes, and more. 

OpenTelemetry also allows you to easily correlate and enhance application and service telemetry with essential Kubernetes metadata, by using the processors available to the Collector. You can ensure that your traces, metrics, and logs are all annotated with accurate and consistent attributes for later correlation.

Once you’ve got this data, what do you do with it? OpenTelemetry to the rescue again! Almost every commercial and open source monitoring and observability tool supports OpenTelemetry data. Rather than face vendor lock-in while addressing mounting monitoring bills, OpenTelemetry allows you to customize your observability pipelines at a deep level. Use the Collector to split out high-priority customer-facing telemetry into near-real-time analysis and alerting tools, while sending everything else to cheap and efficient blob storage.

What’s next?

If you’re feeling overwhelmed by Kubernetes, you’re not alone. I’ve spoken to hundreds of developers who feel frustrated and stymied by mismatches between what they’re responsible for and what they can actually control. While OpenTelemetry is the first step in creating and collecting actionable telemetry data, it’s just that—a step. You still need to analyze that data, and you need to develop a practice for using it.

At Honeycomb, we’ve recently announced a new suite of integrations and features to help you—the application developer—understand your Kubernetes-based applications and architecture. You can try it out here and see the difference for yourself. If you want to get hands-on and you’ll be at KubeCon in Chicago on November 6th to the 9th, come visit the Honeycomb booth to grab a demo and ask us anything!

 

Related Posts

Software Engineering   Monitoring  

What Is Application Performance Monitoring?

Application performance monitoring, also known as APM, represents the difference between code and running software. You need the measurements in order to manage performance....

Software Engineering   Observability  

Where Does Honeycomb Fit in the Software Development Lifecycle?

The software development lifecycle (SDLC) is always drawn as a circle. In many places I’ve worked, there’s no discernable connection between “5. Operate” and “1....

Teams & Collaboration   Software Engineering  

Product Managing to Prevent Burnout

I’ve been thinking about a risk that—if I'm not careful—could severely hinder my team's ability to ship on time, celebrate success, and continue work after...