Resolving High CPU Usage in Kubernetes With Honeycomb

By: Travis Redman | March 15th, 2018

Dogfooding Observability Software Engineering

3 Min. Read

At Honeycomb, we’re excited about Kubernetes. In fact, we’re in the early stages of moving some of our services to k8s. Tools like kops have made getting started with k8s easier than ever. But building clusters is only the beginning – before long you might find yourself with a large number of deployments, pods, and services, and new things coming on line every week. Observability is critical to cluster operations. Fortunately, Honeycomb provides multiple Kubernetes integrations to help you get started exploring your cluster’s events and metrics.

What can we do with Kubernetes data in Honeycomb? Let’s look at a recent event that we experienced at Honeycomb. It started when we noticed node CPU utilization on our new cluster was higher than anticipated.

get node cpu usage in kubernetes

We maintain a team board with a high-level overview of our cluster in Honeycomb. One of the graphs displays CPU utilization and reservation by node. It was the higher than expected values for cpu/node_utilization that seemed odd.

We had deployed a number of pods at this point, and while we had reserved a lot of capacity, actual cpu utilization by these deployed pods was low. A stacked graph of pod cpu utilization showed around 1000 millicores used over all of our application pods.

another graph showing 1000 millicores in use

As illustrated in the stacked-graph below, our 10 node cluster has a CPU capacity of 20K millicores and our deployments were consuming nowhere near that! Obviously there is some overhead associated with running core kubernetes services, but we didn’t expect it to explain the difference.

graph showing total cpu capacity of the cluster

So what was consuming CPU? A breakdown of cpu usage by namespace told us that the cpu utilization was in kube-system. We use SUM instead of AVG here because kube-system has a lot of metrics with very small values for usage_rate that can skew the average. SUM provides a better way to determine what is consuming the most of a finite resource.

graph showing the kube-system is the culprit

Filtering by namespace and breaking down further by pod names revealed the culprit to be our very own Honeycomb Kubernetes Agent! The Honeycomb Agent runs as a k8s DaemonSet, tailing and parsing container logs, and sending events to Honeycomb.

it was our own agent! dun Dun DUNNN

A misconfiguration in the agent was causing it to consume logs from all pods rather than a specific set of pods we were interested in. As we added pods, cpu usage grew, and because it runs in the kube-system namespace the usage crept up on us. A quick update to the config made a dramatic difference that we were easily able to verify in Honeycomb.

graph showing the overall difference post-config change

graph showing the per-node difference post-config change

the scooby gang says it's solved!

It might be nice to detect a similar situation in the future. To do this, we used a Honeycomb Trigger to look at the total CPU usage by kube-system on a per-node basis. This approach scales with the cluster.

honeycomb trigger config for watching the cpu consumption

Want to learn more about your Kubernetes cluster? Sign Up for free and take a look at our Kubernetes Integrations!

Don’t forget to share!

Travis Redman

Winston Hearn | Oct 02, 2024

Using Honeycomb for Frontend Observability to Improve Honeycomb

Recently, we announced the launch of Honeycomb for Frontend Observability, our new solution that helps frontend developers move from traditional monitoring to observability. What this means in practice is that frontend developers are no longer limited to a metrics view of their app that can only be disaggregated in a few dimensions. Now, they can enjoy the full power of observability, where their app collects a broad set of data as traces to enable much richer analysis of the state of a web service.

Dogfooding Frontend

Lex Neva | Aug 26, 2024

Always. Enable. Keepalives.

As part of our recent failure testing project, we ran into an interesting failure mode involving the OpenTelemetry SDK for Go. In this post, we’ll show you why our apps stopped sending telemetry for over 15 minutes and how we enabled keepalives to prevent this kind of failure from happening in the future.

Debugging Dogfooding Software Engineering

Fred Hebert | Jul 29, 2024

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

Dogfooding Incident Response Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Resolving High CPU Usage in Kubernetes With Honeycomb

Travis Redman

Related posts

Using Honeycomb for Frontend Observability to Improve Honeycomb

Always. Enable. Keepalives.

Making Room for Some Lint

Ready to get started?