Training Videos Observability Metrics Debugging

Honeycomb Metrics

Observability is great for applications, but metrics can be useful for debugging system issues. See how those two debugging workflows integrate in Honeycomb.

Transcript

Michael Sickles [Solution Architect|Honeycomb]: 

Hello. My name is Michael Sickles, I’m a Solutions Architect here at Honeycomb.io and today we’re talking metrics. Specifically, how do I get Metrics into Honeycomb and how do we make them useful? How do we make them actionable? So let’s begin with getting that data in. There are multiple different types of metrics sources that we can accept in Honeycomb. The main one is the OpenTelemetry Line Protocol format. That’s OTLP. You can set up an OpenTelemetry Collector to send host metrics to Honeycomb so you get things like CPU utilization, network, and memory. You might also set up an OpenTelemetry Collector to get information from your Prometheus clients to get those Prometheus metrics into Honeycomb. Or you could use AWS. They have a way to export CloudWatch metrics using OTLP. You can get OpenTelemetry SDKs to emit application metrics as well. And finally, we have Honeycomb’s own Kubernetes agent, which will allow you to get Kubernetes information into Honeycomb as well.   

The big key thing here, that big star message: Send metrics to a separate metrics data set. So I’m going to walk through how to set up each of these different metrics sources. When I talk about a dataset in any of them, send it to a separate one. You’ll see why later.

So let’s begin with host metrics. You will use an OpenTelemetry Collector. And it will sit on the host itself. And it has a way to scrape system metrics and transform that data. And so setting this up, you’ll use what’s called a host metrics receiver on that collector. And we have different settings you might utilize. So, for example, you can change the collection interval. Might want to collect once a minute, once every 15 seconds. What makes sense for your organization. And it offers tons of different host metrics that you might be interested in asking questions about. Things like CPU, disk usage, memory, network, et cetera.   

Next, we’re going to talk about Prometheus metrics. So with Prometheus metrics, it’s once again going to be using the OpenTelemetry Collector. You’ll use a Prometheus receiver. And then it scrapes this Prometheus metrics endpoint. So if you already have a whole bunch of Prometheus clients set up, Prometheus endpoints, you have interesting metrics you want to ask questions about, you can utilize this collector to set up what is basically a job. Once again set up the interval. And then choose where to scrape those metrics from. So if it’s on the host itself with those metrics, you know use 0.0.0 and whatever port, but you could also use an OpenTelemetry Collector in gateway mode to hit those targets as well, so long as they can communicate to those machines.   

Then you can see here that you’re going to use your Honeycomb team, that’s your API key, your dataset. Metrics dataset, please. Going to API.Honeycomb.io. The key thing here also is you’re going to set up a pipeline. What a pipeline allows you to do, in this case a metrics pipeline,    receives data, process that data, export that data. So in this case we have a Prometheus receiver, but you might also have that host metrics receiver, right? Maybe reprocess it, maybe package it more efficiently, and then export it off to Honeycomb.   

Next, we’re going to talk about AWS CloudWatch. So AWS allows you to export your CloudWatch metrics using a Kinesis firehose in OTLP data format. Setting that up, you’re going to specifically go to a Kinesis event endpoint going to a metrics dataset of your choice, give it an access key. Finally from an OpenTelemetry standpoint, you have application metrics. Different SDKs have different ways to set up metrics. Just look them up. But you can get interesting things like maybe my JVM heap information. How much is my Java application using? That’s an application metric we might want to ask a question about, package it up in OTLP format, send it to Honeycomb.

Once again, adding in API.Honeycomb.io. You’re going to need your API key and that metrics dataset. Finally, we’re going to add rounding up all the metrics sources, the Honeycomb agent. So the Honeycomb agent is going to sit on the different pods in your Kubernetes cluster that uses the API to hit those individual pods. It’ll get things like CPU, memory utilization so you can ask interesting questions about that. Getting started is really easy. It’s two lines. Add the Honeycomb Helm chart repo. One line installs that Honeycomb agent to your cluster.   

4:41

So great. We have all this metrics data now that we want to ask questions about, like, how do we make it useful? Well, you can query it. It’s just pure events in Honeycomb, honestly. Everything in Honeycomb is an event. We package those metrics as events. You can use our full-on query builder to ask interesting questions. For example, here you can see I was wondering what was my metrics memory utilization average for my different pods? And that’s an interesting question. So then I’m going to take this query, I’m going to add it to a board. A board allows you to package multiple queries together in one spot. You can see the overall health of, in this case, a Kubernetes cluster or whatever metrics you need to see. Maybe your AWS CloudWatch metrics, right? It’s a starting point to start asking questions. And that’s okay. Like, metrics have their uses, and we definitely need to see them. But wouldn’t it be better if we could also see them in context?     

And so we can start asking more, like, interesting questions, right? Not only do we have, like, max operators. We have something like rates. So in this example here, the context is we want to understand how Honeycomb customers add columns to their dataset, the different attributes of their tracing data. And by utilizing metrics, I was able to see not only how many columns do they have, but when are they sending new ones. You see that spike in the rate. And that’s interesting. And then we can take it and put it all together. Let’s get it in the context of an application.     

Here I have my microservices demo. This is a distributed service. There are ten different services running in a Kubernetes pod. And here I see that there’s this spike in latency. I could use Honeycomb to go into that spike. I can see the tracing data. But maybe this time I want to ask questions about the underlying infrastructure. So I’ll go to the Honeycomb metrics tab. And what that allows me to do is show my metrics boards in context together with my application tracing data. So for example here I can see CPU, I can see memory going up, I can see there was a restart. And that’s potentially interesting. But then this is something unique to Honeycomb, this is not just a, like, single pane of glass. This is a single pane of glass with sharing information.

For example, this Apply Queries Filter button, I can click down and I can hover, and I can see my different pods. You can see as I hover over it, it highlights the different graphs above. But if I filter only on one pod, it’s applied that filter across everything. I’m not manually trying to correlate it together. I now am able to share my filters, my group-bys between my application tracing datasets with my metrics datasets. So for example, I can see clearly now that my front-end pod probably has a memory leak of sorts. And I can see that kind of correlates to that increase in latency. And then eventually I can see there was a crash.

Let me just jump over to one of my other interesting metrics boards. I have something tracking memory. And so in this case, I can clearly see that, yeah, I’ve done something wrong. My cache is going crazy. It’s adding a whole bunch of items until I run out of memory and then it crashes and then it restarts. And that is affecting my performance. This is where, then, I would be able to jump into a Honeycomb trace and understand where in my code is it doing this? Where is it adding these cache items? And why is it affecting the latency?

So we started from the high-level view of getting metrics data and then we saw how we can make it actionable, usable, queryable. But ultimately, we want to see it together with the application. How does my underlying infrastructure metrics affect my application? How can we query both of them together? Thank you so much for attending today.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript