Max Edmands [Staff Product Engineer|Honeycomb]:
Hi. My name is Max Edmands, and I’m a Staff Product Engineer at Honeycomb. I’m excited to tell you about our newest beta launch, Honeycomb Metrics.
Honeycomb has always been a state of the art tool for observing what our services are doing in production. But sometimes it’s important to go one level deeper and get insight into what’s happening inside the infrastructure that your applications rely on.
Wouldn’t it be great if we were able to look at infrastructure metrics on this screen? Since you’ve already seen the title of this talk, it probably won’t be a huge surprise to you, but now you can do that.
Today, we’re unveiling a new collection of features for enterprise accounts that will give you and your team a window into what’s happening in your infrastructure. Using this new Metrics correlations tab, at a glance, we can see crucial contextual information that lets us understand how our infrastructure is doing and potentially diagnose or rule out issues that could be contributing to what we’re seeing in production.
In this example, we’re looking at a graph that is giving us a latency distribution of one of our services in production. At the same time, we’re seeing information about CPU utilization and memory usage of that service, broken out by hosts. And, since this service makes use of AWS Lambda, we also have a graph of what’s going on with Lambda executions happening during this time span.
So now that you’ve seen a preview of this new feature, I would like to give you a quick overview of how to set this up for your own applications and how you can get the most out of it. For that, I’m going to show you how I’ve instrumented my own tiny demo application that I’ve named Polyhedron. My aim was to make this simple but to make it reminiscent of the kinds of applications you might be working on in your day to day.
Polyhedron is a simple application. I’ve written this in Golang. This is a virtual HTTP server that rolls the dice for you. You tell it what kind, how many, and it rolls the dice for you and responds with a total.
Here, in this example, I will be sending a request via curl and asking it to roll 2 six-sided dice in a row. Ang getting back… six as a result.
Rolling a few more times.
I’ve already instrumented this app with OpenTelemetry traces. It’s sending data to Honeycomb, in this Polyhedron tracing set. The server looks pretty healthy. With a spread of quite low latencies. You will notice it’s steadily receiving traffic. That’s because I have a load generator processing over here in this terminal.
If I head into this total request graph, and group by host name, you can see that requests are being handled by two different hosts running on two VMs that I have going on this computer. That’s basically the sum total of how this app is architected. It’s just an Nginx load balancer distributing requests across two servers, each running on different machines. And I can click on any point in this graph and see a trace corresponding with that time. Since this server is only doing one thing and it’s not networked, the traces only have a couple of spans.
Back in the query builder, I’m excited to show off this fancy new Metrics tab. If I click it, it’s hiding all sorts of useful stuff regarding what’s going on with my infrastructure. Data about memory usage, CPU usage, process count, uptime, network utilization, and language-specific stuff like the amount of concurrency in my GoLang processes. They’re all broken out by hosts, so if I’m hovering over a specific hosts’ data in my main query, it’ll highlight in the Metrics queries.
One neat thing about this metrics data is it’s all stored in the same columnar data store as everything we received. There are a couple of big benefits to this.
First, you can run arbitrary queries against your metrics data in the same way you would any other data in Honeycomb. Here, I’ve clicked on one of these Metrics graphs, and you can see how we’re actually just on a different Honeycomb query results page with all the same features you get from any other query. For example, we already saw that we’re grouping by hosts. But you will also notice that we’re filtering for just established connections. What other states are there?
So now we’re seeing that, across both of our hosts, most of the TCP sockets we’re using are in the time_wait state. That’s interesting and could be potentially useful to know if we’re running out of TCP sockets. Thankfully, we’re not.
We’ll get a more interesting view of the data if we look at it in log scale. Now we see a time series of all existing TCP connections, for each host, grouped by wherever they may be in the state machine. Let’s filter to the single host for a moment.
Looking at the raw data tab for this host might give a clearer picture of what’s happening under the hood.
Now, we’re able to see the stream of data that Honeycomb has received in order to render these time series. For any given minute, we’ve got 12 rows stored that track a measurement, the number of TCP hosts that we have in this state.
If we filter down to the single state again, for instance, established, we can see we’re looking at a very regular time series with data points arriving more or less every minute. This is more clear when we filter the query to just the last 10 minutes. We see 10 data points. This is probably the most fundamental difference between metrics data and what we were collecting back in our traces dataset.
Here, we have a row stored for every HTTP request whenever it happens. This could be multiple times in a millisecond, with some milliseconds skipped. It’s showing you a one-for-one picture of what users hitting your website are doing. And each row is super wide, giving you as much context as we can about the context for that request. This model makes tons of sense for what’s event data-driven. There’s one row with many high cardinality fields.
But infrastructure data doesn’t quite fit this model. Infrastructure is ambient. There’s a finite amount of network bandwidth available. And at any given point in time, we’re using a certain amount of it. It’s always important to know how much of it we’re using. There’s always an answer to this question. At minute one, we may be using a lot of this resource. By minute two, we may be using none of it. Any number, including zero, is relevant. Instead, we’re tracking events at a constant capture interval, here configured to be one minute.
At Honeycomb, we talk a lot about cardinality, data that varies considerably in a particular field. For example, in our traces dataset, we’re capturing an http.target field. Which is basically different for every request. It’s about as high a cardinality field as can be found. That’s absolutely fine. And Honeycomb handles this great. Because in an event-based model, in our columnar data store, high cardinality is very cheap.
Metrics, on the other hand, are a bit different in this regard. You will notice we’re capturing data from multiple hosts and multiple states. For every new host we add to this list, we’re capturing one more datapoint at that time. And for every new TCP state we track, we’re capturing another datapoint at that time. Twelve states and two hosts means we’re capturing 24 network connection data points every minute, or 240 data points over this 10-minute span.
Honeycomb can handle event volume like this just fine, but it’s worth noting that cardinality works differently when you’re capturing ambient data differently than you might be used to in a tracing data set. So I want to talk you through how I instrumented my system to send this data.
There are actually two categories of metrics I’m collecting. There are metrics about the server runtime itself: how much of a heap we’re using, when we are garbage collecting, how many Go routines are we running, how long has it been since the server started up? And there are metrics about the system: how much CPU is used versus available, how much memory used versus available, what does network or disk throughput looks like?
Honeycomb accepts metrics through the new OpenTelemetry standard, an open standard supported by the Cloud Native Computing Foundation. It’s growing and major players, including AWS, are starting to build in native compatibility for the standard. It’s also tightly integrated with OpenTelemetry traces, which we’ve been offering for the past little while.
I’ve got a processor running on my servers called OpenTelemetry collector. This process will collect metrics from the stream it’s running on and it can stream them to different external services, like Honeycomb. It will also listen for metrics and traces coming from other processes on the host and can forward those along too. On my HTTP server processes, I’ve installed OpenTelemetry SDK for Golang. Which is responsible for capturing runtime metrics. I can also use this SDK to arbitrarily instrument and capture any other metrics I care about in my app with code. The SDK is forwarding its metrics along to the collector which aggregates them and passes them along to Honeycomb.
Let’s take a deeper look at OpenTelemetry Collector. It’s configured with pipelines. Pipelines are made up of receivers that accept or collect data from various sources, processors which can modify the data they see in various ways and exporters which package data off and send it to various places. Some common metrics receivers include the host metrics receiver, which is what we were using to collect those CPU and memory graphs we saw earlier, or the OTLP receiver, which can receive OpenTelemetry metrics from other sources like runtime metrics from an instrumented app.
There are also receivers for other common protocols. For example, OpenTelemetry can scrape Prometheus endpoints, and it can act as a receiver for statsd or influxdb metrics.
Common processors include the resource processor which can add additional data to metrics, for example, stamping each metric with data about the host it came from. Or the batching processor which will hold data and release it in regular bursts, resulting in fewer requests downstream. There are many exporters allowing you to send OpenTelemetry metrics to a variety of endpoints.
For the purposes of this demo, I’m sending metrics to Honeycomb, but, hopefully, you can see how a tool architected this way helps you to avoid vendor lock-in. All of this is configured in a YAML file provided to OpenTelemetry Collector. Here, you can see two pipelines configured, one for metrics and one for traces. The Metrics pipeline is receiving host metrics, and I will show you the sources that they’re coming from. OTLP data is being received by this collector to port 4317. We then have a resource processor that’s just adding a host.name and service.name field to all of the metrics and traces. And a batch processor that is sending out metrics in 200ms batches.
Finally, you can see that we have a logging exporter that can provide us a bit of context regarding what’s going on with the pipelines when there are issues, and OTLP exporter is set up to send metrics traces along to Honeycomb. Additionally, in our application, we’ve imported the GoLang OpenTelemetry SDK. We’re configuring it to send metrics and traces to localhost port 4317, which’s the same port we had configured in OpenTelemetry Collector.
And here we’re configuring a new metrics controller, which is basically another Metrics pipeline that lives inside of this server process. It has a capture interval of one minute and will receive runtime metrics from this process and send them out to OpenTelemetry Collector.
Note, by the way, all this code is available on GitHub. I will make sure there’s a link in the chat for anyone interested in looking at this code in more detail. Also, pointers for the things I’m showing on here can be found in our documentation. So, once this is all set up, all this data can be found in the Polyhedron metrics set.
You can see that I’ve created a board called “host metrics” where I’ve picked out some representative metrics queries. The queries from this board are what we were seeing earlier when we were looking at the Metrics tab from a traces query. But you can also switch to different boards or, additionally, choose from a list of suggested metrics in the new Metrics section in dataset settings. Here.
There’s something I should mention before I get too far ahead of myself. You will notice when we make a query, that there are some metrics that are showing as mostly straight up and to the right lines. For example, look at this system that is the CPU.time graph on the left. What is going on with that? Does that mean that CPU utilization is steadily increasing over time?
No. What we’re seeing here is a special kind of metric, or a sum. Basically, every data point in this chart represents the total number of CPU cycles spent on some activity since the server most recently started up. In order to get value out of this data, we need to be able to graph the rate of change of this number. That’s basically the slope of the line we’re looking at here. This is the next thing our internal metrics product team is working on and we’re super excited to get it out to you.
There’s one final thing I would like to show you. If you will remember at the beginning of this presentation, we saw this graph that showed the number of concurrent Lambda executions. These are also metrics, and they’re coming to Honeycomb via CloudWatch. If you, like us, are hosting your application on AWS, you will be excited to learn it’s now possible to send metrics to Honeycomb directly from CloudWatch. Here’s what the setup looks like in the AWS console.
You can create a metrics stream that’s sending whatever metrics you want to send us. Here we have this configured to send everything and point it to a kinesis data firehose, with the Honeycomb API as its destination. This allows you to pull up all sorts of useful data from Honeycomb. Right now, we’re looking at a board with a collection of metrics showing our various SQL databases hosted in RDS, with information regarding CPU and memory usage, but also showing you query load and other important database characteristics that might be useful to know about.
That about wraps up my demo.
Before I end, I want to leave you with a couple of takeaways, things we think are a unique strength of our metrics offering. First, we’re using the same infrastructure to store and query metrics data, as the same infrastructure used to store and query events that you already send us. That means it’s the same Honeycomb you know and love. There’s not much additional surface area to learn, and metrics data will benefit from all the same capabilities we’re building for our events.
Second, Honeycomb is embracing the OpenTelemetry metrics protocol. This is an open standard that’s already tightly integrated into the cloud ecosystem. This means you can avoid vendor lock-in, and it means we’re compatible with almost any metrics tooling you can add to your systems. And also it’ll be pluggable with tools you’re using externally.
We’re really excited to share these beta features and how you use them. We’ve set up a channel in our Pollinators Slack called #discuss-metrics. Join us there if you have questions, comments, concerns.
Thank you so much.
Max, that was amazing. Thank you for walking us through Metrics. It’s really exciting to see.
Are we going to talk about this new feature?
It’s really cool. We have a few questions. I know we built it with certain metrics in mind, but the project evolved, and we’ve seen a few customers use it. So I’m curious if you want to talk about how the team’s thinking has evolved during the course of the project.
Totally. When we started out, I sort of thought of metrics as your basic things like you’re talking about your infrastructure and CPU and memory and that type of stuff. But we’ve sort of, throughout this project, we’ve developed a more generalized understanding of what metrics are for, and the way I like to think about it.
I’ve alluded to this in my talk as well. Metrics are for ambient data. So when there’s a thing that your system relies on, for example, your host itself, or some external component like a database or whatever else, it generally has state, and that state is consistent, whether or not there are individual things happening.
With your traces, you’re keeping track of individual requests that have responses, and there’s a bunch of data contained within the scope of that request. But the ambient data that sits beneath that is always true. And metrics are perfect for that type of data because they live regardless of whether or not anything is happening.
If I want to know how many open connections there are, there’s always going to be an answer to that question, even if the answer is zero. Whereas with an event, asking the number of connections doesn’t actually tell you much about the event itself. It tells you about the underlying system that the event happened during.
I think that’s the biggest thing, is that we sort of widened the scope of what metrics are for to sort of think about it in that way.
I love the framing of ambient data. It makes a lot of sense to me.
I want to ask you a question that came out of yesterday, out of o11ycon. There was this really interesting conversation in the closing discussion about dashboards and observability.
And somebody said in Slack, if dashboards are bad does everything start with a blank query? And I thought this was so interesting because it’s something the team thought a lot about while building Metrics. I wanted to hear your thoughts on that. Does everything start with a blank query? How does Metrics fit into that workflow?
Yeah. One of the really cool things about the blank query is that it is kind of perfect for situations where you don’t know what it is that you need to know, but you do know something. For example, you have a question. You want to answer that question. With a dashboard, if you have a question, you can scroll through the dashboard and, hopefully, there’s an answer to it somewhere.
With Query Builder, the sky’s the limit. You can ask it anything you want to ask. So that’s really cool. On the other hand, sometimes it’s also useful to have collections of data that you’ve looked at in the past. The thing about dashboards is that they’re static. They can’t change.
The reason I used the boards feature, I was using it at the end of our demo just now. It’s a great way to see the data in the data set. What kinds of stuff is there? And all the things shown in that board were at one point a query that someone made in the past. I went through the process of discovery. I went through and found interesting things that I thought were cool and wanted to bookmark for later.
So the board turns out to be just a collection of bookmarks. It’s probably not the right place to go if you’re solving an acute problem. But a great place to go if you’re like “oh, what have other people looked at recently?” Or I care about things in this category. I’m going to look through that stuff.
We also have our new metrics tab that shows up in the query builder. Our thought here is that for infrastructure metrics, for ambient data that has a specific bearing on the system that you rely on, there are probably things that are always going to matter. So, for example, if you have pieces of the infrastructure that you’re using like, for example, you have a message queue and your app uses that and if the queue runs out of space, you have big problems. Sometimes it’s useful to be able to see, oh, how is the message cue doing right now. You just pull that up and it gives you a really quick answer.
Oh, there’s nothing going on here. Now I can not think about that message queue anymore. But it could also be that there’s a problem, and that short circuits the discovery phase. It’s one of the cool things about this new Metrics tab. You can open that, quickly answer a question, close it again. Or you can open it and know you need to dig more into this.
Yeah, so boards as bookmarks is what I heard and a memory tool. And the Metrics tab for investigating and quick reference and also diving in deeper. That’s really rad.
I also wanted to ask you about OTel. The team chose to build ingest and support for OpenTelemetry with Metrics. Can you tell us about that choice and how you think that shapes the way people use the feature?
Good question. So I’m stoked about OTel. Also, I really appreciate Alolita’s deep dive into OTel and what’s going on with it right now. OTel Metrics is very, very new. It’s still in early beta. We’re building our Metrics support specifically around OTel Metrics, which means we’re kind of very reliant on that project continuing.
That’s a little bit risky, but we think it’s an amazing bet to make. For a couple of reasons. The OTel project is an open standard. So anyone can instrument with their OTel Metrics. They’re not just stuck with Honeycomb. We hope you stay with Honeycomb. We think our tool is awesome, but it’s a really good selling point that you don’t have to put Honeycomb specifically in your application. You can put OpenTelemetry in your application and then, hey, if you actually want to connect a Prometheus server to your app or you want to, I don’t know, send your data to multiple places at the same time, all that stuff is super easy to do because of this open standard. So that’s a huge one.
And then, also, we’re just really excited about where the project is going, and we think that there’s a ton of amazing sort of extensibility and there’s a whole lot of really cool future places we’re excited to build into as well.
One example there, that I’m personally quite excited about, is exemplars. So individual metric streams can have links to traces in them. So if you’re looking at a Metrics graph and that Metrics graph is related to your traces in some way, maybe there’s something about your HTTP server and how it’s doing, in the future I’m envisioning a world where you, just like you can click a traces graph in the Query Builder and get sent to a view of what your distributed trace is doing, you should be able to that for metrics as well. And OTel’s exemplar support will eventually allow that. We’re excited about that. OpenTelemetry is really exciting.
And then OpenTelemetry Collector is this amazing sort of middle piece that enables so much stuff. Because it’s already widely deployed and there are connectors in there on both the export and ingest side, they’re able to speak to a whole bunch of protocols—influxdb, statsd, collectd, Prometheus, all of these things—and it’s easy to mix and match where you want to pull data from and where you want to send it to. It sort of makes things feel extensible and pluggable in a cool way.
Yeah. I love this theme around OpenTelemetry and extensibility and interoperability. I think it’s really, really exciting. Well, Max, thank you so much for this deep dive. It was super wonderful.
I’m excited to see what people do with Metrics. I’m over the moon about it.