Pierre Tessier [Sales Engineer|Honeycomb]:
Hi, I’m Pierre with Honeycomb, to go over the Honeycomb Observability platform.
When you start sending distributed tracing data into Honeycomb, you’re going to get a screen that looks like this. “Total requests” that come in, “Latency” and “Error rates”. You can also break this down by various popular dimensions, like “Status Code”, various different microservices that make up your application, “Route”, “Error” and even very high carnality things like “User”.
Honeycomb encourages our customers to send things like “user_id”, “product_id”, even “transaction_id”. Because when you send these types of things, you’ll never know when you find really interesting patterns and Honeycomb’s platform works great on high carnality data.
Let’s go back to the “Overall” tab. Now, when we’re looking at this, we can see a spike in “Latency”, but let’s talk about this a little bit.
In Honeycomb, when it comes to observability, what matters is that the user’s experience. Your users don’t care if your CPU’s are running hot. What they care about is how long it took the web page to load, and whether or not that page gave them an error.
Nines don’t matter if your users aren’t happy, and in Honeycomb, we want to bring that to the forefront. What you see is “Latency” and “Error rate”, the things that matter to a user. Now, you can see I have a spike in errors that coincide with that spike in latency.
Let’s click on this chart and dig in a little bit more. Now when I do that, we get a glorious heatmap and it’s full of detail. This query here is for four hours, and every single block on this heatmap is about 30 seconds worth of time. So if you will, all those 30-second slices, look like a bunch of 30-second histograms stacked all on each other sideways to produce a heatmap. I could see my increased distributions down here and the anomalous ones over there. Let me get a little legend to help you understand the color of the block and how many requests each one represents.
Now, clearly I found something that’s interesting, this. And I want to dig in and learn some more about what that is. And Honeycomb offers a unique feature called “BubbleUp”, which allows you to learn more. What BubbleUp does is, you as a human tell Honeycomb, the machine, what you’re interested in. And by making that selection right here, Honeycomb went off and went through all of the data on this chart and told me, what’s different from my selection versus the baseline.
What’s interesting is the endpoint. 100% of all the requests and that selection is the exact same endpoint. That’s certainly very telling.
The next one is a name. But this just happens to be the endpoint prefixed by the HTTP verbs, such as GET, POST, PUT and in this case here, it’s all GET on that endpoint. The next I mentioned is “user_id”, one of those high carnality fields.
And when I hover over the big yellow line, I can see right here, I’ve got a single user, 20109, who’s responsible for nearly half of all the requests in my selection. This is also very telling.
So by making a simple BubbleUp in Honeycomb, I’ve been given a lot of information about more, kind of telling me what’s interesting, what’s going on inside that spike, and latency. And I’ve been wanting to continue that. What we call this is the core analysis loop, where you ask a system a question, you get some answers, you form a thesis and you continue asking more questions to value that thesis. So let’s go ahead and do just that. Now, because not all my users, or not all my selection per one user, I’m going to want to group on this right here. Because I want a group, I’m going to add a COUNT to all this. Let’s go ahead and add a COUNT to our query. And we’re going to add a user_id as one of those group fields.
Now this endpoint over here, let’s go ahead and filter on this endpoint. So we’re just going to click on it and say, “Show only where field is value”. You can see my query kind of built up with those values. Let’s go ahead and run this. And when I do that, I get a really different view. I’m going to continue that investigation to what’s going on. My heatmap is a little different, it’s filtered on that endpoint. So I only see how we hit it normally. And then that big spike in latency again. Down below is, I have a COUNT by user_id and all these little spikes, these are all the other users doing one request each. And right here, I’ve got that same user “20109”, who’s making quite a few requests, which seems to correlate with that spike and latency. I’m really getting to something here.
And furthermore, Honeycomb will give you a data table between the charts and you could hover over that data table to really validate what it is. And I could see that same user is responsible for all those requests, and he charts kind of highlighted just what matters. You go over the other entries and you could see that this user here only made four requests over the entire time span.
Now, the next thing I might want to do is dig into those individual requests and learn more. And you can go into “Traces” tab. Like other platforms, you’ll get a list of the top traces that make up your view. But maybe I want to go to this one right there and go ahead and click on it.
And Honeycomb will pull the trace visually for that right there. And here we are with the Honeycomb tracing view. Now, inside here, I could see what’s going on with that entire request. As I click on different spans, we’re going to update some details on the right. And I could see here how this request for the ticket backend service coincides with its peers, and it’s really high up on that map. But when I looked at the individual DB requests, I can see they’re really right in the middle, nothing really particular there. If all I did was looking at the DB service, I wouldn’t find anything wrong. You really need to look at it as a whole and the entire transaction. And we can see here that clearly something’s happening because we’re calling this DB service quite a few times. Probably what’s causing our problems.
And I know not long ago, we did change this application to help our smaller customers. It looks like maybe we introduced some weird regression for larger ones. And when you click on each individual span, we can also find a query behind each one. And continue that core analysis loop, posing, I can go ahead and click on this and grouped by this field. And what they’ll do is they’ll bring their right back to the same query view, allowing me to dig in some more.
Now, cause we’re just looking at query, I’m going to go ahead and just say, “Hey, let’s look at just the spans that have a query in it. And while we’re at it, let’s add a SUM to the duration. When I run this, we’re going to be looking at just the queries.
What I see here, probably isn’t that interesting. And maybe I didn’t really go down a path I wanted to go down, because debugging is not something you’re going to do with five mouse clicks. You need to ask questions of your system and you get there. And sometimes you just go down the wrong path and I just need to take two or three steps back and continue my analysis. Was that two steps or three steps? I’m not sure.
In Honeycomb, we track your history, so every single query you ran is right here available for you to look at and see again. And you go ahead and run them any other time and kick back off your observability journey from there.
But it’s not just your query history. We also track your team’s activity. So I can see how my team members are doing their observability as well. Maybe I’m coming in late to help with an incident. I could see what my team is doing and piggyback on their debugging journey, or certainly when you’re a new employee and you’re told you have to shadow somebody, what better way to shadow than to see how you go ahead and debug production. And I can go and click on any of your queries and continue my journey from there as well. Now all of this is possible because this URL right here is persisted forever with the query results. Even if your data ages out, you’ll still get this exact view.
And that’s just one way to share your query, right? You could also add it to a “Board”. You could also “Share” it onto Slack. And when you do so on Slack, we’re going to go ahead and unfurl the images or even the tracing view. Or if you really want, you could make this a “Trigger”. Trigger is anonymous to an alarm or modern whatever platform you’re talking about. And these are great ways to make sure you’re not going into the wrong place and to the wrong world. But when you started adding all the triggers, you might get into something also known as “alert fatigue”. Alerts keep on going off. And we’re not sure if “the sheep’s crying wolf”.
For this Honeycomb wants you to think about SLOs. SLOs, help you focus on what really matters for your business. Again, going back to user happiness.
When you build a system, you’ve created some form of agreements with your customers. The internal, external, or even within your own team. And when you say you’re going to deliver a level of service, we called that the SLA. Maybe that service is all requests to be served up in under 1200 milliseconds or something like that. Let’s go in and look at our SLO around that SLA. Now to certain measuring all this, we need to think about the service level indicator or that KPI we’re going to use and how you measure it.
In this case here, our KPI is pretty simple. If the duration of each request is under 1100 milliseconds, that’s a pass, otherwise than that you failed. And the SLO takes that and says over a period of seven days, we want to achieve 99.98% of every single request passed. And from this, we form an “Error Budget”. And I could see here, I’ve got kind of a tight error budget, only about 17.7% left.
Maybe I want to do something. Maybe I don’t, maybe that’s good. And then we can work with this, it’s fine. If your error budget goes below 0%, you might ask your engineering organization to perhaps pause on new feature development and spend some time on some stability instead. Because after all, if your users aren’t happy, new features are not going to get them there. Beyond error budget, you see a historical compliance against success alone. We could see we’re, always well above our orange line down here of 99.98. Just above that, we see these, what we call it, “Exhaustion time” or “Burn alerts”. What these are, is Honeycomb is consistently always monitoring your SLOs budget. And if it takes a current trend and extrapolates it out by 24 hours or four hours or whatever time you tell us, and it finds out that you’re going to go in negative in that time, it’ll go ahead and alert you.
So in our world here, this 24-hour exhaustion alert will go to Slack, and in a four hour it goes to PagerDuty. Now below that, we’ve got a heatmap, and this heat map here is made up of 24 hours of the data. Now we’ve introduced one additional color to this heatmap, this yellow color you see right here. And what these are, is these are all the requests that failed our SLI. And as if I did a multi-select of grabbing all the yellow ones and I did a BubbleUp, Honeycomb now tells me automatically why I’m burning that budget. And why I’m burning the budget is because that same endpoint that we looked at earlier. Against that same name… Now before we were just looking at the front door, so we see service name right here. You see the front door is really responsible for the much of it anyhow, and that same user_id.
So I can do the analysis on that spike, or I can just look at my SLO and Honeycomb would have told me the reasons why I’m burning my budget. Now, when we talk about observability, one thing you’ll often hear people talk about is unknown unknowns, and that’s vastly different from monitoring. When you build a new service, you’re going to make a dashboard, you’re going to put six charts on there and that’s how you monitor it. Something goes wrong and none of your six charts solve it. Two more charts going on the dashboard.
Now you have eight, another incident, something else, more charts. And eventually your six chart dashboard, you got about 28 charts on it, and you’ve got about 20 of these dashboards. Because these are all the things that you now know about. And you’re monitoring them to go into an unknown state. That’s the known unknowns. At Honeycomb, we’re all about the unknown, unknowns. I don’t need to know, to look up endpoints, or service name or this user. Honeycomb told me these are the dimensions. These are the things that I need to focus on. And this is my dashboard of unknown, unknowns. Always evolving, always changing, based on what’s happening inside my application and the user’s experience.
I hope you enjoyed this demo. Please give Honeycomb a try for yourself.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.