Pierre Tessier [Director, Solution Architects|Honeycomb]:
Hello? Let’s take a walkthrough of the Honeycomb Observability Platform. When you start sending data to Honeycomb you’re going to get a screen that looks a lot like this, with your three key indicators for any application out there, to requests, latency, and error rate. Now Honeycomb will take this data and break it down by a lot of very common dimensions. Things like status code, the service that makes up your stack behind the scenes. But then we get into some other things like route error and even the high cardinality data, things that you could have tens of thousands of unique values like user and bring that up here as well. We could see here that we have one particular user that may be a little bit different than the others, more on that one later. Let’s go back to the overall time.
Now here, when I’m looking at my latency chart, I could see something happened. I had an increase in latency and it just kind of fell back off and it came back in normal operations over there. And just before that we had a deployment that took place in our area, and nothing on the request or error kind of seems to indicate more of that. So I’m going to go ahead and drill in on latency to understand what really happened to create this. When we get here, first I want to talk about the Honeycomb heatmap. Honeycomb we love our heatmaps. And the reason why is you can get an additional level of dimensionality inside of each one of your charts.
Right here, I can see this dark blue band. This is where the majority of my traffic is. And the lighter colored ones, I have some requests here but just not quite as many. And I’ve got a scale over here on my right to help me understand that. And we could do this kind of heatmap with the level of detail that Honeycomb does, like a 30-second slice over a four-hour period. You could really understand what’s going on inside your application.
Now, right here, we’ve got to spike, and the next question is, what’s next? And this is where Honeycomb really starts to set itself apart from other tools. We don’t want you to look at a dashboard and hope that what’s next is another chart in your dashboard. We want you to ask that question, what next? And we have that button right here to help you do that, we call it BubbleUp. And BubbleUp mode, as a user, you draw a yellow box around a thing that’s interesting and Honeycomb the machine will tell you why that’s interesting for you.
Honeycomb looked at every single piece of data, every single event inside of my chart, and compared every single attribute of what’s inside of my yellow box versus the rest, or the baseline. And it produces what’s different. And that’s what these charts down here are. The first one is name. And it turns out that the cart checkout endpoint is 99% of what’s inside of my yellow box. So, note to self, we’re going to filter on that endpoint next time. Target rate here, we’ve got a few more target elements, but again in this case is still cart. Checkout which happens to be the same to is a name. A cart checkout gets a lot of post. This probably makes a lot of sense right here.
And then we get into this one user ID. And when I hover over user ID, 55% is user 20109. Now, remember we’ve got tens of thousands of unique users in this platform and I pulled out one across all of them that’s responsible for a little more than half of my problem requests. That’s really telling. And I’m probably ready now to ask my next question. And my next question I think I want to ask is, I want to get a distribution of users on this endpoint. So a distribution of users that’s account. I’m gonna go ahead and click count right there. I’m going to click anywhere in this chart. We’re going to say group by that field. And then over here on this name, we’re going to go ahead and click on the bar that matters. And we’re going to say where show field is value. And Honeycomb went ahead and built up a query for me here at the top and go ahead rerun that query.
Now our view is vastly different. We’re only looking at the cart checkout endpoint. And I can see, generally speaking, it’s about under a second. I get a couple of random high ones, but it’s about a second. And then a thundering herd kind of event happened that really causes a lot of pain. Now below this, I’ve got my chart of users or my distribution of users. A couple of users make two requests, and then they go off and do other things. But most of them just do one request and go away, except for yeah, user 20109.
Now below this, I’ve got a data tell because we’re grouping by user IDs. So I could hover over any of the items on my data table and really kind of highlight which ones are which, and it’s clear here. My new user 20109 is causing me problems. And I’m probably now ready to look at what did user 20109 do? What was that transaction like? I want to see the trace. And truth be told we’ve been looking at distributed tracing data this entire time, just an aggregate form to help us paint patterns of what’s going on with our data.
I can click on the traces tab and pull up the raw traces anytime I want. And here are the top 10 traces that Honeycomb does for you. That’s great. But what if I wanted the trace behind this data point right here? We can go ahead and click on it. And it’s a way to do a visual search. We’ll pull up that trace for you right there. And this is the waterfall view in Honeycomb of how we render a distributed trace, came into the front end service, went to checkout cart, and serviced by a lot of other things. And really sticking out here out to me is this long bar associated with get discounts, probably where I’m going to want to focus.
And I remember we did some work recently around cashing, forget discounts. It was really problematic for us. We implemented a new caching layer to help out with it and just kind of a quick cursory. Look, I’m not sure with all these calls underneath, that’s really what I want to get. But let’s check out a few more details over here on the right, starting off with another heatmap. And we put crosshairs here which says, amongst your peers, amongst other checkout, select checkout, this is kind of in the middle of the pack and yeah, each one of these are probably just like that. But when I go look at its parent get discounts itself, yeah, this is right on that tail. This is probably where our problem herein lies. And we get all these other attributes underneath it. And what’s great about these attributes, is we even get things like the actual query that we’re issuing to Honeycomb and we could use this to do more data against it.
And in fact, in Honeycomb, if you see data, you could interact with it. For example, I’ve got down here, a pod name, I could go ahead and say, “Hey, let’s group by this field right here.” And we’re going right back to my query. And all of a sudden I grouped my pod name. It took care of helping me write that query for me. Now I’m probably not filtered on the view, I want to look at here. What I really cared about was looking at perhaps the cash size or the effect of what’s going on behind the scenes, against my application and its latency across the board. So I’m just going to remove a couple of these filters here, and we’re just going to say across the board, let’s just look at all of our latency by pod name.
I’m going to go ahead and click on that. And what I get is a table or a view that kind of looked before, but I got a table below it with all my pods and I could hover each one of these like we were doing before. Oh yeah. Checkout service. Aha, really going to lean down on this one. I’m getting somewhere. I’m thinking in the back of my head, does this have to do with our cash size? Because this spike happened right after our deployment. Did we break it? Well, we’ve got metrics to help us understand that. And I go ahead and overlay those metrics onto my view. And just like this again, with that same kind of look, I could hover over the top of my chart. I could see, oh, so memory utilization falls hard on the checkout service right when latency dropped. Let’s go look at that memory board and yeah. Okay.
So cash size tanks at the same time. So we’ve got a cashing problem. We were increasing cash. We must have run out of memory. This makes a lot of sense. Let’s go take another look at that deployment rate here. And let’s make sure that we’re doing this properly instead. And I could take this, I could collaborate with my colleagues and my peers. I could take this URL and throw it over to them and say, “Hey, look what I found.” Or I could add it to a dashboard. I could throw this in Slack. We’ve got great integration with Slack. And the point here is you want to be able to collaborate with your team when you’re looking and when you’re investigating items, and Honeycomb, we really want to take this to the extreme for you.
We want to really push that collaboration aspect. And another way we do that is by helping you understand your own team’s activity. Because ultimately observability, it’s a team sport. It’s a non-individual thing. And when I come here, I could see what my colleagues are doing. I could see a query that my colleague Michael Sickles ran. Seems to be very similar to what I’m doing as well, kind of ironic. Right? And it’s not just Michael’s or my team’s queries, it’s also my own as well, because maybe in a heat of the moment as I’m going through and investigating, I might have taken a wrong turn. I just want to take a couple of steps back and everything is remembered here for me. Honeycomb we’re really about collaboration and helping you be better as a team.
I can come here and click on any of these ones before, we’ve seen this query that I ran earlier. Now, if you found something and you wanted to institutionalize that knowledge, you can make it a trigger, also known as an alert in other views. Any Honeycomb query can become a trigger. You set a threshold and who you want the recipients are and it works. But the problem which triggers is that we create too many of them. And we end up with alert fatigue quite often. Well, Honeycomb, we don’t want you to have alert fatigue. We want you to think differently about your services. Stop creating alerts for every single symptom and start measuring the service itself. And that’s where SLOs come in.
SLOs allow you to set a measurement for your service, things that really important for your service to stay alive. For example, in this case, I care about every request being under 1700 milliseconds for the front end service, that makes a lot of sense. And we’re going to be allowed to have a couple that go beyond that. So we’re going to hold ourselves to 99% of all requests over a rolling seven-day period will meet this threshold. And every request that comes in against that’s too high, we’re going to chew away from an error budget. And that’s what this represents right here. Our error budget is going down, as bad requests come in. But we’re still healthy-ish. We’re still 30% above that 99% kind of threshold that we are targeting for. So nothing’s really going to go wrong or nothing’s alert. Nobody is concerned about it.
Now, Honeycomb will take this trend that you’re seeing right here in your error budget and that burn rate itself, and it’ll forecast out by 24 hours or four hours or whatever you set it to. And we use these to notify you, should something go wrong. So 24 hours, it’s probably not as urgent. We’ll throw that in Slack before hours. That’s really urgent. “Get on here right away. We’re going to PagerDuty to wake you up. So you can come in here and assess a problem.
Other areas you could really use this for, is if you’re an engineering manager and you’re about to release new software or new features, and it’s kind of risky and your error budget’s really low or really tight. You might say, “We’re going to pause that new feature release and we’re going to work on stability first. We’re going to clean up some items we might have with configuring our auto-scaling groups.” But ultimately Honeycomb, we want your SLOs to be actionable. And that’s what this heatmap right here does for you. We’ve introduced a new color in a heatmap, these little yellow pieces right here. These are the events that actually failed your SLIs measurement.
And then earlier, when you see me draw a yellow box in a Honeycomb, Honeycomb kind of magically told me what about those events were different. We put all these yellow blocks in that yellow box and we did the exact same operation. And it came back with the cart checkout name, that same target, oh, it’s the same user ID as well, 20109.
When we think about dashboards and monitoring, you have a dashboard with 20 charts on it and you have 20 of these dashboards. And your hope is that when something goes wrong, one of your charts has that problem. That’s looking at a board of known unknowns. That means before that actually gave you a problem, and now you’re looking at it. I didn’t tell Honeycomb that looking at data by user ID or language pack or region or whatever it is was important. Honeycomb looked at my data and said, “Here’s an unknown thing to you, going into an unknown direction. Please look deeper.” This is my observability dashboard based on SLO, something actionable, where I can go from here and do more. And this is a Honeycomb Observability Platform, I hope you enjoyed this video. Thank you very much.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.