Product Video

Honeycomb SLO Feature Walkthrough



Danyel Fisher [Principal Design Researcher]:

I’d like to demonstrate Honeycomb’s SLO feature. SLO stands for Service Level Objectives. A Service Level Objective is a way of describing your expectations about the reliability of your system. Let’s start with this simple test data set. It’s showing us the use of a web-based API service, with a number of different endpoints exposed. For each request, we’re tracking how long it takes to be served. We’ve written an SLO on it that requires 99.9% of requests to be resolved in less than 1100 milliseconds, every seven days. At the top left, we can see that we’re not actually using anything close to that. We could compute an error budget by looking at the total number of events we’ve seen and multiply that by our expected reliability. 

In this dataset, we’ve got 82% of our monthly budget remaining, so we’re doing pretty well. Since it’s not burning quickly, we don’t expect the budget to be exhausted anytime soon. But if it were burning rapidly, at the top right corner shows that we’d get alerted, four hours before it burns down, enough to hop in and try to fix whatever’s gone wrong. Historically, we’re running at a little over 99.98%. So this looks really good. Let’s take a look at what sort of events are actually contributing to the burn rate. This heat map view helps us see every single event that qualified for the SLO. In blue are events that pass, in yellow are the events that fail. Those top few rows, where duration is greater than 1100 milliseconds are all the failures across here. From there, we can dig further down into the BubbleUp. A BubbleUp is Honeycomb’s way of separating out what events are in the good, versus the bad regions and seeing how they’re different. In blue are the events that pass, in yellow are the ones that fail. This can help narrow down on what’s affected, what’s happening, and why. 

The BubbleUp draws a histogram for every dimension of the data. We quickly see that failures have a very specific endpoint shape, it’s API v2 tickets. They also of course have, they also, therefore, have very specific names and service names, because they’re speaking to that particular part of that failed service. So now we know what’s going on here, is that a very specific endpoint is taking too long, but that everything else seems okay. Even more interesting though, here on the User ID column, we can see that it’s one particular user, 20109, who’s having a bad experience. This is great, we can handle this in part as a customer service issue, and go get in touch with that user. Of course, as engineers, we also want to know what’s causing this. Fortunately, this data set is annotated with tracing data, and so we’re able to look at the actual traces of what went wrong and how. 

Here’s a selection of some trace IDs that have experienced the failure. I’m going to grab one and go to its trace. In the trace, we can see, this is a query that took 1.42 seconds. Underneath, the reason for that seems to be this call to fetch tickets for export, which took 1.15 of those seconds. And as we look at what took that up, it turns out to be 29 different serial requests to MYSQL, one at a time. Now, none of those requests took very long, but the fact that we did that many of them is rather surprising. That might give us a pretty good idea of where to debug, to go look for this serial call and find out just why it’s running this many times in a row. Using this SLO with BubbleUp, we’ve been able to see how badly something has gone wrong, who’s affected, and what we can do to fix it. Honeycomb uses many SLOs in production today. I’d like to show you a few that illustrate different ways that an SLO budget can work. 

This is the SLO for an internal tool called Retriever. As we can see, it’s on a gradual and slow downward burn, that uses about two-thirds of its error budget in a steady state. There are a couple of places that we might choose to investigate but overall we’ve been exceeding our goals, and it seems like it’s pretty comfortable. There is a small smattering of errors that happen from time to time, including somewhere the system’s running a little bit slow. In a very different pattern, we’ve been experimenting with a new build system. We’ve been trying to track how often we keep our CI builds under 15 minutes. The curve here shows that we’ve had some bad experiences a few days ago that burned through all of our budget, and then some. But things seem a little bit more stable now with this nice horizontal line. Hopefully, that means that over time, we’re going to begin to reset and come back. Last, the SLO subsystem has its own SLOs. We actually had an incident a few days ago, where we burned through a lot of budget. But now we’ve gotten back to stable. While our budget is pretty low, we’re very barely above that expected level. Our steady-state is pretty flat, and so the alert up here hasn’t warned us or isn’t worried at all. Thank you for joining me for this demo.

If you see any typos in this text or have any questions, reach out to