Debuggable Service Level Objectives
Honeycomb’s Service Level Objectives (SLOs) offer more actionable alerts with less noise. They’re also integrated right into your debugging workflows.
Adam Hicks [Senior Solutions Architect|Honeycomb]:
Hello, all. This is Adam Hicks, Senior Solutions Architect with Honeycomb, and I’m here today to talk to you about SLOs, or service level objectives. Today, we want to get an understanding of how we can use technology objectives to sift signals from noise.
So SLOs, what is this really about? I’m going to show you the product in a second and talk about how we do this, but I want to take a moment and level set on a few key concepts before we do. It’s about measuring impact. So we have the ability, inside of Honeycomb, to define, monitor, and enforce agreements that you have, say, between your application development organization and their customers. It could be internal, it could be external, or otherwise.
A few bullet points here that help us understand the outcomes that we’re really trying to achieve.
A clear tracking of performance goals: Is our application meeting the needs of those customers?
How can we keep alerts useful and prioritize workload? A lot of us in the industry are familiar with alerts, triggers that we have on metrics, different measures that we have inside of our application, and some of us get a little bit tired of them or have been tired of them if we’ve been woken up over the night. How do we reduce burnout from that? How do we make sure that the alerts we’re being alerted on are meaningful and leading us into something that is actually important for us to be concerned with? We live in a world where applications are written to autoscale, autoheal. So when is it that we’re having problems that are worth us spending time on, what’s going on?
And then, also, knowing exactly what’s going on with your error budget, and this is where we start really getting into some of the Honeycomb differentiators. Because, not only can we tell you when your service level objectives are being impacted, we can tell you why because of the way we allow you to quickly and easily understand and query differences in the attributes inside of your data.
So there are a few definitions. As I start showing the product, you’ll find that it’s very terminology-heavy. I want to make sure that we understand what it is we’re talking about. An SLI, let’s start with that, is a Service Level Indicator, and this could be a KPI, a key performance indicator. That’s another term that some of us are familiar with. Now, this could be something as simple as, I need my API to return a response in under 1,100 milliseconds. It could be an amount of acceptable error rates of a certain type, whatever those may be. And you’ll find that in Honeycomb, you can formulaically define these with includes for certain attributes, and even excludes.
What is SLO itself? This is our objective, how much we’re trying to meet that key performance indicator. Obviously, all of our goals are always, if it’s latency, I’d like for it to return under 1,100 milliseconds 100% of the time. That’s not necessarily realistic. We understand that it’s not going to, and I don’t necessarily need to go wake my engineers up just because one or two queries against it went a little bit beyond that. So much of our infrastructure is out of our control. It’s going through public links. How do you solve for that? I don’t want to burn my engineers out just because somebody had a slow query once out of a thousand.
The service level objective is a reasonable estimation of how we think we’re actually going to commit ourselves to that. We’ll talk about burn alert last. Move on over to the right. This is an error budget. So this is an understanding of how much error we’re allowed to have. How frequently can we actually have that error? How much of those errors can we allow before we actually start alerting on it? It’s not an alert directly, but if we start burning that budget itself, if we’re not meeting our SLOs inside of a certain time frame or we’re afraid we’re going to run out, that’s when we start working on alerting somebody. And that is our burn down. How quickly are we burning it down? And that then turns into our burn alert. From there, we can move on to the demo.
As I pull up Honeycomb for you to see, I have dropped us directly into our SLO feature set. Now, I came in by way of SLOs, and I selected one called API latency to help us dig in and understand some of the core concepts that we were talking about before. This is what it looks like inside of here. We have a few operative features. We talked about KPIs or SLIs before. Here, if I hover over this fx, this function, this shows me what it is. It’s a relatively simple one. It says, duration, less than 1,100 milliseconds. It’s exactly that, and we have defined the service level objective for that SLI, as we’re going to meet it 99.98% of the time over a seven-day period.
From that, Honeycomb instrumented my budget. It is now tracking budget burn down. How quickly are we actually exhausting the budget that we’ve allowed ourselves? And it looks like, you know, it’s happening, unfortunately, pretty quickly. We do have another feature over here which tells us our historical compliance. This is very, very powerful. This helps steer us, as engineering organizations, in terms of prioritizing those workloads, as we talked about in the intro. You know, do we need to spend more time on stability, or can we continue on with our focus on new features and trying to win customers over?
Up above, I have my exhaustion times. These are configured through my burn alerts. If I drop in here, you can see that what I have done is I have created a four-hour alert, and this is going to go through PagerDuty. So what this means is: If it’s going to exhaust my budget in under four hours, wake somebody up versus a less steep error burn, 24 hours. If we’re going to exhaust this in 24 hours, we’re going to notify via Slack, and we’re going to put it into a specific channel. We have every notification system under the sun, anything that supports a webhook. And you’re off to the races.
But the best part is that it tells me very quickly what is burning my budget. That’s because down below, there’s a heatmap, and it has already instrumented my heatmap with color-coding to tell me, if I look at these sparklines over to the right, it’s going to tell me what events are actually burning my budget. This yellow and these peaks are the events burning my budget. They may not be at the top. They could be at the bottom. They could be in the middle. Depending on your heatmap, depending on what your SLI is measuring, it’s going to dictate the shape of your chart. The point is that the yellow events are always going to be those that are burning your budget. Those are the outliers. Those are the events that we’re interested in understanding. What’s going on with them? Why am I burning my budget?
I didn’t have to go into a tool to find out. The SLO feature inside of Honeycomb is telling me for me, and, if you’ve explored BubbleUp before, then this will look very familiar; but instead of me having to select a region of interest and say, Tell me, Honeycomb, about that region of interest, it is doing it for me automatically. It is telling me over here that there is something with that app endpoint that is 93% of those budget-burning events. Name? it’s a GET action against that same endpoint. Which service? Well, it turns out that that endpoint is on the gateway service. Even user ID, high cardinality fields. And it’s telling me that something like user 20109 is the particular problem here.
Now, before we move on, I would like to show you a real-world use case. I mean, we don’t always do this, but I like it. This is Honeycomb’s usage of its own SLO function. We have a front-end service called Shepherd where we’re corralling and just. The reason I want to bring you here is just to show you the complexity of queries this supports for defining your SLIs. So something like this, we’re looking to include and exclude appropriate errors. That way we can maintain uptime for you, and we, ourselves, can only be notified when we mean to if there’s an ingest problem for your data. You can do this inside of Honeycomb yourself. Thanks for spending a moment with me to learn a little bit about SLOs.
If you see any typos in this text or have any questions, reach out to email@example.com.
Reliable Alerting for Honeycomb Triggers and SLOs With PagerDuty
Now that your SLOs are running, how does the team know when critical systems are in trouble? Join Developer Advocates Mandi Walls at PagerDuty and Liz Fong-Jones at Honeycomb for a demo on how PagerDuty can notify on-call responders based on alerts sent from Honeycomb.