Honeycomb SLOs
Transcript
Alayshia Knighten [Sr. Implementation Engineer]:
Hello everyone. My name’s Alayshia Knighten and today we’ll be discussing Honeycomb SLOs. Now let’s beeline in. So the million-dollar question on this slide is what our SLOs? SLO stands for Service Level Objectives, which basically means that is a business agreement expressed in technical terms, or an API for your eng team, that business leaders speak. If you look at it from that perspective, then it’s a negotiation between engineering, product, and business leaders who come together to talk about your service, service reliability, and what is valuable to your customers. Let’s go through some additional terminology. There are SLIs, which are also referred to as Service Level Indicators which are a specific and granular level of success per event. In Honeycomb, we achieve this with a derived column that you create that adds a true or false boolean value to every single event indicating whether it has met the SLI or not.
The SLO states how often an SLI must succeed over a given period of time. Your error budget is the remaining number of failures tolerated by an SLO. In Honeycomb, you can set up burn alerts to signal when you’re burning your budget too rapidly. Here’s a fun tip. Define an SLO per team, per infrastructure while keeping the list reasonable and actionable. So what does that mean? Well, I’ll give you an example. Let’s say we talk about API latency being an SLO for your engineering team. If the error budget for the SLO exceeds, you would want to have a business agreement in place that an engineering team has the permissions and requirements to stop during the feature development and switch to reliability for the next sprint. This becomes a great tool for customer success, support, and sales to dial in on what’s important.
The way we set up burn alerts, we set them up for a 24 and a four-hour alert. The 24-hour alert is a nonpaging alert. At Honeycomb, we have this in a Slack channel. So an engineer can pick it up if they have time to work on it. No one needs to be woken up or stop the world as it is not an emergency. So here’s a thought for you. What would life be like if you turned off all other alerts and pages for a week and only operated off of SLO burn alerts? If you can do it and it’s feasible, how much noise would you be able to cut out? Chances are a lot.
With that being said, let’s go through a few SLI examples or derived columns that drive SLIs. Okay, let’s look at the first one. The first one is Honeycomb’s page load SLI. Let’s digest this for a second. If you look at the outer, there’s an if statement. So the if statement allows us to have that boolean that we’re referring to so it returns a true or false statement or a true or false value. The type here is a load page and we’re setting the duration to less than equal 5,000 milliseconds. In this is also a “not contains.” Well, there’s a story behind that. We have this “not contains” because there was an individual user that was making use of our platform using Raspberry Pi. because of the Raspberry Pi, that individual user was causing significant burn into our budget and we went to the drawing board, meaning we had a business conversation and we decided to exclude that from our SLI. So that’s where “not contains” come from.
The second example here is an if statement that says, if we’re not ingesting data that’s received by shepherd to retrieve within this timeframe, then we’re going to start paging people like crazy. We have an operational range to account for the fact that the event processor just may be having a bad day while still meeting business and customer expectations.
This last one on the right-hand side is one of the more complicated SLIs that we have as we have had to make several business decisions to exclude cases that are not applicable. For example, we exclude statuses and cases where there’s nothing shepherd can do. For example, when it’s their fault because the data hasn’t been calculated properly or something like there are too many columns, which for us means it’s over 10,000 columns. We take out those pieces as well, as well as requests dropped due to rate limiting.
So let’s dive into defining an SLO and what that looks like. Before I begin showing you guys around SLOs I want to point out something. People ask all the time about an executive dashboard. Well, showing an SLO is a perfect example of an executive dashboard. First off, you’re not seeing all of the CPU usage, dis-utilization, and things like that. What you’re really seeing are things like who has a bad user experience based on the business agreement that you made. You see how the budget is being burned down as well as historical compliance. And those things are typically important.
So I wanted to actually show you guys an SLO in our UI. Let’s begin here. We have a simple test of an SLO. To create an SLO, you go to the SLO tab and you can create an SLO here. You would name it, give it a description. You would select an SLI column. And then from there, you put in the time period as well as your target percentage. For the sake of this, we already have one created. We have called it API latency. We have given it the API latency SLI. We’ve said that we want our duration to be at least 1,100 milliseconds for a time period of seven days and our target percentage is 99.98.
5:48
Keep in mind as a reminder, I’m sure you already know this, that our target percentage is what we value. Nines of availability is not necessarily a hundred percent. No one can be perfect. At any point in time, you can always go into the SLO and edit it. And in here, once it’s created you’ll see that there’s the percentage of available events per API calls, which is the data set for the column of the API latency SLI. The cool thing is you actually get to see what you’ve set here, which is the duration in milliseconds for a period of seven days.
So here, we have the burndown chart. In this data set, we have 17% of our monthly budget remaining. In my opinion, we’re doing pretty well as there are not that many days remaining and we’re meeting our target. Since it is not burning fairly quickly we don’t expect the budget to be exhausted anytime soon, but if it was burning rapidly, the top right corner shows us how we will be alerted about these things. If it’s a four-hour burndown, we’d be alerted here. If it’s 24 hours, we’ll be alerted there. Historically we’re running at a little bit greater than 99.9, roughly 99.98 here. Let’s take a look at what sort of events are actually contributing to this burn rate. So if we look here, we’re looking at the last 24 hours which is also reflected in a graph. Let’s just say I wanted to see the last 10 minutes. The graph will be adjusted as such. I would like to prematurely point out that in the last 10 minutes we’ve had successful events. But in the last 24 hours, we have had a few failed events here.
This particular view, if you’re not aware, is a BubbleUp. BubbleUp is Honeycomb’s way of separating out what events are in the good versus the bad regions and seeing how they’re different. In the blue are events that pass. In the yellow are events that failed. This can help narrow down what’s affected, what’s happening, and why. The BubbleUp draws a histogram of every dimension of data. Looking at the app endpoints, we can quickly see that the failures are very specific in that it’s from the V2 tickets. Now that we know what’s going on, it’s that a very specific endpoint is taking too long, but everything else seems to be okay. What’s even more interesting, if we go to the user ID column, we can tell that this specific user is having a bad experience. This is great. Okay well, the user having a bad experience is not great, but the part that we can clearly see who’s having a bad experience versus the baseline. We can handle this very quickly as part of a customer service issue and go and get in touch with that user and verify their experience. Or even in the reverse. If they alert us, we’re able to see exactly how to pinpoint what’s happening.
We also want to see what’s causing it. Fortunately, this data is annotated with tracing data and we’re able to look at the actual traces of what went wrong and how. So here you’re seeing the trace ID. What we can do is we can click on one of the values and go to the trace. In this trace, we see that it took 1.183 seconds. Underneath the trace, the reasoning behind that seems to be this GET. From the ticket back end service, it’s taking 1.125 seconds. While deep-diving into this, it turns out we have a series of serial requests to the database service one at a time doing the same thing. None of those requests took very long, but the fact that we are doing many of them is rather surprising. As an engineer, this may be something that would need re-engineering.
Using this SLO BubbleUp, we’ve been able to see how badly something has gone wrong, who’s affected, and what we can do to fix it. Okay, now that you’ve seen SLOs, let’s look at the burn alerts really quickly. If we go to SLOs, and I already have two-burner loads configured, so I’d like to show them to you. To do a new burn alert, you simply click new burn alert. You can put your hours and your mechanisms. So here you can notify by email, Slack, PagerDuty, webhooks. It’s totally up to you. We have two setups. We have four hours set up for PagerDuty and we have a 24 hour set up for Slack, a specific Slack channel. And you can edit those and you can delete those. My recommendation is to set up alerts for things that you care about for timeframes that you care about. This has been Honeycomb SLOs with Alayshia Knighten. As always, go beeline in.
If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.