Conference Talk

Pitfalls in Measuring SLOs

March 2, 2020

 

Transcript

Danyel Fisher [Principal Design Researcher]: 

I’m Danyel, thank you so much for coming out today. I’m from Honeycomb.io. We’re going to talk today about SLOs or Service Level Objectives. Now, I strongly believe every good talk should start off with a disaster, so then it’s just an improvement from there. I’m going to tell you about an outage that we had at Honeycomb. What was it? July 9th of this year. This is looking at one of the Honeycomb data graphs. These are data processed along and something happened about three o’clock. I would love to say that it’s because every one of our users simultaneously went out on a coffee break. I would love to tell you that it’s because the internet simply became profoundly uninteresting for that minute, but that, unfortunately, is not true. If I zoom in a little bit further, we can see a little more of the truth. That around 3:50, some of our servers started collapsing and something was going away. By 3:55, everything was down and it stayed that way for 10 minutes.

The question that I’m posing by putting these few starting slides up is the question of, how broken is too broken? I mean, that was bad I think. It was for 10 minutes. I like to figure out this sort of debate as everybody does through chopper memes. Chopper memes are the way that we solve problems. We were down for 12 minutes and objectively 12 minutes isn’t very long. But in internet time, 12 minutes is like two years. And this led of course to internal debates at the company about how bad this sort of thing is, right? There are some people who believe strongly in stability and they really want to make sure that performance and stability are up. They want to improve quality.

On the other hand, there’s sales and management too who want to build new features to sell and so the question is, how do you resolve this sort of debate? What we need is a common language that allows us to describe the trade-offs between improving quality and improving features. We want something that can be understood by management and engineering. Something that we can communicate to our clients and users so we can quantify how bad an experience was. We need to be able to answer questions like, how broken is too broken? So we can talk about how bad that incident was. Conversely, we want to be able to say good enough with enough confidence that we can say, “Yeah, our system is doing good enough.” And of course, we all live in a world where we have lots of systems that are chirping their distress at us kind of nonstop. So it’d be really great if this was also an abstraction that would allow us to combat alert fatigue.

The notion of Service Level Objectives or SLOs is an abstraction that allows us to be able to express this idea. If we can talk about how good our system is quantitatively, then we can decide where we are sitting on that scale of good enough-ness. Roughly speaking, the way I’m going to summarize the idea of Service Level Objectives is to say that your telemetry system produces or consumes events and that those correspond to real-world use. These are users’ actual interests in what the system does and what data is being collected. Now, of course, the world is filled with events that our systems are producing. We’re only going to care about some of them, the eligible ones.

Of them, we hope the vast majority of them are good. This is pretty great because now we’ve defined this idea of a Service Level Indicator. Given an event, is it eligible? And is it good? For example, this massive stream of data. Some of them might refer to say, things that are serving our web service. Some of them have an HTTP status code and of them, if they were served under 500 milliseconds we decide that’s quick enough. And if it’s served at 200, that says that we’re serving the page successfully. So there we go. A good event is one that got a 200 and was served under 500 milliseconds. This is great because now we have the ability to define the idea of quality.

I’m going to have to do some math and I want you all to brace yourselves. Quality is the percentage of good events divided by the total number of events. Now, of course, to be able to define that we’re going to have to pick a time range over which we’re defining that. An SLO or Service Level Objective is a minimum quality ratio over some period of time. The cool thing about this now is that we’ve got a notion of an error budget. An error budget is whatever’s leftover. That is to say, if you are expecting, I don’t know, a billion events this week and you promised yourself 99% of them are going to work, then you’re allowed 10,000 things to go wrong. And if only 200 of them have failed, you’ve actually got things to play with. And that’s kind of cool because now that you’ve quantified what the notion of good enough was and you’ve picked that bar which your users are comfortable with, you can do all sorts of great things with this leftover budget.

5:35

You can start to deploy faster. You’ve got room to experiment, go off and set off your chaos monkey and go see what it successfully tears down without being worried about failing your users’ objectives. Or, perhaps you might realize that in fact, you weren’t generous enough. That 99% you could do better. You could promise your users better than 99%. You could tighten your SLO. Having a leftover budget is great. Being out of error budget conversely tells you that it’s time to improve quality and to get yourself up to the standards that you’ve set. Now, going back to that incident I showed you, in fact, our company had worked out SLOs.

Honeycomb is a telemetry company. We take in people’s events, we process that for them and we put them out afterward in new computations and allow them to generate graphs and charts and interact with them. I’ll show you more about that later but to some extent, that almost doesn’t matter as much. What I really want to bring out is that users are sending us data, and it’s really important to us that we keep that user data. So in fact, one of our SLOs is that we always store user data and we interpret that as 99.99% of events. For those doing the math at home, that does mean that one in 10,000 events can fall on the floor and that would be unfortunate. But really on a grand scale, we’ve decided that’s an acceptable level and so of our users.

Now, we have a set of default dashboards that you load and a page that gets started. We’d really like that one to load in less than a second. But sometimes this page is running a little slow. It does depend on a whole bunch of components coming together. And our users seem to be fairly tolerant around that one, so we’re placing that one at one in a thousand, 99.9%. Sitting behind all this is our grand data store. Many of the requests that it gets are reasonable, and sometimes the requests are unreasonable and sometimes traffic is bad and sometimes Amazon S3 is quirky. And so we’re going to see queries often return in less than 10 seconds, so we have a 99% on that one. By the way, I know that we’re talking in terms of events.

You can also sort of translate this in your head to time ranges. In which case we can say for example that, 99% means that in the course of one month, we’re allowed to have seven and a half hours of downtime. Ninety-nine point nine percent would mean that we’re allowed to have 45 minutes of failure total. 99.99% would mean that we’re down to 4.3 minutes. So we might ask ourselves, when we looked at that gap, what happened here? Was that 4.3 minutes, two years in internet time or was it just a passing blip? Well, that depends on which SLO we’re talking about.

Here’s the bad news. This is user data throughput, which means that in fact, we blew through three months’ error budget in those 12 minutes. This was devastatingly bad. By the way, I’m going to say when you’re a telemetry company and you lose user data like this, that means that every single one of our users for the next few weeks as they looked at the graphs of their system, got to see that drop of us dropping the data. It was incredibly embarrassing. So what do you do in that situation? Well, as I said, we dropped customer data and it was really embarrassing. We rolled it back though. We rolled back the bad change that had caused this to happen and that’s actually what took about six of the ten minutes. We communicated the outage to our customers and apologized profusely. We halted deploys. We stopped making changes to that until we had figured out what was going on and why.

I’d actually like to walk you through some of the history of what we learned from that experiment too. The very first thing that happened was that a developer checked in code that didn’t compile to our repository and attempted to push it into our continuous integration system. We fired that developer and took care of the situation. Thank you all so much for your time, I’ll talk now. We’ve heard today already actually, about the importance of blameless deploys. About how human error is always linked to other sources of error. Because let’s be honest, that wouldn’t have made a difference if our Continuous Integration System, our CI had been working as it was supposed to. But it turns out that we were in the middle of experimenting with our CI system.

A bug had been checked into that recently, which unfortunately swallowed error status codes. Now, most of the time that doesn’t matter and isn’t noticeable because most of our developers were mostly checking in mostly code that mostly worked. But in this case, it didn’t. Of course, that’s not a big deal either. I mean, we wrote a zero byte file and nothing was produced, which shouldn’t be a big deal either except that our deploy system very joyfully took this empty binary and spread it across every server we had. Which still should be fine because of course our pulse tests, the health checks should have noticed that. We didn’t have health checks at that time. That was bad.

10:39

Our infrastructure engineering team simply stopped work on everything else that they were doing and they focused on these points. Over the course of the next two weeks, we built ourselves a health check system with an automatic rollback. We made sure that our deploy system could not possibly deploy empty binaries and we fixed our CI system so that it actually produced errors when a user checked in erroneous code. The notion of having Service Level Objectives allowed us to characterize what went wrong, how badly it went wrong and allowed us to prioritize repair. To know that this was a big enough issue, that it was actually worth repairing.

Today what I want to do, having introduced this idea of SLOs, is talk a little more broadly about our process of building an SLO feature in Honeycomb. I’m going to talk a little bit about how we used it internally and how we figured out what to do. Then talk through some of the challenges and pieces of user feedback that we got around it. I’m going to walk you through one other embarrassing incident that perhaps can help color some stories about the cultural change that happens around these SLOs.

Now, I’d love to say that you don’t actually have to listen to this talk because you can just pick up these books which Google has available free on their website. They each have a chapter on SLOs and they’re fantastically well written and they’re great introductions and we tried that and it wasn’t enough because it turns out there’s a lot of subtleties. I’m going to talk a little bit about design thinking and sort of how I approach this kind of problem. I’m going to talk a little bit about some of the experiences of creating SLOs and expressing them and how you view them. I am going to switch over to talking about how you respond to SLOs and then like I said, I’ll wrap up with learning from some of the experiences.

Now, my background and training is in human-computer interaction and data visualization. I profoundly care about how people visualize, analyze, and interact with data. I’ve been trained and I love spending hours looking over people’s shoulders, seeing them play with data sets, and figuring out sort of both what we do well in those systems and what parts we could improve. In the process of doing this, I was delighted to not only be able to go out and interview and study people, but we had internal users, we had experts in the creation of SLOs, we had internal teams and we were able to pull things from our customers and get a small group of early responders to play with this system and be able to give us feedback. This meant that this is going to be really a conversation about design. About understanding users’ tasks and understanding users’ questions and talking about some of the approaches that we use to these.

I’d like to start off as I said, by talking about the idea of just displaying an SLO. I’ve given you this nice overview of “here’s a failure and we did the math by hand to sort of compute how many minutes of failure we had.” Our real goal here though is to not only see if that we are burning through our error budget, we want to see where the burndown was happening. We want to explain why it happened, and if at all possible we want to give some cues that will allow remediation to happen as quickly as possible.

For those of you who have looked at SLOs in either competitor products or any Google SRE book or various other sources, you’re going to notice that there are actually two different sets of conversations around SLOs. I can give you the definition that’s event-based. Just the percentage of events that had a duration say less than 500 milliseconds. There’s a whole second school of thought. You’re going to see it in some other products talking about how many five minute periods had a P95 of duration less than 500 milliseconds. If you’re working with a system that is actually based around the idea of collecting metric data, this sort of thing actually works much better because you’ve got that P95 floating around. Honeycomb is based on individual events. We really care about keeping those individual events and so for us this left definition works better. It’s a vendor talk, I’m biased. But I know how to do this math in my head and I still haven’t been able to get my head around this one statistically. For the rest of this conversation though, I am going to be talking about the event based genre of SLOs.

Our first question was how we express SLOs. I have a little bit of a background as a designer. My first take was that we should make ourselves super straightforward for our users. We should build them a little wizard and allow them to fill out a little form that will allow them to express what to do. I showed this to our engineering team and our engineering team said, “Really?” And then I showed it to some first users and they sort of blinked at me and said, “You’re kind of overthinking this.” And they pointed out that any of the people who we were targeting SLOs at, was actually pretty good at using our internal expression language and was pretty good at expressing questions. For example, if you really cared about the event called name is run_trigger_detailed and you want to make sure that good events are those that an error does not exist, you could just write this as an if statement. And so this is our own internal mini-language and this was enough to get us sort of through that first bit and get something into users’ hands and Presto five screens worth of development and lots of prototype kind of went away, which was great.

16:13

Now, as users have continued using this, they’re getting more and more complicated. This is actually one of our current active SLOs that we use internally at Honeycomb. It says that events that aren’t from internal Honeycomb users, because we have weird usage patterns, and that aren’t from us pseudo-ing into someone else’s data and that haven’t spitted out this message about this dataset has been deleted, are the ones that we care about. And then if they’ve got a status code of 403, it’s long, it’s complex. And we’re beginning to rethink this decision about building wizards precisely because we’re finding that the conversations that people are putting in are getting longer and longer. And that’s great. This is the user feedback that we actually need and want and so we’ve been monitoring the SLOs that our users are creating. Once you’ve written this expression though, it’s actually pretty straightforward. You need to specify a time period that’s going to act over in days and a target percentage. You fill out those two fields and give yourself some cues about why you created this thing and you’re pretty much set.

What we’re able to do for that, is then show you where your quantity of data was burned down over the last few days. So in this case, for example, we have a dataset that 30 days ago started at 100%. It’s been gradually burning its way down. It’s now a month later and we’re down to 57.5%. By the way, again the value of user feedback when we first created this, we just had a chart that ended right there. Someone said, “The only part about this chart they really care about is the number at the end.” We said, “We can make that pixel really big.” And they said, “Or you could just put the number on it.” And we said, “Oh, right, we could.” That was one of those duh moments that made everything better.

We also realized that tracking how well you are doing over the last period is really useful. This for example is the history of compliance which says we really want this SLO to be four nines. Holy crap, we’re actually at four and a half nines. We’ve been wiggling around for a bit. We’re even higher there. We’re getting dangerously close to 99.996%. Isn’t this exciting? Let’s look at some more. We’ve actually got these two charts next to each other. It’s kind of nice to be able to compare them side by side. Here’s a more interesting example, the SLO, and homepage. No error. SLO was doing pretty well and burning down at a perfectly reasonable rate till… What is it? February 13th or so when it plummeted. Went across, plummeted again, and has kept plummeting and has now stayed stable at -176%. You can also see by the way, that we have not made our SLO target on this for quite a long time. For a while, it was looking like we are at least in a general vicinity of 99.2 and recently we’re down to below 99%. This one’s embarrassing.

Let’s talk about how we diagnose this sort of thing. How we understand and learn from this, right? I mean, this drop is nice but what caused the drop? What happened there? The next thing that we’ve added is this visual presentation that shows you a heat map of precisely where those errors were. Now, you remember I’ve been talking about the idea of SLOs as events. We’re actually able to draw individual events that have failed on this chart. Now, this particular SLO is based on both latency and errors and so we can see that everything above a certain threshold is just being painted yellow. Those are errors. They’re too slow, right? That’s what failing means. But then we can also see that there’s a whole smattering down here of fast failures. Of things that have returned an error message. So just seeing this chart is not a bad start to understand, are we failing because of latency or are we failing because of an error message?

We can take another look, for example, that -176. We can go drop through time to 14 days ago to get ourselves right back here. We can see that this was a huge batch of events that all happened at once. That’s kind of cool. This is not apparently a systemic problem. It’s not our system failing. It’s that for some reason, there were a bunch of things that generated error messages very quickly all at once. Now we’d like to of course understand why they went wrong. Going back to our notion of events again, this idea that we’ve got individual discrete events that we can look at and say these are good events or bad events means that we can do all sorts of wonderful things. For example, each of these events has many, many dimensions on it. Ranging from the request URL to user ID and dataset IDs and various variables that were associated with the response status that it’s a bit back and what error message it was producing. This is pretty great. What we’ve now done is for each dimension in this dataset, we’ve drawn a little histogram of the events that succeed and the events that fail. Succeed in blue, failure in yellow.

21:23

There are some things that are completely unsurprising, right? We said that we’re monitoring for errors and hey, what do you know? The errors are 400s and 500s and the 200s are successes. That’s what we’d expect. Therefore there are error messages defined for… The yellow bars are all there for the error messages because yes, that’s what an error means. But then there are some other fun things that show up. For example, all of those errors are apparently due to two specific user email addresses and they’re due to two specific user teams. So suddenly this isn’t a story about the whole system being highly unreliable. It’s a story about two users having a very bad experience. I’m not trying to be cavalier, I don’t want two users to have a bad experience. But this feels like a sort of different sort of response that we might want.

And in fact, when we looked at the error message specifically to see what happened, it could not find team. Okay, that’s really interesting. So what this means, we’ve got users who logged in and they got an error message and the error message is could not find team. What this actually means is that we’ve somehow built into our code a mechanism that allows you to log in even after your data has been removed or after you’ve probably stopped being a customer. That’s kind of cool. We’ve now got a customer service issue to find out why these people are still logging in and what they’re hoping to find. And we’ve got a technical response, right? Our dev teams should go out and go fix that bug.

If your data has been deleted, you shouldn’t be logging into our system anymore. This is touching on that goal of seeing where the burndown is happening, explaining why it’s happening, and remediating. So we rolled this out as a beta and we were very proud of ourselves because look, we had these cool charts and we had these graphs and we got user responses like, “Wow, this is great. It’s allowing us to figure out what’s contributing the most to missing our SLOs.” And we said, “Yes, we are succeeding.” “To the millisecond we knew what our percentage of success was versus failure.” And we said, “Yes, we are showing them what we need to.” And someone said, “This confirms a fix for a performance issue.” And we said, “We are getting there.”

And then someone said, “We don’t have anything to draw us in. It would be great to have a sense of where the budget’s going.” And we realized that we had this beautiful page and no one had any reason to look at it. I mean, they look at it the first time they create it, and at some point, someone goes, “How does SLO do it? Oh yeah, how is it SLO doing?” So we also really needed to talk about burn down alerts. We needed the ability to express why something was going wrong. Going back again to the user goal, we want to be able to figure out how long it will be before I run out of budget. The reason we want to do that is that we are really interested in human-digestible units. That is to say, if you’re going to run out of budget in say 24 hours, then you should probably get a full night’s sleep, wake up in the morning and go check it out then. If you’re going to run out in four hours, you should probably wake up some other support folks, get the team together, and figure out what’s going wrong and why it’s going so badly that you’re going to burn through your budget. And in fact, we’ve specifically built in a system that allows you to support this with a pager call and this happily posts to Slack or something. So you can pick your level of urgency and your type of response.

Our prediction mechanism is as simple as we look at the wiggly line in the past, you pick a period behind it and you extrapolate forward to figure out when we’re going to exhaust, and if it’s above you alarm the system. Then you go within, if it’s below it you don’t. That was pretty straight forward, and we were very proud of ourselves for having built our first burn alerts. The mechanism is actually quite simple. If you’ve got this 30-day dataset, you run a 30-day query. Do so at about a five-minute resolution so that you can make sure that you’re getting all the latest data and you probably do this every minute.

25:28

Now, you know how they say premature optimization is the root of all evil? Sometimes postmature optimization is a little bit of a problem too because we meant to burn through $10,000 of Amazon. There was the first five and a half K that day and here are the next four and a half K. Before someone said, “Ooh, a 30-day query running at a five-minute resolution over the last month repeated for every SLO that we have running repeated every minute, begins to burn through a lot of resources.” Fortunately, as computer scientists, we understand the right answer to this one. It’s straight forward, you just have to implement a cache. There’s absolutely no problem with that. All we have to do is cache the results, except not incomplete results and not… This has actually turned out to be more of a challenge than we thought.

Our back end servers had never been designed for building reliable high-speed caches. We had always been okay with the idea that some percentage of queries might miss by 5% or 6%. And that’s okay because users press refresh, they run queries. Almost all our results were pretty transient. And now we’re moving to a space where they’re not only important, but they’re meant to be actionable. They needed to be precise. And so we actually had to re-engineer some of our query engines to make absolutely sure that the data that we were producing was high enough quality to make sure that the front end was able to trust it and cache it for an entire month at a time.

Once we resolved that, we suddenly started having alerts flapping on us. This turned out to be, the system would predict, “Hey, it’s going to fail in three hours and 55 minutes, throw an alert.” And then a good event would show up or two good events would show up and say, “Oh, well, we’re fine. We’re out of the woods. It’s four hours and one minute.” And then a bad event would show up and we’re like wait, make it 3:55 again. And it would keep flipping back and forth and back and forth and the alerts would sit there flapping. And we were like, “Wait, the whole point of SLOs was to get rid of flappy alerts. What have we done wrong?” It turns out that when we stared at the data a little bit, we looked at the examples of alerts that were flapping and we throw in a 10% buffer and all went away because by and large, a four-hour alert was almost always right-ish and so if it didn’t reset until four hours and 20 minutes, you’re fine. The newest feature that we’re now working on is the idea of recovering from bankruptcy.

Do you remember I showed you that example of the events that were running at -169%? In theory, no alerts should go off and no alert should continue to go off anytime because we’ve already burned that budget. We’re just going to have to sit there at night, come back to that page, I don’t know every couple of days, and hit refresh and watch that -169 crawl its way back up as we regained error budget. So we’re building in a bankruptcy mechanism that will allow us to get new alerts even after you fixed the problem. Interestingly as we were talking about this, we went out and chatted with some of our users who are active users. And it turns out that their workaround was to delete the SLO and then recreate it, which blew away the cache which caused them to start getting new data in. Which was fine, but kind of embarrassing from a design point of view. I don’t want our users having to give each other that sort of horrible advice. I’d like to actually provide a system that works for them.

In the process of building these out, we’ve actually learned a lot about the experience of working with and building SLOs. The first one was that volume turns out to be really important. No matter what level of nines you have, it turns out that we really need to see… Your SLO level needs to tolerate in the dozens of bad events a day. If you are either getting such sparse data or such slow data or your level is so very high that two or three bad events throw you, then what you start seeing is, again the flappiness, right? One or two bad events come in and all your alarms go off. And then it takes a while to recover from that. If you’re really at the “one or two bad events are going to sink my company” level, you probably are pretty happy with the classic alarms, classic triggers sort of model of a thing that went wrong at least once. And most systems are pretty good with that. SLOs are a really good way of expressing the idea that your system is beginning to degenerate and it’d like to be able to express that. So again, you need to be able to tolerate at least dozens of bad events a day.

More interesting to me, is we started having to distinguish between our fault and our users’ faults in our error messages. Our system, if you sent us bad data would very happily return a 500. You sent us bad data, that’s a 500. Can’t be processed. Also, if our back end server had failed and we couldn’t process it, we’d also send you back a 500. And from a user’s point of view, that was pretty much right. Our 500s came with a textual description of what had gone wrong. And either way, we wanted you as our client to cache that event or to somehow absorb it to figure out what was going wrong and perhaps resend it to us later.

Under SLO world, that became a really important distinction. And in fact, we had a day of all of our SLO alarms going off simultaneously and it turned out that behaves because one user had tried to send us terabytes of badly formatted data that we didn’t know what to do with and so we kept returning 500s for them. A number of our systems have now started adding our fault and their fault. I’m actually giving you snippets of the actual Honeycomb source, where we annotate it with, our fault and their fault. Another interesting learning to me has been the idea that SLOs are also a way of supporting your customers and finding out what their experience is like. Again, I gave you that example earlier of looking at the dataset where we saw that it was one particular user having a bad experience. The number of our SLO challenges that have fallen around one particular user or one particular instance has probably two-thirds to three-quarters of the incidents that we’ve seen, have been based on that kind of challenge. And so the ability to flag, “Hey, this is the user who’s having a bad experience, you can reach out to them” has meant that we’ve now got the Honeycomb customer success team also looking at SLOs and sometimes writing their own SLOs to know which users to reach out to and where challenges have been coming up.

32:22

The last important realization that I think has come up is that blackouts are super easy. That first incident that I showed you, we didn’t need SLOs to discover it. Literally, every alarm bell at Honeycomb was going off. We were losing customer data. That’s alarmed through the gills. Every system that could possibly notice that customer data was supposed to be coming in was failing. Blackouts are easy. What’s really interesting about SLOs is the fact that they notice gradual degradation. They notice slight failures and partial issues. A couple of months ago at 1:29 AM, our new SLO system was barely approved. In fact, we had them in a temporary Slack channel because we didn’t really trust them yet, went off, and said that our AOB response codes are not matching what our internal system does. Now, this really just shouldn’t happen at all. So we had that one set at four and a half nines. What is that? Five in… Sorry, one in 20,000 events. It really should be super rare. It alerted and our on-call person glanced at it and said, “Oh, that’s probably just a blip. I don’t know what’s going on.” Later on, we’d analyze it and find out that we actually lost 1.5% of events for about 20 minutes, which is embarrassing. Not necessarily devastating, but embarrassing.

Here it is afterward. You can kind of see this and to give away a little bit of the punchline, the distinct number of hostnames, the number of computers that were running, plummets. And then climbs back up and everything’s fine. It blipped again at 4:21 AM. Now, one of our techs happened to know some folks that are on the Amazon on-call cycle and they reached out to each other and found out that Amazon ALB had been having some problems and so this is probably just an AWS problem. The reason why we were seeing those falling off is maybe the Amazon ALB system is dropping some of our servers. It’s probably not our problem. And 6:25 AM, we’re pretty sure it’s Amazon and now our team’s getting really annoyed that Amazon’s throwing us over. “Don’t they know we’re trying to run a liable service?”

It wasn’t until 9:55 AM that a more awake on-call tech took a look and said, “When ALB seems to be hiding us, our system uptime is dropping to zero. Something’s out of memory.” And we had actually stopped alerting on out of memory because again we have hundreds of spot instances that were temporarily running and they’d go up and they’d go down all the time. And it wasn’t considered a huge deal because again, sometimes systems run out of memory. All of them running out of memory simultaneously and then dropping to the floor, that’s not so good. This was happening on a four-ish hour cycle as they’d all go gradually slowly out of memory and fail and go all slowly, gradually come out of memory and fail, and also… At which point we finally figured out what was going on, we had dropped in a change in our code that was burning through… that was generating infinite memory use filled up and within a few minutes of figuring out that that’s what we were doing, we were able to stop the next cycle from happening and everything was fine. In fact, hey, global memory in use chart if you go look at it, here’s where the fix goes in and you see the machines propagating and now they’re not using as much memory and here they are doing just fine using almost no memory as we expect them to and everyone was happy. By the way, we can of course view this in our SLO view and you can see up until this incident, things were going great, and then they weren’t so. We get these nice, big, broad yellow lines where the systems were failing and all the events turning bad.

This slide should be familiar. How do we fix it? We stopped writing new features. We prioritized stability. We mitigated risks. We apologized to our customers. Oh, and one last thing. We promoted our SLO burn alerts. Because this was a failure that the SLO system had caught super early, was super aware of and the dev team who didn’t quite trust SLOs yet were saying, “Well, we’re not seeing our other alerts go off. This is probably okay.” And to me, that says that there’s also this piece of cultural change. It’s hard to move from, “I know why this alert went off. It went off because we have seen 15 errors on Wednesday,” to change to this idea of, “Are we at the level of reliability that we expect to be?”

Ironically, having this really clear incident could have helped a lot because people were able to go back and say, “Wow, if we had really trusted SLO we probably wouldn’t have said blip. We would have said, this isn’t the story that we want to be telling.” The positive effect that’s coming out of this SLO experience by the way, has been that we’ve been able to reduce alarm fatigue a lot. A couple of months ago we were more often in a situation where we had alarms on things like running out of memory, systems crashing. We’re moving more towards a world of user-facing and user-affecting SLOs. The system has failed in a way that users can potentially experience, that’s affecting the downstream clients and we’re moving to actionable alarms that say, this is the thing that we’re seeing and this is how we’re going to resolve it.

I’m going to wrap up by saying that, as I said earlier, SLOs gave us a tool to characterize what went wrong, how badly it went wrong, and how we wanted to go ahead prioritizing repair. Now that you’ve got these tools, I’m just trying to spread the gospel. I want you to do it too. It doesn’t have to be with our tool. There’s a lot of opportunities out there. But the ability to move towards alarms that are specific, that are actionable, that are effective, that speak to the user challenges and not to your understanding of what your computer’s architecture is, will hopefully allow you to avoid our mistakes. That said, this is a vendor talk so I’m going to say you should go upstairs to the fifth floor. We’ve got a booth, we’ve got demos. Come kick the tires, come check things out, ask me questions. If you just want a copy of the slides, they are at hny.co/danyel or at the QR code. Thank you all so much for your time and attention.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript