Conference Talk

How Honeycomb Manages Incident Response

June 10, 2021

 

Transcript

Fred Hebert [Site Reliability Engineer|Honeycomb]:

Alright, everyone! I’m Fred Hebert, I’m a Site Reliability Engineer here at Honeycomb, and today I’m talking to you about how Honeycomb incidents — and I’m using here the term “incident” as a verb because this is something we do actively during and between outages. This session is broken out into two sections. This presentation is broken down into two sections: how an incident unfolds and the second one being how we ensure our tools are used effectively with the incidents.

How an incident unfolds is a generative thing. There’s no recipe to that, but since joining the company I have seen patterns that repeat over and over again and that I think are going to be shared with a lot of different places in areas and how things go. All the incidents that we have sort of start with an alert. Internally we use both the SLOs and Triggers features of the platform to do that stuff, to get told when something goes wrong. There are odd things out like sometimes come from customers, from the Pollinators channels, from customer success, but most of them we try to carry everything we need to have from the SLOs and Triggers. Whatever alert we get in is something that we consider to be the entry point of the incident. There shouldn’t be a need to get into a different dashboard page or something like that. The moment we get the incident, we can click on the link and we get into the data that lets us figure out what might be going on.

To do that, we use SLOs for really for the success rate and performance values. Is something working or not going? It’s what we use to represent a proxy of our users’ experience and satisfaction. The Triggers we use for different times of alerting, usually something like thresholds. That could be the connections to a database. If you have 500 of them, you want to know what 400 of them you have to take some kind of action, change the system, how it works, before you reach that threshold where everything decompensates and starts breaking very violently. SLOs are not great for these; you want to know before there’s a problem. Triggers are great for that.

Another one for that is going to be non-events. The SLOs that we have are based on the SLIs — there is a success or a failure, and a non-event won’t be there. If you stop receiving any kind of traffic, there’s nothing in an SLO that looks for a successful query that’s going to tell you if it’s not there. We use the SLOs for the end-user satisfaction, but the triggers for this kind of operational view as people maintaining a platform to know how it works. We also have third-party providers for a few things. We use sent Sentry for assertions, stack traces, and stuff like that, and while we are developing the OTLP metrics and point of things, we still use and have already established metrics providers for some components that don’t give us the rich events we would like to have, for example, with Kafka.

The alert is the beginning of the workflow. The thing that happens right after you get that is, you get on Slack. One of the key parts of what makes the distinction between something there that is a near-incident or an outage is going to be whether it happens during business hours, or off hours. Something off hours is always more exhausting. You are going to be slower, it’s going to be riskier, and you’re going to have far fewer resources available — usually other people — to help you with things. They come with more context switching and more tiredness and all that.

4:02

In general, everyone will try to increase the autonomy of the operators you have so that more people can go it alone, right? That one person can handle all kinds of incidents all the time. But really, the other thing we have to admit is our systems are too large for one person to know everything about them, and relying on each other is the basic concept that we can’t hope to get rid of — instead, we need to optimize. We need to optimize for the better social aspects of things so incidents unfold better. 

For us, a big part of that is to get into Slack. We have an #ops channel for people and an #alerts channel for all the alarms and stuff like that, and we try really hard not to post in the #alert stuff and only post in the #ops stuff. We have this sort of escalation where the non-urgent stuff is just a notification. The really urgent stuff IS a notification on Slack on top of a PagerDuty call that we have. Most of the time there are always people hanging in the #ops channel that are almost or always willing to help. Especially since we’re a distributed company, there are people in many time zones, and we have something that you can see here on the slide where Ben is talking to someone with the eng on-call account which is dynamically set every week to match people in one of our three call rotations. We have one for the platform, one for the product — meaning product engineers — and one for the integrations, telemetry, the tooling that is a bit more external. Every time we ping someone with that alias, all three people on call get the instant notification on that one.

For us, part of that is that escalation is always acceptable, right? If there is this idea that the system is too complex to understand, there’s this acceptance that you’re going to have to escalate. It’s kind of normal. More eyes mean you get a more diverse approach and you might get better and quicker resolution; and if things need more coordination, you find out that you started alone and rope in someone to help you and then there’s a need to get a higher-density conversation, something like that, we start a video chat, usually with Zoom; and then talk about it directly. 

All the key elements, whether they’re evidence or reporting or actions that were taken, are reported back on Slack so that people looking on the sidelines can get up to speed without interrupting anyone but also because it helps the investigations that we can have after the fact. And we can see another practice that’s interesting that Ian is doing here: “Let me know if you need more eyes on this.” There’s this awareness that there are multiple people working on the team, on the issue, and if there’s any cause for concern or need for more hands, someone is on the sidelines — not getting involved for no reason — but they are making it known that they are available. That tends to be very helpful to have and to know about.

Then comes the section of figuring out, you know, what are the symptoms? What is going on right now? What’s the problem? There is, frankly, no one-size-fits-all solution for this. People will go from whatever hypothesis they have at the time and dig from there, whatever seems the most likely before the incidents even unfold and even before there’s an understanding is going to orient the investigation. Usually, the frequent one is, “Oh, we had a bad deploy. Let’s roll it back.” 

Sometimes that won’t work and sometimes it will. Something may have changed or accumulated and if you’re lucky it’s that simple, you roll it back, and that’s it. In general, we rely heavily on something like BubbleUp which we have on each of the SLO pages to guide us through this sort of investigation. They do the quick discrimination early on of what is likely and what is unlikely and it narrows the space down as tightly as possible early on. And that can drastically reduce the response time that you have and something that is not in all the SLOs offering that you have there.

8:02

I’m recently new on this team, I joined less than six months ago, and the closest experience I can relate to was absolutely slow. And here things are tons faster and nicer. What I can probably say is it’s one of the few ways here where I’ve been able to be productive on call without even having to read and understand the code. I was able to reverse-engineer, create an understanding of how things work or not based purely on the observations of the data we have; which is something that, as good as I have been at handling dashboards, would never happen that rapidly and with that amount of fidelity in what we have.

Sometimes what we have is something that looks like, you know, a user is abusing the system because the correlation is high and, you know, the people seeing the problems are people actively using the product a whole lot. That can happen as well if your feature usage is asymmetric. Like 90% of people use one feature; 10% of people use another feature. Then it looks like the problem with that feature is this user. That turns out to really, really be a sort of problem. Really, this dynamic of the things that you observe but can’t explain right away is where it makes sense to require judgment and skill. It’s not something that you can just automate away that a correlation engine or something with AI could figure out easily because a correlation is going to be extremely solid, but not necessarily meaningful. So whenever folks reach their limits in what they understand and do their spelunking, we sort of revert to doing analysis from the ground up whether it is from the code or whether it is by SSHing into a server and figuring out what is happening.

To put it another way, people are going to use the thing they believe is useful; and it doesn’t make sense for us, for example, to keep ourselves outside of SSH just because we should be using Honeycomb. The thing we try to do is that we make extensive usage of it for business intelligence and our operations, but we know and admit that we’re going to meet our match at some point where we need a different perspective that is not being covered. And the thing we try to do is to feed that information back into the platform because these advantages of being tied to the Alerting, the reporting, the first view we get in the incident turned out to pay dividends over time. We know we’re going to have to break out of it at some point in time but it’s always an important step in the post-incident process to actually feed that back into the platform, both as we dogfood for everyone using it but also for our own effectiveness of response.

Then the fourth step is going to be stabilizing the patient. This is where the tools help a bit less — or at least the observability tools. Something like strong CI/CD and the ability to deploy rapidly is going to be helpful there. Once we identify what is going wrong, the question becomes: What actions can we take to bring balance back to things? Is there going to be a need for a more permanent solution? If so, how can we keep things afloat until then? If your ship has hull damage, you got to work the pumps until you make it to port. You can’t fix everything live. Some of them you have to only make do with it until you get into a more stable area. And this becomes a game for the adepts, the people you have on your team who understand how to bend the system in and out of shape to keep it on target. It’s possible that you’re going to have to trade safety and reliability in one component to help another one. There’s going to be a need for compensation across them in multiple places. You share the burdens and stuff like that. 

We have had cases where we had to drop data retention. We wanted to have two days in case another component was being corrupting data, and we dropped that to, I believe, 12 hours to extend disk space because there was a problem somewhere else. There’s a need to understand this idea that you have to spare capacity in one place, you can tune it out, it’s going to help something else. We have that as well in some places where we do calculations or analysis and we can lower the accuracy to make it cheaper to do. You can pay for extra capacity with bigger instances until some behavior normalizes. 

There are feature flags that let you turn in or out features that could be costlier or have different tradeoffs or correctness and everything like that. So really, this is where you need to have that sort of good awareness of how things work to be able to do something. You can find the issue rapidly; solving requires this sort of expertise most of the time. Sometimes you’ll be in a situation where a proper fix would take you three or four hours but you have only have 30 minutes before everything collapses. This is where these trade-offs become extremely useful because if you have a peak use period, you can delay it by eight times and gain a lot of time. If it’s only a problem at the peak time of the day, being able to defer the work and survive one peak can buy you like an entire day of work; and if you can repeat it the day after that, you can buy weeks of work to actually fix the underlying issue if it’snot something that can be done rapidly. And those are some of the things that, you know, let us turn a near-incident and prevent it from turning into a full-blown outage. I think internally we had a count of something like ten near-incidents for every outage we have, and that’s probably where a lot of that comes from.

13:36

And then you have to fix the issue. You know, fixing the issue is sort of simple enough. You change the code; you do something like that. It’s not something that tools help a lot. By this point, you have understood most of everything and usually it’s rather straightforward. Before you even get to an incident review, your engineers probably have an idea of how to fix the very specific case of a problem they had. If your incident review is kind of a run of the mill kind of thing, this is what will get repeated, it was already decided before everyone joined. It’s a part of of incident handling, it requires involvement from engineers who know the stack, and it is generally either straightforward, done right away. But if you move and only care about really the small fixes, your work is done. If you want to have the biggest work that has to do with the organization and everything, it might just be beginning at this point.

So all of these operations I have seen are usually worth optimizing on their own. On the other hand, you can’t really force a flower to bloom. You have to only hope to give it all the best conditions to make it grow well. So while it’s true that having the best tools can give the best results, their effectiveness is mostly defined by the work you do before and after the incident. 

Another analogy I love for that is to compare things to competitive sports, all right? Of course, it’s going to be during the competition itself that things matter the most. This is where you absolutely have to deliver, but it’s all the work done outside of the competition that tends to define how well the event unfolds: The training, the preparation, the resting you do. All of that stuff has an effect on how the actual high-stakes event unfolds.

So the basic mindset that I think everyone needs to have around that is that incidents are normal in systems and must be treated as opportunities to reevaluate our model of our own organization. “These small failures or vulnerabilities are present in the organization or operational system long before an incident is triggered.” The causes behind an error in an incident are not something you discovered or something we construct and interpret from what was already in place and already happening. So reliability is not something you have; it’s something you do. Adaptability is the same. The changes we have don’t cause outages. They highlight existing misalignments between how we imagine things to be and how they turn out to actually be. And the changes are only a light you shine on these cracks. They are not the cause themselves. They are just how you discover them. 

You will not get good results if the thing you try to do is prevent all the incidents because by definition they’re unplanned. They already happen by surprise. Some of the best work we can do in that case is to turn things around and focus on adapting to this misalignment rather than trying to prevent it entirely or in some cases highlighting the misalignment before it becomes a big problem. Once we admit the incidents are going to happen and can’t be prevented at all or not all of them at the very least, we must do work explicitly to support the high-stakes, high-stress situation that comes out of these events.

One of the really, really basic things is asking the question: How do you prepare people for surprise? One of the things to do is to give them enough of a broad understanding of the things that are in the system, right, and to keep it up to date; that it’s possible for them to reason about new things they haven’t seen before without feeling completely lost. This is not on people having a mental model of and keeping that mental model up to date. So one of the things you want to do is provide information at the source, not interpretations. That means you want to expose the data that has to do with something that happened, not necessarily the cause or what might be wrong in the component. I don’t want the routing component to tell me why it thinks something is going wrong in another part of the system. I want it to give me information. What you do in doing that is provide context and your people and your operators interpret and frame the context in the richer capability they have compared to any components with programs.

17:36

One thing you can do is to try to align the structure of your observability with the product, not the implementation. What I mean by this is that it’s really, really cool when we can turn out something like OpenTelemetry and all of a sudden we get metrics and traces about everything; but a lot of these are going to be focused on the idea of what the code is doing, and they are going to make the most sense when you have the code open on the side and can read everything that’s going on and trace it through. If you invest in doing some sort of manual instrumentations that ties the ebb and flows you have with what your product represents, then understanding the products instantly guides you into a better understanding of the observability you have and ties into code-related stuff into a richer context that lets people form a mental model from the understanding they already have of your product. 

This is the difference between data availability and observability. Observability caters to the requirement of making predictions and understanding what is going on, whereas data availability is just having numbers out there. I think aligning the structure of your product tends to help a lot because usually there’s already a lot of things about onboarding people on that.

Then you have to encourage organizational awareness. Sometimes you will solve incidents because you know something changed recently or you have an idea of who knows what in the system and who to ask questions to. So demos, demo days, incident reviews, onboarding people with architecture sessions and reviews, all that kind of stuff, having discussions in general where you can compare and contrast your mental models, are going to be really good at creating that organizational awareness.

You want to make it safe to ask questions, both through your tools — meaning you don’t want the exploration to be costly. You don’t want to make a query to something and know it is going to cost you a lot of money or stall the work of other people. You want to be able to conceptually have the cost so low that you can ask as many questions as you want and get a good signal back. For people, it’s kind of the same thing. You want psychological safety to ask questions to people and get explanations and to do it in a way that doesn’t have dire consequences to you. You don’t want to be in a situation where asking the wrong questions to the wrong person about the system we’re maintaining lands them in trouble. This is going to create all sorts of isolation and erode the trust structure that you have that makes good incident response possible.

For alerting, there’s something really interesting about that one. We’re using service-level objectives where the thing we want to do is to have the fewest amount of SLOs possible because they’re comparing user journeys. We have something like BubbleUp so if we can have three alerts that cover 80% of the platform, we know that our alerting is always going to be as clear and as — with as little noise as possible.

On the other hand, when you do a call rotation, you tend to structure around who knows what? Who has the expertise to run the thing as fast as possible; and if you page the right person, you get the faster response. These two things are in conflict with each other because usually, the service-level objective means if you have only three SLOs for the entire company — which is not the case for us. We have about 15. If you have only three of them but you have 10,000 people on call, you need to be extremely accurate to page the right person and all of that. 

21:07

Honeycomb is sort of lucky that we’re still small enough, we have this three-team rotation that I mentioned earlier, and so we’re able to get away with a few SLOs that cover a lot and have the general expectation people can handle most of everything and are safe to escalate. Otherwise, it’s serving us well. But a much, much bigger company is going to run into that problem with a bit more pain; so a few options that we have in this case are going to be some things like, one, redefine what the users of shared of a shared component are to mean other engineering teams. If I’m maintaining a database then my users are going to be the other teams that will rely on the database to be there to work well. So that can create this sort of proliferation of SLOs based on domain-specific requirements that we have in the technical stack.

Another one is to change the on-call rotation to cover broader bases in a single team. If you have a call graph of something like 15 services to do something, it may make sense for some of these 15 services to share call rotation, knowing full well that the person being paged might have to escalate from time to time. It’s going to give them a bit of a break; it’s going to require these social changes about being safer and doing the escalations; but having fewer SLOs tends to mean if one of the shared components kind of goes haywire, you page a lot fewer people as well. So it’s that tension between how many people you page and how often you page, essentially.

There are ways as well to do higher-level analysis, which is not in our product today. You could decide, for example, if you see a burn alert, something like BubbleUp already runs on them, and redistributes a page to someone else. I don’t know if anyone does that. I figure that the very, very large corporations often don’t have a choice to do something like that. I have seen stuff about multiple alerts management where it mutes some of them, groups them. At some point if you’re too large to get a good signal to know, you have to sort of alert based on which alerts you see; and that creates a step in there, adds a bit of latency but helps you keep the signal. Really keeping these two conceptually disjoints between what you alert on and and your call rotations as two things that are in a state of tension. I think it’s a good thing because it lets you keep in mind really the risk between the high coverage, high signal, low disruption, and the hygiene you have to have around the alerting.

Then finally the thing you have is to keep the operators going. To reuse the analogy with the sports things, right, if you work with your athletes so much that they’re always injured, you are not going to get results. Something we do is take time off after our incidents. You have to rest. You have worked at night through something like that, then you don’t have to work for the day or part of the day that follows that. We want to carry context through on-call handoffs between the rotations where we mentioned what happened; what was challenging. 

We test that our alerting and escalation policies are in place, and that’s this little bit of a ritual to hand things off from one to another so the context is not lost between each rotation. You want to keep a tempo. Too many incidents and you exhaust people, but too few you run out of practice. You get rusty. You’re bent out of shape. It’s not going to go super well. If you don’t have many incidents, good for you; but you should think about things like simulator hours, game days, chaos engineering, these kinds of things where you create these controlled experiments where people can still practice procedures that happen rarely because they’re super reliable or you have been lucky lately.

25:00

You should not treat on-call or incident-related tasks as “sadly necessary.” These should be opportunities for gaining better insights. They’re a good role to have. If you treat them like punishment, it’s going to be conceptualized as such and not going to be valued properly. You should be ready to renegotiate SLOs according to the capacity that you have. If you are in a situation where you can’t promise what you want to deliver, working people longer hours is not going to fix that. Sometimes you have to take a break in order to do more again. The SLOs should be negotiable and adjusted as people see fit. They are ideally a discussion tool to prioritize various types of work in the organization, not a contractual obligation. 

Finally, blame awareness is important. The incidents are the consequences of our organizational structures and shifts that happen over time. The people handling the incidents are those coping with the end results of weeks, months, and or even years of upstream work. They weren’t the most visible parts of it, but the incidents are to be organizationally owned, and there’s nothing I would say more saddening than blaming an employee for not following procedures when the employee is usually breaking the procedures to save things — the procedures don’t count, for example. This blame awareness has to be front of mind in everything we do and incidents are to be owned by the organization, not just the people who tried to solve them actively. This is it for me. Thank you for listening to my talk.

Yeesheen Yang [Senior Product Manager|Honeycomb]:

Fred, that was really cool! I never thought I would say that incident reviews are one of the most interesting things to me, but at Honeycomb that’s the case.

Fred Hebert:

Yeah. I enjoyed them the whole lot, even if I didn’t necessarily give a huge part of incident reviews in the presentation itself.

Yeesheen Yang:

Yesterday Nora Jones said something that was really interesting to me. She said stories are everything. They’re how folks get better and actually learn the history. It takes time to get good at storytelling about incidents. How does this storytelling happen at Honeycomb? What are the tools and practices you use to teach or socialize or reinforce it?

Fred Hebert:

There are two dimensions to that one. The first one I would say is during the incident review itself. Yesterday we linked into the kind of timelines that we show when we build the incidents, and in that case, this is an interesting one to give that perspective of what has been going on and getting back into the present tense and figuring out like there’s a hole of 30 minutes where we don’t know what was going on. What are the struggles at this time? It’s the hero’s journey and putting you in the perspective you had “back then” helps you have these better discussions.

The other part of the story, then, is dissemination in terms of what you tell to other people in the organization who were not there for the review, or not there for the incident. We’re still experimenting with that one, whether it’s a presentation in engineering all-hands about the sort of operational story we’re having; or having a written report about that. We’re actively experimenting with ways in which we can take the learnings we have and share them as effectively as possible both in terms of time but also with attention and requirements we have there.

Yeesheen Yang:

Yeah. Really, really cool. I’m always very excited to see that. I have never seen something like that before. Very neat. I really liked also how you have so many great metaphors in your talk, but one that keeps standing out to me is where you say that it’s about kind of bending the system and having the knowledge and experience and I guess context to bend the system out of shape to get to safety. I don’t know that I have a question around this but wondered if you could say a little bit more about that sort of mindset.

Fred Hebert:

Right. Yeah. There’s always this idea that we have a given amount of capacity to do something whether it’s knowledge and energy and stuff like that, and there are common patterns in what people do under pressure. Some of them are going to like drop courtesy, drop some tasks, focus differently on some of them. Knowing the systems and how it works, it’s really being able to do that trade-off and knowing some components enough to, you know, tweak the dials and turn the knobs in the ways that let you get more of the system given the current pressure, which is not usual because you’re already in an incident and knowing how to do that. And this is the part in a sociotechnical system where the technical aspect is usually fixed, configurable but picking the right set of presets and everything is dynamic adjustment by the humans who can understand the direction things are going in a way you don’t see in just the numbers. There has to be anticipation and a sense of direction things are going into and acting on these. That’s why the human aspect is so important in making technical systems work.

Yeesheen Yang:

Absolutely. I think there’s so much in what you said, too, about trust and team that really is compelling there. Cool! Fred, thank you so much! That was wonderful!

Fred Hebert:

All right. Thanks!

Transcript