Software Engineering   Culture  

The Incident Retrospective Ground Rules

By Lex Neva  |   Last modified on June 6, 2023

I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was especially excited about was the opportunity to see Honeycomb’s incident retrospective process from the inside. I wasn’t disappointed!

The first retrospective I took part in was for our ingestion delays incident on September 8th. Our preliminary report promised that we’d post more about what happened after our investigation concluded, and the retrospective meeting that I attended was part of that work. Later on, we posted our full analysis

Right at the start of the retrospective meeting, Fred Hebert blew my mind by reading out the Ground Rules, which I’ll paraphrase here:

  • Our main objective is to learn and get a better understanding of what happened and what it could mean. We strongly believe that coming up with useful practical improvements is difficult without a good understanding of the aspects of our system that challenged us, and this is what we want to focus on here.
  • If you have improvement ideas or action items, I would encourage you to note them down for later rather than raising them during this meeting. Once we have a better understanding of this incident, action items usually surface organically.
  • We're going for blame-aware incident reviews; we are here to assume people wanted to do a good job, and that they did the best they could to meet objectives. When questions arise about why someone did something, we prefer to focus on why it made sense at the time to take that action.
  • We'd like to avoid thinking about "what we could have done differently" and instead re-frame that into "what can we do next time to get a better outcome?" It's a minor shift in perspective, but it helps us be more constructive in our viewpoints.
  • Ask questions! We'll maintain a steady progression through the meeting, but there should be room for questions.
  • If you think something is obvious to others but not to you, ask about it. People tend to have similar questions, and these can highlight unspoken assumptions about how we do work. You can message me privately in Zoom's chat function if you want your name to remain confidential.
  • If you have feedback about how we ran the session, we're happy to receive it.

There’s so much to love in this intro! I’ve been learning about these concepts for years and trying to slowly incorporate them into the incident retrospective culture around me. I was pleasantly surprised to hear that these ideas were already firmly instilled in Honeycomb’s culture.

Let’s look at the ground rules in a little more detail to find out why.

Learning vs. action items

I first came across this concept in the Etsy Debriefing Facilitation Guide, and since its publishing, I’ve watched long-standing best practices shift toward an emphasis on learning versus action items. The Howie guide for post-incident analysis by Jeli is another example of an incident analysis framework that embodies this idea.

I have to admit, my thinking on this topic has changed over the past few years. Heck, I co-led an entire conference session on running incident retrospectives that held remediation items as the main goal. However, I now see that we learn so much more when learning is the focus. Searching for remediation items actively gets in the way.

Blame vs. context for decisions

“Why did it make sense to make that decision?“ Ask this question in an incident review and you’ll learn more about your sociotechnical system. This one question sets the tone, making those involved in the incident feel safer because they know that everyone is assuming they made the best choice they could at the time based on the information they had.

It’s worth noting that we don’t say “blameless” directly. Instead, we use “blame-aware.” It’s okay to talk about who did something, provided that the discussion is sanctionless; no one is going to be punished for decisions they made in good faith.

Avoid counterfactuals

In an incident review, a counterfactual question asks, “What should we have done?” This kind of question is dangerous because it conjures up a reality that did not exist. In the process, it brings undertones of blame that will engender defensiveness and stifle the investigation. By phrasing our questions in the form of how we can act in the future, we acknowledge the reality that everyone did the best they could during the incident.

Ask questions

Finally, the ground rules encourage asking questions, even if the answer seems obvious. An incident review is about finding out where our mental models of the system broke down—and bringing those models closer into alignment with the way the system actually works. Everyone’s model is an approximation, and a different one at that. Your question helps you improve your mental model, and almost certainly will help someone else too. Ask it!

Using the ground rules

Creating and publishing the ground rules for incident investigations is the first part, but that’s not enough. I experienced firsthand how important it is to read them aloud before every single retrospective meeting. 

In any meeting, chances are there’s someone new who hasn’t heard the rules before. For those who’ve heard them before, it provides an important reminder that tone and mindset are critical to promote learning as much as we can from each incident. The end result is to create an inviting learning environment where everyone feels safe contributing and we all get to learn together as a group.

I’ll end this article by telling you that we’re hiring! If the culture at Honeycomb sounds like it’d be a good fit for you, check out our careers page and see if there’s a match for you. 

 

Related Posts

Culture  

A Day in the Life: Customer Success

We thought it'd be fun to give you some insights into what certain teams at Honeycomb do and how they spend their days, and who...

Software Engineering   Monitoring  

What Is Application Performance Monitoring?

Application performance monitoring, also known as APM, represents the difference between code and running software. You need the measurements in order to manage performance....

Software Engineering   Observability  

Where Does Honeycomb Fit in the Software Development Lifecycle?

The software development lifecycle (SDLC) is always drawn as a circle. In many places I’ve worked, there’s no discernable connection between “5. Operate” and “1....