The Incident Retrospective Ground Rules
By Lex Neva | Last modified on June 6, 2023I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and itâs been a wild ride so far. One thing I was especially excited about was the opportunity to see Honeycombâs incident retrospective process from the inside. I wasnât disappointed!
The first retrospective I took part in was for our ingestion delays incident on September 8th. Our preliminary report promised that weâd post more about what happened after our investigation concluded, and the retrospective meeting that I attended was part of that work. Later on, we posted our full analysis.
Right at the start of the retrospective meeting, Fred Hebert blew my mind by reading out the Ground Rules, which Iâll paraphrase here:
- Our main objective is to learn and get a better understanding of what happened and what it could mean. We strongly believe that coming up with useful practical improvements is difficult without a good understanding of the aspects of our system that challenged us, and this is what we want to focus on here.
- If you have improvement ideas or action items, I would encourage you to note them down for later rather than raising them during this meeting. Once we have a better understanding of this incident, action items usually surface organically.
- We're going for blame-aware incident reviews; we are here to assume people wanted to do a good job, and that they did the best they could to meet objectives. When questions arise about why someone did something, we prefer to focus on why it made sense at the time to take that action.
- We'd like to avoid thinking about "what we could have done differently" and instead re-frame that into "what can we do next time to get a better outcome?" It's a minor shift in perspective, but it helps us be more constructive in our viewpoints.
- Ask questions! We'll maintain a steady progression through the meeting, but there should be room for questions.
- If you think something is obvious to others but not to you, ask about it. People tend to have similar questions, and these can highlight unspoken assumptions about how we do work. You can message me privately in Zoom's chat function if you want your name to remain confidential.
- If you have feedback about how we ran the session, we're happy to receive it.
Thereâs so much to love in this intro! Iâve been learning about these concepts for years and trying to slowly incorporate them into the incident retrospective culture around me. I was pleasantly surprised to hear that these ideas were already firmly instilled in Honeycombâs culture.
Letâs look at the ground rules in a little more detail to find out why.
Learning vs. action items
I first came across this concept in the Etsy Debriefing Facilitation Guide, and since its publishing, Iâve watched long-standing best practices shift toward an emphasis on learning versus action items. The Howie guide for post-incident analysis by Jeli is another example of an incident analysis framework that embodies this idea.
I have to admit, my thinking on this topic has changed over the past few years. Heck, I co-led an entire conference session on running incident retrospectives that held remediation items as the main goal. However, I now see that we learn so much more when learning is the focus. Searching for remediation items actively gets in the way.
Blame vs. context for decisions
âWhy did it make sense to make that decision?â Ask this question in an incident review and youâll learn more about your sociotechnical system. This one question sets the tone, making those involved in the incident feel safer because they know that everyone is assuming they made the best choice they could at the time based on the information they had.
Itâs worth noting that we donât say âblamelessâ directly. Instead, we use âblame-aware.â Itâs okay to talk about who did something, provided that the discussion is sanctionless; no one is going to be punished for decisions they made in good faith.
Avoid counterfactuals
In an incident review, a counterfactual question asks, âWhat should we have done?â This kind of question is dangerous because it conjures up a reality that did not exist. In the process, it brings undertones of blame that will engender defensiveness and stifle the investigation. By phrasing our questions in the form of how we can act in the future, we acknowledge the reality that everyone did the best they could during the incident.
Ask questions
Finally, the ground rules encourage asking questions, even if the answer seems obvious. An incident review is about finding out where our mental models of the system broke downâand bringing those models closer into alignment with the way the system actually works. Everyoneâs model is an approximation, and a different one at that. Your question helps you improve your mental model, and almost certainly will help someone else too. Ask it!
Using the ground rules
Creating and publishing the ground rules for incident investigations is the first part, but thatâs not enough. I experienced firsthand how important it is to read them aloud before every single retrospective meeting.
In any meeting, chances are thereâs someone new who hasnât heard the rules before. For those whoâve heard them before, it provides an important reminder that tone and mindset are critical to promote learning as much as we can from each incident. The end result is to create an inviting learning environment where everyone feels safe contributing and we all get to learn together as a group.
Iâll end this article by telling you that weâre hiring! If the culture at Honeycomb sounds like itâd be a good fit for you, check out our careers page and see if thereâs a match for you.
Related Posts
What Do Developers Need to Know About Kubernetes, Anyway?
Stop me if youâve heard this one before: you just pushed and deployed your latest change to production, and itâs rolling out to your Kubernetes...
What Happens to DevOps when the Kubernetes Adrenaline Rush Ends?
Kubernetes has been around for nearly 10 years now. In the past five years, weâve seen a drastic increase in adoption by engineering teams of...
What Is a Feature Flag? Best Practices and Use Cases
Do you want to build software faster and release it more often without the risks of negatively impacting your user experience? Imagine a world where...