Incident Response  

Should Every Incident Get a Retro?

By Lex Neva  |   Last modified on June 1, 2023

At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague Ben Hartshorne asked a fascinating question, which I’ll paraphrase here:

We’ve been talking about what makes an incident interesting, but what about the reverse? Are there aspects of an incident that would make you say, “We probably shouldn’t bother doing a retro on this one?”

That caught me by surprise. We had a great discussion, and it made me consider approaches I hadn’t before.

Tell me more!

I’ve investigated tons of incidents, written incident reports and even led training sessions. I’ve never met an incident that I didn’t want to know more about. Even the simplest, most clear-cut case of “human error” has so much to teach us. I want to explore all of them.

Even still, Ben had a good point: We can’t study every incident for several reasons, the foremost being time. Learning from incidents is a labor-intensive, time-consuming process. It takes so much work to assemble a timeline, interview participants, develop themes, hold learning reviews, and assemble reports.

Try to run a retro for every incident, and you’ll quickly run into practical limitations. Each incident can take tens of person-hours to study and meanwhile, more incidents may stack up. At some point, there won’t be enough hours in the day, incident analysts to run investigations, or money to spend.

You’ll also soon exhaust your organization’s goodwill. Earlier in my career, I learned this lesson the hard way. My SRE team had a keen interest in getting a handle on the growing number of production incidents in our cloud infrastructure. We learned from experience that memories tend to fade a few days after an incident, making learning difficult, so we set an ambitious goal. We would run a retro on every incident within one week. Preferably within a couple days.

At first, things went well. Our team of five SREs scheduled and facilitated learning reviews after each incident. Some engineers participated actively, and we came up with great action items from each meeting.

After a year or so, we saw problems. It became difficult to get participants to join retrospective meetings. During that time, the number of incidents increased for unrelated reasons. We ran three to five retrospectives per week, and sometimes per day! Incident participants were tired, and so were we. Some incidents fell through the cracks. Incident reports were delayed by two weeks or more.

Trust in the SRE team waned significantly, and we were seen as obstructive time-wasters getting in the way of progress. Folks hesitated to declare incidents since they knew they’d be dragged into a retrospective meeting. And what about all those great action items coming out of the learning reviews? The overly-abundant work tickets languished in team backlogs, and meanwhile, incidents recurred. In short, we had entirely exhausted our company’s appetite for learning. This is a pattern I’ve seen repeated more recently in my career.

How to choose which incidents get a retro

We have to pick and choose which incidents to study, but how? Show me any boring incident you like, and I can tell you five reasons I want to learn more about it. Incidents are my addiction. Stubbed my toe? Let’s hold a retro! I truly believe that, as a whole, our industry needs to do more retrospectives, not fewer—but there is clearly a balance.

I knew I needed help on this one, so I reached out to friends and colleagues, and I even posted a survey.

Going into this, I expected to get answers relating to technical aspects of an incident. I got a few of those, including:

  • Short blips where the system came back up automatically as expected
  • Known issues or inherent capacity risks that were accepted in advance
  • Repeat incidents that were already reviewed
  • Too much time passed, and memories and log files are sparse
  • This incident probably can’t teach us as much as others

These may be useful general rules, but I can think of counterexamples for each scenario. For example, I may want to run another retrospective for a repeat incident because the surrounding circumstances have changed. After all, no two incidents are exactly the same.

Things get more interesting when we step back a level. John Allspaw of Adaptive Capacity Labs shared this insight (emphasis added):

“Events that garner widespread public attention, especially those that are deemed significant enough that draw in legal (civil or criminal) processes. When incidents like these are perceived to have existential consequences for the organization, then an analyst needs to consider whether or not a learning review is actually possible given the myriad of agenda-hijacking influences in play. In these cases, the narrative is NOT in the hands of even the best incident analysts… it's already been constructed by those whose employment (or career) depends on it.

Run from those situations. The potential to permanently damage your reputation as an incident analyst is too high.

John Allspaw

We can’t treat each incident in isolation. Just as I discovered that running too many retrospectives damaged my team’s reputation, so too can running the wrong retro. In these cases, we have to balance the potential to learn from one incident versus our ongoing ability to learn from incidents at all.

Courtney Eckhardt shared similar advice. She cautioned that security incidents require extra care due to legal concerns, especially when questions of public disclosure come into play.

What do we do with sensitive matters?

In general, when incidents involve legal matters, tread lightly. For example, consider whether your incident report itself might become the subject of a subpoena in a civil or criminal case. Do you actually have the ability to protect the confidentiality of the interviews you're doing in the investigation? Or will you be compelled to release them? Perhaps let that incident go, or at a minimum, check with executives and your company’s legal counsel before proceeding.

High-profile incidents as a whole should give you pause. These are exactly the kinds of incidents that we can learn the most from. However, tensions will run high, and even a company that normally embraces a blameless approach to retrospectives might look for someone to take the fall. Does your incident retrospective have the potential to cause harm to incident participants, or perhaps even termination? It might be best to focus on incidents that are less emotionally charged, especially at companies that are still early on the path toward blameless retrospectives.

That said, this kind of situation is extremely rare. In over a decade of learning from incidents, I’ve never seen an incident in which the stakes were this high. It’s worth keeping in mind, but this advice may never come into play, and in general, we should lean toward running more retrospectives.

TL;DR: It depends

In the end, which incidents should we skip over? The technical criteria mentioned earlier are a good starting place, but there are exceptions to each. The really interesting answers come when we consider the broader-ranging impact of the retro itself. Does the learning potential justify the effort required to analyze this incident? Will it damage your credibility as an analyst? Are there concerning potential ramifications for individuals or the company as a whole? Keep all of these questions in mind as you choose which incidents to learn from.

Thanks to the following people for their input: Ben Hartshorne, Courtney Eckhardt, John Allspaw, Chad Todd, John Paris, Varun Pal, Jamie, and several anonymous contributors.

 

Related Posts

Incident Response  

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...

Service Level Objectives   Incident Response  

Alerts Are Fundamentally Messy

Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....

Incident Response  

Incident Review: What Comes Up Must First Go Down

On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...