Incident Response  

Negotiating Priorities Around Incident Investigations

By Fred Hebert  |   Last modified on February 26, 2024

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions with multiple stakeholders. The problems I’ll explore in this blog—from the SRE perspective—are about time pressures (when to ship the investigation) and the type of report people expect.

Investigation types

Incident investigations, reviews, and reports play multiple roles. The below bullet list is inspired by Sidney Dekker's The Psychology of Incident Investigations (short annotated version), where he breaks down four roles that are assigned:

  1. Moral: explain transgressions, and reinforce moral and regulatory boundaries. These tend to refer to norms (what you should or shouldn’t do), and deviations from norms and processes are often seen as contributing factors.
  2. Existential: explain the suffering that occurred. These assume that incidents are not supposed to happen, and seek ways to reassure people that they do not have to happen either.
  3. Preventative: explain how to avoid recurrence, ask for alterations. These seek explanations that can identify variables on which we can act to prevent similar incidents from happening again.
  4. Epistemological: explain what happened, causes, and effects. This approach works best when it represents multiple viewpoints to paint the richest picture possible, including even contradictions where truth can’t be found.

There’s tension here because the Moral and Existential approaches can clash with others. They may search for transgressions or improper behavior, and this may make the Preventative approach more challenging by obscuring or hindering investigation paths. People are less likely to contribute to investigations when they fear reprimand, for example.

The Epistemological approach can be in tension with the Moral and Existential types for similar reasons, but can also clash with the Preventative type. The objective of preventing recurrence may force you to put on blinders when everything worked as designed and that the whole situation might have been an inevitable—or even acceptable—tradeoff to the system. Some fixes, done because you need to fix something, anything, might be ineffective, misleading, or harmful.

The approach I personally favor is always the one that centers on learning (Epistemological), with the belief that when you have good explanations, you can surface preventative approaches as well.

If you find yourself with management, users, customers, or peers looking for a Moral outcome, you should ready yourself to see them consider your reviews a failure for not properly reinforcing expectations on professionalism or ownership. Shaping these expectations becomes groundwork in order to properly do Epistemological or Preventative work, and differs for internal and external stakeholders.

Deadlines and public relations

Customers sometimes demand quick analyses after incidents: a post-mortem, a root cause analysis (RCA), or other public report. In theory, this aligns well with incident investigations: the longer you wait, the more likely it is that participants will forget key details. Ideally, you want to start as soon as possible. A good in-depth investigation that truly tries to understand what was going on will, however, take far more than two business days to investigate: anything you promise within this delay is guaranteed to be superficial and not that useful.

Public reports have their own purposes, and distinct audiences. It is quite possible that while you want Epistemological investigations internally, public reports will be Moral by showing you’re taking the situation seriously, or Existential by acknowledging the pain customers feel.

If your public report is also expected to be a source of preventive measures or explanations for users and industry peers, then these objectives might once again clash. A report produced rapidly can do the public relations role of apologizing and appeasing your users, but is unlikely to do a decent job for learning.

These use cases, while conflicting, are not all invalid. In fact, at Honeycomb, we’ve sometimes opted to publish multiple reports. Here are some things we’ve tried for minor incidents and serious outages:

  1. The status page, which describes all public-facing incidents that hit a significant portion of users. For minor incidents, this may be the only report written.
  2. A preliminary report, which is written within that two to three day period after a major incident. It provides a quick description of what we think happened. If the incident is particularly interesting to us—or to our customers, often due to its severity—we note that a follow-up in-depth investigation will take place.
  3. An in-depth internal review (often with its own report), which may take weeks of on-and-off time to prepare and write. 
  4. An in-depth public report, which is based on our internal report. We redact names, implementation details, some bits of history, project roadmaps, social elements, and other similar content. The criteria here is, “Do we think our customers—or people elsewhere in the industry—could learn something useful from this?”
  5. A short blog post, which is a whittled-down version of the aforementioned report.

This distillation of information into multiple formats hits the mark for different stakeholders. We expect this balance to keep shifting as we grow and as our user base gets more diverse.

What we do in the shadows

We try to encourage learning from our incidents. Lots of groundwork (before my time as well—this isn’t something that started with me) was established to make that a possibility. To “protect” that ability, we’ve accepted that we need to write different reports for different audiences, which—luckily—we can alter without losing our internal approach and benefits.

Our focus on learning also has an interesting rule of thumb attached to it: we don’t review all incidents. The guideline is that we prefer to have a few in-depth reviews than surface coverage of all incidents. Pick and choose the incidents in which you’re going to dive deeper:

  • Choose incidents where folks are surprised, or even say out loud “I want to review this” or “this is a really weird one.” They’re strong signals that these incidents are good learning opportunities.
  • Rare occurrences of weird incidents are worth jumping on at a higher priority; common incidents are probably going to happen again and we can learn from them next time. This is a bit counter-intuitive because we tend to think in terms of clearing up the most common elements first, but aiming for qualitative dive flips this idea around.
  • Large incidents with public-facing impact are generally worth reviewing. If external stakeholders want to know what happened, we should try to learn something from the incident as well.

But what if there are too many to choose from?

Let’s hope you’re never in this situation, but if you have too many incidents to choose from, conduct a meta-review where you consider all of those incidents to be an extended outage period

  • How did this high-intensity period feel for your people? 
  • Are there patterns? 
  • What can you learn from these high-level patterns without necessarily digging deeper into the individual outages?

Answering these questions might help you refine how you handle incidents. 

We’re curious: what’s your current approach like? If you had a magic wand and you could fix one thing immediately, what would it be? Where would you find the most impact? Join the conversation in Pollinators, our Slack community.

 

Related Posts

Service Level Objectives   Incident Response  

Alerts Are Fundamentally Messy

Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....

Incident Response  

Incident Review: What Comes Up Must First Go Down

On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...

Incident Response  

Incident Management Steps and Best Practices

Incident management is the way an organization reacts to any kind of outage (security, broken code, severe weather, or anything that’s disruptive to customer service)....