Incident Response  

Incident Management Steps and Best Practices

By Valerie Silverthorne  |   Last modified on September 27, 2023

According to the Uptime Institute’s 2022 Outage Analysis report, one out of every five companies has experienced a “serious” or “severe” incident over the past three years—a percentage that’s increasing. Those incidents are expensive: over 60% cost more than $100,000, while 15% set their companies back close to $1 million. To put this in perspective, in 2019, only 39% of incidents cost more than $100,000 so the trend lines aren’t moving in the right direction.

A well-thought-out incident management plan that’s created and practiced before it’s needed can lessen these risks. Here’s everything you need to understand to get the most out of incident management.

What is incident management?

Put simply, incident management is the way an organization reacts to any kind of outage (security, broken code, severe weather, or anything that’s disruptive to customer service). Incidents are inherently fraught, not just because they’re time consuming and costly, but because they can potentially poison the well with customers, investors, and even partners. 

Incident management requires companies to think through even unlikely scenarios and create plans for rapid discovery and resolution, as well as a robust (but nuanced) communications plan.

Solid incident management response should keep the following factors in mind:

  • Engineering will usually be key to finding and fixing an incident, but they’re not the only group who will be involved. Plan to include many stakeholders, from the C-suite to lawyers, public relations, partner marketing, and more.
  • Write the plans down and actively practice them.
  • Don’t forget about compliance requirements and state and federal laws.
  • Metrics can be invaluable in detecting incidents, so plan to incorporate service level agreements and service level objectives at a minimum.
  • Time is of the essence when it comes to incident management, so the more a company can build observability into the development process, the faster incidents can be found and resolved. 

It’s important to stress that incident management is not a “nice to have” but a total “must have” for organizations of any size. An upfront investment in a comprehensive incident management plan will have a number of concrete benefits. 

  • For starters, an incident management plan will make it easier to handle a small problem before it spirals out of control. 
  • At the heart of any incident management effort is communication, and that can make the difference between keeping customers and losing them, not to mention keeping other stakeholders up to date as well. 

Incident Management Lifecycle 

A thoughtful incident management response plan doesn’t require organizations to reinvent the wheel. The National Institute of Standards and Technology (NIST) has a four-step incident response plan suitable for companies of all sizes. Although this plan was created with cybersecurity in mind, the basic steps are a perfect starting point for incidents of any type. 

Incident Management Process & Best Practices

Incident prevention

Even with all the prevention in the world, incidents will happen. But, the more preparation teams undertake in advance, the better the outcome. 

Start by building a culture of observability and establishing observability-driven development principles. Observability practices, from distributed tracing to establishing service level objectives and service level indicators, can actually allow teams to find problems before customers do, which is close enough to “incident prevention” to count.

Incident identification

The other observability superpower is incident identification. Code that’s been optimized for observability with distributed tracing means it can be sent to an observability platform like Honeycomb for near-instant data analysis and anomaly detection. Speed is everything when an incident is happening, so the more quickly a team can pinpoint the exact cause of a problem, the more quickly it can be fixed. Also, truly observable code provides context around the data, which means anyone on a team can step into the role of troubleshooter.

Incident communication

Having clear, dedicated communication channels—not to mention an up-to-date list of people/roles necessary to include—is perhaps the best antidote to incident management chaos and confusion. No one should be surprised by an ongoing incident, but no one should experience pager fatigue either. Organizations need to find the right balance to create the most effective communication possible.

Incident reporting

Incident reporting and communication are closely related, but there can be significant differences in the “need to know” timing. Those involved with finding and resolving should be immediately looped in, while those who have to deal with potential fallout (customer success, legal, public relations, etc.) are the second tier when it comes to incident reporting. This is another concrete example of why it’s so critical to have an incident management plan.

Incident retrospective

The best way to know if an incident management plan is working is through an incident retrospective. The entire team needs to have a detailed discussion of the successes, failures, and what might be done differently next time. It’s important to be sure to take those findings and bake them into the incident management plan. But it’s equally important to be realistic. For many organizations, lack of time or other resources may make it impossible to “retro” everything. If that’s the case, be sure to establish guidelines around what incidents should take priority.

Practice

Even with the best incident management plan in place, incidents can be stressful, and that stress can make remembering the details of the plan difficult. You want your incident responders to execute it automatically, and a great way to make that more likely is by practicing it in advance. You can schedule mock incidents, called Game Days, in which a team responds to a fictional incident using your plan. Not only will this familiarize them with the plan, but it will also help you find rough spots and sharpen them to a fine point before they're needed in a real incident.

Incident management: can you depend on tools? 

Let’s be clear: there is no silver bullet tool for incident management. In fact, it’s actually the opposite: incident management is a tricky mix of observability and communication tools, best practices, and a thoughtful plan that’s rehearsed regularly. 

Teams hoping to take incident management to the next level must be sure they can find and fix incidents quickly (that’s where an observability platform like Honeycomb comes in) and have a way to communicate the outage, its resolution, and any possible fallout.

Make sure tools are regularly reevaluated as part of the incident management plan, but don’t rely just on them. We recommend looking into statuspage.io, ServiceNow, other ticketing systems like Jira, and choosing tools that will work for you in the long term.

Tools we use at Honeycomb are Pagerduty to alert on-call engineers, and Jeli to streamline the incident process. We recently did a webinar with Jeli—it’s worth a watch if you’re learning about the incident process. 

Conclusion

Incident management is a fact of modern software development life. Even still, we like to see the benefit of incidents: they are learning opportunities, and as you employ your incident response plan, they can become less stressful. Refine your plan, embrace a culture of observability, and put it all into practice and under review as needed, and you’ll be able to regain some control in the face of unpredictability. There’s never a downside to being prepared.

Go deeper:

Here’s how we manage incident response at Honeycomb

Get more out of incident retrospectives

Understand what incidents have to teach us

 

Related Posts

Incident Response  

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...

Service Level Objectives   Incident Response  

Alerts Are Fundamentally Messy

Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....

Incident Response  

Incident Review: What Comes Up Must First Go Down

On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...