How Observability Helps With Incident Response

How Observability Helps With Incident Response

How Observability Helps With Incident Response

By Rox Williams | Last modified on 2023.03.01

No one knows what they don’t know.

That rather elusive statement is at the heart of one of the biggest challenges in software development today: how teams respond when something is broken. “Incident response” is the term for how companies handle a problem, but while it’s critical to get it right, it is also insufficient in the complex distributed world of modern software development.

When something breaks, teams may literally have no idea where to look or how to make sense of the data; an observability solution surfaces everything, making it easier to see, and thus resolve, incidents. Today’s DevOps teams work on complex systems and need more problem-solving power to resolve incidents and need such an observability solution.

But observability’s true superpower is speed, and that’s why it’s so critical to successful incident response. The longer it takes to resolve an incident, the more costly it is; observability brings the data, removes the blind guesswork, and drastically minimizes the time it takes to find and fix the issue.

What is observability?

Observability is a term that is broadly used in the software development space, and it’s sometimes conflated with monitoring. In reality, though, they’re very different—both in nitty-gritty details and in philosophies. Observability is more like a choose-your-own-adventure movie where the plot unfolds, there are twists and turns, and teams can ask questions and explore all the options to find what some call the “unknown-unknowns.” Monitoring uses alerts to give a team snapshots of what’s happened—the so-called “known-unknowns.”

To put it another way, monitoring is a specific form of keeping tabs on what is going on in parts of the system, but observability is a richer concept of understanding what is going on in there. In an ideal world, observability is baked in during development, giving organizations the power to diagnose internal incidents from the outside.

It’s possible to have incident response without observability, of course, but we don’t recommend it. Here’s a detailed look at modern incident response strategies and best practices, and how observability fits in.

What is incident response?

Whether it’s a security breach or a cloud outage (or anything in between), problems that require diagnostics and resolution are bound to happen, which is why virtually every organization practices some form of incident response. In small companies, an incident can mean all hands on deck to resolve, but in medium and large companies, there are often incident response teams (with detailed playbooks) who are called in to deal with the situation.

No matter the size of the organization, incident response efforts share some universal characteristics:

  • Incident response may require a team of people stretching from the c-suite to engineering and incorporating public relations, legal, customer advocates, and more.
  • The incident response plan is codified and, in many cases, contains very detailed steps and processes. Incident response is often “practiced” ahead of time, so teams feel comfortable with the steps and the level of collaboration.
  • Organizations may have important metrics to consider, including those related to service level agreements (SLAs), service level objectives (SLOs), or mean time to resolution (MTTR). Compliance requirements can also come into play during incident response.
  • Incident response requires lots of data analysis to discover and take corrective action. Teams benefit tremendously from observability solutions (to find problems more quickly) and retrospectives of individual incidents (to promote understanding of the process and any pitfalls).

What is an incident response plan?

Incident response is only as good as an incident response plan because trouble isn’t very discriminating. Malware, a tornado, or even ever-changing compliance regulations can throw a company into chaos. The solution is a well-thought-out incident response plan.

In most cases, the plan is literally documentation that details step by step how an organization should respond. The National Institute of Standards (NIST) suggests a thorough incident response plan will cover four areas: preparation; ways to detect and analyze problems; methods for containment, eradication, and recovery; and a process for what happens when everything is over.

To drill down further, experts say a solid incident response plan should begin with possible scenarios and steps to resolution, but also include roles and responsibilities, methods of communication, key metrics, fallback contact information, a plan to regularly review vulnerable areas like backups, tokens, and asset management rules, and of course, a process to track and monitor all incidents. In regulated industries, an incident response plan could likely include compliance-related requirements, SLOs, and anything else that must be kept top of mind during an event.

It’s important that an incident response plan be familiar to all potential participants, as the last thing a team needs is for participants to be unprepared. Also, an incident response plan should be a living document that’s continually updated to reflect the current state of the organization.

Finally, an incident response plan must be crafted to take into account known threats and unimaginable threats. Getting creative with disaster planning is encouraged.

Why incident response planning is important

In the fast-paced world of modern software development, problems will crop up, and responses must be swift and accurate.

Incident response plans benefit companies in the following concrete ways:

  1. Control the spread: an incident response plan can help teams stop a small problem from becoming a bigger one.
  2. Keep a reputation: data from market research firm IDC found 80% are happy to switch to a competitor when faced with a data breach. A thorough incident response plan will ensure effective public relations and customer communication (and, of course, a more swift resolution to any issues).
  3. Save some money: downtime—no matter the reason—is costly, so the more quickly a problem is observed and remediated, the less expensive it will be.
  4. Fortify data security: a comprehensive incident response plan will include regular “maintenance” updates of key parts of the system (like backups, tokens, and asset and ID management), all of which help keep an organization’s data safer.
  5. Stay compliant: for companies in regulated industries, downtime can sometimes result in fines. An incident response plan will not only speed resolution, but will contain key compliance metrics all in one place and make it easy to loop in an expert as needed.
  6. Lighten the DevOps load: creating safe software quickly is a stressful job, and incidents can take it over the top. An incident response plan helps ensure the right steps are taken by the right people at the right time, easing the burden on the DevOps team.

Who handles incident response?

To take a bit of a contrarian view, incident response really should be the responsibility of everyone involved in building the product or service. Without that mindset of ownership, teams are unlikely to make incident response top of mind, meaning they won’t be building in the magic wand of observability from the beginning.

Philosophies aside, most organizations have dedicated incident response teams made up of a wide variety of job titles, including:

  • Support from the C-suite
  • The relevant product/dev teams
  • Members of ops, DevOps, or site reliability engineers who own the infrastructure
  • Support team members
  • On-call incident response members
  • A representative or more from legal, compliance, public relations, and customer success/communication, if necessary

All that said, we do stick with our belief that if you build it, you own it—especially when it comes to incident response.

How to build an incident response team

The best incident response teams start from the bottom up as well as the top down. Although it sounds paradoxical, it’s not: top management is required to highlight the importance of the team and its role, recruit key members, and make funding available in certain cases—but development teams who build the products must also take ownership both through their coding efforts and their participation.

Generally speaking, the process begins by naming an “incident commander” who will oversee the team and be the directly responsible individual when an incident occurs. In most cases, the incident commander will be a highly technical individual.

After that, an organization needs to bring in people who can diagnose, find, and fix the problems—and that generally means those who’ve built the product, are responsible for the service, and support the infrastructure. On-call service and support staff should also be added.

Depending on the organization, it’s vital to cast the net broadly: some will want experts on threat assessments, compliance officers, lawyers, human resource professionals, public relations team members, customer advocates, etc. Every team will look and operate differently based on industry and company requirements.

Incident commander responsibilities

The incident commander plays a pivotal role during a crisis, of course, but before and after as well. The role requires planning, coordination, a thorough understanding of all systems, processes, and compliance requirements, top-notch communication skills, and the ability to lead a cross-functional group of people.

An incident response team should expect the following from a commander:

  • Preparation – From setting up communication channels to training teams, the incident commander is responsible for creating the structure and all the steps.
  • Commander in chief – The incident commander owns the process from start to finish, but needs to also be able to delegate, take the bird’s eye view as needed, and pull in other teams when required.
  • Planner in chief – The best incident commanders have thought of everything: backups, escalation, extra resources, panicked team members, past history, and industry best practices. And when all is said and done, the incident commander should lead the retrospective so the next crisis can be resolved more quickly and painlessly.

Incident response best practices

If you are getting started or looking for guidance, there are a lot of well-accepted practices in the industry, which can act as a solid foundation to build on. Every organization should have an industry response plan that’s tailored to its needs, but there are key elements they often share, including:

  1. Engage with peers and trade ideas on what worked, what didn’t, and what might make sense to experiment with in a particular industry.
  2. Make incident response a key pillar within the company’s KPIs or objectives. This not only keeps everyone thinking about the importance of incident response, but can create interest and enthusiasm for serving on the team.
  3. Be ruthless about distractions. In a crisis, time and focus are two of the biggest challenges. Plan ahead to ensure the incident response team can have streamlined communication channels, avoid pager fatigue, and be able to truly be heads down on problem-solving and not talking down upper management or worried colleagues.
  4. Lean into incident reviews. Fact-finding will serve as an invaluable record and learning resource.
  5. Evangelize the dev team (and other teams who build/create). Software that is truly observable means incident response is faster, cheaper, and frankly less stressful and aggravating—for everyone! It makes sense for dev and ops to work together to build observability into the software. Observability-driven development is coming to a DevOps/SRE team near you, so it won’t hurt to talk it up.
  6. Have a public relations communications plan ready to roll. The faster customers know what’s going on, the less likely they are to become frustrated or bolt.
  7. Practice, practice, practice. Drill the team, but also think outside the box by trying out chaos engineering, which promotes the idea of breaking things to better learn how they respond in the real world.

Do know, however, that all plans have their limits. Once you’ve reached a certain level of maturity and your incident response is well exercised, you will encounter new challenges that demand you push boundaries and innovate. Improving best practices requires experimentation and the ability to bend rules to see if some end up being better than what you were doing.

The good news is that having access to information about your system is useful no matter how well established your practices are, or how far ahead of the pack you want to be. Observability mechanisms, such as those we provide, are useful for the entire path forward.

Incident response + observability = success

Incidents happen all the time and are costly in monetary terms but also in reputation, time spent, and the impact on employee burnout. Every organization needs a well-thought-out incident response plan and a carefully chosen team to implement it. But companies need to take one further step to ensure incident response is everything it could be: an observability solution will bring sanity to the chaos of finding and fixing problems, and it will do so with speed. In the end, teams don’t just need to “fix” an incident—they need a fast fix, which is what observability provides. And as a bonus, observability and incident response will keep customers and developers happier. A win/win for certain.

The Honeycomb difference

Other monitoring and debugging tools rely on engineers being able to guess which attributes, metrics, or behaviors will impact their users’ experience based on historical trends. They often rely on opaque dashboards that do a poor job of revealing their systems’ true state. In practice, these dashboards are often dead ends for engineers who are alerted to an issue; they provide a thousand-foot view of a predefined set of metrics, but don’t support responsive querying and organic exploration. This means that engineering teams often discover issues only after customers report them.

Honeycomb’s approach is fundamentally different from other tools that claim observability, and is built to help teams answer novel questions about their ever-evolving cloud applications. That’s because Honeycomb unifies all data sources (logs, metrics and traces) into a single type, backed by a powerful query engine built for highly contextual telemetry data.

This single source of data enables engineers to investigate from one UI to get definitive answers, regardless of data type. Every dashboard in Honeycomb is interactive, enabling teams to investigate iteratively with full visibility over their systems. Ask any question on the fly, slicing and dicing by any dimension—and, with BubbleUp, quickly spot outliers that point to the cause of hidden problems. The result: you can resolve incidents faster when they happen, and focus on high-value work.

Curious how we do it?

Get an insider look at how Honeycomb manages incident response.

Additional resources

Blog

The Incident Retrospective Ground Rules

read more
Blog

Touching Grass With SLOs

read more
Case Study

Honeycomb at Tapjoy: Faster Time to Confidence with Observability

read more
Case Study

LaunchDarkly Guesses Less, Knows More With Next-Gen APM

read more