whitepaper

Framework for an Observability Maturity Model

22 minute read

Introduction and goals

We are professionals in systems engineering and observability, having each devoted the past 15 years of our lives towards crafting successful, sustainable systems. While we have the fortune today of working full-time on observability together, these lessons are drawn from our time working with Honeycomb customers, the teams we’ve been on prior to our time at Honeycomb, and the larger observability community.

The goals of observability

We developed this model based on the following engineering organization goals:

  • Sustainable systems and engineer happiness
    This goal may seem aspirational to some, but the reality is that engineer happiness and the sustainability of systems are closely entwined. Systems that are observable are easier to own and maintain, which means it’s easier to be an engineer who owns said systems. In turn, happier engineers means less turnover and less time and money spent ramping up new engineers.

  • Meeting business needs and customer happiness
    Ultimately, observability is about operating your business successfully. Having the visibility into your systems that observability offers means your organization can better understand what your customer base wants as well as the most efficient way to deliver it, in terms of performance, stability, and functionality.

The goals of this model

Everyone is talking about “observability”, but many don’t know what it is, what it’s for, or what benefits it offers. With this framing of observability in terms of goals instead of tools, we hope teams will have better language for improving what their organization delivers and how they deliver it.

For more context on observability, review our e-guide “Achieving Observability.”

The framework we describe here is a starting point. With it, we aim to give organizations the structure and tools to begin asking questions of themselves, and the context to interpret and describe their own situation–both where they are now, and where they could be.

The future of this model includes everyone’s input

Observability is evolving as a discipline, so the endpoint of “the very best o11y” will always be shifting. We welcome feedback and input. Our observations are guided by our experience, and intuition and are not yet necessarily quantitative or statistically representative in the same way that the Accelerate State of DevOps surveys are. As more people review this 1 model and give us feedback, we’ll evolve the maturity model. After all, a good practitioner of observability should always be open to understanding how new data affects their original model and hypothesis.

The Model

The following is a list of capabilities that are directly impacted by the quality of your observability practice. It’s not an exhaustive list, but is intended to represent the breadth of potential areas of the business. For each of these capabilities, we’ve provided its definition, some examples of what your world looks like when you’re doing that thing well, and some examples of what it looks like when you’re not doing it well. Lastly, we’ve included some thoughts on how that capability fundamentally requires observability–how improving your level of observability can help your organization achieve its business objectives.

The quality of one’s observability practice depends upon both technical and social factors. Observability is not a property of the computer system alone or the people alone. Too often, discussions of observability are focused only on the technicalities of instrumentation, storage, and querying, and not upon how a system is used in practice.

If teams feel uncomfortable or unsafe applying their tooling to solve problems, then they won’t be able to achieve results. Tooling quality depends upon factors such as whether it’s easy enough to add instrumentation, whether it can ingest the data in sufficient granularity, and whether it can answer the questions humans pose. The same tooling need not be used to address each capability, nor does strength of tooling for one capability necessarily translate to success with all the suggested capabilities.

If you’re familiar with the concept of production excellence, you’ll notice a lot of overlap in both this list of relevant capabilities and in their business outcomes.

There is no one right order or prescriptive way of doing these things.
Instead, you face an array of potential journeys. Focus at each step on what you’re hoping to achieve. Make sure you will get appropriate business impact from making progress in that area right now, as opposed to doing it later. And you’re never “done” with a capability unless it becomes a default, systematically supported part of your culture. We (hopefully) wouldn’t think of checking in code without tests, so let’s make o11y something we live and breathe.

Respond to system failure with resilience

Definition

Resilience is the adaptive capacity of a team together with the system it supports that enables it to restore service and minimize impact to users. Resilience doesn’t only refer to the capabilities of an isolated operations team, or the amount of robustness and fault tolerance in the software.  Therefore, we need to measure both the technical outcomes and people outcomes of your emergency response process in order to measure its maturity.

To measure technical outcomes, we might ask the question of “if your system experiences a failure, how long does it take to restore service, and how many people have to get involved?”. For example, the 2018 Accelerate State of DevOps Report defines Elite performers as those whose average MTTR that is less than 1 hour and Low performers as those averaging an MTTR that is between 1 week and 1 month.

Emergency response is a necessary part of running a scalable, reliable service, but emergency response may have different meanings to different teams. One team might consider satisfactory emergency response to mean “power cycle the box”, while another might understand it to mean “understand exactly how the automation to restore redundancy in data striped across disks broke, and mitigate it.” There are three distinct goals to consider: how long does it take to detect issues, how long does it take to initially mitigate them, and how long does it take to fully understand what happened and decide what to do next?

But the more important dimension to managers of a team needs to be the set of people operating the service. Is oncall sustainable for your team so that staff remain attentive, engaged, and retained? Is there a systematic plan to educate and involve everyone in production in an orderly, safe way, or is it all hands on deck in an emergency, no matter the experience level? If your 5 product requires many different people to be oncall or doing break-fix, that’s time and energy that’s not spent generating value. And over time, assigning too much break-fix work will impair the morale of your team.

If you’re doing well:

  • System uptime meets your business goals, and is improving.
  • Oncall response to alerts is efficient, alerts are not ignored.
  • Oncall is not excessively stressful, people volunteer to take each others’ shifts
  • Staff turnover is low, people don’t leave due to ‘burnout’.

If you’re doing poorly:

  • The organization is spending a lot of money staffing oncall rotations.
  • Outages are frequent.
  • Those on call get spurious alerts & suffer from alert fatigue, or don’t learn about failures.
  • Troubleshooters cannot easily diagnose issues.
  • It takes your team a lot of time to repair issues
  • Some critical members get pulled into emergencies over and over.

How observability is related

Skills are distributed across the team so all members can handle issues as they come up.

Context-rich events make it possible for alerts to be relevant, focused, and actionable, taking much of the stress and drudgery out of oncall rotations. Similarly, the ability to drill into highly-cardinal data with the accompanying 6 context supports fast resolution of issues.

Deliver high quality code

Definition

High quality code is code that is well-understood, well-maintained, and (obviously) has a low level of bugs. Understanding of code is typically driven by the level and quality of instrumentation. Code that is of high quality can be reliably reused or reapplied in different scenarios. It’s well-structured, and can be added to easily.

If you’re doing well:

  •  Code is stable, there are fewer bugs and outages.
  •  The emphasis post-deployment is on customer solutions rather than support.
  • Engineers find it intuitive to debug problems at any stage, from writing code to full release at scale.
  •  Issues that come up can be fixed without triggering cascading failures.

If you’re doing poorly:

  • Customer support costs are high.
  • A high percentage of engineering time is spent fixing bugs vs working on new functionality.
  • People are often concerned about deploying new modules because of increased risk.
  • It takes a long time to find an issue, construct a repro, and repair it.
  • Devs have low confidence in their code once shipped.

How observability is related

Well-monitored and tracked code makes it easy to see when and how a process is failing, and easy to identify and fix vulnerable spots. High quality observability allows using the same tooling to debug code on one machine as on 10,000. A high level of relevant, context-rich telemetry means engineers can watch code in action during deploys,

 

Download the PDF to read more Download PDF