Why Intuitive Troubleshooting Has Stopped Working for YouBy Pete Hodgson | Last modified on February 7, 2022
It’s harder to understand and operate production systems in 2021 than it was in 2001. Why is that? Shouldn’t we have gotten better at this in the past two decades?
There are valid reasons why it’s harder: The architecture of our systems has gotten a lot more sophisticated and complex over the past 20 years. We’re not running monoliths on a few beefy servers these days.
We’re operating distributed microservice ecosystems on top of a deep stack of frameworks, abstractions and runtimes that are all running on other people’s servers (aka “the cloud”). The days of naming servers after Greek gods and sshing into a box to run
top are long gone for most of us.
The move to these modern architectures is for good reason. Engineering organizations are under constant pressure to deliver more value, in less time, at a faster and faster pace.
Towering monoliths and artisanally handcrafted server configurations simply can’t compete with the scalability and flexibility of small, independently deployable services, managed by a multitude of teams, and running on top of elastically scaling infrastructure.
However, this shift has come at a cost. Our systems moved from the realm of the complicated into the realm of the complex; and with that shift, we have discovered that traditional approaches to understanding and troubleshooting production environments simply will not work in this new world.
From Complicated to Complex
With complicated and complex, I’m using specific terminology from the Cynefin model. Cynefin (pronounced kuh-NEV-in) is a well-regarded system management framework that categorizes different types of systems in terms of how understandable they are.
It also lays out how best to operate within those different categories — what works in one context won’t work as well in another — and it turns out that these operating models are extremely relevant to engineers operating today’s production software.
Broadly, Cynefin describes four categories of system: obvious, complicated, complex, and chaotic. From the naming, you can probably guess that this categorization ranges from systems that are more predictable and understandable, to those that are less — where predictability is defined by how clear the relationship is between cause and effect.
Obvious systems are the most predictable; the relationship between cause and effect is clear to anyone looking at the system. Complicated systems have a cause-and-effect relationship that is well understood, but only to those with system expertise. Complex systems have cause-and-effect relationships that are not intuitive at all, even to experts, and can only be understood by experimentation. Chaotic systems seem to have no discernable cause/effect relationship at all.
When we apply the Cynefin categorization to software architecture, we see that the more traditional monolithic systems tend to fall into the complicated category.
While the reasons behind an increase in request latency or error rates may not be obvious to a newcomer, someone who has operated the system for a while tends to know where to look when they see these effects cropping up in a production system. They can then use this expertise to reason their way through to an understanding of cause and effect.
In contrast, a modern distributed system is complex; even an experienced operator has only limited intuition as to what might be causing a production issue, at least initially. Engineers operating these systems have a tendency to compare incidents to a murder mystery or a medical drama. They puzzle through various clues in order to understand underlying causes.
Significant portions of an incident are spent trying to understand cause and effect in the system. This cycle should be familiar to many of us today, and we shouldn’t feel bad about it. It’s an unavoidable outcome of modern system complexity.
In the worst case, some distributed systems can fall into the chaotic category. The causes behind certain production behaviors are permanently shrouded in mystery, with engineers reduced to incantations of operational voodoo by redeploying and restarting things in the same magical sequence that fixed things in the past.
The Known Unknowns of Complicated Systems
Understanding these system categories, we are able to take advantage of Cynefin’s guidance for operating within each category. Making decisions in dynamic systems is all about connecting cause and effect, and Cynefin tells us that the appropriate way to understand these different systems is based largely on how easy it is to understand cause and effect.
When operating complicated systems, an expert will often intuitively know where to look in order to understand the cause of a problem. Put another way, the complicated domain is a world of “known unknowns.” When trying to understand the system’s behavior, we know what questions to ask, although the answers to those questions are initially unknown.
Cynefin defines the best process for understanding a complicated system as “sense-analyze-respond.” We look at — or “sense” — a set of predefined system characteristics, analyze what we see, then decide how to respond based on our analysis.
Engineers intuitively apply this sense-analyze-respond approach when dealing with a production incident in a complicated software system, for example a monolithic web application.
Imagine that an operator for just such a web app is responding to increased API error rates. From experience, they know — or are using a playbook that says — that elevated error rates are often either due to an overloaded database server or a specific third-party service that sometimes (too often!) goes down for unscheduled maintenance.
The operator already knows what questions to ask. The first thing they do is look at pre-configured dashboards to check on DB load and third-party error rates. Based on what they see — perhaps high error rate from the third-party service — the operator responds by putting the system into a partially degraded state, which bypasses that service, then watches to see if error rates decrease.
This is the sense-analyze-respond cycle in action: sense some predefined key metrics, analyze for the cause of the errors, then respond by bypassing the problematic service.
The problem is that this approach no longer works in modern, complex systems.
Surviving with Complex Systems
Complex systems require a different approach. Understanding the behavior of a complex system means confronting “unknown unknowns.” In other words, at first we don’t even know what questions we should be asking, let alone what the answers might be.
Cynefin tells us that our best option in this situation is to “probe-sense-respond.” Rather than sensing in a few standard areas of the system, as we would with a complicated system, we instead start by probing the current behavior that we’re seeing in the system. Probing allows us to hunt for patterns or clues to figure out what questions to ask. It helps us to look deeper at what exactly is happening in the system, come up with some hypotheses on what might be happening, and then formulate questions to ask that can confirm or deny our hypotheses.
After a few iterations of probing and sensing, we start to grasp an understanding of the cause and effect we’re seeing. As we connect cause and effect, we begin to formulate a response.
This time, imagine that we are the on-call engineer for a large web app with a complex architecture consisting of hundreds of independent microservices. As in the previous example, we’re responding to increased API error rates. Despite our previous experience operating this production system for years, we still can’t initially tell what might be causing the errors. This system is too complex, with too many moving parts. So our first reaction is to probe for better understanding.
This is where the role of open-ended and exploratory observability tools come in. Observability tools let us inspect the responses that are failing in more depth. They help us look for commonality or patterns across various dimensions.
We probe and we notice one pattern: Most of the errors are coming from a specific endpoint. We probe further, and there’s a subset of requests that seem to have a much higher latency than others.
Probing further, we look at one of the slow requests in detail to see where it’s spending its time. It seems to be hanging up in a caching subsystem. Probing further still, all the slow caching calls seem to be referring to the same object ID.
Now that we have probed for unknowns, we have enough information to sense the situation. Chatting with another engineer who understands that caching system better, we develop a hypothesis that a specific cached object has become corrupted somehow. We can then test that hypothesis by looking at the payload of those objects. Our hypothesis is confirmed: The object is corrupted.
Next, we respond. We execute a command to flush that cached object and watch for the effects. Our error rates start to drop back to baseline levels! After a short period of increased latency, our system settles back to a regular hum of baseline activity.
In observability, this is what’s known as the core analysis loop. Throughout this example, you can see how much we rely on rapid, ad-hoc exploration before we get anywhere near a reasonable understanding of what was causing our issue. That exploration helps us form a hypothesis we can test by formulating a response (clearing the cache) and validating the results.
It’s worth noting that the core analysis loop is essentially a variation of the OODA loop, a military-strategy framework developed to make decisions in the uncertain and highly dynamic environments encountered in combat operations.
Effective in several situations across military and civilian industries, it turns out the OODA loop is also effective when it comes to understanding your fancy-pants microservice architecture.
Operating a Complex System Requires a Different Tool Kit
In the past, we could understand our complicated systems by troubleshooting based on experience and known unknowns: What’s the CPU load, how many successful logins have we had in the last hour, what’s the average latency of each API endpoint?
We primarily relied on pre-configured dashboards that could answer those standard questions. Maybe sometimes we dug a bit deeper, with logs or some additional ad hoc queries, but the primary tools for understanding the behavior of our system was oriented toward fixed, aggregate analysis.
Today, tooling that only provides a pre-formed, aggregated view is no longer sufficient. Understanding complex systems requires probing them in exploratory and open-ended ways, formulating a series of ad-hoc and very specific questions about system behavior, looking at the results from various dimensions, and then formulating new questions — all within a tight feedback loop.
This need for ad hoc exploration and dissection has led to the rise of a new class of tools: observability. Observability allows us to probe deep into our systems to understand behavior, down to the level of individual requests between services.
It lets you roll up those individual behaviors into aggregate trends across arbitrary dimensions or break down those trends at any resolution, down to a single customer ID. Observability tools provide the capabilities necessary to move through multiple turns of an OODA loop extremely rapidly, building understanding as we go.
Augment Your Hunches with Observability
Software in 2021 is harder to understand than it was in 2001, and for valid reasons. Modern architectures are fundamentally more complex, and that’s not going to change any time soon.
Debugging by intuition and experience alone simply doesn’t work for today’s complex application systems. We need to augment our hunches with an iterative approach, exploring various facets of the system’s behavior to understand the relationship between cause and effect.
The good news is that the tools we have at our disposal have also evolved. The new breed of observability technology allows us to embrace this complexity and dive in deep, solving new mysteries every time.
As Charity Majors succinctly put it, “…If you can’t predict all the questions you’ll need to ask in advance, or if you don’t know what you’re looking for, then you’re in [observability] territory.”
Intercom’s mission is to build better communication between businesses and their customers. With that in mind, they began their journey away from metrics alone and...
In the last few years, the usage of databases that charge by request, query, or insert—rather than by provisioned compute infrastructure (e.g., CPU, RAM, etc.)—has...
As long as humans have written software, we’ve needed to understand why our expectations (the logic we thought we wrote) don’t match reality (the logic...