This ebook presents leading-edge practices with distributed tracing, as described by Honeycomb customers. During Honeycomb’s annual conference, o11ycon+hnycon, presenters offered a wide range of insights into how they use distributed tracing to understand complex systems and uncover elusive mysteries. While their stories come from a wide variety of contexts, many common themes emerged. In this ebook, you’ll learn about how teams at Slack, CircleCI, eero, and Intelligent Medical Objects (IMO) are using new and emerging distributed tracing practices to their advantage.
Since the advent of cloud computing, a new reality for most teams is that they have now entered an era of working with distributed systems by default. Engineering teams now cobble together a variety of different virtual infrastructure and service endpoints to create scalable, resilient, and performant applications. But rarely are they ready for what it means to now understand the intricacies of the distributed architectures they’ve created.
Distributed tracing is a potentially powerful tool for any team managing the operation of modern distributed software. But tracing by itself does not provide enough context to help you understand the many hidden complexities of distributed architectures. This ebook shows you how various customers pair scientific methodologies along with Honeycomb’s implementation of distributed tracing to make sense of their complex modern systems.
Applying the Scientific Method to Production Systems
Written by Pete Hodgson
Teams using Honeycomb are able to experiment quickly, with tight feedback loops that enable them to move faster. At o11ycon+hnycon, we saw that engineers find Honeycomb valuable as a way to sense and respond their way through complex environments where no one person could possibly fully understand every cause and effect. In other words, observability with Honeycomb provides a superpower: the ability to explore how your code actually behaves in the real world by applying the principles of the scientific method. Teams can form a hypothesis based on initial observations, make a small change to an environment based upon that hypothesis, and then observe again to validate or invalidate their hypothesis.
Teams that are succeeding with observability describe its adoption as a journey. It starts with a small, easy experiment, and then evolves incrementally as lessons learned are incorporated into development cycles. The theme of feedback loops and iteration continually cropped up at the conference as successful users presented their stories.
Sensing in a complex environment
Both Michael Ericksen’s “The Curious Case of the Latency Spike” and Glen Mailer’s “The Unreasonable Effectiveness of a Single Wide Event” presentations highlight the classic experience when observability first seems to click for many engineers—the moment they first exclaim, “My application is doing what?!” as they start using Honeycomb to analyze their systems.
That jolt of disorientation that happens when you first see what your system is really doing (as opposed to what you think it’s doing) demonstrates both the challenge of understanding modern software systems and the ability of tracing and observability to shed light into these environments in powerful ways. Modern software systems have become incredibly difficult to understand due to their distributed nature. In order to make sense of them, we need processes to build up that understanding.
Dave Snowden’s Cynefin model provides a useful methodology for developing that understanding. This model places systems into four categories: obvious, complicated, complex, and chaotic. It then describes how the ability to both understand and make changes to a system varies, depending on which category it is in. The Cynefin categorization of “complex” systems is an appropriate one for modern distributed software: It has moved beyond a place where understanding cause and effect only requires light analysis or expertise. Instead, the relationship between cause and effect in complex systems can only be deduced in retrospect.
In “The Curious Case of the Latency Spike,” Michael eloquently evokes the feeling of what it’s like to operate within these complex systems. He described investigating production incidents like a murder mystery. Your intuition of how your own system is behaving can turn out to be dead wrong, even for those engineers who understand their systems better than anyone else.
The Cynefin model also provides guidance on how you can make progress in these complex domains. When working in a system where cause and effect are hard to understand, you should proceed by probing, sensing, and responding. In other words, you should explore the environment, inspect interesting things that you find, make small adjustments, and then observe what effect they have. That is the only way to develop an accurate and meaningful understanding of a complex system.
When viewing the systems you work in as “complex,” in the Cynefin sense, the value of using observability becomes clear. Honeycomb is designed to enable that sort of probing and sensing. In the Q&A after his talk, “How Tracing Uncovers Half-truths in Slack’s CI Infrastructure,” Frank Chen described this process, quite delightfully, as “let’s observe around and find out!” Similarly, in “Conditional Distributed Tracing,” Will Sargent explained how software engineers often add additional spans to a Honeycomb trace to get a deeper sense of what’s happening in a sensitive or complicated area of the system. In Cynefin terms, they are adding additional probes to make more sense of their complex code.
However, just probing and sensing is not enough if you want to truly grok a complex system—let alone make changes to that system to fix a bug, respond to an outage, or improve performance. You need to also take action based on what you’re seeing. In a complex system, you’ll often be surprised to see that reality does not match your expectations of system behavior. Therefore, you must adjust what you’re probing within the system, and eventually adjust the system itself. Several of our o11ycon+hnycon speakers devoted a good amount of their presentation time to describe their processes for doing that.
Iterating to understanding
Frank at Slack, Glen at CircleCI, and Michael at IMO all described kicking off their Honeycomb adoption by starting small, gaining some insights into the system, and then using those insights to make incremental enhancements. No one simply “added observability.” Achieving observability was an iterative process, with each step delivering insights that guide the next step in the journey.
Similarly, several speakers demonstrated that making changes to a complex system is best done incrementally—by following the process laid out in the Cynefin model to sense, probe, and respond. That process was a central theme of Michael’s “The Curious Case of the Latency Spike.” The theme emerged as both Frank at Slack and Glen at CircleCI described pairing Honeycomb with progressive delivery techniques (such as feature flagging and canary launches) in order to make smaller, safer changes guided by feedback from their observability systems.
In fact, that concept of tight feedback loops comes up over and over again whenever engineers talk about how they practice observability. Simply observing a static system misses much of the value observability can provide. The true value of observability is in getting rapid, detailed feedback on the impact of changes to that system. The feedback cycle that allows a continuous cycle of small, incremental changes is sometimes referred to as a plan-do-check-act (PDCA) cycle, or an orient-observe-decide-act (OODA) loop. It’s a philosophy that the Lean community has used with great success to drive continuous improvement in complex systems.
It’s interesting to note that our o11ycon+hnycon speakers described two distinct flavors of this continuous improvement loop. First, it’s used to improve the systems they are operating. Second, the same approach of small, iterative changes was also held up as the best way to improve the observability system itself (how meta!).
For example, Frank at Slack described how feedback from observability tooling was used to drive a series of changes that pushed their testing “flake rate” from 50% to 5%. In addition, he described how the adoption of Honeycomb itself was also made incrementally. They explicitly did not start with a goal of “tracing all the things.” Instead, they took the approach of asking a question, figuring out what data was needed in Honeycomb to answer that question, and adding only what was necessary to get results.
Similarly, Michael described how IMO worked through their performance conundrums via an iterative cycle of