Guide

Conditional Distributed Tracing: One Way to Adapt Tracing to Your Needs

 
Want a copy of the Guide for yourself? Download the PDF
 

Distributed tracing isn’t great. It’s good, but it has room to grow before it becomes truly great. Will Sargent, a software engineer at eero, encapsulated this reality perfectly in his talk at the 2021 o11ycon, where he described his proof of concept for conditional distributed tracing.

Distributed tracing is a method for monitoring application performance across a distributed system of microservices. As it is today, distributed tracing provides a number of solutions, which will summarize as:

  1. Monitoring system health
  2. Latency trend and outliers
  3. Control flow graph
  4. Asynchronous process visualization
  5. Debugging microservices

But this last solution—debugging—is the most fraught with misunderstanding and competing priorities. Distributed tracing is a core component of observability. When a company adopts observability (and therefore tracing), many people need to use it—chiefly the software developers and the site reliability engineers (SREs)—in order for teams to fully realize the benefits.

Distributed tracing, up to this point in its history, has suited the debugging needs of SREs rather than developers. And forcing developers to adopt distributed tracing to debug can cause internal friction because they simply don’t need it.

The good news is that distributed tracing is relatively young. Enterprising developers such as Will are building on top of it using tools like Honeycomb to create new solutions. Will’s solution is conditional distributed tracing, which allows flexibility so that developers can change the way tracing works based on behavior code written by the application.

The difference in perspectives when it comes to debugging

Debugging is a clear example of how tracing is still in its infancy and has significant growth potential. When it comes to debugging, developers are more like scientists who dig into hypotheses, testing out the logic behind the code. Site reliability engineers (SREs), on the other hand, are like firefighters trying to quickly locate the source of a fire, put it out, and apply what they learn to future fires. This difference in perspective leads to many of the issues around tracing adoption.

When debugging, developers look at the code. They put their effort into trying to understand the logic behind the issue that’s occurred. Will explains that their debugging process usually follows three general steps:

  1. Create a pool of data from statements
  2. Fish around that pool with hooks and queries.
  3. Keep the most useful statements for later.

The goal of this process is to resolve actual versus expected behavior. When something unexpected happens, developers want to understand why it happened. This process requires foresight on their part to predict the expected behavior. The way they compare their predictions versus what happened is usually with logs, which provide granular, but limited insight.

In this workflow, tracing spans aren’t very helpful. Will describes them as the new printfs, where there is no priority system. The result is that using spans creates more data, more sampling, and more work for the developer. They know what they want to test, and spans provide more than is necessary.

Will summarized the prevailing attitude of developers to tracing: “Yeah, [tracing] is cool, but have you actually tried using it?” He pulled out a lot of examples in his talk, but one pull quote from Cindy Sridharan’s article, “Distributed Tracing—we’ve been doing it wrong,” stood out.

Being able to quickly and cheaply test hypotheses and refine one’s mental model accordingly is the cornerstone of debugging. Any tool that aims to assist in the process of debugging needs to be an interactive tool that helps to either whittle down the search space or, in the case of a red herring, help the user backtrack and refocus on a different area of the system.

Cindy Sridharan, “Distributed Tracing—we’ve been doing it wrong”

Developers need a tool that helps them experiment with the code and its underpinning logic. Distributed tracing cannot accommodate this easily as it exists today because it can’t easily prioritize data, and it often provides too much data for what’s needed.

SREs, on the other hand, look at the whole system when they go about debugging. Their priority is to use debugging to locate the source of the problem that’s occurred. The process SREs follow often centers around calling on multiple services to identify patterns and isolate the issue.

The important difference here is that, while developers are looking into why an issue is happening, SREs try to identify that an issue is happening. It’s the classic case of the unknown unknown, where something could be going wrong, and because you don’t know to look for it, you can’t fix it. You don’t know that you don’t know.

In this workflow, Will says that SREs “only use logs if they’re unsure about the mitigation strategy.” Logs only come into play when they’re not sure how to solve the issue. Instead, SREs rely on distributed tracing and sampling to understand the production environment to locate issues.

Both Charity Majors, CTO at Honeycomb, and Will agree towards the end of Will’s talk that locating the issue is 90% of the work to debugging. Despite the resistance of developers, adopting distributed tracing is essential if your business wants to effectively implement observability and reap its benefits.

So how do you strike a balance and adapt tracing to fit the needs of both your developers and SREs? Will’s answer is conditional distributed tracing.

Conditional distributed tracing can address both perspectives

Conditional distributed tracing is a proof of concept Will is working on that changes how tracing works based on behavior code written by the application. The application itself decides when and where it should produce a trace and when and where it should sample one. It’s not perfect, but it’s an iteration toward making distributed tracing more useful for both developers and SREs.

For developers, conditional distributed tracing brings them a step closer to being able to turn granular logging data on and off. The Microsoft Windows 11 team described the need for this capability in their paper “The Bones of the System: A Case Study of Logging and Telemetry at Microsoft.” With this ability, enabled via distributed tracing, developers can leverage tracing to test their hypotheses. They can try a solution, turn tracing on or off to see if it worked, and keep iterating.

For SREs, distributed tracing already does its job well, and conditional distributed tracing will help them get developers on board for their observability initiatives.

As he set about building his conditional distributed tracing proof of concept, Will defined three goals for himself:

  • Allow behavior to depend on application-specific state.
  • Augment spans and traces with additional information.
  • Let developers take the wheel.

In achieving these goals, he hopes to make tracing more useful for everyone involved, developers and SREs alike. Let’s get into what he came up with.

Conditional tracing proof of concept

Will admitted his proof of concept isn’t perfect, but it’s a good foundation that anyone can build upon.

Will’s solution uses the OpenTelemetry SDK hook driven by Groovy scripts. There are two systems: a conditional sampler and a conditional span builder.

  • The conditional sampler allows for script-driven sampling, where you determine whether or not you want to sample a specific span.
  • The conditional span builder allows for script-driven span creation, where the script determines whether you’re going to create a new span or run span.current.

Will goes on to explain that it’s not particularly fast, and he hasn’t done much work on making it secure, but the best option is typically to allow it to be a targeted feature flag so you have the ability to turn it on and off dynamically just as you might do for other features.

Once implemented, Will uses Honeycomb to keep an eye on how conditional distributed tracing is working. Overall, his solution for conditional tracing allows for:

  • Flexibility on a different axis
  • Application-based span control
  • Exploration of OpenTelemetry SDK
  • Answer the debugging problem for multiple audiences (SREs and developers)

Try Will’s solution for yourself at his Github: https://github.com/wsargent/conditional-tracing. You can also watch his session, where he walks through the proof of concept, how it works, where it can improve, and what’s next.

Tracing is still in its infancy

Conditional distributed tracing is just one example of the future of distributed tracing. We’re still in the infancy of this technology, and you can have a very real hand in shaping that future.

Build on Will’s work or create your own solution for another problem. Along the way, you can use Honeycomb to test, experiment on, and prove your ideas.

Want a copy of the Guide for yourself? Download the PDF