Blog

BeginSpan is the New Printf: Observability in the Developer Workflow

 

Observability is a great way to understand how your system is operating in production, diagnose errors when systems fail, and find subtle gaps in logic. But there’s no reason to wait until your code is in production—it turns out that observability can change the process of developing code. (Our CTO Charity Majors calls this observability-driven development.) 

In this post, I’m going to talk about how observability can help you during development by replacing log statements with a far more powerful tool: distributed traces. We’ll take a look at traditional “printf debugging” to see why it can be extremely difficult to make sense of in a distributed system. We’ll also look at distributed tracing as an alternative and show how useful it can be to help understand what your code is actually doing—whether it’s code you wrote or code you’re trying to understand.

We’re going to look at the same code three times. First, we’ll use “printf debugging”— unstructured logs—to understand code on a distributed system. Then, we’ll OpenTelemetry automatic instrumentation and the Honeycomb trace view to see how we can instantly get much richer information. 

beginSpan vs printf

Printf debugging and limitations of unstructured logs

Thinking about developing your code from an observability perspective encourages you to think about instrumentation. It also helps you understand how your code performs and what it does—not just at deployment time and in production, but also while you’re writing it.

When writing or trying to understand code, I often sprinkle it with lines that report progress. My code in development often sprays the console with notes like “made it to the processing section” and “value is now 23.” That practice is referred to as “printf debugging.” If you keep those in long enough, they gradually mutate  into unstructured logs. 

Of course, these logs are limited. They don’t necessarily understand the complexities of the call graph and can’t convey things like call hierarchy. They can get jumbled when code executes concurrently. Printing to stdout may be great for a single process but can get complex when you’re dealing with multiple interacting services or multi-container dev environments. With multiple coordinating services, making sense of the log (or logs!) can become its own interpretation challenge.

As an example for this post, I used the dotnet version of our sample Example Greeting Service, which uses four different services to offer simple greetings.  The “frontend” service calls a “name” service to decide who to greet and a “message” service to decide what to say; the “name” service picks from the top 10 baby names of the year that the “year” service offers. (The year service, in turn, chooses a random recent year.)

It produces a profound output—something like this:

A browser window showing the output “Hello jackson, what’s up?

If I’m coming to this service as a new developer, I can scan the code and see the four services. But I can’t see how they interact. 

Enhancing development with trace observability

One way to understand what’s happening here is to sprinkle in a few printf statements.

Let’s take a look at the frontend service’s core logic. I added a few printf statements so I can figure out how well the name service and message services are operating. I’m also worried about the name coming back blank, so I threw in a log message on that, too.

Moving on to the core serving function, I can see that there are two async calls: one to get a name, and one to get a message. It then assembles those into our message, “Hello jackson, what’s up?”

[HttpGet]
   public async Task<string> GetAsync()
   {
       var httpClient = _clientFactory.CreateClient();
       var name = await GetNameAsync(httpClient);
       var message = await GetMessageAsync(httpClient);
 
       return $"Hello {name}, {message}";
   }

To figure out how this code is working, I added a few print statements. I started by adding a few Console.WriteLine (the dotnet equivalent of printf) calls and similar calls in the other services this one calls—these will help me see what gets called and in what sequence.

[HttpGet]
      public async Task<string> GetAsync()
      {
          Console.WriteLine("Starting GetMessage!");
          var httpClient = _clientFactory.CreateClient();
          var name = await GetNameAsync(httpClient);
          var message = await GetMessageAsync(httpClient);

          if(name.Length == 0) {
              Console.WriteLine("Zero length name should never happen");
          }

          var rv = $"Hello {name}, {message}";
          Console.WriteLine("Finishing GetMessage " + rv );
          return rv;
      }

When I run a test, this is what appears on the console:

frontend │ Starting GetMessage!
    name │ Enter Name::GetAsync
    name │ Enter Name::GetYear 2017
    name │ Exit Name::GetYear
    name │ Leave Name::GetAsync olivia
 message │ DetermineMessage

We see here the different services’ requests and responses. 

When I read this log, to help me process the information, I mentally group these statements and match each enter with its corresponding exit. It takes a little mental work—and if my system had gotten any more complex, it could be very hard to do this matching to extract the call sequence.

I can pick this apart with yet more print statements and making my logs even richer (and therefore more complex), but instead, I’d like to move to directly seeing the nesting in my data.

Creating development traces with automatic instrumentation

With contemporary instrumentation options like OpenTelemetry, we can get a lot farther, a lot faster. Automatic instrumentation can provide the call hierarchy for free, across the entire distributed system, and we can build a skeleton of all the calls to see what our services look like.

Automatic instrumentation means that I follow the instructions on the Quickstart docs page to configure my project to use the OpenTelemetry libraries, but only minimally modify my source code. 

By adding OpenTelemetry automatic instrumentation, cross-service gRPC and HTTP calls will provide additional information about when and how they ran, and how long they took.* Right after I run the code, I can go to Honeycomb’s Home Page, flip over to my development dataset, and see a list of recent traces.

Recent Traces listing showing recent executions of the code. The bottom row represents a recent run.
Figure 1: Recent Traces listing showing recent executions of the code. The bottom row represents a recent run.

See that bottom one? I ran it just a moment ago. I can click on it and, because I had already added automatic instrumentation, instantly see the trace of my code

In the waterfall view below, I can see the name of each service and the name of the span. The tree structure on the left shows me what is calling and being called. The bars on the right show when they started and how long they take to run. 

I can see easily that the front-end calls the name service, which calls the year service; the front end also calls the message service. These calls happened with http-get

I can also see that the services are running one after another—the span for the message service doesn’t start until the name and year are resolved.

And I can see all this using only automatic instrumentation for my code.

Figure 2: The Trace view shows each span in the execution, colored by service.
Figure 2: The Trace view shows each span in the execution, colored by service.

Enhance traces with your own logic

Automatic instrumentation is great, but to get real visibility into what’s going on, you need to create your own spans that track your internal business logic. Automatic instrumentation gets you the entrances and exits from services, but custom spans can also show you what logic got called along the way. This can help debug subtle problem by letting me see precisely how my code executes, and let me match that against my expectations.

If I wasn’t using Honeycomb, these are the cases when I’d previously have  turned to a printf statement to dump internal values and arguments. 

Using instrumentation, you can (and should) create your own attributes instead. I’ll enhance my automatic instrumentation with custom attributes and spans that give me additional information I need to see.  If your process is about to carry out a sophisticated set of steps, you can construct additional spans in your trace. This can help isolate what else got called and any additional arguments you need to debug.

Our goal here is that every important unit of code gets its own spans. Of course, what makes an “important unit of code” depends on what needs examining.

In this case, I added another few spans to help look at the logic: the Preparing Greeting and Render Greeting steps. This flow makes it clear that Render Greeting waits until all the other tasks have finished before it begins. In dotnet, the command to create a new span is called “StartActiveSpan.” I’ve added a few attributes to the spans to help track the name and the message that were returned by the other services.

By calling using var span = _tracer.StartActiveSpan("span name");—the new span gets added in the correct place in the hierarchy. We can then call span.SetAttribute(attribute, value”)to add additional attributes. 

[HttpGet]
    public async Task<string> GetAsync()
    {
        using var span = _tracer.StartActiveSpan("Preparing Greeting");
        var httpClient = _clientFactory.CreateClient();
        var name = await GetNameAsync(httpClient);
        var message = await GetMessageAsync(httpClient);
 
        using var renderSpan = _tracer.StartActiveSpan("🎨 render greeting ✨");
        renderSpan.SetAttribute("app.name", name);
        renderSpan.SetAttribute("app.message", message);
        return $"Hello {name}, {message}";
    }
Figure 3: The Trace view, with additional custom spans, decorated with emoji.
Figure 3: The Trace view, with additional custom spans, decorated with emoji.

Because this results in a hierarchical, structured trace view, you can focus only on the parts you need. 

As I mentioned earlier, our goal is that every important unit of code comes with a relevant span. These aren’t necessarily the same things that you might have chosen to log nor is it identical to the function stack. Rather, it’s constructing the flow of the program execution as you expect it to act.

When you run your code, you can then verify whether it’s working the way you planned. For instance, am I getting the tree I expected? Are the right spans being created and in the order I expected? 

The view shown in Figure 3 reinforces that even though the message and the name are both marked as “async,” we execute them in sequence. (I’m not sure that going through this with a debugger or printing out a log would have noticed this misuse of async.)

In this case, we’ve realized we could do better by making some calls concurrent rather than serial. So there are obvious places to launch from and continue improving our code.

Verifying expected behavior

I know this is just sample code, but my engineer soul couldn’t let me leave this code with the unnecessary serial call.

[HttpGet]
public async Task<string> GetAsync()
{
    using var span = _tracer.StartActiveSpan("Preparing Greeting");
    var httpClient = _clientFactory.CreateClient();
    var nameTask= GetNameAsync(httpClient);
    var messageTask = GetMessageAsync(httpClient);
 
    await Task.WhenAll( nameTask, messageTask );
 
    var name = nameTask.Result;
    var message = messageTask.Result;
 
    using var renderSpan = _tracer.StartActiveSpan("🎨 render greeting ✨");
    renderSpan.SetAttribute("app.name", name);
    renderSpan.SetAttribute("app.message", message);
    return $"Hello {name}, {message}";
}

And, again, the trace view shows me that I’ve fixed it! I can now see that the two calls out of frontend-dotnet—one to name-dotnet and one to message-dotnet—do in fact run concurrently. It’s nice to be able to verify that the call runs the way I expected it to. And I don’t think I’d have figured the same things out with printf debugging alone.

Figure 4: The Trace view, showing that the name-dotnet and frontend-dotnet services are now running concurrently.”
Figure 4: The Trace view, showing that the name-dotnet and frontend-dotnet services are now running concurrently.

Keep the spans in production

When you deploy your code to production, it makes sense to leave these spans and annotations in place. The logic that you used to debug the process during the development phase might be helpful in production, and these are useful cues showing how the system works.

Adding observability into your process

Building observability into your process can help even when you’re just working in development. But that’s the tip of the iceberg—why limit yourself to development time?

In an upcoming post, we’ll talk about how to change your coding process to develop with observability in mind. (Spoiler: We’re going to change our definition of “done” from “it works and passes tests” to an observability-centric approach). 

My thanks to Vera Reynolds and Rob Kidd for the greeting card example, and to Jessica Kerr and Philip Carter for working through this example with me.


*Automatic instrumentation varies with different OpenTelemetry languages. You can get started with OpenTelemetry automatic instrumentation with a free Honeycomb account. Simply sign up for an account and follow our Instrumenting with OpenTelemetry docs to get started. 

If you’re already running Honeycomb in production on this code, I encourage you to send the data to a separate dataset and a distinct API key. An easy way to do this is to set a local environment variable on your development machine, or put the API key into a file that you don’t check into source control.