Distributed tracing is generally a binary affair—it’s off or on. Either a trace is sampled or, according to a flag, it’s not. Span placement is also assumed to be an “always-on” system where spans are always added if the trace is active. For general availability and service level objectives, this is usually good enough. But when we encounter problems, we need more. In this talk, we’ll show you how to “turn up the dial” with detailed diagnostic spans and span events that are inserted using dynamic conditions.
Will Sargent [Software Engineer|eero]:
Hi, I’m Will Sargent, a software engineer at eero. I work with distributed systems and logging, and I’m here to talk about conditional distributed tracing. I think it’s an interesting topic and I hope you like it.
So the interesting thing about conditional distributed tracing is why do it? Why is it there? And to begin with that, we have to ask questions about what distributed tracing is for and what kind of experience it gives to people.
And when you talk to people about well, what is tracing — what problems does tracing solve? It’s usually roughly the same list that starts talking about the health of the overall system, being able to have visibility and observability across the system, and interestingly enough, the concept of time and breakdown of time comes up a lot.
So you get into latency trends, how long a particular thing takes. You get into long tail and outliers. You get into a control flow graph because you can actually have hierarchy in tracing. And that gives you an indication of what’s been called and also more subtly, what isn’t being called. And by its nature, because tracing can actually happen concurrently, you end up with a nice asynchronous process visualization as well, which is something that you typically don’t get with straight graphs. You can have multiple things happening and you can have things stopped and then start again, and — and to be able to see this kind of your view.
But one thing actually shows up again and again in distributed tracing, which is: it’s used as a tool for debugging microservices. I found this to be an interesting phrasing. Debugging services doesn’t always mean what you think it does.
So conditional tracing is flexibility to change the way that tracing works. And in particular, it’s the way to change the way that tracing works based on behavior code written by the application.
That is, it’s not being done by an OpenTelemetry collector or any kind of exporter outside of the system. It’s the application itself deciding when and where it should be actually producing the trace and when and where it should be sampling one.
And in some ways, this is an exploration of what you can do with OpenTelemetry because I think there’s an emphasis on what vendors provide rather than what applications can provide into systems, but also conditional tracing is there because debugging is a different concept to different people. And it means different things to different people.
And this actually comes up almost immediately when you start looking at the people who complain about tracing. Matt Klein, in particular — this is actually from 2018, which kind of dates it — says looking at a pretty trace page is fun, but often does a user find a trace that is relevant to their debugging needs. He uses the phrase “debugging needs” and goes on to say they don’t get a lot of real use primarily because developers are used to logging and stats and it’s too hard to find relevant traces.
And this is something I find really interesting because this is somebody very clearly coming at it from a certain perspective. Yes, this looks cool, but have you tried using it? He specifically relates it to the UI, which I think is a good start but I think there’s more to the problem than this.
And Cindy Sridharan follows this up in a widely known blog post called “We’ve Been Doing This Wrong” which also talks about debugging. And debugging is quickly and easily test hypotheses and refine one’s mental model.
Looking at a trace, a trace is inherently static, and looking at a portion of a trace where something interesting and unusual is happening is not something that’s necessarily conducive to debugging. It’s not necessarily what a trace provides you and it’s not conducive to debugging in the sense of the word that they are using it.
And Abby Bangser says the key to observability is to get both a high-level view and detailed debugging and if you use two different tools for this, each is not independently an observability tool.
So if you are using a high-level tool, and you’re seeing a trace, but that trace doesn’t actually let you drill down to a lower level, and be able to get detailed debugging, if you have to go over to another tool to get it, then you are not looking at something that lets you observe because you are not — you are not looking at a tool that lets you zoom in. And you are not looking at a tool that lets you explore.
And I thought about — like, I knew about these before, but it wasn’t until I actually read this paper, “Debugging Incidents In Google’s Distributed Systems” that it really kind of clicked for me that software developers, software engineers, SWE, have a different mental model than software site reliability engineers. And the model is an indication of how they actually viewed debugging, what kinds of things that they view as bugs, and how they interact with that.
And what it comes down to is for what distributed tracing does, it’s very good at debugging incidents that software — that site reliability engineers focus on. So incidents in production, distributed tracing is very good at allowing you to see, like, where is the long tail? Where is the system that I have to focus on? But for a software engineer, that’s — debugging is a different thing entirely. And it’s an error in logic. It can be an error in the internal state. And the way that they actually resolve debug incidents reflects their mental model in that they check logs. They try to see an error that can indicate a failure because where the SREs are used to looking at the entire system and aren’t really focused on a particular log or trying to find the exact underlying cause.
So with that in mind, knowing that software engineers when they are talking about debugging, are talking about something very different, we can see that the way that software engineers try to debug using distributed tracing is also different. Namely when in doubt, software engineers/developers add more spans. They start off with something that’s not quite fine-grained enough for them and adds more detail, and then they add a span to the trace and then they look again. They do another compilation. They do another deploy. They see that they need more spans for that and then they keep going.
Developers are very well aware that really when it comes to debugging, we only have one tool. We have printf debugging, and we’ve dressed it up in various formats and called it logger.trace and logger.debug, et cetera, but that’s really the ability to turn off a printf statement in some form or another. Developers are well informed about printf. And know that it works in situations, especially microservices where you don’t have access to a debugger, and you can’t stop an operation or inspect its threads, but they are also aware of its shortcomings saying that you are halfway towards either telemetry or towards tracing.
And I swear to God — like, again, this is just a dead-on statement herein that, yeah, spans of the new printfs.
And when it comes to what developers use printfs for, well, they tend to use it for a lot and if you turn on detailed debugging in production, then it will almost immediately overwhelm your system. So debugging statements are typically like a state dump. If you’ve got execution tracing, which spans actually helpfully give you for free, and you’ve got event reporting, which is closer to the Honeycomb sense of events, that there’s been a state transition or request terminated, etc.
But debuggers, developers have very specific debugging needs. They keep — they create a pool of data. They throw in a bunch of spans or printf statements or what have you. They dump internal states and look for state transition and indicate, like when there are particular branches in logic. They create a pool of data, then they start fishing around that pool with hooks and queries and what have you, until they have a fairly good idea about what piece of code is responsible for it. And usually they get the most useful statements around the diagnostic logging for later. And this works for logs because you can keep debug logging in production as long as you have a priority system.
You can say is ‘logging=debug’ and it will return false and just have no operational effect. But spans, because they’re printfs, and there’s no priority system in there, if you add more spans, down to the method level, say, then you would get more data. And it gets far too detailed in order to be able to see anything in there because you don’t want to see everything at once. You only want to enable debugging, spans, tracing, et cetera, for a particular branch. But if you do it reliably (?), you get more data and more sampling and the overall usage of the system goes down because you are trying to reconcile the sheer volume of the data with what comes out.
And this is something that, again, various organizations know about. And the bones of the system, it’s a very good paper by Microsoft, going over the Windows 10 telemetry system and the case study there. They mentioned that the root — what’s really important for them is to be able to get log stuff at a pretty granular level and turn it on and off at realtime. And then be able to get the performance back when it’s turned off. So they want to be able to turn off for a particular branch, for a particular subsystem, debug logging, showing them exactly what is going on to be able to take their performance and then turn it off again afterward.
And we don’t really have a way to do that, in distributed tracing, but I’m going to show you a proof of concept and we are going to walk through that.
So the goals for the conditional tracing that I’ve got here is we want to allow behavior to depend on application-specific state and we want to be able to add more information to spans. We can’t really add information to traces, but we can add extra information and we can give the application and the developers the ability to change that on the fly without having to restart the system. We can give them the wheel.
And if you go on to GitHub, GitHub has it available as wsargent/conditional-tracing. And the tracing is an OpenTelemetry SDK hook that is driven by Groovy statements. So there are two systems I’ve implemented. One is Conditional Sampler which allows you to say, yes, I want to sample this span or no, I don’t want to sample this span. And then there’s a Conditional Span Builder that allows a script to determine whether you are going to create a new span or just return span.current. And these aren’t particularly fast, and I have not spent a lot of time trying to make it very secure. You can add an HMAC, but typically what I think is going to be the best option here is to allow this to be a targeted feature flag.
That is, you can always add code that says when you turn this feature on — when you turn this flag on, then we actually also evaluate a Groovy script in response to this. So that you have the ability to turn this feature on and off dynamically, just as you do others, and I will demonstrate how that works in a second.
So what we start off with here is a conditional tracing demo, which starts off with level one, two, three, four. And in a minute, I am going to start up IntelliJ and I’m going to show you how this starts running. I’m going to do the demo thing here and I hope the demo Gods smile on me.
Let’s present the window of IntelliJ.
And let’s go to IntelliJ, and you can see here that — oh, let’s start running it.
So this is a simple Gradle script and if you look at the main project, you can see that I’ve got a fair amount of OpenTelemetry and I’m doing manual instrumentation here because I am passing in these two Groovy scripts into the OpenTelemetry. I have also LaunchDarkly implemented here with some context keys and I’m setting this up with the actual context. So LaunchDarkly keys and values are being passed into the context and being used here so that the spans are actually aware of them.
And then it just goes through level one, level two, level three, and level four and each of these comes with a wrap, which is using span builder here. It calls makeCurrent. We add enter and exit here and then we accept the span in between there. So you can see, like, when we’re calling this in wrap, then it’s automatically a span wrapped around this with Enter and Exit methods provided. So that’s the base level functionality, and if you can see here, what we have got is — so let’s return this always. You can see it’s ignoring debug spans two and four, and that’s because we actually have the debug attributes turned on.
So if we scroll down here, you can see it’s doing, one, two, three, four. One, two, three, four. And if we change out this feature here, and say debug by feature flag, then we can see it show, one, two, three, four, one, two, three four, again because it’s actually running through. But looking at the feature flag here, we can say we have got an LD system here. And we’ve got a user system here. And we can say that if this is not this particular user, that we are expecting, if the user’s test flag is set false, then we can simply ignore it.
So let’s actually try — let’s actually try to start — let’s look at the conditional sampler now. The conditional sampler is using the SDK. So let’s go back here. We’ll make the Chrome tab and the conditional tracing.
So the conditional sampling demo allows you to make three decisions. It allows you to drop, it allows you to record, or it will allow you to record and sample. In practice, what you have got is essentially like, yes or no options rather than three options. Because what happens is the exporter tends to get confused if you drop a span and then you have a child span, which is not dropped, because it doesn’t know what it attaches to.
So you end up with code that’s, you know, will drop it by random Nanos. It will drop one out of every three. Then you end up with a system that ends up having a missing span in the middle.
So that, in practice means that, like, if you are going to drop a span, then you have to drop everything below it as well. So the conditional span builder demo here, when it’s got the debug attribute, you can see here we can look at attributes in the span. And this is actually something that you typically can’t do with spans otherwise. Typically span is a write-only API. You can set stuff, but you can never get it back out, but because we are in the SDK, we actually can ignore that. So what we do here is we say if it’s got the debug flag on it, then we know we can ignore it. We return false. And if it returns false, then we just call span.current and we return the already existing span on it.
Right now, I don’t actually attach the attributes to the existing span. So it slightly breaks the span builder. That’s semantics but it’s easy enough to change that if you want.
And then what we end up with is this here. We end up with the span builder demo.
And we end up with span level one and we end up with — DoStuff two and DoStuff level three, and the — the punch line here is that this also four defined here at the end here, which is very close by so you can’t really see it.
You can see that, like, you end up with a span system which allows you to roll up stuff, which is debugging, which is fine-grained, so that when you are querying the system, and when you are producing normal amounts of logs and normal amounts of spans, you only see the stuff which is high level. But if you want to drill down and get a more detailed picture, then you can change the Groovy script on the fly and start producing spans with that detail that you want and then be able to turn it back down again.
And, yeah, so you can do that by the data — you can do that by using LaunchDarkly. You can also, of course, use internal state of the system. So here we’ve got an example where you can say by deadline. And you can say this time plus ten minutes, et cetera, and if it’s after that, you can turn it off.
You can also specify conditions where if you are under a resource constraint, for example, this high CPU, or there are low amounts of memory or there’s a lot of memory allocation pressure, then you can say, let’s just not do spanning at this detail anymore, and, of course, because it’s all a Groovy script that is passed in, you can source Groovy script from anywhere. I have it coming from the file system, which is easiest for me, but you can have it coming from Redis, or the central database, et cetera. So you can have that globally distributed or centrally distributed and have that pushed out to all of your applications at once. That, you know, of course can read Groovy and know how to work with it.
And, yeah, so you can break out and fold spans at runtime. It elides spans; it does require some SDK mashing, unfortunately, because OpenTelemetry SDK is not really set up for this out of the box. The thing I’m most sad about is there doesn’t seem to reconstruct an elided span afterward. It’s not recorded, you can’t send it off to another channel. You are stuck with whatever, you know, 10,000-foot view that you have of that span, rather than being able to query a local system or like, investigate a local cache and be able to fill it in later.
So it does have some drawbacks but I think that if nothing else, it’s a useful way to open up the question, which is that developers and SREs conceptualize debugging differently. When they’re solving problems, they are solving different kinds of debugging that debugging logic flows and trying to drill down, vs debugging incidents and trying to resolve the issue.
Reconciling the debug traces in production is something that traditionally in logging has involved conditional logic. And in order to add that — add — in order to add that conditional logic into the SDK, we need to do a few interesting things, but this is, I think, something that can be expanded on. And I think it’s something that can be used to think about using tracing in different ways and considering different audiences. I think that’s about it. So this is a good time for questions. Let me know what you think and how this works for you.
Charity Majors [CTO & Co-founder|Honeycomb]:
That was super fun! You know, watching your talk here, it really sinks home to me that we are in the infancy of tracing and what tracing is going to do for our systems. You know, like, what we think of as tracing today is, like so new to so many people, but, like, we’re just beginning to be able to do things like inspect within events and do like cross — you know, cross-span queries and all that stuff.
What led you to start investing so much time into this, where you are working?
I mean, distributed tracing is an outgrowth of my interest in logging.
Basically, if I wanted to look at a particular trace, then there were times that I wanted a lot of detail and I wanted to see absolutely every single feature, and then there were times when I wanted to just see, like, what is the overall, like, you know — came into database, went out through the HTTP client, et cetera. And I wanted to be able to switch that around at runtime, and I couldn’t do that.
By the way, out of my own personal interest in logging is one of the nerdiest things anyone has ever said. You should get better hobbies. Logs are terrible.
I really don’t. And, again, I think it is a difference in perspective. I’m a software engineer. Like, if I can look at debugs and traces all day long, and even better if I can actually extract useful information out of them, without having to actually look at an individual log and the machine processing.
I’m right there with you and I think it’s terrific.
Yeah, I don’t need better hobbies. I just need more people with my hobbies. That’s basically where I’m coming from.
More people need to have higher standards for their hobbies.
I’m proselytizing Zelos at this point.
Yes, you are. And I’m so glad that you are, because I think we proselytize similar things but in different ways. I think what you said about SREs and software engineers and the language you used is very different. I know we only have a minute left. I feel like we could fill an hour ranting about this, which we should do sometime. You haven’t been on o11ycast yet, have you?
o11ycast, our podcast. Yeah, we should do a whole episode about this stuff, because, you know, when you are talking about SREs and incidents and everything. I’m like: I guess we do. I think of it so much more as software engineers are looking at code and SREs are looking at systems. You know where in the system is the code that needs to be debugged as opposed to what the logic that needs to be debugged. I often find that the hardest part of resolving an incident as you say is figuring out where it is, and once you know where it is, like, everything else is easy, but, like, that’s 90% of the battle right there.
Yep, absolutely. And it gets even worse when, like, it doesn’t seem to cause a failure in the system.
And instead corrupts data.
Oh, yes, my favorite!
If it’s not request response oriented and you actually end up with internal state and that ends up building up, the longer the delay and inputs to results, the more likely you are about to have things explode. And I —
And I think that’s where things go off the rails, the dichotomy there.
Absolutely true. And for all the people who are cringing out there in horror and going, ‘I never lost data,’ I have to tell you, you have, and just don’t know it. So on that cheery note, thanks, Will.