Brian Langbecker [Solution Architect|Honeycomb]:
Hey, everyone. My name is Brian Langbecker. And I’m one of the solution architects here at Honeycomb. And we’re going to talk about distributed tracing. And what I want, as part of these slides and the demonstration, is for you to walk away with a basic understanding so you can answer the questions: What the heck is distributed tracing? What are the various questions it tends to solve? How does Honeycomb go about doing it? And why is it important?
So the first thing is, let’s talk about the definition of what distributed tracing is. Now, the first definition is a pretty academic one. Understanding the flow and the lifecycle of a unit of work, that’s one of the key things, performed in multiple pieces or steps across components in a distributed system. Kind of a complicated definition, right?
A simpler one is: It tells the story of a complete unit of work in your system. And what are three of the most common questions that you’re going to get when you have a distributed trace going on?
The first one is: Why is this thing slow? What do we mean by “this”? This unit of work. Say we’re at the bottom. Earlier in the day we’re going along great. And all of a sudden, things get really, really slow. We cross over our P90 threshold. Really bad performance. People are complaining. That’s one of the common questions. We’re going to cover that in the demo.
The next one is, “I’m a new developer. I’ve joined the company. I don’t really understand where everything is going on, you know, I can look at documentation, or I can look at a live trace.” This sort of gray box. You’ve probably heard the term white box where I can see everything inside it. Black where I understand nothing about it. I’ve got this gray one. I know it’s an API. I know I get a response. But what does this thing do? Distributed tracing is ideal for understanding that.
And the last really typical, common question is: When things are failing big time and error rates are spiking through the roof, what is going on? We want to dive into traces that have these errors and understand what’s going on in the system. So what do we cover here? Honeycomb helps you answer, so does distributed tracing, why is this thing slow, this unit of work? What does this thing even do? And then why are things failing? So let’s look at a few slides that have distributed traces in them, and then we’re going to jump into a live demo.
I don’t want you to get overwhelmed at looking at this. If you’ve never seen this before, it can be quite overwhelming, but let’s look at this from a high-level perspective. There are four or five services. You can see this in the second column from the left.
The first one, commonly known as the root span, is taking about 3.23 seconds. And if you look over at the side—and I’ll orient you around Heatmaps in Honeycomb—the bar at the bottom is the baseline, and we’re way out of the baseline that’s going on. And as we go through here, in these three seconds, we start out placing an order. We get a cart. We get a product. We go check some things out across multiple services in the system. This is a distributed trace. This is what we’re going to do in the demo. But before we jump in, I want to orient you on some terminology here.
The first terminology is the root span. It’s the very first portion of a trace. It’s a span. It’s that row of information. Every other row in the system is just a span, but this is the root. This is where everything starts.
Next, we have the service name, the individual services that make up this tracing journey. We have the name of the components. What is the name, the function; like a Slack statement or a place order that’s going on? And then over here, to identify each and every one of these rows, we have a span ID. And we have a trace ID.
Now, we have more data that’s common to it. We got to know when the thing started. We have to know how long the thing took. The top root span took about 3.24 seconds. You know, we need to know where we are within the position of this graph. There’s an ID not showing here, but you’ll see it later called the parent ID.
But these are key things to understand. Who is my service? What’s my name? What’s my root span? Trace ID? All those things, you keep hearing those things, that helps me constitute and build a graph in the system. Let’s jump into a live demo.
So when you instrument your systems with Honeycomb—what do I mean by instrumentation? If that hasn’t been covered already, it’s typically an SDK or a library you add into your code that then automatically sends over the spans of information—these traces. And with that, you’re able to understand things like the total requests that are going on in the system. The error rate. And the latency, aka, how fast I’m going on, what kind of errors I’m dealing with, and how slow it’s going on.
Now in this demo that I’m running here, you’ll notice this little bar going here. This is a build. And the minute I released a build recently, shortly thereafter I got a spike in latency. And that’s what we’re going to look at a trace to discover what the heck is going on in my system. So I’m going to click on this.
Again, as I oriented you, we love our Heatmaps throughout this system. We believe this really describes what’s going on. With this one, I can see as I go up higher, things are getting slower. If things are farther down, they’re faster. I can see where the activity is busy.
Now let’s jump into a trace. If I click on tracing, I’m going to see a list of my traces from the fastest ones to the slowest ones. I can see my root service. This is where it entered the system. I can see a name of it. This is a fictitious ticket booking website. I can see the duration. I can see the number of spans. I can click on this, it’s the top one right there or what I can do is I can jump into this one that’s right here within the system here.
And oh my gosh, this is an excellent one, because not only are having latency problems, we even see an error rate on this one. So let’s look at the very top of this one that’s going on in the system that I picked here. I can see based on the Heatmap, this is normally really, really fast. I’m running really, really slow. I can see within the system itself is going down in the system here.
These are all running really, really fast, very normal. You’re kind of like, I know exactly where the problem is, Brian, the area is calling it out clearly. But let’s jump down into this area here. And I can see that this normally runs really, really fast. But what is going on down here in the system is running really slow. Why is that the case?
If you’re looking at the screen, you’re like, it’s obvious, Brian. You keep calling the SQL statements all the time. Why am I calling them? It’s a simple SQL statement. Let’s look at, is it the SQL? Everybody likes to blame databases for the problem. But if we look over the Heatmap here, guess what? Every one of these calls is having an issue within the system here. Now we have an error up here—you know, my deadline expired on the thing. That’s not really the big source of it. Then why is this thing slow? It’s not the error message why it’s slow. That’s just an error message.
But that was our third scenario, which is what’s going on, I’m getting deadlines expired. You know, I’m making too many—if I look down here—way too may in SQL calls, why am I doing that? Well, because in this ticket booking service, I’ve introduced the famous N+1 problem. A fancy name for an antipattern.
Where I built this system out assuming that we’d only buy two tickets at a time, four tickets at a time, six tickets at a time—and we recently introduced a feature in the system to allow people to do bulk purchases. This user over here, 20109, is one of our bulk users. Now, how did I notice that here? Well, let’s look at the data that’s here. We’ve got time stamps, database query, this is all out-of-the-box information. But because I decorated it with additional attribute information, a user ID, maybe an email address, whatever I want to add in, Honeycomb can take it. We have a concept of wide events.
I’m able to clearly see this and I’m like, what the heck is with user 20109? Well, I happen to know this is user 20109 is one of our heavy users and they’re using a new set of features. And clearly, when we added a new feature, nobody bothered to realize that we didn’t allow you to bulk order tickets. So as he’s looking through, he’s getting his tickets and doing activities throughout the system.
So this is the power, all in one UI, in Honeycomb. We handle three solutions here.
We handled one, where is it slow at? Well, it’s in this backend service. It’s not the gateway. Gateway looks like it’s slow, no, the gateway is dependent on the backend service.
We handled the other question, which is: If I’m a brand-new developer and I had never seen this before, guess what? I now understand this is composed of a whole bunch of services in the system here. And this one. And we can clearly see that there’s an error going on. You know, this error happens once in a while in the system if I were to pull out here, just for the sake, and maybe pick a different one here. I don’t have an error there. So the error is really not a critical error. Notice I didn’t have an error, a set of spiking errors at the time I was looking at things. So we answered those three questions.
Now without distributed tracing, how would I answer this? Well, I’d probably look at that and say, okay, that thing is really slow, maybe it’s that fault. I wouldn’t know it’s this fault unless I understood the code fairly well, and then I would look over, saying, “What’s going on with my backend services?” And I would have to construct all these things together.
So that is the power of distributed tracing. So what is distributed tracing? Looking at this complete unit of work.
What are some questions it solves? Slow, understanding the system, my errors that are going on.
Why is it important? Because if you don’t have something like this, how do you know what’s going on with your system? You’re stuck looking at metrics behind the scenes and guessing. And the cool part is, we isolated and said, in the case of this one, certain users are having very, very bad performance.
I want to thank you for listening to my discussion. Hopefully, you now have a basic and a little more than a basic one, you’re able to understand, hey, I’ve got trace fields, I’ve got trace IDs that are going, let’s go down this one. I’ve got parent IDs. This is how it’s built. I know about durations. I know when it starts, that’s how I can see this visualization. Wanted to thank you for your time. And look forward to meeting you in the future.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.