Molly Stamos [Customer Success Engineer|Honeycomb]:
Welcome everyone to the “See the Trace?” webinar. We’re going to start in a couple of minutes. We just want to let a few more folks join and get settled in.
Okay, thank you for joining for the “See the Trace? “webinar. Before we dive into the presentation, I’m going to go over a few housekeeping items. If you have questions at any time during this webinar, just use the “Ask a Question” tab, located below the player, and then we’re going to answer questions at the end of the presentation. If you have any technical difficulties, go to the bottom of the page and click on “Support for Viewers,” and someone will help you. Then, finally, at the end of the webinar, please do take a moment to rate the presentation and provide feedback, using the “Rate This” tab, below your player. We will be posting a recorded version of this presentation, and it’ll be available at this same URL, so feel free to share it with your friends and colleagues.
Okay, so this is our third in the Honeycomb Learn series. We’ll have two more in May and June. You can check out the previous series and register for the next series at our Honeycomb website, honeycomb.io, under the “Webinars” tab.
I’m Molly Stamos. I work in the Customer Success Organization. My background is in networking, operations, and data analytics. I love working with customers. I work with customers every day, troubleshooting issues, solving problems, and I use Honeycomb every day to investigate issues on Honeycomb, as you can imagine. Ben, do you want to introduce yourself?
Ben Hartshorne [Senior Engineer|Honeycomb]:
Good morning, everyone. Happy to be here. My name is Ben Hartshorne. I am an engineer here at Honeycomb. My background is also in ops, and I really do believe in the practice of DevOps. I’ve spent a lot of my career running systems, and seeing how difficult it is to really get good information out of these opaque applications that I was supposed to run. The way that the industry is moving towards sharing that responsibility so that the developers are also working on their systems in production, the operators are also building software in order to help do all that, it’s making everything so much better.
Coming from that background of more operating software than building it, I definitely spend a lot of time thinking about instrumentation and about how to understand the flow of traffic. That’s why I enjoy working with tracing quite so much. I look forward to getting to talk about this a bit more with all of you.
Good, good, so today, we’ll be talking about distributed tracing, as Ben just said, the benefits of it, and how to use it within your organization. We’ve worked very closely with our customers over the past few years, developing these capabilities. The information we’re going to share with you today is based on all of that experience. Ben and I will talk for about 30 minutes, and then we’ll answer questions at the end, but feel free to type in your question at any time.
There are a number of different steps on the path to observability in your production systems. We’ve already focused on instruments in your code in episode one. Episode one was focused on how to better create telemetry, so you can get context for the code, which helps everyone maintain a well-performing service.
Then second in the series was how to run queries and conduct incident response, as well as how to proactively watch how production behaves, especially when there’s new code deployed. We covered how to configure triggers so that you can be proactive and get ahead of any problems before they impact a variety of, wide variety of, users. I definitely encourage you to go back and listen to those two episodes, and share them with your team members.
Today, we’re going to focus on distributed tracing. We believe distributed tracing is a very important tool on the path to observability, and to detect difficult to diagnose issues. It is also very useful for watching how the system responds to new input, so whether that’s a user, who is very heavily using the system, or new code that’s been deployed, that’s where tracing really shines. Let’s dive into that.
We, at Honeycomb, have a very holistic approach to managing production. We firmly believe that the best approach for engineering, for developing production systems, is for engineering, DevOps, and customer support to have the same tools, so that we get complete visibility across all team members. I use Honeycomb every day to study and understand what’s happening in a customer’s environment, to help them diagnose issues and solve problems.
For us, observability, reaching an observability state, really starts with development. This little diagram here, basically what it’s saying is we develop the code. We test it locally. We watch proactively what happens. We deploy it into production, validate that it’s doing what we expect, fix issues as they come up, and then iterate, learn from that, and iterate over time.
We use Honeycomb for a variety of different things: feature development, solving difficult issues, bugs, customer support, what I work on, performance analytics, identifying performance problems, and then feature deployment confirmation. This dog, Coco, is our office dog, a wonderful member of the team. She’s still trying to figure out how to do incident response, but I can tell you that she delivers great emotional support, while we’re going through troubleshooting a tricky issue.
Why trace? Ben, I’ll tell you. It really seems like, lately, tracing is really popular. A lot of people are interested in it. A lot of people are talking about it. Why do you think that is?
Well, it’s interesting that you say it’s come up in popularity recently. Tracing isn’t really new. Folks have been doing tracing for a very long time. It’s never been particularly easy. I think the thing that’s really bringing it to the forefront now, and why so many more people are talking about it and writing to us about it and developing around it, is because the industry has spent enough time understanding that it is valuable, but it’s too difficult.
Everybody has been saying, “I want tracing, but it’s too much work. I’m not going to do it right now.” The fact that the number of different companies and open source projects and working groups has been really focusing on unifying the various beta models of tracing and providing better libraries and integrations has allowed people to start using tracing without spending months and months going through everything, in order to get their code instrumented. Yes, it has definitely been rising in its popularity and awareness, but I think that that’s really because it’s getting so much easier to get in on it, and not because people are only now realizing how valuable it is.
I see. Yeah, it’s interesting. I get a lot of questions from people who are interested in it, but haven’t yet started doing it, haven’t started instrumenting their code. I’m just curious, in your experiences working with all of our customers, what are the… why aren’t more people… if it’s easy to get started, or easier to get started now, why aren’t more people doing it?
Well, I think there’s certainly a lot of leftover angst about that difficulty from the time before, when it was really difficult. We need to keep giving more examples and having more people do it, and realizing there is a little bit of extra lift at the beginning. I remember, in the instrumentation, episode one, Nathan was talking about how it sometimes does take one person on a team making that extra effort and spending some time to really get in there and build a scaffold, on which everyone else can then improve and add extra data, without a lot of work. It’s a question of just being able to get the ball rolling, I think. There’s a lot of leftover fear about how much work it’s going to be, from the times before.
Yeah, that makes sense. In our environment, we have a pretty complex sort of… the architecture is, I would call it, complex. Do you feel like you need to have a specific kind of architecture to get value from tracing?
You know, that’s interesting, because I think that is another answer to your previous question, too. People think about tracing as only being relevant for enormous microservice architectures, or distributed architectures, even if they’re not microservices, and haven’t really used tracing as a tool so much, when building monolithic services, thinking that I have what I need because everything’s in this one process. I really believe that tracing excels in those environments. You don’t need a distributed infrastructure in order to start using tracing, because instrumenting even just a single service, using the tracing paradigm, can give you a fantastic view into the flow of execution through that service, in a way that’s just very difficult to visualize with other instrumentation methods.
Actually, later on in this, we’ll go through some examples. The one example I’ll start with is actually a single service, so we’ll talk more about that then. It’s not required to have a distributed infrastructure in order to start using tracing, but if you have a distributed infrastructure, tracing really is required. It can be easily applied in both environments, but it’s very, very difficult to get a good understanding of a distributed environment without something like tracing.
Okay, okay, that makes a lot of sense. When you’ve seen people instrument and get started with tracing, do you see that they’re trying to instrument everything at once, or what would you say is the most successful way to get started?
I like that question, because it echoes some of the fear around what it will take to get started with tracing. In the old days, it was that tracing is not valuable until you’ve instrumented everything, so that all of the pieces talking together are aware of the tracing, and that’s just too much. That’s not how anybody builds software these days.
You hear the stories of the mythical and failed rewrites, because somebody was like, we need to do it all at once. That’s definitely the wrong approach. The easiest way to get started with tracing is to put a shell around your service that starts the trace and makes sure it gets sent at the end. Even just that single step.
When a request comes into the service, start up a trace. Set up the context, so that other parts of your service can use it if they need to. They don’t have to. With that first shell enabled around even just one service, you start getting the bare bones of what will be a trace.
Once that’s there, from within the service, you can then add spans and add context and add color and add little bits, as you need them, or as you identify spots in your service that you need to see better. Finally, even beyond that, by adding a shell around one service and making sure that it emits the tracing IDs and other bits of shared information, as it calls other services, each team,working on a given service, or a collection of services, can then opt in. Every one that you add, adds value, but it’s definitely not the case that you have to have them all before you get any value.
That’s definitely something that put people off in the past. I definitely encourage people to just start with one, and start with a single process. Build your single process traces. As you need to add more, you do, and you gain value from that, incremental value, along the way.
Okay, yeah, I think that’s really good advice. Maybe you could talk to us a little bit about how to read a trace view.
All right, yeah, let’s talk about looking at traces. The standard view into a trace is a waterfall diagram. That’s what we’re looking at here. The time axis goes across the top, and all of the different bits coming down the page are individual spans.
Let’s go to look at this waterfall a little bit more closely, with a couple of concepts. When thinking of a trace, before we talk about how to read the picture, I want to give you the bones that make a trace, a trace, really different from, say, a single event. A trace is a collection of spans. This is the vocabulary we’re dealing with. A span represents a single unit of work, more or less.
These spans are collected together in a tree structure, so a span starts, being called from a previous span, so that’s a parent/child relationship. The parent span, when it needs to make a call that is instrumented, creates a child span to represent the work done in that instrumented section. There’s a special span that is the first one. It’s called the root span. It doesn’t have to be, actually, super special. It’s just saying that this is where the request came into the service, or this is where I started this job, from the customer, or from a batch, or whatever it is that gets streamed.
The most common features of a span are, first off, an identifier that joins it to the trace. That’s the trace ID. You see that up at the top here. The trace ID is consistent across all spans. Within each span, there is a span ID that identifies that span specifically. If that span is a child of another span, it also includes the ID of its parent. You’ll see the span ID in a parent turn into the parent ID in that span’s children.
The relationship is that one parent can have many children. Each child comes from a single parent. The last bit about a span is that they have a duration. They start at a specific time, and they last for a certain amount of time. These are the main characteristics that are required in order to build up a trace. There are-
I have a question about this if… I’m just curious. I’ve always thought that the root span, that top span, encapsulates the entire request, but this one gets cut off. It doesn’t cover the entire request. You have these longer bars at the bottom. What’s happening there?
Ah, that’s a really interesting question. You can think of spans as… If you think of a job that needs to do a web request, fulfilling a webpage, a client requests a page, the server collects all of the resources necessary to build that page, and then hands it back to the client, and its work is done, right? In that circumstance, if the root span starts when the client request is received, all of the child spans are representing the work done, that root span is finished when it hands back the final result to the client.
In this example, and this example is coming from our UI, our web server, the root span here is going long enough to hand back a result to the client. Now, what this represents is Honeycomb building a result to hand back to the customer, but the way that we built our UI server is that launching a query… when you build a query in Honeycomb, and you hit “Run Query,” that is actually launching an asynchronous request, so the roundtrip of the web browser asking the server to please start running a query gets a response when the server says, great, I have started running your query. It doesn’t wait until it says, I have finished running your query here at the page.
This is the model of a lot of web services these days, that the webpage will issue small requests and get back responses quickly, even when those requests actually take a long time. What we see in this visualization is that, when the server got this request to answer a query, there were a number of things it had to do before it could hand back a result, even saying, great, I have started your query. It authenticates the user. It builds an internal model of the query to validate it, and makes sure that it’s a reasonable question, checks that columns exist, finds the identifiers for the database, finds a schema, makes sure that… builds up the query structure itself, hands it off to our back-end storage engine.
When it has handed off that query to the storage engine, and the storage engine has accepted it, that’s when it responds to the client, saying, great, your query is now in flight. That’s when the root span ends, but the trace continues, because the trace also is watching all of the stuff that’s going on from the storage engine, and doesn’t finish until that query itself actually finishes. This is super valuable, in that you can see both when did the browser get a response back and how long did the actual query take? And then connect that back to the customer’s final result.
The thicker bars, down towards the bottom, where it says, execute query, retriever client fetch, persist, S3 put, mark as done… That final span there, mark as done, that’s when it’s created an entry in the database saying, this query is complete. Your results are available. The way that our UI does its pulling, it’s actually fetching those results from where they’re cached in s3, rather than handing them back to the browser live. Does that make sense?
Yeah, it does. It does. It makes a lot of sense.
The last little bit of this page is the field list on the right. Now, an interesting aspect of Honeycomb… We started as an eventing service, where you could send in wide events with many fields filled with context, strings, numbers, objects, whatever you needed, in order to understand what’s going on within the unit of work that that event represents.
The tracing view is no different. Every one of these spans is a complete event. Every one of them has a number of fields that describe what’s going on in that span. Some of them are relatively uninteresting, like the instance type. Definitely interesting when you’re doing an instance type migration. Perhaps not so interesting the rest of the time. That’s an AWS instance site. That’s my big one, but user IDs, all of these extra bits of information, the span is recording a unit of work.
A number of these are database queries. Those include the actual sql query issued. It’s tremendously valuable to have every one of these spans annotated with large amounts of extra content.
Yes, I completely agree. Especially for me, when I’m in support, trying to figure out what’s going on, the extra context is huge. Okay, so having just walked through the anatomy of a trace, and how you would use it, we’re going to cover a couple of trace use-cases, tracing use-cases. There are many different ways you can use tracing, and the obvious one is incident response. Ben’s going to walk through a real outage that we had, an incident that we had at Honeycomb, and show how we worked through that using tracing.
The other areas are whenever new code ships into production, so understanding how it’s behaving, making sure it’s doing what you think versus unexpected behavior that could impact users. Then, in development, a lot of people don’t think about using tracing, and even observability, in development, but it is so valuable, because you can really see what’s happening and understand that your system is behaving as you expect. Ben Hartshorne, do you want to take us through the end-to-end incidents you went through?
Sure. One more comment about this in development that’s especially relevant for the end-to-end testing that I’m going to talk about, it’s really fun and interesting and exciting to be able to use the same instrumentation while you’re building your service, as you are going to have when you deploy that service to production. It gives you practice understanding what’s normal, what’s abnormal. It gives you a mental model of what you’re going to see when it’s really running in production, so that you can better understand what you see when you’re actually looking at that.
There’s a lot of talk about automatically finding the parts that are interesting for you. Well, people are really good at building mental models of things, and then very, very quickly comparing their mental model against a piece of input. Machines are not as good. You can think of the dashboard that you see up on the wall in so many NOCs and operation centers.
Everybody in that room has a deep, intimate understanding of what good looks like on that screen. As you walk in the room, you glance up and know everything’s good or, oh, that looks weird. Part of building that is using the same tool set as you’re building the code, as you will use when it’s finally deployed. That’s what I like best about using tracing in development, aside from the fact that it can actually correct your mental model from time to time.
Just two weeks ago, I was working on a service, where I had understood… It’s a batching service, so a piece of data comes in the door, and then it gets processed by the first section, and then goes through a channel to get processed by the second section, and through a channel to get processed by the third section, and so on. I had a misunderstanding about which of those sections were serialized, and which were going to run in parallel. Looking at a trace just immediately broke that mental model and proved, no, these ones are in parallel. You thought they were happening serially, but they’re not. That’s incredibly valuable. Even though I was working in that code, having the incorrect mental model of what it’s doing is the easiest way to push in bugs.
That’s my plug for using tracing in development. It’s, I think, not something I hear about so often, but I really enjoy it. With that, on to incident response.
As we’re running Honeycomb, we have a number of checks that continually validate that our production service is working correctly. One of those sits outside our infrastructure’s boundary, in order to simulate being an actual client of the service. This is the spectrum of black box or gray box or white box testing, depending on how much the tester, the E2E check, in this case, knows about the service, but it’s different from the type of instrumentation, where you’re seeing what the server is doing, right? This is looking at a client’s perspective of how the service is working.
This end-to-end check has a simple job. It inserts an event through our API, and then pulls, via the UI, to find out whether that event made it into the dataset. If it successfully retrieves the event that it sent in, then it knows that the entire service, from the API accepting it and parsing it, and handing it off to the queue that gets it ready for the storage engine, to writing it to disk on the storage server and, on the other side, the reading section that our UI is working, that it is able to query the storage engine, that the storage engine is able to respond to that query. All of that, that entire process, is validated by this end-to-end check.
Now, because it does such a comprehensive job of checking everything, it’s a very high signal alert. When the end-to-end check reports that something is going wrong, we jump. This is one of the few things we have that will page somebody in the middle of the night, because it is so complete in its check, right?
It doesn’t matter which service is broken. If a customer is unable to get back the data that they put in, our service is broken, and we need to do something about it. That’s what we’re looking at. Molly Stamos, does all that make sense? Anything to add on the description of what the end-to-end service does?
No, I thought that was phenomenal.
Great, okay, so I want to start by showing what normal looks like. This is a visualization of the end-to-end check running successfully. There’s a lot going on here, so I’m going to spend a little bit of time unpacking what we’re looking at.
Now, the reality of our storage engine is that it has started. Our customer data comes in the front door and lands on one of many shards in the background. This is how we grow and scale and keep everything fast, while accepting both small and large volumes of data.
This is one aspect that the end-to-end check actually knows a little bit more about the implementation of the service than one of our customers. In order to verify that not just the service is working in general, but that every single one of the back-end shards is working, the end-to-end check has a special configuration, a different dataset, each of which is pinned to a specific shard, and it submits an event to every one of those datasets and checks that it got every one of them back. That’s how it verifies that all of the shards supporting our service are functioning correctly.
Now, there’s a question that came in earlier about concurrent child spans. This is a really good visualization of how that looks. The end-to-end check runs all of these probes in parallel, so it submits all of the probes to all of the shards at once, and then has a separate goroutine. We use go for most of our services, a separate goroutine that watches the success of the submission and then pulls to try and get that data back.
The way that concurrency is visualized… On the left of the screen, you see what we call the bus stop diagram. The main bar, vertical bar, indicates the connection back to the root span. Each of these stops on that bar represents one of those goroutines. Each of those goroutines has a whole subtrace flowing out from below it. You can see that there are three visible on the screen right now, three goroutines running in parallel. The first was the quickest. The second took a little bit longer. The third is longest. All three of them succeeded eventually, but they took different amounts of time to finish.
One of the characteristics of the way that this check works… When the end-to-end test client requests the data that it inserted back, it’s supposed to get back the 10 most recent probes, and it examines them and includes data from the three previous samples, as part of the metadata attached to this span. This is what I expect the trace to look like. They are variable lengths. They all generally succeed, after a couple of checks. The vocabulary here is that a probe is the goroutine that’s checking a dataset. It submits a probe, and then it, later on, checks the probe. The check probe span has a number of check probe once spans underneath it. Each of the check probe once spans checks a single time. The check probe span is running that in a loop, with a maximum and a timeout and other guards against it running forever. Okay, so this is normal. This is what good stuff looks like.
When we got an alert, instead it looked like this. What we’re seeing is an enormous number of check probe once spans. The interesting part here is that the sample’s length is zero. Now, every time that that check probe was supposed to go in, it was supposed to get back 10 results. Now, if it didn’t find its probe in those 10 results, great. It would say, I failed, and then it would try again a little bit later.
This time, it’s getting none back, and so all of them are failing. That’s pretty clear. Now, the interesting part about looking at this, as a trace… With normal, metrics-based instrumentation, we might record how long the check probe thing took, might record how many iterations it included, and that’s about it.
What we would see is that these checks were failing. They were taking 50 seconds, which we know is the maximum that they’re allowed to run before they report it as an overall failure. That’s about it. We’d have to really start digging in order to find out what’s going wrong.
By using tracing to instrument this client and have it include such detailed data in every span, not only how long did it take to make this request? But what was the status code it got back? How many results did it get back? What were some of those sample results? We immediately see what’s going on here. There’s almost no digging necessary. If we understand, if we’re getting zero results back, clearly the query to get those 10 results is failing. We immediately know where to look, to start to look at this code.
But, there’s one more question that we haven’t answered. I’m going to switch over to an actual Honeycomb screen share here, because I’m afraid I don’t have a slide for this, so give me just one moment while I pull that up. This is going to be the same window we were just looking at, but this one is live. This is the same trace we were looking at before. We selected the same one. We see sample length is zero here.
My question is now, okay, is that representative of just this one trace, or is that actually the problem that we’re looking at. Now, I got into this trace… I didn’t show you how we got into this trace, but when the issue is happening, this is the query that we were looking at. This is a regular Honeycomb query, not a tracing, but just a regular query, saying, show me all of the successes and failures for probes in production, so the filter is on, environment is production, the name is probe, and we can see there’s this rough patch in here. The number of successful probes is dramatically reduced.
Then we clicked through to one of them, got to this trace, looked at the span, saw, hey, sample length is zero. That’s really weird. Also, the last three values are missing. Well, I mean, that makes sense, because we didn’t get any values left, but is this actually representative of that same span of time.
Here I’m issuing a Honeycomb query that is now restricted to the check probe once and shows me what are the sample lengths that I get back. Here in the table down below, we can see we got 10 samples back on failed attempts. We got 10 samples back on true attempts. We got zero samples back is this green line. That time range is exactly the same. That’s confirmation that the individual trace I was looking at is actually representative of the entire span of that outage. That’s what can give me confidence that I’m not chasing a rabbit hole. I’m not going down some strange thing that just happened that once, in that one trace, without going back and just spot-checking traces.
There are a lot of traces that are included in this window, but the ability for Honeycomb to take the deep view of a trace, this is what exactly one request going through the system looked like, and spin it back into, okay, let me take something I learned there, and look at it broadly across all of the tracks that’s coming in, and then go back and forth, that’s what we sometimes talk about, the core analysis loop, building hypotheses, looking at them, backing up, and seeing the same problem from very different perspectives. That’s where tracing, as one view into the problem, really lets us speed up that loop. Okay, Molly Stamos, I’ve run way longer than I meant to. I want to hand it back to you.
That was awesome.
I had a lot of fun talking about that. I really do enjoy incident analysis, because of my background in operations, but also tracing. Yeah, it was a really fun problem to work through. Turned out-
It was actually a pushed, a bit of pushed code that had broken just at one endpoint and didn’t represent a full outage, but-
A full outage, yeah.
That happens sometimes.
Yeah, I love how articulate you are about it.
Oh, thank you.
Well, okay, so now we’re going to speed up a little bit. Go back to the slides, so a very interesting walk-through of a real world situation. New code shipping is another area that tracing can be very valuable in. I, in customer support, do use tracing. We recently shipped a trigger enhancement. Triggers are how we alert in our system, and we shipped an enhancement to our users that would time out long-running queries. I used traces to watch the runs of the trigger service, and validate that we weren’t seeing clusters of timeouts or anomalous number of timeouts, to make sure that the feature was working as expected.
With that, I’m going to give a quick demo. Ben Hartshorne already showed you some live Honeycomb. I have some demo data that I’m just going to run through, just to give you a flavor of how you move in and out of a trace and leverage the insight that Honeycomb can give you, so I am also going to share my screen. Give me a second.
While she’s setting that up, thank you for the questions that came in. Please let us know if the question about child spans and current child spans is not answered fully. We can talk about that at the end. Ah, you’ve got your screen up. Great, let’s keep going.
Yeah, okay, so what we’re seeing here. This is the main query interface of Honeycomb. We’re looking at endpoint calls. We have an API, so we’re looking at the endpoint calls. Of course, you see this very large spike at the end. We’re filtered just to errors that are happening in the endpoint.
The first thing to do is ask, okay, well, what’s happening here? One way we can start to explore this is to start with the Traces tab, so I have my query with my data. Then, what we provide you in the Traces tab is a list of the top 10 traces that have the longest duration. This spike in API calls is actually increasing latency in our service, and this is giving the top traces that show that latency.
We also give you a span summary that’s very handy, because it allows you to see, at a glance, if any of the traces are anomalous. For example, I threw this one in. This one looks very different than the others, and so you might start there. For this demo, though, I want to look at long query run times, so I’m just going to pick one of the traces that has one of those long run times.
Ben walked a little bit through the trace. You can see each span, and its nested position in the waterfall graph. We can see the start of the trace, which is that API call, and the duration, with that high latency. This trace… Basically, the request came in, we checked rate limits, we then fetched some user info, did some authentication, and then started our fetch to the back-end.
One thing I see right away is that all of these queries are running sequentially, which is probably part of why it’s taking so long. They really should be running parallel. But, I also have access to all of the context within each span. It’s not aggregated away, and so I can see, for example, what host it was running on, who the customer was, what build ID this particular request used. It’s very powerful.
In this demo data, I only have, I don’t know, maybe 50 spans, but in our customers’ traces, most of our customers have traces that are in the thousands of spans and, for some, even 100,000 spans. Finding information in the trace, when you have that many spans, is very challenging. So we have some enhancements that will help you navigate through this trace and get to your answer sooner.
First of all, collapsing and expanding spans. You need to get… I want you to see all the spans that are of this type. Expand them all, so I can see them, or collapse everything at the second level of depth, so getting very quickly to see what’s important. Furthermore, you can then search all of the spans to figure out and find important information.
For example, I might be interested in searching the spans for which ones contain the word error. The only one I get is this fetch tickets for export. I can look over here and see, ah, the error it has is “deadline expired,” so these queries ran for too long. I can also look for a particular query, what the query was, and say, are all my query spans running this query, and I can see, yes they are. They’re all matching. That’s some nice ways to really get at the details of the trace and find information quickly, especially when you have thousands of spans, like most of our customers do.
Then, finally, we’re looking at one trace. This is what Ben talked a little bit about. We’re looking at one trace, and that’s really useful. It’s one request I can study, one request. Here’s the thing. A lot of times you want to know, is that indicative of a bigger problem? You want to zoom out. We’re at a very granular level here. Now you want to zoom out. Honeycomb also gives you the ability to do that.
One thing I can do is like, is this limited to a particular user? I can specify, break down by this field, user ID, and Honeycomb is going to rerun the query, splitting all the calls up into which users made those calls. We call it breakdown. I can say break down by field, and right here, we can see, in fact, it’s just one user responsible for this increase in the number of requests and the increasing latency, so I can reach out to that user.
I’ve looked at very close detail of the request, and then zoomed out to that global view to get my answers, and that’s one of the really nice benefits of Honeycomb. I didn’t have to go to another tool or anything like that. I could just answer my questions, follow my line of thought to get to the right answer.
I think you’re really hitting something important there, Molly. When understanding what your production service is doing, there are many different types of questions you can ask, and each of those questions have different ways to best represent the data, in order to answer it. If you’re looking very deep, the waterfall is great. If you’re looking very wide, understanding the relationships between those fields and their values is a far better way of looking at what’s going on. The tracing is important, but I really like how you’ve touched on that it’s just one of the views that you’re going to use, as you’re exploring what your production service is doing.
Right, right, so I want to get to questions. Just very quickly, what did we cover today? Well, we talked about the importance of tracing, and how valuable it is to get context. There are many use cases, besides incident response. I just so encourage you to try it while you’re developing. I do that. I find it very useful, as well as watching how new features behave in production, and that it’s very powerful to switch back and forth between this global view or more aggregate view versus just looking at one single trace.
You don’t have to boil the ocean. Start slow. Instrument one trace, just as Ben Hartshorne said. Get started there. You’d be shocked at how just instrumenting one thing can even get other people on your team excited, and then they start instrumenting, as well. It has become a viral spread. With that, let’s go to some questions.
Okay, we’ve got a few showing up. Please continue to ask them. We have, I think, about 12 minutes left, so I’d love to spend some time focusing on what you would like to hear. Molly, you want to take one of them first?
Sure. Yeah, I can… We’re getting the question, “Is there a limit on the number of traces you can handle or manage?” The answer is no. We have an incredibly high throughput ingest pipeline, so we can handle lots of events coming in, and then our storage engine, it’s parallelly scaled, and all of that, and so there is not the problem. I think, Ben, if you were to think… are there any problems you could think of that would happen if you had a large number of traces? I assume the question is about traces, and not spans, which are the individual bars inside the trace.
Yeah, well Honeycomb’s model is that events or spans come in and are stored in the dataset. That dataset has a capacity that is part of our pricing structure. The trade-off between storage and the amount of time that represents basically is a function of the throughput. If you send an enormous number of traces in and have a small retention, a small dataset, then you will have tracing data for a small amount of time, and large volume going into a large dataset, you will have a longer amount of time it represents.
None of the steps along the way are actually specific to traces. Honeycomb’s model was built on these complex events, and a trace is really a collection of events that have a unifying identifier, a field, a trace ID. That trace ID field is a field just like any other in Honeycomb. There really is no limit on the number of traces, independent from the amount of storage it takes to retain those traces on disk, and then the amount of time that represents for the size of that dataset. Does that make sense?
That makes sense to me. I think that was a good answer.
Another question: “I have a lot of fields in my events that I’m sending in. Will that cause the traces to take a long time to generate?” No, it doesn’t. The number of fields present on events is independent from what it takes to generate a trace. Thinking about the trace IDs and the parent IDs and so on, in order to build that waterfall representation, you really need four fields: the span ID, the parent ID, the trace ID, and the timestamp… five fields, and the duration.
Honeycomb’s query engine is optimized for selecting a few numbers of fields that are present across a group of events, so all of the rest of the fields that are part of that data, which is great… Highly contextual data is the best kind of data. They don’t influence the amount of time necessary to build that waterfall. Once it’s built, you can see all of the rest of the fields on the side. In terms of performance, they’re all very, very fast to put together.
The next question coming in is about instrumenting traces. What’s the best way to do that for Honeycomb? There are a couple of ways. We have our own SDKs that… Libhoney is our raw SDK, and the Beelines are the Honeycomb SDKs that are tracing enabled, so they have traces as first class concepts within the SDK. They have things like start span, finish span. That’s the one that we’ve written to really help people get started, when they are getting started with Honeycomb.
We do have some compatibility with other tracing SDKs, specifically if you use the OpenTracing, Zipkin SDKs, or Jaeger SDKs configured in to use the Zipkin wire format. We can accept those. We have some exporters for OpenCensus. It’s important to play well with others, in terms of accepting both other SDKs, as well as other tracing formats, although we have found, at least in our customer base, there’s a big value in concentrating tracing to a single standard across an organization.
I’d spoken before about not trying to do everything at once, but it is good to at least choose one standard to use across an organization, so that as you continue to instrument additional pieces, they can play well together. The short answer, the best way to do instrumentation for Honeycomb? Use a Beeline. Do we support open standards, too? Yes, we absolutely do. Molly, you got something to add?
No, I was just going to comment that we have the Beeline switched to auto instrumentation, and that’s very useful to getting started. Even if you decide you don’t want to do it that way in the long-term, it allows you to get started running traces in a few minutes versus days and days.
Mm-hmm (affirmative), yeah, that’s a really good point. I didn’t talk about the automatic instrumentation part, but more that the Beelines have a native tracing API. An interesting part of the work instrumenting is deciding what should be instrumented. This is one thing that people love about some of the bigger APM products is that they just magically choose it for you, but a lot of them also have a… Well, I’m not going to get into that part now.
The Beelines do some automatic instrumentation. For example, if you use the wrapper that is aware that you’re running an ACDB server, it will automatically add things like the client IP address, and whether you had load balancer that was forwarded, and the ACDB status code on the way back. All of the things that are relevant for HTTP will be automatically instrumented, if you use the HTTP wrappers in the Beeline. The-
Beeline is both the tracing API and that automatic stuff. Yeah, Molly.
You’ve got a really interesting question here coming in.
Oh, yes, okay, so the question is, “I would like to use a service mesh, specifically Envoy, to instrument our microservices. Is that supported, and what are the advantages and disadvantages to using a service mesh versus instrumenting in your code?”
This is a question with a couple of parts, and we’ve got five minutes, so I will do my best, but please forgive me if we need to cut off a bit. Yes, service meshes are great. Envoy can emit events in the Zipkin wire format, which Honeycomb can ingest, so you’re good on that part. It is supported by Honeycomb.
Advantages and disadvantages to using meshes versus code instrumentation, or the third option, why not both? Understanding the relationship between services is a key part of tracing. This can be handled either by the mesh or by the code itself, so when a service is calling out to a dependent service… Let’s say, I need to authenticate a user. I’m going to call out to the authentication service. Well, my code definitely knows that, so I can create a span in code that says I’m calling the auth service. The mesh also knows that, so it can identify that the application handler has made a call to the auth service.
Where the mesh falls down is that it doesn’t have any facility to augment those spans with additional information that’s relevant to the service. As a service is trying to complete its job, some of its calls out to third-party services are straightforward, ask questions, get answers. Others are a little bit more subtle. It might check a cache first, and then call the database, or it might check a couple of back-ends, and combine those results in some interesting way.
All of that interesting stuff that the service is doing is not visible to the mesh, so adding instrumentation about the actual behavior of the service itself, from within the code of that service, is incredibly valuable. That is the real benefit of doing instrumentation from within a service. Getting instrumentation from the mesh is valuable, because it relieves some of the duties of each different service maintainer. The mesh can see all of that traffic going back and forth between all of the services, without the need of each service maintainer adding instrumentation within it, though there are definitely benefits to both.
You can do both, and combine that data to get both the mesh’s view of which services are being called, as well as additional spans from within each service, augmenting that trace with extra data. There’s one challenge, in that the mesh sees calls between services. If that service, making a call out to another service, has passed through the trace propagation, from the person that called it, and it includes that in a way that the mesh can see it, well, that is the only case in which the mesh can actually combine the different calls into a trace that can then be included with the application.
It actually does require some work on the application service author’s side, in order to allow the mesh to build this complete trace. Otherwise, all it sees is there were 50 requests that came into application service, and there were 30 requests that went between application service and auth service, but being able to tie those back together without that identifier that needs to be threaded through is very difficult.
I hope that answers the question. It is complicated, but both is definitely a good choice, with caveats.
We’re running out of time.
Yeah, we can follow up, certainly follow up with additional questions. Just send them into firstname.lastname@example.org. Thank you so much for attending. We have a number of additional resources provided here. Please join our next episode, where we’ll talk about outlier analysis, very useful.
Thanks, and please give feedback.
I think we’re about to get cut off.
Please give feedback. Let us know what you thought of the session. Thank you very much for your time.
Yeah, this has been a lot of fun. Have a great day, or night, depending on your time zone. Bye.