Molly Stamos [Customer Support Engineer|Honeycomb]:
I’m going to show you the power of tracing to troubleshoot and understand how your distributed systems behave in production. Remember that Honeycomb delivers real-time analytics for DevOps and SRE teams to better understand production systems. One challenge we regularly hear from our customers is that with a complex distributed system, it’s nearly impossible to simulate production behavior in a lab environment, and as such, they can’t possibly predict all of the problems they’re going to see ahead of time.
Being able to observe the system under real load with real users is critical to quickly identify the source of problems as well as identify brewing problems before they turn into fires. One way Honeycomb gives you the ability to observe your system is with distributed tracing. Tracing allows you to see the time and resources taken to service any request. With the waterfall view, it’s easy to diagnose bottlenecks, optimize performance, and understand how your system processes requests. Let me show you how this works.
In this scenario, we have a user who is overusing our tickets export endpoint and driving up latency for the rest of our customers. User 20109 is responsible for this spike right here. They have a high count of hits against this tickets export endpoint, and they’re all showing up as failures, and of course, the latency on that endpoint is now really high.
We found the cause of this latency problem within a matter of seconds, but we want to better understand how our system is handling requests to see if there are things we could do to avoid this problem altogether. The way we can do that is by taking a look at a distributed trace. The traces feature in Honeycomb will show us distributed traces across this time range that are of particular high duration.
We can examine any number of traces that appear here. Here, we see a summary of the spans that are being called. That’s nice because it can highlight any anomalies that give us a good clue which trace to look at first. All these traces look roughly the same, so let’s choose the trace that corresponds to the highest point of latency in our graph. This waterfall diagram shows a distributed trace and all of the steps our system took to fulfill this particular endpoint request.
As we can see for the ticket export endpoint, we have that high latency of 1.3 seconds. On the right-hand side, we see the fields that were attached to this particular request, that 500 error, that particular user, and even the mobile platform that they are using. We have access to all of the rich detail. It isn’t aggregated away. The way this request flows through our system, the request hits the endpoint, a rate limiter is engaged, an authorization service call is made, and those steps all happened pretty quickly.
Then the ticket backend function starts, and there’s a bulk ticket export that happens as a part of this request. As we can see, queries are running sequentially, one after the other. We can see the query that’s being called for each run and the time that it took. So, what should we do? Well, with this insight, we could send this to our backend development team.
Perhaps they could improve the fetch ticket for export, at least parallelize some of the database calls, or perhaps this user is using the export’s endpoint in a way that was not intended. Engineering could improve the rate limiter. These are some examples of what you could do with the insight gathered from the trace. You can see how tracing gives you a powerful ability to observe production under real load with real users, especially when you have a microservices-driven architecture.
At Honeycomb, we believe tracing is one insightful view of what’s happening, and it’s extremely powerful when used in conjunction with other query views, such as histograms and heat maps. Tracing can be used when on-call during an incident, but it’s just as powerful and proactively watching how production responds to new code shipping or even during development. For more information on tracing, please see our blog at blog.honeycomb.io.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.