Molly Stamos [Customer Support Engineer|Honeycomb]:
I’m going to show you the power of BubbleUp for outlier analysis. Remember that Honeycomb delivers real-time analytics for DevOps and SRE teams to better understand production systems. We encourage you to send in events with as much detail as possible. This means that a single event, which let’s say represents a front end service call, could have hundreds of fields attached to it, reporting everything from the specific SQL query that was executed to the User Id of the caller, to which node handled the request. But here’s the problem, when you have hundreds of fields on each event, it can be hard to know which specific fields to look at when there’s a problem. This is where BubbleUp comes in. BubbleUp does the analysis behind the scenes to provide you with the most likely fields that tell you what is causing the outlier behavior.
Let me show you how this works. What I have here is a graph of the total API calls over time, and a heat map showing the distribution of API called durations. I’ve broken it down by status code, so I can see the count and heat map for each status code, and obviously here we’ve got a problem of increasing duration. But interestingly enough, the count isn’t increasing, so what’s happening?
Well first, let’s look at each status code in the graph. I can see here that there’s growing latency with the 200s, the 400s look fine, the 403s look fine and the 500s, well that’s a really interesting cluster of outliers with high latency there. Let’s filter to just the 500s, so we can take a closer look at those. Now the first thing you could do is start breaking down by additional fields. Is this behavior affecting a certain set of customers? Or maybe we should break down by host or platform to see if that’s related. I only have about 12 fields here to choose from, but in a real production environment, people typically have tens to hundreds of fields. Finding the right field to drill on can be a real challenge.
This is where BubbleUp’s power comes in. BubbleUp is going to tell you which fields are most likely related to the increasing latency and help you pinpoint the problem. All I need to do is select the population of outliers. If you haven’t heard the term outliers, Webster’s dictionary defines it as a set of things situated away from or detached from the main body. So you can see here that this selected region of points is definitely different from the overall points in the graph, at least in terms of duration. So with that select action, BubbleUp will take the outliers and determine which fields have the biggest difference between the selection and the rest of the population. Those fields are the most likely to tell us something about the anomaly. What this is telling us is that 100% of the endpoints in our outliers are this ticket exports endpoint, and only 18% of the unselected population is this endpoint. That indicates strongly that there is a problem with this specific endpoint. Let’s break down by that field.
We can also look at this User Id field, which shows a very big difference. In fact, it looks like this single user, 20109, is responsible for 75% of the outliers. Recall that these outliers are the set of selected high latency events. Let’s add User Id to the breakdown as well. Let’s remove Status Code so we see all the traffic and run the query. Okay, looking at the table, we can see that this individual user is just hammering the ticket export endpoint and driving up the latency for everyone.
So take a step back here and review what’s just happened. One, we were notified of an anomaly. Two, we saw outliers. Three, we ran BubbleUp. And four, we got our answer. BubbleUp saves tremendous time when doing outlier analysis. Simply put, you get to the answer faster and can fix the issue quickly. Unlike other tools that only tell you you have outliers, BubbleUp automatically does the outlier root cause analysis, so you know why the outliers are occurring and can solve the problem quickly. For more information on BubbleUp, check out our docs at docs.honeycomb.io.
If you see any typos in this text or have any questions, reach out to email@example.com.
You might also like
Raw & Real Ep. 3 BubbleUp to Ask Any New Question. Again & Again.
When you have latency or errors, why is finding the issue so hard? Do you have the information you need to know where to look?