Finding Outliers With BubbleUp
Learn how using BubbleUp in your debugging workflow helps you analyze billions of rows of data across thousands of high-cardinality fields to quickly spot outliers that may be the source of hidden issues.
Adam Hicks [Senior Solutions Architect|Honeycomb]:
Hello, everybody. Welcome. I’m Adam Hicks. I’m Senior Solutions Architect here at Honeycomb, and I’d like to talk to you about BubbleUp or how we can bring statistical outliers in your event data to the forefront. So what is BubbleUp? What are we talking about here? Really, the core thing we’re after is how we can identify anomalies quickly, areas of interest.
We operate from the principle that visual data gives us cues. We can see spikes and things like error rates and latency. So, we’ve introduced a feature that allows us to select a region of interest from the data and ask a pretty generic question. You know, “what’s different about this data”? The great thing is Honeycomb is going to tell us. It can go and find the differences in that selection. This is exactly what I’m going to show you in the demo in just a few minutes.
The other thing that I want to spend a second on is that it’s fast. It’s very speedy. And that matters a lot. It matters when we’re debugging. It matters because downtime costs you money. Outages cost money. You know, anytime you need to be finding what’s wrong, you want to be able to get to those answers as quickly as humanly possible. So it shouldn’t require a wait for you to distinguish that signal from noise. In many cases, it does. This is a hard technology problem, and we at Honeycomb have all but completely solved it for you.
Before I launch into the demo, I do want to level set on a few different things. So, you know, what we really are doing here, we are, like I said, we are allowing you to select a section, a region of data to get those data points and ask that general question: What is different about it? This operates on what we call a heatmap. You’re probably familiar with heatmaps. Of course, we have our heatmaps as well. So that’s definitely something to level set on. So as you go and select these sections, instead of your heatmaps, we do have other forms of visual, graphical notations to help you understand what is going on in your event data. The heatmaps specifically are those where BubbleUp functions.
We use color-coding. Colors are going to drive those areas of interest. Again, we lean into the visual element quite heavily. So those colors are going to draw you into understanding what is different. The yellow is the regions of interest. You will see that blue is an area of what we call baseline data. That’s sort of your normal functioning data. You’re going to see that quite a bit. So enough with the definitions. Enough with the level set. Let’s move on, then, to the demo itself.
So get my Honeycomb up in front of me, and I’m actually going to return just to a home screen because I want to help you understand how we go through a typical user journey in understanding BubbleUp. This right here is a home for a data set that we call API calls. Pretty generic. A lot of our customers and you probably have an API call service in and of itself. So this should be very familiar. And I do have, on my home page, a heatmap over here around latency. It’s one of those that we talk about. Very common when you want to measure latency. You can see the colors, the rich colors. And, of course, right here, we’ve instrumented a smoking gun. There’s something going on, and it leads me to want to ask those questions.
I can pop into BubbleUp directly from here or I can click directly on the heatmap itself, and it will take me into the query view, expanded query view. This is interactive. I can further refine what I’m seeing, but today we just want to start by bubbling up and asking that very high-level question about what’s different. As an engineer, there are two areas that are actually really drawing my eye immediately. I have this spike that we talked about, and I also have down here this big, dark spot. We have a legend that’s telling us what’s going on. I have fewer events in the lighter colors, more events in the darker colors.
So I might want to ask questions of both of them. For now, let’s start with that smoking gun, the spike in latency. BubbleUp mode is very easy for me to enable. All I had to do was click on a tab, and it takes me to BubbleUp. And now I can draw a selection around the data that is interesting to me, and it is asking again that very high-level question: Tell me what’s different. What it did below is it bubbled up attributes and the values of the event in those attributes.
A little definition time. What is an attribute? Events that come into Honeycomb are structured. I think one of the easiest ways for a lot of us engineers to think of them is like a JSON. For all intents and purposes, that is sort of what it is. So I have JSON. I have a lot of keys. Those keys are attributes, or, in our database, they are columns. We were talking about speed before. One of the key reasons that we can allow you to do BubbleUp with such speed or queries with such speed is our proprietary columnar database. Each of these attributes are creating a column dynamically allowing me to expand it and ask these questions and get these speedy returns. That is why.
Honeycomb does some statistical analysis for me, and it goes and tells me which attributes are the most interesting, starting with app endpoints, on to name, app user ID. And, again, visual, those colors that I talked about, we have yellow is telling me the representation of these unique attributes, their values in the attributes right here whereas this is what they are in the baseline. It’s blue. So what do I mean?
App endpoint is an attribute that is present inside of this dataset. As I hover over this golden histogram, I see that the API v2 ticket’s export value is 100% of this attribute inside of the selection whereas it’s only 7% inside of my baseline. Now, this leads me to believe that there’s something going on there. And so what I’m able to do from here is start entering into a core analysis loop. I ask that generic question, and now I’m starting to gather some ideas about what’s going on, and I’m starting to form a theory that perhaps there is something happening with that endpoint. As I move on to “name” and hover over its column, I see it’s really the same thing. Its endpoint is prepended with the http verb. In this case, it’s a GET action happening on that endpoint.
Move over to app user ID. I really like this one because this allows me to talk about the value of high cardinality data. This is an extremely unique piece of data. There is one user with this user ID, yet, without indexing, I was able to return this query and find this statistical difference in this data at speed. And 20109 is 52% of the events in that selection whereas they’re zero percent inside the baseline.
Now, it seems like there’s something troubling happening with user 20109. I do like to clarify why it’s number three on the left to right popularity. And that is because it’s only 52%. We have some other high latency users, perhaps not as many, but I’m going to put a pin in this, but I do wonder if it’s user 20109, or could there be another user. I don’t see another spike, another histogram spike.
But before I move on, I want to talk about different attributes we might have and different attribute types. So we’re talking now about categorical attributes. We have other types, too, things like ordinal attributes. As I look through here, I have status_code. 200, 400, 403, it’s 500, in this case, there’s not necessarily anything interesting happening there. It’s not like I have an outlandish amount of 400s or 500s.
We also have measures, and measures might be something like this. It’s timestamped. It could be a duration, just different types of attributes that can lead me on. I talked about this core analysis loop, this idea that I formed a thesis, and I want to further refine this thesis by digging into the data. Honeycomb makes this very easy to do. Just like many screens on Honeycomb, they are interactive, and this one is too.
So I might want to take this and click and say, Hey, let’s group by that user ID. Let’s find out if there really is something going on. And when I clicked on this, it went ahead and changed my query. Maybe I want to further refine it and say, Let’s only show where this field is a value and see if there’s something that they particularly are doing. Let’s go ahead and add an account and see how outlandish this really is and run this query. Go back to results. And in pretty fast form, I have gone from BubbleUp into understanding that, yes, indeed, I do have a problem with this user 20109 against that endpoint creating high latency. In fact, there are 411 events in this selection of data that I’m seeing in front of me whereas every other user is only three.
Even though they’re just barely over half representation of the selection that I was interested in before, the rest of it is a blend of so many users combined that I know, in particular, that it is user 20109. So, again, we have found that I’m able to go pretty quickly from smoking gun to the problem with extreme speed.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.