Honeycomb Queries: Traces & BubbleUp
Transcript
Alayshia Knighten [Sr. Implementation Engineer]:
Hello everyone. My name is Alayshia Knighten, and today we will be discussing Honeycomb Queries, Tracing, and BubbleUp. Now let’s beeline in. When we talk about using Honeycomb to debug our systems, we talk a lot about asking questions. Questions like, what’s the average latency for that endpoint? Or how many unique user agents have gotten a 404 response today? Queries are our questions. You define a query, run it on a data set, and you get graphs back. Let’s take a look at how to construct them and how they work. Visualize is where you identify what types of graphs you want Honeycomb to create. Then, there’s Where. Where will be the filtering clauses. Group By, is something that you use to create groupings by different values. Order By and Limit are more for the result columns below that. So at this point, let’s run a query. In the Query Builder, if we leave all fields blank and hit run query, we will get the raw data by default. It automatically sends you to the raw data tab. The raw data view is useful when interacting with a data set you don’t know and you’re trying to figure out things like, what kind of stuff is in here? Let’s review a few of the fields. There are trace ID and parent ID fields, which indicate that these are trace spans.
Just as an FYI, the root span is a span without a parent span. There’s also high-cardinality data. We have the user ID fields for example, Honeycomb is quite fast at dealing with high cardinality data and it is good at letting you do groupings or filtering by this. Some of these spans are DB queries, so they will have the SQL query shape in there. Did you notice that some of these events are sparse in a sense? Sparse in the sense that not every event has a database query, not every event has an availability zone or a hostname, so with these sparse events, you may see empty fields. But that’s okay. This is because when you can do things like, I want to filter only on events with a database query or with a hostname, etc, you may want to do some grouping based on those. In the raw view, if you ever need to export less than a thousand rows in JSON or CSV, you can do so.
Now let’s zoom in on the Visualize clause. Visualize aggregates a bunch of events together and provides both numerical and graphical plotted outputs. Let’s click the see that option. So there’s Count, Sum, Average, Min, Max, and different Percentile Queries. So let’s run COUNT. A count calculation just counts the number of requests in the data set. Along the plotline, you will see the value in each time bucket. So depending on the granularity set for the time, you can see that the value has been calculated closely. In this COUNT query I just ran, we’re getting a count of everything in this specific data set. As a reminder, we can change the timeframe, so here there are two hours, but I can change it to the last 10 minutes if I like. Notice as we change the minutes, the actual results themselves adjust. If I select Average, there are different options here. So let’s just say I select average duration. The average duration, milliseconds, calculates the average duration milliseconds value across our request. There’s also some Min, Max, and Count Distinct, which should be very self-explanatory. I would like to take a moment to talk about the Percentile queries because they’re quite interesting. So the Percentile queries start with a P, P95 is commonly used to chop off the top 5% of values if you need to eliminate those distracting outliers. Our favorite of course is heatmap. This shows a statistical distribution of values over time, the darkest blue indicates to highest density. In a heatmap, you can select outliers. So I’ll run a heatmap query just to show you what that would look like.
The cool thing about a heatmap is that we can look at the outlier and we can zoom in on that time. After we click and drag to select the regional time and zooming in, and we click the magnifying glass, we can rerun the query again, it’ll show us that zoom in time, that we highlighted. The other cool thing is that the raw data and the traces themselves change. So you can see the specifically selected results. So we briefly talked about the time picker, but actually want to go into more detail into that. The time picker actually auto selects granularity based on a time range, but you can boost granularity up to a second if that’s your thing, or reduce it down to 10 minutes. Another feature is that there’s a custom time range. The cool thing about that is, we support natural language inputs. So if I put in, for example, five days ago, at the bottom it’ll show you exactly what the timeframe is, and you can apply it. You can also do things like noon yesterday.
This saves us all from doing calendar math. We also have graph settings. There’s this place that graphs, not different visualizations, but it will stack groupings. So by default, there’s a display linear scale that we provide to you, but you can also display log scale if you would like. If I selected that it would do it based on the log scale here. There’s also the UTC X-axis, which allows you to look at UTC, displaying on the X-axis. There are also omit missing values that show no line, instead of a zero line. There’s also the high markers option, trust me, you’ll need this when you really get into markers. At Honeycomb, we do so many deployments that over a longer time range, that graph is just covered in markers. So we end up hiding a lot of these. Then there are the options to download CSV or JSON, up to 1000 results or 1000 rows.
5:57
Group By clauses let you take a single calculation, like Count, and break it apart into groups based on the value of the giving column. For example, I can group by app status code and run, and the results will show me counts for each status code during a given duration. If you’re comfortable with SQL, then this is like the Group By clause. These types of breakdowns are useful for comparing behavior between distinct groups. So in this case, is my service returning more 400 codes or 200 codes in a given time frame? Or a different problem you may be trying to solve is, which customer is experiencing the slowest request?
Breakdown should be columns with discreet values, customer ID, status code, that continuously varying fields like response time and size. Examples of good Group Bys would be, database queries, service name, team ID. Now let’s take a look at the Where Clause. Where Clauses lets you refine which events you want to include. By default, we’ll process all events in a data set within a certain timeframe. By adding Where Clauses, we’ll only process the events that pass the specified filter. With the Where Clause, you need to watch out for the And or, Or. Filtering is especially powerful when paired with the knowledge gain from a Group By. These filters let you isolate a particular group in order to continue drilling down a particular problem.
In our particular query, I have this miscellaneous data that I want to omit. So here, I’m going to say, where, my app status code. And the cool thing is there are a few operators. This is, starts with, this does not contain, this, of course, is exist. This is, does not start with. This is contains exist, does not exist. And then there’s the In function that allows you to do multiple items. In this case, I’m just going to say, does this status code exist and I’m going to rerun the query. Now the results that didn’t have a status code are omitted. Just as an FYI, not every span will have every field. You will want to practice adding and removing filters for where fields exist.
For instance, trace that parent ID does not exist for root spans. You may want “trace that parent ID does not exist” for roots spans or “trace that parent ID exists” for descendant spans. Essentially, root spans are important. You’ll want to do a lot of queries on root spans only, and therefore you’ll end up setting this up a lot. And by this, I’m literally referring to trace that parent does not exist. Orders are used for ordering the groups that appear in the summary table. We only draw graphs for the first 50 rows in the table. So defining your order can be useful for making sure that you get graphs for your calculations that you actually care about. LIMITS are useful if you like to constrain the number of rows that appear in the summary table. In the case of ordering, out change the ordering to app status code ascending, and in regards to the LIMITS, the default is a hundred. We only have four rows here, so what I’ll do is, I’ll change that limit to one. And then I’ll change it to two.
Okay, I don’t have as much data, so I will show you this slide here. So basically, this is the meat and potatoes beautifulness of a heatmap combined with a COUNT and a GROUP BY. Hovering your pointer over the COUNT gives you a handy rate events for a second for each of the groupings, while hovering over the heatmap gives you a time bucket histogram that lets you compare the entire time range, it shows you the distribution of value over the time range and the darkest blues that are highest densities. While hovering over the result list, which in this case is the endpoints highlights the result in all the visualizations. So what have we been talking about so far? Well, we have been essentially talking about a Core Analysis Loop, which is what it looks like when you answer questions about your production systems when you are analyzing data.
10:23
First, we start off with an interesting question like increased latency or formulate a hypothesis about why that might be happening. For example, there may be a particular customer or API endpoint that could be creating very high latency. So we test our hypothesis by grouping on that field, then we compare the relative latency of the different groups. After we test, we should be looking to see whether that helped us refine or reduce our search for the question. Meaning does one or a subset of things stand out in the results and can we control it for some other factors? The Core Analysis Loop, allows you to keep zooming in and investigating, to define and refine problems. This really gets you to the heart of Honeycomb, which is asking questions about the system, starting at a high level, then digging in to investigate from a high-level graph, all the way through to then, in with as few tools, jumps, and context switches as possible.
As part of core analysis, there’s something that’s easy to miss, which are the three little dots. And they really come in handy, as they are a shortcut to a drill down into results. So for example, I click on a database query, I can then modify my query to say whether or not I want to actually see data filtered with this or not. I can also say, hey, just GROUP BY the database query. So if I add this here, it will then dynamically update my raw data, which will then update my results. Let’s talk about Tracing. The Tracing view stitches together all of the related spans in a single waterfall graph, as long as they are in the same data set. So this can be incredibly useful for understanding performance bottlenecks in your application.
When we say distributed tracing, what we really mean is multiple services that have all participated in the same trace ID. Beelines automatically propagate trace headers to other services. Let’s talk about the tracing span that I’ve opened up. I’ll minimize the tracing span. We see that there are 33 events. So I’ll go ahead and expand the complete trace lasts 1.43 seconds. And here we get to see all the services associated with this trace ID. For example, gateway service, database, Redis, Auth Service, and we can see what’s happening there. If I click on get great limit user ID, I see a timestamp, I see that endpoint, I see its status, how long it took, its name, its parent, its error rate if that exists. And I can also search in here.
I can do the same for tickets. And also as you click on each span, you can actually see the heatmap for that corresponding selection change. It’s also important to know that there’s a number of different trace header, propagated formats. So the new one that’s preferred by OpenTelemetry is called W3C header format. There’s also the V3 format, which comes from the Zipkin project, and then Amazon has their own trace headers that are generated, for example, Application Load Balancers. And now in Beelines, we’ve added in or out for all of these and they’re disabled nowadays by default, but you can turn it on. For example, if you have a gateway service that’s right behind a load balancer, you can say, turn on the Amazon header support, and then it will understand that the root span comes from the ALP, and then you could just take those LB logs and also pull them right in and have a complete trace view that starts at the ALP.
The one thing that we get asked quite often when we’re looking at traces like this one is, is this a good trace or a bad trace? The short answer is that only you as the developer or operator of the system can answer that. When you first implement tracing and start looking at your traces, you’ll have a lot of aha or WTF moments. Those will lead to co changes that modify the trace shape. From fixing that horrible SQL loop, as you see on the screen, as well as adding spans for important bits that auto-instrumentation didn’t catch. Or even removing noisy bits from auto-instrumentation. Those are all okay. A good trace is one that provides you the highest signal to noise ratio. Now that you have all the basics, BubbleUp will make them look like magic. Here’s how it works. First, you go to the BubbleUp tab, if you do not have a heatmap, then you will not have the ability to participate in BubbleUp.
15:14
Select your outlier. Once your outlier is selected, then your dimensions will appear. Your dimensions are the difference between the baseline and the selection. So what normalcy looks like in comparison to what we’ve selected. In this case, I see that there’s an abnormalty with the 200 codes that are different in this area for this particular endpoint, which is tickets. I do see that the duration is much more intense than it normally is as well. So the cool thing is, I can step literally into that trace. I’ll click the trace. Glancing at this particular trace, we’re actually looping through this command multiple times. To fix it as a developer, I would simply do analysis to determine what strategies can I take to get the results I’m looking for, so I’m not looping through the same queries, a million zillion times.
Tracing a BubbleUp gives you a much faster rate to iterate through the Core Analysis Loop. When clicking one of the tall yellow columns, also referred to as your actual selections from BubbleUp, you can quickly add filters and GROUP BYs to refine your investigation. Once again, I’m Alayshia Knighten with Honeycomb Queries, Tracing, and BubbleUp, as always, go and beeline in.
If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.