In this blog miniseries, I’d like to talk about how to think about doing data analysis “the Honeycomb way.” Welcome to part 1, where I cover what a heatmap is—and how using them can really level up your ability to understand what’s going on with distributed software.
Heatmaps are a vital tool for software owners: if you’re going to look at a lot of data, then you need to be able to summarize it without losing detail. When I’m dealing with data, I find it really helpful to be able to see precisely what data points make up the result of a query, and to zoom down to see individual events. Heatmaps, in their way, best represent what Honeycomb is really about: re-displaying events back at you.
I’m not the first to say that: Chris Toshok wrote a great blog entry on heatmaps a year ago when we first launched them. Heatmaps are no longer the New Hotness. They’re just a critical part of understanding your system with Honeycomb.
Unfortunately, the word hasn’t spread as widely as it could yet. Heatmaps can be difficult to understand. I want to talk about the real power of what makes a heatmap special, and how to use it — and why devops really shouldn’t walk out the door without one.
We all, I think, know what a scatterplot is. Given a number of multi-dimensional events, pick two continuous fields and place them on the X and Y axes. A reasonable choice might be to pick, say, the time of day on the X axis, and the duration of a given event on Y. With a few hundred points, it’s easy enough to read—but any noise in the data can hide the effect that’s in the data.
We can learn more by drawing more and more points. But sometimes a scatterplot gets crowded. Draw a few thousand points, and they start to run into each other. I think this image shows that there’s an upper group and a lower group, but it’s hard to tell how they are different.
There are a few options to address this problem of overplotting. One classic strategy, especially for timelines, is to summarize and aggregate the data. By grouping the the x-axis into bins, the visualization can choose an aggregate across the measure and draw it as a continuous line. In this chart, the average, minimum, and maximum show the general trends of the data — but at great cost. There’s almost no detail visible.