Troubleshooting in Honeycomb: Choose Your Own Adventure

When debugging, there’s often not a single right answer or path to follow. More often than not, two teammates will find themselves producing two different sets of Honeycomb queries, following different routes, yet still arriving at the same conclusion. With real, full-stack observability, you don’t have to walk the same path as everyone else to get to the truth.

back to the future gif: where we

At Honeycomb, we have a common dataset we all use for demoing – and all of us know, ultimately, what anomaly in the data we’re trying to show to the demo-ee – but we’ve also evolved multiple different approaches to telling this particular story.

Note: The videos below are set to play at 2x speed, but you can slow them down using the little cog icon that shows when you hover over them.

Look for the latency

Some days, Ben’s wearing his “500s smell like timeouts, so we should check for latency spikes” hat, so the series of queries might look something like:

blank space because formatting in md sucks

  • Start off with a basic COUNT of served HTTP requests
  • Compare traffic against distinct endpoint_shapes, and filter to status_code = 500 requests only
  • Add a endpoint_shape = "/login" filter to exclude non-/login requests
  • Compare the AVG and P99 request duration against the COUNT of requests
  • The total P99 request duration does seem to be doing something unusual here, so drop COUNT and AVG(roundtrip_dur) and look at the P99s of some sub-timers
  • Inspect just the P99(fraud_dur), which seems to be the cause of the overall latency increase
  • Break down by build_id to identify that a single build is responsible for the increased latency (while the others are fine).

Count your distinct chickens

Or sometimes Charity has a “breakdowns are cool, but let’s throw in a COUNT_DISTINCT to find further correlations between attributes associated with anomalous behavior” vibe going on:

blank space because formatting in md sucks

  • Start off with a basic COUNT of HTTP requests broken down by distinct status_codes
  • Filter down to just requests hitting "/login"
  • Add a build_id breakdown to see request volume by build
  • Calculate the number of distinct hostnames serving requests per build/status_code group
  • Remove the status_code breakdown to see the number of distinct hosts running each build
  • Generate a HEATMAP(roundtrip_dur) to show the distribution of request latencies, and compare said distribution across builds.

Isolate outliers

Or Toshok might go with a “heatmaps and distributions are rad, so let’s start there and use it to isolate the outliers and see if we can eyeball common attributes and potential causes” approach:

  • Start by visualizing a HEATMAP of the distribution of served HTTP request times
  • Filter to only “slow” request times (we’re picking 600ms because it seems like it’d help isolate the outliers we can immediately see)
  • Flip to raw data mode to eyeball the events comprising those slow requests
  • Add a breakdown on build_id, to compare the latency distribution across builds
  • Add a COUNT to make it obvious that the increase in failed logins is correlated with requests being served by the bad build.

You can go your own way (go your own wayyyy o/~)

Honeycomb is built to let you explore strange paths, back up to useful waypoints, and try again. And as a team, we love to learn from each other’s thought processes and workflows. 🙂

To start your own explorations, check out our Quick Start guide or sign up for a free Honeycomb trial!