How Honeycomb Uses Honeycomb, Part 3: End-to-End Failures
By Christine Yen | Last modified on August 3, 2022At Honeycomb, one of our foremost concerns (in our product as well as our customersâ) is reliability. To that end, we have end-to-end (e2e) checks that run each minute, write a single data point to Honeycomb, then expect that specific data point to be available for reads within a certain time threshold. If this fails, the check will retry up to 30 times before giving up.
Monday morning, we were notified of some intermittent, then persistent, errors in these automated checks.
We quickly verified that only one of the e2e checks was failing by checking the COUNT
(as a sanity check, and to show where failures were occurring) and the AVG
write/read durations (just to get a head start on identifying the root cause of the e2e failures), broken down by partition and success:
So weâve confirmed that only partition 5 seems to be having troubleâand that, from a clientâs perspective, the write to our API server looks perfectly normal. Itâs the read (which relies on the write successfully passing through Honeycomb and being made available for reads) that seems to be the problem here.
While debugging this, we ran into an issue that plagues many an incident-response team: lots of our query results were showing turbulence around the time in question, but which came first? What was the cause?
As we sifted through our data, graphing interesting metrics and breaking them apart by attributes we hoped might be interesting, we eventually placed cum_write_time
and latency_api_msec
next to each otherâtwo timers that represent two slightly different things (the cumulative time spent just writing to disk for a set of payloads, and the time since API server handoff to Kafka for the last payload). We also isolated the partition we knew was problematic:
This is enough to tell us that the problem isnât in our API server writing to Kafka (the e2e checkâs write duration would reflect that) or our storage server writing to disk (the cum_write_time
would reflect that), but instead in betweenâand we should take a hard look at the moving parts related to partition 5 in Kafka.
There are some fundamental problems that are difficult to impossible to automateâwhy was something named poorly, correlation vs causationâbut Honeycomb let us quickly slice and compare metrics against each other, along different dimensions, to quickly rule out factors that didnât actually matter (dataset, server hostname) to hone in on the metrics that really showed what was wrong.
This is another installation in a series of dogfooding posts showing how and why we think Honeycomb is the future of systems
observability. Stay tuned for more!
Curious? Give us a try!
Honeycomb is a next-generation tool for understanding systems, combining the speed and simplicity of pre-aggregated time series metrics with the flexibility and raw context of log aggregators. Sign up and give us a shot today!
Related Posts
Observability Is About Confidence
Observability is important to understand whatâs happening in production. But carving out the time to add instrumentation to a codebase is daunting, and often treated...
What Is Observability? Key Components and Best Practices
Software systems are increasingly complex. Applications can no longer simply be understood by examining their source code or relying on traditional monitoring methods. The interplay...
So We Shipped an AI Product. Did it Work?
Like many companies, earlier this year we saw an opportunity with LLMs and quickly (but thoughtfully) started building a capability. About a month later, we...