From "Secondary Storage" To Just "Storage": A Tale of Lambdas, LZ4, and Garbage Collection
When we introduced Secondary Storage two years ago, it was a deliberate compromise between economy and performance. Compared to Honeycomb’s primary NVMe storage attached to...
Incident Report: Running Dry on Memory Without Noticing
On November 6, 2019, we intermittently rejected 1-3% of customer telemetry data at ingest for four periods of 20 minutes each. The trigger of the...
Working Toward Service Level Objectives (SLOs), Part 1
In theory, Honeycomb is always up. Our servers run without hiccups, our user interface loads rapidly and is highly responsive, and our query engine is...
Never Alone On Call
Does your organization have an on-call rotation? Several members of the Honeycomb engineering team recently hosted a live webcast about why they never feel alone...
All Together Now: Better Debugging With Multiple Visualizations
"Nines don't matter when users aren't happy" is something you may have heard a time or two from folks here at Honeycomb. We often emphasize...
Understand Your AWS Cost & Usage with Honeycomb
First published in August 2019. AWS bills are notoriously complicated, and the Amazon Cost Explorer doesn’t always make it easy to understand exactly where your...
Treading in Haunted Graveyards
Part 1: CI/CD for Infrastructure as Code At Honeycomb, we've often discussed the value of making software deployments early and often, and being able to...
Incident Review: You Can't Deploy Binaries That Don't Exist
Between 22:50 and 22:54 UTC on July 9, our capacity to accept traffic to api.honeycomb.io gradually diminished until all incoming requests started to fail. 8...
Automating Collection of Troubleshooting Data with Triggers: a How-To Guide
Everyone wants to be more efficient -- to spend less time on the tedious things, and more time on the things that move the needle....
Stop Your Database From Hating You With This One Weird Trick
Let's not bury the lede here: we use Observability-Driven Development at Honeycomb to identify and prevent DB load issues. Like every online service, we experience...
Anatomy of a Cascading Failure
In Caches Are Good, Except When They Are Bad, we identified four separate problems that combined together to cause a cascading failure in our API...
When In Doubt, Add More Spans: A Tale of Tracing and Testing In Production
Recently, Toshok was telling a story about the kind of thing he talks about a lot—improving the performance of some endpoint or page or other....
Incident Review: Caches are Good, Except When They Are Bad
Between Wednesday, April 17th and Friday, April 26th, Honeycomb had four separate periods of downtime affecting the Honeycomb API, resulting in approximately 38 minutes of...
A New Bee's First Oncall
I'm Honeycomb's newest engineer, now on my eighth week at Honeycomb. Excitingly, I did my first week of oncall two weeks ago! Almost every engineer...
Tracing and Observability for Background Jobs
Illuminating the under-loved with Honeycomb Most modern web apps end up sprouting some subset of tasks that happen in the “background”, i.e., when a user...