BLOG

Anatomy of a Cascading Failure

In Caches Are Good, Except When They Are Bad, we identified four separate problems that combined together to cause a cascading failure in our API servers. This followup post goes over them in detail,…

A New Bee’s First Oncall

I’m Honeycomb’s newest engineer, now on my eighth week at Honeycomb. Excitingly, I did my first week of oncall two weeks ago! Almost every engineer at Honeycomb participates in oncall, and I chose to…

Heatmaps Make Ops Better

In this blog miniseries, I’d like to talk about how to think about doing data analysis “the Honeycomb way.”  Welcome to part 1, where I cover what a heatmap is—and how using them can…

Postmortem: RDS Clogs & Cache-Refresh Crash Loops

On Thursday, October 4, we experienced a partial API outage from 21:02-21:56 UTC (14:02-14:56 PDT). Despite some remediation work, we saw a similar (though less serious) incident again on Thursday October 11 from 15:00-16:02 UTC (8:00-9:02PDT). To implement a more permanent fix, we scheduled an emergency maintenance window which completely interrupted service on Friday Oct 12 for approximately two minutes, from 4:38-4:40 UTC (Thursday Oct 11, 21:38-21:40 PDT).