Blog

Category: Incident Response

Incident Response  

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...

Service Level Objectives   Incident Response  

Alerts Are Fundamentally Messy

Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....

Incident Response  

Incident Review: What Comes Up Must First Go Down

On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...

Incident Response  

Incident Management Steps and Best Practices

Incident management is the way an organization reacts to any kind of outage (security, broken code, severe weather, or anything that’s disruptive to customer service)....

Incident Response  

There Are No Repeat Incidents

People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with...

Incident Response  

Should Every Incident Get a Retro?

At a recent training session, Jeli spent a great deal of time covering incident retrospectives and what makes an incident worthy of studying. My colleague...

Incident Response  

How We Manage Incident Response at Honeycomb

When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to...

Incident Response  

Counting Forest Fires: Incident Response Metrics

There are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know...

Incident Response   Debugging  

Solving a Murder Mystery

Bugs can remain dormant in a system for a long time, until they suddenly manifest themselves in weird and unexpected ways. The deeper in the...

Software Engineering   Operations   Incident Response   Debugging  

Incident Report: The Missing Trigger Notification Emails

On November 18, between 00:50 and 00:56 UTC, an update was deployed that improved Honeycomb’s business intelligence (BI) telemetry available from our production operations environment....

Operations   Incident Response   Dogfooding   Debugging  

Incident Report: Investigating an Incident That's Already Resolved

Summary On the 23rd of April, we discovered that an incident had occurred approximately one week earlier. On April 16, for approximately 1.5 hours we...

Software Engineering   Incident Response   Dogfooding   Debugging  

Incident Report: Running Dry on Memory Without Noticing

On November 6, 2019, we intermittently rejected 1-3% of customer telemetry data at ingest for four periods of 20 minutes each. The trigger of the...