How We Define SRE Work, as a Team
The SRE team is now four engineers and a manager, and we are involved in all sorts of things across the organization, across all sorts...
How We Manage Incident Response at Honeycomb
When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to...
Counting Forest Fires: Incident Response Metrics
There are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know...
Incident Review: Shepherd Cache Delays
In this incident review, we’ll cover the outage from September 8th, 2022, where our ingest system went down repeatedly and caused interruptions for over eight...
Incident Review: Working as Designed, But Still Failing
A few weeks ago, we had a couple of incidents that ended up impacting query performance and alerting via triggers and SLOs. These incidents were...
On Counting Alerts
A while ago, I wrote about how we track on-call health, and I heard from various people about how “expecting to be woken up” can...
Tracking On-Call Health
If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has...
OnCallogy Sessions
Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or...
On the Brittleness of Dashboards
Dashboards are one of the most basic and popular tools software engineers use to operate their systems. In this post, I'll make the argument that...
How We Define SRE Work
At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared...
Incident Resolution: Do You Remember, the Twenty Fires of September?
From September to early October, Honeycomb declared five public incidents. Internally, the whole month was part of a broader operational burden, where over 20 different...
Data Availability Isn’t Observability
But it’s better than nothing... Most of the industry is racing to adopt better observability practices, and they’re discovering lots of power in being able...
Lessons Learned From the Migration to Confluent Kafka
Over the last few months, Honeycomb’s platform team migrated to a new iteration of our ingest pipeline for customer events. Our migration to this newer...
On Not Being a Cog in the Machine
This is my first week here as the first dedicated SRE for Honeycomb, and in a welcoming gesture, I was asked if I wanted to...