Posts by Fred Hebert

Fred Hebert

Staff Site Reliability Engineer

Fred is a Staff Site Reliability Engineer (SRE) who has worked as a software engineer for over a decade and ended up with a healthy dislike of computers and clumsy automation. He’s a published technical author who loves distributed systems, systems engineering, and has a strong interest in resilience engineering and human factors.

Incident Response  

Negotiating Priorities Around Incident Investigations

There are countless challenges around incident investigations and reports. Aside from sensitive situations revolving around blame and corrections, tricky problems come up when having discussions...

Service Level Objectives   Incident Response  

Alerts Are Fundamentally Messy

Good alerting hygiene consists of a few components: chasing down alert conditions, reflecting on incidents, and thinking of what makes a signal good or bad....

Service Level Objectives   Product Updates  

From Oops to Ops: SLOs Get Budget Rate Alerts

As someone living the Honeycomb ops life for a while, SLOs have been the bread and butter of our most critical and useful alerting. However,...

Incident Response  

Incident Review: What Comes Up Must First Go Down

On July 25th, 2023, we experienced a total Honeycomb outage. It impacted all user-facing components from 1:40 p.m. UTC to 2:48 p.m. UTC, during which...

Incident Response  

There Are No Repeat Incidents

People seem to struggle with the idea that there are no repeat incidents. It is very easy and natural to see two distinct outages, with...

Software Engineering  

How We Define SRE Work, as a Team

The SRE team is now four engineers and a manager, and we are involved in all sorts of things across the organization, across all sorts...

Incident Response  

How We Manage Incident Response at Honeycomb

When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to...

Incident Response  

Counting Forest Fires: Incident Response Metrics

There are limits to what individuals or teams on the ground can do, and while counting fires or their acreage can be useful to know...


Incident Review: Shepherd Cache Delays

In this incident review, we’ll cover the outage from September 8th, 2022, where our ingest system went down repeatedly and caused interruptions for over eight...


Incident Review: Working as Designed, But Still Failing

A few weeks ago, we had a couple of incidents that ended up impacting query performance and alerting via triggers and SLOs. These incidents were...

Service Level Objectives   Culture  

On Counting Alerts

A while ago, I wrote about how we track on-call health, and I heard from various people about how “expecting to be woken up” can...

Best Practices  

Tracking On-Call Health

If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has...

Best Practices  

OnCallogy Sessions

Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or...

Software Engineering  

On the Brittleness of Dashboards

Dashboards are one of the most basic and popular tools software engineers use to operate their systems. In this post, I'll make the argument that...

Software Engineering  

How We Define SRE Work

At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared...

1 2