The Incident Retrospective Ground Rules
I joined Honeycomb as a Staff Site Reliability Engineer (SRE) midway through September, and it’s been a wild ride so far. One thing I was...
On Building a Platform Team
It may surprise you to hear, but Honeycomb doesn’t currently have a platform team. We have a platform org, and my title is Director of...
The Future of Ops Is Platform Engineering
In the beginning, there were people who wrote and ran software. At some point, we spun away ops skills from dev skills into two different...
An Engineering Manager's Bill of Rights (and Responsibilities)
So many of the best and most promising managers I know have left management roles for senior IC roles since 2018, and as someone who...
An Engineer’s Bill of Rights and Responsibilities
If you let all the power drift over to the engineering managers, pretty soon it doesn’t look so great to be an engineer. Now you...
Engineers New to Honeycomb, What Did You First Notice About How We Do Things Here?
We’ve wondered, in the past, what new engineers think about how we do things at Honeycomb. This time, we asked! Meet Elliott and Reid, two...
“Why Are My Tests So Slow?” A List of Likely Suspects, Anti-Patterns, and Unresolved Personal Trauma
If you get CI/CD right, a lot of other critical functions, behaviors, and intuitions align to be comfortably successful and correct with minimal effort. If...
Exploring AWS Costs Beyond the Service Level
This post will talk about using a derived column to directly connect individual customer experiences to the cost of providing that service with AWS Lambda....
The power of asking questions
This is a guest post by Vlad Ionescu. Vlad Ionescu jokingly describes himself as a "Professional mistake avoider" which is a better way of saying...
On the Brittleness of Dashboards
Dashboards are one of the most basic and popular tools software engineers use to operate their systems. In this post, I'll make the argument that...
How We Define SRE Work
At the time of writing this post, I have officially been at Honeycomb for one year as a site reliability engineer (SRE). I had shared...
Incident Report: The Missing Trigger Notification Emails
On November 18, between 00:50 and 00:56 UTC, an update was deployed that improved Honeycomb’s business intelligence (BI) telemetry available from our production operations environment....
Incident Resolution: Do You Remember, the Twenty Fires of September?
From September to early October, Honeycomb declared five public incidents. Internally, the whole month was part of a broader operational burden, where over 20 different...
Game Launches Should Be Exciting for Your Players, Not for Your LiveOps Team
This blog was co-authored by Amy Davis. The moment of launching something new at a game studio (titles, experiences, features, subscriptions) is a blockbuster moment...
Lessons Learned From the Migration to Confluent Kafka
Over the last few months, Honeycomb’s platform team migrated to a new iteration of our ingest pipeline for customer events. Our migration to this newer...