Incident Review: You Can’t Deploy Binaries That Don’t Exist

By: Ben Hartshorne | July 12th, 2019

Dogfooding Operations

3 Min. Read

Between 22:50 and 22:54 UTC on July 9, our capacity to accept traffic to api.honeycomb.io gradually diminished until all incoming requests started to fail. 8 minutes later, at 23:02, the API server was once again running at full capacity. During the initial 4 minutes some requests were accepted and others refused, but for those 8 minutes all traffic bounced and we lost data sent to us.

What caused this to happen?

As with many outages, it was a combination of factors that came together to trigger the incident.

We’ve been doing a lot of work on the system that instruments our build process so that we can use Honeycomb to understand our CI pipeline. As part of that work, refactoring some code led to a regression where the buildevents tool lost a feature – passing through the exit code of the commands that it runs.
We committed some code that didn’t compile, but the fact that it failed tests wasn’t correctly detected in order to fail the build. This code passed review and made it in to the master branch.
The build artifact from the non-compiling code was missing one executable, and got deployed anyway.
When the service restarted, there was no code to run and one by one servers got pulled from rotation until the outage was complete.

How have we addressed this?

As we’d spent a large portion of our error budget during this outage, we first instituted a release freeze on the API server. Then, we conducted a retrospective to identify what we steps we’d need to take in order to unfreeze. As of today, we have added safeguards in several places to mitigate the chances that any of the steps above could cause a similar outage.

Much of the speed at which we iterate and deploy code depends on trusting that our build system is protecting us from errors like this. It’s paramount that we retain this trust in order to be able to use automated deploy systems, so the first step was to fix the regression in our build instrumentation tool and work towards rebuilding the belief that if the build is green it’s okay to deploy.

We can add strength to the belief in our automated build system with real world use – we can both trust that the builds are reliable, and verify that it is indeed the case by health-checking tasks with automatic rollback, and staggering deploys by letting production lag our internal dogfooding cluster by a build. This is common practice and has long been on the list of things to do but not yet prioritized. Now it has happened.

Is it time to outsource our deploys?

Modern development in the SaaS world is a combination of internal development and reliance on other external tools and SaaS products so we can run less software and focus on what we do best. We’ve been using a home-built deployment system because our needs have been light, and we’re re-evaluating whether now is the right time to outsource that part of our process to something like Kubernetes. There has been a lot of progress in the last few years around processes for managing canaries, circuit breakers, and other confidence-building techniques for deploys, and we’re interested in which might be able to help us continue to move quickly with confidence.

Want your observability tooling made by folks who know what it means to run a modern production service? Try out Honeycomb for free.

Don’t forget to share!

Ben Hartshorne

Principal Software Engineer

Ben has spent much of his career setting up monitoring systems for startups and now is thrilled to help the industry see a better way. He is always eager to find the right graph to understand a service and will look for every excuse to include a whiteboard in the discussion.

Winston Hearn | Oct 02, 2024

Using Honeycomb for Frontend Observability to Improve Honeycomb

Recently, we announced the launch of Honeycomb for Frontend Observability, our new solution that helps frontend developers move from traditional monitoring to observability. What this means in practice is that frontend developers are no longer limited to a metrics view of their app that can only be disaggregated in a few dimensions. Now, they can enjoy the full power of observability, where their app collects a broad set of data as traces to enable much richer analysis of the state of a web service.

Dogfooding Frontend

Lex Neva | Aug 26, 2024

Always. Enable. Keepalives.

As part of our recent failure testing project, we ran into an interesting failure mode involving the OpenTelemetry SDK for Go. In this post, we’ll show you why our apps stopped sending telemetry for over 15 minutes and how we enabled keepalives to prevent this kind of failure from happening in the future.

Debugging Dogfooding Software Engineering

Fred Hebert | Jul 29, 2024

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

Dogfooding Incident Response Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission