Between 22:50 and 22:54 UTC on July 9, our capacity to accept traffic to api.honeycomb.io gradually diminished until all incoming requests started to fail. 8 minutes later, at 23:02, the API server was once again running at full capacity. During the initial 4 minutes some requests were accepted and others refused, but for those 8 minutes all traffic bounced and we lost data sent to us.
What caused this to happen?
As with many outages, it was a combination of factors that came together to trigger the incident.
- We’ve been doing a lot of work on the system that instruments our build process so that we can use Honeycomb to understand our CI pipeline. As part of that work, refactoring some code led to a regression where the
buildeventstool lost a feature – passing through the exit code of the commands that it runs.
- We committed some code that didn’t compile, but the fact that it failed tests wasn’t correctly detected in order to fail the build. This code passed review and made it in to the master branch.
- The build artifact from the non-compiling code was missing one executable, and got deployed anyway.
- When the service restarted, there was no code to run and one by one servers got pulled from rotation until the outage was complete.
How have we addressed this?
As we’d spent a large portion of our error budget during this outage, we first instituted a release freeze on the API server. Then, we conducted a retrospective to identify what we steps we’d need to take in order to unfreeze. As of today, we have added safeguards in several places to mitigate the chances that any of the steps above could cause a similar outage.
Much of the speed at which we iterate and deploy code depends on trusting that our build system is protecting us from errors like this. It’s paramount that we retain this trust in order to be able to use automated deploy systems, so the first step was to fix the regression in our build instrumentation tool and work towards rebuilding the belief that if the build is green it’s okay to deploy.
We can add strength to the belief in our automated build system with real world use – we can both trust that the builds are reliable, and verify that it is indeed the case by health-checking tasks with automatic rollback, and staggering deploys by letting production lag our internal dogfooding cluster by a build. This is common practice and has long been on the list of things to do but not yet prioritized. Now it has happened.
Is it time to outsource our deploys?
Modern development in the SaaS world is a combination of internal development and reliance on other external tools and SaaS products so we can run less software and focus on what we do best. We’ve been using a home-built deployment system because our needs have been light, and we’re re-evaluating whether now is the right time to outsource that part of our process to something like Kubernetes. There has been a lot of progress in the last few years around processes for managing canaries, circuit breakers, and other confidence-building techniques for deploys, and we’re interested in which might be able to help us continue to move quickly with confidence.
Want your observability tooling made by folks who know what it means to run a modern production service? Try out Honeycomb for free.