In this talk, you’ll learn about the architecture and operational practices used by an engineering team of less than a dozen people to run a real-time event analytics platform that persists billions of events per day, with search over the telemetry performed near-real time.
Liz Fong-Jones [Principal Developer Advocate|Honeycomb]:
Hi, I’m Liz Fong-Jones, and I’m a Principal Developer Advocate and Site Reliability Engineer at Honeycomb.
Danyel Fisher [Principal Design Researcher]:
And I’m Danyel Fisher. I’m the Design Researcher at Honeycomb.
And today we want to share with you how we observe our code at Honeycomb in order to ensure that we’re able to deploy quickly and simply. So what does Honeycomb do? Honeycomb is a company that provides observability as a service, and the field of observability is evolving very, very quickly. It means that we as people who provide the service, need to make sure that we’re able to keep up with the demands for reliability and feature velocity. We need to make sure that we’re able both to ship features quickly that enable customers to do better in terms of their own systems, as well as providing the reliability and confidence that people know that they can trust us. There are less than a dozen engineers that build Honeycomb and we’re competing against companies that are 10 or a hundred times our size.
And our goal overall is to help our customers make their systems humane to run so that their engineers don’t burn out, and so their engineers can understand what’s happening in production. How we ingest their telemetry data from their production systems so that they can get the insights that they need. So they can ask the questions and really explore that data so they can figure it out and be empowered to debug their production environments. And so that they can really figure out how do we debug this, how do we get back to running normally, and what’s going on inside of my production infrastructure? As part of this, we have a culture of continuous delivery. We ourselves practice observability driven development, and it’s really empowered us to deploy every single day of the week, rather than sacrificing 20% of our production velocity by not deploying on Fridays.
In fact, not only do we deploy every single day of the week, we deploy up to 14 times every single day, and we don’t wind up needing to touch production on emergencies on weekends, nor do we wind up needing to disrupt our holidays by needing to push production code. Last year, we shipped over 3000 changes to production, and that says a lot about how we’re able to move quickly and nimbly. So how did we get there? We needed to invest in continuous delivery and in feature flag driven development and observability driven development. It took us investing in it from the beginning of our company to get where we are today. That meant that we had to think about things like, what does our roadmap look like? What are the right set of things to build? We invested in tooling, expecting it to pay off not just immediately, but instead over the longer term. We had to think very consciously about what are the things that we’re going to adopt.
We didn’t need to adopt all of the soup, all of the alphabet soup. For instance, we’re not production Kubernetes users because we’re an infrastructure service that needs to utilize whole machines. There is no benefit to us to adopt containerization and packing different workloads together. In addition to not adopting things that we didn’t need, we needed to be thoughtful about build by trade-offs. What are the things that we absolutely need to build in-house, and what can we outsource? For instance, we didn’t wind up building our own feature flag system, we wound up engaging LaunchDarkly to help us with that. But above all, we needed to think about what our cultural processes were, that our systems are sociotechnical systems, and not just made up of tools. And it’s really putting humans first and then providing them with the right tools that enabled us to succeed.
We also had to continuously evaluate and pay down our technical debt. Although we are a startup that needs to move very quickly, we also need to make sure that we’re not just doing the most expedient thing today, but also making sure that we’re paying down that complexity and making our systems better and easier to run for future engineers. And we really need the right metrics to measure and improve where it matters rather than mistargeting our efforts. So knowing what’s critical to fix is half the battle. Our goal, if you remember, is to speed up the development of our product so that we can compete against companies that are 10 or 100 times our size. And we also have to make sure that as we grow, as our revenue and our amount of traffic grows, that we’re able to safely deploy that infrastructure to keep up with that demand as well as shipping brand new services.
We don’t get there by saying, we’re not going to tolerate any risk at all. We get there by embracing risk, by saying that we have to accept that there will be some amount of risk, but that our job as people who practice DevOps, is to mitigate the risks and make them as small in blast radius as possible so that we can still move safely, even if there is a mistake. And we never stop improving. We always focus on making sure that we’re building the best system for the road ahead. Danyel is a product engineer and he’s going to tell us a little bit about what his journey looks like to ship a code change into production.
Absolutely. To build the sort of reliability that Liz was just talking about, we’ve got a fairly well-developed process, a recipe that I think is one that we use, but also one that you can pick up and use for your own purposes. The first is that as an observability company, we instrument our code as we write it. It’s important to us that every major feature be something that we can interrogate, find out how it’s operating, find out how it’s being used, and understand how it’s working. We use that to complement a full suite of both functional testing, as well as visual testing, which means that every component that we build in every piece of UI that we deploy, we have tests that make sure that they’re continuing to render as we expect them to and make sure that the user interface continues to work consistently and persistently.
As we do this, we’ve been designing around feature flag deployment. Using LaunchDarkly’s feature flags, we instrument virtually every piece of our code so that we can turn on or off major suites of features. At startup, we’ll check dozens of flags to decide which sorts of features are being turned on or off for any given run. And we use that to even turn on and off pieces of interface. For example, we had a feature called refinery, controlled by this flag called sampling settings. Only users who are inside the sampling settings flag would see refinery, while everyone else didn’t. That not only allowed us to turn on and off those features but also allowed us to track how they were being used.
We monitor feature flags inside Honeycomb. So what you’re seeing here for example is a graph of those users who are in that flag of sampling settings, and therefore saw the refinery versus those who weren’t. As you can see, we had something like 10% of our hits touching that flag. We also use this to track things like how the error rate is doing. For example, in this case, we’ve turned off a flag that had been running for some time and transitioned it over sometime around June 13th. When we transitioned it, you can also track both the error rate as well as the adoption.
One really nice feature that’s come out recently is that in addition to Honeycomb being able to track flags within its instrumentation, we’re also using it to track with markers. Due to a recently released integration through LaunchDarkly, Honeycomb flag changes actually become markers. Now the goal of a marker in our system is that it drops as a vertical line that you can see on the graph. In this case, for example, you can see a piece of code that was deployed as the gray marker. And then we’ve actually labeled a second marker to show where PagerDuty set off an alarm. This can help understand both what changes have happened and how users interacted with them.
Once our code is built, we use automated integration systems to decide whether or not it’s successfully built and to build the master version. Using CircleCI, for example, we’re able to watch our continuous integration process go. Now while this is the view that CircleCI presents to us, with a network of what happened, we can also use Honeycomb’s own tracing view. This for example, allows us to track how our build system is evolving and shows us where timing was spent.
Being able to look specifically at the pieces of the build and understand how the build actually happened, gives us tremendous visibility into all parts of our CI system. Once we’re complete with automated integration, we also go through human PR review. In one of those cultural pieces, we make sure that literally every PR at Honeycomb is checked by another person, even if it’s just as one that I put in the other day, a brief comment. That helps us make sure that there’s no notion of some PRs are too small to check, or some reviewers are too important to check. We just make sure that everything gets looked at and that lets everybody know how the code is changing as well as helps catch all our dumb mistakes.
In addition to human PR systems, that robot also worked pretty hard, let’s give them a high five too. Once both the robot and the person have approved your code, we’re able to go ahead with the green button merge. Any engineer on Honeycomb who’s deployed a PR is welcome to press that green button and send the code out into the automatic deploy. It automatically updates, although we then have mechanisms for easy rollbacks and pins, which we’ll talk about in just a little bit. Last, once code has been deployed, we’re able to observe its behavior and go see how it’s being adopted and make sure it’s working. And we can watch it in prod to decide whether we’re happy with the way that it’s going. It’s interesting to note, for our prod system, that’s the way that our customers observe their data. But we need to be able to watch how that system’s working to make sure that our code’s reliable.
So we’ve built a subsystem called Dogfood, which we use to observe production. Here, for example, is the adoption of one of our newer features where we’re able to track how users are using it, watch how many people are playing with it. And we can even watch, based on some of these spikes, when we’ve introduced the feature to different groups and how it’s being adopted and how much it gets used over time. Virtually every feature in Honeycomb can be tracked in this way, through our Dogfood system. To understand how Dogfood is running, of course, we have a subsystem which we’ve named Kibble. So Kibble watches Dogfood and Dogfood watches production. It’s that mechanism, these steps of feeling reliable and safe about our code that allows 12 engineers to deploy up to 12 times a day.
When thinking about this and reflecting on what lessons you might want to bring back to your organization, we’d refer you to the DevOps State of DevOps Accelerate Report that really talks about the feedback loops involved in continuous integration and continuous delivery. And we tick off all four of those boxes. The DORA data says that you really need to focus on lead time and we’ve emphasized lead time, such that we can push end to end, a code change in under three hours. Every time I kick off a build, it winds up getting automatically done in less than 10 minutes, which means I can experiment very quickly with commits and then assemble them into a pull request, which I can send to someone for peer review, which will take less than an hour. And then after that, it automatically gets deployed in the next hourly push.
End to end, it’s only three hours to deploy a change, which means that I don’t have to drop context in between when I write code and when I observe it running in production. We also have emphasized the deployment frequency using the fact that we’re able to do builds in under 10 minutes. That means that it’s really worth it to push things out once an hour. And that means that we’re able to push no more than one or two distinct changes per artifact, which means that there are not a hundred different things that all get rolled back at once if there’s a problem. So our change fail rate has gone down because very few things catastrophically fail. We have really good confidence we had testing and human review. And in the worst case, we can do a flag flip or a fixed forward rather than needing to do a giant emergency rollback while the system is down. Fewer than three out of our 3000 changes last year resulted in the need for an emergency rollback or some kind of giant downtime for our users.
And overall, our time for our store has really, really improved by that investment in feature flags, in an observability. If we notice something’s wrong, we’re able to see it very quickly and roll back the feature via a flag flip in LaunchDarkly, taking less than 30 seconds. If we need to pin to an older build, that takes less than 10 minutes. And in the worst case, if we need to actually write new code, that just takes 20 minutes and it’s still a perfectly routine procedure. So our high productivity product engineering, really is a combination of lead time, deploy frequency as well as reducing the change fail rate in our time to restore service in the event of a problem. But all of this also needs to be built on a bedrock of stable and solid infrastructure. So I, as an infrastructure engineer, don’t care necessarily about my resume, right?
What I care about is, is Honeycomb, as a company, able to ship the product that our customers need and be as reliable as my customers need it. So Kubernetes is not my goal, right? I’m not practicing resume-driven development. Instead, I need to make sure that our systems are responsive and scalable in response to both an increase in the number of features that are being pushed out as well as an increase in service in the amount of data that my service is ingesting. So we need to prioritize reliably and simplicity above all else. That means that we need to make sure that there isn’t config drift, that we don’t have things creeping in that are causing complexity that no one is aware of. That means that we need repeatable and reliable infrastructure as code that’s pushed automatically from our main branch in GitHub. That enables us to have a synchronized and central state that we can diff and release straight from our browsers inside of GitHub.
I’m able to do things like edit the percentage of instances that are running on demand versus spot, right inside of my browser. And then I can preview that change and make sure that it is running successfully with the expected delta to the APIs that Amazon is going to provide us to change and control things. And we have automated unit and integration tests that verify that the behavior is correct. Once all of the checks pass, we’re able to remotely run from git and have a history of what ran when so that our on-call engineer doesn’t have to guess what changes went out to production. And instead, they can look at the deploy history and find out. So we can see here that when I changed the percentage of instances that are running on demand, it’s just a browser button to push it forward as well as a browser button to roll it right back.
So that easy roll back and automatic deployment of changes to Kibble and Dogfood before they reach production, allows us to move quickly and keep things in sync while still maintaining control. And the rollback mechanism is simple. It’s just a git revert, followed by re pushing the state and synchronizing the main branch into our production AWS environment. We also think about the idea of feature flags for our infrastructure, that we don’t have to deploy everything all at once to all our infrastructure. Instead, we can control and compartmentalize the blast radius of any change that we might make. For example, we have ephemeral fleets that stand up in response to when we have instances that are needing catch-up fleets, that are needing more resources to get them caught up with the current state of data coming in. And it’s a simple feature flag to turn on the catchup fleet or a simple feature flag to turn on automatic scaling AWS, in order to ensure that our systems are able to keep up with that demand.
Similarly, we’re able to automatically detect what VM type should I be using based off of, is this particular environment using ARM64 or AMD64 machines. All feature flags that are set in our variables file that we don’t then need to track and change and rip out every single time that we want to turn it on or off.
Similarly, we can also quarantine bad traffic. If one user’s traffic is anonymously slow or is crashing or causing issues for other users, we can segment it to its own set of servers, which we’ve spun up with a simple feature flag. And that feature flag controls both which paths are routed as well as the number of servers allocated and even what build IDs are assigned. So therefore, we can set up a special debug instance for the traffic that’s causing problems so that we can really investigate and get to the bottom of it without impacting other users. We also use our continuous integration of our infrastructure code to have confidence that what’s running in production actually matches what’s in our config, so we can feel free to delete things that are not in our config as code, as well as remove any used bits of config as code, knowing that there are no hidden dependencies. But sometimes this doesn’t go entirely according to plan. So Danyel’s going to tell us really quickly about an outage that we had and what lessons we learned from it.
Back in July of this year, well, here’s a graph of Honeycomb’s performance of July 9th of 2019. As you can see, there was some sort of blip at a little after 15 o’clock, but that can’t have been a very big deal, right? Well, let’s zoom in a little and go see how that actually looks. It looks like we get about 3:50 PM, things started going badly and by 3:55, whatever we’re measuring here was down to zero traffic. It stayed down for a good 10 minutes until about 4:05, when we were able to finally bring it back up. This is clearly bad, but how bad was it? Was this just a few minute blip? Or is this a terrible company-wide disaster? It’s worth evaluating the notion of how broken is too broken.
We quantify that with the idea of service level objectives. Service level objectives are a way of defining what it means for a system to be as successful as you want it to be. They’re a common language that engineers can share with managers. For example, management might set a goal for what they want the reliabilities of the system to be. And engineers can figure out how they want to deploy their effort level to make sure that they maintain that level of reliability. 100% is an unrealistic number. No system will be able to stay there. And so if you can come to an agreement on how close you need to get, then you can build a much more powerful and successful system. SLO math in the end is actually super simple. We count the number of eligible events we’ve seen, how many things we’re interested in.
For example, we might decide that we’re really interested in how our system serves HTTP requests. So we’ll filter ourselves to only looking at HTTP requests. And then of them, we’ll define successful events. For example, we want events that were served with a code of 200, and in less than 100 milliseconds. Once you’ve got a pool of successful events and a pool of eligible events, then you can simply compute the ratio. We define availability as the ratio of good to eligible events. That’s fairly straightforward math. And the wonderful thing about that, we’re able to use a time window and a target percentage to be able to describe how well we’re doing. So for example, our target was 99% over the last month. Now the wonderful thing about combining those two is that it gives us the idea of an error budget.
We can subtract the number of events that we’re allowed to have gone wrong, that is to say, the percentage of unsuccessful events divided by the total number that we’ve had, to figure out how much flexibility we have. Sometimes we might be very close to or over our budget, in which case we really should prioritize stability and making sure that systems are reliable. But sometimes we have some error budget left over and that actually allows us to experiment or to have a higher change of velocity. Because when you’ve got an error budget, you can actually describe how acceptable it is for your system to not quite always succeed. We use the notion of SLOs to drive alerting on our system. When we see that an SLO is just about to burn down, we’ll page engineers. And so they can act before we run out of budget.
Now at Honeycomb, we’d done an exercise where we actually estimated out what we wanted our SLOs to be. And we realized that we have three major sets of features. We want to store all user incoming telemetry. We in fact have a 99.99% ratio on that because it’s really important to us that user data not be lost. In contrast, we want our UI to be responsive, but we’re going to be a little bit more forgiving about that. Our default dashboards should load in one second. We’re even more forgiving of our query engine because sometimes users do execute particularly complex or difficult questions. So we’ll place that at 99%. So now to evaluate how that 12-minute outage looks, we really need to understand what sort of data we were seeing. Unfortunately, this is a graph of user data throughput, which means that this 12-minute gap was not only a gap for us, but it actually shows on every one of our user’s dashboards. Because for those 12 minutes, we didn’t accept their data.
We dropped customer data. We were able to catch that this had happened and we rolled back. Liz said earlier, that takes about 10 minutes for a rollback, and that’s precisely what it took here. During that time, we communicated with our customers, both first to notify them that there was an outage and then that it had been repaired. Over that time, we burned triple our error budget. What do you do in this sort of situation? Well first, we halted deploys. We stopped making any more changes until we felt that we were reliable. And then we stepped back to look at how it had happened. It turns out when we trace this down, an engineer had checked in code that didn’t build. Having successfully found the root cause, we fired them on the spot and washed our hands.
Okay, fine. It turns out that checking in code shouldn’t of course be a big deal because as I’ve talked to you about the CI system. Unfortunately, at the moment, we were playing with experimental CI build wiring, which happened to be willing to show a green button, even for code that crashed. Of course, that’s not a big deal because it was generating zero byte binaries, which of course would get stopped. It turns out that our scripts weren’t watching for that condition and were very happy to deploy empty binaries. And at the time, we didn’t have a health check or an automatic rollback. So that when this happened, our system just very happily went down. That put us on a mission to reprioritize stability. And over the next few weeks, we mitigated every one of those key risks, making sure that our CI system was catching all these situations, making sure that it never deployed zero byte files, and making sure that end to end checks would succeed before the deployment would continue. Feeling secure and reliable, we are able to resume building.
What’s ahead for us because that clearly isn’t the end of our mission? Well, what’s ahead for us is continuing to be reliable and scalable and lead the industry of observability by being able to give customers high confidence in us and give them the features that they need. That means that we need to launch services easily. For instance, that refinery service that Danyel talked about, we needed to not just scale up existing microservices, but provision new microservices while maintaining confidence in our systems. We also needed to spend less money in order to pass savings on to our customers with a new pricing model. This meant that we needed to adopt spot instances in order to scale up without increasing the cost dramatically, as well as introducing ARM64 instances, which offer a lower cost and therefore enabling us to offer a good service to our users at reduced prices.
We also are going to continue modernizing and refactoring because continuous integration and delivery are things that all of us are learning and there’s new best practices that merge all the time. But above all else, what we prioritize at Honeycomb is our employees. We want our employees to be able to sleep easily at night, and that means doing retrospectives every time we wake someone up, to make sure that that’s not going to happen again in the exact same ways.
This isn’t just something that startups can do. You can do this too, step-by-step if you start measuring the right things and improving them where they matter. We’d encourage you to read more on our blog at Honeycomb.io/blog, where we talk about many of these things, including some of the lessons that we learned, and give you peeks behind the scenes at how Honeycomb runs and what our engineering practices are. So do what we do, understand and control your production environment so you can go faster while maintaining stable infrastructure. Don’t askew risk, instead manage it and iterate. And always make your systems better, learn from the past and make your future better. If you’re interested in learning more, you can go ahead and go to Honeycomb.io/Liz and get a copy of these slides. And as always, thank you for your attention.
Thank you for joining us.