Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose.
Liz Fong-Jones [Principal Developer Advocate|Honeycomb]:
Hello and welcome to SREcon Americas. I’m Liz Fong-Jones and I’m the Principal Developer Advocate at Honeycomb, where I advocate for the site reliability engineering and observability engineering communities. So what I want to share with you today is our journey at Honeycomb towards having services that are reliable by implementing chaos engineering principles. So let’s talk a little bit about the context in which I operate and my team operates.
We operate in a world of processing big data. We ingest telemetry data that is very, very voluminous and creates a lot of operational complexity. We are not a stateless service, and that causes us to have constraints that other organizations do not. We also have the constraint of needing to evolve observability very, very quickly. Our feature set is always growing because it’s what the field demands. There are always some new best practices that we want to make sure that our customers have access to.
And we’re not a large team either. We’re a team of just about two dozen engineers at this point. But during the period in which most of the work that I’m describing today was performed, we had less than a dozen engineers working full-time on product development and infrastructure engineering. Overall, our goal is to make systems of our customers humane to run by ingesting telemetry data emitted by those applications, and then enabling engineers developing the applications to explore their data, and empowering them to understand what’s happening so that they can better control and understand production.
So part of what makes us successful as an engineering organization is the ability to deploy with confidence, knowing that we can safely perform changes to our production environment without being afraid of the kind of haunted graveyards in production. We deploy over a dozen times per day, including deploying every Friday. We also try to avoid needing to deploy in emergencies, such as on weekends. And you can see here that we’ve got a pretty good track record. And then in the six months between December and June of 2019 into 2020, we hit about 700 deploys in that six-month window.
We need to invest both in continuous delivery, as well as in continuity of operations and testing our contingency plans in order to achieve our goals and hit those product philosophy milestones. My work as an infrastructure engineer at Honeycomb is really based on the idea of creating a stable enough platform, such that we can empower the product team to innovate as well as being able to replace pieces of the platform as we need.
So, part of what makes us unique is that we are a storage service, fundamentally. We are a data storage and analytics service. And that can be really scary for people who have worked only in an environment of stateless services where your persistence is taken care of by someone else. This is what it took for us to become fully confident in our persistent stack. First of all, we needed to think about developing service-level objectives in order to solve the problem of balancing velocity and reliability. So we decided to quantify reliability by actually explicitly setting those SLOs using SRE methodology.
Second, we needed to identify potential areas of risk that might impair our ability to deliver reliable and durable service. Third, we designed experiments to progress where we found it in order to validate that our mitigation strategies were working correctly. And then fourth, when we did uncover problems, we prioritized addressing them if there were potentially SLO impacts. Let’s start with that first bit about measuring our reliability using service level objectives. Let’s recap because I know that for many of you, this is your first SREcon. And therefore I want to make sure that we’re all level set on what a service level objective is, why it matters, and how to construct one.
A service level objective is a mechanism of measuring how broken our services are and validating that our services are not too broken. We don’t necessarily aim for ensuring that there are zero errors because that would create the wrong trade-off too far towards reliability and not enough towards innovation. The SLO helps us understand and measure whether our customers are having success with the goals that they’re trying to achieve. SLOs are common language because they enable us not just as infrastructure engineers, but as a wider engineering organization to understand what is it that the users are trying to do, and are they succeeding, and measuring that so the product managers have access to the data, customer success has access to the data, and so that everyone can really get on the same page about what the reliability requirement of our product is.
So in order to measure a good service level objective, we need a service level indicator. And the service level indicator is composed of measuring individual events corresponding to critical user journeys in the production context. This also means that we need to capture critical data about those requests; things like which endpoints were hit, maybe some kind of tracing or request ID, the user agent, and other things that might help us debug it later. We also need rules for classifying events as good or bad so that we can understand how many good and bad events that we have without a human having to look at the system. For instance, it’s relatively simple to set a service level indicator for a user interactive service that someone is accessing in a browser. For instance, we might expect to see HTTP Code 200 and latency less than a hundred milliseconds is a good request. And then we could do something like setting a target service level objective based on measuring the results of many individual service level indicator evaluations over time.
And that requires us to have a window and a target percentage because it would be nonsense for me to claim Honeycomb was a hundred percent down yesterday, but it’s a hundred percent up today, so everything’s fine, right? So we need something like measuring 99.9% of requests must succeed over a 30 or 90-day window. Why not a hundred percent? As I said earlier, a good SLO barely keeps users happy and provides them with the reliability that they expect, no more, no less, so that we can also innovate and provide features as quickly as we can within that reliability constraint. It doesn’t make sense to make the service five femtoseconds down less per year if users aren’t actually going to notice that. As Charity Majors says, nines don’t matter unless users are happy.
So these are our SLOs at Honeycomb. First of all, we need to store incoming telemetry data admitted by customer applications that are sent to us. And because we only get one shot at it, because we don’t want customer applications to have to buffer that data and retransmit, we only get one shot and we have to store it correctly 99.99% of the time. So only one out of every 10,000 events that are sent to Honeycomb are allowed to drop. We also have to make sure that that data is freshly indexed and therefore that we don’t have any unnecessary stalls in our indexing pipeline. Unfortunately, it’s a little bit more tricky to describe exactly how this SLO is constructed.
The next criteria that we have is that once we have relatively fresh data, we need to be able to evaluate alerting on that data every minute. That way, if your service is having a problem, we can tell you about it reasonably quickly, rather than you having to wait minutes or hours to find out that something’s wrong. So we expect that 99.9% of the time that the alert evaluator successfully runs across fresh enough data. And then our least stringent SLO is that if you’re issuing a query of arbitrary complexity against your data stored in Honeycomb, that that data will come back in less than 10 seconds and will have scanned all of the data that’s available. So that’s a 99% SLO.
So our error budget then is one minus the reliability target. It’s the amount of allowed unavailability that we can have, and enables us to decide, should we invest in more reliability, or are we okay to perform experiments and innovate to go faster? And in particular, I want to talk about this idea for a moment, because if you are already having a lot of outages, chaos engineering is not right for you. You have your sources of chaos that are in the system already. You should be learning from those sources of chaos. But in our case, we were over-delivering on our reliability targets. Therefore we thought that it was important to invest in performing experiments to structurally test that our redundancy mechanisms were working, rather than having to test them in an unexpected situation.
So let’s talk for a second about why data persistence is tricky and is different from a stateless service. We need to think about how we actually validate. Not just can I survive deploying the service, but can I also think about how we validate that if I restart too many workers, am I able to catch up? Am I able to successfully store that incoming telemetry in a safe manner? Do I have my indexes continue to remain fresh? Do I serve stale data? These are important things that we need to test and validate using that leftover error budget if we don’t have things that are straining our system right now. So that brings us next to the architecture discussion. I want to talk for a moment about how Honeycomb is architected so that you can understand what bits of our experimentation might be directly applicable to you, for instance, if you use Kafka, and which bits you might need to think a little bit harder about how to adapt to your situation.
This is the Honeycomb architecture, and I know that it’s a lot of boxes. So I’m just going to broadly say that we have a number of services that are stateless, that are just passing requests onto a stateful backend. And then we’ve got a number of services that are stateful, that they store state regarding either customer data or metadata, and therefore that they have different reliability requirements and different reliability techniques that apply. Let’s zoom in on just the stateful services for a second. What we have here is that when a request comes in from the stateless ingest worker, it gets stored into Kafka and replicated to two other Kafka workers. And then Kafka relies on Zookeeper for the leader election for any given partition. And then that data winds up being read out of Kafka, ensuring a homegrown indexing pipeline. And then our indexing pipeline shoves that data into AWSS3 as well as consulting with MySQL to discover information about metadata.
So that’s basically the core of our ingest workload that is stateful. And we need to make absolutely sure that this process does not break because if it breaks, then we wind up dropping the customer telemetry that we’ve already committed to ingesting. Let’s zoom in a little bit on what the journey is of a single event that comes into Honeycomb. A single event that comes into Honeycomb comes in as part of the batch of events associated with a trace or just a collection of data coming from one particular customer. We assign each event coming in at random to a different Kafka partition in order to make sure that we have some degree of load balancing and spreading across all of our Kafka brokers. So we use a Kafka producer plus to send to a variety of different partitions. And then the Kafka brokers will transmit from the leader to any followers in order to make sure that that event is truly durable. At that point, Kafka returns success and our stateless ingest worker can return success to the client.
In terms of consuming from the Kafka queue, we then need to read out of the Kafka queue individual events, which have a number of different fields in them. And we batch together all of the fields from more than one event in order to form a field index. And then we write those field indexes and serialize them periodically to AWSS3. So that’s what our architecture looks like. The shepherds, which ingest the data, produce into Kafka. Kafka handles durability and replication. And then we have pairs of indexing workers that consume from Kafka, do the indexing, and then persist over to S3. So in theory, this should work pretty well, but there’s a couple of challenges associated with this, which is that we don’t actually change Kafka that much. In fact, AWSEC2 makes it so that in general, our Kafka nodes don’t frequently restart. Our Zookeepers almost never restart. And we do keep our retrievers running. Although we do sometimes deploy new software, we don’t necessarily do a clean boot of a new storage node, an indexing node that often. So that means that we had a lot of hidden risk in our system because we weren’t periodically testing that these redundancy mechanisms were working correctly.
We also have long-running processes, which introduce their own trouble. What happens if there’s a memory leak? What happens if there is a system update or patch that we need to apply? Because we can verify that our stateless services restart all the time, but that’s not necessarily true of our stateful services. And above all else, we need to maintain data integrity and consistency. We need to make sure that we don’t have inconsistent copies of that data floating around. What happens if two indexing nodes disagree about what the content of a segment should be? What happens if Kafka commits an individual set of events, and then it actually eats it silently? What do we do then? Those are important considerations for us to consider as a stateful service. There’s a lot of delicate failover dances that we used originally when the authors of Honeycomb were designing the storage architecture. For instance, there is a delicate failover dance involving rsync, to rsync between one live node to a bootstrapping node.
Let’s talk about a couple of those failover dances. So one failover dance is that if you lose a Zookeeper node, then, in theory, the other two Zookeeper nodes still have a quorum, and you’re able to continue doing Kafka transactions. Or what happens if you lose a Kafka node? Well, if you lose a Kafka broker, in theory, a different Kafka node should become your leader, and then we’ll replicate it to a Kafka node that stands back up. And what happens if we lose an indexing worker? Well, in theory, you should be able to have the parallel indexing worker for that same partition continue processing, and then be able to replay and catch up that processing node that stood up to replace the broken one, either by bootstrapping off of the existing one that’s working correctly or by bootstrapping off of a Snapchat on S3 and replaying from Kafka until you’re caught up.
So that was what we thought would work in theory, but we wanted to validate and verify it because we actually hadn’t tested a lot of these things in production, let alone doing so continuously. That brings us to doing chaos engineering. Now, I want to stop here for a second and define what chaos engineering is. Chaos engineering is the practice of performing controlled experiments in a production or otherwise meaningful environment in order to validate a hypothesis about what you think is going to happen. Chaos engineering does not mean introducing chaos into your systems willy nilly. It really is actually a science that you’re performing or a form of engineering in order to validate your assumptions.
Our goal here was that if we thought that Honeycomb would be able to survive losing a Zookeeper node or a Kafka node, or a retriever indexing node, we needed to be able to be sure that that was actually going to be the case. So we initially started by doing manual experimentation, by restarting one server or one service at a time. And we also wanted to do it in a controlled fashion at 3:00 PM on a workday, not at 3:00 AM when people are super, super groggy. As it is said in the open-source world, bugs are shallow with more eyes, so why would you expose your system knowingly to potential bugs in a situation where people are not available, where there are not extra eyes available on deck in event of problems?
At Honeycomb, we really try to minimize the number of midnight wake-ups our engineers incur, so therefore it makes sense to do controlled experiments during the day, rather than risk waking someone up in the middle of the night. When we perform a change to our production systems, we also measure and monitor for changes to availability and durability using our service level indicators. That’s our first sign that something has potentially gone wrong and that we need to hit the emergency revert button should a chaos engineering change impact more users than we expected. And then after that, we also make sure that we can not just measure the impact, but also that we can debug and understand and replay what happened inside of the system, and ask questions about how our system behaved during the chaos engineering experiment. And that’s why observability is important, because observability enables us to explore and understand unknown unknowns, to ask new questions about our services using the telemetry that we admit.
Honeycomb is a provider of observability, but we also have internal-facing observability by deploying a second copy of Honeycomb that observes the first copy of Honeycomb, so that we can make sure that we understand what’s happening when we experiment on our real-life production systems. So that’s what we mean when we say test the telemetry too. Now, you may not have your own observability platform and meta observability platform. That’s fine. But I think the important thing that you should be thinking about as a practitioner is when you’re designing a chaos engineering experiment, are you actually testing what hypotheses should be true about your telemetry data as well? Not just did it break or did it not break, but could I understand the service was working correctly? Could I get a trace through to understand that the failover worked correctly? If you’ve known the mechanism of validating that your experiment is successful and that your failover is working the way that it intended, then you don’t really have an experiment. You’re just introducing chaos without having something that you’re measuring and controlling. And once we had restarted individual services, we validated those fixes by repeating those restarts until they became buttery smooth.
Let’s talk about what this looked like for a few elements of our pipeline. First of all, let’s talk about the case of losing a Kafka node. So in the case that we lose a Kafka node, in theory, we should have the Kafka node be automatically terminated. And a new Kafka node should be provisioned and catch up from the current, newly elected leader. Additionally, if for some reason, Kafka gets slow, we expect that a shepherd worker that is ingesting events statelessly, should be able to pick at random, a different eligible partition in Kafka to write to, rather than waiting indefinitely and filling up its own internal buffer, waiting to write to Kafka. And similarly, if an indexing worker fails to catch up, it should not crash. It should just wait until Kafka is available again. What we found was that in the event that a Kafka worker crashed and then we terminated it, if our underlying dependencies had changed in our Chef bootstrap process, that our Kafka workers were not actually starting back up correctly. So this was a problem because if it had happened at 3:00 AM, we would have had impaired redundancy.
We needed to make sure that we understood what factors were causing us to have unreliable node clean restarts. In this case, it was a breaking change that someone had happened to make to the AWS Ruby libraries, but it could have been anything really. And we just happened to be lucky by performing that chaos engineering experiment about a week after that particular change landed. This was a situation where we decided that it was important to validate that the clean restart workload was able to succeed and do so continuously. Another thing that we observed during the course of doing experiments with restarting Kafka nodes was that Kafka failovers are not instant. That when you do have a Kafka node that fails in an obscure way, it can sometimes cause rights to the current Kafka leader to become stuck to that current Kafka leader and induce excessive latency, which then causes event queues on the producer to back up. So we had to validate and reconfigure our shepherd behavior in order to make sure that that was working as we planned.
There’s another element here, which is that we also do leader election, not just for Kafka brokers, we also do leader election for things like deciding which of two redundant nodes is going to execute our learning workload on any given minute. So in theory, this leader election has been done through acquiring a Zookeeper lock. And one of the two alerting workers will run successfully. And fortunately, when I tried restarting Zookeeper nodes, I discovered that you could safely restart Zookeeper node two or Zookeeper node three. But if you restarted Zookeeper node one, alerting would fail to run until the new Zookeeper node was back running, which was about 10 months. Oops.
It turns out that our learning workers were acquiring a Zookeeper lock, but they were only trying to acquire the Zookeeper lock from Zookeeper zero. If for some reason Zookeeper zero were down, you wouldn’t actually be able to acquire the lock, and neither worker would be able to run alerts. Obviously, this is a threat to our SLO, so we decided to address it. So we made sure that the alerting workers contacted more than one node in the Zookeeper. That way they could always get to whichever Zookeeper nodes were in the quorum and able to elect which worker was going to run alerts. I want to talk for a minute about how we decided to act in response to some of this stuff. One thing that we needed to think about in terms of de-risking was doing this more often, for sure. To make sure that we’re continuously uncovering issues where our dependencies had broken upon us. But there’s another thing that we decided was a little bit too fragile and too risky.
Let’s talk about the situation of losing an indexing worker, the thing that consumes out of Kafka. Let’s assume that for the most part, Kafka is now reliable. We have pairs of indexing workers that run that are reading out of Kafka. But what happens if you lose one indexing worker? Well, under the previous design, in theory, the indexing worker would perform a replay from the other peer indexing worker that was running successfully. And then it would catch up based on reading off of Kafka and then continue working. But that has a problem to it, which is what happens if you lose both indexing workers responsible for the same partition? Then no worker is caught up-to-date and we have a severe problem.
So what we needed to do was twofold. First of all, we needed to make extra, extra sure that no two indexing workers were running in the same availability zone for one partition. We hadn’t necessarily fully guaranteed this in our previous auto-scaling group and locking designed to choose and claim a partition to subscribe to. But we decided that this was really important as our needs scaled and as our customers started to depend upon Honeycomb more and more.
The other thing that we decided to do was to implement, instead of requiring rsync from a peer indexing worker that was caught up, we needed to say that we instead wanted to replay from S3 to snapshot indexing workers once an hour, no matter what, and write new snapshots to S3, and always replay off of S3 rather than relying on there being a live indexing worker peer.
We also implemented something to do continuous rolling restarts so that we could perform restarts in a safe manner of indexing workers without taking out two indexing workers for the same partition. So these were three redundancy mechanisms that my team implemented in order to make sure that our systems were going to be stable. And then as I said, we continuously began verifying to stop regressions once one-off tests were succeeding repeatedly. We wanted to make absolutely sure that we were able to deploy our systems and have them survive incidents, and do not need human intervention to do that, but instead, for instance, every Monday at 3:00 PM, to restart a Kafka node or restart a Zookeeper node or restart an indexing note, just to make sure that the systems are working so that we don’t have regressions that pop up at 3:00 AM.
But this isn’t just about reliability to me. I think it’s also a mechanism of being able to introduce more capacity to adapt within your system and within your people so that you can use cheaper instances, so that you can really achieve your goals of reliability and durability with fewer resources. So Honeycomb is a small series A, series B-ish company, right? And therefore we care a great deal about what our AWS bill is. And one thing that we were able to do by being absolutely positive that we had catch-up and replay mechanisms, who was that we were able to deploy some of our workloads into preemptable or spot-bidding instances that would potentially restart more often knowing that it would be safe to restart our instances because we have validation of it.
So for instance, right now, all of our stateless ingest is running on spot instances because we know that those can survive restarts and that all the state is stored safely somewhere else, rather than stored within each individual worker. For our stateful workers, we do have one stateful worker that is able to tolerate replaying from Kafka upon demand without introducing reliability risks. So we moved that over to spot instances as well. But there are more significant cost savings that we were able to achieve by knowing that we could safely roll our entire fleet upon demand without worrying about whether or not it was going to cause a reliability or durability incident. And that is our Kafka and retriever stack. ARM64 is a processor architecture that is different from Intel architecture and has fewer licensing costs and lower power consumption. And when Amazon introduced ARM64 Graviton2 instances, we were able to safely migrate not just our stateless workloads, but our stateful workloads over to it by performing rolling restarts using the durability techniques that we’d thoroughly tested in advance. So that’s really enabled us to save tens of thousands of dollars per month as a series A, series B startup.
So what’s ahead for us, and can you learn from our experiences? First of all, one thing that you can learn is that it’s important to have this feedback loop in your systems, of hypothesizing about things that could go wrong, testing those assumptions, and when something is not behaving the way that you expect, learning from it and then patching that bug. And then going back to the beginning, hypothesize, test, and learn. The second thing that you can learn is that we need to celebrate things that are successful experiments, as well as failures. There is no such thing as a failed chaos experiment. If you had something that succeeded flawlessly and the redundancy mechanisms kicked in, great, celebrate the development of those redundancy mechanisms. If things didn’t work as you planned, celebrate that too, because you found a weakness in your system that you’re now going to go back and patch.
Third, you can be both reliable and scalable and have velocity. You just need to develop service-level objectives, so you measure what those targets are. And then you can either respond when those targets are broken or if they’re not broken, that gives you the slack and permission in terms of your product organization to perform experiments that might cause there to be momentary risk reduction stability so that you don’t have catastrophic and controlled production instability in the future. And that brings me to the next point, which is that you need to prioritize the health of your engineering teams above all else. Make sure that you are able to sleep easily at night. And make sure that every single engineer on your team gets to sleep at night. Even if you have only a single site on-call, make sure that people are not getting woken up individually more often than a few times per year. And if you have a sufficiently large on-call rotation of six or eight people, that means that you should be only seeing at most, one page during nighttime or weekend hours per two weeks or so.
You can do this too, step-by-step. Your architecture may not look exactly like ours, but I definitely think that there are common patterns that you can perform. With stateless services, just make sure that you’re continuously restarting them, that you’re verifying that they scale up and scale down, according to demand. And then for stateful services, make sure that you have the capability to bootstrap workers without there needing to be a peer that’s alive. Make sure you’re automatically spreading across availability zones. And make sure that you are actually testing that you can bootstrap nodes successfully. And then after you’ve performed those steps, go back and make sure that you’re performing that process continuously rather than requiring a human to do them manually. If you’re interested in more information about anything that I’ve told you today, we’ve detailed a lot of these experiments on our blog, which is at hny.co/blog.
Overall, my message to you today is to understand and control production by enabling your product teams to go faster with having a sufficiently stable infrastructure that you can change upon demand. So, manage your risk by heeding service-level objectives, and then iterate on that risk by validating and verifying your assumptions. You can find the slides at honeycomb.io/Liz. And you can find me @lizthegrey. Have a really great SREcon everyone and hope that you enjoy the rest of the conference.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.