Danyel Fisher [User Data Expert|Honeycomb]:
Hello. I’m Danyel Fisher. And I’m the User Data Expert here at Honeycomb. This morning, I would like to start by talking a little bit about some of the new analytics features in Honeycomb. To do that, I find that showing off features is best done as part of telling a story and actually engaging with a real incident. Fortunately, Honeycomb’s had no shortage of incidents recently. You can read about it in a blog entry called “The 20 Fires of September.” But I’d like to tell you about an exciting one that started in early November of this year.
During this talk, I’m going to be using a number of technologies that you’re probably already learned about, that you’ll get a chance to review and get to see in action. So we’re going to start with SLOs and we’ll use some BubbleUp and Trace View, and we’ll, of course, spend some time in the Query Builder. To start off, this is an incident that happened on our Shepherd system. Shepherd is our guardian, as the name might suggest and it makes sure that events that come in get dispatched to the right places. Many of our users watch for our Shepherd events synchronously, so it’s pretty important that we process them in under five milliseconds per batch that they send us.
So we try to keep that really quick. Now up until about November 15th, we were doing great. As you can see, we were maintaining just about 100% capability. And then something terrible happened on November 16th that earned half of our Error Budget for the month in just a couple of hours. Together, we’re going to explore what and find out how that happened. The Heatmap gives us a pretty good clue of what was going on. In this short period of time, right around 6:00 p.m., we can see that while most events were continuing to be processed super quickly with a long tail running up to a little bit more than that, there’s suddenly this spike of events that were taking up to, here you go, a couple of hundred seconds. Now one important question might ask is: Who was affected? Was it kind of everyone or was it a very specific subset of people? When I look down here, I can go see this field called app.dataset.id. This is which user was affected.
And I can see, in fact, there’s one specific user who ran into this challenge, and one particular team that ran into the challenge. That’s really interesting because that means this isn’t going to be something that’s burning through everyone. It’s affecting just a few users. First, of course, we got customer support on the case to reach out to them and make sure everything was okay. But we were also curious how that was happening. This doesn’t seem to be a particularly high-volume team. We could possibly tell that by looking at things like the pricing plan and seeing that they’re at our pro-level. So they are standard monthly users who are encountering and working with Honeycomb. Let’s dig in a little deeper and find out what’s up with those events.
The first thing I’m going to do is actually jump into the query. Now this screen, we can first see the count of events that was running wasn’t too different during that time. So it’s not like we were getting particularly hammered or overwhelmed by stuff. Let’s zoom in a little bit at that time range. And go see what we can learn about what is happening. By the way, as soon as I hit zoom, you saw that we ran a new query and made sure that we got higher resolution data. That’s one of the new features that we’ve been able to roll out. Just a simple quality of life improvement that much to my surprise has just made everything feel a lot better. Now all these markers along the line here are showing things like when we spun up a new Shepherd host, when we deployed new code, that kind of thing. And you can also see up here a couple of orange dots that are actually things that our SRE team put in when they were trying to debug the incident.
For the moment, I don’t think we need to see those, so I’m going to filter out the incident type marker. The ability to filter markers, by the way, is yet another new feature that I’m excited about. And glancing at this, we can see that, let’s see, we did bring up a new Shepherd host, but it doesn’t look like a code deploy made the difference as what caused this. So let’s dive in specifically to one trace right now, and go see if we can just get a sense of what this function’s doing. Why were things running slow?
When the trace comes up, what we’re going to see is, well, here’s the trace ID. We see this was a function itself called ChainUnaryServer. And I don’t know precisely what it does except that it certainly calls handle batched event a great in times. In fact, it calls it something like 1800 times, which took 65 seconds. We can also glance at this and see that an awful lot of the handle batched event calls were running seven microseconds, eight microseconds, two milliseconds. Fairly quick. And then there’s this number of them that are taking a lot longer than that. They’re 1.1 seconds. 1.1 seconds again. We can see also that there are 35 of these spans that have errors in them. And that the error pops right up to the surface. Also, an exciting new feature that is just, again, made quality of life a little bit simpler to find some of the errors and figure out what’s going wrong. Now when I look at the error, I see that it’s got a protobuf encoding error from Kafka.
I don’t know precisely what that means yet, but it gives us a clue that Kafka might be relevant. Now Shepherd, when it sends data to the backend database, does so through Kafka, which is a message delivery service. So you have to make sure that Kafka is allocating the right number of pipelines to be sending messages between these different services.
Seeing that there are these one-second-long events, there’s 35 of them, goes a fairly long way to explaining why this was running 65 seconds. There might also be a couple. We can also see a number of cases that are 4 or 500 milliseconds or even… None are popping up right now, but there are even some that run a little bit slower than that. The last thing that I’d like to see is that on the top right, we can see this distribution of the span duration. So this is giving us a histogram of everything that’s ever been a handle batched event. And what we can see from this chart is that most of the time, these run really quick, and right at this moment, we just got this huge spike of them running over a second.
Let’s go look at handling batch events calls to Shepherd. And go look at this distribution of how long those took. We can see that during this period of incident is when we got this huge spike in the duration of handle batched events. One interesting question might be: Is this normal? Do we sometimes just get these to be slow, but we, I don’t know, somehow papered them over before? So I would like to compare this to a week earlier and take a look at how the data looked then.
What you’re going to see here is comparing the current handle batched events to those from a week before. And it’s a fairly arbitrary time, but it gives a sense that, no, this is definitely an unusual period. Sometimes we have a little bit of fluff there from a few things running slow. But this is different and kind of worrisome. Now it’s useful again to go ask were we perhaps handling more batch events that were being covered? Maybe these events are overall longer. And so doing the count again shows, nope, traffic was pretty much looking like it usually does. That’s not really the clear distinction.
There’s one other cool feature that we’ve brought out recently that can analyze when you’re looking at a system where things have gotten to take more time or the count’s changed. That suggests this operator called Concurrency. What concurrency does is for every event, it looks at the duration of the event. And it marks it for every one of those. So while a count only counts at the very start of time, concurrency runs across the entire time.
It’s really good for estimating things like the length of the queue or the number of events that are waiting or the depth of the stack, the depth of a trace that’s happening at a time. We’ve got some great blog entries and a really good docs page that talks about how to use concurrency. Right now I would just like you to see that as I look at this, concurrency is running in the hundreds or so most of the time. And then very suddenly, during this incident, we watch it spike to 5,000, to 7,000, all the way up here, the maximum actually gets to 15,000. The maximum gets all the way up to 15,000. We have a lot of events that are running concurrently now. So that suggests to me that the queue is getting long, that a lot of things are being processed.
Let’s try and verify that and learn a little bit more about how this is happening. Maybe a BubbleUp on these events will give us some insights. When I do the BubbleUp, we’re now going to look at what makes these events different. We’ve, of course, the first thing that really pops out to me, well, let’s see. We see that most of the errors are this protobuf encoding error, that’s precisely the one that we were seeing before. And scrolling a little further, we see this one field called app.chosen_partition. And there are very few vertical lines in there. A huge number of them on 13 and 19, but there’s a whole bunch of others that aren’t getting traffic. So there’s something going on with app.chosen_partition.
I’d like to view Grouped By that field, to go see whether app.chosen_partition helps explain what’s going on. What we’re now going to do is look at whether that Heatmap and that concurrency break down differently by chosen_partition. One BubbleUp, so we’ll jump back to the Results tab. And what you can see here is that concurrency is super high for partition 13, for events that are going to partition 19. On the other hand, for many of the events here that are going to partitions 5 and 27, and 28, concurrency is very low. So whatever app.chosen_partition is, well, that gives us a pretty good sense that only some of them are really hurting. In fact, this one’s hurting the most.
Now one technique that can be useful is just jumping straight into the code so, I’ll look at our codebase. And because we’ve instrumented this with the term chosen_partition, let’s check the repository. We can see that in our Kafka producer code, we actually have, right here, we have this code called sample and remainder during the function called write. So here we are. We’re trying to write. We choose the partition to write to. We record which partition it is. And what we’re seeing is that one of them is doing really poorly. Now we’ve got a pretty good sense of what’s going on. There’s a bad partition that we’re trying to write to. It’s getting overloaded. And it’s taking too much time to come back with an answer. And for some reason, we haven’t been able to load balance away from it. And stop taking a look at and, you know, distribute our effort to other things.
A metrics view might begin to give us some insights into what’s going on there. When I bring up the metrics, we see that it’s not one of the Shepherd machines that was overloaded. We can see that those are ticking along just fine. I mean, sure there’s, you know, we can see that some of them are started and stopped, but most of them are ticking along in the 60 to 70% CPU utilization. But this sudden spike in memory usage and the spike in network connections definitely suggests that something was going on with attempting to maintain a big queue of data and attempting to send it out over the network. Fortunately, we also have a set of metrics around Kafka dashboards. And this gets really interesting. The CPU utilization here is all over the chart. Let’s see if we can learn something from this. When I click in, I can start looking at, let’s see the CPU utilization for the Kafka group. And let’s scroll through these a little bit. Oh, that’s interesting. So just before this incident that we’ve been looking at, this Kafka machine stopped operating. I think we shut it down because we do rotate through our Kafka Brokers from time to time.
And we spun up another one. But weirdly, that one didn’t get any traffic. Instead, the traffic seems to have gone initially to this machine, which was fairly loaded. And then increasingly, let’s see, to this one, which got all the work, even though that was pretty loaded too. So somehow our load balancer had gone kind of wild. It had made a bad decision about where our efforts should go. And instead of loading up that other machine, the one that was able to take that load, it loaded up this one. That’s unfortunate. And also gives us a pretty good explanation of at least the sort of things that were going on here. We have a second chart here of the number of under replicated partitions. We try and make sure that our data is stored on three different partitions at once. And the fact that we can see the sudden spike in under replicated partitions is, again, watching that one machine crash and then the others begin to take that load.
Now these all come from the Kafka dashboard, so I would like to jump over to that dashboard for a moment. And see what else we can learn from it. We’ve taken a look at CPU utilization. And we can also see here that the amount of disk free is showing kind of the same thing. We have a new machine. It’s beginning to come up. Meanwhile, one of the already loaded machines kind of got overwhelmed. And then you can see that our SRE team matches successfully kicked things to restabilize towards the end.
Looking at the topic throughput is the number of messages that we’re sending to Retriever. And the fact that these curves are staying fairly stable, even as this is going nuts, suggests to us that we pretty much didn’t lose any real data. You do see that there was a spike in catching up with some of the stuff that we had to buffer. So things got slow. But I’m glad to say that we didn’t actually lose any data for our users.
There’s one last chart that I would like to look at, which is looking at memory use on the Kafka side. To do this, I’d like to start off by taking a look at a metric called GC Collection time. That shows you how much time the garbage collector is running on these Kafka Broker machines. So for each Kafka Broker and for each machine, we are looking at the amount of time it’s taken on garbage collection. Now, this says classic counter. As you can see, the number is increasing slowly. We’re looking at like this machine OE2, which is jumping from, you know, 10,000 to 10,900 to 10,970. Right? It’s ticking up very slowly. Trying to eyeball what it’s doing is pretty tricky.
I’d also like to notice right here the orange one had this sudden drop down to zero which is going to get in the way a little bit. So we have a second operator here called RATE MAX. What RATE MAX does is look at the change in time over time. Unfortunately, RATE MAX currently isn’t very good at handling resets. And so we get this huge period of negative where it just drops far below zero. That’s not particularly useful but if you drop this onto a log scale, that actually kind of makes that effect go away a little bit by hiding those negative numbers. And now we can see a little bit more of what’s going on.
The rate of garbage collection for all these machines was doing pretty well. Until this incident starts. We spin up one new machine, and it gets started with garbage collection. Another machine gets up and is going with this garbage collection. But what’s really interesting is we now see that this one machine here, the one we were worried about, the one that’s been overwhelmed also starts spinning up really heavily. And it’s going nonstop on the garbage collection. Which probably explains, let’s see, that was O2OO, that explains why is its CPU utilization so high. And we can now, actually if we pop off the log scale and come back to linear, we can also see that all these machines are running pretty close to the max amount of memory they’ve got available. They seem to be configured for about six gigabytes. So one choice could also be giving some of our machines a little more headroom.
Let’s step back now and take a look at what we’ve done. We started with an SLO alert. And we used that to find out that there was one customer who was being affected. We used the trace view to go examine what was happening with them and what had gone wrong. And looked into the underlying events to go find out what the characteristics of the traffic were. That showed us that it was one particular Kafka partition that was acting badly. And the metrics verified that it was Kafka, which machine it was, and even began to give us some cues as to the fact that the load balancer wasn’t acting correctly.
It turns out it wasn’t that user’s fault. They were just unlucky. They had been allocated to a machine that was acting poorly. And we’d made some mistakes with our configuration. Honeycomb’s managed to learn a lot from this event, and I hope that you’ve had a chance to learn to something about some of these new features. Because we just went through, well, we started talking about OpenTelemetry ingest. We saw the concurrency and the rate operator. We filtered some markers. We had a chance to auto query on zoom. We did a time over time comparison. I showed you that you could have been sharing spans and traces by knowing their URLs. And we got a chance to see the error state traces. I would like to thank you so much for your time and attention to that. And I would particularly like to thank Liz Fong-Jones for helping me reconstruct this story and, of course, my empathy and thanks to the ops team that actually had to handle that incident as user data was in peril. Thank y’all so much.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.