Software Engineering   Debugging  

Incident Resolution: Do You Remember, the Twenty Fires of September?

By Fred Hebert  |   Last modified on August 29, 2023

From September to early October, Honeycomb declared five public incidents. Internally, the whole month was part of a broader operational burden, where over 20 different issues interrupted normal work. A fraction of them had noticeable public impact, but most of the operational work was invisible. Because we’re all about helping everyone learn from our experiences, we decided to share the behind-the-scenes look of what happened.  

Most of the challenges come from a pattern of accelerated growth and scale (the amount of data ingested by Honeycomb grew by 40% over a few weeks), when multiple components hit inflection points, at the same time, on multiple dimensions. 

For brevity’s sake, I’ll give an overview of trickier incidents in this blog for context and then jump to the lessons learned. If you want a more in-depth view, see our long form report.

Earth Wind & Fire

Incidents and near misses

A grab bag of issues

We encountered a lot of small issues that sometimes crossed paths with underlying scale issues. In all cases, they ended up playing a role by being part of a heavy operational burden:

We are moving services to containers running in EKS. We were first migrating our dogfood environment—internal instances of Honeycomb—to monitor production. These caused multiple disruptions.

  • Our main RDS instance got too small for our workload and needed to be scaled up.

Most of our significant issues, however, were related to increased load.

Overloading our dogfood environment

Through the summer, various optimizations were added and limits raised in our stack, which increased our capacity for queries. Since then, we had a few cases where multiple components would alert across environments, but with no clear reason.

Diagram showing the relative spike of ingest volume related to S3 logs in our Dogfood environment

On September 16, some large customer issued multiple costly queries, which read billions of files (petabytes) in a short time. Each of these in turn generated S3 access logs.

Our dogfood ingestion pipeline was under-provisioned for this spike. The back-pressure caused ingestion delays and crashed production Kafka metrics reporters, making it look like a production outage to our redundant alerting systems.

Beagle processing delays

Beagle analyzes input streams for SLO data and auto-scales based on CPU usage. Partition imbalance, service-level objective (SLO) definitions imbalance, and network throughput are all different things that can contribute to CPU being a poor proxy metric for its workload.

Previously, when beagle would warn of falling behind, we’d manually scale it up, which fixed the problem with minimal effort. Through the month, we ended up scaling up more aggressively than usual, and it kept improving at a rate below our expectations. 

Kafka scale-up

Some of the beagle processing delays were due to the rebalancer failing without us noticing, but others were hard to explain. We discovered that our Kafka brokers were being throttled over network allowances by AWS:

The command and output showing the outgoing bandwidth allowance being exceeded, using 'ethtool'

We needed to scale our cluster vertically to get more network capacity. As beagle delays kept happening, we fast-tracked the migration of the most impacted partitions by shifting them to bigger instances and then move the rest slowly over days.

We had a small hiccup over the long transition, where the data Honeycomb users see was delayed. We posted a public status for this since it was customer-impacting, but kept the delay at around 5 minutes for the worst of it. 

Understanding beagle processing delays

Despite our Kafka scale-up, beagle delays happened again. These weren’t caused by Kafka being overloaded since we had added 50% extra capacity.

Diagram showing unstable processing times for Beagle partitions

We tried some manual balancing—consumers started to catch up, but surprise! Delays got worse again.

We eventually found out that the choice of partitions beagle would consume could cause problems. Whenever the same leading broker got some of its partitions assigned to a single beagle two or three times, its consuming performance faltered. This explained why scaling up would often fix problems—but would also sometimes make things worse.

A comparative view of how Sarama's round-robin strategy would allocate partitions and brokers to consumers, both with 2 and 3 consumers in the consumer group.

In the Sarama library, Kafka consumer groups open only a single connection to any leader, regardless of how many partitions are going to be fed from it. This can create contention where the connection buffers bottleneck traffic. We tried various settings to increase buffers and throughput over a few days, which at this point seemed to hold up.

Scaling up retrievers

We found retrievers themselves were also having scaling problems due to new extra traffic going through Kafka. We planned a scale-up, adding roughly a quarter extra capacity by scaling horizontally.

We followed a runbook written when beagle did not exist. The steps for it were added later and never tried in production. We saw crashes and delays. We guessed this was because the new partitions were empty and, consequently, rushed things. We made mistakes, one of which was that the runbook was telling us to introduce records in the beagle SLO database:

("beagle", "honeycomb-prod.retriever_mutation", <N>, 0, "manual")

The actual topic was "honeycomb-prod.retriever-mutation" with a -, not a _. We didn’t notice this when writing the runbook, crafting the queries, reviewing them, applying them, and when doing the first audits of the table after things were going bad.

Beagle struggled and crashed in a loop, causing an SLO processing delay outage.

Diagram showing the accumulating Beagle processing delay (going past 1 hour) for our SLO processing.

Once we caught up, we backfilled the SLO data and all customers’ service were reestablished properly.

About a week later, we scaled up retrievers again to buy extra room as ingest volume kept growing. That time, it went fine.

Lessons learned and things to keep in mind

Scaling of individual components

We had major ingest growth in a short time. It felt like we hadn’t been proactive enough, but while some of us had ideas about what some of our limits were, nobody had a clear, well-defined understanding of them all.

We ran into scale issues around:

  • Network limits that were previously unknown and invisible to us
  • Limitations in the abilities and stability of automation
  • Surprising capacity of our production cluster to overload our dogfood cluster and vice versa
  • Bottlenecks in various components
  • Manual and tool-assisted rebalancing of partitions reaching toil acceptability levels

We got in a situation where various types of pressures just showed up at once and disguised themselves as each other.

line diagram showing the ingest traffic growth in abstract terms.

doodle showing the shape of dataset growth over two points in time.

Traffic does not scale uniformly across customers—their datasets aren’t evenly distributed. Implications vary per service. Beagle consumes all messages of all datasets, but the count and costs of SLOs means the scaling shape is distinct from what retriever needs from the same feeds. 

Distinct scaling patterns are needed for various components. It’s sometimes unclear if bottlenecks would be better solved by scaling vertically or horizontally (or some other dimension). Our scaling profile over the last months looks a bit like this:

doodle showing the stepwise nature of our lambda capacity and scaling rates.

doodle showing the retriever growth over time, mostly stable with two scale ups late in time.

doodle showing the kafka cluster size over time. The cluster size drops abruptly and raises back again at the end.

doodle showing the beagle cluster size over time. Auto-scales horizontally until we pin it to a fixed size near the end.

Growth depends on costs, awareness of bottlenecks, scaling models, and expectations. Lambda capacity grew step-wise over time, which changed traffic downstream in different environments, and retriever’s boundaries were fuzzy so we needed to experiment to find effective approaches. 

Additionally, our Kafka cluster grows vertically due to licensing structures. Beagle is now fixed in size because that seems stable, but could be smaller vertically and wider horizontally.

Combinatorial scaling

The upcoming challenge we’re about to face is having to consider when our future scaling plans run into each other and reach limits. For example: 

  • Scaling retrievers horizontally for writes make reads more likely to hit high 99th percentile values
  • Vertically scaling retrievers does not address load issues that beagle could see, and scaling horizontally dilutes beagle’s auto-scaling metrics.
  • Increasing scale in one environment can cause ripple effects in other ones
  • Our own tooling is running into new hurdles as GROUP BY limits in Honeycomb queries mean we can’t see the work of all our partitions or all hosts at once.

September and early October have sent us the clearest signals yet about where a lot of our limits lie and where we need to plan growth adaptation in a combinatorial approach.

Experience and tempo

Often, one of the new incidents highlighted something we did not understand in a previous one or a previous incident held the keys to solving one of the new ones. These examples reinforce the idea that all incidents are learning opportunities.

The ability to adapt to production challenges comes from sustainable pacing—not too active, not too sparse, a bit like exercising to stay healthy. For example, over-scaling to ensure we avoid issues for a long time gives more time to forget about essential operational challenges. It’s useful to keep ourselves familiar with the signals and ways the system bends.

If we operate too far from the edge, we lose sight of it and can’t anticipate when corrective work should be emphasized. If we operate too close to it, we are constantly in high-stakes situations and firefighting. We then lose the capacity, both in terms of time and cognitive space, to be able study, tweak, and adjust the behavior of our system. Keeping that balance is essential to long-term system (and our team’s mental) health.

Do you have any scaling experiences or better practices for operational pacing you’d like to share? We’d love to hear from you! Give us a shout on Twitter or, if you’re already a Honeycomb user, send us a note in our community Slack group, Pollinators.


Related Posts

Teams & Collaboration   Software Engineering   Culture  

What Makes for a 'Good' Pair Programming Session?

Software changes so rapidly that developing on the cutting edge of it cannot fall to a single person. When it comes to asynchronously disseminating information...

Software Engineering   Dogfooding  

Deploy on Friday? How About Destroy on Friday! A Chaos Engineering Experiment - Part 1

We recently took a daring step to test and improve the reliability of the Honeycomb service: we abruptly destroyed one third of the infrastructure in...

Software Engineering   Culture  

Staffing Up Your CoPE

Getting the right people working in the CoPE is crucial to success because these change agents must limber up the organization and promote the flexibility...