Graviton5 in Production at Honeycomb: Per-service Results From the m8g to m9g Migration

This is the fourth installment in the Graviton retrospective series we've been writing since 2021. The methodology is the same one I always reach for: hold the workload constant, run both generations on the same Kubernetes namespace concurrently, and let the per-pod numbers speak.

By: Liz Fong-Jones

| June 10, 2026

Dogfooding

Technical Deep Dives

How to Resolve the Productivity Paradox in AI-Assisted Coding

Webinars

March 4, 2026

How to Resolve the Productivity Paradox in AI-Assisted Coding

Join Ben Good (Google) and Austin Parker (Honeycomb) as they unpack the productivity paradox in AI-assisted Coding.

Watch Now

Graviton5 in Production at Honeycomb: Per-service Results From the m8g to m9g Migration

TL;DR

Honeycomb migrated a portion of our production shared compute pool from m8g.8xlarge (AWS Graviton4) to m9g.8xlarge (AWS Graviton5). Over 60 days running both generations side by side, every measured service used 11-26% less CPU on Graviton5 for identical work. Ingest P99 latency dropped 28%, the metrics ingest pipeline P99 halved, and tail-based sampling queues ran 44-73% shorter at P95. Nothing in the fleet regressed.

Play with the data yourself

About this post

The Canvas embedded above is the actual investigation I ran to evaluate the m8g/m9g A/B test, made read-only and public so you can click through the queries and interact with the data yourself rather than reading numbers off screenshots. Hit the make fullscreen button in the corner for the proper view, and try the light/dark mode toggle while you're there.

The Graviton4 edition of this series was put together by hand, with screenshots glued to prose. This time, I had Canvas reproduce the shepherd ingest latency portion agentically, replaying the methodology and graphs from the Graviton4 post. It took about three minutes to fetch the blog post, parse the images, and recreate and rerun the queries. That's the bit Canvas being agentic actually buys you. The numbers and conclusions are mine; a chunk of the legwork that previously took me an afternoon is now an artifact of the tooling.

What was measured

Five core stateless services run on this fleet: shepherd (OpenTelemetry ingest), refinery (tail-based sampling), beagle (SLO evaluation), newf (service maps), and kelpie (anomaly detection). The comparison window ran from February 20 to April 21, with both m8g.8xlarge and m9g.8xlarge pods running concurrently in the same namespace, so every comparison is per-pod on identical workloads.

Honeycomb's poodle (API/query) service isn't in the comparison: it isn't very CPU-intensive, so we preferentially schedule it on older generations and leave the newest silicon for services that actually benefit from it. That's how we extract value from a fleet that spans multiple Graviton generations rather than rip-and-replacing on every launch.

Per-service CPU efficiency (60 days, honeycomb-production namespace)

newf was bursting past its CPU request allocation at P99 on Graviton4 (142%); Graviton5 brings that to 114%, still over request but with a lot less pressure. beagle's P99 dropping from 94% to 65% on identical throughput is the cleanest demonstration in the fleet that the same work is finishing faster, not that less work happens to be arriving.

Ingest pipeline: where the latency came from

Network receive is flat across both generations (8.18 ms vs 8.02 ms AVG on traces; effectively identical for logs and metrics). Network latency is dominated by client-side variance we don't control, which is why those numbers look identical across silicon generations. The gains are entirely in CPU-bound processing, which is what you'd expect when only the silicon changed.

Total root-span duration P99 by pipeline:

Inside handle_batched_event (the write-to-Kafka stages, per-pipeline P99):

The metrics pipeline benefits the most: total write-to-Kafka P99 drops from 880 µs to 440 µs, a 50% reduction. Metrics were already the hottest pipeline in CPU terms because of how aggressively we compact and normalize events that share target fields; Graviton5 takes a chunk out of exactly the spot that hurt most.

Tail-based sampling: queue depths are the clearest signal

Refinery throughput is bound by how fast pods drain buffered work. April 18-20, 2026, three m9g pods running alongside roughly fifteen m8g pods, running Refinery 3.2.0, with a round-robined identical workload:

Same workload, same stress, dramatically shorter queues. If you operate Refinery, queue depth is the metric that determines whether you're about to drop spans on the floor under burst load. Cutting the incoming queue P95 by nearly three-quarters is real headroom, not a vanity number.

Streaming services: same work, less CPU

beagle, newf, and kelpie are all Kafka consumers, so throughput is partition-bound, not CPU-bound. Span duration on these services reflects tick intervals or partition assignment more than raw processing speed.

beagle (SLO evaluation, ticks once per minute): 2,347 vs 2,322 datasets per tick, 6,113 vs 6,147 SLOs evaluated per tick. Identical throughput. AVG CPU drops 11% and P99 CPU drops 31%.

newf (service maps): 92.4 vs 91.6 spans processed per run. Identical throughput. AVG CPU drops 23%, AVG processing lag P50 halves from 0.04s to 0.02s.

kelpie (anomaly detection): 26% lower AVG and P99 CPU; per-operation latencies for the small local operations (loadDatasetEntry, Make Schemas List) drop 25-55% at P99, which lines up with the CPU efficiency story.

A floor—not a ceiling—on savings

These numbers measure performance at a held-equal pod count. We ran a mixed fleet doing the same work on both generations; we did not remove instances, raise CPU requests, or push Graviton5 hotter to find the point where latency starts to degrade. The 11-26% CPU efficiency gains are therefore a floor on the instance-count reduction you could expect for an equivalent workload, not a ceiling. A capacity-tuning pass that actually packs more work onto each Graviton5 box should yield bigger savings on top of the performance wins shown here.

We did such a pass on shepherd alone in the week before re:Invent. That's where the 36% per-core throughput number came from. The numbers in this post come from a longer-lived passive experiment that covers the full mix of workloads rather than a single service tuned to the edge.

For context: when we did the equivalent exercise on Graviton2 four years ago, we eventually ran 20% fewer workers per service, and the cost story compounded from there across successive Enterprise Discount Program renewals and Compute Savings Plans. Graviton5 is the same shape of opportunity, with each generation stacking on top of the savings from the original arm64 move.

Bottom line

Compute-bound services see 16-22% lower CPU and 16-28% lower P99 latency on Graviton5 compared to Graviton4. Sampling queue depths drop 44-73% at P95 for the same workload. Kafka-bound streaming services do identical work with 11-26% less CPU. Nothing in the fleet regressed.

Graviton5 is attractive whether you're using an earlier Graviton generation or contemplating a switch from x86.

How to evaluate this for your own fleet

The per-service gains are more interesting than any single headline number. Where you sit on the compute-vs-input-queue-bound spectrum determines whether the win shows up as latency, headroom, or queue depth, and the right way to measure it follows from there. Don't lead with "what's the average improvement?" Instrument each service for the bottleneck it has, run a mixed fleet for long enough that workload variance evens out, and let the per-pod data show you which services to migrate first.