How Our Love of Dogfooding Led to a Full-Scale Kubernetes Migration

When considering a migration to Kubernetes, as with any major tech upgrade or change, it’s imperative to understand the motivation for doing so. The engineering time and labor to execute a complex migration will take away from other priorities, making it crucial to have org-wide alignment on why the change makes sense.

By: Ian Smith

| June 14, 2023

Dogfooding

The benefits of going cloud-native are far reaching: faster scaling, increased flexibility, and reduced infrastructure costs. According to Gartner®, “by 2027, more than 90% of global organizations will be running containerized applications in production, which is a significant increase from fewer than 40% in 2021.” Yet, while the adoption of containers and Kubernetes is growing, it comes with increased operational complexity, especially around monitoring and visibility.

Speaking from our own experience, we knew the move to Kubernetes made sense for us—not because it was a trendy thing a lot of our customers did, but because we lived between two hard-to-manage systems.

We completed our migration from Amazon Elastic Compute Cloud (EC2), orchestrated via Terraform, Chef, and homebrew scripts to Kubernetes, using Amazon EKS, in March 2023. In this post, we’ll detail lessons learned, recommendations for success, and the benefits we experience from migrating to Kubernetes.

TL;DR: we dogfood our own software so we can apply observability to do things our customers use Honeycomb for, like easing complex migrations.

Three reasons we migrated to Kubernetes

While we had success with our previous EC2/Terraform/Chef/bash scripts approach, we also felt that it wasn’t what we needed for the next five years of Honeycomb. In the end, we had three big reasons for migrating to Kubernetes:

Homegrown systems are great—they’re tailor-made for you. But you also have to maintain them and train new users and operators.
Kubernetes would enable us to scale our own system to provide more reliable and performant services for our users. On more than one occasion, we felt the pain of infrastructure built around a company at a smaller scale—with fewer machines, fewer developers, fewer services.
A lot of our customers asked us about Kubernetes. We knew it was a system they clearly care about and use, so we thought, “Wouldn’t it be great if we knew more about Kubernetes?” We already had some folks with k8s experience on our team, but we believe in learning by doing. Dogfooding on its own isn’t a reason to undertake a big core-infrastructure-tools migration, but it was a nice bonus win for us.

Seamless migrations using the power of Honeycomb

Throughout the entire Kubernetes migration, Honeycomb helped us answer the questions we didn’t know we needed to ask. Because all our data is already in Honeycomb, all we had to do was create a derived column ‘is_kubernetes’ and we had an efficient way to A/B test hypotheses. We confirmed our Shepherd API was just as performant in Amazon EKS as it was in Amazon EC2, and we compared response latency for two deployment mechanisms to decide on the right one.

Honeycomb is designed to help answer unpredictable new questions, and it was critical to the success of our migration. Whatever question we had, we created a query or a derived column to find the answer. We had real-time, interactive introspection to understand what was going on, which is something that would drown—or bankrupt—other tools.

Explore FireHydrant’s Kubernetes Migration

Tackling the hardest parts first and aiming for quick wins

Initially, we had three goals for our migration:

We needed an EKS cluster to try stuff in. But really, we needed three clusters—one for each Honeycomb environment: prod, dogfood, and kibble. Dogfood is our very own private Honeycomb that receives telemetry from prod, which we use to observe and operate Honeycomb for customers. Kibble, in turn, receives telemetry from dogfood, allowing us to observe and operate prod.
We wanted to migrate the simplest service for an easy early POC. This was Shepherd, which is the API where we ingest customers’ data and telemetry. It’s stateless, it scales nicely in response to load, and it is the archetypal Kubernetes-shaped service.
We wanted to migrate the hardest service, because if we couldn’t make that work, we wouldn’t migrate anything else; it was going to be all or nothing. That was our storage engine, Retriever. It’s stateful and scaling it up is infrequent, high-touch, and done when a human looks holistically at a variety of performance indicators and traffic projections.

Bringing in outside experts (yes, asking for help is okay!)

While our migration team had Kubernetes experience, we didn’t have experience specifically with EKS, and EKS managed by Terraform. And unfortunately, that was a prereq for all the other validation work we needed.

To help us out, we decided to hire a contractor with a very specific goal: do the initial hard setup so we can start tinkering, as this would give us an easier learning curve. We also set expectations with our contractors: we told them we wanted to learn from them by reviewing their PRs, asking questions, and creating a please-mentor-our-engineers dynamic.

It was an expensive, month-long engagement, but we got immense value from it! We did redo a lot of bits, especially for Retriever. We had very specific goals for it, and we decided our spec hadn’t carried all the necessary context. After exploring the problem space with them, even if the result wasn’t right, we felt comfortable continuing that exploration on our own after the contract ended. Focusing on the hardest parts first helped us make sure that everything in between the easy and the hard was doable. We gained a ton of experience to migrate all our other services, and now, we can apply our learnings to help customers on similar journeys.

Solving the Monday morning scramble with containers

As for our quick wins, those happened with Shepherd. Before our Kubernetes migration, Shepherd experienced repeated, similar, scale-up issues on Monday mornings. We used Chef for orchestration and each week, like clockwork, some dependencies or external packages would break, usually on Thursdays or Fridays—but we didn’t notice it because the system scales down then.

By Monday morning, we’d have a peak of activity when our business customers kicked off their new week, sending us more traffic. When we scaled up Shepherd, it was prone to tickle one of these failures, and now we lacked capacity because the system couldn’t scale up as intended. That’s when we decided to containerize Shepherd so all its dependencies would be assembled once at build time, instead of runtime.

Another effect of moving dependency assembly to build time was speed. Before the containerization, our ramp-up time was 15 minutes. We set a target: to cut that in half. We said, “Let’s measure this— once it’s in production, what will Honeycomb tell us happened?” Surprise! We didn’t go from 15 minutes to 7.5. We went to a two minute delay from traffic increase to scale-up.

Mistakes will happen; this is a sociotechnical system

Another way we de-risked the process came about after we accidentally deleted the namespace in dogfood (“We.” Ok, no, it was totally me.). We got lucky, in that we only deleted pods and ingresses, but the hosts were still there. AWS just spun up our old images and we ran a manual recovery with the config and scripts in GitHub.

It was a minor setback, but it reminded us to think about how to build safety barriers. Simply saying “be more careful” wasn’t enough: we had to make the system itself safe. Now, we manually create the cluster namespace, and we created a wrapper for the command line that compares the destination environment and the cluster in question. If they’re not the same, it bails out. On the one hand, it’s a manual, not automated, task in setting up a new cluster. On the other hand, Helm won’t accidentally delete our apps if we point it at the wrong cluster. We made the system safer for human operators.

In the end, we gained a ton of experience during our Kubernetes migration, and the scaling problems of 2020-2021 are old news (to be replaced, of course, with the next phase of growth and scaling). Thanks for learning with us. If you’re still hungry for more dogfood, watch the full webinar: How we (begrudgingly) moved some of our services to Kubernetes.

Want to know more?

Talk to our team to arrange a custom demo or for help finding the right plan.

BOOK A CONSULTATION