Case Studies Observability Incident Response Customers

HelloFresh Improves Organization-Wide Performance With Honeycomb

15 minute read

Company Overview

Customer: HelloFresh
Company Stage: Enterprise
Tech stack/integrations: Graylog, Prometheus, StatsD, InfluxDB, Grafana, Jaeger, Cassandra, AWS, Amazon Elastic Load Balancing (ELB)
Industry: Meal Kit
Location: HQ in Berlin, Germany

About HelloFresh

HelloFresh is the world’s leading meal kit company and aims to provide every household with wholesome, homemade meals, with no shopping and no hassle. Serving customers across 14 countries—including the U.S., the UK, and Germany—everything required for delicious meals is carefully planned, carefully sourced, and delivered to the front door of each customer at the time most convenient for them. For this enterprise company, managing their global customer base requires coordination across many different teams.

HelloFresh practices a service-ownership model where engineering teams (organized into squads, tribes, and alliances) own running their code in production. The Platform & Payment Alliance, led by VP of Engineering Renato Todorov, provides centralized platform and payment services to internal teams, ensures they adopt current best practices, and builds an attractive workplace for engineers. Their mission is to help service teams focus on creating business value by experimenting and delivering new features by reducing the friction of toiling against infrastructure and production issues in AWS.

Shifting toward observability

The platform team had been supporting a combination of several tools to help their teams troubleshoot production issues. However, they wanted to reduce both the burden of maintaining and the cognitive load of using those solutions, while also encouraging adoption of modern practices.

“I didn’t want to introduce another APM tool or go down the path of AIOps because both of those models encourage the wrong mindset,” explained Renato. “For example, with APM tools, you have to fiddle with logs and metrics across multiple windows, and it’s hard to not get lost. You also need to learn a proprietary query language, and logs can only be retained for short periods of time because it’s too expensive to keep them. It adds friction that is the exact opposite of our mission.”

“We’d already replaced New Relic for APM and had realized we wouldn’t need most of our metrics and logs if we were using events and distributed traces,” he continued. “The shift toward an observability mindset allowed us to reduce the number of tools we use and it helped reduce cognitive load. But we were then managing our own observability stack using Graylog, Prometheus, StatsD, InfluxDB, Grafana, Jaeger, and Cassandra. That took up a lot of time and processing became too expensive if we wanted all of our data to be queryable.”

The shift toward an observability mindset allowed us to reduce the number of tools we use and it helped reduce cognitive load.

Renato Todorov, VP of Engineering, HelloFresh

Renato realized he would have to introduce Honeycomb’s observability approach gradually to the organization—and the first step would be understanding how to meet people where they are. He would start identifying their needs by asking questions to discover any current pain points that observability could help address. 

Most commonly, a resonating benefit was to reduce cognitive load by making it easier to find the answers they were looking for. Then he would identify their preferred medium for learning, such as reading books or articles vs. watching videos. Another tactic was to consider whether certain teams would be early adopters, laggers, or even champions. For instance, a key strategy in accelerating enterprise-wide adoption at HelloFresh was by lowering barriers to entry with the help of champions who created organizational examples of what it takes to integrate observability practices into current processes or demonstrated what effective instrumentation means for their org’s applications.

Faster incident resolution with Honeycomb 

A key part of reducing friction for developers supporting production services is providing tools for fast incident resolution. Early in their evaluation of Honeycomb, the platform team was able to quickly see faster incident resolution results for themselves.

“It was a Friday afternoon, and we were dealing with an incident when the Honeycomb team pointed out that they were already ingesting logs for the affected service,” said Renato. “So we all jumped on a call and, while the incident was happening, we ran a few queries in Honeycomb and quickly isolated the problem. At the same time, our incident management team had been working on the issue with our existing tools. They’d been working on it for a while and weren’t even close to finding an answer—meanwhile, Honeycomb had already helped us identify the cause. That’s what convinced us.”

It was a Friday afternoon, and we were dealing with an incident when the Honeycomb team pointed out that they were already ingesting logs for the affected service, so we all jumped on a call and, while the incident was happening, we ran a few queries in Honeycomb and quickly isolated the problem.

Renato Todorov, VP of Engineering, HelloFresh

Autonomy is a key part of the squad model at HelloFresh. Squads make their own decisions about how best to support their production services. The platform team advocates use of recommended practices and technologies, but they still must support many different approaches to managing production. Observability with Honeycomb is just one of the many services the platform team provides.

“The ‘you build it, you run it’ philosophy is well established at HelloFresh,” said Renato. “But that doesn’t mean everyone is necessarily keen on managing production. So we put a lot of focus on our incident management program. Honeycomb is a critical part of that. We still have some traditional logging and metrics usage, but our main goal is to shift everyone toward an observability mindset. We want our engineers to use distributed tracing and structured logs with Honeycomb as their one and only tool for getting feedback from systems in production.

Optimizing for performance

With Honeycomb, the HelloFresh team now spends less time on maintenance and they have reduced the number of tools they use.

“Honeycomb has benefited us in two different ways,” shared Renato. “From a platform perspective, we got rid of several pieces of our infrastructure. We were able to save money and tangible engineering effort on maintenance. We now have more capacity as a team to deal with other challenges that come up, which is priceless. From a product development perspective, they have been able to reduce the number of tools needed, and that simplicity lets us move faster.” 

Next steps

Currently, the team uses alerts to help monitor service availability, but Renato plans to continue driving adoption of Honeycomb’s service-level objectives (SLOs) so business objectives will be better aligned with engineering objectives.

“SLO adoption is driven by the platform team,” explained Renato. “The SLOs that we use in Honeycomb are based on Amazon Elastic Load Balancing (ELB) logs. We send unsampled ELB logs to Honeycomb, and then our teams can create SLOs based on performance that customers actually see.” 

“Right now, use of SLOs is a newer concept,” Renato shared. “Our goal is to alert on issues that affect the user experience. Honeycomb SLOs directly measure impact on users—in other words, measuring the user or customer experience. At some point, we want to review all alerts and remove the unactionable ones in favor of SLOs. But those discussions are still happening. The SLO targets that the business wants for user experience aren’t the same targets that engineering supports—we’re negotiating what’s an acceptable level of service.” 

“Our goal, and we are succeeding in some places with it right now, is to make Honeycomb the first thing teams look at whenever anything goes wrong,” said Renato. “Honeycomb reduces the number of dashboards and the amount of context switching that engineering teams are used to because they don’t need to log into many different systems and switch through a bunch of windows to then correlate all that troubleshooting data in their minds. They can see it all, in just this one place, in Honeycomb.” 

Our goal, and we are succeeding in some places with it right now, is to make Honeycomb the first thing teams look at whenever anything goes wrong.

Renato Todorov, VP of Engineering, HelloFresh

“Keep in mind that observability isn’t something you dump onto a new team and move on,” said Renato. “You need to keep helping your people. Over time, they will want to do more complex investigations. They will want to start using end-to-end user journey SLOs, and you will need to help them with that. That’s a sign you’ve succeeded.”