Outreach Engages Their Production Code with HoneycombBy George Miranda | October 16, 2020
Outreach is the number one sales engagement platform with the largest customer base and industry-leading usage. Outreach helps companies dramatically increase productivity and drive smarter, more insightful engagement with their customers. Outreach is a privately held company based in Seattle, Washington.
Tech Environment & Tools:
- Ruby and Go. Some NodeJS
- MySQL, Kafka
- Kubernetes containerized on AWS
- DataDog for Log management
- Wavefront for Metrics
- Honeycomb Observability
Millions of Logs and Limited View Metrics
When Richard Laroque joined the Outreach engineering team in 2015, they mostly relied on querying logs with Elasticsearch as a way to pinpoint issues in production. Since then, the engineering team and its footprint steadily grew. By mid-2019, Outreach’s booming customer adoption meant that the challenge of dealing with their log volume and size had become unmanageable, even with a two week retention period.
In response, the team decided to migrate their logs from Elasticsearch to Datadog. At a smaller scale, the team had been able to search through their logs for general criteria, like group by IDs, to understand which customers might be experiencing an issue. As traffic grew, their approach of creating one log entry per HTTP request resulted in millions upon millions of log entries that rendered their previous search methods unusable.
The team considered a metrics-based approach for discovering production issues. But a key requirement was having the ability to pinpoint performance for specific users and groups, rather than analyzing performance in aggregate. The team needed to quickly sift through high-cardinality data to find performance bottlenecks.
Amritansh Raghav, VP of Product & Engineering, joined Outreach in mid-2019 and recommended a new approach. To more proactively understand issues in production, the team should adopt observability practices and specifically take a closer look at Honeycomb.
Crossing the bridge to observability
The Outreach team eventually settled on Wavefront for general metrics. To pinpoint specific performance bottlenecks, they explored distributed tracing with Jaeger. Trace views proved incredibly helpful during an incident by providing necessary insights, but the Jaeger user experience was challenging. After a deeper evaluation and side-by-side comparison between Jaeger and Honeycomb, they chose Honeycomb because its UX was better, more intuitive, and easier to adopt.
The SRE team at Outreach predominantly focuses on infrastructure rather than writing application code. But Outreach believes in creating a culture of service ownership across the engineering organization. All of their engineers are in on-call rotations with escalation policies. That level of service ownership in production was a key success factor in their observability journey. Newer team members learned how to get started with Honeycomb over internal lunch-and-learn sessions to create a smooth onboarding process.
“I’m a big Honeycomb fan and I think everyone in engineering should use the product. Early on, after an incident, I wrote a step-by-step guide that really helped the team understand how to better use the tool.” Richard Laroque, Software Engineer, Outreach
Once the team narrowed down their toolchain and began migrating away from Elasticsearch, they also started the process of instrumenting their applications. Having opted to use both Datadog and Honeycomb, the team faced a choice for how and where to send their data. The team followed Datadog’s documentation for creating structured logs, then set up a double-write process to feed that structured log data to both Honeycomb and Datadog.
Primary use-case: Performance Optimization
Today, the team monitors their AWS RDS database performance by exporting slow query logs to Cloudwatch. Cloudwatch picks up any query longer than a specified threshold. Naturally, when investigating API endpoint query bottlenecks, the team’s inclination was then to look at the slow query logs in Cloudwatch.
However, Cloudwatch logs only identify the slow SQL query and don’t provide any context around why latency might be occurring. It’s easy to see that a particular query is running slow and perhaps where it came from, such as a particular pod, but not why it is slow or who that may be affecting.
Now, the Outreach team uses Honeycomb for continuous performance tuning. Wavefront may, for example, detect request latency on an API endpoint and trigger an alert. But the team doesn’t rely on Wavefront metrics to diagnose issues. Instead, they immediately turn to Honeycomb to examine traces for the affected API endpoint. They group-by relevant dimensions to understand where bottlenecks may be occurring, like Redis or MySQL, and introspect from there.
“Honeycomb has definitely helped us when the thing that's going wrong is new. I like not being blind anymore. With Honeycomb we’re generally not running around wondering where the problem is. We can always figure out what component is causing the problem.” Richard Laroque.
Onboarding new team members
Today, some Outreach engineers are well-versed in using observability to approach any problem as new. They leave behind any prior assumptions and use Honeycomb to objectively guide their investigations. Starting with any hypothesis, they test it by running new queries that provide clues to the next investigation point: they ask questions whose answers reveal the next question, again and again.
“Once you become familiar with how to find the trace parent ID, then isolate trace spans from leaf-level spans, you are on your way to locating an issue or root cause.” Richard Laroque.
But not all engineers know where to start. To help the team learn from one another, Outreach uses Honeycomb boards to provide useful starting points for investigation.
“A handful of boards have been set up to help new team-members navigate to existing queries and use that as a jumping off point. If someone is new to an on-call rotation, debugging can take a little longer especially if it’s a Monday morning traffic surge. Our DevOps channel is filled with links to Honeycomb queries and teams are picking it up pretty quickly”.
Beginning to implement SLOs
Outreach is starting to use Honeycomb’s Service Level Objectives (SLO) feature. They’re taking time to thoughtfully consider the implication of measuring Service Level Indicators (SLIs) across time frames wider than what just happened in the last 10 minutes.
“Our SLOs are now hooked up to a Slack channel and are working just fine. Honeycomb got SLOs right. You have followed the practices in the Google SRE book really well. I haven’t seen anyone else in the market doing this the right way.” Richard Laroque.
Honeycomb and Debugging go hand-in-hand
Once the team started using distributed tracing, Honeycomb outpaced the usefulness of their previous ElasticSearch solution, which has now been essentially retired.
“For me, Honeycomb is the most useful debugging tool. I would struggle to get through incidents without it. Now we are at the best stage we have ever been. I love working with Honeycomb and I highly recommend it.” Richard Laroque
Hungry for more? Read the Outreach Case Study.
Dear Miss O11y, I'm confused by all of the Three Letter Acronyms (TLAs) that have started popping up lately. This week, I got an email...