A New Bee’s First Oncall

By: Liz Fong-Jones | April 18th, 2019

Debugging Dogfooding Operations

6 Min. Read

I’m Honeycomb’s newest engineer, now on my eighth week at Honeycomb. Excitingly, I did my first week of oncall two weeks ago! Almost every engineer at Honeycomb participates in oncall, and I chose to join in the tradition. This may seem unconventional for a Developer Advocate — surely my time might be better spent holding more meetings with customers and giving more talks? Yet, I found that being oncall was the right decision for me.

As the Developer Advocate of a company focused on engineering operations, my mission is to make Honeycomb work better for those relying upon us to understand their services and tame complexity. I need to know the product inside out, and I need to represent our users’ experience internally. Our users are software and systems engineers who care deeply about production excellence. When I walk firsthand in the shoes of a Site Reliability Engineer, I make myself a better advocate for the SRE & DevOps communities’ product needs. No amount of listening to others and embedding as a trainer can substitute for being a hands-on practitioner of what I espouse. I hope to share more about the customer and product sides of my work in future blogs, but this week I wanted to share my experiences with engineering and oncall!

gif of cartoon bees moving as a grop

Opportunities to improve the user experience

Charity has previously written about how Honeycomb aims to make oncall and production ownership humane and support engineers in contributing to production excellence. We use Honeycomb to observe Honeycomb, which means that we experience the same joys and pains that our paying customers do. Honeycomb lets us make long-term improvements to service reliability, do acute incident response, and also make product management decisions. Thus, my engineering time is spent making improvements to our infrastructure and being jointly accountable for its performance, as well as sanding down rough edges I encounter in the wild.

One rough edge came to my attention while I was working through the onboarding example queries that Honeycomb asks every new engineer to complete. The query editor had swallowed my incomplete Calculate() clause while I switched tabs to consult the documentation :(. My instinct was to fix the bug and leverage the experience to learn how the company uses JavaScript and React. My teammates were extremely helpful in giving me a guided tour of the code, and conducting thorough code reviews. I emerged confident that I could fix other usability issues in the future!

But I also needed to learn the production environment and infrastructure, which I did by shadowing a week of oncall alongside Emily, Alaina, and Alyson from the engineering team. So, how was my week oncall? I saw 4 total high-urgency pages & 3 low-urgency tickets across two incidents, each of which was a real end-user visible failure, and 12 low-urgency tickets related to servers flapping and requiring investigation. Although this was an unusually noisy week, all of the alerts ultimately were actionable and useful and resulted in service improvements!

A busy week oncall: the details

On Monday of the oncall week, we experienced a simultaneous watchdog timeout of both redundant nodes responsible for serving one shard of indexed data, potentially impacting about 3% of our customers. Emily and I had our phones buzz simultaneously with 3 urgent paging notifications and 2 sub-critical tickets; I happened to be on the NYC subway, and found a convenient station stop to work from (they all have wifi!). Emily focused on restoring both nodes to service and got them caught up with Kafka minutes later, while I did an evaluation of the damage. Our Honeycomb instrumentation of the query frontend revealed that only we had noticed the outage — nobody else had tried to query the index for that shard and seen errors besides our end to end blackbox monitors. No data was lost, because Kafka queued writes during the outage. Had any real users been impacted, Molly from Support would have been able to personally reach out to each impacted customer. High cardinality instrumentation for the win!

my little ponies celebrating

The second paging incident, on Friday, arose from a conversation Molly had with a customer whose traces were no longer loading in the UI. Being on the East Coast, I was the first oncaller awake and opened a PagerDuty incident to raise Alyson’s attention as well. We quickly found the error in Sentry and identified the code that would throw errors on edge cases not present in our own dogfood datasets. Robust CI/CD enabled us to work around this problem quickly, even without rolling back and pinning previously released binaries. When you deploy every hour, rather than only releasing every day or week, it’s easy to revert in Git a commit that caused a regression in user experience and re-deploy from the tip. One piece of “always roll back the binary, never roll forward” muscle memory to unlearn from my time at Google! Time to resolve was less than an hour once we knew, but detection was slower than we’d have liked (12+ hours after release, found by customers rather than our own instrumentation).

The slew of non-urgent tickets coming in throughout the week stemmed from a kernel memory management bug in the older Linux kernel and 14.04 LTS Trusty Ubuntu release we were running. The bug would periodically cause our serving process to freeze, or worse, cause the entire VM to hang; every time a machine in our fleet got stuck, we’d have to intervene to restart it. Instead of writing a less toilsome automatic machine kicker, we instead chose to stop hanging entirely by upgrading the kernel major version and the base VM to 18.04 LTS Disco. This work didn’t finish during the oncall week, but the operational pain was definitely a motivation for Alaina and me to finally qualify and roll out the upgrade this week!

The value of real-world context

Being oncall has given me more empathy for the other engineers on the team, and ensures that I am considerate and careful about deploying changes. Now that I better understand prod, I feel safer making improvements even on the weeks that I’m not oncall. I know that my colleagues see me as part of the team in the trenches with them, rather than shoveling pain across the wall at them. And, I’d hope that you, our customers, feel safer having my SRE expertise and skills on the team looking after your observability systems.

If you’d like to join us at Honeycomb, we’re hiring a product marketing manager and product designer! Or perhaps you’d like to try using Honeycomb yourself! If so, dive right into your data with a Honeycomb trial, or drop us a line at sales@honeycomb.io!

Until next time!

Don’t forget to share!

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Winston Hearn | Oct 02, 2024

Using Honeycomb for Frontend Observability to Improve Honeycomb

Recently, we announced the launch of Honeycomb for Frontend Observability, our new solution that helps frontend developers move from traditional monitoring to observability. What this means in practice is that frontend developers are no longer limited to a metrics view of their app that can only be disaggregated in a few dimensions. Now, they can enjoy the full power of observability, where their app collects a broad set of data as traces to enable much richer analysis of the state of a web service.

Dogfooding Frontend

Lex Neva | Aug 26, 2024

Always. Enable. Keepalives.

As part of our recent failure testing project, we ran into an interesting failure mode involving the OpenTelemetry SDK for Go. In this post, we’ll show you why our apps stopped sending telemetry for over 15 minutes and how we enabled keepalives to prevent this kind of failure from happening in the future.

Debugging Dogfooding Software Engineering

Fred Hebert | Jul 29, 2024

Making Room for Some Lint

It’s one of my strongly held beliefs that errors are constructed, not discovered. However we frame an incident’s causes, contributing factors, and context ends up influencing the shape of the corrective items (if any) that get created. I’ll cover these ideas by using our June 3rd incident where a database migration caused a large outage by locking up a shared database and making it run out of connections.

Dogfooding Incident Response Software Engineering

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission