I’m Honeycomb’s newest engineer, now on my eighth week at Honeycomb. Excitingly, I did my first week of oncall two weeks ago! Almost every engineer at Honeycomb participates in oncall, and I chose to join in the tradition. This may seem unconventional for a Developer Advocate — surely my time might be better spent holding more meetings with customers and giving more talks? Yet, I found that being oncall was the right decision for me.
As the Developer Advocate of a company focused on engineering operations, my mission is to make Honeycomb work better for those relying upon us to understand their services and tame complexity. I need to know the product inside out, and I need to represent our users’ experience internally. Our users are software and systems engineers who care deeply about production excellence. When I walk firsthand in the shoes of a Site Reliability Engineer, I make myself a better advocate for the SRE & DevOps communities’ product needs. No amount of listening to others and embedding as a trainer can substitute for being a hands-on practitioner of what I espouse. I hope to share more about the customer and product sides of my work in future blogs, but this week I wanted to share my experiences with engineering and oncall!
Opportunities to improve the user experience
Charity has previously written about how Honeycomb aims to make oncall and production ownership humane and support engineers in contributing to production excellence. We use Honeycomb to observe Honeycomb, which means that we experience the same joys and pains that our paying customers do. Honeycomb lets us make long-term improvements to service reliability, do acute incident response, and also make product management decisions. Thus, my engineering time is spent making improvements to our infrastructure and being jointly accountable for its performance, as well as sanding down rough edges I encounter in the wild.
But I also needed to learn the production environment and infrastructure, which I did by shadowing a week of oncall alongside Emily, Alaina, and Alyson from the engineering team. So, how was my week oncall? I saw 4 total high-urgency pages & 3 low-urgency tickets across two incidents, each of which was a real end-user visible failure, and 12 low-urgency tickets related to servers flapping and requiring investigation. Although this was an unusually noisy week, all of the alerts ultimately were actionable and useful and resulted in service improvements!
A busy week oncall: the details
On Monday of the oncall week, we experienced a simultaneous watchdog timeout of both redundant nodes responsible for serving one shard of indexed data, potentially impacting about 3% of our customers. Emily and I had our phones buzz simultaneously with 3 urgent paging notifications and 2 sub-critical tickets; I happened to be on the NYC subway, and found a convenient station stop to work from (they all have wifi!). Emily focused on restoring both nodes to service and got them caught up with Kafka minutes later, while I did an evaluation of the damage. Our Honeycomb instrumentation of the query frontend revealed that only we had noticed the outage — nobody else had tried to query the index for that shard and seen errors besides our end to end blackbox monitors. No data was lost, because Kafka queued writes during the outage. Had any real users been impacted, Molly from Support would have been able to personally reach out to each impacted customer. High cardinality instrumentation for the win!
The second paging incident, on Friday, arose from a conversation Molly had with a customer whose traces were no longer loading in the UI. Being on the East Coast, I was the first oncaller awake and opened a PagerDuty incident to raise Alyson’s attention as well. We quickly found the error in Sentry and identified the code that would throw errors on edge cases not present in our own dogfood datasets. Robust CI/CD enabled us to work around this problem quickly, even without rolling back and pinning previously released binaries. When you deploy every hour, rather than only releasing every day or week, it’s easy to revert in Git a commit that caused a regression in user experience and re-deploy from the tip. One piece of “always roll back the binary, never roll forward” muscle memory to unlearn from my time at Google! Time to resolve was less than an hour once we knew, but detection was slower than we’d have liked (12+ hours after release, found by customers rather than our own instrumentation).
The slew of non-urgent tickets coming in throughout the week stemmed from a kernel memory management bug in the older Linux kernel and 14.04 LTS Trusty Ubuntu release we were running. The bug would periodically cause our serving process to freeze, or worse, cause the entire VM to hang; every time a machine in our fleet got stuck, we’d have to intervene to restart it. Instead of writing a less toilsome automatic machine kicker, we instead chose to stop hanging entirely by upgrading the kernel major version and the base VM to 18.04 LTS Disco. This work didn’t finish during the oncall week, but the operational pain was definitely a motivation for Alaina and me to finally qualify and roll out the upgrade this week!
The value of real-world context
Being oncall has given me more empathy for the other engineers on the team, and ensures that I am considerate and careful about deploying changes. Now that I better understand prod, I feel safer making improvements even on the weeks that I’m not oncall. I know that my colleagues see me as part of the team in the trenches with them, rather than shoveling pain across the wall at them. And, I’d hope that you, our customers, feel safer having my SRE expertise and skills on the team looking after your observability systems.
If you’d like to join us at Honeycomb, we’re hiring a product marketing manager and product designer! Or perhaps you’d like to try using Honeycomb yourself! If so, dive right into your data with a Honeycomb trial, or drop us a line at email@example.com!
Until next time!