What happens when a seasoned engineer goes on vacation?By Deirdre Mahon | Last modified on April 1, 2021
Have you ever experienced a time when someone on your team takes off to recharge or takes unplanned downtime away from work? It may feel a bit scary as workloads shift, priorities change and the team has to pick up some slack. We all need to recharge and in some org’s, it’s imperative. The covering for each other builds trust among the team which is invaluable when you’re in the trenches daily, working hard.
SEDNA, a new Honeycomb customer, had this very experience. When Grace and Ammar needed time away, it proved harder when resolving issues and handling customer tickets. Because much of the critical system info was bound up in the heads of a few key team-members, new folks didn’t have the knowledge or a place to go to really understand what was going on. This was more pronounced during incidents. Also, the same things came popping up again and again - what we call known unknowns - and when that information isn’t captured and shared across the team, no-one learns.
The SEDNA product team decided that they needed a platform to capture all their system information and expose it to others across the team. Having the ability to run new queries means that everyone learns and comes up to speed much faster. Secondly they needed to be able to pinpoint exactly what’s happening with their service when faced with unknown unknowns.
Human knowledge + data-driven analysis.
When you combine team knowledge with the right tools, you get a powerful combination. In the words of Ammar - I started playing around with Honeycomb and realized I knew enough to be able to narrow it down based on request path, user ID, and some other fields. We had originally thought the back-end was returning bad data—but we figured it out immediately with Honeycomb, unexpected web sockets were the cause...instrumentation/traceability is what solved the problem. I didn’t have the institutional knowledge to tie the request parts together, but tracing got me there!”
This is common where an individual on the team needs to validate what’s happening and then evaluate the extent of the issue before remediation and fix. It usually starts with a hunch or a hypothesis. Then follow a trail to determine cause figuring out if it’s occurred previously or a new occurrence. The trail may be the wrong path but with the right investigation tools and ample instrumentation, they can get answers quickly. Sharing results via Slack or giving tool access to others allows all to level-up, starting from a place of strength the next time an incident occurs.
Avoid Burn-out & Strive for Efficiency:
The team at SEDNA can now relax, feeling less reliant on only the seasoned team-members. Honeycomb also gives the team more time to innovate with less frustration around resolving tickets that could be sitting for months. Noisy alerting is a prime frustration point for many teams. If you figure out the right SLO’s that matter to your users, it helps everyone prioritize workloads by alerting when necessary. Find out how Honeycomb uniquely addresses SLOs so you avoid over-alerting and team burn-out. Pick from the 3-part webcast series for more on SLO theory, how Clover Health picked their SLOs and how to get started. Listen to experts Liz Fong-Jones and Google’s Kristina Bennett.
Observability is a journey and has net positive impact across many parts of the software lifecycle. The team at SEDNA saw results in just days. Begin with basic instrumentation such as ELB logs and then start to resolve outstanding tickets or performance issues. You’ll be surprised how quickly you can get answers.
Get the Big Picture: Learn How to Visually Debug Your Systems with Service Map—Now Available in Sandbox
Honeycomb’s Service Map gives you the unique capability to filter by both services and traces so you can generate maps that isolate very specific areas...
Solving a Murder Mystery
Bugs can remain dormant in a system for a long time, until they suddenly manifest themselves in weird and unexpected ways. The deeper in the...
Incident Review: Shepherd Cache Delays
In this incident review, we’ll cover the outage from September 8th, 2022, where our ingest system went down repeatedly and caused interruptions for over eight...