Take huge leaps with Honeycomb for Incident Response

As engineering teams shift from delivering services on monolithic architectures to microservices and even serverless environments, developers are no longer just responsible for creating and maintaining their code. Shared ownership has become the new…

By: Guest Blogger

| April 30, 2020

Debugging

Observability

As engineering teams shift from delivering services on monolithic architectures to microservices and even serverless environments, developers are no longer just responsible for creating and maintaining their code. Shared ownership has become the new normal (or at least trending towards) and so they are now responding to production incidents and in some cases in the on-call rotation.

Of course incidents vary in terms of impact, but they do take time away from innovation and creating new capabilities. The time suck impacts productivity of the whole team.

The old rules tools don’t apply.

With customers, revenue, and reputation at risk, speed and collaboration are critical to effective incident response. Currently most teams that use monitoring and other investigation tools such as APM are silo’ed, have different user interfaces and query languages, and each generate their own alerts with limited context. Fixed data models, data aggregation, and limited dimensions or cardinality leave teams struggling to piece together the full story, leading to guesswork and delays.

“There was no real way to find possible culprits with our classic APM. We had to know what we needed to find before we could find it—a dead end.”

–David Laperle | Technical Producer | Behaviour Honeycomb Customer

Too many incidents look like this:

Detection (aka Something unusual is going on with production!)

A high volume of alerts with little context that are not actionable
Alert storms page you out of bed, only to find that it’s not critical and can wait until morning
Engineers and teams develop alert fatigue and potentially miss real issues

Triage (aka Let’s find out what exactly is going on and who’s impacted)

Is anything really broken? Multiple engineers and/or SRE/Ops try to validate issues
Standalone or bolted-together tools make investigation and collaboration difficult
What’s the magnitude of impact? Every user, a segment or one? All services or more isolated? Back-end or front-end? Very difficult to pinpoint with the limited visibility on data cardinality
Redirected team resources, shifting priorities, wasted cycles & burnout

Fix (aka how can we resolve this & get back to a desired state?)

Piecing together the full picture from separate tools leads to guesswork and more delay.
Hard to collaborate. Handoffs broken due to silo’ed tools & lack of ongoing learning.
Larger issues become tech debt which causes shift in priorities
If affected customers can’t be easily identified, customer support becomes impacted

Retrospective (aka what did we learn, how can we improve)

Can’t easily piece together analysis and restoration steps to review & improvealerts and playbooks
Insufficient detail in bugfix tickets due to lack of full-fidelity granularity across components and functions
Limited access to historical reports, so critical details get lost when developers need them

Analysis Speed is critical with high cardinality, full-fidelity data

Honeycomb’s underlying column-oriented data store ingests rich events that give you the highest cardinality, full-fidelity, (non-aggregated) data, at extremely fast ingest and query speeds.

With an intuitive UI and integrated query builder, ask new questions across all your data in one place. View end-to-end transactions across every customer, service, and component in your service/app. Investigate and troubleshoot at unprecedented speed.

Every query by every Honeycomb user is saved forever and searchable in Query History, so you don’t have to ask anyone to stop, save and send when they’re in the middle of an incident. Retrospectives are easy with Query History! Plus, bug-fixes have all the

critical details to help developers focus – less toil. Teammates level up by reviewing each other’s investigations and saved queries.

“I keep thinking back to older problems–many took days or weeks to understand. We could have solved them in moments with Honeycomb.”

–Grace | Developer | Sedna Honeycomb Customer

Incident response with Honeycomb leaps the whole team forward:

Detection

SLOs, configured for your business, tell you immediately what caused the alert and more importantly what events may be causing the burn-down
Everyone has visibility into SLO charts and error budget burndown
BubbleUp hones-in to highlight where the problem is
Fewer unnecessary alerts, reduced fatigue and burnout

Triage

Views across unified data types (events, logs, traces) give highly-actionable context, helping with quick triage
The highest cardinality data makes issue validation fast and reliable
Single UI makes investigation easier
Honeycomb SLO enables prioritization based on business needs

Fix

All eyes on shared reports reduces guesswork, simplifies discussion on issues and any next steps
Effortless collaboration with Honeycomb Query History
Transparent and effective handoffs using Slack or email
Determine incident contributors and triggers in minutes instead of hours (see our customer case studies and our own transparent incident reviews)
Identify affected customers and failed transactions in real-time for immediate customer communications

Retrospective

Infinite cardinality and full-fidelity data make retrospectives more accurate and valuable
Automatic Query History of all steps enables improved alerts and playbooks
Permanent history so critical details never age-out while developers still need them for bugfixes

“It’s amazing going from other tools to Honeycomb, Honeycomb just shows me the information that is relevant to me immediately. Ridiculously powerful.”

–Matt Button | Infrastructure Engineer | Geckoboard Honeycomb Customer

Honeycomb’s intuitive user interface gets you started off on the right foot, then Heatmaps, Tracing, BubbleUp, and Markers do lots of the heavy lifting for you, quickly identifying outliers like affected users and systems, and showing when code and system changes occurred.

Honeycomb is the fastest way to visualize, understand, and debug software

Ready to dive in and see what’s special about your data? Sign up for a free trial!

Not ready to dive in yet? “Play with Honeycomb” to learn more.

Want to know more?

Talk to our team to arrange a custom demo or for help finding the right plan.

BOOK A CONSULTATION