As engineering teams shift from delivering services on monolithic architectures to microservices and even serverless environments, developers are no longer just responsible for creating and maintaining their code. Shared ownership has become the new normal (or at least trending towards) and so they are now responding to production incidents and in some cases in the on-call rotation.
Of course incidents vary in terms of impact, but they do take time away from innovation and creating new capabilities. The time suck impacts productivity of the whole team.
The old rules tools don’t apply.
With customers, revenue, and reputation at risk, speed and collaboration are critical to effective incident response. Currently most teams that use monitoring and other investigation tools such as APM are silo’ed, have different user interfaces and query languages, and each generate their own alerts with limited context. Fixed data models, data aggregation, and limited dimensions or cardinality leave teams struggling to piece together the full story, leading to guesswork and delays.
“There was no real way to find possible culprits with our classic APM. We had to know what we needed to find before we could find it—a dead end.”
–David Laperle | Technical Producer | Behaviour Honeycomb Customer
Too many incidents look like this:
Detection (aka Something unusual is going on with production!)
- A high volume of alerts with little context that are not actionable
- Alert storms page you out of bed, only to find that it’s not critical and can wait until morning
- Engineers and teams develop alert fatigue and potentially miss real issues
Triage (aka Let’s find out what exactly is going on and who’s impacted)
- Is anything really broken? Multiple engineers and/or SRE/Ops try to validate issues
- Standalone or bolted-together tools make investigation and collaboration difficult
- What’s the magnitude of impact? Every user, a segment or one? All services or more isolated? Back-end or front-end? Very difficult to pinpoint with the limited visibility on data cardinality
- Redirected team resources, shifting priorities, wasted cycles & burnout
Fix (aka how can we resolve this & get back to a desired state?)
- Piecing together the full picture from separate tools leads to guesswork and more delay.
- Hard to collaborate. Handoffs broken due to silo’ed tools & lack of ongoing learning.
- Larger issues become tech debt which causes shift in priorities
- If affected customers can’t be easily identified, customer support becomes impacted
Retrospective (aka what did we learn, how can we improve)
- Can’t easily piece together analysis and restoration steps to review & improvealerts and playbooks
- Insufficient detail in bugfix tickets due to lack of full-fidelity granularity across components and functions
- Limited access to historical reports, so critical details get lost when developers need them
Analysis Speed is critical with high cardinality, full-fidelity data
Honeycomb’s underlying column-oriented data store ingests rich events that give you the highest cardinality, full-fidelity, (non-aggregated) data, at extremely fast ingest and query speeds.
With an intuitive UI and integrated query builder, ask new questions across all your data in one place. View end-to-end transactions across every customer, service, and component in your service/app. Investigate and troubleshoot at unprecedented speed.
Every query by every Honeycomb user is saved forever and searchable in Query History, so you don’t have to ask anyone to stop, save and send when they’re in the middle of an incident. Retrospectives are easy with Query History! Plus, bug-fixes have all the
critical details to help developers focus – less toil. Teammates level up by reviewing each other’s investigations and saved queries.
“I keep thinking back to older problems–many took days or weeks to understand. We could have solved them in moments with Honeycomb.”
–Grace | Developer | Sedna Honeycomb Customer
Incident response with Honeycomb leaps the whole team forward:
- SLOs, configured for your business, tell you immediately what caused the alert and more importantly what events may be causing the burn-down
- Everyone has visibility into SLO charts and error budget burndown
- BubbleUp hones-in to highlight where the problem is
- Fewer unnecessary alerts, reduced fatigue and burnout
- Views across unified data types (events, logs, traces) give highly-actionable context, helping with quick triage
- The highest cardinality data makes issue validation fast and reliable
- Single UI makes investigation easier
- Honeycomb SLO enables prioritization based on business needs
- All eyes on shared reports reduces guesswork, simplifies discussion on issues and any next steps
- Effortless collaboration with Honeycomb Query History
- Transparent and effective handoffs using Slack or email
- Determine incident contributors and triggers in minutes instead of hours (see our customer case studies and our own transparent incident reviews)
- Identify affected customers and failed transactions in real-time for immediate customer communications
- Infinite cardinality and full-fidelity data make retrospectives more accurate and valuable
- Automatic Query History of all steps enables improved alerts and playbooks
- Permanent history so critical details never age-out while developers still need them for bugfixes
“It’s amazing going from other tools to Honeycomb, Honeycomb just shows me the information that is relevant to me immediately. Ridiculously powerful.”
–Matt Button | Infrastructure Engineer | Geckoboard Honeycomb Customer
Honeycomb’s intuitive user interface gets you started off on the right foot, then Heatmaps, Tracing, BubbleUp, and Markers do lots of the heavy lifting for you, quickly identifying outliers like affected users and systems, and showing when code and system changes occurred.
Honeycomb is the fastest way to visualize, understand, and debug software
Ready to dive in and see what’s special about your data? Sign up for a free trial!
Not ready to dive in yet? “Play with Honeycomb” to learn more.