Part 1/5: Asking Better QuestionsBy Charity Majors | Last modified on April 18, 2022
Any mature production system is likely to have hundreds of thousands if not millions of metrics, most of which never get looked at by a human.
Metrics and logs are no longer human-scale, they are machine scale.
We have typically coped with this madness by crafting lots of human-scale dashboards — artisanal, bespoke, handcrafted dashboards made of metrics. And then dashboards of dashboards to track your dashboards … sigh.
Dashboards all the way down
Hey, dashboards are awesome. We totally need dashboards. They tell us lots of important things in a glance, and they help us cognitively process lots of extremely detailed information about the state of the world.
But most dashboards are fixed, static views that were built to surface particular failures. You predict a component is going to break, so you do what? You make a dashboard to help you visualize it or debug it when it does. Every postmortem you’ve ever been in has probably had an action item called “create a dashboard to find this problem faster.”
Dashboards are a terrific view on reality, but they are not a good debugging tool, because they lock you into a set of assumptions.
So what happens when you can no longer predict all (or even most) of the failures? Well … stop trying. Instead, get used to asking questions about your systems. Interactivity is no longer an optional, nice-to-have feature.
Start asking questions
Instead of scrolling through static dashboards, get used to interacting with your systems — asking questions, refining them, and treating it like an interactive service instead of a flat view on a TV screen.
Microservices, containers, ephemeral instances, schedulers, serverless models, functions as a service, third-party hosted databases, polyglot persistence, platforms connected to other platforms with variety of glue and balancers; today’s infrastructure is exponentially more complex than yesterday’s.
Honeycomb addresses this inviting you to play around and explore your data with wide, rich events. We want you to toss in as many attributes and as much context as possible for every dataset — you don’t pay a performance penalty with more details, even tens of thousands per event, whether sparse or full.
If you can’t predict what convergence of problems will cause a user-impacting event, you definitely can’t predict what you’re going to need in order to diagnose it and solve it either. So just store everything! Sample vertically to control costs. And get used to asking questions.
For example, let’s contrast the processes:
Old Way: You scroll down and skim page after page of auto-generated dashboards or handwritten dashboards … or, best possible case, copy/paste a custom query using the vendor’s proprietary query language, or maybe type in a bunch of dot-delimited metric names by construct a dashboard or graph.
New Way: Using Honeycomb, start with simple entry points (like req/sec) and start adding more attributes to aggregate on, perform calculations, sort, limit, and filter. You can always get back to the raw events for the current query and eyeball the results, looking for any correlative patterns to explore visually.
You might be groaning and thinking this sounds harder, but OMG no, it is not! It may not be what you’re used to, but it’s not harder, and it saves SO much time and energy once you’re used to it. It prevents dashboard blindness, where we tend to forget a thing exists if we didn’t visualize it.
We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.
Honeycomb can make you a better engineer.
Interacting with your systems this way will make you a better engineer. It builds your spidey-sense about how your complex systems are going to interact and behave. And this is why you should start thinking this way from the beginning.
Yeah, you can go back and reinstrument your code and build pipelines for your logs to Honeycomb later. You can convince your teams to learn a new way of interacting with their systems later. But it is SO MUCH EASIER than constructing a massive edifice of an ELK stack or a Graphite install or OpenTSDB or configuring tons of plugins.
We get so wrapped up sometimes in telling you why this is more powerful for complex systems that sometimes we forget to tell you, it’s easier to start out this way from the beginning too! SO much easier than maintaining an ELK stack, or a Graphite install, or OpenTSDB and complicated static dashboards or some other great-great descendant of RRD.
Get used to interrogating your systems from day one. Everyone who joins your team will have a rich set of reference points to start composing harder questions and looking for tricky black swan events later on.
Intercom’s mission is to build better communication between businesses and their customers. With that in mind, they began their journey away from metrics alone and...
In the last few years, the usage of databases that charge by request, query, or insert—rather than by provisioned compute infrastructure (e.g., CPU, RAM, etc.)—has...
As long as humans have written software, we’ve needed to understand why our expectations (the logic we thought we wrote) don’t match reality (the logic...