Part 1/5: Asking Better Questions

By Charity Majors  |   Last modified on April 18, 2022

Any mature production system is likely to have hundreds of thousands if not millions of metrics, most of which never get looked at by a human.
Metrics and logs are no longer human-scale, they are machine scale.

We have typically coped with this madness by crafting lots of human-scale dashboards — artisanal, bespoke, handcrafted dashboards made of metrics. And then dashboards of dashboards to track your dashboards … sigh.

Dashboards all the way down

Hey, dashboards are awesome. We totally need dashboards. They tell us lots of important things in a glance, and they help us cognitively process lots of extremely detailed information about the state of the world.

But most dashboards are fixed, static views that were built to surface particular failures. You predict a component is going to break, so you do what? You make a dashboard to help you visualize it or debug it when it does. Every postmortem you’ve ever been in has probably had an action item called “create a dashboard to find this problem faster.”

Dashboards are a terrific view on reality, but they are not a good debugging tool, because they lock you into a set of assumptions.
So what happens when you can no longer predict all (or even most) of the failures? Well … stop trying. Instead, get used to asking questions about your systems. Interactivity is no longer an optional, nice-to-have feature.

Start asking questions

Instead of scrolling through static dashboards, get used to interacting with your systems — asking questions, refining them, and treating it like an interactive service instead of a flat view on a TV screen.

Microservices, containers, ephemeral instances, schedulers, serverless models, functions as a service, third-party hosted databases, polyglot persistence, platforms connected to other platforms with variety of glue and balancers; today’s infrastructure is exponentially more complex than yesterday’s.

Honeycomb addresses this inviting you to play around and explore your data with wide, rich events. We want you to toss in as many attributes and as much context as possible for every dataset — you don’t pay a performance penalty with more details, even tens of thousands per event, whether sparse or full.

If you can’t predict what convergence of problems will cause a user-impacting event, you definitely can’t predict what you’re going to need in order to diagnose it and solve it either. So just store everything! Sample vertically to control costs. And get used to asking questions.
For example, let’s contrast the processes:

Old Way: You scroll down and skim page after page of auto-generated dashboards or handwritten dashboards … or, best possible case, copy/paste a custom query using the vendor’s proprietary query language, or maybe type in a bunch of dot-delimited metric names by construct a dashboard or graph.

New Way: Using Honeycomb, start with simple entry points (like req/sec) and start adding more attributes to aggregate on, perform calculations, sort, limit, and filter. You can always get back to the raw events for the current query and eyeball the results, looking for any correlative patterns to explore visually.

You might be groaning and thinking this sounds harder, but OMG no, it is not! It may not be what you’re used to, but it’s not harder, and it saves SO much time and energy once you’re used to it. It prevents dashboard blindness, where we tend to forget a thing exists if we didn’t visualize it.

We have optimized Honeycomb for speed, for rapid iteration, for explorability. Waiting even 10-15 seconds for a view to load will cut your focus, will take you out of the moment, will break the spell of exuberant creative flow.

Honeycomb can make you a better engineer.

Interacting with your systems this way will make you a better engineer. It builds your spidey-sense about how your complex systems are going to interact and behave. And this is why you should start thinking this way from the beginning.

Yeah, you can go back and reinstrument your code and build pipelines for your logs to Honeycomb later. You can convince your teams to learn a new way of interacting with their systems later. But it is SO MUCH EASIER than constructing a massive edifice of an ELK stack or a Graphite install or OpenTSDB or configuring tons of plugins.

We get so wrapped up sometimes in telling you why this is more powerful for complex systems that sometimes we forget to tell you, it’s easier to start out this way from the beginning too! SO much easier than maintaining an ELK stack, or a Graphite install, or OpenTSDB and complicated static dashboards or some other great-great descendant of RRD.

Get used to interrogating your systems from day one. Everyone who joins your team will have a rich set of reference points to start composing harder questions and looking for tricky black swan events later on.


Related Posts

Observability   LLMs  

LLMs Demand Observability-Driven Development

Many software engineers are encountering LLMs for the very first time, while many ML engineers are being exposed directly to production systems for the very...

Observability   Connectors & Integrations  

Honeycomb + Tracetest: Observability-Driven Development

Our friends at Tracetest recently released an integration with Honeycomb that allows you to build end-to-end and integration tests, powered by your existing distributed traces....


Observability and the DORA metrics

The Accelerate State of Devops Report highlights four key metrics (known as the DORA metrics, for DevOps Research & Assessment) that distinguish high-performing software organizations:...