So You Want To Build An Observability Tool…

animation of 3 pillars blowing up

12 Min. Read

I’ve said this before, but I’m saying it again: observability is not a synonym for monitoring, and there are no three pillars. The pillars are bullshit.

animation of 3 pillars blowing up

Briefly: monitoring is how you manage your known-unknowns, which involves checking values for predefined thresholds, creating actionable alerts and runbooks and so forth. Observability is how you handle unknown-unknowns, by instrumenting your code and capturing the right level of detail that lets you answer any question, understand any state your system has gotten itself into, without shipping new code to handle that state (since that would imply that you could have predicted it or understood it in advance, which would make it a known-unknown).

Ok. If you buy my story about how systems are changing and why we need observability tooling, then here are the characteristics I believe those tools must have in order to deliver on observability. Honeycomb was built to this spec.
It is possible to kludge together an open source version of most of the pieces. The best one I know of is here, done with just feature flags and structured logging:

So, what do I need to include?

It may not be as catchy as “three pillars”, but in order to claim your tool delivers observability, it should support/deliver the following:

  1. Arbitrarily-wide structured raw events
  2. Context persisted through the execution path
  3. Without indexes or schemas
  4. High-cardinality, high-dimensionality
  5. Ordered dimensions for traceability
  6. Client-side dynamic sampling
  7. An exploratory visual interface that lets you slice and dice and combine dimensions
  8. In close to real-time

If you’re looking to buy from a vendor who says they provide observability tooling, do not take that statement at face value. Ask them closely about the technical merits of their solution. If it is metrics based, aggregates at write time, etc — that’s a nonstarter. It does not deliver as promised. You will not be able to introspect your system and understand unknown-unknown system states.

And if you’re looking to develop a competitive product or an open source solution … here are my notes, godspeed. I believe in sharing, I think execution is the true differentiator.

Observability is based upon events

Why? Because observability is about understanding performance from the perspective of the user (nines don’t matter if users aren’t happy). This means that what you actually care about is how well each request is able to execute from end to end. It doesn’t matter if your backend metrics are reporting 99.99% availability, if your DNS record has expired. It doesn’t matter if your database is practically idle and your apps have lots of capacity…if there’s a logic problem in one service and all your requests are writing to /dev/null. Metrics don’t matter, the request is what matters. So you need to be collecting telemetry from the perspective of the request, and aggregating it by request.

Here’s how Honeycomb does it. When a request enters a service, we initialize an empty Honeycomb blob and pre-populate it with everything we know or can infer about the request: parameters passed in, environment, language internals, container/host stats, etc. While the request is executing, you can stuff in any further information that might be valuable: user ID, shopping cart ID — anything that might help you find and identify this request in the future, stuff it all in the Honeycomb blob. When the request is ready to exit or error, we ship it off to Honeycomb in one arbitrarily wide structured event, typically 300-400 dimensions per event for a mature instrumented service.

So, so much of debugging is about spotting outliers and correlating them, or finding patterns. There’s a spike of errors — what do they all have in common? Which brings us to the closely related point:

Observability is not metrics-based

The word “metrics” has two different meanings. One is a generic synonym for telemetry (“All the metrics are up and to the right, boss”). The other is the statsd-style metric that whole industries of time-series databases and monitoring tools have been built around. A statsd-style metric consists of a single number, with a bunch of tags appended to it so you can find it and group it with other metrics.

animation of generic metrics dashboard

Observability tooling is never based on statsd-style metrics. A request might fire off hundreds of metrics over the course of its execution, but those metrics are all disconnected from each other. You’ve lost the connective tissue of the event, so you can never reconstruct exactly which metrics belong to the same request.

So if you’re looking at the spike of errors, and you want to know which requests those errors happened on, and if those requests were all bound for db101, or if they happened to be on hosts that just ran chef, etc — you cannot ask these kinds of iterative questions of a system built on metrics. You can only ask predefined questions, never follow a trail of breadcrumbs.

Observability requires high cardinality and high dimensionality

In fact, every publicly available metrics solution or TSDB that I’m aware of only supports low cardinality dimensions in their metrics tags.

Cardinality refers to the number of unique elements in a set. So any unique ID will be the highest possible cardinality for that dimension. (Social security numbers: high cardinality. First/last names: slightly lower, but still high cardinality. Gender: low cardinality. Species … well species presumably only has one value, making it lowest cardinality of all.)

Once upon a time, this wasn’t such a big deal. You had THE database, THE app tier, THE web tier. But now, just about every dimension you care about is high cardinality, and you care about being able to string a lot of them along together too (high dimensionality)! So much of debugging is about finding needles in the haystack, and high cardinality is what allows you to track down very fine-grained needles, like “find me all the Canadian users on iOS11 version 11.0.4 using the French language pack, who installed the app last Tuesday, running firmware version 1.4.101, and who are storing photos on shard3 in region us-west-1“. Every single one of those constraints is a high cardinality dimension.

Observability requires arbitrarily-wide structured events

It’s actually very important that these events be arbitarily wide. Why? Because otherwise you are limiting the dimensionality of the data.

Schemas are the antithesis of observability. Schemas say, “I can predict in advance which details I will need, and it will be these.” Bullshit. Unexpected data needs to be captured too. People need to be incentivized to stuff in any random detail that pops up, at any time. This means schemas and other strict limitations on data types or shapes are also anathema to observability..

Observability ordering and traceability

If you’re capturing one wide event per request per service, congratulations. You have recreated much of the ability to trace and debug a request that you once had in your monolithic app. Yay! Now all you need to add is some incrementing fields for ordering and tracing. Honeycomb appends these automagically if you use our Beelines.

animation of a robot searching a library

It’s not hard to see why this matters: you need to be able to determine the path of a request to reconstruct what has happened.

From a storage perspective, Events + Tracing = Observability

What users of your observability tool need

And now we come to the second half, which is about the user side: querying and understanding this data.

Observability requires that you not have to predefine the questions you will need to ask, or optimize those questions in advance

People will protest, “I can get those questions out of my metrics!” And sure. You can ask any of these questions with metrics or monitoring tools …. if you define them in advance. Which is how we’ve limped along this far. But that only works if you predicted that you would need to ask that exact question in advance. Otherwise, you’ll need to add those metrics and restart just to ask that question. This violates our need to ask any question we need to understand our systems, without predicting that we would need to ask it in advance.

We should be gathering data in a way that lets us ask any question, without having to predict in advance. In practice this means:

Observability requires raw events and read-time aggregation

Raw events preserve the environment and characteristics of each request, and read-time aggregation is just a fancy way of saying “scan it and ask a new question every time”. So you don’t need to predict and lock yourself into a set of questions in advance.

Observability precludes write-time aggregation or indexing

If you’re aggregating when you write out to disk, you may get impressive perf numbers but you will have lost the raw event for ever and ever. You will never be able to ask another new question about its data.

This is how metrics get those “averages of averages” percentile buckets for which they have such a wretched reputation. Deservedly so: they are not just meaningless, but actively misleading to any engineer trying to understand how users are experiencing their system.

Likewise, indexes lock you into a constrained set of queries you can run efficiently — the rest will be full scans of the raw data. For this reason, you pretty much need to use a distributed column store in order to have observability. At small scale, of course, you can get away with scanning lots of raw unindexed data. For a while. But to function effectively at larger scales:

Observability demands dynamic sampling

Why does everybody pre-aggregate? To save costs, of course. Lots of people will insist that it’s absolutely necessary, because nobody can save ALL the detail about ALL the requests forever — it’s simply cost-prohibitive.

They’re right.

That’s why you must support dynamic sampling, with rich controls at the client side.

Dynamic sampling isn’t the dumb sampling that most people think of when they hear the term. It means consciously retaining ALL the events that we know to be high-signal (e.g. 50x errors), sampling heavily the common and frequent events we know to be low-signal (e.g. health checks to /healthz), sampling moderately for everything in between (e.g. keep lots of HTTP 200 requests to /payment or /admin, few of HTTP 200 requests to /status or /). Rich dynamic sampling is what lets you have your cake and eat it too.

Observability requires interactivity, open-ended exploration

Maybe you’ve seen me ranting about dashboards. Sigh. Dashboards definitely have their place. But they are not a tool for debugging.

Every dashboard is the answer to a question. When you go flipping through a bunch of answers, looking for the right one to explain the complex behaviors in front of you, it’s fundamentally the wrong approach. You’re flipping to the end of the book. You’re assuming you know what lies between the book’s covers.

Instead, you should start by asking a question. Then inspect the answer, and based on that answer, ask another question. Follow the trail of breadcrumbs til you find the answer. This is debugging. This is science.

animation of a pony disappearing into a cornfield

I can’t count the number of times I have flipped to a dashboard, pronounced it the answer, then found out long afterwards that I had guessed only a tiny portion of the picture.

Observability requires methodical, iterative exploration of the evidence. You can’t just use your gut and a dashboard and leap to a conclusion. The system is too complicated, too messy, too unpredictable. Your gut remembers what caused yesterday’s outages, it cannot predict the cause of tomorrow’s.

Observability must be fast

It should return in sub-second query times, because when you are debugging it is important not to break your state of flow. It can’t take so long that you have to issue queries and wait for results to come back. Iterative, step-by-step debugging only works if you can take one step quickly after the next.

In summary

In order to deliver on any observability claims, I believe tooling must have the following technical specifications.

  1. Arbitrarily-wide structured raw events
  2. Persisting context thru the execution path
  3. Without indexes or schemas (columnar store)
  4. High-cardinality, high-dimensionality
  5. Ordered dimensions for traceability
  6. Client-side dynamic sampling
  7. An exploratory visual interface, that lets you slice and dice and combine dimensions
  8. In close to real-time
  9. BONUS: pre-compute important data and and sift it to the top

I welcome others in developing more observability tooling and shipping it to help users understand their complex modern infrastructures. I can’t WAIT to have competitors. I would love for there to be a compelling open source alternative as well. I am always available to talk through my reasoning and/or the design constraints, why we chose to do what we do.

But you cannot claim your tool has observability if it doesn’t check these technical boxes, because this is what you need in order to answer any question, understand any system state, without shipping new code. This is what it takes to deal with unknown-unknowns on the order of dozens, hundreds of times a day. So if what you’re actually doing is log aggregation or monitoring or APM, just say that. Be proud of what you’ve built! But please stop calling it observability. You’re not helping.



Want to save time?
Try Honeycomb for free instead.


Don’t forget to share!
Charity Majors

Charity Majors

CTO

Charity is an ops engineer and accidental startup founder at honeycomb.io. Before this she worked at Parse, Facebook, and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of O’Reilly’s Database Reliability Engineering, and loves free speech, free software, and single malt scotch.

Related posts