Ask Miss O11y: Observability vs BI Tools & Data WarehousesBy Charity Majors | Last modified on June 17, 2022
You probably have already answered this before, but do you have a good rule of thumb for where o11y [observability] ends and BI [business intelligence]/data warehouses begin?
Yes! While data is data (and tools exist on a continuum, and can and often are reused or repurposed to answer questions outside their natural domain), observability and BI/data warehouses typically exist on opposite ends of the spectrum in terms of time, speed, and accuracy, among others.
It can be really hard to generalize about “business intelligence tools”—a quick glance on the Internet turns up everything from online analytical processing (OLAP), mobile BI, real-time BI, operational BI, collaborative BI, location intelligence, data visualization and chart mapping, tools for building dashboards, billing systems, ad hoc analysis and querying, enterprise reporting ... you name the problem, there’s a tool somewhere optimized to analyze it. (It is only somewhat easier to generalize about the data warehouses that power them, but at least we can say those are non-volatile and time-variant, and contain raw data, metadata, and summary data.)
So anything we say to generalize is only going to be 90% true. But that never stopped Miss O11y! Let’s kick it.
Query execution time
Observability tools need to be fast, with queries ranging from sub-second to low-seconds. A key tenet of observability is explorability—the fact that you don’t always know what you’re looking for. You spend less time running the same queries over and over, and more time following a trail of breadcrumbs. When you’re in a state of flow, trying to understand and explore the consequences of your code in production, it’s incredibly disruptive to have to sit there and wait for a minute or longer to get results. You can lose your whole train of thought!
BI tools, on the other hand, are often about running reports, or crafting a complex query that will be used again and again. It’s okay if these take longer to run, because you aren’t trying to use this data to react in real time, but rather to feed into other tools or system. You typically make decisions about steering the business over units of days, weeks, months or years, not minutes or seconds—and if you’re updating those decisions every few seconds, something has gone terribly wrong.
(Please note that one of the umpteen categories of BI tools is called “Exploratory Data Analysis“ [EDA], which specializes in flexible, rapid exploration over sampled data—much like observability. The difference between observability and EDA tooling is that the latter typically focuses on helping you join across multiple tables, while observability tools are highly opinionated about data structures like traces.)
For observability tools, “fast and close to right is better than perfect” is the law of the highway (as well as being one of our company values 🙃). You would almost always rather get a result that scans 99.5% of the events in one second than a result that scans 100% in one minute. Which is a very real, very common tradeoff that you have to make with massively parallelized distributed systems across flaky networks.
Also, some form of dynamic sampling is often employed to achieve observability at scale, in order to manage cost while capturing enormously detailed traces about important code paths. Sampling and “close to right” are verboten for data warehouses and BI tools. When it comes to billing, for example, you will always want the accurate result no matter how long it takes.
The questions you answer with observability tools have a strong recency bias, and the most important data is often the freshest. A delay of more than a few seconds between when something happened in production and when you can query for those results is unacceptable, especially when you’re dealing with an incident.
As data fades into months past, you tend to care about what happened more in terms of aggregates and trends than specific requests, and when you do care about specific requests, it’s fine for it to take a bit longer to find them. But when data is fresh, you need those results to be raw, rich, and up-to-the-second current.
BI tools typically exist on the other end of the continuum, on the “it’s fine for it to take a bit longer” side. While there is often some ability to cache more recent results, and pre-process, index or aggregate older data, you want to retain the full fidelity of the data forever. You would never use an observability tool to find something that happened five years ago, or even two years ago, while warehouses are designed to store that data forever (and grow infinitely).
True observability is built out of arbitrarily wide structured data blobs, one event per request per service (or per polling interval in long-running batch processes). In order to answer any question about what’s happening at any time, you need to incentivize developers to append more details to the event anytime they spy something that might be relevant in the future. Defining a schema upfront would defeat that purpose, therefore schemas can only be inferred after the fact (and changed on the fly, just start sending a dimension or stop sending it at any time). Indexes are similarly unhelpful. That’s called picking and choosing in advance which questions you can ask efficiently, when the answer has to be “any of them.”
BI tools, on the other hand, often collect and process large amounts of unstructured data into structured, queryable form, while data warehouses would be an ungovernable mess without structures and schemas. You need consistent schemas in order to perform any kind of useful analysis over time. And you tend to ask similar questions in repeatable ways to power dashboards and the like, so you can optimize them with indexes, compound indexes, summaries, etc.
Because data warehouses grow forever, it is very important that they have predefined schemas and grow at a predictable rate. o11y, on the other hand, is all about rapid feedback loops and flexibility. It is most important under times of stress or duress, when predictability goes out the window.
Related to the last couple of points: debug data is inherently more ephemeral than business data. You might very well need to retrieve a specific transaction record or billing record from two years ago with total precision, whereas you are unlikely to need to know if the latency between service1 and service2 was high for a particular user request two years ago.
You may, however, want to know if the latency between service1 and service2 has increased over the last year or two, or if the 95% percentile has gone up over that time. This type of question is a very common one, and it is best served not by BI/warehouses or observability tools, but by our good old pal monitoring.
Monitoring tools don’t store raw request data, like observability tools do, but they do allow you to quickly and cheaply perform aggregates and counters on the fly. Monitoring tools (from rrdtool to Prometheus) are also excellent at aging out detail so that historical data can accumulate while only ever occupying a fixed amount of storage. You can store high level aggregates by the year, somewhat more detailed aggregates by the month, week, and day. That’s literally what it’s designed for, and what it’s best at.
Observability is a specialized use case for people who write and ship software to understand that software in production. It requires you to ingest telemetry in a specialized way; to make a set of tradeoffs on the storage side that are unlike those of any other use case; and to optimize the user interface for explorability, rich context correlation, and outlier detection. And it’s fast.
But the best way to tell if what you’re using is observability or BI is this. If you want to know with great precision what is happening to your users with your code, in production, right now, and you can reliably answer your own questions even if the scenarios are new to you? Then congratulations, you have excellent observability.
If you have to wait a few minutes, or an hour, or days, weeks, or months to find out? Then you’re bending a BI or logging tool towards observability purposes. (At best.)
But I’m guessing you knew that. ☺️
Dear Miss O11y, I want to make my microservices more observable. Currently, I only have logs. I’ll add metrics soon, but I’m not really sure...
Your API Key (in the x-honeycomb-team header) tells Honeycomb where to put your data. It specifies a team and an environment. Then, Honeycomb figures out...