Observability: What's in a Name?
By Charity Majors | Last modified on January 31, 2019âIs observability just monitoring with another name?â
âObservability: we changed the word because developers donât like monitoring.â
Thereâs been a lot of hilarious snark about this lately. Which is great, who doesnât love A+ snark? Figured Iâd take the time to answer, at least once.
Yes, in practice, the tools and practices for monitoring vs observability will overlap a whole lot ⌠for now. But philosophically there are some subtle distinctions, and these are only going to grow over time.*
âMonitoringâ, to anyone whoâs been in the game a while, carries certain connotations that observability repudiates. It suggests that you first build a system, then âmonitorâ it for known problems. You write Nagios checks to verify that a bunch of things are within known good-ish thresholds. You build dashboards with Graphite or Ganglia to group sets of useful graphs. All of these are terrific tools for understanding the known-unknowns about your system.
But what happens when youâre experiencing a serious problem .. but you didnât know for hours, until it trickled up to you from user reports? What happens when users are complaining, but your dashboards are all green? What happens when something new happens and you donât know where to start looking? In other words, how do you deal with unknown-unknowns?
Known-unknowns are (relatively) easy (or at least the paths are well-trodden). Unknown-unknowns are hard.
But hereâs the thing: in distributed systems, or in any mature, complex application of scale built by good engineers ⌠the majority of your questions trend towards the unknown-unknown.
Debugging distributed systems looks like a long, skinny tail of almost-impossible things rarely happening. You canât predict them all; you shouldnât even try. You should focus your energy on instrumentation, resilience to failure, and making it fast and safe to deploy and roll back (via automated canaries, gradual rollouts, feature flags, etc).
The same goes for large apps that have been in production a while. No good engineering team should be getting a sustained barrage of pages for problems they can immediately identify. If you know how to fix something, you should fix it so it doesnât page you. Fix the bug, auto-remediate the problem, or hellâjust disable paging alerts in off-hours and make the system resilient enough to wait âtil morning. (Please!)
In the end, the result is the same: engineering teams should mostly get paged only about novel and undiagnosed issues. Which means debugging unknown-unknowns is more and more critical.
You canât predict what information youâre going to need to know to answer a question you also couldnât predict. So you should gather absolutely as much context as possible, all the time. Any API request that enters your system can legitimately generate 50-100 events over its lifetime, so youâll need to sample heavily. (See our sampling docs for more best practices.)
âObservabilityâ is a term that comes from control theory. From Wikipedia:
âIn control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals.â
In ordinary English, what this means is that you have the instrumentation you need to understand whatâs happening in your software. Observability focuses on the development of the application, and the rich instrumentation you need, not to poll and monitor it for thresholds or defined health checks, but to ask any arbitrary question about how the software works.
An observable system is one you can fully interrogate. Given a pile of millions of needles, one or two of which have problems, can you slice and dice and sort finely enough to quickly locate literally any given needle?
Monitoring is great. Weâre big fans. But itâs not what weâre trying to build here.
(Historical side note: we first adopted the term because companies like Netflix, Twitter, etc tend to use âobservabilityâ internally. Lots of our users sign up for Honeycomb because they desperately miss the kind of tooling they used to have at their $bigco job, so the association was useful.)
* Could you say that observability is a subset of monitoring? Sure, you could! But what term would you use for older-style thresholds-and-canned-dashboards? Iâm stumped on that point, so Iâve been calling it âmonitoringâ. If you have a better term, please share!)
Related Posts
ShipHero's Observability Journey to Seamless Software Debugging
Committed to timely service, ShipHero recognizes that the seamless performance of its software is paramount to customer satisfaction. To maintain this high standard, the development...
A Practical Guide to Debugging Browser Performance With OpenTelemetry
So youâve taken a look at the core web vitals for your site and⌠itâs not looking good. Youâre overwhelmed, and you donât know what...
Observability Is About Confidence
Observability is important to understand whatâs happening in production. But carving out the time to add instrumentation to a codebase is daunting, and often treated...