There were a bunch of talks at Monitorama 2017 that could be summed up as “Let me show you how I built this behemoth of a metrics system, so I could safely handle billions of metrics.” I saw them, and they were impressive creations, but they still made me a little sad inside.
The truth is that most of us don’t actually need billions of metrics. Sure, there are the Googles and Facebooks (and legit – one of these presentations was from Netflix, who actually does need billions of metrics), but most of us don’t really need billions of metrics. And I’m coming from a place of love – I also built a behemoth of a metrics system with multiple tiers of aggregation and mirroring and high availability and fancy dashboards. And it was beautiful. But the real truth is that most of the metrics shoved in to that system would have been better served by something different. Something I didn’t know about back then. Something that exists now.
The problem crept up on me. I didn’t see my few precious numbers multiplying so horrendously until it was too late. I started by watching the core metrics on all my servers – CPU, memory, disk utilization and capacity, network throughput. But then I had questions. This network throughput (I’m on a webserver)… what does it look like? It was a small leap to build a web log tailer and start building metrics about the HTTP status codes of the web traffic flowing through the machine. And it worked! I had graphs of HTTP status so I could see the success and failure rates of my web traffic.
Soon though more questions came in and I expanded the log tailer to capture more nuance in the traffic. It would generate multiple metrics that would then be summed and aggregated by the metrics infrastructure and give both overall numbers and allow you to dive in to specific questions. How many failures are from
POSTs? How many are from which webserver? What’s the 90th percentile of the response time instead of just the average? What had started as a few (less than 30) metrics per host soon became 500 (and ultimately closer to 1,500) per server. 500 metrics times 200 hosts for just the web tier and we’re at 100,000 metrics?! No wonder people are trying to build such amazing systems to handle the load. (And this was still just the beginning)
But here’s the secret. Those 470 out of 500 “metrics” per server that I was trying to push? They are not system metrics, for which this metrics system was designed. They’re much closer to application metrics. And are they even metrics? The questions I’m asking about are things like “Which requests to my webserver failed? Why? Who were they from? What customers did they impact?” These are not questions a metrics system can answer because those answers revolve around keeping high cardinality contextual data. Distilling those events to the few metrics I had originally chosen lost all the context necessary to answer those questions.
The key to solving this problem was also interspersed in many of the Monitorama talks. They called it by many different names. Betsy Nichols talked about adding context to your metrics system. Bryan Liles and a few others talked about structured logging. There were many people mentioning tracing.
Metrics are here to stay – they’re an effective way of condensing information about the state of your system to numbers you can put up on a graph and get wonderful visualizations of how your infrastructure has changed over time. The (relatively) recent addition of tags to metrics has allowed even better visualizations, though underneath it still suffers from the problem of metric volume explosion. Several talks mentioned how you should be adding a myriad of tags to your metrics… but not IP address! Not customer ID! Those are too high cardinality and will blow out your storage.
As Roy Rapoport pointed out:
The shift from using metrics for everything to an awareness of the importance of context is marking our next evolution as an industry. We’re less interested in the distilled numbers representing a state and more interested in being able to pick that apart and track it down to individual events, customers, servers, stack traces, or states. The path to this kind of analysis is through recording wide events that have all those high cardinality keys along side the rest of the data that gives your events context.
But there’s another reason people like metrics, besides them just being easy to reason about (mostly). They’re cheap. Collecting, transmitting, storing, and analyzing all the events your service creates requires an infrastructure as large as that serving your primary business. Who has the budget to have an analytics platform that is as large as your production infrastructure? (not counting CERN.) Only by intelligently incorporating methods for reducing the total data set while retaining visibility into the parts of your traffic from which you gain the most insight can you hope to manage costs while still gaining the benefit of a modern approach to observability. All events are not of equal interest – the customer that’s generating 20k events per second may care less about each individual event than the one calling your service 10 times per day.
Through a good understanding of the important aspects of the business, you can safely discard 99% of events collected, and through a good understanding of your application combined with good tools, you can throw away 99% of the metrics you’re collecting. This is the direction we need to go, and we need to take our services with us, kicking and screaming.