Are you looking for a better way to troubleshoot, debug, and really see and understand what weird behavior is happening in production? Service-level objectives (SLOs) and observability can help you do all that—but they require collecting and storing the right data. If we’re naive with our telemetry strategy, we spend a lot of money on storing data without seeing adequate return on investment in the form of insights.
In this post, I’ll share three strategies for taming the spew of telemetry data with effective sampling so you can keep costs predictable, get better visibility into your code in production, and refine your data to separate the signal from the noise.
Optimizing user experience at all scales
The end-user experience is typically top of mind for most of us when we ship code, and systems have many different properties that are useful in helping us understand what’s happening. For example, we might want to know how many percent of events fit below or above certain criteria, what’s the count and latency of events that share commonalities, etc.—without spending too much time or money on it.
We would prefer to keep all data unsampled, but that becomes uneconomical past a certain scale. How do we preserve the ability to debug our systems and optimize user experiences without breaking the bank? The trick is that we engineers have more to learn from some user experiences than others and can benefit more from getting the right telemetry data from the right code executions.
Reduce, reuse, and recycle your telemetry data
It’s cliche because we’ve all heard the phrase “reduce, reuse, and recycle” in so many other parts of our daily lives. Are you surprised to hear this for sampling, too? Let me explain how this works.
Reducing your data means stop storing data that we write once and never read later. Instead, ask yourself, “How long do we need this data for?” It can be 24 hours, it can be 48 hours, it can be a year, but really ask whether you’ll need that data in that time frame.
The next step is to structure your data so that it has well-defined fields that contain the information you need and no more. A lot of people write more than one event per transaction, which means they’ll have to later manually dig through logs to look for the, for example, five times an event ID pops up in said logs. With structuring, you can just pull data once and buffer it, then emit all the fields as one event.
You’ll want to use distributed tracing for linked events because it makes it easy to see which events are related to each other and see how events flow through your system.
An effective sampling strategy can help teams control costs while still achieving the level of observability needed to debug, troubleshoot, and understand what’s happening in production. It’s a technique for reducing the burden on your infrastructure and telemetry systems by only keeping the data on a statistical sample of requests rather than 100% of requests. For a brief breakdown on different types of sampling, plus their benefits and drawbacks, check out this blog post on how to sample traces in Honeycomb. When we discuss sampling, we’re not just discussing naive approaches that treat all data as equally important; we instead need to discuss how to refine your data to get the most signal out of the noise.
You can smartly reuse some of your data via sampling to see the picture of all your data without needing to store all of it. Not all traffic is equally important. For example, you can choose to count 1 in 6 events, record details for that one, and assume that one event represents the other 5 statistically similar ones. The reason this works is because 99% of the events you’re collecting are low in signal; in other words, they’re either recording something not very interesting or something you’ve already seen before. You don’t need to keep that duplicate data in raw form as long as you have a mechanism for reconstituting the distribution of the data afterwards.
Similarly, if you’re running a multi-tenant service where you have hundreds of different customers with different orders of magnitude of traffic, you might care just as much about a client that’s sending 20 queries per second (QPS) as you do about the client sending 20k QPS. In this case, each individual request is no longer equally important. The larger customer’s data shouldn’t overwhelm the data from the smaller customer—but if you simply take an unweighted sample, then it means you might just barely catch some of the smaller customers. Weighting our sample so that we collect a higher percentage of requests coming from smaller clients and a lower percentage of requests from higher-volume clients ensures that both are well represented in the set of actually collected events.
For instance, let’s say you have a customer that’s 100% down, but their traffic is so small that it’s only 1% of the total traffic you’re seeing. That customer will be hopping mad, but your dashboards likely won’t show them. To avoid that, you need to make sure you allocate your budgeted events fairly across all your distinct clients and distinct ranges of latency.
So for a client that’s sending 20,000 QPS, maybe retain only 1 in 2,000 events, and for the one sending 20 QPS, sample 1 in 1 or 1 in 2. In a similar vein, if it’s a slow query, then you might decide to keep every single one. Fortunately, if you’re only failing 1% of your traffic, it’s cost effective to keep all 1% of that while sampling the other 99%. Taking this approach doesn’t destroy any of your data and it doesn’t sacrifice context; it instead helps refine your data by tracking all of the anomalies while sampling the uninteresting data so you can zero in on the weird stuff that’s causing problems.
I want to emphasize that it’s ok to sample 1 in 1,000 or 1 in 1,000,000 (or more) on your uninteresting data—in other words, don’t be afraid of high sample rates. Instead, what’s important is to have enough values remaining in order for the sampling to be accurate. Retain the most interesting data at the tails of latency 1 for 1, but sample away the bulk of your uninteresting, repetitive data. When it comes to sampling your data, not being 100% accurate is fine—perfection means spending a ton of time and money, which isn’t feasible at scale. When trading off between constraints at scale, sophisticated strategies with high sampling rates are the best choice. They can mean the difference between having the bandwidth to to support distributed tracing or not doing tracing at all. The goal is to pragmatically keep enough of the right data rather than all of the wrong data, or none at all.
In part, sampling gets a bad name because some vendors are quick to point out that they don’t sample your data. Yet historically, many of those same vendors aggregate data into metrics to offer calculated medians, means, and averages along specified dimensions. Those aggregations help lower costs. With event data, aggregation is basically recycling data, turning it into something completely unrecognizable from the source product. If this is the road you’re taking, you’ll likely regret it later when trying to debug a complex, distributed system (which are all systems nowadays).
Aggregation destroys granularity and cardinality by arbitrarily squishing data together. You lose out on the deep context that comes from having access to individual events coming from your production environment and being able to examine them across any dimension. Aggregation should only be a last-resort option and used only when you know what questions you intend to ask. It fails when you try to understand attributes about your system that vary or have high cardinality.
This is why we at Honeycomb so strongly advocate sampling your telemetry data. Sampling refines your observability experience because it reduces resource costs while still retaining its high cardinality and usefulness in exploring unknown-unknowns.
Want to Learn More?
Interested in seeing how sampling works in more technical detail? Check out my blog, “Dynamic Sampling by Example,” where I show, with screenshots and code snippets, how this all works under the hood. Or, if you’d like to test it out yourself, get started with Honeycomb for free today. And finally, keep an eye out for more information in the near future where we talk about sampling of whole traces!