System Debug Like a Pro with Smart Sampling


The Addiction to Data Collection

Like many developers in today’s Brave New Distributed World, I’ve started to develop an addiction lately: I’m addicted to data. Data, whether it’s small or big or consultant big, is a critical make-or-break factor for businesses today. Once you figure out that you can store and analyze every interaction on the website or happening on your servers, it seems to be only a matter of collecting all the right details and turning the proper knobs to grow your app and ensure your status among the unicorns.

It therefore wouldn’t surprise me if the idea of losing some of that precious data is keeping you up at night.

Carl couldn

The craving to collect data is especially strong for those of us tasked with keeping the system up, and for engineers who want to test their code in production the right way. The dream, of course, is to observe everything – to collect every drop of data we might need, and query it at a blazing fast rate. To divine outages before they happen. To blast through our systems like we’re using a Cerebro for code.

Do you even Cerebro?

It’s a good dream.

But pretty soon into our journey to become Debugging Geniuses…

…Reality intervenes.

It starts slowly. Maybe your home-grown centralized logging cluster becomes more difficult to operate, demanding unholy amounts of engineer time every week. Maybe engineers start to find that making a query about production is a “go get a coffee and come back later” activity. Or maybe monitoring vendors offer you a quote that elicits a response ranging anywhere from curses under the breath to blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots of cases, you can just drop those writes on the floor as long as your observability stack is equipped to handle it.

In other words, sample.

“Sample? Like they have at Costco?”

Well, this type of sampling is far less delicious, but arguably more rewarding. Although, now that I’m thinking about it, maybe you can pitch your boss to buy you new snacks with the money you’ll save…

What is sampling, then? It’s sending only a subset of the total collected information (such as events, which are JSON blobs describing what’s happening in your system) to your debugging tool. Using sampling, you can mimic having all of the data without entailing all of the costs of that data, e.g., the terabytes of storage needed (and subsequent horrendously slow query performance) if you were to store everything. In most systems, you can declare a static sample rate up-front and the system will take note of the fact that data is being sampled at this rate. In Honeycomb, you can even set a per-event sample rate so that you can make sure not to lose important data like errors. More on that later.

“But… my precious data….”

Well, that’s fair. I hate settling for anything less than omniscience too.

But if you reflect on the problem, and try sampling out, you might find that with sampling you lose less important information than you might think. If you need to get an eye into something that’s going wrong, it’s likely to show up multiple times and/or be a persistent problem. Therefore, even when sampling heavily you’re likely to catch it eventually. And if it doesn’t show up again or cause major issues, then it’s one of many inevitable ephemeral blips in your application’s lifespan anyway.

One helpful analogy might be to think of sampling like JPEG compression. While technically “lossy”, the tradeoff is worth it, like in this example below (from Colt McAnlis’s blog). An almost indiscernable reduction in quality results in an image which is about 30% of the size, helping to slash the bandwidth and storage bill.

In Honeycomb’s case, you still have access to the raw data from the events that you do send – so you can continue to slice, filter, and deep dive with the Honeycomb workflow you know and love. Sampling therefore allows you to keep harmony between your storage quota, visibility into macro level trends, and an ability to dig into fine-grained details. your storage quota. Your queries will also run faster because the storage engine doesn’t have to churn through so many redundant rows.

And using Honeycomb, you can sample intelligently to keep what you care about the most. Let’s take a look.

Smart Sampling

Let’s say that you’re in charge of shepherding a high-traffic website or API. You probably have a lot of traffic that you don’t care about checking up on that much because frequently things are operating well or because the paths being exercised are not high value. On the flip side, you might have a subset of traffic that you need crystal clear insight into because it relates to core business functionality such as collecting payments, or it could be from customers of critical importance.

If we set a static sample rate (e.g., “Keep 1 out of every 5 requests”) we’d keep more of the boring stuff and lose more of the interesting anomalies.

Luckily, with Honeycomb events we can sample normal, boring events at a high rate (with a sample rate of N indicating that we’re keeping 1/N events) and keep all of the interesting bits. For instance, in this image below you can see a demonstration of dropping 99100 “boring” HTTP 200s that return in a reasonable amount of time, but keeping every HTTP 500-level response for our customers of high importance, or don’t meet our desired latency SLA.

We even open sourced an implementation of dynamic sampling techniques that can determine
proper sample rates on the fly. You can simply set the fields you’d like to base the sampling on and let it rip.

Become a System Debugging Genius

Using sampling, you’ll be able to get answers to questions faster. By querying faster, you’ll be able to try out more hypotheses, and ultimately become a better system debugger. Using the techniques outlined above you should be able to separate the wheat from the chaff and mostly keep the golden data that you absolutely must hang onto.

Like a DJ cutting the bass on one track to cross-fade in another and keep the crowd grooving, you’re not losing effectiveness in your role by trimming some information. You’re gaining it! So don’t be afraid to give it a try – check out our sampling documentation. And as always, we’d love it if you give our Honeycomb free trial a whirl to see how event-based system debugging can change the way you develop software!