System Debug Like a Pro with Smart Sampling

System Debug Like a Pro with Smart Sampling


The Addiction to Data Collection

Like many developers in today’s Brave New Distributed World, I’ve started to
develop an addiction lately: I’m addicted to data. Data, whether it’s small or
big or consultant big, is a critical make-or-break factor for businesses
today. Once you figure out that you can store and analyze every interaction on
the website or happening on your servers, it seems to be only a matter of
collecting all the right details and turning the proper knobs to grow your app
and ensure your status among the

It therefore wouldn’t surprise me if the idea of losing some of that precious
data is keeping you up at night.

Carl couldn

The craving to collect data is especially strong for those of us tasked with
keeping the system up
, and for
engineers who want to test their code in
the right
way. The dream, of course, is to observe everything – to collect every drop of
data we might need, and query it at a blazing fast rate. To divine outages
before they happen. To blast through our systems like we’re using a
Cerebro for code.

Do you even Cerebro?

It’s a good dream.

But pretty soon into our journey to become Debugging Geniuses…

…Reality intervenes.

It starts slowly. Maybe your home-grown centralized logging cluster becomes more
difficult to operate, demanding unholy amounts of engineer time every week.
Maybe engineers start to find that making a query about production is a “go get
a coffee and come back later” activity. Or maybe monitoring vendors offer you a
quote that elicits a response ranging anywhere from curses under the breath to
blood-curdling screams of terror.

The multi-headed beast we know as Scale has reared its ugly visage.

As some of you may have already guessed from the title, I’m going to discuss one
way to solve this problem, and why it might not be as bad as you might think.

Take some of your precious information and throw it in the garbage. In lots
of cases, you can just drop those writes on the floor as long as your
observability stack is equipped to handle it.

In other words, sample.

“Sample? Like they have at Costco?”

Well, this type of sampling is far less delicious, but arguably more rewarding.
Although, now that I’m thinking about it, maybe you can pitch your boss to buy
you new snacks with the money you’ll save…

What is sampling, then? It’s sending only a subset of the total collected
(such as events, which are JSON blobs
describing what’s happening in your system) to your debugging tool. Using
sampling, you can mimic having all of the data without entailing all of the
costs of that data, e.g., the terabytes of storage needed (and subsequent
horrendously slow query performance) if you were to store everything. In most
systems, you can declare a static sample rate up-front and the system will take
note of the fact that data is being sampled at this rate. In our product,
Honeycomb, you can even set a per-event sample rate so
that you can make sure not to lose important data like errors. More on that

“But… my precious data….”

Well, that’s fair. I hate settling for anything less than omniscience too.

But if you reflect on the problem, and try sampling out, you might find that
with sampling you lose less important information than you might think. If you
need to get an eye into something that’s going wrong, it’s likely to show up
multiple times and/or be a persistent problem. Therefore, even when sampling
heavily you’re likely to catch it eventually. And if it doesn’t show up again
or cause major issues, then it’s one of many inevitable ephemeral blips in your
application’s lifespan anyway.

One helpful analogy might be to think of sampling like JPEG compression. While
technically “lossy”, the tradeoff is worth it, like in this example below (from
Colt McAnlis’s
). An
almost indiscernable reduction in quality results in an image which is about 30%
of the size, helping to slash the bandwidth and storage bill.

In Honeycomb’s case, you still have access to the raw data from the events that
you do send – so you can continue to slice, filter, and deep dive with the
Honeycomb workflow you know and love. Sampling therefore allows you to keep
harmony between your storage quota, visibility into macro level trends, and an
ability to dig into fine-grained details. your storage quota. Your queries will
also run faster because the storage engine doesn’t have to churn through so many
redundant rows.

And using Honeycomb, you can sample intelligently to keep what you care about
the most. Let’s take a look.

Smart Sampling

Let’s say that you’re in charge of shepherding a high-traffic website or API.
You probably have a lot of traffic that you don’t care about checking up on
that much because frequently things are operating well or because the paths
being exercised are not high value. On the flip side, you might have a subset of
traffic that you need crystal clear insight into because it relates to core
business functionality such as collecting payments, or it could be from
customers of critical importance.

If we set a static sample rate (e.g., “Keep 1 out of every 5 requests”) we’d
keep more of the boring stuff and lose more of the interesting anomalies.

Luckily, with Honeycomb events we can sample normal, boring events at a high
rate (with a sample rate of N indicating that we’re keeping 1/N events) and keep
all of the interesting bits. For instance, in this image below you can see a
demonstration of dropping 99100 “boring” HTTP 200s that return in a reasonable
amount of time, but keeping every HTTP 500-level response for our customers of
high importance, or don’t meet our desired latency SLA.

We even open sourced an implementation of dynamic sampling
that can determine
proper sample rates on the fly. You can simply set the fields you’d like to
base the sampling on and let it rip.

Become a System Debugging Genius

Using sampling, you’ll be able to get answers to questions faster. By querying
faster, you’ll be able to try out more hypotheses, and ultimately become a
better debugger. Using the techniques outlined above you should be able to
separate the wheat from the chaff and mostly keep the golden data that you
absolutely must hang onto.

Like a DJ cutting the bass on one track to cross-fade in another and keep the
crowd grooving, you’re not losing effectiveness in your role by trimming some
information. You’re gaining it! So don’t be afraid to give it a try – check out our sampling documentation. And as always, we’d
love it if you give our Honeycomb free trial a
whirl to see how event-based debugging can change the way you develop software!