One of the most common questions we get at Honeycomb is about how to control costs while still achieving the level of observability needed to debug, troubleshoot, and understand what is happening in production. Historically, the answer from most vendors has been to aggregate your data–to offer you calculated medians, means, and averages rather than the deep context you gain from having access to the actual events coming from your production environment.
This is exactly what it sounds like–a poor tradeoff for performance. With classic metrics and APM tools, you can never again get back to the raw event source of truth, which means you’ll regret that choice when debugging a complex, distributed system. When you’re working with metrics, the data must be numeric, and any other type of data must be stored as metadata either attached to the datapoints themselves or out-of-band in some way (“tags”, “dimensions”, etc), AKA: more limits on what you can store and retrieve.
Honeycomb’s answer is: Sample your data.
But, you say, sampling means I’m throwing away some (or a lot) of my data. How is that OK? I won’t know what I am not seeing, right?
What if you had more flexibility? What if sampling offered a greater breadth of options than just “send a percentage of my data”?