The Evolution of Sampling in Honeycomb: Introducing Refinery 2.0By Kent Quirk | Last modified on September 27, 2023
Honeycomb's Refinery is a tool that customers can use to help manage the volume of their telemetry.
It's rare to have too much telemetry—it's not often that someone says "I wish I didn't have all this information!" However, telemetry is data, and data is not necessarily information—particularly when you’re drowning in it. Honeycomb's query engine is so fast and powerful that many customers can send us all their telemetry. As we say on our stickers, "The Backend Can Handle It."
However, some customers have so much telemetry that it’s redundant and costly to send it all. At that point, the challenge is deciding which parts of it to send.
Refinery can help with that. It's a sampling proxy, which means that it is designed to receive a lot of telemetry, select a representative sample of it, and forward it to Honeycomb while discarding the rest.
Why we updated Refinery
Refinery is four years old now, and it has worked remarkably well over that time. In general, once it’s working, it keeps working—but many customers have found it challenging to set up and configure. In particular, there were a few key issues:
- The configuration file format was a problem. The preferred configuration file format was TOML, while in the intervening years, most systems moved to YAML. While Refinery supported both, all the documentation was in TOML, which made it harder for people who needed to use YAML.
- The design of the configuration didn’t scale well as Refinery grew. The configuration files weren’t well organized and there were several inappropriate defaults that could not be changed for compatibility reasons.
- The configuration file couldn’t easily be validated, which meant that it was all too easy to create configuration files that looked correct but were ignored. One unfortunate example was a configuration that misspelled `
SampleRate` as `
SamplerRate`, which meant that it was silently ignored and used the default value.
Once we looked at all of these related issues, we decided that the best solution would be to fix all of them at once. This would necessarily break backwards compatibility, so that’s why we called this newest release Refinery 2.0.
What’s in it?
A complete rework of configuration
We knew we wanted to redesign the configuration structure, so we turned to one of our designers for help in reorganizing it. Yes, we consider config files to be part of the user interface, and they deserve design love too! The major changes included organizing it into groups of related configuration values, standardizing things like the way durations are expressed (always now containing units, like `
100ms` or `
1m30s`), and improving the defaults to modern standards. We also added a version marker so that we can make future updates without breaking existing configuration files.
Of course, changing configuration format implies a certain level of user inconvenience: they need to convert their old configs to new ones. As part of this release, we also created `
convert`, a configuration conversion tool that can read a v1 configuration file and emit a proper v2 configuration, in a new format and with the appropriate default values—with comments!
In order to write a conversion tool, we created metadata that documented, in detail, every configuration value in both its old and new form. Once we had that, we had the information at hand to not only convert old format files to new ones, but also generate full documentation files automatically for both configuration and rules.
We also added a feature that could extract configuration from Helm charts and rewrite them, reducing toil for operators working with Kubernetes.
We also wanted to be able to do strict validation, meaning we want to ensure that all of the values specified in a config file are valid and consistent. This is a surprisingly subtle problem as most configuration libraries load values they recognize and ignore the ones they don’t by design.
To achieve this, we used the metadata above and completely rewrote the configuration loader. The result is that configurations for Refinery are now checked for type correctness, spelling, and range. We also added the ability to exit after checking configuration. This will allow CI systems to test configurations before they’re deployed.
Refinery now supports multiple targets for its own metrics, so it’s possible to send metrics to both Honeycomb and a local Prometheus instance. Refinery also now supports OpenTelemetry metrics in addition to the legacy Honeycomb format. There are metrics for individual samplers, and the different logging levels are now more manageable.
We improved the Stress Relief system, introduced in v1.20, to be more accurate and stable; it is now quite effective at keeping Refinery stable during spikes of heavy load.
We also fixed a very important bug. In Refinery 1.x, the dynamic samplers did all their calculations based on traces, rather than spans. But the point of sampling is to predictably reduce span count. It made it very hard to reason about the appropriate sampler configuration.
In Refinery 2.0, dynamic samplers now properly count spans. Throughput samplers can be readily tuned to a specific number of spans per second, and rate-based samplers are more likely to hit their target rates as expected.
Note that this change will require re-examining sampler configuration values. In particular, when using a dynamic sampler, the number of items counted may be significantly larger than in 1.x. As such, target rates likely need some adjustments.
We have also implemented two new throughput samplers, EMAThroughput and WindowedThroughput. We recommend them to anyone desiring to achieve a particular target quantity of spans sent to Honeycomb.
For more information on dynamic sampling and how to tune it, please see this blog post.
In short, Refinery is growing up! The 2.0 release is a lot easier than 1.x to configure, reason about, and operate.
If you’re running Refinery today, please make plans to migrate to Refinery v2.0 as soon as possible. The `convert` tool makes the conversion nearly painless, and it will improve your operator experience.