Dynamic Sampling in Honeytail
A while ago I wrote a three part series on sampling, covering an introduction, some simple straight forward ways to do it, and some ideas for fancy implementations. I’m happy to say that that work has made its way in to Honeytail, our log tailing agent.
Dynamic sampling in Honeytail works with a two phase algorithm – it measures the frequency of values in one or more columns for 30 seconds, computes appropriate sample rates for each value based on trying to fit a logarithmic curve to the traffic, then uses those values for the following 30 seconds. While it’s using those values, of course, it’s measuring the traffic to use updated values for the next window of 30 seconds. In this way it’s continuously adapting to the shape of your traffic, applying the best sample rates to each event as they go by.
After downloading the latest release (version 1.411 or newer), you can use this feature by updating your config file (in
/etc/honeytail/honeytail.conf by default):
- setting a sample rate with the
SampleRateconfig entry (
--sampleratecommand line flag)
- specifying which fields should have their value distribution measured with one or more
DynSampleconfig entries (
--dynsamplingcommand line flag)
- optionally adjusting the 30 second window to something more appropriate for your environment with the
DynWindowSecconfig entry (
--dynsample_windowcommand line flag)
Honeytail’s implementation of dynamic sampling is tuned to ensure infrequent events are seen and frequent events are more heavily sampled. This is just what you want when, for example, it is important to be able to see some of every customer’s traffic instead of having high volume customers drown out low volume customer’s traffic. It works great when someone starts sending you huge volumes of the same event and you still want to see what everybody else is sending.
To decide if a field will make a good candidate for the dynamic sampler, try doing a
COUNT of your traffic with that field as a
BREAKDOWN in Honeycomb. Having an order of magnitude or two between the most frequent events and the least frequent events will give you good results. Here’s an example graph with that property, with good coverage across 4 orders of magnitude (note the log scale on the Y axis, get this from the Gear menu):
Be aware, though, that if you try and use a field that does not have the property that some values are more frequent than others, it will do a poor job of deciding how to sample your traffic. For example, if you tried to use a unique request ID as the
DynSample field, it would effectively turn off sampling entirely. I’d be happy to talk with you about the shape of your data if you’d like help choosing appropriate settings.
Give it a try! For web traffic, try it out using the HTTP
status field as your key. For application traffic, try using a customer ID. (Or both! Then individual errors will still appear even for high volume customers!)