Dynamic Sampling by Example

Query count and sampling chance plotted

2 Min. Read

Last week, Rachel published a guide describing the advantages of dynamic sampling. In it, we discussed varying sample rates to achieve a target collection rate overall, and having different sample rates for distinct kinds of keys. We also teased the idea of combining the two techniques to preserve the most important events and traces for debugging without drowning them out in a sea of noise.

While these techniques work out of the box in our log ingestion agent honeytail, you may want to know how exactly it works under the hood, or want to implement it yourself! This week I’m doing a show-and-tell to demonstrate these techniques. This pedagogical example is in Go but is straightforward to port to any language that supports hashes/dicts/maps, pseudorandom number generation, and concurrency/timers.

Our base case

Let’s suppose we would like to instrument a high-volume handler that calls a downstream service, performs some internal work, then returns a result and unconditionally records an event to Honeycomb (or another instrumentation collector):

func handler(resp http.ResponseWriter, req *http.Request) {
	start := time.Now()
	i, err := callAnotherService()
	resp.Write(i)
	RecordEvent(req, start, err)
}

This is unnecessarily noisy; in Honeycomb, this would result in a shortened retention period for the dataset. With a different collection provider, this would result in a sky-high bill proportional to your traffic volume.

Fixed-rate sampling

A naive approach might be probabilistic sampling using a fixed rate, by randomly choosing to send 1 in 1000 events.

var sampleRate = flag.Int("sampleRate", 1000, "Static sample rate")

func handler(resp http.ResponseWriter, req *http.Request) {
	start := time.Now()
	i, err := callAnotherService()
	resp.Write(i)

	r := rand.Float64()
	if r < 1.0 / *sampleRate {
		RecordEvent(req, start, err)
	}
}

Then, on the receiving end at Honeycomb or another instrumentation collector, we’d need to remember that each event stood for sampleRate events and multiply out all counter values accordingly.

Go to the next page to learn about adjusting the sample rate.

Don’t forget to share!
Liz Fong-Jones

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 18+ years of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Related posts