Dynamic Sampling by Example

By: Liz Fong-Jones | May 17th, 2019

Instrumentation Sampling Software Engineering

2 Min. Read

Sampling with dynamic rates on arbitrarily many keys

What if we can’t predict a finite set of request quotas we want to set — e.g. if we want to cover the customer id case above? It was ugly enough to set the target rates by hand for each key (“error/latency” vs. “normal”), and incurred a lot of duplicate code. We can refactor to instead use a map for each key’s target rate and number of seen events, and do lookups to make sampling decisions. And this is how we get to what’s implemented in the dynsample-go library, which maintains a map over any number of sampling keys and allocates a fair share to each key as long as it’s novel. It looks something like this:

two graphs with varying sample rates over time labeled 200 and 500

var counts map[SampleKey]int
var sampleRates map[SampleKey]float64
var targetRates map[SampleKey]int

func neverSample(k SampleKey) bool {
	// Left to your imagination. Could be a situation where we know request is a keepalive we never want to record, etc.
	return false
}

// Boilerplate main() and goroutine init to overwrite maps and roll them over every interval goes here.

type SampleKey struct {
	ErrMsg        string
	BackendShard  int
	LatencyBucket int
}

// This might compute for each k: newRate[k] = counts[k] / (interval * targetRates[k]), for instance.
// The dynsample library has more advanced techniques of computing sampleRates based on targetRates, or even without explicit targetRates.
func checkSampleRate(resp http.ResponseWriter, start time.Time, err error, sr map[interface{}]float64, c map[interface{}]int) float64 {
	msg := ""
	if err != nil {
		msg = err.Error()
	}
	roundedLatency := 100 *(time.Since(start) / (100*time.Millisecond))
	k := SampleKey {
		ErrMsg:       msg,
		BackendShard: resp.Header().Get("Backend-Shard"),
		LatencyBucket: roundedLatency,
	}
	if neverSample(k) {
		return -1.0
	}

	c[k]++
	if r, ok := sr[k]; ok {
		return r
	} else {
		return 1.0
	}
}

func handler(resp http.ResponseWriter, req *http.Request) {
	var r float64
	if r, err := floatFromHexBytes(req.Header.Get("Sampling-ID")); err != nil {
		r = rand.Float64()
	}

	start := time.Now()
	i, err := callAnotherService(r)
	resp.Write(i)

	sampleRate := checkSampleRate(resp, start, err, sampleRates, counts)
	if sampleRate > 0 && r < 1.0 / sampleRate {
		RecordEvent(req, sampleRate, start, err)
	}
}

We’re close to having everything put together. But let’s make one last improvement by combining the tail-based sampling we’ve done so far with head-based sampling that can request tracing of everything downstream.

Don’t forget to share!

Liz Fong-Jones

Field CTO

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with over two decades of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Charity Majors | Apr 23, 2025

How Much Should I Be Spending On Observability?

In last week’s piece, we talked about some of the factors that are driving costs up, both good and bad, and about whether your observability bill is (or should be) more of a cost center or an investment. In this piece, I’m going to talk more in depth about cost drivers and levers of control.

Observability Sampling

Irving Popovetsky | Apr 21, 2025

Data Strategy for SREs and Observability Teams

The idea that telemetry data needs to be managed, or needs a strategy, draws a lot of inspiration from the data world (as in, BI and Data Engineering). Your company most likely has a data team that manages the data warehouse(s), data pipelines, data sources, and reporting tools. These teams are also constantly balancing costs with their user and stakeholder needs, usability, data retention, granularity, etc. Sound familiar? That’s because if you’re working on observability data, these teams are at least several years ahead of you in addressing these tradeoffs and considerations—and can teach us quite a lot.

Observability Sampling Software Engineering

Tyler Helmuth | Jan 22, 2025

Tracing Refinery

We recently released Refinery 2.9, which came with great performance improvements. Reading through the release notes, I felt the need to write a piece on this improvement, as it's quite important but easy to overlook: collect loop taking too long. This is the story of how we used distributed tracing to find the slowdown in this loop.

Sampling Tracing

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission