Instrumenting High Volume Services: Part 1

This is the first of three posts focusing on sampling as a part of your toolbox for handling services that generate large amounts of instrumentation data.

Recording tons of data about every request coming in to your service is easy when you have very little traffic. As your service scales, the impact of measuring its performance can cause its own problems. There are three main ways to mitigate this problem:

  • measure fewer things
  • aggregate your measurements before submitting them before submitting them
  • measure a representative portion of your traffic

Each method has its place; this series of posts focuses on the third: various techniques to sample your traffic in order to reduce your overall volume of instrumentation, while retaining useful information about individual requests.

An Introduction to Sampling

Sampling is the idea that you can select a few elements from a large collection and learn about the entire collection by looking at them closely. It is widely used throughout the world whenever trying to tackle a problem of scale. For example, a survey assumes that by asking a small group of people a set of questions, you can learn something about the opinions of the entire populace.

Sampling as a basic technique for instrumentation is no different—by recording information about a representative subset of requests flowing through a system, you can learn about the overall performance of the system. And as with surveys and air monitoring, the way you choose your representative set (the sample set) can greatly influence the accuracy of your results.

This series will explore various methods appropriate for various situations.

A naive approach to sampling an HTTP handler might look something like this:

func handleRequest(w http.ResponseWriter, r *http.Request) {
  // do work
  if rand.Intn(4) == 0 { // send a randomly-selected 25% of requests

By sampling with this naive method, however, we lose the ability to easily pull metrics about our overall traffic: any graphs or analytics that this method produces would only show around 25% of our actual, real-world traffic.

The non-negotiable: capturing the sample rate

Our first step, then, is capturing some metadata along with this sample datapoint. Specifically, when capturing this request, we’d want to know that this sampled request represents 4 (presumably similar) requests processed by the system. (Or, in other words, the sample rate for this data point is 4.)

funcfunc  handleRequesthandleRequest((ww  httphttp..ResponseWriterResponseWriter,,  rr  **httphttp..RequestRequest))  {{
     // do work
// do work     ifif  randrand..IntnIntn((44))  ====  00  {{  // send a randomly-selected 25% of requests
// send a randomly-selected     logRequest(r, 4)     // make sure to track a sample rate of 4, as well

Capturing the sample rate will allow our analytics backend to understand that each stored datapoint represents 4 requests in the real world, and return analytics that reflect that reality. (Note: If you’re using any of our SDKs to sample requests, this is taken care of for you.)

OK, but my traffic isn’t ever that simple:

Next, we’re ready to tackle some harder problems:

  • What if we care a lot about error cases (as in, we want to capture all of them) and not very much about success cases?
  • What if some customers send an order of magnitude more traffic than others—but we want all customers to have a good experience?
  • What if we want to make sure that a huge increase in traffic on our servers can’t also overwhelm our analytics backend?

Coming up in parts 2 and 3, we’ll discuss different methods for sampling traffic more actively than the naive approach shown in this post. Stay tuned, and in the mean time, sign up for Honeycomb and experiment with sampling in your own traffic!

This is part of a 3-part series: