Instrumentation: The First Four Things You Measure

By: Guest Blogger | January 9th, 2017

4 Min. Read

Note: this is the first in a series of guest posts about best practices and stories around instrumentation. Like it? Check out the other posts in this series. Ping Julia or Charity with feedback!

This is the very basic outline of instrumentation for services. The idea is to be able to quickly identify which components are affected and/or responsible for All The Things Being Broken. The purpose isn’t to shift blame, it’s just to have all the information you need to see who’s involved and what’s probably happening in the outage.

In a total abstract void, you have a service.

service

Things are calling your service, be it browsers from the interweb, or other services, or API clients from the interwebs: upstream things are depending on you.

things calling your service

Most of the time, your service will have dependencies on other downstream things: some database or other service.

dependencies

And when there’s problems happening, people from Upstreamland will be telling you about your broken service, and you’ll, maybe, turn around and blame people in Downstreamistan:

complaining

But somehow, you need to be able to know when something is your service’s fault, or someone downstream’s fault. To do that, when people tell you that your service is broken, you need to be able to see if internally, it appears to be broken.

investigating

For all incoming requests, you want to have the following instrumentation points:

A counter that is incremented for each request that you start serving.
A counter that is incremented for each request that you finished serving, aka responses, labelled by successes or errors.
A histogram of the duration it took to serve a response to a request, also labelled by successes or errors.
If you feel like it, throw in a gauge that represents the number of ongoing requests (helps identify leaks, deadlocks and other things that prevent progress).

With this information, when people tell you that your service is broken, you can prove or disprove their claims:

Yup, I can see the problem:
- my thing is returning lots of errors, very rapidly.
- my thing is returning few successes, very slowly.
- my thing has been accumulating ongoing requests but hasn’t yet answered them.
Nope, problem is before me because my thing hasn’t been receiving any request.

This gives you many dimensions to prove or disprove hypothesis about what’s happening.

If it seems like your service is involved in the problem, the next step is to know: is it strictly my fault, or is it a problem with my downstreams? Before you turn around to other people to tell them their things seem to be broken, you need numbers:

need numbers

For all outgoing requests (database queries, RPC calls, etc…), you want to have the following instrumentation points:

A counter that is incremented for each request that you initiate.
A counter that is incremented for each request that had a responses, labelled by successes or errors.
A histogram of the duration it took to get response to a request, also labelled by successes or errors.
Again, maybe throw in a gauge that represents the number of ongoing requests (helps identify stuck calls, or build ups of thundering-herds-to-be).

And now, you can see quickly whether the reported problem lies within your service or within one of its dependencies.

I talked about services, databases, API clients, the browsers on the interwebs… this principle is valid for any individual piece of software that’s in some sort of client-server shape, be it:

basic www to service to db

a monolithic Rails application with a SQL DB, some Redis and what not… that’s alone serving requests from the webs, or:

www to service to many dbs and backends

An organically, loosely organized set of DBs and web services, or:

lots of interlinked backends

A massively distributed microservice soup.

In Instrumentation 102, we will see how to instrument the internals of a service. Due to budget constraints, Instrumentation 102 has been indeterminately postponed.

Thanks again to Antoine Grondin for their contribution to this instrumentation series!

Don’t forget to share!

Guest Blogger

Martin Thwaites | May 30, 2025

Dashboards, or Launchpads?

I have a personal vendetta against “dashboards.” Not because they're not useful—I actually think they’re extremely useful—but rather because they're generally built with the wrong user in mind, then used by a completely different user, and for a different use case.

Monitoring Observability Software Engineering

Martin Thwaites | Mar 17, 2025

So, What’s the Difference Between Observability and Monitoring?

Observability and monitoring are not about gathering different data—they differ in their purpose, but share the same data.

Monitoring Observability

Jessica Kerr | Feb 26, 2024

APM From a Developer’s Perspective

In twenty years of software development, I did not have the privilege of being on call, of tending to my software in production. I’ve never understood what “APM” means. Anybody can tell me what it stands for—Application Performance Monitoring (or sometimes, the M means Management)—but what does it mean? What do people use APM for? Now, I work at an observability company—and still, no one can give me a satisfying definition of “APM.” So I did some research, and now the use of APM makes sense from a few angles.

Monitoring Observability

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

Instrumentation: The First Four Things You Measure

Guest Blogger

Related posts

Dashboards, or Launchpads?

So, What’s the Difference Between Observability and Monitoring?

APM From a Developer’s Perspective

Ready to get started?