Development at Honeycomb: Crossing the Observability Bridge to Production

For years, the “DevOps” community has felt focused on one main idea: what if we pushed our ops folks to do more development? To automate their work and write more code? That’s cool—and it’s clearly been working—but it’s time for the second wave of that movement: for developers (that’s us!) to own our code in production, and to be on point for operating / exploring our apps in the wild.

Observability is that bridge: the bridge from developers understanding code running on local machines to understanding how it behaves in the wild. Observability is all about answering questions about your system using data, and that ability is as valuable for developers as it is for operators.

gif of Valkyrie crossing bridge

During the development process, we’ve got lots of questions about our system that don’t smell like “production monitoring” questions, and aren’t necessarily problems or anomalies: they’re about hypotheticals, or specific customer segments, or “what does ‘normal’ even mean for this system?”

By being curious and empowered to leverage production data to really explore what our services are doing, we as developers can inform not only what features we build / bugs we fix, but also:

  • How to build those features / fix those bugs
  • How features and fixes are scoped
  • How you verify correctness or completion
  • How you roll out that feature or fix

When wondering… how to make a change

There’s a difference between knowing that something can be changed or built and knowing how it should be. Understanding the potential impact of a change we’re making—especially something that’ll have a direct, obvious impact on users—lets us bring data into the decisionmaking process.

By learning about what “normal” is (or at least—what “reality” is), we can figure out whether our fix is actually a fix or not.

alt=”screenshot of bursty and steady traffic”

What’s “normal” customer use here? Are the steady requests “normal,” or the bursty ones? Which customers represent which behavior patterns?

In our Check Before You Change dogfood post, Ben described what it was like to code up an “obvious” feature request: unrolling JSON for nested objects. But instead of just diving in and shipping the change, by gathering data first (and instrumenting our dogfood payloads to include things like the depth of the nested JSON, and how many columns it would expand to if we unrolled it), we were able to protect most of our users by making the feature opt-in-only.

A bonus example of this is our Measure twice, cut once: How we made our queries 50% faster…with data dogfood post: by looking carefully at where we could trim down perceived latency for the end user, Toshok found a way to dramatically improve the experience for most folks without going into the weeds of performance optimization (plenty of that happening already!)

These examples are our debug statements in production and are so powerful for us day to day. They’re lightweight, they help validate hypotheses, they carry metadata specific to our business (e.g. customer IDs), and they let us suddenly describe the execution of our service’s logic in the wild. All of this together helps us plan and build a better experience for our customers.

When wondering… how to verify correctness

Developers rely on all sorts of tests—unit tests, functional tests, integration tests, sometimes even end-to-end tests—to verify that our code does what we think it does.

And testing code for correctness on your machine or a CI machine is all well and good, but the real test is whether your code does the right thing in production. So what do we do when we aren’t quite sure about our change, or want to check its impact in some lightweight or impermanent way?

Feature flags (shout out to LaunchDarkly!). We love ‘em for letting us try out experimental code in a controlled way, while we’re still making sure the change is one we’re happy with shipping.

Pairing feature flags with our observability tooling lets us get incredibly fine-grained visibility into our code’s impact, and Honeycomb is flexible enough to let us throw arbitrary fields into our data—then look at our normal top-level metrics, segmented by flags.

screenshot of metric with feature flags

We’re looking at a vanilla top-level metric, but segmented by the values of a few arbitrary feature flags (and combinations thereof!) that were set at the time.

This let us do wild things like: turn a feature flag on for a very small segment of our users, then carefully watch performance metrics for the set of folks with the feature flag on, compared to those without. In our The Correlations Are Not What They Seem dogfood post, Sam did just that, and illustrated the power of being able to slice the performance data by the precise dimensions needed to draw out insights and feel confident about his change.

Feature flags and Honeycomb’s flexibility let us use these ephemeral segments of our data to answer the exact sorts of ephemeral questions that pop up during development—so that we can do ad-hoc, on-demand canarying in a way that’d make our ops folks proud.

When wondering… how that fix was rolled out

As any ops person will tell you, the biggest source of chaos in a system is usually humans: developers pushing new code.

(If you’re lucky, they might even tell you twice. )

The grown-up version of the ad-hoc, on-demand canarying I mentioned above (or less-grown-up, depending on how carefully the code was merged) is simply: being able to compare top-level metrics across build IDs and deploys.

Build IDs are a stealthy member of the “high cardinality” club: often a simple, monotonically increasing integer, that can nevertheless have an infinite number of possible values. But they’re invaluable for figuring out which build introduced a problem, or where something got fixed.

Anyone with a sufficiently large number of hosts, though, knows that deploys aren’t instantaneous, so timestamps aren’t quite enough, and it’s useful (and sometimes mesmerizing) to watch a deploy roll out across hosts, traffic switching over from build to build.

screenshot of graph broken down by build_id

Breaking down by something as straightforward (but deceptively high-cardinality!) as `build_id` can help visualize the real, natural behavior of our systems.

That second wave of DevOps, and what observability means for it

By making our observability tools use the nouns that are already ingrained into our software development processes—build IDs, feature flags, customer IDs—we developers can move from “Oh, CPU utilization is up? I guess I’ll go… read through all of our benchmarks?” to “Oh! Build 4921 caused increased latency for that high-priority customer you’ve been watching? I’d better go take a look and see what makes their workload special.”

Observability isn’t just for operations folks or for “production incident response.” It’s for answering the day-to-day questions we have about our systems, so that we can form hypotheses, validate hunches, and make informed decisions—not just when an exception is thrown or a customer complains.

If you’re interested in learning more, read about how we instrument our production services or (in excruciating detail) the datasets we rely on, day to day, and the sorts of questions we ask about our own services.

Or take Honeycomb for a spin yourself! Our Quickstart dataset is available for anyone to poke around in, and we’d love to hear what you think.