Save the date for O11yCon 2026Get Early Access

Scary Things Happen in Production. Context Helps You Find Them.

Your data doesn’t become linearly more powerful as you add more context, it becomes exponentially, combinatorially more powerful with each added attribute.

| March 25, 2026
Your Data Is Made Powerful by Context

Production is a rowdy place of chaos, especially at scale. When you have millions of requests per second flowing through your system, weird things are always happening. Outliers, unusual request patterns, spikes and pulses of traffic from unknown sources, port scanning…it’s all there. To the naked eye, it looks like noise.

If you know what you are looking for…patterns emerge.✨

The night sky: every dot is a request. Without intent, it's an undifferentiated field of light. But when relationships between dots are understood and tagged with descriptive attributes, they form a pattern with meaning—a constellation.

“Anomaly detection” is not the hard part

AI and machine learning have many battle-tested techniques for picking anomalies out of a baseline dataset. The (much) harder part is deciding which anomalies we should care about, which anomalies we should act on.

The first gate is the easiest: if the anomaly is causing a sufficient number of customers to have a bad experience, we care. This is what SLOs do. It gets harder from there.

If you don’t know what you’re looking for, the anomaly has to be quite large to stand out against a noisy baseline. But if you do know what you’re looking for, the anomaly can be surprisingly easy to identify and track over time, even if it is small, so long as you are capturing data in a way that allows you to pick out those particularities.

This is the insight behind the coming wave of agentic solutions that aim to solve the age-old problems: how do you find bugs before your customers do?

The answer: by tracking developer intent, and progressively validating it in production.

The power of developer intent

Any change introduced to a swirling, complex system will have consequences, both intentional and otherwise. When a developer lands a diff and it ships to production, they are trying to change it in some way. That is developer intent.

When you are changing a complex system with many interdependencies, your changes will inevitably have consequences you did not expect. Some of them are good, some of them are bad, many are net neutral. The point is, you cannot predict them (or presumably you would have tested for them).

(This is why we keep banging the drum on testing in production. Yes, you should test pre-production. Yes, you should wrap what you learn in production back into your pre-production testing. But everyone tests in production, because it is impossible not to test in production; every deploy is a unique convergence of infrastructure, build ID, artifact and time, never to be repeated.)

The only way to find them is to put your code in production…and look for them. That second half is what people forget. “Test in prod” is not meant to be YOLO.

Your baseline is littered with anomalies when you’re running at scale; anomalies are the norm (left). The best way to ascribe meaning to any pattern is by tracing developer intent. If you know when you made a change, and who it applies to, you can identify the consequences of that change with high confidence (with the right telemetry data).

This is a data problem

This isn't primarily a culture problem, or even a tooling problem. It's a data problem. The dominant model for collecting telemetry (metrics in one place, logs in another, traces in a third) rips the fabric of relationships apart at write time.

To trace a small sample over time and explore its effects you need precision: context and cardinality. The relational seams between data points are the value. If you destroy those relationships at write time, you can’t get them back.

Aggregate metrics and logs are usually good enough for finding bugs. They are woefully insufficient for understanding user experience or the consequences of each diff.

Precision data meets the scalpel of intent

The beauty of developer intent is that it tells you what you are looking for. When combined with rich data sets and production guardrails that allow you to roll a change out progressively, promoting it gradually as you gain confidence, you gain the ability to understand the impact of your change on customers, and earn confidence in your code over time.

The sample size this requires is often astonishingly small—maybe just a handful of requests spread out over hours, out of a baseline of tens or hundreds of millions.

But if your telemetry preserves the context of each request, and if your instrumentation hoovers up as many attributes and relationships as possible —build ID, feature flag state, user, endpoint, latency breakdown—those anomalies become findable, even traceable over time.

Context isn't just additive, it's combinatorial

Most people don’t understand why context is so irreplaceably valuable, or why arbitrarily-wide, structured log events (or traces) are such a game changer.

Your data doesn't become linearly more powerful as you add more context. It becomes exponentially more powerful.

When you add another attribute to a structured event, it doesn't give you one more thing to query. It gives you new combinations with every other field that already exists. Drag the slider below to see what that means in practice.

Drag the slider to see how many possibilities you can pinpoint based on how many attributes per event or trace. The wide event described here has 250 fields. That's 2²⁵⁰ possible combinations, more than the number of atoms in the universe. Note that this only accounts for keys—high cardinality values explode your precision set further still.

Adding field #30 created 435 new pairwise combinations—one with every field that came before it. That one field contributes more combinatorial power than the first 29 fields combined.

  • 4 fields? 6 pairwise combos, 15 possible combinations.
  • 8 fields? 28 pairwise combos, 255 possible combinations.
  • 30 fields? 435 pairwise combos, 1.1B possible combinations.

This is why the three pillars model is so costly. When you scatter signals across siloed tools by type, you don't just lose some of the value. You lose most of it.

Aggregates cover over a multitude of sins

Most people have never looked at their production systems using anything but metrics, logs and traces. Which means they have been accumulating “weird shit” (I believe that’s the technical term) the whole time.

One of the most enjoyable moments of any Honeycomb proof of concept is when engineers start turning over rocks and finding some of the terrifying things that have been lurking beneath the surface in their systems all along.

This never gets old. 🙃 Hey, no judgment—it happens to everyone..

What happens when a team goes from having very little context to having A LOT of it, all at once?

Case study: Homeaglow

When Homeaglow, the largest home cleaning marketplace in the United States, adopted Honeycomb, they jumped right into the deep end: they instrumented everything at once.

"We opened the firehose. We did not know what volume we would have; we did not know anything. That was the whole problem," said James Baxley, VP of Engineering.

Context and precision revealed two sets of problems. First came a wave of quick wins, surfacing bugs and inefficiencies.

Within days of turning on the firehose, Homeaglow found several problems that had been silently degrading customer experience:

  • A memory allocation issue in their core booking API. Once fixed: a 40x performance improvement in their most critical customer flow.
  • A debug log buried in a membership billing job was quietly generating a massive N+1 query, causing a routine job to bloat from minutes to hours.
  • Redis cache behavior they'd never been able to observe was costing them 10x what it should have.

The second kind takes longer to surface, but is even more valuable. These insights are about understanding the product and how users are using it.

Once Homeaglow had rich, wide events flowing, they were able to make product decisions based on how it was actually being used:

  • Correlating accept-language headers with cleaner review scores revealed that Spanish-speaking cleaners were consistently outperforming—a pattern invisible without the relational context to connect the two fields. This became a new product initiative.
  • Trace data showed that a significant number of customers were already using the cleaner-facing mobile app. After seeing this, the team kicked off a multi-month native app project.

These are discovery stories, not debugging stories. Things you didn't know to look for and cannot find without rich context and scalpel tools.

Logs and metrics are infra tools and they solve infra problems. But the way you instrument your product, collect the telemetry and present it for analysis is not an infrastructure problem. It’s a product problem.

As James put it: "You will find scary things. They are already happening."

Agents need this more than we do

It is an exceptional developer that has the discipline to instrument their code as they write it, deploy to a canary, progressively increase the percentage, and check up on their diff for the next few days. (Most developers don’t have the right tools, either.)

Doing this is time consuming and tedious for a human. It’s what agents were built for.

But agents will need context and cardinality even more than developers do, because agents don’t have intuition. If we do not encode the wisdom of senior developers into the system, it does not exist.

Our entire industry is in the midst of upheaval. We are all trying to figure out how to cross the gap from traditional, handcrafted, artisan code to the software factory. We don’t have all the answers; no one does.

But one thing that won’t change is this: just like their human counterparts, agents are only as good as the context they're given.

Read our O’Reilly book, Observability Engineering

Get your free copy and learn the foundations of observability,

right from the experts.