AI in Production Is Growing Faster Than We Can Trust it

AI reliability and trust is naturally becoming a top-of-mind concern. By 2028, Gartner predicts 40% of organizations will have implemented some form of dedicated AI observability.

By: Fahim Zaman

| January 23, 2026

AI & LLMs

8 Best Practices to Understand and Build Generative AI Applications Effectively

Guides

August 24, 2024

8 Best Practices to Understand and Build Generative AI Applications Effectively

Read Now

AI in Production Is Growing Faster Than We Can Trust it

Enterprise software has moved past the generative AI testing phase. Businesses with millions of daily users or workloads are no longer just prototyping LLMs in a vacuum. They’re directly wiring agentic efficiency into product interfaces and infrastructure to stay competitive.

This wave is often compared to the spread of microservices in the past, but we aren’t just adding new dependencies and complexity. We’re making a fundamental change to the ingredients of the stack, since LLMs introduce infinitely varied inputs and outputs into systems that were built for predictability.

Because every step in an agentic workflow carries a probabilistic chance of LLM failure, the overall risk of system failure increases exponentially. This convergence results in a massive growth in data complexity while leaving an extremely narrow margin for error.

Leverage AI-powered observability with Honeycomb Intelligence

Learn more about Honeycomb MCP, Canvas, and Anomaly Detection.

Learn More

AI reliability and trust is naturally becoming a top-of-mind concern. By 2028, Gartner predicts 40% of organizations will have implemented some form of dedicated AI observability. This will result in new failure patterns teams will need to address:

Hallucinations that look “successful” to basic health checks.
Model behavior that shifts as weights update or contexts change, referred to as drift.
Tiny config changes in a RAG pipeline can lead to massive cost spikes.
Latency creep can create regressions that only appear when the system is under actual user load.

Where observability fails with LLMs

Engineering teams are caught between a rock and a hard place. They need to ship fast, but AI failures are difficult to monitor: they are subtle, contextual, and expensive. The demand to ship new AI features is exceeding the ability to reliably operate them.

Traditional observability platforms (the ones built on pre-aggregated metrics) are hitting a structural wall. Leading monitoring suites have introduced valuable LLM dashboards, but their analysis workflows are still fundamentally optimized for low-cardinality metrics (i.e., tracking a small number of dimensions with limited variation across tags and values). New, specialized AI reliability tools are great for dev-time evaluation, but struggle to bridge the gap into full-scale production infrastructure.

AI telemetry is inherently more complex. To understand an agentic AI failure, you need the prompt variant, the retrieval context, the token count, and the specific user ID all in one place, tracked to the request.

The status quo: You see that an LLM call “succeeded,” but you have no idea why the output was garbage for a specific user cohort in production
The insight gap: While specialized tools in your testing environment evaluate AI, they’re not designed for granular, real-time insight into performance and efficiency with real users

Supporting observability for the AI era

If you want to understand (and trust) how agentic systems behave in the wild, you cannot collapse your data into averages. You need to capture all relevant context and preserve its granularity:

Token-level timing: time to first token vs. total duration.
Context and chunk injection: what did the RAG actually feed the model?
Self-evaluation signals from agentic loops.
Cost-per-interaction mapped to specific features.

Honeycomb doesn’t ask you to choose which dimensions “matter” upfront. Its high-dimensional data model allows you to slice by any variable (prompt version, user cohort, or model config) and see exactly where the quality or cost is leaking.

Instead of guessing, teams can catch subtle degradations before they become outages, debug RAG pipelines by seeing exactly which part of a document caused a hallucination, and set realistic SLOs for quality in a world where "perfect" doesn't exist.

How Intercom balanced speed and spend

Intercom’s AI agent, Fin.ai, handles millions of support conversations. For them, the challenge wasn’t just making Fin work; it was getting its AI workloads to scale without killing their margins or their user experience.

The visibility gap

With 19 teams and dozens of services involved, Intercom could see that services were up, but they couldn't see the customer’s perspective. How long was a user actually staring at a loading spinner?

The Honeycomb approach

Intercom started tracking time-to-first token, the exact moment a user sees the AI start to respond. They wrapped this—and many other LLM contextual attributes—into Honeycomb traces, linking the frontend experience to the backend execution. This allowed them to:

Optimize latency: They cut median response times by two seconds.
Manage cost: By adding token counts to those same traces, they could see exactly which optimizations were driving up costs.
Break the tradeoff: Intercom stopped guessing if a faster model was worth the price. They had the data to prove it.
Grow the business: Fin.ai's financial success has significantly fueled Intercom's expansion, establishing it as a major business unit.

The bottom line

Generative AI features in production don’t always fail with a bang, and aggregated monitoring doesn’t capture what’s really happening. These agentic workflows, while essential for modern products, can become a probabilistic nightmare: failing in ways that erode customer trust over time.

Scaling them safely requires an observability platform that supports unlimited cardinality and high dimensionality. This makes it easy for engineers to correlate clear signals from the noise of high-volume, highly-granular AI workload telemetry.