Innovation Week Day 2: Observability for AI, and Observability With AI

AI is reshaping the SDLC in two directions at once. AI-generated code is shipping faster and with less human supervision than ever before, while agents and LLMs are running directly in production, where they behave very differently from traditional software: non-deterministic, with a wider blast radius than any single function or component, with no stack trace to catch when something goes wrong.

By: Shabih Syed

| May 13, 2026

Innovation Week

AI & LLMs

Together, those changes compress the multi-stage software development lifecycle into rapid loops of intent and validation, with most of what used to live in pre-production now happening live. Honeycomb's answer splits into two: observability for AI, which expands what you can see to include agents and LLMs running in production, and observability with AI, which uses new tools to solve harder problems faster than you could before.

Observability for AI: Agent workflows are opaque, but Agent Timeline makes them visible

Teams putting agents into production keep telling us the same thing: they can't see what's actually happening. Agent workflows branch, retry, call tools, hand off to other agents—and from the outside, you only ever see fragments. When something breaks, the failure could be anywhere.

Existing tools weren't built for this problem. LLM eval platforms understand model behavior but never see your database or API calls. Traditional APM tools see HTTP requests but have no concept of full agent interactions. Engineers end up stitching traces together by hand, tab-hopping across tools.

Agent Timeline, entering early access today, changes that. It gives you a conversation-level view that renders an entire agent workflow as a single visual sequence: every agent invocation, LLM call, tool call, and downstream trace, all bound by a conversation ID.

In Shashank's demo, support handed off a single conversation ID and nothing else. Agent Timeline loaded immediately, with a summary of duration, tool calls, failures, and tokens consumed. From there, Shashank pinpointed a broken tool call in seconds: a connection error in check_shipping that had taken the shipping service offline. Below the timeline, the trace waterfall pulled in all the non-GenAI services the agents were calling, so the entire stack sat in a single view.

Observability for AI: Prompts, tokens, and quality scores are now first-class telemetry

Shipping agents into production means accepting that model inputs and outputs are just as important to understand as database queries and API response times. They're part of the same system. Deeper LLM insights, shipping today, brings prompts, token counts, model identifiers, latency distributions, and quality scores into Honeycomb as first-class span attributes, conforming to the OpenTelemetry GenAI semantic conventions.

This builds on the Gen AI tab on traces. That tab renders LLM messages in markdown format directly in the trace waterfall sidebar alongside evals and other AI-specific fields. Everything your agent does is now visible in the same place as the rest of your production system: what it was told, what it said, how long it took, and how good the answer was.

Observability with AI: Your Canvas investigation starts when the alert fires

The compression of the SDLC challenges the underlying assumptions of observability as a practice. Failures often show up first in production. They don't always throw explicit errors: prompt drift can degrade an agent, and a support agent inventing a refund policy never raises an exception. And there isn't always a safe rollback to retreat to while you figure things out. Sometimes the only way out is through.

Observability has to help you solve hard problems fast, in production, while you're still flying the plane. The re-imagined Canvas is built for exactly that, with multiplayer collaboration, living artifacts that update as the team learns, and auto-investigation that means work has already started before you even open the tab.

In Taylor's demo, an overnight checkout latency trigger had already fired an auto-investigation and produced a structured plan with ranked hypotheses, ready by morning. Taylor didn't have to start from scratch. By circling a slow group of requests on a heatmap, they instructed the Canvas agent to run BubbleUp on that selection, which surfaced a misbehaving pod and a specific user dominating the slow requests. When a second investigator joined the same canvas, their agent read everything Taylor's agent had already done and added context the first agent hadn't yet found. A custom skill encoding the team's checkout flow runbook then reordered the action plan and elevated the issue from P3 to P2.

Observability with AI: Skills encode your team's knowledge into every future investigation

Canvas skills are how your team's runbooks and tribal knowledge become an active part of the investigation instead of a document someone has to remember to open. Pre-built skills cover the most common investigation patterns out of the box. Custom skills let you encode the specific context, thresholds, and decision logic your team has accumulated, so every auto-investigation starts with your best thinking already applied.

In Ken's demo, a trigger fired on token usage in the customer service chat agent and kicked off an auto-investigation using a custom skill that knew exactly what to look for. Within minutes, Canvas had assembled the picture: 88 conversations had exceeded 80,000 tokens in the trigger window, the order status agent was burning 79% of all tokens, and a runaway loop in a check_shipping tool was the culprit. Ken jumped into Agent Timeline to confirm and saw the agent pulling 145K of order data on every single turn, then sending it back to the model again and again. Meanwhile, a teammate working a separate thread on the same canvas spotted a system prompt override added a week earlier. Ken pulled it onto the canvas, asked the agent to date it, and the regression timeline assembled itself.

What we've been shipping to get here

MCP became read-write

When our MCP launched, it was read-only: you could ask it questions about your data and get answers. Useful, but not truly agentic. Now, MCP can create SLOs and triggers based on what an agent finds, without handing off to a human using Agent Skills for AI Coding Assistants. An agent can now investigate a problem and set the alert that catches it next time.

We also resolved the OAuth friction that made connecting anything other than Claude cumbersome: any compliant MCP client (Claude Code, Cursor, Copilot) now connects without pre-registration. And AI agents can now answer Refinery and sampling configuration questions accurately, with the reference docs served directly through the MCP server. We also added OpenTelemetry semantic conventions to the MCP server, so AI-generated queries are grounded in what your span attributes actually mean.

Canvas expanded steadily

Canvas can now do auto-investigations from trigger and SLO burn alerts. Prompt Intelligence allows you to paste a chart URL into Canvas to load it as a component, a faster way to bring existing queries into an investigation. The Canvas Slack app has been available since April, bringing investigations into the tools where engineering teams already live. Contextual Canvas chat (Ask Canvas) is now available across the entire product, covering query results, traces, boards, and SLOs, with surrounding context automatically shared with the model.

Alerting got smarter

Query Math for Triggers lets you write alert conditions using formulas and math expressions rather than raw column values. BubbleUp's sorting algorithm was improved in April to surface fields where the baseline is higher than the outlier. With Anomaly Detection (Early Access), customers can now adjust anomaly detector sensitivity directly in the UI, with a preview graph so they can see what the change does before saving it. We also added a presence signal type that tracks whether a metric is being observed at all. If your service stops sending data, you want to know. Anomaly profiles now have an auto_investigate boolean. When an anomaly fires and the flag is on, it kicks off a Canvas auto-investigation.

Dark mode shipped

It's the most requested feature from our developer community, and it’s launching today. Get to the UI and try it out!

What's next

On Day 3, we'll zoom out to how Honeycomb fits into the broader tooling ecosystem. Expect deep dives with existing Honeycomb partners, an integration with Amazon Bedrock AgentCore, and a new partnership announcement with Embrace.