Innovation Week Day 1: The SDLC Is Collapsing, and Observability Has Never Mattered More
Day 1 of Innovation Week was about how software gets validated, where observability fits, and the problems that have always been hard but are now genuinely urgent with AI part of the software development lifecycle.

By: Shabih Syed

The software development lifecycle is collapsing. The multi-stage pipeline that defined how software got built and shipped for decades is compressing into rapid loops of intent and validation, with agents now part of the teams building and running it. Day 1 of Innovation Week was about what that shift means for how software gets validated, where observability fits, and the problems that have always been hard but are now genuinely urgent.
Production is the most important stage of development
The conventional software lifecycle treated production as the end state, the place you deployed to after all the real work was done. Staging caught the bugs, QA signed off, and production was where software lived, not where it was understood.
That model was already under strain before AI crushed it entirely. AI-generated code is shipping with less pre-production review than ever before. Agents running in production are non-deterministic by definition: the same input doesn't produce the same output, failures aren't repeatable, and what broke today didn't exist as a failure mode yesterday. You can't test your way to confidence before the deploy. The validation has to happen live, on real traffic, against real behavior.
Production has always been the most important stage of development. We just have more pressure to treat it that way now.
The gap between building and operating is where value goes to die
Many engineering teams have operated two separate loops: development (build, test, merge) and operations (alert, observe, fix). Those loops have run in parallel for decades, rarely communicating, rarely teaching each other anything. A regression surfaces in production, an on-call engineer investigates. The knowledge they earn fighting that incident lives in their head until they write it down somewhere, if they ever do. The next engineer who runs into the same problem starts from scratch.
The gap between those loops is where millions of dollars of lost value live. It's where the same bugs get reintroduced, the same incidents recur, the accumulated operational wisdom of a team never makes it back into the software that created the problems in the first place.
AI magnifies this problem from both ends. More code is shipping with less review, which puts more pressure on the operations side to catch what slipped through. At the same time, AI gives teams a new capability: the ability to encode their best thinking into the system itself. The knowledge that used to live only in the heads of three senior engineers can now be applied consistently, at scale, on every investigation.
Traditional observability was built for a different shape of software
When teams start running agents in production, they discover a specific gap. They thought they had coverage: they had APM, they instrumented their services. They did the work. Then the agent ships, and everything they built turns out to have been designed for a deterministic world: pre-decide what matters, pre-aggregate the data, pre-build the dashboards, model the known failure modes.
That model works when failures are predictable. Agents are not predictable. The path the agent follows is shaped by what the model decides in the moment. What breaks today didn't exist as a possibility yesterday. Teams end up in archeology mode, piecing together partial logs, manually correlating timestamps, trying to reconstruct what the agent was actually doing from fragments.
Trying to debug an agent with disconnected tools is like trying to understand a chess game by looking at each piece in isolation. You can see the board, but you can't see the game. What's missing is the connected view: every model call, every tool invocation, every handoff and sequence, queryable in real time, as it actually happened, not as an average of metrics.
The quality of an AI's output is bounded by the quality of the data it can access. Data fragmented across three tools returns fragments. Short retention means no patterns. Throttled cardinality means no correlations to find. This is why high-cardinality event-based tracing in a connected data store has always been the right foundation.
We believed that was the right way to do observability. It turns out it's also the right way to do it for agents.
Observability for AI
The first part of Honeycomb's answer to the agent era is expanding what you can see. If agents are now part of your production system, they need to be just as observable as everything else running alongside them:
- Prompts, token counts, model identifiers, latency, and quality scores become first-class telemetry, conforming to the OpenTelemetry GenAI semantic conventions.
- LLM messages get rendered directly in the trace waterfall, not siloed in a separate eval tool.
- Conversation-level views that render an entire agent workflow as a single visual sequence are made available: every agent invocation, tool call, and handoff, bound by a conversation ID with the full trace waterfall of downstream services in the same view.
From the outside, you only ever see fragments. When something breaks, the failure could be in the model, a downstream API, a database, or a broken tool call at any layer in between. That's the gap Agent Timeline, now in early access, closes.
The second part of the answer is using AI to help your teams solve harder problems faster than they could before, with AI embedded in the investigation itself, starting before an engineer even opens a laptop.
When a trigger fires or an SLO burns, Canvas immediately launches an auto-investigation and builds a structured picture of what's happening: ranked hypotheses, relevant queries, context from prior investigations. By the time someone opens the tab, the work has already started. And because Canvas is multiplayer, humans and agents can work the same investigation simultaneously, with each agent reading what the other has already found and building on it.
Skills are how your team's best thinking gets built into every future investigation. Every engineering organization has a few people who always know exactly where to look when things break. That knowledge has always been impossible to package and scale. Skills encode it directly into the investigation agent, so your runbooks and institutional context are applied from day one on every incident.
Observability is a prerequisite
The organizations navigating this well have made one decision that separates them from the ones that are firefighting: they treat observability as a requirement before any agentic system ships, not something they bolt on after the first incident.
One design partner, a large bank building multi-agent systems, put it directly: no agentic system goes into dev without observability in place, because they have no idea where it’ll break.
That mindset shows up in practice at the companies that have moved fastest. Mixpanel rolled Claude Code to their entire engineering org in July 2025 and immediately started tracking agent costs with a Honeycomb board template as one of their sources of truth for how much they're spending on agents. When agents are writing code that goes into production, from UI work to highly stateful storage systems, you need the same observability primitives you'd apply to any other production system. The lesson holds across the industry. Intercom's Fin chatbot handles millions of customer conversations across thousands of organizations. When resolution rates started degrading and time to first token started climbing, they knew what was happening, and Honeycomb helped them understand why. The resolution rate is now at its highest level.
What we've been building points here
The capabilities we've been shipping over the last several months aren't a pivot toward AI. They're the same conviction, now more visible.
MCP became read-write in March. Agents can now investigate a problem and set the alert that catches it next time, in the same flow, without handing off to a human. Any compliant MCP client (Claude Code, Cursor, Copilot) connects without pre-registration, because friction in the feedback loop is the enemy of the feedback loop.
Canvas auto-investigation launched in February. Within four days, 279 investigations had run across 91 teams. The investigation starts when the alert fires. BubbleUp is now available inside Canvas investigations, and the Canvas Slack app opened to all Honeycomb Intelligence customers in April, bringing investigations into the tools where teams already live.
Each of these is the same idea in a different form: the feedback loop between what happens in production and what your team learns from it should be as fast, precise, and complete as possible. The agent era makes that more urgent.
Join us for Day 2
More agents running faster with less visibility is not progress. One well-observed agent connected to good data and trusted enough that teams act on what it finds is worth more than a swarm running blind.
Join us for Day 2 where we're announcing Agent Timeline, deeper LLM insights, and a re-imagined Canvas built for exactly the kind of investigations that the agent era demands.
Join us for Innovation Week
A three-day virtual event on AI & observability
with keynotes from the founders, product announcements,
and the partners enabling it all.