Your Questions About AI-Assisted Development Answered

We recently hosted a webinar on AI-assisted development with DORA, and the audience had a lot of questions—far more than we could get to in an hour. I picked out six that get at the stuff people are wrestling with day to day.

By: Austin Parker

| March 5, 2026

AI & LLMs

Product Videos

November 3, 2025

Honeycomb MCP: AI Meets Observability

In this quick demo, Michael Sickles shows you how to connect the Honeycomb MCP server to multiple AI environments, including Claude desktop.

Learn More

Your Questions About AI-Assisted Development Answered

We recently hosted a webinar on AI-assisted development with DORA, and the audience had a lot of questions—far more than we could get to in an hour. I picked out six that get at the stuff people are wrestling with day to day. These aren't the easy questions, and I don't think there are necessarily easy answers, but I've spent the past year building and shipping with AI coding tools and observing (literally) what happens when that code hits production.

Here's what I have. And in the meantime, if you missed the webinar, you can watch it here:

On behavior

Briony Goldsack, Isometric: How do you propose observing agent behavior and decisions?

Austin: The short answer is: the same way you observe everything else. Traces and logs (and some metrics, where it makes sense).

The long answer is: you need to stop thinking of "agent observability" as a special category.

An agent is a program. It makes network calls, reads and writes data, has latency characteristics and failure modes. The fact that it's calling an LLM in a loop instead of a database in a loop doesn't change the fundamental shape of the problem. If you have a good observability practice for your applications today, you already know how to observe agents. Instrument them with OpenTelemetry, emit spans for each tool call and LLM invocation, and send that telemetry to the same place you send everything else.

The more interesting question is whether you're observing your own agents or agents you're providing to others. If it's the latter—say, you're building an agentic product for your customers—then this looks a lot like any other application. You need traces, you need to understand performance and error rates, and you need to be able to debug issues when a customer says "your agent gave me a wrong answer." Standard stuff. Instrument it like any other service.

If it's your own internal agents, the picture shifts. You have more options, more control, and a different set of concerns. You're probably thinking about things like:

Are these agents accessing the right data?
Are we auditing that access?
Is there a governance layer?

These are real questions, and they're worth taking seriously, but they're also not new questions. They're the same data governance questions you should already be asking about any system that touches production data.

The question I think people should be asking more is: how do you make agent telemetry available to other agents, or to your internal data platform in general? If your observability agent is investigating an incident, it needs access to the same telemetry data your human engineers use. If your coding agent is deploying a service, it should be able to check the observability data from that service to validate the deployment. The data flows in both directions, and the platform that stores and queries that data becomes a shared resource between human and machine operators. That's a shift in how we think about observability platforms—not just "humans looking at dashboards," but "a data layer that both humans and agents can interrogate."

Read our O’Reilly book, Observability Engineering

Get your free copy and learn the foundations of observability,

right from the experts.

Download Now

On production incidents

Austin Redding, Traversal: Is the increase in AI-generated code increasing the complexity of resolving production incidents?

Austin: Mostly it's increasing the absolute complexity rather than introducing a new kind of complexity. AI-generated code can have subtle errors, and so can human-written code. The difference is throughput: you can make bigger changes, faster, and ship them sooner. The failure modes aren't novel, there are just more of them, and they arrive quicker.

I think it's worth being precise about this, because fear tends to run ahead of reality. AI-generated code doesn't have some unique class of bugs that's harder to diagnose than human-written bugs. It has the same kinds of bugs: off-by-one errors, mishandled edge cases, incorrect assumptions about API behavior, just produced at a higher rate. The person on call at 2:00 a.m. is dealing with the same shape of problem they always have. They might just be dealing with more of it, or with code that changed more recently than they expected.

The good news is that a lot of this responds to the same things that have always worked: tests, guardrails, and telemetry. If you have a strong test suite and CI pipeline, AI-generated bugs get caught before they hit production at roughly the same rate as human-generated bugs. If you have good observability, the ones that do get through are no harder to diagnose. In fact, I've found that AI-generated code tends to be more consistently instrumented than human-written code, because you can put your instrumentation patterns in the context window and the LLM will apply them everywhere. Humans get lazy about adding spans to every new handler. LLMs don't, as long as you tell them not to.

Where teams get into trouble is when they let the speed of generation outrun their existing quality checks. If your test coverage was already thin and your telemetry spotty, AI isn't going to make that better. It's going to make it worse, faster. DORA research backs this up: AI tends to exacerbate the preexisting conditions of an organization. If your feedback loops were tight before AI, they'll stay tight. If they were loose, they'll get looser.

On deleting, refactoring, and consolidating

David Keech: AI assistants produce a lot of code and rarely delete, refactor, or consolidate. What can we do to avoid these traps?

Austin: This is a real pattern, and I think it comes down to poor feedback loops, both in terms of how code is authored and how it gets validated in production. The good news is that most of the solutions aren't AI-specific; they're just good engineering practices that become urgent when you have a very fast, very literal-minded code generator on your team.

First: focus on really good interfaces and worry less about implementations. If your service boundaries, API contracts, and module interfaces are well defined, you've done two important things. You've controlled the blast radius of any individual change—AI-generated or otherwise—and you've created natural test boundaries and optimization boundaries. A bad implementation behind a good interface is a Tuesday afternoon fix. A bad interface is a monthlong refactor. Spend your human attention on the contracts and let the machines iterate on the internals.

Second: if you have good test coverage, you can confidently hand autonomous agents tasks to improve and refactor code. This is where the upfront investment in testing really pays off with AI tools; not just as a quality check, but as an enabler. An agent that can run the test suite and see green is an agent that can safely delete dead code, consolidate duplicate implementations, and clean up after itself. Without tests, the agent (reasonably) defaults to the conservative strategy of "add new stuff, don't touch old stuff," and you end up with the bloat the question describes.

Third (and this is one I think is underappreciated): invest in inline documentation. One persistent antipattern in our industry has been not thoroughly documenting code with comments and docstrings. With AI tools, this becomes a much bigger deal. Good docstrings are more token-efficient than forcing the model to read entire implementations to understand what a function does. Hooks and other deterministic helpers can enforce that comments and docstrings get updated as implementations change. The goal is implicit context, close to where it matters. A well-documented function signature tells the model (and your future self) what something does without requiring a deep read of the code. A CLAUDE.md or equivalent rules file can encode your conventions and patterns so that the model generates code that fits your codebase rather than its own generic defaults.

Bloat is a feedback loop problem. Fix the interfaces, get real test coverage, and document as you go. The accumulation problem mostly sorts itself out.

On speeding up code reviews

Jerry Saravia: How have you sped up code reviews to keep up with AI-assisted coding? How do you enforce quality?

Austin: I don't think anyone has fully solved this yet. But again, it's a feedback loop problem, and the solutions rhyme with everything else we've been talking about.

Start with AI in the review loop. Use AI review tools as a first pass—they're good at catching the mechanical stuff that eats up reviewer time: style violations, missing tests, obvious logic errors, inconsistent naming. This isn't a replacement for human review, but it means that by the time a human looks at a PR, the trivial stuff is already handled.

The second piece is keeping changes small and easy to reason about. This has always been good practice, but it becomes essential when code is being produced faster. Stacked PRs work well: they fit the natural flow of AI-assisted development. You can race to the finish with the AI and see how the whole thing works end to end, then break it up into a stack of reviewable changesets that each tell a coherent story. The person writing the code gets the fast, iterative experience they want; the person reviewing it gets changes they can actually hold in their head.

Third, invest in fast, automated quality checks. Type checking, linting, formatting, test coverage—these should all be automated and enforced in CI. If you're using something like Claude Code, you can put these checks in hooks and the agent will self-correct against your quality bar as it's making changes.

That leaves the human review to focus on what AI genuinely can't answer: does this approach make sense? Is this the right abstraction? Are we building in the right direction?

On integrating AI

Thomas van Gemert, LegalSense: What are the best, concrete first steps to start integrating AI into existing codebases?

Austin: Every AI coding demo is some greenfield React app or a brand-new API built from scratch. That's not where most of us live. Most of us live in codebases that are old, complex, full of implicit knowledge, and held together by a combination of lore and duct tape. So what do you actually do?

I've seen two approaches work.

The first is to start in a new part of the codebase—a new service, a new module, a greenfield corner—and establish good AI-assisted practices from the jump. Write a CLAUDE.md (or equivalent rules file) that describes how to build, test, and run the project. Set up your test harness, your linting, your CI checks. Get the feedback loop right in a low-risk area where you're not fighting legacy constraints, then expand outward as you build confidence.

The second is to work around the edges of the existing codebase. Use AI to improve test coverage for code that's currently undertested, add instrumentation, write docstrings and inline comments. Have it explain what modules do. This builds your confidence in the tool and generates useful documentation as a side effect. Bug fixes are another great entry point: you have a clear problem statement, clear success criteria, and tests to validate the result.

Either way, the important part is nailing the feedback loop so you can make changes with confidence. That means:

Can the AI run the tests?
Can it see the results?
Can it validate that what it just did actually works?

If the answer to any of those is "no," fix that first. Everything else follows from being able to make a change and know, quickly, whether it was a good one.

One specific recommendation: adding OpenTelemetry instrumentation to an existing codebase is a great early task for AI. The conventions are well-defined, the scope is bounded, and you get two things at once: better visibility into your system, and better data for the AI going forward.

On junior engineer onboarding

Anon: How does reliance on AI change how we onboard junior engineers?

Austin: I actually find AI to be a great tool for onboarding, and I think the fear around it is somewhat misplaced.

The concern people raise is that junior engineers will use AI to bypass the struggle of early learning and never develop real intuition. I get it. But consider what AI actually offers a junior engineer: an inexhaustible resource that never gets tired and never makes them feel bad for asking a dumb question. That's not a small thing. Anyone who's been junior knows the anxiety of bothering a senior engineer for the third time in an hour. AI eliminates that friction entirely.

How you structure the loop matters. I think a good pattern is: have the junior engineer ask the AI first, push back on what it says, challenge things that don't make sense (ask for citations, ask for ELI5 explanations, ask why) and then take that conversation to someone more senior. The junior isn't just getting answers; they're learning to understand the shape of a system and to ask better questions. By the time they sit down with a human mentor, they've already done a first pass on the problem and can have a much more productive conversation.

This also exposes documentation gaps in a way that's useful to the whole team. When a junior engineer asks the AI about a part of the system and the AI gets it wrong because there's no documentation, that's a signal. It tells you where your explicit knowledge has gaps. Fixing those gaps helps the AI, the next junior engineer, and the senior engineers who've been relying on tribal knowledge they didn't realize was undocumented.

That said, keep human mentors actively in the picture. AI is a good first stop for questions, not a replacement for mentorship. The judgment calls (is this the right abstraction? Is this how we do things here? What are the tradeoffs I'm not seeing?) still need a human. The AI gives juniors the confidence to show up to those conversations better prepared.