Fast and Close to Right: How Accurate Should AI Agents Be?
In this blog, I’d like to discuss why hallucinations aren’t the biggest problem in observability agents, the tradeoffs around data fidelity and task accuracy inherent in agent and tool design, and how to evaluate agentic capabilities as they apply to observability.

By: Austin Parker

Introducing Honeycomb MCP: Your AI Agent’s New Superpower
Watch Now
A common question and concern I hear about AI agents has to do with accuracy. This makes sense, I think. We’re accustomed to think of software systems as entirely deterministic—unpredictable results are a sign of bugs or logic errors.
This thinking is heavily ingrained in the way a lot of people in the wider software industry are building AI agents. Popular frameworks mirror this sort of ‘deterministic workflow’ approach, even though LLMs are inherently nondeterministic. This nondeterminism is confounded by the very ‘loose’ nature of LLM errors, as we tend to conflate all of them into the grey goo of ‘hallucinations.’ I find this to be a bit of a thought-terminating cliché when it comes to discussing agents, because insomuch as hallucinations are a term of art in the ML/AI space, users and builders tend to conceptualize them as bugs. Unintended, undesired, something to be squashed with just one more ‘IMPORTANT: NEVER DO THIS’ remark in a prompt.
In this blog, I’d like to discuss why hallucinations aren’t the biggest problem in observability agents, the tradeoffs around data fidelity and task accuracy inherent in agent and tool design, and how to evaluate agentic capabilities as they apply to observability.
Compounding loss and error in observability
It goes unremarked that telemetry data about a system is inherently a lossy encoding of system state. The different signals that we encode events into, be they metrics, log events, or traces, are all forms of structured data that abstract away potentially thousands of discrete system events so that we can monitor and analyze system behavior at scale. Just because this data is compressed doesn’t mean it isn’t useful, mind you. The amount of discrete things that happen on a server just to handle a single HTTP request is staggering, and the vast majority of them aren’t interesting (or if they are, they’re not interesting in a way that we can cheaply determine at the time they’re generated). We see this as we move up the stack, away from hardware and into application code and logic. The things that impact user satisfaction with a website are often not any one specific function or line of code, but a combination of factors, like unoptimized loading or caching, a critical mass of slightly underperforming SQL queries, etc.
It’s likely that you realize this instinctually. Many developers prefer time-series metrics for application state specifically because of this lossiness—why bother creating medium-fidelity data (such as spans) for high-level performance (API performance by route) when the stuff that’s gonna help you figure out what’s happening in prod (low level system or application logs) aren’t getting aggregated anyway? This works until it doesn’t: your pre-aggregations turn out to be too brittle to cope with functional changes to your app, or too general to pick up user-impacting incidents, and you’re back to the drawing board.
The second important realization is that the act of querying and alerting on telemetry is itself also lossy, even with metric signals. Temporal alignment will shift points around to make them fit on your proscribed time ranges, and log and trace aggregation may drop events from slow-responding nodes in a distributed telemetry database, or simply be filtered out by query limits. Selective views into the overall system are the norm, not the exception. The law of big numbers lets us feel safe that these incomplete views still give a representative sample of the whole, however, and we operate our systems without truly thinking about the incomplete views we’re dealing with.
Adding AI to this compounding loss is a hell of an accelerant, especially when you take into account the obsequiousness of chat models like GPT-5 or Claude. The way these models are trained biases them towards confidence rather than cautiousness, and it’s a constant battle to keep them aligned with operator expectations. A great deal of anxiety involves hallucinations, the tendency for language models to output plausible but incorrect text. Seasoned SREs will recoil in horror at the idea of a confabulation or a misinterpreted alert leading to nodes being spun up or torn down, and it’s not an unreasonable fear.
Midnight in the garden of implicit and explicit knowledge
What I’ve observed in my time building AI observability at Honeycomb is that much of the anxiety around agentic systems springs from the liminal space between what we know and what we think everyone knows. One of the most common failure modes I’ve observed with users of our AI tools is that they’ll ask the system for an answer that is contained within the data, but that answer is only interpretable if you have access to information that exists outside of the data. I’m going to call this the ‘dog problem’ because at Honeycomb, our services are all named after dogs. 🙂 It’s cute, right? Very startuppy.
Do you think Claude understands the joke behind ‘basset’, ‘retriever’, or ‘newf’? If I explained the references, probably. It’s a language model, its whole thing is understanding the relationships between text. You, too, have a dog problem. Your services are quite possibly a mix of exciting names and acronyms that make a lot of sense if you work with them every day. But the machine cares little for your whimsy, and your jokes better be real good or else you’re just making its life harder.
Dog problems abound when it comes to telemetry because it is so difficult to measure implicit knowledge. It’s a fun game to try, though! Get four other engineers and sit down at a table and start asking questions about the system, about where to find answers to things, about who owns what, or what function some component serves. Once you’ve done this for about 15 minutes, divvy up the list of all the questions and answers and try to find documentation about them. My guess is that you’re gonna find a lot of stuff that isn’t written down, and for the stuff that is, you’re gonna find documentation that isn’t up to date. For bonus points, generalize this to telemetry data. Try to figure out how many of your metrics or logs are well-documented and have some sort of schema to them. Are your attributes normalized across every service? Worse, do you have attributes that mean the same thing but with different names? Alternately, do you have attributes with the same name that mean different things?
Truly getting value out of AI for observability is managing the tradeoff between explicit and implicit knowledge. It’s unrealistic to expect you to refactor your knowledge base, version control system, code comments, telemetry systems, whatever else you can think of in order to prioritize explicit knowledge. I find that organizations know this in a somewhat oblique way, mostly demonstrating it through practice. We ‘know’ that small, tight-knit teams can ship quickly in the same way that we ‘know’ large organizations are slower and ponderous. Conway’s Law, coordination penalty, all of these concepts are really just masquerading the time and energy that it takes to convert implicit knowledge to explicit knowledge.
LLMs may be very smart, but they are not psychic and they tend to make knowledge transfer harder. I find AI-written text to be the visual and mental equivalent of a spherical cow on a frictionless plane: it is unmoored from context, passion, or investiture. It has a tell. Crucially, it tends to omit or elide important details and winds up saying less in more words than a short conversation or a few Slack messages. This is a long way of saying that if you hope AI can both fix the problem of ‘not having enough explicit context’ by turning your unstructured knowledge into accurate knowledgebases, you’re gonna have a bad time.
Small mistakes compound, but it might not matter
If we’ve established that hallucinations are inevitable, and that it’s nearly impossible to ensure that we have flawless and accurate documentation, you might expect that trusting an AI agent to answer observability questions for you, or manage a system, might be impossible. This is where I would say that you’re incorrect. It’s unintuitive, but our evaluations bear it out: LLMs that are only ~60% accurate on narrow data extraction tasks can have 100% success rates on more complex investigatory tasks.
Let’s dig into this by being more precise. When I say ‘narrow data extraction tasks,’ I mean something like this: give an LLM the results of an observability query (e.g., the count of requests over a two hour period, grouped by HTTP route) and ask it factual questions about the data. Things like, “Which route had the most requests?” or “What was the difference in request volume between the top two endpoints?” These are straightforward questions that can be answered with a single number, or a true/false. They often require some level of calculation or generalization. For example, the model has no other information than it’s an observability agent answering observability questions. It doesn’t receive a schema for the query results. It just gets some light instructions, the result, and the question. Our evaluations show a success rate for these sorts of questions between 90% and 55%, more or less, based on the model. More powerful models perform better.
Our other set of evaluations involve giving an agent with tools (our MCP server) access to a known set of data with known problems. The model does not necessarily have access to all tools; we want it to actually investigate instead of ‘cheating’ and looking at other results. It gets a prompt that we base on our canned demo (if you’ve ever been to a Honeycomb booth, you’ve seen this before) and we let it go off to figure out why the problem is happening. Results are graded against a set of objective criteria. The agent has to identify the root cause, what service it's in, and why. Coming up with this answer requires several discrete steps and queries, and the ability to pivot from looking at where the problem presented itself to where it started. These evaluations succeed almost every time. They’re remarkably consistent, in fact! Some of this is due to the demo data being preternaturally ‘clean’ and accurate—it is demo data, after all—but if models have such a high failure rate on basic tasks, you’d expect the agent to fail at least sometimes on more complex tasks, right?
What I’ve come to understand is that it doesn’t necessarily matter if the small things fail, because the agents can route around the failure. They do this in a way that is often oddly human. Models will frequently mess up a little bit on math and be off by one when doing sums or other arithmetic. It turns out that this doesn’t really matter, because most interesting things (outliers, etc.) are at least two or more orders of magnitude larger than their peers. It’s pretty easy to spot true outliers! Another underappreciated factoid here is that agents are good at checking their work. If they make a mistake and then take a step forward and get conflicting answers, they’ll often reevaluate their approach or try a different way until the situation becomes more clear. This is an inherent advantage of agent loops and reasoning models: they have the ability to self-correct. Finally, agents are very good at exploring a problem space. The ability to quickly query across multiple sources and services, then compare the results, means that the agent is going to toss out a lot of incorrect hypotheses—but it ultimately doesn’t matter, because they can be wrong and move on to the next thing.
What matters for using AI with observability?
I remain convinced that we’re still in the very early days of productized AI for observability purposes. There are quite a few people selling advanced capabilities that I don’t think are viable in the real world. If your agent does great with high resolution, highly commented, and highly annotated data, that’s great! Almost nobody has that. If your agent promises general solutions to not just domain-specific but organization-specific problems without the ability to deeply customize it or provide additional context, then you’re probably not being honest with yourself. Problem solving generalizes, and problems aren’t novel, but the way they present is. Every happy family is alike, as they say.
Think of your AI-powered tools as an extension of your existing capability, not a replacement for it. Yes, it does enable you to do more with less; it allows for individuals to have an outsized impact. It’s a force multiplier. Good AI adoption has a lot more to do with figuring out the parts of your organization or engineering strategy that aren’t working and making those things work better. AI is an accelerant, as I said above. The DORA report agrees; it notes that AI tends to exacerbate the preexisting conditions of an organization. If you’re slow, AI won’t make you fast; it’ll make you feel fast while compounding your problems. If you’re fast, it won’t make you good; you’ll eventually collapse under the complexity of a bunch of disconnected, unlearning, overconfident agents.
Organizations that adopt, and are successful, with AI-powered observability are doing two things. First, they’re finding out where AI is failing and using it as a signal to improve explicit knowledge, which not only helps the AI, but also helps humans. Second, they’re not replacing their existing expertise and systems, they’re using AI to expand those human capabilities into other parts of the business. AI may not be great at understanding your system telemetry, but if you can teach it, you can allow your customer success and field engineering teams to self-serve a lot of questions that would have otherwise required an engineer’s time. Your sales team can start to appreciate how performance really impacts their customers. Executives can get real visibility into the cost of poor performance in terms of conversion rates or abandoned carts, and conversely, see the impact of prioritizing reliability in dollars and cents.
Leverage AI-powered observability with Honeycomb Intelligence
Learn more about Honeycomb MCP, Canvas, and Anomaly Detection.