Observability   LLMs  

Honeycomb + Google Gemini

By Jessica Kerr  |   Last modified on April 9, 2024

Today at Google Next, Charity Majors demonstrated how to use Honeycomb to find unexpected problems in our generative AI integration.

Software components that integrate with AI products like Google’s Gemini are powerful in their ability to surprise us. Nondeterministic behavior means there is no such thing as “fully tested.” Never has there been more of a need for testing in production!

Honeycomb is all about helping you find problems you couldn’t before, problems you didn’t imagine could exist—the unknown unknowns. Charity showed how this applies to integrating with Gemini.

What’s the problem?

In the Honeycomb product, people can ask questions in natural language and get a Honeycomb query and result. We call this feature Query Assistant.

Honeycomb + Google Gemini Keynote: Query Assistant - Typing in a query

People type in stuff like “Which users have the highest latency?” and get a graph and table. This comes with a Honeycomb query they can modify.

Honeycomb + Google Gemini Keynote: Graph/table result of a query.

Query Assistant turns an engineer’s question into a valid Honeycomb query—most of the time. Is that good enough?

How good is the solution?

With Honeycomb, we can measure whether the feature is working. We have instrumented the entire operation all the way from when the user hits enter, our RAG pipeline, programmatic prompt construction, AI model call, response parsing and validation, query construction, and submission to our query engine. There’s a lot that can go wrong, even without the nondeterminism of AI!

All of this rolls up into a single Service Level Objective (SLO) we call “Query Assistant Availability.” It tells us that Query Assistant produces a valid query 95% of the time.

SLO compliance in Honeycomb.

But what about the other 5%?

Where does it go wrong?

Charity scrolls down on the SLO screen, to the view where the magic really happens. We call this the SLO BubbleUp View.

BubbleUp looks at every Query Assistant request and compares every single dimension from the requests that violate the SLO against the baseline events that satisfy the SLO, and then sorts and diffs them so the differences come to the top.

Dimensions: A bunch of graphs.

This answers the question: “What is different about the ones I most care about?”

Charity sees that “Error” is one of the most different fields, and there are several different values. She mouses over one of them. It says “unexpected end of JSON input.”

Unexpected end of JSON input.

Charity clicks to add this field as a filter, and now she has a graph of only requests with this error.

Charity clicks to add this field as a filter, and now she has a graph of only requests with this error.

Now, she’d like to see the end of that JSON input. She types something like “also show the user input and response” into Query Assistant.

Query Assistant usually adds the right fields to the GROUP BY, and now they appear in the table. There’s something suspicious about the end of that app.nlq.response value.

Honeycomb + Google Gemini Keynote: There’s something suspicious about the end of that app.nlq.response value.

That response is truncated! Indeed, the JSON has ended unexpectedly.

What’s our limit on output tokens? Charity asks Query Assistant “also group by the output config.”

 

Looks like our configuration was just for 150 tokens from the LLM, and clearly that’s not enough!

Looks like our configuration was just for 150 tokens from the LLM, and clearly that’s not enough!

That seems like an issue Charity should take to the team.

This is observability.

This flow shows just how important observability is when you’re building with generative AI. We found this problem to fix, but new ones will surely come up.

When you can use SLOs to identify something you find interesting, then slice and dice and explore your data, you can quickly get to the bottom of any issue—even if it's totally new and unknown.

With Honeycomb, you can develop AI applications with confidence, in Google Cloud or any other environment. Try it today for free

 

Related Posts

Product Updates   Observability  

Introducing Relational Fields

Expanded fields allow you to more easily find interesting traces and learn about the spans within them, saving time for debugging and enabling more curiosity...

OpenTelemetry   Observability  

Real User Monitoring With a Splash of OpenTelemetry

You're probably familiar with the concept of real user monitoring (RUM) and how it's used to monitor websites or mobile applications. If not, here's the...

Observability   Customer Stories  

Transforming to an Engineering Culture of Curiosity With a Modern Observability 2.0 Solution

Relying on their traditional observability 1.0 tool, Pax8 faced hurdles in fostering a culture of ownership and curiosity due to user-based pricing limitations and an...