Conference Talk

Releasing and Debugging Software in Production with Honeycomb



Liz Fong-Jones [Principal – Developer Advocate|Honeycomb]: 

Hi. I’m Liz. I’m an engineer at Honeycomb. Today, I would like to show what Christine means when she talks about designing for curiosity with features that are fast and fit both developer and operational needs.

We’re going to look at what it means in practice to reimagine the workflow of releasing software to production. I’m going to show you how we release software at Honeycomb, and how we debug it, and how that differs from other places I have worked before. 

Because, in a previous life, we did big bang releases. We only shipped software about once a month in big batches to production. And we wrestled with having many different release branches. We spent hours rolling back any time misbehavior turned up in prod, and we struggled to diagnose several performance problems such as individual users having a slow or erroring performance. 

Now, it’s true that companies have gotten better at performing small batch releases more frequently, but moving faster doesn’t necessarily translate to more safety. There’s often no real confidence that deployments are truly production ready, and engineers don’t really have a good understanding of how their changes are going to behave in production. So we’re packaging up and shipping changes to production often and automatically, but the net result of end reliability and distrust of production is the same. If issues happen once that feature reaches production, with older tools, it’s difficult to diagnose why issues are occurring, and we often have to roll back as the only way to fix the problem. 


But, fortunately, we’re able to move much faster and more reliably at Honeycomb. We practice continuous delivery, and we integrate observability into our entire software development life cycle. We call this observability driven development. 

So today I would like to show you how I modify, release, and debug new Honeycomb features we’re shipping in production by using a second copy of Honeycomb. I’m going to show you how Instrumentation, Build, Service Level Objective, Querying, Metrics, and Collaboration Features all combine together to make for a fast and reliable developer experience that empowers our developers to have curiosity and understanding. 

Let’s show you Honeycomb’s Query Data API, it’s a new feature that’s being released today that allows you to run Honeycomb queries programmatically and obtain results as data that you can integrate into your workflows any way you see fit. 

For this demo, I would like to show you how I can make modifications to that new feature. Let’s suppose I want to improve the freshness of results for beta customers of the Query Data API. Let’s go ahead and cache the results for less time so they can get fresh data every 10 seconds, not every 10 minutes. And let’s allow them to do that not for just 24 hours of data but for an entire month at a time.

So we’re changing these parameters and we’re adjusting the handlers that handle query, rate limiting, and caching. We’re going to kick off a pull request that will test this change before being released into production. And then, CircleCI will go ahead and start building my software. Now I can follow what’s happening inside of the Circle CI UI and that will give me the web of dependencies but it doesn’t tell me what’s happening at what time; what happened when. So, instead, let’s look at the build job as a trace inside of Honeycomb.

When I look at that view, it enables me to understand why my build is unusually slow and what the slowest part was. This is something that Honeycomb Build Events enables, and it integrates for us with the Circle CI org but you can also use GitHub Actions or just run a shell command inside of your build pipeline. 

So once my build has finished, I can go ahead and think about releasing it to production by clicking on the merge button. But, wait a second. Let’s make sure it’s actually production ready. At Honeycomb, what this means is I want to have enough telemetry baked in so I know how my production service’s behavior is changing and which users are impacted. If the performance is negatively impacted for any reason, I need to know what I’m looking for inside of the prod code.


So let’s review how we measure and understand success. With OpenTelemetry, we’ve added custom instrumentation fields like cache hit miss or team and data set ids and we have inter spans for every individual function call that might take a lot of time. And this is all in addition to the automatic instrumentation that OpenTelemetry adds on every API request.

These bits of telemetry tell me how the caching performance has changed and who that change is impacting. Additionally, we also have service level objectives set for all of Honeycomb as well as for this specific API service. So we’ve defined what the success criteria are for our beta customers who are using this new query data functionality. 

In this particular case, we would like Query API data to return in fewer than two seconds. If results take longer than two seconds to return, that’s a bad experience for Honeycomb customers, and we want to count those queries against our error budget. Remember, slowness is the new downtime. 

Of course, we’ve also zoomed out and looked at all of our service level objectives just to make sure everything is in a good state before we push any releases. 

As a developer, I’ve clearly designed success and failure criteria for how my changes will impact production. I know where in my code I can look if I encounter any problems and I understand the state of production before my change is introduced. That is what it means to be production ready. 

At Honeycomb, it’s my responsibility to watch how our changes impact users in production. When I ship, I’m expected to look at production behavior alongside whoever is on call from Engineering. Our observability driven approach really helps us with being able to understand code changes as they’re being rolled out with the help of great tooling like Honeycomb. 

Now, I can hit the merge button and make sure the change is built within 10 minutes, and it automatically gets shipped to all of our environments within an hour. Let’s go ahead and wait an hour and see what happens. Well, that’s less good. Within an hour of my change shipping the reliability of the Query Data API has really taken a nosedive. That’s unfortunate. The good news is that we found out before we exhausted our error budget because we got a burn alert that told us proactively. So, the Engineer on call and I both started looking because they are on call for the service as a whole, and I’m responsible for watching my specific code. 


So we can see there’s a dip in availability, and Honeycomb points out what factors are contributing to unavailability. Which properties are shared between all the failing requests that happened after my code was released to production. I want to understand “why did my code fail?” and “why didn’t this turn up in pre prod?”

Honeycomb’s integrated BubbleUp feature helps us understand what keys and what part of the key space is broken. You can see a few things. First of all, the heatmap lets us see the majority of queries are still faster than two seconds, and some are slow. So that’s a bad customer experience. We consider that a failed query.

We can see the slow queries come from three specific partitions from one or two specific build IDs, and they all have a high number of results being returned. A high number of groups. This helps me drill down and understand what are the common factors of performance happening here, and how might I be able to stop this SLO burn from happening? 

Let’s drill down even further and get a record of this customer’s queries using the Query Data API on the specific dataset. Now, as you can see, there’s only maybe half a dozen customers here with early access to the Query Data API, but I can easily apply it across all Honeycomb queries being run across tens of thousands of datasets. It doesn’t really matter. Honeycomb can query and group by any arbitrary number of cardinality fields; and, query across arbitrarily many of them as well.

Let’s go ahead and look at the performance for the specific dataset, and let’s also maybe zoom out and group by dataset ID and have a look at all the datasets together now that we’ve examined how recently this behavior started happening. Let’s use Honeycomb’s new Time Comparison feature which allows us to understand how this customer’s queries and all the queries against this API have performed day on day and week on week. Is it that the customer is wrapping up and this is a normal behavior during the weekday? Or have they suddenly started sending us more queries than normal? 


In this case, we can see the slow query performance is suddenly flooding us with a lot of queries, and we’re not getting cached results. We’re doing work every time. We’re able to compare day on day and week on week to see what’s happening. Just to make sure it’s not just this one customer, let’s also have a look and see what’s going on with the other customers. 

I’m going to note that there are a lot of other customers on this graph, and we may not necessarily want to look at all of them because there are a few customers here that have sent one query in the past week, and then they’ve gone away. They have not really sent us another query using the Query Data API. So what if I could declutter my graph and get rid of all those things that don’t necessarily matter in this particular investigation? 

That’s where it will be helpful to use Honeycomb’s new HAVING clause feature that lets us focus on the relevant time series and it allows us to remove that clutter and show us only groups that have for instance that either succeeded or ate into the error budget. I’m going to go in and set the HAVING clause to show for only groups having a count of greater than two. This is something you couldn’t do before. Now it lets you not just query across individual events but filter across groupings of events. Doesn’t that graph look a lot cleaner? 

Now we can see a couple of customers have been querying us and still been seeing successes, but it’s only one customer seeing consistently slow performance and is seeing consistently more queries. We can confirm this hypothesis by going into BubbleUp and highlighting a specific area that I want to look at rather than trusting Honeycomb to only show me the queries that have a failed service level indicator. 

That enables me to do my own exploration and digging that Honeycomb hints to me, with its machine smarts, what fields I might want to look at. But let’s also share what I know as a human. Let’s share that with the rest of the team by sharing the query so the teammates can see this behavior in their query history. By doing that, I’ve saved the results in a shared lab notebook so we can all look together and see what queries we’ve run and which have been annotated with titles. 

Then I can go in or my teammate can go in, for instance, my teammate who’s on call, and they can see in team-saved queries what queries I’ve run and named. So this helps us, as a team, debug issues faster because we’re all able to understand what’s going on in production together. 


But let’s go ahead and dive in a little bit more and have a look at an individual trace exemplifying the slow behavior. We can see here this is a query that’s not just buffering and sat in a queue for two seconds. We’re actually spending two second of time in AWS Lambda as well as in the parent process of Retriever, which is our storage engine. So this user is not just hammering us with queries, the requests are taking a lot of time, and that time is spent chewing on doing that computation. So we should probably verify that the system, as a whole, is behaving well. We are going to need to go back and have a look at the system holistically. 

Let’s go ahead and have a look first at which hosts are most negatively impacted by this change. We can see here that some of the hosts are perfectly healthy, but there are three particular hosts seeing a high number of errors being returned and high latency. We could filter to that individual host name as well as being able to filter by dataset and so forth. But, in this case, I would like to show the metrics for my entire system together. 

We can see the CPU of all the Retriever workers, and the memory of all these Retriever workers as well as a metric from AWS, the current number of concurrent Lambda executions. I can go ahead and do something like roll over an individual line to see what the behavior is of that individual hosting. I can also dive in and look and get an understanding of how that host name’s behavior differs from the other hosts just by rolling over and comparing. But if I need to see things in more detail, I can, for instance, click on any of these graphs and see the graph blown up to full scale. It’s not just Lambda executions. I could have picked any number of CloudWatch metrics that I wanted to plot here that are automatically ingested into Honeycomb. 

Let’s filter to just this individual hosting that’s behaving slowly. Let’s go ahead and apply that query filter. Let’s go ahead and filter specifically to that host to understand what is the CPU utilization of that host? So the filters and groups are reflected here, and that allows me to quickly understand everything that’s happening all on one screen. 

So overall we can see that since we deployed the shorter caching, each of these queries is running and running on CPU on the host and on the Lambdas. That means we need to do something to address this because this one customer is really slamming us in production, and we didn’t anticipate or see this in query production. 

So now is an opportunity for me to go back to my SLO and try to put a stop to this behavior before we burn through all of our SLO. So this set of users is having a hard problem. Let’s go ahead and go into LaunchDarkly and turn off this user so that it stops impacting production. Let’s flag this user off for now so that way we’ll stop burning through our error budget, and we’ll be in a lot better shape. So looking at this change, you would think it would be a simple change but turned out to have such large ramifications on my entire production system. 


So that means I need to go ahead and make sure I’m mitigating the impact of what I do, and that way I can go ahead and take my own time fixing the overall issue. So now that we’ve, you know, mitigated the issue by turning off this user in LaunchDarkly, now we can go ahead and think about what we’re able to do using Honeycomb. 

Honeycomb enabled us to catch this issue before it burned through our entire error budget and allowed us to debug and understand what was happening. In this case, that the user’s queries were suddenly no longer being cached and therefore were being allowed to execute against a real live storage engine 50 times more often. Instead of executing once every 10 minutes, it was once every 10 seconds. That’s really part of what the power of Honeycomb is. 

Honeycomb allows us to understand what’s happening in production, and it enables us to understand how those changes impact customer experience. We can make design decisions, like whether query cache duration should be a fixed setting or whether we need to make it adapt to the workload size of each customer. This is what it means to give developers fast tools to fit our needs so that both operations and development concerns can be understood from the same interface.

So, at Honeycomb, we’ve reimagined an experience of deploying to and debugging production. 

And using these same approaches, your teams, too, can work together to release features quickly and reliably with Honeycomb. You can center around doing what you do best, building customer experiences that will delight your customers. 

Thank you very much for your attention and enjoy the rest of hnycon.

If you see any typos in this text or have any questions, reach out to