Michael covers false starts during early issue investigation including cold starts, provisioned concurrency, and function memory limits. You’ll learn how and why increasing application observability enabled his team to bypass invalid assumptions about root causes of poor application performance, how they moved from guessing to knowing, and how that focus helped them fix application behaviors that most directly impacted our customer experience.
Michael Ericksen [Staff Site Reliability Engineer | Intelligent Medical Objects]:
Hi, hnycon. Thanks so much for having me. My name is Michael Ericksen. I am apparently a Zoom waver as of about last June which was an unexpected turn of events. Professionally I work as a Site Reliability Engineer at Intelligent Medical Objects. We’re headquartered near Chicago. At IMO we enable clinicians to deliver better patient care and operate more efficiently. If your doctor is using an electronic health record, the chances are she is using IMO.
As a little bit of background on our engineering organization, we really came of age with application performance monitoring and auto-instrumentation. But what we have discovered over time is that auto-instrumentation doesn’t capture some of the key event details, perhaps a client ID or a request parameter that can be critical to identifying unknown-unknown issues in production. And while we support the majority of services using our historical APM vendor especially for server-based solutions, Honeycomb still fills an important and niche role in terms of our ability to see and to support critical production services.
As a personal rule, I’ll generally try to avoid using the term post-mortem — since it literally means after deaths — to refer to the learning cycle that hopefully occurs after a production issue. I tend to make two exceptions to that rule. The first is if we’re doing a close reading of act five of Romeo and Juliet. Honestly, five-why the hell out of that shit.
The second is if we choose to treat production incidents like a murder mystery. And in this caper, I’m still a fairly new member of the prod application team. I don’t have the same shared context that other engineers on the team have. I don’t have the fingers to keyboards experience building the application from the ground up. And one of the themes that I hope to develop with you today is the way that application observability can level up all members of a team, regardless of whether they are a recent addition to the team or have been a member of the team since its inception.
Is this slide a bit of foreshadowing? Are experienced engineers going to routinely guess wrong about the root cause of poor end-to-end performance? Stay tuned.
To give background on the application, it’s primarily a serverless application that connects to an Elasticsearch backend and is fronted by a load balancer. And our application team has some internal documentation with SLI-like statements (Service Level Indicators) about a target performance of 50 milliseconds per term for each item in the request payload. So we are expecting request durations to scale more or less linearly with the number of items in the request payload.
But we know customers will always, always be infinitely more creative in terms of load parameters or request inputs that they can send than your development teams are able to cover in automated tests. Being able to understand what some of those unique behaviors are can really help to support your application.
And our application team had a model for how they expected the application to perform under heavy load, a definition of what constituted heavy load, but once we started to see actual application traffic, what we saw diverged somewhat significantly from that model.
You can see up in the left corner, the service health dashboard and on paper, those numbers don’t look very good. P50, 50% of requests are faster than 6.5 seconds which means that 50% of requests are slower than 6.5 seconds. And at P90, that number is much closer to 10 seconds. Fortunately, the primary traffic at this point is from internal applications — our telemetry helped us to see that — and they’re dogfooding the service for us. And knowing who your traffic is coming from and who is having the issues can really decrease the temperature of a potential production issue.
But one of the questions that our application team had at this juncture is how many items were in the request payload? Remember, latency scales somewhat linearly with payload size. If there are more than a hundred items in the payload, it might not look good on paper but we’re probably able to tolerate that performance behavior for a while longer. We might add a story to the backlog to prioritize some feature work to improve application performance with a larger payload size. But if there are significantly less than 100 items in the payload, we probably need to investigate and prioritize stability and performance work for the request that we see coming in.
In this case, we have request item count as one of those request details that we’re adding to our custom telemetry and so we’re able to see almost all of the requests during this time period have between 20 and 30 items. And so we’re expecting performance to be between one and 1.5 seconds and what we’re actually seeing from live traffic significantly diverges from that model.
One of the other behaviors that caught the eye of our engineering team when they looked at that service health dashboard was the higher-than-expected ratio of cold starts to total function invocations and you can see the total function invocations in the light blue line on this graph. And then cold starts are that very colorful graph almost right below. There’s almost no gap between total function invocations and cold starts. And to clarify a little bit of the terminology here, the initial invocation for each function instance will bootstrap your application code and dependencies before your function can execute the user request. So the bootstrap cycle called a cold start can increase end-to-end performance processing time or a user request. Once that bootstrap cycle completes, however, additional invocations are considered warm and they don’t pay that cold-start penalty.
One of our earliest theories was that poor performance is due to cold starts and if we can decrease the number of cold starts, we should see some significant improvements in terms of request latency. So we ran experiments with provision concurrency — that’s a way to pay a little bit of extra money to keep a certain number of instances warm so they don’t go through that cold-start cycle.
We ran that experiment at varying levels from 50 to 250. And although we saw a subtle change in performance behavior, it didn’t really get us back down to that 1 to 1.5 second target threshold that we really hoped to see. And the important thing to remember about this theory is that it cost extra money to put into production. If it’s not significantly changing the end-user experience, it’s probably the wrong technical fix to apply, at least at this juncture.
To quantify that extra cost a little bit, for that first experiment with 50 concurrent instances, that’s about a thousand extra dollars per month if you keep it on 24 hours a day for 30 days. And that cost goes up significantly more than $10,000 for that, for 750 concurrent instances. It wasn’t the right technical solution for the problem our users were having.
What we’re looking at in orange is uncaptured time in the telemetry from our historical APM vendor. It constitutes a sizable portion of the request time. It becomes more significant for longer requests. You can see in that second peak, that most of the time of that five seconds is orange. We reached out to our support rep, and what we learned was that auto instrumentation wasn’t capturing network requests made on background threads. Usually, it’s pretty good about auto-instrumenting network requests, but our application was originally designed to handle network requests on background threads. So the performance was invisible to us.
Designing observability as a foundational behavior of systems creates the opportunity for us to short circuit some of these visibility gaps. In our case, the performance of a primary dependency is entirely opaque in our telemetry with caveats for potential hindsight bias from looking back at the present moment in time.
Whatever was causing our application performance issues, we couldn’t see it. So it was time to pivot. What we’re looking at here is a distributed trace graph after we enabled Honeycomb for one of our pre-prod environments. And the way that we’re getting data into Honeycomb here was writing structured logs to CloudWatch and then ingesting them via a serverless function that’s deployed in our account. It’s a really easy integration pattern as long as you’re able to write structured logs. You can ingest data into Honeycomb, and the documentation here is really clear and easy to get up and off the ground. It also solved the primary issue that we experienced previously around multi-threading and the vendor agents.
Along the top of this graph, that horizontal purple line represents the end-to-end duration of a customer request, and the orange lines below it, are called spans and represent a unit of work necessary to complete the customer request. In our case, we’re talking to that Elasticsearch dependency for each item in the request payload, but for a more complex application, you might see spans for talking to an authorization layer or to a caching layer, or to a database. And it would be important to be able to see each of those dependencies’ performance, status codes, key parameters in your application telemetry.
I’ve heard other Honeycomb customers describe their, “Oh, shit, that’s what my application does?” shortly after enabling it. That was my experience here as well. We can see that some of the requests to Elasticsearch are less than 50 milliseconds the way that we hoped but some of them are also significantly longer than that, almost seven seconds. Even our best and most experienced engineers guess wrong.
What we’re looking at now is, we’ve decomposed those orange spans from the previous slide so that we can understand a little bit more about the performance behavior of processing each of those items in the request payload.
For myself, who is less familiar with the application and more recently joined the team, this distributed trace graph really helped me to see another important application behavior. For a single item in the payload, we’ll sometimes make one request to the Elasticsearch backend, and at other times we make two to perform some rescoring logic. We also can see there’s not really any consistency, whether it is the first request that’s slow or the second request that’s slow, or whether it’s payload items that have to make two requests that are consistently slow. For example, in this image, that bottom span only makes a single request to Elasticsearch and yet it accounts for almost the entirety of the function duration. And looking at this data in aggregate also helped us rule out either the first request or the second request or multiple requests. We’re refining our focus on what is causing this performance issue.
We continue the instrument. Observability isn’t really a binary state, at least in my experience. It’s not, “You’re not observable, you’re not observable, you flip a switch and it’s observable.”
We’re continuing to instrument because we weren’t able to visualize what was causing our performance issues. The highlighted teal span on the left measures the amount of time it takes from our application’s code perspective to complete a round-trip search request. More than 13 seconds. More than 13 seconds!
But the highlighted attribute on the right, that’s a value that’s returned in the Elasticsearch API response and it measures performance once the search request reaches the main or coordinating node until a response is ready to return to the cluster and there we see that Elasticsearch is reporting that it only took a single millisecond to process that request. So we have a significant divergence between what our application code is reporting in terms of performance and what Elasticsearch is reporting in terms of search performance. And that type of custom attribute is perfect to include in your wide event details because it helps us to eliminate the Elasticsearch performance as a potential root cause. Again, we could scale up our cluster if that was the actual root cause or one of the root causes of performance issues, but it wouldn’t solve the issue that our users were experiencing. And having it in our event telemetry allows us to use it to understand potential future performance issues. At some point in the future, we may have actually poor performance due to Elasticsearch cluster performance, so having that attribute in our telemetry allows us to aggregate over it and understand how performance has changed over time.
But if the poor performance isn’t internal to Elasticsearch this time, then where is it?
I and another engineer on the team, we’re looking at the application code again, especially with that last data point in mind. Elasticsearch is reporting that searches are really fast. And we sort of boil it down to one of these four lines of code which is a much smaller problem space than when we started. We looked at the code. We looked away. We looked back and we said, what if — powerful engineering words — what if, we moved the creation of the HTTP session outside of the multi-threaded context which is where this code was originally executed. What if we moved it so that it executed once per function invocation or even once per function lifetime? What if we moved it out and shared it across the multi-threaded context?
So that’s what we did. We moved it, we ran some experiments with a shared HTTP session. And saw some significant improvements. Previously we were only seeing about 20% of requests completing faster than the target 15 milliseconds per term. And with this simple change, we got this much closer to 40% of requests. Not the 80% that we really wanted to get to, but still a significant performance improvement and much more significant than any of our prior theories had moved us. It sort of seemed like we were finally in the right problem space to exert some engineering effort to solve.
We ran a variety of other experiments playing around with thread counts as well. And what we saw was that under bursts of concurrent requests, the application performed significantly better with a lower thread count than in the initial design which used a much higher thread count. And you can see that at 4, 2, and 1 threads were all much closer or above that 80% target threshold.
Under moderate application traffic, the model we had more designed for, the application performed well under all conditions. What this gave us the confidence to do is to deploy an update to the application code that reduced the number of threads that we were running on, and we would actually be able to see, did not change the performance for users, our live traffic. We expected that it would, our model predicted that it would, but did it actually change the behavior? And what we’ve seen since then is it has. It might not be a permanent fix but it’s at least stabilized application performance for our users.
While we were doing this, we had another one of those, “That’s what this application does?” moments. In this distributed trace graph, we’re looking at application behavior with only a single thread. And you can see that that first item in the request payload takes a significant amount of time, more than 500 milliseconds in this example to complete. But then all of the subsequent requests are much, much faster. You can also sort of see since we’re running it with a single thread, that serial behavior, one finishes and the next starts and waterfalls on down until all the items have been processed.
I love this quote. In part, because it doesn’t sound like the author is about to sell you an observability product that just happens to have three pillars: metrics! logs! traces! The goal is to build a culture of engineering based on facts and feedback. I want to conclude with one final anecdote about the power of observability in an engineering organization. There’s a long-running suite of integration tests for one of our applications and because they execute in multiple life cycles, they impede our ability to deploy changes quickly, deploy changes frequently to production. We’re much, much slower than 60 minutes from commit until production. So I was curious — another powerful engineering tool — I was curious, is it specific tests that are consistently slow? Is the whole test suite equally slow? Does the order of test execution matter? Arbitrary questions that I should be able to ask about our environment, in this case, our test environment. And what I saw once we started to ingest data into Honeycomb, it was a few hours’ efforts to refactor the test suite, add in the Honeycomb Beeline for Python, again, Honeycomb makes it easy to start ingesting data and the documentation is super clear.
We’re able to start ingesting test results into Honeycomb within a few hours. In this image, we can see a visualization of a single run of the integration tests. And now that we’re visualizing that behavior, we see this curious period of what appears to be no activity. The test suite starts around 12:05. There appears to be no activity for about the next 12 minutes. What that helped us to understand is during the period of what appears to be no activity, there are some expensive test behaviors that are executing for each of those test scenarios. As the application becomes more complicated, we should expect this expensive test setup behavior to prolong our test runs even more. But we’re able to use this data to have a conversation based on facts and feedback.
Do all the integration tests need to run in multiple environments? If they do, is there a subset of integration tests that we can run that would give us similar confidence in our deployment so we’re not running the entire suite in multiple environments? The conversation was based on facts and feedback.
Throughout this talk, I’ve alluded to the DevOps Research and Assessment (DORA) metrics, but I wanted to make that connection more explicit in closing. Our latency spike caper, observability, and especially Honeycomb allowed us to focus on the right engineering problem to solve in order to restore service back to our goal of, 80% of requests complete faster than 50 milliseconds per term. Eighty percent is the service level objective that we defined for the service, the production issue was ongoing, and 50 milliseconds per term is the service level indicator. And on the previous slide, we also saw the way that adding observability to a test suite allowed us to identify potential opportunities to decrease the amount of time from commit until production while simultaneously increasing our deployment frequency. In my experience, my experience, tools like observability, vendors like Honeycomb, they’re critical to building that high-performing engineering culture based on facts and based on feedback.
Thank you so much. I’m available for some live Q&A. And then also in the conference Slack channel, or on Twitter @erickso_e_n. Thank you.
Ben Darfler [Engineering Manager|Honeycomb]:
All right, let’s give it up for Michael. Thank you, Michael. That was great.
One of the things that jumped out to me was that quote from the Dapper paper. Read the paper before. And that one didn’t jump out at me, or it’s been a while. Yeah, something we like to point out at Honeycomb, tracing empowers the person that can ask the best questions rather than the person who knows the system best. And that helps democratize debugging and empower junior engineers and yeah, it’s a wonderful superpower. I’m wondering if that’s something you’ve seen in your experience.
It really has. As I said, when I joined this team, I was not familiar with the application all that much. I had read the READMES, I looked at application architecture, but there were a lot of behaviors of the system, especially when it was seeing customer traffic that was different to me. And when we as an engineering team sort of first were looking at sort of what are the core causes of performance, a lot of our guesses were wrong. Pretty dramatically so. So being able to sort of move out of that guessing and being able to visualize and actually see the impact of some of the changes that you’re making is a really powerful engineering tool. And I like the way that you talk about it as a democratizing aspect of tools, observability, tools like Honeycomb.
Yeah, right, and I think it kind of plays to one of the comments in Slack too, starting with an instrument, learn, question, instrument kind of cycle is a really powerful place. And something we covered in the last talk as well, like rather than throwing the whole kitchen sink at the beginning, get really opinionated about the details you want to capture, and you get the highest quality data outside of that to answer the questions you’re looking for.
There was a really interesting learning curve to this. And I wasn’t able to go into it in the talk as much as I hoped but sort of as we were adding in that telemetry, our ability to sort of evaluate and either rule in or rule out some of our hypotheses got faster and faster. It was almost like there was a half-life on theory to execution and to rule it in or out. It was really powerful.
Yeah, yeah, absolutely. There’s a question in chat from Ethan wondering, you mentioned you started with structured logs and Ethan was wondering what makes logs structured?
The structured logs that we were using, the multi-threading presented some interesting issues with integrating with our historical APM vendor and I wasn’t sure of my ability to easily integrate one of the Honeycomb Beeline libraries and not run into that same issue. I had used structured logs which is just an object that I’m writing out to standard out as JSON. That shows up in one of our logging tools and we’re able to asynchronously ingest that into Honeycomb. It’s a really, in some ways an easy way to get data into Honeycomb. It’s a little more work for our team to sort of correctly generate those logs, generate propagate parent/child relationships and shared contexts, but it also means that any data that’s output as JSON, you can pump it into Honeycomb and start to sort of ask some of those unknown questions about the behavior of your system.
Yeah, it’s quite impressive, just the idea of like JSON, it’s unreasonably valuable, anything with key-value, throw it in there and away you can kind of go.
Another question from Vera. You mentioned having this aha moment and she was wondering if there’s anyone on the team that didn’t have the aha moment and what was their experience like? Did they eventually come around to seeing this value of the observability-driven kind of loop that we’re talking about?
I think it’s something we’re still working on as an organization. There was a quote from the last presentation that called it a slow conspiracy. So I think telling these stories to other engineering teams about our ability to triage production issues, especially ones where the behavior diverged from our initial hypotheses about them, I think that’s a really powerful tool to sort of help level up your organization in terms of how these types of tools can change developer behavior. I know that’s — I have gone through sort of that transformation. What’s this observability thing about? And just changed behavior for me. I started writing code differently. I started supporting code differently. It sort of puts you on that trajectory that some of the Honeycomb folks talk about, about production excellence. I think that’s a really valuable way to think about it. I don’t think the goal is necessarily observability. I think the goal is production excellence. And delivering that value to your customers. Observability is a really important tool on that trajectory.
Yeah. I love how that aha moment really is the thing. And then the question is how to get to that aha moment. How to give more people an aha moment sooner.
Yeah, it’s really a — I don’t want to have to support production applications without this again. It’s almost a point of no return. It’s tabling stakes in some ways for how we are building high-performing teams and applications.
Yeah, I like to tell the story of my experience first with tracing. I could kind of muddle through with logs and with metrics and kind of make it do. Then all of a sudden tracing comes along and it’s ten times faster to get to the same thing. It’s hard to explain what an impact that has. You could get to the answers previously, maybe not all cases though some cases, but I go back to the Dapper quote. But to be able to move that much more quickly, you can talk about it, but you almost have to feel how empowering it is to move that quickly.
I had a question about the multi-threading piece. It was curious that that’s not a thing that I think shows up in the data, right? That’s where we get to the data informs this intuition in the engineer and back and forth. I’m curious how did that piece unfurl?
Yeah, so there’s a quote from — I may have alluded to, it’s not about knowing the code, it’s about knowing where the system, where in the system the code to debug is. So seeing that pretty significant disparity between how long our application was reporting it took to communicate with Elasticsearch and how performant Elasticsearch was reporting that it was, really cut down on what actually might be going on here. It’s only a handful of lines of code. I think I displayed almost all of them on the screen. Something interesting is going on here, we don’t know what it is. The first thing we alighted on was like, there’s multi-threading going on here. And we had had previous experiences with some interesting behavior about performance in that URL session lib so, it’s sort of like, what if it’s this? But it might not have been, it might have been other things, but we were still in the right area of the code, the right context of the code to refine and iterate. And we might have had future iterations. In this case, that change really stabilized application performance for us. It might not be persistent in perpetuity as production models of customer traffic change but for the behaviors that we’re seeing right now, especially spiky load and sort of steady-state load, we’re right back to where the business, because this is all about delivering value to the business, where the business needs performance to be.
Yeah, I really like that. You got to the point where you were aiming for so you can pause then. You don’t have to keep going down the rabbit hole and continue digging and spending the time and the effort. That’s fantastic. Well, thank you, Michael. Really appreciate the chat.