This talk covers hard lessons learned by a small team at Redhat over the past few years as they struggled to refactor a monolithic service to a more scalable and resilient architecture. In this talk, John covers the lessons they’ve learned about observability, production support effectiveness, and the cost of bad or inadequate solutions. They made more than a few mistakes like repeatedly under-appreciating the value of operational features and support infrastructure, and falling prey to the rush of shipping features while neglecting their service’s operational health.
John Casey [Technical Lead|Red Hat]:
Hello. My name is John Casey. I work with Red Hat. Today I’d like to talk about learning to observe a monolith. I titled this A Diary of Digging Out. I think that’s appropriate foreshadowing. So let’s dive in.
First a little bit about me. I’m a technical lead at Red Hat. I work in an organization that’s responsible for our product delivery pipeline. That’s a fancy build system basically. Does a lot more than your average Jenkins, but we manage the builds for our products. Before that, I worked in the Apache Maven community for a couple of start-ups. And just generally my experience in my career has a strong emphasis in Java build systems. A little bit about this talk. This is a play in three parts, three acts.
So at first, we had what I’m going to call primordial production support. This is kind of what a developer might come up with without really knowing anything about operations. We were watching logs on the pod. We didn’t even have aggregated logs.
We were doing that from seven time zones away. At the time I was in the U.S. and we had users primarily in Europe. What that meant was a lot of times I would wake up and even before I was ready to be at work sometimes I would have a message waiting for me saying there was a problem in production. And so it was this process of asking what happened? Let me look at the logs. Oops. The logs rolled. If we have enough events, it will fill up that log file to 100 MBs and then it will roll off and create a new log file, and then we only have space to keep 20 of those.
There is a disk space consideration there. But it’s also just how much log file can you actually get through in a reasonable amount of time to try and find a production problem? So the result of this was often just being dropped in a hole. You didn’t have a satisfactory solution. You didn’t learn a lesson. And you weren’t more prepared for the next time it would happen. This wasn’t really support. At the same time, things were fairly slow. We didn’t have a really good definition for what constituted fast or slow. But anecdotally, things were slow. How slow? We didn’t really know. And why were they slow? We also didn’t know that.
So eventually we implemented enough features that the build system was sort of nominally successful. We could bring products in and could have a reasonable expectation that they would be able to do all the steps they needed to be able to actually ship a product. So naturally, this is the thing everybody’s been waiting for. And the flood gates opened. We had products unrolling in the system and starting to try to actually do their work. This meant the logs were rolling faster and faster as more products ran more builds and more things were happening concurrently. Which also meant our ability to find problems became smaller and smaller. The log volume we were accumulating scales linearly with how many are being run.
Then the inevitable thing happened. We had a big outage.
For those who don’t know, Red Hat hosts a yearly event called Red Hat Summit. And it functions as a big show-and-tell for our products. For any new features we’re planning to roll out or even new products, there’s a big push in the days leading up to Red Hat Summit to get those products built so that we actually have something that we can show at Summit.
We were right in the middle of that giant river of product builds, and our system wasn’t working. We had builds that were timing out and those time outs led to data inconsistency. And so we were putting out fires trying to keep the system running, but we were also repairing data.
After three days of incident response, we finally, out of sheer desperation, turned on an ad hoc Grafana deployment and turned on our metrics. This was something we had undertaken the previous summer out of a vague sense that the logs weren’t going to be enough for us. So we had added instrumentation into our code to help us expose metrics for the kind of things we thought would be hot spots. And we probably leaned a little heavier than we really needed to into the various metrics to be able to see more of the system than maybe what we really needed to do.
One thing that’s really important to know about all of this is it was kind of a skunkworks. We did this a little bit off the books. Most of what our stakeholders wanted to talk about were features and how fast we were delivering the features. And we built a little bit of slack into that process and had done some instrumentation work along the way. But we got to that point, that transition point where it was time to actually stand-up infrastructure in order to be able to use that code. That’s where the effort fell apart. Anyway, in the leadup to Summit, when we hit that point of desperation, we turned on our metrics system. This is really kind of a rudimentary system.
We saw the problem and had a configuration fix within an hour. And then in subsequent weeks and even months after that, we had really good guidance on how to improve our system. How to make small design changes, add a cache here or there, and code optimizations. All of that made a huge difference not only to our performance but also to the reliability of the system. We were losing less data. If things fell apart, we knew more about why or we could adjust.
I liken this to turning on the lights. You can start to see the bugs a little bit more easily.
Up to this point, I would say that we’ve been learning some really hard lessons about what it means to do support in production. And with that transition into producing metrics, there were so many valuable insights that came out of that. It changed our thinking. And at this point, I think we’ve entered a new phase of our growth where we’ve learned to really value observability and really learning to take support seriously as an intentional practice.
Fast forward. Things went pretty well for a while. But then we hit another snag and the builds started timing out again. And it was a very vague feeling. Is that slower? Does that seem slower to you? We looked at our metrics and we could see the overall build times starting to creep up. It’s important to note here that from our perspective, as the content provider to the build system, we don’t have a really good, distinct sense of when a build starts and ends. What we see are a bunch of requests for content. Then at the end, we see a request to validate that content. So we don’t necessarily have the best view on the system to be able to know whether our overall build times are going up.
But in August 2019, the builds started timing out again. It was when we were validating our content when we started to have problems. Again, our metrics had started to show us a slight uptick. It was all inferred because we don’t have a really good direct measure of build time on our side. We didn’t have any really good inflection points down in the deeper metric data that would indicate we had some kind of a problem. There was nothing obvious to fix here. Crucially we didn’t have a reproducer. There were certain builds that would fail more often. But not reliably enough we could take it out of production and do testing to figure out what was going on.
At this point, we had aggregated metrics and we had been using that to guide a lot of performance tuning and things like that. We had set up a production support rotation where we were watching those metrics for spikes among other things. Obviously, other things are involved. What we found here is that the aggregated metrics gave us an un-differentiated view down in the deeper metric data. Because we had a monolith, we had nine functions we were supplying to our build service out of that one monolith. And this meant that there was no way to tell for any given data point in the deeper metric data, which function that was coming from.
And these smoothing effects down in the deeper layers hid the problems that we were dealing with here. In the end, what we found was that we had a method call that had gone up by one-tenth of a millisecond in terms of how long it took to execute. But that was being called 10,000 times per request. And because we were evaluating the output of a build, because of the way that validation worked, we sometimes had 14,000 requests. And when you multiply all that out, we started to see build timeouts.
Ironically, the structured logs saved the day here. Earlier on when we transitioned into aggregated metrics, another thing we did at the same time was we started shipping our logs off to an Elasticsearch ELK stack. And we were using in Java what is called map diagnostic context — semi-structured logging basically. We were able to shove in context variables here and there that would help us with filtering our logs. This saved us because we were able to go back and say, “OK, we need information just about this request.” We knew that we were seeing aggregate metrics from all over the place that were hiding the signal. We knew there was a signal in there somewhere.
We started adding a summary log at the end of a request that would summarize how many times a particular hot spot method was called and how long total we’d spent executing in that method. Within a couple of hours of rolling that out, we had built up enough log data from our builds that we could see the problem immediately. And it took us a little bit of time to design a fix and get it rolled out and then deal with the inevitable side effects because we’re working in a monolith. But that’s how we found the problem was through our logs. We put metrics into our logs, which sounds absolutely insane to me now looking back. We needed to be able to measure per request. Not per unit of time. This was not straight time series data. That was all jammed together.
As a sidebar here, Kibana has been a painful experience for us. We’re using this ELK stack because it was free to us. We had another team that was providing it for other use cases. And so we were able to piggyback in on that service. The problem is it was never designed to be a production-grade service. The stability of that service has been lacking for our uses. But also Kibana itself isn’t the fastest thing. Or we could probably tune it out. But when you’re searching more or less an open set of fields, it’s not easy to set up really good indexes on that stuff. You don’t necessarily know where you need your indexes to be. As a result, our responsiveness in production support has not been great when we have to dive into the logs. We spend hours filtering that stuff down just getting to the right angle.
And even when that happens, we have engineers who are outside of the U.S. Our ELK stack is hosted here in the U.S. When you cross an ocean, the latency in the Kibana interface often causes it to time out. That’s a resource that’s out of reach for our production support engineers outside the U.S.
And as I’m sure you can see, performance tuning around the same time hit the wall. For the same reason that we had a hard time solving that production outage or that production problem, we were also unable to see how to make the system faster. All of a sudden, all that data was smeared together and you couldn’t tell one use case from another. Yes, this is a problem with having a monolith. You’ve got multiple functions living in the same house together and it’s hard to tune it for multiple functions simultaneously. The metrics really didn’t help there, because, again, it’s all just mashed together.
So where do we stand today? Right now we’re in the process of refactoring this monolith. This is a major undertaking as we’ve had to separate our data domains. The service that we host is almost all about data, which makes it a very difficult thing to try to refactor. We separated our data domains and we’re splitting out these different functional services so that we can address scaling strategies in different ways based on the service. We’re creating our SLOs. We’re encoding, writing down, and asking people “What’s the minimum acceptable performance for these various functions?” And that’s underway now, actually. I’m embarrassed to say it’s taken us this long to get here.
We’re shifting away from our aggregated metrics and into wide events and a trace orientation. We’re using Honeycomb for that. We’re hoping that over time we’ll be able to make the case to adjacent teams to help us participate in that trace and to give us a little bit more context to help. Because we think that will have an outsized impact on our ability, as a set of teams and as a pipeline, to do support. If we’re all contributing context, that makes it easier to find problems and see how they’re related to other things upstream and downstream in the pipeline.
We are struggling a little bit with adjacent teams. Because a lot of those adjacent teams have just started the transition into aggregated metrics, or at least capturing them in a methodical way. At the same time, we haven’t quite reached that point where we have something to show for our wide event reorientation. And so we’re still in this “We can’t show them. We have to tell them about it” state. And that makes a lot less of a compelling case. So we’re hoping to be able to improve that story and continue to build out what we’re doing so we can show how much it helps. We can see that extending the trace is valuable and adding more context is valuable, but it’s a difficult thing.
So let’s take a step back here and let’s think about why this journey was so hard? And what are the things we’ve learned along the way?
The first thing and the most important thing is that you have to be stakeholder zero. Your users depend on you to do support. Even if they don’t really understand that in a really conscious way. Can you support them? Do you have the tools and features you need to help? Dropping a support case without an answer is a really unsatisfying thing to do. And you need to plan for intentional support.
You will also need to grow your team. Over time, our build systems are being asked to do more and more things, take on more and more technology, and be faster. You will have a need to grow your team. So the question is how hard is it to onboard new members to your team? Or another way to think about this. If you’re drowning, how long do you have to hold on without help? The complexity of your system really defines that.
And then also you need to be ready to add and remove features quickly. If you identify an opportunity, you need to be able to pivot your design quickly so you can take advantage of it.
Or if you have performance problems, a major performance problem that comes because there’s some new use case out there, it’s important to be able to shift quickly and fix it so that you don’t have to live with the pain. We’ve really learned that lesson the hard way with the monolith because shifting out of a monolith is one of these things that takes a lot of work and there’s nothing to show for it until you’ve reached much closer to the end, especially when you have a lot of data to migrate.
And in asserting your own position here, in standing up for the importance of operations, it’s really important to have management support for this. If managers are all focused on user features or on what the build system can do and they’re not paying so much attention to how stable the build system is, that’s going to be a really difficult thing. It’s important to talk a lot about intentional operations, observability, and how well you can actually do support and what that looks like, and get management on board with it.
Your users care about operations too. If you don’t care enough about operations, they will care a lot more about it. And pretty soon operations issues will become user-requested features, like particular levels of reliability. At the same time, all user features have an implied minimum performance. That’s basically the SLO discussion. And it’s important not to leave those things as implications. We need to chase them out of the system, understand them, write them down, codify them, and monitor them.
That’s a really important part of all this. It gives you something where you can say “Is it broken or is it not broken?” If they say it’s broken and the monitor isn’t going off, then you’ve got the wrong thing written down and it’s time to have a conversation about that. Talk to stakeholders about the operational features. At the same time and in the same conversations where you’re talking about the new things to do with the build system, talk about the way we’re going to provide the stability for those new things.
And of course, there’s got to be a buy-versus build-slide. Here it is. One lesson that we learned is that you really have to respect maturity curves. And of course, we knew going into this that we were starting over, that we had an idea we could do a build system better than the last one. And we knew to some extent that we were going to have to climb a hill on this. There was going to be a time when our build system was less mature, less stable and maybe couldn’t do as many of the things. But even knowing that, everything you want to do is harder than you think. We have a tendency to look at and visualize a path for a set of features. But that path that we visualize tends to be the happy path. And it doesn’t talk about all the brambles on the way and the potholes. It’s important to slow down and think about what actually goes into supporting that happy path and trying to keep people on it.
Your organizational shape matters a lot. Whose budget is it going to come out of? Especially if you have a platform team and you have a couple of teams doing various stages in the build infrastructure, what budget does it comes out of if you want to invest in a new service? That’s going to be a bit of a tricky conversation. Also if there’s instability, who feels that pain? Obviously, you have support engineers in different places. But your users also have some downtime where they’re not getting builds out the door.
It’s really important to transmit incentives and transmit the pain across organizational boundaries. If you have a team that’s providing a platform and that platform has these transient DNS outages, we can do certain things from our design perspective to smooth over those things. But anything that drops a build will produce an outsized impact on the user. And sometimes the platform team can’t really see that connection. It’s important to have conversations that span all of that so we can really understand when it’s time to buy a service versus hosting something internally.
The other thing here — the last thing about buy-versus-build — service prices are easy to understand, they are in dollars. Team hours and the cost of instability, these things are really hard to quantify. It shouldn’t be that hard to quantify team hours, but it is something we tend to systematically undervalue. For instance, how many times have you been in a meeting where you didn’t say much or didn’t get all that much out of it. How much did that meeting cost the company?
The last thing I’ll drop in here is that it’s important to respect your lack of understanding. That sounds funny. But developers, and I think managers to a certain extent, tend to be optimistic about what they can do and how fast they can do it. It’s important to be suspicious of that. Be skeptical of it. As I said, we have a tendency when visualizing a problem, to think about the happy path. It’s important to slow down and go, “OK. Well, we have these other things that can go wrong.”
Our ability to see that upfront is very limited. We may have written the code. We may have written every line of the code. We may understand it back to front. When you start putting load on it, things are going to surprise you. Things will blow up. So the question is when you’re in that situation, what tools do you have to figure out what’s going on and get it solved?
Metrics are really good at detecting problems you’ve already had. I think about these aggregated metrics we have sometimes, and they’re definitely like pickets. They’re watching for the last problem in case it comes back. It’s an important thing, but by definition, if you have a problem in production, it’s one of the hardest problems that you’re going to encounter. It would have been caught by tests otherwise. So by definition, if it happens in production, it’s likely to be more serious and more difficult to fix or even understand. It’s important to have tools that will help support you in that.
Just as a really quick summary, I would say three big lessons we learned is we need to be stakeholder zero. We need to really intentionally design for operations. We need to consider our biases that go into buy-versus-build. And think hard before we decide to host another service, that comes with its own support load. Lastly, favor tools that support you in exploring the system. Understand that you don’t understand the system as well as you think you do. You never will. There will always be things that come in sideways and it’s important to have something that’s going to support you when you’re in an unfamiliar situation.
And I’d say that that’s pretty much what we’ve learned by trying to stand up and observe our monolith. Are there any questions?
Ben Hartshorne [Engineering Manager|Honeycomb]:
Thank you, John. What a journey. That sounds like it was a real progression through a lot of changes.
I would say we’ve learned a lot.
There’s one thing you talked about there that really keyed in for me and resonated from something Rob Meaney said yesterday in his talk. He said, “We write our application code tests when we know the least about how our code operates in production.” In your talk, you were saying at the beginning when you were adding those first metrics, you were adding things you thought were going to be hot spots. But you didn’t know. This consistent guessing before you actually get to places where your application is running seems a theme that cuts across so much of what we do as developers. Even when we got to tracing, you think about what might be a reasonable boundary of a unit of work.
Do you think that we can shorten that arc? Can we guess better? Or is this just a necessary part of a circular process?
I think that we are always stuck in this trap of always fighting the last battle and having the best information about things we’ve already tripped over.
I would say that I’ve learned some interesting things while we’ve been going through the actual methodology. How we’re going to break up the monolith has forced us to think about separating our data domains. And of course, that means we have to grapple with how much latency we are going to have once we separate those things. I feel like that puts us in a position where we can think more intelligently about what kind of metrics might blow up on us. But, of course, part of the problem here is that if it gets to production, there’s a good chance that the thing that goes wrong is missing your guess. It’s the thing you didn’t guess about.
But I do think that if we understand what are the important dynamics, what are the important pieces in play, that we can put in some kind of measurements around those things and then maybe have a better shot at seeing that.
And the other thing I’m really looking forward to out of diving all the way into Honeycomb is being able to take those things and crosscut them in different ways. What I am trying to do now when we have a production problem, is to look at it one way, and then shift to the left and then come in from a new angle. You can see it from different angles and then get a better picture of the problem. It will give us better visibility on those things.
It’s like you said at the end, the fact it’s manifesting in production makes it the hardest problem that you could have because otherwise, you would have caught it already.
Yeah. And that’s honestly the best argument in favor of just queryability, right? It’s not just having a dashboard. It’s having the ability to go and explore your data. Then the only thing you have to worry about is having the right data.