John Casey is a DevOps engineer and architect who leads the team at Red Hat that’s responsible for CI/CD and production support. In his talk at the 2021 ollycon+hnycon, John told the story of how he and his team learned a thing or three in their transition from a monolith to microservices.
One of the most important of these lessons? Use tools that help you address unknown unknowns. But to get to that point, John and his team had to go through a long journey.
John’s journey started just before Red Hat’s annual conference a few years ago, when dozens of products were slated to launch at once. As the conference approached, things started to fall apart under an influx of tests and launches. The tools Red Hat used to debug could not provide the visibility John’s team needed to understand what was happening, let alone why.
Their solution was aggregated metrics, which guided John and his team to a solution, but at a steep cost that they would only come to realize months later.
In 2019, builds started timing out and it wasn’t clear why. Aggregated metrics transformed from a quick fix to a huge headache because they hid problems behind smoothing effects. To find out what was going on, John’s team added a summary log at the end of each request.
In the end, John’s team found that at some point, the length of one call increased by one-tenth of a second. That call was made 10,000 times per request. In the face of 14,000 such requests, naturally, build time-outs would appear.
While logs provided visibility into their build process, John knew he needed to go beyond metrics and logs to debug production. He found that he needed to prioritize exploratory tools, like Honeycomb, that helped him solve for unknown unknowns.
There were many more hard-won lessons like these that John discussed in his o11ycon+hnycon talk. We’ve summarized them in this guide to help you quickly get started on similar projects.
Lesson One: You are stakeholder zero
Embrace your role as the instigator of this transition from a monolithic architecture to microservices. Stakeholder zero means that you’ve bought in before everyone else, and it’s your responsibility to think through contingencies, anticipate opportunities and challenges, and get other stakeholders to buy in. As Frank Chen from Slack put it, this is an “Uncle Ben” situation: with great power comes great responsibility.
John pointed out in his talk that your users will depend on you to do support, even if they don’t consciously realize that. When you apply this way of thinking to the transformation from a monolith to microservices, it reveals the importance of intentionally planning support cases.
To make a microservices architecture work for your devs, you need to empower them to resolve issues users might have. Specifically, John called out two core questions you need to think long and hard about:
1. Can you support your users?
2. Do you have the tools and features you need to help them?
How your team will grow and change is another important aspect of this transformation to anticipate as the behind-the-scenes stakeholder. The extent to which your team changes is usually correlated with the complexity of your architecture.
“Our build system was being asked to do more and more things—taking on new technologies faster.” John said, “You will have a need to grow your team. So the question is how hard is it to onboard new team members. Or if you’re drowning, how long can you hold on without help?”
As stakeholder zero, any slip-ups your team makes will reverberate through to the devs and eventually the end users. And your team will make mistakes because moving from a monolithic architecture is not easy.
Because mistakes are inevitable, it’s also your responsibility to be ready to add and remove features quickly. As John said, “If you identify an opportunity, you need to pivot to take advantage. Or, if a new use case causes performance problems, you need to fix it.”
Perhaps the most important responsibility of being stakeholder zero is to get buy-in, particularly from management. If management is only focused on user features and not the stability of the build system, talk to them about it. This is and will always be a continual effort. So talk to management about this transition from a monolithic architecture to microservices a lot.
Lesson Two: Users care about operations, too
Users may not fully understand what you do, but they feel the impact if you do it poorly. At the end of the day, users will care about operations and stability even if you don’t. All features have an implied minimum performance users expect. John points out that this implied minimum performance is essentially a service level objective (SLO).
It’s your role to make sure the implied minimum performance is not implied. Explicitly document, codify, and monitor the performance of these features. “This gives you something where you can say, ‘Is it broken or is it not broken?’” John said. “If they say it’s broken but the monitor isn’t going off, it’s time to have a conversation.”
Lesson Three: Respect maturity curves in the buy vs. build debate
Understand that the beginning of a change as big as transitioning from a monolithic architecture to microservices will be rough. Your build system will start off as less mature—it’ll be able to do fewer things and it’ll be less stable. Choosing to buy or build an internal solution depends on how much of this instability you can stomach.
In particular, John recommended in his talk to take organizational shape into account. For instance, whose budget would buying a new tool come out of? Who would feel initial instability the most if we build our own solution?
When it comes to investment in time and money, service prices are easier to understand than team hours and the cost of instability. John said that a lot of biases come into calculating team hours and that devs usually underestimate the true cost of time spent building a solution.
Only you and your team can answer whether you build or buy, but no matter what you choose there are important considerations, especially if you have an internal platform team providing tools used across the organization.
When designing solutions, we tend to think of the “happy path” and not the brambles along the way. It is important to slow down and really think through all of the scenarios that need to be supported beyond the happy path. In addition, incentives and pain need to be transmitted across organization boundaries. A platform team may have issues that result in a dropped build, and they need to understand the impact that will have in terms of downtime to their end users.
Lesson Four: Embrace your lack of understanding
According to John, by definition, problems in production are some of the hardest you’ll encounter; they would’ve been caught in testing otherwise. Problems in production are also more likely to be completely new and difficult to identify, especially in a complex, microservices environment.
Start asking yourself what tools you will need to adapt and solve for these unknown unknowns. Aggregated metrics are great for solving problems you’ve already experienced. Summarized log data can help you dig a little deeper. But what about unforeseen problems in production?
You need to prioritize exploratory tools, like Honeycomb, that help you solve for unknown unknowns. Honeycomb goes beyond aggregated metrics and log data to combine their strengths with event data. Altogether, you get complete observability that allows you to adapt to whatever unforeseen challenges the transition from a monolith throws your way.
Shifting wide events and trace orientation using Honeycomb
Watch the full recording of John’s talk to learn more about how his team will shift to wide events and trace orientation. You’ll also get to hear him tell more harrowing stories from within the monolith.
Honeycomb is the observability tool John and Red Hat trust because it helps them tackle the unknown unknowns of their product. Try Honeycomb out for free today.