Software Engineering  

What Happens to DevOps when the Kubernetes Adrenaline Rush Ends?

By Martin Thwaites  |   Last modified on October 11, 2023

Kubernetes has been around for nearly 10 years now. In the past five years, we’ve seen a drastic increase in adoption by engineering teams of all sizes. The promise of standardization of deployments and scaling across different types of applications, from static websites to full-blown microservice solutions, has fueled this sharp increase. 

Kubernetes is currently in its “hype cycle” phase. It’s more acceptable for engineers to suggest Kubernetes as their platform of choice, regardless of whether they’re using the cloud or on-premise infrastructure. We’ve seen brick and mortar retail stores deploy single node Kubernetes clusters to manage their till systems, and we’ve also seen ecommerce sites deploy thousands of nodes across hundreds of data centers to manage uptime. 

There is no doubt that Kubernetes is here to stay—but what happens when the hype of migrating to Kubernetes wears off and we now have to manage it every day?

Standardization is the thing folks talk about when they evangelize Kubernetes. This is the idea that everything you run can be containerized, making every service a standard shape, with standard connectors. Kubernetes solves the problem of deploying software at scale, in a standardized way. But what it doesn’t solve is knowing if that software’s doing what it’s supposed to be doing. We simply can’t standardize knowing whether something is doing what it’s supposed to do, since different applications solve different problems.

The broken promise of Kubernetes

I'm an application engineer, and moreover, I'm an application engineer who fully embraced the DevOps movement. I call it a movement as that's what it was to me. It wasn't a new role or new responsibilities, and it wasn't about CI/CD pipelines or IaC. To me, DevOps was about working more closely with the specialists who gave the code I wrote life beyond my local machine, and working with them to make sure that my application performed at its best and got into the hands of users quickly. This was great for me as I started to understand the challenges they faced, but also, they started to see the constraints I had and could offer solutions. Together, we created applications that users wanted. 

With Kubernetes, teams have been running so fast that they didn’t notice a new divide creeping in—this time, under a different name: platform engineering. Now, we have Kubernetes administrators that can create our clusters, and they know nothing about what’s running on them because we’ve standardized everything around a container. 

You might say that this is great because now there is a much clearer divide between the app (the container) and the infrastructure (the cluster). To that, I would disagree. Now engineers have to think about deployments, services, sidecars, service meshes, nodes, node affinity—the list goes on. 

You could say, "But Martin, that isn't what they should worry about, that's platforms’ job!" but you’d prove my point: there is now a divide. We pushed for infrastructure and application engineers to work together, to reach into each other's worlds and have an understanding of each part so we could ask intelligent questions of each other. Now we're saying, "Leave that to someone else, they know what they're doing," and that’s where we were 10 years ago. With siloed teams blaming each other when things go wrong. Application engineers can now say, “It worked on my machine but stopped working in production, so it’s platform’s job to fix it,” and platform engineers can point at their dashboard and say everything’s up.

Don’t get me wrong here. The best, most high-performing teams have excellent dialogue across those teams, and realize that communication and acceptance that they need different tools to do their job is what makes them both able to perform. Platform engineers manage everything from autoscaling to network routing, and application engineers look after the product features and make sure that customers are getting the best experience possible. However, what we’re seeing is that migrating to Kubernetes is seen as the end—but what about day two? Once everything is running there, we have nothing else to do, right? We don’t need to upgrade Kubernetes every year, do we?

Most monitoring tools solve for yesterday’s infrastructure 

With the move to Kubernetes—and the ephemeral nature of the infrastructure we use to host our applications, like pods—the approaches we used for monitoring and debugging our applications fails. We’re taking the approaches we used at an infrastructure level and applying those to application debugging techniques because now, everything is standardized, so everything is infrastructure. This severely underserves the application developers building in Kubernetes and the platform engineers looking to enable them with more system context.

We need to evolve our thinking when it comes to supporting modern applications and the application developers working in them. Kubernetes doesn’t make our applications more observable. Rather, it makes them easier to deploy and iterate on. That’s not a bad thing. The ability to update applications easily, promote more deployments, do red/green deployments and canaries—these are all great things that will improve the ability of application engineers to support their apps. What it doesn’t do is make it easier for application developers to debug their applications. At best, it’s where we were before Kubernetes was the deployment system of choice. At worst, we’ve introduced more points of failure that we now need to investigate.

When we had a fixed amount of servers we dealt with, we’d add each of those servers as a dimension in our application metrics. We’d then add the version number of the application. From there, we could delve into which version/which server had an issue—or if all of the servers were the problem. That combination of Server Name and Application Version were then low-cardinality data, which is very well suited for time-series aggregated databases. The situation we’re in now, though, is that pods can be rescheduled at any time, resulting in potentially a new node being used. For each deployment, we have a new pod name and the majority of the time, this is now high-cardinality data, which traditional metrics-based systems struggle with.

Pods only matter when users aren’t happy

As I’ve said, users don’t care about your infrastructure. They don’t care about the CPU on your pods, the network bandwidth, or whether you’re using a service mesh or not. They don’t care if you have one pod or 10 for each service. They only care if your whole system responds to their requests. 

We’re in a situation where unless there's an exception in the code, an HTTP error, or another type of error, it’s likely that it will be pushed to the platform team as an infrastructure issue. Anything to do with slowness—or responses that don’t make sense to the application engineers—is pushed off to the platform team to investigate. At this point, the platform team has little information about the applications and has to investigate infrastructure issues based on coarse-grained metrics. Again, we’re in siloed teams not talking to each other.

The reality is that it could be the pods, but it could also be the code. At this stage, we need the ability to see whether the issue is localized to individual infrastructure components, like pods or nodes, or if it’s affecting everything. This is where thinking about high-cardinality data, such as pod names, becomes crucial to your application telemetry. We’re at the point where pods matter, and that’s why you need to be able to bridge that gap. Platform and application engineers need to come together. They need contextual, deep context data about the applications and the infrastructure. They need to correlate customer-centric data like tracing, which is customized by application engineers to provide context specific to their applications, with infrastructure-centric data like Kubernetes metrics, managed by the platform team to get a full understanding of what makes customers unhappy.

Kubernetes isn't a silver bullet

As organizations move past their migration to Kubernetes and into operating mode, they need to be mindful of the siloed approach that we’ve spent years moving away from. Platform engineering is about the enablement of application engineers and how, together, they can serve their customer best. Providing the processes, tools, and culture so that those teams can work together is vitally important. It’ll ensure that there isn’t an “us and them” mentality, which leads to a bad overall customer experience. 

This is done through collaboration, not control. Remember: if you have to dictate that a particular tool be used rather than people wanting to use it, there is likely something wrong with the tool.

To build these high-performing teams that work seamlessly together, you need to be the bridges by using common technical language. Using tools like OpenTelemetry that can help provide joined-up thinking through tracing (customer-centric for developers) and metrics (infrastructure-centric for platform) will help.

Together, platform engineering teams and application/product engineering can provide the best customer experience, but those relationships need to be nurtured. And they certainly don’t come for free. To put it succinctly, Kubernetes is not your silver bullet to better performing software. Put in the work.


Related Posts

Software Engineering  

Investigating Mysterious Kafka Broker I/O When Using Confluent Tiered Storage

Earlier this year, we upgraded from Confluent Platform 7.0.10 to 7.6.0. While the upgrade went smoothly, there was one thing that was different from previous...

Software Engineering   Culture  

Independent, Involved, Informed, and Informative: The Characteristics of a CoPE

In part one of our CoPE series, we analogized the CoPE with safety departments. David Woods says that those safety departments must be: independent, involved,...

Software Engineering   Culture  

Establishing and Enabling a Center of Production Excellence

Software is in a crisis. This is nothing new. Complex distributed systems are perpetually in a state far from equilibrium, operating in what Richard Cook...