I hate to be the bearer of bad news (not really), but the reality for developers is that it’s only getting more complicated to ensure the code you write still works. Assuming operational responsibility for the code you write is becoming a larger and larger part of the developer role — even as “where it runs” gets further and further away from “where it was written.”
The first wave of DevOps primarily embodied Ops folks learning how to Dev: to “automate everything” through code. The second wave, naturally, is now about Dev folks learning how to Ops: now, we own the running of our code in production. But while the two shifting waves typically come together in cross-functional DevOps teams, “understanding production” has historically carried with it a heavy Ops bias.
It’s almost ironic, really — recent trends in platform abstractions have turned everything into code. (Hi, serverless! And thanks, Heroku.) What that should have meant is that understanding what’s happening in production would be easier for devs, not harder. Let’s take a look at why that hasn’t been the case, and how it should be instead.
Shoehorning Devs into Ops: What Could Go Wrong?
Leading software engineering organizations are increasingly asking developers to own their code in production. Software engineers are being asked to join on-call rotations, with varying levels of support.
And yet, conventional “production monitoring” tools are inherently hostile to how developers think and work with their code. Traditional approaches to understanding “production” are tied to an application’s underlying infrastructure. Graphs of data like CPU utilization, network throughput, or database load are very infrastructure-centric ways to understand the world. Just because the lines continue to blur between dev and ops doesn’t mean we simply transfer over previous mental models of the world — the goal of DevOps isn’t to simply swap out responsibilities. The goal of shifting into DevOps is to get the most out of the varied skills, background and mindsets that comprise these new cross-functional teams.
Traditional production monitoring tools were written long before the era of DevOps — they speak the language of Ops, not Devs. Unfortunately, that sets up an artificial barrier to entry for developers to think about production as a place they own. We’ve done nothing to help developers see their world as it exists in production. Developers often get handed a dashboard full of Cassandra read/write throughput graphs, thread counts and memtable sizes, as if that somehow inducts them into the club of production ownership.
Sure, those metrics and graphs look cool — but there’s often no way to connect that information back to the code, business logic or customer needs that are the world of software development. When problems occur, there’s a big mental leap that exists between seeing that information and tying it back to “what really happened.” And even if that leap can somehow be made, there’s certainly no path at all that leads toward reproducing any observed phenomenon, much less writing the code to fix it.
The cognitive leap that traditional production monitoring tools require developers to make doesn’t get a lot of attention, because that’s simply how things are done for Ops engineers. In some corners of engineering, there’s a smug satisfaction that devs now have to make that leap. Feel our pain, devs! How do you not know that when both of these lines trend down and that graph turns red, it means your application has run out of memory? Welcome to production.
That cavalier attitude reinforces the hostility reflected by the approach taken by traditional monitoring tools. In practice, that approach inadvertently leads to situations where devs simply follow the breadcrumbs and do their best to replicate production debugging patterns they don’t fully understand. Culturally, it creates a moat between the approaches that Ops values and the approaches that Dev values — and reinforces the illusion that production is a hostile place for developers.
Enhance Existing Dev Behaviors
Instead, a more welcoming approach is to tap into what we Devs do naturally when debugging: allow us to quickly compare our expected outcome against the actual outcome (e.g. this code should handle 10K req/sec, but seems to only handle 100 req/sec). Devs share this part of the investigative journey with their Ops comrades. However, where Ops and Dev patterns deviate is when digging into understanding why that deviation occurs.
For Devs, we compare “expected” against “actual” all the time in test suites. Investigating test failures means digging into the code, walking through the logic, and questioning our assumptions. Being able to capture business logic-level metadata in production (often high cardinality, often across many dimensions) is a baseline requirement for being able to tap into Dev experience for production problems.
We need a specific replicable test case. Being able to tap into the specificity of custom attributes like userID, partitionID, etc, is what enables production to feel like an extension of development and test workflows, as opposed to some new foreign and hostile environment.
A Developer Approach to Production
With the advent of PaaS, IaaS and serverless, our world is increasingly abstracting infrastructure away. That’s paved the way for both waves of DevOps and it has made room to redefine priorities. For software development teams that own running their code in prod, that means they’ve shifted toward aligning their definition of successful operation with what ultimately matters to the business — whether the users of that software are having a good customer experience.
That shift works very well for developers who are accustomed to having functions, endpoints, customer IDs, and other business-level identifiers naturally live in their various tests. Those types of identifiers will only continue to become more critical when investigating and understanding the behavior of production systems. (In contrast, traditional monitoring systems focus on the aggregate behavior of an overall system and almost never include these types of identifiers.)
All of the questions that developers should ask about production boil down to two basic forms:
- Is my code running in the first place?
- Is my code behaving as expected in production?
As a developer in a world with frequent deploys, the first few things I want to know about a production issue are: When did it start happening? Which build is, or was, live? Which code changes were new at that time? And is there anything special about the conditions under which my code is running?
The ability to correlate some signal to a specific build or code release is table stakes for developers looking to grok production. Not coincidentally, “build ID” is precisely the sort of “unbounded source” of metadata that traditional monitoring tools warn against including. In metrics-based monitoring systems, doing so commits to an infinitely increasing set of metrics captured, negatively impacting the performance of that monitoring system AND with the added “benefit” of paying your monitoring vendor substantially more for it.
Feature flags — and the combinatorial explosion of possible parameters when multiple live feature flags intersect — throw additional wrenches into answering Question 1. And yet, feature flags are here to stay; so our tooling and techniques simply have to level up to support this more flexibly defined world.
Question 2, on the other hand, is the same question we ask anytime we run a test suite: “Does my code’s actual execution match what I expect?” The same signals that are useful to us when digging into a failing test case are what help us understand, reproduce and resolve issues identified in production.
A developer approach to debugging prod means being able to isolate the impact of the code by endpoint, by function, by payload type, by response status, or by any other arbitrary metadata used to define a test case. Developers should be able to take those pieces and understand the real-world workload handled by their systems, and then adjust their code accordingly.
The Way Forward: A Developer-Friendly Prod
The future of Dev careers isn’t about having different bespoke ways of approaching debugging your production environment. DevOps is about getting the most out of your new cross-functional teams and, luckily — when it comes to using tools to get answers to the questions you care about in production — there’s an opportunity to all get on the same page. Whether your team labels itself Devs, Ops, Devops, or SRE, you can all use tools that speak the same language.
In today’s abstracted world — one full of ephemeral instances, momentary containers and serverless functions — classic infrastructure metrics are quickly fading into obsolescence. This is happening so quickly that it even calls into question the future of ops careers. A fundamentally better approach to understanding production is necessary — for everyone.
A good first step is shifting focus away from metrics like CPU and memory and instead embracing RED metrics as the primary signal of service health. That can substantially lower the barrier for entry to production for most developers. Devs can then be armed with the metadata necessary to understand the impact of any given graph, by tagging those metrics with customer ID, API endpoint, resource type, customer action, etc. It bridges the gap between capturing metrics in prod and tying them back to code and tests.
One step better is the reason that observability has seen an explosion in popularity. Observability is not a synonym for monitoring. Observability takes an event-based approach that still allows you to incorporate infrastructure metrics to understand the behavior of your production systems. It’s an entirely different approach to the Ops-centric world of monitoring that enables understanding the behavior of production systems in ways that makes them accessible to engineers from all backgrounds.
The future of dev careers should be defined by struggling to understand the correlations between traditional monitoring tools and where that ties into your code. By breaking away from traditional monitoring tools, the future of dev careers instead becomes one where understanding what’s happening in prod feels every bit as natural as understanding why code failed in your development or test environments.
Over the last decade and change, as an industry, we’ve all gotten really good at taking code and shipping it to the user. That was Heroku’s promise, after all: simply and magically hooking a production environment up to a developer’s natural workflow. And because of this — because of how much closer we’ve brought production to the development environment — the developer skill set has to follow the same trajectory… or risk being left behind.
Learn more about making production more approachable to devs and to ops in Honeycomb’s Guide to Developing a Culture of Observability.