I’ve never been on-call, but I’ve been on-call adjacent for a lot of my adult life—my partners, my housemates, my friends…they’ve largely been sysadmins, in Operations, or Dev/Ops, which means I’ve experienced a lot of the pain second-hand.
Being on-call is of course about being reachable when something goes wrong. And if lots of things go wrong when you’re on-call, you end up getting paged a lot and losing sleep and all the other crappy things that go along with an extended battle with Production.
But although I have been awakened by many a page/pager in my life and followed along from the sidelines in the Great Battle For Uptime, until recently the folks doing that battle typically were able to ultimately figure out what a given problem was—even if they sometimes had to jury-rig a fix and do a more thorough review after the fire was out. But that was The Past. Nowadays, incident reports and post-mortems far more frequently wrap up with no clear explanation for what caused the problem.
One obvious reason for this is the escalating complexity of the systems being supported. And for a while, the tools (and processes, and skillsets) were keeping up—the industry moved from grep to egrep, to “Google for your datacenter”— log aggregators with a search box on top, and then as things got more complicated, to APM.
Similarly, we began the Dev/Ops “transformation” to improve service outcomes because the tradition of “throwing the code over the wall” to Operations was no longer even remotely tenable. We transitioned to continuous integration and deployment in part to try and mitigate the difficulty of troubleshooting the introduction of new code into such complex systems by shipping smaller changes, more often.
But the transformation is continuing. And the way forward is for engineers themselves to own the systems they create—because at this point, they’re the most likely to be successful at figuring out what those systems are doing in production.
This brings us back to the pain of supporting these systems, of being on-call for them. As our CEO Charity Majors said in a previous post:
“Developing software doesn’t stop once the code is rolled out to production. You could almost say it starts then. Until then, it effectively doesn’t even exist. If users aren’t using your code, there’s no need to support it, to have an on-call rotation for it, right? The thing that matters about your code is how it behaves for your users, in production—and by extension, for the people who are on-call for your code running in production.”
But engineers typically resist being put on-call because of all the aforementioned pain. More transformation is needed. How can you begin to improve the on-call experience for your team—whether they’re devs, dev-ops, ops, or all of the above?
Encourage and reward useful instrumentation
If someone is going to be successfully on-call for a given production system in the modern era, they need the system to communicate information that makes sense in terms of what the user is experiencing. They need access to context that aggregated log output won’t provide. Work with your team to identify the information someone would need to debug the kinds of issues you’re seeing or expecting to see, and begin an iterative, ongoing process of improving the instrumentation of the codebase. Wondering where to start, actually? Refer to the section titled “What should an event contain?” in Observability for Developers for some specific guidance on what to include in your instrumentation.
Further incentives: if your instrumentation provides enough context, the Support team might be able to address more issues without paging someone on your team.
Review your alerts religiously and remove them whenever you can
At many shops, the alerts defined in the monitoring system are almost never removed, only added to. There are layers of alerts that haven’t fired in ages, other layers of alerts that just mean ‘restart this service’, and still further layers of alerts that fire and no one pays attention to because if they ignore it, it’ll stop on its own. This kind of environment comes about when the on-call folks don’t have the power to decide what is and isn’t being alerted on, and when the folks who are defining the alerts don’t necessarily know what matters to the users vs the operators.
Commit to a thorough review of the current set of defined alerts in your monitoring system—do the things you’re monitoring actually relate to the end-user experience? Be ruthless in stripping out what has been hoarded. Only actually page someone if an end-user will notice, and optimally if there’s something tangible the on-call person can do to resolve the problem. Add new triggers when you’re seeing a new issue, and when you’ve identified and debugged the issue, take them back out.
Empower your team to remove or reconfigure alerts to minimize the impact on them while remaining focused on the customer experience. Consider making the pruning and improving of alerts a key task for whoever is currently on-call.
Further incentives: On-boarding new people to the on-call rotation is a lot easier when someone doesn’t have to explain which alerts to ignore.
Being on-call will always involve getting woken up occasionally. But when that does happen, it should be for something that matters, and that the on-call person can make progress toward fixing. Iterating toward more actionable content in your instrumentation and more focused, significant, and less-disruptive alerting will improve everyone’s experience.
Learn more about observability and what it can do for your team: