Oncall and Sustainable Software DevelopmentBy Charity Majors | Last modified on January 11, 2019
Yes, being on call typically and anecdotally sucks. I understand! If you’ve heard me speak, I often point out that I’ve been oncall since I was 17 years old—so I know how terrible it can be.
But I believe strongly that it doesn’t have to be. Oncall can and should be different. It can be like being a superhero—if you’re on call and an issue comes up, you should get to feel like you’re saving the world (or at least your users), in a good way. It shouldn’t eat your life, or have a serious negative impact on your day-to-day interactions and personal relationships. It just shouldn’t suck that much.
Let’s explore this possibility. The first questions to ask are: Why does on call even exist? Should it exist? What are its goals, and what is out of scope?
If we agree it should exist, then how can we do it well? How can oncall become a thing that is pedestrian, part of life, something that is possible to love, or at least tolerate?
What even is oncall?
What oncall should be about in most environments:
- Bring the service up if it goes down
- Respond to users when they complain (or do so by proxy via Customer Support)
- Triage bug reports and escalate, wontfix or fix
- Make small fixes (that are scoped to some threshold, maybe < 1 day or < 2 hours)
- Do garbage collection tasks from a grab bag of work that doesn’t happen often enough to be worth fully automating or hiring a team for, requires human judgment, is a one-off, is critical and urgent
- Represent back to management when the burden is becoming too heavy
Why are there on call rotations? Because it forces everyone to learn. It is terrible for just a single person to know how to do a given thing; terrible for them, terrible for everyone else. We have oncall rotations for same reason that we round-robin DNS for our API: it’s about resiliency, and ultimately, sustainability. On call rotations are the primary mechanism by which most teams address the question of sustainability.
Services need owners, not operators
It really does take a village.
Responsibility for tending to brand new, baby code is too important, too hard, too all-consuming to be one person’s job. It will break them, burn them out so fast and so hard until all they want to do is drown that code in the bathtub. (Babies, by the way, are engineered by evolution to be so cute that you won’t kill them no matter how they scream. Your code is not this cute, I guarantee you.)
To build something is to be responsible for it. Writing software should only be done as a very last resort. Great senior engineers do everything possible to avoid having to write new code.
Developing software doesn’t stop once the code is rolled out to production. You could almost say it starts then. Until then, it effectively doesn’t even exist. If users aren’t using your code, there’s no need to support it, to have an oncall rotation for it, right? The thing that matters about your code is how it behaves for your users, in production—and by extension, for the people who are oncall for your code running in production.
The craft of building software and the craft of owning it over its lifetime are effectively the same thing. The core of the craft of software design and engineering is about maintainability. These things should continue to matter to you after your code is in production.
If I ask a room of folks who among them works on building software, and then ask the same room which of them are part of the oncall rotation, I should see about the same people raising their hands. But I typically don’t see the same people raising their hands because their experience of oncall sucks (see above), so software engineers believe they don’t want to do it.
So how do we make it not suck (so much)?
On call is a relentlessly human-scale process. It is never “fixed” or “done” any more than managing a team is done. It requires eternal vigilance and creativity and customization to your specific resources and the demands upon them at any given time. There are no laws I can pass down that if you only apply them will make everything better. It is contextual and contingent at its very heart.
But we do know that nothing is sustainable if it is hateful to everyone who participates in it. Sustainable processes are low impact and use renewable resources. Burning people out is not sustainable. Making them dread being oncall is not sustainable. We have to make it suck less.
Be(e) the oncall you want to see in the world
If you’re leading a team responsible for oncall, set an example yourself.
Empower your team to get the work done, then value their work, their time, their personal lives. Make people mobile; provide them with whatever hardware they need to do their work on the go. Show them the impact they have on the service, on customer happiness. If you can, shrink the feedback loop so your team can internalize the impact of their own choices and outcomes.
Pair people up as primary/secondary so there’s very little friction to asking someone to take over for you for a few hours. Send people home or tell them not to come in after a rough night. Support them in making those decisions for themselves. Never page new parents, or anyone who is already being woken up by something. Never force someone to carry a pager. Create incentives and social pressure and team cohesion, and they will want to.
Help your team connect on a human level. Share food with each other, try to spend time together and share things that aren’t software. Make it easier for them to ask for help from you and each-other.
Most of all, ensure you and your senior people are modeling this behavior; don’t just say it’s encouraged with your mouth. Nobody believes you ‘til they see it.
Care more about less stuff
Your team can’t handle everything all the time, so be ruthless in prioritizing, reducing responsibilities, setting expectations with the rest of your organization. Consider lowering your standards! Have reasonable expectations; set them as low as you possibly can.
Guard the critical path, architect for resiliency. Have daytime alerts and few night time alerts (and audit these often). Only page on something if the customer is going to notice it. Let lots of things fail until morning. Protect your team from unreasonable expectations.
Above all, pay attention and be outcome-oriented.
Paying attention? Here’s what to look for
Signs it’s bad:
- People dread it
- People plan their lives around it
- People cancel things
- People talk about it a lot
- Ghoulish gallows humor
- People complain about things being unfair
Signs it’s going well:
- People cheerfully and actively volunteer to cover for each other
- People ask to be covered for like it’s no big deal
- People don’t keep score obsessively about who owes who
- Someone gets woken up 1-2 times a week at most
- People can freely self-select which rotation to join
- People move among rotations often
- People can just take their laptop and wifi with them to the concert, camping, whatevs
The glorious future of being on call
One of the crappiest things about being oncall in the current reality for most folks is that you’re stuck getting paged for the same old shit every rotation, applying the same workarounds, restarting the same services. Instead, being on call should be about solving new and different problems, being a superhero—not about being the janitor.
Oncall can and should be a break in routine, freedom from the daily tedium of incremental progress factoring widgets, an opportunity to fix all the little things that are bothering you, to give your teammates and users sparks of delight. It can be a breather where you’re encouraged to investigate and pay down technical debt, learn new things, make your team-mates’ lives a little better.
And you want your oncall team to be freed up from handling the known-knowns, and the known-unknowns—because as complexity continues to accelerate, you need them fresh and bright-eyed and bushy-tailed to handle all those unknown-unknowns coming our way.
Hoping to see you someday in the glorious future of being oncall :)
Software systems are increasingly complex. Applications can no longer simply be understood by examining their source code or relying on traditional monitoring methods. The interplay...
Stop me if you’ve heard this one before: you just pushed and deployed your latest change to production, and it’s rolling out to your Kubernetes...