Tracking On-Call Health
By Fred Hebert | Last modified on May 18, 2022If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has very abstract qualities to it. For example, are you feeling burnt out? Does it feel like youâre supported properly? Is there a sense of impending doom? Do you think everything is under control? Is it clashing with your own private life? Do you feel adequately equipped to deal with the challenges you may be asked to meet? Is there enough room given to recover after incidents?
These questions are often fundamental to whether your engineersâ on-call experience is going to be positive, negative, or anywhere in between. Theyâre also difficult to track. In this post, Iâm going to expand on the values weâre currently using at Honeycomb to monitor on-call health, why we think theyâre good, and some of the challenges weâre still encountering.
Track things you can do, not things you hope donât happen
In most places, people try to track on-call health by picking a proxy value thatâs easier to measure: how many disruptions youâre going to have.
And these disruptions end up being more or less counting alarms (so you can track false alarms and incidents), with added weight often given to off-hours interruptionsâafter all, being woken up or interrupted while putting the kids to sleep is worse than handling outages during work hours. The problem with this value is that while itâs intuitively good, itâs challenging practically.
Itâs challenging because not all incidents are equal. If Iâm on call, I actually expect to be woken up from time to time. Disruptions are part of the job definition, and I donât believe we can get rid of all the outages nor false alarms. Itâs more tiring, stressful, and difficult to be caught in a false alarm for a component you are not properly trained to operate than it is to comfortably handle a real alarm for something youâre skilled at. Some coworkers revealed that to them, the biggest disruption is actually just being on call because they canât leave for the mountains on the weekend. Getting paged is not as bad of a disruption by comparison.
The other problem with counting incidents is that itâs a negative value. One of the principles I keep pushing for is that we should track things we can do, not things we hope do not happen. If we define success as ânot having incidents,â weâre essentially orienting our objectives on counting events that can often be out of our control, with measures that become debatable targetsâwhat counts as an incident easily shifts to meet objectives. In fact, we tried this briefly internally, and even knowing that I shouldnât change what I consider an incident to please the metric, I found myself really double-guessing everything. By making the value more important and countable, we push ourselves to think about it differently.
Instead of wanting to avoid bad outcomes, our targets should be about things we can do because we believe they improve our operational health. So instead of counting outages and alerts, we should focus on whether we react to them in a useful manner, experiments we run to increase operational experience, or how confident people feel about being on call, to name a few examples.
What we track
Not too long after I joined, I asked to be added to each of the on-call hand-off meetings we have and soon added an agenda item: Tell Fred how you feel about your week. I was determined to get a qualitative feel for the on-call rotation, and what people felt was more challenging than just counting events.
I did this for many months. It worked okay, but wasnât great. It gave me a good feel about how week-to-week things changed inconsistently, how some peopleâs stress levels were higher when new services were introduced, how often things were being tackled to correct them, and so on. In fact, these quick questions ended up being behind establishing frequent training sessions and dedicated space to discuss on-call concerns.
But an informal discussion wasnât adequate enough to track sentiments over time, to know if things were improving or worsening, and what caused stress.Â
Last year, I stumbled upon literature on how professor Erik Hollnagel evaluated the resilience of sociotechnical systems by using a specific grid. What was striking to me in there is that it felt like the gridâs four categories that define the basis of resilient performance (the abilities to respond, monitor, learn, and anticipate) resonated with me, both because of my personal experience and because of feedback regarding what made our engineers nervous or uncomfortable about being on call.
The original grid as defined by Erik Hollnagel is rather long and would be time-consuming to fill. I decided to take the grid and reduce it as much as possible. Iâm certainly losing rich data in doing this, but Iâm hoping the trade-off is a higher participation rate before or after on-call rotationsâenough decent data is possibly better than little perfect data here, and I do not want to add to the burden of on-call folks. So the 5 minutes of discussion at the end of the hand-off is ideally replaced by 2 minutes to fill out a Google Form.
Once enough people have gone through a rotation, we can then visit these numbers once or twice a quarter and use it to impact our planning in SRE land. So far, we primarily used the poll contents to help guide some decisions (e.g., fast-tracking the on-call rotation growth and splits or directing some of the OnCallogy themes). It has also been used as a way to feed that information back to the rest of the organization to hopefully better inform decisions there.
The Google Form
Our form currently has the following questions in it:
- Which rotation are you on?
- Are you going on-call or off-call?
- Response: I feel I know how to respond to things happening while on call (rate from âstrongly disagreeâ to âstrongly agreeâ)
- Monitoring: I believe we have an adequate level of observability data (rate from ânot enoughâ to âtoo muchâ)
- Monitoring: I believe we have an adequate level of alerting (rate from ânot enoughâ to âtoo muchâ)
- Learning: I feel we are learning meaningful lessons from things that happen (rate between âthere was nothing worth learningâ and anywhere from âstrongly disagreeâ to âstrongly agreeâ)
- Anticipation: I think future threats are identified and being addressed (rate from âstrongly disagreeâ to âstrongly agreeâ)
- How do you feel about being on-call this week (optional, comment box)
Each of the Response/Monitoring/Learning/Anticipation category also has an optional comment box for any extra content the responder feels like adding. When generating a report, each chart comes with a little âwhat the chart doesnât showâ section where I paraphrase free-form comments and provide extra context.
Responses are tracked by email so that I can know who to follow-up with for further questions, but thereâs an understanding that the data should be confidential and made anonymous when writing broader organizational reports. Even meetings within the SRE teams are done with the email column hidden at this point in time.
What we need to improve
None of these measures are perfect. While I now believe we have useful qualitative data to track how people feel about being on-call and that our SRE team is using it to help orient interventions and experiments, there still are lots of limitations to what this form lets us do.
There are countless factors we do not track or represent. For example, the form does not concern itself with impactful ones such as on-call rotation size and frequency, the alert volume, rate of changes, pressures, quality of tools, amount of trust between engineers and the organization, on-boarding, prioritized values, or impacts of rapid growth.
The data we gather is sparse and slow tricklingâonce or twice per responder shift if they feel like itâand canât be used as a precise, timely reporting mechanism. Itâs part of a necessarily fuzzier and slower feedback loop working in terms of multiple months. This means that comparisons over time are harder to do because team make up, call rotations, systems, and trends will change a lot in a fast-moving environment.
So while Iâm currently pretty happy with decent qualitative data, itâs important to keep in mind that the map is never going to be the territory. Our expectation is that weâll need to keep refining and tweaking what we measure for it to remain relevant and useful.
How do you track your on-call health? Let us know by sending us a tweet!Â
Related Posts
A Vicious Cycle: Data Hidden Behind Lock and Key
Understanding production has historically been reserved for software developers and engineers. After all, those folks are the ones building, maintaining, and fixing everything they deliver...
âWhy Are My Tests So Slow?â A List of Likely Suspects, Anti-Patterns, and Unresolved Personal Trauma
If you get CI/CD right, a lot of other critical functions, behaviors, and intuitions align to be comfortably successful and correct with minimal effort. If...
What Are Some Useful Things to Look for Right Away When You Get Data into Honeycomb?
Here's what some of our customers had to say about Honeycomb features that they found value in right away....