Best Practices  

Tracking On-Call Health

By Fred Hebert  |   Last modified on May 18, 2022

If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has very abstract qualities to it. For example,  are you feeling burnt out? Does it feel like you’re supported properly? Is there a sense of impending doom? Do you think everything is under control? Is it clashing with your own private life? Do you feel adequately equipped to deal with the challenges you may be asked to meet? Is there enough room given to recover after incidents?

These questions are often fundamental to whether your engineers’ on-call experience is going to be positive, negative, or anywhere in between. They’re also difficult to track. In this post, I’m going to expand on the values we’re currently using at Honeycomb to monitor on-call health, why we think they’re good, and some of the challenges we’re still encountering.

Track things you can do, not things you hope don’t happen

In most places, people try to track on-call health by picking a proxy value that’s easier to measure: how many disruptions you’re going to have.

And these disruptions end up being more or less counting alarms (so you can track false alarms and incidents), with added weight often given to off-hours interruptions—after all, being woken up or interrupted while putting the kids to sleep is worse than handling outages during work hours. The problem with this value is that while it’s intuitively good, it’s challenging practically.

It’s challenging because not all incidents are equal. If I’m on call, I actually expect to be woken up from time to time. Disruptions are part of the job definition, and I don’t believe we can get rid of all the outages nor false alarms. It’s more tiring, stressful, and difficult to be caught in a false alarm for a component you are not properly trained to operate than it is to comfortably handle a real alarm for something you’re skilled at. Some coworkers revealed that to them, the biggest disruption is actually just being on call because they can’t leave for the mountains on the weekend. Getting paged is not as bad of a disruption by comparison.

The other problem with counting incidents is that it’s a negative value. One of the principles I keep pushing for is that we should track things we can do, not things we hope do not happen. If we define success as “not having incidents,” we’re essentially orienting our objectives on counting events that can often be out of our control, with measures that become debatable targets—what counts as an incident easily shifts to meet objectives. In fact, we tried this briefly internally, and even knowing that I shouldn’t change what I consider an incident to please the metric, I found myself really double-guessing everything. By making the value more important and countable, we push ourselves to think about it differently.

Instead of wanting to avoid bad outcomes, our targets should be about things we can do because we believe they improve our operational health. So instead of counting outages and alerts, we should focus on whether we react to them in a useful manner, experiments we run to increase operational experience, or how confident people feel about being on call, to name a few examples.

What we track

Not too long after I joined, I asked to be added to each of the on-call hand-off meetings we have and soon added an agenda item: Tell Fred how you feel about your week. I was determined to get a qualitative feel for the on-call rotation, and what people felt was more challenging than just counting events.

I did this for many months. It worked okay, but wasn’t great. It gave me a good feel about how week-to-week things changed inconsistently, how some people’s stress levels were higher when new services were introduced, how often things were being tackled to correct them, and so on. In fact, these quick questions ended up being behind establishing frequent training sessions and dedicated space to discuss on-call concerns.

But an informal discussion wasn’t adequate enough to track sentiments over time, to know if things were improving or worsening, and what caused stress. 

Last year, I stumbled upon literature on how professor Erik Hollnagel evaluated the resilience of sociotechnical systems by using a specific grid. What was striking to me in there is that it felt like the grid’s four categories that define the basis of resilient performance (the abilities to respond, monitor, learn, and anticipate) resonated with me, both because of my personal experience and because of feedback regarding what made our engineers nervous or uncomfortable about being on call.

The original grid as defined by Erik Hollnagel is rather long and would be time-consuming to fill. I decided to take the grid and reduce it as much as possible. I’m certainly losing rich data in doing this, but I’m hoping the trade-off is a  higher participation rate before or after on-call rotations—enough decent data is possibly better than little perfect data here, and I do not want to add to the burden of on-call folks. So the 5 minutes of discussion at the end of the hand-off is ideally replaced by 2 minutes to fill out a Google Form.

Once enough people have gone through a rotation, we can then visit these numbers once or twice a quarter and use it to impact our planning in SRE land. So far, we primarily used the poll contents to help guide some decisions (e.g., fast-tracking the on-call rotation growth and splits or directing some of the OnCallogy themes). It has also been used as a way to feed that information back to the rest of the organization to hopefully better inform decisions there.

The Google Form

Our form currently has the following questions in it:

  1. Which rotation are you on?
  2. Are you going on-call or off-call?
  3. Response: I feel I know how to respond to things happening while on call (rate from “strongly disagree” to “strongly agree”)
  4. Monitoring: I believe we have an adequate level of observability data (rate from “not enough” to “too much”)
  5. Monitoring: I believe we have an adequate level of alerting (rate from “not enough” to “too much”)
  6. Learning: I feel we are learning meaningful lessons from things that happen (rate between “there was nothing worth learning” and anywhere from “strongly disagree” to “strongly agree”)
  7. Anticipation: I think future threats are identified and being addressed (rate from “strongly disagree” to “strongly agree”)
  8. How do you feel about being on-call this week (optional, comment box)

Each of the Response/Monitoring/Learning/Anticipation category also has an optional comment box for any extra content the responder feels like adding. When generating a report, each chart comes with a little “what the chart doesn’t show” section where I paraphrase free-form comments and provide extra context.

Responses are tracked by email so that I can know who to follow-up with for further questions, but there’s an understanding that the data should be confidential and made anonymous when writing broader organizational reports. Even meetings within the SRE teams are done with the email column hidden at this point in time.

What we need to improve

None of these measures are perfect. While I now believe we have useful qualitative data to track how people feel about being on-call and that our SRE team is using it to help orient interventions and experiments, there still are lots of limitations to what this form lets us do.

There are countless factors we do not track or represent. For example, the form does not concern itself with impactful ones such as on-call rotation size and frequency, the alert volume, rate of changes, pressures, quality of tools, amount of trust between engineers and the organization, on-boarding, prioritized values, or impacts of rapid growth.

The data we gather is sparse and slow trickling—once or twice per responder shift if they feel like it—and can’t be used as a precise, timely reporting mechanism. It’s part of a necessarily fuzzier and slower feedback loop working in terms of multiple months. This means that comparisons over time are harder to do because team make up, call rotations, systems, and trends will change a lot in a fast-moving environment.

So while I’m currently pretty happy with decent qualitative data, it’s important to keep in mind that the map is never going to be the territory. Our expectation is that we’ll need to keep refining and tweaking what we measure for it to remain relevant and useful.

How do you track your on-call health? Let us know by sending us a tweet! 

 

Related Posts

Teams & Collaboration   Best Practices  

A Vicious Cycle: Data Hidden Behind Lock and Key

Understanding production has historically been reserved for software developers and engineers. After all, those folks are the ones building, maintaining, and fixing everything they deliver...

Software Engineering   Best Practices  

“Why Are My Tests So Slow?” A List of Likely Suspects, Anti-Patterns, and Unresolved Personal Trauma

If you get CI/CD right, a lot of other critical functions, behaviors, and intuitions align to be comfortably successful and correct with minimal effort. If...

Observability   Best Practices  

What Are Some Useful Things to Look for Right Away When You Get Data into Honeycomb?

Here's what some of our customers had to say about Honeycomb features that they found value in right away....