On Counting AlertsBy Fred Hebert | Last modified on July 9, 2022
A while ago, I wrote about how we track on-call health, and I heard from various people about how “expecting to be woken up” can be extremely unhealthy, or how tracking the number of disruptions would actually be useful. I took that feedback to heart and wanted to address the issues they raised, and also provide some numbers that explain the position I took with these metrics on alerts.
On sleep, stress, and hours worked
The first criticism I received came in Tweet form, reinforcing the idea that sleep deprivation can lead to long-term health problems. As far as I know, this is backed by science and therefore, I have little reason to doubt this claim.
I’m a big fan of Hillel Wayne’s The epistemology of software quality. Essentially, three of the best things we can do to increase the quality of software we build is to play on variables long known to affect people: sleep, stress, and hours worked. The three of these have very measurable effects on people across disciplines, and there’s no reason to think we’re immune to that in software.
My original post didn’t call out sleep quality, stress, or hours worked directly. And other bits of feedback I received also mentioned counting alerts as important signals that underpin this sort of balance. My gut feeling is that this is true, but only contextually so.
I’ve worked on-call rotations that received five to 10 alerts a night, and I’ve seen people who were on rotations with five times as many. I’ve been on on-call rotations where we received two alerts a year, and anywhere in between. Obviously the place with two alerts a year was less stressful and more relaxing than the one with 10 a night.
But I’ve also been in places where one to two alerts a week felt less stressful than one to two alerts a month, simply because of the support structures in place.
For me, the issue isn’t necessarily only about the number of disruptions—even if they have an obvious correlation to hours worked and quality of sleep—but also the additional factors that impact stress and your general comfort at work, since to me that also heavily influences how tired work makes me.
Of course, there’s nothing like backing these types of impressions with data, so let’s do a little retrospective.
What’s our alert volume saying?
I wanted to run through six months of data, from November 25 to May 25, combing over all of the PagerDuty alerts we had, ignoring all non-alert work. After a short analysis on the broad strokes, here’s what I found:
|Test pages (used in hand-offs)||65||36.93%|
|Self-resolving non-acknowledged pages||5||2.84%|
|Alert Target||Total Count||Ratio||Off-hours count||Off-hours ratio|
|Pingdom (front-end domains)||9||8.11%||2||1.80%|
|Interactive Queries (SLO)||6||5.40%||2||1.80%|
|Home page (SLO)||4||3.60%||2||1.80%|
|Host disk filling up||1||0.90%||0||0%|
Off-hour pages are defined as happening between 7 p.m. and 8 a.m. PT, and entirely ignore the timezone the employee is working in (that’s too time-consuming to set up). Almost half of the off-hour pages above happen within the three hours between the East Coast being up and the West Coast getting in (roughly 6 a.m. to 9 a.m. PT).
We have about 20 people on-call, and maybe a quarter of them did not get an actual page—which doesn’t mean their on-call time isn’t busy. Many of our activities happen during business hours and don’t rely on alerts. In fact, our rotations vary significantly. Since we split our on-call into three rotations, the platform team was paged 36 times, the product team once (an escalation), and the telemetry rotation wasn’t paged at all.
By comparison, we have two platform engineers that together account for nearly 50 percent of all pages assigned or acknowledged during that time period. Are they playing heroes? No, mostly they were handling common pages for a newer service (SLO processing) that was getting up to scale, intercepting pages that would have woken up West Coast engineers (as they happened to be on the East Coast), or unfortunately experiencing weeks where alert rates were higher because the cloud was angry. Another engineer who wasn’t in an active call rotation (but still there for relief) received only off-hour pages, as they worked from Australia for a while and intercepted off-hour alerts when they could.
Are the end-to-end alerts too sensitive? That’s the sort of gut feeling many of us had—and an easy reaction when we looked at the stats. But these alerts are testing two environments’ combined components, indirectly covering and alerting for over 16 cross-component hops. Other alerts are all environment-specific and often component-specific. The end-to-end alerts also tend to act as a backstop for other checks as well, which means we tend to be cautious about relaxing them because they’ve historically been very helpful.
Even then, and before knowing the exact numbers, we actively worked to increase their reliability, with encouraging results (from ~20 a month in January and February, to three or four lately). This brings me to my next question: is knowing the number of alerts actually useful for decision-making?
Is this a practical metric?
Do we get something more out of this data than what is already in the form we have? Does knowing about these alerts help us direct future work in any way?
Knowing whether you have four vs. 40 alerts is like thick fire smoke, and is a good signal if you’re far away. But variations that fall within, say, five or 10 percent of the norm are not going to be informative on their own. In general, I believe that for the Honeycomb engineering team, alert count is not a super practical metric for decision-making—which doesn’t mean it doesn’t have value—for two broad reasons.
The data is messy
The first is that the data is messy. Off-hours pages are ill-defined. Time to resolution is vague and unhelpful. Alerts don’t represent the overall on-call workload accurately. It’s hard to figure out if they are related in a cascade or not. The magnitude of the incident is not obvious.
For example, we had a 14-hour long incident on December 7 that involved 10 engineers, but it generated only two pages (because the incident was so bad our notification capacity was down), whereas a flapping monitoring environment can generate half a dozen alarms in a week before a single engineer restabilizes it at a leisurely pace.
Another thing to be careful about: we don’t want to consider an escalation-related page to be a bad thing. Remember that sleep is only one of the many important variables. Stress and long hours are also there. We want to encourage escalations, and for engineers to ask for help when they feel they need it. Arguably, having as many escalations as pages would be concerning, but it also isn’t necessarily a good thing to have so few of them.
The problem is that the count doesn’t tell you whether they’re desirable escalations (people feeling comfortable asking for help) or undesirable escalations (people asking for help because there’s a chronic lack of training and support). Knowing about this requires a qualitative look, and the same is generally true of most alerting.
We have better signals
We feel the pain of bad weeks or months far earlier than we generate reports and statistics, and usually, we take measures to improve things on a shorter feedback loop than those of quarterly reports.
We know about these things because engineers talk with support and customer success frequently, the SRE team has someone sitting in every on-call hand-off meeting to carry context, and people can monitor the #alerts and #ops channels in Slack to see ongoing discussions. Furthermore, we encourage engineers across all rotations to be vocal about issues, and we empower them to improve things. Honeycomb considers alert burden important enough that we can also discuss, negotiate, and prioritize that type of work adequately with the organization.
Another thing to keep in mind is that alerts only represent a fraction of on-call work. They can let you make inferences about how tired (and fed up) people are likely to be, but incompletely so. We also have various tickets, sentry alerts, long-standing issues, scaling challenges, non-paging notifications, and personal lives to care about. I do believe that the form described in how we track on-call health captures all of this more accurately than alert count, even if it is still incomplete.
Generally, I would say that counting alerts is a lagging rather than leading indicator at Honeycomb. This means that rather than directing corrective work, it lets us look back and see trends that happened, and can reveal new insights about what occurred. It would become a more useful leading indicator only if we didn’t also keep a close eye on our everyday experience.
How do we deal with a burdensome alert volume?
I want to cover some specific things we do regardless of whether we count alerts or not. Honeycomb has operations close enough to its core values that the following are all true:
- Being on-call means you are not expected to do project work; if nothing interrupts you, you are expected to help improve on-call and tackle annoying things as you see fit.
- If you’re paged at night, you are encouraged to take time off the day after or on Friday (for instance), depending on how you feel.
- We earnestly try to lower alert volume, flappy alerts, and lower false-positive rates, and have ongoing experiments to measure and address these.
- We can decide to stop servicing some alerts if we feel they can’t be resolved within the scope of on-call, and go through something dubbed “the SLO protocol” where we trigger a discussion with support, engineering, and product teams to decide how we’re going to address things and communicate the decisions made inside and outside the organization.
- None of our alerts or SLOs are set in stone. They’re all negotiable, and people are free to tweak them or suggest modifications if we think it gives us a better signal.
- Participation in on-call is not mandatory (although we value it); we all understand special circumstances that may arise as well.
- Alerts captured within components that are shared by multiple teams can be reassigned to various rotations depending on recent alert volume, in an attempt to send the right signal to the right people.
We can’t promise people will never be paged at night since we don’t have an around-the-clock staffing situation, but we do make an attempt at restricting alert volume. This can be done by making our stacks more robust, but also through better alert design.
For example, most of our SLOs come with two layers of burn alerts: 24 hours and four hours. The four hour burn alert pages someone and asks them to take action to resolve things as soon as possible. The 24 hour alert, however, only pings on Slack—and the expectation is that you shouldn’t have to wake up for this sort of slower burn. Dealing with it the morning after is fine.
The other thing we try to be careful about is to always keep the psychological safety of knowing you can escalate your alert. We can’t turn a call rotation into a sort of utilitarian fever dream of “only waking up one person to preserve the sleep of everyone else.” We treat being on-call as a team-wide and even organization-wide accountability affair. If you find yourself in a situation you don’t know how to handle, that’s just bad luck and it’s fine to escalate, wake people up, and fix things.
It’s then on us as a team to figure out what we can do to lower the probability that this happens again, whether through preparation and training, automation, processes, or adjusting how we develop and deliver software. My general objective as an SRE is to constantly seek ways to shorten feedback loops, adapt to them, and come to agreements about values and trade-offs when challenges arise.
The moment I showed initial alert reports internally, people were interested in them and said they’d like to see them tracked over time. Under that perspective, and given the current situation, alert counts can be useful as an anchor. For example, if people feel there are a lot of pages, but the count is steadily going down, where does the feeling come from?
I, however, am unlikely to see them as useful in directing work. That being said, it’s an interesting signal, and if you do have other ones to suggest, let us know in our Slack Pollinators community or by Tweeting us.
SLOs—or Service Level Objectives—can be pretty powerful. They provide a safety net that helps teams identify and fix issues before they reach unacceptable levels and...