We’ve posted before about how engineers on call at Honeycomb aren’t expected to do project work, and that whenever they’re not dealing with interruptions, they’re free to work on whatever will make the on-call experience better.
However, all of our engineering rotations rely on hand-off meetings where they update the Slack groups with everyone who’s on call. During my last shift, a small problem kept causing friction for some of our incident management automation. We have some teams who share what is essentially a pager rotation for incident response, but do not otherwise have hand-off meetings.
For these people, the Slack alias is often outdated. This means people can’t necessarily reach them via the generic aliases—but more importantly, that our incident management tooling (via Jeli and PagerDuty) cannot automatically invite the right people to incident channels. This is particularly bad when the incident channel is created in private mode. The end result is some people get paged and are told to go on Slack, but they can’t see why.
We thought the problem would be solved easily by synchronizing schedules. It’s a minor pain—cumbersome, but it can’t be that hard, right? In fact, a solution should already exist.
We did a quick check and it turns out that most of the solutions out there are somewhat convoluted and require infrastructure. I’ve talked to some friends who were willing to pay money for this to their various providers. A colleague mentioned a prior workplace that had automation to do this (among many other things), but noted that it cost thousands of dollars on an ongoing basis, while other solutions we found relied on infrastructure (such as databases) that we did not see as required, nor did we wish to operate.
All of this seemed more work than spending a few hours here and there reading API docs and getting things going.
A few days later
It took some time and coordination to get the right API keys from the right people, but all the pieces came together:
- Have a script with a declaration of all the on-call Slack handles, with lists of their respective PagerDuty rotations (the script is the database)
- For each rotation, ask PagerDuty’s API who’s on call right now (which gives a list of users)
- Fetch each user’s email address from the same API
- De-duplicate records
- Go to Slack’s API and find all user IDs by looking up their email
- Find all the Slack group IDs for each of the handles (via the API)
- Update each group ID with the list of user IDs in it
Take that script, shove it in a container, and run it every hour through your mechanism of choice (we picked a Kubernetes cronjob because we run some of these already).
Eliminating friction, even if it’s small, is worth a lot
When the cronjob ran while most teams’ call rotations were cycling, there was an immediate outpouring of love from engineers across the organization. In hindsight, I thought to myself, ”Why the hell didn’t any of us spend time fixing this garbage before?”
It surprised me because that was the most direct, instant amount of positive feedback I had received at work in years, and it’s not like I hadn’t worked on useful or important stuff before. I imagine part of it is that maintaining these groups is obviously toil. It’s necessary, frequent, annoying not to do (particularly when someone covers you temporarily), and unpleasant. But it’s also sort of minor (big things page us), and it’s low-enough effort that no one could really justify sitting down and taking the time to fix it. It falls into that gap of “annoying, but never enough to prioritize.” Turns out, free time when on call was a good opportunity for that.
I was surprised enough that I talked about this reaction to my solution in other communities, and two people from different organizations immediately reached out to me asking how we did it, because they were having the same problem and were considering purchasing a product to do it for them. While it’s not my intent to undercut service providers who could monetize the gap between these APIs, the hundred or so lines of non-comment code required seem more reasonable to share freely in order to make people’s on-call a bit nicer.
Because of this, we decided to make the Python script a bit more generic and open-source it: https://gist.github.com/ferd/19b0207bfc10173559e523c049db51db
Simply declare your Slack handles and rotations, follow the instructions in comments about passing in API keys, and your rotations are going to be in sync too.
Was this useful to you? Let us know by dropping a line in Pollinators, our Slack community.