OnCallogy SessionsBy Fred Hebert | Last modified on May 23, 2022
Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or night, with limited context.
It’s therefore critical to have proper on-call on-boarding procedures, offer continuous training sessions, and continuously improve documentation. We also need to make sure people feel safe by providing ways to reduce their stress, and make room for questions to surface all sorts of uncertainties around our operations.
This is why we started doing OnCallogy Sessions (a name pitched to me by an online acquaintance named Dave). OnCallogy Sessions are meetings that happen every week on Monday—one day before we hand off the pager from rotation to rotation—and that are always optional. They're mostly used to share information about on-call practices, either rooted in the practical (specific services or procedures) or a bit more abstract (e.g., "How would we feel about a chat-only experts on-call rotation who are just there to give advice?")
There is no hard recipe—whatever we think feels interesting, useful, or timely may get its own week. We’re always open to suggestions or other people running the sessions, and it’s not a problem whether it’s 12 people participating one week and just 2 the week after. Topics are shared ahead of time, and people join if they think it’ll be worth their time, regardless of whether they actually are on a call rotation or just curious about how we work.
We do, however, have some flavors of sessions that have been recurring since the six months or so we have been running OnCallogy. Here are a few patterns.
These sessions are the first and most numerous ones we’ve run. Each session deals with a specific part of our stack (often chosen because someone leaving the call rotation or entering it may have said it made them nervous) or practice, such as being an Incident Commander, for which I provide an agenda such as:
- A “fact sheet” and background information, which is a quick overview relevant to the service or practice—including main data flows, where to find observability data or other documentation, or what various organizations may do in a similar situation—and is intended to be useful even if you didn’t attend the session
- A walkthrough of specific or challenging operations that happened recently or we know may happen sometimes, and if possible, done live and with participation from attendees
- Time for Q&A and discussion
Interruptions and questions are encouraged at any time.
These sessions have proved useful, particularly when a subject matter expert is present to frame and correct the content provided by the presenter (often myself, though not always). I've already seen links to these sessions shared a few times in Slack weeks after they were run when people were looking for some background information.
Planning on-call improvements
Before I make changes like splitting up call rotations, re-working the onboarding experience, or trying to come up with better ways to measure on-call health, I tend to pick an OnCallogy week where nobody had questions or suggestions and turn them into a free discussion or a sort of focus group.
These still come with an agenda:
- A quick review of the current state of this practice within the organization
- A high-level discussion to establish the direction we want to go
- A workshop/brainstorm of ideas we can do to improve things
This sort of session is particularly something I used coming back from holidays or near the start of quarters when I feel there’s time to kick off a new set of improvements, and is specifically intended to make sure whatever idea I have is properly rooted in what we do and what our engineers need.
This type of session is useful when we have had a noisy on-call week, either around high levels of false alarms or when an intriguing or unclear alert popped up. The goal for these is to work as a small group to discuss what it is we think is going on, what would allow us to better communicate it, and then fix it on at least one example during the session.
We've run a couple of these so far—below are two examples.
- A rare alarm that comes from a failure in a deeper data management job. The failure can have unimportant false positives, but sometimes be a sign that something major is broken and that entire partitions may eventually run out of space. During the session, we made sure to identify these possibilities, the expected reaction, things to check, and then provided a link to relevant documentation in the alert so that someone who has never seen this before can understand the type of situation they may be experiencing.
- An alert we had was pretty flappy due to weirder data volumes in production being sent to our internal platform in such high volumes that it created occasional latency. We dug into how to diagnose all of this, saved all our steps in queries as a demo, and tweaked internal tolerance parameters and scaling values to cut the false positive rate by 30-50%.
These hands-on sessions have the advantage of requiring less preparation, instantly improving the operational experience, and being very interactive for a small group. Note though that they may be rather difficult to make interesting for a larger and broader audience.
Free discussions about on-call practices
This is an experimental session we ran recently and that I really liked. While all other sessions are ideally recorded, this one remains "live only," with some anonymous notes left behind, but nothing else. These sessions are based on a question that is intended to be ambiguous and a bit challenging, and to dive together into our perceptions.
The first one I chose was one that I had in mind for a while: Should you visibly be frustrated when handling a call/outage/incident that is difficult or should we keep everything under control?
The objective of the session was to dig into that question, figure out which cases are okay or not, what are some circumstances that change the definition, and just overall come up with a sort of definition about what we expect both as a person on call and as a person looking at the call go.
This is intended to be a non-judgmental discussion. What we are aiming for is discussing what we think the impact may be, not to actually tell people how they may or may not feel. The reflection is more important than the answer.
I provided some background to prime the session with links, putting in opposing ideas that could all be valid, and launched the discussion. There's no real structure, just a bunch of people sharing their various experiences.
I loved this session because it quickly became rich with great ideas, subtle details to look for in behavior, call-outs for specific dynamics, ways to ensure healthy relationships post-incident, and so on.
Specifically for this one, we had discussions about trust, psychological safety, people sharing experiences about group dynamics, whether to kick someone out of an incident or just asking them to take a breather, people seeing emotions as useful signals, or ways they dealt with re-framing blame and trust. To me it felt like I was getting insights from decades of experience made available just like that.
OnCallogy in a nutshell
The core idea here is to make room and create structured time to improve people’s on-call experience collaboratively.
I want it to be a habit for us to have dedicated time specifically for such discussions so there is always room for it outside of any other project work. If something odd happens that would need a quick distribution of information because we think it's relevant to people on call, we can leverage it for that. If there's nothing and we just want to find a friendlier way to mark deprecated documents, we do that.
The time is there for the taking, it's ours, and it's important to have room in the schedule to make sure being on call continually improves.
Do you have any equivalents to OnCallogy sessions? Want to throw in some interesting discussions you've had in a similar vein? Send us a tweet! Or check out this blog to learn how you can make going on call less painful at your org.
“Why Are My Tests So Slow?” A List of Likely Suspects, Anti-Patterns, and Unresolved Personal Trauma
If you get CI/CD right, a lot of other critical functions, behaviors, and intuitions align to be comfortably successful and correct with minimal effort. If...
Here's what some of our customers had to say about Honeycomb features that they found value in right away....