Going On Call for the First TimeBy Liz Fong-Jones | Last modified on June 16, 2022
Dear Miss O11y,
I've never been on call before, and I'm not sure what to expect, or how I can best prepare for it. Will I need to upend my life just in case the pager goes off? And how should I best cope with getting paged? I've read Charity's piece on the opposite problem of wanting to stop being on call, but it didn't quite answer my question.
Worried about Wake-ups
Going on call for the first time definitely can be nerve-wracking! Hopefully you have the support of your team, which can help offset some of the uncertainty and give you a smoother on-ramp onto the process of service ownership. Ensure that your team has given you written guidelines on the expectations of on-call. Have they allowed you to clear your plate of project work during the weeks you're on call? And is there a documented response SLA (e.g., 30 minutes to get from receiving a page to being at keys)?
These policy pieces will help you be prepared to switch into response mode while being on call, while still being able to live your life by allowing you to spend quality time with your pets or children. But you'll want to build up some comfort with what to do if the pager does go off. My colleague Fred Hebert, an SRE here at Honeycomb who writes a series about being on-call, does a deep dive on how managers/teams can create effective on-call policies.
Everyone starts somewhere
The first thing that I trust you've had a chance to do is business hours on-call. A majority of pages will be caused by either a traffic spike, or by a configuration or code change—both of which primarily tend to happen during business hours. Thus, it's a great way to get familiar with the kinds of issues that go bump in the middle of the night, without that "middle of night" part!
The next piece I'd suggest is to practice with a buddy. As you progress from workday on-calls to weekday evening on-calls to weekend on-calls, make sure that you're either "shadowing" or "reverse shadowing" on-call—getting a copy of the pages with someone else acting as primary on call, pairing on the issues, or having your reverse shadow get a copy of your pages so they can cross-check your work.
If your service doesn't receive a high enough volume of pages to train up this way, the next step is to use game days/wheel of misfortune/chaos engineering to rehearse incident response and practice common on-call workflows and runbooks. These exercises are done while the team is in the office, and allow you and your teammates to experience an incident or mitigation operation with an emergency stop button (or purely as a tabletop exercise).
Practice, practice, practice
This is also a great opportunity to refresh your skills with your observability tool of choice (we're, of course, biased towards Honeycomb). The last time you should be learning how to use your observability tool is at 2 a.m. with the pager blaring; instead, practice understanding what your system looks like when it's functioning normally at steady-state, and ensure that you know how to slice and dice to formulate hypotheses and identify outliers in your data.
Remember that you don't have to memorize every dashboard, as long as you know either how to search for helpful timeseries (if you're using a timeseries tool), or to construct queries ad-hoc if you're using a proper observability tool like Honeycomb. If you’re at a loss, your runbooks or your team's saved queries may suggest starting points.
Now, you’re finally ready to do on-call during a set of hours you haven't taken before. But remember, you're not flying solo. On-call is best done with a primary and secondary on-call teammate, so you always have someone else to escalate to in the event that you need someone to cover the pager for a few hours, an incident exceeds your knowledge, or you need an extra set of hands (or need to, in the worst case, separate incident command from operations responsibilities during a major incident). Since you’re new to being on call, make sure you aren't scheduled alongside someone else who is also brand new. For the first few times, you should be with someone who is experienced until you get the hang of things.
Furthermore, know that it's always better to escalate to someone "unnecessarily" than to get in over your head. Make sure that you know how to get in touch with your counterpart and that you feel comfortable doing so. It’s a common misconception to see escalation as a failure, but know that it’s not a failure on your part. It's an opportunity for you to learn something you didn't know before.
If you're interested in learning more about the subject of on-call, the on-call chapter of the Google SRE book has some great advice. With luck, the queries will flow and pagers will be silent—but if an incident happens, hopefully this leaves you prepared.
If you want to chat more about being on call, feel free to schedule 1:1 office hours with me. We can even run practice drills using Honeycomb! It's free to sign up.
Ask Miss O11y: To Metric or to Trace?
Dear Miss O11y, I remember reading quite interesting opinions from you about usage of metrics and traces in an application. Did you elaborate on those...
Ask Miss O11y: Is There a Beginner’s Guide On How to Add Observability to Your Applications?
Dear Miss O11y, I want to make my microservices more observable. Currently, I only have logs. I’ll add metrics soon, but I’m not really sure...
Ask Miss O11y: Error: missing ‘x-honeycomb-dataset’ header
Your API Key (in the x-honeycomb-team header) tells Honeycomb where to put your data. It specifies a team and an environment. Then, Honeycomb figures out...