Conference Talk

Survival Guide: What I Learned From Putting 200 Developers on Call

 

Transcript

Alina Anderson [Senior TPM, Site Reliability Engineering|Outreach]:

Hello. I hope you’re having a great time at o11ycon. I am honored you have chosen to spend your time with me today. I’m excited to share what I learned putting 200 developers on call. First, a little about me. My name is Alina; I am a Senior Technical Program Manager at Outreach. You can find me online in the usual places.

Let’s be honest, 2020 was truly a dumpster fire. I chose to commemorate this by deciding at the very last minute to learn to cross-stitch Christmas gifts for my family. As long as you don’t look at the back, I totally nailed it. 

In 2021, I’m also trying something new: giving this talk. I hope you can find at least one idea that gives you hope or inspiration for whatever situation you may find yourself in today. Feel free to reach out and say hello in the Honeycomb Slack community. I’ll be hanging out this week.

Let’s jump right into 10 lessons in 10 minutes.

An on-call is a human, not a robot. Humans are complicated, messy, and susceptible to emotional, physical, or social influences. Each on-call engineer handles stress and learning differently. It’s super important to accept this natural variance. It is unrealistic to expect humans to perform manual tasks perfectly and consistently. Humans require compassion and connection. Service level objectives, or SLOs, are an easily understandable language that helps humans operate systems more effectively.

You’re not alone. Leverage relationships. Most, if not all of us, at o11ycon today are grappling with new challenges at work. When it comes to reliability, we’re navigating a complex ecosystem within our organizations that often requires the orchestration of people, processes, and technologies. I want to encourage you to leverage your network of vendor relationships. You and your vendors share the same goal, you want your team and business to be successful. Partner with your vendors to conduct engineer training sessions, host office hours, or even facilitate strategy deep dives.

Through a conference like this, Slack interests’ group, or vendor intro, take the opportunity to connect with leaders at other companies and share ideas. And sometimes, you just need to remember you’re not alone in this.

The obvious is not obvious. A close colleague has challenged me to fully accept this lesson. It is now a standing joke between us when anyone says, “But that was obvious.” The obvious is truly not obvious. What seems plain as day to you can be entirely overlooked by the person next to you — to no fault of theirs. Just ask the question or say the thing. Especially under pressure in incident situations, our brains are in overdrive and a simple reminder can help us all. All it takes is one brave person to think something might be obvious but to ask the question anyway.

Like asking an incident commander, “What is the customer impact here?” In this scenario, it could be the reminder everyone needs that it’s already been a half hour, and we haven’t even answered the most single important question.

Make learning and growth visible. Building and operating reliable cloud services is an exhilarating and, at times, a head-spinning journey of continuous learning. Learning in growth is a practice, a commitment to practice, that must be visible across teams and leaders. This can look like an open-door policy for participating in post-mortems, allowing interested engineers to contribute to a given retrospective. It can be highly valuable to create a forum for “lessons learned” authors to share their findings and insights with the broader organization.

Cherish the opportunity to listen. Change is hard. Sometimes, complaints are just a request to be heard or seen, and it’s not something to fix, resulting in an action item. On-call rotations can create stress and tension for anyone, especially for the first time. Cherish the opportunity to listen for what is causing fear, uncertainty, friction, or struggle. And validate the human who is taking the time to express it. We can’t always solve all of our on-call problems, but we can listen. And sometimes, that is enough.

Say what they need to hear. Leadership needs to hear the truth. They need information to make decisions. They are charting our course ahead. Engineers need to hear what to expect and what is expected of them. Sometimes, we wince a little, knowing it will be a little uncomfortable. It helps no one when we downplay or soften something, and it never gets fixed. You can use SLOs as a tool to speak the truth and provide direct visibility to leadership.

Embrace the art of the broken record. When you think you’ve said it too many times, say it one more time. Say it verbally. Say it in writing. Say it in Slack. Say it in Zoom. Say it in the org-wide meeting. Say it in the wiki. Just keep saying it. Embrace being an annoying broken record about things that matter. Everyone is facing information overload from all directions. Assume no one read it the first or the second time. Ask, “What is the customer impact?” every single time until the team develops the muscle to consider this before you even have to ask.

6:57

Praise proactive homework. It can be difficult to gain visibility of work that prevents issues, so often, the firefighting can take center stage. Cultivate opportunities to recognize proactive work. It can look like quietly working to make the next on-call shift more successful by implementing a weekly hand-off mechanism, methodically updating SLO doc, eliminating noisy alerts that weren’t actionable, or even running a game day for the team. Recognizing proactive instead of hero worship builds sustainable engineering culture.

Assume training is never enough. Great knowledge base, great diagram, onboarding session, they’re ready, right? No, not even close. Assume no one read your doc and they were texting during Zoom training. Or they forgot everything since their last on-call shift eight weeks ago. DevOps teams and orgs are moving incredibly fast. It is impossible to keep up with all the documentation and training. Continuous feedback loops weekly or daily via on-call hand-offs can keep the whole team up-to-date on new changes to be aware of.

Trust your struggle. Some days, it can feel too hard. And other, days it can feel hopeful. This is the human experience. The human struggle. At o11ycon, we are part of an evolution within the software industry. So many organizations are faced with the same challenges. We can learn from each other. We are defining the next generation of tools and corporate culture. Know that you are doing some hard stuff.

In 10 minutes, we walked through 10 lessons I learned putting 200 developers on call. Thank you for sharing this time with me. I hope you’re taking at least one idea that will start a conversation when you get back to your team. And thank you to all the awesome humans that made this possible. If you have feedback on how I can make my presentations more accessible or inclusive, I would love to hear from you. And hope to meet you in Slack.

Corey Quinn [Chief Cloud Economist | The Duckbill Group]:

Thank you very much. Before we dive in, just a reminder to the crowd that we will be taking questions in the obnoxiously longly named Slack channel. The first question comes from me because that’s why I put myself in these positions where I can ask the important stuff. And it comes down to the baseline question of, when you put people on call, who until now have not been on call, how do you handle the, I guess I want to say perceived — it’s not, it’s real — shift in expectations where it used to be a 9:00 to 5:00 job? Or these are developers, let’s not kid ourselves, often, it’s 10:00 to 4:00 or 10:00 to midnight depending on what’s been broken. And shift that over to, “Your sleep schedule? Yeah, fun story.” It feels, for folks who have not been in an on-call environment previously or took the role because, at the time, it wasn’t on call, that it’s a shifting of expectations. How do you handle that?

Alina Anderson: 

Yeah, it’s absolutely a shift. And a shift not only for the engineer but for their family. For their lifestyle. For their time away from work. And I think that it is — you’re not going to be successful unless leadership is aligned on what that means to the culture and what the messaging is to the engineers. And I think the reality is, you can’t make that shift without a certain amount of attrition. Some people are going to take — it’s a fork in the road — and some people are going to say, hey, this isn’t for me.

I think sometimes there’s a lot of frank conversations with leadership. And as a TPM, sometimes what I can do is anonymize feedback and present that in a way that’s like, “Hey, we’re getting feedback around this set of challenges. What is our plan to address it?”

Corey Quinn:

I think it’s also very fair if someone decides, well, I signed up because there’s no on-call. Now there is, the expectations changed, I’m going to work somewhere else. And if I were to hear someone tell me that story in a job interview that I’m considering hiring them for, it wouldn’t be a ding against them, even if it were for an on-call role. 

Alina Anderson:

Absolutely.

Corey Quinn:

Because changes expectations versus having an expectation is slightly different. How do you view the approach of compensating people for being on call? In the abstract,I love it. On the other, there’s economics we’re trying to maintain here.

Alina Anderson:

Yeah, it’s a great question. And this also comes up. And I think HR has to be enrolled in what’s going on. They have to understand what the kind of company perspective is on this. Some organizations out there offer different types of compensation. Other organizations do not. And I think a more successful transition means that engineering leadership and HR are on the same page about what the messaging is for folks, whether it’s device-related, time-related, certain regional areas around the globe actually have laws around on-call and compensation.

Corey Quinn:

I want to disclaim my own bias here. My first on-call job was so horrific — I would at this point, it qualifies as, say abusive — that it was ridiculous. There was a 15-minute SLA, and two people left the team. One transferred and the other person goes, well, I’m managing the team so I’m always on call so I’m not part of the rotation. It went from one week out of four to a 50% rotation, at which point, you considered it was the best effort, and that was during the 2008 financial crisis. And that doesn’t set me in the direction of, I don’t want to be on-call because it’s awful and feels like it can ruin your quality of life. One thing that I think companies get wrong is, if you’re woken up by something, you are empowered to fix that thing.

Alina Anderson:

Absolutely.

Corey Quinn:

And sometimes that fix is turning off the alert that would have woken you up. Because if all I have to do is open a ticket and wait for someone else to fix it, great, make them on-call! Which is the point of what you’re getting at. This old world of just the ops people on call and can tag in people as needed — it seems though, improving software quality, correct me if I’m wrong, feels like a shortcut to this because all you’re doing is making people who write the software have to field the consequences of the software not doing what it’s supposed to do in a more visceral way.

13:44

Alina Anderson:

Yeah, I absolutely agree. And almost not transitioning from the culture of let’s make it boring and let’s all work, you know, normal lives, and we don’t have heroes that are running in to save the day or we don’t have, you know, there’s a certain, I mean, I consider myself an ops person. And there’s a certain fun, excitement, adrenaline when you guys are working on a problem. 

I think there’s a lot of interesting human factors involved. But to your point around fixing the problem, like every on-call hand-off or every week, look at — and PagerDuty is one tool that allows you to see when you got paged — how many are after-hours and go through every single one and it’s like, was this actionable? Yes or no? Can this be automated, yes or no? Did someone really have to get up for this, yes or no? And being really vigilant about that. And I think the team culture, not just saying, “Oh, it’s okay, just eat the pain and suck it up.” Ultimately, it isn’t a place most of us want to work.

Corey Quinn:

Question for me in this one. Did you find challenges in driving for that culture of on-call when, presumably, please correct me if I’m wrong, as a senior TPM, you were not part of the rotation itself? Doesn’t it sound like, “Yeah, you assholes need to wake up?” By the way, wake up asshole is the PagerDuty model, they don’t realize it yet. I’ll be sleeping soundly until morning when I go to work. The people that seem to be forcing you to take an on-call rotation are invariably never the people on the rotation themselves.

Alina Anderson:

Yeah, I agree that pattern is not the way you want to go. And despite people suggesting that I should handle it that way, what we chose to do was, we rolled out an incident commander program. All of the engineering leadership was on call. So they had a 24-hour primary, 24-hour secondary on-call for incident commander. So CTO, directors, VPs were all participating in this rotation. And I was shadowing for all the incident commanders every time we got declared incident, I was paged in as well. This was essentially to observe and make sure that we were — how we were doing, where the gaps were, what we needed to address. Until we got to a point where things are running relatively smoothly, and I was no longer needed. And could just kind of step back.

I think anyone that’s driving, there isn’t a world in which people are telling you need to do this, that don’t have skin in the game. It just, I think it comes down to values, and being directed from the ivory tower to impact your life doesn’t feel very good.

Corey Quinn:

Ben asks, if on-call is a tire fire — invariably it is — do you fix it before having devs so they can come into their on-call experience knowing that it doesn’t have to suck? Or bring them into the rotation to help calm the firestorm of burning pain?

Alina Anderson:

I think, a strong engineer, it’s bringing visibility because sometimes what could happen is, after that first on-call, you write a summary of everything that you had. You have a come to Jesus with the product manager and say, look, we are stopping all feature work for two sprints because we’re going to go fix these things and introduce this automation. Sometimes that’s just what you have to do. It’s like surfacing that information and then making a stand that this is not sustainable. It’s actually impacting our velocity. Because at the end of the day, the business wants revenue-driving features and stuff. So if you have an on-call like that, that is going to be a direct conflict to being able to produce that value.

Corey Quinn:

In the noblest tradition of company earnings call when analysts are called upon to ask a single question and they come back with two, Sean has two questions: How do we do this messaging in a positive manner without being labeled by the powers-that-be as the squeaky wheel. In a start-up, no one wants to hear this; they just want the duct tape.

Alina Anderson:

Yeah, I think in my experience, abstracting it away from an individual person’s opinion and making it like a guild or a working group or a kind of like a community-driven folk that produces a summary or a report or an insight, and that can bring all the same points, but you no longer have the objection that it’s Bob’s opinion and Bob is always opinionated. You make it more of like a, I guess a whitepaper in a sense or like a published document of a current state of things. And then a proposal of how we mitigate these things.

Corey Quinn:

And the second part of the question — because of course, no one follows instructions — more people are doing it too. I shouldn’t have said anything. How do we know when it’s time to give up? Do we ever? I give up when someone else offers to pay me more and my solution out is, “Here is my two-week notice.” But that is a defeatist attitude.

Alina Anderson:

Yeah and I’ve seen some cool writing on the internet these days about when giving up is actually a positive thing. I do think if your organization is going through this, everyone should soul search and figure out what are my values? Do I enjoy this work? Am I invested in this? Is this aligned with my career aspirations? And sometimes it’s not. 

20:01

And instead of suffering for the next six months to a year, it might make sense to make a shift. In terms of giving up on the overall initiative, I think at the end of the day, I think you can just boil it down to, “Is our company fulfilling the promise we are making to our customers?” That’s sort of a yes or no. If we have said we are going to be available for three 9s, are we meeting that or not? And if you are meeting that, and you don’t have, you know, I think it comes back to are you getting the velocity that you want, yes or no? Are you meeting the promises, yes or no? And if those things are true, I think you have to dig a little bit more into what problem are we trying to solve.

Corey Quinn:

Next, we have from Brandon, how do you approach a long and growing backlog of on-call sourced bugs? There’s always time to patch something, just get it back up, never time to fix it properly.

Alina Anderson:

Yeah, I think what I’ve seen effective there is having a pre-agreed upon basic priority matrix around the severity and you guys can debate it indefinitely. Priority versus severity versus urgency. And at the end of the day, there’s going to be a bucket of things that you just don’t fix. And I think sometimes there’s ruthless prioritization in there and maybe you just publish a known issues list. And it has to reach a certain threshold of a customer report or two customer reports before it comes off the known issue and becomes something you take action on. But having some kind of systemic prioritization, and it’s super uncomfortable to say we’re not going to fix these things, right? That’s not — it’s not enjoyable, but it’s the reality.

Corey Quinn:

Not enjoyable but the reality. That feels like work in many respects. Slasher asks, two questions at once but I’ll try to condense them into one because, you know, why not. Engagement is what we want from audience members here. Feel free to ask questions. I’m a fan of Dominica Degrandis’ Making Work Visible, but today is the first time I’ve heard making learning visible, and I love the phrase. What are ways you found to make learning visible, and do you use different ways tailored to different audiences?

Alina Anderson:

I think on this specific topic, post-mortems are a really key and rich opportunity for learning. And I think part of that is calling out, you know, writing a post-mortem for someone can feel like a tedious, administrative task that isn’t necessarily connected with the value that it could drive across the org. So calling out, as part of that, maybe a bullet point or two of what lessons did you learn that could apply, that all the teams could benefit from. Then if you have some forum — some companies, it’s like a big Zoom meeting, it’s like a wiki publish place where the listeners, you know, may tune out for some of the details but then for that part of, oh, hey, this a section that might be relevant to me and lessons I can learn here. 

You sort of create that structure for focus, that works really well. Because you have so much input and you’re not going to sit there and retain every detail of every post-mortem. But if you’re highlighting, hey, we have this outage because of certificate expiration, here’s a learning and everyone should go back to their teams and think about how you’re handling certificate expirations, as an example.

Corey Quinn:

I think one of the hardest parts is understanding who the different contingencies are and how this stuff all works out. In a microcosm to say, here’s how you roll this out. As you did this, did the organization as a whole learn things it didn’t know about — I don’t want to talk about politics, necessarily — but the idea of the way that information is flowing, the way that things are construed inside the environment?

Alina Anderson:

That’s huge. And I think things have to be discoverable. One example is standard Slack naming conventions. Where there are people who don’t have to look up a wiki and ask a guy who knows a guy to find a channel for a specific team. There’s a standard team API interface, I guess, in a sense. Where all you need to know is if I search for keywords, I’m going to be able to find an expected interface for a team, whether it’s Slack channel, intake form, that kind of thing. Because people are going to re-org, people are going to leave the company, you’re never going to know who is in charge of what anymore. You have to create these durable things that people can still interact with and eventually find what they’re looking for regardless of the org changes and shifts.

Corey Quinn:

Yeah, I think we have time for one more question. Making developers responsible for their own services can mean breaking one rotation into several. How do you apportion responsibility for each team fairly? It’s a great question. That team has 4 people on it and the other has 14. How do you make sure that transferring to larger teams does not become a perverse incentive?

Alina Anderson:

I tend to think about this in terms of looking at the services first. If you have services A, B, and C. Great, what resources does service A need to be maintained and develop new features? Okay, great, we think that’s six people. And then obviously that service needs to go to a team with six or more people.

I don’t think it’s successful to try to shoehorn services on a spreadsheet into the teams. It’s like, okay, what does the service need in order to operate and meet the commitments, and then we should resource that. And sometimes that results in we need to hire more people, or we need to mix things up. Because one service might have a massive monolithic database so, you need a certain skill set. You might need to shuffle seats a little bit to make sure that the service has the right expertise needed in order to operate it.

Corey Quinn:

Well, thank you very much for giving your talk. This is one of the areas that I think everyone has loud, angry opinions on, but you have not just data but also experience, which is great to hear. And seems in your case to come from a place slightly different from, “I don’t want to be on-call, therefore it’s bad putting me on call.” Which, frankly I’m sympathetic to myself. Thank you so much, Alina. And there are further questions for people to hurl questions at you on the other side in the Slack channel.

Alina Anderson:

Absolutely. Thank you so much.

Corey Quinn:

Thank you. Appreciate it.   

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript