Blameless SRE Resilience Panel

 

+ Transcript:

Amy Tobey [Staff SRE, Blameless]:

Thanks for joining us today to talk about how SRE can help organizations adapt in all this. A lot of us have been talking about this. I’ve had conversations with Alex and Liz about it. We keep wondering like, “What can we do?” So, those of us at Blameless were asking the same thing, like, “What can we do?” And we thought we would get these very smart people together to talk about that very topic. So what we’re going to do is, we’re going to start with 40 minutes of a panel discussion with our guests. Then we’ll leave 20 minutes in the end for open Q&A. So think of your questions as you go and pop them into the Q&A panel in Zoom.

My colleagues will sort them all out for us as we go. When we get to the end, then I’ll read them off and our panelists can have at them. So for now, introductions. I’m Amy Tobey. I’m going to be moderating today. I’m a Staff SRE at Blameless. I’ve been an SRE and DevOps practitioner since before those names existed. I love this community and believe that SREs are uniquely positioned to change the world in small and large ways. And going alphabetically, I’ll let my panelists introduce themselves. So starting with Alex.

Alex Hidalgo [SRE, Squarespace]:

Hey, everyone, my name is Alex Hidalgo. I’ve been an SRE for about a decade now. It is something that truly speaks to me. I wonder how I ever did anything else before. I’m currently at Squarespace and currently in the process of writing Implementing Service Level Objectives, which I hope will be well received.

Amy Tobey:

And Dave.

Dave Rensin [SRE Director, Google]:

Hello, everyone. My name is Dave Rensin. I’m an SRE Director at Google. I’ve only been an SRE for 5 years. So I guess I’m the puppy of the bunch. I was one of the principal editors on the SRE Workbook and pleased to be contributing to Alex’s book as well. It’s lovely to be here and to see everyone. For some value of seeing.

Amy Tobey:

Thanks, Dave. And Liz.

Liz Fong-Jones [Principal Dev Advocate, Honeycomb.io]:

Hi, I’m Liz Fong-Jones. I’m a principal developer advocate at Honeycomb.io. I’ve also been an SRE. I consider myself both an SRE and a dev advocate. I’ve been around SRE for the past 12 years. I’ve been working in the DevOps space for about the past 15 or 16 years.

2:24

Amy Tobey:

Awesome. Thank everyone for joining us. So we’re getting started. I want to start with a question about, just to go right into it, error budgets and development velocity. And it seems like there’s a lot of teams out there that are running with reduced personnel, and available spoons or cognitive capacity, right? Because we’re all dealing with this. So the three of you have a lot of expertise in this space and how these regulations work. So how can these practices help teams cope and adapt? Let’s start with, we’ll start with Liz.

Liz Fong-Jones:

I think I’ve seen a couple of bad things about how to cope with this, and I’ve seen a couple of good examples of how to cope with this. The bad example is an unnamed bank in Europe that divided their teams up into the blue team and the green team and mandated that the blue team come to work on alternate days with the green team so that you would only lose half of your team. That way, you can still keep the services running and keep the deployments running. That’s a bad example, I don’t think that we should be doing that, I think that we need to be a lot more continuous with our resilience practices, that we need to accept that people are potentially going to have less throughput. That people are potentially going to be less available. 

Instead of dividing things into the blue team and the green team, let’s instead think about, how do we actually make it so that people can swap in and out of on-call, depending upon their parenting needs? How do we make sure that people can continue to push out releases according to automated release trains so that if someone does feel like writing code, they can still get that code pushed to production? I think that’s the direction that we need to be headed instead.

Amy Tobey:

I really like that. I feel like I’m going to save you for last Alex because I think you have a lot to say about this. So let’s go to Dave next.

Dave Rensin:

Sure. So, I think I’ve seen a mix of things that went really well and things that are going not well in some realities. I agree with Liz that the scenario she described is generally an anti-pattern. But, I would like to acknowledge that there are times when it’s a requirement, particularly when you have people who have to have physical contact with some infrastructure. We have had double and triple split teams in our data centers at Google because humans have to go in and do human things to the equipment that we haven’t invented the robots to do yet. It’s not optimal in all the ways Liz described, but it’s just a requirement.

Things that have gone well. Well, the same things have gone well and have gone badly. Teams that have been good about paying attention to their technical debt and their toil, right? Meaning keeping their toil volume low. So being very proactive about giving to the computers the things that computers are capable of doing, are generally faring better, having to work remotely than other teams who didn’t. The teams who haven’t been paying as much attention to their automation or keeping their toil under control are finding they’re having a really hard time because, doing toil, really doesn’t scale when everybody is remote just because of the communication friction that being remote adds.

Especially to teams that are not used to having to interact or video conference, principally. So that’s both a pattern and an anti-pattern. I will say one last thing. One of the conversations we’re having internally today about SRE leadership is, this is a global pandemic and it’s awful, but as long as it exists, it provides a really interesting natural experiment to ask, how many of the principles that we practice in SRE at Google and other places, are scalable to the degree of global scale pandemics? And what edges do they expose? Things we have to rethink, or maybe are true or as true as we thought they were.

If you’re looking for sort of a silver lining in an otherwise fairly great cloud, I think that’ll be pretty interesting as it emerges over the next few weeks.

Amy Tobey:

I like that point about resilience. Alex, what do you think?

Alex Hidalgo:

So I think one of the most important things that everyone has to keep in mind, especially the way you put it, error budgets, right? How do you think about what is a tolerable amount of failure in times like this, especially perhaps in terms of, what does your release cadence look like? What are you expecting from your humans? And people often, when they think about SLOs, or you think about error budgets and they think about the windows that these are calculated over, the classic example is out of error budget, focus on reliability, have error budget remaining. Release, move, do whatever you want.

But that’s not really how to best use those numbers. The real, I think, the best way to use the concept of an error budget, even if you don’t have specific numbers for this, right? It doesn’t mean you have to actually have measurements and all that. But the concept behind it is, it just gives you a different way of thinking about things and to have good discussions with people with that data, and to help you make decisions based upon that. So I often tell people that you should revisit what a target is whenever you need to. Sometimes it’s because you had an incident, sometimes it’s because your code base changed or your dependencies changed, right?

Things change about the world and sometimes you do change your expectations, but I also tell people to do that whenever you need to. Right now might be one of those times. Right now might be one of those times where you have to stop and say, what makes sense? What makes sense for our users, for our engineers, for the product team? In some cases, I could see this being an example where you need to make things a little more stringent. Zoom is very important right now. Netflix is very important right now. But there’s also perhaps chances where maybe you don’t need to focus as much on something because you need to prioritize correctly in this current world.

Liz Fong-Jones:

Yeah. People first, right? It turns out that your service is going down 1% more, it’s probably actually acceptable if that means that people stay home, if it means that people take their kids to the hospital if they’re getting high fevers, right?

Alex Hidalgo:

Yeah, exactly.

8:34

Dave Rensin:

Also, this is a moment where teams have to make cognitively uncomfortable choices that, maybe this product that I was going to launch, this feature I was going to launch the users will really like, just isn’t that important in the context of everything else that’s going on. And we’re just going to stop and divert resources to other things, and it’s painful in the sense of, it was never frivolous, but maybe it’s relatively frivolous, to the current time.

Liz Fong-Jones:

Yeah. We can think about global prioritization as well. I saw, actually this morning, and have signed up for the New York State Tech SWAT team that is being dispatched to deal with coronavirus. So it’s almost like doing a potentially, either a USGS rotation or similar but for 90 days, or even just volunteering services, right? These are things that we can all be doing, to make sure that we as a group of human beings survive.

Alex Hidalgo:

I also like building off the point that Dave made in terms of, perhaps you don’t need to ship this feature. Maybe you do. There’s this new service called My Bodega that wasn’t planning on launching until later on in the summer. They were trying to from the start, their whole concept was they wanted to allow Bodegas and corner stores to more profitably deliver things. So their plan was always to only charge 50 cents for each delivery, as opposed to things like Caviar and Seamless that can charge upwards of 30%. So they always have the owners of these small businesses in mind in the first place.

So what they’ve done, they launched early and they said, “Look, everything may not be perfect, but we’re launching early.” They’re not charging anyone. They’re a 100% free platform to use as long as things are currently … In the state that things are currently in. So yeah, sometimes for the greater good. Perhaps you do throw something out there as long as people understand it may not have been as polished as you’d originally hoped.

Amy Tobey:

Yeah, I like that. Go ahead, Dave.

Dave Rensin:

Oh, sorry. I mean, I was just going to say, Liz and Alex know, one of the things that happens when you have a service outage, and one of the things you worry about is, what is going to happen when you restart the service? What’s the crush of built-up demand and how do you measure it and moderate? And there’s some really interesting things in the health system that we have to start paying attention to. Like what happens when this abates? So I’ll tell you one, it’s sort of amusing, in a sense. We’re paying a lot of attention, obviously, to intensive care beds and ventilators and respirators and … those are obviously things we need to be in triaged spaces. Those are things we have to be paying attention to.

But with approaching half the US population and a billion people worldwide all shelter in place … don’t laugh when I say this, it’s probably a good idea to also start paying attention to what our maternity ward capacity is going to be 40 weeks from now. Because I think it would be foolish to assume there won’t be, shall we say, an uptick in that case. That problem if you will, is generalizable to a lot of things. Now is actually probably , I mean, we’re in the height of this crisis and at least in the US, we’re expecting things to get a lot worse over the next couple weeks.

Now’s actually probably the time to start thinking about what are the ways, when we restart, that we want to restart things, so that we don’t cause immediate overload and crush everything?

Liz Fong-Jones:

Yeah, it’s that when you turn the service on, you have to turn it on with exponential back off. You have to turn it on with jitter. If you don’t do that, then you just get immediately inundated the second everything comes back.

Alex Hidalgo:

Yeah, I’ve already been imagining, thinking about, what is it going to be like the first day that New Yorkers are allowed to go back out to the bar? And not just in terms of how busy it’s going to be. But what safety measures should we have in place, to ensure that people who are probably going to drink a bit more than they normally would, are they going to be able to get home safe and yeah, these are all things that we need to start thinking about now.

Amy Tobey:

Yeah, there’s capacity turned down almost everywhere, even the rideshare systems are running at lower capacity right now, aren’t they?

Liz Fong-Jones:

Yeah, they are. In New York, it’s sufficiently bad that many drivers are driving around and not able to find work such that the city actually stepped in and said we are going to hire rideshare drivers to carry critical medical supplies from place to place, rather than have them drive around looking for passengers that won’t turn up.

Dave Rensin:

And a bit of fascinating frenemy things happening, right? So some of the rideshare services who might punitively compete with, like, food delivery services are starting to partner with them, so that they can still give rides and hours to their drivers and don’t lose them permanently, except now they’re just delivering food to people instead of people to people.

13:45

Amy Tobey:

That’s really good adaptability there. So we talked about the need to plan for the capacity as we emerge from the lockdowns and quarantines and so on. Going back to where we started a little bit, how do we create the accountability, while balancing that with compassion, right? Because we need to create that pressure, right? This is why we have SLOs and things to do that capacity planning. And so I guess, what are the successful strategies that you have seen in the world for starting that process and getting that process going? Because as SREs, we often are the people crying in the dark and going, “We’ve really got to turn up capacity before it’s too late.”

Sometimes we’re not heard. So what can folks do? Let’s start with Alex this time.

Alex Hidalgo:

I don’t know, this is a tricky question in terms of our current situation, because it’s uncharted territory, right? You can do your best when you’re talking about any capacity planning or any feature launch or any product launch. You can try to use the data that you have. That’s all you can do. No one can really predict the future. You can throw all sorts of stats at what numbers you do have and perform regression analysis, but that’s still not really telling the future. Sometimes I think the best you can do is make a guess, but be ready to change it. That’s what resilience really means, right?

Do you know what to do and can you? Yesterday, I cut myself pretty badly. I was opening an avocado because I wanted to make some guacamole. Robustness is the fact that I have this other hand. Resilience is the fact that I am first aid trained, Red Cross certified, and I knew exactly what to do, and I had the tools on hand to do it. So even though this was unexpected, I certainly wasn’t planning on cutting myself, while I was trying to make some guac. It happens but, the resilient aspect of that is the fact that I knew what to do, I was prepared for it. So I think we have to take a similar view on how we turn both services and society back on, in these times.

We’re not going to know. Take a guess, but try to be as prepared as you can possibly be.

Liz Fong-Jones:

And have feedback loops, right? Maybe we don’t let all of New York City come back to work at the same time, right? Maybe we increase capacity a little bit and then increase capacity a little bit and then discover, “Oh, wait, we need to back off.” The more you have adaptability and flexibility in your system, the better prepared you’ll be.

Amy Tobey:

Do you have anything to add Dave?

Dave Rensin:

Yeah, a couple of things. First, here’s where well-designed incentives play a role. We talked about error budgets, but the value of the error budgets are the incentive alignments they create, right? If you’re over your error budget a lot, the SRE teams are going to hand back the pagers. That’s a super oversimplification. So we’re starting to see some of these things. Like in the early days in most countries, you started to see hoarding. And that was terrible. People go, they clean out stores, they buy 14 years of toilet paper, because I don’t know, expected COVID-19 to make them poo more or something.

Or they bought three years of perishable goods, which made no sense. So now stores are starting to respond and do really good things like, “Oh, okay, sure that first unit of hand sanitizer is the normal $4 and the second unit, two for $80. That’s a good incentive response, which will keep people from hoarding, or stores opening early only to service elderly and at-risk patrons. Those are good responses. But I sort of piggyback on something Alex said, which I think maybe the most important thing in all of this, I tell people that it’s not a sin to fail, the sin is in failing to notice.

Whatever we do, it’s going to be wrong the first time. That is a metaphysical certainty. So to Alex’s point and Liz’s point, it’s all about the feedback loop and in the monitoring and noticing the things are going off the rails and having the levers to adjust and try something new. If, let’s say, the blast radius of your mistake is sufficiently small, you can afford to discover the right thing to do by first discovering all the wrong things to do.

Alex Hidalgo:

I also really liked that point about how people were really hoarding things and, in some cases, still are depending on the store. It also leads to, I think, the need for you to ensure that you’re communicating things well. In my neighborhood, suddenly, all the toilet paper was back. Someone I know, just a few neighborhoods over in Brooklyn still couldn’t find any at all. I didn’t know that or I would have let people know ahead of time. But as soon as I found that out, it was just on Twitter. I was like, “Oh, go to this store or this intersection. And this store at this intersection.”

This person was able to find the toilet paper they needed. Maybe that could have happened much earlier, if we knew what we had to communicate to each other. You can’t always know what you need to communicate, but that’s just another I think important aspect. That’s how you keep things reliable, making sure that everyone who needs to know, knows.

Liz Fong-Jones:

It’s interesting in that I’ve been following the efforts by a group of aviation professionals. It’s called an Ops Group. They’re a set of people who are both private pilots, who are commercial pilots, and also people who work for major airlines. Despite the fact that they work for competitors, they share information about, here’s the airports that are currently closed. Here’s currently what’s going on with the missile strike in Iran, things like that and they share that information. It makes all of them much more adaptable, because they have the information that they need rather than siloing it.

19:45

Amy Tobey:

A thing that struck me while you all were talking about that, that I found funny … Maybe not funny but interesting was … that how the toilet paper thing reminded me instantly of the thundering herd problem we were talking about earlier, when we turn up services at the edge, and then they hammer all the services on the inside, and the whole system goes into a flapping state for a long time. It’s really weird to see that out in the real world. So if we’re good with that, the next thing I had queued up was … Actually, somebody asked a question, and I want to jump ahead to it a little, because I think it’s really relevant to where we are right now.

It’s, how do you all see the world of disaster response and Business Continuity Planning, changing as we move forward into recovery? Because I think right now, a lot of folks are finding out that their disaster plans were incomplete, because nobody really planned for a pandemic, or a lot of organizations didn’t. There’s a lot of need for using some of our recovery, or disaster planning, that maybe hasn’t been tested that well before. So if somebody wanted to kick off and start talking about their thoughts on that.

Liz Fong-Jones:

I think you can’t enumerate every single possible thing that’s going to go wrong, right? I think that going the playbook strategy is not necessarily going to work super well, because you cannot anticipate what the next Black Swan is going to be. So instead we have to focus on making our organizations of people more resilient. I think that’s the lesson I hope people will take away from this. I was reading an article this morning in the New York Times about how American Airlines had a plan for dealing with a pandemic in China. They’re like, “Yeah, we’ll shut down flights to China.” And then it spread to Italy. They’re like, “Well, we can shut that down too, but suddenly our plans no longer work.”

Dave Rensin:

I completely agree with Liz, you cannot game out every contingency. The permutations are insane. But, failure modes cluster across just a handful of axes, right? So you can plan generally about how you’re going to think through and respond to sharp swings in capacity requirements. Whether that’s technical capacity, like computer networking, or human capacity, like we have to surge humans to a place or another. Those are generalized techniques and things you can plan for that are pretty portable across different situations. You can generally plan for a communication partition.

What if the East Coast can’t talk to the West Coast? What do we do there? That rhymes with this problem of, “Oh, we can’t send all the staff to a physical place at the same time because there’s infestation the rest of the way.” That looks like a staff partition. So there are classes of things that you can drill, and so that you at least have a mental framework of how to take these classes of things and apply them to the specifics of where you are. I’ll tell you a funny … whatever. Sure. Let’s call it funny. An anecdote. I think some people might know … Obviously Liz and Alex know.

At Google, we do this thing called DiRT disaster and recovery testing. Where we try to simulate the most existentially awful but plausible thing we can think of. Earthquakes or giant outages or whatever-

Liz Fong-Jones:

Or zombie plagues.

Dave Rensin:

Well we did in fact simulate that one, because zombies could happen. When I would talk to people in the industry, and they would find out about DiRT, the reactions usually went like this. First reaction was, “Wow, that’s neat. It sounds kind of fun.” Because it is neat and is fun. The second reaction is, “But man, that is such a luxury item.” Of course, Google does it, because it’s Google and whatever. So the good news is a lot of what we’ve learned in DiRT, we’re actually having to apply. Not just can, like we’re having to. The bad news is, we are discovering all the vectors of terribleness that 20 years of DiRT testing did not prepare us for, or worse, mis-educated us about.

So we’re unlearning some lessons rapidly too. It’s a weird sort of dynamic, you would think a company that spent 20 years thinking about crazy meteor strike kinds of things might be better prepared. On some axis we definitely are, but on other axis, it actually hurts us. So I don’t know what lesson we’ll learn from that, but we definitely need to learn some lessons from it.

24:34

Alex Hidalgo:

Yeah, and not only can you not plan for every potential outcome, you can’t plan for every potential problem. You also can’t plan for the scale, right? Like a good example, I was reading last night that Waffle House is shutting down. Waffle House never shuts down. Part of that is capitalist greed, of course, blah, blah, blah. We don’t have to get into that. But they take this seriously. Every store or every restaurant does training, everyone that works there knows how to do this. They have reduced menus that they fall back onto. They are ready to help deliver things to people, if they can’t leave their houses.

This is part of how Waffle House has set themselves up. They have tried to ensure that they can be as resilient as a restaurant business possibly could. They’re closing. Because sometimes, even if you are as prepared as you possibly think you can be, you can’t prepare for the scale of what you’re dealing with either. Because I think that’s what’s going on there.

Liz Fong-Jones:

Yeah, it’s an interesting situation where Waffle House is used to hurricanes taking out one or two or three states at a time, but they’re not used to something affecting the entire nation at once, with no possibility of serving people with a limited menu, right?

Amy Tobey:

What I found really fascinating about the Waffle House case is, as a signal to the people who live in those states that get hit by hurricanes frequently, that was a stronger signal that they had to batten down. Then the official warnings came out, and I was wondering if anybody had thoughts about these unofficial or colloquial alert systems that maybe we already have in our organizations too.

Dave Rensin:

People love tea leaves. There’s something about human nature. I don’t know exactly what it is. Someday maybe there’s going to be some body of research. But I think there’s a fundamental human distrust of authority, particularly authority that seems very abstracted from you, sort of in your locality. I also think maybe it makes people feel a little more in control, when they’re like, “Oh, yes, I see this tea leaf that I can read.” So yeah, things like the Waffle House is a good local thing or in a lot of communities whether a Walmart is open, or Walmart parking capacity is another proxy measure in small towns, people use it.

There’s just something about human nature where they love it. It’s the same thing, in companies. People love anecdotes and anecdotes are more viral than data. For sure. In a previous life a long time ago, I worked in the US Intelligence Community and we used to say that the three great sources of the intelligence for any group was SIGINT, signals intelligence, humans, human intelligence, stuff you get from spies or whatever, but the most powerful was rumored. What you heard in rumors. There’s just something about human nature that’s weird.

So as leaders, it’s really interesting and hard, because on the one hand, you want to be making decisions based on well-curated and aggregated data. On the other hand, you have this human instinct to want to pay attention to anecdotes, and you have to find some filtration mechanism to mix and munch them together. And the higher you are, the more challenging the problem becomes.

Liz Fong-Jones:

So I guess to spin that on its head, what can we as leaders do to communicate doing the right things to our people, when our people are disinclined to believe what we say at face value?

Alex Hidalgo:

I think part of that is you use narratives, right? Not only do people like these signals that Dave was talking about, but people like stories. We’ve always been storytellers. That’s how, for the vast majority of humans having existed, that’s how we passed information, via stories. That’s why the best post mortems are narratives and not timelines, and-

Liz Fong-Jones:

Goodness. After this pandemic, we’re going to have to really stop using the word post mortem, we’re going to have to start using retrospective because now post mortem does feel like people are going to have relatives who have died recently and that’s a little bit too on the nose.

Alex Hidalgo:

Yeah, I’ve actually renamed things as incident retrospective internally, and I like it a lot better, but we’ve been calling it post mortems for decades, in this industry, it can be difficult to jump onto a new phrase, but totally agree.

Dave Rensin:

There’s an old saying, among marketers and people who have to persuade people for a living. So this is really a piggyback on Alex’s point, because he makes a really important point. That identity beats analogy, analogy beats logic and logic beats nothing, right? And so when you’re making an argument, if you’re arguing against nothing, if you’re arguing against a vacuum, you can use facts and data and logic, and you’re going to win the argument. If you’re making an argument, and what you’re arguing against is, let’s say a set of facts that can be interpreted a bunch of ways, that analogy turns out to be a stronger way to persuade people.

And the strongest way to persuade people is identity. And a fact, this is going to change the way you look at advertising, but if you look at any long-form of advertising, it always goes in what’s known as the up and down pyramid. It starts with identity. Don’t you want to be known as a handsome, sophisticated, suave human being? And then it’ll go down to an analogy. After that, imagine you are an elegant animal in a herd? I don’t know some crappy analogy people make for men’s clothing or something. Then data. 46% of all employers love people with blue ties.

And then it works back up. It reinforces with analogy and then an identity again, and that is the arc of every piece of successful long-form presentation. There’s something to Alex’s point about human nature, where we value stories. Maybe they are easier to compress and store in our brain or something about the way they interact with our ability to abstract, think, or maybe they’re sort of pre-chewed in the sense that if you give me data, I have to chew through it and look for knowledge and then look for wisdom and then synthesize it into a thing and store it, but whereas if you just give me a story, I can just pull a moral out of it.

31:03

Liz Fong-Jones:

I think what’s particularly interesting here though is, we had heard those stories for months, right? We’ve been hearing from countries in Asia who are impacted, and countries that were outside of Asia, chose to ignore those stories, which I think brings us back to the idea of identity. Right? If you think that your identity is such that that story has no relevance to you, you’re not going to pay attention to that story.

Alex Hidalgo:

Yeah, but I think part of that too is the way those stories were presented to us here was numbers. Right? It was just news reports, like X number of people now got infected, y number have died, z number have recovered. I’m sure there was much better data out there. There were better stories being told out there, but that’s how I was receiving the news at least, right? It was just numbers. Again, numbers from a place I’ve never been. I think that’s one of the reasons it was probably easy for people to not take it seriously because as Dave was saying, stories work better and that’s why I’m always telling people, a good SRE understands marketing.

Because you need to get other people to buy into what you’re selling, either in terms of a system, how do you think about reliability, or I think this tool is a really good idea. Or, we should be focusing on this. You need to know how to market, and to do that you need to be able to tell stories. Last year I spent a lot of time trying to convince people to spend a lot of money on a certain vendor. I tried using numbers at first, like, “Hey, we need this. We need X number of engineers to build the same functionality internally so that salary costs would far outweigh what it would cost to just pay this vendor to do it.”

That didn’t get me anywhere. But then during our trial phase when someone was able to actually solve the problem, when I told that story, I said this human on this team did this thing. That was what was able to help convince leadership.

Liz Fong-Jones:

That’s when people have that aha moment.

Amy Tobey:

I think another thing that we probably deal with a lot in our spaces and work is, we are very biased toward understanding the numbers. A book I keep thinking about a lot lately is Innumeracy and how … If I said to the three of you, I said, “Oh, yeah, it’s going on an exponential curve.” You would go, “Oh, yeah, that’s an exponential curve. It looks like this.” Right? But for the majority of people, if we say, “It’s going to go exponential.” They go like, “Is that adding more, or?” I think for large population broadcasts, and so I think that’s a really key skill for SREs.

Dave Rensin:

It’s also worse than that. All humans, even really math savvy humans, are terrible at internalizing and dealing with probabilities. I believe firmly that humans really only understand two probabilities, zero and one. Everything else is a mental coin toss. Most people reduce risks to either zeros or one. Like entrepreneurs too often reduce risks to zero. And pessimists, let’s say too often reduce risks to one. That’s it. That’s all people understand. So you see this thing where even in the press reporting and the public discussion over the pandemic, you’ll have some people saying, “Well we could be through it by this date.” And other people say it’ll take much, much longer than this. That’s insane.

Actually, the truth of the matter is that the probability of any one of those scenarios being true, is roughly the same, and no one knows which one’s going to happen. So whether you tend to be an optimistic person, or maybe a pessimist who can’t count, just depends on how you look at the world. Or a pessimistic person. depends on which scenario you gravitate to. And people have a really hard time accepting that two or three or four different outcomes are all equally plausible and therefore things that feel like they’re conflicting, can all be simultaneously true. They have a really hard time figuring out, what do I do in that scenario?

Liz Fong-Jones:

And I think that this is where we have a role as people who think about preparedness right? To prepare for each of those three or four scenarios, right? What would it take to adapt under this scenario versus that scenario? Once you have that concrete menu of options, that feels like a much more reducible problem than dealing with black swans, but I think we do have to deal with black swans too.

Alex Hidalgo:

Yeah, a thing that I notice a lot, I’ve been trying to introduce better knowledge of basic statistics and using probability. These are very powerful tools for calculating meaningful SLIs and picking good SLO targets and stuff. What I learned is that some of the most technical people, yeah, they have real trouble grasping simple probability concepts. It’s just not easy for human brains to do. Like, you may accept the fact that there’s a chance that no deck of cards has ever been shuffled in the same way. Right? Because if you do the math there are just trillions or … I can’t remember what the number is now, but there’s so many possible outcomes, that there’s a chance that no two decks have ever been shuffled in the same way.

You might know that via math, but I don’t even really believe that right? It’s one thing to have numbers and it’s another thing to actually convince human brains of things. Like the Monty Hall problem, I think it is another great one. It’s pretty simple to prove that … here like, I’ll explain it. So it was based on a game show and there was a gift behind one door, and behind two other doors were donkeys I believe or something like that. Basically, once you pick a door, one of the two remaining doors would be opened, and you get to find out what was behind that door. If it was a donkey, as opposed to the gift, you were allowed to switch which door you had picked.

Intuitively, we all want to say that it doesn’t make any difference. But it does. You increase your chances of getting the gift if you switch the door again. It’s incredibly counterintuitive, but you can find this on Wikipedia and you can find the mathematical proofs. From a probability standpoint, if a door is opened and there’s a donkey behind it, and you’ve already selected, then move, choose the other door. You actually have a better chance, this makes no sense. It’s very difficult for human-

Amy Tobey:

It really doesn’t.

Alex Hidalgo:

It doesn’t, does it?

37:31

Dave Rensin:

The other thing is I find, people have a pretty good intuition about expected value. If there’s a 30% chance of this outcome, and a 70% chance of this other outcome, and so we weigh them together, and then the expected value is some number that actually doesn’t appear in any of the outcome tables. They have a pretty good intuition about expected value, but it turns out in a lot of these situations, the expected value is not useful for making a decision. So if you have a 30% chance of living and a 70% chance of dying for this one particular patient, and we say living is a value of one and dying is a value of zero, then you have an expected value of point three, which does nothing to the decision process, because it’s not one of the actual outcomes.

Liz Fong-Jones:

Right. Exactly. I’ve been having to tell people, a 3% death rate or even a 1% death rate among people who are able-bodied and younger, that means that every company is going to either lose someone, or is going to lose a relative of someone working at the company, right? That makes it a lot more concrete, it makes it more than just maths. But I think all of this is wrapping back around to the idea that, when we think about reliability, when we think about risks, that a lot of it revolves around persuading people of it, rather than necessarily just the math.

Amy Tobey:

All right, so that brings us to about 17 minutes left. So we’re going to move on to Q&A. So the first one I have overlaps a little bit with what we were talking about earlier, but it has a little twist. Given the view that humans and technology are part of the same system, sudden staffing reductions can deeply impact these systems, for example, and this is the part that I like, the loss of expertise. How can SREs proactively help prepare our organizations to adapt for this?

Liz Fong-Jones:

I have a lot of my talks about the idea of collaboration, that the idea that we need to not silo knowledge, which means that you have to have more than one person with a working knowledge of, how does something work? How do we do things? Why do we do things? So I think that that’s one redundancy mechanism that we can really employ to guard against the possibility of someone going to the hospital and not being available.

Alex Hidalgo:

I think part of it is also just proactivity on people who know that they can be thought of as subject matter experts, or a single point of failure. People often know that. You need to encourage people, too, when you see yourself as that person, when you know that you’re the only one that holds this data, that you need to proactively share this.

Dave Rensin:

There’s an exercise you can do as a team, which I do on my teams. I talked about this earlier a few months ago at the CHAOS conference. One of the exercises called “Wheel of Staycation,” where every week you randomly pick a person. I’d go into much more detail, but you can look it up. You randomly pick a person and they stay at work, but they have no work communication. That becomes their day to do project work or whatever it is, but they can’t answer any work emails, they can’t have any IMs, no asking them questions. Nothing. The point of the exercise is to discover your information SPOFs. What questions couldn’t get answered?

Just because that one person randomly wasn’t there, that one day, that one week. And then to actively engineer the information SPOFs. There are other things you can do. But the only way you can discover things like expertise SPOFs or information SPOFs is to regularly and routinely exercise them before the emergency shows up. There are a set of exercises you can do to make this happen.

Amy Tobey:

The other thing that this made me think of is how, probably a lot of SREs are those experts that have a lot of critical knowledge in their head, for how the infrastructure is held together and where all the pieces are connected, and so probably our community is especially loaded with this information.

Alex Hidalgo:

Yeah, and from a technical standpoint, this is tangential I guess, but something I’ve been thinking a lot about is, I’m going to work from home more often, not because I necessarily love it, but because I need to make sure I can. So many companies reported running out of VPN licenses because they never expected every single employee to have to log in at the same time. Even if they had enough VPN licenses was the subnet used to assign to the clients large enough? And even if that subnet was large enough, were there enough DHCP leases available, right? There are so many different things that you need to think about, that if you’re being a little bit more proactive, you’re testing scenarios and you have exercises like what they’ve mentioned. These are all good things that can help you learn.

Liz Fong-Jones:

Yeah, we had a really interesting thing that really prepared us for this at Honeycomb in that, we were concerned about potentially losing our office lease back in September or October, and not having the cash to put down a deposit on our new office. This was before we raised our most recent round. As a result, the company instituted one week out of every eight weeks, we’re just going to all work from home that week, just to get us ready for this mentality of, we might lose our office. And we didn’t lose our office the way we expected to lose our office. I’ll tell you that.

42:55

Amy Tobey:

I have another question about remote work, so leading into that. When all the forced remote people are told that they can return to their office, what proportion of them do you think will say, “No thanks. I’m used to this now and I like it better. Please don’t make me go back to the office.”

Alex Hidalgo:

This is incredibly anecdotal, but I have this social slack. It’s just me and my friends. It’s been around for a few years. It’s like a glorified group chat at this point. We appear to be split 50/50 on who misses the office and who doesn’t. We’re split just about 50/50 on who has always wanted to work from home or does, and who would really like for things to get back to normal in that sense. Purely anecdotal, very small sample size, but I wouldn’t be surprised if it’s something close to that. A lot of people have noticed that after a week or so they really missed human interaction. There are other people who are very happy just to be indoors.

Liz Fong-Jones:

I’m really curious to see whether this expands the range of people hiring outside of San Francisco, though. I think that overall, this crisis is really exposing the need for people’s tooling to support remote workflows for people to not need to be shoulder surfing each other in order to collaborate. The sooner companies can adapt to that reality and adapt solutions that enable people to collaborate without being physically in the same place, the better prepared we’ll be for any scenario, including hiring remote employees, including this crisis dragging on longer. And yes, including people not wanting to go back to the office.

Dave Rensin:

I expect something like 100% of the people or a very, very high percentage of the people who are working remotely, will go back to the office that first week. I’m a raging introvert. I’m fine not dealing with humans. But even I’m going nuts. And so I’m definitely going to go back to the office in the first week. And then I think I’ll see this double boomerang. A bunch of people who are like, “Oh, okay. Now, I have a side by side comparison.” In the recent timeline. I think I’m going three days a week if the company will permit it. Whatever … work from home.

I also share Liz’s hope that this will make companies bolder about where they are hiring people, and where they’ll allow people to be hired. That would be awesome. For all the reasons. Detestification and decongestion and just generally making things more affordable and better for people. Where I work, Google has a very strong culture of in person. That’s just the way the company was built. Yeah, I know … sometimes it drives me nuts too Liz, but I’m just saying. And not just that, but for some roles, Bay Area. Like when I interviewed, I was living on the east coast near Washington DC. I said to the person hiring, “Hey you have an office 15 minutes from my house and the job you’re hiring for is a globally distributed job with teams across the world. How about I just do the job from Reston?”

And her answer was, “That is a really fantastic and interesting point. By the way, the job is in Mountain View.” So it’ll be really interesting for a company like ours, that strong geographic affinity thing, to see if we learn any of these lessons or take this as an opportunity, let’s say to relax some of those constraints. That’s a thing I’m really looking forward to … a conversation, I’m really looking forward to seeing.

Liz Fong-Jones:

Sounds so interesting seeing the back and forth, at least in larger SRE orgs that are multihomed, right? That they already have some of these skills in order to collaborate across time zones, just not necessarily being remote in the same time zone.

Amy Tobey:

There are two questions I think are very similar. So I’m going to combine them and that’ll probably be our last … Well, we got about 10 minutes left. So the first one is, if people think that something could happen only once, do you think they’re capable of learning from this incident? If we call this all one big incident. And the second one is, that I think is related, in two years’ time, when we aren’t directly talking about this anymore, what impacts to the way we work do you hope will have stuck? And so I think those two are very related. Do you feel like people are going to learn from this? I think everybody here has a lot of experience with trying to educate people coming out of incidents, and sometimes very large incidents. Then like, what do you think is going to stick?

Alex Hidalgo:

I know this is contrary to what a lot of people think but, in my experience, spending 18 years in tech in one way, the other people actually expect things to be the same. The things that they’ve seen happen before they expect to happen again. I think, even something of this magnitude, people are going to remember that this can happen. So I don’t think people will actually view a problem as large as this as a one-off. People have trouble categorizing things as Black Swan events, they will always go back to when a new problem happens. They get, “Oh, well, is this the same problem we saw back in October? Or is the same problem we saw last year?”

Even if that problem never pops its head up again, people instinctively, they expect the future to look like the past.

48:16

Liz Fong-Jones:

We can still pick out micro patterns that might happen again, right? People rushing to the grocery store to hoard, right? People all happen to get on Zoom at the same time from the same country, right? These are things that we potentially can learn from. At Google when Alex, Dave, and I were all at Google at the same time, right? Like there’s this wiki page on the Google internal wiki docs that say something like formative outages, right? And you can hear about all the Black Swan outages that Google has had over the past 10 years, and which ones were the most influential?

Which one should you know as a new SRE. And that way, I’m actually less worried about what people who lived through this are going to learn. I’m more worried about what happens to the next generation of SREs who have not lived through this. How can we communicate what we’ve learned from this to them?

Dave Rensin:

So that’s the most valuable thing I think. I agree with Alex and Liz. I have no fear that people will learn lessons from this. Alex is right, everyone’s tendency towards recency bias is going to keep this fresh in their minds. Also, this isn’t rare. Just in the last 10 years, we’ve had SARS and MERS and H1N1 and this. And this one broke out faster and the reaction was sort of more swift and loud, let’s say, and swine flu. But like every couple of years, something like this either happens or is on the verge of happening. So this isn’t nearly as Black Swan as people think. I 100% think companies are going to learn these lessons. I really and truly do. The ones who don’t will self select for extinction.

What you want to see happen though, is that the lessons they learn go into the culture, just into the culture of being at the company. So that, to Liz’s point, the next generation of SREs are employees in general, like it’s just part of the culture, that whatever remote work is part of the culture or capacity planning is part of the culture or whatever is part of the … That’s the most useful thing we can do from here is to ask, what principles, what practices, what cultural norms do we want to drive into our companies, so that the next generations don’t have to remember the specifics of this incidence, to get the value of what we learned from.

Liz Fong-Jones:

Like an immune system, right? The immune system remembers things.

Alex Hidalgo:

But I’d say I don’t think it’s necessarily the case that every company is going to learn anything from this, or they’re going to learn the opposite lesson, right? There are plenty of examples of companies who are just waiting on the bailout and they always work. You’re not going to convince me that the current setup of our airlines is actually going to change any other practices moving forward unless we put any regulations to change how they make decisions. Same with our banks. Until we can regulate things better, they’re just going to wait for a bailout next time that they fail. Or, all the buybacks happening in aviation and … Boeing knows that the government’s always going to bail them out because our economy cannot survive without them, unfortunately.

There are ways around this, but just want to point out, it’s also not necessarily going to be the case that everyone currently impacted is actually going to change their practices. I don’t think. There’s always going to be outliers, there’s always going to be problem actors.

Amy Tobey:

So what you’re saying is, we need to hand the pager back to those organizations, and stop bailing them out?

Alex Hidalgo:

Yeah.

Amy Tobey:

That’s what you made me think of when you were talking about that. Like that’s our equivalent, if we keep bailing out service teams, they’re never going to really learn to swim for themselves. Sometimes it’s, “Here’s your pager, good luck. We’re here to support you.”

Dave Rensin:

The good news is, they’re not independent actors. They act in an ecosystem, they have externalities in the form of government intervention and regulation. So, a lot of the, many, let’s do that of the loopholes and laxities that were in … for example, the 2008, 2009 bailout TARP, don’t exist in the Senate bills that just passed, I think just passed the US Congress. Because people remember, and they learned some lessons. They were like, “Yeah, no, you don’t get to do that. We saw you do that last time. You don’t get to do that this time.” Companies will invent new and creative ways to do something else. But the good news is there is a feedback loop for a bunch of these things.

Liz Fong-Jones:

I think the other element is reducing over the long term the amount of too big to fail, right? Can we save specific things that are essential, while not saving the things that are frivolous?

Dave Rensin:

Yeah. Starting from the minimum.

Liz Fong-Jones:

As far as commerce is concerned, yeah.

Alex Hidalgo:

I mean, I think it’s important to remember we use the term complex system a lot in this industry. When we do, people’s minds often go to some microservice mesh where a bunch of things are talking to each other. But like, that’s a very old concept that other engineering disciplines have been using for a very long time. Like everything is a complex system, and it doesn’t even have to be a thing, right? I’m not just talking about the fact that the airplane is a complex system or a bridge, but our political system is, right? Our economic system is, individual organisms are. You can apply the same ways of how you think about your complex microservice system, you can apply how you think about things to our political system.

Dave Rensin:

We’re also going to have to unlearn some interesting lessons we thought we had learned. One of the arguments for the distribution of global manufacturing across a lot of different countries, rather than concentrating heavily in one country, another was a thought about resiliency. I remember very clearly in the late ’80s and early ’90s, the debates around NAFTA and other trade agreements were, “Hey, this is going to create resiliency, so that we’re not just dependent on wherever in the United States or someplace in Taiwan, or whatever, and-

Liz Fong-Jones:

We should’ve learned that lesson a long time ago, right? Like the floods in Thailand that took out a whole portion of hard drive manufacturing, right? Like these are …

Dave Rensin:

Yeah, but that was a lesson only the tech industry learned. That was too localized to cause a global conversation, let’s say. So one of the lessons we’re learning now is, “Oh, actually having some capacity shorting across different …” Like, maybe we don’t want to concentrate … I don’t know the production of all of our antibiotics to the lowest cost provider, because it tends to concentrate at one place geographically, just as an example. So maybe making it a little more expensive, but shorting it across a lot of different places will be better. So we’re learning that some things we thought we understood about global supply chains were wrong. Now we have to adjust them. So that’ll be a really interesting thing to watch happen.

Amy Tobey:

Awesome. Well, that’s time for us, everyone. Thank you so much to our attendees, and especially a huge thanks to our panelists, Liz, Dave, and Alex. I certainly appreciate your insights and had a great time. I hope everyone had a great time on this panel. We’re hoping to do more of these in the future. If you have more ideas for how we can connect with our community, I think the three of us are on Twitter. I’ll probably ask the folks to put our handles out there on the Blameless account again, just to make sure it’s easy to find. I’m Miss Amy Tobey. T-O-B-E-Y. And I look forward to hearing from you all, so stay safe, healthy. Stay at home. Bye, everyone.

Liz Fong-Jones:

Thank you.

Dave Rensin:

Thanks.

Alex Hidalgo:

Thanks so much.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.