Nora Jones and Charity Majors share their experience leading major movements shaping the future of shipping software. Nora Jones is CEO of Jeli, and former engineer at Netflix and Slack will share her research and experience with Chaos Engineering, human factors, and site reliability. Charity Majors is Honeycomb’s CTO and co-founder, who pioneered Observability as a software practice for modern teams.
Kelly Gallamore [Customer Advocacy Marketing Manager|Honeycomb]:
Welcome to o11ycon. We have a very full show for you today. So let’s get right to it. Please welcome to the stage Honeycomb’s CTO and co-founder, Charity Majors.
Charity Majors [CTO & Co-founder|Honeycomb]:
Good morning! Well, this is super weird. It’s Saturday, and I’m here staring at my own face and welcoming you all to o11ycon at god awful early in the morning on Thursday. It’s very strange. It’s been three years since we’ve done this. I’m guessing not many of you here listening today were at the first o11ycon because there were only like 150 people. It was like three years ago. But I really enjoyed it. I thought it was one of the best conferences I’ve ever been to. Clearly not prejudiced in any way, but it was really warm, and it was really special seeing so many people show up with their own stories about this brand new, cutting edge at the time… observability was just starting to get out there, and people were asking, what’s the definition?
We’re in a very different place now, and that’s exciting too. For our keynotes this morning, you’ve got me and Ms. Nora Jones. We wanted Nora to talk about incidents because it’s a different lens on observability. It’s looking at one of the main levers you can use to both create high-performing teams and weaponize high-performing teams. It’s kind of like a ninja move to spread information and sort of get observability into every corner of your sociotechnical system and see how these parts overlap. So she’s going to be talking about that. I’m going to be talking about another aspect of high-performing teams in observability, which is CI/CD, and why everyone should be deploying to production automatically many times a day in 15 minutes or less because this is the way.
Nora Jones [CEO|Jeli]:
Thanks, Charity. Good morning, everyone. Today I’m going to be talking about a different type of observability. A different key for it through incident analysis. I know at previous o11ycons, there’s been a lot of talk about past outages and what we can learn from them from that perspective. I’m going to take a different approach today. Now, we’ve all had incidents. This is actually a good thing. You know there’s a positive spin on incidents. They’re unexpected. They’re stressful. And sometimes inevitable questions creep up afterward. Even if we don’t want to necessarily be asking these questions. We ask things like, what can we do to prevent this from ever happening again? What caused this? Why did this take so long to fix?
For several years, I studied why engineering organizations, why organizations through the tech industry were doing incident reviews and post mortems. We all do them, but a lot of folks don’t have a good answer for why, and a lot of different folks have a different take on it. They do them. They check that box, but we got a lot of answers like I’m honestly not sure. Management wants us to. It’s for the engineers. It gives them space to vent. I think people would be mad if we didn’t. We have some obligations to customers. We owe them this ICA. Or we have tracking purposes. We want to see we’re doing better over time. What we got through this research study was that everyone agreed that post mortems are important, but we don’t agree on why they’re important. Even though orgs knew they were important on some level, they weren’t making efforts to improve that process. A lot of times, we don’t know how.
Now, analysis can be better trained and aided, but it has to be trained to better improve upon. I want to do an activity. I miss being hands-on in conferences. This is my way of collaborating today. I want to show you a public RCA. And we’re going to take a few minutes to review this RCA. Now, the purpose of this is not necessarily to call out bad examples, but I actually want to see if this satisfies your curiosity. Imagine you were reading this as an employee at the organization, a coworker, as a customer. I’m wondering if this actually tells you where to look. I’m going to read through some of this for y’all.
In this root cause analysis, we have the start date, we have the end date, we have some information about how it was detected, we have the root cause, we have some of the impact, and then we have a bunch of this stuff that we’re going to do afterward. Take a second to digest this real quick. We see that the root cause is that our DNS has host scheduled a restart for security patches resulting in a simultaneous outage of DNS servers in our environment. Then we have a bunch of action items. We’re going to update this playbook and documentation. I think a lot of these action items and root cause and impact could maybe be said about any incident. I think we’re missing a big point here. We’re missing a point where we can actually dig into and understand exactly how this happened.
Incidents are beautiful in organizations in that they’re one of the only times where processes go out the window, and everyone is doing everything they can to stop the bleeding as soon as possible. Now, you learn a lot about your organization in those moments, and it’s wasted learning if you don’t try to dig into how you work together, who you needed to rely upon, which dashboards you needed to pull up, which systems you needed to interact with. None of those things are actually mentioned in this report, which removes our ability from being able to dig deeper. A good report can actually tell where you say to look technically. It can help us focus on the mechanism that triggered an incident and helped us figure out how it worked in theory and practice versus in theory.
Now, I’m guessing you all have a lot of open questions about this still. I’m curious in the chat if you could talk about what other questions you still have about the event. Charity Majors mentioned in a New Stack article a couple of years ago that without certain types of access to data, good luck debugging. She was talking about debugging for a system like cart IDs, et cetera. Without understanding the team dynamics involved in these systems, we’re going to have a hard time understanding those technical aspects too. They’re symbiotic, and they play together in nature. My talk today is about understanding some of those human aspects of our sociotechnical system so that we can understand the technical aspects even better.
Now, I want to come back to this. Post-incident reviews are important. I’m sure they thought the incident review was important enough to create one, but they’re not great throughout the industry. What’s worse is when an accident or event is deemed to have a higher severity. We tend to give our engineers even less time to review and figure out what happened. There was a pretty large incident that happened in the tech industry. It had Twitter abuzz, LinkedIn, the register. It was the Salesforce incident. The status page was down. Everything was underwater. We saw things get displayed pretty quickly after the event, right, without a lot of time for the organization to learn. Those public reviews impact how we understand things internally too.
We can’t just slap on more training and more processes after incidents. We actually have to understand the mechanisms that trigger some of these events, all the pieces at play that helped lead to the incident. There’s a way you can quantify incident reviews too. We don’t see a lot of people in the tracking, and I don’t mean just calling out a bad actor like, Oh, this person wasn’t trained enough. We had him or her on-call too soon. But there needs to be more interpersonal dynamics. Have these teams worked together before? Have they interacted with these systems before? How are these systems used? The script involved in the incident? We tracked a lot of metrics today like MTTR and MTTB and a number of incidents, and all of these are shallow.
Now, John Allspaw has talked to us about a lot of things before. He says, where are the things in this tracking and where are you? We have not changed much as an industry in that regard. There are little things we do to help us with that. Gathering this useful data about incidents does not come for free. We’re going to talk about how these organizations can work in your favor through a couple of different stories. Do it in ways that are not disruptive to your business and the next steps to embark on. Spoiler alert! Sometimes it’s a mirror we’re not ready to hold up yet. You have to be willing to see and understand things that are impacting your business.
I’m Nora Jones. I have spent a lot of my career in site reliability engineering. I’ve done a lot of work on chaos engineering. In 2017, I keynoted about the business benefits of chaos engineering to 50,000 people about my experiences implementing at jet.com and Netflix. I went to Slack and people throughout the tech industry share and understand their experiences with incidents in their organizations. That led me to start Jeli where we focus on the technical side of the system. I want to share this quick graphic from Organizational Psychologist Gary Klein. It’s from a book where he studies what makes people experts and how to build more experts in organizations. I think we can all agree that that is something we want. We want people to be prepared to own the system, to respond to the system. We don’t want to have to rely on the same people every time because then they become people bottlenecks as well. This is not something he just made up. It’s something pulled from years of research. This is a combination of error reduction and insight generation. You cannot have one without the other.
The tech industry focuses a ton on the error reduction portion. That’s the counting incidents. We don’t focus a lot on insight generation and dissemination. Sharing our experiences with the incident with our organization and getting people to collaborate and share their own experiences, is insight generation and dissemination. It is hard to do a good job at it. That’s something that can be improved upon. I’m going to tell three stories today about how we focused on this piece. They’re based on actual things I’ve experienced and been a part of, but names and details are changed. I worked with three software engineers. They were brilliant. We worked to help engineers understand and ask more questions about errors in their system that unexpectedly behaviored when running into turbulent conditions like failure and latency.
I was happy to be working on an engineering problem that could help the business find the weak spots, right, finding the unknown unknowns. There was a problem with this. The problem is most of the time, the four of us that created this tooling were also the core users of this tooling. We were using the tooling to create the experiments, to run the experiments, to analyze the results, to find these failure modes. We were running chaos experiments on systems we did not own. What were the teams doing? They were receiving the results and sometimes they were fixing the stuff we found and sometimes they weren’t. Why is that a problem?
Well, we were not the ones on the teams for the experiments we were running. We were not the ones whose mental models on how certain pieces of the system like search or bookmarks or what we expected our CPIs to do in certain chaos engineering cases. Those weren’t our mental models that needed refining, but we were the ones getting that refinement and understanding. We were leading the horse to water there, but we were pretending to feed the horse and drink the water. Sometimes teams would use our tooling. It was usually after a big incident or it was before a high traffic day, or they would put it in their pipelines and forget about it later. We would have to remind them to use it several weeks later. It was tough, right? What did we do?
We started to look at the steps they were taking in order to get easier access to the tooling. We wanted to give them more context on how important a particular vulnerability we found with the chaos tooling was. If we found a weakness in the system, we wanted to give them context on how important it was to fix it. To put this in another context for you, I’m sure you could pull up any repo in your organization and find a ton of bugs. It doesn’t mean that they’re important to fix at that moment. That’s sort of what we were having with the chaos experiment. How important are these bugs to fix? We need to give them prioritization. In order to do that, in order to know if it was important to fix, I started looking at previous incidents. I started digging through them and finding systems that were underwater a lot or people we relied on too much or people that were not on call that we needed to pull in or things that were really surprising to us and caused a lot of impact. I wanted to use that as a prioritization algorithm in order to inform if they should run it and experiment and what they should do with the results.
But I found out through that process of trying to drag more people to my chaos tooling, it was that incident analysis had greater power than helping the organization create. It opened my eyes to so much more. Things that could help far beyond driving traffic to that system. And I learned a secret. And that’s incident analysis is not just about the incident. The incident is a catalyst to understand how your org is structured in theory versus how it’s structured in practice. It exposes that delta for you. By exposing that delta, you can continuously improve over time, and you can know yourself a lot better.
By knowing your org better, you’re going to move faster. You’re going to move more efficiently and understand your customers more. It’s a catalyst to understand where you need to improve your sociotechnical system. And it’s showing what you’re actually doing really well at versus where you need improvement.
My next story, 3:00 a.m., when all the bad things occur. I was asked to look at this incident the day after. I walked into the office in the morning, and a senior engineering leader had pulled me aside and said something along the lines of, Hey, Nora, I’m not sure if this incident is all that interesting for you to analyze. I said, Why?
Well, it was really human error. He doesn’t know how to run the system. It could have waited until the morning. Now, blamelessness, I will talk about that in a minute, but without a deep understanding of it and what it meant, a lot of orgs think it’s about being nice and not naming names when that’s really not the case. It’s about making it a safe enough space to name names and to allow Kiran to share. If we don’t do that, people are going to repeat what was done and shove it under the rug over and over. I don’t blame this. This can be shifted anywhere in the tech industry. When something like this happens, a Kiran makes an error, it’s usually meant with instituting a new rule or process within the organization without explicitly stating that you thought it was Kiran’s fault even though everyone in the org does think that and you think that and Kiran knows you think that, it’s still blame. It’s not only non-productive, it hurts your organization’s ability to build expertise after incidents.
It’s also so much easier to add in rules and procedures. It covers us and allows us to move on. That’s what we want to do. When something bad happens, we want an explanation, and we want to move on. But this information on rules and procedures doesn’t come from the front lines. It’s easy to say, Wow, she really should haven’t done that. Unfortunately, adding in these rules and procedures diminishes the ability to glean new insights from these incidents. In spite of being not told to talk to Kiran, I wanted to talk to him. According to the org, Kiran got an alert at 3 AM that, had he spent more time studying, he would have known how to debug it. He would have known that it could wait until business hours to fix. I came in completely blank and asked him to tell me what happened.
He goes, I was debugging a Chef issue that started at 10:00 p.m. We got it stabilized, and I went to bed at 1:30 AM. At 3 AM, I received a Kafka alert. Interesting finding number one, Kiran was already awake and tired and debugging a separate issue. I asked him what made him investigate the Kafka broker error? He said his team had just gotten paged for it and his team had gotten transferred this on-call rotation for this particular broker about a month ago. So other brokers were owned by different parts of the org. There was a separate part of the org that owned most of the Kafka minus the broker. It was strange. I said, Had you been alerted for this before? He said, No. But I knew this broker had some tricky nuances.
Which led me to my second interesting finding. Why did they own it? How do on calls and expertise work? If this broker is so tricky, why did we put them on call for it after only a month of ramping them up on it? I asked him how long he’s been at this org. He said five months. He was new to the organization. He was on call for two separate systems in the middle of the night. I think if I was in his shoes, I would have also answered that page. I would have been afraid to just snooze it and go to sleep.
This led us to a lot of interesting changes in the organization. Rather than adding guardrails or taking Kiran off call, we were able to improve how we think about our on-call systems and who we’re putting on call for them and how we’re divvying them up and how we’re ramping up new engineers. Automatically changing someone off call if they’re on call for two separate systems. None of those things would happen if we were not talking to Kiran one on one and giving him that space. I want to go back to this. Post-incident reviews are important, but they’re not good, but they can be good.
Cognitive interviewing, we can use it to determine what someone’s understanding of the event was, to determine what stood out for them as important, what stood out for them as confusing and ambiguous, and what they believe they know about how things work and what others don’t. You have to see what was happening for them in that moment. They were right at that moment. They were doing those things for that reason in that moment. Extracting that info will improve your whole system later on. And these can be used to glean insight on relevant projects and past incidents that were related and past incidents that had an action item that led to this incident or even past experiences. And you can iteratively conform and contrast these with other sources of data, like your monitoring solutions or your Slack transcripts, or your architect diagrams, and use this to form a more cohesive view about what happened.
My last story is around promotion packets being due. In this organization, I will move quickly through this story. In this organization, there was an uptick of incidents during a certain time of year. What ended up happening, after looking into this, is that in this organization, we had managers putting together promotion packets to make a case for their employees to be promoted. When the organization grew, this became very much work-driven. People were looking at these packets quickly, and engineers were incentivized to push a bunch of stuff they had agreed to months earlier just to get promoted, even if that stuff was irrelevant. So we saw a lot of merges and deploys that happened at the same time around things that were not super relevant anymore.
Guess what. Incidents upticked. Looking into it more and finding themes across them, we were able to find things in this organization where we weren’t doing promotion packets anymore and they were not just a big push. Again a good incident analysis should tell you where to look. You need to make sure that those individual data points are good to see the cross themes.
I want to give you an observability chart. This is a real incident. It’s showing how people interacted at the moment. Here, you may be interested in the gaps of chatter. You may be interested that customer service was the only one talking late at night. I wonder if they were getting supported. You may look at the storm from PagerDuty. You may look at the team and the participation level. You may be interested if we were light on folks that were not on call and we needed to unlock tribal knowledge.
I want to quote something that Charity said before. She said, When you’re flipping through a bunch of dashboards trying to figure out what’s happening, you’re not expecting reasoning about it or following a trail of meaningful breadcrumbs. You’re jumping straight to the end as a guest. It’s as though the entire system was a Black Box and you had no reason to know what happened in what sequence. This can help you reason about it better. It can help you dig into what people were looking at in particular times, how they were debugging things. A good incident analysis can help you with a lot of things. It can help you with headcount, training, promotion cycles, central changes, who came into your organization that wasn’t supposed to.
It can also help you understand your technical systems a little bit better. It can help you figure out things to focus on that everyone is confused by and understand where to instrument things. Now, not every incident needs to be given more time and space to analyze. There’s a bit of a formula for this. You can look at ones that have more than two teams that are involved, especially if they have not worked together before, or ones that involved a misuse of something that seemed trivial, like expired certs. There’s usually something else to look at there. If the event was really bad. If it took place during something like an earnings call. If a new service or interaction with a service was involved. Or if more people joined the channel than usual.
Now, I definitely have a lot more to share on incident analysis and some of the things you can learn from it. Some of the ways that you can find out about this are the learningfromincidents.io website where we open source a lot of our learnings through the tech industry. We’re also working on a lot of these problems at Jeli.io. Thank you so much.
Thanks. That was so awesome. I feel like I’ve heard little bits and pieces of that talk over the last few months or however long, but I don’t think I’ve heard it all together.
Thanks. It’s definitely a fun experience talking about a different portion of observability; right?
We can use some of the human systems involved to help us understand the technical a little better. I think that gets lost on a lot of organizations, and it’s a big mess.
In your experience, obviously, there’s a lot of feedback looping going on here, but doing this sort of analysis can make your teams better. Do you find that it’s typically taking a high-performance team to do these sorts of analyses? Or is there a scrappy group that uses it as leverage?
It doesn’t just have to be high-performing teams, but it does have to be shared by leadership. Like, hey, we want to give folks the time and space to learn. Otherwise, people are just going to spin their wheeling more. My favorite story, I was working with an engineer that was a senior engineer. He had been trying to get promoted to staff for five years. He was so frustrated that he wasn’t able to get it. We started coaching him on incident analysis and running reviews better and understanding different parts of the system. At the end of it, he said, Nora, I’m seeing in color now. I know I’m not just an administrator. I know how the system works in a lot of ways that not a lot of other people at the org do. He got the promotion at the end of that. Not because he was running more post mortems. He knew who to talk to. He was able to look at design reviews and poke holes in things
He could better navigate the technical systems and the human systems. You know, I mean, we throw this word sociotechnical around a lot, which I love because it’s one of my favorite words. You know, something I used a couple of times, recently, to try to drive home is that you know, if you take the “New York Times,” which has a great engineering team and you send their people all to Cancun, but you replace them by engineers that are equally as experienced in skill sets and everything, how long would it take them? And then the site goes down. How long would it take the new ones to figure it out? First, they have to figure out how to log into the system? So much of the system is literally in your head and my head, and it’s in the team’s head. Humans are not fungible. They’re part of the system.
Yeah. And one of the reasons I really harp on tenure and stuff is because, guess what, in the tech industry, when people vest, they usually leave organizations, and they take a bunch of that knowledge with them. It’s for the organization’s best benefit to continue investing and learning and training your people. If that group of folks left and you replaced them with equally trained engineers
They don’t have context and history.
Even if they write it all down, that’s a pale limitation of having lived through the experiences. And, you know, something that used to make me… all right, still makes me really bitter is when CTOs and CIOs, they’re clearly looking for a vendor to tell them, Give me millions of dollars, and you will never have to understand your systems again. Your engineers will never have to understand your systems again. We’ll show you the answer. We’ll tell you the answer. We’ll tell you what’s happening. From the perspective of the exec, sometimes people come and go, but vendors are forever. I hate that.
I like the way you turned this on its head, and you’re going, But we can invest. Instead of investing in these robot systems, we can invest in human systems and keep going. It’s about strength. If you’re rotating your team every couple of years, there’s not going to be these hairy corners where only two people know what’s happening, right?
Exactly. It levels everyone up.
Levels everyone up.
And the incident review shouldn’t be run by the one lone SRE. If you spread out who’s running the reviews, then you start leveling everyone up. I don’t know if you have ever, giving a talk on something or giving an incident review on something, you mentioned I wrote the chaos engineering book. I knew a lot about it before than throughout writing the book. That’s the thing about reviews. You learn more about it. By investing and allowing all these different people in the organization to do stuff like that, the business benefit
Leveling everyone up. You’re both creating and reinforcing internal team practice and culture that everyone can participate in. It only makes them better engineers, but it’s also binding. It’s one of the things that can bind people together.
It encourages teamwork, right?
I love what you said. It’s not about not naming names; it’s about making it safe enough to name names. I had never heard that before. Damn. That’s so true.
I’ve seen so many organizations like, we’re blameless. We’re not going to mention Jack’s name.
We’re not going to say the name of the person who took everything down.
That makes it so much worse. It’s a million times worse by not doing that. I want to know you can find it out anyway. Why not just put it in review? If you can’t do that? That is something you need to fix.
It’s not ignorant. It’s blameless.
I want to know about Jack’s tenure. I want to know what teams he’s been on.
Where do they come from? That’s super interesting. Thank you so much for being here. I’m going to do another talk now and Nora is going to stick around to dissect everything afterward.
My talk is going to be about everything, CI/CD, observability, teams. TL;DR- It’s time for us to finally, as an industry, fulfill the promise of continuous delivery. For those of you who don’t know me, I’m the Co-founder and CTO of Honeycomb. I co-wrote the Database Reliability For Engineering with Campbell and Observability Engineering with Liz Fong-Jones and George Miranda. If you want it, you can download it. So, how well does your team perform?
This is one of those fun, uncomfortable questions. It’s not the same as asking how good your team is at engineering. It’s a very different question. High-performing teams get to solve problems that move the business forward every day. It’s a great career. You’re a creative software engineer. You get to build sandcastles in the sky. Lower performing teams spend most of their time not doing those things. It’s a lot of firefighting, waiting on each other, solving problems over and over again. Fighting with their tools. There’s a lot of toil. And if you’re wondering how high-performing your team is, I recommend people start with the DORA metrics.
How frequently do you deploy? How long does it take for code to go live? How many deploys fail? How often does your team get paged after hours? How quickly are people going to get burned out? There is a wide gap between the elite teams and the bottom 50%. I’m not fond of that word. I tend to call them high-performing teams. They deploy many times a day. The bottom percent, 50%, once a week to a month. Times to restore service go from an hour to a week and a month. Oh my God. If you look at them year over year, well, the top 50%, it’s achieving lift-off. More people are getting better because the tools are getting better, the practices are getting better, we’re all sharing them. The bottom 50% are losing ground. In software, if you’re standing still, entropy has got you.
It really pays off to be a high-performing team. Two hundred and eight more times code deployment. Obviously, the question is how to make these high-performing teams? Hire the smartest people and the best engineers, probably from MIT and Stanford exclusively. Wrong. It’s actually the opposite. It works in the opposite direction. Same correlation, but it works in the opposite direction. Great teams are what make great engineers. Just think about it. Two kids straight out of school. One joins a high-performing team on the left. One joins a medium-performing team on the right. Who is going to be the better engineer in a couple of years? The one who gets to deploy constantly and who will spend a tiny fraction of their time firefighting? Or someone who gets to deploy five times a year?
Every time you deploy, it’s an opportunity to learn, foremost. You build something. You deploy it. You learn something. Over and over again. If you’re only going to learn something with five deploys a year that’s kind of the wrong way to put it. You’re probably learning something, some things. They’re just not likely to be very fun things or useful things or the things you want to be learning. What happens when Google or Facebook team members join the team in the blue bubble? Welcome. Bring them up to meet them. Your productivity can rise or fall to match the team you’re on within a few months. The act of shipping software safely and securely and swiftly, it has very little, relatively speaking, with your personal ability to write good code. It has much more to do with the sociotechnical system you participate in.
It’s all about the deploy systems in place, the CI/CD story that you have, how long it takes to get your code reviews turned around, how many automated checks that are, and tests that are to catch things so that you can automatically fix them before they get deployed live, which is why anyone who thinks of themselves as a technical leader, not just managers but definitely managers. Technical leadership means to focus intensely on constructing and tightening those feedback loops at the heart of the system. I believe if our leadership class spent half as much time and energy and vigor in tending to those tight feedback loops and keeping them short as they do to interviewing and setting a bar and trying to make sure that new employees meet the bar, the technical bar, I think if you just switched your focus to creating those feedback loops, it would be time well spent. It would be easier on everyone, and you would get better engineers out the other side.
Sociotechnical is a great word. It’s one of my favorites. One of the reasons I like it is because you know what it means; but in case you didn’t, there’s the definition. This brings us to CI/CD because shipping is the heartbeat of your company, as my good friends at Intercom like to say. This means shipping new code should be as regular, as ordinary, as boring, as commonplace, as unremarkable as your heartbeat. It should be small. You don’t want to bang the drum with every heartbeat. You want it to be consistent and regular. You don’t want to have to think about it. You want to be thinking about the code you’re writing. You don’t want to be thinking about shipping the code. CI/CDs is how we get there. If I ask you, Do you do CI/CD? Of course, you do CI/CD. Everyone I’ve talked to, of course, we do CI/CD. Drilling into it, well, we have a CircleCI account. We run some automated tests. Cool, right? Most people at this point are doing CI or, you know, a derivative thereof.
But it’s the prelude. It’s the precursor to the main course. The entire fucking point of CI is to clear the path to be deploying constantly. Continuous deployment will be what changes your life. Years ago, when the book was written on continuous delivery, they use a lot of weasel words here about how it was getting your system to a state where it was ready to be shipped at any given time as though that was good enough. Well, it was good enough at the time because they were shipping shrink-wrapped software. So they had an excuse. You probably do not. If you’re not going to CI before shipping to production, why even bother? Set up a batch and run with the same result. Any CI/CD job that ends without having deployed your code to production is a failed run.
I’ve been harping on about feedback loops here, so I want to go back and talk about what a good feedback loop looks like, instead of criticizing everyone. I’m glad you asked. A good feedback loop looks like this. Engineer writes code, merges it, sets up changes to a master with any user-visible changes hidden safely behind a feature flag. It should automatically trigger CI to run tests, build an artifact, and deploy that artifact, either straight to production or to a staging environment which will then be automatically later promoted to production.
There are a couple of very important things to note about this. And the time elapsed should be 15 minutes or less. Very important. One change set by one engineer per artifact. Period. No manual gates. I don’t actually care if it takes, you know, an hour or two to be deployed to staging and then promoted to super staging and then promoted to production. I don’t care. That counts as long as there are no manual gates and no variable times. It should be predictable. It should be something that your body clock can hook into and go, yep. I’m merging. I know that in five minutes it will be live and I will go look at it.
Because if you’re an engineer and you’re merging some code and you know that your changes will be live and users will be using them in five minutes or less, you’re going to go look at them through the lens of the instrumentation that you just wrote. On the flip side, if you’re an engineer who just merged some changes and you know at some point in the next 6 to 48 hours, your changes and anywhere from 0 to 20 other changes will have been shipped to production, you’re not going to go look at it. Nobody’s ever going to go look at it.
It’s important that they are going to be hidden behind flags because this is what successfully deploys couples from releases. I’m sympathetic to the fact that product managers have, and marketing people have stuff to play in and schedules and everything. Nobody says we could be jerking all of our users around willy nilly. Deploying releases, severing deploys from releases is a critical part of this. This is the way. Favorite use of meme this year. I’m stoked about that.
What’s important about this is that precious interval between when you wrote the code and the code is deployed, because that time… keeping that time small as possible is really, it’s not everything. But it’s a lot. It’s the difference between software engineering being a creative, fulfilling career and you with the taking machine where you take the tasks and get them out. I firmly believe that. It ages like fine milk. At that moment you have just finished solving a problem and you’re merging your changes and got them reviewed and feel good about them, you know everything about… you know as much as anyone will ever know about your original intent there, what you’re meant to do, why you’re meant to do it. And implementation details that you tried and discarded, the tradeoffs you had to make down to the specific variable names. Everything’s fresh right here. You know? And if you go and you look at users using it right then and there, you know, you are prime to be able to notice that isn’t quite what I meant. That’s higher than I expected it to be. This is a little weird.
You could never get that back and neither can anyone else. And that state of mind begins to degrade as soon as your focus is shifted from this software to the next thing. That lasts for minutes maybe. Which is why engineers can find, you know, 80% of all bugs or more in that magical fleeting 15-minute interval as long as they have good observability tooling, instrument their code, and they muscle memory go and look at it. Just ask yourself, is it doing what I meant it to do?
It lets you hook into the intrinsic reward systems. Dopamine. Muscle memory. It’s so good, right? You do this for a couple of days, a few weeks, and then when you merge code, you get this antsy feeling in your body. You know you’re not done yet because you haven’t closed the loop. You haven’t looked at it yet. You haven’t got your dopamine hit yet. Finding bugs before the users have found them, it’s the best. Having a very short interval of… it’s not all about your attention. It’s a lot about your attention, but it really helps you hold against the deploy. So you have one engineer’s changes per artifact to deploy. Which leads you to software ownership. If you don’t have that, it starts to get grim really fast. And, in fact, if your interval is not 15 minutes but more on the order of hours or days, you’re almost definitely batching people’s changes together. You’re almost definitely not deploying automatically. Then you enter what I think of as the software development death spiral.
It starts with the longer interval between when the code is written and when it’s deployed. This leads to people writing larger diffs, longer turnaround for code review, gets batched up, severs ownership, makes it hard to identify whose code is at fault. More and more engineering cycles are spent waiting on each other. Every time you had a shift with 25 people’s changes in it and it fails, you’re in for a bit of fun. So everyone who has a change wrapped up in that mess. It’s going to take you the rest of the day, more than likely. Just give up. Now you probably need an SRE team. You might even need a build and deploy team. You probably need a release team to try and automate a bunch of this stuff. Now you’ve got specialists, teams. Teams need managers and more coordination costs. You need product teams. You need project managers. Everything takes more time and you’re spending, like, the coordination costs are enormous.
You could either spend your entire life as a technical leader chasing all the symptoms and pathologies that flow forth from this or fix it at the fucking source. Fifteen minutes or bust. How much is your fear of continuous deployment costing you? Well, I have a rule of thumb which is that if it takes X number of engineers to build, maintain, and run your software at 15 minutes or less, it’ll take twice as many engineers with an interval of hours. With an interval of days? Twice as many again. Weeks? Twice as many again. And if you think that I’m exaggerating, well, you’re right that I mostly pulled this out of my ass. I mean, not completely, but it’s not like I’ve done, like, research studies on this. But the evidence suggests that if anything, I am actually being too conservative. Because the best data that we have was the stuff done by Facebook where they very rigorously showed the costs of solving a bug go up exponentially the moment the bug was written. Exponentially more expensive to find, isolate, reproduce, understand, solve, test, ship, et cetera.
Have you ever looked at one of those companies that has, like, 200 engineers and think how does it take that many people to do that amount of work? Just saying. If you’ve ever been an engineering manager that said, God, what would I do with twice as many engineers? Well, have I got a proposal for you? Fix your fucking build pipeline and let’s find out. But it isn’t just about the economic arguments.
How well does your team perform? This is, just to remind you, high-performing teams get to spend their lives solving fun puzzles together and having a lot of impact. You know, moving the business forward and being, you know, more successful. Lower performing teams do not. It’s a grind. It’s a drudge. It’s not fun. Which means that this is more than just an economic issue. It’s also a quality-of-life issue. It’s an ethical issue. I’m tired of hiring people into teams and just having them just look dead in their eyes. Or having them distrust me when I tell them how code, how the process of shipping software it doesn’t have to be that bad. The number of cycles that technical leaders spend on curating and tending to their internal feedback loops by which the software gets made manifests is microscopic. And I got to say no one who’s worked with true CI/CD is ever willing to go back. So it’s really good for hiring. And it’s really good for recruiting and retention.
By focusing on these inner feedback loops, by focusing on how the sausage gets made, we can build a more humane generation of sociotechnical systems that build the next generation of great engineers. We can build great teams that foster great engineers. We can build systems that are well instrumented, that are well understood. Those are not just, like, several years of, like, whatever got coughed up in a hairball shoved under our beds. They can be not noisy. They can be compatible with being an adult, having kids, having families, having other responsibilities.
And all we all want out of our jobs anyways is to work on teams that are low in toil and high in autonomy, mastery, and meaning. And how? Hire people who share your team’s values. Spend time investing in your sociotechnical feedback loops. Add instrumentation on observability. You wouldn’t go down the highway without your glasses, would you? And then instrument, observe, iterate, ship, and repeat. It’s not that hard. It’s literally not that hard. The way you’re doing it now is the hard way. This way is easier. Are you too busy to improve? I hope not. The end. All right. I’m done.
That was awesome, Charity. I really liked your talk about burnout and increasing coordination costs too. It’s interesting, I feel like when people feel they are not shipping enough an immediate reaction is, oh, we just need to hire more people.
Yeah. And it’s so costly. I mean, sometimes it’s got to be done, but it’s not usually. And I feel like I don’t like to give more headcount to any engineer who’s underwater. Until they learn how to say no to have the right amount of work for their team, I would rather give a headcount to an engineer manager who I’m like, you’re doing such great work. I want there more of it. And they’re like, cannot do it until you give me more people. Please take more people. That’s the dynamic that should happen.
It’s a mixed bag for sure.
You talked a little bit about shipping quickly. I’m curious how you think staging environments and our attitudes towards them impact accomplishing that?
Death to all staging environments. I say this believing that we have three of them at Honeycomb or four of them. You know, my beef is not with staging environments. I think they have a lot of really good uses. I just think that historically, we have spent, like, almost all of our engineering focus time and energy on staging. Just making it elaborate and all the tests and we’re going to be so sure and everything. Then when it’s, like, what about production? We don’t have time left over for that.
If we spent all that engineering effort, we spent 80% of the engineering effort on production first, making it safer to ship, making it so there are guardrails, so there was instrumentation and observability. Making it so everybody knew how to deploy and making, you know, it great. And if we gave our leftover dev energy to staging, then I think all would be well in the world. So not abolitionist. There’s been historically misjudged. But I feel like in the last five years this has started to shift.
I feel like the gravitational… it started to shift away from preproduction to production. And I think you see, like, all the little start-ups that started bubbling up. Like us and Gremlin and launch Darkly and you guys and all those who think production is the only thing that matters. I think we’re starting to get there. It’s just scary.
Yeah. And it feels like it’ll help at first, but actually like you said, it increases your coordination costs, leads to more burnout. Which totally impacts high-performing teams. And then you don’t have a high-performing team anymore.
Another thing I feel like people haven’t quite internalized is that as an engineer, you have maybe three, maybe four hours a day of engineering labor in you. That’s all you get. That’s pretty inelastic. You can spend a lot more time at work doing things, but when it comes to the focus that moves the business forward day by day, you’ve got three or four hours a day in you, honey. That’s it. So try to make the most of it and make sure you’re moving the business forward in those days. But no reason to burn yourself up. No one can write good code for 12 hours a day. Can’t be done.
Thanks so much for joining me. It was really fun to have you.
If you see any typos in this text or have any questions, reach out to email@example.com.
Ep. #11, Chaos Engineering with Ana Medina of Gremlin
In episode 11 of O11ycast, Charity Majors and Liz Fong-Jones speak with Gremlin chaos engineer Ana Medina. They discuss the relevance of breaking things in order to engineer them more efficiently, monitoring vs observability, and chaos engineering at scale.
SEDNA’s Product Team Reduces Reliance on Internal Team Knowledge with Honeycomb
SEDNA's Product team wanted to consolidate and bring consistency to their monitoring and alerting. Incident response relied too much on team-member's knowledge, and that didn't work so well when they went on vacation. Consolidating all that experience inside a reliable tool like Honeycomb, that resolved long-standing tickets which helped customers and unburdened the team, was the result SEDNA needed.