There is a yawning gap opening up between the best and the rest—the elite top few percent of engineering teams are making incredible gains year over year in reliability and lack of technical drag forces, while the bottom 50% are losing ground.
Charity Majors [CTO & Co-founder|Honeycomb]:
All right. Good morning. Good evening. I’m going to be talking about the socio-technical path to high-performing teams. For those of you who don’t know me, my name is Charity. I am the co-founder and CTO of Honeycomb.io. I’m an infrastructure engineer, co-author of the O’Reilly Databases Reliability Engineering book. What I want to talk to you about today is teams, basically. Teams are the basic building block of everything that we do in tech. I like to think of it as like Raid for humans, right? Teams are how we make each other redundant in a good way. Teams are how we cover for each other. Teams are how we make sure that we can function as abstract groupings of individuals into larger groups of individuals and teams are also… Teams that you join have the greatest impact of almost anything on your career, greater impact than which subspecialty you choose, even bigger impact than how good of an engineer you are.
The teams that you join really do define your career in technology. I’ve been fortunate enough to work on some amazing teams. I have had the experience of working on some pretty crappy teams. What’s interesting is that the teams that were crappy were all, it’s like Tolstoy said, “They’re all crappy in their own unique ways,” but what they had in common was the way that they made me feel, on the right there. It didn’t matter whether they were the teams that were cutting edge of tech, it was programmers or really shitty tech and… What I’m getting at is I believe that work should be more than what we do because we have to do it, right? It should be more than just a salary. Work has the potential to be part of a life well-lived, highly functional, highly fulfilling life. In order to be part of a very fulfilling life, they need to provide autonomy, mastery, and impact. Oftentimes, people get this confused with belonging to a team that is high-performing, and these two, tend to go along together, although they are not synonymous.
High-performing teams, it does not feel good to be on a low-performing team. It doesn’t feel good to be wasting a lot of your time and energy. It does not feel good to be solving the wrong problems. I think that the question of how we make our teams do higher performing, it’s a real problem of personal fulfillment. So we’ve been kind of stumbling along the tech industry for years, without much data about what makes teams better at what they do. That has changed. For that, we can thank, you know, Jez and Nicole and Gene. If you haven’t read Accelerate, you should absolutely read Accelerate. One of its key findings was that there are basically four key metrics that every team should be tracking for themselves and you should know where you stand because where you stand in these four metrics, really does correlate to how high-performing your team is.
The four metrics being, of course, the time lapse between when you’ve written the code and when you’ve deployed it, how long it takes for it to go live, how many of your deploys fail, how long does it take to recover from each outage, and I would add a fifth thing that we should all be tracking, which is just the number of times that your team is alerted outside of hours. This doesn’t necessarily track directly to how high-performing your team is, but it does really tightly correlate to how burned out your team members are going to be. So we should all be tracking these. It turns out once you know these things, you can figure out where you are compared to other teams, and there’s an enormous gap that’s kind of opening up between the high-performing teams and the rest of us.
Year over year, more and more teams are becoming very high-performing, I don’t like the term elite, so I’m going to use high-performing, but they use the term elite teams. While meanwhile, the bottom 50% are actually losing ground, which is terrifying, but also makes sense, right? I think we’ve all noticed that in tech, if you’re standing still, you’re actually losing ground because of entropy, basically. It really pays to be on a high-performing team, really. If you look at the difference between how often you deploy when you’re one of the high-performing teams, it’s a few times a day and when you’re in the bottom 50%, you’re in once a week to once a month territory. The other thing I wanted to show you is we waste a lot of time, a lot of time. If you look at the Stripe Developer Report, they surveyed thousands of engineers and came up with the finding that almost half of our time is basically wasted, like just wasted.
This isn’t the fun 50% of our time where we’re slacking off and getting a bagel or a donut or going for a walk or a cigarette break. This is the bullshit time when you’re fighting your way through a messy CICD pipeline, you’re trying to get through the work that you’ve got to do just to get to the work that you want to do, that you have to do. It’s redoing work. It’s when you’ve gone down the wrong pathway and you have to undo that work and then go down another pathway. It’s when you can’t even orient yourself and figure out what the problem is. It’s when you’re trying to reproduce a problem. It’s when you’re fighting with somebody else’s leftover mistakes from yesterday. I’m all for slacking off. In my belief, in my world, you have about four hours a day of really concentrated, focused learning, learning or novel work, writing new code, solving new problems. That’s about as much cognitively as anyone can stand, is four really good hours.
I’m not trying to say that anyone should be trying to get 16 hours a day out because that’s literally physically impossible. What I am saying is make those four hours fucking count. Anything that stands in the way of you using those four hours to just move the business forward, measurably forward every single day, it’s a drain in your life force, basically. I mentioned earlier, it really pays to be on a high-performing team, really. You ship 208 times more if you’re on a high-performing team. This does not mean that those are better engineers, it means that the structure around them. This whole talk is about socio-technical systems, the socio-technical systems that support you in being efficient with your time and not having to reinvent the wheel every day.
Just imagine, you’re a new grad, you join a team. What do you think the difference is going to be after a year if you’ve joined a high-performing team versus a medium 50th percentile after a year when you have shipped 200 and odd times more often, you’ve spent 2000% less time firefighting. This does not mean they’re better engineers in the high-performing teams, because what I’ve seen throughout my entire career is that you join a new team and within a few weeks, a couple of months, you will rise or fall to the level of their productivity because so much of your personal productivity, I know we don’t like to think of it this way, because we’re all very individualistic, at least here in America, but very little of your ability to make a difference to move the business forward is defined or limited by your knowledge of algorithms or data structures.
Most of it is determined by the structures around you, your CICD pipelines, your deploy process and that your deploy scripts, all the libraries you use, all of the developer environment stuff, all of that stuff around you that supports you in writing and shipping code and moving swiftly and safely with confidence have an enormous impact. It doesn’t matter if you’re a Googler engineer that’s joining a low performing team. I think there’s a lot of hubris a lot of time from these engineers who have worked on high-performing teams before, who think that they can just go and through the magic of their presence, they can transform a low performing team. Doesn’t work that way, man. Yeah. Imagine you’re a new grad engineer and a year later, you’ve joined a high-performing team. Imagine just how much further along you’re going to be as a person. This is what I mean when I say that the teams that we join define the trajectory of our career more than almost anything else.
A lot of people tend to think, “Well, how do we build a high-performing team? Just hire the best engineers.” That is bullshit. It’s completely false. What does best engineers even mean? Does it mean the person with the best knowledge of data structures and algorithms? That does not mean that they will be the most effective or impactful in moving your business forward. When it comes to building high-performing teams if you can’t just hire the best engineers, what should you do? Great question. I have the outlines of an answer here, which we’re going to talk about in a little bit more detail, but it really starts with constructing a blameless environment, making it safe for people to fail, and then relentlessly tuning and paying attention to these socio-technical feedback loops, practicing observability-driven development, because it doesn’t matter how fast you’re trying to move if you can’t see where you’re going. Then just iterating.
So I mentioned socio-technical a couple of times, and let’s just define that real quick. I love this because it’s a word you distinctively know what it means just by hearing it. It means these systems that we ship software in, it’s not just about people writing code, it’s the people who are using the tools to work upon the system, but it’s a complex system because it feeds back into itself over and over. What that means is that you can actually, by changing the tools that people use or the systems that they operated, you could actually change the people who work within it, which is mind boggling. Most engineering leaders tend to act as though they have one set of constituents, but every engineering team has two sets of outcomes that are paired in their destiny, which is are your customers happy, but also are your teams happy? You can’t sustainably make one of these two happy without the other.
Let’s look at this, how this feedback loop actually operates. Imagine you ship some code and it’s not the worst team, but you’ve shipped some code, but it doesn’t get auto-deployed. So it sits there for a while. Other people ship, they merge more diffs. Eventually, someone will come along and hit deploy and ship all those messes of changes out into production. So imagine they do that. The deploy fails. Whoops. Okay. Takes down the site, pages on call, on call jumps in, might not actually know what this person over here is doing with deploying. So one of them starts rolling back and then you have to do a git bisect. You just start like, “Which one of these changes is responsible?” So you start pulling in the other engineers who had merged diffs in that elapsed time, shipping change after change, rinse and repeat. This will probably eat up the rest of your day. You and whoever else was writing code and was unlucky enough to get piggybacked onto that merge.
Then you wonder why people are like, “On call sucks. I don’t want to be on call. It’s miserable. It’s terrible. I don’t want to work in infrastructure.” No, it just means that your feedback loops are stupid. Let’s look at this exact same bug shipped into a system with good feedback loops. You’re the developer, you merge your change. It automatically gets picked up by CICD and deployed within a few minutes, the same bug. So it takes the site down, but you’re watching, you know what you did. You quickly revert it, find the bug, commit a fix, and it automatically rolls back out. Lapsed time, maybe 10, 15 minutes. The number of people bothered, one, just you, and you go on about your day. If you multiply the impact of these different loops, the pessimal and the optimal one by many teams, you start to see how 50% of our time gets wasted on bullshit because our feedback loops are terrible.
I mentioned observability earlier because it is a really crucial component to fixing your feedback loops. If you can’t see where you’re going, you’re going to run off the road a bunch, and it’s running off the road, then making course corrections, that accounts for so much of the flailing that takes up 50% of our time. So let’s talk about observability real briefly. I assume most of you know this by now, but it’s a term taken from mechanical engineering and control systems theory. It means how well are you able to understand the inner workings of your system just by observing it from the outside? This is different from monitoring. Monitoring is about known unknowns and observability is about unknown unknowns. There are a lot of technical differentiators that proceed from that definition. You need to support high cardinality, high dimensionality. You need to not use metrics, but these arbitrarily wide structured data blobs, blah, blah, blah.
I’ve written about this a bunch. So if you want to learn more, you definitely can. What this means, applied to software engineers though, is just, can you understand any novel situation inside your system without having to ship new code to handle it? Anybody can, understand what’s happening and then add some logging output to describe it, but using the telemetry that you have to understand and debug novel problems is what distinguishes observability from telemetry. As I said, I’ve written about this a bunch. So I’m just going to toss these up there. There are dependencies. The reason that this is exploding now is that everything’s got high cardinality dimension, whether it’s user ID or app ID or container ID or et cetera, and you typically need to chain a bunch of those dimensions along if you’re going to describe any specific situation in your system.
Really, this is all leading up to you need to practice observability-driven development, which is simply instrumenting your code as you’re writing it and with an eye towards how will a future me understand this? Will future me, drunk me at 2:00 AM, be able to understand this system? Then you’re instrumenting as you’re writing code, you have a small, tight CICD pipeline to get your code live within minutes after you merge it, one dif per deploy, one merge per deploy. Then you close the loop by going and looking at it. You look at the code that you just wrote through the lens of the instrumentation that you just wrote and you ask yourself two questions. Is it doing what I meant it to do? And does anything else look weird? If you get that feedback loop going where it’s just muscle memory, where every engineer at your company knows to look at their code and production a few minutes after they’ve written it because it goes live immediately.
You will catch 80% or more of bugs right then and there before customers ever have to see them. It’s miraculous. Obviously, most of us are so used to working on production systems that are just like the hairballs that the cat coughed up. We’ve never understood them. No one’s ever understood them. Every day we ship more codes we don’t understand that no one’s ever understood onto this fucking hairball and then we wonder why it’s a nightmare to support. It starts with understanding your systems and this is something that you can’t really compromise on and it’s something that you need to do fresh every single day.
I’m not going to pretend it’s easy to dig yourself out from under a hole. It’s not, but it’s very easy to not fall into the hole. So the last thing I wanted to touch on is just I think that this is a fundamentally optimistic view of the world because if you think about the systems that we all grew up with monoliths, they had some socio-technical properties, which had socio-technical effects. The properties were rigid, they were fragile. They were all up or all down. As a result, we, I say we in ops, became very fear-driven and we’re like, “Developers stay out of production,” and we were kind of mean about it and nobody was allowed to be in our castle, but us, which is like building glass castles basically.
What I find optimistic and pleasing about this is that the next generation of technology has different characteristics with different emergent effects. It’s more resilient. It’s more shades of gray. The tradeoff here is we had to assume a lot more complexity, which means that honestly, the only person who really has a chance of debugging these bugs under clutch situations is the person who wrote the code. You need to promote this model of software ownership end to end, a full lifetime of the code owned by the developers who wrote it. But in exchange for that complexity, we’ve gained flexibility, we’ve gained resilience, we’ve gained the ability to treat production more like a playground, where we put up guard rails to make sure that you can’t kill yourself going down the slide. You might get a bloody nose, but it’s okay. We can get bumped around and bruised up a little bit and we can still serve our customers.
What I love about this is that the fundamentals are there. The fundamentals are there for this to be better than it was. Tomorrow’s systems are not going to be built and supported by people who are burned out, harassed, bastard operators from hell, people who can’t work well with others. It can’t be done. The systems of the kind of complexity that we have increasingly, can only really be owned by people who are applying their full creative self, who are emotionally and intellectually engaged in their work. You really can’t afford not to. Where we’re going is a place where everyone’s on call, but it doesn’t have to suck. Your labor is a scarce and precious resource. It’s a creative act. You should really only give it to organizations that you believe in, that you believe will make the world a better place because it’s like a superpower. So I think we have everything that we need to make the world a better place, so let’s do it.