Charity Majors [CTO & Co-founder|Honeycomb]:
It’s super weird. I feel like I’m talking out into the abyss, but hi. Hi, lovely people. Thanks so much for having me. These virtual events are super weird. I’m going to be so curious to see if we keep doing them once the COVID thing is away. I never thought I would say this, I’m such an introvert, but I miss talking to a room full of people. Did I just say that? So weird. See? Dead silence. It’s fine. It’s really creepy. All right, cool.
So I’m going to talk about the socio-technical path to high-performing teams. And that’s a sentence that makes no sense to almost anyone because we’ve just started talking about this as an interest, which is why it’s fun. My name is Charity. I work at Honeycomb. I have had a career as basically an infrastructure engineer. I’ve been on call since I was 17. I have many thoughts and feelings on the subject. I also wrote the Database Reliability Engineering book. Side note, if you happen to own this book, you might notice that your cover is broken, it has a horse on it. I can fix the cover by sending you a unicorn sticker. And I probably will if you ask nicely.
What does it mean to be a high-performing team? Why should you care? How can you convince others to care? Let’s see. Why are computers hard, is really the question here. Well, they’re hard because we don’t understand them, because we inherited systems that were never understood by anyone. And the entire time, we just kept pushing more crap onto the system. Crap we don’t understand, which joins its peers, with all the other messy stuff that nobody’s ever understood, and it just keeps growing and getting nastier. And so we hire teams of people to run it for us, and we split up into teams and we hire more and more engineers and we hire… It’s a whole thing. Also, you probably never actually learned to debug production systems. Almost nobody has. And vendors will happily tell you, “Just give us tens of millions of dollars and you’ll never have to understand your systems again. We’ll just tell you what to look at.” This is always a lie. All these things can be fixed, more or less.
Honestly, I feel like the biggest hurdle that we have, between each of us in having truly better systems, is this Stockholm syndrome, this belief that we have, this is just what it’s like. This is just as good as it gets when it comes to computers. It’s got to a mess of frustration and coding in the dark and it’s fun. It can be so much better. I’m really excited that we’re starting to do some research into this stuff. I assume most of you have heard of or read the DORA report. It’s in, what, its third year now? As in, I think there’s been two of them? Hope they do a third one.
It’s fantastic. It is mind-blowing. If you ever wondered how your team looks next to other teams in the world, it’s so nice to have numbers. The other interesting thing to look up is the Stripe Developer Report, where they polled tens of thousands of developers on how they spend their time. And showed that over 40% of the time, self-reported, so it’s probably optimistic, 40% of the time that engineers spent engineering was spent not making any forward progress whatsoever on the mission. It was all the work that you need to do in order to get to the work, or the yak shave, or trying to figure out where the thing that you need to actually build actually is, reproducing. Just the stuff we get bogged down in.
It’s sobering to see over 40% of our time. And I don’t like to be sober. Kidding. So out of all this mess of data, it gets really hard to compare human teams to each other, right? It’s so hard because everything is contextual. Everything is specific. There’s no such thing as a generic team, but Nicole and Jazz did all this science and they basically boiled it down to these four questions and it’s deployment frequency, lead time for changes, how long does it take you to get changes into prod, timed restored service, change failure rate. And it’s pretty dramatic there, right? Look at the fact that, I don’t have the proportion here, but over half of the engineers in the world are spending their lives able to deploy once per month, more or less.
And if you look at the last results versus the year before, the gap is actually widening, the bottom 50% is moving down slightly, if anything, and the top layers are going off into the stratosphere. It’s super interesting. It really pays off to be a high performer, really, really, really. And this translates directly into business impact. All of the money that you spend is basically headcount, and if those people are able to be even 10% more productive, you’re more likely to win. These are dry statistics, but the amount of human suffering sometimes takes 2,600 times faster time to recover from incidents. So much empathy for these poor kids.
So how do you get a high performing team? Well, obviously, you get an elite team, because I’m doing these by hand, You aren’t getting the transitions. Teams become elite when you have most of your members come from Facebook, Amazon, Netflix, Google, blah, blah, blah, blah, blah. No, they don’t. No, they do not. They do not. They do not. They do not. There is basically a zero correlation in my experience, a person’s pedigree, and their ability to operate in an elite team. And this speaks to a crazy truth about this, which is that it’s about the team. It’s not about the person. I have repeatedly seen high performing teams that I’ve gotten to work on. I have repeatedly seen very average, very average engineers join. And within six months, they’re performing quite well. And I have seen people who have spent their entire career working at extremely high performing teams after high performing teams. And then they get it done and they join a team that is not performing very well. And you would think that they would drag the team up, but no, they’re right down there in the mud with everyone else.
The power of these social constructs, these organizations to drive our behavior and our outcomes is way more powerful than what you think. Honestly, we don’t really, fully, we can’t give you a recipe. “Here’s what you need to do to be a high performing team.” I have a lot of theories though. I have a lot of guesses and I was lucky enough to have a team for the past few years and Liz and I sat down and just looked at what our team’s stats were for fun. After this last DORA report came out, and ours is almost an order of magnitude-ish, better than the most elite tier of the DORA report. Now you might go, “You’re a startup,” very true. It’s also a four and a half-year-old startup that’s growing really rapidly. We’re a platform, so we don’t get to control the data that comes in. We’re absorbing the full production load of dozens of very large customers.
And the last company that I was at was Parse and I was at Parse for about four and a half, five years. And Parse was just as good, if not better, engineering-wise, just in terms of raw ability, they were a lot more senior than our team is, but the stats were not nearly as good. And a lot of it is because of the tooling that we’ve been able to use, which is why I think that anytime that you’re thinking about this problem in terms of technical solutions or social solutions, you’re on the wrong path. This is where the socio-technical thing comes in because they’re intertwined, they’re interwoven. They are the same thing, right? There is no such thing as a technical solution without a social component. There’s no such thing as a social fix that doesn’t have technical tooling attached to it in this day and age.
And I think that this is a way of thinking that is very foreign to us. And it’s interesting to try and wrap our heads around it. And I’m really interested, honestly in hearing, if anything pops in your mind, you’re like, “Ah, this is what my team has done to make us perform at a higher level,” I would love to hear it cause I’m really trying to pull together stuff. But, as far as I can tell, these so-called elite teams, and instead of elite, can we just say excellent? I like the word excellent so much more than elite. Excellent teams seem to be made up of pretty average engineers in general, who take a lot of pride in their craft. They want to do well for the sake of the work itself. They care about their users and they have the time to fix and iterate.
I don’t care how good of an engineer you are. If you are given zero time to actually fix the things that are paging you, it’s just going to keep paging. Somebody saying something to me? I can’t see notifications. And this is presentation mode. Every time I try and do that, it flips to a full screen and I lose it. I was just trying to figure that out and I couldn’t figure it out. I’m sorry. I will definitely share the slides afterward, I promise.
Care about the work, communicate with themselves. So this is an interesting data point. When Christina and I decided to start recruiting our engineering team, we very explicitly decided we were not going to hire all of the senior people that we worked with at Facebook, Google, et cetera. We did not want to have a team that was that senior-heavy. And we wanted to have a team with diverse backgrounds. And that doesn’t just mean hire chicks. That means hiring people that come from code academies, people that are at different stages in their career, and not just all… I’m a dropout, Christine’s an MIT grad. So we wanted to continue to kind of reflect the diversity of perspective of eyeballs that we’re building for engineers all over the world, right? We’re not just building for Silicon Valley.
One thing that we did select for though in the hiring process was not algorithms, is not data structure, so much as we selected very high for communication skills, meaning that it was just as important as the technical work that they were doing. The main technical component of our interview is actually, we would give them a piece of code the night before and ask them to refactor it in some way and add some piece of functionality. And we explicitly said, “We do not expect you to finish. We do not want you to finish. We want you to improve it some in the time that you have, however long that is. And when you’re sending it off to us in an hour or so, if you feel like it, write down the list of the things that are still in your mind, you still want to do, you didn’t have time to, fine. Wherever it’s at, that’s fine,” because that’s not the interview. The interview is not the code.
The interview is the next day, you pair with a couple of our engineers and you talk through what you did. Just code review, “Oh, here’s what I did. And here’s why. Here are some other things that considered. Here’s where I didn’t do them.” And is it that conversation, that’s the interview, because we have found that if people can talk about their work, if they can talk you through what they’re doing, they can do the work, right? Whether they know the syntax and all the ins and outs, the language, doesn’t matter. They can do the work. And the reverse is not necessarily true. There are lots of extremely skilled engineers out there who cannot really talk you through their thought process. And I’m not saying they’re bad engineers or bad people, but that’s not what we’re selecting for. We’re selecting for people who are able to communicate very clearly and enjoy it. And I think that that’s one of the bets that we’ve made that has paid off the biggest because once our engineers get ramped up, they can perform with the best of them. And what I’ve observed is that this helps the team hold each other accountable. And when they notice problems, they are able to lift each other up and bring the entire team up to the level of the person who is strongest in each area. It’s really cool.
Every engineering org has a dual mandate, which is to make your users happy and make your team happy. And a lot of times people act this is a conflict. Somehow we have to sacrifice our humans on the altar of users. And just doesn’t work that way. I absolutely guarantee you that. Production excellence is a two-sided coin, right? You are not actually going to get happy customers if your people are miserable, not long term. People remember every single interaction that they have with your company. You don’t have a lot of opportunities to make a mark on them. And it shows when people are happy when they’re well-rested. And time check, okay. The world is changing a lot. This is not news to anyone here, but complexity is going up quickly, exponentially.
A lot of you, if you’ve watched my talks before, you probably saw me go into a lot more detail about this, but we had the LAMP stack back when I started out and there’s Parsons infrastructure in the middle. And then on the right is the National Electrical Grid, which is really the way that we should be thinking about our system’s architecture, it’s far-flung, it’s loosely coupled, it’s resilient, it’s instrumented, it reports its state. You don’t sit there and try and predict which trees are going to fall over so that you can go out and be ready. No, you just make a system where you can quickly slice and dice and isolate the source of the problem, send out a crew and fix it in a very ad hoc manner because you know that the same problem never happens twice. History doesn’t repeat, but it rhymes.
Unfortunately, we’re really behind the curve when it comes to understanding our systems, shockingly behind. And I could sling some, I could get very specific here as to why, but instead, let’s just talk about the differences between monitoring and observability. Monitoring is very much about the known unknowns. You build a system, you look at it, you predict how it’s going to fail and you write a bunch of checks, right? These thresholds are okay, right? And if I build a system, a website, I can probably look at it and size it up 80%. I can guess 80% of the ways that it’s going to fail. Connections are going to fill up, right? Run out of capacity. Buyers will crash. Cool. I write checks for all those things, right? And I have them page me.
And then over the next few months, as I’m running the system, I encounter the remaining 20%. And I write them, and it’s very rare that I’m truly stumped with something new, right? Well, that’s not really true increasingly in modern systems where you’ve got microservices and polyglot persistence and serverless, and half your stuff is being run by third-party providers. And you’re kind of sitting in the middle, gluing together with APIs, and it’s more like the goal should be that you’re not going, “Oh, that again, Oh, that again.” It should be every time that you get paged, it should be, “Huh, that’s new.” Right? It should have to engage your full creative brain to actually solve the problem. That’s how you know it was worth paging you about.
The heritage that I prefer is it comes from our kinfolk in mechanical engineering, where observability is actually the mathematical dual of controllability, to the extent that if you can observe and understand any state inside the system, just by looking at it on the outside, even if you’ve never seen it before, right? You can’t just pattern match and go, “Oh, this looks like an outage that I experienced two months ago,” you have to be able to understand in a pretty fine-grain detail, what happened to who, why, one, then you have an observable system. And crucially, you need to be able to ask any question, understand any state without shipping new code to handle it, because that would seem to imply that you knew in advance what you were going to need or that you know what the problem is. You know what data you need to go collect. It’s this whole chicken and egg problem that just, it stops working. It stops working. You hit a wall. It stops working hard. God, those years. Those years are sad.
And they’re all of these, I’m not going to go into the why’s because I know we’re running out of time. But all of these, just take my word for it. If you accept my definition of being able to ask any new question without having shipped your code, then all of these are strongly implied, if not required. It has to be very exploratory. You have to have arbitrarily wide events that carry the full context of the request as it’s going from hop to hop. And you can kind of think of this in a way as distributed strace, right? You can’t strace your process anymore because it’s hopping around across the network. So you have to bundle up all the context and ship it along with the request, then you can trace it and see where the problems actually are.
You need high cardinality because that’s always going to be the most relevant debugging stuff. And that means that you can’t achieve observability if you’re using metrics. If you’re pre-aggregating, if you don’t have tracing, et cetera. You can read all the crap that I’ve written about this if you care. Point is, distributed systems, very full of unknown unknowns. But I want to pause here and point out that there are characteristics of the way systems are changing that have really cool ramifications in the real world for humans. Look at the technical aspects and cultural associations, all these things we built up culturally in the era of the LAMP stack. We had the application, right? The app, the monolith, right? The database.
And it was fragile because if something went wrong in that monolith and it all crashed, right? It was very difficult to fail gracefully. It was very difficult to loosely couple things. It was very difficult. You didn’t have sharded databases, but things tended to fail in pretty predictable ways, right? So that’s what we did to debug. We built up this mental database of past traumas, outages, and when something happened, we’d look at our dashboards on the wall and we’d pattern match, we would go, “This feels like Redis.” And then we’d go and see, Redis is down, right? Or, “This looks like the bug in the app code.” That’s not debugging. That is a hypothesis. That is jumping to the end. That’s flipping to the last page of the book. And I get it. It’s fun. I really love being that first person who gets to be the magic worker who’s like, “It’s Redis, I smell it, even though it doesn’t say Redis anywhere on the dashboard, I know it. And then I’m right. I get to feel super good and smug about myself.”
But it’s not repeatable. It’s not something you can teach people. They just have to go through all the same scars that you did. And so we evolved this very masochistic on-call culture and this real fear of deploys. Nobody wanted to change anything, right? Because we associated change with failure and downtime and badness. And most crucially, we in ops, we have a big role to play in this. We were just like, “Developers, stay out, stay out of our clubhouse. We don’t want you in here messing things up.” Right? And we treated failures as though there were something to be prevented.
I think of it like, a question of change/fail rate. I’ll get to that in just a sec. Thank you. But good, I’m glad I have questions. I think of this as the glass castle era, right? Where we built this really fragile, forbidding edifice that nobody was allowed to play in. And as we’ve been growing out of necessity, right? We didn’t build distributed systems because they were fun. Well, maybe some of us did, but we’ve mostly built them because we had to, right? We had to get the reliability guarantee, to get the performance that we needed, and the technical aspects and cultural associations that we have here are super different, right? There are many storage systems. There are many services.
Deployment is something that we embrace, right? We know that the only way to make it not scary is to make failure not scary. The entire mental shift has to be away from this fragile, forbidding, to a friendly, I think of it as a playground, right? We need to build a playground. We need to build some safety measures in, build some guardrails in the slide so that your kid doesn’t break their arm. And it’s probably fine if they bust a nose once in a while. But you build it for experimentation. You build it for real users to use it because that’s how you know that things get better. I see a lot of chat all of a sudden, I can’t see how to fucking get to it. Is my time up? Oh, oh, there it is, there it is.
Kong Summit Host:
Hi Charity, yes, we are out of time.
Okay. 10 seconds. Here’s the thing. The next generation of systems won’t be built and run by burned-out people, it can’t. It can’t be done. They don’t fit in their heads anymore. We need people’s full creative selves. And we need observability to come first because otherwise all of the energy that you put into trying to fix your teams and your systems will be 10 times as hard because you won’t be able to see where you’re going, right? If you’re building without a feedback loop, you’re kind of screwed. In the end, on call needs to be less like a heart attack and more like the dentist. It just needs to be not the most fun thing in the world. Invest in your deploys, democratize access, and don’t be scared by regulations. They’re fine. And the end. I swear that’s it. There we go. We have an opportunity to make it better, let’s do it. Thanks so much for having me. I will share my slides.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.