At this time I’m very excited to introduce our guest speaker today, Charity Majors, Charity is the founder and I believe CTO at Honeycomb. Charity and I go back a little way, wouldn’t you say Charity? We’ve had our share of conversations about serverless. Charity, I’ll turn it over to you to give a little bit of an intro to what you’re all about and what you’re going to be sharing with us today.
Absolutely. Thanks, Forrest. Yeah, we go back to the beginning of serverless, I think, the very first Serverless conf in New York City, in somebody’s garage. It was great. My name is Charity, I am CTO and co-founder of Honeycomb and I’ve put my favorite picture that I could find of Forrest up, who’s now at Cloud Bard, as I learned.
I think it’s pretty official. If it’s on the internet, it’s official. And there’s me in my favorite dress. So I’ve been an Ops nerd my entire life. I’ve been on call since I was 17. And things have changed quite a lot. When I started doing systems, none of us were expected to write code, we were managing all these handcrafted systems. I actually started out as a system admin at my university.
Back then they just gave kids routes. So I assume that’s changed as well. Anyway, we are going to be talking today about the next wave of changes for op stuff. And specifically, I think we’re going to be talking about what it means for three key personas, the Ops engineers, developers, what it means for Ops, what it means for devs, and then also what it means for leaders.
It’s not just managers, I feel like everyone who’s been in the game for a while has a real stake in making sure that our socio-technical systems are well-tuned, are performing high, and are humane for the people who have to run the systems.
That’s awesome. And Charity, I’m going to pop in with a question or two at times. And my first question for you is that you use the term socio-technical systems and I’m hoping you can explain to us what that means, because it goes a little bit over my head, for sure.
One of my favorite things about the word is that if you hear it, you probably have a guess what it means, and you would be correct. It means systems that are made out of social elements and technical elements. There are three parts to any system, right? There are the artifacts, the production systems themselves. There are the people who run it, the people who build it, and there are the tools that they use to interact and to act upon those systems.
And it’s all a feedback loop, right? The people use the tools to act upon the systems, but less intuitively the tools act upon the humans too, because the tools that you use can actually change who you are and change what you do. They influence us in return, which I think is super fascinating.
Absolutely. All right. Well, let’s see if we can influence our PowerPoint deck to spin up the presentation.
Seriously. So I thought we could start by defining operations, which to me … There are some corners of the internet where Ops means toil, right? It’s something that’s just to be minimized. And I don’t see it that way. I see it as it’s how we deliver value to users. It’s everything, it’s the tools, it’s the best practices, it’s our habits, it’s our shell scripts, it’s our cron jobs.
It’s everything that we use to deliver things to users. I think it’s one of the three essential pillars. The business is why we build stuff, development is what we build and Ops is how we get it to users. If you ship software, you have Ops problems. And less isn’t necessarily better any more than fewer lines of code is necessarily better. We need to do our work efficiently, but it’s not as simple as just saying, well, less is better.
And I thought we could just look at the Dora report real quick here because every team, I think should know where they stand, how high-performing they are. And Jez and Nicole and Gene did all this great research where they surveyed tens of thousands of companies. And they categorize folks into low performers, medium performers, high performers.
And there’s a huge gulf between the elite teams and the rest of us, right? And I feel like we often attribute this to being better engineers, and that’s just not how it works. It’s actually the other way around, that great teams make great engineers. And if you look at the year over year bubbles here, it’s pretty clear that the “elite performers”, there are more and more of them and they’re getting better and better.
And I think this really speaks to the surge in adoption around making production systems better, shifting the center of gravity away from staging and all this pre-production stuff to really building guardrails around production. Meanwhile, the bottom 50% is losing ground. Which to me speaks to the fact that we need to constantly invest in ourselves, invest our skill sets, invest in upgrading our systems.
Because entropy is real and there’s this slow drift towards everything not working. So in terms of what’s changing to impact our jobs, I think that the first wave of DevOps was really all about, all right Ops people. We’re going to learn to write software. And like, message received. We all write code now, cool. And I feel like the second wave of DevOps is about swinging that back around.
Like, okay, software engineers, your turn, time to learn to write operable services. All of the Ops stuff that used to involve lots of hand curation is now, software is a first-class citizen. That’s how we do Ops, right? We do infrastructure as code. There’s also been this big wave of going from monolith to microservices, which has really shoved operability and op skills into the mainstream.
You can’t now be a developer who doesn’t give a shit about basic Ops, right? You have to know how to build things that are operable. And in fact, I think that ownership, there’s this really tight, virtuous, beautiful feedback loop where developers can do a better job of owning their own services because they built them. They have all that context in their heads.
And five years ago, the idea of putting developers on call was a deeply exotic idea. And it’s not now. And the reason is not that we’re masochists, we want everyone to be a masochist and so we’re trying to make everyone as miserable as we are. The reason is because that’s how we make it better. And it doesn’t mean that there’s no room for Ops people there, Ops people are still the experts in how to run things.
But we serve more as consultants or subject matter experts in helping to build the stuff upon which developers can build good services and own them. I’m excited about this because I’m over 30. I don’t actually want to be woken up all the time anymore either. It was fun when I was younger, it’s not funny anymore. I feel like it is reasonable to ask someone to wake up for their services once or twice a year.
Any engineer who works on a 24/7 system, I think that’s reasonable if you don’t have a young child. One alarm system at a time, right? But most people on-call rotations are nowhere near that level of one to two times. They’re more like one to two times per night. And that’s just not compatible with human happiness and flourishing. I think it’s really important that we acknowledge that every engineering team has a dual mandate.
It’s not just about the customers. It’s also about the teams that build and run these services. And these two are not intentioned, right? It’s not like we need to sacrifice ourselves on the pillar of customer satisfaction. They actually reinforce themselves. If you are building systems, nobody likes to do mediocre work, right? If you’re building high-quality systems using high-quality engineering, people are happy at their jobs, your customers are going to be happier too. And vice versa.
And a lot of it begins with observability for the same reason that you put on your glasses before you go and drive down the street because if you can’t see what you’re doing, you’re screwed. I’m going to pause for breath there before we go onto the next section.
And while Charity’s pausing for breath I’ll reiterate what we said right at the beginning for the newcomers, which is yes, this session is being recorded. No worries there, we’ll be sending that out to everyone following the presentation today.
Cool. Should we look at what this means for Ops people next?
There’s actually been a question about this. Jose is asking what are the new skills and capabilities that are required for the actual Ops role, Charity?
Oh, I’m so delighted that you asked. What does this mean for Ops? Well, I think that first of all the rule is on the verge of diverging into two, right? For a long time infrastructure has been synonymous with operations, but the actual amount of infrastructure work that we have to do is slowly, it’s moving up the stack, right? Raise your hand if you remember having to go to the corridor in the middle of the night to flip the switch in the MySQL primary?
We don’t do that anymore. We don’t have to, because Amazon came along, Azure came along, and they do it better than us for cheaper, right? But as a result, those hardware skills have become very niche, they’re something only specialists do at infrastructure companies. And the rest of us don’t have to worry about it, we use an API. And you’re seeing this sort of thing just creep up the stack, right?
Some of these solutions aren’t mature yet but you can see where in the next five or so years they will become table-stakes, they will become commodities. And I’m talking about database storage, or containers, a lot of the system stuff. What this means is that you can follow one fork or the other, right? You can go to an infrastructure company that does infrastructure, sells a category as a service.
Or I think that the next wave of Ops skills is leaning into what will still be very common. Because every engineering team is going to need people that do socio-technical systems engineering, which means curating and tuning those feedback loops so that we are shipping software to users and doing it well. Doing it effectively, doing it efficiently, doing it delightfully.
High-quality operations, lazy … Lazy, that’s a bad word. Lots of people look at Twitter as like a cost center, but it’s no more a cost center than this is optimizing your AWS bill. It might not be shipping features, but it definitely, when you do it badly, it is costly, right? And in terms of answering your exact question, I think these are the emerging operational skill domains.
Vendor engineering. As systems become more and more like a bunch of services that are knitted together with APIs, somebody needs to own that relationship. I feel like in the past we have been too prone to rewarding people for vanity projects. I’m going to go build another time series database and I’ll be level E9 at the end of that. Oh God, the world does not need another time series database, right?
The world needs people that are being efficient with their time. We need to reward people, promote people for, not the biggest, flashiest, newest project, which everyone’s just going to have to maintain for the rest of its life, but more for tight integrations. This doesn’t mean you don’t need an observability team. You probably do, but you don’t need an observability team to build a solution from scratch.
You need an observability team to own the vendor relationship and build libraries, modules, helpers, examples, and document how to use it. There’s glue that needs to be built in order to make most vendors’ tools work with your system, and to be done so in a maintainable way, so that you have reusable patterns, so that all of the other engineers can pick it up and get going very quickly with it and so it doesn’t just look like spaghetti code, right?
So where every team does it differently. Embedded consultant. When I was at Facebook there were a bunch of different models that the production engineering team could take. One of them would be, if you had a team of say, five software engineers, well, maybe two production engineers would join the team and just be in the rotation just like them.
And they were just like the software engineers, except they had more systems, knowledge, and background and they could consult with the team. Another model is to have those five software engineers be on call for their service, and the two production engineers would serve as escalation points. Another would be to have a sidecar team for the large teams, for edge computing, right?
They had like 12 software engineers and then a team of five production engineers who worked together closely but they had two separate teams. There are a bunch of different ways to structure this, but the key insight here is that we’re helping them own their own stuff. There is no wall anymore for them to throw it over, right? Because there’s no wall.
You’ve got the software engineers owning their shit all the way over until users are using it in production. Managing the portfolio of technical debt and investments is something that Ops people are uniquely qualified to do, because we know we have great instincts and we know in our gut when people are making bad architecture decisions, and we’re well positioned to encourage people to make the right investments early because we’ve seen the consequences when they don’t.
Data is going to continue, database reliability engineering. Even though so many of those aspects are becoming commodities, it’s still going to be critical. It’s still going to be a skillset that we will need some in-house expertise in that. Release engineering and bringing the delivery to CI/CD. I feel like we really haven’t realized the promise of continual delivery as an industry.
We’ve gotten pretty good at CI, but we’ve stopped short of actually shipping our code automatically. And this, honestly, if there’s a single investment that everyone could make in their future, a single thing that will cut down the bugs that you have to fight, the fires, the outages, everything, it would be this. It would be really automating that delivery pipeline so that as soon as someone merges their code to main, it kicks off.
You should almost be able to think of it like it’s an atomic process, right? As soon as their code is merged to main, it triggers all the tests to run, artifact to get generated, and it should go live. That doesn’t mean that it has to be immediately viewable by all users. Yes, you could use feature flags. Yes, you can use progressive deployments to put a Canary into production, all this stuff.
But just making it so that engineers could rely on the fact that as soon as they merge it, it’s going to be in production. It shortens that feedback loop so that when engineers are writing code and they’ve just merged, they know that in a few minutes it’s going to be live and they can go look at it, which lets them practice observability driven development, where they’re instrumenting their code with an eye towards, how will I know if this is working or not in a few minutes when it’s live?
And then they can go look at it and ask themselves, is it doing what I expected it to do, and does anything else look weird? That right there is magic. It is the key to better systems. And we aren’t there yet. And I feel like Ops can really play a great role here.
Absolutely. And Charity, as you might expect, there are lots of great questions coming in as a result of this slide. And I want to just pull a couple of them up before we move any farther. There’s one that’s come in from Ross, which I think is great. He says, if Devs are gaining Op skills and companies are taking on Ops specialists later and later, both of which are things that you’ve already said, Charity. He’s asking, is the Op skills shrinking? Should more Ops people become Devs with strong Op skills? What do you think about that?
Is it shrinking in real terms? No. Is it shrinking in relative terms? I think it is. I think that we need fewer Ops people today. The entire computing industry is expanding, so it’s more like the Ops field has expanded at a slightly slower rate than the developer, because we can do more with less. This is awesome. I don’t want anyone to hear this and be afraid because there’s nobody … It’s not going away.
But yes, we can do a lot more with less. We have a lot more leverage. We don’t have to hand curate our servers anymore, we’ve embraced automation. Now, the thing about people bringing Ops people on later in later, I don’t know that that’s a good thing. I would really encourage companies to be conscious of the technical debt that they incur by bringing Ops people in so late when it’s already a forest fire and they’re just like, “Hey, help us.”
That’s terrible. Just to brag just a little bit, Honeycomb, we had an Ops founder, which I know is very rare, but we haven’t had that forest fire nonsense. We just haven’t. We’ve never gotten paged to that. We’ve never gotten woken up. It’s built well. So I think that people think that they can go for a long time without needing Ops. They don’t think they need us until there are fires.
And please, that’s really shortsighted. It’s like calling the doctor when you’re already just keeling over half dead. Just don’t do that. Does that make sense? Yes, the Ops field is growing at a less rapid rate than I think software engineers are, but I think that that’s reflecting our superpowers. So I think it’s a good thing.
Yeah. I would 100% agree with that. And one of the ways that it’s changing, of course, you mentioned, is vendor engineering. And I think that’s a less familiar concept to a lot of folks, really myself included. And so a few have asked if you would say a little more about that. So maybe as the last question before we go on, maybe you could give us an example of vendor engineering that you’ve seen in action and you feel like it’s done really well, Charity.
Yeah. The first one that most people encounter is often observability. They’ve got their monitoring providers or whatever, and every … Well, actually the first one that I ever encountered was with Gmail. For me, I started out my career as a mail software admin. I think I wrote the first spam filters for Gmail when I was like 18 or whatever, and I loved running mail servers.
I loved getting to graph my mail sequel. It’s fantastic. There’s no search that compares to GRUB. And I remember when I was at Linden Lab they suggested outsourcing to Gmail and I was like, “Hell no, that’s a terrible idea.” I was wrong. It was a great idea. But there was some engineering that we had to put in our site at the time to do the export, to do all the transfer, the mail, and all that stuff.
So that was my first experience as a vendor engineer. Another one that often people will see is if you’re a serverless shop, right? I think it surprises some people that companies like Fender, who’s a Honeycomb customer, they’re a big serverless shop. And people are like, “Oh, so they have no Ops people.” No, they have some of the best Ops people that I’ve ever known, right?
They have some amazing SREs. So those SREs are often thinking about architectural issues, like how do we design these systems, how do we instrument them, how do we make sure that they are well understood? The particular vendors that they use are, I think, a lot of AWS and Lambda, so they become experts in those technologies and they help the software engineers understand good development patterns, bad development patterns.
They’re always looking out for things that don’t scale, they’re looking out for … Or like when I was a Parse, right? Parse is mobile backend as a service. And I know engineering teams that would use Parse for their mobile apps. Their Ops engineers would form relationships with Parse and they would ask us, how does query tuning work under the hood? Like how do I write efficient queries? How do I help my software engineers tune their queries to be not doing a 5X full table scan, that sort of thing? Does that make sense?
I think it definitely does. And I know we’ve got a lot of great stuff to come so we need to move on, but there are a lot more questions that have come in around this domain of Ops and hopefully, we’ll be able to loop back and hit some of those.
Yeah, totally. Basically, you should know where your team stands. This is another thing that I think Ops often should own. The DORA report pointed out these first four, how often do you deploy, how long does it take for code to go live, how many deploys fail, how long does it take to recover from an outage? And I would add the fifth, everyone should be tracking, how often is anyone alerted after hours?
You should know where you stand so that you know how high-performing your team is. And also we waste a lot of time, we waste … This is self-reported, so it’s really an optimistic estimate. Teams waste half their time just aligning themselves, just trying to figure out what they’re doing, just trying to figure out if they’re working on the right thing, just working on the wrong thing, and then having to backtrack.
It’s all the shit that you have to do in order to get to the work that you need to do. And companies that don’t invest in Ops tend to do very poorly here. So this is my summary slide for Ops people. We need to stop punishing people for touching production. We need to stop being masochists, look for ways to enable, empower.
And a little bit tongue in cheek there at the end, get a therapist, go to therapy. Ops is notorious for being cranky and miserable, and we can’t do that anymore. We are uniquely capable of making systems humane. Ops has always been the most aligned with customer pain. And if you’re struggling with this, a little bit of self-knowledge really helps. Getting better at talking about feelings really helps.
So, on to the next section. Yes, you should be on call for your systems. And this is still a little bit controversial. I think that most people have made the leap to realizing that this is good and inevitable, but there’s also a handshake here that if management is asking engineers to be on call, it is management’s responsibility to make sure that it does not suck, or does not suck indefinitely.
You shouldn’t have to plan your life around it. If you carry your laptop and a WiFi device with you, you should be able to just lug it around, go to the movies, whenever you … You should be able to rely on not getting paged most of the time. And I really admire the companies that have made an all volunteer rotation. I think that if you do this right, people can want to be on call.
It should be compensated too if your company has money. It’s not actually compensated in time. It is possible to make this a high prestige, a highly valued thing. For example, if you make it so that when you’re on call you’re not responsible for project work, you’re only responsible for making systems better, you’re rarely paged and it’s just not terrible for your life.
Software engineers do need to be practicing observability driven development. And I mentioned this briefly earlier, but it’s the process of instrumenting like a headlamp, right? You instrument two steps in front of you every time you’re building, never accept a pull request unless you can answer the question, how will I know when this breaks? Watch your code, go out, make it muscle memory, ask yourself if it’s working as intended, right?
TDD is the most successful paradigm of my lifetime. No doubt. And TDD is great, and you should still write tests. I’m not saying don’t write tests, but it stops at the boundary of your laptop, right? TDD works by extracting away everything about reality. And that’s just increasingly not enough. We do need to test production, we do need to add reality back in concurrency and production load and all that stuff.
On call. On call is the best lever, the most powerful tool that you have for improving production. Whenever people ask me how they can make their developers care about something or how they can increase ownership or whatever, I always ask who’s on call. And honestly, I feel like this is not the hard sell that sometimes people think it is because we all want to do well, right?
We all got into engineering because we were curious about how things work and we like seeing the impact of what we’ve done. And we all want autonomy, mastery, and meaning from our labor, right? And an on call can be a really powerful weapon towards making your work meaningful. It’s really key to invest in your deploys, invest in instrumentation.
Progressive deployment is the new term that I guess we’re using for deployments that do canaries, that do rolling deployments so that you don’t break your backend. I’m not saying that there’s no value in staging here. I’m not saying that at all. There’s some value in staging. What I am begging you guys to do here is not treat production as an afterthought, right?
I’ve seen so many teams invest weeks and months of labor into making staging perfect and all of the different test environments of Dev. And then when you’re like, “Can we invest in production?” They’re just like, “Oh, we don’t have any time.” And that’s just bassackwards. That’s just not okay. Invest in production first and let staging take the leftovers. Feature flags are amazing.
Feature flags are a great way to make sure that your code is getting out into production quickly, but you have very fine grade control over who sees what, and you can decouple deploys and releases. SLOs are an advanced maneuver, but they are the last step towards making it so that you’re actually planning and doing your production work instead of firefighting it and doing it as soon as it hits you in the face every time. They’re great.
And if you’re in a highly regulated environment, there are some myths floating around that, oh, developers can’t touch production and can’t complete our code. It’s false. It’s completely false. It’s carbon coated nonsense. Some of the biggest and best financial institutions in the world managed to do this. So see me after class. I did want to walk through an example of the insidious loop where all of the time gets wasted so that you can see just how poor Ops hygiene multiplies and waste everyone’s time.
Here’s the scenario, engineering merges with Dev, because they have it hooked up automatic CI/CD, nothing happens. Hours pass, other people merge things too, eventually, someone comes along and triggers a deploy, and it’s got a few days worth of merges. Well, it fails. Takes on the site, pages on call, on call comes, isn’t aware that someone else is doing something.
So they start firefighting, the person who’s running the deploy manually rows back, then start trying to figure out which deploy was to blame. Well, they have to start git bisecting and doing test deploys of artifacts, every combination of Devs that have been merged, which pulls other engineers, everyone who’s merged something in that deployment. So now you’ve pulled in like half a dozen people and this eats up their entire day.
Everyone complains about how much on call sucks, and you ended up wasting the entire day for many people. Multiply this by day after day, week after week. It sucks. What if instead, it looked like this, the engineer merges a diff, which automatically kicks off CI/CD, automatically deploys your code. Fails it, but it notifies the engineer who just merged, automatically reverts to safety.
Well, she knows exactly what she just did so she just fixes it, adds some tests and instrumentation, and commits a fix and it kicks off another CI/CD and deploys. And voila, 10 minutes later the fix is live and it never rippled out beyond that engineer. This is why, on the difference between the lower performing teams and the elite performing teams, this is why you’ve got orders of magnitudes of difference, right? This is the difference between being able to deploy once every week and several times a day.
Yeah. And I think it’s fascinating, Charity here. You had mentioned earlier that it’s not about having more great engineers on these elite teams, but it’s the processes of the team itself that enable these engineers to do their best work.
Yep. And before we move on from on-call I want to bring in one question, this is from Jeff saying, if there are no takers on volunteering for on call, and this is coming from the perspective of the team, how does an org encourage folks to do it shy of baking that expectation into a job description? How do you use that soft influence?
Yeah. Well, I think that an all volunteer force is something to build up to, and I would bake it into the job description. I would not hire somebody for a 24/7 highly available internet service who doesn’t understand that the job involves ownership, right? Because if you’re not on call for your own code, then who are you forcing to be on call for it for you?
That’s just crappy. Your job is your responsibility, and it doesn’t end when the clock hits a certain time. It ends when your code is working in production, right? I think that that’s a completely reasonable expectation. As long as it’s coupled with real urgency on the part of management to getting it to a place where it’s sustainable, where it’s compatible with human flourishing. You can find plenty of engineers who understand that this is what it means to be an adult.
Absolutely. It really pays. Really pays to be on that high-performing team.
It really does. I think that our instincts tell us to slow down when we get wobbly, and our instincts are wrong because speed is safety when it comes to this. Think of it like riding a bicycle or ice skates or you’re a shark, if you stop, you die. Speed is safety. And that’s because you want to get that stuff into production while you have that original intent, what you meant to do in your head.
That moment in time is so powerful. When you’ve just written that code, you will never again understand it as completely as you understand it right now, right? It’s going to start aging out and being replaced by other bits soon. But while it’s fresh in your head, if you ship it right then, and if you’re looking at it in production, you understand the consequences like that, you can find like 80% of all bugs right there before users even have a chance to notice. And it’s much harder … Mm-hmm (affirmative).
I was going to say, you’re not advocating testing in prod here, are you, Charity?
Absolutely, I am. Yes. Testing and production, you test and prod or live a lie, right? That doesn’t mean that you test only in production, it doesn’t mean that you test stupidly in production, but at the end of the day, you’re going to be testing in production. You may as well admit it and try to do it well. Any other questions before we move on?
Yeah. Let’s just cover just a couple more. There’s been a lot of questions about SRE versus DevOps, folks trying to understand, because I don’t think we’ve really used the term SRE in this presentation so far.
I wanted to understand how you think about it, how it fits in with the Ops and Dev clash we’ve talked about so far.
Sure. I read a hilarious quote somewhere that said SRE is DevOps without the empathy, which made me laugh. Honestly, there is no consistency. Every person that you ask will have a different answer here. The answer is really just a historical one. SRE came from Google, DevOps came from the rest of us. And typically if you have a bunch of Googlers on your staff, then they named the team SRE because that’s what they know.
And if you don’t, then they name it DevOps. It’s as simple as that. I do wonder if what we’re heading towards is a world where SRE has more of a real meaning? And what it really means is these are the senior engineers who follow that more consultant model. I’m honestly not sure that there’s much meaning in the future in being a junior SRE, because I think that that would just be a software engineer who works in slightly more info problems.
As we’re moving up the stack, it’s just not really clear what that means. But honestly, if you’re going around expecting the title DevOps versus SRE to mean something, you’re going to be disappointed because it just doesn’t. It’s just a historical thing.
Fair enough. All right. Well, let’s move ahead then and talk about leadership a little bit, and then we may come back, we’ll definitely come back and we’ll loop through some more questions here right at the end.
Every socio-technical system is a snowflake. It is the only one of its kind. There are these highly complex systems that we build, we maintain. And what that means is that you can’t follow anyone else’s rule book, right? You can’t just take a playbook from any other company or any other system and blindly apply it to yours and expect it to work. This is what’s fun about our profession, right?
This is what’s awesome about it. This is what’s interesting, that we all have unique problems to solve. That said, there are some patterns that make it more probable that your snowflake will … If you’re standing still you’re losing ground because everything is just slowly drifting into entropy. And that means you need to hire people who are interested and intrigued by change and who are not too threatened by it.
Because the only constant is change. Then you need to give them emotional safety. You need to invest in good tooling. It doesn’t matter if you hire the best engineers in the world, if they don’t have good tools, they can’t do their jobs. It means you need to pay down tech debt constantly. Practice observability driven development. Really I think so many of the changes that we’ve seen the last few years are oriented towards shifting that center of gravity towards production, right?
Whether it’s chaos engineering or observability or feature flags, it’s all about giving ourselves more knobs around production. Because production is the only environment that matters. If your users aren’t using your code, it is dead code. It may as well not exist. And then of course constructing those feedback loops. I also think it’s important to remember that managers don’t hire people, they craft teams, especially at smaller companies.
And you should hire people for their strengths, not for their lack of weaknesses. You don’t need super people, right? Nobody started out as an amazing engineer. They joined teams where they became great engineers. I think I just wanted to show this one more time because great engineers are forged in the crucible of great teams. Which of these poor kids is going to be a better engineer in two years?
The one who had five deploys per year to learn from and who’s firefighting constantly, or the one who got 3,000 deploys per year? The amount of human suffering encompassed in the poor dude on the right, versus that person who got to grow up into being a high-performing engineer on the left, is enormous. It is really worth investing in these socio-technical systems.
We often think of it, how do you make elite teams? Just hire the best people. What happens when an engineer for the elite bubble joins a team in the medium bubble? Your productivity will rise or fall to match that of the team that you joined, right? We’ve all seen Googlers who have left Google and joined another team, and three to six months later their productivity matches that of the people around them, right?
And the same is true in reverse. I’ve seen plenty of people come out of nowhere from health care or the government, not known for their efficiency but they join a high-performing team and within three to six months they’re keeping up. We are so influenced by the systems that are around us, the infrastructure, the structure, the shipping code, it impacts us all.
It is the most important thing when it comes to being high performing. I also wanted to touch briefly on the build or buy thing, because this is something that I think a lot of leaders are constantly going through. A lot of people have been burned in both directions, so I’m not going to claim that there’s one great answer that is always true. But it is leadership’s responsibility to focus relentlessly on the core business differentiators.
Engineering cycles are the scarcest resource in our universe constantly. And I think that you should have a default approach that is reluctant to build. Code is legacy, right? Because my friend PVH likes to say, the best code of all is no code. Always. It’s the most scalable option. The second best code is code that someone else writes and maintains for me, and the third best is anything else.
All code feels great when you write it, but then you have to maintain it. And I feel like for senior engineers, in particular, it’s really a responsibility to amplify the hidden costs. Decision makers are making the best decisions that they can with the information that they have, but they don’t have all the information that you have. And a lot of us I think we feel like we’ve done our jobs if we’ve said something once, but that’s not sufficient.
You need to repeat yourself. We just do. We need to repeat ourselves. It’s important to both amplify hidden costs and make people aware of the costliness of their decisions. And it’s also important for us to point at and praise the things that went well so that they will be repeated. Again, it’s not enough to just do that once, right? If you fix your build pipeline and things are going great and they haven’t broken in weeks, praise that person. Point it out.
Be like, “Oh man, isn’t it great that this thing isn’t breaking?” That’s one of the most powerful tools that you have in your arsenal, both to increase your own impact and to just guide people into doing the right things. But the system relies upon us to amplify the hidden costs and the dog that didn’t bark, right? The things that we don’t see anymore because they’re going well.
I think that when it comes to vendor engineering, in a good vendor relationship, if you’re a small company, they should feel like just another team at your company, right? And conversely, if you’re at a big company, your team should feel just like vendors, right? They should be loosely coupled but highly aligned.
And the stuff that for vendor engineering, I put a list there, like building libraries, modules, shared interfaces, examples, docs, relationships, driving consistent internal use cases, really wrapping this into our promotion practices, into our job ladders, stop … Whatever we praise and promote people for, we’re going to see more of. So I think we need to think thoughtfully about, what do we want to see more of?
Do we want people feeling like they have to write a lot of Greenfield code to get noticed and appreciated? Most engineers do feel that way, and that’s not great.
The vendor engineering you’re describing where you’re treating teams like vendors and vendors like teams, it’s just almost another approach to the social-technical systems engineering you were talking about at the beginning. Isn’t it?
Totally. Absolutely. They’re just people, right? Vendors are … Yes, you should never believe what they say. I say, this is a vendor, trust but verify, because they’ve got something to sell, but also they’re solving hard problems. And if something is your core mission, you’re going to do a better job of it than anyone will do of it if it isn’t their core mission.
And this is how we scale our efforts, right? I remember when I started Op-sing, when I started doing this stuff, I was a jack of all trades. We all were. We all ran DNS, we ran mail, we did everything from formatting the file system, bootstrapping the OS, and applying security patches, we did everything. I don’t do almost any of that stuff anymore.
And it’s not because it went away or it doesn’t have to be done, it’s because somebody now does it. It’s their mission to do every single one of those things, right? And they do it better than I did. And that means that I get to move up the stack and use all of this amazing work as leverage and build powerful things for just like 250 a month. It’s incredible. We live in a golden age. We really do.
That’s awesome. That is just great. And Charity, this has been wonderful. We’ve got time for just a few questions here at the end if you’re up for that.
There’s been a few great things. There are some that I think are right up your alley. A couple of them I’m fairly sure are troll questions specifically, but we’ll get to those too. So let me just say real quick before we get into that, when you leave the webinar today you will be presented with the opportunity to take a survey. And if you go ahead and fill out that quick survey, you will be entered to win the t-shirt that’s shown here, the ACG swag.
So definitely do that. We give these out with every webinar. We’ve got one with your name on it, so go ahead and fill that out. Okay. So let’s go ahead and talk about a couple of these questions. So I think there’s a great one here, and this is from Alan saying … I love this one, Charity, because I think it ties together two things that you’re known for talking about, one being testing in prod, the other being data reliability.
For those that don’t know, Charity literally wrote a book on data reliability engineering. And the question is, when testing in prod affects data and we’ll have to involve data fixes if it’s going wrong, is testing in prod really justified and staging and UAT really not worth it?
So, the closer you get to laying bits on disc, the more conservative you should be. Absolutely. And I think that, like, for example, testing a major database version is one of the absolute no-brainer use cases for capture replay, right? This is a piece of software I’ve written three times in my career for three different databases, just something to sniff 24 hours worth of traffic and replay it against database snapshots again and again and again.
Absolutely, you should be conservative when it comes to touching data. That said, there are a lot of ways to minimize that blast radius. Yes, when you’re doing a database upgrade you should be that paranoid, but it doesn’t mean that every time you mutate data you need to be that paranoid. For example, you could have test accounts in production that follows a particular UID scheme or a naming scheme or something, and you can run tests on them in production.
Which will give you a much more realistic sense of how things will actually perform in production without polluting your other namespaces. And then you can drop those. You can have a scripted way of just pruning and dropping that data. Does that make sense?
I think it definitely does. And along the same lines, I think there’s been a number of questions asking about some form of terminology or some form of describing, what the Ops team is. Questions about DevSecOps, somebody asking if you see companies using NoOps, which I know is a question that you love. Maybe you can talk a little bit about NoOps and maybe the misconceptions that you see there.
Oh, I would love to. First I would like to say I think that the next big wave is NoDevs because, do we really still need developers? All we’re doing is gluing APIs together. How hard is that? Anybody can do that. Do we really need developers? I think that’s got to be a troll question because I don’t think the NoOps thing is really a movement. I don’t like the way that people talk about Ops as just the cost center, just synonymous with toil.
It is not. It is the engineering of how we deliver value to users. There’s a lot of depth there, there’s a lot of creativity, there’s a lot of difficulties, there’s a lot of knowledge and expertise, things that you can learn from. And to just diminish it like that, it just guarantees you’re going to do a bad job of it. I think that the companies that I’ve seen that have the most inhumane, the most humanitarian catastrophic systems, are the ones that have invested the least into operations.
And that certainly rhymes with my experience as well. And I think that’s true across different sizes of organizations. We’ve had a few people asking about how you scale this to large enterprises that have very siloed Dev teams and very separate Ops and governance?
Yeah. Again, the first question that I often want to just ask is who’s on call, because on-call has an amazing way of breaking down barriers and walls. If you put software engineers on call for their services, they will pretty rapidly see the virtue of talking to their Ops colleagues. Not every organization can be helped, or maybe not by you. Depends on how much power you have, how invested you are in their success.
Often the answer is to go get a different job. It will be easier somewhere else. You can join a team that wants your expertise, that wants to be high-performing, that wants to listen. So don’t bash your head against the wall for too long because … I actually think that when we stick around those kinds of jobs, we reward people who shouldn’t be rewarded. We tell them that it’s fine to treat people the way they treat them, and I don’t like that.
Yeah, I think it’s good to know what the threshold is there for when you need to move on. There’s a couple of questions about Honeycomb, people are asking what is it that Honeycomb does in that vendor engineering space? Maybe you can just expand a little bit on that in the couple minutes that we have left, Charity.
Yeah. Honeycomb is an observability tool, the only observability tool. I have written about this a lot so I won’t rehash it all here, but monitoring is not the same thing as observability. And a lot of the companies out there that are stealing our messaging are not actually, they’re talking about observability but they’re not actually shipping it.
And observability matters. Observability is the ability to slice and dice across high cardinality dimensions and ask any questions. It’s the ability to understand any system state without any prior knowledge of it, without being able to predict in advance that it would happen. When we had the monoliths, they tended to fail in predictable ways. We had the app and the database, right?
And most of the application complexity was all there bundled inside the application. So you needed to attach a debugger but it wasn’t able to explode out of there. Now we’ve got microservices and dozens of services, dozens of storage engines, and everything. The hardest problem is often just trying to figure out where in the system is the problem starting. And monitoring just is insufficient at this point.
You need to have high cardinality, high dimensionality, the ability to understand the system state. That’s what observability is. You asked about vendor engineering, and we see a lot of people adopting this when they have these systems where you can no longer predict all the ways that they’re going to fail. In fact, every time you get paged it’s something new and exciting, right?
And so the old model where every time something new happened we would post-mortem it and then we would create a dashboard that would help us find it immediately the next time, and write a custom monitoring check, all of that effort is wasted if the problem is never going to happen again, which is the vast majority of problems in this really brave new world.
So, people who are adopting this will often have observability teams and their observability teams will practice vendor engineering by writing integrations to deliver events to Honeycomb, to instrument their code, to let them explicitly do this sort of thing. And then constructing visualizations to share with the team as well.
I think that’s great, and it really plays into the power of the cloud in general, doesn’t it? I sometimes call it a hive mind, right?
That seeing problems before they even have them, because so many customers with such a broad range of problems that they’re addressing all the time. It’s really an amazing effect of centralizing that much talent on one particular problem, that mission that you were talking about.
Yes. Okay. Well, we’re right up at the end of the time here, we’ve got two minutes left. I’m going to end on what I think is a happy note. It’s from Jose, who’s been asking all kinds of great questions today, and this is the question of how do we improve motivation and recognition of Ops people in the organization?
Oh, that’s such a good question. Oh, yes, I love that question. A lot of this comes back, in Ops, when we’re doing a good job there’s a tendency for no one to know that we’re there, right? Because everything seems to just work and it’s just like magic, just ferries in the system. And the answer here, a lot of it has to do with management and leadership, getting recognition for the people who managed to change the engines on the jet plane in midair without anyone noticing, right?
Ops has often historically been seen as not as advanced as developers, so they don’t get raises and they don’t get promotions on the same timeline. But recognizing that this work is difficult and hard and important, there’s a role to be played for managers there, there’s a role for senior engineers in calling out the things that didn’t break and praising people who have done those things. There’s a role for us ourselves too, not just tooting our own horn, but getting better at communication. Communication is a fundamental engineering skill and we could all stand to get much better at it.
Well, that’s for sure.
Also praising each other. You can’t toot your own horn, but you can toot the horn of everyone around you, and then they will toot yours in return.
I love that. Charity, thank you so much for taking the time today. This has been wonderful and so many great questions. There have been other questions which are great, you can feel free to @acloudguru on Twitter with those and we’ll try to get to some more of them. And don’t forget to take that survey on your way out. You’ll see it when you close out of the webinar app. But again, Charity, thanks so much for your time.
Thanks for having me.
Wonderful. All right. Well, keep being awesome everyone and we’ll see you in two weeks. The next webinar is serverless CI/CD on AWS. That’s with Rob Sutter who’s an AWS developer advocate, and we will have a lot of live demos during that session. So if you’re doing anything with DevOps CI/CD, you will want to check out what the state of the art is on AWS now. I’ll see you there and have a wonderful day. Thanks, everyone. Bye now.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.