Conference Talks Engineering Best Practices

Cultivating Production Excellence

Taming the complex distributed systems is not just changing tools and techniques. It also requires changing who is involved in production, how they collaborate, and how we measure success. Liz Fong-Jones walks through the thinking behind a team that strives for production excellence.

Transcript

Liz Fong-Jones [Developer Advocate|Honeycomb]:

Thank you for having me on. So hello, INS1GHTS. I know it’s been an action-packed day full of lots of exciting things to learn about, and this is one of the final talks. Today, I want to tell you about some of the lessons that I have learned over my past seventeen plus years working as a site reliability engineer or systems engineer, and how it’s really impacted how I think about our sociotechnical systems, and how we have to design both our systems that are technical as well as our people systems in order to make sure that we’re able to run them sustainably and reliably. I think it merits saying that we try to write code in order to solve problems. That we’re trying to solve some kind of issue going on in the world, whether it be improving people’s productivity, improving commerce. But the problem is that that code doesn’t exist in a vacuum. Instead, we need to be able to understand how it works in production in order to be confident that it’s having the impact that it needs to have. So just merging it into the trunk is not sufficient to ensure that our code is having the reliability and security and performance that we want.

Part of the challenge is that we as engineers have a relatively limited amount of things that we can hold in our heads, but we need to make sure that we can understand the behavior of the systems. The systems are growing and growing and growing with time, and therefore we can’t really rely upon just squeezing all that’s standing into our head, but we have to instead build all these abstractions that we only have to hold one layer at a time. This is where the idea of microservices came from, where we had this idea of, let’s separate out concerns, let’s ensure that we each have to focus on only one component at a time, but when things break, how do we actually figure out what’s going on? It can be really challenging to do that. We’re adding all of this complexity all the time, some of which is essential to our business, some of which we really need in order to make sure that we’re delivering the features and value that our customers depend upon, but some of which is technical debt, some of which is stuff that we haven’t really bothered to properly document, or that we intend to simplify later and we don’t bother cleaning up. We’re in a situation where this debt is compounding, where this cognitive overhead is escalating and escalating, and we need to really understand and tame all of this complexity.

The name of the conference is INS1GHTS, and I think this is a really great name for this talk. Let’s focus back on this idea of what it means for our systems to be available so that they’re delivering value to end-users. What does uptime mean? How do we measure it? Well, a long time ago, I used to work at a game studio that was based in San Francisco. This was 17 years ago. We thought about the idea of, if the game world is up, that means that the two servers that run it are up. If the two servers are up, everything’s up. If the two servers are down, our services are powered down. But that doesn’t really work in today’s world where our systems run across uncountably many servers because there are many microservices and many copies of those microservices running. So thinking about whether any one server is up and running, that doesn’t make sense anymore as a way of measuring how our services are running.

We shouldn’t have to wait until users are complaining that things are hard down before we take action. There has to be somewhere in the middle for us to think about how we measure uptime and how we measure whether we’re having the results that we want in the world. In addition to thinking about uptime, we also have to think about delivering features. We also have to think about security. We also have to think of a performance rate. It’s a lot, and we’re really tired, right? If you talk to anyone who’s worked in ops for a long time, we all bear scars and trauma from having to deal with this year after year after year after year. Nights and nights and nights of staying up past 2:00 AM every single night when the servers are on fire. We need better strategies for adapting to this new world in which we have microservices, we have services that are scaling up to meet billions of users, and the rapid rate of change that we’re trying to introduce into our systems. Now, one trap that I’ve seen people fall into is trying to buy their way into this problem, that they’re trying to buy their way into having a DevOps practice because they hear that DevOps is the right thing to do. It may be true that DevOps is the right thing to do. We have to remember what DevOps is about. DevOps is about culture, automation, and sometimes tools, right? But it fundamentally revolves around the idea of changing our people’s practices. You can’t get that in a box from a vendor. But when you try to do that, right, when you order your alphabet soup from the vendors, when you get automated, continuous integration and delivery, because they tell you to ship shit faster, well, ship shit faster is what happens.

4:58

You wind up shipping garbage and that’s no fun for anyone. Or let’s say it was you deploying infrastructure as code without being fully ready for it. Now, instead of one command being able to take down one server, one stray command can take down your entire AWS environment. Whoops. Or what about adding Kubernetes? Because everyone loves to talk about Kubernetes. It’s the hot new thing. Well, maybe it’s not appropriate for you, but if you buy it without knowing what you’re getting into, you’re signing up for a lot of maintenance headaches. But beyond that, I think I also see people adopting tools like PagerDuty without thinking about the cultural ramifications of putting every developer on call. When you put people on call and subject people to the same thing that us ops types have been dealing with for decades, people burn out, right? Ops people have been burnt out for many, many years, but suddenly your product development software engineers are also burning out night after night because they’re getting alerted all the time and they get really, really grumpy from waking up at 3:00 AM every single morning. When they start debugging the issues, what they find is that their dashboards don’t make sense. Their dashboards are speaking a completely different language. They’re speaking the language of individual hosts and CPU and disk and all of the variety of metrics that people have piled on and piled on and piled on, and now you’re stuck trying to figure out at 3:00 AM what line wiggled at the same time as this other line. So while you’re looking at all these dashboards and taking 20 minutes, 30 minutes, an hour or two to look through and try to figure out what’s correlated with what, your customers are waiting, because everything is not working for them and your system is down.

Maybe that means that people can’t get their packages they shipped overnight. Maybe it means that people can’t pick up their pharmacy prescriptions, right? This has real implications for people. Eventually, it’s 4:00 AM. You’ve been debugging for two hours. You can’t figure out what’s going on. You finally call up the tech lead of your team. For many of us, we have been that tech lead, right? You get woken up every couple of days at 4:00 AM and you don’t get to cycle off of on call because you are the expert, right? So you’re really grumpy. You’re tired. You figure out a mitigation. You go back to sleep. It’s 8 AM. You try to fix it. You try to push a release and you find out that your deploy pipeline is broken, that no amount of testing of each individual component works in order to ensure that the whole works as a whole. There’s no time to do projects, right? All of this time is spent just draining our team cycles trying to triage and fix issue after issue after issue. This is what we call a state of operational overload in the language of DevOps and SRE teams, where you have both no time to do things and no coherent plan for how you’re going to get out of the situation of having too much operations work to do. It feels often like our teams are struggling to hold on, that we don’t really know how we’re getting to the light at the end of the tunnel and we’re just stuck running the same systems that keep breaking in the same ways over and over and over again. So what are we missing? What should we be doing differently? What is going to get us through the next 20 years that didn’t get us through the past 20 years?

Well, personally, what I think we need to do is we need to think about who operates systems. We need to think about the people. Honestly, retelling that story of years and years of organizational ops trauma has made my heart rate go up. So I want you all to do this. Take a breath with me. Doesn’t that feel so much better, right? When you focus on the people aspects, it really makes everything better. You can calm down. You can focus and you can do so much better on the job running your systems. And no tooling is going to be able to do that. You have to make room and space on your team for you to breathe. Tools can help you by reminding you to do things that you already want to do, but they’re not going to top-down inflict a new organizational set of rules upon your team. They’re not going to magically solve issues where people don’t trust each other. We have to think about the people first and really approach things from the angle of people, culture, and process, and then figure out what tools might make sense to go along with that journey. So that’s what I’m going to be telling you about today, is how we achieve production excellence, combining the people, culture, and tooling together in order to make sure that we achieve the optimal results on our teams. We need to make our systems not just more reliable, but also friendlier to the people who operate them. You don’t get there by accident. We really have to develop a roadmap and plan to figure out how we get from where we are today to the ideal world that we’d like to be in, in the future. We really have to also figure out what our signposts are.

What are we measuring by? How can we figure out what’s going to deliver tangible results to our team and to our stakeholders so that they can see the change and they can avoid burning out sooner? So we have to make sure that we’re evolving, not just the people who are working in tech, but also our other stakeholders, sales, customer success, finance, the business, and especially product managers and user researchers. We have to have a culture of psychological safety where people feel like they can ask questions, where people feel like they can contribute and raise their hand if they feel like something’s not quite working for them. So how do we get started with this? Well, I argue that there are four things that you need to do. First of all, we need to know what are success criteria? What does it mean for our system to be working and when is our system not working as intended? Secondly, we have to be able to debug when the system is out of spec and third, we have to be able to collaborate across many teams in order to get the job done. And then finally, we need to close that feedback loop. We need to make sure that we’re eliminating unnecessary or excessive complexity to make room in our cognitive budget for the necessary complexity that goes with the features that we’re trying to develop. So why did I say know when our systems are too broken? Why did I not say, know when our systems are broken at all? Well, the reason is that our systems are always failing in some small microscopic way. You might have a lawn that’s full of green grass, and there might be a few blades of brown grass in it. It doesn’t matter if a few blades of grass are brown, as long as the lawn as a whole looks green enough and is soft enough for your dog to play in, for your kids to play in, and so forth.

11:19

Let’s think about measuring those success criteria and figuring out what too broken is. This is an idea that’s one of the core concepts of site reliability engineering. It’s called Service Level Indicator, and it has a companion called the Service Level Objectives. SLIs and SLOs represent a common language that helps us connect with business and engineering stakeholders. They help us define what success means and help us measure it throughout the life cycle of a customer. So we think about what an SLI for our business workflow might look like. That probably involves one or many events that have a context associated with them. For instance, maybe if you operate an eCommerce website, your SLI might be that a customer can visit your homepage and can see items available for purchase within a certain duration of time before they become bored and decide your site is not working and give up. The context might involve fields such as where the customer is located, which version of the website they’re seeing, which specific page they’re looking at, what their ISP is, and so many more factors. We need a way to categorize these events as good, bad, or not applicable to figure out which events represent a satisfactory user experience and which ones represent a disappointed customer who’s potentially going away and telling their friends that the site wasn’t reliable. One way of doing this is to ask your product managers or user experience researchers to find out, what are their criteria for success? What are their critical user journeys like? Or maybe you can do chaos engineering experiments. Slow down your own experience and find out, when does it feel laggy? When you add a hundred milliseconds of latency? 200 to 500? You can figure that out in order to better understand what the requirements are for your system in terms of its availability and its latency, and that’ll lead you to figure out what threshold buckets those events. For instance, maybe you’ll figure out that your site is fast enough as long as it serves within 300 milliseconds.

We might decide to put a stake in the ground and say our Service Level Indicator is that a load to the homepage must complete with HTTP status 200 in less than 100 or less than 300 milliseconds. Then we want to make sure that we’re only categorizing events as good or bad that are real user experiences. We don’t want internal load testing traffic to show up. We don’t want your friendly local botnet to show up, and we don’t want irrelevant things or things that are relevant for a different Service Level Indicator to turn off. For instance, you probably don’t necessarily care about categorizing the performance of your checkout workflow in the same indicator as your homepage workflow, because checkout and talking to your credit card company takes much longer than rendering a simple view and an index page of all of your items available from purchase. Now this empowers you to figure out the number of good events and the number of eligible events and allows you to compute the percentage availability or the percentage success rate. That allows us to form our Service Level Objective. Our Service Level Objective is a target for the percentage of events as categorized by our Service Level Indicator measured over a window of time. We can’t just measure, “Oh, we had 100% uptime over the past 24 hours,” because that ignores what happened over the previous 24 hours. If you were 100% down yesterday, you can’t go to your boss and say, “But we were 100% up today,” right? Customers have a much longer memory than that. So we have to set a longer window, for instance, 30 days or 90 days, on which to measure performance and set a target for the percentage of events that we expect to succeed.

Maybe, for instance, we’ll set 99.9% of events must be good over the past 30 days where an event is defined as good if it was a homepage render and it was served in less than 300 milliseconds with HTTP code 200. So why not aim for 100% or 99.999%? Well, a good SLO barely keeps your users happy. You want your Service Level Objective to barely meet user expectations so that it leaves you the freedom to experiment and room to move much faster rather than pursuing endless reliability and neglecting the key performance and features that your users need. What can we do with SLOs? We can do two things with our Service Level Objectives. First of all, we can use them to decide what’s an emergency and what’s not. We do this by calculating based on the number of total events and the percentage that we’re trying to target. Therefore, we can compute the number of events that we’re allowed to have fail over a given window of time. For instance, if I’m serving a million requests per month and I’m allowed to have one in a thousand fail, AKA 99.9% SLO target, that means that I can have a thousand events fail over that month. I can figure out that if I’m burning through a hundred bad events per hour, that I’m going to run through my error budget in 10 hours, right? Whereas if I’m bleeding much more slowly, I can take my time to resolve it and I don’t necessarily have to wake someone up. That enables us to assign levels of urgency rather than thinking purely about, what’s the instantaneous error rate, or is my CPU usage high? If something is genuinely not an emergency, well, it can wait until the next weekday or business day.

16:46

Here’s an example that we had at Honeycomb, where we were trying to measure the success of our ingest endpoint. We discovered that over the course of a few hours, we are starting to bleed through our error budget with a 2% brownout that kept on happening every couple of hours. So we woke someone up and dealt with it. But you can also do things beyond immediate response by also thinking about how we measure and maintain our overall product goals versus our reliability goals. We can decide if we have plenty of error budget left, that as long as we know how to mitigate or roll back and limit the blast radius of an experiment, we can push forward with using a feature flag to push something experimental. Worst case, it doesn’t work out. We roll it back and we’ve only burned some of our error budget. Conversely, if you’ve had a set of really bad outages, for instance, the one that we had about nine months ago, we can think instead about investing in more reliability and really fixing some of those things that we hadn’t thought were important before having that outage. You don’t have to be perfect at having an SLO. You just have to start measuring something and you can iterate over time. For instance, starting with your load balancer logs is a perfectly fine place to start with in order to measure what your load balancer at least thinks about your services, response codes, and latency. Over time, you can iterate to meet your user’s needs. As you understand, for instance, that you have areas where your SLO didn’t catch a user impacting outage, you might want to tune your SLO. Or conversely, if your SLO says that things are broken but no one’s complaining, well, maybe your SLO is wrong. But I want to reinforce that our code has social impacts as well, and we need to be thinking about that.

It’s not okay to disparately impact proportions of people that are more marginalized. For instance, if people that are systematically getting denied loans because they live in one particular zip code, I would treat that as an outage, right? That’s a serious issue that you need to take attention to. But overall, what I would say is preserve your own cognitive ability. Think about, instead of alerting on CQ load, think instead about measuring what success means to your customers and making sure that a majority of your customers are having a good experience with your site and that your site is providing a good enough quality of service to everyone, not just to specific more privileged groups of people. But SLOs and SLIs are only really half of this picture because they only address the monitoring and alerting side. We also need to be able to debug when we have an actual outage. When we have an outage, ideally it should be something that’s novel to us, right? It shouldn’t be living Groundhog Day, living the same outage over and over. So that means that that tooling that you use to fix that first outage, right, you’d better make sure that it’s flexible enough to ensure that it works across many different kinds of outages, rather than hyper-focusing your tools on only solving a specific set of problems. Every problem that you encounter is going to be new if you’re always making your systems better. You cannot predict in advance how your systems are going to fail for sure, so you have to have that capability to understand everything going on in the context of your system. That means that you have to be able to debug new use cases in production because your customers won’t wait three weeks for you to reproduce their issues in a staging environment.

You have to be able to understand that code in production without pushing new code in order to really have a fast time to repair your incidents. Avoid selling your data. Make sure that everyone is working off of the same set of data that they use for debugging rather than pointing fingers at each other and saying, “Oh, it works fine in my system. I don’t understand why you think that things are broken.” We have to empower people to look at that data to form and test hypotheses. I spent a majority of my time as someone who is on call, when I have an incident, I’m trying to figure out what might’ve gone wrong and how can I verify it, right? That’s where I spend a lot of my time. Once I understand what’s going on, it tends to be relatively simple to fix it but trying to understand that, right, trying to prove my hypothesis, that’s what takes the most time. We have to empower engineers to dive into that data, to ask new questions of what’s going on in their system and not just the things that they thought were going to be relevant at the time they wrote the code. All of this is just to say our services have to be observable. Our services have to be understandable so that we can ask new questions of them at runtime, that we didn’t predict in advance, in order to comprehend how the system is behaving. We have to be able to look at all these properties of the system like I was alluding to earlier. Like which version numbers did the request cross? Which services did it cross? How long did it take in each of those services? What was the call chain that happened or the stack trace in a distributed sense?

21:44

And what specific features might be shared in common across all of your failing requests? For instance, could we tell that all of our shopping cart failures were coming from the same set of items that people were failing to be able to buy? Or what would happen, for instance, if you had an outage where people were unable to pass age verification for purchasing alcohol, but only in specific states and only with a specific version number of your microservice, if you were a wine brewery? How would you detect that, right? Can you find and pull out those common factors that correlate with a particular set of users being impacted? This requires us to collect relevant dimensions, things like what was the HTTP response code and what was the latency as well as more esoteric things like what was the geography of the user? What was the company that that user belongs to? What was their email address? What service did it serve on, right? All of these things really matter for getting that full picture and being able to do that correlation and figure out what’s going on with your service. Do you know what’s better than debugging at 3 AM with really great tools? Not having to wake up at 3 AM. Can we collect the telemetry that we need without having to wake up for it and then be able to automatically remediate it, for instance, rolling back a bad deploy, or for instance, turning off the bad availability zone and then looking at it during the daytime? That would make all of our lives so much better. So the standard to aim for is not just having the ability to debug things in real-time. It’s to be able to debug things post facto. But observability is not just fixing broken things in production.

It also is the ability to understand what’s happening in our code at all points in the life cycle. From the instant that we first started writing code, can we understand what it’s doing and whether or not it’s going to pass our test cases and how it’s performing as we run it through the test suite? Can we understand what barrier is standing in between it and reaching production and how we can speed up the code deployment process? And can we understand what users are actually doing with it in production in a non-emergency situation? Can we understand usage statistics? Can we understand success metrics? And can we understand those dark areas of the code and that hidden complexity of many nested layers of microservices? Can we untangle that web in order to understand how everything fits together so that we can target our improvements, so that we can really manage that technical debt? Another thing is, observability is a socio-technical capability. It’s not just about the data and it’s not even about what form factors we create that data in. It’s about the overall ergonomics. Can we instrument code as easily as adding a print test above line? Can we store the data cheaply enough and can we query it in real-time using questions that we didn’t think of at the time that we instrumented it? So in a lot of senses, it doesn’t really matter whether you use a metrics or tracing or logging approach. What matters is, do you have that ability to introspect that code regardless of how you originally instrumented it? So SLOs help you understand when things are too broken and observability is a capability that you build up that enables you to debug and understand those outages, to debug and understand what’s happening inside of your systems.

But I’d argue that there’s another key element that you need for a successful production excellence practice. You need collaboration between your teams. As I was saying at the beginning, many of us are super, super tired. Heroism is just not sustainable. We need to build up that capability across our teams to be able to debug together. We need to make sure that we’re leveling everyone up so that everyone feels like they’re able to handle on call and not just your most elite performers. We need to make sure that we don’t have our debugging stop at organizational boundaries, that our customer success team, your product development software engineers, and your people working in the data centers or your people working at your cloud provider, that you all have a seamless system to debug things together. This means we have to practice. We have to do game days. We have to do wheels of misfortune. We have to work together so that we understand before it’s 3 AM how we’re going to collaborate together. Collaboration is deeply, deeply interpersonal, and that means that we really have to focus on building that trust, on building up those good relationships between people and making sure that we have good working agreements, that we have good trust to raise our hands and say, “Yeah, it was my change that broke the build,” or, “It was my change that caused that outage,” rather than have the person hide for a fear that they’re going to be blamed.

26:23

Inclusion really, really matters. If you do not have an organization that supports a diversity of backgrounds, you’re going to have an organization that is deficient in communicating and an organization that is not able to achieve its full potential. Really make sure that people, especially those who are most marginalized, feel safe to bring their whole selves to work. Their safety really, really matters. I implore people, especially in this moment, to think about how they can facilitate having black technologists in their organizations and making sure that the work that they’re doing does not negatively impact black communities and communities of color. If you develop that trust, you can lean on your team. You can do things like, for instance, make sure that people are able to sustainably handoff on-call, to make sure that people are not on call during their religious obligations. For instance, maybe you shouldn’t put the observant Jewish person on call on Friday nights, right? Maybe you shouldn’t put that parent on call when they’re also dealing with a crying baby, right?

These are all things that we can do if we trust each other to hand off assignments and collaborate with each other. Really document your code. Make sure that people are working, not just with their current coworkers, but with their past and future coworkers with sufficient documentation. Share that knowledge to make sure that you are not a single point of failure and you can go on vacation and take that much-earned rest, that you can retire, that you can change jobs. Focus on using that same platform of technology to make sure that you’re speaking the same common language and having supportable code that is maintained, not just by you, but by a collaborative ecosystem, to make sure that we’re all speaking the same terminology when it comes to things like observability or Service Level Objectives or monitoring. Make sure that you are thanking people when they exhibit curiosity. Don’t say, “I can’t believe you didn’t know that.” Say, “Thank you for asking that question. It’s clear that we need to improve our documentation. Let’s work on that together.” But let’s talk a little bit more about that theme of learning from the past, because I think there’s one more strategic area that we need to talk about to achieve production excellence, and that is the idea of thinking about common patterns in our outages. They may not be exactly identical, but they definitely rhyme.

We need to target our efforts to improve reliability to make sure we’re having the best possible impact. We do that by conducting risk analysis and thinking about, what are the key areas that we might fail in and how can we mitigate those known failure modes? Of course, there’ll always be unknown failure modes. For instance, maybe a freak tornado will come along and knock over this bridge. But if we know that the bridge has holes in the roadbed and cars are falling through, that seems like a pretty high priority thing to fix, right? Or if we know that the bridge needs a seismic retrofit because it’s in an earthquake-prone area, maybe that’s something that we want to address, right? So we need to quantify our risks based on the frequency that they occur and the impact. How long do they last? How long does it take to detect us? How long does it take to detect them? How many people are impacted? And then we can understand, based on that, what we expect the impact upon our users to be over a long period of time, like a quarter or a year. We can figure out which risks are most significant. In particular, which risks are, on average, going to contribute to the greatest number of failures against our Service Level Objective?

If we’re allowed to have a thousand bad events per month and I know that our MySQL database going down is going to cause 500 bad events to be served over that month, that seems like a pretty high priority item because that’s half of our error budget and that’s just half from one cause that we know about and that doesn’t leave room for other causes and other causes that we may not even have predicted in advance. We really need to address those risks that endanger our Service Level Objectives that are, on average, going to cause more bad events than we can tolerate according to our SLO. Having that data really enables us to make the business case to fix them because it lets us say, it doesn’t matter how many new shiny features we ship because as long as we have this potential outage cause, we’re going to have our customers not trust the reliability and be unable to benefit from these features. It means that we really do have to commit to finishing that work, that we can’t just leave a trail of postmortems behind us that had action items that we committed to do and no one actually did, right? We have to prioritize and think about, what is the most impactful thing that I could do with my time devoted to reliability to ensure we’re able to meet our SLOs? So don’t waste your time on vanity projects. Really think about what’s going to move the needle and make for happy customers and make for happy developers.

31:12

In closing, though, I want to say two things. That if you have a lack of observability, that is a systematic risk that you really should address because it adds time to every single outage in which you don’t know what’s happening, where you are not aware of what’s going on inside of your system and therefore you’re not detecting things and you’re not able to resolve your outages. It turns what could be a five or ten-minute outage into a two or three-hour outage. So make sure that you are not under-investing in observability and really pay attention to whether or not your product developers can understand the impact of the code that they have. Secondly, it’s really, really important to focus on collaboration and psychological safety in your organizations to make sure that you’re working together to reduce the amount of time, again, that it takes to detect and recover from outages, which you can only get through having people feel safe to speak up and having people feel like they can work together to constructively resolve those outages. You don’t have to be a hero to achieve a successful system, but I do recommend working and striving towards production excellence. It’s not just me who says that. The DevOps Research Group also says that in order to achieve high software delivery performance, you have to have the ability to deploy quickly, to reduce the time it takes to deploy a change, to have a fast time to restore service, and to have very few changes that fail.

It’s true that a growing number of companies are succeeding in doing this. Over a fifth of us are managing to achieve elite performance where we’re able to deploy on-demand multiple times per day, and have less than 2% of our code pushes cause a failure. But there’s still a majority of people who are failing to achieve high performance and are stuck in the middle to low performing category. I think the way forward there is really focusing on that automated deployment production and observability stories, and that requires us to help each other as a community through this. I know because I’ve helped build Honeycomb, right? It’s a dozen of us who are hands-on product engineers at Honeycomb, and we make our systems humane to run and we make other people’s systems humane to run as well, right? The way we do this is by providing an observability platform that ingests people’s telemetry and enables them to explore that data. We practice observability-driven development, and we found that it’s helped us and it’s also helped many of our customers. Tooling can help, but I think it has to be really coupled with that culture change, that you really need that production excellence culture in order to achieve a high performing software delivery team. So measure how your customers are behaving and what experiences they’re seeing with their SLOs, debug your systems with better observability, collaborate with your teammates, and really fix any structural issues that you have by doing risk analysis and planning and closing that feedback loop. Thank you very much. If you have any questions, I’ll be around the Slack channel and enjoy the last little bit of INS1GHTS. Take care, folks. Bye.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript

Conference Talks

The Socio-Technical Path to High Performing Teams - With Animation

We are far behind where we should be as a profession when it comes to how much of our effort is wasted on work that doesn't move the business forward, and this is in large part because our ability to understand our systems is so wretched—and we don't even know it. Charity Majors shares how observability tools and culture fix that blind spot and allow team to innovate on the right work for the right people—their users.

Podcasts

Ep. #9, High Performance DevOps with Jez Humble

In episode 9 of o11ycast, Charity and Rachel sit down with Jez Humble, Co-Founder and CTO of DevOps Research and Assessments (acquired by Google since this session was recorded), to discuss DevOps security and how a team's culture relates to their success.

Conference Talks

Habits of Highly-Performing Teams

There is a yawning gap opening up between the best and the rest—the elite top few percent of engineering teams are making incredible gains year over year in reliability and lack of technical drag forces, while the bottom 50% are losing ground.

BACK TO RESOURCES

Cultivating Production Excellence

Transcript

Transcript

Ready to get started?