Webinars SLOs Observability

Achieving Production Excellence at Scale


Summary:


Many teams are burned out from spending too much time debugging and fixing production issues instead of shipping new features—and they risk falling behind competitors that ship faster and manage production better. James Governor (Redmonk co-founder), Charity Majors (Honeycomb CTO and co-founder), and Liz Fong-Jones (Honeycomb Principal Developer Advocate) have collectively worked with thousands of technical teams who share similar pain points.

In this webinar, they will share what it takes to achieve production excellence and the best practices that lead to:

- Fewer production issues
- Faster debugging
- Faster release cycles
- Less developer frustration and burnout
- Happier customers
- Better business outcomes

Whether you’re a startup building new services from scratch or in a brownfield enterprise environment, this webinar offers expert advice on getting started and measuring the ROI of implementing modern software practices like progressive delivery, observability, and service-level objectives (SLOs).

Transcript

Liz Fong-Jones [Principal Developer Advocate|Honeycomb]: 

Let’s get going. I’m really excited about today’s program, particularly about our panelists today: Charity Majors and James Governor. Before we start, we need to cover a couple of housekeeping items to keep in mind throughout today’s presentation. First of all, this presentation is being recorded. After we complete the content and the recording is processed, if you’ve registered to attend, you will receive a link to watch the recording. 

If there’s anything you missed or would like to refer back to, simply click that link from your email to watch this webinar on-demand. Next, we will be taking questions at the end of today’s presentation content. You can ask your questions at any time by entering them into the Q and A box at the bottom of your screen. We encourage you to ask questions as they come to you, and we should have plenty of time to address your questions in 15 minutes of Q&A at the end. 

Lastly, today, for accessibility, we’re joined by Kimberly from Breaking Barriers who will be providing live captions throughout the webinar. To follow along with the live captions, just select the CC caption or live transcript button in Zoom, or you can use the link we’re just posting now in the Zoom chat. You will see there’s the option to get the stream in a separate browser tab. So, with that, that’s it for housekeeping. Let’s go ahead and start today’s webinar on Achieving Production Excellence at Scale. 

I’m your host. I’m Liz Fong Jones, and I’m a Principal Developer Advocate here at Honeycomb, and I’ve worked for over 15 years in the industry, including spending 11 years as a Site Reliability Engineer at Google. Charity, why don’t you introduce yourself. 

Charity Majors [CTO & Co-founder|Honeycomb]: 

I am Charity Majors, Co-founder and CTO at Honeycomb. I am mipsytipsy on Twitter. And, James, you’re here? 

James Governor [Co-founder & Analyst|Redmonk]: 

Yeah. So I am James Governor. I am an Analyst and Co-founder of a company called Redmonk. Basically, we spend our time trying to understand the decisions that developers, engineers, and practitioners are making so that our clients can do a better job of serving them. Our basic thesis is that the developers and engineers are increasingly the most important constituency in decision-making in engineering in terms of engineering process and tooling. So that is who I am.

Liz Fong-Jones: 

Excellent. Well, thank you for introducing yourselves, James and Charity. So I want to talk about the theme of today’s discussion. The theme of today’s discussion is: What is the challenge of achieving production excellence and running at scale? You want to emphasize “at scale” because not every company is a series A, series B start-up. There are a lot of large enterprises out there that are realizing the benefits of production excellence and observability. So that’s the theme I want us to aim around today. With that, I wanted to lead off, first, with a question to James. So, James, you’ve been speaking a lot recently about what is progressive delivery. 

James Governor: 

Yes. 

Liz Fong-Jones: 

How does that relate to production excellence? What strengthens that relationship? 

James Governor: 

Great question. So, yeah, essentially, progressive delivery, it might be worth saying what that is, how we got to using that as a term, and then talk a little more broadly about context. Essentially, when I was looking at the landscape, in terms of delivery of software by high performing teams and the kinds of aspirations we saw from other organizations, it was pretty clear that one of the things that the high performing teams were doing was a more experimental approach because they were more confident in their infrastructure, and they were able to do some interesting things. 

So if we think about canarying, blue-green deployments, the ability to do dark launches, perhaps to turn feature flags on and off, and rollback, if necessary. It could be a bit more experimental in terms of development. But, as I said with safety, it really came to me that it used to be that in IT we had not enough resources. So we would have a separate, you know, we would have, literally, a separate environment for testing, for development, for QA, and for production. All of them, we were constrained in terms of resources. 

Well, now we live in a cloud world, a world of abundance. So what are the possibilities and opportunities that are afforded by that? A world of cloud abundance and network abundance, combined with the fact we have this sophisticated network, and we can route traffic accordingly so we can have a new experience perhaps to some constituents. As I say, we could roll something out darkly. I just think it needed a name. 

I’m not saying that continuous integration, continuous delivery is not still relevant, but there are a broader set of disciplines. I wanted to point that out. Normally, at my firm, we don’t make up new terms, but I just said this, and a lot of people seemed to resonate with it, so I’m running with it. 

5:15

Charity Majors: 

Over the last couple of years, I found myself saying it all the time. You know, all this stuff about deploying that is not on/off switches, canarying and rolling out in waves, and all this stuff. That’s a very awkward thing to say. So I think it really is a suite of tools that was really looking for an umbrella term. I also feel like there’s just kind of this broader gravitational switch. There’s this big gravity swing over the last five years from investing all of our resources in pre-production stuff to investing in, you know, in production stuff, like real-time. The toolset really didn’t really exist until five years ago, which is when we were started. LaunchDarkly was started. You know, Gremlin was started. 

There’s a bunch of startups now for developer tools focused on giving you very fine-grained visibility and the ability to take a scalpel to production instead of just kind of like shipping it all and crossing your fingers and looking at big aggregate graphs. 

James Governor: 

See, I love that. This is one of the debates in the industry. We spent a bit of time on that at Redmonk. A lot of people will say, Oh, no! X is a culture change. SRE is a culture change or DevOps is a culture change or perhaps observability is a culture change. People always say these things. Why are these people trying to sell you tools when, in fact, it’s a culture change? 

Charity Majors: 

But it’s a sociotechnical system. Tools are a big part of that. It’s not just culture. Culture reinforces tools reinforces culture reinforces- it’s almost like the tools are how you change culture. 

Liz Fong-Jones: 

I think that gets quite nicely toward what we are getting at, the distinction between progressive delivery as a culture thing as a tools thing and production excellence, also, as a culture thing, first and foremost. 

James Governor: 

That’s right. This gets to a conversation that I’ve had a few times. You know, there is this, I want to be slightly careful not to press Charity’s buttons too much, but there are people in this industry that will say you should never ship code on a Friday. You should never make production changes. You should never roll out that application on a Friday afternoon. 

And I always felt that I could understand why, because people are afraid and they’re not in a position where they have production excellence. They’re not in a position where they have tools to enable them to do that confidently. They have not done sufficient testing. They don’t have the processes and people and tools in place where they can do that. 

So I really, from my perspective, when I was talking about progressive delivery, it was very clear that in order to do something like this, you have to have a lot of confidence in how you’re delivering software. That’s where this production excellence came in. When you started using the term, I was excited by that because I think it’s aspirational. This is where we need to go. 

I’m not saying to everybody, Oh, you need to be able to do, you know, AB testing of an application overnight and be prepared to ship at any time. On the other hand, an organization that has confidence and a level of safety in terms of knowing that they can fix things, indeed, defines production excellence. So to me, I think progressive delivery and production excellence go together really well. I don’t know if that definition of production excellence works for you, Charity. 

Charity Majors: 

Yeah. I mean, it’s all about being able to ship swiftly and with confidence. That doesn’t mean you’re confident nothing will go wrong. Because sometimes there’s just absolutely no way to gain that confidence because, you know, production is production, and nothing else is production. But it means that you have confidence that you can find it swiftly, that it won’t impact your users, and that you can fix it swiftly. You know, if it takes you, if the time between when you write code and when it’s live is on the order of hours, well, you’re going to be pretty scared to ship things, and you’re going to put a lot of effort to make sure it’s not going to do anything. 

But if it’s minutes and you have the ability to ship it behind a feature flag, if you can decouple releasing and deploying, the speed at which you can move and your terror, your speed goes up and the terror of the changes goes down because you can do it in so much more controlled manner. 

James Governor: 

I think that is part of it as well, with production excellence, you know that you’re in a position because what matters is the customer experience. I think this is one of the really important things. Like Charity says, it doesn’t matter if you have five nines if your customer is unhappy because their experience sucks. In particular, it may be a high-value customer. 

10:17

Charity Majors: 

And this is where observability comes in, too, because in the days of, you know, aggregate metrics, your Prometheus, and whatnot, which are great tools, but they answer the question of: Is my infrastructure healthy or not? They don’t answer, they don’t answer the question of: Is this customer’s experience good? There’s no top 10 list in the world that can do that for you because it may be customer number 576 in terms of load that it’s the highest value by dollars or whatever. You cannot predict in advance which one you’re going to need to break down by or search by. So having the ability to slice and dice in high dimensionality, high cardinality data is really key. 

Liz Fong-Jones: 

I think that gets us to the point of how do you measure what a good customer experience is? I think that’s where SLOs and observability helps. For people who don’t know what SLOs are, maybe we can spend time on that. 

Charity Majors: 

Service level objectives. Service level indicators or something. I don’t remember how that works, honestly. But the idea is that you have a budget from talking to your customers and figuring out… you don’t want to aim as high as you possibly can because, you know, which is, I think, what a lot of us naively start out doing. You actually want it to be right there on the bubble between, will my customers notice and care or not?

You want to have a budget for yourselves so you can sleep through the night if it’s not supercritical. You want a budget for yourself so you can experiment and move quickly. You want to have that budget because that actually, you know, leads… it’s like every engineering team has two constituencies. You have the users and you have your engineers. Engineers’ quality of life, engineers’ ability to maneuver matters too. The right SLO balances both needs. 

In the bad, ol’ days, we were paging ourselves on symptoms, paging ourselves on every disk alert. God, help me if I ever see another CPU load average page again. So we were getting paged all the time, right? And we spent so much time curating all the things that were alerting us and trying to find the right level on the dial that would not be too sensitive and page all the time but yet give us advance warning when something was about to go wrong. 

It was an impossible job, and it’s become seriously impossible. SLOs are great because they replace… we’ve heard from customers who ended up deleting all of their alerts, replacing them with SLOs. They ended up having, like, 90% fewer page alerts. You end up aligning alerts with: Are our customers in pain or not? This gives you a lot more room to maneuver. 

James Governor: 

I think the “aligning” word is so important. I guess we’re certainly not in the era of Itel anymore. Thinking about how we align the incentives and how the customer is feeling and how the service provider is feeling or the enterprise, perhaps, and then, I think, Charity, the important point you made, which is how do we manage what it feels like to support those customers. 

So definitely, just because there’s a lot of user pain, we don’t want to be burning out our engineers. And I think that sense of budget is just so important in thinking about, okay, we should be confident in making changes, but how do we make sure that we have psychological safety? 

Charity Majors: 

Yeah. 

Liz Fong-Jones: 

Yeah. I think another interesting thing to think about here, though, to ask a devil’s advocate question, you have SLOs. They’re wonderful but if your SLO goes sour, how do I actually debug this thing? That’s kind of a challenge I frequently encounter when I talk to people about SLOs. How do you combat that mentality? 

Charity Majors: 

Well, you need to have good tooling. Like, with Honeycomb, we have this thing called BubbleUp where if an SLO fires, you know, you can see why because we have the thing where you can select a region. Any spike or whatever, you’re like, hmm. This is weird. You just select it and then we precompute for all dimensions inside the bubble that you selected and outside of it, and we dip them. And you can see which one to however many things are different about the selected spike than the baseline. So you can see pretty quickly. Oh, I see. All of those errors are coming from requests that are erroring at this service or this request ID or this user ID or this endpoint or any combination of the above. 

It is really challenging. For as long as there have been ops teams, our debugging has always been incredibly unscientific. We have all these dashboards and graphs. We’ve experienced so many outages, and we have a lot of scar tissue. When the pager goes off, we tend to see which alarms are firing, what the dashboards look like. Then we jump to a conclusion. Ah, this reminds me. It must be this. Then we look for evidence that we’re right. If that doesn’t work, we look for another hypothesis. It’s probably this. Then we look for evidence that we were right. That worked well in days when failures were predictable. 

Liz Fong-Jones: 

So you’re basically talking confirmation bias as opposed to being scientific. James, I don’t know. What do you think?

Charity Majors: 

If you didn’t have enough context or this scar tissue, you would be lost. God knows how long it would take you to figure it out. With observability, we’re trying to shift this to a much more scientific, predictable, follow the trail of breadcrumbs to the answer no matter how familiar you are with the system or not. 

16:37

James Governor: 

We’re always implementing and reimplementing, rinse, and repeating. Then we have a new set of environments that need to be managed, but the economics is constantly changing. And I think one of the sort of underpinnings, how do we do… it’s one thing to say, Here are new tools that are effective, but one of the essential questions is: What are the economics that enable those tools? When we think about data, one of the things that excites me about observability is and one of the challenges is: Do I need to install every single metric? Okay. If I do that, okay. There are storage buckets that are cheap now, but then can I analyze that data once I’ve stored it? 

I think the economics, for me, is one of the big questions in observability. So, like, from Honeycomb’s perspective and maybe from an enterprise perspective, they’re like, We’ve been paying a fortune for logging already. I don’t really think we can do this, Charity. What is the answer to that? 

Charity Majors: 

I mean, one part of the answer is this whole metrics, logs, and traces thing, this is why I find this so painful and so misleading to users because if you follow that model, you are paying to store your data at least three times in three different ways. Right? But the thing is all three of those can be derived from the source of truth. It’s an arbitrarily wide structured data block. You can’t go in the other direction. If your store your data once in the observability way, you can get all those other answers. You can get traces. You can get your metrics and aggregates and your logs. 

But if you store them as metrics, logs, and traces, you can’t go in that way. It’s like asymmetric cryptography. You can’t go that direction. And so, yeah, you shouldn’t just add another one. I get why people don’t want to store their data in four ways. The point is if you start moving to observability, we hear this from customers all the time, the ways they use their other tools, they start withering off and dying. They’re not as useful. Honestly, it’s worse than just having to store it three times because those three data sources probably don’t agree with each other, and you’re probably copying and pasting IDs from one place to another to see the same spike in two or three different systems. 

Usually, you get paged. You look at the data. You’ve got your aggregates and your logs. You were not correlating that, but you jump over to the logging tool and time stamp to see if it’s the same thing or not. It’s just a fucking mess. It’s worse than expensive and wasteful. It’s not accurate. Right? These ways of using data should be two sides of the same coin. You should be able to flip back and forth. You should be able to ask the question and then visualize it as a trace over time. And then flip back to, you know, asking questions like: How many people? Here is the problem in the trace, how many people are impacted? 

James Governor: 

What can you throw away? It’s counterintuitive. What if I throw away the wrong stuff and can’t get the answer? 

20:12

Charity Majors: 

Well, I think we all intuitively-what can you throw away? For most people who are not running it at hyper-scale, you don’t have to throw anything away if you’re using the right tools. You don’t have to. Most people’s workloads are not that large, frankly. If you get to a point where economically, it’s becoming cost-prohibitive, or by scale it’s cost-prohibitive. You can start to adopt some type of dynamic sampling. We know health checks from load balancers are not as valuable as, say, errors that are erroring out on slash payments. Right? There’s some pretty different level of importance there.

The way Honeycomb does it is, you know, you can start setting rules at either the client-side or using this new thing that we just shipped to do it for you at the service side that does some intelligent sampling for you. The way we do it is each record, each arbitrarily wide structured logline that comes in, has a number on it that represents the number of requests that this one represents. So we may say we’re discarding 99% of all of your load balancer requests, but then it actually does the math so the lines still look right because it has the number to say, you know, this represents 100. Does that make sense? Liz is better at explaining this than I am. 

Liz Fong-Jones: 

Yeah. It’s a way of compressing your data. It’s a way of saying this highly duplicative data, we’re just going to keep one representative copy and then we can keep one representative copy of this data if it’s repeated a hundred times, a thousand times, but the things that are truly unique, the snowflakes, the errors, those are the things that you want to keep one for one. I think that’s what is really innovative about the product launch we’ve just done around a product called Refinery, which is let’s help you distinguish the signal from the noise. Let’s help you collect 100% of your signal while only keeping 1% or .1% of your noise. I think that helps people optimize on cost and not treat all the data as equal. Because 99% of your data is garbage. Why are you paying to store all the data? You shouldn’t be. 

Charity Majors: 

There are things you can do. 200 okay requests to your root domain. Requests like that that are valuable and aggregate but very, very rarely useful in specific, it turns out that these blunt, easy dynamic sampling rules can cut your costs by orders of magnitude just very easily. 

Liz Fong-Jones: 

People in the past have been tempted to just say “I’m going to sample everything with a blunt hammer.” One for a hundred, one for a thousand. They miss important data. That’s what happens when you take an ax to a problem rather than a scalpel. 

James Governor: 

Can I say something? One thing about observability and these questions, because observability, we’ve been thinking and talking about it in terms of troubleshooting. There’s an idea that we only use it when we’re in trouble. But, of course, observability should be something that we’re all of the time better understanding the system as it should behave or as it does behave so that we get better at knowing. So is it just about troubleshooting or is part of it to do with observability something we always need to do?

Charity Majors: 

These are your five senses. You know, the best engineers I’ve ever worked with are ones that keep up an IDE, they keep their editor open and they keep a tab open to production. You eat, you breathe, you smell, you’re looking at it every day. The reason I get a little bit punchy about defining observability and protecting the definition is that some of the best practices and ways of doing it are diametrically opposed to best practices for monitoring. 

One of the monitoring best practices is you shouldn’t have to look at your graphs all the time. The system should just inform you when there’s something wrong. Then you go and investigate. With observability, it’s the opposite. It’s not about paging all the time. It is about you should be looking at it. Every time that you write some code, you merge it, you should go and look at it in production. Look at it through the lens of the instrumentation that you just wrote, is it doing what I want it to do, and does anything else look weird? Right there, when you have the original intent in your head, you can never recapture that moment. That moment is the best time when you’re best poised to find the most subtle bugs, most subtle behavioral things, and it will never get any easier than this. It will never get faster or easier. 

You should always be looking at what you’ve done and validating yourself. Did it do what I wanted it to do? And asking questions. I feel like our tools, for so long, have punished us for curiosity. In ops you know, the first time any software engineer goes on call, they have the experience of like, What is that? What’s going on? And somebody in ops will go, don’t pick up the rug, man. Don’t look under the rug. It’s a dark hole. You will find too many things. Which is true. We actually have people who start with Honeycomb and they start rolling it out and they start going, (gasps) there’s a bug. We’re like, Yeah, yeah. It’s been there forever. We have to stop and fix it. We’re like, we’re never going to get this rolled out if you stop to fix all the bugs that you find. Because you’ve just never had that kind of fine-grain telemetry. Yeah, there are so many things broken in your system right now that you have no idea about. 

Liz Fong-Jones: 

So which ones matter, right? I would argue the ones that matter are the ones that are impacting your service level objective, to tie this back to the earlier SLO, to tie back to your earlier conversation. 

26:24

Charity Majors: 

But I think the way we make our systems not be a hairball, coughed up garbage is we’re shipping new code to these systems every day that we’ve never understood and never been able to see, that’s why the bugs and weirdness keep going. If you look at your code right after you shipped it, you’re going to catch so many more of those subtle things, and your systems will become healthier, better understood, and freer of all of these. Yes, observability is, the thing is this shouldn’t be one more tedious thing, just tack on to your list of to-dos or one more thing people are lecturing you about. 

Like, you become hungry for it. You start to crave. Once you’ve tapped into this feedback loop, the dopamine of it, it’s like you get high on it, and you can’t imagine going back to a world where you can’t… engineers want to do a good job. We don’t enjoy shipping bugs. We want to do well. We’ve just never had the tools to really validate that we are. 

James Governor: 

For what it’s worth, I’ve literally just thought of it this moment. So this may be a really bad idea. I do feel like in a way observability is kind of like being mindful about your operations. So, instead of just eating something and not thinking about it, it’s like appreciating it so I can understand the system and what’s going on and when something else is weird. That’s the thing that I think, to your point, is what ties into progressive delivery. Any change I make, I should be looking at and understanding the implications of the change. It’s not something I should do. It’s something I should understand. 

Liz Fong-Jones: 

And it’s a continued practice. We have to keep working on it. Oh, we have observability. We don’t have to keep working. Observability is something you do. It’s not something you have done. 

James Governor: 

So on that note, can I just take my legacy system and press the observability button? 

(Laughter) 

Liz Fong-Jones:

Charity and I joke about this. You have to be this tall to ride. If you’re not shipping code continuously once a month, you have some groundwork to do first. You don’t get that feedback loop if you wait six months between when you write the code and it reaches production. You have to have the feedback loop to really benefit. 

Charity Majors: 

Yeah. I would say that it is easier than most people think it will be and more complicated. (Laughter). You know, all of these things do kind of go together. Right? The observability and progressive delivery and putting developers on call for their own services. Even to the extent of chaos engineering, they’re all part of that gravitational shift I was talking about. 

I like the way you put that, to be more present and mindful with your systems, really tasting and feeling the changes that you make. And, you know, when you first start doing that, it might not always be pleasant. Right? But it is the only way, I believe, to build systems that are humane to their developers, their users, and it is very rewarding, I think when you start making those changes. But, no, you can’t just start pushing a button. It is a process. It is not one of those things where, like, with tracing in the early days, you have to instrument your entire system to get a ton of work put into it until you get value back. It’s not like that. You get value back every step of the way. 

Anytime you write the code and ship it to production, anytime you improve your delivery system so that you have a little bit more control over it, anytime you decouple your releases from your deploys, anytime you add a feature flag, anytime you add observability, it’s rewarding. 

30:45

James Governor: 

Whether we’re talking about Brownfield/Greenfield or sometimes I like to look at it as the new code versus the glue code. We’re integrating to legacy systems, tons of integration, as opposed to this is a new system. Sometimes it’s different teams. In terms of the getting started question, I think the organizations I’ve seen that have been successful, that have been adopting new practices, generally, you’ve got to start somewhere. Quite often, it’s with a new development where you have a team that comes together that has some of the new skills, they use some of the new tools, and then they can begin to show the rest of the organization, Hey, look! This is the aspiration, production excellence. We’re not saying you need to do this now, but this is a journey you could go on. 

And, by the way, have you seen this tool? These are some of the things you could, you know, new approaches you could take. I think for me, with all of these things, DevOps, SRE, there’s no switch where the organization is suddenly doing all of it, but if you can find some good problems and perhaps high-value customer problems, new applications, you can begin to show the rest of the organization the art of possible. 

Charity Majors:

Yeah. You can begin to create some hunger for it, too, because it is really hard to go back, once you’ve experienced it. One of the main things that comes up for me when I’m thinking about this is continuous deployment. I feel like it’s 2021. I can’t believe this is still a question, you know, because the interval between when you write the code and when it’s live is fundamental to a high-performing team, just having it be as short as possible. When it’s on the order of minutes, then you get to hook into physically the parts of the human nervous system that, you know, handle dopamine and serotonin and everything, motivation. You can literally create, like, the feeling in your body that you need to go look at production. That’s so powerful. That’s the only way to really create an environment where you can give software engineers ownership over their code, is one where you’re shipping one diff, one set of merges at a time. 

Once your lead time is a number of hours, I guarantee you’re not doing that. You’re mushing together many engineers’ changes and shipping them together, and you can’t have ownership over that because when it broke, well, whoever deployed needs to figure out which one of these changes broke it and get dissecting and pulling engineers into it. 

Liz Fong-Jones: 

There’s a workaround though. You can run feature flags. As long as you can turn them on one at a time, it’s less of a problem. 

Charity Majors: 

Don’t tell them that, Liz. 

Liz Fong-Jones: 

I’m trying to help them be pragmatic. People are like, oh my God I have so many developers. I can’t possibly do one build per artifact.

James Governor: 

To be fair. If Liz is going to bring up feature flags, one of the things that’s interesting, if we look at all of this, progressive delivery, feature flags are part of that. Some of the most interesting integrations and projects and tech talks and stories, frankly, that I have seen in the past maybe 12 months or so, have been about where feature flags meet observability. So I get where you’re coming from Charity, responsibility, it’s so important, but I think feature flags are really great. Well, I don’t know. Not everybody likes chocolate and peanut butter, but I do. And I think feature flags and observability goes well together. These are the changes we’re making, and these are the implications on the system before we may roll out more broadly. 

Charity Majors: 

Yeah. They amplify each other. They make things more powerful. If you have observability, then you have the ability to break down high cardinality dimensions, like flags. Right? You can see are these errors only coming from instances where the flag is enabled? That’s really great to be able to see at a glance. 

35:30

Liz Fong-Jones: 

Yeah. I think the other interesting thing to talk about here is when we talk about proof of concepts, like, when you talk about prototyping projects, cloud migration is a natural place where you want to adopt the right new stuff in the cloud. I think, also, do it incrementally where it’s service by service. Do it one service at a time. Like distributed tracing, you don’t have to adopt it across your entire org at once. You can wrap a client call and say, I don’t have instrumentation of the server that’s underlying the client, but I can measure the time we are spending on this client call. I think building up the pilot can be helpful. 

James, what is the business outcome I should be driving towards? How do I know the pilot is working? 

James Governor: 

What does “good” look like? What is the return on the investment? There are a broad set of questions here. I really like the way that Spotify thinks about this stuff. When you talk to them, everything is in terms of engineers, or FTEs, full-time employees. So, literally, everything they do, any changes they make, any automation they take, any new approach that they take is all about, will we be able to hire more engineers because we saved X money? If it is going to take 10 engineers or four engineers three months to do an automation, you better understand the implications in terms of what are the benefits going to be. So for them, I think they really take this engineer-led approach, and I think engineering is the currency. I do think that ROI is hard, but if you can begin to help the organization understand, you know, here we were able to roll something out, and it took fewer people to manage, so there were, you know, three people on the team or, you know, what were the implications in terms of what we were investing for SRE. 

I think ROI is always difficult, unless, of course, hey, we rolled out the new service and a massive benefit, but I think it’s going to be about savings and engineering. Further to that, not just cost savings, but, like, are our engineers happier? Are we getting more people sending in their resumes because they want to join us because they heard the team is crushing it? Are other people within the organization saying, Oh my God! I want to work in that part of the org because that looks like the fun part. I think focusing on engineering costs is a good one. Obviously, we have the cost of infrastructure. We need to take that into account, but, increasingly, I think the Spotify insight around FTE is super valuable. 

Charity Majors: 

Super great. I’m really excited to hear about this from Spotify. I hope it catches on because I think we’re really bad at evaluating our time. The problem with software engineers is they’re just like, Oh, I will just write some code. I, personally, am a very mediocre software engineer, and I see this as a great strength that I have because if I write some code, you know God damn well it needs to be written. 

Liz Fong-Jones: 

Right. Like Intercom says, run less software. Right? And it’s 100% true. You should run less software. 

James Governor: 

I think that’s exactly right. Just in terms of production excellence, I think it’s really interesting in terms of all these conversations, which are about service ownership. It’s hard to do ROI when you don’t have clear service ownership. And the FT has done some work recently that would terrify the heck out of a lot out of enterprise organizations. We’re going to turn off any infrastructure component that doesn’t have a service owner. Literally, if we don’t know who owns this thing, it has to be turned off. As sort of a discipline, I think that’s really, really interesting. And, of course, observability is going to help you there. Let’s understand the implications of what we’re doing. 

But, yeah, they’ve said, If there’s not a product owner, there shouldn’t be infrastructure that we’re paying for. And talking about paying for things, Charity, shouldn’t you be selling something? What should people pay for? What’s the thing? 

Charity Majors: 

What should people pay for? I mean, they should pay for anything that isn’t their core business model. Right? They should be paying. I almost think distraction is a greater cost than engineering itself. Right? Your business is bound by the number of engineering cycles that you have. And you should spend as much as you can on your core business model. 

40:30

Liz Fong-Jones: 

Yeah. And I think to that point, you don’t have to develop an entire observability engineering organization. Right? Your observability should be focused on integrating something that someone else has developed, not writing your own observability platform. 

Charity Majors: 

If you don’t have an observability team, you probably should have an observability team, but what they should be doing is sitting between you and the vendor and making sure they’re a service org for the rest of your engineers. Right? They can write libraries for standardization. They can, you know, help train people on service ownership. There’s a lot that the observability org can be doing and should be doing, writing an observability stack from scratch is probably not it. 

Liz Fong-Jones: 

Yeah. So the good news is Honeycomb has a new offering. We have a new offering of Honeycomb Enterprise, including quick start packages. So you can try the Honeycomb Enterprise product for 30 days for free. That includes Service Level Objectives, support refinery for sampling, includes secure tenancy if you have data governance requirements. We’re here to help you so you don’t have to reinvent this stuff yourself. You don’t have to figure it out on your own. We want to be behind you 100% of the way on your observability journey.

Charity Majors: 

I feel that startups, it’s like we should feel like one of your teams. Right? At a big company like Facebook, each team, I felt, when they really worked together, each team kind of felt like a startup. That was your services startup. I feel like the flip side of that is, you know, when you’re a smaller company, start-ups should feel like teams of your engineering org. You know, it’s like the API between us should be similar. 

James Governor: 

So one of the things I like about this, I’m regularly asked: How do you get started with observability? One of the points I make is that the tools are there so, you know, your observability team could be we need to hire people with an understanding of what that is, but the tools are there, and you should play with them. 

The easier it is to start taking advantage of toolsets to gain insight into your systems and be more mindful about how you manage them, like, the thing now is, anybody that’s not giving you the tool to play with, what do we want to be doing? Like a 12  to 18 month RFP with slide decks? 

Charity Majors: 

Right. 

James Governor: 

Accessibility is important, and that’s what you focused on with this launch. 

Liz Fong-Jones: 

Awesome. So we now have 15 minutes to take questions from the audience. A couple have been queuing up as we’ve been having this discussion. Let’s look at this question from James Ably. How do you think observability and continuous delivery tie in with resilience engineering? 

Charity Majors: 

Well, you know, I think resilience engineering is sort of an academic way of looking at the outputs and the results of these things, and observability and continuous delivery are ways of achieving resilience. It’s not about making your systems, you know, break as little as possible. It’s about making it so that they can survive and still your users won’t notice even if they break in lots and lots of ways. 

Liz Fong-Jones: 

Yeah. 

Charity Majors: 

And we have so much science by now that shows that the only speed is safety when it comes to software, which is very counterintuitive to us because, as humans, when we feel unsafe, we freeze up, and we get slow. Right? That’s how we try and achieve more control, but the physics of software are very different. The physics of software are like riding a bicycle or being a shark where if you slow down too much, you’re going to wobble and fall over. 

Achieving some smooth speed, shipping in small diffs as often as possible is how you get there. Right? Because changing your system is, if you can’t be resilient to the changes you, yourself, are inflicting your system, what hope do you have of, you know, surviving the changes that Mother Nature is going to inflict upon you? 

45:13

James Governor:

There are some physics to that. If you’re driving on ice, the last thing you want to do is slam your foot on the brakes. You have to drive in through it a little bit. As you say, be very, very delicate, make the small changes, understand the implications of those changes. 

Charity Majors: 

And look at them. The whole DevOps divide was so pernicious because it involved, you know, the people with the original intent in their head never looking at the output, and the people who were looking at it didn’t have the original intent. This is how our system got to be so bad. This is why Dilbert exists. Right? We’re so cynical about how working with software just has to be so terrible. It actually doesn’t. It doesn’t have to be that terrible. 

Liz Fong-Jones: 

Yeah, but the way that you get towards better resilience is having better observability. Right? And specifically, when we talk about resilience engineering as defined by chaos engineering and doing different experiments if you don’t have the ability to test your hypothesis, you’re just creating chaos. You’re not doing chaos engineering. You’re just doing chaos. 

Charity Majors: 

Specifically, if you don’t have the ability to… as I define observability, with high cardinality, if you’re just using a Prometheus, I love it, but you’re just injecting chaos into your system because you don’t have the ability to look at that fine-grained outputs. 

Liz Fong-Jones: 

Like, who was impacted by our experiment?

The next question we have is someone is asking how does progressive delivery work not at scale? Such that you don’t have enough traffic that you’re regularly testing all of your paths? 

Charity Majors: 

Well, every system is different, right? Your wonderful sociotechnical system is a snowflake, just like everyone else is. At the end of the day, you have to understand your systems. I can’t give you an answer that will work for your systems. You have to give an answer. If there’s an input that only gets hit once per day, then, you know, your progressive delivery system needs to take that into account. You know, the footprint of your traffic, and you know what your objectives are. 

Maybe that means you inject some false traffic under a different namespace or something but using a production system so, at the end of each deploy, you actually inject a couple of those records. Maybe it means that you canary for a day. It depends on what you’re trying to achieve there. 

Liz Fong-Jones: 

Excellent. Thank you for that answer, Charity. 

Charity Majors: 

Sure. 

Liz Fong-Jones: 

We have another question. How does observability help API only or serverless products where there’s no kind of server, right? There’s no service.

Charity Majors: 

James, did you have an answer to the last one first? 

James Governor: 

Yeah. One of the things, sorry. We do need to be thinking about this, one of the things about production excellence is we’re moving to the idea of running a platform. In running a platform, as Charity said, let’s look at what we need to do to run the platform as a business and what we need to be outsourcing. So choosing platform services that we can work with to get a better sense of, basically, if you’re going to be doing something like progressive delivery, that’s something you need an architecture for. You do need to be doing some work in order to think about what the system is going to look like, and, you know, if you’re going to have two zones where you roll two things out, what does that look like? What does it mean to have replication of cloud-based infrastructure?

I think thinking about it from a platform perspective, and certainly, as you scale up, that’s going to become even more the case. Things like feature flags, you could end up with huge feature flag sprawl, and that’s potentially a problem. Yeah. It’s not something you want to undertake lightly. You want to be making investments in approach and tooling and platform before you do that. Sorry. And then the API question, so, Charity, I always think that is interesting. How do we understand if we’re talking about unknown unknowns and stuff?

Charity Majors: 

Yeah.

James Governor: 

When we have lambda functions, we have even less. 

50:12

Charity Majors: 

I love this question because I think it gets right to the heart of what is different about observability, which is that I often tell people when you’re instrumenting your code, pretend it’s serverless. Looking at the code executing on whatever system it’s executing on, you can instrument the same way on lambda functions as you can on your own containers or whatever. It’s all about what’s happening, to who, you know, are there shopping cart IDs or whatever. You just throw them onto the blob, and it works in exactly the same way. I think we’ve gotten used to having all these agents and magical stuff, which is wrong. It’s a crutch. It gives you the wrong perspective into your stuff. I don’t like the sidecar model because the first-person perspective is so cored to it.  

Liz Fong-Jones: 

Yeah. We should be thinking about the request as a unit of obstruction, not the machine. We’ve been misled by all these tools that say you should be measuring the CPU. No. That’s backward. When you have a lambda serverless project that’s the purest expression of yes, instrument your code and you will get observability. And you don’t have to care about the CPU because that’s what lambda is already taking care of for you. 

Charity Majors: 

Right. There’s a level of distraction here that you’re aiming for where the application engineers shouldn’t have to be ops people, shouldn’t have to be infrastructure, you know, engineers. Yes, there’s a couple of things that you probably care about. You care about when you shipped your code, did the memory usage triple? Right? You care about that very high, blunt level; but you shouldn’t have to give a shit about all the stuff under slash prog or all of the networking firmware and counters and all that stuff. You may not have that information, but you don’t need it. Right? Your infrastructure team, whether that means Amazon or yours, right, should care about that.

Liz Fong-Jones: 

So we have another question, which is from Henry Carpenter which asks: Which tools exist or what reading should someone do if interested they’re interested in progressive delivery and progressive deployment? James, that’s a question for you, I think. Where should people get started with that? 

James Governor: 

It’s interesting. There used to be that there wasn’t that much stuff out there. I mean, I’ve done a bunch of talks on it. I’m not sure I’ve done the best writing on it. I’m kind of on the hook in that regard, but I think some things to look at, I think there has been some really nice work done by some of my friends at a company called Weaveworks. So they’ve got a talk on flagger, which basically is about doing canary deployments. So the model that they use is called GitOps. That’s the notion that I make my changes through Git, and everything is done for you using Kubernetes. It’s just a really nice model. 

Okay. I’m going to roll out some of the traffic, and then I’m going to roll out the rest. I’ve talked about feature flags. I think that one of the companies that are doing a really good job in getting people to understand this is LaunchDarkly. You know, pretty clearly, if you’re thinking about observing the system, spend a lot of time on Honeycomb’s website because that’s what they’re talking about on a daily basis. In terms of deployment, you know, we’ve seen a big, big change. Any of the folks that were focusing on CI/CD have gravitated to realizing, you know, Oh, hey, this is something we need to think about. 

GitLab is beginning to do quite good work in this space. I think they’ve got something nice from a user feedback perspective called Review Apps where within, if we think about some of the experimental aspects rather than just rule this, run it, or want it, and the idea that the user can give feedback and we can roll it out into the system we’re building is really nice. But, yeah, I mean, I think that I’ve talked about platforms. I think you do want really solid observability, and then you want to be looking for release management. Vendors investing in release management. 

Liz Fong-Jones: 

Excellent. Thank you. So one person asks: What guidance do you have in terms of influencing your leadership to adopt SLOs? Where should someone get started on selling someone that we need SLOs? Charity? 

Charity Majors: 

Yeah. Well, I think, first of all, whenever you’re trying to influence leadership about engineering things, try and cast it in terms of, you know, not just abstract engineering goals but in terms of dollars, in terms of happiness, in terms of retention with your employees, et cetera. This is hard to answer without knowing how technical the execs are, et cetera, but it certainly pays dividends in terms of not burning out engineers. It places user happiness, first and forefront, as it should be. So, what is their objection to that? I’m not sure. James, do you have a better answer? 

James Governor: 

No. I just think, you know, we’ve got in terms of persuading people, good stories are useful. While it’s not directly about SLOs, it goes back to the basics of people that have changed the way we’re thinking about tech delivery. So, you know, the Phoenix Project, the Unicorn Project, the stuff that Gene Kim has done. You know, getting Jeff Lawson’s new book in front of them. As Charity says, who are the most valuable people in the org? They’re the people doing the technical work now. Everybody wants to be a software company. Everybody needs to become better at this. I think as we look at those stories, then you can begin to have a conversation about let’s understand how SLOs will get us to a position where we’re able to move quickly, we’re able to…

Charity Majors: 

It’s a real competitive advantage, a manager more modern as a software engineer. It’s something that everyone should be aspiring to. 

Liz Fong-Jones: 

Excellent. Well, I thank all of you for joining us today. As a reminder, we’re going to be publishing a recording of this. If you have any further questions about this, you can go and go through that material and have a look at the recorded material. Thank you very much for attending this session. We appreciate your feedback. So if you have any feedback, please feel free to share it with us on the team. Thank you very much for tuning in. There are a few more resources listed here, the observability maturity model and our previously recorded talks on service level objectives. Thank you very much to our panelists James and Charity. And thank you for your attention. 

James Governor: 

Thanks, everyone.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript