A New Framework for An Observability Maturity Model

Summary:

Everyone is talking about "observability," but mapping out a game plan to get there has not yet been clearly defined. We have a great new starting point: Honeycomb’s observability experts, CTO and co-founder Charity Majors, and developer advocate Liz Fong-Jones, recently combined their knowledge to create an initial framework that focuses on goals instead of tools for achieving observability. The framework asks orgs to honestly look at "how well are we doing?" in key areas that affect observability -- and ultimately customer and dev happiness.

In this webinar, the authors of this framework cover the importance of these 5 key areas for assessment, including:

- Resilience in responding to system failures
- Quality of code
- Cadence of releasing code
- Management of complexity and technical debt
- Understanding user behavior

The authors will also share their intent for the framework to be a living, evolving guidepost that incorporates input from the dev and engineering community over time. Attendees can learn how they can get involved and provide feedback too.

Transcript

Kelly Gallamore [Manager, Demand Gen|Honeycomb]:

Hello everyone. Thank you for joining us today for this webinar. We’re going to start promptly in two minutes to let a few more attendees dial into today’s episode. We’ll see you in two minutes.

Charity Majors [CTO & Co-founder|Honeycomb]:

I feel like Liz and I should have some witty banter here for everyone to listen to. We don’t need to make it start recording? It automatically did that?

Liz Fong-Jones [Developer Advocate|Honeycomb]:

That is correct. This presentation will be recorded and available afterward. And you’ll hear more about that in a moment from Kelly Gallamore.

Kelly Gallamore:

Hello everyone. Welcome to “Observability Maturity, A Framework, and a Model” presentation. Thank you for joining today’s webinar. Before we dive into our presentation I would like to go over a few housekeeping items. If you have any questions at any time during this webinar, please use the ask a question tab located below your player. We will have a Q&A session at the end of the presentation to address any and all questions. Also, at the end of the webinar, we would really appreciate it if you would rate our presentation and provide feedback using the rate this tab below the player. Your feedback makes us better. If you have any technical difficulties at all, down at the bottom of the page, you can click on support for viewers.

Finally, today’s presentation is being recorded and will be available at the same URL shortly after its conclusion. Feel free to share it with your friends and colleagues. Let me introduce you to today’s speakers. Presenting today we have Charity Majors, CTO and co-founder of Honeycomb, and Liz Fong Jones, Honeycomb’s Principal Developer Advocate. Charity, why does observability matter?

3:02

Charity Majors:

That’s a great question, Kelly, thank you. You will notice that my voice is gone. I don’t know where it went. I hope it’s having a nice time on a desert island perhaps. Instead of a conversation between Liz and I, a lot of this is going to be Liz talking and me doing mime gestures to cheer her on. Why observability matters? It matters because it has to matter. I’ve been building systems… I’ve been on-call since I was 17. I thought that monitoring would take me forever. I thought that I knew how to build systems and maintain them. And then microservices happened and then distributed systems happened and suddenly I realized that the unknown-unknowns way outstripped the known-unknowns. And I was just dying and that’s the history. And I’m now going to drink some of my wonderful lemon tea and let Liz take it from here.

Liz Fong-Jones:

Lovely. Thank you for introducing us to this subject, Charity. So, as we were talking about earlier, observability really matters for running sustainable software systems that drag your business. So today’s agenda really focuses on these six things: why we should care about observability, why you need a maturity model to think about your observability journey, how we think about observability and how we’ve seen it implemented in organizations, and then we’ll help you think about how can you evaluate these things, how can you put this into process and what our next steps are for the kind of more popularizing the idea of the observability maturity model. And then finally, there will be an opportunity for you to ask us questions at the end, as well as for you to get involved with our future efforts around maturity and model.

So let’s go ahead and start by talking about the idea that observability is essential to modern software development. I think that observability is essential because we really have systems that are becoming more and more complex. And as our systems become more and more complex, some of it is essential complexity, which is required to build these more complicated things that we need to deliver value. But some of it is complexity that’s not essential. So figuring out which one is which and helping us figure out how we understand our systems even as the complexity grows. That’s kind of the new challenge of software engineering in 2019.

Charity Majors:

That is really the reason we need software ownership, right, Liz? Like the fact that ownership is now like we used to have two teams, Dev and Ops, and we had different tools with different languages. And if you tried to do both it was doing two jobs. And now we realize that in order to build good services you have to own them from end to end.

Liz Fong-Jones:

I definitely think so. We need to, and can, resolve the very common slow feedback loop. We cannot afford to have different sets of tooling because we need to move fast. We need to move really fast and that’s kind of why we think of what the observability story is about. That leads to the next question of, why a model? Why are we doing this model? Why are we doing this today? And I think that part of this answer is that when we have people who are persistently asking Charity and me “oh, this observability stuff sounds great but I could never do it,” right? “My organization is just incapable of doing it.” And we kind of want to disabuse you of that notion because we did learn many of these the hard way ourselves and we don’t want you to have to suffer through this. We want you to see a path or a set of paths that are feasible for you that other people have walked before and that you can walk as well.

So it’s not just about immediately getting to the end. We have to think about, what does the jury look like? How do you progress along this path? How do you know whether you’re on a good path or whether you’re wandering off into the woods? And this really gets to a lot of the agile movement as well, about thinking about how do we introduce incremental changes to our organizations and to our development processes, and how do we evaluate? And how do we kind of have that feedback we look for on processes as well as injecting feedback loops into our development process itself? So kind of the meta loop of thinking about how do we iterate on our process.

So when we sat down to think about this, when we sat down to think about how can we come up with a maturity model for observability. I think that a lot of the challenge was realizing that there are so many different places that people start from. There are so many different places that people start from and there are so many different ways of accomplishing this goal that we didn’t really feel like it was right for us to prescribe one golden path for people. And it wasn’t right for us to prescribe one set of evaluation criteria for people. So instead what we realized was that the idea of Wardley, Simon Wardley, value chain mapping was a really powerful concept. And that the idea of Wardley value chain mapping is that it’s something that carries over very well between organizations. But it has to be customized for each organization, that it’s a judgment call people on the ground in each organization needs to make.

And that it’s a framework for thinking about, how am I best serving my customers today? How am I helping my customers and helping my business and helping my key stakeholders achieve what they’re aiming for? And that really resonated maturity in me. It really resonated with the idea that we could think about, instead of saying thou shalt do logs in a certain fashion or thou shalt keep your data stored in a certain way or thou shalt instrument a certain way. Instead of saying that, we said you know what, let’s zoom out. What are people trying to accomplish and how can observability help them?

9:01

And what we wound up with was something that looks vaguely like this. Completely beautiful, right? Starting from the idea of if you want to make your customers happy, how do you do that? Well, in order to make your customers happy, you need to ship the product and you need to make sure it’s appropriately reliable, but how do you do each of those things and kind of practically breaking it all down. Practically breaking it all down to figure out, how do I start from making the customers happy and how do they translate that through each step and each individual process to figure out how do we improve? Which components do people need to think about as they’re listing out what things they might want to improve upon?

Charity Majors:

That reminds me of Tolstoy, how he said like every happy house looks the same and every unhappy family looks unique right? Everybody has their unique set of pathologies or unique set of differentiators. And you cannot prescribe a certain route to anyone. And yet there is this concept of, what does a good engineering team look like? What does a high-quality engineering team that is shipping fast, that is winning, what does that look like, and how can you get from wherever it is that you are to be more like that?

Liz Fong-Jones:

Yeah, so that’s definitely where we were coming from when we started approaching this. And the cool thing about Wardley maps is that they let you focus on kind of, not just where you are now, but how can you make each of these pieces evolve over time? How can we move along this path to make ourselves more mature or more capable, and what blockers might we have to resolve? What’s most important? Having a map tailored to your business really really matters for these purposes.

So, with that, we can now kind of talk a little bit about the capability model that we’ve built and thinking about what are the five things that every software organization needs to do and how does observability intersect with those. So let’s go ahead and talk about and show you what this looks like. So, as far as the observability maturity and capabilities, we said everyone here on this call presumably has a business that they care about. But in order for the business to be successful, we said you need to make sure that your development organization is happy. If your development organization is not happy, then they’re not going to be as productive as they could be.

And I know that when I talked to Charity about this she had a bunch of really great thoughts about what makes for a successful engineering team with her previous hat as an engineering director.

Charity Majors:

Yeah, it’s miserable when people don’t build great systems. Engineering is such a creative act, right? We’re not used to thinking of engineering as a creative job, but it so is. Especially when you’re getting into the edges. Anytime you’re doing something new, anytime you’re doing something that is pushing the envelope. And people are like 90% of what we do is boring, that’s fine. Like we can stamp stuff out by rote, but on a regular basis we encounter new, interesting problems and we insult them without a creative brain. And a creative brain is well-rested, a creative brain is not bogged down in the minutiae, things that you’re repeating over and over. A creative brain has a certain amount of freedom to pursue whatever their curiosity is piqued by. Because the greatest way to see a creative brain is to give it play. Your work should feel like play. Like debugging to me, the best times of my life, the hardest problems, it felt like I was playing.

Liz Fong-Jones:

Yeah, it definitely goes to a lot of what you said about autonomy mastery, purpose, and all of these wonderful things that we really know make for successful engineering teams.

Charity Majors:

We’re not used to knitting these together. We’re not used to saying no. Our typical problems depend on our care and feeding as human beings. I think it’s so true.

Liz Fong-Jones:

Yeah. But then there’s also another piece. You can have all the happy devs in the world but if they’re not working on the problems that your customers actually care about then you’re not really having a successful organization. So those were kind of the first two ways that we broke it down is: are your depths intact? are your customers being served? If you’re not accomplishing those two things then we’d argue that your organization is not actually successful.

But that’s really hard to concretize. It’s really hard to conceptualize what are the things that I should be working on. So we realized that we had to get a little bit more fine-grain than that. So we went on and said, and we’ll talk about each of these things in detail, but we said there are five key capabilities that every team needs to think about. And these five key capabilities are ultimately backed by, do you have the ability to collect the data that you need, and do you have the ability to query it to answer your questions?

Let’s go ahead and talk about each of these five areas and then let’s talk about kind of how the instrumentation querying pieces, which a lot of people jump to right away when they think about observability. At the end, we’ll talk a little bit about how that last piece helps in each of the five areas, but we really didn’t want to focus on how you solve the problems, we want to focus on what are the problems and why.

Let’s start talking about what quality code is. You need to have quality codes in order to have a successful engineering organization, have happy customers. Because if you don’t actually have quality codes then what happens is your users complain that the product is buggy and your developers really get frustrated that their builds don’t pass – that they can’t actually get anywhere in the codebase because they’re constantly firefighting, trying to figure out, why is this freaking software not working?

So that’s kind of where we kind of come from as far as thinking about what is quality code and how does it really advance most of these missions.

Charity Majors:

Quality code is not – sometimes the right thing to do is to take a shortcut. The right thing to do is not always to reach after and so forth because maybe this is the code that has a short lifetime. That’s up to you to decide. You know your workload better than we do.

Liz Fong-Jones:

Yeah, I definitely think that’s true and I think the other purpose of talking about these five capability areas is all of them are areas that you can think about. Not all of them will be equally important to your business at any moment. This is to kind of prime you to think about what is the area that’s going to have the most impact on my business right now.

15:40

So let’s go ahead and move on to talking about predictable releases. I think that predictable releases are another key cornerstone of making your customers and making your devs happy. Because devs really get frustrated if they commit a change. They think something’s shipped and actually doesn’t reach production, doesn’t actually reach customers until months later.

And unfortunately, that’s the reality that a lot of people live with and we’re trying to tell you here it doesn’t have to be that way, and you can incrementally decrease the time between when someone commits a change and when it’s really delivered when it’s actually delivered and being stable enough that it’s not being rolled back… And from a customer’s perspective, you can make a feature request, but it takes six months for it to be delivered, even though it’s sitting in someone’s code review somewhere, that’s a really intensely frustrating experience for you. Go ahead, Charity.

Charity Majors:

I was going to say if anyone’s read “Accelerate” and the research that Nicole and Jez have been doing, they get very frustrated with the idea of what does a good working team look like and they finally brought it down to one thing, you can pretty much tell how good a team is by the amount of time it takes between when you write the code and when it shows up in production line. That is the one identifying factor. If you were to shorten that, too many virtuous cycles feed in because it’s not a trade-off between speed and accuracy. They actually reinforce each other. When you get better at speed, you get better at success and when you get better at success, you get better at speed. They reinforce each other, they’re not in tension because of how we intuitively think about it. And we have to start training ourselves and our peers and our leadership out of thinking that way because our tendency to clench on a problem is the problem.

Liz Fong-Jones:

Yeah, I definitely feel that. We talk about the idea of reducing these development cycle times so that you get that feedback, so you can actually hear from your customers, which we’ll talk about more in a moment. I think there’s another piece here that we think about as far as capability and I think that there’s a reason I don’t mention it first, resilience is a thing that you need in your socio-technical systems. You need the adaptive capacity of the team and the service that they’re supporting together, such that you can run it sustainably for a long time with acceptable levels of reliability. But a lot of people, when they think about observability, think first about instant response, I think they story search long before that.

That instant response is one piece, but you also have to have all these other factors. Yeah, you cannot have that in preventative care versus the emergency room, exactly as Charity was saying. I think that that’s so critical and I think that’s something that we need people to think about the larger picture of, what does resilience in my organization look like? And no, resilience is not just a property of your code. Resilience is a property of the people in combination with the code.

So if you don’t have good operational resilience then what winds up happening is your outages happen and outage inevitably happens and they take longer to resolve. In some cases, outages can even go on for days, weeks, months in the case of healthcare.gov. If you don’t understand your systems, you’re in trouble.

Charity Majors:

And the worst ones are the ones that go out for days or weeks or months and you don’t even know about them. They’re hurting your customers every day. People are bouncing, people are unhappy. But a very small percentage of the people who are suffering actually trickle up to you.

Liz Fong-Jones:

Yeah, so kind of connecting that feedback loop is super, super important. And speaking of connecting that feedback loop together, I think we take a really broad view of what user insight is. We take a broad view of thinking about how you not just empower your software engineers or the people who happen to be on call, who ideally should be your set of software engineers, how do you get everyone including product managers, software development, product engineers and systems engineers, to understand the impact that their software has upon users, right? Do we have a product-market fit? So if you don’t have these things, what winds up happening is you wind up wasting a whole bunch of peoples’ time and your company goes under if you don’t actually understand, what is the market demanding of me?

I think that this is something Charity definitely cares about a whole bunch as someone who was on the operations and product sides at Parse before. I don’t know whether you want to share an anecdote.

Charity Majors:

Oh sure. Yeah, at Parse… we were backend as a service. Every day people are coming to me and they’re like “Parse is down,” and I’d be like “Parse is not down.” “Oh my gosh, my lights are all green!” And this is Disney, and Disney is doing, like, thirty requests per second from all the traffic in the views. It’s their world to them like if you’re an indie app developer, your app is your world. If it isn’t performing well, you aren’t getting paid, your users are pissed, they’re all coming at you, but to me, it was a one of like a hundred thousand and by the time you looked it was like one of a million. I can’t care about everyone and their million apps. I would go insane, I would kill myself trying.

Liz Fong-Jones:

But also you had the thing going on, people found weird and wacky ways to use Parse that you weren’t necessarily aware of right?

Charity Majors:

Because we’re increasingly – everyone has a platform. And I believe that platforms, from a backend engineer’s perspective, a platform is nothing other than a system where you’re inviting your users’ chaos to move on your land. And the more control and flexibility you give your users over what the hell they want to do, whether that’s writing ad hoc queries, or running snippets of code, the more passable outcomes right? The more flexibility and power you give to users, the happier they are, but the more they can screw you over. And the more they will be unique and no two snowflakes will ever happen the same way twice. They come up as – really they’re just chaos monsters. The more you give your users the ability to do things, the more everything becomes one-off.

Liz Fong-Jones:

Yeah, I totally hear that and those ideas that if you don’t have that ability to understand what your users are doing, then you’re not going to be able to build correctly for them. They’re going to be hacking into your system and you’ll have no control over it. Whereas, the more that you understand your users, the more you can give them a paved path that they really will be happy with.

Charity Majors:

That is true, but also, if what makes you a business is that you do give your users that crazy amount of control, maybe that’s fine. Maybe that’s the decision that you make. Then you need the ability to talk, understand. Think about Travis and Jenkins and all the big CI providers, they let you do whatever the fuck you want basically on your instance. But they regularly have to go through this really long thin tail thing that almost was to happen and that’s all you have to care about. So you don’t care about the health aggregate, you care about each and every single user as it’s making its way through all of your code and being able to understand exactly what happened for every single one of them.

22:58

Liz Fong-Jones:

Yeah and that brings us to a really great point next that we want to focus our users, our customers, and the general observability community on, which is, how does observability interact with technical depth, right? The more code you pile on, how do you actually make sure that we can continue to make forward progress? How do we make it the right trade-off so that we don’t spend forever in defend paralysis? While, at the same time, cleaning up as we go so that people can understand the systems. So that people can really understand, hey this is what the system is currently doing. This is how I can make progress, this is how I can improve without hearing I’m going to break the whole system.

And my former colleague from Google, John Reese talks a lot about this idea of the haunted graveyard, of the thing no one wants to hunt for because they don’t understand it. And that, to me, is an insidious form of technical debt…

Charity Majors:

Yeah, that’s the black box. It’s something scary that lives in there. I just started watching Stranger Things so now I’m like “it’s the upside-down place, nobody go there!” Yeah and the thing is you need to have a toolbox for digging your way out of these holes until it’s no longer scary. And that’s our instrumentation.

Instrumentation is your ears, your eyes, your nose, your tongue, it is your five senses for understanding the code that you’re writing and shipping every day. And I feel like as an industry, we’re kind of behind here. We’ve gotten really good about commenting codes, our codes. We’ve gotten really good about rigor around a lot of these things, even documentation. We’re better at documentation. I know it’s a big scary, stupid statement. We’re better at documentation than we are at instrumentation because for so long we’ve been shaping our instrumentation to fit the needs of the very hyper-specific data structure that certain tools have been giving us.

Instead of starting first with, what will I need to understand these problems, and then just shoving it off for future analysis.

Liz Fong-Jones:

Yeah, awesome. So those are kind of the five areas, and as Charity was alluding to, when we have the idea of instrumentation, I think that instrumentation is needed, but it’s not instrumentation because we like to write instrumentation. There’s a purpose. We need to get the critical data. We need to answer those five questions.

We need the critical data that we need in order to understand fewer quality code. Is a release going to happen on time? Are you able to debug your system and production? Are you able to understand your users? Are you able to maintain your code without technical debt? That’s why we write instrumentation so that we can get the data that we need and then get the context that we need in order to answer these questions.

Charity Majors:

And this is both for humans to be reading and for systems to be reading. Machines are going to be consuming and doing some sort of off-end analysis and things machines are good at. But you know what humans are good at? Taking that original intent that you had and looking at the system and seeing if it’s doing what you meant it to and if anything else looks weird. And that is something that you, and only you can do. And you’re going to have forgotten what that original intent was in a few hours or days. So this is a rare, precious point in time where you built something, submitted it, and it’s playing, and you get to watch it and validate it and I swear to God, 80 – 90% of technical problems in production will never even be noticed by your user if your editor just gets in the habit and muscle memory of doing that.

Liz Fong-Jones:

Yeah and I think that the other aspect of this, besides heavy muscle memory of watching your code if it goes up is production, is also being unafraid to ask new questions. Being unafraid not just to accept this is what’s on our dashboards, but also kind of thinking about how can I make sure that if I’m trying to understand the system from first principles, or if I’m trying to put together these new set of fields they’ve never put together before to understand, what combination of build IDs and users is tripping this bug? Those are the things that you need to be able and empowered to ask rather than kind of falling back on, “well there’s no dashboard for it, I guess we better bug an expert.” You are the expert.

Charity Majors:

And honestly, this kind of stuff makes you a better engineer. It really does. When you’re passionately kind of pattern matching between the dashboard and your code, you’re not actively engaging with the subject matter. And it can be very daunting to make that leap to actually formulating a question, looking for the hypothesis and trusting it or not, but this is how I see people become really great senior engineers. And it’s important to know how your code runs in production under normal circumstances, not just during outages.

27:45

Liz Fong-Jones:

Yeah. So we put out this idea of those five key capabilities in the instrumentation and data and querying that drives your ability to answer those five capabilities. And we put it out and we weren’t sure what people would think, honestly. Charity and I are very opinionated and sometimes people disagree with us and that’s fine. But what we heard when we put together this meet up that had, I think, something like 30 people over there at O’Reilly Velocity Con in San Jose, was people really resonated this idea of these 5 things that people care about as engineers. That was something that really resonated with our audience. The people really felt, yes, this describes exactly what my organization is going through and maybe one of these is more attainable to us than the other. In particular, two of the tables had the highest attendance, the technical debt, and the operational resilience table.

People really strongly identified with those two being their most painful, but everyone felt affinities with each of these five areas. And what we did was we had a really great set of round table discussions about each of these topics.

Charity Majors:

Observability is such a keystone, it’s a cornerstone for each of these. It’s not the point, it is the how. It’s not the what, but the how you get there. And I feel like there are many axes in which our industry has been kind of held back by the fact that we haven’t had observability tooling, and by that I mean just the ability to understand on per request level, what is happening? What is the context? Being able to slice and dice your data… it’s not just throwing your code over the wall, letting someone else in ops look at the aggregate graphs.

And having no idea if this spike means that my sequel is slow or ten percent of users are affected, or one percent of users are completely failing. All of these are very different scenarios. I think that cast engineering has started to become a thing because observability is starting to become a thing. I think that there are a lot of things that are becoming feasible because of observability. I don’t honestly like to focus on observability so much as I like these actual questions, which Liz was just saying. We had these discussions. Normally people have these really rich, challenging problems, and observability is just a missing ingredient every time. And it’s not the only one, which is why it’s something we talk about. But it’s kind of direct with it.

It’s like English 101 when you get to college. You gotta take it before you can take anything else.

Liz Fong-Jones:

Yeah, it’s so interesting. Ben Sigelman from LightStep often talks about the idea of traces are the fuel, events are the fuel. That you use the information you collect in order to answer these questions. And if you lack the ability to answer those questions then you’re going to be in trouble. You’re not going to be able to answer questions like: is my deploy pipeline getting slower over time? What’s contributing to my deploy pipeline getting slower? Or what kind of user is affecting this?

Charity Majors:

Yes, it’s a lot of these social pathologies we’re all familiar with like the grumpy old guy in the corner who we can fire because he can intuitively understand all these problems in the complex system, but nobody else can understand because there’s not a tool anywhere. No one has access to it. This is all on Ted. Or the hoarding of information or the nausea – so there are so many pathologies that we think of being social problems. But like we said, they’re technical social systems and there’s a lot of crossovers. And just opening this mindset in production is making it seemingly available to everyone has a great democratizing effect.

Liz Fong-Jones:

Yeah, so definitely one of the themes we were talking about when we had these tables at O’Reilly Velocity was we had everyone from each of these tables collect notes and kind of record what was the discussions of the table and how did people think that observability mattered. And one of the things that came up, for instance, of the operational resilience and incident response table, was people really felt that they needed service level objectives to understand, in a sufficiently complex assertive system, is the system up or down? The answer is yes, the system is up or down. There’s no binary up or down anymore. You need to have an idea of what is a sufficiently good state. What’s slightly degraded and what’s really too degraded?

And if people don’t have the visibility into their systems to understand at the event level what’s going on, if they can’t break that down, then they’re going to be in trouble and no, AI Ops is not going to solve your problems. The better path is to have humans be aided by machines rather than have the machines tell humans what is wrong. This viewpoint comes from attendees, not necessarily even just us lecturing the crowd. This is what people said. These are real problems right now in our organizations.

Charity Majors:

You’re getting very cynical about that, which I love.

Liz Fong-Jones:

Yeah, and then the other area that was super popular that there are a lot of detailed notes on was the idea of technical data, the idea of, how do we manage our technical data and complexity? And people mentioned a lot of symptoms of pain. They felt all this pain because they had projects that literally couldn’t move forward because of technical data.

Or people were adopting the system for the wrong purpose and running it into the ground. Or there were people where there were recurring problems because people weren’t able to effectively communicate. Or people weren’t able to effectively say yes, I feel comfortable releasing this because everyone was afraid to touch it. One person at my table even said “I literally cannot go on vacation because people will not be able to push forward without me.”

Charity Majors:

That’s so sad.

Liz Fong-Jones:

That’s terrifying and sad, right? These are the problems that people are grappling with and we really think that observability helps you with this by demystifying production, by making it possible to understand your production systems without relying on that one engineer to document things through your observability and to really kind of…

Charity Majors:

Because a person who has the most political clout wins. Solve arguments with data. All right. Yeah. Sorry, keep going

Liz Fong-Jones:

And then there were a couple of other tables I think, Charity would you like to summarize these?

34:25

Charity Majors:

Yeah, absolutely. The quality code one, there’s a lot of religious doctrines to build up around codes, I think, and in the absence of an arbiter, which is reality, you can get into all of these pathologies of how people have ways they like you to do things. In the end, are engineering labors for business purpose? They’re not getting paid to sit around and work on our bin files. We’re getting paid to deliver value and this can get very muddled sometimes. But honestly, people are happier at work when it’s clearer what the business value that they’re trying to create is and when code quality can become a reality. Nobody writes code and then ships it and feels confident about it. I hope not. You shouldn’t feel confident about it for quite some time.

Observability can answer that question. The predictable release. You’ve all heard me ranting about Friday deploys. One set of diffs per deploy, a couple deploys releases. Own your code. Anyone who has the ability to merge the masters should have the ability to ship their own deploys. Context is key and user insight. And this is one of the most interesting things, I think, which is that observability is not just for engineering teams. It is for engineering teams, but it is also for every adjacent team in engineering. Everyone who has filed a ticket for engineers. Everyone who ever asked a question for an engineer. An engineer has to go through and look, look through code and data and stuff, that could be available to them so that they can serve it up themselves without having to go to that engineering gatekeeper.

Liz Fong-Jones:

Yeah, so that was a summary of the discussion that we had at one event. But what we were hoping to do was to have more discussions like this. A representative sample of the 30 most passionate people at Velocity is not necessarily going to be everyone across the world. So we’re hoping to continue these conversations. But one of these key venues that we’ve had a lot of feedback from is from Twitter. A lot of people talk to Charity and me on Twitter. And surprisingly to us, we didn’t really get flamed over these. The people just really thought this really resonates with me, to the point that people are asking us for merch with the model. Which is like oh my God, I can sell merch around this?

Charity Majors:

Yeah, so we want to have more of these conversations. We want to start running them in other towns. We want to continue working on a framework for observability. We’d love to hear from you If you start trying to follow these recommendations that we’ve said and you loved it. If you’re local, Liz is in New York, I’m in San Francisco. I feel very strongly that we only move this industry forward by telling stories and by listening to each other.

Liz Fong-Jones:

Yeah, people really love anecdotes and tales of, here’s how I solved these pains.

And especially here’s how I applied observability, even if people don’t necessarily think technical data and observability go together. Guess what – if you hear that story of the team that applied observability in order to decrease onboarding time for their engineers or the team that solved their resilience problems with observability. These are all things that we can use to help each other advance. So the next step…

Charity Majors:

We do not need to make all of these mistakes individually all over again. We should be able to learn from each other. God, I hope.

Liz Fong-Jones:

Yeah, don’t learn from falling face first in everything, other people are falling face first. Let’s leverage each other’s knowledge. So the next step for those who are interested is to read our Observability Maturity Model Framework. In it, we include some things like this is what good looks like and this is what bad looks like. And how do you think about prioritizing these things? These are all things that we’re hoping for your feedback on, also your personal experiences.

We’re also going to have a survey really soon that will help you kind of contribute your thoughts on this and help us understand where are people along each of these dimensions. Because in order to really figure out how, we give fuel-targeted advice that helps you where you are now when you should know where you are. And then we’re going to have a few more in-person meetups in San Francisco next month and I’m also hoping to run a meetup in New York where I’m based in the coming months as well. And we’ve got a couple of great episodes of observability casts coming out in the next few months.

Now that Charity and I are acting as co-hosts over the observability cast, so we’ve got great guests like Ana Medina of Gremlin talking about TS engineering and observability. We’ve got James from Bugsnag talking about kind of measuring technical debt and code quality. And then we have, who is it from Intercom, Charity?

Charity Majors:

Rich Archbold.

Liz Fong-Jones:

Yes, Rich Archbold from Intercom, who’s going to tell us all about how Intercom uses observability to ship software faster, to get their deploys more…

Charity Majors:

If you know anyone who would be a great guest, or if you yourself would be a good guest, we’re always on the lookout, not just backing folks, the client sites, mobile developers, even product marketing. Practical people who use data, who want to talk about this, hit us up.

Liz Fong-Jones:

So that’s most of what we have for you but we wanted to leave at least 20 minutes at the end for questions so I think I’m going to hand it over now to Kelly Gallamore to help us wrap up this section of the content.

40:06

Kelly Gallamore:

That sounds wonderful! Now taking questions, if you have any questions, I see a few from our audience, feel free to enter it into the “ask a question” tab. We’ll get to as many as we can during this time. Alright, I have a question here that I hope I can do justice to. Someone has asked a question about what we really need to be aiming for. Code freshness, code written so it’s easy to test, code written so it’s easy to change limiting the amount of context necessary to understand how a piece of functionality works or fails to. What constant digging is not spent understanding? Are these the things we need to be aiming for?

Charity Majors:

It absolutely should be time spent understanding. I don’t really understand this question. I think that what I’m hearing is when you’re writing code, what should you be paying attention to? And I would say it depends on what’s not working for you. Liz, do you have a better idea?

Liz Fong-Jones:

Yeah, I think that this really gets to the idea of toil. In the ethereal world, we think a lot about toil as the time that you’re not spending productively working on making strategic improvements to your system. And I think the idea of when you’re spending a whole bunch of time scratching your head, not being able to figure out why something works, I think that’s an insidious dark form of technical debt. Kind of not necessarily toil in the manual repetitive work sense, but it is toil in the sense that it’s something that’s slowing you down. It’s a form of debt that’s dragging you down, that’s stopping you from making the changes to your system that you really want to. I think that’s how I’d look at it, is asking people how much of your time are you spending in the zone? How much time do you feel like you know what you’re doing? How much time are you distracted and how much time are you just scratching your head, feeling like WTF is going on?

Charity Majors:

Yes, that’s a really good point. You should not feel completely unmoored very often. You should feel like you have a place to start. And it may take some persistence but it’s following a trail of breadcrumbs to get you there. And if you have the bread crumbs, the first thing you start doing is jabbering the bread crumbs, if you instrument your code and decide on a […] for that.

Liz Fong-Jones:

Yeah, I think the dangerous thing that happens to a lot of organizations is every new hire relearns these lessons over and over because no one writes the instrumentation to better understand the documents through instrumentation of what’s going on. When you have the same people relearning the same lessons over and over I think that’s really the trap. And the more you kind of free up just a tiny bit of time, every time someone onboards, have them add a little bit of instrumentation. You can gradually chip away at this and make the problem better. Feel free to ask another question in the box if this didn’t answer your question.

Charity Majors:

And when you’re on call just take an instrumentation first approach. Add your divisive instrument two yards in front of your eyes at every time and before long, your entire stack will be well-instrumented.

Kelly Gallamore:

I think that actually leads into another good question here and Charity, maybe this is the answer to it. If we’re hurting in all five areas, what’s the best place to get started?

Charity Majors:

I thought you were going to ask about the Friday deploy one. If you’re hurting in all scenarios, it depends, because they can all hurt equally. I would start with humans. What is waking your people up? Start with whatever is making it their will to live, because once you start freeing out some of those – once they start sleeping through the night and once they have their creative brain back, they’ll be better equipped. So your managers should be tracking the paging volume, who is getting paged, where is the page coming from, do you understand what’s happening? Can you put the automate adequately […] in place? Can you lower your threshold? Can you decide not to care about some of these things? Maybe you just have to offload some amount of caring to dig yourself out of this hole.

Liz Fong-Jones:

Yeah, I definitely hear that. If your team is not happy, no amount of working on the other things is going to ever progress because you can’t run a team on burnt-out engineers or engineers who quit. Decrease the operational pain to the point that your engineers aren’t feeling burnt out or unable to do work. Even if that means compromising on the reliability of your service.

And the next thing I would think about addressing is okay, now we feel like we’re starting to get our people situation under control, how can we get the operations and our service back under control? Are there repeat incidents? Are there things that we can do to increase understanding of the service so we can start bringing the reliability levels back up? And then finally I think then you can start doing the technical debt piece. And the technical debt piece, there’s a reason we put it in the middle of that wheel. The reason we put it in the middle of that wheel is that it influences all four of the other things. It makes it harder to do all four of the other things, or easier to do all four of the other things.

So that’s your highest leverage thing but it’s not necessarily your most urgent thing. Your most urgent thing is to stop the burnout, stop the constant on-call buyers. After that, you can focus on tech debt, and then you can kind of work on each of these four things in term. In a way, I like to think about it in terms of the wheel. As in terms of how… when you have this wheel of code deploy, make sure it runs in production, and then look at the feedback from users. And kind of work your way around that wheel. Every time that you have one thing that’s kind of pulling you back, note, and you improve it. And then you improve that until you’re held back by the next step. Now let’s suppose your developers cannot debug their systems, even in their dev environments. So entries that are pulling for that help them write better casts, help them cast using the same tools they would with production, do all of that.

But then if your blockers suddenly become okay, now they can test well but they can’t ship to prod, okay it takes weeks to ship to prod, now you got to work on that. Okay now we’re shipping in prod but its breaking in prod, okay, now we got to work on resilience. Okay now everything’s resilient and stable but we’re not building the right thing, it loops around. Just keep working on […] probably keeping your people from burning out.

Charity Majors:

And consolidate your weight behind fewer errors. Don’t try to attack all five at once. Attack one, knock it out of the park, then attack the other one, knock it out of the park. No word count until it’s shipped. You don’t get partial credit for partially fixing something if it’s still waking you up. Make sure that you’re just knocking them off one at a time instead of going “all right, we have five people, each one of you gets to assign one of these problems”. That will never result in success.

Liz Fong-Jones:

The Wardley maps that we talked about earlier, every organization’s Wardley map showing where they are in each of these five areas, will be very different potential priority […] decide based importance of pain right now.

Charity Majors:

It could be really fun for your leadership team to do during an off-site. I did want to mention SLOs though. Because SLOs is the contract that you have between engineers and management. You’re never gonna get through the promised land without those.

Kelly Gallamore:

Got that! It sounds like a really great approach to just identify your problems and take one bite at a time to start working your way towards more success in this area. And since this is a common question I’ve heard: how often should we… [be doing releases]? Should we release on Fridays?

Charity Majors:

Liz, you want to take this?

Liz Fong-Jones:

Sure, I’ll take it. I can be a little bit more diplomatic I think. I think that when you have a fine-tuned release process, when it’s regularly running, it should be an exception to turn it off. Why is your release process that’s continuously deploying have to be turned off on a specific day? Or why do you think that if you can’t catch an outage in the first three hours when it’s deployed, why do you think that you’re going to necessarily catch it the next day? It might explode one day later, two days later, three days later.

At that point, if you choose not to push on Fridays? You are giving up one-fifth of your velocity. And some other organization that is invested in the better tooling needed in order to maintain Friday deploys, is going to eat your lunch. And that’s the thing, spoiler alert, that’s the thing that Intercom says […] I think that if you haven’t made that investment, don’t deploy on Fridays, but look at that as a symptom that something is wrong rather than I’m proud of looking after my people.

You can be proud of looking after your people and saying “hey, stick around for long enough until you’re confident.” After that window, it could go wrong at any time and that’s normal on-call work.

Charity Majors:

I am not, despite what some may think, I’m not rabid – I’m not thankful we should ever deploy on Fridays. If that’s where you’re at in organization, that’s a great stopgap. That is a great temporary thing to do to make sure your people have their weekends safe. But if that’s actually – if your actions state where that’s the reasonable thing to do, it probably means that the rest of your week is all shit. But the point is not to not deploy on Fridays. Deploy to do the right thing for your organization and at every future development and somethings very wrong if you can’t safely ship on Fridays.

49:22

Charity Majors:

But also, differentiate between release and then deploys. Nobody’s saying some big, fancy, million lines of code features should get shipped on Friday at 5 pm because that makes sense. No, that is not what we’re saying. […] major flags but get it in there so that you don’t have to stop everything to cue up behind a couple of big deploys. Keep the digestive system moving.

Liz Fong-Jones:

Yeah, incremental releases are so critical to this. Having adequate observability to understand when I do that incremental feature flag turn on, what are the results I’m seeing broken down by feature flags? Can you compare, control, and experiment on groups as you go? And you’ll feel confident and yes, I do sometimes push on Fridays, but I don’t push the thing that I think is high risk on Fridays. It’s not universal no deploy on Friday, it’s be smart about it. Be aware of the risks.

Charity Majors:

You want people to be using their judgment. You want to be empowering people to use good judgment. Because that helps them build good judgment.

Liz Fong-Jones:

I think that covers the question.

Kelly Gallamore:

All right, that’s great. Let’s move onto this one. This is assuming our familiarity with a lean start-up methodology. I think we can all understand that. Do you see, looking for a different change here, do you see observability are integral […] in learning metrics and learning experiments? Or do you think of these as separate domains that require distinct sets of tools to work inefficiently?

Charity Majors:

Data is data dude. You actually get so much more value from having all of it in one place, than by separating up and pre optimizing for all of these different things. We’ve been using Honeycomb for our business logic since day one, and a lot of our happiest users are people who, it’s not just engineering, it’s support, it is project management, it is sales, it is marketing. I’m a firm believer that tools create silos. If you have a team that’s using a tool and the boundary is the edge of that team and other teams using other tools, their views of reality are never going to be the same. And you get much greater than some other parts. If you can unify everyone in one place.

Liz Fong-Jones:

This is a great place to talk about Service Level Objectives. It’s the way of getting the business who hires as project managers, the engineers, the people on call, all on one page about this is what our success criteria is, this is what the business is aiming for, and this is how we measure it. And I think that that’s so critical. I think there is one angle I do want to talk about. There’s a reason that we talk about the five capabilities and why you need to evaluate where you are in each of them.

A super, often instant response practice does not necessarily translate into product insights. You may need to separately instrument for those and think about how you’re going to instrument for those as you’re writing the code. And the instrumentation may be slightly different. The use metric of a feature may be very different from “this is the error rate that we’re seeing and this is which requests are failing.” I think those are different things that you’re instrumenting as you go.

If you put them in the same place you can correlate them afterward. If you don’t put them in the same place, if you use different tools, people are jumping around using different tools. That really sucks. So you may have to think about building the capability instrument, each of these, and the capability to build the right queries for each of these. But if you make the data store the same, it kind of standardizes.

Kelly Gallamore:

Liz, it feels like you’re actually touching on one of our next questions. So let me ask this specifically, what are some specific examples of things people should be looking at, metrics, or events, in order to track these five outcomes of observability? Can you touch on that?

Liz Fong-Jones:

Yeah, I can certainly touch on that. I think that if you are not measuring the SRE golden signals, utilization, saturation errors, or the other one is rate error duration. If you’re not studying these things, then your APM tool is deficient. I can flat out say that. Every APM tool needs to have request rates, error rates, latency histograms. If you don’t have that, you’re kind of flying in the dark. But I think that the better data comes from when you’re able to break that down by arbitrary cardinality where you’re able to ask, for this customer, what’s the latency that they’re seeing.

Even if you have a million customers. Your customer success team can go in and look directly at that without having to ask an engineer. And I think that the other dimension of kind of thinking about how we instrument for each of these areas and kind of a success case really goes to, for each of these areas, quality code. Your developers have the same sense of how things are working in their dev environment as in their prod environment. For predictable release, do you understand we release this cool, open-source tool called Build Events that lets you find out, how long are my builds taking? Which builds are the most prone to failing? Am I seeing a creep up in the lengthy distribution of this particular past? These are all questions you can answer with Build Events. So kind of using that to cut down your release cycle, I think that’s really powerful. Everyone seems to understand instant response to a degree. Can you actually figure out what’s broken in production without it taking 20 minutes to form a hypothesis about what’s gone wrong?

Charity Majors:

I gotta be shipped a new custom code to describe that one scenario-

Liz Fong-Jones:

– code and wait that 24 hours, 36-hour week cycle in order to get insight into your systems. You should just build it the right way from the start so you can query it when you do have a question you’d like to ask. So hopefully that kind of gets to this flavor of how do we approach these things, how do we kind of concretely think about the observability capability in each of these five areas? If that didn’t answer your question feel free to tweet at me or ask the question in the chat.

Kelly Gallamore:

Great, thank you so much. I do want to say I’m not sure who said that. Somebody wants to say thanks for a very helpful response, whichever one it was. So we appreciate that coming through. Someone has asked for details about the New York meet up and what the name is. I will put a link to that meetup page, you can join it. I’ve put one in recently. If you become a member of our meetup page, as soon as I have more details about where and when that’s going to be, you’ll get a notification. So I’ll make sure that that link comes up very very soon. And we’ll turn it back to this question. Software projects are not naturally broken, they get that way over time. You probably can’t improve until you break the negative feedback cycles that lead to things getting worse. How do you break those cycles?

Charity Majors:

I don’t know what cycles you’re talking about.

Liz Fong-Jones:

I think that is a question of executive will. If you are an executive who does not care about your people, then you’re going to run the organization into the ground. If you are an executive who does care about your people, what the tripping factor is for a lot of people waking up and realizing we cannot keep doing business as usual, is when they have that outage that breaks the camel’s back.

I think the classic case there is, we discovered that the government procurement process for software was broken with healthcare.gov. And then there was a crisis and the crisis caused there to be enough willpower to say you know what, we’re going to fix this. We’re going to fix this the right way. Crisis can be one trigger, new executives coming in can be another trigger, and finally, people realizing that they need to do this DevOps and observability driven development, or full ownership model. That can also be a powerful driver for change. I think once someone decides to change, then you kind of have to do that incremental progress, figuring out what’s the minimum viable unit, not I’m going to smear that up to process everything.

What’s a minimum viable unit? Where am I going to test this to see how it works with my organization? Which team is going to process this first? Go ahead Charity.

Charity Majors:

Pain and pleasure are the only two tools. Either it hurts someone and they decide to make a change or people can get hooked on the dopamine hit. That’s the only two ways anything ever changes.

Kelly Gallamore:

All right, awesome. We only have a couple of minutes left so I’m going to see if we can get through this one last question. Sometimes technical debt just means we understand this better now. When deployment is not scary and regressions are easier to spot and the blast radius is limited, does this mean we can actually provide better value?

Liz Fong-Jones:

I think that’s a yes/no question. I think the answer is yes. The more leverage you have, the less you’re fighting the system, the better.

Charity Majors:

Right.

Kelly Gallamore:

Okay, that sounds great. I’m gonna call it… Oh, I’m sorry Charity, go ahead.

Charity Majors:

There’s an open source question actually.

Kelly Gallamore:

Oh, you’re absolutely right, we have an open tsource culture. How should we make progress with observability with our own built systems?

Charity Majors:

So first of all, open source culture is one thing. Does that mean that you write or call your own […have open source] for everything? I would push back, I think it’s kind of pathology at this point. However, the answer to your question is, look at how we’ve instrumented, how we’ve written the Beelines. Basically, when a request enters the service, you initialize an empty event, populate it with everything we know about the context, drop the life of that request of that survey, you stuff more detail into the block and at the end when it asks it for errors, ship that block off as one arbitrarily wide, structured event. Ship those into, fuck it, I don’t know, Aurora or something you can find specs on them and that will give you something.

Liz Fong-Jones:

Yeah, there are a lot of great open-source things that kind of do this. It’s the methodology of structuring your events and making sure that your back end support is indexing those structured events and querying them at run time. That’s what matters but Open Source is not free, you have to pay the cost to maintain it and run it yourself. That’s a trade up.

Charity Majors:

All right cool. This was fun. Thanks, guys for showing up, it was really great.

Kelly Gallamore:

Thank you, everyone. Feel better soon Charity, thank you very much everyone for your questions and for joining us today. There are attachments that will come through once this link is finished processing and you’ll get an email including a white paper for the framework for the observability maturity model, written by Charity and Liz. Thank you for joining us today and we’ll see you next time!

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript

Whitepapers

Observability Maturity Community Research Findings

Adopting observability tools, site reliability engineering (SRE) practices, and a culture of shared ownership translates to efficiencies across the software engineering cycle. Industry survey results are in! We learned that advanced observability tooling and practices go hand-in-hand with achieving better outcomes: Advanced teams realize better outcomes and possess the confidence to solve problems when crisis strikes—52% are very confident in their capacity to detect bugs in production and another 40% are somewhat confident. Read results & learn why o11y maturity drives production excellence.

Podcasts

Ep. #12, Speed of Deployment with Rich Archbold of Intercom

In episode 12 of O11ycast, Charity Majors and Liz Fong-Jones speak with Rich Archbold of Intercom. They discuss the crucial importance of timely shipping, high-cardinality metrics, and the engineering value of running less software.

Whitepapers

Framework for an Observability Maturity Model

Everyone is talking about "observability", but many don’t know what it is, what it’s for, or what benefits it offers. The framework we describe here is a starting point. With it, we aim to give organizations the structure and tools to begin asking questions of themselves, and the context to interpret and describe their own situation--both where they are now, and where they could be.

BACK TO RESOURCES

A New Framework for An Observability Maturity Model

Summary:

Transcript

Transcript

Ready to get started?