Podcasts Observability APM

Observability Helps You See What Looks Weird

In this conversation for The New Stack Makers, Charity Majors discusses a number of themes relating to observability and monitoring, as well as how she continues to make herself a better developer.

Transcript

Alex Williams [Founder & Publisher|The New Stack]:

Honeycomb provides observability for all software engineering teams to learn, debug, and improve production systems, to delight end users and eliminate toil. With Honeycomb, developers code with confidence, operator efficiency goes up, life quality improves and the business grows. I am so excited for our guest today, Charity Majors, co-founder of Honeycomb. Charity, it’s so great to have you here today. You wrote a post for us just last week on test-driven development. And it’s another opportunity today to discuss test-driven development as a discipline and why it has to become part of the process that’s built into the muscle, I like what you say about that, and just making it part of what a developer does. So thank you so much for joining us for this conversation.

Charity Majors [CTO & Co-founder|Honeycomb]:

Thanks for having me and what a beautiful living room you have.

Alex Williams:

Thank you. We just had a lot of work put into it and it’s-

Charity Majors:

Good timing.

Alex Williams:

Yes, it’s the set that makes the pandemic, I guess, bearable to some extent for video purposes and Eddie helped us out with it entirely remotely.

Charity Majors:

Wow.

Alex Williams:

So we’ve become our own AV technicians here as a lot of people have. And I actually like your sound screen there.

Charity Majors:

Oh, yes. This is how we create multiple rooms in San Francisco.

Alex Williams:

I was reading Understanding Media by Marshall McLuhan. He talks a lot about spaces and how they’re defined by media. And I think we’re living in that reality right now.

Charity Majors:

So true. We’re all learning a little bit more about each other than we really needed to. But it’s good.

Alex Williams:

So I wanted to start with a question about test-driven development and what is a test-driven approach today? And I looked back at what Martin Fowler wrote back in 2005. And as you well know, test-driven development dates back much longer than that.

Charity Majors:

Yeah.

Alex Williams:

And he wrote that you write a test for the next bit of functionality you want to add. Write the functional code until the test passes, refactor both new and old code to make it well structured. And my question is, has his process fundamentally changed and how so or how not?

Charity Majors:

Yeah, well, it hasn’t, and that’s the problem. I mean, and there’s nothing wrong with it. That’s how you write code, but it’s not how you make sure the code works. It’s step one. It tells you that your logic works, which is great. It produces an output, but when it comes to running systems in production and making sure that ever-larger codebase actually… How many things do you have to consider besides does it return the right function? It’s how quickly does it run? Maybe you’ve got concurrency issues. You’ve got thousands of threads running at the same time, all executing, all sharing resources, all sharing information, all hitting the same data storage systems. It becomes pretty clear pretty quickly that that is just the beginning. That’s just the beginning of your code’s life cycle. And so TDD is brilliant because it allows you to abstract away all the messy reality of production and just focus on getting this function working. But it has almost nothing to say about that function embedded within the code base embedded within reality, as your users are using it. Those are just very different worlds.

4:16

Alex Williams:

They are very different worlds. So you write that TDD has some great things going for it. It’s a pure way of thinking about your software, but that’s part of the problem too. It’s very pure, isn’t it?

Charity Majors:

Very pure.

Alex Williams:

You write about how it’s almost hermetically sealed.

Charity Majors:

It is. It ends at the border of your laptop. That’s your world, right? Which is great if you are, trying to cart your laptop around and write some code and get something pushed to production. But part of it, I feel like people look at DevOps like it’s a new thing. And I really do not. I look at it as tearing down the wall that should never have been built, right? Returning to our roots as people who built and owned our software in production, where back then you were your needs and your reality was very aligned with your user. When your user was yelling at you, you fixed the problem, right? It was this very tight, virtuous feedback loop. And for reasons due to maturity and size and everything, we’ve developed all these subspecialties, but the more detached we get from each other, if you’ve got someone writing the code, handing it over to release engineering, shipping it off to ops, you really risk breaking down that virtuous cycle, so you’re completely detached from your user’s experience. And you don’t have your eye fixed on the biggest problem from their perspective.

Alex Williams:

I heard a technologist once say that one of the biggest mistakes he ever saw was that infrastructure was actually divided into compute, networking, and storage. And that kind of thinking, that kind almost refers to what you’re saying here a little bit.

Charity Majors:

Yeah. Specialization is necessary for scaling reasons, for all this stuff but I think that putting it into production, like when people ask me how they can get their developers to care about the things that they care about, right, my number one question is, “Are they on call?” Because that’s a very quick, easy blunt solution, right, to making them feel the pain of you or your users. I think that a softer version of that is just kind of what I wrote about in the piece, which is developing that muscle memory, understanding that your job is not done when you’ve merged to master or merged to trunk and you don’t get to walk out the door then. Your job is not done until you’ve watched your code get deployed. And you’ve watched users using it in production. And while you’re developing, you should be looking, you should be instrumenting it, right, with an eye to your future self.

How is my future self going to be able to understand, is this code working or not? And you should never accept a pull request unless you can look at it and say, “I understand how I will know how this code is working in production.” And then once your code is out there even as a canary or whatever, you go and you look at it through the lens of that instrumentation. Is it telling you that your code is doing what you intended it to, and does anything else look weird? And that second part sounds fuzzy and it is, but it is not shrinkable from that description. Does anything else look weird? Are you applying the fuzzy heuristics of your incredibly powerful human brain, who knows these systems intimately and knows what you’ve just done and has the best chance that anyone will ever have of connecting some unintended consequence with what you just did because you’re going to move on to a different problem?

All of that context is going to get paged out of your head. You’re going to lose it. If somebody else discovers it, it could be weeks, months, years down the line after God knows what efforts been have absorbed into everyone’s expectations of how the system will perform. And it’s just going to get harder and harder and harder to find it.

Alex Williams:

So how is test-driven development evolving into observability? I like what you say about looking weird because when something looks weird, you’re actually observing it. You’re like saying, “Oh.” Your mind is saying, “I just made it an observation. And the observation is that looks weird.”

Charity Majors:

And it’s embedded in the entire complexity of the production system. And people have tried to create staging environments that match production. I’m willing to declare that a failed battle. It’s not that there are not some things that staging is good for. If you’re a designer who wants to see how things will display, staging fits perfectly, because their production system is actually less the infrastructure, their production system is the browser. It includes the browser, right? And so if you’re testing it independent of the browser, you don’t have a production system. So, but production is never going to be staging. Staging is never production. No one is willing to pay for staging to be the same size as production and to be as complex as production. It’s always going to be mostly mocked, right? And it’s never going to have the same concurrency, the same impact of millions of users using it, right, in parallel.

So I think that it’s time to kind of declare that a failure and invest those resources. Developer cycles are the scarcest resource that we have. We need to invest those resources into making it so that you have the tooling that you need to ship things in a very controlled and isolated fashion.

Alex Williams:

But isn’t that what monitoring has always been? And hasn’t that just been the premise behind monitoring?

9:52

Charity Majors:

Monitoring is post-hoc. Monitoring is you’re always fighting the last battle. Monitoring is you’ve got your defining thresholds that say, somewhat arbitrarily, the system between these perimeters is good enough. It’s good. It’s not good, always, but it’s good enough. It’s good enough that we don’t have to wake someone up in the middle of the night to fix it. Now, most of the bugs that you ship in your life will never actually trip that threshold. And they can’t because you will go nuts. There are so many false positives. There are so many blips that you can’t explain. That’s just part of your reality of running a very complex production system. And if you try and like… Ops teams spend their lives curating these thresholds, and they’re never quite right. Right? And the agreement that we have to make with ourselves to run the future distributed systems of the world is you only pay someone when your users are impacted.

Now, most of the bugs that you are shipping, when you’re writing code, your users won’t be impacted or they won’t know that they’re impacted, it’ll be so subtle. It’ll be something that will be… It’s just there are many different kinds of bugs and most of them, you will see if you’re looking at it, if you’re looking for it through your instrumentation, but they’re not going to be catastrophic enough for monitoring to catch it. And when I say the tools that we need to build to ship these things safely and in controlled ways, I’m talking about things like feature flags, right? I’m not saying deploy everything to 100% of your users in production immediately, right? You can have test users in production that you use, right? There are lots of different shades and ways of shipping things, partially shipping things to a canary, if you’re worried about the load impact, shipping things in a progressive way.

Monkchips has been talking about this a lot and I really love what he has to say. It’s all part of the same observability tools that let you break down by high cardinality and high dimensionality. They’re all part of an ecosystem of extending the reach of developers in very sensitive and controlled ways into production.

Alex Williams:

I’ve heard more people talking about progressive software development. And one of the questions I do have is about just those people who have been conditioned to believe in monitoring as the way that they do work.

Charity Majors:

Monitoring isn’t going away any more than TDD is going away. They’re both essential characteristics. They’re both essential tools, but they aren’t sensitive tools.

Alex Williams:

They’re not sensitive tools. And when you mean by sensitive tools, what do you mean?

Charity Majors:

TDD abstracts away all of reality, right? It doesn’t let you look for a small consequence of a small change in a large complex system with all the ripple effects of the interconnectedness and everything. You need to be able to see it in its native ecosystem, so to speak, not in your very contrived false ecosystem. Monitoring, the thresholds are saying, “Well, if we get errors that are more than 0.02% out of…” That’s a blunt tool, right? Your code change might not even result in an error code being thrown. It might be changing suddenly the performance footprint of a query or the behavior of a modality or something visually, right? They’re blunt tools that give you a certain level of confidence in your system. SLOs are better than most monitoring, but it’s fine. But they’re saying above all, this is a heuristic, right? This is a probability. This is a very blunt tool that we can use to let us sleep during the nights, but it doesn’t actually necessarily tell you anything about the change you just shipped.

Alex Williams:

So in review, we’ve talked a little bit about test-driven development and how it is muscle memory, and how you really need to think about that. Monitoring is something that’s been around for a long time. And it has existed a lot in on-premise environments and that’s where it really evolved, right?

Charity Majors:

On-prem. So, yeah.

14:13

Alex Williams:

Right. Yeah. So yeah, it all relates on-prem in many ways. So adding all that up, adding that precedent there, why is observability the missing link? What is it about that fine-grain capability that comes with observability that makes it much, much more relevant now?

Charity Majors:

It’s possible now. Storage has gotten cheap enough that you can gather enough detail. Right? I would argue that it’s always been, it’s not so much a missing link as it is a necessary first step. It’s kind of like people are blind mostly, but they’re wearing reading glasses, which allow them to see a fuzzy outline of a shape, right, which is better than nothing. But people are driving down the highway, not being able to see very well, which means that a lot of their engineering energy gets sprayed around in the wrong places and a lot of problems that they cause don’t get caught quickly. Right? And observability is putting on the glasses of your actual prescription to let you see in very specific detail, letting you break down to the difference between the specific rows that are exhibiting the error and the baseline, right?

It’s the ability to, instead of trying to build a sandcastle in the sky in your head, reading your code and trying to imagine how it’s all working and how it’s interplaying in production, instead of trying to hold it all in your head, you can literally just look at it in a tool and follow the trail of breadcrumbs. Let me give you an example. With monitoring tools and outages and ops, if you saw a spike on your graphics, big spike of errors, you would likely look at it in alarm, try to remember if anything had just been done recently, you might start flipping through other dashboards to see if you see the spike in any of the others. And you’re just trying to form guesses about what’s happening, right? It’s likely informed by, you have a large library of past outages in your head, right? Things that have broken before, things were most likely to be wrong.

Then you go and you look for those things, which is not really scientific debugging so much as it’s heuristics, it’s very human. It’s very rooted in the past. And it’s less and less useful and helpful today because you don’t have monolithic systems failing in the same predictable ways over and over. You have these complex microservices, distributed services where it’s a new thing every goddamn time. And it’s really annoying, right? Your library of past outages is not as useful to you when it comes to interpreting the spike. So nowadays, if you have observability, you see that big spike, you might go, “Oh, crap, let me go and look at it.” We have a thing in Honeycomb that is called Bubble Up that will let you do all this sort of automatically at a glance. But if you’re doing it step by step, you might just go, “There’s a big spike. What is it? Break down by endpoint.

“Is it all of the endpoints? No, it looks like it’s just the read endpoints. Okay. Is it all of the read endpoints? No, it’s just the ones that talked to Memcache and my SQL. Was it all of them? No, it’s just the ones with the primaries that are in these availability zones. Is it all of them? No. It’s just the ones with this type of fiber. Is it all of them? No, it’s just the ones where it’s running this particular version. It’s only Memcache and not my SQL. So is it all of them? Oh, no, it’s just the clients that are running this.” You don’t have to have any knowledge of where you’re going to end up because you would just put one foot in front of the other and follow the breadcrumb of clues where it leads you to the solution every time, even if you’ve never seen that problem before because you have the ability to…

At a very low level, it’s based on a different building block instead of metrics where it fires up hundreds of metrics while it’s executing, but they’re not connected by anything. Instead, there’s this thick connective tissue of the requests that all the data is aggregated around and you can break down by high cardinality dimensions. You can string many of them together in high dimensionality queries. And you can just string into these incredibly complex and detailed. So you can say, “Oh, these rows are the ones that are failing. And these are all of the things that they have in common.”

Alex Williams:

It speaks to the dimensions of space and time in a much different way where there are unknown unknowns. And how do you observe what is unknown to uncover what’s deep, what’s really invisible?

Charity Majors:

And so the trick is to just throw it, anytime you see a detail you think might be interesting at some point, anything about your environment, anything, the parameters that are passed it, anything you just toss it in there because it’s almost free. Instead of incurring the cost linearly, it’s almost free to append more detail to the existing rows. So you incentivize developers to capture everything they think might possibly be interesting so that it’s there in the distant future when they’ve forgotten about it and it turns out to be that one thing that is the missing link.

19:17

Alex Williams:

And so that really defines observability, doesn’t it? And I mean, it’s really about that discovery that you made.

Charity Majors:

But I would point out that this is a socio-technical system, right? It’s not just computers, it’s not just people. It is people and computers and the tools that they use to manage them. It’s a socio-technical system, it’s a socio-technical problem, and it will be a socio-technical solution, right? It’s not enough. You don’t just buy an observability tool and bingo, it’s fixed. It’s about making it so that you’re welcoming developers into production. Ops has this long and deserved reputation for masochism and for being kind of assholes about things. And you can understand why we were assholes because we were trying to keep these systems up with very few tools and developers are just breaking shit all the time. Okay. Well, these days, you have very low hope of ever debugging and understanding your system if you didn’t build it, if you’re not in there changing it.

And so as ops people, it’s our job to welcome developers in. Our job has pivoted into one of more of services, empowering engineers to own their own code in production. Because if you aren’t in there looking at it every day, you’re not going to know when something looks weird and you really have to build up, like you said, that muscle, that intuition and that familiarity, the deep familiarity with your complex system.

Alex Williams:

So when you come into the actual reality of it and building out those technical architectures, how can anyone build a robust technical architecture that provides an observability driven approach?

Charity Majors:

You start with observability from day one because while it’s never too late, it’s always easier the earlier you start. You will move faster. And with more confidence, when you can see what you are doing and when you make that part of your daily practice. Beyond that, the answer is the same as it’s always been. There are some things you can learn from past mistakes, past principles, but every system is unique and you build it step by step, day by day. I would say that for technical leaders who are looking at where to start to try and bring their teams into the modern era, look at the Dora report and the four metrics. Measure your team. Just the act of measuring something and making it visible and prominent, people will start to bend their behavior towards making those metrics better.

And the better those metrics get, they’re metrics like the time between when you write the code and when it goes into production, time to recover from outages, so forth, the better your metrics get the more time you’ve reclaimed. And it can turn into this beautiful virtuous circle, where the more time you reclaim from technical debt, like babysitting your systems and the drudgery of shipping code, the more time you reclaim, the more time you have to invest in reclaiming more time. And in computers, if you’re standing still, you’re losing ground and you can’t afford to be losing ground because we need to be doing more with less every year. The growth of our systems demands it. And it’s a fun set of problems.

Alex Williams:

So Honeycomb specializes in helping people get there and you are counsel to people and you’re citing the Dora report, which is very helpful, but how does Honeycomb help people get there?

Charity Majors:

We have one of what I would say is one of only two observability tools available in the market today. It’s kind of unfortunate that everyone’s just jumping on the bandwagon going, “We do observability,” because they don’t, and there is a big technical difference. But what we provide, what we do, it’s funny. I think our customers have more credibility when they talk about this than I do, because everybody, every founder loves their ugly baby so hard. And I love my ugly baby, but our customers tell us that they’re able to delete 90% of their paging alerts when they move from a blunt monitoring approach to a slow guided approach, where their pain is aligned with their customer’s pain. We provide tooling that lets developers see the impact of their code as they’re releasing it, which incentivizes them to look harder while they’re releasing it. There’s nothing more frustrating than putting the energy in to look for something and not being able to find your answer. And I feel like what we give is the parenting tools for developers to really own and parent their code in production.

Alex Williams:

I like that, the parenting tools that allow them to use these capabilities in production. And I do hear a lot about observability and I hear people talking about their observability capabilities. What is it that is different about the way that, for instance, the traditional monitoring technology company might approach observability versus the approach that you’re talking about?

Charity Majors:

Yeah. Well, it’s funny because there are logging companies. There are monitoring and metrics companies. There are APM companies and they’re all kind of trying to race in their product to look exactly like Honeycomb looks today. That’s where they’re trying to get to faster than we can build up our business sides to look like they look today. So it’s a fun arms race, right? At an implementation level, part of it is about how you gather the data. It needs to be aggregated around that request, not split up into hundreds of metrics that you can’t then correlate or connect together. So when your request enters a service, we initialize an empty Honeycomb event, pre-populate it with everything we know, or can infer about it. And then you can stuff anything you think is valuable while it’s executing and then at the end, when it’s ready to exit or error, it fires it off to us in this one arbitrarily wide structured data blob.

25:35

Usually, a maturely instrumented service will have 300 or 400 dimensions per every request and then that’s part of the secret sauce too, is it’s stored in these arbitrarily wide structured data blobs, and this gives you just this really rich power to query very quickly. Our 90th percentile query time is under a second so that you can see how you can just query quickly and put one foot in front of the other, go, “Oh, but this and this and this and this,” instead of firing off a query, hoping it was indexed on the right things and then like going to the bathroom or getting a drink, right? That breaks your flow, that breaks your debugging. So, part of it’s about the speed, part of it’s how you gather the data, part of it’s how you store the data. And I’ve written a bunch of blog posts about what defines observability and people can find them out there if they’re curious and that points, in very rich detail, to the difference between us and others, but observability is, at the end of the day, an emergent socio-technical property.

And if your teams aren’t using it, you don’t have observability. Right? So part of it’s about the usability and the friendliness too. I’ve talked about ops teams, we need to get over letting production engineers into production, and then engineers need to redevelop their sense of curiosity for how their code is actually functioning. And this isn’t usually hard, right? Like it’s been drummed out of many software engineers, to not go look at production because there’ll be dragons. But we got into engineering because we’re curious people and because we loved it and everybody loves to do a good job. Right? And I feel like it’s very easy to motivate engineers to care again if you give them the tools to actually see the consequences and the effects of what they’re doing in production. So, it’s partly a tooling thing. It’s partly a social best practices thing. It’s partly the will to do it thing. But I do think that it becomes less and less possible to build and interact with your complex systems every year without it.

Alex Williams:  

So it makes me think of just how we live every day. And for instance, now it’s a time of pandemic and you have to establish routines for yourself to make sure that you don’t just get defeated by it or defeated by yourself really. You talk a lot about muscle memory and the discipline of just being able to set a routine, set a process, and set almost just your own workflow.

Charity Majors:

Yeah, yeah. The best engineers I’ve ever worked with all, and this is why I started Honeycomb after being at Facebook and at Parse and leaving and just going, “Oh shit, I don’t know how to engineer anymore without some of these tools.” The engineers I ever worked with were ones that will keep two buffers open, one with their code and one with the tooling where they were watching it run in production, and every day they’d be in there. They’d be poking through, looking at the stuff that they were doing. Well, we’ve tried to do this in Honeycomb, too, by letting people put little triggers on queries. So if you’re working on an endpoint or data storage or something, you can just be like, “Ping me in Slack during the day if something weird happens here, or if this user does something funny or if I’m trying to reproduce a problem and I can’t see it. I want to know if it happens.”

So any query that you can compose, you just put a little watch on it to poke you because I feel like it’s like putting yourself in constant conversation with your code in production while users are using it. There is no substitute. I feel like we kind of have to… So yeah, the routines, I feel like the blunt one is just, you keep it up, you put it in your eyesight every day. So you’re watching the things that you’re doing become manifest in production. And it’s different, depending on what kind of engineer you are, but that combined with the muscle memory of, I’ve merged to master. This is why I think it’s so important for people to automate everything that happens between when you’ve merged your code and when it gets deployed to production, all of the tests, all the deployment, everything.

If you can get it down so that people can just expect that after they merged within five or 10 minutes, their code will be live with no human intervention necessary, I think that people will be shocked at the changes that will be wrought in your team because that’s short enough amount of time that you’re not going to lose it in your head. Right. You go get a drink, you come back. Now it’s live, you predict it. You would expect it. So you can develop that muscle memory of going to look at it, right? If it’s a variable amount of time, if it’s hours, if it’s days, it’s really hard for you to bake in that expectation of after you’ve merged it, then you go look at it, right? It needs to be this tight feedback loop because looking at it while it’s fresh in your mind is the best way that you’re ever going to be able to find those subtle problems.

30:40

Alex Williams:

So just in conclusion, what have you worked on yourself to become a better developer? What is it that you build into your practices that you can share with our listeners out there about the ways that you’re using observability, the way that you’re using these practices you talk about to make yourself better every day? And then how do you impart that upon your team?

Charity Majors:

Yeah. One of the most fun things lately has been watching some of the creative uses that we’ve come up with for tracing. Because of the way Honeycomb instruments trace it, we’re agnostic about… It’s just visualizing things in time. That’s it. And so we’ve instrumented our build pipeline as a trace so we can see which tests are slow. Right? Sometimes our customers will instrument particular things from end to end that they’re interested in. And just the number of uses that have come up for tracing is super fun to watch.

Alex Williams:

Hmm. Hmm.

Charity Majors:      

We love solving problems. Right? And you give engineers a powerful tool and it’s not hard to get them off and running.

Alex Williams:

What visual representation helped you then? What has it done for you?

Charity Majors:

It’s because, so I come from ops more than from the development side and the visual representation for me has been so powerful because I always think I know what’s out there, but often I’m surprised and forcing yourself to get used to relying on the tool as your source of truth instead of your head as the source of truth is amazing. It is so good for you and your team because what it does is it pulls all the things that I know, all the things that I’m an expert in, out of my head and into a place where everyone has equal access to them. If someone wants to know how I debugged that query last time I was on call, they can go and look at how I did it. Right? So they don’t have to wake me up in the middle of the night. Or if I’m wondering how Christine did something that she’s an expert in. I don’t have to wake her up.

I just break down, “Oh, last time Christine was on call like two weeks ago, it was 2:00 PM on a Thursday. What was she doing?” I can retrieve what was in her brain and try it myself. And I feel like this is, we’ve never built for individuals. We build for teams because when you’re building these large distributed systems, part of it is locked up in your head, the part that you’re building intimately, but you’re responsible for the whole thing. And all the other parts are locked up at other people’s heads. Right? And so building somewhere for you to pull that out of each of your heads and do what you do best and let other people rely on the stuff that you do because it’s a tool where you can look at it is a game-changer.

Alex Williams:

Well, that’s a perfect way to end this show, I think, as thinking about that game-changer and the way that you think about your own work and the tracing capabilities and how that allows you to really see the work of your coworkers as well, which really is critical now in these systems that are so distributed, so complex and just full of unknown unknowns.

Charity Majors:

Yeah. Good summary.

Alex Williams:

Charity, thank you so much for taking the time to talk today. I’ve really enjoyed our conversation.

Charity Majors:

Thanks so much for having me, Alex.

Alex Williams:

Honeycomb provides observability for all software engineering teams to learn, debug, and improve production systems, to delight end users and eliminate toil. With Honeycomb developers code with confidence, operator efficiency goes up, life quality improves, and the business grows.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript