The Future of Observability

In 2018, Peter Alvaro joined us at o11ycon to deliver a visionary and thought-provoking talk around the problems we had yet to solve. In retrospect, he ended up painting a picture of the problems that ultimately defined a lot of the work that was done in the observability space for the next three years. In this session, Charity and Christine join Peter for a chat that digs into where observability might be headed for the next three years.

Transcript

Charity Majors [CTO and Co-founder|Honeycomb]:

Well, hello. Hi, Christine. Hi, Peter.

Christine Yen [CEO and Co-founder|Honeycomb]:

Hi, Peter.

Charity Majors:

It’s so fun to be back with you. Last time I saw your face was from the very first o11ycon three years ago, if you can believe it. Your talk was, so many people came up and told me how amazing it was and it was their favorite one. We were like, Whew! We invited an academic, and it went well. Yes! And by the way, congratulations from all of us on your tenure

Peter Alvaro [Assistant Professor of Computer Science | UC Santa Cruz]:

Thank you so much! It’s been such a long time coming, but I couldn’t be more delighted about it.

Christine Yen:

One of the things we did to prepare for this talk was go back and rewatch your talk from three years ago. One of the phrases stood out to me as sort of being evergreen and also something that we’re trying to chip away at. You said something about a faraway land like production.

I love that phrase because for so much of Charity and I working together, our relationship has been about, Charity is this ops persona, where production is her world and she speaks the language and everything. And I’m a developer where I live in dev and write code that gets put into production.

Charity Majors:

When we were at Parse, the seating charts were literally … Kevin seated us by how close we were to production. I was here at one wall. We went backend, frontend, and there was Christine literally on the other wall, and we almost never spoke to each other.

Christine Yen:

It felt like a different land. They used words I didn’t really know how to work with. I was over here writing my tests and my code, and they were there talking about, you know, right throughput and database uptime, and it did feel like a foreign land. So I loved that phrasing.

Charity Majors:

And yet it can’t be. Right? And I feel like part of what we’re out to do is to help raise a new generation of engineers who are production native. Where it’s not a faraway land for them, where it’s where they live. Because guess what, it’s where your users live, right?

Peter Alvaro:

I would really love to better understand how we can realize this vision. Before I was an academic, I did work in the industry. I came from ops. I was DBA, and then I was an operations person before I clawed my way into dev. I clawed my way into dev, I didn’t want to be an ops person. I didn’t want to be on call. But I always said when I was a developer that I never forgot where I came from, you know what I mean? In those days, there was a big wall between those two, and, obviously, stuff got thrown over it.

There’s a lot of interesting academic work. I’m thinking about some of the work from my colleague on tooling to help production operations people deal with bugs introduced potentially by devs. So debugging, for example, configuration.

He made a name for himself emphasizing how different these two worlds were. The devs can’t debug production config issues because the devs are not the people setting the values in the production config. The ops people are. But the ops people don’t know the code well enough to understand the consequences of a missetting of those configuration files. So he built a whole bunch of tooling to — if there’s a bug in the config files, forced the system to crash fast, so they could enlist the help of devs.

4:07

Charity Majors:

This is the whole thing about config is code. Your config is code. You can’t separate the two. Increasingly, even five years ago, you would hear devs going, “Oh, I don’t want to be on call. That’s why I’m a dev.”

Now they are on call. Over the past five years, it’s become a given. If you write code for production, you should be on call for that code. This has been a sea change. It’s both good and bad, right? It’s good because increasingly, if you didn’t write the code, you have no hope of maintaining it in production. Ops people could no longer black box the shit. You just can’t.

I feel like a lot of teams then are just using this as an excuse to make devs as miserable as ops people have been. Granted, we have a big problem with masochism, right? There’s a grenade! I’ll fall on it!

But the point is not to everyone as miserable as ops people. The point is that this is the only way to make systems better. This is the only way to make systems not that terrible. This is the way that if, you know, if you’re saying devs need to be on call, then your management has to say equally as much that it’s not going to suck, it’s not going to interrupt you.

I think it’s reasonable to ask them to wake up to fix their code once or twice a year. More than that, it’s just abusive.

Peter Alvaro:

I feel really ambivalent about breaking down the law, though. I think as systems get larger and get more complicated, we’re going to need some modicum of modularity. There’s got to be boundaries of what somebody is an expert in. And there is something nice about the vision of SREs and ops people are experts in observability tools, they don’t look inside the box, they look outside the box, they understand how the box is composed and work together at a higher level. The devs, by contrast, are inside the box, right? It looks like abstraction in programming. But, unfortunately, I don’t think it works. I think the kinds of architectures of, you know, the large scale web-based companies

Charity Majors:

They’re too porous.

6:06

Peter Alvaro:

Not just in e-commerce. They’re porous and they’re just gigantic. I can’t understand how my code is going to behave without understanding everything it depends on and everything that depends on it. At any given time, no single person might even able to enumerate all the potential dependencies of a service that you’ve deployed. The only way I think at least, as systems get more complicated, the only way to see what code does is to see what it does out in production. It’s not as though we have staging areas that are faithful that will show all the ways an API will be exercised.

Charity Majors:

You have to.

Peter Alvaro:

That’s what would appear to be required for classic development and debugging. I’m thinking about our conversation from last week, where I was making this argument that debugging and sort of incident response are two totally different things, despite the fact that you use similar parts of your brain. You’re reasoning differentially, but the labor of the programmer finding a root cause versus the labor of the SRE relieving pressure is different. You push back and were like, no, they’re the same, and they’re going to have to be the same. Over the last week, I think I’ve come around.

Charity Majors:

Well, they are different, in a way, but they’re not so different in practice. One of the things I often say about observability, it’s not about debugging the code. It’s about finding where in the system is the code is that you need to debug. Right? Which is kind of the hardest part.

Peter Alvaro:

Yeah. Localization.

Christine Yen:

One of the things … again, the foreignness, there’s a real language that developers speak, and it’s the language of our tests, it’s the language of the code, it’s the business logic, it’s the pieces of cause that you go back to and you’re like, how can I construct a test case to reproduce this thing I’m seeing in the tool.

I had this huge light bulb moment the other day, a while ago, when I realized that one of the reasons why I struggled so much with the tools that Charity’s team was using and handing to devs was that they were all tools that spoke ops language. All CPU and throughput. And I was sitting there like, Well, how do I map this back to a test case so I can fix the thing that’s causing the problem?

Peter Alvaro:

Don’t you think that traces are potentially the lingua franca for connecting those two worlds?

Christine Yen:

Traces are a structure. Traces still have to handle all the high-cardinality data that we talk about. When you think about the values that go into test cases, they are, by nature, high-cardinality values. They’re going to be user types. They’re going to be blog posts with zero comments versus 500 comments. These are the sorts of things you’re throwing in there, and traces play a big part. Traces, I think, help developers feel like production looks the same.

Charity Majors:

For a long time, ops people have been the translation layer between developers who are writing shit and the low-level underlying system components. I think what we’re seeing is increasingly you can separate orange and Cheetos ….I had Cheetos last night.

Devs should haven’t to care about resources at a very low level. They need to know if they shipped a change and it tripled the memory usage, sure. Do they need to know all the stuff under /proc? All the different types of memory usage? No, they shouldn’t have to. Right? That should be something that Amazon deals with or whoever is provisioning your infrastructure deals with. That’s the true native ops use case that should be increasingly isolated because what you should care about as a dev is: Can my user execute my code from end to end in a reasonable amount of time.

Christine Yen:

The first thing that you said. What I care about is my deploy. That’s a thing that makes sense in my world. Everything else is, “Hey, Charity, my deploy caused this. Let’s work together to figure out what this means and what to do about the memory increase.”

Charity Majors:

But translating, something that Honeycomb does that I think other companies are starting to come around to is you should only have to deal with your code and your systems in terms of endpoints and variables and functions and this higher order of things.

Teams that have SREs now, shouldn’t have to have dashboards of all this low-level hardware shit because it should be extractible away enough so that you can just, if this isn’t working, you fail it and you try another one. Even SRE should be able to manage it at a much higher level now.

I think that the traditional metrics use cases are increasingly the domain, and they will always be the domain of whoever is dealing with lower-level infrastructure stuff, but that’s from the perspective of the infrastructure itself. Am I healthy? Am I accepting connections? How is my provision? Do I need a provision for capacity stuff? That’s a separate concern from, is my code executing? Are my users happy? And I do think you increasingly separate them.

11:24

Peter Alvaro:

I agree. Christine, you were drawing a distinction between the quantifiable system measures that the old-fashioned ops people used to care about versus the maybe quantifiable application measures. The reason I mention tracing is that tracing is a story that connects system measures and application measures.

It’s like, tell me one user or one request story and why did this take longer than expected to? That answer is going to involve reasoning about how the application code uses resources and also about whether the resources, whether the capacity was there and the resources were adequately provisioned.

And I agree with you, Charity. Although again, I only made brief sojourns in the industry. I do think SRE teams, at the end of the day, it’s the app that everybody cares about because it’s the app that the customers interact with. The capacity has to be there, and somebody has to make sure the capacity is there, but SREs are looking at app-level metrics as well.

Charity Majors:

The need for ops and SREs is not going away by any means, it’s just that we’re more like consultants and high-level experts in here’s how you do these things the right way.

I like the models that a lot of companies have taken is that if you’re a dev team, you’re building to spec, right? When you’ve reached the spec, maybe the SREs will take it over and run it for you or be in the rotation with you or something like that. But they’re not going to get involved until you’ve made it instrumented and made it, restarted cleanly, and you’re using the golden path that we’ve invested in as a company and that sort of thing.

Peter Alvaro:

You mentioned dashboards. Can I ask a controversial question?

Charity Majors:

Absolutely.

Peter Alvaro:

Are dashboards good for anything? You were talking before we started about what was observability, what is now, and where is it going, and I’m inclined to say it was those dashboards. Lots of money spent on big things. The current state of the art is these quite sophisticated UIs that are way better than a dashboard because they’re interactive. But they’re saying there’s one view of the world, and I think the future is going to be like querying these signals in rich ways. It’s like, I want to go out and do exploratory analysis.

Charity Majors:

I swear to God we did not prepare this. We did not prepare this. I will often say that every dashboard is just an artifact of some past failure. We figured it out. We created a dashboard. We’re like, we’re going to find it immediately next time. And now we’ve got this graveyard of tens of thousands of dashboards, many that are no longer receiving data.

Every time you have a dashboard, you’re jumping to the answer and you’re looking for evidence that you’re right. You’re not actually exploring or trying to ask the question and figure out the new current state of things. You’re just like: Was it that? Was it that? Was it that?

It stops your mind from debugging and actually thinking about the problem in a systematic way.

Peter Alvaro:

It gives you the illusion of witnessing all the signals.

Charity Majors:

Like God. I see everything. Single pane of glass, isn’t that what they say? Is that helpful?

Christine Yen:

I will offer a contrasting opinion, which is that, like all things, in moderation. Use sparingly. It’s good to have entry points.

Charity Majors:

Entry points.

Christine Yen:

There should be jumping-off points where you have things your users care about. This speaks to our philosophy in SLOs. But there should be a small set of things that are user impacting that then are used as jumping-off points into that investigative process. The problem is that most people have taken the dashboard model and applied it to all the failures.

Charity Majors:

Because it’s if only thing that they’ve had. It’s a very compelling, fresh user experience if you’re like: I ran one thing, now I have the world opening up to me, all these dashboards. It’s not until you get into the thick of things that you’re like, but what are they telling me? I’ve got many, many dashboards. What do they mean?

Because any computer can tell you that there’s a spike or something, but only humans can impose meaning on it. Maybe the spike was good. Maybe it was desired. Maybe it was expected. Maybe it was anticipated.

Peter Alvaro:

But the spike itself is an artifact, it’s not an interesting thing to stare at. You want the context and you maybe want to compare it to other things, but staring at two graphs and understanding how they’re related is not something that our brains aren’t that good at.

Charity Majors:

Correct.

Peter Alvaro:

If I have to look at two pictures to understand something, I’m already screwing up.

16:00

Charity Majors:

I would agree. One of the two things that helped Honeycomb reach product-market fit were doing the Beelines so you just added a library and it autoinstrumented a bunch of stuff and adding the APM home that’s landed you with the same three graphs you see everywhere: latency, request rate, and errors. Because it gave people a comfortable place to start from instead of landing them in an open query interface where they’re like: What am I doing here? What am I looking for? It was terrifying.

I think there are necessary jumping-off points, but I feel like we in ops need to stop pattern matching and stop leaning on our past scars and our library of past outages, which make us feel like wizards, right? Like, I know what it is. It’s MySQL.

We need to be more in the mindset of debugging and iterating and being in a state of flow and understanding our system step by step, much more like stepping through a GDB output or something like that that devs do.

Peter Alvaro:

Not a sequential stepwise debugging in all likelihood because we’re in a distributed system, and we don’t have total orders, but I agree. This piece-wise Q&A, in the same way when a data scientist gets a new dataset, first we’re going to start with really shallow queries. Min/max, what is the shape of the data? Then you start refining the queries, going back and forth. Then the final query is the query that says: Tell me the story of the system in terms of the system metrics, the app metrics, the traces, before this incident, and explain what all the good executions had in common that the bad executions happening during this incident don’t have in common, whether that’s labels, scans longer than we expected, actual structural differences in traces? Like, that’s kind of where I want to go.

And then when we get there, the distinction between debugging and localization will go away. Because localization was just a shallow query that you had only a few minutes to do because you were hemorrhaging money versus root cause analysis, if you will forgive the term, which is the same thing, with more iterations going deeper. Right? Because an SRE doesn’t want the root cause. They want the closest cause, the lever I can pull.

Charity Majors:

Get me back to good as fast as you can, and then we’ll figure it out.

Peter Alvaro:

And then we’ll sort it out. Exactly.

Christine Yen:

I have a question for both of you about the far future. In the first chunk of our call, we were talking about the sharp boundary that we had between ops and dev, and us moving to try to blur that boundary a little bit. Ten years from now, what new boundaries are going to be in place, and are we trying to build them or blur them?

Charity Majors:

Well, I think one answer to this is, you know, I think the DevOps movement, if you look at it from a broad, big picture, it was a split that never really should’ve happened, but it did because of specialization. And the first wave of DevOps was about ops people, you must learn how to write code. The second half, it was like, software engineers, it’s your turn. Learn to build and operate your services. Learn to build operable services, learn to instrument, and learn to understand your stuff.

When I think of the most powerful engineers on the planet right now, they’re the people who are sitting at that nexus. They’re the people who can write the code, who could write the big systems, who could also jump in and understand and debug them and operate them. Those people have superpowers. And I think that you see a lot of aspiration on behalf of engineers who come from both sides to reach that point. I think that’s where we’re at right now. We’re trying to help accelerate that. We’re trying to help…the DORA metrics.

If you look at the DORA metrics, you see the top 50% are getting better; the bottom 50% are getting worse.

Peter Alvaro:

Yeah.

Charity Majors:

I think over the next 5-10 years, that’s going to accelerate. I do think we need to figure out how to reach that bottom 50%. I don’t think this is so much a technical thing, as I think we need to help them leapfrog a decade or so of fear.

I also think the specialization that’s going to occur more in the future is going to be less about dev vs. ops and more about types of industries, almost. Like, the framework that we’re often implicitly talking about is big web applications, right? When we’re talking about something by default, it’s that.

People who do embedded programming and medical devices and mobile and stuff, they have a very different set of starting points and workflows. I think that dev and ops look very different there. I think those boundaries will be maybe harder to cross.

Peter Alvaro:

Charity, you’re an animal. The things you say are so content-rich. Let me see if I can respond to the amazing things you just said. There are two things I definitely want to respond to.

First, yes, as I said before, we have to cut at the joint somewhere. As systems get bigger and have more people contributing to them and are running on larger numbers of computers, we can’t just be omniscient. And so I like this idea of sort of vertical chopping, right? But there’s going to be a different pile of DevOpsy skills that are relevant to different industries and beginning to articulate what those are is a path to the future.

I think that’s great because we have to partition somewhere. This is insane.

Charity Majors:

Right. We can’t just all be everything to everyone.

Peter Alvaro:

I like what you said about the superheroes. Because if you may recall, that was a big theme from my talk years ago, which is we’ve gotten ourselves to this position to where there are these super experts. We don’t really remember or know how they were trained. So we can’t get them to train anyone else.

Charity Majors:

You can’t replicate it.

Peter Alvaro:

In some sense, this is true about all SREs. SREs, they’re apprenticed in. No one knows how to train an SRE. How do you train an SRE? You bring them into the war room. You seat them next to a veteran SRE, and eventually, it leeches into them. You have runbooks, but the thing about runbooks is who is keeping them up to date. You don’t have the documentation, so you just plop them in.

To be fair, this is how PhDs and professors are trained too. We haven’t figured out how to train them either. There are certain classes of people who were like, oh, these people are super smart, so let’s not try to improve the process of how we create them.

I think, Christine, to answer your question, this isn’t, like, where the industry is going, but I think it’s a crisis the industry needs to solve. It certainly hasn’t gotten any better over the last three years.

Charity Majors:

No.

Peter Alvaro:

Which is how do we train people to be omniscient in this sense to be able to understand the dev and ops side?

Charity Majors:

This is an argument Liz and I have a lot. Well, not an argument. I think we vigorously agree, which is that so many of us who came up from systems land, we crawled in, we scraped our way in by our fingernails. Like, I’m a dropout. I’m a music major dropout. I didn’t even go to high school. And people like me … there are not many avenues into the industry for people like me anymore. It used to be, like, if you could sit there and learn a Linux computer, there was a job for you somewhere. That’s not the case anymore, and I worry about that a lot because it is an apprenticeship industry.

Peter Alvaro:

I was an English major in college. I think about this a lot, Charity, about how, like, I can’t give people advice. People ask me for advice: How do you get into computers? How do you get into tech?

Charity Majors:

I know. They ask me that, too, and I don’t know what to tell them.

Peter Alvaro:

I’m not going to be one of these jerks and tell them, Oh, it worked for me, so you should do what I did, you should do English and then…it’s like… but, yeah, the answer can’t be exceptional people like you, we need to keep finding completely exceptional people.

Charity Majors:

I don’t think I’m an exceptional person. I think there were avenues open for people who were stubborn to work their way in that just are not open anymore. Because if there’s one thing that’s special about me is my persistence and tolerance for pain. But that’s really it.

Christine Yen:

You know the analogy that comes to my mind as you two were talking. People like to bemoan, Kids these days can’t take a computer apart. They have these nice, walled gardens. They have the rib boxes and the cool… I don’t know. Cool kid things.

Peter Alvaro:

That sounded right.

Christine Yen:

Cool kid things. Sorry. Packaged things.

In a way, this is what Charity and I have been pushing against this whole time in our world. We don’t want people to accept that their software is a beautifully packaged thing that only this magical agent can plug in and understand. We want people to be willing to take it apart and look at the guts and put something in here and see what happens.

It’s almost like our reversal of the glossy packaging promise. And, correspondingly, it’s been hard for folks who’d be like, what do you mean? Do I have to do this again? But once people get over that mental hump, one of my favorite phrases I’m going to borrow from another talk I heard a couple of years ago is, it was a new engineer, fairly early in her career, sitting next to an experienced ops person, and she described what she was seeing as leaps of intuition akin to magic.

Peter Alvaro:

Who was this?

Christine Yen:

It was Logan at Monitorama from BuzzFeed. So 2018 or 2019.

Peter Alvaro

Definitely confirms my biases.

Christine Yen:

It does. And her whole talk was great. It was talking about how to learn, how to build an apprenticeship motion, how to do this transferring of knowledge and expertise.

Charity Majors:

What can be really interesting is the more we lean into this, the harder it is to sell to C levels. Because they really want to hear, “Just give me 10 million bucks, and you’ll never have to understand your system again, no one will ever have to understand your system again. They want people fundable. Of course they do, that’s their job. They want to make sure there are no single points of failure. But, in fact, you can’t. You just honestly can’t.

Our whole mission is leaning into the idea that you have to help humans do what humans are good at, which is attaching meaning to things. You have to center the engineer and make them more powerful and more important in a way instead of less important and replaced by machines that will just magically tell you the right answer.

In my experience, that’s just 100% false.

26:45

Peter Alvaro:

This really is a rough tradeoff space to be running a startup in, I gotta say. I was thinking a moment ago, you were saying it was this auto instrumentation that’s been huge. Certainly, for the tracing community, creative ways to use aspect oriented programming, ramifications, or whatever. It takes away the pain, but the thing is if you take away too much of the pain, you force devs to trace their way through the code. How are you going to understand code that you haven’t implemented? For God’s sake! What a horrible trade-off!

Charity Majors:

You have to do the hybrid model. You have to make them instrument their code some. You can make it easy, you can make it as easy as doing a print dev, but that’s as easy as you can make it. You have to make them think about, how is future me going to understand this once it’s deployed? What am I going to need to look for to see if everything is working as planned? You have to bake that expectation into it how you write and ship code or they’re not going to get there.

Peter Alvaro:

Right. I like the idea that you’re embracing the porous nature of the system, but it’s a lot of work.

Charity Majors:

But the good thing is, but these are engineers. Like, we got into this because we loved that dopamine hit of finding something and experimenting and getting the, you know, so many people have had this drummed out of them. They’ve had their tools punish them for being curious and exploring.

We’re really trying to reawaken that joy and love of: Oh, that’s what it was! Oh, I can see it! I don’t have to guess! I can see the user having the problem. I can correlate it with, oh shit, all these other users are having the same problem! I can figure it out and fix it before anyone complains. That is a high that you get them hooked on, and they have a really hard time going back.

Peter Alvaro:

I’d buy that. That’s awesome.

Charity Majors:

Well, it looks like we’re about out of time. Peter, would you like to put on your very fancy, new tenure professor hat and tell us what we’re going to be talking about at o11ycon three years from now?

Peter Alvaro:

Oh my gosh, this is going to be — you’re going to be so disappointed with this answer, but we’re going to be having a really similar conversation three years from now because this movement is new.

Charity Majors:

Yeah.

Peter Alvaro:

I wish I could report on bigger changes over the last three years, but, to some extent, I’m still waiting for tracing to get rolled out at the level of granularity that I want it to do my stuff. So much so, I got tired of waiting, and I have some students in my lab writing their own tracers, doing things a little bit differently, and screwing around with hybridizing tracing and fault injection. Because I can’t wait. Industry moves slowly for my tastes, you know.

Charity Majors:

Fair enough.

Peter Alvaro:

I think we’ll have made some progress. We’ll have better tooling. We’ll be asking bigger questions, but I think it’s going to be a very similar conversation to the one we’re having now that we’ll be having in three years.

Charity Majors:

I think we’ve seen more progress than you’ve seen.

Peter Alvaro:

Good.

Charity Majors:

Well, thank you for coming in and talking to us. It’s a delight as always.

Christine Yen:

Thank you for your time.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript

Podcasts

Observability Helps You See What Looks Weird

In this conversation for The New Stack Makers, Charity Majors discusses a number of themes relating to observability and monitoring, as well as how she continues to make herself a better developer.

Guides

Developing a Culture of Observability

Observability gives engineers insight into how their systems function and how users experience the resulting services; it allows you to answer questions that you didn't anticipate having to ask.

BACK TO RESOURCES

The Future of Observability

Transcript

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript

Ready to get started?