Webinar

The State of Observability 2021: Mature Teams Ship Better Code Faster and You Can Too

 

Transcript

Emily Nakashima [VP Engineering|Honeycomb]: 

All right. Good morning, folks. We’re going to give everyone a few more minutes to get dialed in. We’ll get started two minutes after the hour.

James Governor [Analyst & Co-founder|Redmonk]: 

I just realized I have not shouted at my household and told them all to be quiet. 

Emily Nakashima: 

That’s perfect, though. That’s how people will know that this is a live, real webinar and not a pre-recorded thing. 

James Governor: 

There you go I keep meaning to get one of those “on-air” signs. 

Charity Majors [CTO & Co-founder|Honeycomb]:

The blinky lights.

Emily Nakashima:

All right. It’s two after. So let’s get started. First of all, I want to say thank you so much for joining us today. I know that for those of you on the West Coast, it’s early morning. For James, it’s a late night. I’m glad we could all be here together to talk about this.     

I did just want to start with a couple of really quick housekeeping notes. First of all, we’re recording. Please keep that in mind. Your questions and all those kinds of things, we’ll record them, and we’ll send out the copy of the video from the webinar afterward, so if you want to refer back to anything, it will all be available to you. We do have live captioning. So if you would like to take advantage of that, if you would go into the Zoom UI, you can actually turn that on. That will be available for you. It’s possible to open it in another window as well if you would like to do that. Perfect, Bethanie has put it in the chat to follow along with the live captioning.     

We’re all calling in from our homes around the world. If anyone loses activity or anything like that, give us a few moments. Hopefully, we’ll be back momentarily. And we’re doing this live so we can hear your questions and hear from you. So we’ll talk a little bit in the beginning and tell you a little bit about this topic, and Joe will share some of his stories from the trenches with you. Please, be generous with your questions. We really want to hear from you and hear what parts of this are interesting to you so we can have an even better conversation. All right. With that, let’s dive in. 

Today, we are talking about the 2021 Observability Maturity Community Research report. So we will tell you a little bit about the report itself and jump into the findings. First of all, to introduce our wonderful panelists here, we’ve got James Governor who’s RedMonk’s Principal Analyst and Co-founder. And then, of course, we have Charity Majors, who is Honeycomb’s CTO and Co-founder. And then we are very lucky to have Joe Thackery who’s a Senior Software Engineer at Eaze. He’s going to tell us about how this has really worked for them in the trenches at Eaze. And I’m Honeycomb’s VP of Engineering. I’ll be moderating today. 

To dive into the report, this just came out a few weeks ago. This is a report conducted by ClearPath Strategies. We had 405 respondents. To, sort of, high level what was the big picture, we found that observability adoption is on the rise. Mature teams see significant benefits from observability, which we’ll dig into. We did find that barriers still exist for some less mature teams, which is why I’m so happy to have Joe here today. He can talk a little bit about what that journey has looked like for them. For folks who are less familiar with a maturity model, I did want to just talk quickly about what that is and what the value is.    

Charity, Honeycomb has been working on its own observability maturity model for a little while. I was hoping you could tell us a little bit about what that tool is and how you expect folks to use it.

Charity Majors:

What the tool is?

Emily Nakashima: 

What is a maturity model?

5:34

Charity Majors:

Oh, yeah. Liz and I started working on this a couple of years ago. Basically, I think of it as a “choose your own adventure.” You should be able to look at yourself, look at your model, and see yourself reflected in it. Like, if you’re weak in one area, you should be able to recognize that and say, oh, these are some steps I can take to become stronger in this area. Because it’s not really a linear path. Everyone starts from a different and sort of unique place, but there are patterns. Right? There are patterns. Some people are really weak when it comes to deploying, but they’re strong when it comes to automation. Some people are really weak when it comes to the on-call stuff but really strong in other areas. I really just think of it as a two-joint sort of adventure where you should be able to spot and see your strengths and weaknesses reflected as well as seeing sort of a map for how to get to the next level in the areas where you’re weak.

James Governor: 

Can I jump in a little bit?

Charity Majors:

Yes. 

James Governor: 

One of the things I like about the approach is that we know it works. It’s good to adopt things that work, I guess. If we think about the impact that the DORA report has made, in making a transition from how to why, over time, we hope the research will serve to do that. We’re still kind of in the why phase. Why should we do a thing? 

Charity Majors:

Yeah. 

James Governor: 

But, over time, as you identify those key benefits or metrics, it becomes a house statement. I think that’s the sort of journey that customers need to go on, and so that’s why I think certainly a regular repeated thing that we’re currently in the why but as we get to how, that’s why a report like this can be really valuable in a market space. That was why I was sort of excited to see this.  

Charity Majors:

That’s why everybody we’ve talked to has said, Cool. Can you show me a story of someone just like me who has done this before? Or can you connect me with somebody just like me who has done this before? We’re all looking to leverage each other’s growth and what we’ve found, which is good because we can’t each reinvent the wheel from scratch. That would be absolutely exhausting and time-consuming and ridiculous. I think James is right. This is a way to sort of synthesize and derive patterns from how everyone else is doing this and trying to apply it to our own stacks. 

Emily Nakashima: 

I love that. It’s a sort of perfect bridge between taking the theory and it gives you the template for putting it into practice. When we looked at what we saw from the report, we see that there’s sort of this maturity distribution, and we do see more teams moving toward that advanced group. But, as we look across the industry, there’s still plenty of folks in the novice bucket. 

When we look at folks moving to the right, we see they’re seeing all kinds of benefits. So higher productivity, improvement in code quality, higher end-user satisfaction, and then even better retention of their software development teams. So folks who have these tools find that their teams are more satisfied at work and run into less friction on the job, and it has these positive benefits all around the organization. 

Charity Majors:

And part of what I think is interesting is we also tried to poll people to find out what they meant by “observability.” Because, as you all know, there are a million different definitions out there. And there’s a definition we subscribe to, which is all about the unknown unknowns, et cetera, which is not the definitions of the three pillars and so forth. And the most interesting thing to me was that the distribution of benefits accrues to the people who define observability more or less the way we do, not in the generic telemetry way, which was really validating. Joe, do you have something to say?

9:45

Joe Thackery [Senior Software Engineer|Eaze]: 

No. Just agreeing. Having the tools in place to answer the unknown unknowns is really valuable for us. We have quite a variety of different patterns of traffic and patterns of services. We don’t know everything about them at all times. So, yeah, the unknown unknowns are really the valuable piece that observability helps answer. 

Charity Majors:

A lot of times, people accused me, Oh, it’s just a marketing thing. But it’s not. It’s real. It’s kind of a step function in terms of actually being able to understand your systems, and I feel like that’s what we’re starting to see reflected in this maturity model, which is exciting.

Joe Thackery: 

Yeah, logs are great, but if you don’t have the log statements into your code already in advance, knowing that something that might go wrong, you’re out of luck when something does go wrong. 

Charity Majors:

And metrics are great, but if you didn’t capture the exact metric that you wanted, you’re screwed because you can’t slice and dice them retroactively.

Emily Nakashima: 

Joe, I’m curious to know if that end-user satisfaction piece has been something that you’ve seen used. Like, how do you think about measuring that, and has that actually been something that’s come out of your observability journey as well? 

Joe Thackery: 

Yeah. We’ve had some wins definitely on the user satisfaction for the consumer platform. We focused on add-to carts and checkouts and kind of classic metrics on the business end. So we’ve been able to say, okay, when we improve this code path, we see more add-to carts and more checkouts. Being able to do that is great for being able to justify investing more time into improving our observability and keeping that in kind of a virtual cycle feeding itself.

Emily Nakashima: 

Nice. Fantastic.

Charity Majors:

A lot of the point of this sort of thing is to help give ammunition to our sort of champions who are out there, people who are on the ground who really want to be able to invest time in this or want to buy a tool or whatever but are having a hard time convincing the higher-ups or convincing their team. So I hope this helps arm people with the evidence they need to start investing in this area. 

Emily Nakashima: 

I feel like the benefits stand out really clearly in this report. I really love this set of metrics where you see these sort of higher-performing teams are going to be so much more likely to be able to see the problem when something breaks and then, you know, get to the cause of it more quickly and then be able to immediately identify the solutions. I think, when you look at that last graph, the gap between the advanced and immediate teams is… 

Charity Majors:

Enormous. 

Emily Nakashima:    

It’s huge. And especially it’s remarkable given that the advanced and intermediate teams are more likely to be more automated. Right? So a lot of the teams on that novice end are remediating the same problems over and over, and it can still be a challenge.

James Governor: 

I found it interesting, the commonality. Obviously, they’re related, but getting to know your systems, the answers were close to being able to find problems and troubleshoot that the organizations that are seemingly spending time using observability tooling and approaches to get to know their systems, there is a correlation between troubleshooting. So it’s not enough to just use the tool for troubleshooting. You also need to use the tool…    part of observability is getting to know what you’re running. And I think that’s one of the key changes we’re seeing from a mock maturity perspective.

 Charity Majors:

Yeah. I mean, we’ve all had the perspective of using the tools that frustrated our curiosity and did not encourage us to lean in and use them more. The hope is with observability tooling, you get the dopamine hit, and you resolve the problem, you figured it out, and you understood it better. That means you will spend less time understanding the problem because it’s not as alienating.

Emily Nakashima: 

Joe, I would love to hear how much of this resonates with the journey that your team has been on. Did you find that you were able to more quickly diagnose issues? What other kinds of benefits did you see? 

Joe Thackery: 

Yeah. Along the lines of what James was saying, even in advance, we were able to say, Okay. This is not doing what I thought it was designated to do. There are way more requests going on out of this initial customer workflow, a path through our website, than we thought there should be. And that kind of gave us opportunities to say, Okay. This will be a problem on our heaviest traffic days of the year if we don’t do something about it now. 

Charity Majors:

That’s super cool. You’re moving a couple of steps away from, oh, I’m getting page, the site is down, it’s an emergency, and you’re spending time getting familiar with your systems and comparing the expectation with the reality and investigating when they’re different, and you’re warding off problems before your customers even see them. 

15:09

Joe Thackery: 

Our systems were created at a very different time in our industry, in a regulatory environment as well. Understanding how they were creating years ago and how they’re operating now other than how we want to be operating in the physical world gives us a lot of insight. And, of course, when things go wrong, it is great to identify the trouble spots. 

Charity Majors:

Nice. 

Emily Nakashima: 

And then we’ve also found one of the other sections of this report that really stood out. There are still teams that are novices into the curve that face barriers to adoption. There are always problems with figuring out how to prioritize this work. There are folks who think their current tools are good enough. There are teams that are convinced that they’ve got other challenges in the way first. There are folks worried about cost.    

James, I would like to hear from you. When you think about people who are weighing this decision, what do you see there in terms of how they’re thinking about how to prioritize observability?

James Governor: 

I think really one of the things we’re seeing is observability in pockets, which is, as it should be. Certainly kind of very hard, retrofit everything for a different way of working. What we’re seeing, organizations, they’re probably, well certainly on the enterprise side, observability-curious fairly early in the journey. But they will have some teams that are doing some distributed systems. They are realizing there are a set of challenges they’re struggling with. Within those pockets, within those specific teams we’re seeing them want to get better, and I think that’s where they’re quickly realizing, actually, there’s a bit of a skills gap. 

I think that what we’re seeing, with at least enterprise adoption, is it’s quite different on the SaaS side. People are more mature on this spectrum. They’re basically saying to themselves, Yeah, we would like to have more confidence, and this is going to be part of, I think, what Honeycomb would call production excellence. We’re going to be better at automation. We’re going to be better at testing. We’re going to be doing all of these things. When we’ve done all of those things, observability should and must be part of the mix.     

For high-performing teams, I think one of the things I’m beginning to identify is, if they’re understanding the interlinkages of observability with the pipeline and their release management, that’s where you realize that they’re probably farther along they are in the journey. But, yeah, lack of skills is definitely an issue.

Charity Majors:

I like what Blake just said in his comment to us. “I find the prioritization struggle kind of funny because, how can we effectively prioritize without good observability? This is true. People are used to flying super blind. They’re used to, like, Wee! Off into the abyss!

Emily Nakashima: 

Joe, I’d love to hear a little bit about what this journey looked like at Eaze and how people became convinced this was something that they wanted to invest in. In the early days, it was just metrics and logs, right? What convinced your team that they needed to take the leap? 

Joe Thackery: 

Unfortunately, it was painful outages, basically, on high profile, biggest days of the year – the biggest business days of the year. We realized we needed to rethink a lot of aspects of how we approached our systems, and one of them was getting better at observability to understand what’s happening when a customer just opens up our menu or when one of our operations staff is managing or the orders are coming in and the driver is delivering the orders. It was kind of a come to Jesus moment where we had to do something. Fortunately, we had some good leadership in place that knew the value of observability and brought us to Honeycomb.

Emily Nakashima: 

Was there that sort of that three pillars versus holistic observability debate at your company? Did these people argue for adding in an additional pillar, or was it fairly clear which way you wanted to go?

19:50

Joe Thackery: 

I think at first there was some reluctance to take on the extra work, basically, of introducing the implementation into our systems; but I think once we saw some of the initial proof of concept results, I started to say, okay, and I’m a visual person. So it helped me, at least, see what was going on under the hood and compare it to just a lot of tests sliding down in logs or dashboards, which were great, but I know the system is broken. A dashboard is just helping me figure out exactly what is already broken, and we started moving from this inherently reactionary standpoint to much more proactive. That was something we had been talking about for a while, this kind of shift. So it aligned with adding for observability tooling.

Emily Nakashima: 

I love that. And your team had sort of a smart strategy for not diving straight into having to instrument every single thing out of the box. What did your instrumentation journey look like? 

Joe Thackery: 

Yeah. We were fortunate to have services and other languages and frameworks as well, but the Honeycomb Beeline package basically gives you a lot of the information for free. Just install it, add your API keys, and you’re getting some valuable information right out of the box. That was kind of our proof of concept with our key systems. Okay, we can start tracing our requests all the way down. People started buying into it. We went through doing a decent amount of work to do some custom instrumentation in our core systems once we had proven the net worth. We were able to get data to resolve issues and identify trouble spots even without that. 

Emily Nakashima: 

Yeah. That’s such a good approach. Start with what you can get for free and add on in an iterative way

James, I know you’ve been following along with the great three pillars debate out there. I’m curious to know what you’ve seen. Are folks moving away from that model? How do we conceptualize it?

James Governor: 

That’s a great question. You know, I think the industry has work to do. You’ve been doing some very useful reframing. I think that there’s an aspect of work that has to and had to be done, certainly by incumbents. We can’t be in an environment where we’re constantly context shifting. That makes life a whole lot harder. And so certainly without investigating and bringing together data across these pillars, we’re going to be struggling. So the investments being made across the industry probably following your lead in some respects are, frankly, welcomed. These are problems everyone is going to have, and the industry is going to have to move forward, but I tend to sort of, from a definitional standpoint, I’m looking at changes in behavior to try and understand what observability means, and it kind of gets back to the earlier slide where I was saying observing and understanding your system in order that you can do better troubleshooting with the knowledge that things will break, but if you apply a set of engineering disciplines, hopefully, you will be able to identify and fix those breakages.

Everyone has sort of talked about shifting testing left, but in a world of shifting, frankly, we’re also shifting testing right. We’re shifting testing into production. And the orgs that understand that, that’s definitional to me about the value and what we’re doing with observability. I don’t think the three pillars lack value, but I think, for me, the definition is about problem-solving and probably, frankly, maturity in general about how you’re developing apps. 

I mentioned the DORA report. You’re going to be thinking about those sorts of metrics. You’re going to be understanding that these distributive systems are going to be nondeterministic. So that’s the problem you’re solving. And it’s certainly not, Oh, no. This is great because I have distributive tracing now mixed with security logs from three weeks ago. Those are entirely different sets of problems. I think it’s at that point in time. It’s about solving a specific set of problems rather than, oh, here is a bunch of data. 

Emily Nakashima: 

I like that framework. Yeah, that makes a lot of sense. Joe, one of the things I found interesting about your journey is you started with the production stuff, and then you actually moved to instrumenting your CI pipeline, I believe, understanding your builds better. What made you decide that was the next step for you? 

25:25

Joe Thackery: 

Good question. I think when we did that, we were moving from… we were completely revamping and deploying our CI workflows. The idea was to get in at the start. We knew the old deploys were slow and kind of a black box in some places. So we wanted to have more answers and be able to say at least this is the slow part of our deploy. We wanted to track them over time. We run on GitHub actions now. So we wanted to have a good understanding of this new system we’re using, using the new tools we were writing for deploys. We were already using Honeycomb and saw the value of it. So, yeah, it just kind of made sense. 

Emily Nakashima: 

Got it. Charity, I know you’ve seen this virtuous circle or this interplay between CI/CD and observability. What’s the connection there? Do you need one for the other? Should you have observability all throughout the pipeline? What advice would you give to people? 

Charity Majors:

Well, that interval between when you write the code and the code is live is so core and so central to having a high-performing team. If you want to get that amount of time down and keep it down, you need observability. I kind of see them as peanut butter and jelly. You can’t have one without the other. It’s a heartbeat, right? You need to swiftly diagnose and see what’s going on. Conversely, you can’t really get observability if you don’t have the ability to ship code in a reasonable amount of time. Because if you have a six-month lead time, what does it even matter if you can diagnose successfully because you can’t ship changes within a six-month timeline? I think they really go together, and it’s hard to do one without the other. 

James Governor: 

There are definitely preferences here because I’ve been talking about it as peanut butter and chocolate. So, you know, that’s just a preferential thing. But, yeah, I just can’t agree enough. My hobby horse is progressive delivery. If you want to be thinking about, as I say, understanding a system as it will perform in production across a set of cohorts or cohorts that you define, then observability is just non-optional. I think to do modern release management, you have to have an observability pipeline and observability of your systems. 

Charity Majors:

You were talking about shifting left and shifting right and all this stuff. I think that’s exactly right. You can shift right all you want, but until you include production, you don’t know what the fuck you’re doing. You can include that stuff, but until you include production in that cycle, in that, the thing you’re doing, you don’t know. So I think just accepting it’s going to break, absolutely, and we don’t know until we know. This is why the entire point of CI/CD that, the book was written 15 years ago when we were shipping shrink-wrapped software. I think that’s why they included the cop-out in CI/CD. You can deploy whenever possible, but you don’t know if what you have is going to work until you deploy it. That’s what we’ve learned. I think CI/CD delivery should be defined as actually deploying it and seeing if it works. 

Emily Nakashima: 

One of the things I love about that is, seeing instrumentation in the CI/CD pipeline is a way to get observability without having to dive straight into production. Joe’s team did production first and then instrument and CI/CD, but I’ve seen teams do it the other way around. Joe, obviously, your team is going to be at different levels in terms of their observability journey. Can you think of anything else that has helped people on your team that are maybe a little bit slower to understanding the concepts or seeing the value to get to that place where they also can have the same benefits? 

30:18

Joe Thackery: 

Yeah. I mean, we evangelize internally. I sit on the platform and infrastructure team, and so we evangelize observability outwards. We try to bake it in under the hood, basically. We give as much to the other teams that are building our applications and building our clients so it reduces that initial burden on them. So they can ideally try to get something out of the box. We built Honeycomb into our, we have an internal in house framework, so it’s there. Does it give you the exact instrumentation for your service? Probably not. But it’s at least going to give you some request level metrics and request level information that you can trace as it comes from our front ends and all the way through.  

That was kind of our answer. We, fortunately, had some very, I guess advanced on the maturity curve here, members of the people that were able to help lift up the rest of the teams. If you don’t have those, I would say there are tons of resources out there. The auto instrumentation, is, honestly, just amazing. It’s simple to install. It’s simple to get started. Even if you’re only looking at one service in distributed systems, you can start to get that. Once you do it, it starts to click. For me, that’s how it worked. 

James Governor: 

I think you’re answering a question we have in the Q&A already. For this team working on observability from scratch, where is a good place to start? The answer you’ve just given, which is out-of-the-box telemetry, you know, that’s a great place to start. If the tool is going to provide some context, then that’s the first step on the journey, isn’t it? Or certainly a good step on the journey. I don’t know if it’s the first step. 

Charity Majors:

Yeah, installing the Beeline, out of the box, I think we underplay, we under advertise just how magical it is because, you know, there’s so much more you also can do, but just installing out of the box, it’s a great first step. After that, I feel like a really good way to deploy it is to just follow the pain. As soon as you can start solving people’s pain… I think a lot of people take the cautious strategy of doing some small services around the edge and stuff. That, I think, is usually a recipe to get it deprioritized over and over, but if you can go into what is waking people up and causing their customers pain and knock off things really quickly and follow the pain from there, that’s my preferred approach. 

Emily Nakashima: 

As James mentioned, we’ve got some great questions coming into the Q&A. I’m going to jump to that section. Please do ask us more questions in the chat. 

We’ve got another great one, which is: What resources would you suggest for a team that’s building a Greenfield system and wants to do the best job possible of building observability from day one? This is a great one because it’s kind of the opposite of the journey Joe’s team is on where they had a legacy system and had to figure out how to start. Charity, what advice would you give to someone starting from day one? 

Charity Majors:

I would say install the OpenTelemetry collectors. You know, this is almost more a question of practices than it is tools, I think, because, that’s the right place to start with the tool, but, you know, encouraging your team to keep one window open with their IDE and another window open with the Honeycomb graphs and look at it regularly. 

Making your CI/CD pipeline auto-deploy on day one is the number one tip I would give anyone. It could be small and janky, but every time your team merges something to main, if it builds, deploys automatically with telemetry and then you have the practice of expecting your engineers to go look at it, it should be a really short interval. It should be a minute or two. Just getting yourself into that habit. 

I compare it to Alexander the Great and lifting his horse every day before breakfast. He started as a little boy. By the time he was a man, he could lift his horse every day. When you start early, it’s so much easier than when you bolt it on afterward. 

Emily Nakashima: 

I love that horse analogy. I’m going to use it. 

35:18

James Governor: 

I think it’s one thing to jump in with tooling, but definitely some learning. I mean, there are some people that you should be following and paying attention to before you do anything. I think @copyconstruct, Cindy Sridharan, I just shared a link in the chat about, her observability book. It’s a good read. I would be definitely watching some talks and paying attention to Jaana Dogan. I would be going back and looking at a lot of the writing that Charity has done. I would look at some of the stuff that Liz Fong Jones has done because I think the place to start for a team is to probably read stuff together and to establish what it is you’re doing and why you’re doing it. 

Charity Majors:

Learnings over lunch where you talk about it and discuss it. 

James Governor: 

Yeah. So you have a shared language, and you’re coming at it with some shared ideas so that when you actually proceed to start choosing tools, playing with tools in order to learn more, you’ve got sort of a grounding. I think in terms of the sociotechnical system, learning together is possibly a really good way. Before building anything, let’s think about what we’re building and why and how. I think those are the people I always mention sort of initially as the folks you should read if you want to understand these concepts a little bit better. 

I mentioned four folks, all of whom have built and managed systems at a high scale. You know, if you’re doing instrumentation and observability at Amazon, Google, Apple, frankly, the kind of work that Honeycomb is doing, you’re going to know a little bit about it. So I think that’s a great place to start, is lunch and learn together before you build.

Emily Nakashima: 

I love that. 

Joe Thackery:

Getting the teams onboard is a big part of it, the sociotechnical aspect of it, as Charity likes to say. We were fortunate. We short-circuited all of that by having people in our company pay for outages. We had been saying for a while that, Hey, our system is getting creaky. It’s built for a different era. We had initiatives. That’s reflected here. Doing something that doesn’t, on its face, deliver the business value, so times you have to suffer through the pain first. But, yeah, the way to do it through less painful means. Talk to people. Watch the talks. I think the hnycon and o11ycon talks just came out on YouTube. So check them out. They’re really cool talks. 

Emily Nakashima: 

I love it. Yeah, there’s some good stuff in there. I love it when we don’t have to plug it. Thanks, Joe, for doing that. Charity, you’ve helped a lot of people. You’ve kind of helped drag a lot of people up that adoption curve to observability. Have you ever seen anyone be a contrarian and dig their feet in? How do you get past their objectives? 

39:16

Charity Majors:

Yeah. Often, it’s the person who built the last generation tools and are an expert in it, for good reason. Like, you know, I don’t think you can reasonably ask someone to adopt a different tool, visibility, whatever, unless what you’re offering them is a magnitude better than what they have. For some teams, honestly, it isn’t worth it. If you’ve got a monolith, if you have a database, an app, a load balancer, maybe it isn’t worth it. If you do have these problems and you have holdouts that are vehemently opposed to it because they know the current system, look for bridges. Look for bridges from the past to the future.

For me, we were using Ganglia at Parse, and we brought out this jank ass cron job once a minute to sync the XML dump once a minute from Ganglia into scuba. And that helped me see my world and theirs. I knew what the variables were and how to use them, et cetera. You don’t want to take away somebody’s toys until you’ve given them something that’s better. The more you can co-opt that person and ask them to be the person who prototypes this, ask them to partner with you and, like, try to find something better. You want to turn them from your biggest detractor into your biggest evangelist, if at all possible. 

Emily Nakashima: 

I love that. You can hear that you’ve been on both sides of that line, which I really appreciate. We’ve got another great question in the chat, which is: Have you seen any data on how observability affects uptime? Is it three nines or five nines? I always direct my nines questions to you, Charity. 

Charity Majors:

There’s data about how it affects it. It’s ironic, though, because so often it appears that your availability is going down when, in fact, your ability to detect your errors is going up. So it’s a little bit of a mixed bag there. I mean, the more you can understand your systems, the better your availability will be. 

James Governor: 

Charity, did you not hear? If we did less testing, coronavirus would have gone away, and it would have all been fine? 

Charity Majors:

Right. There’s a real thing that we’ve seen, like, over and over in the field where we start to roll out Honeycomb, and people start finding bugs. Like, Aargh! No! Stop! Halt! Right? There are bugs there! We need to fix it! And we’re just like, Okay. Okay. We fixed the bugs. We roll it out. Aargh! Stop! More bugs! 

We’re just like, Okay, kids. We’re picking up the rock, and, yes, there’s a lot of bugs underneath it, but they’ve been there forever. You’re just now finding them. Right? So that’s my answer. It cannot but improve your availability even though, in the short term, it may seem to impact it negatively. 

James Governor: 

From my perspective, I just think one of the things to understand is in asking the question: What are we asking? Nines are very expensive. 

Charity Majors:

Right.

James Governor: 

You’re kind of 10xing your infrastructure investment potentially every time you get another nine. So you need to think about…    

Charity Majors:

Do we really need this? 

James Governor: 

Do we really need this? I think my business partner, Steven, just has my favorite story on this. Sort of back in the day, he was at NSI, and he went into an organization, and they were like, Yeah, we need 24/7/365 for this service. He said, What is the service? They said, Oh, well, it’s used by our customer service representatives to support our customers. He was like, Oh, are you doing 24/7 operations? They’re like, Oh, no. Our customer service people work from 9:00 to 6:00. And Steve was kind of like, Wait. What? Do they work on the weekends? They’re like, No. No, we don’t. And the point was just the amount you’re spending in order to do that is really profound.     

I think, hopefully, he persuaded them it was a bad idea, or maybe being a good consultant, he helped them do it anyway even though it was a bad idea. I don’t know.But I think the bottom line there is that, look, I know we live in a world of, oh my goodness. Wow. Look, Google is not going to fall over. A lot of these services are never going to fall over, but I think, as an organization, we need to understand. And this gets to observability, really. Let’s understand our customers’ needs. Let’s understand our capacities and abilities as an organization to account for those needs and optimize for that. 

I’m a little bit of the orgs that are doing observability and investments and use of tooling have more nines. I think it will be interesting data, but I think if you’re embarking on the observability journey, it shouldn’t necessarily be with that, oh, I will get two extra nines. I think you need to think about why you’re making those investments. 

Charity Majors:

Totally. Good answer.

Emily Nakashima: 

Joe, how do you think about this at Eaze? Do you use SLOs? Do you have particular target availability numbers? 

44:45

Joe Thackery: 

We don’t use SLOs yet, as the Honeycomb feature, in terms of that, but we have a very similar story as James. We operate by legal research and during certain hours. So we have a very well-defined set of hours where we know our systems need to be up and need to be responding in, you know, a reasonable amount of time. We have people on the ground doing physical activities in order to deliver our orders to our customers. 

We focus on those hours and we take advantage, as much as we can, of the fact that we have off-hours where we can make changes that might otherwise be disruptive. And it makes it easier to make those changes because we don’t have to account for having, you know, the new system side by side with the old system. We can just do a swap over. In terms of nines, we don’t have nines internally. We’re still small and growing. So maybe that’s coming in the future.     

But we view it as like these are the core hours we need to avoid problems for our customers during these hours. And we have our customers making orders, and we have customers on the operations team. We think of those as two different sets of needs and two different… they have slightly different hours. They have slightly different tooling. They hit different paths. So, yeah, I think observability helps on that front too. It’s a slightly different topic, but it helps understand the different constituencies. 

Emily Nakashima: 

Yeah. I love that. You don’t just have to pursue three nines. You can think about what you really need, what your customers are going to need to be happy. Charity, I saw you just answered the question about capacity management, but I would love to just talk a little bit about that one. How do they work together? 

Charity Majors:

Metrics are the right tool for the job. It’s oriented around the spectrum of the service. How much utilization does the service have versus observability, which is oriented around the perspective of the customer, the user? If you’re an infrastructure team, if you’re responsible for backend services, not for customer-facing stuff, then your metrics, your monitoring tools, your percent utilization, that’s the right tool for that job. It doesn’t really have anything to do with observability. 

James Governor: 

Charity, having said that, if we think about capacity planning and one of the reasons you do that is about cost, I know some orgs have been able to get a better handle on their spend.    

Charity Majors:

Absolutely. That’s a different way of looking at it. But, yeah, absolutely. We use Honeycomb for managing our own spend. Emily should probably talk about that. 

Emily Nakashima: 

Oh, gosh. That’s a whole rabbit hole. We can deep dive into using Honeycomb to look at your AWS bills, but…   

Charity Majors:

That’s just kind of using it as a data tool. You can definitely use a data tool for that. If it’s an infrastructure problem, you need Metrics because they’re cheap. They compress easily, et cetera. But if you’re trying to understand, you know, your costs, then, yeah, absolutely, pipe it into Honeycomb or something like that. 

Emily Nakashima: 

Another question: What are some of the pitfalls you see when people start to adopt observability? How do you avoid the things on the journey that might slow you down or become roadblocks for your team? Joe, is there any advice you would give someone who is going through the same journey you went on? What would help them out? 

Joe Thackery: 

Yeah. Pitfalls, distributive systems are tough. Having them all on the same page and talking in the same tracing headers and formats, it can get complicated. Use the tooling, the libraries that come from Honeycomb, and the open-source community. It’s not always possible. We had to implement some of our services. For that, we’ve had occasional issues. I would say try to rely on the community as much as possible. Rely on the community tooling.      

OTel is growing faster and faster. We don’t use it a lot internally yet, but the fact that there’s an open community, kind of a standard emerging, is something that I think I would like to look at in the future or if I was starting Greenfield, to look at that.

49:45 

Emily Nakashima: 

James, what about you? Have you seen any common pitfalls on the journey there? 

James Governor: 

Yeah! Don’t build your own. To Joe’s point, there’s some really cool community-created technologies out there. There’s kind of a wealth of interesting technology that theoretically you could build, you know, a platform that was going to enable you to do observability. Chances are high you could go the wrong path and do three pillars, but there’s all this technology. There’s a wealth of it out there. That’s great. Then you’re going to be left feeding, cleaning, and watering that system. 

As Charity said, you’re going to have one person on the team. That’s the person that built the platform. If they’re not there, you’ve just bottlenecked on your ability to make a change and make sure stuff is working effectively. Essentially, I’m a big believer in open source, but I think you need to have a really good reason to roll your own. And, like, thinking that you can just, like, grab some Prometheus and some Grafana or some ELK stack or something and start to build your observability platform, you’re going to end up bashing yourself on the thumb with that hammer. 

Emily Nakashima: 

Definitely. I’ve seen teams coming back from having taken that journey and realizing they wanted to take a new approach. One distinction, though, is that there’s a little bit of a difference between open source for the actual tooling and for the backends of this stuff versus open source instrumentation. So using OpenTelemetry and stuff like that. Charity, any advice for people trying to decide between the vendor version of the thing and the open-source? 

Charity Majors:

Well, this is why we’re all so delighted to welcome OTel, OpenTelemetry, to the fold, which, if you’ve instrumented your systems with OTel compatible stuff, you should never have to reinstrument them again because all the major vendors and, hopefully, all the major open-source providers out there will be compatible with it. And it will just be as simple as, you know, flipping a link. This is good for everyone. 

Emily Nakashima: 

Yeah. I work for a vendor. I’m supposed to say the vendor thing is great, but it’s such a sense of relief to say, Okay. We’ve instrumented once. We can always use this instead of having to keep chasing the next thing. James, have you seen OpenTelemetry change the conversation out there much? 

James Governor: 

Oh, yeah, definitely. I think, as an industry, we swing things about, pendulums. We’ll all be on a wave of more OpenTelemetry, but open instrumentation. Then we’ll have a wave of proprietary agents. And I think, just certainly at the moment, customers are showing they would like to see standardization of instrumentation, and, yeah, that’s having a significant impact on certainly vendor behaviors. For those orgs that have had a good run these past 10 years or whatever it is, with their proprietary agents, they’re having to respond to that.

I think the openness there is definitely something that customers see value in and vendors are going to have to support. So, yeah, absolutely, I think there’s been a significant change. You don’t want to be in the business of, Hey, check out our proprietary agents. They’re going to make you effective. That’s not the wave of technology that we’re currently on. You want to be: Here are these events. We’re going to do an incredible job of understanding them and troubleshooting accordingly. 

Charity Majors:

We don’t want to be competing on the basis of instrumentation; We want to be competing on the basis of value delivered to customers. 

Emily Nakashima: 

We’re just about at time. That seems like the perfect note to end on. I want to quickly wrap us up. First of all, I want to say a huge thank you to our panelists: Joe, Charity, and James for joining us. And then thank you to all of our attendees as well for dialing in and joining us live and for your great questions. I want to just quickly remind folks if you’re on this training path yourself, we have some great resources. Definitely get started with our free tier, OpenTelemetry, auto instrumentation. And then, as you’re, kind of, going down that journey with your team, we’ve got workshops. We’ve got our o11ycon and hnycon talks, which, of course, are linked in the chat. And then as you get further into that lifecycle, you can start building observability into your processes, and we’ve got some great resources for that as well. Thanks again for joining us. And, as I mentioned at the beginning, if you want to watch this webinar later, we’ll send it out to you. And we’ve also got our other webinars on-demand as well. Thanks so much, folks. Have a good day.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript