Webinars Observability Incident Response Debugging

Never Alone On-call

Summary:

Switch from deep query to a trace waterfall view and spot outliers with Honeycomb BubbleUp heat-maps - these make debugging effortless and dare we say fun? The Observability that Honeycomb brings literally shifts a Dev team from guessing what the problem may be to knowing exactly how their complex prod is behaving and where in the code the issue lies. Through the experience of all users, dev teams learn faster when they work together and have the same visibility, down to the raw event. We affectionally call it See Prod in Hi-Res.

Team Collaboration is achieved through shared charts, dashboards and following the breadcrumbs of your teammates. With Query History, everyone sees results and can tag according to what’s interesting. It’s organized, intuitive, and easy to follow. Incident response and on-call runs much smoother when everyone is virtually on-call with you.

In this webinar, we’ll discuss and show how:

- Honeycomb's Query History gives rich meaningful context
- How Honeycomb’ers dogfood and learn from each other
- Benefits across the engineering cycle and use-cases when debugging and maintaining
- How to build a culture of observability and why you should do it now.

Transcript

Emily Nakashima [VP Engineering|Honeycomb]:

Hello! Welcome to today’s webcast. We’re gonna give folks two more minutes to dial in. So we’ll get started in just a couple more minutes. All right! It’s 10:02. Let’s get started. First, a couple of housekeeping notes. If you want captions during this webinar, we do have live captioning, so you should see a link at the bottom of the screen. You can click on that, or if you’d like, it’s also in the email we sent out yesterday. So you can get the captioning link there too. If you have questions, at any time, during this webinar, please use the “ask a question” tab in the BrightTalk player. We will do a Q and A session at the end of the presentation, so don’t worry if we don’t get to them right away. We’re gonna reserve time for questions at the end and get to them all at once. At the end, take a moment to rate this presentation and give us feedback. It’s super helpful to give us an idea of what kind of content would be helpful in the future. So rate us, give us feedback, and if you have any technical difficulties at all, there’s a support for viewers link at the bottom of the page. Finally, a recorded version of this presentation is gonna be available at the same URL after we wrap up, so if you want to share this with your friends and colleagues, you hear something that’s interesting that you want to share with your team, don’t worry. It’ll all be recorded and available after the fact. So check that out if you want.

All right, great! This is part of a five-webinar series, the Honeycomb Learn series. And the goal of this series is to break “how to make observability happen” into bite-sized chunks. We’ve previously covered topics like instrumentation, ongoing performance optimization, incident response, identifying outliers, and anomalies, and this is gonna be episode five, never alone on call. And in this episode, we’re specifically gonna be focusing on team collaboration, curated learning, and we’re gonna be kind of taking a deep dive into how Honeycomb handles knowledge sharing within our on-call rotation, and all of this is part of the story of how we build up a culture of observability within our team.

I am very lucky today to be joined by two of Honeycomb’s finest engineers, my name is Emily Nakashima, I’m the Director of Engineering at Honeycomb, I manage the engineering and design team, and I will let Alyson and Ben tell you a little bit about themselves. Alyson, why don’t we start with you? How did you get to Honeycomb? What do you work on? What was your background before here?

Alyson van Hardenberg [Engineer|Honeycomb]:

Yeah, hi. I’m Alyson van Hardenberg. My background before coming to Honeycomb was as a front-end engineer at a small startup. And prior to joining Honeycomb, I had never been on call. So that was a new experience for me, coming to Honeycomb.

Emily Nakashima:

And Ben, how about you? You’ve got a different background than Alyson. What brought you to Honeycomb and what did you do before here?

Ben Hartshorne [Engineer|Honeycomb]:

My name is Ben Hartshorne. I’ve been an on-call engineer for quite a while now. My background is in operations, and I’ve been on call at basically every technical job I’ve had, from small startups to large organizations like Wikimedia and Facebook. Here at Honeycomb, I’m working on the full stack, from the frontend, the backend storage, ingestion, API, client libraries, and of course, on-call.

Emily Nakashima:

Fantastic. Thanks, Ben. I was so excited that Alyson and Ben could both join for this, because they’re two people who are fantastic engineers, really valuable people to have in our on-call rotations, and they come from super different backgrounds, so that’s part of why I’m so excited to talk about it today. Just to give you a review of how on-call works for us because I know that can mean different things at different places.

5:49

At Honeycomb we have an on-call rotation that lasts for two weeks. For the first week, you’re the secondary on-call. You are the backup for the primary on-call engineer. So pages can escalate to you if primary on-call misses a page. For the most part, secondary on-call week is pretty chill unless you’re ramping up, and then you use that time to watch what the primary is doing, to get oriented to issues that are happening, and you’re the first line of escalation for primary on-call. So if they get too busy or need a hand, they’re more likely to pull you in first, before grabbing someone else.

After that first week, you’ll make the transition to being primary on-call yourself, another engineer will become your secondary, and then you’re the first line of defense, and at the end of each week, we have an on-call handoff meeting where we discuss whatever issues came up during the week, walk through our action items, and make sure someone is on point to take care of potential mitigation, and make sure to share context about what happened during that week and what we can do to get better.

Ben, since you have so much on-call experience, you’re my expert. It would be great if you could tell us a bit about how on-call at Honeycomb is different from other places you’ve been before.

Ben Hartshorne:

Sure. So on-call, most of the places I’ve been have been driven by alerting. There will always be alerts that come through, that trigger on-call work. But the interesting thing about Honeycomb’s on-call — the entire engineering staff is on-call. Most of the organizations I’ve worked at, and this is part of the evolution of our industry too — there’s an ops team. The ops team is the group of people responsible for production. And when things come up, they might call in engineers for help. But it was a pretty clear separation.

The industry is moving towards DevOps. The ops teams are learning how to build software and automate all of their processes. The development teams are learning about operations. And what it means to run software in production. And the fact that there isn’t this divide has really changed the way on-call works. I’ve always been part of software development in a way that — you know, that the product will be built, and then we’ll try and run it, and inevitably, there will be challenges, and folks will say: No, no, you didn’t pull in ops early enough. This isn’t able to be run, because you didn’t ask the Ops Team, or ops will complain — you know, this isn’t well-instrumented. There’s no way for me to run this. And this characteristic of there being two separate groups that don’t really talk makes it difficult to run software in production.

Having the engineering team be the ones that are also on call and running it in production gives an immediate feedback loop, and so it improves the way that we can build software, as well as shortening the response for any of these problems that come up. On-call at Honeycomb is not about triaging problems and then finding somebody else to fix them. Because it’s the engineering team that’s on-call, when problems come up, we just fix them.

And so the feedback cycle of getting — of identifying a problem, finding a fix, pushing it back out, is very fast. And I don’t mean that on-call has to fix all problems. You know, we can certainly escalate to the rest of the team, if we hit something that’s more challenging, or will take longer, or an area of the stack that we don’t know so well. But most of the on-call tickets that come up are handled by the on-call rather than being just a point to identify a problem and then pass it off to somebody else. And that’s a really big difference. And has made on-call just way more fun.

10:06

Emily Nakashima:

Perfect, thanks, Ben. Yeah, I’ve definitely worked in places where we’re always trying to find that right moment to hand off things between software and the ops team or the right moment to move them in, and I feel like you never, ever find it. So it’s really different to just have both of those teams be one and the same from the start.

Alyson, I’m curious to hear from you — coming from the perspective of someone who was doing more core software engineering work and not on-call in the past, has going on-call changed the way you think about writing software? Or how you think about your job as a software engineer?

Alyson van Hardenberg:

Yeah, it has. I was definitely a little bit nervous about joining the on-call rotation when I came to Honeycomb. But it’s been great for me. It’s really changed the way that my — even my career path, like, the way I think about my career is different. It’s given me the opportunity to learn more about those ops-y skills, like SSHing into a different host or even instrumenting my code. And because I know that I’m gonna be on-call and have to support the code that I’m writing, I’m more likely to think about the instrumentation as I’m writing the code. And it lets me instrument it in development and see how it’s working, and then as I push that code to production, I can support it as it goes live, and see my code rollout, and how it’s working. Even before I just let the issues roll over to on-call.

Emily Nakashima:

Yeah, having that feedback loop between development and production — I totally agree that it changed the way I thought about what hooks I would put into my code at the time I was writing it. That makes complete sense. I was wondering if you would also tell us a little bit about what on-call does at Honeycomb. I know some places, you’re really just responding to alerts, about CPU or latency. And I think that for us, we maybe cover a broader territory than some folks might in their rotation. So can you tell us a little bit about what you might do during the week on-call?

Alyson van Hardenberg:

Yeah. When we’re on-call at Honeycomb, we’re responsible for so many different things. Different types of tickets. Primarily we really pay attention to any alerts that might come in. We might have a Slack trigger that comes in or a PagerDuty alert. When those come in, it’s a call for your attention. It might be message delays or displace alerts. Those are the typical ops-y roles that I would imagine on-call handles.

But as an on-call engineer, you’re also responsible for looking into user-generated errors. Something like log.error or a JavaScript error that might come up. But we’re also responsible for being the first line of defense for triaging customer bugs and requests. For example, a customer has found a bug in one of our SDKs, and because we’re all software engineers when those requests come in, we fix them and roll in a fix, and they get resolved really quite quickly.

Emily Nakashima:

Thanks, yeah. You can see it can really change a lot from week to week. Oftentimes you’re on call and focusing on one area of the codebase or the infrastructure, and the next time you’re on call, it’ll completely change. So you get a tour of all Honeycomb’s engineering, for better or for worse.

I wanted to sort of quickly walk through what that looks like in practice and how we collaborate together when we’re on call. So I have a really quick demo that I want to walk through. Kind of the story of how we might respond to an incident together. I should just say, for this part, there’s gonna be a lot of text and images on screen. If you want to, you can actually expand the BrightTalk layer to zoom in on the images a little bit. If you want to, you can. For us, on-call gets involved in one on two channels. Either a customer reports a problem to us directly, via a support channel, which might be email or customer Slack, or we get an alert. It’s about 50/50 whether something is alert or customer-driven.

So here’s an example of an alert we want to investigate. An engineer on our team, Doug, has just done a big refactor of something we called hound-usage, and we set up something we call triggers to report if the error rate is higher than what we would expect. Looking at this alert, it looks like we’ve seen 131 errors in the last ten minutes, which is over the expected threshold. And hound-usage is part of how we move data from primary to secondary storage, and because it’s part of the workflow of customer events, we want to look pretty closely at it. So the trigger method has all the usual things you would expect. There’s a description of what went wrong, how to think about this alert, a link to a dashboard with query results on it, with relevant queries, there’s a link to a playbook that Doug just wrote, and you can see Alaina down at the bottom said: Great playbook!

15:32

So you see them discuss the alert. Alaina is the primary on-call for the week. They decide that it’s important but not urgent, so they’re gonna look at it together when Doug gets in from his commute. I see this alert a few hours later. I want to know what happened, see if it got resolved okay, so the first thing I’m gonna do is click the query, viewgraph link, in the trigger. Which is gonna take me to the query we ran to generate the alert. So I can look at this myself, rerun the query. It looks like the error rate has gone down, so I’m a little bit less worried, but I still want to know what’s going on. Here’s the cool thing. If you look at the icons over on the right side, one of those is the Query History icon. The middle one is Query History. The lower one is Team History. So I can see both the queries I’ve run and the ones that my teammates have run if I click those.

Now you’ll see over on that right sidebar I can see that Doug got off his ferry trip, he and Alaina sat down and started running some queries to investigate what was going on. I can see other people on the team are investigating this, I can see what questions they’re asking, and a cool thing that some people don’t know — we save query results forever. So you can always look at an old graph in an instant. This is cool if you have a dataset with only a couple of days of retention. If you want to go back and look at what happened during an incident, you’ll always be able to reload this graph, which is a nice feature.

So if I click into one of the first queries that Doug ran, that’s a group by name. I can see he’s trying to figure out exactly which errors happened. The errors, updating metadata, you can see the table shows you the breakdown of the different types of errors. The metadata ones I’m not as concerned about because we can regenerate the metadata later. The rollout errors I am concerned about. That means potentially there was a failure trying to move this data out of S3.

So let’s do a group by error message to see what the individual error types are and see if we can get more information about what went wrong. That’s really interesting. Down at the bottom, you can see that we’ve got that SQL error. Not what I expected. And then from there, I’m going to group by dataset, because I want to know: Was it just one customer’s data that was affected, or is this a widespread problem? Looking at the count in the table, it looks just like one customer’s data. So we do some digging in the admin table to see what’s going on with it, we find bad data that looks like it’s causing an issue, and they’re gonna manually clean that up. And of course, going back to Slack, I can see that my team is already on top of it. Alaina is checking in with the customer support team, letting them know what happened, letting them know that the issue was resolved, how they resolved it, and Doug is updating the alert.

And for me, as a manager on the team, this is the best-case scenario. I’ve already learned a lot about the new hound-usage code, by watching them work, so I’m seeing what questions he’s asking, what he’s thinking about, what’s normal, what isn’t if they need another set of hands, I can jump in and help, but if they’ve got it handled, I’ve collected knowledge about the incident and I can totally get out of the way, which is great.

So that was a little overview of how a typical incident might play out at Honeycomb when we’re on-call. Finally, I think it’s worth talking a little bit about how working this way has changed the way we think about engineering and how that impacts our business overall. What you are seeing in the demo is us using our observability tooling as a team to understand what our software is doing, and for us, that’s all part of this larger culture around what we call software ownership.

So we believe that the best software comes out of involving the team from idea all the way to production. We have the same engineers going through that whole cycle from planning, writing, testing, deploying, validating that things are successful in production, and then running their own software, in an ongoing way, after it’s been shipped. And being responsible for continuous improvements to that software. The demo we just talked through was focused on incident response, but of course in the software ownership model, we’re not just responding to incidents. The same people that are catching those pages are also gonna be doing ongoing development, doing performance optimization, and all these things I think work best when you can take the same toolchain and apply it to each of those phases.

20:40

So this means for us that our engineers are using Honeycomb to understand what they’re building. Sometimes even starting in development. Using Honeycomb to verify that it’s behaving correctly when it’s rolled out, and we’re using it to respond to errors, like what we just looked at, and also the optimization piece. A lot of people think of that as performance optimization, but it’s also product analytics. Making sure that people are using things the way they’re intended, and we like to see what customers are doing with the product. That often helps us design and plan for future improvements. This model, using a shared tool for all the various steps, allows us to have a faster iteration cycle and development process, which is what we really care about. Being able to move quickly, and being sure we’re shipping the right things. I think a lot of people hear about putting all engineers on call and really have a lot of questions about that.

I think this is the thing that we take to conferences, and people kind of give us the most concerned looks or raised eyebrows. I think that for us, this is one of the ways we foster software ownership on our team. But we know that it doesn’t necessarily work in every organization. Not every organization is ready to jump into this. And so a lot of people have asked questions about how we do this. Just a couple of caveats, let me say: We don’t put all engineers on call on their first day or even in their first month. This is the culmination of a process that begins during onboarding, and you should feel like you’re continuing to learn more about our systems and how to do on-call well throughout the tenure of your career as an engineer at Honeycomb.

And we don’t feel like on-call should be able to fix everything. They’re just the first line of defense. This is keeping the barrier to escalating really, really low. You should never feel embarrassment or shame about asking for help or asking questions. You’ll have your secondary there, but it’s really common for engineers to jump in and help if they see something going wrong and they have a lot of context about it.

Ben Hartshorne:

One of my favorite parts, if I can jump in for a sec, Emily, one of my favorite parts of the onboarding experience — you mentioned we don’t put engineers on call on their first day, but when it does come to be time, to help a new engineer ramp up, one of the things that we do is an architecture overview. This is not uncommon. The thing that I like about it is that the previously newest engineer is going to be the one that gives the architecture overview, and most of the team comes. So it gets to serve as both a way for relatively new engineers to go over this whole infrastructure again, to make sure that they have everything right, talk through it, as a way of cementing that understanding, and then also provide an opportunity for everybody on the team to see what things are new or what’s changed. As well as the people that have been there for a while longer, to talk a little bit about any questions that might come up.

Alyson van Hardenberg:

I also really love that architecture meeting. When I gave it, as the most recent hire, it made me feel like I had ownership and a stake in the team. Even though I was relatively new, I still had so much knowledge to contribute. And I really love that the engineers who have been at Honeycomb for a longer time will join, and they add a lot of funny color commentary. It’s kind of like… I don’t know. It just brings everybody together. It’s really great.

Emily Nakashima:

And the thing that’s hilarious to me is how often the most senior engineers come to that meeting and still learn something new! We have a few people who join because they love the fun of sitting in that meeting, but I’m so shocked that employee number two will often go — I didn’t know that! During a meeting. So I love that it emphasizes that we’re all continuously learning, which is really great.

Alyson van Hardenberg:

Even though I just started being on call when I joined Honeycomb and was really nervous about it, I still think it’s really — it’s been a really good thing for me, and it really helps individual engineers, but it also helps the team as a whole. And it helps us by democratizing knowledge. For example, we don’t want people to feel like there are only two experts on the team who solve everything. By putting everybody on call, it makes the team more resilient.

25:37

At the old company I worked at, we would only have one or two power users who knew how to use the instrumentation, and when things broke, nobody knew how to fix it. There was a huge scramble. A lot of stress, a lot of panic. It also levels the playing field between junior and senior engineers and even people with a different set of knowledge. It allowed people to ask questions and question those established practices. For example, somebody might ask: Is the current caching strategy killing our database? That’s not something I would think to ask when I was on call, because my focus tends to be more frontend-y, but having the different players be on call gives us a more well-rounded set of practices.

And then when different team members go on call, they notice holes in the documentation, that people have been doing the same thing over and over and over again. They don’t notice. When new people join the rotation, they improve the docs and help other people get up to speed more quickly.

Ben Hartshorne:

You know, I’m a little embarrassed to admit it, but I think I suffer from that last bit quite a bit. I’m particularly bad at leaving holes in documentation. Just as a side effect of having been on call for 15 years, and having done a lot of these processes over and over and over again, they’re air quotes “obvious” — they’re not obvious. So being on call with another engineer, especially as we work on some things together, whether I’ve escalated to my secondary, or I am the secondary, getting to fill in some of those holes by seeing those same processes through another set of eyes is a fantastic way of working together to make the whole team better.

Emily Nakashima:

Yeah. I love that you mentioned different levels of experience on call. And I think there’s also a component of it that is different engineering disciplines. Alyson, I think, is on frontend development for Honeycomb, and has been senior for systems kinds of things, but it might be if Ben is on call and there’s a JavaScript issue that comes up, just because it’s not his favorite area of software engineering, he still might have to escalate or pull in someone for help. So I like that it levels the playing field, that everybody has to potentially learn something to be on call, which is really cool.

Alyson van Hardenberg:

Yeah, I’m always super grateful for having the other engineers out there. I’m so comfortable pinging them and asking them questions. It’s great to not be alone on call.

Emily Nakashima:

One of the important things that I don’t want to get lost here is that we wouldn’t do any of this if we didn’t think it was the right thing for our customers and for Honeycomb business. There is a component of this that is about retaining people on the engineering team, and we think this way is a good way of building if you’re someone who likes to understand the whole process, likes to understand the impact of your work, but we also think it’s the right thing to do for our customers and for the product because working this way means we can often resolve customer issues more quickly than if we were using a more traditional waterfall model and handing things off between teams.

Oftentimes if customers report a bug, it means things are resolved in minutes or hours instead of days, and it reduces communication hoops both for customers and for teams. I used to work in a more traditional model, and so often we would kick tickets back to the customer. We can’t reproduce this. Please give us more so we can reproduce. And now we can just say — oh yeah, we can see that and fix it.

The most gratifying feedback we hear to make us know this is working for other people too is hearing from customers who have switched from the prior generation of tools, trying to do this type of debugging with APM tools and logs, to using Honeycomb, and telling us — we immediately found something we couldn’t have solved with our APM, we solved our first production problem within minutes. Those kinds of things are really common and help us know that we’re on the right track and getting what we want out of our tools. And the goal is, like I said, to spend less time reacting to large production issues and having more time to respond to specific customer issues, making customers happy, and to keep improving the product and making it better for everyone.

30:09

Ben Hartshorne:

Bringing it back to the customer, if I can… Emily, it’s almost either a side effect or perhaps part of the purpose of the DevOps movement, if you will, for many of the years I was in operations — I never used the product that I was responsible for maintaining. The work that I was doing was making sure all of the services were running and everything was smooth. And it didn’t actually matter what the service was. And so there was one point where I was like… You know what? I need to switch jobs. And I’ve never supported a product that I actually used.

One thing about the engineering team being on call is that it’s the same team that’s building these features, that’s then supporting them. So that disconnect of… I’m just making sure the servers run, I’m just making sure the software runs… From — I’m using this feature. I understand why this feature is here. I understand why this service is here — it brings it together in a way that lets us more easily pull the work that we’re doing directly back to customers, in a way that’s very satisfying.

Emily Nakashima:

Yeah, totally. Tying it into the larger DevOps trend is so important. I don’t want people to miss that connection. Because I think that this is really almost the dream of that improvement, broken down into kind of more actionable steps, which I really like.

Cool. So that was the end of our kind of presentation portion. Before we take questions, a little bit of wrap-up information. As I said, there’s gonna be a recording of this webinar that’s gonna be sent out soon, so please feel free to share it with your friends and colleagues if you want to. For more information about how to get started, we will send the attachments that you see here in an upcoming email, so you’ll be able to check out Alyson and Alaina on our team, about the culture of observability, so you can look at the decks from that, and that goes into a little more depth about our on-call process and culture. We have an observability eGuides series, so achieving observability, observability for developers, guide to distributed tracing. So again, if you want to go a little more in-depth, you’ll find those at this link, or on the resources page of our website. If you do want to demo Honeycomb and play around with the things we walked through today in the demo, you can do that on our play site, play.Honeycomb.io, and use the UI and try to debug a real production issue, and of course, if you want to trial Honeycomb or request a demo, you can do that on our website too.

Cool. All right. So questions? Let’s see. Someone asked: What is hound-usage? Alyson, you want to take that one?

Alyson van Hardenberg:

Yeah. Hound-usage is our way of moving our data from our primary storage to our secondary storage, and we back things up to S3 using it. So it runs on a cron job and backs up our data.

Emily Nakashima:

It’s one of those things that has charmingly outgrown its name. It used to be about measuring how much data we’re using, and now it’s a crucial piece of our infrastructure. But you know… Old names die hard. We have… Who is ultimately responsible for writing and maintaining and updating the playbook for on-call? That’s a good question. I don’t know that we’re particularly good at playbooks right now. Ben, what’s your take on that?

Ben Hartshorne:

So for any given service, the team that is building the service goes through an evolution. First, they’re writing everything. And then it gets into our staging or our dogfooding environment. And generally, the team that is building that service is watching it closely, learning about how it’s running in a high traffic environment, and writing documentation about how — what sort of edges it exposes and how to understand them, what instrumentation is available, and how to read that. And then there is the phase that is the difficult one, where this team gives up control of their special creation, and the entire engineering team is then responsible for this service going forward. And that allows the team to dissolve and the individual engineers to move on to other projects.

35:07

So at least part of that process is that when any new service is stood up, there has to be some copy written about how it works and what it’s doing and how to interact with it. And that’s like predictive documentation. Right? That’s the thing that you think people are gonna want to know when they’re trying to understand your service, and it’s never more than 50% of what folks need. So that document turns into an evolving and living set of tips, tricks, instructions, specific responses, and so on. Really through the hands of all of the different on-call engineers that come across it. They find that the problem that they’re trying to solve is either well answered, poorly answered, or not answered at all.

And then improving that documentation. So phrasing it as “who is ultimately responsible for the documentation”, for updating the playbooks? It’s sort of everyone. When everyone’s responsible, no one’s responsible — in that the team that creates the service definitely starts the process. And then part of being on call is ensuring that the operation of the service can continue. Now, some of the issues that are resolved, as you resolve them, you feed that back into the documentation. Other issues that are resolved, you feed that back into the code, so that issue doesn’t happen again.

So it’s sort of a process that has a number of different options. But I think the closest answer to who is ultimately responsible for maintaining it and updating it… Those are the people using it. Namely the on-call engineer, which is everybody through rotation. Is that in line with what you see, Emily?

Emily Nakashima:

Yeah, absolutely. The one thing I would add is that I think kind of taking the software ownership approach incentivizes people to write the right amount of on-call documentation. I think that if you write enough of a playbook that people can successfully debug and operate the software, it means fewer things will be escalated back to you, the person who originally wrote the software, and the nicest version of this ends up being a relatively short playbook and a nice board that helps you get started with queries you might run to try to debug some things. That’s ideal in practice. I think most people on the team would like a little more documentation than we have right now, but we’re getting there, step by step.

Ben Hartshorne:

Along with wanting a little bit more time and a few more employees.

Emily Nakashima:

(laughing) Yeah, I think if you have enough documentation, you’ve probably written too much. All right. We’ve got… When hiring new engineers, how do you talk about on-call? The system is so different than anywhere I’ve worked. How do you get folks on-board to participate this way? That is a great question. Alyson, what made you not run screaming from the idea of being put on call?

Alyson van Hardenberg:

I had a firm promise that I would never be just by myself. I mean, the reason that I didn’t run screaming away was… Knowing that I was just gonna be the first line of defense and that I had a whole team backing me up. Because everybody is on-call, everybody has experience doing this and knowing that it’s okay and expected to escalate, for me to escalate an issue, when I don’t know how to solve that issue or I had never seen something like that before… Yeah. It makes it feel safe to be on-call. Also, we don’t get paged very much in the middle of the night. Most of our issues tend to be during working hours because we are a bit more proactive with our response to things.

Emily Nakashima:

Yeah, that’s a great point.

Ben Hartshorne:

That’s a good point, and it’s a direct side effect of the way that we run these things. Nobody likes getting woken up in the middle of the night and developing software in a way to ensure that it has the right fallbacks, it has the right scaling properties, and so on, is a hugely important part of that process.

Emily Nakashima:

Absolutely. All right. We’ve got one last question, it looks like, which is: What percentage of the team should be fully proficient at handling on-call P1 issues? Oh, gosh. That’s a really sticky one. Alyson and Ben — either of you feel compelled to take a guess at that? The answer is… So it’s really squishy, just because, as I said, we have all those different types of P1 issues, and so there really is not one engineer who can handle every single P1 issue that comes up.

40:24

Alyson van Hardenberg:

Yeah. I would almost say… 100% and 0%. We should all know how we should handle P1 issues. Whether or not we know how to resolve them, we should all know the escalation patterns and where to look for the runbook. But nobody is expected to be fully proficient in handling all the P1 issues on their own.

Emily Nakashima:

That’s a great way of putting it. I think that… Go ahead.

Ben Hartshorne:

We are still a small company, and when really severe issues show up, when… I’m not gonna say when the site is down, but when there are major features that are not working correctly, there are definitely, within the engineering organization, teams of people that are helpful for managing different aspects of the infrastructure. We sort of split a little bit into a platform team and a product team. So depending on where it goes, escalation is an expected part of being on-call.

So especially for the folks that are more recently on-call, escalation happens more often. For the folks who have been on-call for a long time, escalation still happens, and it’s a way of ensuring that you understand: When is this really a severe issue? And how can we resolve it most effectively? And bring in the right people necessary to make sure that happens? When things are really on fire, you know, we’ll send Slack pings or text messages or call people up on the phone and say: Yes, we need some help! And I think that’s actually just fine. Trying to believe that we can handle everything on our own is a way of making it so that you reinforce that idea, that a few people handle everything. So yeah. Not everybody handles everything. And that’s a way of making sure that the whole team can handle everything.

Emily Nakashima:

Yeah, I really, really like that framing. I think maybe it’s worth emphasizing that the most important trait to be successful on call at Honeycomb is just having good judgment and being a clear communicator. Because so often what you’re doing is just making the call of: Can I handle this myself? Or do I need to ask for help? And that’s sometimes a harder problem than being able to understand the actual technical issue that you’re getting into.

Cool. So that is all the time we have for questions. If you have any more questions, please do feel free to reach out to us, through our website. There’s a little chat bubble in the right-hand corner. Or respond to any of the webinar emails. We’re happy to help or point at resources for you. And just a reminder that we love feedback. So tell us how we did, using the “rate this” tab under the player, and as I said, this video will be available on-demand too. We’ll send you an email when it’s available. Please do check out those resources. We’ll send them out if you want to go more in-depth in any of this content. There’s a lot of good stuff, especially in the eGuides and eBooks we have. Well, thank you, Alyson and Ben, for joining us today, and thank you for everyone listening in. We’ll see you next time!

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript

Podcasts

Ep. #22, Designing for Observability with Jimmy Bogard of Headspring

In episode 22 of O11ycast, Liz and Charity speak with Jimmy Bogard of Headspring. They discuss maintaining balance for on-call engineers, what’s missing in the average engineer’s toolkit, and moving from monoliths to microservices.

BACK TO RESOURCES

Never Alone On-call

Summary:

Transcript

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript

Ready to get started?