Product Videos SLOs Observability Incident Response Debugging

Reliable Alerting for Honeycomb Triggers and SLOs With PagerDuty

Transcript

Liz Fong-Jones [Principal Developer Advocate|Honeycomb]: 

Hi, I’m Liz Fong Jones. And I’m a Principal Developer Advocate at Honeycomb. And today I’m delighted to be joined by Mandi.

Mandi Walls [DevOps Advocate|PagerDuty]: 

Hi, Liz. Thanks for having me on. I’m Mandi Walls, DevOps advocate at PagerDuty.

Liz Fong-Jones: 

Today we want to share with you how PagerDuty and Honeycomb can work better together. So let’s suppose that I have a query that I want to find out If something goes off and I need to find out on my phone really quickly if that condition has happened. Now for our production services, it might be something like the error rate is too high or maybe my app is crashing, but in this case in my demo app what I have is something that’s wired up to EVE online. And in EVE online, we’d like to find out when someone is hacking our stuff in order to be able to get online and respond. So you’ll notice here that if I go back 60 days that we do, in fact, have some of these things that have fired in the past. Things like people are attacking our systems or people are attacking our structures. So it’s really helpful to be able to find out and get that notification. 

Let’s go ahead and create a notification for that. So I’m going to go ahead and make a trigger. And you can see here that my query results are reproduced, and I can get the count of the number of notifications grouped by what kind of notification and where they’re happening. In this case, nothing that’s fired in the past couple of minutes but if it had, I would get a preview of what was going on, and whether or not the alert would have fired. And that lets me, for instance, tune my thresholds, but in this case, I care if there are any events whatsoever, so the count is appropriate.

And I might want to set the frequency to something like run every minute because I want to find out very quickly. And I want to query the last four minutes of data, that way in case there’s a temporary blip, I still get data for the past four minutes. And I’ll go ahead and get those notifications that happened in the last four minutes every minute. Don’t worry about the kind of duplication. PagerDuty handles for me the de-duplication of these alerts.

We’re also going to need to add a recipient. And the good news here is that if I go and pop open the integration center, we can see here that there is a Slack integration and a PagerDuty integration. So I’ve already configured this, but if I wanted to go over and look at PagerDuty, we can kind of walk through how to set it up on the PagerDuty side. Mandi, do you want to walk me through how to do that?

Mandi Walls: 

Yeah, so for your PagerDuty pieces, you’re going to want to go into one of your services. In your service directory and find the place where you’re going to be receiving those notifications. So you have that set up, and you go into the integrations. And we’re going to go ahead and push that into that particular service.

Liz Fong-Jones: 

Awesome. Let’s see, what happens if I search for Honeycomb. Ah, there we are. So now it tells me that I need to take this integration key, and I need to paste it in over here. We’re going to do that. And now I have the demo test integration working. Now I can go back and create this trigger again. Now let’s actually create it. And as I did before, one minute, four minutes, although, in reality, I’m not going to go and log into eve right now and trigger a structure notification. That would be a little bit mean to my friends. but we’re just going to go ahead and create a demo test. And go ahead and create the trigger. It was really that easy. I was able to create a trigger starting from an existing Honeycomb query and just set a condition on one of the variables in the visualize field. And then I will get immediately notified within a minute should that condition be violated. And let’s show you what this looks like. I’m going to violate the rule that, you know, you have to take your phone on mute when you’re going on a presentation because I want to actually get notified here. So let’s hit the test button and cause this to go to PagerDuty.

4:25

In a minute, I’m going to get a phone call from PagerDuty that says, hey, your stuff is under attack. There comes a text message. Now comes the phone call. So PagerDuty handles making these things automatically notified, escalated, and resolved when these happen. And that way I don’t have to worry about, you know, is someone going to check their email. Is someone going to actually, you know, what happens if the person doesn’t immediately respond? PagerDuty takes care of these things for me.

For instance, in my EVE alliance, if I don’t respond to a structure notification, it will also go to other members of my team, which is really great because it means that we will always have a way of making sure that someone is available. And no matter what the hour, to go and look at it. For instance, we do have people who are in Europe who might want to look at things during European hours rather than waking me up in the middle of the night. What are some of the other cool things that I can do with PagerDuty besides just send notifications, Mandi?

Mandi Walls: 

Oh, my goodness, there’s so much stuff in PagerDuty now, if you haven’t looked at it in a while, there’s a lot of additional things that are in PagerDuty. As Liz has shown one of my favorite integrations, right, between your game and Honeycomb and all those great things, like all the usual stuff is ready for you as well. So right now, PagerDuty as a platform has over 600 different integrations with other kinds of applications. You still have all the stuff that you expected to be there. You’ve got your on call notifications. You’ve got your escalation policies and your schedules. But we’re also pulling in lots more data from other places for you to not wake up for stuff, right? We don’t want to wake you up anymore more than you want to be woken up. So there are additional components under automation that are going to help you deal with, you know, things that should be fixed via a very small shell script and not necessarily by a human being that you can put together that way. We acquired a company called Rundeck last year that provides additional automation services there. We have lots of new features coming with Rundeck, actions that you can see in preview now. You get early access if you’ve got a PagerDuty account already. We want to have all of your data, but we don’t necessarily want to wake you up for all of it.

Liz Fong-Jones: 

Yeah, and this is one of the really awesome things about putting Honeycomb together with PagerDuty.

Mandi Walls: 

Yes.

Liz Fong-Jones: 

Is that you don’t necessarily have to wake up for everything. Because Honeycomb is recording the telemetry of what’s going on inside of your system. And therefore that’s all going to be safely there for you when you’re sipping with a cup of coffee reviewing what automatically remediated last night. You don’t have to be sitting there capturing logs or poking your services in order to capture the debug data as the outage is going on. Just automatically fix the outage and stay fast asleep. And then look at it during the day.

Mandi Walls: 

Yeah, we can send information to your ticketing systems if that’s the way your team handles your backlog for things that happened overnight. You can open a ticket and look at it in the morning when everyone is fresh and rested.

Liz Fong-Jones: 

Yeah. So what do you think that people should know the most about how to make PagerDuty work the best for them?

Mandi Walls: 

One of the things that we work with folks a lot on is, you know, just getting stuff into PagerDuty can be a little onerous. Most people will be using the web UI. We do have a Terraform provider that allows you to use that workflow to create your teams, create your services.

Liz Fong-Jones: 

Ooh. Infrastructure as code, right? Like, it turns out that if you’re not Honeycomb that only has like two or three teams but instead, you’re one of our clients like Vanguard which has hundreds of teams, it’s relatively easy then to configure a standard set of escalation policies.

Mandi Walls: 

Absolutely. 100%. And we work with very large customers as well who might have dozens if not hundreds of different teams and escalation policies and maybe thousands of different services. And being able to manage all that stuff via Terraform provider is super helpful. That’s a good one for folks who, if you haven’t looked at PagerDuty recently, that’s a place to really dig into. 

8:47

Super helpful.

Liz Fong-Jones: 

Awesome. And the other thing I wanted to give a shout out on the Honeycomb side is what we do have triggers and triggers are great for kind of static thresholds, we do think that a majority of people who are running production quality apps should be using Honeycomb service level objectives. And that when you have SLOs, SLOs help you understand not just what the behavior in the past minute or four minutes are but instead to help you manage your service towards a level of reliability over 30 days or 90 days. And that way you’re not going to have alerts flap quite as much. And you’ll only get alerted if there’s a risk of real user impact from too many queries failing over, for instance, a couple of minutes to a couple of hours, taking in context with everything that’s happened over the past 30 to 90 days.

So I strongly recommend that if you are a Honeycomb enterprise customer or if you’re interested in starting a Honeycomb enterprise trial to really try out our service level objective feature because as Mandi says, we don’t want to wake you up, right? So we’d rather have you blissfully sleep through the tiny blip of one out of one queries failed in the middle of the night, and really only alert you if say 10 or a hundred queries failed during the day when you might actually have a trend you can look at and start looking at the data patterns to understand what’s going on? How do I remediate? How do I fix?

Mandi Walls: 

Yeah, absolutely. And because we know folks are getting so much information from so many places, you’ve got structured data from places like Honeycomb that are adding context and creating better information out of that data. Or you may also be getting customer support tickets. You might also be getting other telemetry and metrics from other places. Having all of that come into one sort of central nervous system to provide context and decision rules. And once you’ve got enough data in there, you can make use of our event intelligence and Machine Learning capabilities, it can help you, you know, really reduce the number of alerts and other information that’s being passed to your human responders over time.

Liz Fong-Jones: 

Awesome. Well, that’s what we have to share with you today about how Honeycomb and PagerDuty work together. Thank you for joining us.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript