Product Videos SLOs Observability

Honeycomb Service Level Objectives (SLOs)

In this three minute video, you’ll see how Honeycomb’s actionable SLOs can help you get to the source of an issue faster. Using a real production SLO (latency per-event) as an example, we walk you through what exhaustion time alerts are and how to configure them, as well as how to use a heatmap to investigate and take action when things happen.


Pierre Tessier [Director, Solutions Architect|Honeycomb]:

Let’s talk about Honeycomb actionable SLOs and how they can help you. Here, we’re looking at a real production SLO, latency per event. This is an SLO that computes the time it takes Honeycomb to process every single event ingested into our platform. In large batches, we’re allowed five milliseconds per individual event inside that batch, as defined by this SLI here, which looks at a couple of different various rules as well. We’re holding ourselves to a four 9 goal over 30 days. Here, we’re holding just below 35%, kind of holding steady. Now to help us be on top of our SLO, we’ve got exhaustion time alerts and we have two of them configured here. We go ahead and configure these alerts. I’ve got my first one, 24 hours is going to Slack. My four-hour one here goes to PagerDuty. These alerts allow me to be only notified when something’s really happening by PagerDuty, and if it’s a slow burn, we can put that into Slack and get to it when we have time.

Now, our SLO itself also has a heatmap and allows us to investigate intake action when things happen. Let’s go look at a time period where something actually did happen right here in his little downslope. That was about three days ago, and I could get there by hitting these arrow buttons here. I’ll hit this arrow button three times to go back to three days. Honeycomb will go back in time. You can see the shaded indicator right here, where we’re selected over our drop. My heatmap, it’s got this yellow-gold colored part of it. These are the things that have errored, that have violated my SLI’s definition. All the other ones, these are the successful items, the successful events that we like.

Now, what Honeycomb does is almost magical. It’ll take every single attribute for every single event that errored and compare them versus everything else to tell you why it’s happening. Down here, I’ve got these results below for me. I can see which IP address they come from. I know which team is responsible for these. I could hover over this team ID that’s 62% of all my errors were coming from the exact same team or tenant or customer in Honeycomb’s parts. This is big. I could take action on this right here. What’s great about this is that I do not design or define which attributes you include. This is my dashboard, my observability dashboard, based on an SLO where the charts change based on the data itself, and I don’t have to predefine that drill path.

I can come here and say, let’s filter on this user and continue that investigation from here. When the heatmap comes up, it’s pretty clear when the issue is happening. We go ahead and drill down on any of these items, discover more, and understand exactly what was happening and why this issue was happening and what we could do to fix it—all with the power of SLOs and a couple of mouse clicks.

If you see any typos in this text or have any questions, reach out to