Overcome Alert Fatigue with Service Level Objectives (SLOs) – thanks
If you find it challenging to sift through alerts for crucial notifications, we want to show you how Service Level Objectives (SLOs) help reduce alert fatigue and improve system reliability.
Actionable Service Level Objectives (SLOs) Based on What Matters Most – thanks
In this session, we discuss the inherent dangers of alert fatigue that are normalized in monitoring-based alerting systems and how the combination of SLOs with structured event data provides a more beneficial experience than using time-series data or aggregated counts.
Service Level Objectives as Code: Terraforming Honeycomb SLOs
In March, we announced official support for a Honeycomb Terraform Provider. Today, we’re announcing additional support for managing Honeycomb Service Level Objectives (SLOs) with Terraform. This furthers Honeycomb’s support for configuration as code and…
Ep. #52, Service Level Objectives with Alex Hidalgo of Nobl9
In episode 52 of o11ycast, Charity and Jess speak with Alex Hidalgo of Nobl9. Alex shares his formative experiences advocating for reliability, insights on utilizing error budgets, and the attributes needed to leverage senior-level influence within a socio-technical environment.
Honeycomb Service Level Objectives (SLOs)
In this three minute video, you’ll see how Honeycomb’s actionable SLOs can help you get to the source of an issue faster. Using a real production SLO (latency per-event) as an example, we walk you through what exhaustion time alerts are and how to configure them, as well as how to use a heatmap to investigate and take action when things happen.
Debuggable Service Level Objectives
Honeycomb’s Service Level Objectives (SLOs) offer more actionable alerts with less noise. They’re also integrated right into your debugging workflows.
Working Toward Service Level Objectives (SLOs), Part 1
In theory, Honeycomb is always up. Our servers run without hiccups, our user interface loads rapidly and is highly responsive, and our query engine is lightning fast. In practice, this isn’t always perfectly the…
SumUp Uses Honeycomb to Improve Service Quality and Strengthen Customer Loyalty
Growing pains can be a natural consequence of meteoric success. We were reminded of that in our recent panel discussion with SumUp’s observability engineering lead, Blake Irvin, and senior software engineer Matouš Dzivjak. They shared how SumUp’s rapid growth spurt compelled them to change their resolution process—both logistically and culturally—to ensure a service level quality that reflects their customer obsession.
Exploring AWS Costs Beyond the Service Level
This post will talk about using a derived column to directly connect individual customer experiences to the cost of providing that service with AWS Lambda. By leveraging these tools, we can better understand when our product is used in costly ways, and also provide tooling to better analyze and understand the cost effects of configuration changes.
Honeycomb Supports Service Ownership
The software industry is moving toward teams that own the services they build. This concept encloses principles and possibilities from movements toward microservices, DevOps, Agile, and Project to Product. In these paradigms, a team of people delivers software that provides valued capabilities. These capabilities help customers get their work done, support business operations, or enable other software to do these. Writing code is only part of this; capabilities only work if the software is running in production. Service-ownership teams carry this responsibility. To own production, a team needs visibility into production. Honeycomb recognizes service ownership and supports it.
A Better Environment for Observability, at Your Service
We’ve made some big changes under the hood at Honeycomb to give you better control over how you put your apps data to work—we’ve expanded our core data model with formal Environments and Services!…
SRE + Honeycomb: Observability for Service Reliability
As a Customer Advocate, I talk to a lot of prospective Honeycomb users who want to understand how observability fits into their existing Site Reliability Engineering (SRE) practice. While I have a passing familiarity with the discipline, I wanted to learn more about what SREs do in their day-to-day work so that I’d be better able to help them determine if Honeycomb is a good fit for their needs.
The Case for SLOs
With one key practice, it’s possible to help your engineers sleep more, reduce friction between engineering and management, and simplify your monitoring to save money. No, really. We’re here to make the case that setting service level objectives (SLOs) is the game changer your team has been looking for.
Authors’ Cut—Gear up! Exploring the Broader Observability Ecosystem of Cloud-Native, DevOps, and SRE
You know that old adage about not seeing the forest for the trees? In our Authors’ Cut series, we’ve been looking at the trees that make up the observability forest—among them, CI/CD pipelines, Service Level Objectives, and the Core Analysis Loop. Today, I’d like to step back and take a look at how observability fits into the broader technical and cultural shifts in technology: cloud-native, DevOps, and SRE.
Authors’ Cut—Actionable SLOs Based on What Matters Most
SLOs—or Service Level Objectives—can be pretty powerful. They provide a safety net that helps teams identify and fix issues before they reach unacceptable levels and degrade the user experience.
But SLOs can also be intimidating. Here’s how a lot of teams feel about them: We know we want SLOs, we’re not sure how to really use them, and we don’t know how to debug SLO-based alerts.
Conditional Distributed Tracing
Distributed tracing is generally a binary affair—it’s off or on. Either a trace is sampled or, according to a flag, it’s not. Span placement is also assumed to be an “always-on” system where spans are always added if the trace is active. For general availability and service level objectives, this is usually good enough. But when we encounter problems, we need more. In this talk, we’ll show you how to “turn up the dial” with detailed diagnostic spans and span events that are inserted using dynamic conditions.
Ask Miss O11y: Is There a Beginner’s Guide On How to Add Observability to Your Applications?
Dear Miss O11y,
I want to make my microservices more observable. Currently, I only have logs. I’ll add metrics soon, but I’m not really sure if there is a set path you follow. Is a guide of some sort, or best practice, like you have to have x kinds of metrics?
I just want to know what all possibilities are out there. I am very new to this space.
Authors’ Cut Spark Notes Edition: Jumpstart Your Observability Journey
George Miranda, Liz Fong-Jones, and Charity Majors, held a series of live discussions called the Authors’ Cut to bring core concepts of the book to life by applying them to real-world use cases. Now that the series is complete, we thought it would be helpful to combine all of the discussion recaps for your viewing pleasure. Each blog post below takes key concepts from chapters in the book and makes them more digestible.
NS1 Implements Honeycomb to Democratize Their Code and Spark Customer Joy
The line from observability to customer joy is straighter than you think. We recently learned this from NS1, a managed DNS provider and Honeycomb customer, in a panel discussion with Nate Daly, Head of Architecture at NS1 and Chris Bertinato, Software Architect at NS1.
Touching Grass With SLOs
One of the things that struck me upon joining Honeycomb was the seemingly laissez-faire approach we took towards internal SLOs. From my own research (beginning with the classic SRE book, following Google’s example), I came to these conclusions:
-SLOs are strict. They aren’t as binding as an SLA, but burning through your error budget is bad.
-SLOs/SLIs need to be documented somewhere, with a formal specification, and approved by stakeholders.
-SLOs should drive customer-level SLAs.
-Teams should be mandated to create a minimum number of SLOs for the services they own.
New Honeycomb Features Raise the Bar for What Observability Should Do for You
As long as humans have written software, we’ve needed to understand why our expectations (the logic we thought we wrote) don’t match reality (the logic being executed). To that end, we developed techniques to help measure reality—logging text strings, or capturing aggregated metrics—and persevered, seeking out newer and fancier logging or monitoring solutions over the intervening decades.
Honeycomb Welcomes New Field CTO
I am thrilled to share with you that Honeycomb now has a Field CTO: our very own Liz Fong-Jones.
How Do I Do Availability Checks in Honeycomb?
Let’s dig into what we mean by an Availability Check and how that maps to observability, tracing, and supporting production systems.
Authors’ Cut—Not-So-Distant Early Warning: Making the Move to Observability-Driven Development
Observability is about understanding systems, which means more than just production. Moving from logs to tracing and showing causality can be done locally, as well. We can give developers the same superpowers that SREs have: observability-driven development.
Top Takeaways from Monitorama 2022
Two of our folks went to Monitorama 2022, and they gleaned a few pearls of wisdom they’d love to share with you, including an unexpected, but surprisingly insightful talk on carbon impact reporting. Read more now.
Honeycomb Pro: Now With Metrics & SLOs
Honeycomb Pro is about to get even better. Starting today, all Pro accounts have access to Honeycomb Metrics and two Service Level Objectives (SLOs), previously only available to Enterprise accounts. Full disclosure: Later this…
How Reliability and Product Teams Collaborate at Booking.com
This article originally appeared on the Booking.com engineering blog. For more by the author, visit his blog www.codecapsule.com. With more than 1.5M room nights booked per day, Booking.com requires a solid infrastructure that’s constantly…
ICYMI: Honeycomb Developer Week: The Partner Ecosystem
We know that you value collaboration. That’s why we share incident reviews and learnings—because we believe the entire community benefits by working together transparently. In the spirit of working better together, we invited ecosystem…
How to Effectively Lead High-Performing Engineering Teams
What are the foundational elements of a high-performance engineering team? While there’s no silver bullet, a few common threads make up the fabric of engineering teams that set the standard for velocity, quality, and…
Hone Your Observability Skills at Honeycomb Developer Week
Is the lack of time holding you back from sharpening your observability (o11y) skills? Maybe you’ve dipped your toes into o11y, but you’re not sure how and where it will drastically improve your team’s…
Scaling Kafka at Honeycomb
When you send telemetry into Honeycomb, our infrastructure needs to buffer your data before processing it in our “retriever” columnar storage database. For the entirety of Honeycomb’s existence, we have used Apache Kafka to…
Honeycomb Differentiators Series: SLOs That Tell the Whole Story
In the recent past, most engineering teams had a vague notion of what Service Level Agreements (SLAs) and Service Level Objectives (SLOs) were—mainly things that their more business-focused colleagues talked about at length during…
Tale of the Beagle (Or It Doesn’t Scale—Except When It Does)
If there’s one thing folks working in internet services love saying, it’s: Yeah, sure, but that won’t scale. It’s an easy complaint to make, but in this post, we’ll walk through building a service…
How Vanguard used Observability to Accelerate and De-risk their Cloud Migration
Rich Anakor, chief solutions architect at Vanguard, is on a small team with a big goal: Give Vanguard customers a better experience by enabling internal engineering teams to better understand their massively complex production…
Shipping on a Spent Error Budget
Modern software services are expected to be highly available, and running a service with minimal interruptions requires a certain amount of reliability-focused engineering work. At the same time, teams also need to build new…
Data Availability Isn’t Observability
But it’s better than nothing… Most of the industry is racing to adopt better observability practices, and they’re discovering lots of power in being able to see and measure what their systems are doing….
One Year of Graviton2 at Honeycomb
A year ago, we wrote about our experiences as early adopters of Graviton2, and how we were able to see 30% price-performance improvements on one dogfood workload from switching to the arm64 architecture. In…
Honeycomb Raises $20M to Define the Future of Observability
I’m delighted to announce that Honeycomb has raised $20M in Series B funding, led by e.ventures Growth, with participation from existing investors Scale Venture Partners, Storm Ventures, Next World Capital, and Merian Ventures, and…
Honeycomb’s 2020 Blog Roundup
We’re here at last: the final days of 2020. Let’s take a look back at this year’s most popular Honeycomb blog posts. Observability 101 In Observability 101: Terminology and Concepts, Shelby Spees reflects on…
Setting Business Goals with SLOs
‘Tis the season to set 2021 goals. Whether setting OKRs, KPIs, KPAs, MBOs, or any other flavor of goal-setting frameworks in an endless sea of acronym soup, chances are that you’re still dealing with…
Outreach Engages Their Production Code with Honeycomb
Outreach is the number one sales engagement platform with the largest customer base and industry-leading usage. Outreach helps companies dramatically increase productivity and drive smarter, more insightful engagement with their customers. Outreach is a…
Incident Review: Meta-Review, August 2020
Every once in a while, teams or systems hit an inflection point where enough things change at once and the pattern of incidents shifts. We found ourselves at an inflection point like that last week.
Spread the Love: Appreciating Our Pollinators Community
Have you heard the buzz about observability with Honeycomb 🐝? It’s the best tool on the market for observing your systems in real time to reduce toil and delight users. But don’t listen to us, listen to our kickass community of “Pollinators”–this blog post is dedicated to them 💖
Bees Working Together: How ecobee’s Engineers Adopted Honeycomb
At ecobee, adopting Honeycomb started as a grassroots effort. Engineers signed up for the free tier and quickly started sharing insights with teammates. When it came time for ecobee to make the “build vs. buy” decision for observability tooling, sticking with Honeycomb was the clear choice.
Sharing Context Across Space and Time: Honeycomb for Teams
When Charity and I started pitching Honeycomb, we had a “bit” we would do, on the importance of building for teams: I’d identify her as the {Kafka, Mongo, insert tech-of-the-moment here} expert on the…
Challenges with Implementing SLOs
A few months ago, Honeycomb released our SLO — Service Level Objective — feature to the world. We’ve written before about how to use it and some of the use scenarios. Today, I’d like…
Honeycomb SLO Now Generally Available: Success, Defined.
Honeycomb now offers SLOs, aka Service Level Objectives. This is the second in a set of of essays on creating SLOs from first principles. Previously, in this series, we created a derived column to…
Velocity (& Reliability) – Two must-haves for every software engineering team
(Field notes from O’Reilly’s Velocity 2019 Show, San Jose.) It was steamy hot in San Jose during O’Reilly’s Velocity show and the normally frigid AC temps in the expo hall were welcomed by all…
How We Manage Incident Response at Honeycomb
When I joined Honeycomb two years ago, we were entering a phase of growth where we could no longer expect to have the time to prevent or fix all issues before things got bad. All the early parts of the system needed to scale, but we would not have the bandwidth to tackle some of them graciously. We’d have to choose some fires to fight, and some to let burn.
Surface and Confirm Buggy Patterns in Your Logs Without Slow Search
Incidents happen. What matters is how they’re handled. Most organizations have a strategy in place that starts with log searches—and logs/log searching are great, but log searching is also incredibly time consuming. Today, the goal is to get safer software out the door faster, and that means issues need to be discovered and resolved in the most efficient way possible.
Honeycomb, Meet Terraform
The best mechanism to combat proliferation of uncontrolled resources is to use Infrastructure as Code (IaC) to create a common set of things that everyone can get comfortable using and referencing. This doesn’t block the ability to create ad hoc resources when desired—it’s about setting baselines that are available when people want answers to questions they’ve asked in the past.
Authors’ Cut—Shifting Cultural Gears: How to Show the Business Value of Observability
How do you solve the people and culture problems that are necessary in making the shift to adopt observability practices? And once you instill those changes, how do measure the benefits?
Incident Review: Working as Designed, But Still Failing
A few weeks ago, we had a couple of incidents that ended up impacting query performance and alerting via triggers and SLOs. These incidents were notable because of how challenging their investigation turned out to be. In this review, we’ll go over interesting patterns associated with growth, and complex systems—and how these patterns challenged our operations.
On Counting Alerts
A while ago, I wrote about how we track on-call health, and I heard from various people about how “expecting to be woken up” can be extremely unhealthy, or how tracking the number of disruptions would actually be useful. I took that feedback to heart and wanted to address the issues they raised, and also provide some numbers that explain the position I took with these metrics.
Tracking On-Call Health
If you have an on-call rotation, you want it to be a healthy one. But this is sort of hard to measure because it has very abstract qualities to it. For example, are you…
Ask Miss O11y: Pls ELI5 TLAs like PRO, SRE, and SLOs!
Dear Miss O11y, I’m confused by all of the Three Letter Acronyms (TLAs) that have started popping up lately. This week, I got an email from Honeycomb saying that the “PRO” plan now has…
On the Brittleness of Dashboards
Dashboards are one of the most basic and popular tools software engineers use to operate their systems. In this post, I’ll make the argument that their use is unfortunately too widespread, and that the…
Ask Miss O11y: Long-Running Requests
Dear Miss O11y, How do I think about instrumenting and setting service-level objectives (SLOs) on streaming RPC workloads with long-lived connections? We won’t necessarily have a “success” metric per stream to make a percentage…
The Five Characteristics of a Good SLO
This guide covers the basics of SLOs, why their use is preferred as a leading-edge practice in observable systems, and how to ensure your SLOs are set up effectively.
ICYMI: Honeycomb Developer Week Wrap-Up
Getting started with observability can be time consuming. It takes time to configure your apps and practice to change the way you approach troubleshooting. So it can be hard to prioritize investing time, especially…
Ask Miss O11y: Load Testing With Fidelity
Dear Miss O11y, My developers and I can’t agree about what the right approach is for running load tests in production. Should we even be running load tests against our production infrastructure or is…
Honeycomb Expansion in Europe Fueled by New Series C Investment Led by Insight Partners
Former Pivotal, SignalFx, and Chef Software executive, Andy Hawkins, joins as Regional Director to lead Honeycomb’s European presence and growth LONDON, October 21, 2021 – Honeycomb, the leading observability platform used by high-performance engineering…
Vanguard’s Adoption Journey: How Honeycomb Helps Shape Developer Workflows
After evaluating multiple approaches to distributed tracing, Vanguard ultimately landed on using OpenTelemetry and Honeycomb. Now, they have hundreds of teams using Honeycomb, with a different mentality to the way they run and manage production. One example is a team using SLOs for a critical service. A burn alert came through, and they were able to remediate this issue before it became customer-impacting.
The State of Observability in 2021
Observability adoption has increased as more companies seek to understand how their applications behave in production and quickly identify and resolve problems. Our second annual observability maturity report is the first that shows a…
Achieving Production Excellence at Scale-Thanks
Whether you’re a startup building new services from scratch or in a brownfield enterprise environment, this webinar offers expert advice on how to get started and how to measure the ROI of implementing modern software practices like progressive delivery, observability, and service-level objectives (SLOs).
Refine Your Observability Experience at Scale
Today, we announced that Refinery is now generally available. With Refinery, it’s now easy to highlight the critical debugging data you need and to stop paying for the rest. Refinery is a sampling solution…
Honeycomb Introduces Refinery, a New Solution to Optimize Observability for Enterprises at Scale
San Francisco, March 2, 2021 – Honeycomb, the company that pioneered the first commercial observability solution to understand, debug, and improve distributed production systems, announced today a new solution, Refinery, to help enterprises refine…
Sweetening Your Honey
Are you looking for a better way to troubleshoot, debug, and really see and understand what weird behavior is happening in production? Service-level objectives (SLOs) and observability can help you do all that—but they…
Identifying Hidden Dependencies
Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose.
SLOs: Uniting Engineering and Business Teams Behind Common Goals
A 451 Research | Business Impact Brief:
If an app/service performs poorly, how likely are you to switch to a different brand? Turns out 79% claim very or somewhat likely. SLOs are now a best practice approach to help engineers and business stakeholders understand what to measure about their service for a consistent quality customer experience.
SLO Theory: Why the Business Needs SLOs
Now, engineering and business speak the same language. Find out why you should care, how SLOs are critical to SRE practice, and how to keep your customers happy.
Honeycomb Announces the First Event Driven Debugging Product for Modern Software
Empowering engineering teams to truly understand how their software works in production SAN FRANCISCO, April 11, 2017 /PRNewswire/ — Honeycomb announces general availability of its flagship event driven debugging product. Built for engineers with operational responsibilities–software, infrastructure,…
Honeycomb Launch!
Many many of you have been asking when we’ll be “launched”, in “production”, taking “money”, or “GA”. Well, here you go! 🙂 A big THANKS to all our early users, our first paying customers,…
Part 5/5: Building Badass Engineers and Badass Teams
No matter how much we love technology, it is always a means to an end. The mission comes first – we don’t do tech for its own sake, we use tech to get the…