Ask Miss O11y: Pls ELI5 TLAs like PRO, SRE, and SLOs!By Liz Fong-Jones | Last modified on May 23, 2022
Dear Miss O11y,
I'm confused by all of the Three Letter Acronyms (TLAs) that have started popping up lately. This week, I got an email from Honeycomb saying that the "PRO" plan now has "SLO" support that "SREs" might find useful. Can you translate these acronyms for me and explain it like I'm five years old (ELI5)?
– Acronymically Addled
I'll try to answer without using a single (new) acronym! First things first—"PRO" refers to our Pro plan, rather than being an acronym in and of itself. Honeycomb Pro is our cost-effective offering for professionals like you who are running a few production workloads! And we're hoping that folks will get even more benefit now that they have access to our SLO feature!
But what is an SLO, you might ask? Great question! That stands for Service Level Objective, which is a way of setting uptime and reliability goals. The core idea dates back to the Apollo lunar missions, which had reliability goals to bring astronauts safely home. The Site Reliability Engineering (SRE) team at Google was one of the first software teams to apply this idea of quantitative reliability to software beginning as early as 2005.
Today, the SRE discipline aims to bring engineering rigor to the field of software operations, bringing together software engineers, systems engineers, and platform engineers to produce appropriately reliable software and the scaffolding to maintain it. SLOs are a foundational part of practicing SRE, because they give us engineering constraints to work with. While SRE is one potential way of accomplishing the reliability and automation goals of DevOps, other approaches to DevOps focused on shift-left ownership or continuous delivery also exist.
The core idea behind an SLO is to decide how reliable you need your service to be, and to adapt how you engineer your service to meet those goals. After all, without goals, all you have is guesswork about what experience your customers are having, and whether it's good enough. A typical SLO for a consumer-facing application might be 99.9% (44 minutes of downtime per month) or 99.99% (4.4 minutes of downtime per month). Setting the right goal is important—too low, and you'll churn customers; too high, and you'll be spinning your wheels trying to deliver more reliability than your customers will notice.
In addition to defining how reliable you want your service to be and over what window of time, an SLO also requires you to define what user workflows are in and out of scope and how to measure whether they are successful. This definition of success for each individual user workflow is a Service Level Indicator (SLI). Once you've defined an SLO and its corresponding SLIs, you can quantify outages and risks to stability in terms of the SLO, rather than guessing about how urgent they are and whether it's really worth waking an engineer up at 3 a.m. And, by using the current state of the SLO to guide the balance of time you devote to each, you can finally resolve that question about whether your team should be spending more time on feature work or on reliability work.
So now that we've defined these acronyms, what does this mean for you as a customer? In short, Honeycomb SLOs make it easier for software delivery teams to define, measure, and debug the reliability of their software. While many other solutions exist for measuring your SLO in retrospect using traditional monitoring, Honeycomb SLOs uniquely let you use your existing Honeycomb traces and telemetry data to proactively manage and act to stay within your objectives.
This means that you can find out about a user-impacting issue before it's too late, while not getting woken up for false alarms. When you inspect the SLO that is at risk, you will immediately see how bad it is in the context of historical reliability, which customers are worst impacted, and what might have triggered the outage—all on one screen and within one tool.
With two SLOs now available on the Honeycomb Pro plan, you can simultaneously measure reliability in terms of both acceptable latency and successful status for two different workloads. You'll want to tell Honeycomb how your SLIs should be defined, so that we can categorize incoming trace spans in Honeycomb as good, bad, or not applicable to each SLO. Honeycomb's Derived Column syntax is flexible enough to accommodate many different use cases, but to start with, we can work an example of a simple latency and status code on a checkout workflow that must succeed in less than 1 second, and not return an HTTP 5xx code.
You might create a derived column to represent the SLI that looks something like this:
IF(AND(EQUALS($http.route, "/checkout"), NOT(EXISTS($trace.parent_id)), AND(LT($duration_ms, 1000), LT($http.status_code, 500)))
And then you'd configure the SLO to specify how often that SLI should be met; for instance, specifying that 99.9% of checkout requests should succeed over a 30-day window. You then would be able to ask Honeycomb to proactively alert you if you were in danger of violating your reliability goal, and get instant insights into how to fix that checkout flow if something were to go wrong.
To find out more, see https://honeycomb.io/production-slos and https://docs.honeycomb.io/working-with-your-data/slos/.
"Dear Miss O11y, I’ve been following Honeycomb for a long time, and I understand where the insights from observability fit in. But larger orgs haven’t...