SLOs, SLAs, SLIs: What’s the Difference?
SLOs, SLAs, SLIs: What’s the Difference?
SLOs, SLAs, SLIs: What’s the Difference?
Table of contents
- What are SLOs?
- What are SLAs?
- SLAs vs SLOs
- What are SLIs?
- How do SLOs and SLAs work with SLIs?
- SLO best practices
- Defining the right SLOs for your organization
- Event-Based SLOs vs. Time Series-Based SLOs
- Honeycomb SLOs
At the end of the day, so much of software development is wrapped up in two basic concepts: happy customers and/or end users resulting from well-running apps and services, coupled with swift incident resolution that prevents them from becoming unhappy when things go wrong. And, the reality is, today’s complex and distributed systems fail, so it’s imperative to have the right tools in place to quickly resolve any issues. This way, your team can spend less time fixing bugs from the last release, and more time focusing on building the next cool feature.
One of the most effective tools to do this are service level objectives (SLOs). On the surface, SLOs are simply an internal service performance metric, but they’re capable of doing far more than that. A correctly crafted and implemented set of SLOs can help manage customer satisfaction, bring engineering teams together, drastically improve incident response, and drive organization-wide alignment on the importance of reliability as it relates to the business.
It’s a tall order, for sure, and SLOs, as well as their service level compatriots SLAs (service level agreements) and SLIs (service level indicators), need to be well thought through and maintained for maximum benefit.
To set yourself up for success, it’s best to define each service level. Then, we’ll dive into best practices for creating your own.
What are SLOs?
Service level objectives are performance goals teams set for themselves as a way to objectively evaluate how a service or product is operating. On the surface, SLOs seem like a simple metric—”we’ll achieve 96% uptime over the course of 30 days”—but the most effective service level objectives are nuanced and tied closely to two other service metrics: service level indicators (how are things running?) and service level agreements (what level of service have we promised the customer?). They also play a critical role in incident response.
SLOs are the way teams know if they’re meeting service level agreements with customers, and they’re also a clear indication of customer satisfaction or happiness with performance levels. In other words, how much failure is acceptable before the alarm bells go off?
They also have another superpower: SLOs can also serve as a jumping-off point for debugging and can be far more effective as an early warning device than relying on alert-based monitoring. Carefully chosen SLOs will reduce alert fatigue, give teams more confidence when alerts do go off, and they can even help engineers prioritize their work.
But at their core, SLOs are a big juggling act as teams try to balance happy customers with development velocity, business objectives, and an understanding of the importance of reliability across the organization. Done right, an SLO can bring everyone to the same page: customers know what level of reliability to expect, developers and operations can agree on the tug-of-war between new features and stable performance, and the entire organization can unite around a shared responsibility for reliability and customer satisfaction.
Generally, an SLO would outline the service involved, a clearly understood performance goal, a timeframe, and when and where the goal would be measured. The trick with an SLO, though, is not to aim too high: the goal should represent a doable number with some padding around it. For example,95% uptime—not 100% uptime—because that’s nearly impossible and very impractical to achieve. 100% uptime means that by the time you have one error, you’ve already blown your budget.
Also, teams should make SLOs both something that can easily be measured and something that matters to the customer and/or the business. There is no point in creating an SLO that measures something meaningless when it comes to performance or is completely untethered from key business objectives.
What are SLAs?
Service level agreements are how a provider and customer contractually agree on performance factors like uptime and responsiveness. SLAs usually contain a number of service level objectives that spell out in detail what is expected from the service provider. Setting up an SLA is a very common practice in the information technology (IT) industry.
Although common, SLAs aren’t exactly popular with IT and DevOps teams in large part because they’re often drafted in a tech vacuum by legal or sales teams. SLAs need to find the right balance of what will make a customer satisfied and what is technically achievable, and that’s often very difficult to come by. SLAs can be prone to the same problem that plagues SLOs: they focus on metrics that aren’t relevant and are hard to measure and achieve.
A typical SLA would spell out the service type and level, relevant timeframes, expectations and responsibilities, a list of measurement metrics, and any penalties involved.
The most successful SLAs bring IT and/or DevOps teams to the table to ensure the contract reflects technical realities and business needs on both sides.
SLAs vs SLOs
An easy way to distinguish between service level agreements and service level objectives is external versus internal. SLAs are a performance contract between a service provider and an external customer. SLOs are internal outcome goals established to ensure compliance with the contract. SLOs can also be created for internal team use to measure reliability, performance, or other metrics, but an SLA would never exist on its own.
What are SLIs?
In the service level universe, SLIs are the building blocks of SLOs. They define an indicator (the I in SLI) that specifies whether a given measurement is successful or not. The SLO then defines how many failures are acceptable over the total number of measurements taken by SLIs. They turn the indicator success rate into an objective (the O in SLO).
The SLI has one outcome: success or a failure, for any valid measurement, whether it is a single request (must have status <400 and take under 500 milliseconds) or an aggregate value (the 95th percentile for this minute must be below 150 milliseconds).
The SLO takes all these SLI results and asks, “Over a given time window, what’s the rate of success? Is it 95%, 99%, or even 99.9%?”
With the above in mind, the SLI is the building block that defines individual measurements, and the SLO is the aggregate success rate we aim to have.
To put it simply:
- SLI: is this event or measurement successful?
- SLO: how successful do we want to be in aggregate?
- SLA: how successful do we have to be in aggregate before we owe reparations to customers?
How do SLOs and SLAs work with SLIs?
The “service levels” work closely together, but they’re not interchangeable. Service level agreements are sort of the umbrella “business to customer” performance agreement and they contain service level objectives that clarify expectations on both sides. Those SLO “expectations” are measured by a service level indicator, but SLIs are unlikely to be included in an SLA because they’re a metric for the service provider’s internal consumption and not aimed at the customer.
An SLO doesn’t require an SLA to be an effective way teams can measure services internally, and in that case, they would likely use an SLI as the measurement metric. An SLI could also stand alone as a way to track the performance of a feature or service and not necessarily be tied into either an SLO or an SLA. We at Honeycomb do that internally. We’ll compare the SLI’s success rate across builds to know if a build is similar enough (or too bad) to keep rolling out across environments
SLAs, SLOs, and SLIs are all tools organizations can leverage to improve overall system performance, hone in on the right upgrades and feature improvements, and develop a robust incident response plan that will save time, money, and corporate reputations.
SLO best practices
SLOs are a wildly effective tool for fast debugging, improved incident management, and even happier DevOps and SRE teams thanks to reduced alerts. But getting the most out of an SLO is more of an art than science. Here are some best practices to consider.
Understand the users
In order to make sure SLOs measure the right metrics, teams need a thorough understanding of the user journey. Map out user interactions, the entire journey, the business goals, and every single infrastructure touchpoint. Use that information to come up with SLOs that reflect key steps in that journey, like logging in, paying, or asking for help.
Underpromise and overdeliver
A look at industry-standard SLOs will show that teams like to work in nines, as in 99.9%, but that rule is not hard and fast. In fact, many organizations turn to their SLAs to help them set “reality-based” SLOs. If an SLA promises 98% uptime, the SLO might be 99% uptime because building in some padding will mean an earlier alert and more time to fix issues before they start to impact the SLA.
Embrace the flexibility
Unlike the SLA and its signed agreements and expectations, SLOs can provide teams an opportunity to experiment. After all, they’re an internal metric, not external, so it’s possible to try something and change it if it’s too strict, too loose, or simply not providing actionable data. Being flexible and prepared for changes is a good skill for teams to cultivate.
Keep it simple
The temptation of “more is better” can be true for SLOs, but the opposite is true. Fewer aggregated service level objectives make for faster incident resolution and happier, less fatigued teams. Stick to SLIs and the resultant SLOs around user experience, because that is what matters most to an organization.
Defining the right SLOs for your organization
One of the most surprising features of an SLO is the ability to experiment with it. There aren’t many areas in modern software development where teams can be this carefree.
But, even when experimenting, it’s good to have some boundaries. Here are six steps to take to identify the best SLO options for a particular team.
- Start with the user: Map out what a customer does at every step of the way, creating a critical user journey (CUJ).
- Overlay the business: What is most important to the business? Identify that, then map it to the customer journey. The results may show that some user pain points are more “painful” to the business than others. Those are the pain points that should be of utmost importance to track/measure.
- ID the best SLIs: With the user journey and business importance in mind, choose service level indicators carefully and make sure that tracking them will make a difference to the user and the business. Also, it’s important that the SLIs are, in fact, measurable. Availability and latency are normally safe options.
- Establish the SLO: Given all the research above, carefully choose targets that will reflect most clearly on the user journey and business priorities and that also fit into SLA contractual requirements. Then, set a timeframe for the SLO.
- Determine the error budget: As mentioned before, it’s usually a good idea to set an SLO that’s slightly below the SLA requirements, so teams have room to respond before things get totally out of hand.
- Flip the switch: Establish SLO alerts (and exhale!).
Event-Based SLOs vs. Time Series-Based SLOs
One of the challenges around SLOs is the proper selection of an SLI. Since SLIs can represent intricate criteria, such as “all HTTP requests for logged-in users return a status code below 400, and take 150ms at most,” it is rather common for people to structure their observability with pre-aggregated metrics that fit in with time-series data.
The risk of doing so is that while the SLO is flexible—you’re free to change its duration and compliance rate—the SLI is not. Redefining the existing SLI may change what an SLO means and confuse people over long time periods. Defining a brand new SLI means that you will also lose all historical data around your SLO and will need to start fresh every time. Put simply, an event that occurs at 2 a.m. is most likely very different than at 2 p.m.
However, if your system generates a rich set of events in its telemetry to begin with, these events are likely to already contain most of the critical fields you would be concerned with: status code, duration, and user information. Having these rich events means you can define SLIs by using their fields, and therefore change the thresholds as often as you want.
If, starting today, you decide you want the SLI to cover both logged-in and anonymous users with a status code below 400 but want to give 300ms to logged-out users instead, you can do that, and look at your past events to rebuild a fresh definition of the SLO with a brand new budget that extends as far back in the past as your data retention allows.
While adding instrumentation to alter your SLIs will still reset your history, having the ability to craft or alter fields on demand (“this endpoint is a file upload and it’s normal for it to be slow, let’s give it more leeway”) is an incredible way for your teams to progressively refine their understanding of the system and of the pain points it may impose on its users.
Other solutions provide metric-based SLOs, meaning they simply check a count (good minute or bad minute?) with no context on severity (how bad was it?). Honeycomb’s alerts are directly tied to the reality that people are experiencing, so you can better understand severity and meet users’ high performance expectations. Honeycomb’s SLOs are event-based, enabling higher-fidelity alerts that give teams insight into the why. When errors begin, Honeycomb SLOs can ping your engineers in an escalating series of alerts. Unlike other vendors, Honeycomb SLOs reveal the underlying event data, so anyone can quickly see how to improve performance against a particular objective.
Service level objectives can be a powerful tool to speed debugging, better align business with tech, and keep a careful eye on customer satisfaction. When combined with service level agreements and service level indicators, a well-thought out set of SLOs can be invaluable to a modern DevOps/SRE practice.
Here are examples of SLOs, as well as how to create and monitor them in Honeycomb.
We use SLOs at Honeycomb. That’s how great we think they are.