‘Tis the season to set 2021 goals. Whether setting OKRs, KPIs, KPAs, MBOs, or any other flavor of goal-setting frameworks in an endless sea of acronym soup, chances are that you’re still dealing with a sizable disconnect between business objectives and daily engineering work.
Service Level Objectives (SLOs) have boomed in popularity because they provide a common language between business stakeholders and engineers to set aligned goals. But getting started involves a bit more research than just jumping right into setting targets. This post provides practical examples of how to (and how not to) use SLOs in setting your upcoming yearly goals.
Aligning business goals and engineering work
If big, hairy, audacious goals are to be achieved by an entire company, it’s important that everyone understands how their work connects back to the vision.
Did your eyes roll when you read that sentence?
Congratulations! You’re probably an engineer that’s experienced business goals that are entirely divorced from engineering realities.
Meaningless goals like that are born as a result of bad communication during the goal-setting process. Business leadership sets the strategic vision (as they should!) and then other departments figure out how they can align in that direction. Collaboratively, everyone shifts and reprioritizes work, organizational constraints become apparent, negotiations happen, and — together (ideally) — the company arrives at a set of ambitious, but achievable, goals. The problem comes in when the negotiating parties aren’t speaking the same language.
How exactly does a lower p95 duration of requests to the Artemis service mean the company is winning bigger market share? (side note: why does almost every org have an Artemis somewhere?) The layers of translation necessary to connect those dots is, at best, complicated and faulty. Round after round of negotiation, relevance and nuance gets lost in translation. By the end, somehow you’re committed to Five 9’s of availability for every service and every engineer knows that’s never even remotely in the realm of possibility.
The good news is that it doesn’t have to be that way. SLOs help solve that problem. Before unpacking how to use them in your goals, let’s first look at why that’s true.
The common language of SLOs
A lot has been written about SLOs, Service Level Indicators (SLIs), Service Level Agreements (SLAs), SLO burn alerts and more. We’ll recap by saying that SLOs are the negotiated availability agreements you set internally, SLIs are how they’re measured, and SLAs are how those agreements are communicated externally to your customers. In this post, we focus on the internal process for setting up SLOs.
SLOs should be expressed as business goals for service reliability. In other words, they measure your service’s customer experience. For example, your business might have a strategic goal expressed as “providing the fastest website experience possible.” On a technical level, that means “a user should be able to load our home page and see a result quickly.” In this example, an appropriate SLI might:
- Identify qualified events by looking for events where request.path = “/home”.
- For qualified events, the criterion for considering it “successful” means duration_ms < 100.
Our SLO target, for this example, might be that during any trailing 30-day period we want 99.9% of all events (as qualified by our SLI) to be successful. Every event slower than the threshold is considered an error, and it counts against your error budget.
Different parts of your business (and different parts of your stack) will have different targets. For example, at Honeycomb, we have different business goals for how we ingest and serve your data. We have very little tolerance for losing any of the events you send us, so our ingest API service has the strictest SLO target at 99.99%. The UI homepage is set somewhere in the middle with a 99.5% target. One of our least strict SLOs is around query functions (where performance varies greatly depending on the task). There, we want to see results returning within 10 seconds 99% of the time.
There’s more depth and detail to SLOs but, even in this high-level generalization, you can see how that approach enables explicit agreements with business stakeholders that determine critical paths and necessary investments, allows engineers to clearly understand priorities and how their work impacts business goals, and gives managers the tools needed to set expectations with both groups.
Because SLOs provide that ability, teams are eager to use SLOs when setting their yearly goals. Let’s see how that’s done in practice.
Getting started with SLOs in the real world
A Honeycomb customer, one of the largest US banking institutions, has a cloud-based platform that helps finance professionals manage large asset portfolios. Given the critical nature of those transactions, they use Honeycomb to ensure application reliability for their customers.
We worked with that team as they set their yearly goals around furthering their observability practices. They wanted to start using SLOs in their organization. Their plan was to set an OKR (Objective and Key Results) with an Objective around “implementing end-to-end observability” and Key Results based on SLO targets they estimated could be met.
We often see that approach with customers new to SLOs. It’s important to remember that practicing observability isn’t an objective in and of itself: it’s a means to an end. Objectives should always map to business outcomes.
The team could have prematurely locked themselves into specific SLO targets that weren’t backed by explicit business agreements. Instead, we advised them to connect their objective to a business outcome (like better service reliability). Throughout the year, they could then take steps to propose initial SLO targets and build organizational consensus. If you bake SLOs into your goals before doing that, you’re committing to meeting targets when you don’t yet know what the right targets are.
The team’s reliability objective became to deliver “a performant and available site experience for our users, which measurably increases retention and improves sales.” The key results accompanying that in Q1 next year are:
- Instrumentation Coverage: XX% of all user-facing requests travel through instrumented services.
- On-Call/Ops Adoption: XX% of all on-call engineers use Honeycomb as their go-to troubleshooting tool during incidents and report (via survey) significant effectiveness improvements from doing so.
- Identify, negotiate, and implement 3-5 SLOs with buy-in from business stakeholders and engineering leadership.
When first approaching SLOs, it helps to take a quarter or two to gather data. You should understand how both planned and unplanned operations impact your SLO error budgets. In turn, that informs how service availability maps to business performance indicators. With that data, you can then negotiate targets with business stakeholders and engineering leaders. Doing that with 3-5 services is a pretty significant goal.
Importantly, also remember that your SLO doesn’t need to be perfectly defined (they can always improve!) before those negotiations start. Identify the services most critical for your business outcomes, gather some early data, have those negotiations, then implement your SLOs, and improve iteratively as you go.
Gathering data for your own SLOs
There are no universal thresholds for SLO targets. Even if there were, it’s important to do what’s right for your particular situation. SLOs will integrate into your company goals once you establish the right targets for your particular services. When first implementing SLOs, start by gathering data for a small set of pilot services before committing to any targets baked into your goals.
Many teams can’t wait to implement SLOs because of the order-of-magnitude impact they have on team efficiency, reducing alert fatigue, and boosting productivity. So it’s tempting to set a few reasonably sounding (yet still actually arbitrary) targets just to get started down the SLO path. But it takes time to reorient, learn new processes and practices, and build the new team muscle memory it takes to set the right targets or to respond when error budgets that have been exceeded.
Take these steps to figure out your own SLO starting points.
First, prioritize your service reliability needs by the customer value they provide. If each of your website and API components all have equal availability goals, start there. Separate the various functions and rank their reliability needs. In our earlier example, Honeycomb’s ingest API needs rose to the top of our list. But the query interface has a greater tolerance for error because if something goes wrong, most users simply hit “refresh.” All of your services will not have equal availability needs.
Next, meet with customer-facing teams to validate those findings against the technical and business functions have the highest customer expectations for availability. Take into account any maintenance windows defined in your SLAs, contracts, or terms of service. The data you gather in this step informs the possible size of your error budget. If you don’t have maintenance windows for your services today, now is the time to change that. As the old mechanic saying goes, “plan maintenance for your equipment, or your equipment will plan it for you.”
Then have a conversation with business stakeholders and product management to set an error budget. You not only need to be clear on availability targets, but you must also be clear on what happens when those availability targets are in danger of being missed. Feature development work may need to halt in favor of work that improves availability when that happens. Define, discuss, and negotiate those various scenarios upfront to avoid unpleasant surprises later. Document those commitments in a contract. Be explicit in your negotiations.
Lastly, define how your team reports when your error budget is spent. The error budget is how much time your stakeholders have agreed that your service can be away from its job, so to speak. Similar to when you go on vacation or attend a conference, people want to know where you went and what great things you did. If your API gets a vacation, people will also want to know how it got better. 🙂
Setting your company goals this year
SLOs will help you set goals that are meaningful to business stakeholders and engineers alike. You should use SLOs when setting your company goals. But, if you’re new to SLOs, be sure not to back yourself into a corner by setting arbitrary SLO targets. Instead, set goals to identify a small set of pilot services for SLO implementation. To figure out your own starting points for SLOs targets:
- Prioritize service reliability needs by their customer value
- Meet with customer-facing teams to validate those reliability needs
- Seek agreement from business and product stakeholders on an error budget contract
- Report on how your service’s error budget was used
That intermediary step gives you time to develop new organizational habits and reap the learnings you’ll need in order to be successful. After that starting point, you’ll be in a much better place to bake the right SLO targets into your objectives and key results.
Let us know how your journey with SLOs and Observability practices help you build your competitive advantages, boost your customer loyalty, deliver features faster, get better service reliability, and help you set more effective company goals. We’d love to hear from you!
Talk to sales to learn how Honeycomb SLOs can help your team.