Authors’ Cut—Shifting Cultural Gears: How to Show the Business Value of ObservabilityBy George Miranda | Last modified on January 5, 2023
At Honeycomb, the datastore and query systems that we manage are sociotechnical in nature, meaning the move to observability requires a sociological shift as much as it does a technical one. We've covered the technical part in several prior discussions for our Authors’ Cut series, but the social aspect is a little squishier. Namely: How do you solve the people and culture problems that are necessary in making the shift to adopt observability practices? And once you instill those changes, how do measure the benefits? How do you show business value?
There’s already some great advice in the book chapters themselves. For the Authors’ Cut series, we thought we’d bring those chapters to life by asking you to ask us your burning questions about the business value of observability. Then, we’d apply the same principles in the book to help answer your questions in real time. Think of it as Honeycomb’s Dear AbBEE for showing business value for adopting observability. Let the Q&A begin!
Turning the page to SLOs
Q: My staff is getting paged constantly, and I want us to move toward service-level objectives (SLOs). What’s the best path forward for constructing an organization that cares about getting from A to B?
Our answer: It sounds like your teams don't believe that things can be better. You can change that mindset with some time, attention, and engineering.
When you’re getting paged 50 to 100 times a day, you’re probably not investigating them all. It becomes noise at that point, so don't let it take up space in your head. Redirect your time to doing proactive work on SLOs or on automation. You’ll need to get buy-in from leadership so that time is protected and is not overtaken by the alert onslaught.
You can build toward getting that buy-in by running experiments in parallel. This way you’re not changing the current way of managing alerts and introducing organizational risk; you’re just trying out a secondary approach in a small slice. In the case of SLOs, focus on measuring top-level user experience and don't worry about alerting on any issues in your infrastructure. Do a side-by-side comparison of what happens when you get that typical flood of alerts versus what happens when you get paged by an SLO burn alert. When alerts aren’t just noise, and when you can isolate the right causes with observability, your team will quickly fix the customer experiences that actually matter to the business. Then, broadcast how and why that is tangibly better across your organization. That sort of proof is a powerful way to build up your allies.
Measure the results every time you migrate a service to using SLOs in terms of incremental value, like how much faster you’re able to resolve issues or how much you shrank the amount of time it takes to build and deploy your software. All the progress you make compounds in time and gives back by building trust with your colleagues—be clear about how and why SLOs are creating better business outcomes and those ideas will spread. Sometimes they just need one engineer to show them it’s possible and to create confidence.
Making the case for tracing
Q: The systems we have at work—metrics and logs—are OK, but it's clear to me that adopting OpenTelemetry and introducing a measure of quality per service would be useful for us across different teams. How do I make the argument for tracing and SLOs for services when there’s no desire for change?
Our answer: It sounds like what you’re proposing is switching to OpenTelemetry so you can add distributed tracing to your telemetry mix and so you can start practicing observability. Here are some starting points where you can anchor arguments for why adopting OpenTelemetry is a great long-term investment:
- Using distributed tracing in tandem with an observability tool decreases the cognitive coordination costs that microservices require.
- It opens the door to trying out modern observability tools, which increase the efficiency and flexibility of your team because you can debug quickly based on data instead of primarily relying on experience and intuition, which is not sustainable in the long run. That means more engineers can effectively debug and work on different parts of the system without the gatekeeping enforced today when doing that work requires first having deep knowledge and expertise.
- OpenTelemetry is a vendor-neutral framework, which means you can try many different observability tools without first needing to reinstrument your code. That means your org can avoid vendor lock-in and it puts you in control of finding the right tool to suit your needs at any time (especially when your needs change).
- OpenTelemetry makes it easy to start practicing observability and then you can fix things you didn't know were broken, allowing you to find anomalies before your customers do.
Freeze frame: painting the observability picture
Q. What should we measure to help frame the improvements that organizations see after implementing observability?
Our answer: Several metrics that reinforce each other help you measure improvements. The key is to focus on "how did we improve our ability to execute?" and not "what improved after implementing observability?"
You’ll want to set expectations. A lot of people expect the number of bugs to go down after bringing in observability, when in fact they’re surfacing more bugs than ever before. That's a good thing, even though on the graphs it looks bad. What that means is that now you’re able to see the many hidden bugs you had no idea were even happening. It might seem counterintuitive, but it’s not unusual to see that number increase.
At Honeycomb, we talk about lowering MTTWTF—the meantime to figuring out what the fuck is going on. And that should still be happening; you will lower the amount of time it takes to find bizarre and perplexing issues. But you’ll also discover new WTFs that you didn't even know about. We often hear this from our users: a type of problem that used to take three engineers a week to pinpoint will instead take twenty minutes to find with Honeycomb. At first, it may take a bit of time to figure out the many WTFs you had no idea were even happening. But in the long run, the improvement comes by way of creating more resilient and performant services so that you can ship features more reliably.
So, look at the DORA metrics around time to recover from failures and time between commit and running in production. Measuring those helps with both shifting left (finding bugs earlier, making them easier to resolve) and shifting right (taking a microscope to production).
Connect observability to business outcomes
There’s a lot more in the book, so we highly recommend diving into those chapters if you haven’t already. But a few important lessons also become clear from these questions:
- Focus less on quantifying specifics that changed (like number of bugs) when adopting observability and focus more on quantifying the business outcomes (like how quickly you could fix customer experience issues)
- Run experiments in parallel and loudly broadcast the results if you need to build up buy-in from business allies, because they will be crucial to your success
- OpenTelemetry is a fantastic framework and the way to push for its adoption is (similarly) to focus on the outcomes it enables for your teams and your organization
If we didn't address your conundrum in this post, you can hear additional questions and answers by listening to the full recording. You can always ask us for our advice in the Pollinators community on Slack or submit a question to Miss O11y.
The beginning is a good place to start
In case you can’t tell, we’re big believers in Continuous Delivery and agile practice (and you can be, too). You don’t need to wait until you’ve perfected things—simply start by implementing some observability. Might we suggest taking that first (free) step with Honeycomb?
Our friends at Tracetest recently released an integration with Honeycomb that allows you to build end-to-end and integration tests, powered by your existing distributed traces....