Ask Miss O11y: My Manager Won't Let Me Spend Any Time Instrumenting My CodeBy Martin Thwaites | Last modified on July 25, 2022
Dear Miss o11y,
My organization doesn’t want me spending time on instrumenting my product. What can I do?
Thanks for the question! You’ll be relieved to hear that you’re in the majority, and also that there are quick (and easy) steps you can do to prove that instrumenting your code is worthwhile.
Is the system working?
First, let’s talk about what makes a system “work.” If we don’t know that our system is running, then it’s not done. If we can’t find out how healthy our system is, we can’t give our stakeholders confidence that the system will continue functioning. That’s the conversation that needs to happen. So it comes down to reframing the question.
I’ve heard a lot over the years that instrumentation is only needed for production, or it’s something that’s done when they have problems. This is a common thing, and generally ends with the stakeholders saying “well, do the productionization then!” after an outage that takes the team a long time to solve. For some reason, and I can’t put my finger on why, the business will see value in knowing what the system does when customers see it’s not doing what they expected.
Obviously, I’m preaching to the converted here as you already know and want to do the instrumentation work.
In the same way as the business is likely ok with you writing developer-based tests (unit, automation, integration), instrumentation is the same. The conversation we need to have is that instrumentation, or knowing the system is healthy, is the work we need to do for the feature. In my opinion, it’s even more important than those.
What is instrumentation?
One of the big reasons I see pushback around implementing instrumentation by stakeholders is how it’s described to them. It’s brought up as a separate task, and therefore one that could potentially be vetoed as “not needed right now.”
We’re repeating history by calling instrumentation out as a separate task. For years, engineers fought the battle of getting time for unit tests and integration tests, and eventually prevailed. Very few Product Owners/Managers (or stakeholders in general) now question time spent writing those automated tests to ensure the system doesn’t fail. How did we get there? By not mentioning it. It’s part of the work of the feature, so why even mention it?
Instrumentation, in this context, is exactly the same. It’s part of creating the feature, it’s part of being able to know that the feature works. It’s included in your estimation of delivery times, it’s included in your complexity conversations, everything.
What about auto-instrumentation?
This is a really easy thing that I’d honestly not even add an estimate for. For most languages and frameworks, adding in basic auto-instrumentation is a few lines of code. You can also add a free Honeycomb account that covers most small-scale implementations and get some decent data quickly. This provides you the baseline to start adding manual instrumentation, which is something you can then do as you add more features.
Further to that, there are a lot of ways to get instrumentation data in without making code changes with the OpenTelemetry Agent (.NET, Java, and Python all have this feature), or even bringing data in from your infrastructure outputs like Kubernetes or load balancers.
Observability-Driven Development + Test-Driven Development
This is where we start to level up our practices considerably. Coupling TDD with an approach of observability first is the superpower you didn’t know you had. In addition, this is now part of your workflow of developing the feature, and no longer a separate task to build and validate them.
So what if we write our tests in such a way that tracing was what we used to assert our outcomes? Here’s a simple test using ODD and TDD together.
Test: ValidAddRequest_ShouldSaveToTheDatabase Asserts - Span named “Database-Save” is created - Span named “Database-Save” has correct Id property - Subsequent GET request for Id returns correct object
With this approach, we’re getting both the benefits of TDD (shaping our contract, fast feedback loops, RGR, tests as system documentation, etc.) and having all the information in production to debug things in the same way as you would debug with your tests.
This has an added benefit that now, instrumenting your code is part of the work. It’s part of your definition of done. It’s part of your system documentation, too.
What about backfilling instrumentation?
This is the hardest part. If you have a system that’s lacking instrumentation (either completely, or partially), then adding that in is a much harder sell. This is similar to backfilling unit or automated regression tests in a system.
I’ll again refer back to the fact that we’ve solved this problem.
How did we get the business to understand that backfilling regression tests was an important thing? We stopped telling them we were doing that. We included it in the estimates of new/adjacent tickets to those features, and we refactored code so that all the code went through central points that are tested.
All of this is the same process for instrumentation. We include that in other work on the same platform, we refactor code to use common components that we do instrument. This means that when you get the question, “why did you add tests for the instrumentation for Feature Y that we shipped last month, when you’re working on Feature X?”, we can answer that refactoring caused changes that we needed to protect against.
What if the developers don’t want to do instrumentation?
This is a much harder question to answer. Having developers care about production is a completely different situation. I’ll wait until someone asks that question before we try to answer it!
I hope that I’ve provided you with some practical approaches to talking to your organization (or not talking, as it turns out) about instrumenting your code.
If you have further questions or need clarification, feel free to book 1:1 time with one of the Developer Advocates. We offer office hours!
Dear Miss O11y, I want to make my microservices more observable. Currently, I only have logs. I’ll add metrics soon, but I’m not really sure...
Your API Key (in the x-honeycomb-team header) tells Honeycomb where to put your data. It specifies a team and an environment. Then, Honeycomb figures out...