Ask Miss O11y: What Should An "Observability Team" Do?By Charity Majors | Last modified on August 3, 2022
Dear Miss O11y:
I care a lot about instrumentation and telemetry and OpenTelemetry, so I was thinking of joining the observability engineering team at my company… but it seems like they spend all their time managing Prometheus and Grafana. I guess I was expecting something very different? We also have an internal tools team working on OpenTelemetry, and our IT department manages our provisioning for logging providers and exception handlers, and there are a dozen different metrics libraries and logging frameworks in use across the various teams. What should an observability team be doing? Please help.
Confused in Philly
From what I can tell, everybody ran the same shitty sed script at some point in the past five years:
sed -i ’’ ‘s/Monitoring/Observability Engineering/g’ engineering_teams
People use “observability team” as a catchall basket for all kinds of things these days—from cutting-edge tech to truly heinous hacks. Eh, it is what it is. The industry may be in a roiling state of massive flux, but I’m cautiously excited about the changes beginning to take shape and emerge from the muck. And I definitely think it’s worth spending some time talking about what observability teams can and should be.
i think you're going to hear a lot more of it, as it's one of the biggest growth areas for operationally inclined engineers. :)
"vendor engineering" means writing libs, modules, use cases, docs, etc to standardize use of third-party software across the org. https://t.co/iDT7aKFt5J
— Charity Majors (@mipsytipsy) October 24, 2020
Briefly: vendor engineering consists of sitting at the edge of your engineering org and making other companies’ software solutions integrate seamlessly with your own software and workflows. It consists of everything from evaluating vendors, rolling out a solution, configuring, updating, and integrating it with your existing workflows, and writing libraries, modules, docs, etc. to standardize the use cases across your organization. It also typically includes working with the vendor’s product and engineering teams to influence their roadmap, managing the relationship, pitching and selling the solutions internally, guiding migrations and upgrades… and ultimately, accepting responsibility for the partnership’s success.
It is a role that draws on multiple disciplines, from software engineering to security to operations, with a high standard for operational excellence. And it is probably the highest leverage role you will ever have.
What does vendor engineering mean for observability engineering teams?
Observability engineering teams are the most commonly spotted instantiations of vendor engineering.
A good observability engineering team sits between the engineering organization and their vendors. They audit, test drive, and carefully select a set of formally supported tools to meet the engineering needs of the company, based on their knowledge of the codebase and production systems, as well as the feedback they have solicited from others.
Are they gatekeepers who bless a few tools and prohibit the use of any others? No, probably not, but they do establish a golden path—a set of tools that the org guarantees will be available, supported, and easy to use.
A good o11y eng team stays up to date on industry changes, and combines their knowledge of the technical domain and the business needs to provide informed opinions and guidance about where the org should invest its time and resources in this area. They might recommend an early investment into distributed tracing, if they see developers flailing and losing days of work in attempts to find or reproduce complex bugs. Or they might recommend delaying an investment into OTel, if the payoff seems less than valuable.
Your observability engineering team should make good decisions about which frameworks, client libraries, and instrumentation tools should be used across the engineering org, set sane defaults, and document standards and examples for engineers to pull from. They should handle security and compliance insofar as they can, including watching the pipelines for PII leaks, and provide templates and helper libraries for standardization across teams.
It’s their job to make sure that there is minimal duplication of work across the engineering org and maximal consistency and coherence. Somebody who is instrumenting their code on a particular services team should feel at home reviewing the instrumentation code from another team.
If your organization believes in putting developers on call for their own code in production, the observability team is well-positioned to serve as an expert consultant to engineering teams that are trying to figure out how to instrument and understand their software. They should choose an alerting provider and help teams figure out how to avoid paging themselves to death. They are also well positioned to help design and streamline on-call rotations, especially those that have cross-org dependencies or upstream/downstream alerts.
It’s not all about technical stuff, either. One of the most important things an o11y team can do is build a healthy culture of observability. There’s a huge educational component to this—teaching teams about how to practice observability-driven development, instead of hurriedly tacking a few metrics on at the end, looking for opportunities to shrink the time spent flailing or firefighting by adding better visibility, or teaching people how to leverage the resources they already have. Building dashboards, templates, or query collections that people can share and reach for quickly when they are under pressure, and creating a safe environment for people to ask questions and learn from each other.
Most (but not all) observability teams will spend comparatively little time actually running large in-house software installations. If you aren’t an observability company, your engineering resources are probably better spent on your core differentiators.
Finally, observability teams will inevitably find themselves in the business of cost management—evaluating the value you get from your tooling vs the money you spend on it, forecasting future needs, and making smart investments towards your future.
Hope this helps!
P.S. You are absolutely right, Prometheus and Grafana are not observability tools. 😬
Learn how Jimdo, Upgrade, and Campspot benefited from OpenTelemetry, whether in improved performance, or by avoiding vendor lock-in....