Ask Miss O11y: Mapping Out Your Observability JourneyBy Liz Fong-Jones | Last modified on April 20, 2022
Dear Miss O11y:
It feels so overwhelming to get started with observability. I want to use Honeycomb, but it feels like I can't justify spending the time and money without having proof of how much better things will be once we have observability everywhere. And I can't get that proof without spending a bunch of time and money which I don't have. Raaaagh. Help?
– Trapped in Toil
Thanks for asking the question! Approaching observability as an all-or-nothing problem often leads to the project feeling daunting. But that's not specific to observability—any project can be overwhelming if you think it needs to be done all at once, perfectly. Such as, erm, writing an entire book on observability! *looks around worriedly*
The trick to making progress is to decompose the problem into smaller chunks and bring forward the time to realize the first value. It's not necessary to need the results from the full solution at the very beginning. You can bootstrap by just digging in with your bare hands, then trading in the diamonds of insight you find for a shovel, which will unlock more value, which will let you buy a jackhammer, and so on.
You'll need to break down the enterprise observability adoption journey into three separate pieces, and each of the three pieces into manageable steps. In my experience, organizations benefit from guidance with:
- Getting buy-in for contextual instrumentation
- Understanding how to leverage that instrumentation for insight
- Developing a plan to pay for it all
Breaking down instrumentation into steps
For example, let's talk about generating telemetry with instrumentation. While I'm a proponent of rich, wide events with many attributes, it's not a requirement to have fully enriched events to get started. Even analyzing flat HTTP request logs from your cloud provider's load balancer can give you insight into what requests are slow and exactly how slow they are.
An observability solution can ingest these logs and surface to you the high-cardinality dimensions of HTTP parameters/endpoints, user agents, and source IP addresses. And it can tell you exactly how many requests fell into each latency bucket (and what their properties were) rather than estimating with p99 values. Even if there's no further visibility into the internals of your application architecture or into which users are making those calls, you'll know what is slow—even if you don’t know why yet.
Armed with that potentially surprising data of what calls are slow, you'll have some initial value and answers, and now have your team motivated to answer new questions. Why are those requests so slow?
To answer those questions, you'll need to add in OpenTelemetry (OTel) to your application. OTel is a vendor-neutral instrumentation standard that allows you to send relevant context from your application. You can use OTel's automatic instrumentation of gRPC, HTTP, and database/cache calls to at least get the skeleton of who calls whom in the tangled web of microservices and downstream dependencies. Depending upon what language you are using, you can integrate the appropriate Honeycomb OpenTelemetry Distribution with just an agent plus configuration, or through adding a few lines of code and recompiling. This data will allow you to find the un-cached database call that's being repeatedly issued. Or the downstream dependency that's slow only for a subset of its endpoints, from a subset of your services.
Finally, you can invest in custom instrumentation—attaching fields and rich values, such as client IDs, shard IDs, errors, and more to the auto-instrumented spans inside of your code, making it easier in the future to understand what's happening at each layer. By adding custom spans within your application for particularly expensive steps internal to your process, you can go beyond the automatically instrumented spans for outbound calls to dependencies and get visibility into all areas of your code. You're now practicing Observability-Driven Development—working proactively to make your future problems easier to debug rather than reactively.
Learning to query your data
You've learned what kinds of data can be progressively sent into an observability tool like Honeycomb. But we haven't yet discussed how to concretely get insights out of the data. There's a smooth learning path to that, too, that provides you with rewards for each step you unlock.
The default graphs from the Honeycomb Home view allow you to get a quick idea of the rate, errors, and duration (RED) metrics of your service, based on the real trace data flowing through the system. Clicking on any of the graphs will expand it. I especially recommend the heatmap in the upper right showing the number of requests for each range of latency and recent time period. And you can zoom in as far as you like in time by clicking and dragging—no fixed one-minute or five-minute granularity buckets here! "Computer, enhance!"
You can pull up the Traces tab and click anywhere on the heatmap to find an exemplar trace that has the shape you're looking for. "Honeycomb, find the trace of one of the 913 requests that took between 550 and 555 milliseconds and that happened between 5:40 p.m. and 5:42 p.m. today." It sounds magical to instantly recall the needle from the haystack, but it's very real and will accelerate your understanding of your systems.
Eventually, you'll want to do more than use your mouse cursor to navigate our pre-defined dashboards. While it may seem intimidating to learn a new querying API, Honeycomb was designed to behave like SQL, a language you probably already know. Just type in a list of things to VISUALIZE instead of SELECT (such as COUNT, HEATMAP, and more), and use as many clauses as you like under WHERE to filter your data. You aren't limited in what you can GROUP BY, so you can choose as many interesting fields as you like and get each VISUALIZEd summary broken down by each unique set of values.
Not sure what fields are going to be the most relevant? That's okay! You can plot a HEATMAP(duration_ms), go to the BubbleUp tab, and draw a box around the outlying data to have relevant dimensions and values suggested to you to filter or group by.
There's no penalty to making a mistake and iterating on your query, as each query only takes seconds to run, and we don't charge per query you issue. It's a natural part of the debugging process to run into the occasional red herring or dead end. But you can leave your future self a lifeline by taking notes and annotating which queries you found helpful or not as you go along. Honeycomb's query history acts like Ariadne's thread for you and for your team, helping you remember what you've already visited before.
As you use Honeycomb more and more, you'll be able to find the insights faster and faster, and solve problems that you previously thought to be intractable.
Breaking down data scaling and cost
We've discussed so far how to incrementally add data and how to incrementally improve your querying capabilities. Gradually leveling up one step at a time helps you justify the investment into instrumentation and self-training. However, you still might be wondering how expensive this is all going to be in dollars or euros. The great news is that you can move at your organization's comfort level and never feel blocked.
You don't need to pay a single cent to start seeing the traces from your data. Because OpenTelemetry is vendor-neutral, you have complete freedom of backend selection. You can use an open-source solution like Jaeger all-in-one to quickly get something local to visualize traces locally. Or you can make use of Honeycomb's Free Tier to send up to 20 million trace spans per month while you're prototyping (or forever!).
Speaking of prototyping, you don't need to start off sending production data (with all the data protection requirements that may have). You can get good performance results and learn surprising things just from your development or staging environment. That’ll let you show increasing value over time and make the case to bring Honeycomb closer to your real production environment and solve real production issues with Honeycomb. Plus, small production services can still fit into our free tier!
Honeycomb is useful just for a single microservice that your team owns since it gives you access to high-cardinality data and shows you client spans for your outgoing requests from day one. It's not necessary to trace every single upstream or downstream service before you see the first value. Of course, following the flow of requests into dependencies beyond your individual service is helpful, so Honeycomb provides compounding benefits the more of your upstream and downstream services are connected.
After you’ve seen results with the Free tier and are ready to trace in production, Honeycomb Pro is a great step that provides metered pricing according to the number of trace spans you send us. When you're ready to send more significant data volumes, you can start a free trial of Honeycomb Enterprise, where you'll get support with Refinery for managing and sampling your data.
Expanding value to your entire organization
So far, we've covered how to progress from utilizing flat logs to automatically generated traces to custom instrumented, rich data, along with how to expand from small amounts of data to the full volume of data you need to level up your team's debugging in production.
Now that you and your team are seeing real value from observability, you can generalize this solution specific to your team and expand it for organization-wide adoption. At a larger scale, you may need to spend more than you can expense on your own company card. But that's okay—you can use the evidence of how much happier and more productive your team is to persuade one or two adjacent teams to extend your traces into theirs and join you on a larger Pro tier. As Nick Herring said on o11ycast, showing up with the shiniest fire truck and actually putting out the fire with it can be incredibly persuasive.
As Honeycomb gains momentum within your company, you'll have enough volume to qualify for volume discounts with an Enterprise purchase. And once you've reached the tipping point of adoption, the way is clear for replacing a legacy APM or logging tool with Honeycomb at the entire company level, giving you both better debugging capabilities and net cost savings.
Hope that helps,
Have a question for Miss O11y? Send us an email!
"Dear Miss O11y, I’ve been following Honeycomb for a long time, and I understand where the insights from observability fit in. But larger orgs haven’t...
People use “observability team” as a catchall basket for all kinds of things these days—from cutting-edge tech to truly heinous hacks. Eh, it is what...