Webinar Recap: How to Avoid Being On Call With Under-Instrumented Tools
By Jessica Kerr | Last modified on June 24, 2022âItâs too expensive!â
âDo we really need another tool?â
âOur APM works just fine.â
With strapped tech budgets and an abundance of tooling, it can be hard to justify a new expenseâor something new for engineers to learn. Especially when they feel their current tool does the job adequately. But, does it?
All it takes is one bad on-call experience, one floundering around to find out whatâs really the problem, to feel how under-instrumented your tooling might be. Thatâs what happened to retired SRE Paige Cruz. An on-call event made her clearly see how much more context she needed behind alerts when her service goes down. In a recent discussion, we learned more about her observability journey and debunked some common myths. Letâs dive in.
Story timeâand itâs one many of us know all too well
It was a typical Thursday morning. Paige was checking things off her to-do list when the incident channel in Slack lit up. As an SRE, her natural instinct was to jump inâbut she wasnât on call. Her team was in the throes of a high priority migration, so she stayed focused on the task at handâeven as the number of messages continued to climb.
Finally, her Team Lead messaged her directly and asked if a change she made went through. He followed up by confirming that the entire system was down. A bunch of thoughts started running through her mind: But a bunch of people reviewed my PR. What the heck? Surely itâs not my one-line change? She said she didnât know how it could be related, but she would roll the change back. After all, now was the time for mitigation, not investigationâeven though her mind was begging to investigate.
Before we disclose what happened, hereâs a little quiz. What do you think caused the outage?
- A bug
- A botched config rollout
- A random zonal failure
- An accidental Terraform apply in the wrong environment
If you guessed answer B, you are correct. đ„ș
Paigeâs one-line chain lay in a nested layer of YAML. In the prod environmentâand only in that environmentâthere lived a security scanner that served as the front door to their Kubernetes cluster. Without it, no requests from the external CDN could come into the cluster. That one-line change shifted the merging of YAML, and removed that scanner. No requests got in. Devs saw a drop in throughput and scary HTTP status codes. But with their current APM tool, they couldnât surface the problem. The team saw views specific to their app, not the full system. (*sad face)
From this incident, Paige learned that even if you do things by the bookâlike getting PRs reviewedâmistakes can still happen. Perhaps more jarring, she began to see how APMs have some hefty limitations when it comes to understanding all things that can go awryâspecifically in todayâs modern distributed systems. To round it back to our original question, are your current APM tools doing the job they need to do? In Paigeâs case, and most likely many teams running software in cloud environments, no. Letâs explore why.
After this incident (about six years ago), Paige turned to a shiny new concept known as observability. In theory, observability brought promises of full visibility into systems, of faster debugging. As she implemented observability in her organization, she encountered three myths/objections/misconceptions folks often have about observability in practiceâand noticed how to overcome them.
Myth One: Observability is too expensive.
âWe just donât have the budget.â â the boss.
Teams are conditioned by the bills they receive from their current APM tools. Especially if they are seeking budget approval for a new observability tool to replace their current APM tool. Many of us have heard the answer, âour bill is already high enough, why do we need another tool?â Letâs debunk Myth One.
Perhaps youâre using your tool to monitor requests with HTTP status in the 500 range, per service. Then you add the particular HTTP status code.
http.request.status_5xx service:$X code:500
5 http status metrics
x 15 total services
x 75 http status codes
x 100000 users
= Â 5,625 potential unique metrics
The more detailed the metrics, the more expensive they are.
Next, it is useful to know which users are sending those requests:
http.request.status_5xx service:$X code:500 Â user_id:123456
5 http status metrics
x 15 total services
x 75 http status codes
x 100000 users
562,500,000 potential unique metrics đ„â ïžâ ïž
That looks like $$$.
With traditional APM dashboards, youâre charged for every data series you want to ingest and track beyond the automatically included fields. Dimensions with many valuesâimportant stuff like container ID, customer, full URLâget extremely expensive. Observability tools donât have this restriction. Add hundreds of fields, each with millions of values, for no additional cost. Throw in region, User-Agent, result, and even IP address. Include everything relevant to decisions your code makes, and also what the code decided. Include identifying information so that you can find this particular event when you want to study it.

Youâd never send these wide events, with their high-cardinality (many-valued) fields, to an APM tool. That would be too expensive! Instead, send them to your observability tool. Then query by any field in seconds, group by as many dimensions as you need, surface correlations between events across all fields, and youâre only charged on events per month.
Myth Two: Observability is too difficult.
âAnother tool? Whoâs gonna own that?â says the SRE team.
Tools arenât magic. Iâm not going to tell you that observability is easy. Thereâs work in instrumenting applications, and thereâs effort in learning a new interface.
Paige points out that all of that work fades in comparison to the difficulty of working without observability.
On-call teams that use observability to monitor their systems often say things like, âI donât know how we did this before, Iâd never go back.â Especially after the first time they log in to work in the morning and Slack shows them a Service Level Objective burn alert. That alert tells them something is wrong, by the standards of our reliability commitment to the businessâbut not super duper urgently wrong. Someone just got a free nightâs sleep! Monitoring alone canât do that.
Observability enables even the newest on-call engineers to investigate whatâs happening. It gives them fast feedback loops, so they can get answers quickly. At a click, everyone can drill down into a detailed distributed trace. When they can see whatâs going on, incidents are less frightening. Debug and resolve in minutes, and then leave work on time. Or close the laptop, however we conclude work these days.
Observability takes the fear out of being on call. Then more people on the team can participateânot only the top experts.
The work of observability pays off quickly. Also! That instrumentation work is portable, thanks to OpenTelemetry. Itâll continue to make your application easier to support, whatever tools you use in the future.
Myth Three: My APM tool works just fine.
âMost of our devs know XYZ tool really well,â says the engineer that set it up.
Thatâs great! Maybe they can use the current monitoring tool to its full potential. But Paige considers: how far does that potential go?
Say a page comes in, and the monitoring tool shows you this chart. Blue is HTTP 2xx (success response), green is 4xx, and red is 5xx (both failures).
Something is definitely wrong! But what?
This is what timeseries metrics can show usâit's not that they're devoid of information; they got us to this point. They can show us when something bad is happening, and how many bad things happened, but not why. With monitoring, you can only get as far as the $$$ custom metrics your organization stores.
It costs something to shift our existing activities from one tool to another. The payout comes with the new activities we can do. For Paige, the âquery anything with eventsâ page exemplifies this.
It may seem intimidating at first, but this page is a treasure trove of information to help anyone investigating an issue. Sort the events by HTTP code; see which ones are increasing in frequency and narrow it to those. Check which customers are getting those 4xx responses, and what theyâre sending. Thereâs more to learn, because thereâs more power.
To recap
Observability is cost-effectiveânot expensive. It is a journey; it is necessary to meet the complexity of today's systems. Learning to use it is also a journey: a journey to new superpowers.
Let's revisit the incident from earlier, where Paige took everything down by accident, except in a glorious world with all of the capabilities of Honeycomb and observability.
What might have happened? The on-call developers get an alert, and they pop over to Honeycomb. They see some requests failingâexternal requests. Internal requests are fine. Then they click on a failed request, and see the trace. The first error span in the trace shows that it failed on the security scan. Paigeâs team lead says, âHey, can you roll back that config change? No external requests are making it through because the security scanner isnât reachable.â
Paige still thinks, âOh my God, the change that tested? The change that I had you and a bunch of other people review? That I baked for a long time into lower environments?â But she says, âHey, thanks for that context! Rolling back now.â
She has confidence that the action that sheâs taking will resolve the symptoms that they were seeing. And then she will go find out more about the security scanner that only exists in production.
Conclusion
After all sheâs learned, Paige will forever refuse to be on call without good observability.
Want to get this better on-call experience yourself? Sign up today for our free tier and see what you can learn. You can also watch this webinar on demand.
Still have questions? Book a 1:1 with Jessitron to learn more.
Related Posts
Observability Is About Confidence
Observability is important to understand whatâs happening in production. But carving out the time to add instrumentation to a codebase is daunting, and often treated...
Defensive Instrumentation Benefits Everyone
A lot of reasoning in content is predicated on the audience being in a modern, psychologically safe, agile sort of environment. Itâs aspirational, so folks...
What Is Observability? Key Components and Best Practices
Software systems are increasingly complex. Applications can no longer simply be understood by examining their source code or relying on traditional monitoring methods. The interplay...