What Is Observability Engineering?

What Is Observability Engineering?

What Is Observability Engineering?

By Rox Williams | Last modified on 2023.04.14

Although it may seem like something out of fantasy, it is possible to find out what you don’t know you don’t know through the magical powers of observability engineering.

Most organizations have monitoring in place, which is fine as far as it goes, but completely useless when it comes to finding out why a random outage has taken place. Observability engineering offers a fast way into the why by letting teams ask questions of the data, visualize anomalies, and pursue possibilities—especially if they’re far-fetched, random, and never-seen-before. In fact, it is exactly that concept of ‘one-off novelty outages’ that observability engineering was created to address.

Here’s what you need to know.

What is observability?

Observability is not a term unique to software development. It was first coined in the 1960s by engineering inventor Rudolf Kalman while he was researching control theory. The idea of observability took hold in the software space during the late 2010s, when complicated cloud-native systems needed a higher level of diagnostics than ever before.

Observability is how modern software development teams discover and understand a problem within a service. It provides  teams with a way to ask fact-finding questions of their data, pursue leads, and generally explore anything and everything that’s occurring during an incident. An observable system allows engineers to look past pre-defined monitoring alerts and dive deep into all areas of the system to pursue answers they’d never thought of before. Arbitrarily-wide structured events are at the heart of observability (in our opinion) because they contain up to hundreds of events that can be dissected as needed or put together in order to find anomalous patterns.

What observability isn’t, however, is ‘three pillars.’ If you’ve never heard of the three pillars before, it’s the concept that observability telemetry is divided into three separate buckets: metrics, logs, and traces. The keywords here are divided and separate. We won’t bore you too much on why that’s unequivocally wrong (we’ve written about it enough), but the Cliff’s notes are that observability should provide you with a complete picture of your system—not only parts of it that you must manually stitch together. So why separate them to begin with? Also, the three pillars actually can’t contain all the data required for true observability: business metrics, customer feedback, CI/CD pipeline performance, and many other steps in the SDLC can provide valuable clues and context on the journey.

How to determine if a system is observable?

Ask the following questions to determine if a system is truly observable:

  • Is it possible to ask an unlimited number of questions about how your system works without running into roadblocks?
  • Can the team get a view into a single user’s experience?
  • Is it possible to quickly see a cross-section of system data in any configuration?
  • Once a problem is identified, is it possible to find similar experiences across the system?
  • Can the team quickly find most load-generating users, hidden timeouts and faults, or a random user complaining about timeouts?
  • Can these questions be asked even if they’ve never been imagined before?
  • Once these questions have been asked, can they be iterated on, leading the team down new rabbit holes of data and exploration?
  • If you have to query your current system, are you able to include and group endless numbers of dimensions, regardless of their importance? And do the query responses come back quickly?
  • And finally, do debugging journeys normally end with surprising—or even shocking—results?

An answer of “yes” to all of the above means a system is observable, and also illustrates the observability journey.

Observability engineering requires tools, of course, to allow for data exploration. However, it also requires a curiosity culture where the question “why?” is prevalent. In an ideal world, observability is baked into a software service from the beginning, and organizational enthusiasm for problem-solving is also baked in.

Observability vs. monitoring

Observability and monitoring are often mentioned in the same breath, but they are in fact distinct entities and can be boiled down to “known” vs. “unknown.”

Monitoring was originally built for predictable monolithic architectures, so it’s firmly planted in the “known” realm, where engineers set up alerts based on their system knowledge of what might fail. The alerts, in turn, tell engineers where the problem is, but not why it’s happening. Monitoring’s other serious limitation is that it can’t handle a “never seen that before” situation, simply because monitoring is set up to only alert on known problems that have been predefined as “important” by APM vendors for decades.

Observability, on the other hand, was created for modern distributed systems. It isn’t about alerting once something is already broken and impacting the user experience, but rather, the ability to examine the entire system and user experience in real time to surface anomalies and answer why something is happening before it degrades user experience. Observability engineers approach fact-finding without any preconceived notions and let the data tell them where to look and ask.

Why is observability engineering important?

At a time when consumer tolerance for outages has all but disappeared, the importance of observability engineering can’t be overstated. Teams using an observability strategy can find and fix incidents more quickly than those relying solely on monitoring. Observability engineering is also important because it offers tools and culture changes that support modern software development, including:

  • A more robust incident response process.
  • Increased transparency.
  • A broader understanding of how users interact with the product.
  • The opportunity to build observability into software as it’s being created, rather than after the fact.
  • Improved workflows and feedback loops.
  • True visibility into production environments = the opportunity for tweaks/improvements.
  • Better understanding of business needs/requirements.
  • The ability to create a culture of problem-solvers.

Benefits of observability engineering

Complex modern software development can appreciate the full benefits of observability engineering, starting with the speed of incident resolution. The faster a problem is found, the faster it can be fixed, saving organizations time, money, and concerns about reputation damage. The money saved is potentially substantial: Software downtime can cost $9,000 per minute, according to research from the Ponemon Institute.

Observability engineering has other benefits as well. Without the need to spend endless hours sorting through logs to resolve issues, teams are able to work on higher-value projects like developing new features or increasing reliability. Many organizations suffer from a “too much information” problem, but observability engineering manages all that data, extracting relevant information that can help resolve an outage. Observability engineering can also help corral data from disparate systems, helping to ease the overwhelming amount of information teams have to process. And there’s a side benefit to surfacing all that data: teams can be more transparent about all aspects of the product, and transparency is key to efficient software development. 

And finally, when developers roll out distributed tracing as part of an observability engineering effort, it’s easy for them to visualize the concrete benefits of their coding. They can uniquely leverage rich data such as individual user requests as traces flow through specific infrastructure components. That can lead to better, more efficient application development.

Challenges of observability engineering

Observability is a shift in culture, process, and tools—and that comes with understandable apprehension. When monitoring is all you’ve known for years and it’s worked decently enough, it can be hard to justify a change—not only to business stakeholders, but to engineers that are used to a certain way of doing things. But observability champions that are successful in bringing observability into their organization often rise through the ranks quickly as their teams become more efficient and drive greater impact for their organizations.

Another barrier to observability can be instrumentation. In our Ask Miss O11y series, we’ve received questions from engineers trying to get their manager on board with spending the time to add instrumentation. This can feel daunting, so find a vendor with comprehensive documentation and thought leadership around best practices.

That’s all a long way of saying that the biggest challenges of observability engineering revolve around getting the business side to buy in for a tool purchase (i.e., making the case for improved ROI), as well as nurturing a culture of observability.

The business case should be a straightforward one: observability engineering will save the company from lengthy, costly outages and unhappy users.

What does an observability engineer do?

An observability engineer by any other name could be called an SRE, platform engineer, system architect, any type of DevOps engineer, a tooling admin, or… ?

The term “observability engineer” is a relatively new moniker to describe team members charged with building data pipelines, monitoring, working with time series data, and maybe even distributed tracing and security. While an observability engineer doesn’t necessarily need highly specialized training, the role does require someone who is comfortable with all the data, is curious and likes to problem-solve, while also having strong communication skills.

The ideal observability engineer would be the organization’s observability champion, choosing platforms and tooling, and cross-training key members of the team. They would have a strong grasp on the business needs, customer experience, and product goals. This role would stay up-to-date with the latest trends in observability, and could help create and lead an incident response team through the observability wilderness.

What is observability-driven development?

We’ll take your test-driven development and go one further: observability-driven development (ODD) is a superpower your team can use locally to identify potential issues before they’re actually out in the wild.

We’re not the only ones excited about this: Gartner named observability-driven development as on the rise in its 2022 Gartner Hype Cycle for Emerging Technologies.

Just as test-driven development shifted testing left (and has been a tremendously popular and successful strategy for DevOps teams everywhere), it is possible to shift observability left so that more of it is in the hands of a developer while the code is being written. This is what we like to call “tracing during development” and it has a number of key advantages:

  • Developers are the logical folks to tackle this during coding, rather than having to go back later.
  • ODD means less context switching, and also eliminates the need to attach debuggers and the tediousness of hitting API calls one by one.
  • The process of observability-driven development is going to result in better and cleaner code. Devs can see if the code is “well-behaved” before it gets into production.

That said, we know it might be hard to get developers excited about a big change like ODD. Our best advice: start slowly, make a big deal of the wins, and add instrumentation as incidents happen.

Observability Engineering Best Practices

There are a number of best practices teams can employ to get the most out of observability engineering, including:

  • Choose the right tooling: Teams need to be able to see how users experience their code, in real time, even in complicated environments. Tools need to be able to analyze high-cardinality and high-dimensionality data, and do it quickly. Observability platforms must never aggregate or discard telemetry, and should be able to work directly with databases.
  • Understand what observability may cost: It should be 20% to 30% of what is invested in infrastructure. Look for tools that have predictable pricing to avoid surprise overage bills.
  • But don’t overdo it: Auto-instrument what you can at first to get insights quickly, but take the time to add custom tags through manual instrumentation to truly leverage the power of high-cardinality data.
  • Speed up the feedback loop: When trying to resolve an incident, observability means speed—and speed is everything. Ensure the team is structured to get the most out of fast feedback loops.
  • Look at end-to-end requests: Remember that context is vital and so is ordering.
  • Raw data is king (and you can’t have too much of it!): Don’t settle for less. Any and all types of data are welcome because more data means more context and thus faster problem solving.
  • Structured data and wide events are also king: Make sure logs and events are structured so you can maximize the power of your query engine.
  • For every hop (or service or query), generate a single event: This is industry best practice.
  • Learn to love context: When you don’t even know what you’re looking for, context is what can help guide the process. Everyone on the team should be encouraged to always look for the extra details.

And perhaps most importantly, observability can’t happen without a robust, supportive, and inherently curious culture in place. We know a culture play can be challenging in some organizations, but observability needs to be a team effort in order to get the most out of it. It starts with developers: they need to instrument their code, and they may need to be convinced of the value of that effort. It’s empowering for devs to not only write the code but literally “own it” in production (though we acknowledge this can be a big change in some organizations). But service ownership is truly the most straightforward way to build and sustain a culture of observability.

Also, don’t forget that observability is all about asking questions that haven’t been asked before, so keep reminding teams they’re creating a process for future incidents and their future selves. And this is easier said than done because o11y tools actually hang on to the query history so teams can learn from each other when familiar situations arise.

Yes, observability enables high-performance engineering

Engineering teams continually strive for faster release times, but what happens when there is an outage? Time is of the essence, which is why observability engineering needs to be part of any modern software development effort.

Observability engineering surfaces data and anomalies allowing for faster diagnostics and resolution. Tooling, and a culture committed to answering the question “why?” are vital for successful observability engineering, but luckily that fits seamlessly into a modern DevOps practice.

Explore how Honeycomb tackles observability

Additional resources

Case Study

HelloFresh Improves Organization-Wide Performance With Honeycomb

read more
Book

Honeycomb’s O’Reilly Book Observability Engineering

read more
Case Study

Intercom Accelerates Developer Productivity With Observability and Distributed Tracing

read more
Video

Intro to o11y Topic 1: What Is Observability?

read more
Video

Two-Minute Honeycomb Demo

read more