Observability Engineering: Achieving Production Excellence

Want a copy of the Whitepaper for yourself? Download the PDF


A note for Early Release readers

With Early Release ebooks, you get books in their earliest form—the authors’ raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. If you have comments about how we might improve the content and/or examples in this book, or if you notice missing material within this chapter, please reach out to the editor at vwilson@oreilly.com.

Thank you for picking up our book on observability engineering for modern software systems. Our goal is to help you understand how to develop a practice of observability within your engineering organization. This book is based on our experience as practitioners of observability, and as makers of observability tooling for users who want to improve their own observability practices. As outspoken advocates for driving observability practices in software engineering, our hope is that this book can set a clear record of what observability means in the context of modern software systems. The term “observability” has seen quite a bit of recent uptake in the software development ecosystem. This book aims to help you separate facts from hype by providing a deep analysis of:

  • What observability means in the context of software delivery and operations
  • How to build the fundamental components that help you achieve observability
  • The impact observability has on team dynamics
  • Considerations for observability at scale
  • Practical ways to build a culture of observability in your organization

Who this is for

Because observability predominantly focuses on achieving a better understanding of how software operates in the real world, this book is most useful for software engineers responsible for developing production applications. Anyone who supports the operation of software in production will also greatly benefit from the content in this book.

Additionally, managers of software delivery and operations teams who are interested in understanding how the practice of observability can benefit their organization will find value in this book, particularly in the chapters that focus on team dynamics, culture, and scale.

Anyone who helps teams deliver and operate production software and is curious about this new thing called “observability” and why people are talking about it should also find this book useful.

Why we wrote this book

Observability has become a popular topic that has quickly garnered a lot of interest and attention. With its rise in popularity, observability has been unfortunately mischaracterized as a synonym for monitoring or system telemetry. Observability is a characteristic of software systems. Further, it’s a characteristic that can only be effectively utilized in production software systems when teams adopt new practices that support its ongoing development. Introducing observability into your systems is both a technical challenge and a cultural challenge.

We are particularly passionate and outspoken about the topic of observability. We are so passionate about it, that we started a company whose sole purpose is to bring the power of observability to all teams that manage production software. We spearheaded a new category of observability tools, and other vendors have followed suit.

While we all work for Honeycomb, this book is not here to sell you on our tools. We have written this book to explain how and why we adapted the original concept of observability to managing modern software systems. You can achieve observability with different tools and in different ways. However, we believe that our dedication to advancing the practice of observability in the software industry makes us uniquely qualified to write a guide that describes, in great detail, the common challenges and effective solutions. You can apply the concepts in this book, regardless of your tool choices, to practice building production software systems with observability.

This book aims to give you a look at the various considerations, capabilities, and challenges associated with teams that practice using observability to manage their production software systems. At times, this book may provide a look at what Honeycomb does as an example of how a common challenge has been addressed. These are not intended as endorsements of Honeycomb, but rather as practical illustrations of abstract concepts. It is our goal to show you how to apply these same principles in other environments, regardless of the tools you use.

What you will learn

You will learn what observability is, how to identify an observable system, and why observability is best suited for managing modern software systems. You’ll learn how observability differs from monitoring, as well as why and when a different approach is necessary. We will also cover why industry trends have helped popularize the need for observability and how that fits into emerging spaces, like the cloud-native ecosystem.

Next, we’ll cover the fundamentals of observability. We’ll examine why structured events are the building blocks of observable systems and how to stitch those events together into traces. Events are captured by telemetry built into your software and you will learn about open-source initiatives, like OpenTelemetry, that help jumpstart the instrumentation process. Instrumentation exists to analyze system events, and you will learn both how that analysis works and how observability and monitoring can co-exist.

After an introduction to the technical concepts behind observability, you will learn about the social and cultural elements that often accompany the adoption of observability. Managing software in production is a team sport, and you will learn how observability should be used to help better shape team dynamics. You will learn about how observability fits into business processes, affects the software supply chain, and reveals hidden risks. And you will learn about the intersection between business objectives, engineering team needs, and user experience that is captured with ServiceLevel Objectives and their role in observable systems.

Observability presents further challenges for large-scale organizations. You will learn about the challenges of efficient data storage for system events, managing large quantities of data with pipelines, deciding when and how to introduce event sampling solutions, and the considerations to take into account when embarking down the path of building your own observability solution.

Finally, we look at organizational approaches to adopting a culture of observability. Beyond introducing observability to your team, you will learn practical ways to scale observability practices across an entire organization. You will learn how to identify and work with key stakeholders, use technical approaches to win allies, and how to make a business case for adopting observability practices.

Chapter 1: What is Observability?

In the software development industry, the subject of observability has garnered a lot of interest and is frequently found in lists of hot new topics. But with that level of surging interest in adoption, complex topics are often ripe for misunderstanding without a deeper look at the many nuances encapsulated by a simple topical label. This chapter looks at the mathematical origins of the term “observability” and examines how it has been adapted to describe characteristics of production software systems.

We also look at why the adaptation of observability for use in production software systems is necessary. Traditional practices for understanding the internal state of software applications rely on approaches that were designed for simpler legacy systems than those we typically manage today. As system architecture, infrastructure platforms, and user expectations have continued to evolve, the tools we use to reason about those components have not. Systems that only take aggregate measures into account don’t provide the type of visibility needed to isolate very granular anomalies.

New methods for quickly finding needles buried in proverbial haystacks were born from necessity. This chapter will help you understand what observability means, how to determine if a software system is observable, why observability is necessary, and how observability is used to find problems in ways that are not possible with other approaches.

The mathematical definition of observability

The term “observability” was coined by engineer Rudolf E. Kálmán in 1960. Since then, it has grown to mean many different things in different communities. Let’s explore the landscape before turning to our own definition of observability for modern software systems.

In his 1960 paper,1 Kálmán introduced a characterization he called observability to describe mathematical control systems. In control theory, observability is defined as a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

This definition of observability would have you study observability and controllability as mathematical duals, along with sensors, linear algebra equations, and formal methods. This traditional definition of observability is the realm of mechanical engineers and those who manage physical systems with a specific end-state in mind. If you are looking for a mathematical and process engineering oriented textbook, you’ve come to the wrong place. Those books definitely exist: as any mechanical engineer or control systems engineer will inform you (usually passionately and at great length), observability has a formal meaning in traditional systems engineering terminology.

However, when adapted for use with squishier virtual software systems, that same concept opens up a radically different way of interacting with the code you write.

Applying observability to software systems

Kálmán’s definition of observability can also be applied to modern software systems. When applying the concept of observability to software, we must also layer additional considerations that are specific to the software engineering domain.

For a software application to have observability, the following things must be true. You must be able to:

  • Understand the inner workings of your application
  • Understand any system state your application many have gotten itself into
  • Understand the things above, solely by observing that with external tools
  • Understand that state, no matter how extreme or unusual

A good litmus test for determining if those conditions are true is to ask yourself the following questions:

  • Can you continually answer open-ended questions about the inner workings of your software to explain any anomalous values?
  • Can you understand what any particular user of your software may be experiencing?
  • Can you determine the things above even if you have never seen or debugged this particular state or failure before?
  • Can you determine the things above even if this anomaly has never happened before?
  • Can you ask arbitrary questions about your system and find answers without needing to predict what those anomalies would be in advance?
  • And can you do these things without having to ship any new code to handle or describe that state (which would have implied that you needed to understand it first in order to … understand it)?

Meeting all of the above criteria is a high bar for many software engineering organizations to clear. If you can clear that bar then you, no doubt, understand why observability has become such a popular topic for software engineering teams.

Put simply, our definition of observability for software systems is a measure of how well you can understand and explain any state your system can get into, no matter how novel or bizarre. You must be able to comparatively debug that bizarre or novel state across all dimensions of system state data, and combinations of dimensions, in an ad-hoc manner, without being required to define or predict those debugging needs in advance. If you can understand that bizarre or novel state without shipping new code, then you have observability.

We believe that adapting the traditional concept of observability for software systems in this way is a unique approach with additional nuances worth exploring.

For modern software systems, observability is not about the data types or inputs, nor is it about mathematical equations. It is about how people interact with and try to understand their complex systems. Therefore, observability requires recognizing the interaction between both people and technology to understand how those complex systems work together.

If you accept that definition, many additional questions emerge that demand answers:

  • How does one gather that data and assemble it for inspection?
  • What are the technical requirements for processing that data?
  • What team capabilities are necessary to benefit from that data?

We will get to these questions and more throughout the course of this book. For now, let’s put some additional

Want a copy of the Whitepaper for yourself? Download the PDF