Observability Glossary

By Rox Williams | Last modified on 2023.10.20

Glossary

New to observability? We’ve got you covered. Read on for straightforward definitions (and targeted links) that answer your observability questions.

Terms tagged with a 🐝 have a Honeycomb-specific meaning.

Glossary

Annotation 🐝

An annotation is an optional way for Honeycomb users to label or describe a Query 🐝. Annotations help users remember what a query does and are also useful for sharing queries with other team members.

Honeymarkers are a special kind of annotation that mark a point in time on a chart. We use them to indicate things like deploys. This can be helpful for, say, linking a code change to a new problem.

Right now, we don’t have the ability to add comments as markers on a chart, but we may consider adding such a feature.

Alerts

An alert is a notification or automated message generated by a monitoring or alerting system to inform individuals or teams about a specific event, condition, or issue that requires attention or action. Alerts are a crucial part of system monitoring, incident management, and SRE practices, as they help organizations respond to problems promptly to maintain system reliability and availability.

Honeycomb manages alerts via Triggers and SLOs.

Alert fatigue

Alert fatigue, or alarm fatigue, occurs when a person is exposed to too many alerts and becomes desensitized to them. Responding to alerts can be exhausting, so it’s important to try to reduce the overall number of alerts engineers get.

A “false positive” occurs when an alert fires but nothing is wrong.

Application Load Balancer (ALB)

An Application Load Balancer is a type of load balancer used in cloud computing and web application architectures to distribute incoming network traffic across multiple target instances, such as virtual machines, containers, or serverless functions. ALBs operate at the application layer, which allows them to make routing decisions based on characteristics of the HTTP/HTTPS requests. ALBs are a critical component for improving the availability, scalability, and reliability of web applications. Honeycomb offers an AWS Elastic Load Balancer integration.

Application Programming Interface (API)

An Application Programming Interface is a set of rules, protocols, and tools that let different applications and services communicate with each other. APIs define the methods and data formats that applications can use to request and exchange information, services, or functionality, often across different systems or platforms. We offer an API to our customers so they can send and receive data, for example when connecting to a third-party tool.

Application Performance Monitoring (APM)

Application Performance Monitoring is a set of practices, tools, and technologies used to observe, measure, manage, and optimize the performance and availability of software applications. APM solutions provide developers, DevOps/SRE teams, and system administrators with real-time insights into how an application is performing, helping them identify performance bottlenecks, diagnose issues, and improve the overall user experience. Monitoring and metrics have their place, but we feel the best questions (and thus answers) happen when using distributed tracing.

AWS

Amazon launched Amazon Web Services (AWS) in 2006 to offer organizations a cloud computing platform on a “pay-as-you-go” basis. AWS offers its customers infrastructure-as-a-service (IaaS), platform-as-a-service (PaaS), and software-as-a-service (SaaS)—all of which can be scaled up or down based on demand. AWS is the world’s leading cloud provider. Honeycomb offers an AWS integration.

Board 🐝

A Board is a collection of queries and the resulting data visualizations displayed together on a page. Query results on a Board can be displayed as charts, tables, or lists. Customers can take advantage of Board templates, a one-click way to get started. Board Filters allow you to apply one or more parameters to all queries on your board, helping you zero in on a specific issue.

Build

A build refers to source code that has been converted into software artifacts that can be run on a computer. Because a project results in many builds, each build is tracked using its “build number.”

Caption 🐝

A caption is an annotation that our users can submit when adding a query to a board. The caption will only show up on the board. Captions provide a way for someone to leave a note about why they’ve added a specific query to a Board.

Cardinality

Cardinality refers to the number of possible values a dimension can have. Some examples of low-cardinality data include booleans (true/false) and days of the week. High-cardinality examples are first name, social security number, UUID, and build ID.

Chart

A chart is a graphical representation of data or information. Most of our charts are line graphs, stacked graphs, or heatmaps. While “chart” is the umbrella term, and “graph” is a specific kind of chart with axes and a mathematical representation, the words “chart” and “graph” are often used interchangeably.

Columnar database

A columnar database stores and reads records by column rather than by row. It is optimized for fast retrieval and analysis of columns of data. In Honeycomb, each dataset is stored as a separate table where the columns correspond to fields from your events.

By using a distributed column store, we avoid predefined schemas and indexes that limit true observability, and support unaggregated data that can be sliced and diced and processed along any dimension needed, anytime.

Continuous integration/continuous deployment (CI/CD)

In DevOps, Continuous Integration and Continuous Deployment (CI/CD) is a set of best practices and automation techniques used in software development to streamline the process of building, testing, and deploying software changes to production environments. CI/CD aims to increase the speed, reliability, and efficiency of software delivery by automating various stages of the software development lifecycle. Development teams use a CI/CD pipeline to automatically test and launch new or updated code. Honeycomb can help teams increase visibility into their CI/CD pipelines.

Configuration-as-code (CaC)

Configuration-as-code is a concept in software development and DevOps that involves managing and provisioning system configurations, infrastructure settings, and application configurations using code. CaC extends the principles of version control and automation to configuration management, letting organizations treat configuration settings as code artifacts that can be stored, versioned, and automated alongside application code. By keeping access to configuration separate from the actual code, CaC allows for increased traceability, security, and management.

Context

Context refers to the collection of additional dimensions, fields, or tags on a piece of data that tells you more about the internal state of your system. Context can include metadata, system-level metrics, and domain-specific information about the running program.

Database

A database is an organized collection of data. Databases are designed to efficiently manage, store, retrieve, and manipulate various types of data, including text, numbers, dates, images, and more. There are a variety of popular database formats, including MySQL, MongoDB, and PostgreSQL, all of which Honeycomb works with to provide data observability.

Data model

A data model is a conceptual representation of the structure, organization, and relationships of data in a database or information system. It defines how data is stored, accessed, and manipulated within a database, serving as a framework for designing, implementing, and understanding the database’s structure and behavior. We have data modeling best practices to let customers get the most out of our observability tooling.

Dataset 🐝

A dataset is a structured collection of data that is organized and formatted in a way that makes it suitable for analysis, research, or information retrieval. For instance, in ui-dogfood, we have a dataset called user-events that collects data about how people are using the Honeycomb UI (the frontend). We can use ui-dogfood to see who is actively using the Honeycomb interface. Check out our dataset best practices.

Deploy

A software deployment, or a deploy, is the process of taking software from a development or staging environment and making it available for use in a production or target environment. Deploys are critical to delivering software and maintaining its functionality over time.

Derived column 🐝

A derived column is a column in a database or dataset that is created by applying a transformation or calculation to one or more existing columns. Using a derived column lets you calculate new values from your existing data without having to modify that data or how it’s collected.

Most derived columns live in a dataset’s settings, which means they have to be created and defined before you can use them in a query. You can use the Query Builder’s syntax to create a derived column on the fly.

DevOps

DevOps, short for “Development” and “Operations,” is a set of practices and principles aimed at improving collaboration and communication between software development teams (Dev) and IT operations teams (Ops). The primary goal of DevOps is to streamline and automate software delivery and infrastructure management processes, resulting in faster, more reliable, and more efficient software development and deployment.

Dimensions (in the sense of data dimensions/dimensionality)

Data dimensions refer to different aspects or characteristics of data used to categorize, analyze, or describe the data. Data dimensions also organize and structure data for meaningful insights and decision-making. Examples of low-dimensionality data include metrics and flat logs. High-dimensionality data includes anything that can support many fields or attributes, including JSON blobs, structs, and objects. Observability requires data with high dimensionality.

Distributed Tracing/Trace

Distributed tracing is a technique used to gain insights into the performance and behavior of distributed and microservices-based applications. An individual trace represents a chronological record of the events and processing steps that occur with a single end-to-end transaction or request as it moves through various components, services, and nodes of a distributed system. Distributed tracing helps developers and operations teams diagnose and troubleshoot issues, optimize performance, and understand the interactions between different parts of an application.

Dogfood

“Eating your own dogfood,” or simply “dogfooding,” refers to the practice of using your own product or software internally within the company. It means that employees, including the development team and other staff, actively use the software or product they are building as if they were its users. Honeycomb’s engineering team uses Honeycomb’s tools to diagnose issues with our system, just like Honeycomb’s customers use our tools to diagnose problems with their systems.

Emergent failure modes

Emergent failure modes refer to unexpected or unanticipated ways in which a system or software application can fail when operating in a real-world environment. They can have a significant impact on the reliability, availability, and performance of a system. These failure modes are often challenging to predict during the design and testing phases because they arise from complex interactions, dependencies, or conditions that only manifest under specific circumstances. They are also difficult to debug, which is where observability comes in.

Events/structured events

A structured event refers to a specific type of occurrence or data record organized and formatted in a predefined and consistent manner. Structured events are designed to capture and convey information about a particular occurrence or action in a way that makes it easy to parse, analyze, and process automatically. These events often follow a well-defined schema or format, which typically includes specific fields or attributes with standardized meanings. At a minimum, events usually have a timestamp and a name field. Wide events refer to structured events with many fields. Our website defines an event as “something happening worth tracking” and we offer an Events API.

Feature flag

A feature flag encompasses certain parts of your code in conditional statements that you can turn on and off. Feature flags allow a team to deliver different functionality to different users without maintaining feature branches and running different binary artifacts. Feature flags are also sometimes called feature toggles, release toggles, feature switches, feature gates, or conditional features. In agile environments, you can use toggles during runtime to enable or disable a given feature on demand, for some or all users.

Frontend/backend (or front end/back end)

The frontend, also known as the client-side, is the user interface or the visible part of software that users see and interact with in their web browsers or apps. The backend, also known as the server-side, is the part of software that operates behind the scenes and handles data storage, processing, and management. In our case, the frontend refers to the code that defines the Honeycomb UI that you see in the browser and the backend refers to the code that manages what happens behind the scenes (e.g., managing server-side logic, communicating with external services, fetching and processing data, and so on.)

Granularity

Granularity refers to an integer describing the time resolution of the query’s graph, in seconds. Valid values are the query’s time range /10 at maximum, and /1000 at minimum.

More broadly, Granularity refers to the level of detail or precision with which data or processes are divided, measured, or represented. Granularity essentially determines how finely or coarsely something is broken down or represented.

Hash

A hash refers to a function or algorithm that takes an input and returns a fixed-size string of characters, which is typically a hexadecimal number. This output is often called a “hash value” or simply a “hash.” The primary purpose of a hash function is to transform data into a fixed-size representation that appears random and is difficult to reverse-engineer.

Heatmap

A “heatmap” is a graphical representation of data in which values are represented as colors. Heatmaps are used to visualize and analyze data that has two dimensions, typically displayed as rows and columns. The colors in a heatmap convey information about the magnitude, intensity, or density of values within the matrix, making it easier to identify patterns, trends, and outliers in data.

In Honeycomb, heatmaps are one way to visualize your query result, and can lead to deeper analysis using tools like BubbleUp. They work best when you have a lot of events to visualize, and where the spread of values is wide enough to see some differentiation, but not complete noise. For more information, read our guide to getting the most out of heatmaps.

History (see Query History) 🐝

Honeycomb saves your query history, much like your web browser saves your browsing history. The Query History page lets you view and search through a timeline of your entire team’s activity across all datasets.

You can use the Query History search functionality to recall the queries you constructed in the past, view recent boards that were accessed, see how your teammates solve problems, or replay the debugging steps of an incident.

Home 🐝

Honeycomb’s Home area provides a snapshot of the active dataset. Home displays visualizations of some commonly-used queries and breakdowns, and an overview of the user’s most recent traces and events. Both can be used as a jumping-off point to explore data. Customers can use Home to become familiar with key system metrics and to check on system health.

Incident

An incident refers to an unplanned event or occurrence that disrupts the normal operation of a system or service. Incidents can encompass a wide range of issues, including errors, failures, outages, security breaches, performance degradations, and any other unexpected events that impact the availability, reliability, or security of a software application, infrastructure, or service.

Managing and resolving incidents effectively is critical to maintaining the stability and performance of systems and ensuring a positive user experience. A well-defined incident response process helps organizations minimize downtime, reduce customer impact, and continuously improve their systems to prevent future incidents. We have a strategy for managing incident response.

Infrastructure

Infrastructure refers to the underlying foundation of hardware, software, and resources that supports the operation, deployment, and scalability of software applications and services. Infrastructure encompasses a wide range of components and technologies that provide the necessary computing, networking, storage, and other resources required to run and maintain software systems. An infrastructure is the backbone on which applications and services are built and delivered.

Infrastructure stretches from a developer’s IDE through source code management, issue tracking, code quality, dependency management, APM, testing, CI/CD, cloud services, containers, and so on. The operations or platform team is generally responsible for maintaining a company’s infrastructure; on many teams it can be a primary responsibility of site reliability engineers. We believe it’s critical to have complete observability into one’s infrastructure.

Instrumentation

Instrumentation refers to the practice of adding code to a software application to collect data and generate insights into its behavior, performance, and usage. Instrumentation is implemented by inserting code into the application at strategic points, such as before and after function calls, at key events, or within specific code blocks. Developers use libraries, frameworks, or custom code to instrument their applications. Honeycomb’s customers are encouraged to instrument their code to produce wide, structured events that will enhance their search and analysis capabilities.

Instrumentation typically starts with installing auto-instrumentation packages specific to your language and framework. These take care of automatically capturing many general attributes about each event in your service. As your observability practice matures, auto-instrumentation is supplemented by rich custom instrumentation (also known as manual instrumentation), where you add lines of code to capture additional context that is specifically meaningful to your organization.

Lambda

AWS Lambda is a serverless computing service provided by Amazon Web Services (AWS). Lambda lets developers run code in response to specific events or triggers without the need to provision or manage servers. Lambda is a key component of serverless computing, a cloud computing model that abstracts infrastructure management and focuses on executing code in a highly scalable and cost-effective manner. It can be difficult to have observability into Lambda, which is why we created an AWS-managed OpenTelemetry Lambda layer.

Logs

Logs (or log files) refer to the recorded messages or entries generated by a computer program or system during its operation. These log messages capture information about the program’s execution, performance, errors, warnings, and other relevant events. Logs can include information on who has accessed an application or provide a time-stamped view of what happened in an application. Because logs provide insights into what a piece of software is doing and how it’s behaving in real-time or over a period of time, logging is a crucial practice for monitoring, debugging, and troubleshooting incidents.

Metrics

Metrics are typically considered one of the “three pillars of observability” (see also logs and traces) and can be used to track the health or reliability of a system. There are two main types of metrics: infrastructure metrics, which are traditionally represented through CPU, memory, and disk utilization, and application metrics, which can also be used to track higher-order measurements like requests per second, connections to a load balancer, and more. Learn more about how we use metrics.

Monitoring

Monitoring is the practice of continuously observing, collecting, and analyzing metrics from various components of a system, application, or infrastructure to assess its performance, health, and behavior in real-time or over time. Monitoring alerts users when issues occur. Because monitoring relies on metrics, the findings are limited, at least when compared with observability. Monitoring can tell you that an issue occurred while observability can tell you why an issue occurred.

Observability

Observability (sometimes referred to as o11y) is the concept of gaining an understanding into the behavior and performance of applications and systems. This is accomplished by collecting and analyzing telemetry data, such as traces, logs, and metrics. Observability is used to diagnose issues, gain an overview of the interconnectivity of dependencies, and monitor reliability. Observability has become an increasingly critical element in software development as systems and applications become more complex.

OpenTelemetry (OTel)

OpenTelemetry is an open-source framework that provides a standardized method for collecting, processing, and delivering telemetry data from distributed systems. The project lives under the Cloud Native Computing Foundation (CNCF) and is curated by a community of contributors. Honeycomb is a major contributor to OTel.

OpenTelemetry Protocol (OTLP)

CNCF’s OpenTelemetry Protocol represents the technical specs teams use to set up distributed tracing in their applications. OTLP explains how to encode, transport, and deliver telemetry data between various endpoints, including sources, backends, and intermediate nodes.

PagerDuty/on call

PagerDuty is a cloud-based incident management platform and alerting service that helps organizations monitor the health and availability of their systems, quickly respond to incidents, and coordinate responses among team members. PagerDuty is commonly used by DevOps, SRE, and IT support teams to ensure the availability and reliability of services and applications.

PagerDuty is best known for its on-call scheduling and alerting capabilities. Being “on call” refers to taking responsibility for responding and troubleshooting critical issues after hours. Honeycomb offers a PagerDuty integration.

Pull request (PR)

A pull request is a fundamental concept in collaborative and version-controlled environments like Git, GitHub, GitLab, and BitBucket. A pull request represents a mechanism for proposing, reviewing, and incorporating changes or additions to a codebase. PRs are critical components of the code review and collaboration process, ensuring that code changes meet quality and standards before being merged into the main branch of a repository.

Query 🐝

A query refers to a request or command that is used to retrieve, manipulate, or manage data stored in a database or information system. Queries are written in a specific query language, such as SQL for relational databases or other query languages designed for different types of databases or data sources. Honeycomb’s queries are particularly powerful as they give users a wide range of flexible parameters for answering highly complex questions.

Query Builder 🐝

Honeycomb’s Query Builder lets users build complex queries against their data without having to manually input SQL. Users can select and apply filters, aggregate data, and retrieve visualizations.

Query Template Link 🐝

A Query Template Link is a shareable link that users can define with a query parameter. Opening the link executes the specified query and displays the query’s results.

Raw data

Raw data, sometimes called source data or primary data, has not undergone any transformation, organization, or analysis, and is typically in its original form. Honeycomb lets you easily view your raw data.

Refinery 🐝

Honeycomb’s Refinery provides real-time data processing for high-cardinality events. Refinery examines whole traces and intelligently applies sampling decisions to each trace, so you can everything that’s important for debugging and sample from the rest.

Reliability

Reliability refers to the ability of a system, application, or service to consistently perform its intended functions under specific conditions for a specified period. Along with availability and usability, reliability is one of the key attributes of quality in software and services and is a fundamental attribute that users and organizations expect from their tech. Achieving reliability often requires careful design, engineering, and ongoing monitoring and maintenance.

Resilience

Resilience refers to the ability of a system or application to gracefully and effectively handle unexpected failures, disruptions, errors, or adverse conditions while maintaining core functionality and minimizing the impact on users and operations. Resilience is a key aspect of ensuring that software systems can continue to operate under adverse circumstances, recover quickly from failures, and maintain a high level of availability and performance. Observability helps developers build with resilience in mind.

Repo/repository

A repository (often abbreviated as “repo”) is a centralized location or storage space where version-controlled source code, project files, and related assets are stored and managed. Repositories play a crucial role in collaborative software development by enabling multiple developers to work on the same project, track changes, and maintain a history of code revisions. Honeycomb offers some of its repositories on GitHub.

Software-as-a-Service (SaaS)

Software-as-a-Service is a cloud computing model in which software applications are provided and hosted by a third-party service provider, typically as a subscription service. Instead of users purchasing and installing software on their individual devices or on-premises servers, SaaS allows them to access and use the software via a web browser or an application. SaaS providers handle the maintenance and upgrades of their applications, shipping updates to their customers as they are released.

Sandbox

A sandbox refers to a controlled and isolated environment where software, code, or processes can be executed and tested without affecting the host system or other parts of the system. The primary purpose of a sandbox is to provide a safe and secure space for experimentation, testing, or running untrusted code to prevent potential harm to the system or data. Our Sandbox lets developers test drive our observability tool without impacting their existing infrastructure.

Sampling

Sampling is the process of selecting a subset or representative group of data points or items from a larger dataset for the purpose of analysis, testing, or inspection. Sampling is a common technique used when it is impractical or resource-intensive to work with an entire dataset, especially when dealing with large volumes of data.

In observability and performance monitoring, logs, traces, and events can be sampled to summarize information about a service or activity. When sampling your information, there is a constant tradeoff between granularity, system-representative accuracy, cost, performance, and relevancy.

Serverless

Serverless refers to a cloud computing model in which developers can build and run applications without having to manage traditional server infrastructure. In a serverless architecture, cloud providers abstract away the server management tasks, letting developers focus solely on writing code for their applications. This model is designed to simplify application deployment and scaling, reduce operational overhead, and enable developers to pay only for the computing resources they consume. Serverless apps are at the heart of cloud-native development and benefit from observability.

Service Level Agreement (SLA)

A Service Level Agreement is a formal and legally binding contract or agreement between a service provider and a customer or client. SLAs define the level of service and performance standards that the service provider is expected to meet. SLAs serve as a means of establishing clear expectations, responsibilities, and consequences for service quality and delivery. Most IT contracts include an SLA for the protection of both the vendor and the customer.

Service Level Indicator (SLI)

A Service Level Indicator is a metric or measurement used to assess the performance or quality of a specific aspect of a service or system. SLIs are an essential component of SLAs and SLOs, helping to define and track the reliability, availability, and performance of a service. SLIs are typically expressed as a numerical value, percentage, or ratio and are used to monitor, measure, and ensure that a service is meeting its performance targets. Common SLIs include error rates, latency, and availability. If an SLI alert is triggered, a team will know the SLA or SLO could be at risk.

Service Level Objective (SLO)

A Service Level Objective is a specific and quantifiable target or goal that defines the level of performance, reliability, or quality that a service or system should achieve. SLOs are a critical component of SLAs and are used to establish clear, measurable expectations for service providers and stakeholders. They help ensure that the service meets the needs of its users and aligns with business or operational requirements. The best SLOs set a minimum standard for performance, are chosen thoughtfully with business objectives in mind, and are only focused on metrics that can be measured.

Span

A key part of distributed tracing, spans are a smaller unit of work within a trace. They represent a specific operation or processing step in a service. Spans are linked together to form a trace, creating a hierarchical view of how requests flow through the system. A trace can, for example, represent a user request and contain spans that show all its time-stamped pieces.

Structured Query Language (SQL) and MySQL

Structured Query Language is a programming language used for creating and manipulating queryable, relational databases. SQL provides a standardized way for developers and database administrators to interact with databases, define data structures, perform data queries, update records, and perform various database operations. MySQL is an open-source database management system that uses SQL.

Site Reliability Engineering (SRE)

Site Reliability Engineering focuses on the availability, reliability, performance, and scalability of systems and applications. Site reliability engineers typically sit on the software engineering team and bridge the gap between traditional software development and IT operations by applying software engineering principles to the design, deployment, and management of systems. SREs work closely in conjunction with developers, operations teams, and platform engineers.

Telemetry

Telemetry is the collection, analysis, and monitoring of data and metrics related to the performance, availability, and reliability of a software system or online service. Telemetry is an essential element in observability applications as it is used to gain insights into the health and behavior of a system, identify issues, and make data-driven decisions to improve reliability.

Time-series data

Time-series data, a foundational element of observability, refers to a collection of timestamped data points that record changes or measurements of various system metrics and performance indicators over time. This timestamped data is crucial for SREs and operations teams to monitor, analyze, and manage the reliability, availability, and performance of software systems and services. When time-series data is measured at regular intervals, we call that “metrics.” When time-series data is measured at irregular intervals, we call that “events.”

Visualization

Visualization refers to the graphical representation of data or information and involves creating visual representations, such as charts, graphs, diagrams, maps, or dashboards, to make complex data more understandable, accessible, and interpretable. Visualization is a powerful tool for gaining insights from data, presenting information, and conveying patterns or trends. Check out all of our visualization features.

Learn more about observability

Now that you have a better understanding of observability terminology, see how it can help you.

Get Started

Additional resources

Getting Started

What Is Observability Engineering?

Case Study

Reducing Mean Time to Diagnosis: How Salary Finance Uses Honeycomb

Video

Intro to o11y Topic 1: What Is Observability?

Demo

Observability Glossary

Observability Glossary