Understanding High Cardinality and Its Role in Observability

Understanding High Cardinality and Its Role in Observability

Understanding High Cardinality and Its Role in Observability

By Rox Williams | Last modified on 2023.03.07

Complex distributed systems generate vast amounts of granular data that is ideal for diagnostics.

Most of the time…

But it also creates a needle in a haystack problem. And without the proper tools in place to find, dissect, and understand which data matters most, it’s only going to create headaches and dead ends.

In observability, that valuable granular data is known as high cardinality and high dimensionality, meaning that it is chock full of unique values waiting to be analyzed and queried. Easier said than done: high-cardinality data can be too complex for many database backends to quickly analyze. But in this case, prioritizing this is worth it, as high-cardinality data is the most effective data to analyze when it comes to  resolving an incident. To achieve true observability, leveraging high-cardinality data is a must.

That said, high cardinality and its cousin, high dimensionality, require the right tooling in place—as well as the correct team framework and mindset—to get the most out of it. Here’s what to understand about high cardinality, why it matters, and best practices for setting up datasets and high-cardinality analysis.

What is high cardinality?

Cardinality is simply a reflection of an attribute’s uniqueness. High-cardinality data contains a lot of very specific values, while low-cardinality data has only a few. A user-ID is a high cardinality attribute, while geo-location is low cardinality. High cardinality refers to a single field (or database attribute) that has many unique values attributed to it. Common examples include Social Security or passport numbers, email addresses, or user IDs.

High-cardinality data can make debugging really easy if you know how to use it, but it also suffers from a “too much information” problem. Many database systems can’t efficiently handle the volume of queries required for analysis. But if it’s possible to process high-cardinality data without bloat, teams will be able to quickly spot anomalies, understand what they are, how they’re happening, and who they’re impacting—leading to faster incident resolution.

Low cardinality vs. high cardinality

In the end, the difference between low- and high-cardinality data comes down to broad versus narrow. Low-cardinality data can help teams examine broad patterns in a service, perhaps by looking at geography, gender, or even cloud providers. Low cardinality is straightforward and easy to work with, but it’s not useful when trying to resolve an incident, largely because the data points are very limited.

High-cardinality data, on the other hand, is a magnifying glass into a service’s problems, making it possible to look at outlying events that can help guide troubleshooting efforts. It’s the most valuable information an incident response team can access, providing they have a database solution that can support the hefty data storage and need for advanced queries.

To achieve true observability, teams must have a way to quickly leverage the hidden knowledge contained in high-cardinality data.

What is dimensionality, and why does it go hand-in-hand with cardinality?

While high cardinality refers to a large number of data points in a given attribute, dimensionality takes it further: how many attributes does an event have? It’s not unheard of for datasets of events to have tens to thousands of dimensions, and that’s not surprising, as dimensions tend to record important business context like “timeframe,” “products,” “version,” or “customers” associated with each event. Being able to analyze trends across many dimensions quickly is important when you’re trying to figure out what changed but don’t know what to ask. When you’re trying to solve unknown-unknowns, high-dimensionality data is your friend.

Why is understanding high-cardinality data important?

When an incident happens, the speed of finding and fixing it is absolutely everything. Teams must have deep visibility and context into what’s happening, so they can follow leads, and know the right questions to ask. Requests generate a tremendous amount of related (and often very high-cardinality) data, and having access to that context is a gamechanger for teams trying to get a service back on track.

High-cardinality data can showcase the “where?” and the “why?” of a problem, and nothing is more effective at finding what the anomalies are. For example, maybe it’s just three users spiking an error code budget in your system, because of one particularly odd request call out of hundreds in your app. And it’s crucial to have the right tools in place to provide fast analysis and visualizations in order to quickly identify outlier behavior in need of investigation.

Modern DevOps teams striving to be elite performers, as defined by the DORA metrics, understand that true observability will enable a faster MTTR (mean time to resolution), while at the same time safeguarding business reputation and the company’s bottom line. High-cardinality data is at the heart of observability, and when combined with a strategy for distributed tracing and service level objectives, an organization has a potent antidote against outages.

High cardinality example

What does high cardinality look like in the real world? It’s anything that, when asked for, comes back with unique—and numerous—values. In a software application, “userID” or “requestID” could return a million answers, compared to “state,” which would have only 50 options.

How to evaluate tools for analyzing high-cardinality and high-dimensionality data

Not every database is capable of handling the demands of high-cardinality queries and storage. Therefore, it’s imperative to sniff out if the tool has the specific technical requirements needed to analyze it. Here are some questions to ask:

  • Are you able to pinpoint down to the individual level exactly which users/regions/etc. are experiencing issues?
  • What is your current process for tracking and tagging unique attributes? Is there a cost associated?
  • If you have to query your current system, how many dimensions are you able to include and group by if you’re unsure which dimensions are important?
  • How fast do query responses come back?
  • Do you allow a user to always have access to the raw event?
  • Will it be possible to add more data to the event as needed? What constraints are there?
  • How does your tool support an iterative investigation? Is there a point where users hit a dead end, or is there always another breadcrumb to follow?
  • Does your service seamlessly work with tens of millions of values?
  • Will the raw events always be available to run new calculations and transformations or do you aggregate/discard certain data?

Best practices when analyzing high-cardinality and high-dimensionality data

We’ve established that all of this is complicated, so finding efficient ways to get the most out of high-cardinality data is critically important. Below are the sociotechnical requirements needed to drive alignment on your team.

  1. Make sure everyone agrees on the naming of events, attributes, etc.
  2. Decide on what should or should not be included in events.
  3. Know who is responsible for “owning” the high-cardinality data analysis, and, conversely, identify those charged with building observability into the product.
  4. The goal is useful high cardinality, so don’t measure something so unique it has no diagnostic value.
  5. Keep tags and source names stable.
  6. Put related events in a single dataset.
  7. Don’t mix environments—prod and dev don’t get along.
  8. Leverage an individual IDE to create a dataset.
  9. Each service should have its own dataset (test, prod, etc.).
  10. Schemas are complex, so use namespaces, ensure consistency, and double check field data types.

Worried about the costs of high-cardinality data?

A common concern we’ve heard over the years is that high-cardinality data would balloon cloud costs. That’s true of other vendors—most aren’t optimized to execute fast queries on trillions of field combinations, which results in cost spirals. 

In contrast, Honeycomb is built specifically for high-cardinality data. It enables engineers to capture unlimited custom attributes for debugging, with no impact on your spend. Honeycomb charges by number of events, not how much data each event contains, or the way you analyze that data. There’s no penalty to instrument rich high-dimensionality telemetry or analyze high-cardinality fields.

Why high cardinality and high dimensionality are the answer to many questions

Outages and debugging are a fact of life, but it’s how teams respond that matters the most. Using a fast analysis observability tool that can handle high-cardinality and high-dimensionality data to find, analyze, and fix a problem is not only the most efficient option, it’s also a key part of teams achieving true observability of their services.

We know this is a big step for many teams to take, but we can help. After all, we purpose-built our columnar data store just to support high-cardinality data searches.

Take a deeper dive into observability.

Additional resources

Video

Fast Debugging and Optimization with Honeycomb

read more
Blog

Observability 101: Terminology and Concepts

read more