OpenTelemetry Best Practices #3: Data Prep and Cleansing

Having telemetry is all well and good—amazing, in fact. It’s easy to do: add some OpenTelemetry auto-instrumentation libraries to your stack and they’ll fill your disks with data pretty quickly. However, having good telemetry data—data that’s curated into being useful—is something that is both cost-effective and represents good value.

By: Martin Thwaites

| Updated: August 28, 2024

OpenTelemetry

Guides

April 18, 2023

Honeycomb & OpenTelemetry for in-depth observability

Learn More

Having telemetry is all well and good—amazing, in fact. It’s easy to do: add some OpenTelemetry auto-instrumentation libraries to your stack and they’ll fill your disks with data pretty quickly. However, having good telemetry data—data that’s curated into being useful—is something that is both cost-effective and represents good value.

Observability is about getting answers about how your production system is functioning by using telemetry data. If that data isn’t in an accurate, curated state, then you’ll struggle to get the answers you need—even if you have a ton of data. Either the data is confusing, or it’s locked away because of security concerns, or there’s just too much data to find the context you need. Because of this, it’s easy to get overwhelmed with bad data and feel like OpenTelemetry isn’t actually useful. Enter data prep and cleansing.

The Transform processor

With this processor, you can drop attributes that have names that are not ones you would want in your observability backend, such as firstname or creditcard. Further, you can also use the Transform processor to search for values in attributes such as password.

The processor allows you to perform the following actions on your spans:

Create a new attribute by parsing, searching, or combing existing attributes.
- E.g., combine a primary and secondary product category into a single value.
Delete attributes entirely.
- E.g., remove the social security number attribute when it goes to third parties.
Hash attributes to maintain their cardinality, without keeping personal identifiable information (PII).
- E.g., hash an auth token or API key used to access the system.

There are some known fields that you should consider whether to filter or not based on your context.

url.query and url.full: If you regularly use a query string for searching which could include anything that would be considered PII, you should think about whether you should filter this information out either globally, or specific to some URLs. You should also consider whether the engineering team should extract the most pertinent information from the url and add it as attributes in their code as this would provide a better telemetry experience.
network.peer.address and client.address: These fields can sometimes be populated with the IP address of the client accessing your site, and in some regulatory contexts, can be considered PII data. You could choose to hash this, however since the values are known, hashing doesn’t give the protection you might expect.

With these processors, you can also enrich your telemetry data with some static context data, like the cloud region or availability zone, through to additional information, such as what Collector infrastructure processed the request.

Redacting sensitive data

These processors allow you to drop and redact spans that meet certain criteria. The redaction processor can be configured with two modes that are important. I’ll call one “aggressive” and one “passive.”

In passive mode, you can tell the processor to look for specific patterns within your attributes. This could be looking for a pattern that resembles a card number or a social security number. To do this, we use regex patterns that scan each attribute.

You’ll need to apply your own context here. However, here’s a non-exhaustive list of what you should include:

Social Security Number (region-specific format)
National Insurance Number (UK-specific)
Credit card numbers (note: not all card numbers follow the same format)
Driver’s license numbers (region specific format)
Phone numbers
Postal codes / zip codes

In aggressive mode, in addition to looking for patterns in attributes, you’ll also provide a list of allowed attribute names. This means that any attributes not in the list will be dropped.

It’s best practice to run, at the very least, passive mode with some patterns that are specific to your region and sector. However, aggressive mode is something that would only really be applied in some very specific hyper-secure environments. It has limited use in that if data extraction is a concern, engineers could use the allowed parameters to include information they want to extract.

Balancing cardinality and PII

While we want to keep PII out of our telemetry backends, it’s often important to know the amount of individual users affected so that we can see if there’s a widespread problem. Or, to see what an individual user might have done over their lifetime.

We can maintain the cardinality (the distribution of the values) of this data by using a strong hash of the attribute using the Transform processor.

A word of caution: if the value you’re hashing has a small number of possible values and a predictable pattern, it’s relatively easy to reverse engineer the value. As such, this may not be a viable way to stop it being considered PII.

Get started today.
Try Honeycomb for free.

TRY NOW

Filtering non-useful spans

On top of redacting that sensitive data, and removing attributes, it’s also good practice to drop spans that aren’t useful. The most common are health check spans as they generally offer little value and can be dropped without affecting visibility into the system.

One thing to be careful of is that it will only filter based on a single span. If your health checks have a full trace structure, you may need to think about sampling (we’ll cover this in a different post).

More best practices coming soon

Building good observability pipelines—and by good, we mean pipelines that are safe—is part of what makes for a robust strategy. The Collector and its processors are a core part of that, and they’re luckily easy to configure.

If you missed part one or part two of my best practices series, you can find them here:

OpenTelemetry Best Practices #1: Naming

OpenTelemetry Best Practices #2: Agents, Sidecars, Collectors, Coded Instrumentation

Want to know more?

Talk to our team to arrange a custom demo or for help finding the right plan.

BOOK A CONSULTATION