Blog

An Introduction to Distributed Tracing

 

There’s no strict definition of a distributed system. But generally speaking, if you have reached a point where you’re running more than five interdependent services at once, that means you’re running a distributed system. It also means you are more than likely experiencing difficulties when troubleshooting using traditional debugging tools. Unfortunately, pulling up multiple tools, each built for a monolithic world, doesn’t help pinpoint the problem. Instead, a practical solution would be to use distributed tracing.

For a monolithic application—an app that has all or most of its functions modularized in a single service—tracing a request is relatively straightforward. You could troubleshoot an error by looking through your log messages or using standard application performance monitoring (APM) tools. A distributed architecture, on the other hand, needs a deeper level of telemetry to make sense of interdependent relationships you may not even know are happening.

How does distributed tracing work?

Distributed tracing is a technique used to monitor and observe requests as they flow through your distributed services or microservices-based applications. Distributed tracing provides visibility into each request and all its subsequent parts. As a result, it allows you to understand if your systems are behaving properly in production.

To understand distributed tracing, let’s consider how it works for a single request, such as submitting a form. When a user enters a request to submit a form, the distributed tracing platform assigns a unique trace ID to the request. That trace describes the entire journey of a single request (now with that unique ID) through all of the systems it encounters to complete all of its necessary work.

The distributed tracing platform also generates an initial span, which is also called a parent span. Spans represent a single unit of work, a single function, or an operation performed upon the request, such as a database query. Because there was an operation that generated this user request, the platform creates a parent span that represents that work. As the request moves through your distributed system, the platform will also generate child spans for every new operation needed along the way.

A child span may operate as a parent span to several child spans nested within it. For example, the distributed tracing platform may generate a child span, Span A, when the request enters one of the services in your microservices-based application. Within that service, numerous functions are performed on the request, generating more child spans and making that initial Span A a parent to the new child spans.

Service A's span is the root and takes the longest. It has one child.

Each span includes its unique ID, the ID of the parent span, and the ID of the trace. Other information encoded in a span can include:

  • Service and operation names
  • Duration of the operation on the request
  • Error messages or exit codes
  • Data on the operation executed
  • Any other metadata, such as user ID, that you wish to include

After that data has been recorded, a distributed tracing tool will also help you visualize your request lifecycle using the data from the trace and its spans. Different visualization schemes may be used, such as the commonly used waterfall view or flamegraph formats. With these graphs, your engineers can see which parts of your distributed systems are slowing down your application or which parts are erroring out. They can also clearly see dependencies between services and the exact route taken through your various systems to process any given request. With that level of clarity, they spend their time focusing on what matters.

graph with spans

A common problem in distributed systems is that different user requests can take different execution paths. In fact, similar requests with similar execution paths may also encounter different underlying systems or encounter failure and retry scenarios that completely change their performance profiles. Without distributed tracing, engineers are often left to guess why any two given requests performed differently. Or, worse, they need to spend an exorbitant amount of time piecing together various clues to understand what happened. Distributed tracing greatly simplifies the debugging process in distributed systems.

If you can get all this information from a single trace, imagine what you could achieve by observing and comparing tiny differences between multiple traces coming through your microservices-based applications every day. Distributed tracing provides a fundamental type of data that is necessary on the road to achieving observability.

The benefits of distributed tracing

The primary benefit of distributed tracing is that it allows one to see and understand how your distributed services handle a single request. This benefit also gives way to other advantages, such as:

Reduced mean time to resolution (MTTR)

Distributed tracing is a faster way to tell where a problem originates, allowing you to spend valuable time conducting additional forensics on that specific service call. It is also helpful in spotting unusual patterns or inefficiencies, such as a call to a database made multiple times, which may have occurred erroneously during a recent code ship.

In larger organizations, it’s not unusual to have different development teams responsible for owning different services that may be involved in fulfilling any one user request. With distributed tracing, everyone responding to an incident can visualize what is occurring when alerts are sent to those on call. When on-call engineers are woken up in the middle of the night, a trace link can tell them if it’s a service they are responsible for or if the problem is happening with a third-party service, which helps speed up time to resolution.

Increased application performance

With distributed tracing data, it’s possible to see how your services respond to one another and compare performant traces with anomalies. Most single-purpose distributed tracing tools don’t give you the analytical tools necessary to make those comparisons, but they’re available with Honeycomb. You can also measure the time to perform actions taken by specific users, such as signing up or purchasing a product, and see how their experience differed from those of other users in the system. You can prioritize improvements and innovations in areas needed and create a satisfying experience, either for specific strategic customers or for every customer, from these insights.

Flexible integration

Your engineers can integrate distributed tracing tools with most microservices systems—they can work with a wide range of programming languages and applications. OpenTelemetry makes it easy to get started with instrumenting traces and allows you to send your trace data to multiple backends, including Honeycomb. Honeycomb is a contributor to the OpenTelemetry project and fully supports ingesting trace data using the OpenTelemetry protocol (OTLP).

Improved collaboration

Collecting distributed tracing data is a first step. But most tracing tools simply stop there. Honeycomb is built to also boost collaboration among your teams by eliminating the need to debug errors within a single team. Instead, every team responsible for each of your services can trace and troubleshoot an error and identify the team responsible for fixing it. For example, to ensure everyone is looking at the same data, teams can share permanent links to bring up the same query results. They can also automatically archive interesting trace data and share those investigations with other Honeycomb team members.

Distributed tracing with Honeycomb helps reduce the developer frustration because it makes it helps locate bugs faster, and devs become more confident while troubleshooting and resolving an error. They are also able to build more-stable applications with a newfound understanding of how your systems work.

The challenges of distributed tracing

While distributed tracing offers several benefits, there are still challenges with its implementation. These difficulties can include:

Manual implementation

Some distributed tracing tools require manual instrumentation or using proprietary instrumentation libraries. As a result, teams spend precious time building instrumentation from the ground up or creating instrumentation that locks them into using one particular tracing solution. Both of those scenarios waste time that could have been used for more productive tasks.

When getting started with instrumenting traces, we recommend using the OpenTelemetry SDKs that support most common application languages. OpenTelemetry is quickly becoming a widely adopted defacto standard for emitting trace data (and other useful application telemetry). If your language is not currently supported by OpenTelemetry, other instrumentation options may be available, such as Honeycomb’s libhoney, while support is eventually added to the open-source OpenTelemetry project.

Limited front-end implementation

Distributed tracing is traditionally only helpful for server-side development. An emerging practice is adapting distributed tracing to also capture client-side information. However, that is not yet a well-established practice. Generally speaking, most distributed tracing tools do not let you see request information on the client side, so it becomes difficult for front-end developers to trace an error using the same tools.

This is where the collaboration benefits of Honeycomb comes in: The front-end team liaises with the back-end team in a cross-functional DevOps fashion to troubleshoot bug requests by leveraging team knowledge-sharing features in Honeycomb, such as query permalinks. 

Even when adapting tracing for capturing front-end client side information, it’s difficult to connect front-end and back-end information. Since distributed tracing tools generate a trace ID at the first application they encounter on the server side, it can be difficult to trace an error to any separate services or applications used on the front end.

Other challenges with distributed tracing include:

  • Transactions in a microservice architecture may take different execution paths, in different orders, so it’s challenging to include a pre-defined and broadly applicable set of instructions that could connect different traces before an execution path actually begins.
  • Any individual function in the execution path may run multiple times during any given single request.
  • As a request moves through your distributed services, operations on the request may cause some contained data to change as part of the business logic, so it’s difficult to predict fields that could make connecting traces more plausible.

However, these challenges are almost insignificant when compared with the numerous benefits distributed tracing provides for your application—so it’s essential to find a solution like Honeycomb that helps reduce the challenges presented by distributed tracing and offers all of its benefits.

One tool for distributed tracing and more

With Honeycomb, distributed tracing is one of the many powerful tools in your debugging toolbox. Honeycomb provides more functionality than most other distributed tracing tools, such as a number of analytical and collaboration features that can help you and your team make the most of your tracing data. 

For example, you may start by running a query to better understand how a specific API call is behaving. Then, you might use Honeycomb’s line graphs or heatmaps to visualize trends, and use tools like BubbleUp to crunch through billions of rows of data across thousands of dimensions containing high-cardinality data  to quickly spot outliers and determine what’s different about them.

To precisely understand what is happening with that particular API service request, you can visualize an unusual spike and pinpoint a particular customer (or user) by identifying the ID. You can then view the trace of that exact request. Check out this demo to see how we do it.

Honeycomb allows you to see every field attached to a particular service request and what occurred at each step. For example, in the demo above, the time bar shows the duration, and you can see exactly where any latency occurs, such as bulk ticket requests. You can send details of the latency report to the development team or communicate with the support team to inform customers of the new updates to resolve this. Also, by sharing these query results, other team members can better solve any recurring issues.

To dive deeper into distributed tracing and what it can do for you, download a free copy of “Distributed Tracing: A Guide for Microservices and More.” You can also try Honeycomb for yourself with a free trial today.