If you are running a distributed system and have reached a point of scale where you’re running 5+ services, you are more than likely experiencing difficulties when troubleshooting. Does the following sound somewhat familiar? You’re on-call and an issue comes up. You start with a few hypotheses, followed by more questions, and then turn to a handful of different tools such as metrics, monitoring, and log management. The APM tool dashboard says the system is performing as expected and according to the charts, no unusual patterns are detected. However customers continue to report problems, and so the on-call team (that’s you) needs to figure out the exact source (or sources) of the problem which requires a different tool and a more streamlined approach…
Triage with tracing
Teams are used to triaging when something goes wrong but pulling up multiple tools each built for single purpose becomes time-consuming and doesn’t help to pinpoint the problem. A classic log management tool can be useful in telling part of the story but with 5+ different services running, you can’t easily visualize what’s happening. Moreover, typical log messages don’t often have the detail on how services are connected and called upon. -It would take pages of logs to get there and it’s difficult to spot latency or understand which service call is taking the longest. Understanding all the steps across a set of distributed services requires a deeper level of telemetry. The answer–distributed tracing—provides visibility in to each request and all its subsequent parts. With distributed tracing a true understanding of how systems are behaving in production.
Tracing is a faster way to tell you where a problem lies which then allows you to spend useful time conducting additional forensics on that specific service call. Tracing is also useful in spotting unusual patterns or inefficiencies such as a call to a database made multiple times which may have occurred erroneously during a recent code ship. With tracing everyone on the team can visualize what is occurring with proactive alerts sent to those on-call. When on call and woken up in the middle of the night, a trace link can tell you if it’s a service you are responsible for or a 3rd party which then speeds time to resolution.
Why is tracing hard?
Implementing tracing often feels like a big hurdle because instrumenting is required at development and many believe that if you instrument for one set of services, you have to do it across the entire application for it to be meaningful. OpenTracing has helped to codify how to implement tracing and map to the data model being used but it does treat spans a little differently from open source tools such as Zipkin and Jaeger where child spans can have their own child spans which then results in very deep nesting. These open source tools are useful in instrumenting and getting data in but Honeycomb recommends using Beelines, which require much less work and also allow you to avoid the step of having to store data in a 3rd party store such as Cassandra. Honeycomb’s built in data store is highly efficient and uniquely designed for high cardinality data with blazing fast query.
One tool for tracing, plus query against events and logs
With Honeycomb tracing is yet another powerful ‘tool’ in your debugging bag that can be used interchangeably with other types of query. For example you may start by running a query to better understand how a specific API call is behaving and using Honeycomb line graphs and heatmap charting, you quickly spot outliers. To understand exactly what is happening with that particular API service request you can then visualize an unusual spike and pinpoint to a particular customer (or user) by identifying the ID in the detail. This allows you to then view the trace of that exact request. Check it out in this demo video.
With Honeycomb tracing, you see every field attached to a particular service request and what occurred at each step. The time-bar shows the duration and you can see exactly where any latency occurs, such as bulk ticket request in the demo link above. To resolve, you may send details to the development team or communicate with support who can then inform customers. Making changes to these API service calls will hopefully improve the experience for all future customers and by sharing these query results, other team members can better solve any reoccurring issues.
Try it out for yourself
We always recommend a hands-on experience. To better understand the power of distributed tracing and learn how to use it interchangeably with the other debugging features in Honeycomb, join this workshop taking place in San Francisco on Jan 23rd, 2019. You will instrument a sample app, deploy it, trace it, and track down performance problems from start to finish. This will leave you equipped to take the tracing fundamentals back to your own team, and understand how to swiftly get your production app instrumented and trace-aware.
Of course if you are not local to the SF Bay Area, it may be harder to join and so we recommend you try Honeycomb for yourself and start a free trial.