Paul Osman [Instrumentation Engineer|Honeycomb]:
Thank you very much. I hope everybody is enjoying the session so far. My name is Paul Osman, and I work for Honeycomb. We’re a tool that lets you see what’s going on in your production applications. So help you with debugging and fixing performance issues. I work primarily on instrumentation. So that includes pretty much everything that helps you get data into Honeycomb. So let me start off with a perspective on instrumentation. So the way I see it and I think the way our products certainly thinks about things, observability isn’t binary. It’s not like you do something and then all of a sudden you have observability. I think of observability as more of a journey. So you can get telemetry from your system that helps you figure out what’s going on in production and how your users are experiencing your application. And that telemetry can vary in the degree to which it helps you answer questions. And so I think of this journey as starting, people start at various points. You can start with gathering external telemetry about your service, and that can be anything from structured logs, host-level metrics, things that are outside of your actual code. And these are valuable because they tell you something. They tell you something that’s going on about your system, but they’re not great at answering questions about what’s happening necessarily in your code. How long a particular function is taking to run or how long a database call takes to run or how much latency there is between services, et cetera. So a step up from that in terms of answering those sorts of questions is what we call auto instrumentation.
And auto instrumentation is this notion that you add a library or you add something to either the framework or to the code itself that powers your service and you start gathering data automatically. And this can be incredibly useful because you can hook into if you’re using a framework like Java Spring Boot, or you’re using Ruby on Rails, or something like this, you can hook into the framework to answer certain questions. What sort of HTTP method was used to make this request? How long did the request take overall? What were the various parts of the request and so on and so forth? And so auto instrumentation can get you really far, but at the end of the day, at some point in order to achieve the sort of best observability you can achieve for your system, to be able to answer questions that only you know, you really need to do custom instrumentation. And that means actually instrumenting your code with either spans if you’re doing tracing, or events if you just want to emit some information, or metrics if you want to gather counters, et cetera. And then that’s where we get into modifying your code and modifying your code, as we all know, is expensive. It takes a fair bit of work. And I love this quote, by the way. This was actually somebody who uses Honeycomb, just mentioned that using automatic instrumentation is teaching your code words that someone else thought would be useful. And I love that way of putting it because that definitely mirrors my experience. These are words that can commonly be useful in services, but they’re not necessarily the things that are most useful for your system. Only you know the answer to that, which kind of touches on my point about needing to do custom instrumentation.
Let’s talk about the kind of data that you can get when you start doing custom instrumentation or even auto instrumentation. One of the most… Honeycomb is actually an event-based system. We collect events and those events can be arbitrarily wide. So you can add as many fields as you like to an event. And then we allow you to query by or aggregate on any fields that you send to us, no matter the cardinality. And that’s great. If those events contain certain information, we can string those events together as spans in a trace. And this talk today is focusing really just on tracing. I’m not going to cover metrics even though OpenTelemetry has a lot of great solutions for metrics. I’m going to talk about using OpenTelemetry, specifically the OpenTelemetry collector, to enhance how you do tracing in your application and get traces. I wanted to put this screen up just to kind of show you first off a screenshot of Honeycomb showing a trace in an application. But also for anybody who hasn’t used tracing before, who hasn’t used a product like Honeycomb or LightStep or an open-source platform like Jaeger or Zipkin. A trace can be thought of as causally related spans, a collection of causally related spans. And so in this example trace that I have here, I’ve got a request to a service called Front End, and the endpoint is slash product. That service then made a call to a product catalog service, a recommendation service, a product catalog service, and an ad service. And you can see the hierarchy of calls here. And then you can see information about the duration of each of those calls.
Importantly, each of these spans also has an arbitrary number of fields associated with it. Depending on the standard there are different names for these. They could be called attributes on a span. At Honeycomb, as I said, we care about events. We call them fields, fields in the event. But they’re all the same thing. There’s a bunch of different tracing formats out there. And I mentioned a few. Zipkin and Jaeger were two tracing formats that were part of the open tracing standard. OpenTracing defined an API and a structure to span to tracing data. And that helped kind of create some uniformity. OpenCensus was another system. It has its roots in Google. OpenCensus started to become pretty popular. And there was this problem in the world of telemetry and observability, where now all of a sudden you had two standards that were sort of competing with each other and that’s not always great for users. It’s not always great for vendors. And so a bunch of people got together and created the OpenTelemetry project with the goal of actually combining OpenTracing and OpenCensus into one unifying standard for telemetry data. Like I said, my talk is focused on tracing and some of the demos that I’m going to show you have to do with tracing, OpenTelemetry, also deals with metrics and increasingly logging, which is really exciting. OpenTelemetry has its own format. And then there are a bunch of vendor-specific formats. I mentioned I work on instrumentation at Honeycomb. We have a bunch of libraries and SDKs that help you get data into Honeycomb. And those all wrap our HTTP API, which accepts JSON payloads. LightStep has a format, Amazon x-ray, vendors all have introduced various formats over the years to make it easy for people to get data into their systems.
What I want to walk through today is solving this problem of what happens if you’ve already invested a bunch of effort in instrumenting your system and it would be absolutely cost-prohibitive to ask most organizations to go and reinstrument your entire system, if you wanted to use some new product or some new feature of a product. For example, imagine you’ve invested months, maybe even years, instrumenting your system with an open standard or an open-source project like Jaeger. Okay. So you’ve used all the libraries, all of the languages that you deploy code in had good support and you’re comfortable. You have good observability and you’re using Jaeger to look at your traces and everything, but you want to try out a product like Honeycomb or LightStep or another vendor, or you want to try out some other open-source projects that support some of these standards. You really don’t want to have to go through and reinstrument your code. Doing that would be a nonstarter for a lot of people. It’s hard if you have one codebase. It’s really hard if that codebase is large. It’s exceptionally difficult if you have more than a few code bases. At that point and I’ve been there in previous roles where I’ve been on teams that helped do SRE for hundreds of services and getting an organization to lift and shift like that would just be absolutely huge. What I’m going to do is I’m going to walk through a demo of using OpenTelemetry to address a few use cases that I think make these kinds of journeys easier. And specifically, I’m going to show using the OpenTelemetry collector to kind of detach those requirements, your instrumentation from the backend that you’re using.
And then we’ll walk through a few cool things that you could do once you’re already using the OpenTelemetry collector. The OpenTelemetry Collector, it’s part of the OpenTelemetry ecosystem. I’d encourage you to go to OpenTelemetry.io and just take a look around. There’s a lot of components to the project. There are language-specific SDKs, and you can use those languages specific SDKs along with an exporter to hook it up to a backend, or you can use the OpenTelemetry Collector. And so what the OpenTelemetry Collector does is it’s composed of various components called receivers and processors and exporters. And this diagram kind of shows at a high level you can have traces, you can have instrumentation in a variety of different formats, run it through the OpenTelemetry Collector and then export it into a variety of different backends regardless of the actual format you use for your instrumentation. And so the configuration happens and I’ll walk through examples of this using these pipelines, this notion of a pipeline. And I really like this because it allows you to specify one or more receivers. And so you can even mix and match your instrumentation. You can have some services instrumented with Jaeger, some services instruments with OpenCensus and you can have receivers configured in your OpenTelemetry pipeline for each of those. The next step is it runs a trace and data through processors. And processors I’ll show a few use cases for processors, but they allow you to modify or scrub or otherwise mutate trace data in flight, which can be really powerful. One of the biggest use cases there is of course sampling also modifying span data, scrubbing GII, et cetera. And then at the end of the pipeline, you have exporters and exporters are the things that vendors provide or open-source contributors can provide that allow you to string together your trace data into some backend.
Let’s go through a demo, let’s go through a few use cases. And the first use case I’m going to show is what I talked a little bit about earlier, which is you have, let’s say dozens of services, we’re ahead of the curve. You’ve been using Jaeger for a while and or another open-source project. And you want to try out a product like Honeycomb or LightStep or some other backend for your tracing data. As I said, you don’t want to go through and reinstrument and redeploy all of your code. And so I’m going to show a way to use the OpenTelemetry Collector to do a zero-code redeploy of the collector that allows you to migrate back ends. Specifically, I’m going to show taking a service going to Jaeger, a bunch of services actually, writing trace data to Jaeger, and then use the collector to actually write to Jaeger and Honeycomb at the same time, show the trace data in both. And you can imagine being on a team, if you’re using a tool like Jaeger, you don’t want to organize like a big shutoff point where all of a sudden you’re turning things over to a product like Honeycomb. You want to be able to use them both at the same time while you’re transitioning between products. Let’s take a look. All right. In this window here, I’m going to run… I have a bunch of terminal windows. I’m going to run the OpenTelemetry Collector. I’m using a version that I’ve compiled. There are of course Docker images that are published. In fact, let me go and show you. I would encourage you to go to OpenTelemetry.io and go to documentation collector. And you’re going to be able to see how you can get this yourself, either using a prebuilt image or just cloning from GitHub. Let’s see, the first thing I’m going to do… The other component I’m using is the synthetic load generator created by the good folks at Omnition. This is a Java project that just simulates a bunch of services that are instrumented with Jaeger, publishing trace data. So really useful when you’re trying to simulate a bunch of traffic and you don’t want to touch prod yet because you’re doing a demo or because you’re just trying the product out. And then I’m going to run Jaeger.
I’m going to run Jaeger just using the Docker image that’s hosted on Docker Hub and I’m going to run it like this. So what I’m doing here, actually let me not do that. Instead, here we go. I’m going to run it like this. I’m exposing a few ports, specifically different Jaeger ports for the web UI and for invest. Okay. So that’s running in this terminal. This, of course, I’ll be run on Kubernetes. This can be run however you like it. And so I’ve got port 1416 or 14268 open, and the synthetic load generator is going to start publishing trace data to that port. I run a synthetic load generator. Imagine this is a bunch of different services that are running and let’s see, I should start getting trace data. Okay. So I’m getting admitted trace ID, logs that are showing me that things are happening. Good. So let’s take a look at some of these traces. All right, I’ve got Jaeger running locally. This is the Jaeger UI, and I can already see that I’ve got a bunch of services that are receiving data. So that’s good. Let’s take a look and let’s look at some traces. Excellent. If I click on one trace, I can see all of the various spans. So this is very similar to the Honeycomb screenshot, just visualizing a trace in Jaeger. If I click on an individual span, I can see the tags. I can see metadata that gives me information about that. All right, so we’re getting data. That’s great. This is a world a lot of people exist in. They’re using a tool like this. Now let’s say you want to introduce something like Honeycomb. You want to do a migration or you want to explore a new tool. The first thing I’m going to do then is instead of publishing my trace data directly to Jaeger, I’m going to shut this down. I’m going to shut down Jaeger and let’s see.
I’m going to run Jaeger but I’m not going to expose port 14268 anymore. I’ll expose the web UI and I’ll show you why in a moment. Because now what I’m going to do is I’m going to run the OpenTelemetry Collector and use it to intercept that data. So let’s see. This is the config I’m going to use. And notably, I’ve got one receiver. It’s the Agora receiver. It’s going to accept data on this endpoint. So it’s going to listen on port 14268, which I didn’t enable in my Jaeger instance. And I’m going to configure these exporters. The most obvious one is Jaeger. It’s going to publish to a locally running Jaeger instance. I’ve also got this logging exporter, and this is really useful when you’re experimenting with the OpenTelemetry Collector because you can see in realtime what’s happening without having to go back to the web UI constantly, you can just make sure that things are happening. And then I’ve got these pipelines. And I mentioned pipelines are the kind of core way that you configure the OpenTelemetry Collector. So pipeline, I’ve got one called Traces and I’ve got one receiver and one exporter. In fact, I’m going to put logging in there too. Okay. I’ve got two exporters now. Pretty basic, but it will get the job done. So let’s do this. Step one. And these YAML files, of course. Well, this one is pretty basic, but I’m going to introduce more to it as we go. Okay. I run that data and now it’s going through the OpenTelemetry Collector going to Jaeger. Hopefully, I should be able to, I’m getting debug log output. I should be able to go to Jaeger now and see that I’m still getting trace data.
That looks good. All right. And I’m getting all of this through the OpenTelemetry Collector now. I’ve already introduced some value in that I’ve abstracted out my instrumentation from my backend. Okay. So let’s throw Honeycomb in there. So what I’m going to do in order to get this publishing to Honeycomb is I am going to add an exporter. So Honeycomb is one of the supported exporters. There are also exporters I’ve mentioned for Stackdriver, for LightStep, for a whole bunch of different products, but let’s just add in Honeycomb here and there are different configuration options for different exporters. Let’s see. Okay. For Honeycomb, I’m going to specify an API key. I’m not going to put my real one, but a dataset. I’m going to create a dataset called cloud camp and API URL, Honeycomb.io. And that should be enough. Yeah. Oh, and then, sorry, I’m also going to go into my exporters here and add the Honeycomb export. Just configuring the exporter isn’t enough. I have to add it to my pipeline, which makes sense because I have configurable pipelines that can string together different combinations. All right. Okay. Now let’s start up Jaeger again. Good. Start up my synthetic load generator. Excellent. And now at this point, let’s go over and make sure that we’re still getting data in Jaeger. Good. Looks like it. There we go. There are some traces and now I’m going to go over to Honeycomb and I’m going to see that there’s a dataset that’s just been created called cloud cam. Data sets are just ways of grouping data in Honeycomb. If I click on this, I should see I’m starting to get data. Perfect. So let me run a query. I’m just going to do a straight count query. Look at the actual events that I’m receiving. All right. I’m going to restrict it to the last 10 minutes.
Okay. Looks good. So you can see here, I’m looking at duration, or let’s change that to count. Okay, good. Let’s take a look at some of this data in more detail. If I click on the traces tab, I can see I’m getting traces here. Very similar to the screenshot I showed. You get all of the fields that were being published in Jaeger. And I’m going to take a look at the trace ID here, and I’m just going to copy it over to Jaeger, to my Jaeger UI. And show that I can look at the exact data in Honeycomb and Jaeger and this way it’s a very no-code way of being able to try out a product and see if it fits your needs. All right, so that was step one. Pretty cool. Let’s go back and see what else we can do with this. I mentioned that migrating back ends. That’s the first use case I’m going to cover. Scrubbing data. Now that I have data going through the OpenTelemetry Collector, I can start to use processors. And like I mentioned, processors are really powerful. They allow you to mutate span data in flight. And this is a really valuable thing. So let’s go back to our config. And now what I want to do is I want to add something to our pipeline. Specifically, I want to add a processor, and let’s see, there’s a processor that’s built-in and it’s just called the attributes processor.
I’m going to give it a name, delete. And what I’m going to say here is actions. So what this is going to look at is it’s going to look at attributes or as we call them fields in Honeycomb. And I have region data coming in for each span. I’ll show you that. Let me take a look, run a query, and I’m going to group it by region. All right. I can see I’ve got US East, US West. Then I’ve got a bunch of spans that don’t have a region at all. Let’s say, I just don’t want to record that data. It’s not reliable. It’s not being recorded properly. So I want to take that out. If I go back to my YAML here, I say key of region, action, delete. And then I add this to my pipeline. Okay. And I’m just seeing if there are any questions. Let’s see. All right. I think I’ll end with a poll because I have a poll ready. Okay. So I’ve now got this running. Let me see, let’s say step three. Should be. Okay. Restart the collector, restart my low generator. All right, let’s go back into Honeycomb and I’m running this query still. Let’s take a look. And what I should start to see after a little while is yeah, the only data I’m getting now has no region data because I’m mutating the span in flight and deleting that data before it gets to Honeycomb. This is a bit of a contrived example, doing it on region. But imagine you have some DII or something like that that you don’t want to land in a vendor backend. That’s a very good use case. Very quickly I’m going to introduce one last use case and that’s actually modifying or setting some span data. So let’s say you want to go in and actually update the span or set some data on the span. That’s something you can do too. It’s using the exact same processor. What I’m going to do here is instead of delete, I just go in and let’s see, actually know what? We’re running out of time. So I’m just going to show the documentation.
Because this is something I wanted to run over really quickly. If you go to the OpenTelemetry Collector, GitHub repository, it’s linked from the docs. Let’s see, there is some good documentation here. You can see a configuration, you can see processors and this will walk you through everything that you can do with processors. I was just using the attributes processor and that shows you all of the use cases there. All right. Let me see if I can figure out how to get this poll going. Because I want to end with a poll quickly. And I don’t see the option. All right. Something must have gone wrong. I must’ve not queued up properly. That’s okay. We’ll just keep going. Okay. So just to summarize. The OpenTelemetry collector will allow you to detach changing code in order to change the way that your telemetry data is sent to back ends or processed in flight. And that can be really, really powerful because as we all know, code deploys aren’t always an option, and certainly reinstrumenting your code is prohibitive. So definitely check out opentelemetry.io here at GitHub links. The OpenTelemetry Collector has a repo, and then there’s a contrib repo. You want to go there if you want any of the open-source contributions or any of the vendor-specific stuff actually is really what is in that repo. Images published to Docker. You can just pull down an image and start experimenting and then docs.honeycomb.io/gettingdatain. This is anything that you want to know about getting data into Honeycomb. Thanks very much.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.