Conference Talk

OpenTelemetry, Not Just for Production Troubleshooting

June 9, 2021

 

Transcript

Michael Haberman [CTO & Co-founder|Aspecto]: 

Hello, everybody, my name is Michael Haberman. I’m the Co-founder and CTO with Aspecto. I would like to tell you how to prevent your next downtime using OpenTelemetry. If you are at this conference, you’re probably into some kind of observability. I’m going to talk, specifically, about tracing and OpenTelemetry. We will take a look at how we are going to prevent the next issue.

The way that we’re going to do it is we’re going to start with just a bit of history. It will be just a bit, not too much, and also how it started, and how we’re using it today. And then I will move to the next portion, which probably will be more interesting, and this is where I’m going to think together with you: how can we use OpenTelemetry and tracing data to prevent stuff, to do more for the work we are doing today, and I will give us some action items so we’ll be able to take it into reality next thing tomorrow.

Why am I speaking about OpenTelemetry, tracing, prevention? I have been doing microservices for quite some time. About five years I worked as an independent consultant with all kinds of companies: either breaking their monolith into microservices, or their microservices scaled and they needed some help with that. And after about four or five years I stopped being an independent consultant and started doing my own start-up. And that’s what led me to speak about this area.

Let’s start and look at the early days where we had monoliths. How were we getting visibility into our systems? We are using logs. Those logs basically told us the story. The application would tell us the story, what happened in our application. And this was great, whether it was with a central log solution or a distributed one. But when applications started to change and started to be distributed apps, it started to be more complex, because now you don’t have one story to follow in order to understand what is happening in your application. Now I have a whole bunch of stories that I need to compose together into one big story. This could be quite complicated to do. So we came up with the idea to have some kind of correlation between the logs, right? Now we have all the different logs, but we have some unique ID that’s propagated throughout my services, and that way, I can look at all of my logs as one big story. But it still requires the developer to understand the story: I need to read a lot of stuff and understand how it works, and it is kind of hard. Still, it’s quite hard. This is where we saw traces. Traces basically took the story and made it way more visual, making my life as a developer that needs to work with them way easier. And then, we are also able to correlate traces to logs. All this history is basically leading us to the fact that we can really easily visualize a production issue. Usually production.

So we have some issues. Something isn’t working the way that I expect it to work. I want to visualize it. I want to have this big, red sign in my UI saying this is where it failed. I will be able to resolve it faster, it will be easier and simpler for me, and even cooler, just more fun to work with traces. But the thing remains, I have an issue and I want to fix it. How is this thing helping me? What are traces helping me do? Basically, they are helping me to improve my MTTR, my mean time to recovery, or any other English word that starts with R and kind of tells you that the issue no longer exists. So we are talking about, how fast can I fix an issue? How fast can I respond and make sure the system keeps on working? And this means that this is a reactive tool. First, I have an issue. And then I’m resolving it. I’m not trying to say it’s not important. I’m not saying you don’t need it. Not at all. You need to resolve things faster, but I am emphasizing two things. It is a reactive tool. You don’t prevent anything. You just respond to what happened. I want to look at what I wrote here. I said first you have an issue and then you resolve it. When you say you have an issue, I want you to deep dive into that.

If we look at our process as developers — so we are having an issue at which time in the process? I think usually it starts when I’m committing code. Right when I’m committing code, basically, I’m telling my teammates, “Hey, I think this code change should be part of our codebase”. You do the commit, and you send the pull request, and you have all types of tests and validations, and whatever is happening in your CI. And then you deploy it. And then only once we deploy it, you realize that it is failing. There is an issue. There’s quite some time between committing a change and deploying it to production. Well, depending on the organization and the work, it could be a matter of minutes, hours, or days or weeks, but you have some time right there, between the commit and between the deployment to try to figure out if something is not going to work as expected.

7:00

This is just a bit of history and how we are using it today. And now let’s dive into: what more can we do with that? If we have a reactive tool, I want it to be proactive. I want to be able to prevent issues because I prefer not having issues in production and waking up on Friday night and fixing something. So I want to be able to predict the issues prior to deployment, and one thing to emphasize here, by no means I mean you are not going to have any more bugs in production. Not at all. That is not this type of talk. What I want to get you thinking about is, you can take the same data that you are using today and use it to do more things. We’ll try to find the main issues that we encounter in our day-to-day and ask ourselves whether tracing data can help us prevent it. 

Last week when I worked on this talk, I encountered two issues, and I want to go over them with you. I want to describe the issues, describe to you what the developer said, and let’s discuss whether we can fix it or not. The two issues that I found were: schema changes that introduced some failure in a feature, and a DB that got way more queries and basically wasn’t able to handle the load.

Let’s jump to the first one. We’re talking about schema change. As you can see here, we have service B and we have service A. And service B is communicating with service A. As a developer, I got an assignment to make a change in service B, and once I started to investigate the code, I understood that the change shouldn’t be in service B. It should be in service A because service B communicates with A and I need to change the response. So I did a schema change. I did a schema change. I consulted with my coworkers. They thought it was a good idea. I ran the test. I asked the CI. I got the code review approved. Everything is green. Everybody is happy. We deployed to production. And then reality hits. Then there is also service C. And because of this change that I introduced in service A, we now have a feature that is down and not working in service C. And I’m pretty sure if you look at the past three months, six months across the whole organization, you may find this somewhat familiar. I’m saying “may” and I’m guessing most of you know what I’m talking about.

Now I want us to kind of put ourselves in the developer perspective, the one who made the change. And let’s think what he may say. What was the reasoning for this downtime or feature down? How can I explain it? I thought of three main things that I can say. I can say I didn’t know that Service C is depending on service A. Well, that’s a legitimate thing. If you are working at this company that has hundreds of microservices, you want to know the dependencies between them. Also, you could say it was a schema change. I wasn’t aware it would be a breaking change. I checked it and it worked. I could just say we have an automated test. I did my own manual test. The automation tests all passed. What do you want from me? I passed the gate that the company is asking me to pass before moving to production. So yeah. It works for me. It’s not working in production. We know those things. Now, the question is, how could this be prevented?

Before addressing how we’re going to prevent it, I’m going to say what I think we need to have in our hands in order to resolve it. The first one is access to understanding what the payload is, the schema of this endpoint that we just broke. This would help me understand when I’m going to introduce a change. Also, it’s really important for me to take an endpoint in my system and be able to know who is consuming this message. Both of them could be achieved with tracing data and OpenTelemetry. So you can, of course, know who is communicating with who. That’s the very basics of traces. And you can also have access to the payload, maybe not by default, maybe you will need to have some custom hooks for your instrumentation, but you can get it done. This is doing custom hooks that will give you the payload of your HTTP or PubSub, or whatever payload you are working with. If you want to take the payload and infer the schema from it, you can use any number of open-source tools. Specifically, there’s a Genson tool that allows you to convert payload into schema. There are all kinds of implementations out there. My company released genson-js as an open-source project. Just because we were missing one and so we released it. If you have both the payload schema and who is communicating with who, then what can we do?

13:18

We can do two things. First, I want to know who is consuming my endpoint.  And when I say “I” I’m putting myself in the shoes of the developer who wrote the code and basically broke that feature. So imagine that. You have OpenTelemetry shipping data to Elasticsearch, and that’s all you have. You have Elasticsearch filled with all of your spans and all of your traces. If I go to Elasticsearch and I run a query to group by all the spans that are calling my endpoint in the, I don’t know, last ten days or so. Basically, now I will get the list of all of my consumers, in the production environment, in the last ten days. Now, we know that production doesn’t behave the way that we think it behaves. And we have a database telling us 100% how it is being used. So instead of trying to figure out from the code, or from asking your peers, who are consuming your endpoint, you can just go ahead and check that and now you will get an answer that is relying on data and not what you remember or what your peers remember. So that is what I would do if I were the developer, and this is a very simple thing — just a query.

One step further, which could be even more complex, but will give you even more options because this would basically just give you a list of your consumers. But if I’m looking at finding the different schemas out there. If I know all the potential schemas using my endpoint, it could really help me. I could go to my integration or my API test or my contract testing and make sure that I cover all the different schemas that I get in production environment or maybe I’m using mocks and I don’t know whether my mocks are up to date or whether they represent production environment well enough. Here I can really easily just get the list and use them to mock or test my endpoint. How will I get those schemas? I’m assuming that you already have the payload attached to your spans, which means you need to do some alteration to your instrumentation. And now, I’m also saying that for each span that you get, take the payload, and extract or infer a schema out of it, and just hash that schema. That means that now you can know how many schemas you have out there. Now if I run the same or similar query to the one that I spoke about last. I will group by the endpoint that I’m going to change and I’m going to find all the potential schemas. And this is going to help me to figure out whether there is an edge case out there that I’m missing. If I were the developer or if the developer would have done those things, I think there is a high certainty that he would find another use case that he didn’t think of and now we can think about it and can address it. We can add the right test and we can protect it in our code or whatever needs to be done.

This was the first use case that I found, or the first issue that I found. Another one, which is also quite interesting is as you can see here, we have a developer running a service, service B, in his local environment. Then, we have service A. Service A is not running in his local environment. It’s running in some cloud environment, some shared environment that all the rest of the developers are using. And one of the requirements that I need to implement is that I need to change some parameter that service B is sending to service A. Again, I’m seeing the code of service B. I don’t know the code of service A. I just change the parameter that’s being set to service A. Service A is communicating with MySQL or any other database for that matter, which is also shared, and, unfortunately, this parameter change caused three times more queries to run against MySQL. And I wasn’t aware that this endpoint is highly used and I definitely wasn’t aware that it was going to increase the load on my database. Again, I made a change in service B. I didn’t even think of looking at what service A would do. What would the developer say? He would say I didn’t know how service A’s implemented. I didn’t know its internals. Also, I did not have the visibility to understand the impact of my changes.

As companies and as R&D managers, we invest a lot of time in quality and processes and making sure the developers have everything that they need in order to understand, fix, and change things. Both in this example and in the previous one, in the schema change, the concerns that the developer raised were right. They didn’t have enough tools. In this situation, I think it’s extremely hard for the developer without more tooling to find this issue. And how could we prevent it? Yeah. Don’t go to the DBA just to find it before it’s an issue.

19:46

We could have detected it, right? If I were looking at those traces and I were looking at the whole trace as it develops. Imagine you send the API call to service B and you just view the trace. You would notice that something has changed. But you don’t look at traces. You don’t look at traces when you work. And when you develop, you look at traces when something doesn’t work, since it’s reactive. Being proactive means when I’m developing, I want to look at those traces, and maybe I will find the issue and it won’t be shipped to production at all. So my suggestion here is take Jaeger with you anywhere. I wrote Jaeger here, but basically any visualization from any open source or vendor. If you can look at the traces that you produce while working, it will be way easier for you to understand the impact of what you are doing. Not the impact on this service that you are changing but the impact on the overall system.

I hope I got you intrigued with what you can do, what else we can do with tracing data in order to prevent the next issues. Those were, you know, two use cases that I encountered last week, and I guess there are hundreds or maybe thousands of use cases that we can think of, but in order to find those, I think, we need to do two main things. And the way I’m looking at it is each developer should have some kind of a setup, and this setup consists of two things. I want you to have access to Jaeger and, again, any tracing visualization, as you develop from your own local station when it’s your ID, and also — I don’t want to say every day, but most developers should have access to the data source containing your tracers, and in my case, it’s Elasticsearch, and maybe you have some other database. Having a local Jaeger I would stream all the development environments into a single Jaeger, and that would help the developers to visualize their work. So I can see what I’m doing, and I can understand the overall impact, and it’s going to make me more aware of my changes. You can work with it as you develop, but also another interesting thing is if you are not a developer, and you are now assigned to some pull request or merge request, and you need to review a request that you are not entirely sure of all the details, maybe you can also use Jaeger to understand the changes that the developer did. So even when reviewing something, it may help.

That’s about running Jaeger, but you should also definitely access your data source. Every time you have a question: What parameter do I get? Who is calling who? How many times an hour does it happen? What is the latency? How many times is the database being queried? All of those questions can be answered if you are just going through your Elasticsearch, going through your spans and tracing data sources, and asking those questions. And at first, you know, you just look at the database, you don’t know what to query, but when you start to do it and use it in your day-to-day, and usually when developers like something, they start to automate it. And then you may find yourself with all kinds of scripts, automating your spans, basically creating more quality for the team. I think doing those two, having local Jaeger and access to traces, is going to be something that can really affect how many bugs reach production. Those were my two action items. I think those are the things that if you start to do in your daily work, you may save yourself some production issues. I suggest you give it a shot. If you have any questions, feel free to shoot them in the chat and to reach out. Thank you very much.

24:48

Charity Majors [CTO & Co-founder|Honeycomb]: 

Hello. Welcome back.

Michael Haberman: 

Hey.

Charity Majors: 

Thanks for being here, Michael. That was an awesome talk.

Michael Haberman: 

Thank you. Thank you very much.

Charity Majors:

Yeah, when did you first hear about OTel? What prompted you to get started with it?

Michael Haberman:

Actually, I started to look for the tracing solution. I didn’t know it was tracing back then. I was just looking to visualize what’s happening and started with OpenTracing and then OpenTelemetry.

Charity Majors:

It’s interesting. I feel like so many people got started using the open-source stuff and, you know, OTel, it’s — for way too long, there’s just been no standardization, whatsoever, across the industry of this completely fundamental component, and I’m just so grateful to you for helping us extend, you know, the entire industry’s best practices and, you know, when to use it and when not to. I think it may seem basic, but, like, we’re rethinking a lot of basic first principles here. And with any luck, like, after we get this right, future generations won’t have to do all of this crap.

Michael Haberman:

Yeah.

Charity Majors:

Tell me how this has impacted your team’s ability to execute.

Michael Haberman:

Well, I think using OpenTelemetry data and distributed tracing data pre-prod can really affect the number of bugs and issues you eventually deliver to production. And I think you have so many points to do that. You can do it as you develop. You can do it prior to creating a pull request. You can do it in your CI and I think, you know, each team and each organization has their own issues, and I think if they are looking into solving them as soon as possible, that’s the way to do it.

Charity Majors:

Totally! Do you feel like there should be an increased role for designers who are participating in the collaboration, you know, around the design and the execution on, you know, things like OpenTelemetry?

Michael Haberman:

Do you mean designers?

Charity Majors:

Designers, yeah.

Michael Haberman:

Yeah. Well, I don’t know — it’s a good question. I think we still need to nail down developers using it.

Charity Majors:

Well, yeah, absolutely. I just feel like OTel is shinier than most, but a lot of these projects, you know, fail to get the traction that they need because we insist on treating engineers like engineers but not consumers of technology. You don’t come to the OTel project with “I want to work on OTel!”. You come to it in the process of trying to solve a problem.

Michael Haberman:

Yeah.

Charity Majors:

And so many of our tools are written from the VIM way of doing things. Memorize all of this, and you will be a superhero. Whereas most of us really need to be able to put ten minutes in, get something usable out and, you know — yeah. I don’t know.

Michael Haberman:

Yeah, I think most people find OpenTelemetry out of necessity. They just have a problem.

Charity Majors:

And it’s dense. It’s not trivial to adopt or to become an expert in.

Michael Haberman:

Yeah.

Charity Majors:

It’s very much V1. It’s very alpha. I’m so happy that it has gotten the uptick that it has. And now I think we need to parlay that into being born. Build boring software.

Michael Haberman:

Yep, yep, I agree.

Charity Majors:

Thank you so much, Michael, for giving your talk. It was really nice to have you.

Michael Haberman:

Thank you. Thank you very much. I really enjoyed it.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript