Raw & Real Ep 6
The Tarot of Telemetry
Help Your Future Self

 

+ Transcript:

Kelly Gallamore [Manager, Demand Gen|Honeycomb]:

Hi, everyone. Welcome to Raw & Real. It’s great to have you here today. Be it your morning, afternoon, evening, or night. A little bit of morning time for me. 

Liz, thank you for joining us today. How are you?

Liz Fong-Jones [Developer Advocate|Honeycomb]:

I’m doing all right. How about yourself, Kelly?

Kelly Gallamore:

I’m okay. I’m a little nervous this morning. I’m finding myself having to take a few deep breaths today. There’s definitely a lot going on, not just as Honeycomb but in the world around us. 

Liz Fong-Jones:

Yeah. Yeah. It is not wonderful to have people dying of a variety of things. It’s not a great thing to sit with, but we have to soldier on. We have to soldier on. 

Kelly Gallamore: 

We do. And I appreciate having this to focus on. Everybody, it’s a minute after the hour. I have to start every piece of this in just a couple of minutes as we’re letting people sign-on. In the meantime, let me do a little bit of housekeeping. I see people in the chat. This is so great. Hi, everybody. Of course, you know we have the chat panel going. Coco says hi to all of you. She’s mad. She doesn’t want to be on camera this morning. I do want to let you all know that we do have open captions. We have captions available during this show. You can either hit the button there at the bottom of your screen or Bethanie is going to paste a link if you would rather follow along with your browser. Hi, everybody. This is fun. Please say hello. In the chat, please say hello. 

Liz is with me today to talk about The Tarot of Telemetry: Help Your Future Self. I want to let you know you can ask questions anytime. Just use the Q and A section in the Zoom webinar. If we don’t cover your answer, we’ll have a Q&A at the end, and we’ll get to as many as we can. If you see a question come up, and you’re excited about it, please upvote it. We’ll do our best to cover as much as we can in this short period of time. 

What else is important? You know closed captions exist. Let’s say the things that don’t need to be said but explicitly get it out anyway. A Code of Conduct, we here at Honeycomb assume good intentions, so we’re going to bring it today in Raw & Real. Please, as you talk with each other, no shaming language. Let’s make this a place where we all want to be. You can ask questions, captions, Code of Conduct. Liz Fong Jones. 

Liz Fong-Jones: 

Hello, folks. 

Kelly Gallamore: 

Hi.

(Laughter) 

When Honeycomb built out the SLO feature, there were many considerations regarding which kinds of telemetry to start with that would be most helpful to make the best product for our customers, or the best feature for our customers. Can you please share our story, share your story about this?

Liz Fong-Jones: 

I would certainly love to. I’m going to share my screen so, hopefully, everyone can see that. Kelly, if you could turn off your video, that would be helpful. 

Kelly Gallamore

Sounds good. 

Liz Fong-Jones: 

Excellent. So we built out the Service Level Objective feature about a year ago, a little over a year ago is when we started. We launched it in January of this year. For those people who don’t know what a Service Level Objective is, a Service Level Objective is a way to measure the performance of your service over a longer period of time so you understand what experience your customers are having. When we built this, we built this initially for ourselves as well as for some of our customers. This is what it looks like inside the Honeycomb product. You’ll notice this is how you spent your budget over the last couple of months. This is what your reliability record has been and here is what’s happened in the past 24 hours. These are some of the key elements that we wanted to make sure people had. But there are challenges associated with collecting this much data over a period of 60 days. How do we query it? How do we make it accessible to you? How do we make it fast enough?

I wanted to show you as well what another SLO may look like. It’s a little more actionable. It says, hey, we had a big outage that burned a bunch of error budget about a week and a half ago. We needed to debug that during the immediate incident. So when we were designing this, we needed to make sure that we understood what the performance characteristics of it were. So we evaluate your Service Level Objective once every minute, and we needed to be able to understand, like, how long is it taking to evaluate Service Level Objective? Is it too fast? Sorry, is it fast enough? Is it too slow? And kind of where are we spending our time in this process?

So one way to approach this is certainly to create trace bands. They’re one of the core units of instrumentation inside of Honeycomb. What we did is we decided we’re going to extend the existing service that processes triggers. Instead of it only processing triggers, we wanted it to both process Service Level Objectives as well as the existing triggers. So the trigger workflow already existed. It had this notion of we’ll evaluate a bunch of different triggers. And then, after all of them are evaluated, we’ll go ahead and send notifications and return results. So we wanted to duplicate that pattern with Service Level Objective. When we were creating this feature, we realized that we needed to change what is invoked once per minute. Instead of just evaluating triggers, we needed to also, in addition to validating triggers, we also needed to evaluate the SLOs. So this is a span that defines, hey, we want to keep track of, how long does it take to evaluate all the SLOs? But that’s not very helpful if one SLO is particularly slow. So for each individual SLO, we needed to create a specific span relative to evaluating that individual SLO. And we also needed bits of instrumentation around which SLO is it, which dataset is it? And kind of instrumenting all the way down to the database so that we understand, like, is it failing in our application code? Is it slow in our application code? Or is it slow inside of the database?

6:16

For instance, when we issue a call up to the database, we want to explicitly wrap that in a database span. As well as having all of these tidbits of metadata. Which team is it? Which dataset? Which SLO ID? You can see here, when I look at this SLO here we can see the additional pieces of metadata that I sprinkled in. You can see how long it took and which host did it run on and other varying properties. If you drill down further, you can see the individual database calls here. This, for instance, is a getcontext database call that’s issuing a select statement. You can see it’s very, very fast. But where we spend the majority of the time happens to be evaluating the SLO. So that corresponds in this code here to the code that actually looks at evaluating an SLO. So let’s look at that. This is where the codes to generate the span is. We’re asked to evaluate an SLO. We create a new span, and we set, again, the attributes such as: Which SLO is it?

We also want to know if there’s a correlation between the number of days that we’re measuring the SLO and how long it takes to evaluate. So we wanted to add that as a property. We also wanted to carry through the dataset ID and so forth. And then after that, you can see we go and do a variety of different things, such as fetching the counts from the database. Of course, when we fetch the counts from the database as well as our query backend, we need to make sure that that, indeed, is doing what we expect. So we need to have a span for SLO counts and then we need to have spans for things like, hey, how much time did we have that managed to hit the cache and how much time did we have to fetch anew. We can look at this trace here. We can see fetch SLO count. We can see that the overall SLO window is 720 hours, or 30 days. And you can see, of that, the amount of time that we’re able to fill out of the cache should be right here. Let me search here. It will be easier. Yeah, there we go. So validate cache is set to true. Fetch hands from retriever. How much did we fetch out of the cache? What is this property called here?

This is kind of part of the fun. When you work with Honeycomb instrumentation or instrumentation in general, there’s a lovely opportunity to go back and forth between your code and the trace and see where the two things line up with each other. For instance, I think that I can see here query hours, if I search for query hours, it should pop up inside of the trace span. Maybe it doesn’t. Oh, I’m looking at the wrong trace. Query hours, yeah, so we can see right here that it’s telling me I had to do a manual query for one-sixth of an hour of data, but that the rest of it was built out of the cache automatically. This is fetching out of the cache, and this is the live request. This cache is really fast. It takes 50 milliseconds to evaluate and that the storing back to the cache takes relatively little time. We spend 1.2 seconds actually asking our query for that data. Yes, it does take time, but we want to make sure that we give you accurate results. Because we link all of the traces together, I can go ahead and look and see how long did this individual request take to execute? Right? Where did we spend the time in the, you know, one second or in this case, the 300 milliseconds here? How long did it take, and where did we spend the time?

So you can see here, for instance, that we actually were just blocking for 188 milliseconds because there were too many other queries executing in parallel. And then the actual request to carry out the search for how many requests succeeded and failed, or SLI, over that 16-minute window or over that 10-minute window. That took only 40 milliseconds to evaluate. And most of that execution, in fact, all of it happened locally. We’re just spending that time kind of coalescing the results together. So that’s kind of some of the considerations we had when we were thinking about implementing the SLO feature. How do we know it’s fast? How do we know it’s executing correctly? When there are slowdowns, what is going on? Right? How can we debug whether it’s slow?

Indeed, one of the things that we found with SLO before we added the cache was that we were querying every single minute for 30 days of data, and that was massively hitting AWS lambda every single minute, and it actually generated a $10,000 AWS lambda bill, whoops, in addition to being really slow, but we had the correct instrumentation to know that it was going to be very slow after we had already made the mistake, and, therefore, we were able to correct it by adding the cache, and we were able to verify the cache was working correctly. 

Kelly Gallamore: 

Liz, can I ask a question real quick? Pardon. I just want to imagine. What I hear you talking about sounds really, really important right now. The idea of, okay, everybody, we wanted to build this feature. We’re sitting in a room. These questions you’re talking about, the query questions you want to ask, making sure you have this. This is the key element, I think, for you talking about Honeycomb sitting in a room, the team sitting in a room asking: What questions do we need to make this successful? Are those the key questions to know what custom instrumentation you’re going to need?

12:03

Liz Fong-Jones: 

Yes, exactly. Like, you need to know, we can automatically measure things like duration. Right? That’s the easy thing, as far as Honeycomb is concerned. You just have to add that kind of beeline.start span call. But I think the more interesting questions are: What are the custom properties? What are the things that I need to know beyond just what is the name of the span and how long did it take? But, like, who is it? You kind of have to ask the who, what, when, where, how, why? 

Those are the kinds of things that you have to ask in order to get all of the instrumentation you need. You need to break up your span sets, not just the top-level RPC, but you have to say what steps did I take along the way, so you know what you can attribute the slowness to if one of the steps gets slow. 

Kelly Gallamore:

Thank you for explaining that.

Liz Fong-Jones: 

Yeah. Of course. But I think kind of getting to Kelly’s other point, all this stuff is useless unless people are actually using it. One of the things we really love to do is we really love to do things like understand who is creating SLOs. Are they having trouble creating the SLOs? Those are all super interesting things to explore. I’m going to go over here and show you a little bit about the user events that we created for the SLO feature. If this GitHub search works correctly. Hopefully, this does what I expect it to. There we go. You can see that we’ve created not just server side instrumentation, but we’ve also created user events around SLO creation. So user events are a thing that we developed inside of Honeycomb that look at what people are doing inside of the browser. For instance, I can go and look to understand what is the actual number of customers who are performing the operation to look at their Service Level Objectives?

So let’s have a look at that. We want to run a count where “name” is. Let’s see. So SLO heatmap. For instance, I can find out how many people changed the… let’s look at 14 days of data. This will work. Let’s try “name starts with” by name. Oh, I see the problem. It’s warning me that the name of the field is not actually “name.”  The name of the field, instead, is action. Let’s just have a look at… I want to have a look at a little bit of data, and we can figure out what the field is actually called. This is a dataset that my colleague Danyel works with more than I do. I’m less familiar with it. But it’s always great to explore. What is this thing called? This thing is called… that’s not what I want. 

Kelly Gallamore:

You can just sort through the raw data right there. That’s so accessible.

Liz Fong Jones:

Yea. If I’m not sure what something does, I can just look right at it. Let’s look at one sample event. It’s called type. SLO grouped by type. I knew that was going to bite me at some point. There we go. So now we can see that there were 189 people over the last two weeks that clicked the button to edit or create a Service Level Objective. But we can see that 28 people clicked the button to create or edit a burn alert, which lets you know when you’re going to run out of error budget. I can go through and look specifically at the examples of where someone cancels editing in SLO because I want to look at those specific sessions. Let’s look at one of those sessions, and we can try to understand what happened. The person did a BubbleUp over some data. They clicked the button to edit the SLO. And then they clicked something in the navbar, and then they canceled their request to edit the SLO.

So this is cool, so it lets me understand what are people actually doing with their data, and are people actually using the speech we developed in Honeycomb? And all of it is handled through this library we have where we want to measure the actions that people take in our UI as well as measuring the backend. So those with two sides of the coin in how we think about instrumentation. Where are we spending the time on the backend? Where is the latency coming from? But also, are people actually benefiting from the feature that we are developing. So that’s really the demo I had in mind to discuss today, to show you about the thought process when we develop a feature for Honeycomb.

17:40

Kelly Gallamore: 

Liz, that’s fantastic. I really appreciate it. I’m going to see if I can come back on here for a second. Video settings. I don’t know how to bring myself back up, but that’s just fine. 

(Laughter) 

I’m not going to worry about that right now. I think that one reason this really stands out to me, it makes me think about how hard it is sometimes for people to just start something new. If I’m in the middle of something I do all the time, I forget how easy it can just be to kick-off, what are our goals with something? I love how deciding what telemetry is going to help down the road can bring the team together about what’s most important. Now that we know what’s most important, let’s just kick the can and see where it goes. And then having the goal of understanding how your customers, how your users are actually using the SLO features, it means we can try to get ahead of what’s working, what’s not working.

Liz Fong-Jones: 

And where people are confused, yeah. I think the other really awesome thing about this is that it’s the practice Charity Majors and I call observability driven development. As you’re writing the code, you’re thinking about how you’re going to measure it rather than bolting it on after the fact. 

Kelly Gallamore: 

That was going to be my next question. How does this tie into observability?

Liz Fong-Jones: 

Yeah. I think we want to make sure we can debug things after they’re running in production. We want to understand the how and the why. That requires us to bake in observability from the very beginning in order to make sure we have a shot at being able to answer those questions that we have as software engineers and as designers and as product managers. 

Kelly Gallamore: 

Fantastic. That’s great. I have no specific questions after that. Does anybody attending have any questions for us today? I don’t see anything. I haven’t seen anything come in yet, so I want to encourage you not to be shy. Learning about how to read traces has been very important for me to understand how my teammates can actually see each individual detail about which part takes this amount of time, not just to improve our product for Honeycomb but also to… well, actually, that’s exactly why. It’s what I want for all of our users. We have a question here. How do you determine what… I’m sorry. I don’t understand this question. How do you determine what you need observability into? Oh, I think I understand. How do you decide where you need observability? Forgive me if I didn’t get that right. 

Liz Fong-Jones: 

My interpretation is the question is, how do you figure out what kinds of questions you’re going to ask? And the answer is, you generally don’t know. But there are definitely dimensions that you might know are important. Like, I might want to know if the SLO feature is broken or slow for one specific user. Right? So I’m definitely going to add that team idea. I’m definitely going to add that user ID who created it. I’m definitely going to add that dataset ID. In terms of what services you need observability into, I think the answer is you need observability into everything. You need observability into each end user facing service that you have. As you encounter things that you can no longer explain with your existing telemetry, then you have to add new telemetry in.

So if you discover that you cannot explain why the latency is slow because it’s filed somewhere that you don’t have visibility, then you might want to add the database spans and get observability through the database. But, definitely, the goal is to start as close to the user as you can, for instance from your load balancer logs, and then drill all the way down. That’s the priority ordering, focus on things closest to the customer first. That’s how you get the highest impact. 

Kelly Gallamore: 

I want to dig into that for a second because when you say observability into everything, I can imagine if you’re not practicing observability in a way that helps you yet, honestly, that’s kind of daunting. What’s the key? When you talk to customers or prospects, how do you get people started? What’s the one, like…    

Liz Fong-Jones: 

Yeah. I think that the reason that I love the idea of SLOs as a top-level feature, is that it focuses you on what is the customer impact? Can we measure the customer experience? And can we understand when there are degradations in the customers’ experiences? Everything else flows downstream from that.

Kelly Gallamore: 

Okay.    

(Overlapping speakers)

Liz Fong-Jones: 

So, you know, I definitely kept that in mind when we were developing the SLO feature, but, also, the SLO feature itself seems to aid people in observability because it helps them prioritize where to measure.

22:26

Kelly Gallamore: 

Okay. So I just really want to repeat that back for the folks listening in. It’s the idea of, we talk about it for SLOs with us, but whatever you’re building, where it’s going to help your customers the most, that’s where you want to start. So just get started. Okay. Also, let’s see. Do you have a sense of how creating a span or adding a tap will add to the latency you’re measuring? 

Liz Fong-Jones: 

So the answer is it takes, in general, like, microseconds of latency to just record the start and end of a span. And the way that we make it efficient is that you batch up the spans that you’re going to send. That’s done in a completely separate thread rather than blocking inline. So the text is generally negligible. It’s, you know, going to be less than 1% of performance impact unless you’re in a situation where you’re limited to a single core and you don’t have another core available assigned to your microservice. So there are ways to mitigate it, but, definitely, that’s kind of the performance impact. In the worst case, you can sample, you can turn off tracing, unless, you know, the dive roll is 100. 100 D 100. Right? You can kind of make sure that you are getting the signals that you need while minimizing the telemetry impact, in terms of the cost of the telemetry.

Kelly Gallamore: 

Okay. Thank you. This question: What makes a good attribute name? Are there standard names that everyone uses, or is it more ad hoc?

Liz Fong-Jones: 

I think, as you saw me struggling earlier with the user events dataset, I think that the user event,    the field called “type” was definitely confusing to me. Right? So I think that the more you can standardize the way that you have your field set up, the easier it’s going to be to understand.

For instance, we definitely prefix everything that’s generated from an application instrumentation with app. We definitely have prefixes for, like, global to define process bubble attributes. And then there are definitely things where, like, if it’s an HTTP service, you know, definitely saying, you know, hey, by the way, here’s the HTTP response code. Here’s the X. Here’s the Y. Here’s the Zed. Right? Like, there are taxonomies. I believe there’s a thing called elastic common schema, which is gaining more and more kind of usage, but it is a little bit heavyweight. So kind of, there are trade-offs involved in standardizing on naming schema, but one of the things that open telemetry, which is one of the various instrumentation standards that we use here at Honeycomb does, is that it tries to standardize some of those field names.

Kelly Gallamore: 

Thank you. Can you elaborate on the idea to look more into the routing logs when it comes to starting as close to the customer first? What important information do NGINX logs capture? Do you have any…  

Liz Fong-Jones:

 Kelly, you just got very, very soft, I think. Your microphone is a little weird. 

Kelly Gallamore: 

Oh, hey! Can you read this question in the chat?

Liz Fong-Jones: 

Yeah, I’ll read the question. So let’s see. Looking into the routing logs. So the question, the NGINX logs capture, or your service mesh logs capture things like the request URL. It captures which service it is, potentially even metadata like which version. What it does not capture are fields like customer ID or things that are sent in the kind of body of the post request rather than exposed in the metadata of the URL. So I think that you kind of need a mixture of both the top-level telemetry that you get from your request logs, as well as having those application-specific attributes that you add. So, you know, things like latency, automatically captured by NGINX. Things like response code, automatically captured, but you kind of don’t get those application-specific metrics and fields. So that’s kind of where you have to combine the two.

Kelly Gallamore:

Gotcha can you hear me now, Liz?

Liz Fong-Jones: 

No, I cannot. You’re still absurdly soft. Let me see if I can fix that on my end. Nope, the problem is not on my end.

Kelly Gallamore: 

I’m going to turn these off. If we get a bunch of echoes, I’m really sorry, everybody. 

I just want to point you to… let’s do this.

26:40

Liz Fong-Jones:

I’ll keep reading questions off the Q&A while Kelly debugs that issue. Debugging, in production, you gotta do it. Any rules of thumb, what kind of downstream release spans is worth propagating? Things like the total duration or roll-up of downstream release spans is helpful to have. For instance, we have data on a single process. How much time was spent working on the downstream database? Rolling up the data and adding up the data can be really helpful. A sophisticated query engine should be able to do it without you having to worry about it at the instrumentation layer. Propagated downstream, the baggage it’s been around for a while. I think you have to be selective with propagating data to spans because it adds additional protocol. If you’re sending 100 bytes with your correlation baggage, that can be very, very frustrating. That’s how we think about it.  It’s a trade-off, but definitely for things where you need to have the available query on the child spans. In the long term, a sufficiently sophisticated query engine can do the cross span correlation. That’s something we’re thinking about and working towards within the Honeycomb product.

I do also see a question in the chat. One of the questions in the chat says: Am I familiar with New Relic transactions versus Honeycomb tracing? I think the New Relic’s synthetics and New Relic’s transactions are helpful in some ways, but the ability to issue queries… let me show you. To issue queries like, hey, I want to understand, you know, how many requests are that look like this? What is their latency? Even asking questions about, hey, show me what’s interesting with these particular things? To jump from trace to trace or to jump from trace to metrics or to analyze where our anomalies are coming from, what’s particularly slow, or what’s particularly fast, those are the kinds of things that a product like New Relic cannot do. That’s not to say they can’t do it in the future, but the deep integration as opposed to treating, oh, this is our synthetics product, these are our APM product. I don’t like that approach. I think it’s a better approach to focus on doing everything in one consistent UI and being able to explain both here is what the behavior was for the client, as well as, all the way down to the database layer and be able to go all the way up and down.

Kelly Gallamore:

Well, thank you for explaining that. Can you hear me now?

Liz Fong-Jones:

Yes. Loud and clear. 

Kelly Gallamore: 

Always have a backup microphone. Everyone, I think we got through the questions in the Q&A. I saw one more. We’re right at 10:30. Liz, do you have a couple more minutes?

Liz Fong-Jones: 

I do have a couple more minutes.

Kelly Gallamore: 

I have the set for half an hour, and I have no idea when this webinar is going to cut off. Let’s go for it. How do we add instrumentation after the fact? Adding child spans under a parent span that didn’t previously exist?

Liz Fong-Jones: 

I think that’s great. As you look at a trace, for instance, if I’m looking at this specific trace here, right, and I see that this took 460 milliseconds, let’s suppose these child spans didn’t exist here. I just saw 458 milliseconds and I didn’t know why, then I would potentially want to add additional instrumentation. It doesn’t give you, you can’t rewrite the past, right, but you can at least add new instrumentation if you catch it slow again. Add spans are helpful rather than over-instrumenting and paying through the nose. 

Kelly Gallamore: 

Fantastic. I’m going to stop it there because I think you need to get ready for a presentation in the near future. Everyone, thank you for interacting with us today. It’s so much fun to see you here. If you have more questions, you know where to find us at team Honeycomb.io. Hope you’re already following us on Twitter, but if not, twitter @honeycombio. What I do want to let you know is you will get a link for the on-demand with this. I will send a few more pieces of content that will be more information about telemetry and SLOs, but I have a favor to ask all of you. We’re going to send out a survey. At the end of this webinar, you might get the link. Could you please fill out the survey. We want to know if this show works for you, what went well, what didn’t go well, and what other topics you think are interesting. So if you have five minutes to fill that out, that would be great. Liz Fong Jones, thank you. It was nice talking to you today. 

Liz Fong-Jones: 

Cheers.

Kelly Gallamore: 

We’ll see you all in the future. Bye.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.