Kelly Gallamore [Manager, Demand Gen|Honeycomb]:
Hello, everyone. Welcome to today’s episode of Raw and Real. It’s great to see you this beautiful day. It’s about 10:00 am on Wednesday on the west coast. We’re going to wait two minutes and give everyone a moment to sign in. We will start promptly at 10:02 am.
You’re here at Raw and Real. It’s our tiny short sweet product demo. We’re gonna show you how Honeycomb uses Honeycomb.
For those of you who are interested in captions, we do have live captions from Stacey who I appreciate joining us. There’s a link for the stream text player. You can find us at the Honeycombio captions link on your StreamText link. If you’re just joining us, we’re going to start in a couple of minutes.
Hi, everyone. Welcome to “Raw and Real.” Our short and sweet series on how Honeycomb uses Honeycomb. Today’s episode, we’re talking about BubbleUp, our feature that speeds up any new questions. You can ask a new question, get results right away, and then ask another question again. I’m Kelly, and I’m here with Michael today. Michael, how are you doing this morning?
Michael Wilde [Dir of Sales Engineering|Honeycomb]:
I’m doing awesome this morning. It’s a wonderful, beautiful, glorious day here in southern California. I just moved into a new office that has a window, so I can see trees.
I know that can be a game-changer.
We’ll get right to it today. In our first episode, Remote But Not Alone, we were talking to folks who were on call. Most folks, many folks in our community are remote these days, working from home. And one thing that working in Honeycomb does is allow people to work together. You’re sitting there. You have an issue. You’re trying to work through it. And being able to see other people’s ways of thinking. Click on the permalinks, explore boards that focus on the data set specifically, allow for a level of context that really brings the whole team together.
What I also know is that when we ship code at Honeycomb, we watch it right away. It goes into production and we are observing behavior. It’s possible not only to see an issue because you’re on call when there’s an incident but when you’re observing code as you deploy, you can find out if you have latency or an issue right away. When something like that happens, Michael, how do you ask the next question?
Yeah, I’ll show you some of that. One of the points that you brought up about, you know, observing services and production, it’s a unique thing that we have been trying to promote for a long time here at Honeycomb, because the engineer that writes the code to perform a specific function within the app or service they build, other than maybe a product manager, they’re the only ones who really understand what that’s supposed to do, how it’s supposed to behave.
Just like if you planted some seeds in your garden, you’re the only one who is supposed to have an expectation of if flowers are going to come up or not. So the idea of instead of waiting for an alert to happen, and going in on a periodic basis and seeing, did the thing you built work the way you expected? Just like are the seeds starting to sprout?
Let me take you through a powerful feature of Honeycomb called BubbleUp. I haven’t seen it exist elsewhere other than maybe in some business analytic tools. It does help us ask the next question really well. So let me show you some stuff. I’ll share my screen.
And let’s get going. It’s always good to get into the meat of this. The first thing I’m going to show you is something you can actually touch and play with. We’re going to talk about Honeycomb and how we use Honeycomb real quick. But if you head over to Honeycomb.io/play, or you can probably find it on Google, there are scenarios based on data that we have used or created. But there’s one called Play with tracing and BubbleUp. The goal of Play is to go through a whole scenario, but I’m not going to go through the whole scenario right now.
I will give you an understanding of what BubbleUp’s job is. So if you look at the right-hand side of the screen, there are not very many fields. This is a demo application. You can probably guess which ones are important, which ones pertain to a database, and which ones pertain to an endpoint. But what most people will typically do is they will query their system, and they’ll start to ask, well, are things behaving normally?
I don’t know. Maybe they are. If you look at this chart every day, maybe you think that that’s normal. And you might say, well, why is it rising? Obviously, this is a demo here, so we can probably guess it’s some event creation. But ultimately, the questions are, well, isn’t everything okay? Is it working right?
Well, one of the things that we have found as a really effective visualization is a heat map. It’s a set of histograms turned on their side. It shows you for this time bucket, this is a range at which values were shorter or longer.
I see Play on the screen. Is that where you still want to be, Michael?
That’s weird. It popped open another window. So apologies. And I’m going to go back here and stop sharing. And that’s awesome. It’s raw, and it’s real, and that’s cool. That’s fine. I was going to share my whole screen, and then we’ll be done with that. Anyways, so back to the lecture at hand, if you will.
We go over here, and as I was saying before, this is one of the cool things about Honeycomb, I can go back to the history on the side, and before I was talking about the actual fields that were in this data set. We’re going to use all the fields in a data set or columns or whatever you want to call them to then do other things like do further analysis.
If I were to use that heat map, that same heat map, it will show me how data is behaving. And clearly we see some weird things here. We see just a count before, and that kind of looks normal. When we look at a heat map of duration, we see a spike. Well, maybe the reason why you jumped into a monitoring product or the reason why you looked at an observability tool like Honeycomb is that you heard something was going wrong, or an alert happened. So now you want to know what set of questions you should answer. It looks like there’s an issue here. I have a set of fields that I could do group bys or filters on. Which ones should I use? Which ones are important? Again, this is easy to figure out here. And that’s why we made Play pretty easy. But the cool thing about BubbleUp, I use this analogy really often. I feel like it’s kind of like an MRI or a scanner for data. Almost like an x-ray machine for your knee.
Before you end up doing any further digging as a doctor, you have to figure out a whole bunch of questions to ask. So BubbleUp allows you to draw a box around something that you will find interesting. You might be able to have things like Machine Learning and AI and all this other stuff, but really, only the human can put the true meaning behind the thing that they’re interested in. What happens here, you draw a box around a place of interest on a heat map, and Honeycomb is going to do a statistical comparison of how things are behaving inside this gold box, which we call the selection, versus everything else in the baseline or everything outside the box.
And it finds fields that are different. Then it orders them by how different they are. So in this case, we see one field here, the user ID field shows up 100% of the selection of events. There are two different names of services, API tickets for export, and fetch tickets for export. What will you do with this? We will show you in a moment, but any time you click on any of these many charts, there’s a menu that pops up. Now I begin to ask the second, third, hundredth question because I’m presented with things of interest. I’m using a really powerful tool to help me along. Because when you get into real production like now, this is Honeycomb’s dog food environment. This is what we’ve been showing with Raw and Real. Check out previous episodes of Raw and Real, because they show different aspects of how people work together.
We have a number of services here that we built that make Honeycomb work. The telemetry of how Honeycomb works that our customers use that we were showing in play was sent to us in what’s called dogfood. We have talked about shepherd in the past, which is our injection pipeline. Poodle is our front end. We just look at the status of this, and we can see, okay, there are some more requests that were happening at 9:30.
Latency looks pretty good. We’re pretty fast, and we try to make great software. And there was a deploy that was happening. But I’m just going to go ahead and click on “Run Query.” I’m not going to click on BubbleUp right away because I’m going to show you why BubbleUp is awesome. You might have thought BubbleUp was cool before, but we only have 15 fields, and to me, they’re kind of obvious.
Well, we see a spike over here, a higher count of events than previously. Does that mean something good or bad? I don’t know. I can’t apply meaning to something I don’t understand yet. But on the right-hand side, you see a lot of fields. Now, if you were using, let’s say, a log analytics tool or log search tool, let’s say you have a Splunk or Elastic, they’re all pretty good at displaying a count of events. You send structured data, and you can see a lot of fields. But how are you supposed to know what fields are important?
Here at Honeycomb, we recommend our engineers always create a field for every trace, feature, flag that is created. It allows us to ask a lot of questions later. But again, I’ve got a lot of fields this is what someone would normally do. Someone would normally say, well, I heard there’s a field called app.team.ID, and if I do a “group by”, I would see a count group by the team. And I could see a particular team that is maybe using Honeycomb more than others at that particular time. Okay. Team 1226. Does that mean things are bad? Does that mean that things are good? Still don’t know.
But in the world of observability, we should ask a lot of questions. So now if we were to take and make our heat map, we’ll do the same heat map on duration as we did in the demo. Duration is in milliseconds. We will get rid of the count. We will get rid of the group by. We will just look at the general duration over the last two hours. Not too bad. Let’s look over a longer time range. Maybe look over the last 8 hours. All right. There’s some latency, 60,000 milliseconds. And this is the user interface. What questions should we ask now? Here is where BubbleUp starts to get fun because I can take my core sample and see how things look. BubbleUp is going to tell me fields that are different between baseline and selection.
If you walked in here and didn’t know what an important field was, easily when we BubbleUp, we can see there’s a couple of maybe one team associated with this. We might see, you know, particular APIs. We have two traces that are showing up. We’ve got a particular user that’s showing up. And 125 other fields that we might want to drill in and look at here.
So I might say let’s look at a particular request path. Or, you know, a URL. Or, you know, even maybe something in the request header or maybe the build ID. Okay? So knowing that these things show up in this test of BubbleUp versus another one, it might make me say, oh wow, these four over here that are a little longer, there’s a handful of teams that might be experiencing some sort of slowness.
Now just because it shows up as latency doesn’t mean it’s slow. Slow is kind of a perspective. And it also depends on how it’s being measured. I might then want to drag around this area. And then do a group by on something. I could say, you know, group by on error. I might even be doing a group by on handler.route. That is an example of something I could do. And another thing that’s really popular is to actually take a look at a trace. So if I drilled all the way in. So maybe it’s that point that it became something really important for me to look at. And now I see, wow, it was happening on this build and this availability is on, and maybe this instance type. And it might cause me to then, hmm, let’s do a group by.
So BubbleUp has led me down this path to see important things that maybe I should look at. And maybe I might, you know, go a little bit further. And obviously we have two different builds that are showing up here. Does that correlate to some latency? Maybe it does; maybe it doesn’t. But if you think about it, I’ve got a huge amount of fields here and I can do a lot of investigation on how the behavior on things is. From the name of the service that’s being used to how long it’s taking to any other number of other questions.
I’m going to show you something about SLOs and why BubbleUp is awesome. I wonder if we happen to have any questions that are showing up, or do you have thoughts, Kelly?
I do have a few questions, but I want to let everyone know that is listening at home if you have questions, type in the question box at the bottom of the screen and we will address them. We love hearing from you about what’s important about this.
I see a lot of value in BubbleUp. It seems like what you can do is set this comparison so you can go ah, here’s something interesting. Here’s the baseline behavior that I expect. What does this tell me? Why is this so hard to find with other solutions?
That’s a great question. Most other solutions, regardless of how you’re asking questions, if you’re looking at a dashboard, dashboards have a set of questions that have already been asked. If you’re using a log tool or a metrics tool, they all have tons of data, tons of fields about what your systems are doing. But, every person that comes to look at, you know, a data set, they come with a different set of understanding. And if you have little to no understanding of that dataset, that index, that whatever it is, that database, you have got start somewhere. And most people will start at looking at an error, trying to find something that’s low. Because really, we’re in the business of dealing with things that are slow or things that don’t work in the course of trying to make things great.
Well, a senior engineer who has been building the system for five years could definitely know what things to query on, things to group by, whether using a log tool or whatever. But how it is that a brand new person or someone that might have just got put on the project would even know which fields are important is a mystery to me, and that’s why, you know, in my experience, I think BubbleUp is an awesome tool. I have used log tools before, and boy, every single one could benefit from BubbleUp.
So there’s a cool thing that I want to show you as well. So we looked at BubbleUp from the context of asking the second question, and that’s pretty powerful, the second, third, fourth. What I generally like to see is people doing a BubbleUp on something, drilling in, making a new query, and then BubbleUping again on that particular thing. There’s another place where we see BubbleUp as well.
And that is in SLOs. In the past, we have shown some stuff around SLOs, and in other Honeycomb “Raw and Real,” we have shown activity feeds so that people can learn how to use other queries and have this big, shared brain. BubbleUp also appears in Honeycomb SLO. For those who are unfamiliar, an SLO is just an agreement of how you as a team are going to try to make your service run through a certain amount of success. In our case, this is the ingestion API. It has to do its job very well. We’ve said over four 9s of eligible events, which just means of all the events that come in, these are the ones that would be considered a success or failure. A health check would be an example of something that is ineligible.
But Honeycomb calculates the error budget, you know, for the 99.99%. Well, it looks like on the 13th, we had an issue. We had an incident that happened. We didn’t dip below our 99.9, but half of our error budget was burned out. On the SLO page, Honeycomb automatically creates the BubbleUp for the time range that you’re viewing. In this case, I used the back button to get all the way over to this gray box, because I wanted to investigate that time range. SLO becomes a great learning tool, a great investigative tool to see what happened in that time range that we saw when we weren’t doing so great. But the BubbleUp has been applied to events that have succeeded or failed the SLI. We see this big brown area, similar to what we might see if we drew a box around it. In this area, we have a whole bunch of things happen on the ingestion API. We are meeting our goals, so that’s good. But we can see very quickly kind of where most of the errors came from.
Maybe there was this MySQL RDS lookup thing. What was the reason things were dropped? These are the top errors that are showing up. There were two builds that were deployed at that time. One of them could have been the thing that broke it and the other could have been the build that fixed it. But BubbleUp here gives us a perspective that’s almost not possible in any other tool, because we basically said, let’s measure ourselves with SLOs and then now let’s understand what’s chipping away at that. It becomes almost an instant support tool. So we can say, hey, are we up? Yes, we are still meeting our goals but some people are not having the greatest time. And BubbleUp then becomes the ability for me to go and ask another question like let’s go look at a trace that happened to have failed the SLO or the SLI at that point in time. And then I might go and do some further set of debugging. I find it’s a pretty powerful tool. Heat maps are a fantastic way. Heat maps are the only place right now where BubbleUp appears. So try to create more heat maps, because they give you a different kind of resolution than account grouped by customer.
Sometimes you can just have a lot of lines on the screen. Sometimes that might be, you know, a little bit easier to understand when you use a heat map as well. So that’s where I see BubbleUp being an awesome and powerful tool, and it just helps me, you know, kind of analyze how my garden is doing. If some of the soil is behaving differently if my irrigation isn’t working and that same kind of thing without having a lot of knowledge about gardening.
Let me see if I can restate and make sure that I understand. BubbleUp is the tool that allows you to visualize interesting selections of data points over your baseline behavior. And it can do it fast, no matter how many numbers of fields you’re querying, right? And I think that that’s important because, in other situations, we have all of these fields that can take other tools a long time to comb through all of that data. Does that sound right?
It does. We’re using some pretty smart, sophisticated statistical analysis to make all of this fast. Taking the right number of samples to run the result set that we’re seeing here. But we’re seeing that it’s pretty great at being able to tell you what’s different and then also the fact that it’s ordered by columns that are most different between the baseline and selection is likely a good pointer to the things that you might want to further investigate.
Gotcha. Okay. I think that helps answer a question about seeing 50% in the selection and 0% in baseline across so many fields. So BubbleUp by nature and design, that helps you know which fields to focus on. But in the context of your own environment, with your own time of measurable rules that you’re trying to meet, it gets you right there, so you can dig into a point and decide, this is not what I’m going for. Like I can eliminate this and once you drill down and eliminate, you can bounce back out and see hey, here’s something else that’s interesting. Drill down to that and bounce back out to see where you’re at.
You’re right. And one thing that is also important to understand is that your software is evolving. The version that you deployed three weeks ago has a new feature out. It has a new field. Unless you are keeping track of, on a list on your wall, of all the fields that every engineer is putting in every single log event, you might not know that a brand new field has appeared. Maybe you’re not an engineer or a software developer. Maybe you’re in ops. You’re on call and maybe your engineers aren’t on call. How do you know, because you didn’t build the software, what’s important? What got added? You might not be reviewing as an ops person, you might not be reviewing every single pull request, so you might not be aware that hey, there’s a new field. But just by drawing a box around it, boom there’s a new field. That’s interesting. I didn’t know there’s a new feature deployed. Cool. There’s a field for it. Awesome. And it turns out it’s showing up. It may arm you with the ability to escalate to the engineer who built the software faster and easier and they would have a lot more information than they would otherwise.
Okay. That seems really helpful. Another thing that I see as you’re bouncing around, as you add your heat map and run your query so you can see BubbleUp, I like how when you drill down, you can also pop to the trace. To me, it just seems easier to have this access within one tool. Every time I switch tools, I lose cycles and context going wait, where was that in the first place? No matter how organized I am.
Exactly. Yeah. Like, even at Honeycomb, we have a lot of tools to do different parts of our job. And you have to reset our minds. You know, when you’re running production systems, your mind has to be focused pretty intensely on doing one of two things: Building and/or fixing the stuff that you deployed, or observing how it’s actually working and seeing what the customers are experiencing.
Perfect. Michael, we’re almost out of time. Here, I’ll share my screen real fast. I know that folks can go to…
Check out Play. Honeycomb Play is a good place to check that out. I don’t know if everybody in the world uses heat maps, but they’re a great visualization. Check out Play. If you’re a Honeycomb customer and you don’t use heat maps and you didn’t use BubbleUp before today, use heat maps. They tend to reveal things that you can’t see in different types of charts. Play is there for you if you want to check out what I did today.
You can check out play.Honeycomb.io so you can see how this works. Play around with it yourself. If you have more questions, reach out to the team at Honeycomb.io. At the end, you’ll get an email that has how to view this again so you can share it with peers and colleagues. There will be a link to our survey. If you could give us feedback so we can make this the best experience possible, we really appreciate it. And we really appreciate you all coming today and participating with us.
Michael, that’s all I’ve got. Do you have any last words of wisdom for the folks at home?
You know what? I think they ought to be good by now. If you’re not a Honeycomb customer, try it out. It’s really easy to get started. If you don’t have time to get started, check out Play, because it may give you a different perspective on what’s possible with a tool looking at data in production. And you know, follow us on Twitter. We’re easy to get ahold of. And you know, have an awesome day.
Great. Episode 1, remote but Not Alone exists. You can see that. It’s a quick scenario focusing on how teamwork happens within Honeycomb. Michael showed several features that we use, that our customers use. Episode 2 talks about SLOs and is about how to tame your alerting so you can focus on what’s most important for your end-users’ happiness. And today’s episode talking about BubbleUp. We’ll see you in June towards the end of the month. Have a great rest of your week. Thanks, everybody!
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.