Webinars Observability Debugging

De-stress Debugging Triggers, Feature Flags & Fast Query


Summary:


Second episode in our Honeycomb Learn series looks at how to cut stress levels when debugging issues in production. Starting with a hypotheses, run fast queries and then navigate to the code where the problem lies. Be proactive and set triggers to let you know if something needs attention. When engineering is about to ship a new release, set a feature flag to watch how production behaves in real-time. Curtail performance issues and reduce customer impact with the right tools to better understand production systems, right now.

Listen to learn:

- Quickly go from hypotheses to fast query and pinpoint exactly where the issue is
- How to set triggers based on thresholds important to your business
- Set feature flags to control specific parts of your environment to reduce stress levels.

See a Honeycomb demo, ask questions and learn more.

Transcript

Peter Tuhtan [Product Manager|Honeycomb]:

Hey everyone this is Peter Tuhtan with Honeycomb, joined by Michael Wilde, also Honeycomb. I hope everyone can hear us well. Feel free to drop feedback into the feedback from the audience or questions into the audience.

We’re going to be starting in about two minutes, so go ahead and grab your breakfast, lunch, or dinner; whatever time it is in the area you’re joining us from, and we’ll get started soon.

Also just an early heads up, at the end of the presentation we will definitely be keeping some time for questions, so if you would rather save your questions until the final slide, feel free and you can drop them into the chat and we’ll discuss amongst ourselves.

All right, welcome everybody. Let’s get started. Thank you all for joining the webinar today. So, before we dive into the presentation, I’d like to go over a few housekeeping items. Like I said earlier, if you have any questions during the webinar, please use the “ask questions” tab located below the player. Your questions will be addressed either during or at the end of the session. Probably at the end of the session just so we can keep the flow moving, but if I see something that really prompts us to dive into it right then, Michael and I can jump into that. At the end of the webinar also please take a moment to rate the presentation and provide some feedback using the “rate this” tab below the player.

2:43

All right, let’s get started. Welcome to the Honeycomb Learn webcast series. It’s designed to educate teams that work in DevOps about what observability is, and how to get started. Observability-driven development is the ability to ask any question about your production system so you better understand debug when incidents occur. Teams should code confidently, learn continuously, and spend more time innovating.

So, one thing to chime in on here too before we really dive into the material here, a lot of this is driven to the fact that Honeycomb has spent a lot of time researching where dev time is being spent, and a lot of that obviously is spent debugging and fixing technical debt. Super frustrating, time-consuming, and obviously impacts the quality of life, quality for your time. Expensive for companies, because customers do not have an optimal experience, and ultimately competition can get ahead, which impacts your revenue.

I’d also like to call attention that this is, again the second in our webinar series. Go ahead and head to our website if you’d like to look for episode one and catch up if you missed it.

Today you’re joined by myself, Peter Tuhtan; I’m a product manager here at Honeycomb. I joined the team when we were working out of a condo with about four people, back in the day, as head of sales, and have transitioned to product management since. I’m joined by Michael Wilde, who I will let introduce himself

Michael Wilde [Dir of Sales Engineering|Honeycomb]:

Yeah, greetings. I run the Sales Engineering organization here at Honeycomb. I came to Honeycomb last July after a twelve-year stint at Splunk. So I’ve seen the world of machine data and production debugging evolve, and I’m super excited to see what customers have done with Honeycomb. Hopefully, we can show you that today.

Peter Tuhtan:

Yeah, thanks, Michael. On that note, we will be going through some topics today including what you see now. So what kind of debugging exists in the eyes of Honeycomb and observability, how do you get started facing that, how do you get ahead of those problems, and what are the right tools for those jobs, in our opinions?

We’ll probably be going through about 45 minutes to 30 minutes today, but as I said, we’ll keep a lot of time at the end for questions as they come up. Michael Wilde will also be jumping into a screen share and going through some demoing of the Honeycomb product. Let’s get started.

So, one thing that we definitely preach here at Honeycomb is that debugging only gets harder, and it’s harder than ever right now. A problem that exists is that there’s just a myriad of different tools that are used across an entire company by different teams and the members of those teams. That can cause a lack of visibility of what’s actually going on. At the same time, which I’m sure many of you obviously understand, systems are becoming only more complex through distributed systems and microservices. The challenges are that you’re not using debugging tools designed for these new architected systems.

Metrics lack details and give you the direction that something is different, but needs further investigation. Logs meanwhile can be searched, but at times can be difficult to actually query unless you’re using another service, and they don’t provide any ability to actually get tracing in unless, again, you’re using another service. And then again, APM tools don’t give you access to all the raw data with different visualization features, and we tend to believe through some research that tracing can be an afterthought in APM, and still very new to that space.

By the way, if you’re interested in seeing how Honeycomb stacks up against these three kinds of, what we think are older ways of going about debugging, we have a blog post. If you just search “comparisons” in our blog, you will be able to follow that.

For any of you that follow our fearless leader, @mipsytipsy/Charity Majors on Twitter, this quote may not be new to you, but for the rest, she claims that “It’s way easier to build a complex system than it is to run one or understand one.” At Honeycomb, we believe observability is the only way forward. This means you have the ability to ask any question of your system that is necessary if you’re going to meet your SLOs. Think of it as production intelligence for modern DevOps teams. Just like BI was built years ago for business users, intelligence for systems devs ops SRE teams is what we’re trying to create here at Honeycomb. You must have a unified, single view into what is actually happening, especially if new code ships on a much more frequent basis.

Charity frequently speaks at events, by the way, such as Velocity. If you’re going, you should definitely try to check it out. She’s got some sweet stickers, but her actual talks are really, really good, and have helped a lot of teams out in the past.

7:54

There are a number of best practice steps that teams must go through to reach a state of observability in our opinion. And as said previously in episode one, which I alluded to at the beginning of this call, we talked about instrumentation. So there’s your first very, very important step, and how to create better telemetry so you give context for the code which helps everyone in the DevOps team to maintain a well-performing service. I encourage everyone to check that first version of our webinar out and share it with other team members so you can follow the whole path to this webinar.

Today we’re going to focus on the ongoing review of your system, and show you how to run queries to better understand what’s going on, but also to use specific tools such as creating triggers so you can be alerted proactively. How to handle an incident is critical, so issues are resolved quickly, and impact to the customer is minimal.

In our opinion, there are really three areas of debugging that commonly exist. Obviously there are outliers and everyone has probably different lingo for all of this stuff, but as a DevOps oriented team, you’re faced with all sorts of different issues and things you need to focus on so that your service is totally operational and you’re satisfying SLOs, and you’re maintaining that happy customer base. Honeycomb refers to this as software ownership, and regardless of who on the team is responsible, it does impact a wide range of your entire org, from engineers to ops to SREs to customer success and sale. We see debugging, as I said, falls really into these three main buckets, or categories if you like, of activities.

And it’s obviously not meant to be an exhaustive list of what’s out there when incidents occur, but starting with the left, we call this basically just “incident response,” right? Major incidents. Your on-call person is receiving an alert, something’s wrong, they need to jump into there and solve it. The second bucket though is something we’ll focus on a little bit more today, as well as the third. The second being the problems and incidents caused by performance degradation. Maybe this isn’t exactly an on-call alert, but for any number of reasons such as capacity constraints or the opposite, an unhappy message from the head of finance about your AWS bill, this is something that you can use Honeycomb to keep a constant eye on and manage all the time.

And then the third on the right here is what we believe might be an area that actually gets the least amount of attention these days when one thinks about debugging, but we also think it’s probably the most important. If you’re continually learning from the new system as you’re releasing new features and additions to your services, you can be proactive and get ahead of any issue. You can also work closely with your eng team across other teams, to know exactly when something hits production to understand how users are adopting a new feature. So, going way beyond debugging and leveraging a system like Honeycomb to see if things are successful with what you’re releasing. There’s a tremendous amount of learning for the team overall in this area of ongoing development and release management.

So, Michael, we’re going to cover a variety of these topics today, but just to kind of give some folks the lay of the ground, maybe you have an example of an instance recently for folks to hear about that Honeycomb was used for.

Michael Wilde:

Yeah, thanks Peter Tuhtan. A recent customer of ours, a company called Behavior Interactive, makes some really awesome multiplayer games. They’re up in Canada. And they had some decent APM tools in the past, but they noticed there was some slowdown, something not behaving right, and they actually just couldn’t find the answer with their other tools. They were looking at, is it caching, is it the database, or is it somewhere else? And they actually felt that the speed at which they were able to get things done with Honeycomb was literally impossible in some of the other products. They’re now at the point where they recommend to anybody who’s running anything in production to think seriously about Honeycomb just because of the speed.

As Peter said, with Charity’s help, complex systems are really easy to build, but they’re very difficult to debug, and it’s really exciting to see existing customers get through some of their issues quickly. Feel free to check the case study on Behavior, or also known as BHVR, out on our website at honeycomb.io.

12:28

Peter Tuhtan:

Awesome, thank you, Michael. So let’s dive into the first one of these areas, the incident response. The one we’re all familiar with, right? You’re using PagerDuty, what have you, you’re getting a ping, something’s wrong. There’s an incident and the on-call team is alerted, and this could be obviously one customer or many. The most important customer, or all of them, saying “hey service is slow, service is down.” A lot of the time when this occurs for your team it can feel like a black box that you’re stepping into. You have no idea where to start. So what do you do? Right here we kind of list a best practice map from top to bottom of what we believe is the best approach to solving an unknown bug or problem.

The first is to understand the severity of the issue, and how many people it’s actually affecting. Then, I don’t know, perhaps something that has happened previously is what you want to look for, or is there maybe a starting off point in the past you’ve taken that’s the right route with the current tools you use? Usually, though, teams have some general hypotheses based on the description of the problem, but that’s not always the case. This is obviously one of the bread and butter situations that people leverage Honeycomb, so I’ll pass the microphone to Michael here again, and let’s dive into a scenario using Honeycomb to address this.

Michael Wilde:

Yeah, thanks, Peter. I’ve got to show you what this thing is like because the speed that you will see me work in Honeycomb is pretty much unparalleled. But also, we’re going to experience a couple of them, you could call them team features, where we can observe what each of us is doing to help everyone become the best debugger.

Provided everyone can see my screen, this is just the homepage of Honeycomb when you log in. I’ll give you that scenario of trying to track down maybe something really difficult to find but pay attention to when I’m scrolling down here. I see the information that I’m doing, so my past history is here. Great places to start if I’m debugging problems frequently, but I also see things that other team members are doing, and Honeycomb is unique in the respect that we recognize that most problems are solved by folks on teams, and if we can observe what our other team members are doing, chances are, a new person could become smarter at your system, or the team can benefit from all the expertise that is on there.

So, imagine if I’m running an API service and there’s a report of something wrong, but all of my monitoring systems are kind of showing things are okay. On the Honeycomb screen here, on the right-hand side, we’ll see my entire history, which is kind of like my browser history, but very visual, and I can retrieve it instantaneously throughout the entire life of my Honeycomb employment. On the top, there’s a set of gray boxes where I can start to do a query and it’s really, really simple to use. But there’s a lot of power in the simplicity.

If I look at this chart, it’s just a simple count chart over the last six hours. And we see a normal pattern of behavior. In my case, I’m running an API service so I’ve got the information, a little bit from the back end, some stuff from the front end, a few extra fields, and of course I’ve instrumented my code so I get some nice distributed tracing. To really crack this thing open, I’m going to quickly do a breakdown by status code, and I’m going to use a great visualization that we built called a heat map. And we’ll do a heat map on latency. What that’s going to allow us to do is see a bit more about what’s going on inside that normal period of activity.

We see our purple line here, which shows HTTP 200. Those are successes. Statistically, there are so few failures that it probably wouldn’t even set off most monitoring systems. But if we look deep down inside, there are a few 500 errors that are happening, and if I scroll down here as well, I also see a table that shows my status code by count. It gives me a little information.

As you can see, the heat map also gives me ranges of behavior by color. Most of our latency, or duration, is much less than a second, which is where we like it. We do have this odd spike that is drawing attention to me, and we should probably see if we can investigate that. So what we did at Honeycomb, we also built a really great tool for developers and operators that allow us to drill deep down inside and almost x-ray what’s happening inside this weird little spike.

This tool is called BubbleUp. I haven’t seen anybody else have this yet, which is kind of cool because when I draw a box around that area that I’m interested in, now I get an instantaneous analysis of absolutely every field that is in my dataset, regardless of whether I broke down on it. And it helps me answer really the three big questions if I’m having an issue. Somebody reports a problem, I have to verify it, right? Just because Peter reports a problem, doesn’t mean there’s a problem. Second, where is it happening? And third, gosh I hope it’s not happening to anyone.

So the bars in yellow represent the statistics around this field and its appearance in the selection. So we have 98% of the events from the ticketing export endpoint show up in this selection. That’s kind of interesting. We also see, is there failure? Again, Peter reported a problem, is there really one? There actually is. So I’m going to take and filter by this status code field. I’m going to do a breakdown by this name field. Well, actually we’ll use the endpoints field. They’re pretty much almost the same. And then, look at the user ID. So user ID is showing up. It’s obviously showing up in every single event, but one user is affected. That one person, that lonely user out there, that might be really important to us.

So let’s break down by that field. And what I’ve basically done is constructed a query right here which is of high granularity. So I’m going to click run, and instantaneously I’ve almost pinpointed the source of the problem. If we look here down on the chart below, we can see user 20109 is getting what, HTTP 500 status code error, Omni ticketing export endpoint, and it’s way more than everyone else. So that’s not good. But at least we’ve found it. What could I do right now? Maybe I have my customer service group reach out to that person.

19:18

Michael Wilde:

But we should be able to take this a little bit further. When you do instrumentation, which we highly recommend in this new world of advanced apps and micro-services, distributed tracing becomes your go-to method of really finding out what behavior is actually doing. So if I click on the ‘traces’ button here, I get an idea of the slowest transactions that are happening in this time range, and we’re kind of hoping this has only happened in the one user, but it looks like there is another one there. So as an engineer I could drill in and look at any of these tracers, but I need to drill all the way down.

This is where you might see some differences between Honeycomb and other tools. This is often where most tools stop and Honeycomb really starts to shine really bright. So if I want to see life from the perspective of that request, I can go all the way down, get to that distributed trace. Most other tools stop here, at the original request, and as you can see on the right-hand side of the screen, we have every single field from the original raw event, so nothing is pre-aggregated. All sorts of extra context from platforms, the service name, to duration. And we can see the path that this request took. Hit an endpoint. Hit our rate limiter, that looks just fine. Did an authentication service, good thing that that’s working well on the back end.

And then for some reason, the ticket export was called. Maybe they were printing tickets for a concert so they could hand them out. Well, it looks like we’ve got a high degree of latency here. In our world, 1.3 seconds is pretty long. I mean, think about staring at a webpage for almost two seconds, and you’re sometimes on to something else. So we can see the entire path that this request took with this waterfall chart. At each stage, we have enriched fields such as query, and the time the query took. And we can look at this and say, well maybe we could change the set of operations to do all the queries in parallel. Maybe you could do that, maybe you can’t. Sometimes we’re not lucky enough to be able to change the code that we have, so could we change the code that happened before what we have?

So right here we may say, “let’s see if somebody hits our endpoint way more than frequently.” Maybe we upgrade the code in the rate limiter so that that behavior doesn’t exploit a design that we might not be too happy about. And lastly, what you’ll see on the right-hand side is my entire history. So it makes it easy for me if I end up at a dead-end, as we often do in debugging, to get back and also to learn from the activity that all of our other users will do. Lastly, I could take and make a trigger and that trigger could notify maybe a PagerDuty, a webhook, or some other mechanism. I’ll show you a little bit more about triggers but I just wanted to throw it back to Peter to move on.

Peter Tuhtan:

Thank you. Very helpful. And on that note, let’s talk about the second area of debugging that we recognize here at Honeycomb. Obviously the first is the one that kind of chills our bones and wakes us up in the middle of the night, or you have a dev or engineer on your team who’s just not happy about being on-call knowing that they might run into that situation. But on the flip side of that, one of the ways that you can avoid all of that occurring is just being proactive. Michael just highlighted something, talked about triggers a little bit. Again, here’s a list of what we believe are some of the best practices in setting yourself up for success here.

So, first, we like to get a good understanding of the time frames for shipping code. Hopefully, all of us can rely on a consistent calendar, says the project manager who laughs internally, of the upcoming months and quarter of when things are going to be on time in GA, and staying on top of it and making sure we’re instrumenting ahead of time to prepare ourselves to monitor for any impacts on our service, or what we’re leveraging to run our service.

So we decide what’s important to watch. This might change over time depending on the nature of new code. We also make sure that everyone on the team is aware, especially of those on-call or in customer support, about when new code is being shipped. You don’t want those teams going back to that black box and searching through the dark. For some of our customers, this actually involves giving support a heads up, like I mentioned, to being extremely alert if there’s a customer that we’re really, really keeping a close eye on. So Michael, why don’t we talk a bit about this scenario more, looking out for just performance degradation, leveraging some of the tools in Honeycomb.

Michael Wilde:

Yeah, thanks, Peter. Most tools should be able to do some levels of proactive notification, right? The unique nature of Honeycomb allows us to drill in and dig deep into interesting scenarios that might bubble up, no pun intended, the kind of things we want to be alerted on. In our process of finding this user that is having an issue, I might make a trigger, maybe for example we add a p95. Let’s add a 95th percentile of duration, run that query.

It’s so refreshing to have something that works so fast, and that’s one of the things that Behavior Interactive loved about Honeycomb. I can make a trigger. Maybe something simple like where the duration is greater than, I don’t know, maybe the duration of latency is, maybe it’s greater than 800 milliseconds. Most other systems don’t recommend that you run these types of triggering systems very frequently. At Honeycomb, as you can see, it’s so fast. Go ahead and run that thing every minute. And maybe we add a recipient, okay? Sure we could send an email, send something to Slack, PagerDuty, even your favorite webhook.

But additionally, you may have noticed some odd lines that showed up on my chart before. You can actually create a dataset marker in Honeycomb so that there’s something dropped on there for an operator to see; external context is really awesome. When triggers fire, they show up on a really nice page that allows you to see all the triggers that are happening, test them out to make sure they still work, because you’ve got to make sure things are always working, and the typical idea is taking a look at things that are wrong.

At Honeycomb, we use Honeycomb to Honeycomb Honeycomb, kidding. Although any time you see a demo or talk to a vendor that has a tool that helps you debug, ask them how they use it on their own systems. It will be quite revealing. At Honeycomb, we try to live the values that we espouse, so we’ve done a great deal of instrumentation on our own code, and in our production environment, we do lots of the things that you would normally expect. Triggering on things like errors, but also perhaps I look at what’s happening maybe on the front end. And I’m looking at activity by the user.

As a product manager, for example, Peter can observe what our customers are doing to kind of see well, are they getting the experience that they’re expecting? Does that mean that somebody on call is then doing something about it? Maybe and maybe not, but the idea of software ownership is really about taking a look at software behavior, not just when it’s broken, not just when there’s an on-call incident, but when things are actually working well, so you can see whether you have built what you expected folks to use.

27:19

Michael Wilde:

Now there’s, as I mentioned, this idea of software ownership. One of the upcoming technologies and methods that folks are using to really own software is this idea of a feature flag. If you’re not familiar with feature flags, a lot of you probably are, but it’s like a way to turn on and turn off parts of your production system, parts of your code that are either hidden or disabled or enabled. It’s a great way to do things like testing in production. It’s a great way to have a beta program. It’s a great way, and we use this at Honeycomb, to help customers that we’re building things for, prior to release. We can turn on a feature flag, and I’ll show you what that looks like really, really soon.

If you look at the whole CI/CD pipeline, when a build occurs, maybe when a feature flag is deployed, and we use a great product called LaunchDarkly to do that, you might take a different approach. You might say, okay, when I’m deploying a new feature flag or I’m deploying an update, why don’t we use an API called the Honeycomb, maybe to generate a dashboard. There’s a nice API for boards in Honeycomb. You could generate a dashboard that had four or five queries to remain to that particular feature flag. Maybe send a flag message to the developer of that flag so that they then can look at a dashboard that’s already ready for them. And when that deploy happens, a marker has been put on the timeline in the Honeycomb by way of an API call from a CI system. You’ll see this all over my screen when I use Honeycomb in production because we’re doing deploys all the time.

Let me give you a few examples of how this whole feature flag thing works if you’ve never seen it, and how some insights that we can glean from what’s in Honeycomb about, hey, how our customers are doing. So if you think about it, this idea of software ownership which is, again, developers and operators examining how production is, not just when it’s broken, but when things are working; that idea of being proactive and owning your code says, keep your eyes open the entire time.

Here at Honeycomb, you may have seen some of my user interface where I was over here at the query screen, and you notice on the right-hand side, if you did, there’s a lot going on on the screen, there are three tabs in this green bar over here. Green, teal, whatever color shows up on your screen. There’s also a little ‘x’ button here where I can hide it. My monitor is a huge monitor, but I’ve got a 13 inch MacBook Pro, and our engineers and our designers like to know kind of how people are using the product. If you have this screen hidden the entire time, you actually might not know there’s great history and team activity. But it might be a result of you having a smaller screen.

So instrumentation allows us to observe what’s going on. This is the really cool part of observability. One might ask, Chris might ask, do people have their sidebars open? Yes or no. And that sidebar is that green bar that I just showed you. Okay, great. That’s helpful to give us an idea of the count of folks. What does the sidebar look like while queries are run? You did see the sidebar show up with the details on the dataset, but not everybody clicks on everything in every web app you use. And it might help if the history bar was … Maybe that history should be defaulted at first, right? So that idea helps us understand it. And this is all just natively using Honeycomb again, not to debug problems, but to observe exactly what’s going on. Again, we try to live the observability lifestyle that we espouse.

That idea of creating a board, so if we were to go … I’ll show you a board in a second, but you know, a board is a list of queries that may have a visualization associated with them. And it’s kind of like a dashboard. There’s an API for that so that perhaps out of the process of a build, maybe a new feature flag is deployed, boom, a dashboard is created. It’s really simple to extract an existing dashboard and turn it into something new. And we’ve tried to make this extremely developer and DevOps process friendly.

This idea of a feature flag, if you’ve never seen how they get deployed, obviously there’s some code that’s written in engineering. What you’re seeing on the screen is a product called LaunchDarkly, and LaunchDarkly is how we at Honeycomb, and many other customers, manage the provisioning of feature flags. We can see there are lots that we’re working on at Honeycomb, and we’ve got some great things out here. As a matter of fact, we have a feature flag associated with the integration that we’re building for Okta, so if you’re an Okta customer, you can wire that right in. This might be one to kind of look at. So if we drilled into this feature flag, right now the default rule is to have it on, but a feature flag allows us to target specific users, specific teams, and turn those features on for them.

Well, if I’m doing software ownership and I turn on the Okta flag for some users, I might actually want to see a little bit about that. So I’ve got a dataset inside of Honeycomb that has information about what kinds of things users do. And as we can see, I have two boards here. I have one board here with two queries. One that looks at a feature flag for Okta FSO integration. If I clink into that board, the query is quite simple. I’m just looking at a particular team. We have a customer here that’s deployed it recently and we have obviously some testing going on. We can see how frequently they use it, which is great because we see a marker here that’s probably represented when it was deployed. Maybe this is 19814 when that flag got deployed. I’m doing a filter on here, flags.Okta = true. We’ve taken and instrumented our code so that we actually have that information on every flag that we’ve put in here, right in Honeycomb.

That’s why this idea of taking data from many different systems and looking at how, not only in prod when things aren’t working well but how when you deploy things, the ability to see what’s going on. Lastly, this idea of really looking into what’s going on. Another query I found today which I thought was really cool was “most common window widths when looking at query results.” So our team has a heat map here, and we can see most people’s window widths are less than 3000 pixels, but this type of thing helps us understand exactly how the experience that people use with Honeycomb actually is. Again, I believe that most vendors that show you anything associated with troubleshooting and production or whatever should give you a good idea of how to actually use their own product.

Lastly, what I find is really cool is, queries are super easy to build, but the query history feature in Honeycomb makes it really easy for me to just search on what everyone else has queried, learn from what they’re doing, and maybe get my job done quicker. So hopefully that’s a good overview of the idea of testing in production, software ownership, how Honeycomb uses even Honeycomb to look at how our customers are doing, and how the idea of feature flag works and could help you in production. Back to you, Peter.

36:11

Peter Tuhtan:

Awesome, thank you, Michael. So to kind of recap again what we went through today, really importantly, obviously, and before we move to questions, the entire team across your organization from the developers to ops to devs that work in ops to SREs to customer support, even sales can now rely on tools that give everyone visibility into what’s actually happening in production. By being proactive, you can get ahead of the issues that keep your team up at night and cause early gray hairs, especially if they are major and affect more customers or end-users. This gives back the evenings to your teams, and over time obviously they will spend less time being frustrated with what’s going on on-call and being able to focus on the core initiatives and the products and services you guys are trying to build.

I’d like to open this up to questions. I’d also like to highlight that if you go ahead and click on the attachments and links sections here at BrightTALK, you’ll be able to find some useful stuff. One of them is “Play with Honeycomb,” which is, you don’t need to send us any data, it’s just a sandbox scenario. You can walk through tracing or our events based querying scenarios, our documentation, and of course which we’ll touch on in a second after we answer some questions here, the next stage of our webinar.

If you have any questions, please drop them now to us and we can reserve some time right here. And while we’re at it, again, Honeycomb Play in the attachments. You can start a trial by going to our website as well. And again, this is the second of our series of webinars, the third coming up on April 24th. We’ll be focusing a little bit more on tracing, so “See the Trace?” is the title, and we’ll be focused on discovering errors, latency, and more across the modern distributive system. Open up for questions now.

So we have the first question, how do you get the whole team to be able to see inside the production system? I guess I have my own opinion about this, but Michael you work with our customers a lot more actively right now than I do. Do you have an answer off the top of your head for this one?

Michael Wilde:

Yeah. One of the best ways to get the whole team working in Honeycomb, aside from inviting them in, is to start using Honeycomb itself. You might be on your own, you’re trying it out, things are working well. Most of us are Slack users in some way, so I might take a query that I ran, and I might share that directly to Slack. So I might put that in my channel. I have one here just for the purposes of the demo, and I might say, “team, check this out. People are actually rocking with the Okta stuff.” That’s going to end up in Slack, preferably decorated. It looks great. You’ll end up seeing the chart, the logic behind it, and that causes the conversation to move.

Somebody then pops into Honeycomb for the first time, maybe clicks on that, and then just randomly clicks on the upper left-hand button, and they can start to see what everyone else is doing. Once one sees others in a system, they often want to jump in. So start using it, but start sharing outside of Honeycomb and you’ll find that it all ends up going both ways, in and out, and the team gets smarter and better.

Peter Tuhtan:

Yeah and I would add to that, a key example for me is working with customer support. And one thing that I’ve seen some of our customers do in the past, for instance, to get them involved using a tool like this, leveraging the data that you’re spending your time putting in, are things like making sure the trigger fires to the right channel in Slack. It’s simple but if you think about it, if you have a very high, high of importance customer out there in your customer base and you know who your customer success team is responsible for that, you send that trigger when something occurs around that team straight to your CS team, and they’re the first line of defense in making sure your customer is taken care of.

A couple of other questions here. Duration of the free trial. The standard trial right now is about 14 days, give or take obviously depending on what your need is for integration with us. We can be flexible with that and help you get data in if it’s not something that’s just forthcoming in our documentation. But beyond that, we also offer a totally free version of Honeycomb. If you hop on our website you’ll be able to see all this information. So go ahead and sign up. If your time runs out on the trial and you need an extension, we can talk. If you want to jump into the free version, obviously recommend that.

Another question here. Michael, when you were drilling into the trace for the endpoint errors earlier …

Michael Wilde:

Yeah, this is kind of an interesting one for me. The person asked, “When you were drilling into the trace for endpoint errors earlier, have you run into problems where the user was sampling their traces, and thus they’ll always have a trace to go with metrics?”

That’s a great question. First, one should consider sampling itself. When we’re sampling, I’ll share my screen so you can maybe follow along at home on exactly where I’m going, but out here in the Honeycomb docs, cruise over to docs.honeycomb.io. So let me give you one piece of information about Honeycomb: a product is sampling aware, meaning that every event comes in with a sample rate. This could be a sample rate of one. Every event represents itself. It could be a sample rate of 100, which might mean that event represents 100 events. Some systems do, not Honeycomb, but some systems do things like sampling one out of every 100 events, which is arguably not the best idea to do.

So there’s some documentation here I’m sampling that I recommend that you read, on why one should sample, methods of doing sampling. So let’s say if you were only capturing successful requests, sure you could randomly sample one out of every 10 events, right? But in the case where I have a failure, I would never want to just sample one out of every 10 events to include the failure. So I might take a dynamic sampling approach where I put a different sampling rate for the successes, like every 100 successful events we keep, but every single failure event we keep. So when you’re doing sampling, be smart about how you do it.

There’s some technology in Honeycomb ingestion agent that will help you with that. But if you’re doing instrumentation directly in your code, there are some ways to do that as well. But if you do the sampling right, then the events that need to be captured will be captured to their full fidelity, and the ones that you’d like to sample so as to save on time, speed, and the size of your data, you’ll get the right experience that you expect. Hopefully, that answers your question.

Peter Tuhtan:

Alright, if there aren’t any other questions, I’d like to remind everyone right now that a copy of today’s webinar will be emailed to you, the attendee, so you can always revisit it or share it across your teams. I’ll hang here for just a few more minutes to see if we have any other questions entered. Otherwise, I hope everyone enjoys the rest of their days, evenings, and afternoons.

Okay, it doesn’t look like we have any more questions. You can always get in contact with us by emailing support or solutions@honeycomb.io. Also, you can email me personally if you have questions and you don’t want a whole team and you feel like now we have an intimate relationship because of this webinar, I’m Peter Tuhtan, P-E-T-E-R, @honeycomb.io, and Michael Wilde is michael@honeycomb.io. We’d be happy to help you with any questions, comments, feedback on today’s webinar, or if you’d like to get started using Honeycomb. Again, enjoy the rest of your day, and we hope to talk soon. Bye.

Michael Wilde:

Thanks, Peter, bye.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript