Webinars OpenTelemetry Observability Migrations Engineering Best Practices

Hot Swapping Active Production Services Reliably With Honeycomb



Summary:


You've heard the analogy: changing a tire while the car is moving. And you've probably seen video of the acrobatic stunt driving necessary to make that kind of operation possible. Hot swapping production services reliably might seem like high-performance engineering stunt acrobatics but, in practice, anyone can do it safely and reliably with Honeycomb. The Mode Analytics backend team is updating different services in their stack by hot swapping them in production. By sending production traffic to both an old and new service at the same time, they can use Honeycomb to identify tiny deviations, introduce proactive fixes, and validate functionality without impacting users. With Honeycomb, the team knows exactly which work to prioritize at every step along the way to ensure a smooth transition in a setting where production traffic never stops.

In this webinar, you’ll learn how to:

- Create a migration validation strategy with feature flags, error reporting, and observability
- Use OpenTelemetry to provide any custom insights you need
- Break down complex service migrations into manageable milestones with Honeycomb

Transcript

Christine Yen [CEO & Co-founder|Honeycomb]: 

All right. Thank you, everyone, for joining us today for “Hot Swapping Active Production Services Reliably With Honeycomb.” You all know the drill throughout this point. Welcome to our Zoom webinar. If you have any questions or comments, drop them in chat. We’ll have folks watching it and making sure to pull in any questions that come in. Our thanks to Mode Analytics today for being our guest for today’s broadcast and sharing their experience. 

Before we get started, a little bit of housekeeping. Again, questions, we love questions. Please ask questions at any time. There’s a Q&A button on the bottom of your Zoom toolbar. If we don’t answer your question, we’ll answer it at the end. Captions are available. The link is dropped in the chat. You can choose to access them at the bottom of your screen. Thank you to Kimberly with Breaking Barriers Captioning who is here with us today. 

All right. I’m Christine Yen, CEO and co-founder of Honeycomb. The stars of the show today are Ryan and Talia, senior software engineers with Mode Analytics. Would you mind telling us about yourself and your role at Mode?

Ryan Kennedy [Staff Software Engineer|Mode Analytics]: 

I’m Ryan Kennedy. I’m a Staff Software Engineer on the Mode Analytics backend services engineering team. We work on large-scale data services, integrations, and other things that sort of sit behind the web application. I joined Mode about two years ago.

Talia Trilling [Sr. Software Engineer|Mode Analytics]: 

I’m Talia. I’ve been a backend services engineer at Mode for almost two years. I’m working on the stuff that Ryan described. Prior to Mode, I was at a food tech start-up working on their API and working on a large enterprise company. 

Christine Yen: 

For anyone who isn’t familiar, can you give an overview of Mode, your mission, and what it is you do for your customers? 

Ryan Kennedy: 

I can go ahead and take that. Mode is really a tool for cross-functional teams, a tool that empowers them to investigate ideas, analyze data, making decisions together. We integrate with customers’ data warehouses. We don’t store their data. We go out and fetch their data from where it lives. We bring it back and allow customers to actually do the analysis on top of it using a combination of our visualization tools and our Python and Arduino notebooks. Sort of layered on top of that, we allow customers to do things like schedule reports. Once they have the reporting they want, they can schedule to run it at any time and share it out via Slack, email, whatever. That’s Mode in a nutshell–helping you make decisions with your data using our tools. 

Christine Yen: 

Sounds like a lot of delicious complexity to dive into.

Ryan Kennedy: 

Absolutely. 

Christine Yen: 

Awesome. Well, help set the stage for us, please. I know this started with the introduction of a new service. What was the impetus for this change? And why was the old service so important? What does it do? 

Ryan Kennedy: 

Talia, do you want to take this one?

Talia Trilling: 

Sure. The service we’re looking at, Flamingo, powers all of our visualizations, it’s an important part of the product. We have this one service, but it was doing way too many things. There was tight coupling between the visualization system, the in-memory database, the syntax, the grammar language; and, basically, everything was too tightly coupled in a way that made it really difficult to iterate on the surface or add new features because you couldn’t make changes. There was only one team. You had to involve different teams. It was serving us for what it needed to, but if we wanted to continue to grow and add more features to the service, it was just going to be next to impossible because of how tightly coupled everything was and the lack of any sort of abstraction boundaries. 

Christine Yen: 

All right. I imagine this is not something you did overnight or something you were planning to do overnight. I’ve got some diagrams you shared with us. Would you walk us through what we’re looking at here and the different stages?

Talia Trilling: 

Yeah. On the left there, it’s showing what the architecture looked like before any of these changes. As you can see, Flamingo has a lot of different things going on that are all happening within the same service. It’s handling the API, visualization grammar, SQL dialect, the in-memory database as well as pulling data from S3. So, as we said, just very tightly coupled. 

5:23

Ryan Kennedy: 

One thing worth mentioning is that visualization grammar there is largely owned by another team inside of Mode. That’s one of the oddities, the quirks of Flamingo, in terms of the dual ownership of the service at the moment where our team backend services largely own Flamingo, the API, the data manager. There’s a whole other team that manages the visualization grammar. It’s complex itself. Give me grammar explaining what it is you would like to visualize. It turns it into an execution plan, and that involves one or more SQL queries against that in-memory database. It’s a very complex piece of code that’s running in there. They upgrade and make changes to it at their own pains and velocity. 

Services are trying to maintain the rest of the system at good reliability and performance. There are two very different concerns going on in there. It’s what kind of goops up the system a bit. It’s one of the initial things we wanted to start tackling is breaking it apart so we could wrangle it as two different software components. 

Talia Trilling: 

Yeah. This shows a little bit. It’s a product we’ve been working on now called Data Engine. It’s basically trying to bring us towards a place where we can decouple the thing that the other team manages from the things we work on, so make the API and all of… it’s a little bit hard to explain, but, basically, we have duplicated a lot of the functionality in Flamingo into a new service that extracts away a lot of the things you have to normally know inside Flamingo. 

Then we are slowly sort of pointing customers at that service without there actually being any noticeable user difference. And then, if all goes well, the hope is that Flamingo will handle the stuff around visualization that the other owns. Maybe we can change out the in-memory database because we’ll be less bound to that by kind of extracting away everything that happens once the data engine is called. 

Christine Yen: 

You mentioned the other team kind of co-owning the visualization grammar. Who are the other stakeholders or teams that care about the success of this project? “Cared about” is rough.    

Talia Trilling: 

Kind of everyone. As I said, visualizations are kind of our bread and butter in terms of what we offer people. It’s one of those things where if it works, nobody really thinks about all of the components, but as soon as it doesn’t work, the core functionality is pretty altered. I would say probably almost everyone. Maybe they don’t realize that they care about this, but they probably do because I think as soon as it stops working, that’s a problem. 

Ryan Kennedy: 

I think maybe to get into some specifics, the other team is originally part of what used to be a front-end team. Now we have visualizations inside of Mode that take ownership of that grammar. They care about this deeply. They want to make sure the changes to this don’t impact them very much. 

Product management, a big part of the selling point for this project is opening the door for new capabilities for the company down the road by moving this abstraction layer down a bit. By moving people off the S3 bucket we can start making some changes to that so we can speed up other parts of our pipeline. The fact that other things are tightly coupled to those objects in S3 makes it difficult to make changes. So the abstraction boundary is going to open up, you know, months, years, quarters of work down the road for us that we can start doing. 

I think, also, on the product management side, they want assurance that, you know, we’re not going to break the existing experience, that we’re not going to make it that much slower by adding in this additional hop. There are other stakeholders who are maybe more interested just in as you’re making this change, please make sure nobody notices that we’ve done this. 

Christine Yen: 

Sounds like a lot of eyes. I know you took these changes in phases. Let my slides catch up. Would you walk us through the phases of making this change and what you were thinking at each stage? Common, slides. There you go. 

Talia Trilling: 

Sorry. The light in the room I’m in just turned off. Okay. The first phase is fire and forget. Basically, we’re sending queries to both the existing service and the new service, but we don’t actually examine what comes back from the new service, but it allows us to kind of start getting a look at, like, do we have capacity problems? Are there very obviously glaring things we see when we’re hitting the new service with any sort of traffic, are there problems there? Again, this was where we started, and we weren’t actually looking at what data engine gave back. Yeah. And then the next phase. 

11:00

Christine Yen: 

These are some screenshots you all shared about the sorts of queries you were running in Honeycomb in each of these phases. Can you run us through what English level question you were trying to ask and how you were trying to answer it with the graph and the table here? 

Talia Trilling: 

Yeah. I think this is the exceptions inside Data Engine grouped by the exception type and ordered by how many of them we saw. So, basically, this is just showing that Honeycomb, and also we use Bugsnag, to get a sense of during fire and forget, are there any large issues. You can kind of see a thing that happened as we hit our throughput for DynamoDB. That’s the second exception. That was something we noticed during the fire and forget that was obviously something that needed to be fixed. 

Ryan Kennedy: 

This was the first service we’ve had instrumented with OpenTelemetry at Mode. The Flamingo service predates the tracing capabilities of Honeycomb. This has been the first chance for this team to build a new service using OpenTelemetry. So it’s been sort of interesting to look at the differences between, you know, the inception level information out of the spans in OpenTelemetry. We also get exception information in Bugsnag. In some ways, Bugsnag is easy to consume error information, but it’s also easy to sit in a pane of glass when we’re doing a lot of our analysis and investigation. This is one of those places where we have the exception level information inside of Honeycomb. We can continue to use this UI that’s really familiar to us to slice and dice this data. 

Christine Yen: 

Very cool. What is this one? You’re grouping by something called query evaluation mode. 

Talia Trilling: 

Yeah. That’s a dimension that we added to Honeycomb that basically reflects what is happening with our LaunchDarkly flag. We kind of used this combination of Honeycomb and LaunchDarkly where the phases we were talking about to fire and forget, and then the other two are then stored in that Honeycomb dimension so that for any given datapoint, we’re able to say this is the phase, this is the part of the code that that data traversed and, like, able to actually then pinpoint because this is looking at the average success rate. Basically saying for the average success rate grouped by which phase it is in and then try to get a sense of how is that impacting the success rate. 

Ryan Kennedy: 

Yeah. A nice thing you can see here is you can see as we’re transitioning from phase to phase, it starts off with orange that has a null value for the query evaluation load. So this is before the code actually went out, before we were starting to use the feature flag. Then you see on June 11th, the purple line shoots up. That’s Flamingo exclusive evaluation. This is the control for our particular experiment. You can see this is the code where the LaunchDarkly flag actually went live, but we’re still evaluating the control. 

Then June 28th or so is when we start turning on the fire and forget. You can see these vertical lines actually show where each one of the feature flags are being enabled at certain percentages for us. The average of the success rate, that success rate metric is either 100 for success, zero for failure. It works similar to an SLI-type metric where we can run an average of it to then get a percentage for this. 

We can see that in Flamingo exclusive, it’s better than all the other evaluation methods. So there’s something for us to go and look at. Particularly, the data engine, fire and forget, we saw the errors on the previous slide. Part of that contribution was, like Talia said, exhausting our DynamoDB capacity and realizing we underprovisioned our DynamoDB going out of the gate. The nice thing is we ignored these responses. The responses were continuing to be generated by Flamingo. Users were never affected by this, but we were able to collect the data early on. If we would’ve automatically gone live with this it would have just errored because we didn’t put enough quarters in the DynamoDB machine. Let’s go put in more and make sure the error goes away. 

15:30

Christine Yen: 

Would it be fair to say that, you mentioned almost using this like an SLI, would you say this was the thing that you communicated externally? This is the standard for stable migration and then the previous slide is what you used to investigate? Help me understand what you would do or what you did when you started seeing the success rate dip down like we’re seeing in the green section. 

Ryan Kennedy: 

You got it exactly right. This is a course measurement of how the service is performing relative to the other service. If they’re relatively identical, well, it’s at least as reliable as the other one. That’s the core criteria. We don’t need to spend too much time looking at it. But we see that it’s lower and say okay, reliability is not as good. We should probably go look and investigate and find out why. We needed to figure out what we need to fix and fix that before we move on to the next phase. 

Christine Yen: 

Gotcha. Remind me. Sorry if you said this already, but what were some of the other metrics that were important to you? We’ve got a question from Martusa in the chat about measuring the success of this migration. 

Ryan Kennedy: 

Yeah. We set some really early success criteria with our product manager and the engineering manager for the project. We said before we even start writing code, we want to have concrete metrics about what is it that has to be successful to go ahead and roll this out completely? You know, it has to be at least as reliable as the existing system. That’s where this metric comes in. We don’t want to be any worse on reliability. The other thing is we knew we were adding an additional network hop. Network hops can slow things down. They can impact your reliability. Saying we didn’t want to be any less reliable, that’s a little bit of a crapshoot there, network glitches and whatnot. We’ve been fortunate with that so far, knock on wood. 

But the latency metric has been another one. We negotiated with product early on. It’s probably going to be slower. How much slower are you willing to accept in terms of doing this migration knowing that down the road as we start swapping out the backend technologies on this we might end up going faster? But we might have to endure this small period of time where we’re a little bit slower. So, we ended up coming to an agreement with product that the P50 and P99 latencies for Flamingo, the service invoking the data engine, that its P50 and P99, would be no more than some threshold slower than its baseline performance. That’s what we’ll actually be evaluating in the last phase. How is the performance between Flamingo only evaluation and Flamingo evaluation through the data engine? Are we within the bounds? If we’re in the bounds, we’re good to go. We can launch. If we’re not in the bounds, we need to have the conversation of, are we close? Do we want to relax a little bit? Do we want to find extra performance and squeeze it out of the system? Or do we need to decide to scrap the system? We had to say to ourselves that there’s a possibility this approach does not work, and we need to go back to the drawing board at the end of this. 

Christine Yen: 

Well, with that, let’s go on to some of these later phases where you’re looking into performance. 

Talia Trilling: 

Yeah. The second phase was a full evaluation. Send queries to both services. Still returning the results from the original service to the user, but now creating an event in Honeycomb that lets us compare the results from the two services and see if there are any differences between the two results? That was important to us. Obviously, if we’re trying to switch out one service for another, the user experience should be the same, and the data should be the same, which sounds straightforward. It got a little complicated.

As you can see here, this is us looking at the new event that we created and grouping by whether or not differences were encountered. As you can see, the majority, there were no differences seen, but there were still a fair number of cases where the difference encountered was true. This is kind of interesting for us. It was a mix of some of them that were easy to figure out. Like, for example, I believe the old service was using microseconds, and the new one was using milliseconds. So that one was pretty easy to fix. 

Then, the one that continued to linger and was complicated for us, if you can go to the next slide, is floats. The majority of mismatches we’ve seen are of the cell type flow. We were trying to figure out what was the problem, what was going on there. I think we are still trying to answer this question, but it’s a problem with floating point loss of precision. Basically, this has just been like it allows us to have the UI that allows us to look at what actually are the mismatches that we’re seeing, are these things we need to fix, are they things that don’t actually mean anything?. 

21:00

Christine Yen: 

Wow. I’m so glad you captured, on the right-hand side, the derived column that you’re using to define this difference. It’s really cool to see inside your schema and especially such a detailed investigation like this. Do you have a ballpark of how many columns you have on each of these events since you seem to be grouping by and investigating all these different pieces of an execution? 

Talia Trilling: 

Yeah. You know, it’s not that many. I think it’s basically about what is the… because we check for mismatches on four dimensions. So it’s row count, column count, column type, and then cell mismatch. I believe we’ve only seen cell mismatches. Everything else has always been correct. I believe our columns are difference encountered, something that encounters what mismatch it was, and because we only ended up seeing that cell mismatch we would have two columns that represent the column type and cell value for the mismatch in the old system as well as the column type and cell value for the mismatch in the new system, which sounds like a lot but is actually pretty straightforward in terms of when you just look at it. 

Ryan Kennedy: 

Yeah. The way we admit them, it’s one Honeycomb event per road difference that we detect. So it’s not a total evaluation. There’s a cap in there. I don’t know how many. 

Talia Trilling: 

Fifty. 

Ryan Kennedy: 

So up to 50 differences per query that we log into Honeycomb. If there’s a million query results, we don’t log that. We cap it at a useful subset for us to look at, which is how we get to this data that you can see here. 

Christine Yen: 

Very cool. 

Ryan Kennedy: 

The interesting thing here is it turns out Honeycomb, as we discussed yesterday, has its own floating-point weirdness where all those zeros and negative zeros show up as their own groupings. It’s been funny just to see all the places floating-point weirdness rears its ugly head. 

Christine Yen:

Okay. I think we have someone looking into that on our side. Should we go on to the next phase?

Talia Trilling: 

This is the phase that we’re not actually in yet but is captured by the chart where we showed that we had the before, during, and after, this is the after. Basically, once we feel confident enough that we can swap the two services out, we are going to exclusively send things to Data Engine, and then look at that and basically compare it to the baseline of the Flamingos exclusive. At that point, we can look at what does the latency look like, actually look at measures that make us feel, not necessarily confident that performance has improved because we don’t know that it will, but that performance has not declined massively. So we’re not at that phase, but I think it’s where we’re hoping in the near future to go next. 

Ryan Kennedy: 

Once we can explain and fix the data differences we have, we’ll feel comfortable moving on to this phase. I’m sure that the only thing worse than giving no answer is giving the wrong answer to someone when they’re trying to do an analysis. It’s something that we’re acutely concerned with, is that we’re not giving people bad data that they’re then making big business decisions based on. It’s one of the worst things we can do to them. So making sure the results you’re getting are at least as correct as they were in the old system has been really important to us. 

Christine Yen: 

As Honeycomb is a customer of Mode, I very much appreciate that attention to detail as well. All right. Lessons learned.

25:15

Talia Trilling: 

Yeah. Definitely something we talked about that’s important is ahead of time, before we touched the code, saying here is what we expect to see in terms of latency, and, like, here are the acceptable bounds of changes in performance for us to feel like this is a success. Work with small changes. Don’t underestimate a system’s ability to get weird. That’s my personal favorite because that did happen a fair number of times. Ryan, is there anything you would want to add? 

Ryan Kennedy: 

Yeah. Nothing to add. That success criteria early on is the big one for me. I worked on a lot of problems where we wanted to accomplish X, but there were no real concrete measurements to have to know, how do we know we’ve accomplished it? Are there other things we need to be worried about? Being able to sit down with our product management and engineering management early on and say, We need something concrete to latch on to and measure and show you and demonstrate. 

I think it’s been a big thing for people to have good confidence that a data engine, when it goes out, will be successful that we’ve actually thought about the criteria rather than saying once it’s done, we’re done. That was not good enough for us. We wanted to make sure we set ourselves some yard markers we wanted to make it to. The smallest thing we could do was a big thing for us. Data Engine, there’s a lot of stuff we want to do with Data Engine. We have to pick that place to start and get to that point of fire and forget. 

All of our stuff used to run on EC2, now we’re doing stuff based on Terraform and ECS Fargate and see if anything weird falls out of those systems that we were not expecting. Let’s get this out now so we can see the problems rather than getting to the end and trying to roll out 10 different changes and trying to find the problem. That was a big thing for us, given the volume of changes in the system.

As Talia said, don’t underestimate the ability for things to get weird. The floating one is not one we expected in the investigation. We pretty much know where it is. It turns out that’s probably been happening for all of the time, and we just didn’t know it was happening because the floating-point addition is weird. Just be prepared for the unexpected. It’s going to happen. 

Christine Yen: 

How has it felt uncovering this weirdness, and has it changed your confidence, engineering confidence, in projects going forward? 

Ryan Kennedy: 

I will start with it. I think it feels bad at first. You find that this data doesn’t match. Oh, God! What did we do wrong? Fortunately, having the data in Honeycomb, made it somewhat easy to say, oh these floating numbers are very similar. We’re talking about billionths of a decimal point. This is sounding a lot like floating point precision loss. Then you start feeling like this is maybe a to-be-expected thing. Then you find, we got the production case. because there’s a loss of precision, let’s find out where it’s happening. Let’s set breakpoints and find out where it’s happening. That helped steer us in the right direction. 

That helps us. We’ve thrown out a lot of code. If we run it this way, we get this answer. If we run it this way, we get a different answer. It gives you confidence that you found the problem and that you’re smart enough to find problems. It sort of builds on the stuff. You solve one thing that seemed intractable. Then you find out you’re smarter than you thought you were, and the tools help find the things. The future confidence for building things, I think is going to be wonderful for us.

Christine Yen: 

You are smarter than you thought you were, that’s something that is important here. When you talk about success criteria and especially what the PMs cared about, did that also encompass the business value of this project? Or are there other conversations you expect to have that will either come out of or be kind of supported by the investigations you’ve done so far in this migration? 

30:18

Talia Trilling: 

The thing that’s interesting to me about this is that the refactor in and of itself of, like, splitting apart the services is not something that inherently provides business value, but doing it then opens us up to a lot of things that do provide business value. Really, the goal with this section of the work and what Honeycomb really helped with is being able to say, Hey, we can, in production, slowly start putting people on this new thing with, like, no visible impact to the user and not feel confident that we have to take down a thing that is really important and that a lot of people rely upon. 

And then the thing that really provides business value is once these things are decoupled, like, all of these things we’ve been asked for and we’ve had to say it’s just not possible, they start to become possible. There’s a little bit of patience that we have to have, but I think that’s part of why Honeycomb was so helpful because we were able to say, Hey, the business value is maybe not being generated right now, but you can at least see that we aren’t harming anything. Then we can work toward something where there actually is a lot of value generated. 

Ryan Kennedy: 

Yeah. I think there was a lot of prework done with Honeycomb in looking at the ways we could improve, we use Honeycomb a lot for those things. There are other parts of the data pipeline, outside of analysis that lead to analysis. Like ingesting data from a customer database, we rely on Honeycomb heavily for that. And we can say, hey, this is where the time is being lost. It’s being lost because we pulled data out of a customer’s database, and we turned it into a CSV and a JSON file. 

They’re expensive formats. If we can get away from that, we can speed up things. We can do it more incrementally. But we can’t do that until we do this other thing. There’s all this prework that goes into it that says, I want to build up a whole bunch of proposed business value using the metrics that we have and tell you we cannot get that business value until you let us do this other project because it is an obstacle to being able to do this other work. 

Christine Yen: 

That’s good argument building. You mentioned something you said early on. You said that Mode was very explicitly for cross-functional teams to answer questions about their systems. Is Honeycomb something that you’ve seen non-engineering teams use to, I don’t know, uncover weirdness in their worlds? What does that look like? 

Ryan Kennedy: 

Talia, do you want to take that one? 

Talia Trilling: 

All I was going to say is success uses Mode heavily at Honeycomb pretty heavily when customers are reporting issues, being able to say, Hey, I can see that this certain type of error is being encountered. I can see the frequency with which it’s happening. I can narrow down any number of dimensions that make me feel confident. Because the data is relevant to the customer that’s having the problem. I think that’s helped our pipeline of kind of working through customer triage significantly. 

Ryan Kennedy: 

Yeah. I think one really good explicit example of that is where the customer is Fastly, a CDN provider, and we ingest their access logs into Honeycomb. We had a problem once with customers calling in, complaining. Discovered that they were all isolated to a particular region. We dove in and found that, oh, there’s a Fastly pop of presence that appears to have problems. We’ll get in touch with Fastly and ask if there’s something we can tell the customers about. 

We found out they were working on it. We went through the whole process of finding that, building out the queries, sharing them over with the CS team. I think that’s been the jumping-off point for the CS team. Oh, there’s a bunch of valuable information in Honeycomb. Can you teach us a little bit about how to use it? 

We’ve given them light training on how to use Honeycomb, where the data is at, how to get it and query it. They’ve built their own things now. They have a trigger now. If there’s an alert, they can find out in Berlin, there’s a higher incident of errors for customers. If we get called, we know the first place to go and look. They’re building things to get ahead of it before customer complaints. That’s been really valuable for them in terms of getting access to them and figuring out how to slice and dice. The fact that we’re able to capture stuff inside of Honeycomb has been great for them. They will get a customer that complains about, Hey, all of my particular queries hooked up to Mode is not working. 

If they learned how to use the login provider, maybe they would get something out of it. We can drop into Honeycomb, and they can drop in that identifier for their database. They can group by errors. What’s the type of error? Connectivity. Credential problems. Did you roll your credentials on the database recently and not update Mode? They’ve been able to help us with that. 

35:58

Christine Yen: 

That’s awesome. I also don’t want to monopolize the question floor. If there are any attendees that have questions, please drop them in. Otherwise, are there any last-minute thoughts, Ryan, or Talia, from this project or from your exposure to Honeycomb? 

Ryan Kennedy: 

Talia, did you have anything you wanted to go with first? 

Talia Trilling: 

I think just that this was my first experience working with a non-time series system in terms of looking at events happening in the system and that’s really changed how I kind of look at understanding what’s happening in the code and that’s been valuable for me. 

Ryan Kennedy: 

I think the thing for me is Honeycomb, more than a lot of observability monitoring telemetry type tool I’ve used in the past, it pulls in other parts of the company easily. I think the incorporation of SLOs is a big one. We recently did Honeycomb workshops. We had days on instrumentation and then collab and debugging and SLOs where we invited people to come in on those. The day on SLOs, we actually had engineers, customer support, and engineering go into breakout rooms and start saying: What type of SLOs would we like to set? 

Seeing those conversations happen between those three groups and having customers in product and support talk about what’s important to them. Having an engineer say, we’ve got that, or we don’t have it yet, or we’ll put it on the backlog. I think Honeycomb is the only tool where I’ve seen that collaboration. In the timing series, you don’t see product managers wanting to drop down, at least not most. We have some that will, but not all. Those tools don’t lend towards cross-functional. They lend to engineers and things engineers care about. I think Honeycomb is nailing a lot of that cross-functional interface with customers. 

Christine Yen: 

I think so. It’s almost not allowed for a technical person to say this, but I think one of the most viable parts of adopting SLOs is those conversations, 100%. It’s getting folks into a room. You have to walk out with one thing. You have to all agree. I’m so glad to hear that. There’s so much. When Charity and I started this whole thing, we had, and still have, dreams of, we say things like: When we hear that, a customer’s CST teams and customer teams are using this, that’s when we will have landed. That’s when we’ll know that that customer gets it enough and has stopped fighting fires long enough to start getting creative. 

Thank you so much for sharing your stories. This is so exciting to get to hear and to watch you all continue to kick butt with us. All right. I think that about wraps it. I will pause another, like, five seconds, if anyone has any last-minute questions. Thank you, all, everyone who joined us today. I hope you learned something. I hope you had fun digging into all the strange things we uncovered in Mode systems. Have a great rest of your week. 

Ryan Kennedy: 

Thank you. 

Christine Yen: 

Thank you again, Ryan and Talia.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript