Ben Hartshorne [Engineering Manager|Honeycomb]:
Hello, everybody. And welcome to a conversation with a number of folks here about the lessons that we’ve learned implementing observability. I would like to invite on stage, seven other folks from this conference. Come on up. Hello. What a wonderful group of people. Thank you, all for coming here to have a conversation about the scars we’ve gained on this road to observability and how others can learn from our challenges. I want to start with introductions. A number of people have seen your talks. But a number of folks haven’t. If you could start with your name, company, title. Then we’ll get right into some of these stories. Frank, why don’t you start?
Frank Chen [Staff Software Engineer|Slack]:
Sure. I’m Frank Chen. My pronouns are he/his. I’m an Engineer at Slack. Glen, you want to go next?
Glen Mailer [Senior Staff Software Engineer|CircleCI]:
I’m Glen. He/him. I was a Software Engineer at Circle CI. Recently have started a new job. Go, John.
John Casey [Principal Software Engineer|Red Hat]:
Hi. I’m John Casey. My pronouns are he/him. I work at Red Hat. Principal Software Engineer involved in the build pipeline for our products.
Michael Ericksen [Staff Site Reliability Engineer|IMO]:
Hey, everyone. Michael Eriksen. Pronouns he/him. I work as a Site Reliability Engineer at Intelligent Medical Objects.
Pierre Vincent [Head of SRE|Glofox]:
Hi, everyone. My name is Pierre Vincent, he/him. I’m Head of SRA at a company called Glofox where we build software for the fitness industry.
Renato Todorov [Global VP of Engineering|HelloFresh]:
Hi, everyone. My name is Renato. He/him. I work at HelloFresh as Global VP of Engineering.
Lovely. Thank you, all. Well, let’s just get right into it. Our goal today is to give the gift of hindsight to others in your roles. And understand how to walk this road to observability. Frank, could you kick us off? What is one pain point about Honeycomb or observability that you and your team have learned the hard way? And one thing you wish you had known sooner.
Yeah. Hey, folks. Complex interfaces are really complex. I think Honeycomb is a great tool. It requires some tinkering requires bootstrapping ideas of what the explorable space is. One way to tell when someone’s eyes glaze over when I bring up the query interface. My soul dies a little bit. The metaphor I’ve been using is to imagine finding a needle in a haystack. Before you might need to redeploy the world to know what a needle looks like. But today the haystack tells you what isn’t hay, what doesn’t belong. Make it super easy to get started. Motivate folks with their specific business problem. One way to find and drive adoption in smaller group explorations and demos of a specific problem that a group is trying to solve in another context. And pair with folks and observe and trace around and see what you find out. I think we’ve had a lot of help with Slack’s observability to build interfaces into Slack trace which outputs to both Honeycomb and a few other real-time and analytic stores. That’s been super helpful for us to help solve problems at Slack. Cheers.
Yeah, that feeling when you bring up the blank query builder. Even after using this thing for five years, I still feel like what do I do? So I appreciate you calling that out. Glen, do you have suggestions and ideas on how to help people overcome that hump? Or a story of your own?
Well, I was thinking about the prompt for this talk. What are the biggest things to avoid? So I thought more of the paper cuts to tell people to avoid. Field naming. There was a, I talked about this a little bit in an earlier talk today, where basically every Honeycomb dataset has 16 names for every field. Depending on which you look at. I feel like this is a little minor gap in Honeycomb. Like, if I could just make it the right field to replace them all and coalesce them into one, that’d be great. But yeah. I think trying to keep a handle on that early on, like especially the really important things, common things throughout your whole system. Just get a Wiki page, write them down, tell them to go there first. Don’t make it difficult but make it so people can add to that list and reference everyone else’s list really quickly. That would be my main thing. Like, just try and get on top of your field names early and have a data dictionary. But don’t make it a gate-keeping bottleneck.
Well, you know, the problem with having 16 field names that are all the same thing is you can put a derived column in and have a 17th too. I love that don’t make it difficult. Renato, I feel like you could pick up that thread easily.
Exactly. There’s one thing I wish we knew when we started this journey which is the size of the challenge that we were taking on when we decided to embrace observability. Don’t get me wrong, I definitely don’t regret that decision. The team would be able to be a little bit more prepared and probably take some different strategic decisions. That comports the different names for the same datapoint. So I would break down the two challenging parts. The technical challenges and sociotechnical ones, which is a term Charity likes. On the technical side, we got held back on the lack of maturity on the variable libraries. For example, OpenTracing, OpenCensus OpenTelemetry. For here, it was unfortunate timing.
We started our journey at the same time these libraries were merging. And the engineers that we served from the platform team ended up getting mixed messages from us. We started pushing them to use OpenCensus and at some point pushed them to use OpenTelemetry. Which in hindsight might not have been the best decision given OTel was a little bit too early stage and we could have been using it longer. Just to again reduce the cognitive load on people. And another technical problem, we only identified later on, was the amount of why we needed to do from the platform side to make sure that the traces were propagated through all of the different hops. That we could get data from load balancers, CGNs. This kind of proved to be a bigger task than we anticipated for the platform team.
Now to the sociotechnical challenges they were quite interesting as well. I think timing also played a key role here. We weren’t expecting 300 people to immediately jump in when we said observability is the thing, but we also didn’t expect such a low engagement. We didn’t consider the cognitive load that people were already dealing with. When we pushed for adoption, people were busy working on other stuff. And we were just dumping them an extra task. So this approach creates stress on both sides. On the developers, because they have platform people and SREs bugging them all the time and on us because people are not engaging for innovations. So I think knowing about these things wouldn’t have changed our decision, but we could have avoided some bumps here and there.
I’d like to react on that side of things for the sociotechnical part. Like, I’ve been in different places, different companies where we’ve tried to, like in the SRE side, of things to really push people to understand production better. Because as SREs and that side, we have a massive amount of curiosity. We see something that spikes. We see something that sticks out. It’s like, okay. Look. I’m going to spend the next hour on this because it’s going to bug me if I don’t know what it is. It’s kind of hard to sometimes understand that other engineers just don’t feel that way. And in some ways, that’s where, this is what you were saying of okay, no. You go and instrument your service. And that without the context, I feel, most of the time just is bound to fail. Because they have their backlogs. It just feels like you’re in more stuff.
One thing that seemed to have worked a little bit better is just to try and have that stuff organically seep in. I guess try to get that curiosity mindset going for everybody about, hey. There’s some weird stuff happening in your production environment. It’s actually a little bit of a treasure hunt sometimes to just find what you’re going on. Showing after the fact some kind of weird incident and just going down that like at five minutes of, this is some weird thing and this is how we figured it out with Honeycomb. This is how we found that needle in that haystack. I find that’s a pretty cool way to just coincide. It’s kind of a game; right? You can get that benefit in your own services by instrumenting them. Rather than just making them say, here is another task that you’ve got to do. But here is, you know, how that stuff can become more and more interesting. Oh, I want that. I want that for my services; right? I still wish everybody had some of the curiosity I see SREs having.
I think that really resonates both those points from Pierre and Renato with my experience as well. When I’m talking with our engineering teams about sort of the value of observability, the value of curiosity in their production environment, one of the feedback things I get from them is we’ve added the story in our backlog. Make application observable. It’s a big hurdle to get over. It’s okay to start smaller. Maybe find what is the application endpoint that’s most critical to your business? Maybe it’s performance issues you’re having a hard time sussing out. You can start much smaller than the whole thing needs to be observable. You just sort of need a hook into the system and then I think your engineers start to sort of progress on that trajectory on their own especially as you’re providing them tools that facilitate that learning curve.
I’ll just chime in. I love the idea of doing a treasure hunt. Almost making it a competition to see who can come up with the most kind of arcane set of cascading impacts or whatever. You know, just like here’s the thing that I found. And giving people time to do that would be fantastic. I love the idea of you don’t have to have the entire thing done to get the benefit. I mean, kind of the story of what we’re doing is we’re starting in one place and we’re trying to build our way out. The more you’re able to cultivate that curiosity. And that takes time. You have to give people space and time to have curiosity. Because time pressure killing curiosity. But the more you’re able to do that, then you wind up running up against blank walls. There’s nothing worse than you see it disappear. You have to build the next stage that you can see what’s going on a little further out. Honestly, that’s a pretty good description of how we progressed.
I have a fun story about treasure hunts. About two years ago we were supposed to have a call with Michael from Honeycomb on the platform with us. We had just dumped our production logs into the system. And we had an incident five minutes prior to the call or 30 minutes prior to the call. The incident was going on. Platform people were called. And I was one minute away from telling Michael to postpone the call because we were in the middle of an incident. And then someone said, why don’t we ask him to help us find the issue? So he actually did it. Five minutes after the call started, we were able to pinpoint the issue being due to some tries that were connecting monolith code base through database. And we couldn’t see these from any of our other dashboards. We had a hundred other dashboards. We were looking into RDS and applications. But there was a very tricky locking issue that we just couldn’t pinpoint elsewhere. And five minutes into the call, he was able to help us out. That was the buy-in that we needed for signing the contract, let’s say.
That’s a great demo.
Yeah, I love this idea, the curiosity mindset. And making little videos here’s what you can have too. Here you go.
I think those are a game-changer as well. This gets people really excited to, like, try and cover an end-to-end user journey and get SLO as a bonus.
I’m curious for the other panelists, have you seen observability as a more bottom-up grassroots type of initiative effort? Or have you sort of had more of a top-down mandate to move towards observability? I’ve been trying to pick up the threads of who’s done it in what way. I’m curious how it’s functioned in each of your organizations. We’ve definitely been sort of bottom-up grassroots.
I want to take this one. Because we actually kind of did both in a really weird way. Not weird. Kind of a good way. Basically what happened was that myself and a couple of other engineers, well, actually our first time with Honeycomb was something was broken and I signed up for the free trial, threw some data at it, and got answers. It was great. But yeah. And then we picked a couple of applications at the edge of people who are interested in Honeycomb and had an application they wanted to see stuff from. Just grabbed an SDK and started wiring stuff in. That got success and we started talking to people. Then we have the internal tooling team part of the platform organization at Circle. They said, okay. We will build the standard library that will be used on the applications. We’ll deploy a collector, manage refinery. And they took over the tracing aspect of things. And then we’re responsible for getting that into other teams’ applications, coaching them, sort of owning that experience. Yeah, very much started little here and then kind of jumped here and came back to the middle again. Kind of all of the above.
I would say on our side it’s been kind of both. The observability, the drive to be able to see what’s going on in production is mostly born out of a need to be responsive to our stakeholders, our users. And so I guess you could say we started kind of in the grassroots just in a pocket of the company. But now there’s a lot of interest in kind of holding the organizations themselves, you know, inside the company to kind of standards about reliability and responsiveness to things like that. So we’re getting a little bit more kind of management level interest in the numbers we’re putting up. Kind of the SLO numbers, sort of. And so it’s… we’re in this experience of kind of starting to meet in the middle building toward each other, I guess.
Yeah. And I’d love to speak to a similar experience as Renato. At Slack and I shared a little bit of this story in my talk a few hours ago. We implemented our first cross-service trace in the middle of the second day of a multi-die incident involving Get LFS which was strangely affecting a small portion of the fleet that had cascading, of course, cascading failure scenarios for other portions. Like I work for productivity and build tools where a lot of the events, when they’re either lost or have some sort of issue are absolutely critical. And that story… and I feel like this should hopefully strike… stories have this way of sharing some really hard concepts really easily. And with that single cross-service trace, I think two hours later our incident was over. And during the talk, I hadn’t realized this yet, but a few days just before this multi-day, multi team incident, we had kind of the same problem. And so this was, like, incident two, day two multi-team incident. And so that, I think, really helped build a lot of interest in how other teams inside of Slack could adopt tracing and use this tooling.
I love this varied collection that is all reflections of, you know, complicated. The work we do is complicated. It’s thick. There are a lot of different pieces. Renato, you mentioned paving the easy path. Pierre, you mentioned the little videos. Frank, telling stories as a way of communicating these. I feel like there’s this collection of methods for reducing cognitive load or communicating these complicated bits. I would love to hear of some other tools that you’ve used for that as you’re bringing observability to your teams.
Storytelling has been really key for us. I think it is like Frank said, it’s a really powerful tool. And especially if you can connect some of those stories. Almost like an essay. The thesis statement in this essay, like “In this essay, I will…” It doesn’t have to be that reductive. But then tie that value from observability to things that matter in your organization. Whether they’re objective in key results or recent incidents that things didn’t go as planned. Observability would help out. Organizational value. Connect that to your business. And I think those stories can really help maybe folks who weren’t at the sharp edge of that production issue to start to understand how those might fit into how they ship software going forward.
Yeah. I think there’s a common theme I hear in pretty much every Honeycomb adoption. It’s not like someone is adopting Honeycomb as ability tooling because they heard it was cool. It’s because they’re blind and they want to see. It’s like they want to achieve something. They want to really get into their systems. And I think once you’ve got a few people starting to get into it and start to experience it, the trick is then finding ways to bring people along for the journey. If you’ve got an incident process that isn’t chat-based, I’ve been at companies that have a chat-based process. And I love it. Then after half an hour, it’s like should we start a Zoom? And you get crickets in the chat. And you’re like, what’s going on? I can’t tell anymore. Because it’s gone into this video.
I think being able to see that flow of chat with the Honeycomb queries in it and then referring back to that later, that sort of thing where you see, oh, well, I can see how as they were investigating that incident, they were dropping queries all over the place. I could see the development of the query over time. And yeah, you can pull that back out from the Honeycomb query history, but being able to see it alongside the discussion in the chat, somebody going what about this? I think that’s powerful. Like if you’re a newer company, and you’re not involved in incident management, just go and watch the next incident. Sit along and see what’s happening. You’ll learn so much so quickly from that.
I think this whole idea of storytelling goes hand in hand with product thinking as well. I think the SRE space has a lot of opportunities for applying product thinking as a technique. In our platform, we always try to hypothesize about the problem. Then you have to see the hypothesis is valid and is this worth solving? If you look at the MTTR as one of your DORA metrics and you see it’s here to look at the investment on making this better. If we reduce MTTR by 30%, we can save X amount of money on a monthly basis.
This kind of stuff also really helps create some engagement. And it also helps create empathy with the product owners of the organization. Because there is always this tradeoff between building new features and running the system in production. Not always are they aware of the cost of running things in production or the cost of actually not running things in production. I think product thinking is a really good technique I like to use a lot when it comes to platform topics.
This is maybe a bit of a divergence from that. But just keying off this idea of, you know, expensive running things in production, this is something that I feel keenly. We started out our path, you know, hosting around. And at Red Hat, we have a lot of expertise just laying around. If you get in touch with the right people, you can learn to do a lot of things. I think that may have led us to a bit of a false sense of security in terms of what we could do for ourselves in terms of doing support. We were using an internal aggregated look because we had a team already doing this other kind of stuff. And the problem is that we didn’t really spend enough time thinking about whether they were actually doing that in a way that’s going to be highly reliable. And so as a result, we were trying to do production support with tools that themselves were not, like, production capable. They weren’t production reliable if you will. And, you know, we have people in Europe who can’t do queries on Kibana because the latency is too high. Things like that.
It confuses things because you end up sending mixed messages about how to do support. And you don’t really know what you can rely on and things like that. Much incident response becomes much harder. But it has me thinking of the recursive nature of reliability. And the tooling used to support things and that kind of thing. If I’d thought about that upfront, you know, what happens when your aggregated solution isn’t there. Because guess what? You’re doing incident response over here and you’re also running that thing. And that’s just falling over. It makes the question much more a clear thing, I think.
That sort of story reminds me of kind of the early days of perfect cloud adoption. Imagine someone’s got their infrastructure team. You’ve got to talk to the data center team and they’re like, can I have service? You’re like, come back in three months. I’ve got a credit card and can have that now. Ten years ago this was groundbreaking. And now we can say that’s normal. But that applies to basically every vendor. If you are forced to use your internal team’s tooling, then the incentives are probably not going to align with the things you need from them. Honeycomb is out there trying to grow the market, like, expand their market share and build a better product. Whereas maybe your internal team isn’t interested in those things. Maybe they have different goals. And I think being if you’re able to choose, then A, you can choose a vendor. And B, your internal team may say, no, we’re going to make it worthwhile for you to choose us.
I think it’s probably down also to the fact there is a lot of stuff that has been out there for quite a long time. Everybody has been running Elastic shares for a long time. Why would we need to do anything else? Why would you need to send a big check for Honeycomb or the next crowd if sure we have Elastic search here? We don’t do context. That just gets lost. We don’t have the scale of Slack or Red Hat. And our SRE team is three people. Right? So those choices of, will I run one extra thing instead of going to the vendor is super, supercritical. It’s like, I want to do my feature flags. I’m going to go to vendor X. I’m going to do my observability. I’m going to go to Honeycomb. From the balance sheet, this is costing that much. But running the Elastic search, which is probably going to fall over when you actually need it. And then you’ll be like, oh, now we also need to manage this on top of actually responding to the problem, as you said, John. This is a recipe for disaster. And then the minute you lose the trust of people is that you give them that, it’s gone. It’s like, oh I’m not going to go to Kibana because the last time I went there, half the stuff I wanted wasn’t there. Or it was duplicates. I’m picking on ELK, but whatever you run. And I think we’ve probably seen that multiple times over.
You kind of feel like you’ve got this space-age technology to help you do production reports. But you’re still stuck with sticks because things fall over.
Regular expressions are a very sharp stick.
We talked a little bit about pain points with adopting Honeycomb and also this build versus buy. One of the things especially for folks who might not be familiar with Honeycomb, it was super generous. We were doing much on the free tier before we had to consider leveling up. Saw a lot of interesting productions on the free tier before we moved up.
You don’t even have to go… I mean you can go get OpenTelemetry and Beelines, or whatever. But if you have a log system where I’ve got all the data and I can’t explore very well, export some logs, get Honeytail, throw them into Honeycomb. And, you know, find it what it’s like to search your logs properly.
Well, we are just about at time for this session. Thank you, both, for those wonderful plugs. Really this has been a real pleasure. It’s fascinating to me to hear different stories of how Honeycomb has worked its way into all of your organizations and the easy parts and difficult parts along the way. And I’m really excited to keep going on this road and see what comes next.
This wraps up our panel. There’s going to be a little bit more conversation on this stage reflecting on the conference. And then I think some workshops after. The agenda’s up there. Thank you, all, very much. It’s been a real pleasure. Talk to y’all later.