EAZE INTO OBSERVABILITY

Transcript:

Pie:

Welcome, early joiners. I guess on-time joiners. We’re going to wait a couple of minutes. We’ll start at 10:02 so everyone has a chance to join us. Hold tight. We’re ready to tell the story, but we want everyone to be there. For those of you who have just joined us, we’re going to start at 10:02 to get everyone a chance to get in here. Some people are probably right now installing the zoom client.

Ceej:

The scariest thing I ever did involved one of those super slick conferences and they were recording it for a professional production and there were camera men on swooping cranes, and I thought I’m not going to think about that. Ben, you’ve done stuff on stage, right?

Ben G:

Yeah. I’ve done a few things. So you’re saying the cranes scared you?

Ceej:

The crane, it brought things up to like oh, God. A lot of strangers are looking at me right now.

Ben G:

Any time somebody shows up with a steadicam rig, it’s always like what the hell is that.

02:43

Pie:

Yah,we’re just about to start. All right. So we’re going to get started. Welcome to Eaze into Observability. We love puns here. We’re going to talk today with a couple of folks about how to build a reliable service with Observability.

First I want to point your attention to our options for having captions. If you’re interested in having captions there’s a URL in the chat, or you can turn on captions at the bottom of your screen. It should be a menu option. If you have seen this, and you’re seeing captions and it works, let us know. And let us know if you can hear us in the chat box. I think that people can hear us. We’re going to get going. Great. Thank you sir.

We are also going to be recording this webinar. We’re doing it right now. Later on we’ll be making that recording available to you by email so if you signed up and you have to drop out at some point due to the real world you’ll be able to watch and catch-up later. Another option you should know about is that we’ll be taking questions at the end of the webcast. And you can ask them now in the Q and A panel that’s also available in your menu. But we will not be getting to them until the end. You can ask a question and we will check it out at the end. Thanks very much for Nikki Rose for doing the captions. Also seeing there is a couple of chats. We have Kelly also in the chat available to answer any questions that I’m not able to get to. So, I think that’s all the housekeeping. Let’s move on to introductions.

First thing I want to say is we all look like we’re not taking any crap right now. This is actually a really great story. So bear with us. Welcome to CJ Silverio and I’m Rachel Perkins and this is Ben Gardella. Ben and CJ are from Eaze.

Ceej:

I’ve been in the Silicon Valley for about 30 years, which is a terrifying amount of time. Worked in start-ups most of that time and done everything from technical writing to writing cell phone apps in Java, to making distributed systems work. Before I was at Eaze I was at MPM making nodes folder very large very quickly.

Ben G:

My name is Ben Gardella, the Infrastructure Manager at Eaze. I spent most of my career as a dev, mostly Java. I’m not far behind CJ in the time that I’ve spent in this particular universe. I made the transition over to the DevOps side maybe five, six years ago, and probably never going to leave until it’s time to leave. So yeah. That’s me right now.

Pie:

Welcome. And I’m Rachel Perkins, a lot of people know me as pie. I’ve been a tech writer and ops adjacent my entire adult life. Now I’m in marketing for Honeycomb. So let’s move directly into this is what the past looked like for you. Can you talk through what you were, what you walked into, especially CJ?

Ceej:

Yeah. I think I joined like less than a month.

Ben G:

Three weeks. I have three weeks on you.

06:51

Ceej:

So we walked into the same situation, and I think we were part of the response to the situation, which is that it was an enormous dot net monolith – it was over four years old, it had been written as a prototype and written into production and written through a whole lot of industry changes at that time. It started out as a concept of medical marijuana delivery and it took the same platform and mutated it through legalization in California. There was an attempt to microservicize it in the sense that a lot of new code was written around it in a little cluster of Node.js microservices. There wasn’t a lot of architectural thought behind that. Everything was running on AWS. Ben will talk about that a little more. They hired me because I don’t believe they had ever had a single big day in which they stayed up. You would have a 4/20, which is a big thing in the marijuana industry, as you might imagine.

Pie:

What?

Ceej:

And you know, dispensaries would be, you know, be planning for a lot of load and the site would go down and this happened in the two years previous to when we got there. And I was a little bit of a run-up to this year’s 4/20.

08:24

Ben G:

My first day, I want to talk about two days. My first day and the first day that CJ arrived, because they’re similar. My first day I sat down, you know, got acclimated, I did have a solid mentor whose still my right next to me everyday. And he showed me the ropes. But what I noticed immediately is — they have this thing called deploy party. And it involved trying to deploy code, having it fail. Trying to deploy code, having it fail. And I kind of looked over and said what’s going on over there? And they said oh, yeah. It’s just how things happen. It takes four hours. And I was like-

Pie:

That’s quite a party.

Ben G:

Oh, my God. And how often? Every day you do this? Every day? Yeah, okay. Let’s not do that anymore. So I got immediately involved in that for a few weeks. And didn’t really fix anything in those three weeks and then CJ showed up — speaking of which, I have to call out that since she I discovered that her Malcolm Tucker bot was herself, I was like this was someone I can work with.

[ Laughter ]

You know, the thing —

Ceej:

Did we lose Ben?

Ben G:

What happened?

Pie:

You just froze for a moment.

Ben G:

Oh, it says it’s unstable. Am I good?

Pie:

You’re good.

Ben G:

Okay. And so CJ’s focus was far more about the applications — and did a quick dive on that and I just sort of watched from afar. And then things started to really go sideways in March. And that was our first real outage and that’s when I saw Malcolm Tucker show up.

Ceej:

Malcolm Tucker does not have a very polite mouth.

Ben G:

That’s my inner love child is Malcolm. So I knew that I had an ally in the sense that we needed something. And I had followed — I had followed Honeycomb and Charity for about two years not understanding a word she was talking about. But I did have an idea that there was a new way of seeing some microservice catastrophes. I didn’t say anything, but CJ said it first and I was like yes, let’s try something. Because I couldn’t make sense, looking at AWS, whatever logs we had, whatever observability that we thought we had, was a lot of RegEx soup. All of what I would call first generation logging services that I’m intimately aware of — Loggly is what we had here, Slack — not slack. I’m forgetting them all now. But they all require some sort of prior knowledge, RegEx and service name tags — that every time I start a new job, what’s the super secret, who has the saved queries?

And the awful naming. Can we talk about naming for a minute, CJ? The absolute disaster of the legacy code naming cute names of things that meant nothing to anyone. No context. And there was people that have been here for almost the entire time, and some people for a year. And they’re just passed on knowledge that you just have to acquire over time. And it was steep. And yeah, CJ.

12:13

Ceej:

I want to talk about the observability situation we’re in. I want to go back to those tools. The team had some tools. They did have all the logs going into Loggly. They had hosted graphite set-up and were not using it and were not emitting any application-specific logs. And they had some decent front end observability. Sentry was great but it wasn’t wired up to anything. They thought Loggly was great. One of the people I interviewed with said there isn’t a lot of observability stuff for dot net — and Honeycomb immediately came to mind. I’m like oh, if you hire me I’m going to blow your mind on that one. Because I don’t think anybody realized their story could be better. I think they thought that log was it.

Pie:

Classic. Mired in it. You’re down in it and you don’t see a way out. You’re like, this is fine.

Ceej:

Right. And you know, the scaling situation therein, you can move to your next slide if you would like. I’m sorry —

Pie:

It’s all good.

Ceej:

This had been growing enough that a fairly naive solution really works totally well, right, when you’re small. It’s like do you have product market fit, can you change what you have well enough until you have a thing that does what your business needs. They had arrived there quite early. It turns out that people like legal marijuana delivery just as much as they like legal pizza delivery.

Pie:

Shocking.

Ceej:

This is absolutely fantastic in the state of California. I’m in the peninsula right now south of San Francisco. There’s no legal dispensaries because Palo Alto doesn’t like it. It starts succeeding and that’s when the trouble starts. That’s when your naive solutions stop working. You’ve got to take it to the next level. You start seeing the system slowing down and it also manifested as every big business day. It was a classic distributed systems cascade failure, where you’re like okay, this one thing is slowing down, we see a symptom over here and we don’t know what it is, and suddenly everything is down. So 4/20 this year was my first chance, my second chance —

Ben G:

You’re forgetting March.

Ceej:

March.

Pie:

You had initially — to set the stage here. You started off, you came to this organization. They had this older platform built for a particular purpose. It was kind of working. And it was specifically to deliver medical marijuana to people who had prescriptions. So that was pretty scoped. And all of a sudden California legalized recreational marijuana. And suddenly the business is booming, and that’s a great problem to have, but it was still a huge problem for you and you had some serious problems.

15:36

Ceej:

Success is a catastrophe that you have to survive. It’s the classic, you’re caught in this bind if you engineer for the kind of wild scale you hope to reach, you probably are over-complexifying early on. But you have to be able to rewrite and scale later on. And doing that, putting yourself in a position to do that, is work. And forethought. And information. The first signs were in March on a perfectly normal Friday night. The site went down. Now, I say the site —

Ben G:

Quickly.

Ceej:

The site went down. Like, that’s very vague. What does it mean that the site went down? The team did not understand why it had gone down. Like, no idea why. Okay. It got really slow and then it stopped responding — and we had to put on maintenance mode, let it cool down, turn it on at which point it fell over immediately again. Why?

Ben G:

I will say actually CJ, there were several people that knew the why. But had not really been empowered to really get that type of stuff at the forefront. Because it was ship, ship, ship, ship.

Ceej:

So there’s all kinds of reasons you get there.

Ben G:

Yeah, yeah. There was definitely some engineering knowledge of understanding, oh, this is going to happen. There’s plenty of people telling leadership in the past that we have no room. It was very much — cache is full, when cache full, database go down, everybody goes down. The curve is really sharp. So it didn’t take some people by surprise, but it took everybody else, including CJ and I by surprise.

Pie:

You need that visibility.

Ceej:

You need visibility and the answers weren’t there. So at this point, I’m walking into sort of a place where — any data at all would be a help. The first attempt was to try to use the tools we had — which was to get hosted graphite hooked up to at least get some data about what’s going into our Redis. This business brought down — completely saturated Amazon’s largest elasticache instance. And I said something’s very wrong here because we’re not doing that much. We can’t possibly — why is it. And no insight. And logs aren’t telling you that. Not unless you’ve thought in advance to frame the question that way. Logs are perfectly great. I don’t want to give them up. I need them for forensics and security analysis.

Pie:

Is that when you brought in Honeycomb? After that March situation?

Ceej:

It was after 4/20 when it happened again.

Pie:

Okay.

Ceej:

And at this point I had leverage. Like, okay —

Pie:

The biggest day, oh, my God.

18:41

Ceej:

It’s the biggest day of the marijuana industry’s year. You do a lot of preparation in advance for it as a business. You get your supply chains in place, your dispensary partners are spending money for inventory — all the industry revs up in order to do banner business on that day. And Eaze was down for eight hours on that day. It was some unbelievable like —

Pie:

It was hours.

Ceej:

It was hours.

Ben G:

Not all in a row. We went up and down. By that time I will say that my biggest improvement to the system was — I had created the ability to bring up a maintenance page via a chat op command. So we did that a lot. We put it up and down, up and down all day long.

Ceej:

At this point we had the leverage and the executive attention. When the executive over me had to go and apologize to dispensary partners in person — you have a problem that your business is taking seriously, and I was able to say okay, we’re going to stop doing these things. Stop — we’ve got logs, great, whatever. We’re going to do the work to integrate Honeycomb into this system so we could figure out what’s going on.

The other, you know, the fact is — there’s a lot of turnover in the engineering team. The system burned people out. This environment burns people out. When it takes four hours of human attention to run a deploy — or you’re down all the time and your response is to just pour human time or money into it, it’s stressful.

So at this point, okay. We’re going to do things radically differently. We are going to instrument with Honeycomb. And the first thing I did was I got it set up and I convinced my boss — we’re going to spend this money, we’re going to do this, it’s going to pay off, please trust me. Ben was my first ally internally. Because I think no one had figured out this was going to be better than logs yet. It may seem surprising to everyone listening to this webcast. We have logs, why do we need this? I can search logs — I can draw this graph. So Ben just started pumping our ELB logs into it.

Ben G:

It took me 10 minutes. I’ll let you know, it was the easiest thing for me to ever do in the history of my ops career. So it was like please, do something. And that’s all I had to do — I turned on all ELBs, pointed them at Honeycomb and then voila!

Ceej:

So this is like —

Ben G:

Do you remember the first thing you saw, CJ?

Ceej:

No. What was the — it was some surprising status response. Like out of bounds.

21:36

Ben G:

Mine was the sheer volume of 403s everywhere. Just while the site is fine. Just the sheer — how do I put this? So many of our little microservices around our monolith just kind of flap for a lot of reasons. There’s a lot of noise. And I was surprised to notice just how much background noise of, oh there’s an occasional burst of 5 hundreds. What is that? That’s a driver shift change. There’s a time when all the drivers on the road end their shifts and go to another one. And there would be spikes of 4 hundreds and 5 hundreds — and it was the first time you could see that and go — why is that, why is that, how is that happening? Who’s doing that? And that was just the tip of the iceberg.

Pie:

Seeing that — you mention your bosses had to go talk to the partners. Spent a ton of money getting ready for 4/20 and didn’t get to take advantage of your service. It seems at that point you needed to have something behind, we’re doing something differently.

Ceej:

This is — are we fixing it? And the answer is yes, and this is how we’re going to measure our progress. We’re going to be able to put a number to the situation right now. Put numbers to how many active orders we can have in the system, how many active drivers we can have on the road before things start getting hot. And then look at our progress moving those numbers forward. It’s — high proxy metrics are hard.

So in the short term what we did was look at other much more meaningful than engineering numbers. Number of requests coming in, what the request rate looks like during certain things. All these things are possible to do with metrics, but the integration with everything was work — the team was pretty stressed at this point. The easier wins were super valuable to us. It was like the ease of integrating ELB logging. Just ELB logging — which is enough structure to turn into data, which is enough to start framing questions, answer those questions, observe some changes — and that was the step into getting them on board behind the harder work of integrating Honeycomb with our dot net monolith. Integrating it with the node web services API was easy, but it took a little more legacy code base.

Pie:

At this point you figured out some things in your existing platform, using Honeycomb, and you saw a way forward. But you had to make a big decision anyway. And what happened at that point?

24:40

Ceej:

I eventually reached the point where I said, okay — I don’t think we should iterate on this in place. We have to iterate on this in place to survive, but I think we should rewrite. There was some — this period post- 4/20 was a big discovery and research period where we took a team that didn’t really understand the code base very well — we sent them into it to document it. What are our APIs, how are our clients using our APIs? What’s going on in the back end in this driver dispatch algorithm, how does it actually work? These are questions that nobody at the company still could answer because all the people who had written it had left.

Pie:

Burned out.

Ceej:

Burned out. I think literally, it burned out two engineering teams before I got there. It was — I thought it was. And it really — I don’t think the fix is hard. The fix is work. But it’s not a mystery. Like get data, get data. Have plans —

Pie:

To make decisions.

Ceej:

Make decisions, measure the results of your actions if you’re not going where you need to. Reevaluate.

Pie:

So then you came to this — you got some data and you were like well, I don’t know if we can really go forward with this platform.

Ceej:

We’re in the middle of a rewrite.

Pie:

It’s huge.

Ceej:

And the rewrite starts actually at the ground up. Watching what Ben was going through for the first few months it’s tough too. It starts with tools.

Pie:

Make that decision.

Ceej:

In the meantime we had to stay up.

Pie:

Let’s talk about the new platform you’re going to — and what you’re planning to do there and how you were able to make that decision — and then we can talk about doing two things at once.

Ceej:

We’re rewriting it a little more intentional microservice architecture right now and a little bit more intentional understanding of what Eaze’s current business is. Just not the same as what it was when the first system was started. And doing this with better knowledge about what that system does. Rewriting Go and Node. Node is part of every stack because you have a website — and Node is very fast for someone like me with a strong JavaScript background.

Ben G:

Fully containerized as well. We’re embracing containers hard.

Pie:

So you’ve got this new world you’re going to. But at the same time you have to keep the existing site up, right?

Ceej:

We’ve got to stay in business.

Pie:

So what were you able to do there?

Ceej:

So I said that the 4/20 push and the initial ELB — we’re getting data, oh, my God — justified work where we integrated Honeycomb into that monolith, into OG. And that took us a little while. The engineer who was working on that went around a couple of times with you because — I believe we were pumping a ridiculous amount of data at you to start with. This is a very chatty service.

Ben G:

Was this the monolith you’re talking about?

Ceej:

The monolith, OG.

Ben G:

This is Randall’s work?

Ceej:

This is Randall’s work, exactly. That was nontrivial for him but, oh my God did it pay off.

Pie:

It looks like we’ve got some explanation of this.

28:32

Ben G:

I was just — there is a question I was trying to ask. I don’t know if you know, CJ. Like what was any specific dot net solutions or was he making HTTP calls to the API?

Ceej:

He was making HTTP calls to the API and he eventually had to batch because he wasn’t sampling — which I think people should do if they have any large volume at all. It’s okay not to sample when you’re tiny. Because it turns out — my first reaction to seeing the OG graphs in Honeycomb was, oh my God this brave little toaster. How much work it’s doing — for so little effect. It’s the stunning revolution of what was going on in our code base.

Now, I think you could get there with code tracing tools. You can get there with code inspection. There are a bunch of ways to arrive at this answer of, what is your code doing? But Honeycomb did — the ease of use of the interface — where my boss, my boss at the time was a data guy, could walk into Honeycomb’s interface and make the spreadsheet you see sitting in front of you right now — like ad hoc queries. Or someone can walk in and not know in advance what the question they want to ask is.

Logging is phenomenal, but you kind of have to know what you’re looking for — and you have to know a little bit of what you’re asking at the time you write the code. Metrics are useful to a point too because they let you do that ad hoc graphic without waiting for 20 minutes for something to draw. But they don’t let you say hey, it’s that one user ID that’s causing this problem — what the heck is up with that? Because they can’t have that cardinality data in there. This is where the smash — the union of logs and metrics is just what makes Honeycomb awesome. That plus ad hoc queries and the inviting nature.

Ben G:

And I just want to say at this point — again, the only time that DevOps and ops had to put in, was just point the logs and then I walked away. I didn’t even know this work was going on, and that’s like the best part of my story with Honeycomb. Because seriously —

Pie:

The be aware.

Ben G:

I only care about the HTTP between — what I can see between these services. I haven’t looked and used Honeycomb in a while but I looked up — here’s an example of my saved queries. All 400s, just the sign-ins. This is what I named them. It’s great because it brings me back as I’m looking at them. Break down off by user agent. App, and I had a million different one of these, app X grouped by certificates or grouped by status code. All 409s everywhere. 500s by request shape. So it was like I said give me everything and it’s like bam, okay.

Ceej:

If I have that question, I could look at Ben’s queries and hop off them.

Pie:

What is this telling us? This spreadsheet here.

Ceej:

This is what Shri, the person who wrote this, called API damage. He wanted to express the cost of servicing specific API endpoints in terms of the number of servers it took to just answer that one question. And he did this by looking at the number of times we call it, and the number of the average request elapsed time. It’s like the simplest possible way of slicing that. And he could just do it — bam, bam, bam, okay, this is where we’re going to focus our effort. Why are we calling this post-API endpoint that much — and why does it cost that much to run? You can approach the problem from two directions. You’ve been aimed at where the worst API damage is and you can say — is the website being a good citizen calling it this often, is the app that directs around being a good citizen polling for this data like that?

Pie:

Not knowing where to start is a huge problem in this situation. And this was able to tell you what work is going to have the most impact.

33:03

Ceej:

I have a theory — I think this is what’s slow. No, it’s actually this other thing slow. Why is it slow? Because the website is calling it too often. Which is another part of the problem. You might have a perfectly performant endpoint that’s being horribly misused by a client that didn’t understand what to do.

Pie:

I think you were able to make a change in the front end that addressed some of the issues in the back end?

Ceej:

This is our front end team. Our web team embracing Honeycomb in order to prove that their work was effective. There was a whole series of examples like this — that a front-ender would identify. This API is costing us, it’s painful. We’re calling it too much and it’s too slow. A back end team we would go and look why is this too slow. A front end team why are we calling it that way, can I rewrite. And they would accompany their PR’s with before and after data. Like, I did it. Look how less often we’re calling this now — and the back end team says look how often the elapsed time goes in. I did blow their mind with dot net observability. It’s totally possible.

Pie:

It’s good to know what’s affecting things and how changes are making it better.

Ben G:

You should know this is a familiar pattern. We were D-DOSing ourselves quite a bit. Quite a lot. And still are, to some extent. But, like a fraction of what we were before.

Pie:

This is the current system that is in production right now.

Ceej:

It’s in production right now.

Pie:

So you’re doing all this work while at the same time building your new world. And you’re able to keep the site up. So when we were talking about doing this webcast, it hadn’t been Thanksgiving week yet. So we were wondering, how did it go?

35:02

Ceej:

I’ve got to tell you one more story about a thing Honeycomb let us see — and then I’m going to tell you about our next big day of the year. There was another thing, a moment we’re looking at instrumentation in OG — we’re looking at why a particular call was so slow. And why are there 200 network spans underneath this one call? What’s going on here? It turns out that there was like — a loop in variant was being recalculated every passthrough loop and it happened to involve a networking call to data that didn’t switch.

Pie:

Classic.

Ceej:

You know — an example of the client wasn’t DDOSing the service, the service was DOSing itself. You can find this through code inspection, call tracing — a whole bunch of ways to find this. In this case we were just like click, click, what — what? So again, it’s patterns jump out that you can then follow up on and use a number of other tools on — starting with your ability to read code and understand what it does.

Anyway, fast forward. After 4/20 I talked to the engineering team and I told them about my goal. Actually I talked to the company — I told them my goal was to hand them a boring big day.

Pie:

Uneventful day.

Ceej:

You want your systems to do what they’re supposed to do. And to let everybody relax and be happy about doing a lot of good business. And oh boy, Ben. Last week — green week. You want to talk about what green week is or green Wednesday?

Ben G:

Green Wednesday specifically. We expanded it to green week, but it really was about the Wednesday before Thanksgiving. The best part was that we had enough confidence to where no one came in on Wednesday.

Ceej:

No war room.

Ben G:

No war room, no ordering food, no planning to be there. We were all home with our families. I barely looked at my phone that day.

Pie:

Amazing.

Ben G:

Mostly was out of reflex. I kept checking slack for no reason until I finally was like I’m going to — muscle memory. Traumatic. And this was at this point we had been there 8 months, 9 months. So we’re not even — neither CJ or I have been there a year yet and there’s already PTSD — and I think green week showed it. For me at home, I was like why am I so freaked out about this? Everything is going to be fine.

Pie:

Green week because people are buying marijuana so they’re ready for their families. Before you get on the plane, you want to have your gummies — you want to have all your microdosing worked out. It’s a thing.

Ceej:

Because if you’re going to talk to your terrifying uncle.

Ben G:

In this terrifying timeline, you need some medical help.

Ceej:

The industry prepares for it — the dispensary partners prepare for it.

Pie:

They were able to succeed because you got —

Ceej:

Didn’t flicker for the first time. I got texts from former employees saying — I’m really stressed for y’all today, how is it going? In the way it’s supposed to.

Pie:

Nice. So we were talking about — there’s so much to discuss here we’re running really long.

Ceej:

I’m sorry.

Pie:

It’s okay. This is a great story. We talked about this a little bit before, that you couldn’t get this sort of information out of your previous tools. You want to talk a little more about this?

39:06

Ceej:

I think we pretty much covered it. Metrics are totally great, like to have around. You absolutely need logs. You need them to analyze what happened long after the fact. A security audit trail, all the things you need logs for. But they’re not good at helping you frame questions you’ve never had before. Or pointing you in the direction of new things. Yet —

Pie:

Sorry. I keep clicking on the chat box and it forwards the slide. I apologize.

Ceej:

Let’s move forward to the civilized age. You need to know what your servers are doing. You have a distributed system if your database is not in the same process as your server — you have a distributed system in some way, therefore you have complex interactions you can’t predict. You can have hunches about it. Build something in this from the start. This is true for our next gen services. All of them have Honeycomb built in. I just finished about a month, a month and a half for pure tools work for our next — pure tools work. We have Kubernetes in them now and they all have Honeycomb in them from day one with no effort from our engineers — unless they want to emit specific annotations or start doing really deeper integrations, app aware integrations.

Pie:

So you had time to like, do you go from the left-hand side of no time to do this work that you’re describing, to time to address technical debt?

Ceej:

I now have the time to work on what the next generation is, because the engineering team is at work on features and next gen stuff and not on firefighting. We’re not trying to figure out what is bringing AWS’s biggest Redis down. We know how big it is and how often we’re reading and writing it. It’s not a mystery. We were totally in the tech debt spiral — so much effort spent just fighting things that you couldn’t make any progress to get out of it.

Pie:

Knowing that the site is going to go down for every possible situation.

Ceej:

But why? We didn’t know. We had hunches

Pie:

So now you’ve got — one thing you mentioned earlier is some people knew about this but they weren’t able to convey what was going on in a way that made the upper level folks understand, or different departments understand that we need this help.

Ceej:

Ben’s got that expression on his face. It could be hard to make that justification. Numbers — we ended up working for Shri who’s leaving to go do his own start-up, the big jerk. He’s very data-driven. He ran Eaze’s data analysis tool team — which does great stuff. And he was very happy to get numbers. Here is what these servers are doing. Here’s why they’re doing it. Here’s proof we’re moving it forward — and I was able to persuasively make that case to the executive team.

Pie:

All right. And so with this data — go ahead.

42:34

Ben G:

Part of the mix that CJ is going to paper over a little bit is also Honeycomb provided her, combined with her sheer personality and drive — it forced people to pay attention. So you know, I do want to say that I couldn’t have done what she did. Which was, you know, stepping outside of yourself as a normal engineer. And I do think this is kind of an important note — is that armed with something like Honeycomb gives you confidence to walk in that room with people that have E-level, much closer to the money, much closer to what the VC’s — the pressure is a little different. The story is a little different in those rooms as opposed to engineering pressure. And that changed the climate quite a bit. And so that’s kind of — it wasn’t all Honeycomb that just did it. It wasn’t like oh, they see the data and that was it. There’s a little bit of clout, there’s a little bit of chutzpah involved.

Ceej:

A lot of human grease.

Ben G:

I don’t want people to think, if I buy Honeycomb I’m going to get the executive’s attention. It’s not quite that simple.

Ceej:

You’ve got to make a case. Look, business is involved here. We did not enable our dispensary partners to settle because our engineering was not up to snuff. This is what we’re going to do. Fix it. This is how it’s going to pay off for us. It’s not going to be zero work because nothing ever is zero work, but this will be directive worked to be able to prove we made it better.

Pie:

Having really data. We haven’t talked about this a ton. But you’re also building this new site and you’re not doing it with more staff. You’ve not added to your team to do this.

Ceej:

No.

Pie:

How is that possible?

Ceej:

I’ll tell you in a year if it is.

Ben G:

Triage. Effectively — it’s like a MASH unit. We created a different wing in the MASH unit and part of it is the juggling of morale. No one wants to get stuck on legacy. But there’s plenty of interesting legacy to still do, and I’m actually trying to introduce containers in legacy as well to keep the skills compatible so that people — you may not be working on the new system yet, but we’re going to do new stuff inside the legacy.

Ceej:

We’re able to slowly pull more and more of the engineering team into the new stuff and give people time to cycle in and out of writing new code versus just maintaining old code.

Pie:

It’s taking less power.

Ceej:

It’s taking less power to run. Exactly.

Ben G:

Less psychic power.

Ceej:

Less emotional burnout.

Pie:

People don’t think about that. How much you’re taking out of your team.

Ceej:

This right here. This matters to me. It just gutted me to see what happened on 4/20 — where people ran around. We had people just devastated that they felt they had failed the company. It’s my fault the site’s down. No, it’s not your fault. And there’s a better way here that doesn’t involve us all just feeling bad. Or answering pages. Or frantically trying to guess what’s going wrong. We can use tools. We’ve got lives.

46:06

Pie:

So, yeah. It would have been a great deal harder. I have quotes from discussions we had earlier. It was you who said “we can’t afford observability” is a myth.

Ceej:

How the hell do you know — distributed systems are complex things. How do you know what they’re doing to any degree without having telemetry from them? It’s like sending a spacecraft up and not getting telemetry from it. What?

Pie:

And now, you’ve got some pretty great business value out of this as well. You were able to go back to your partners and say hey, Green Week was okay.

Ceej:

I won’t mention a number here, but I can say Green Week the Wednesday before Thanksgiving, was our biggest business day ever.

Pie:

Congratulations.

Ceej:

I just feel good about that.

Pie:

I bet the whole team does. Your engineers feel more empowered to continue to make things better. They say a we through what seems like this endless hall of problems before.

Ceej:

They believe it now. We can be successful. We can scale and it can be boring. We can put all this work in up front and we can have a relaxing, excellent day.

Ben G:

It feels nice, as engineers, to have a positive story to tell in your startup experience. So the ones — I keep telling people every time. No one works at the same company for very long, so all you have is stories. And the first thing do you when you go to another place, and to interview, is you’re going to tell them stories. And you don’t want to walk into a future employer place and say let me tell you about all the disasters that I was a part of. You want to have some wins. And it’s really nice to give a group of people that have had so many losses to get a win.

Pie:

Wow. I was also looking at the questions. And you’ve been busy, Ben.

Ben G:

CJ is doing all the talking, I’ll do the typing.

Ceej:

Where are the questions?

Ben G:

There’s a Q and A chat.

Ceej:

Look at that!

Pie:

And someone has a question for you, CJ. Is there anything else? Any last words before we dive into questions? Folks, please ask questions and we’ll go through them in just a few minutes here. Any last words from Ben or CJ?

48:40

Ben G:

I just have one card that I put for people that might haven’t used Honeycomb from an ops perspective — I am a past dev but I honestly don’t spend any time thinking about stuff that CJ thinks about in terms of where are those recursive calls inside the code, I just see the whole system. What was nice when I talked on all the ELB to flood it to Honeycomb, I’m just looking at HTTP. Here’s the microservices. I see them all. I see all the traffic between them. And all the application logic and application errors don’t matter. How these systems are responding to each other, and these are the explosions I’m seeing between them. That was such a good first story for us to go deeper with, because it told us where to go deeper. Instead of just guessing. Just having that idea — that here’s a system that’s going to tell me HTTP first. And then you can go deeper later. Because again, it’s all about for an ops team. I do not want to set up a system — okay, now that’s just the beginning for you. Now you have a 7-week project ahead of you. We do not have time for that in the ops world. So that’s the will only thing I wanted to add.

Pie:

I think CJ might know the answer to this question. You used more or one of our bee lines?

Ceej:

We used the node bee line thing, we’ve got that integrated into our legacy node apps, a handful of our legacy.

Pie:

That’s an auto instrumentation thing.

Ceej:

It’s in our next gen framework as well built in our middleware. That’s open source. I can find the code for that. Chris has posted it somewhere. We’re also using the Go bee line. We built some lightweight opinions around the echo.js framework and the Go bee line is built into that too.

Pie:

These bee lines are for various popular languages and they do auto instrumentation for you on standard libraries — for folks who are following along, so you can get started super quickly in the way that you can with just typing your logs in as well. So this can instrument your own code. So cool. So yeah, you’re now prepared to be — 4/20 is a little ways off, but right before the holidays, later this month.

Ceej:

I don’t know if there’s anything big in December. The cannabis industry is a little bit seasonal. One of the reasons why 4/20 is so huge — it’s the start of the summer and warm weather in most of North America, so it’s like — people enjoy that stuff more during those months.

Pie:

Less time for stress. I would be really securus to hear — we’ll check and obviously we’ll be talking to you before then, before next spring 4/20 — and see if you continue to have a time without stressing out your employees, still making money for you and your partners. And we’ll be curious to hear the state of your new world and when you’ll be able to move over to that.

Let’s see we’ve got a lot of questions that have been asked and answered. I see three open questions.

Ceej:

I don’t know what pricing tier we’re on. My boss handled that.

Ben G:

The pricing structure usually involves retention. Retention and volume. We’ve tuned down some of our nonproduction services that have zero retention.

Ceej:

We need to move to sampling. We will get just as useful data from it.

Pie:

Some people are worried about sampling because they don’t understand there are ways to make it really tuneable to the kind of data you have based on what’s in the data and the frequency at which it’s coming at you. We have all those options available so you can see more valuable data in the larger stream of your data.

Ceej:

Yeah. That’s really about you. You can spend the money if you want. You can get good answers from things.

Pie:

Absolutely. This one is asking do you use Honeycomb to log any kind of security audits.

Ceej:

No, and I’m not sure I would ever send Honeycomb critical user information. Obscured, only. We have an opaque user ID token, we’ll send that so we can figure out oh, it’s this one user who’s having this terrible problem. But no. I won’t want to do that.

Pie:

A secure proxy option and your critical data doesn’t leave your premises unencrypted — and if you have more questions about that, you’re welcome to ping us on our website or by intercom, however you want, the folks who are listening. Because we do offer some secure options that are solid enough to be used, for example, in the healthcare industry.

Ceej:

Interesting. I hadn’t moved as far as thinking about that.

Pie:

You’ve got a lot on your plate. So cool. I think we — I don’t see anyway — these two open questions here, I think they are more responses to other things.

54:27

Ben G:

I could say with Ramon’s question — VPN traffic, I suppose, could be interesting. If I was worried about some attacks. But some of our security issues haven’t really been revealed by Honeycomb so much. But some of the alarms that do, you set trip wires effectively that like I was saying — these 400, 500 explosions we’ve had some recent auth attacks that were driven by Honeycomb queries.

Ceej:

We didn’t touch on that. There’s some slack integrations here. Our 401 rate is going up. Say hello in slack.

Ben G:

Slack and AWS alarms. It’s rather nice for that.

Pie:

Cool. Does anybody have any last questions? Before we end up — we’re getting right to the top of the hour. This went perfectly to time

Ceej:

I hope it was fun as advertised.

Pie:

I was really looking forward to this. I love the future that I live in currently in a sense that I get to interview you folks for work. It’s great. Any last words from anybody?

Ceej:

I’m happy to have a use case. I’ve been watching y’all for a while. The scuba paper is pretty cool. Just this idea that you can query log level cardinality with metric level speed, really, this is super good.

Pie:

I like that log level cardinality with metrics level speed. Write that down.

Ben G:

And I just like the opinionated UI. It doesn’t ask me anything — in fact, it gives me options. I don’t need to know all of the different API endpoints that my own systems has. It’s going to tell me what they are. Never having to use Regex. Can support it if you need to. So different personality types. I’m that personality type, I don’t want to have to have memorized or have a book lieutenant somewhere or scratch pad of memorized queries. Every time I go to Honeycomb, I play.

Pie:

Exploring is great. Exploring is super easy, I think in Honeycomb.

All right. I’m going to wrap it up. Thank you to Ben and CJ for your time. Thank you to Nikki for the captioning. And for those of you who joined us late, we are recording this and we will be sending anyone who registered a link to the recording when it’s been processed. Please feel free to share that link with your colleagues. And if you have further — if you want to look for further information, there’s a bunch of other good stuff in our resources section, white papers. I’m partial to the e-guides and white papers because I write most of them and feel free to follow us on Twitter on @Honeycombio. Thank you very much everybody.

All-in-one Observability

Why Honeycomb

Looking for something?

Our mission

EAZE INTO OBSERVABILITY

Transcript:

Ready to get started?