Get StartedBuild One Simple SLO

<– #1 SLO Theory: Why the Business Needs SLOs #3 Define Success: The Right SLOs for Your Org –>

Transcript:

Nathan LeClaire [Sales Engineer|Honeycomb]:

So welcome everyone to Production SLOs Success Defined, a three-part webcast series from Honeycomb.io, the one and only true observability tool to debug production and high resolution. Today I’m very excited. I have a special guest, Josh Hull from Clover Health.

Josh Hull [Site Reliability Engineering Lead|Clover Health]:

Hello.

Nathan LeClaire:

And this webcast is part two of a three-part series on using SLOs. So what we’re going to look at today is a bit of a story of where Josh and Clover Health come from, why they landed on observability and where they’re going to go in the future. And especially the success that they’ve had with SLOs. So hopefully you tuned in for part one in the series. There’s going to be a third part. So tune in for that and we’ll have some details about that at the end of the presentation here. So a little bit about us, myself over on the right, I’m Nathan LeClaire. I’m a sales engineer at Honeycomb.io. I help customers get started and do all sorts of variety of things in a startup and I’ll let Josh introduce himself here.

Josh Hull:

Hello, I’m Josh Hull, I’m the site reliability engineering lead at Clover Health and I help keep the site reliable.

01:38

Nathan LeClaire:

There we go. And what we’re going to talk about today is the use case of SLOs with Clover Health. So Josh has a lot of really special contexts on using this new initiative that’s coming out of Honeycomb and I’m very excited about what we’re going to cover. We’re going to cover why Clover Health wanted observability in the first place, why they wanted Honeycomb specifically, and how they’re using SLOs.

We’re also going to talk a little bit about how we selected the correct service level indicators to arrive at the service level objectives, which is what SLO stands for and how they managed to get guidance from Liz Fong-Jones who started them out helping them along in their journey. A little bit about the common language of SLO that helps bridge the gap between business and engineering teams. And we’re also going to talk about why SLOs are important and we’re going to do some demoing of the different capabilities that they have to offer. So Josh, we’ve got a couple of slides from you on Clover Health and why don’t you give our listeners just a little bit of an intro on what Clover is all about.

Josh Hull:

So Clover is a new kind of Medicare Advantage provider. For those who are unfamiliar, Medicare Advantage is a government support program that provides health insurance to those who are eligible for the coverage. And Clover is a company that leverages technology to extend benefits to our members through providers giving guidance to our healthcare providers as they provide care to our members. So we use machine learning and predictive analytics in order to get a better understanding of the diagnosis for customers with chronic conditions and how we can give them the care that they need to improve their quality of life and their health.

Nathan LeClaire:

Got it. And so that is a kind of a text representation of a lot of what you just said there. And you all have been growing and growing. So you were founded in 2012, and what’s your journey been like since then? Where do you come in the picture?

Josh Hull:

My personal journey with Clover actually began this year. The company, as you can see, has had significant growth from hundreds of members in 2013-2014 timeframe to our latest annual enrollment period, just closed north of 50,000 members. We’re in a number of states beyond the initial state of New Jersey, and we are establishing an ability to provide a product, Clover assistance that is used outside of our membership plan. So other providers can leverage on the same machine learning and predictive analytics to provide care to their members.

Nathan LeClaire:

Right on. And so it sounds like you are taking advantage of data to provide better medical services for folks, and I got to imagine that adds a lot of complexity into the picture. Would you say that’s pretty accurate?

04:53

Josh Hull:

It is. It’s a very complex system. There’s a lot of steps in maintaining compliance and adhering to requirements as it pertains to managing data around personal health information and personal identification information. And obviously we want to provide the lowest cost solution for our customers as well. So any opportunity that we have to reduce bloat within our services and prove the code that runs our plans and our predictive analytics, the better. So if we have an opportunity to holistically observe our environment and make improvements in areas that are surfaced for us, that’s a fantastic opportunity. And that’s sort of what helped us land on Honeycomb as an observability tool.

Nathan LeClaire:

Awesome. Yeah. So, we’re going to talk a little bit about your journey with SLOs in a second, but just in case there are any listeners out there who didn’t catch part one of the series or maybe are just really confused with all these acronyms, we’re going to quickly go over a definition of what they are exactly. So, SLOs are a data-driven way to measure and communicate how your production is performing based on measures your customers care about, like latency. If an app is slow or it has a lot of errors, that’s really bad. So actually before we dive into the real specifics of Honeycomb SLOs, could you maybe just share a couple of what those things are for you all?

Josh Hull:

Yeah, so we have a strong interest in the frequency or the latency that’s represented for our member search and for our provider search. Right? So when our members come in and want to find specific services or engage in setting up appointments with providers or clinicians, that search needs to be very responsive, it needs to be accurate, and it can’t have significant delays or breaks in that process.

Similarly, for providers, when they’re looking for information regarding their members, that search needs to be very responsive. So for our ability to tie the service level objective to the manner with which the member or the provider is experiencing that, as opposed to from internal looking at it and saying, “Oh, well it looks like we’ve got good responsiveness, our systems are up.” We would much rather get the perspective of the user as it stands than to look at it from within the mechanics of our framework.

Nathan LeClaire:

Got it. Yeah, and I got to imagine, especially in medicine, having that reassurance and a snappy website and something that performs well, doesn’t have many errors, it’s just that much more important.

Josh Hull:

Yeah. If you’re suffering or if you need care, the last thing that you want to do is wait for your service to be responsive or to not have a responsive service at all.

Nathan LeClaire:

Totally. So it is a really key feature of SLOs and we’re going to talk about how this came into play at Clover later on that engineering and business stakeholders speak the same language, right? So engineers benefit because they have fewer noisy alerts, they have a more clear high level picture of how your systems look and how reliable they are. And then on the business side, it’s clear that the efforts you’re putting into things like code instrumentation, migrations to new infrastructure, developing say paying off technical debt actually is quantifiably successful. That’s something that can be proved out using SLOs. So now we’re going to talk a little bit about your specific journey with Honeycomb and why observability for Clover Health. Why did you land here?

Josh Hull:

Specifically we are looking at doing a significant infrastructure migration. When Clover was built, there were a lot of considerations to focus on the data science and the calculations of how we arrived at our care and how we provide that care to our members, and not as much consideration was spent necessarily in looking at the manner with which those applications should be built. Right? You could talk about Twelve-Factor Applications, you could talk about application design. But it was a very rapidly growing company that leveraged certain technologies that were available at the time.

We have an opportunity to move our current strategy into a more resilient, more stable environment that uses modern technologies that are available from cloud providers that weren’t available when our applications were originally written. So this migration process is something where, if we had a tool that was holistically observing current state and future state and sort of a pre-prod environment, if you will, the observation of both of those in kind would be able to give us a very holistic understanding of where we were missing, where our gaps were failing in the new environment given the brittleness or rigidity of the original environment.

To be able to compare those two without necessarily putting a thumb directly on the pulse was very valuable to us. And to be able to tie in the service level objectives from the jump on the new infrastructure and the applications running in the new infrastructure.

If and when we had the opportunity to do any refactoring, it was a great opportunity to just turn to Honeycomb and say, “Oh, okay, well that makes sense, I can see that very quickly from the information that Honeycomb is sharing.” It’s not to say the same could not have been accomplished in a much more difficult or in a challenging manner with traditional APM. But I feel that given that we had those traditional tools in place on the original environment by putting Honeycomb into both environments, it really is sort of a clean slate if you will.

Nathan LeClaire:

Got it. Yeah. So you had a bunch of tools already, but Honeycomb just had that much more to offer. I guess that was sort of right in the pocket for what you all needed.

Josh Hull:

Yeah.

11:22

Nathan LeClaire:

Awesome. So why don’t you talk a little bit about the story of getting started and Liz coming to help you all out. I love this story because at Honeycomb, one of the things that’s really important to us is the success of our customers, right? It’s not enough to just make software. We really want to help people get good results. So, why don’t you tell us a little bit about the origin story of these tweets and what would happen there.

Josh Hull:

Yeah. When we began talking with Honeycomb about an enterprise engagements Michael Wilde on your team said, “Hey, we have this potential new product you might be interested in, would you like to learn more?” Sat down with himself, with Michael and with Liz and sort of got an understanding of what the SLO could bring to the table for us. And the opportunity to engage with her at the professional services level was more than just reading through a tutorial or following a sort of code run book and saying, “Okay, well I’ve got my simple little SLO up and it’s working and I kind of have an understanding of its purpose.” For her to come into our office, meet with our engineers, help us connect and put our secure tenancy environment in place. Right? And for her to really get the understanding of… As Charity says, “Nines don’t matter.” Right?

Your customers have to be happy, they have to get an understanding of… Well, we have to have an understanding of the customer experience and the SLO as she was able to help us create really drove that. Right? To the millisecond we knew what our percentage was of success versus failure. Right? We were in containers already. I’m seeing that line item on the slide here.

So I just wanted to clarify that our current environment is using containers but not from an orchestrated standpoint. So we are shifting from one environment to another, both using containers. But the original was sort of more of a Docker swarm if you will. And our future state will be more orchestrated. But aside from that, the opportunity to have Liz come in and sort of help us formulate our understanding of observability was fantastic and we’re really looking forward to the future engagements that we have with her and in the remaining professional service she’s offering.

Nathan LeClaire:

That’s great. Great. Yeah, and it’s really interesting to see containers crop up there even in the fact of kind of changing from one orchestration tool or not quite an orchestration tool. I’m not sure what you had going on to something new, whether it’s Kubernetes or Mesos or I think mostly people seem to be all about Kubernetes these days. But that’s such a sign of the world is changing and there’s just more moving pieces and now we need new tools to address that new era.

Josh Hull:

Right? And it’s important to note that no technology is a panacea to what you’re suffering, right? If there’s bad or less optimized code I should say, and a VM which can be carried into an orchestration container, right? So it gives you the opportunity to really observe the state of your current solution and it gives you a very quick representation of opportunity to improve the state of your current solution. Right? Oftentimes if we’re looking at dashboards or if we’re looking at sort of traditional tracing, we’re sort of looking in the rear view mirror.

We’re seeing the past sort of echo to us what our future failure might become, but it’s almost too late when we get there. And so I think yes, it is important to note that containers are the natural order of the day, but they aren’t a solution. They’re simply a means by which you can present your application. So I think it’s very valuable to note that it’s not just the orchestration of those containers or the migration into a new environment. It’s our opportunity to really examine what is healthy within our system and what could be improved within our system.

Nathan LeClaire:

Totally. Yeah, and not to get into too much detail, but there is one thing I wanted to highlight, which is that, Clover deals with a lot of sensitive data. Honeycomb is a software as a service where you send things to us and it’s hosted on our servers. So how is that possible? What did we do to make that work for you all?

16:15

Josh Hull:

Honeycomb offers a secure tenant, which is very brilliant in its simplicity, right? It just converts anything that is a string in that data stream into a hash, which is unreadable by Honeycomb members. The hash is decrypted on that secure tenant side. So in the round trip, we’re still getting the metrics from Honeycomb and we’re still able to do all the tracing, what have you. But when it gets to us, it’s human readable, right? It’s decrypted from that hash.

So that was probably more of a layman’s description than most people would be interested in. But at the end of the day, we know that because we are data stewards of health information and identifying information, our best opportunity to work with partners is when we can communicate in an encrypted format that keeps that privacy and protections in place.

Nathan LeClaire:

Gotcha. Yeah, totally. And it’s really nifty. I think that SLOs and all the Honeycomb goodness is still available for users of secure tenancy. So why don’t you talk a little bit about how does SLO tie in for you in this picture of bridging the gap between business and engineering.

Josh Hull:

The engineers who really grokked and dove in right away and said, “Oh wow, if I had this in place, I would have been able to hit that post-mortem so much faster or gotten to the root cause so much faster.” The tie for that to business in our personal experience was that when we had the SLO is in place and secure tenancy was ensuring the protection of our data but was still sort of bubbling up the information that we were interested in. We were able to go to an executive presentation and sort of show a slow query results set over a very quick period. You can go from two hours to 14 days or an extended period if you wanted to. But instantly from the business perspective it was, “Okay, well that’s showing me right there that query itself could be optimized.”

Is that something that we can put on the roadmap? That one that is creating milliseconds if not hundreds of milliseconds of latency for each time that that query is being brought. And to hear that from the executive side mirrored from the engineering side, we know that we can optimize that query. When can we put that… Which sprint can we put that optimization work within? Right? That’s sort of really tied together for us that there’s always a journey on improving the code that we have, but for the business to guide it and for the engineers and developers to support it really created a fantastic synergy.

Nathan LeClaire:

Awesome. Yeah, I’m really excited about that myself. So often, myself being in a role where I’m kind of bridging the gaps between engineering and business by nature. I’m a sales engineer, that’s going to open the doors up for all kinds of all kinds of things. So we’re going to take a little bit of a look here now at making an SLO. So I’m going to be doing it on some of our demo data. Josh is going to be my copilot, he’s the SLO expert now because we got them all set up.

And so before we dive in to the specifics of the honeycomb UI, it helps to think a little bit about what some example SLOs actually are. So, we might have a business goal saying that we don’t ever lose any customer data, right? Like Honeycomb, we’re a service where we receive constantly a stream of data. It’s really important to us that those are stored reliably.

So we might say something like, “Well, 99.995% of API calls have to be processed with no errors in less than a hundred milliseconds.” Likewise, we might want to set SLOs that define that 99.9% of requests have less than one second of latency. Right? And that’s the demo SLO we’re going to be looking at actually. It’s related to responding quickly. Likewise, at Honeycomb, we have SLIs and SLOs around not having to wait for a query response. So, we make sure that a certain percentage of our requests or data engine are always within bounds that we consider an acceptable experience.

In the SLO page you’re going to see an action here. There’s a couple of key components. So one is there is a remaining budget. That is to say how much of the SLO is still remaining and that would be this part over here. There is a historical SLO compliance, which would be this over here that will show you over time how well you have done at complying with your SLO. So again, tying that back into the business outcomes there, this can be really key for establishing that, “Hey, we all agreed upon these objectives and over a certain time interval looking backwards, we know for a fact that we obtained those goals or if we didn’t obtain them, we know what caused it.”

And we can also see visualizations that show us which events are actually failing. So you might have to kind of squint a little bit if you’re following along at home. But this part that I’ve circled on the graph here, you can see that the failing things on this heat map are highlighted in a different color than the rest of the heat map. And so we can actually identify where those failing events are falling. Likewise, we can use that highlighting of these specific events to bubble up automatically to identify which field of values are associated with that. And so that might tie a little bit into, your story about finding a particular query that was affecting the outcome. Can you think of any examples where you all have used that at Clover Health?

Josh Hull:

That is probably the primary example that I would share that of all of the queries that we have, the mappings between providers and members, the mappings between diagnosis or any type of geo-fence that might be present for whether or not a member has the ability to get to a provider in an appointment setting period of time. All of those are interrelated and one query can pull from every one of those disparate data sources. So the more optimized that is, the better. What we found is that it’s difficult to optimize against all of those axes, right? And so we do see many of our failures for the SLO range and threshold that we’ve set. So maybe after the demo we can get into a little bit of within our journey how massaging that percentage and the acceptable latency rate gave us an opportunity to really say, “Okay, this is a trajectory we want to follow.”

24:00

Nathan LeClaire:

Yeah, sounds perfect. So just a couple takeaways before we dive into the demo. SLOs unify engineering and business. You can start small and you can continuously iterate on what you have to gain more outcomes that are better for your team. So let maybe now take a look at Honeycomb side of things. So you know the much-promised demo. So, here we can see the Honeycomb homepage and Josh, I’m sure you know this well. What’s the first part of creating an SLO that we’re going to go through right now?

Josh Hull:

The first thing that we want to do is create a derived column. It’s going to give us a very specific measure against which we can create our service level indicator.

Nathan LeClaire:

Right? So, in Honeycomb, derived columns are sort of like virtual columns that are constructed out of the outputs of other fields. So we can do things like say, “Hey, if the field request time is greater than or equal to 500 then the value of this column when we evaluate queries will be true.” Otherwise it will be false. And in fact that’s the very mechanic we’re going to use here where we’re going to take a field that we have called duration milliseconds, which measures the latency of a given API call in milliseconds.

And we’re going to say, “If that’s over a thousand, we’re going to say that that is actually failing the SLO.” Otherwise it’s okay because the latency was under what we consider an acceptable threshold. And we’re going to name this column something like latency SLI, because we’re actually describing a service level indicator, right?

So indicators for how the service is performing, will go into determining how well we’re doing at our service level objective. So we’re going to create that. And now having made that, we can then come over here and take that latency SLI field and we can group by that field and see a count for each case where we’re failing and where we’re succeeding. Right? And so you see this tiny little bump up here when it’s true, which was associated with the latency and we might want to say, well this tiny little bump up where we’re violating that SLI of what we consider acceptable, is going to be something we want to make our SLO around.

So to do that, we’ll come over to the SLO page here, we’ll click new SLO, and we’ll just call this latency SLO test or something like this. Select that column that we just created. Latency SLI was what it was called. We’ll set a time period in days of when we’ll actually track this SLO over and a target percentage like say 99% and then we’ll be able to see how well we’re doing at making sure that 99% of the requests that we’re measuring fall within this a service level indicator.

So now we have this new SLO here. If we click on it, we see that overview page that we were mentioning before. So, “Oh, did I invert it actually?” I think I inverted my SLO here. So I’ve got to come over to the data set and edit that column. So I actually did true instead of false. Got to edit it.

Oh, I think it doesn’t like it if it’s used for SLO.

Josh Hull:

Yeah, we need to edit the SLI.

Nathan LeClaire:

Yeah. Let me just make a new one. A new one here. Less than or equal to. And we’ll just delete that original one there. All right. That’s a different one. Make a new one. Set those the same. And now we should see a result that is more of what we were expecting. So our SLI column will emit true when we pass the SLO and false when we fail it.

And we can see on this page here that looking back over the data that we have over the last seven days, we are 90 using up only about 4% of our error budget, right? It starts at 100% and burns down. And this SLO has not been around for a very long time. So we don’t have any historical SLO compliance data. But when we do, it will show up in this little chart over here now. Why don’t you maybe talk a little bit Josh about what’s this bubble up chart here showing?

29:52

Josh Hull:

So it very quickly identifies that we have a color dissonance between the majority of our calls that are succeeding and those that are exiting the derived column that we created. The service level indicator. So, what do we want to call that beige or tan area at the top and then the histogram vertically on the right hand side. We can see those areas where we have failures. So, this bubble up, if you were to drag your mouse and select a small portion of that, you would be able to isolate that time period and those calls and perhaps see an even more expressive display of what potentially is failing.

Nathan LeClaire:

Yeah, exactly. And so normally you would use the bubble up and draw a little square on the chart like this and Honeycomb would bubble up all the events that were kind of in that region to show what’s different about them. In the SLO instance, we do it automatically based on whether they’re passing or failing the SLI. And so, in this case, it’s just test data. So, this endpoint is kind of always the guilty one, but we rapidly identified that this one particular endpoint is the one that’s acting up. And likewise because Honeycomb tracing fields are just regular old fields, we can see, “Oh there’s particular services that were violating the SLI.”

And we can see that there were particular traces that violated it that we could just jump right to. So if we wanted to we could just jump right to that trace there, which doesn’t seem to be playing nice. No idea why. But coming back to the SLO page here, do you think there’s anything else we can do? I mean, we made our SLO, we can monitor it, log into the app and check it. But there might be other things we might want to do in addition to that. So any ideas.

Josh Hull:

There is definitely. We can configure and assign a burn alert. So what this does is it gives us sort of a window prior to when our budget is exhausted and we can create notifications based upon that. So that if this is not something we absolutely have to wake up our on call engineers for, we can say, “Hey, give us a notice of eight hours or 12 hours on that burn down so that we can sleep through the weekend because it’s something that we know we can resolve, it’s not an outage per se, we can address it when it is a working time as opposed to 3:00 AM on Sunday if you will.”

Nathan LeClaire:

Right. And so, unlike just regular old alerts where we may say, “Okay, if errors have more than a hundred errors in a certain time interval, somebody gets a buzz.” This can actually measure how our error budget is going and allow for a little bit more nuance and sophistication I think.

Josh Hull:

That’s right.

Nathan LeClaire:

So to make one of those, we can click on this configure button here and then create a new burn alert. And we see this exhaustion time field. Now what is that Josh?

Josh Hull:

So the exhaustion time essentially is at the current burn rate. How much time prior to complete exhaustion do you want your alert to fire? So this in essence assuming that your burn rate does not have a significant delta over time, it’s not increasing or decreasing. The calculation here gives you a four hour window within which to act before that budget is completely exhausted.

Nathan LeClaire:

Gotcha. So four, it seems like a reasonable default, but everyone might have a different use case and so, we can actually set something indicating that well, we’re going to blow through our error budget if this keeps going for another four hours, let’s say. And that would obviously be not good because we would never catch up. So we can also notify to a variety of different recipients. Right. So you can notify by email, through Slack and of course through PagerDuty.

Josh Hull:

That’s right.

Nathan LeClaire:

Yep. And even through webhook, which I think is kind of a nifty thing. And I still think that someday we’re going to see the true potential of this Honeycomb webhook alerting kickoff. Because you could in fact send out a webhook to something that goes and takes some action as a result of an alert firing off. So you could have systems that sort of tie into your monitoring system to go fix themselves theoretically.

Josh Hull:

Mm-hmm (affirmative). Now, if you were not relying on some type of autoscaler or horizontal pod autoscaler for instance in Kubernetes and instead wanted to more manually drive your scale up or your scale down, that webhook would be a fantastic opportunity to do so.

Nathan LeClaire:

Yeah, exactly. And actually that’s a great point. I’ve seen a lot of auto scaling, theorizing or kind of architecture that relies on things like CPU. Like When a CPU is over a certain amount, will scale up and down. But that has all kinds of problems associated with it that we don’t necessarily get into right now.

And it’s kind of cool to think that, “Hey, maybe there’s a situation where we could do all kinds of cool stuff like scale up if we’re slow or scale up if a particular customer actually needs more resources.” Not everyone’s in a fully shared multi tenant environment. So because Honeycomb allows that high resolution data, you can actually have alerts and SLOs that kind of get more into the nitty gritty of dealing with particular customers.

Josh Hull:

That’s right.

36:02

Nathan LeClaire:

So that’s a little bit about burn alerts. And so those will be influenced by that chart that we see here in this SLO where we’ve got the budget burn down. So, I think we could even make a more aggressive SLI if we want to do for example purposes here that would show that this burn down going through a lot more aggressively.

Josh Hull:

Yeah, that’d be a good exercise because adding nines is something that, from a business objective, that’s ultimately we want to do. Right? We wanted to have as few errors as possible for the entire system. The challenge there is that we find within the journey of observability in some of these burn downs. If we start too aggressive, what happens? Well obviously those triggers fire to exhaustion, right? And then we create noise in the system and we aren’t as responsive to those alerts as we could be.

So we found it much more amenable to adjust the target early on and just say, “Until these are firing in a cadence that is amenable to actually improving the system is firing too rapidly.” Right? We are aware that we have latency, we are aware that this is causing consternation for people who want rapid service or a quick response. But we want to tune this to help our behavior, not necessarily just punish us with constant alerts. Which is fantastic. They’re just shifting that budget and making it more amenable to our opportunity to improve. We’re not getting paged as frequently and we can focus on the strategy of improvement rather than separate firefighting.

Nathan LeClaire:

Yeah. I love that. It really highlights, I think the amazingness of Honeycomb SLO in particular. The ability to take advantage of that high resolution data. You can have a lot of nuance to these because you can exempt, for instance, let’s say you have something that and maybe it’s technically an error. Maybe it’s like a 400 level status code, but it doesn’t actually matter for your business goals. We don’t really care if some user got a lot of four fours. That wasn’t really an error, right? That’s the kind of thing you can track in this. And actually SLO building can be a bit of an iterative process in that way as you sort of work on exempting some of the things that aren’t really relevant.

Josh Hull:

That’s right. Liz helped us with our derived column saying, “Well let’s focus just on those that are within 200 response.” Right? We actually had an active valid response. Are those still latent? And by limiting it just to 200, right? We’re stripping out anything that could be server side, we can strip out anything that’s… Wait for the response. Right? And we’re landing on those successful responses and then evaluating within that smaller subset of total possible responses, whether or not we are meeting our objectives. So you’re absolutely right, the granularity with which you can create the derived columns and sort of drive your business as a result is fantastic.

Nathan LeClaire:

Yeah. And it even almost kind of reminds me of the idea of burning through the alerts really fast. When I messed up that first SLO, and I think I have this still around… We loaded up this page and we could see on this incorrectly configured derived column that our error budget burn down was at -909896%.

Josh Hull:

I still see some green Nathan. There’s still some green.

Nathan LeClaire:

Still see some green… So I mean, I happen to consider this to be the opposite of success, but we still have a few success failures.

Josh Hull:

That’s right. We have a few winners.

Nathan LeClaire:

Hashtag DevOps. Great. Yeah-

Josh Hull:

No, your point is valid that it’s… I like how easy we could recover from that slight misconfiguration right? The greater than versus less than, but it did illuminate very quickly that we are well below budget. Given our goal, what’s going on here, we can step back in and fix that. So not unique to anyone trying to set up an environment of observability that it is iterative and it does give you greater opportunity to understand exactly what’s happening. When we first looked at it, we knew it was broken, right. And it was very easy to identify, “Okay, well, it makes sense we want these to fall within that window and not exceed the window.”

Nathan LeClaire:

Yeah, totally. And it actually even kind of highlights a thing that’s really special about Honeycomb SLO, which is that, we have all the raw data sitting there just ready to be query for things like this. So if you need to back up and change what you’re doing, you don’t have to wait around for 30 days to everything to catch back up again. You do actually have to do that in some tools. So I enjoy that.

So again, your takeaways, SLOs help unify engineering and business. You can start small and simple and then adjust things as you go. You don’t actually have to be experts in observability first. It will take time and especially if you’d like to become a Honeycomb customer, it’s really important to us that we help you. And just like with a business, iterate, iterate, iterate.

Josh Hull:

That’s right.

41:49

Nathan LeClaire:

So that’s kind of coming to the conclusion of the content for our webcast on this part two. Tune in next time for the next webcast for part three of the SLO series., Liz Fong-Jones, none other than the Liz themselves will be presenting alongside Kristina Bennett from Google, and they’re going to present on how to pick the right SLOs to get started. So tune in next time and just to do the normal spiel, I am obligated as a member of the sales team to say, “If you want to follow myself or Josh on Twitter, you can follow us @dotpem for me and @vestigialethics for him. And actually, I got to ask him, what’s the origin story on that handle there?

Josh Hull:

I was resistant with any forms of social media and at the moment it was like, this seems like we’re all just kind of letting it go. So that’s the origin.

Nathan LeClaire:

It just sounds like a magic card to me. Like you would play that and you’d just be like, “Oh, no, like vestigial ethics, like that’s so overpowered.”

Josh Hull:

I would love to plug the Honeycomb pollinators slack channel as well. Even if you aren’t a customer of Honeycomb. If you have questions, everybody on there is super responsive. If you’re running into any issues with the UI the bees, as you will, are extremely adept at actually recovering issues the same day.

So it’s almost a ridiculously high bar for being able to submit, say, “Hey, this isn’t resolving for me, this modal isn’t popping up, what have you,” and we get a very quick response. Yep. We’re shipping that to production should be live for you in 15 minutes. So if you have questions about Honeycomb, that is, for me, then a fantastic resource to just go out and ask the community or ask the members of Honeycomb to assist with any questions or issues that we’ve got.

Nathan LeClaire:

Awesome. Yeah. That’s pollinators working exactly as intended. We really thought that having a community slack would be a good thing and it’s worked out super well. So if you want to join pollinators or if you are even thinking about using Honeycomb, we have a form on our website, honeycomb.io, where you can contact us. So please do reach out, contact us. We want to talk to you, we want to give you a demo and we want to help more businesses get started, crushing it using observability.

So, thanks Josh for calling in. I really appreciate you doing this webcast and just talking about your story. Look forward to having you all as customers and helping you continue to the super successful.

Josh Hull:

Thank you very much Nathan. I appreciate it.

Nathan LeClaire:

Yep. And thanks everyone for tuning in. That concludes the webcast. Thanks everybody.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

<– #1 SLO Theory: Why the Business Needs SLOs #3 Define Success: The Right SLOs for Your Org –>

Transcript:

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Ready to get started?