Conference Talk

Cloud-Native Observability: Basics to Advanced Forensics

 

Transcript

George Miranda [Head of Product Marketing|Honeycomb]:

We are joined by Frederic Branczyk, Founder of Polar Signals. Previously with Red Hat and CoreOS. You might know Frederic from his work as a maintainer of Prometheus and Thanos. He’s also the tech lead for sync instrumentation at the Kubernetes project, so welcome, Frederic. Gordon Radlein is an Engineering Manager at Facebook. There was Gordon. He is supporting observability infrastructure at Facebook. Previously at Etsy and an SRE at Google. Gordon has worked at a number of companies both big and small in a variety of domains. And then we have Liz Fong Jones who is a Principal Developer Advocate at Honeycomb. She’s a member of my team. I absolutely love her. You will too. You probably know her from her advocacy work or from a variety of other places. She’s the co-author of the upcoming O’Reilly Observability Engineering book, along with me and with Charity Majors. So welcome, Liz. And then finally, this panel is being moderated by Rags Srinivas, Lead Architect at InfoQ where he focuses on Kubernetes, Open Source, and DevOps. With that, I’ll hand it over to Rags.

Raghavan “Rags” Srinivas [Lead Architect|InfoQ]:

Hello. Good morning, good afternoon, good evening wherever you are. Hopefully, you guys were at the earlier panel. So we’re not going to talk too much about DORA metrics. Because that was done there. But this panel is really about basics to advanced forensics. We have an awesome panel that is willing to answer just about every question that you have on observability, including what is observability; right? And, you know, just want to remind you guys that this is your panel. So it’s all going to be based on the questions that you ask on Slack.

Please go to the channel on Slack and I will try to kind of moderate those questions and feed it to the right panelists. But if you want certain panelists just indicate your preference as well. With that said, I really want to thank everybody, especially the panelists. And before we dive into the audience’s questions, I’d like to start by asking each of you what was your first experience with observability? Whether it was a pleasant experience, nightmarish experience, or whatever. So you start to introduce yourself and build a bridge with the panel, with the audience so that they can ask you the right question. Let’s go with Frederic.

Frederic Branczyk [Founder|Polar Signals]:

Yeah. I think you started with an interesting question there, which is, how do we even define observability? Because I guess we need to sort of do that before we can talk about what my first experience is. And really, I think there are a lot of definitions out there. But for me, observability is sort of anything that allows me to understand the operational aspects of my software better; right?    

That’s actually a very broad definition, but if we take that, I guess it must be somewhere around 2002 or something. You know, I was using logs to understand my software better. But I guess if we talk about more, what I guess we would think of as the…let’s call it the modern era of observability. That was probably around 2014 when I first started using kind of time series for alerting purposes. And I think that was for myself was kind of where an evolution happened where Prometheus was open-sourced. And for me, I felt like there was a real, a new beginning where we weren’t doing checks anymore. But we were gaining much deeper insight. Ever since the community has really developed quite drastically; right? We can understand our systems better than ever before. I guess that’s kind of my story. I don’t want to take everybody’s time.

Gordon Radlein [Engineering Manager|Facebook]:

Sure. I’m going to throwback. I have a similar kind of understanding of what observability is. You know, something, it is a property of a system that tells you how well you can kind of understand its state, by looking at it. So, I’ll take it way back to when I was first learning to program. You know, I wrote a little Python script and wrote my code. Went to the command line, hit enter. And then nothing happened…it just went to the next line. I was like, oh. Did something happen? And that’s when I learned about the utility of print statements and getting insight into, like, what is actually happening in a running program. I’d say my journey started there.

Since then it’s been kind of learning new tools that provide better observability and then running into the limits of those tools and looking around to see what’s next.

5:46

Liz Fong-Jones [Developer Advocate|Honeycomb]:

Yeah. I think I’m going to echo what Frederic said which is that this idea of let’s see this turbulent behavior in production. Let’s see what unexpected things users are going to do on my systems. That was my first job. I was working as a technical support engineer on a massive multiplayer game, Publix Pirates. And players were doing all kinds of weird cheating activities. And those would show telltale fingerprints in the logs. But the modern era of observability began with me with the Monarch system at Google. It was my first exposure to really measuring and monitoring systems by using Config code. It was actually let’s properly do this with proper instrumentation and proper query language.

Raghavan “Rags” Srinivas:

Perfect. I just realized we don’t have any questions on the Slack channel. There may be a little bit of confusion between the previous panel and this panel. So, you know, if you have a question for this panel, just preface it with “forensics.” Okay?

With that said, in the spirit of kind of crawl, walk, run; right? It seems like observability covers so many aspects; right? You know, it could be as simple as what Gordon said. Just putting print statements to, you know, doing something a lot more with OpenTelemetry and so on.

For somebody who’s kind of relatively new to this journey, you know, where do you recommend starting? And let’s go kind of reverse this time. Start with Liz Fong.

Liz Fong-Jones:

Yeah. I think that there are a variety of ways to start these days. I think the primary aim should be understanding user behavior. That you can always drill them and add more granular statements later. But if you’re not capturing the traffic coming into your application or through your load balancer, then you have no hope of understanding what’s happening under the hood. You’ve got nothing to tie it to. That’s kind of why I advocate this kind of tracing-based approach where you start by tracing the broadest request at ingress and then add much more instrumentation as you go.

That being said, I’m not super picky that it has to be a trace. I feel that there are a variety of ways to do it as long as it’s expensable. As long as you’re not going to silo yourself into a place where you can no longer add telemetry instrumentation because there’s nowhere to attach it to.

Gordon Radlein:

Sure. You know, I think it’s kind of bringing people into the world of observability and the tools can be tricky. Because, you know, there’s a lot of tooling now. There is the ecosystem has broadened dramatically over the past number of years. And people can get overwhelmed fairly easily. I think it’s one of these things that, you know, it’s similar to an analogy might be, like, you know, test-driven development. Where the right thing to do is to write the tests. And that’s good and people say you should do that, but in practice, people just write their code and maybe bolt on some tests if somebody calls them out on it.

I just try to think about it through the lens of, like, not overwhelming people who are new to this space with kind of, like, the most advanced concepts and things like that. And to Liz’s point is capturing those user flows and how you do that, you know, will kind of depend on the kind of tooling that you have available to you. It’s really just starting to kind of bootstrap that cycle. Because once people start to get that feedback to see those charts move, to understand those request flows, and see these events in their logs that are interesting. That kind of is the bootstrap cycle. And you want more of it. You’re like, oh. That’s interesting. Then you realize you can’t answer it and start thinking more about it.

Use whatever tools you have available and just start to capture those kinds of critical flows and ask questions and bootstrap that cycle.

10:14

Frederic Branczyk:

Yeah. I mean, I think maybe I can, without adding too many more aspects, maybe I can bring in an actual example. I maintain a project that kind of automatically sets up an entire observability stack on Kubernetes. And exactly what the other panelists have just said happened here. People install those and they’re immediately overwhelmed with the amount of information available. So I can only kind of repeat what Liz said. Be really intentional about the thing that you’re trying to do and start at kind of the ingress level. That tends to be where you already get so much insight. Maybe enough insight; right? We also often say, like, that’s where you want to define your SLOs. Essentially that’s where most of your users actually interact with your service; right?

I completely agree with that. And not only from a technical perspective, but it’s also from a kind of cultural organizational perspective because if you buy a product or install a system that gathers a lot of data, that tends to also quickly add up in terms of the cost of the system; right? And if you can kind of gradually prove the worth of the system and grow it that way, I think you’ll be much more successful than kind of creating a really huge bill right away. Even though it’s probably worth it, it’s harder to sell rather than kind of gradually showing this is improving our reliability and understanding of our system rather than going all out right away.

Raghavan “Rags” Srinivas:

Perfect. We have a question from Robert. You know, I think that’s been answered by James Governor as well, but I think this is especially relevant to this panel. What is the difference between observability and a traditional data center hosted context and in a cloud-native context? Personally, I think they use the same tools but there may be more. I guess what are the tools for the trade for a cloud-native developer who wants to implement observability? Let’s start with Gordon this time.

Gordon Radlein:

Sure. I think Liz will have good insights here, but I’ll kind of tell you I guess my perspective. Which is at a high level, I don’t think there’s that much of a difference. And that at the end of the day what you’re trying to do is have insights into all aspects of your system so you can answer whatever questions you need to answer at the time you need to answer them; right? And if you control a data center, then that might even get into is there maintenance happening in the cluster where the set of servers live that is currently serving this workload that’s having problems.

Now, if you’re in kind of the cloud, you’re probably not looking at that level. You probably want to know if there are issues happening in the zone or region that you’re in; right? It’s a similar thing. And so it’s these levels of abstraction go up, but you want to get insight all the way down to the kind of core primitives of your infrastructure.

Liz Fong-Jones:

I love that answer, Gordon, because I think it encapsulates this idea of what is cloud-native. Does cloud-native mean you are using a modern Kubernetes or Microsoft server architecture? Or does it mean owning your own from soup to nuts? I think those are two different considerations. Facebook I would argue is cloud-native in having terms of microservices. But not in terms of running its own data centers. Whereas at Honeycomb, we trust AWS. We need to measure an instrument, everything flowing outbound to AWS so that we can communicate clearly with AWS across that boundary. So I think that kind of influences your technical choices of what do you instrument, what do you measure? How deeply does it integrate with your software stack?

Frederic Branczyk:

Yeah. I think one additional thing that I would add to that is if you actually own the data center, there are some things you just can’t fundamentally rely on. If you’re entirely on a cloud, you can rely on distributed, like, data storage. You can rely on persistent volumes and stuff like that. If you don’t have that, like, there are some fundamental assumptions that you can make about your infrastructure. This is where Prometheus, for example, came from. Where you really don’t want a distributed system is when you’re running this infrastructure and there’s no, nothing else but the bare minimum you can rely on.

And when you go up the stack, I understand how using the system can make sense. But it’s kind of all layers; right? So, yeah.

Raghavan “Rags” Srinivas:

Great. I know there’s another question on how do you scale observability, but let’s jump to a little bit more kind of a technical question. Which is really about, I often have problem with… this came from Ryan. Thanks, Ryan. I often have trouble with finding the specific query I’m looking for. Sometimes due to not knowing what some other developers named a trace or what attributes it has. Are there any guidelines or tips on the naming or ease of querying for adding a trace, a stamp, or attribute on an application?  

Liz Fong-Jones:

Exemplars. Exemplars, exemplars, exemplars. I’ve been talking about this for years. Even before I’d come to Honeycomb, I was at Google and this was one of the things that changed my mind about the value of tracing. Before exemplars, I was a metrics purist. Monarch is the way. Let’s have many time series. These are so hard to access. Why do I even bother? And being able to click on a spot in the Heatmap that showed me an example trace that exemplified the behavior I was looking for, that changed my mind. It fundamentally changed something in me. And now I want to bring that to everyone whether they’re using Jaeger or Honeycomb. That changes things. I see Frederic nodding too.

Frederic Branczyk:

Yes. I’m very excited.

Raghavan “Rags” Srinivas:

Kind of a maybe a dumb question here. Are exemplars applicable across the board or only context in the world of OpenTelemetry?

17:30

Frederic Branczyk:

I think they’re a general concept. I think OpenTelemetry will make them quite a lot more accessible, just because of that kind of intertwinement, I guess if that’s a word, of technologies. The Prometheus libraries have support for it, but you kind of need to wire it up yourself. So I think the accessibility is going to vastly increase.

Liz Fong-Jones:

Yeah, you don’t have to have Open Telemetry. We did it at Google. I presume Gordon’s team is thinking about it at Facebook. These are the kinds of things that are easier to do if you have that shared tagging mechanism. It’s this backbone on which you build your telemetry.

Gordon Radlein:

Yeah. One point I want to fly here is just when you think about exemplars, what we’re talking about is discovery; right? How do you make it really easy for people to find what they’re looking for? You know, there are systems that will log trace IDs in your logs. If you’re looking at logs, you might be able to click through and kind of load up a trace of interests. But it’s very much on the tooling. And you need to well, ideally you’re using or can acquire and build tools that make it easy to discover these things. Especially when it comes to tracing. Discovery is the most important thing because if you don’t have discovery, people aren’t going to go to your separate application and start randomly iterating through different kind of permutations of queries   

Liz Fong-Jones:

That’s the worst; right? The random clicking. The frustration. You can see it in users. Query users. They’re just like, I give up.

Raghavan “Rags” Srinivas:

I think Ryan had a follow-up here which is, instead of quitting trace with X and Y, look for a trace that fails. Then click around to find an example and arrow from there. Is that the way you would do it?

Liz Fong-Jones:

You should be able to do both. I think you should be able to do both. The way you get to trace with X and Y is with sufficiently advanced trace querying and also consistent naming schemes. Schemas are super important for making sure is it app.error? Is it app error?

Gordon Radlein:

Yep. And naming is hard.

20:15

Frederic Branczyk:

I think there were multiple attempts at, namespacing for observability data. I don’t think there’s a standard that has necessarily established itself. But I know there are a lot of people wanting to have, like, standardized metrics for HTP or DRPC, for things that are common in the industry. And I think that would also help in terms of discovery.

Raghavan “Rags” Srinivas:

Perfect. The next question again from Kyle is once observability is in a company’s culture, especially let’s talk cloud-native here, how do you scale? Who wants to take it? I’ll let one of you pick it. And again, you know, we don’t need to be congenial to each other. We can fight a little bit.

Gordon Radlein:

I can start. So the question is repeated. Once a company has observability in its culture, how do you scale? I would say once a company has observability in its culture, that will just happen. And I think the really hard part is getting it in the culture. Because typically if you think about a company that’s not tiny. You have different teams that own different things. And the hard part is not saying, like, oh. We want to add tracing or we want to instrument some stuff. But who’s going to do this work? Who’s going to own it? Where’s it going to live? There’s an existing system that does a similar thing but not as well. What happens to that? Do we turn it off? What happens to those people? You get all of these questions. And it becomes very hard to scale. The coordination cost often, the organizational cost, becomes like the limiting factor. An organization that has observability in its bones is going to prioritize that and find ways to make that work. Whatever mechanisms they use.

I think it tends to be really hard when everybody says they want observability; right? But is anybody changing their road map to get this stuff done to create the integration that is needed to really make discovery simple so that people that aren’t in the weeds on this stuff all day, it gets in front of them and they start using it? II think that is the hard part.

Liz Fong-Jones:

I really love that idea of how do you make it so people who are not in the weeds can use it easily? One thing we do at Honeycomb is, our pool of request templates asks, what is your observability plan? You fill that out every request to explain, how am I going to understand where this is working. So it’s not just the super experts. But it’s everyone on the team who is practicing this and reinforcing it and rather than just the one logging tool wizard. If you have the one logging tool wizard, then that’s not going to work.

Frederic Branczyk:

Yeah. I love how that example ties back to scaling it. You’ve implemented the kind of culture and that’s how you make it scale. And I think that exactly ties back to Gordon’s point that, like, it’s scaled when you have the culture. Not really vice versa.

Raghavan “Rags” Srinivas:

All right. I think we have time for one last question, and I’m going to pick my question here. Which is OpenTelemetry is by far the second most popular project next only to Kubernetes. Is Kubernetes really helping in the efforts or really adding more complexity? What is the kind of synergy, if you will, between Kubernetes and observability? Who wants it?

Liz Fong-Jones:

You’ve got a big observability person.

Frederic Branczyk:

Yeah. I don’t think Kubernetes is necessarily complicating it. Maybe it is complicating it for itself just because of historic things that are already there that are, like, widely used. I don’t think it’s complicating it, but I think there’s almost a fundamental tension. And sometimes I struggle with this with OpenTelemetry myself. It’s very intentional that OpenTelemetry kind of ties all of these things together. But that also brings difficulties with it when you want to use it; right.

If you already use a metrics system, then switching may make it more difficult. I’m not sure Kubernetes is making it more difficult for OpenTelemetry or vice versa or existing legacy if you want to call it that, making it hard for Kubernetes. Not that there’s anything wrong with what Kubernetes does today. But obviously, they want to do tracing and that’s been long in the works and it’s happening, but it’s difficult to introduce when you have lots of existing processes.

25:42

Liz Fong-Jones:

The one thing Kubernetes has done that’s great for OTel is enabling people to run the OTel like a sidecar. Without a platform like Kubernetes, it would be impossible to standardize and say you’re going to get it on every process that you run.

Raghavan “Rags” Srinivas:

Gordon, do you have anything to add?

Gordon Radlein:

I do not. I think   

Raghavan “Rags” Srinivas:

Okay. All right. So let’s go final words. You know, maybe each of you have 45 seconds? Anything you want to add? It’s perfectly fine to say I just pass. So let’s go with Liz.

Liz Fong-Jones:

Yeah. I think that in keeping with the theme of the panel of going from advanced forensics, I would say don’t be intimidated by folks doing advanced forensics. Don’t feel like you have to get there or else. You can take these baby steps incremental steps and they’ll just free up so much more of your time; right? The easier you make it to investigate production problems, the easier you make it for people to indulge their curiosity about the code and systems, the more that will drive that value and over time you’ll work your way towards the advanced setup. Measure the simple stuff, observe the simple stuff and go from there.

Raghavan “Rags” Srinivas:

Perfect. Gordon?

Gordon Radlein:

I guess on that same note of basic to advanced forensics, I’d just say for folks building the tools to really think about how you take your users from basic to advanced forensics; right? How do they get a simple kind of introduction to the value you can provide with observability, with the data you’re collecting, and then how do you use that to bring them into kind of maybe more powerful tooling and maybe more powerful features that can bring them even more value but require kind of more of their investment to kind of learn and really make use of.

I think thinking about that is really critical to scaling observability at an organization because I find like a lot of people, you know, especially if they’re new to the system or it is a tool amongst all the other tools they’re using because they’re focused on maybe product development or something like that, you have to show them the value. And you don’t do that by, like, throwing a very complex system in their face and say look how awesome this is.

Frederic Branczyk:

Yeah. To maybe add, I agree with everything that’s already been said, but maybe to add in another point, what I always like to say is kind of look at the tools that you are still using today that aren’t automated in any way. All of the tools that we use and have been developed today somehow originated in us manually collecting this data. And at Polar Signals, we do continuous profiling. Profiling has been part of the developer’s toolbox forever. But it is sort of the continuation of that; right? No pun intended. Where we look at profiling in new and interesting ways to make it even more useful. And I think there’s so much out there that hasn’t even been touched just by looking at the tools you use to debug incidents that go beyond the tools that you have today. I think there’s so much more value to be created.

Raghavan “Rags” Srinivas:

Perfect. With that said, I think we will give back 30 seconds. I really want to thank everybody who attended. I know that there are some questions in the queue. So please come back on Slack and we will answer those questions. But I really want to thank the panelists. I think it is great. Maybe in the near future, we’ll do it, you know, in person. That’s what I’m looking forward to, anyway. But thanks, everyone, again. Ciao.

Gordon Radlein:

Cheers.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript