Conference Talks Observability Engineering Best Practices

Embedding Observability Into Your Engineering Culture

How can you create a safe culture that enables engineers to learn, try, and test? The data unlocked by observability is a powerful tool for your engineering teams, but it’s the people and the culture that will be the real force for transformation.

Transcript

Jared Jordan [Leader of Growth APAC Engineering|Netflix]:

Hello, everyone. Thank you for joining us for a discussion on embedding observability into your engineering culture. This is a Lead Dev webinar created in partnership with Honeycomb. The webinar will last approximately 45 minutes after which both myself and the panelists will head over to the Lead Dev Slack to answer your questions in the observability chat. 

Let’s get started with introductions. First, I’m Jared Jordan. I’m the APAC leader at Netflix, and today, I will be joined by Ryan, Tom, and Liz. Ryan is a software engineering manager leading the monitor group focused on ensuring that Slack engineers have world-class observability and visibility of their software. Observability is a key aspect of reliability, and Ryan strives to keep Slack reliable for all of its customers. As a hearing-impaired individual, Ryan is also a leader of the Abilities ERG Research Group, advocating for inclusivity, and accessibility for people with disabilities in the workplace and in Slack’s products itself. When Ryan’s not looking at graphs or helping folks debug software issues, you can find him on his bike in the Marin Headlands, or camping with his partner in the mountains. 

Tom is a senior technologist at ThoughtWorks. He routinely works with delivery teams to foster a culture of collaboration, trust, and shared ownership of quality. He has pitches for getting things shipped as well as bridging the gaps that often exist between ops and engineering. Liz is a developer-advocate. Labor and ethics organizer and site reliability engineer with 15 years of experience. She’s an advocate at Honeycomb for the SRE and observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights. She lives in Vancouver, BC with her wife Elly and a Samoyed and Golden Retriever mix, and in Seattle and San Francisco with her other partners. She plays classical piano, leads in the EVE Online Alliance, and advocates for transgender rights. 

Now that we have had these wonderful introductions of our panelists, let’s jump to the conversation. Liz, how is the best way for leaders to demonstrate the value of observability with their engineering teams? 

Liz Fong-Jones [Principal Developer Advocate|Honeycomb]: 

So I think that to answer this question we need to first define what observability is, secondly talk about what it looks like if you don’t have enough observability, and then, third, talk about the difference between lacking observability and having observability. 

To me, observability is the ability to understand what is happening inside of your software systems and to be able to debug problems in them that you’ve never seen before. And without having to push new code, just using the telemetry emitted by your applications. It’s not necessarily a specific tool, it’s not necessarily the data itself, it’s your ability and your team’s ability to analyze that data. When you lack sufficient observability, you find that your team ships slower. They have larger amounts of technical doubt and incurring more outrages. It takes forever to solve those outages because people feel like they lack visibility into the system, so they have to push new code to figure out what is going on, and that can slow things down in terms of your development philosophy, as well as making your customers not happy with you. 

So the change that you expect to see when you introduce observability and the business case for observability is that it enables you to ship product faster and to satisfy your customers with a sufficiently high degree of reliability, and it allows you to improve the maintainability of your software stack for the years to come. We typically look at things like the Dora metrics, things related to how long it takes for your lead time pushing hits, how long does it take to resolve your outages, and you look at improvements in those metrics as you improve your observability. 

Jared Jordan: 

Are there one or two metrics you would start out with, or a playbook for leaders to get started they could take away with? 

Liz Fong-Jones: 

I actually published the third article in the Lead Dev series on observability, so it’s available on the Lead Dev website, where we talk about some of those business cases and where to get started in terms of making that argument. In terms of where I’d concretely start, a lot of people just start by trying to improve their ability to do great fixes. 

You know, there are a lot of proactive ways to use observability, but I think that the most common pain point that people find is that they can shrink their outage durations by deploying an observability solution that people can get started with right away. Sometimes, often, in the middle of an outage, people will have the adding tracings so they can figure out what is going on. That is really powerful, when you start resolving incidents with observability, it paves the way for others to start using it. 

5:37

Jared Jordan: 

I think that’s really great. When people start taking ownership individually, I think we see the power of observability, and it makes the code quality and our pushes safer and less risky. Ryan, how can you create a culture, a safe culture, that allows engineers to learn, try, and test? When you’re doing this, people will learn and fail fast. How do they bounce back from that? How do you create that culture where people can learn? 

Ryan Katkov [Senior Engineering Manager|Slack]: 

First of all, I want to say thank you for having me here. A safe culture is about trust. And I want to think about visibly creating trust, because if it is opaque, it creates a lot of fear, because of the leadership, the systems going on the service. The first thing I would do, that I would say is truly paramount, is in the deploy pipeline you need to continue in the system, think to manage – if you have more visibility in your tooling, you create safety and trust. 

And the other thing that is very important, the quality and environment, how closely – the code, that they must, so, I would recognize that you can’t approximate production, and you’re going to run into problems and that it applies to small companies, to big companies in the – Liz, she says always test in production, but, we have really good practices for that. We have deployments, where – compare and contrast between your – and the need for … 

Liz Fong-Jones: 

I think everyone tests in production whether they like to or not, right? It’s just a question of how well prepared are you to do it? The more well prepared you are to test in production, the less you need to invest in your staging environments. I think the other interesting point here that Ryan is alluding to is the point about sharing data, right? If having that trust and collaboration that comes from sharing that data so you’re no longer pointing fingers saying Ryan, it’s yours, my data service is fine. 

Ryan Katkov: 

Yes. I’m sure that one can have tried and tested methods, it seems, and I want to say that Honeycomb is absolutely instrumental— we use it extensively at slack. We use Honeycomb with the query engine, we made it easier for the developer to see the impact of Honeycomb, and to code with that, we created trust—create fear any time. 

Jared Jordan: 

Fantastic. Yes, creating trust is a good way to look at it, and also, looking at the system holistically. When you start looking at the system holistically like Liz pointed out, you’re not pointing fingers because you’re trying to figure out what the resolution is, and the fastest way we can get that resolution done. Tom, just right on that, if you have all these teams, and you have these microservices running around, how do you foster a culture of collaboration and trust between individuals on a team, and between teams for that shared ownership of human quality? 

Tom Oketch [Senior Technologist|ThoughtWorks]: 

Yes, I would say the key to this is really creating psychological safety. As, you know, both Ryan and Liz have alluded to. But when it comes to the whole idea of shared ownership of quality, first, I have to clarify that, in my mind, shared ownership implies a shared vision, right? You can’t really rally teams to come together to uphold something that is not well defined. That’s why it is important to make sure that everyone is on the same page in the first place. 

Practically, what that means is that your team, or, you know, your company, depending on your scope of influence, has a solid understanding of how you determine that you’re meeting your quality objectives, right? So be that something that is measured from the perspective of users, which is, you know, usually the best view to have, or even just looking at, let’s say, internal teams, so, departments like sales, marketing, customer support, like how do they view quality? How do they view your product, and say we are also having a great experience, even while supporting this thing? So you want to make sure that you are all on the same page. 

There are, you know, things that you can use. You can have, you can look at business metrics, you can look at SLOs, and having those defined and saying these are the things that we strive for. This is how we define good quality, so that, from the very onset, people have an understanding of what good looks like so you can tell deviations from that. 

11:15

Then, as you’re doing that, the second thing that you want to do is to basically ask yourself how, consider whether you’re setting yourself up for more collaboration, right? And the easiest way of doing that, I found, is by actively encouraging knowledge-sharing. 

I know Ryan talked about Honeycomb. When I first used Honeycomb I was fascinated by the fact that you had no private queries. Everything was public, everything was in the shared domain. I think that’s the right effort that you want to have. You want to make everything as visible and as transparent as possible. You can’t just go and say, “Go and collaborate”, you have to put the systems in place to do this effectively. Things like making queries accessible, making run books public, are things that you should start. You can look at things like recording production incidents, and encouraging incident leads to talk through those incidents afterward to discuss what were the contributing factors? What were the steps that were taken for resolution? What are the lessons that were learned, right? You can make sure that you are asking folks to debug incidents together, not just solo. 

Then I think one of the last things, and, again, Ryan talked about it, it’s around trust. For me, the key thing is that trust is something you build up over time, right? It’s not something that is instant. You kind of have to build it through the course of repeated experience. There are some things that you can do to accelerate the trust-building exercise. So, for example, being a leader, being vulnerable enough to share, not only the good but also the bad, in the hope that those experiences will be learning opportunities, but, in general, trust-building has to grow organically. So I think, you know, the combination of those things can really help. 

Jared Jordan: 

That’s great. I think you don’t want to start establishing trust during an incident. I think it’s a lot better to try to figure out the strategy of building trust as long as you’re building in this software development cycle, partnering with teams, and really talking to them about what is the wanted outcome and position, and backing up from there. I think all you guys are touching on those points that overall, it’s building trust and establishing metrics, and making sure that everybody is aligned in driving to the same set of metrics, for better help, better and more helpful to drive a business in the outcome that you want. That’s great.  

One of the things, Liz, is we have many teams and companies and cultures where ideas have flown mostly from the top, and you will get these mandates of doing things, but, or a few selected individuals are the ones that hold the ideas, and then drive that vision. But what we’ve seen in our careers is that the magic is when everyone gets involved and the observability of the culture, building a strategic vision of how we are going to ship software. Do you have a strategy or framework for getting everyone involved and to feel, and to promote the collaboration from the ground level as opposed to just the top down? 

Liz Fong-Jones: 

I love what Tom said about making sure everyone is aligned on what the mission is, and being aligned on, what are our goals? And giving people the autonomy and flexibility to figure out how they are going to meet those goals. In particular, when it comes to observability in-service operation, I don’t think that you’re necessarily going to get great results by mandating, “You are going to implement CI/CD, you’re going to implement retrospectives, right? I think it’s better to say, “Here are the target service-level objectives we are aiming for because this is what our customers expect of us,” and then trust people to figure out the rest. 

I do think it’s important to do things that build trust before the incident, as you said, so, for instance, things like game days, or Wheel of Misfortune, where you rehearse, what it’s like to have an outage so your heart isn’t racing because it’s the first time you’ve ever used your incident management process. I think the more you have collaboration through teams that don’t flow through the product managers and the people managers, I think that creates that room for organic ideas to spring up.

Honestly, in addition to creating great ideas, it also really promotes a decrease in employee turnover, right? If people feel like they have autonomy and impact they’re going to stay on your team. I think that that is a huge reason why we really need to focus on being outcome-driven rather than kind of focusing on how you do each thing. 

16:28

Jared Jordan: 

Tom, are there any missteps when attempting to promote this culture, a collaborative environment, in a safe culture, or do any of you have examples you want to share? I think we do in all these kinds of experiences? 

Tom Oketch: 

Yes. There are. As is often the case, right? I think one of the more common missteps that I would like to call attention to is, you know, safety doesn’t really mean sweeping things under the rug, and never talking about them because they make people feel uncomfortable or unqualified, right? I think it means actively working to transform such experiences into learning opportunities, right? And especially when things go wrong. One of the things that I like to think about is that, you know, especially when things go wrong, you want to ensure that attribution is not correlated with punishment, right? 

So, I want to give an example, and I know this might sound a bit controversial, so bear with me for a moment. But, I actually think that blameless post-mortems, or the way we tend to think of them, can easily fall into this trap. Let me explain what I mean by that. I’ve witnessed a few cases where people have said during a post-mortem that we don’t really want to ask about who did what, right? Because this is a blameless post-mortem, right? We just want to know what happened, we want to figure out how to make sure it never happens again, and the thinking is by eliminating any focus on the actors, we’re more likely to create a culture of safety. Since no-one is essentially vilified, is essentially worried about, like, being vilified when they do something wrong, or push the wrong button, right, but, I think this is self-defeating, because it means we’re basically eliminating a lot of useful context from the equation, right? 

Focus on what happened, what we can learn from it, but, at the same time, we have to make sure that we don’t deny ourselves, or critical information that needs to be passed on to other people who were in the front seat when this was going on. Having a culture of safety means being able to discuss who did what, what they were thinking, why they did it, right? How our systems responded. How different people responded as well. And still not apportioning blame to any of the actors, right? The fact that someone did something means that there was a combination of contributing factors that allowed that thing to happen, so rather than taking the easy part and attributing the entirety of what happened to just that person that was directly involved, you have to do the hard work of investigating what all those other contributing factors were.

It is not an easy thing. It’s easy to say this person did this thing, and that’s the issue, but, being able to find out what those factors were, and I mean, this is also where it definitely helps, when people have the right tools to be able to fill in those missing knowledge gaps, and I think that’s where observability really helps. You have just bits and pieces of information, but what you want to do is to be able to find out what is the real story? Yes, I mean that’s one of the things that I would say. Like, safety is really, there is a lot more hard work to it than just saying let’s not blame people. 

I mean, the other thing that I probably would say is that it’s people things, it’s not really tooling. You don’t improve interpersonal dynamics through the mindless application of new tools. Assuming that it’s going to help you improve processes, right? It’s really how you use it, and the culture. You have to invest intentionally in people, and in building the right kind of culture that you want to see, and that is a hard task. 

21:00

Jared Jordan: 

Amazing. Ryan? 

Ryan Katkov: 

I agree with you, the fact that you just said should say that it is – to agree on the practice. I think it’s teaching people how to naturally evolve that practice, because, I think you can have it natural, the adoption of tools is really important, and as you create that culture that you put it out, the tools that are a really important part of it. I completely agree with you, Tom. 

Liz Fong-Jones: 

I think the other interesting thing about this is that focus on contributing factors, and that focus on understanding, right, like not just what someone did but why, I think that’s part of what John Allspaw’s research speaks to me about where he is trying to understand what is that state that is not visible? What are people’s hidden assumptions? You can’t unpack that if you divorce the people from it. 

Jared Jordan: 

As everybody says that is the easy thing to do, but in your career, everybody, nobody wants to say, “I made this mistake” especially when it’s the first time, and we’ve all been there. So what do you do to loosen the room just a little bit so people can share a little bit more, and be a little bit open, and build that trust, and empathy that we are all wanting to get at the end? But at the very beginning, in your first room, and you see the incident, you have to get there. So what are you doing, or what have you all done in the past to help that come along? 

Tom Oketch: 

As I said, I think there is no easy answer to this, right. I think there is a lot of, actually, an example that I have. A clan that I worked with, a few years ago, even before we went live we had the VPO of engineering, basically coming up to the room, and every morning would say, “You’re all doing a great job” and then go away, and ask for updates, and things like that. Doing a great job, keep up the good work. 

And, you know, that’s something that seemed like a small thing, but, even when we are having outages, even when they’re having issues, he would come into the room saying I know you’re doing the best you can, just keep doing it. I think it’s small things like that that actually help to build people’s confidence, and help to build trust, and so people are able to know that, even when this is happening, you’re not alone. You actually feel supported by your leadership, you feel supported by your teams as well. And I think it’s just things like that.

Jared Jordan: 

That’s great, thank you. Ryan, do you have one or two tactical takeaways for the audience also as your pulling observability at the forefront of your teams? You’re doing Slack at a very high level, so I was hoping you could share one or two, a couple of nuggets for us?

Ryan Katkov: 

Sure, of course. First recognize that not every company has the resources to put in a staff or monitoring team which Slack has, and Twitter has an observavblity team, for example. That is not always possible. The quality, I mean, let me back up a little bit. If you think about Liz’s definition of observability, and we think about how we try to measure observability, that might be the obvious thing to do, try to measure the quality of your observability in order to drive the adoption or better observability. You can’t really do that. How do you measure something that is not visible?

If you remember, as Liz said, observability is the influence of predicting internal states through the external outputs, so, looking at the service, did they own it. And, … Tom is talking about post-mortems, and we are talking about incidents here, and that sort of, the product…  we want the systems to be reliable, and of course, we all want our product up, we never want it to go down, but it’s going to happen. The best way to latch on to that is to use the incident process to highlight visibility gaps, and that is like – we always want to be proactive rather than reactive in the company.

But we tried to take what we have today with our incident, and we identify visibility gaps. Then with us doing that in practice, we have action the metric that we couldn’t provide this service because this metric wasn’t available at the time of the incident, so, you know, too much time to ask about that problem, and you add a metric, and you put a time element. You take it by the end … that’s how you create that feedback loop. And slowly over time, you create better observability. That’s how you get in front of the teams. You make it a priority. So, part of that is definitely developer, and … but the tooling, the adoption makes it easy for people to … and Honeycomb was integral in that. 

27:21

Jared Jordan: 

That’s great. Go ahead, Tom. 

Tom Oketch: 

I was going just to add, one of the other things that I found to be really useful is, I mean, we all want to have, I guess, we all aspire to have situations where you basically deploy into production, and you can test in production, but I feel like that is not really a great starting point for every team, right? Most teams are going to have staging environments, and dev environments, and things like that, and one of the ways to practice and to use observability is basically using it in all those environments, even before production, right? You don’t just use it as a – it’s not something that you’re going to – if you’re looking at ways of getting, you know, iterative feedback, you’re not going to wait until you’re in products and then say fire up all my observability systems and make sure that I have all the data that I’m collecting. I think being able to apply that even in those other low-end environments and being able to practice beforehand, even see changes before you get into the production environment, you can iteratively improve that process as well.

Jared Jordan: 

One of the questions that came from the panel was, do you have any recommendations for enforcing expectations, and what should happen when the expectation is breached, and how do you build these ships across engineering to get these folks to take these new expectations seriously? 

Ryan Katkov: 

I would like to answer that one. That’s a pretty difficult question to answer because it is strategic to be a leader, and it is hard to be natural. I know that’s not the answer everyone wants to hear but it has to be natural. You can’t do it and boast these kinds of expectations of good observability onto leadership, because the leader can say we do not have good observability. We need better observability, but that is really arbitrary, so it needs to come from the bottom up. You need to have that culture of observability, you need to have the tooling, the processes, and the trust. If you build that trust, it happens naturally.

So, it is – I think that means that you have a problem with adaption, or you have a problem with mutual understanding with everyone, if you make it happen naturally, you find you don’t need to set expectations. 

Jared Jordan: 

That is great. When leadership and engineering, everyone is aligned, then you build the trust, it really resonates with other folks. When you get other people building support and accountability on the bottom, people then begin to really build into this as part of that culture and are really accountable for what goes on. I loved what you were saying there, Ryan. It really resonates with me.

Liz, I had a follow-up question from this, for the panel, or attendees, and any one of you guys want to jump in on this, please do. The question is on improving observability. Let’s say I have a bunch of microservices and have integrated Honeycomb. To go to the next step, I want to add metadata to span or trace context. It seems like a huge task to touch every service and endpoint and start adding context information there. Is there any playbook here on how to start making progress? Or is there another way to think about this?

Liz Fong-Jones: 

The way I like to talk about this is in terms of a person’s body. Like, the data about what requests are flowing through, and the basic metadata about that request, that is the skeleton you have to build on. You can flesh out and add the meat when you need more detail. If you don’t have the skeleton, you’re going to be missing critical pieces.

When we think about building out the trunk of that skeleton of your distributed trace graph, it’s important to focus on your ingest points first. Like where is traffic flowing in from my customers, right? And even if I don’t have each individual, like, finger modeled in the skeleton, if I at least have the stick figure diagram, like, you know, here are my arms, legs, that may be just good enough, right? So you don’t necessarily have to think about instrumenting every service.

For instance, if you’re using a framework like Estio, you can use your service mesh to populate some of the data for you. If you have sufficiently large microservice deployments you’re probably using something like Estio or Kubernetes already. That initial deployment strategy is how you get the basic shape of the skeleton. After that, you can add additional spans, you can add additional data specific to each request, and to each microservice, and flesh out the bones. You don’t have to do it all at once. You find latency that you can’t quite explain with your spans, add more spans, and trace-down services.

It’s a question of where your resources are, and also, does your environment have shared libraries that you can add one line of code to, and it gets deployed to every microservice, or is it the case that you have to chase down a hundred different developers? They’re situationally independent but focus on the impact first.

Ryan, I know you’ve done this recently at a pretty large scale. 

33:30

Ryan Katkov: 

Yes, we actually have – we actually have P3 monoliths, which actually made it easier. The P3 monoliths, you have to have the framework that emits to collect the trace spans, so we add a lot of context in there. With the microservices, what we did is that we declared the service that actually collects the trace metadata into a variety of formats. What that did, it made it really easy for developers to emit the trace spane in any format that they wish, whether it is OpenTracing, Jaeger, Zipkin- in any protocol that they like, HTTP, so, we made it really, really easy for them to just immediately add the spans – gave them a set of patterns, a set of standards.

I think overall—we started from scratch—it took us about nine to twelve months to do a full~95% adoption across our Slack, about 100 engineers across, I don’t know how many services, and we ingest something like 250 billion events per month into Honeycomb. That is 1% … we are actually really excited about our progress—making it easy to look at the data, to drive that adoption. 

Jared Jordan: 

That’s just amazing and shows the scale. And along those lines, Tom, you have a lot of people coming from a lot of different teams, and you’re always changing or adding folks to your team. We’re talking about, you know, building trust, and adding people, and building a culture, so what are some of the strategies that you use to drive adoption of the modern products, modern tools with your teams, given that you have people coming from everywhere, and everybody using their favorite tool, but just driving it at the consistent level of adoption for the tools for the teams? 

Tom Oketch: 

I think that’s actually a good segue from what we’ve just been talking about. And, I think when I look at adoption, and speaking just for the adoption of anything, regardless of what kind of tool it is, I think the two things that really stand out to me are for something to be adopted, people need to be aware of its existence, right, in the first place. 

So there is the marketing aspect to it, which is to say, you know, if we are working with a bunch of different teams, and we are providing internal tooling, some work has to actually be done to say, you know, we have this tool, we have this capability that you can now use, right? You can’t just assume that because you’re building something, everyone is going to know about it. 

Liz Fong-Jones: 

This is the danger of internal platform teams where people assume you have a captive audience. That is not true. I think that this is why I’m a huge advocate for developer relations, internal developer relations. More companies need it. 

Tom Oketch: 

Absolutely. That is important. And, so I think in addition to, like the marketing aspect as well, you also have to – I like the fact that you talked about internal platform teams. Because even internal platform teams have, they have users, right? And you have to do it, you can’t take the short cuts, or just assume what people want. You actually have to take the time to do the research, and talk to your teams and understand, okay, so, this is what you’re offering, but does this actually work for you? You need to perform surveys and have questionnaires, and all kinds of things like that, because you’re building a product, and you want to measure it. You want to apply the same standard as external products that you might be shipping to other users. So, I think that’s really important. 

And I think one thing that can be understated is that you have to be willing, and ready to receive feedback from your users. So it is not just pushing your own agenda on to all these teams. You actually have to figure out, okay, what is the feedback? What do I do with it? You have to go through a process of prioritizing that. Yes, maybe, once you do that, you’re probably going to get a bit more adoption than typical. 

38:17

Ryan Katkov: 

I agree with you, Tom. Thinking about what the problem space is, it’s important, I think I struggled with other companies, because there were a lot of engineers that wanted to build the internal product without actually thinking about the users or the problem, so I totally agree with that. 

Jared Jordan: 

That’s amazing. Liz commented in the Slack room, don’t mandate, I loved what you guys are saying, seek feedback from your customers. Because if you seek feedback, then you’re going to make it better, and they’re going to approach it as their part of the solution as opposed just to I didn’t use these – I need to use these tools because somebody said so. Thank you. 

From any other panelists, any last tidbits that you want to leave any of the attendees with for getting started on their journey, or if they’re a little bit further along with a mature microservices, what should they take away from this conversation?

Tom Oketch: 

I mean, I will just say it’s better to start. You know, just start somewhere. I think especially new teams that are, for whom observability is a new concept, I think there is a tendency to go and try and read all about it, you know, look up all the products that you have, and, you know, you kind of get lost in this analysis paralysis. I would say, you know, pick something, and just start somewhere, and then you can start to iterate on that process. That’s really what I would say. 

Jared Jordan: 

I like that. Pick something, build trust between teams, and seek feedback from your audience, or the people that are using your tools, and then continue to send out surveys, and make sure that those surveys are actioned on because then people will feel like they’re a part of the process. 

Also, I would like to thank all of you, Liz, Ryan, and Tom. Thank you for being panelists today. The insights that you shared today are going to be helpful to all of our attendees. We will also be continuing the conversation and taking questions on the Lead Dev Slack channel about observability. You can all head there, and we will be there answering questions. Thank you for joining it, and we hope to see you all next time. 

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript