Conference Talks Observability Engineering Best Practices

Tradeoffs on the Road to Observability – With Animation

The main way to improve sustainability of services is to empower engineers, not just SREs, to understand their systems. Liz talks about how observability yields that empowerment as a natural byproduct of the practice.

Transcript

Liz Fong-Jones [Principal Developer Advocate|Honeycomb]:

Hello, LeadDev. Congratulations. Today is your day. You’re off to observability. You’re off and away. You have brains in your head, you have hands on the keyboard, you can steer yourself any direction you go toward, but which direction should you go? Once upon a time, I worked in an organization that wanted really much to take care of its customers, but it had many, many different ways of doing everything because this was a very large organization. And people really did take to heart that they wanted to put the user first, but this resulted in a lot of miserable humans. How? Well, it turns out, the focus on the user turned out to result in people having a lot of pain that was very self-inflicted. That road is paved with good intentions, no one intends to introduce technical debt, but yet, somehow when people do introduce technical debt, it’s subsequent people that come after them, that end up saddled with it. And people had, even worse, the wrong incentives. Who do we promote in our organizations?

What do we promote them for? And how do we make sure that we’re growing and rewarding the right set of people? If your culture is one that says it wasn’t invented here, there’s no possible way this could work here. I’m a snowflake, I’ve got massively gigantic scale. I’ve got a massively different need than anything ever invented before. So, you know what, all that old stuff, push it away. We’re going to build it from scratch again. Or what about situations where people get rewarded for doing shiny things that don’t necessarily deliver impact. Where people adopt resume or promotion driven development, where people almost have too much autonomy over building things that are not necessarily the most impactful thing that they should be doing. I’ve also seen the opposite. Situations where people set up an environment where they’re completely disempowered from doing anything. Where the attitude is, you know what? I’m just here to do my job. I’m not going to write any software. I don’t have control over my tools. The consultants decide that, management decides that, but I just have to use whatever shit they give me.

Fortunately, there is a better way, and we don’t have to live in these dysfunctional organizations that reward mediocrity or make work work. So what’s our goal as software engineers and as software engineering managers? Our goal is to make sure that we’re delivering features alongside appropriate reliability and scalability. So ordinarily I would spend a while talking about service level objectives, but you can refer back to my talk from the LeadDev New York from last year where I talk about appropriate reliability and scalability. But the short of it is that we have to measure. We have to understand what our customers are doing and whether we’re meeting their expectations and make sure that we’re delivering that level of service, but also not over-delivering such that we leave enough room for innovation in our code basis, rather than feeling frozen in place by excessive reliability requirements.

This means that we have to think about not just, can our fleet of machines scale-out, but also are we paging our humans to death? Are we doing too much ops work in order to keep our systems running? Because chances are that doing ops work all the time is not what your team’s mission is. Your team’s mission is to deliver reliable service that’s sufficiently reliable. No teams should have to do too many operations and scaling the machines out of control without appropriate control mechanisms is a recipe for disaster. We need to have only the minimum amount of essential complexity relating to the features that we build and not unnecessary complexity that’s related to technical debt. We also need to have appropriate observability. We need to have a control loop over our data. Observability is not logs, metrics, and traces. Observability is something else. Observability is a capability in our systems that enables us to understand and debug our systems in production. Not in staging, not in dev. In production. Because like it or not, we all test in production. Observability is not just a great fix. It’s our ability to answer any question that we may have about our code, whether it be getting our service back up and running after our service level indicators indicate that it’s too broken, but also thinking about other business decisions that we might make. Things like, can we release code on a predictable cadence? Can we manage the quality of our code and debug our code, even when we’re testing it and writing it for the first time? Can we understand whether or not people are using the features that we’re building? Can we get that insight into what we’re building? And finally, can we actually manage that complexity? Can we understand what are the scary areas of my code and how do I make them less scary?

5:33

It is not just the data. It’s not just these interfaces or abstractions of logs, metrics, and traces. It’s something else. Observability is our ability to write instrumentation in an ergonomic fashion. So that it’s as easy as dropping in a debug print out statement to understand what’s happening inside of our code to write that initial debug statement, and then to gather and store the data in an economical format that isn’t going to break the bank. And then finally to go and query that data to answer those concrete five business questions I just showed you. That is what observability is. It’s that overall end to end workflow of being able to ask questions and get answers. That’s what we’re aiming for. But often people assume that what is actually important is how many people are asking for my attention, or how many people are using my system. I don’t think that that’s the case. We are as people scaling limits ourselves. So don’t make systems that depend upon cloning people, as you’re thinking about how to engineer observability solutions. We need to make sure that we’re engineering for all of our teams, rather than just a subset of teams. We shouldn’t shovel work across the fence or make different teams use different sets of homegrown tools, because that’s just going to add more cognitive barriers and dissonance when people are trying to understand what’s happening in production. I want to go further and say that engineering itself is not writing software. Engineering itself is a much more broad practice that encompasses requirements gathering and then design, and finally engineering, and then actually running things in production and making sure that they run in production.

The number of lines of code committed is not the metric that we should be measuring. So let’s suppose that you are an engineer, and you have a problem that you’re trying to solve. Well, let’s go ahead and try to change those incentives in order to make our job as engineering managers and as engineers so that, that engineer, who’s thinking “I have a problem I want to solve” does the right thing for your company. Instead of saying, and celebrating heroism and duct tape. We should instead talk about what is it that we’re actually solving and how elegant was it? Not, how many hacks did I build? We should be talking in the language of success cases. What did people do with this observability tooling that we built? How did we successfully implement or integrate this tool to make our lives better, regardless of how many lines of code, whether it was negative 1000 or plus 10,000. And we have to think about, what is the problem definition and start from there. When we’re talking about telling these stories of success, what’s the problem? What requirements did we gather? What did we try? Because the failures are as important as the successes. We have to understand what are the dead ends, so other people following in our footsteps can avoid walking into those same dead ends. And how did we solve these problems? How did we actually engineer a solution that met the requirements after resolving any of these deadends that we ran into? And can I actually reuse it? Can someone other than my own team, can they actually reuse it in order to have a successful experience with using that tool on their team, too? How much time did you save? How many user experiences do you make more delightful by developing better tooling? Not, here’s how many lines of code that I wrote really look left, look right. Look at what other people around you are doing before you start building a thing from scratch.

Do not glorify creating complexity or enshrined complexity of the problem into your requirements for getting someone promoted on your team. And remember, there’s a wide ecosystem beyond what you as an individual developer can write, or you as a technical lead of a team can write. There’s opensource. There are the combined efforts of hundreds or thousands of different companies, all working together on shared solutions to our common problems. You may have heard of the idea of inner sourcing before, of looking at other organizations within your parent company. So that you don’t build the same thing at your company 20 times over. Maybe if you’re a Google size, three times of the same thing, to solve the same problem is enough to account for all possible use cases. But a majority of the time, you’re probably pretty closely aligned with what people in your business unit are building. Make sure that you copy or extend something that they’re doing rather than going off and building something completely greenfield. And also vendors do exist, especially in this economy. Should we be spending time building bespoke solutions only for a company that is not actually core to our business? If this is not a strategic advantage for my company to build, why am I building this? We should only be writing personally the things that we believe are unique value differentiators for our company. So consider outsourcing to people who are experts in a particular thing and have them maintain a shared solution rather than developing your own bespoke one. So many of you may be nodding your heads and saying, “Okay, that’s great, but Liz, tell me what to do. Should I build, should I buy? Enough blathering from you.” Well, I can’t, I really cannot because your situation is unique.

11:15

Instead, what I’m hoping to do is to guide you through some of the things you might want to consider. What problem are you solving and even better yet, the problem is not something like I want to have more dashboards. We need to think instead about what problem are our customers having if they’re internal engineering teams or external teams. We have to think about sometimes the customer is you, sometimes it’s your teammates. Sometimes it’s other engineering teams. And sometimes it’s our company’s customers, or our company’s, customer’s customers. So anchor first on that source of user pain, and then we have iterate and understand why, how, how did we get here? If we fixate on how we are going to solve it, rather than how we got here too early, then we’re going to constrain the solution space so much.

Honeycomb’s VP of engineering pulled me aside, last week in fact, and said, “Hey, Liz, when you’re proposing solutions to things, rather than stating the requirements first, you’re causing a lot of consternation.” And I thought about it and I was like, “You know what? Yeah, there might be a simpler solution floating out there that I didn’t consider.” So the best thing to do when brainstorming is not to propose a solution first, but instead to define the problem. And then we have to think about the who as well. Who’s going to run it? Who’s going to actually operationalize this and make it work? Who’s going to maintain the codebase? Who’s going to make sure that it doesn’t become a rampant pile of stuff bolted on, and bolted on, and bolted on? Our products and our tools should not be collections of random features. They need to be a coherent product with a maintained codebase. Yes. Internal tools and site reliability engineering tools need to have product management. And we have to think about what options did we consider and make sure that we’re writing up those factors in case those factors change in the future. Why did or didn’t we reuse something? Why did or didn’t we use open source? What about the vendors, and why did we potentially consider building something new? One of the most phenomenal examples of this that I’ve seen is Etsy, which did a case study of why they picked the Google cloud platform over Amazon and Azure. But more than that, you can make the studies of options you considered out in the open, the more you’ll benefit, someone who’s deciding, should I extend this or should I do my own thing? Should I continue to build on this? These are all important things that you need to document and share with the wider community. And it’s okay if one thing won’t solve it, it’s okay to use a hybrid of different solutions.

Just make sure that they work well together. Because even having two separate existing solutions glued together is much better than building and supporting a third thing. Adapt an existing solution or two. Sometimes the right answer is both, extend both. You don’t have to worry too much about building a platform out of multiple components. That’s why they’re modular. It’s why they’re extensible. When you write new code, that’s technical debt. When you have forked code that you’re maintaining, for example, a Kubernetes fork that’s really, really costly. Because now you have not only your new code but also you have to periodically integrate that existing code. Don’t fork Kubernetes, please. So upstream code that exists outside of your own fork, that’s in the main repo, that’s fine. That’s got a community looking after it, but if you are the only person looking after it, or the only organization looking after it, that is a single point of failure. In running software, anything that you have to host on-premises or within your own environments, that’s technical debt. Someone has to patch it. Someone asked to keep it up to date. Your library dependencies will have security vulnerabilities. You’re going to need to update them. You’re going to have to make sure that your application doesn’t have memory leaks. There are so many factors to consider and therefore software as a service is a much better solution overall when you can wrangle it. But above all else, I’ve learned this especially much in the past year since I joined a startup, if you build the wrong thing, it is ruinous. Congratulations. You have wasted three to six months of your company’s opportunity to differentiate its products by doing undifferentiated heavy lifting. I worried a lot last year when I started advocating for Honeycomb, the company that I work at to build a feature oriented around service level objectives.

15:51

Because I knew if we built the wrong thing, if I was wrong about the idea that no company had, or an open source project had yet done service level objective measurement correctly, then I would have wasted several engineer’s time for six months. And that would have been a disaster for our series A startup. So think really carefully before you build. What do we do after we build that software though, or after we adopt the solution? I think we need to collaborate between our organizations. We have to listen first and understand what is it that we’re trying to achieve? What is it that our peers, that our colleagues, our users are going to use? Product requirements documents can really, really help as far as getting you to think about user studies and understanding what it is that you need your solution to do. And then look for similar problems. As I said earlier, look left and right. You’re not a unique snowflake. So gather requirements for related use cases too. But at the same time, you do have to deliver value quickly. So make sure you don’t let feature creep slip in. And make sure that whatever you do that you’re extending, that you’re making sure that you are not putting yourself into a corner. That you can always extend what you do rather than feeling like you’re locked into a box of some YAML. And make sure that you’re documenting and providing examples. Why? Because people love to copy. Good programmers can write code. Great programmers copy. Your colleagues will want an example of the Django framework running at FastCo. They’re going to copy any existing examples they have of Django running at FastCo. And therefore the onus is on you. If you’re the first person introducing Django to FastCo, you’d better make sure that it’s well-documented and adopts the best practices of your organization.

People really do copy the first instance of a working out they can find. So make sure you’re documenting adequately. And make sure you have an active community of users. In case you do decide to build your own observability solution or at least adopt an in house one based on existing things. Make sure you have an active user community. Make sure people are talking to each other who are all using the same observability tooling. People should really feel empowered to improve and extend those tools rather than feeling like, it’s that other team. They never listen to feature requests. That breeds that discontent that causes people to roll their own. The more you make people feel welcome to contribute, and the more that they feel that they understand the technology, the better adoption that you’ll see. User confusion is really, really costly. Think about how much or how little people are going to have to do work on their own. Not just how much are people going to have to bug you, which is a factor too, but make sure that you’re not just pushing that complexity onto your customers and turning off your ticket load. Really think about the best way to accomplish the use case rather than just having people do workaround on workaround on workaround. Or override on override on override. And make sure that you share that tooling and that code. Make sure that people are really learning about solutions in organizations that are their siblings and organizations outside their parent organization. Make sure that people share that code and feel that they share that sense of ownership rather than feeling like if I didn’t write it, it doesn’t count towards my promotion.

And lastly, I think it’s really important to give back to our broader community. Yes, like the set of people here at LeadDev Live virtually. We need to make sure that people use the same language. We need to make sure that people are using the same terminology for the same concepts. Because people change companies all the time. So let’s not make them relearn the same concept names. We shouldn’t simultaneously have the idea of cherry-picks and feature branches, and people calling it five different things. Do not dump your random hacks on the community. Make sure that you do donate your research, but where something is untested, don’t tout it as the latest and greatest. The community is a no dumping zone. And make sure that people understand that full context that you used when deciding whether to build or buy. So that when other people are considering adopting your same solution, they can understand whether or not it’s a waste of time for them to consider it. If something’s never going to work for someone for their use case, make sure that they go to spend time wasted evaluating that thing. And upstream your code, upstream your specifications. If you modified something, don’t maintain that separate fork forever. When you do upstream things, it makes it a lot easier for people outside your organization to gain a stakehold in it. So that they feel like they’re benefiting from it too. They can also help out with the maintenance. Single company projects are doomed to failure. This is why organizations like the Cloud Native Computing Foundation, The Linux Foundation, and The Apache Foundation exist. Because they help people who have a common shared interest in a piece of technology like Apache Kafka come together.

21:03

Even if a plurality of them all work at one company. Because it means that the stewardship is in good hands. And that if that company, heaven forbid go out of business, that there still will be a continuity plan. We have to make sure that we document our technical decisions. There was an amazing talk last year at LeadDev New York by Rod of Dropbox, who talks about this idea of having technical decision making documents. Where you have something like a request for comments or a Python extension proposal, that really says, this is the problem we’re trying to solve. This is why we did or didn’t do it. Here are the objections raised. Here are the trade-offs and benefits, that way, it helps other people who are newcomers understand, why did we do things? And what do we need to change to do them differently? You don’t have to be a hero in order to achieve success. We should have the freedom to walk away or move on and have the world go on without us. This means that we have to automate away our previous jobs. And we have to make sure that the tooling that we build or adopt is usable by other people in our organizations. So I often reflect on the DORA key metrics and the DORA research. This is the DevOps research group that works on figuring out… This is the same folks who wrote the Accelerate book. And they figured out that there are four key things to high effectiveness software delivery. And you’ll notice that none of them involve writing your own third-party tools. Just a question of, can you deploy often enough? And does it take you not very long to get your code from committed to trunk into production?

Can you fix your outages quickly and how often do your changes fail? And I think that that’s, what’s golden. That’s what we’re trying to do. We’re trying to deliver value to our customers as reliably as possible. And this is what we need to do. So when you have observability, it enables you to be a much more mature performer. It enables you to be in that 20% of folks who can reliably deploy within hours, not days. And who reliably have less than 1% of their changes fail. But unfortunately fewer than 20% of us are there. So the remaining 80% of us are struggling with some of these basic fundamental issues. And I would argue that if you’re in that 80%, that’s still struggling. You shouldn’t be building your own things. You should be adopting and learning from what the elite performers are doing. Things like adopting continuous integration delivery really well. Adopting continuous build, and making sure that you’re building the right observability solutions into your production toolchain. Fifty-seven percent of elite performers have really excellent integration between their observability tools and their production deployment systems. So here’s a quick case study before I close. Honeycomb is a company that has 33 employees, and we have less than a dozen engineers. So our job is to make systems humane to run by ingesting people’s telemetry, enabling people to do data exploration, and empowering our customers who are also engineers. That is the goal that we started off with. We didn’t start off with, we want to build a column store because column stores are cool. We started with this idea of focusing on our audience of engineers first, of engineers at other companies. And it means that we’ve spent our time doing the right set of things, adopting the right set of common tooling and sometimes building our own, in order to do things like deploy on Fridays.

You’ll notice that we deploy up to 10 times per day, every single day, except for on Saturdays and Sundays, because we work five days a week and not four or seven days a week. So lead time. Make sure you have the right tools in place to allow people to iterate quickly. Every time that build time starts taking more than eight minutes or 10 minutes, people go and fix it. And in fact, I went and said, “You know what? I’m not going to design my own JavaScript parallelization framework. I’m going to use thread-loader, and that’s going to solve my Webpack problems.” That’s how we approach this problem of, okay. We want to make our builds faster so our engineers see fast return times. Deploy more quickly. We deploy once an hour if there’s a change. And again, we didn’t build our own bespoke deployment system. We use Chef for it. We use Cron. It’s not that complicated. You don’t have to over-engineer things. And our change fail rate has gone down because we’ve really invested, instead of adopting kind of these software Linting tools that are static analysis. No, we just make sure that we test things, that we have adequate flag flips, and we fail less than one in a thousand of our production changes. And when we do have an outage, it’ll take 30 seconds to a minute to flag flip something off. We can roll back within 10 minutes. We can fix forward in 20 minutes. It’s a lot more freeing when you’ve built or adopted the right set of primitives. You don’t have to build it all yourself. That’s what a high productivity product engineering looks like. It doesn’t look like massively over-investing in our infrastructure.

So overall, what I’d implore you to do, is to write less software by looking around you, evaluating those solutions around you, collaborating with your peers, and then sharing out what you learned so that other people in the community can benefit from it. If you’d like to view these slides, they are at Honeycomb.io/Liz, and I’m also going to be taking questions in Slack afterward. Thank you.

If you see any typos in this text or have any questions, reach out to marketing@honeycomb.io.

Transcript