Stepping Our Way Into Resilient Services
Is it possible to discover unknown-unknowns proactively with Chaos Engineering? Where exactly is the intersection between intentionally breaking production services and discovering the multitude of ways they could be broken with observability? This is a short presentation that leaves plenty of time to have a real-time discussion with George Miranda, take audience questions, and explore practical steps you can take with your teams as you step your way toward improving service resilience.
Jason Yee [Director of Advocacy|Gremlin]:
Hey, everyone. I’m Jason Yee. I’m the Director of Advocacy at a company called Gremlin. Today, I want to chat about how we step our way into resilient services. I think we can do that with this concept that’s been called the “adjacent possible.” Biologist Stewart Kaufman created this theory of the adjacent possible, and, without getting too deep into the science, the gist is that evolution builds upon itself.
For example, a fish cannot evolve into a land animal. That’s impossible. But there are all these things that need to come together for that to happen. First, the fish needs to grow lungs and the ability to breathe air. And then the fish needs to grow thicker bones that help support its weight outside the water. And then those bones, those fin bones need to turn into solid bones for legs, right, and have a joint in the middle for elbows. And the ends need to morph from fin into feet.
And in hindsight, we can see this entire chain. It all makes sense to us. These are the natural steps that need to happen. Moving forward through time, they don’t have that ability. It unfolds. Each step is only possible after the previous evolution and couldn’t be conceived of before then.
And so in his book “Where Good Ideas Come From,” Steven Johnson applies this to innovation. Just like the fish to the land animal, the evolution, you cannot just jump from light bulbs to lasers. There are all these things that need to happen in between. The invention of the light bulb creates a platform where innovative people can begin to imagine: what’s next? Things they could never imagine if they only had fire or candles.
Steven Johnson explains the adjacent possible like this: The adjacent possible is a kind of shadow future hovering on the edges of the present state of things. It’s a map of all the ways in which the present can reinvent itself. And he goes on to further describe it as the strange and beautiful truth about the adjacent possible, is that its boundaries grow as you expand them. Each opens up the possibility of other combinations.
You can think of it as a house that magically expands with each door you open. You begin in a room with four doors, each leading to a new room that you haven’t visited yet. Once you open one of those doors and stroll into that next room, three new doors appear, each leading to a brand new room that you could haven’t reached from your original starting point. You keep opening new doors, and, eventually, you will have built a palace.
As I was reading this, it got me thinking about something that we talk about all the time, and that’s contributing factors. We say this word a lot when we think about and talk about incidents. We often say that in complex systems, like the ones that we work on, there is no root cause. It’s not just one thing that leads to incidents or outages, it’s contributing factors. And that one factor allows for another thing to happen, and then combined with another factor, it all leads to our major incidents.
And we often talk about this as if it’s all black swans, and we could have never imagined it. It got me thinking if failure has contributing factors, that maybe there’s an idea of the adjacent possibility of failure.
What I mean by that is that in our current understanding, your current understanding of your system, you can imagine possible failures. But it’s extremely difficult, if not impossible, to imagine the catastrophes that those possible failures would lead to, how all of those collaborating factors can come together. This naturally leads me to think about the three categories of knowledge and unknowns that we have, and the first is the known-knowns. We know a lot of things. These are the things we know we know. We’re certain about them.
Then there are the known-unknowns, things we know we’re ignorant of. I know that nuclear submarines exist. People work with them. I also know that I’m completely ignorant of that. I have no idea how to operate a nuclear submarine. That’s a known-unknown.
And then we have unknown-unknowns. These are the things that are so foreign to me that I don’t even know that I don’t know about them. There’s maybe something crazy. There’s some UFO or something in Area 51 that I could never imagine, so I don’t even know what I don’t know about it. If we put this in Johnson’s analogy of the rooms, the room I’m currently in, that’s my known-knowns. I know everything in that room. I know about it. In the adjacent room, that’s the one I can imagine. There’s a door, and the door leads somewhere, and I can imagine that being a room.
So those are my known unknowns. I know there’s something there. I just don’t know what it is. In this analogy, the rooms beyond that, there are two or three rooms away, those are the unknown rooms, the rooms I have no concept of. The question then becomes this: How do you explore those adjacent possibilities? How do we innovate on failure and move beyond the rooms and see what is in them?
The known-knowns, if you know there’s a weakness in our system, you’re going to build for that. That’s why we have testability and reliability. That’s why we built systems for auto-remediation. We have autoscaling groups that will know if a VM is down and it will restart it. We have Kubernetes and they can automatically scale them up.
And then we have our known-unknowns. This is where monitoring and observability are important; right? We know our systems need resources. We don’t necessarily know what they’re at or how they’re trending. We have tools that we have built to track for us. This is also where we create incident response processes. Again, we know that our systems could have something go wrong with them, and we don’t know exactly how that will play out, but we know that we can create processes to respond. We can pull in the information and the people that we need so that we can deal with them when they happen.
The problem though is those unknown-unknowns, the things that are so far out there, the things where the failures again become collaborating or contributing factors. We don’t know how that will work, so we don’t have anything. We’re not building for those. We don’t have processes for those. We only have hope. We hope that they don’t happen.
In this, if we’re here in our known knowns, how do we move into our known unknowns? And I think the way that we do that is chaos engineering. I’ve used it, and it works well. Chaos engineering can start to let us explore those known-unknowns by reproducing incidents. When we use that to gain information and gain a better understanding, then our known-unknowns become our known-knowns. That means our adjacent possibilities become our known-unknowns. And we start to diminish some of that unknown-unknown area.
For those not familiar with chaos engineering. It’s thoughtful, planned experiments designed to reveal the weaknesses in our systems. This is what I used to say a lot of. It’s short, it’s memorable. It’s not a bad definition. It plays into the system that we don’t want weak systems. We want strong, robust systems.
But, over time, as I’ve started to do robust engineering and more and helped others do it, I’ve evolved my definition of it. Yes, it’s thoughtful, planned experiments. You have to be intentional about your experimenting. If you’re randomly going about things, you’re not going to learn a whole lot, and you’re probably going to be a jerk to your colleagues who are having to deal with your randomness. But, rather than design to reveal our weaknesses, it’s building to understand how things work. The best way to make better systems is to have better engineers. Systems don’t build themselves. So the better you are as an engineer, the better your systems are going to be. So how is this done?
Well, I like to follow the scientific process. If you remember from grade school, you start with a question about your environment. You form a hypothesis about how you think things work. You test that hypothesis, gathering data. You analyze that data, and then you repeat it, and you share the results. In chaos engineering, the question often starts as: Is my application or system reliable? With my definition, I think that’s the wrong question. I think the question is: Does my application or system work the way that I think it does?
From that, then, we can form a hypothesis that given some sort of condition, some sort of failure, how do I think my application is going to react? From there, we simply inject that failure or replicate that condition, and we gather the data. As we gather the data, then we can analyze it, and we can learn from it.
Or in the case of the adjacent possible, we can start thinking if this failure condition happened, what other contributing factors could come along and lead to catastrophe? From this new platform, I can start to imagine more that I wouldn’t have been able to do prior. And then, again, sharing and repeating. So sharing your learning so that everybody in the organization can learn from what you’ve done and repeating it as you begin to improve your systems and understanding. When you do this, start small. Be careful. Try to start with the smallest experiment you can that will yield some data, and then grow from there.
Oftentimes, people talk to me and they’re afraid. I don’t want to take down production. But I always mention that you’re doing this all the time, you deploy change to production all the time, and you’re not freaked out. Do the same thing with chaos engineering. Start small. Build with confidence. By the time you’re pushing with production, it’s not the idea of: “I’m going to take down production.” It’s “I’m confident and I’m building more confidence.” Back to the question: How do you explore the adjacent possibilities of your system? How do we innovate on failure and move into those adjacent rooms so that we can understand more about our systems and how they fail?
Looking forward to this conversation that’s up next with Charity. Thanks for joining.
George Miranda [Director, Product Marketing|Honeycomb]:
We have time, and we would like to have a bit of conversation about Jason’s talk and figuring out the intersections between observability and chaos engineering. With that, Jason, you know, I think your talk had a really great viewpoint on unknown unknowns. I guess here is the way I would frame the conversation, right?
I think the point of observability is to answer those known unknown questions when they happen. Right? I think the take I usually take here is you can never know what those known unknowns are. Let’s dig into them. It does strike me as a bit reactive. So what you’re proposing is, you know, start iterating your way toward that. How do you ever know what those known unknowns are?
Yeah. That’s the point, right? That’s why we love observability. We get into an incident, and we’re like, this thing happened, and I don’t know what happened. So I have these questions; right? I think the problem when we think about true unknown-unknowns is that you don’t know what questions to ask; right?
That’s why we wait. We’re basically waiting for incidents to happen to find out what questions to ask. So there’s definitely the viewpoint of, well, that’s just the way it is. There’s always known-unknowns. That’s true. You can’t know everything, but, generally, I’ve been doing a lot of study on innovation and practices, and there’s this notion that we can start to explore that; right? Innovation is not just luck. If we think about failure, we can’t really just assume that it’s also luck, right? That it’s just black swans and catastrophes happen.
I found it interesting that we can start to explore. We can start to dive into our systems to try to understand a little bit more and maybe simulate failure with the notion of, yeah, if it’s all contributing factors, can I try something and then from there imagine what’s happening. Right?
Sure. I will give you that. I think that it’s not just luck, but there’s such a permutation of possible contributing factors that come together that it seems like an infinitely large puzzle to solve.
I think you’re right. To some practical degree, there are bits of that puzzle that we can guess against, right? There’s no practical way to answer this question, but how big is that unknown; right? Because how big is infinity?
How much ROI do we get on that; right? It seems like we’re investing a lot of time at guessing at those unknowns and chipping away at it. Is it worth it? Are those the failure modes we’re going to see in production?
That’s a good question. It’s largely to the organization on how much they want to invest. I think part of the reason I was thinking about this is we’ve adopted blameless post mortems at Gremlin. Everyone has. You avoid hindsight bias, but hindsight bias is there because how many times have you been in an incident where you’re like, of course, that happened.
There’s a huge amount of unknowns that we could never know, and you shouldn’t spend all your time doing this, but I think there’s enough stuff to predict if we could only put ourselves in a position at least spending time to imagine what it might look like. I think that’s the balance.
Here’s what I hear you saying. We’ve mentioned today how our systems are sociotechnical. And a lot of things that you do find in a blameless retro, you know, in retrospect you’re like, Yeah, of course, that was going to happen because of the teams that we have set up, the systems of communication that we have. Some of those social bits. So you discover that when you simulate a failure, no matter what that is, to some degree. That failure can actually float up some of those people problems that you intend to find.
Do you know the right place to figure out what is happening with that service? When you communicate with someone else about that service, do they have the same model in their mind about how that works as you do? I think those kinds of things, you’re 100% right. You will find that when you simulate there being a failure to asterisk any failure, to some degree, right? Yeah, I think I see that argument. But what about actual observability; right? Where is the line? Where does observability fit into your experiments? Do things need to be instrumented to figure those out? Do they go together?
They totally go together. If you don’t have observability and you’re starting to do engineering, that’s a losing game. Like, what are you actually going to learn if you can’t see anything? Right? So you definitely need to have both together?
I would challenge. That what is interesting — I’m the observability person here, right? For some of those issues on the sociotechnical spectrum, does the observability matter? Yes and no. I think it happens in those cases where you have different models of understanding the system and without actual data and real models to look at what’s happening in production, and it’s hard to get on the same page, but for some degree of things, where it’s communication issues and access to tools and access to production, those things maybe not so much, but, sorry, I hear myself say that, and I think I softened my stance, as I came into this talk.
I’m curious for you. Absolutely, you need the telemetry. How does that come into play?
That’s funny. What comes into play, it’s interesting. I don’t have that completely set, if that was the goal. I love the fact that we’re talking through this and thinking through this because that’s predominantly what I wanted. I think it goes also to the fact that chaos engineering is not just this exploratory thing. You can use it to test systems.
A lot of times, we use it to run those sorts of incident replications. That’s definitely a useful value.
In that case, yes, you’re using it for practicing your incident response, maybe you don’t need observability. But, you know, obviously, this being the observability conference, yes, you should have it. You’re going to get the most value out of chaos engineering in that way.
I hear you on that. One of the things I’m looking at is running tabletop experiments and running through here is what I think it’s going to look like. And what sort of telemetry do we need to communicate that. What do we need to communicate to stakeholders and things of that nature?
But there are certain failure modes that you can expect. I think specifically new features we’re launching into production. Like, I sort of have this idea of what this new microservice does. It returns, like, four possible answers. If that’s the case, then, sure, we can actually iterate and moderate those around those. That seems like less of an infinite problem. I think 1,000%, you want all the telemetry that you can possibly have to figure out did all those failure modes I expected to see happen the way I expected it. I think that can be surprising sometimes.
Absolutely. But the question is how many times do we have services that are truly microservices that return a limited set.
Okay. All right. I hear you, right, because the problem remaining is not always that small. Where is the practical advice? I come back to observability means you have the telemetry and the tools, hopefully, to be able to get down to the root of what happened. Where was the bottleneck? Where was the source of the issue coming from? When it happens and then trying to predict that in advance. What is your practical advice for folks trying to figure it out? You have observability in your systems. Maybe you’re not doing chaos engineering yet, but what are practical steps to take to start combining the two practices?
Yeah. When you’re starting out, a lot of chaos engineering is going to be confirming what you know to be true. So that’s going to be testing out the fundamentals of: Does my service restart if it fails? What happens if I cut off network connections? Definitely start there. Start in blast, build your work, work your confidence into production. That’s kind of the standard core of what we’ve always thought about for chaos engineering.
But I think, you know, in talking about this whole adjacent possible, part of that is, cool, if that’s the base level, where do we start from there or interesting value in terms of what we’re doing.
It’s fine if you’re doing chaos engineering and, to you, it’s just testing to ensure that this thing restarts, that’s easy, and we’ve got tons of customers at Gremlin that do that. Right? They put it in their CI pipeline, and when they go to deploy and it’s running all those tests, that’s what happens; right? It kills their process. They wait for it to come back. Yes. Check that box. Digital deploy.
I think that’s a good baseline. I think as we talk more and more about complex systems, though, starting to explore some of the ways and build our knowledge around our systems and chip away at those known unknowns becomes extremely valuable.
It sounds like what you’re recommending then is you pick an arbitrary point and maybe a really important point in your infrastructure, somewhere in that architecture, start there. Iterate those failure modes. Then, once you sort of have an understanding of what those major ones are, then with that adjacent possible, go with the next service and discover a little bit more there.
As you start to figure out what are the adjacent properties, how does one service affect the other, then you will never have a complete picture. You will start slowly understanding a lot more of the common failure modes. The way to get there, one, do the experiments; and, two, understand the telemetry of what happened when things went awry.
That’s a good way to sum it up. Start with the things you have more confidence in or more understanding of. Build that up and then start expanding.
I love it. That’s a great way to go about it. It looks like we’re almost at time. Jason, I want to thank you for your time. Thank you for participating in this talk. I want to let everybody else on the track know that we are meeting back over at the main stage for a closing talk. So we’ll see you there in about a minute or two.
Thanks again, Jason.
Thanks for having me.
If you see any typos in this text or have any questions, reach out to firstname.lastname@example.org.
Ep. #11, Chaos Engineering with Ana Medina of Gremlin
In episode 11 of O11ycast, Charity Majors and Liz Fong-Jones speak with Gremlin chaos engineer Ana Medina. They discuss the relevance of breaking things in order to engineer them more efficiently, monitoring vs observability, and chaos engineering at scale.
Ep. #26, Unknown Unknowns with Parveen Khan of Square Marble Technology
In episode 26 of o11ycast, Charity and Shelby speak with Parveen Khan of Square Marble Technology. They discuss Parveen’s journey into observability and the insights she’s gained as a test engineer and quality advocate.