Caring for Complex Systems: We Can Do ThisBy Jessica Kerr | Last modified on February 27, 2023
When we work at it, professionals are pretty good at analysis. We can break down a simple system, look at its parts and their relations, and master it. Given enough time and teammates, we can analyze a very complicated system and fix it when it breaks.
But complex systems don’t yield to analysis. We have to add another skill: sense-making.
Complex systems have parts that learn and change, with relations that vary with state and history. They respond to and influence their environment. They’re always changing, so even if you omnisciently knew how it worked at one moment, you’d be wrong the next.
Complex systems include families, teams, and distributed software systems under development.
Fortunately, we all have the skill of sense-making. It’s what we do as humans, as people. We form ideas of how our communities work, how each other works, how our social and political systems work. We form ideas of what all that means for us. And we don’t do this by knowing everything about them, but by noticing things and discussing it with each other. We do sense-making naturally in order to exist as part of the world.
Analysis, by contrast, is a skill we learned on purpose. Analysis is rigor. It gives us mastery. It promises control. It’s appropriate for parts of our system: I can totally analyze a program. Then I can change that program to do what I want. This feels good.
But once I hook my program up to software written by a bunch of other people, or put it in front of users who do who-knows-what with browser plugins, most bets are off. Now we’re in the realm of complex systems.
We need sense-making. Sense-making involves coming up with a theory, asking questions to investigate it, and getting something better than answers. The sense-making circle gives you better questions. Narrow in on what matters to you and get a good-enough understanding of that.
A good-enough understanding lets us influence our complex system to our liking. We can move it in the direction we hope for. There is no control.
Some scientists do sense-making very explicitly. I like Brene Brown’s research into what deeply happy people have in common. This is a qualitative, evidence-based method for theory formation and investigation.
What does it mean to do sense-making in teams?
Maybe as a manager, I notice that people’s voices in standup sound duller than they used to. Are they getting tired? I ask them. José says yes, his 2 year old is sick. Marina says not really; she was just more excited by the last feature than this one they’re implementing. Xiaoping agrees: adding access controls isn’t fun. It makes everything harder for the user, and it feels so arbitrary.
Hmm, maybe the team doesn’t feel good about what they’re implementing. New question: what feels arbitrary? Everyone chimes in with an example. For one: it takes one access level to add an item, but a higher one to change it—what if a user makes a typo?
Now I have detailed questions for the product and design people. They respond, “Wow, good point, let’s meet with the whole team about these.” Soon we have a better product and a team that’s invested in the work.
What does it mean to do sense-making in software? Every incident response offers an example.
Readable code & observability
Early in my career, I learned to write readable, testable code. Readable code is amenable to analysis. You can look at it carefully and predict what it will do, then verify that understanding with a unit test. That understanding keeps our programs malleable.
These days, I am careful to write observable code, leading to observable software. Observable software is amenable to sense-making. It explains itself in traces, and lets us see what happens at scale in production. Add arbitrary queries, and we get the sense-making circle of forming new questions.
That needs a concrete example. Here’s one: I expect my services to respond to a
/cart request reliably and in a reasonable time. I can ask whether that happens with a heatmap, and find out: sometimes. Then I can ask, “Why is it slow sometimes?" I can click on a slow dot to get a trace. The trace tells me which part takes the most time: a span called
GET /price. Now I ask, “Is that normal?” and aggregate all the
GET /price spans, and I find a jump in latency a few hours ago. Now I know where to go: the
GET /price code in the pricing service, and its recent release history. Once I’m in the code, my analysis skills come out.
We need both: analysis and sense-making
Analysis skills are essential for software development. We work on those, and that’s good, because they don’t occur naturally in humans.
Working in production software, we also need sense-making skills. Fortunately, people have those. We can work at exercising them more explicitly.
And then there’s software! We’ve made modern programming languages and APIs more amenable to analysis: more declarative, more readable, more testable. Software can’t do sense-making. But it can help us with ours, when we make it more observable.
How Do We Cultivate the End User Community Within Cloud-Native Projects?
The open source community talks a lot about the problem of aligning incentives. If you’re not familiar with the discourse, most of this conversation so...
How We Define SRE Work, as a Team
The SRE team is now four engineers and a manager, and we are involved in all sorts of things across the organization, across all sorts...
Deploys Are the ✨WRONG✨ Way to Change User Experience
I'm no stranger to ranting about deploys. But there's one thing I haven't sufficiently ranted about yet, which is this: Deploying software is a terrible,...