Conference Talk

Testing is Not the Goal

 

Transcript

Rob Meaney [Director of Engineering|Glofox]: 

Hi. My name is Rob Meaney. I’m Director of Engineering with Glofox in Ireland. Today I’m going to talk about Testing is Not the Goal. Who is the smug kid on the right, you ask? That’s my son, Adam. A couple of years ago, while we were in England visiting Legoland, he was the richest person in our family. Why? Visa had a European outage. We couldn’t withdraw cash. Adam, with 45 pound sterling in his little velcro wallet was now our chief holiday financier. For a moment, I started to panic as I tried to figure out how to pay for food, transport, and accommodation in a foreign country without access to our bank accounts.     

It was a situation that made me feel acutely aware of how dependent we’ve become on software systems and of how utterly vulnerable and helpless it can make us feel when they don’t work. But it also made me think about the teams of engineers that were scrambling in the background to resolve the issue. It was like being there myself. I could almost feel the stress, the pressure, and the subsequent shame that they would have to endure. Within minutes, the outage was plastered all across social media, and within hours, it was all over mainstream news. I could almost hear the blame-leading questions being asked for senior management. How could this happen? Why didn’t you test this?

What I’ve learned, without managing the complexity of our systems, they will ultimately fail with consequences. There are real costs to software failure, for software and human. So the question is: How can we deliver value to our customers quickly, reliably, and sustainably when we’re working in this increasingly complex world where everything is connected, change is continuous, and failure can be catastrophic? Even though Glofox is a relatively small company, we find ourselves working with a hugely complex distributed system, with dozens of interconnected services that are constantly changing and evolving. Deployed to a cloud platform, in our case AWS, that we can’t rely upon. All cloud providers have outages and integrating against third-party services whose APIs responsiveness can change at a moment’s notice.

When I started in the software industry about 15 years ago, we had these all-encompassing big-bang releases. When it went wrong, it really went wrong. Everybody knew. Now we have a lot of releases with little failures that may or may not be noticed. We have to assume our systems are in a constant state of partial failure. This scary new world presents a whole new set of challenges and risks, and, consequently, it requires a new approach to deal with these risks. We need to unlearn old ideas and embrace a new reality in this complex, new world. The truth is our job is no longer finished when we deploy to production. It’s just beginning. This is when we really start to learn about our systems and how they actually behave as opposed to how we imagine they behave.

Tests are not reality. They are simply our best guess. We need to accept that we can’t anticipate and test all the ways systems can fail. They’re infinite. No amount of testing can tell us what will happen in production. We need to shift from imagining how our systems behave to understanding how they actually behave. We need to accept that failure is inevitable in these complex systems. And in the face of this reality, we need to rebalance our efforts, not just prevention, as has been the tradition, but on mitigation and recovery so teams can move safely with justifiable confidence. We need to build anti-fragile systems. 

5:07

Today’s high-performance teams have learned to deliberately design testability and operability into their systems from the very, very start. I’ve learned that managing risk effectively is the key to delivering value to our customers quickly, reliably, and sustainably. A focus on testability allows us to manage the risks that we can imagine while a focus on operability allows us to deal with the realities. Today, I’m going to talk about the journey that has led me to these conclusions and explore the patterns that have allowed us to achieve speed and stability over the years.   

We’ll talk about the associated insights related to testability, operability, and then observability and how they’ve allowed our teams to not just survive but thrive in this complex, new world. So my journey begins as a software automation engineer, I guess, about seven or eight years ago. We’re going to start with testability. Forgive me for the shameless plug here for the book I co-authored with Ash Winter. I had just joined a U.S. harbor company that was building out a brand new software engineering department in Cork, Ireland, where I live. As part of our onboarding, the whole engineering department was asked to perform regression testing for a release.      

After seven weeks, yes, seven weeks, of tortious regression testing by the entire engineering department, the test limped out the door. Still, after one of the developers who was acting as an architect in the office approached me and asked me one simple, very profound question. How can we make this software easier to test? 

We explored the testing challenges we had faced, and I came up with a model I call the CODS model. It’s to remind us of four testing attributes to consider when designing or architecting software systems. C is for controllability, the ability to identify and control the variables that influence system behavior in each important state. O is for observability, the ability to observe and understand everything important that’s happening in the system. D is for decomposability, the disability to decompose the system into individual testing components. And S is for simplicity, being easily able to understand the system.    

After a learning curve, we began the see truly remarkable results. Let me explore the impact it had. In my experience, a system is checked and explored. We check to confirm our existing beliefs about the software system and its behavior, the knowns. Those are things the software does that we developed it to do. And then the unknowns, the problems we couldn’t or didn’t anticipate, and only become obvious through interacting with the software. By focusing on testability, we improve both the automatability and the explorability of the software, that is the ability to uncover the unexpected and to reveal new information that we didn’t already know.  

So what’s explorability? Explorability is the ease with which we can run fast at cheap, safe-to-fail experiments. Exploration is almost the complete opposite of automation. Exploration is the most effective means of managing situations of great uncertainty or complexity. There’s always uncertainty in new software systems. It’s a very human process centered on learning that requires people to use their intuition, skills, and experience to refine their mental models of a system. By making the software more explorable, it accelerates the speed at which we can gain a deeper understanding of how the system behaves and why.

John Boyd’s OODA Loop is a great way to explore. We perform small, safe experiments to learn how to best adapt to situations of complexity. The faster we can iterate through the loop and decide and act, the faster we can learn and adapt to complex situations. These ideas can be applied to things like finding product-market fit, testing complex distributed systems, and even debugging production issues. 

10:18

Going back to my story, before testability, we are at the bottom of this diagram. We had a poor understanding of the expected behavior of our software. We spent most of our time either manually checking for expected behavior or fixing flaky test automation. We spent almost no time exploring for unexpected risks, which meant lots of problems were going on undetected or detected very late.     

By focusing on testability, our automation coverage became far deeper and more robust, checking thousands in seconds. It became excellent and provided almost immediate feedback in every change. This foundation freed our testers from having to check expected behavior so they could now explore for unknown expected behaviors and risks. We began doing valuable testing that we never had time to do before. We began to do deep, exploratory performance and resilience testing. We found new and interesting types of both.   

The speed, depth, and breadth of our testing was incomparable to before. This balance provided change detection and failures invited exploration. Exploration drove understanding of both the expected and unexpected behaviors. This combination of fast, reliable automation and deep skilled exploration allowed us to minimize the time and effort in achieving shippable quality for each new change. Or, in other words, minimize the transaction costs of getting changes to production.

What did this achieve? Over the following 12 months, it was remarkable. The design approach was adopted by teams in both the U.S. and Irish offices. The speed of delivery and quality of the product improved dramatically. Releases went from seven weeks of testing to two to three days of exploratory testing, and the teams were able to work at a sustainable pace and more proud of the work they were doing.    

After this experience, I had a profound realization. Significant quality improvements were not due to testing. They were about building relationships and influencing people at the right time to build quality in. So this brings us to the operability leg of my journey. I started working with an early-stage start-up with 20 people in total, maybe six software engineers. We were a business-to-business SaaS company building fraud detection. And the product was beginning to gain real traction with some of the biggest retailers in the world.

It was a new context for me with a new set of challenges. It was the first time I had worked on a SaaS product. I learned that because our service was on a customer’s cart page, it was critical that our platform had to be available, responsive, and scalable to meet customer needs. But building out robust automation and doing exploratory testing was only a starting point in this context. This told me almost nothing about the availability responsiveness and scalability of the system in production. I soon found out that no matter how much testing I did pre-production, we still experienced problems when we deployed changes.    

I realized that we are no longer simply testing code but testing complex systems comprised of real users, real data, on real infrastructure. We simply couldn’t recreate the variety, volume, or complexity of product type traffic in our test environments. The reality is with these complex distributed systems, that if you’re not testing in production, then it’s your customers that are actually testing in production for you. 

14:42

We needed to test where our greatest risk lived, which was production. I reframed the CODS to focus on operability to make it safe to test and deploy in production. We did this by, firstly, C controlling risk exposure. We began doing blue-green deploys, which allowed us to refer to the deploy quickly and easily if we detected an issue with minimal impact to our customers.

Observing system behavior, we added instrumentation to visualize our critical customer pain points, the error rate, response time, and the whole team monitored the system during releases. D, we decomposed deployments so each deployment was limited to a single change state so we could quickly isolate the source of a problem in an issue was released. S, simplifying the deployment process. 

We could deploy a rollback with a single click of a button and create run books to help debug issues in production. This hugely accelerated the rate at which we could ship without reducing the negative impact on our customers. It opened my eyes to testing in production and managing risks with operability. Here are some of the examples of techniques we used in that company when designing for operability using the CODS model for risk. We used blue/green deploys, canary toggles, et cetera. We used things like alerts and tagging tests and monitoring, logging, and tracing. We simplified so we could roll back with one click and monitor error spikes as well.  

The lesson I learned is, when it comes to operability, you can test with the production code but you need to test the infrastructure and the tooling that supports operability. You need to test that your infrastructure and tooling work when you need them most, when things go wrong. This brings us to the third and final leg of my journey, observability. I think the picture perfectly captures how many companies deal with observability.     

My last role in the past company I worked with was quality engineering manager with a company called Poppulo. It worked with all the engineering teams to identify and achieve their quality goals. As I said, I kind of worked as a coach. Having seen the power of designing systems with a focus on testability and operability, I shifted my focus from testing to coaching teams to build quality into the product from the start. As a coach, one pattern I identified was, of the four attributes, observability is the first and most important thing teams need to focus on when they begin working with them. 

Why? Observability allows the team to get a true understanding of how healthy their system currently is, how their system actually behaves, as opposed to how we imagine it behaves. It allows them to get an understanding of the reality of what their users are experiencing. Lastly, it allows them to identify where they need to focus their quality efforts. When working with teams, I start with very simple questions. How would you know if your system wasn’t healthy? How would you know if your users were having a bad experience? In the event there was a problem, how would you isolate the calls?     

This simple technique gets teams thinking about what aspects or attributes of their system are really important. It gets them identifying and looking at health indicators and using things like health objectives. It lets them know what normal system behavior looks like so they can identify abnormal system behavior. At Glofox, we began looking at these basic health indicators holistically. Starting at the infrastructure and moving up, we’ve introduced Honeycomb. It’s really gained traction across our teams.     

Going back to observability, this may seem familiar, a well-observed system is both checked and explored. We checked the system for known failure scenarios, or knowns, systems we know are indicative of customer pain. We explore the system for unexpected behavior, unknowns. We explore the data looking for unusual patterns. We check for systems of indicators of customer pain. There’s a fine number of indicators. And other customers are experiencing pain, and we explore to understand the source or the cause of the pain. For example, when we go to the doctor, we don’t list all possible causes. We list our symptoms.    

20:04

By focusing on observability, we improve both the monitorability, the ability to detect customer pain, and the explorability, the ability to understand system behavior, or cause of the pain, of our production systems. Previously, in Poppulo, we had low observability. We wasted most of our time in a sea of noisy, unreliable alerts, and creating hundreds of dashboards for everything we could think of. But, more often, it was the customers reporting issues, not the alerts. Even when they did alert for general system outages, it only told us we had a problem. And it was very difficult to isolate the cause of the issue.    

Alerts were focused on identifying the symptoms of customer pain, not alerting for the cause. When they alerted, we knew we needed to act immediately. Getting an understanding of system behavior and slicing and dicing to get to the cause. These teams started curiously exploring their production data, identifying interesting anomalies and some things longstanding undetected bugs. In these cases, teams were able to locate outages and correct problems before customers were impacted or were aware of the issue. Great monitoring drove action, allowing us to quickly detect production issues while great explorability allowed us to understand the cause of the problem through skilled debugging. This combination minimized the effort in detecting and resolving production issues.   

About the same time that I began working with our teams on observability, we adopted the metrics from Accelerate. For those of you who don’t know it, it’s a fabulous book by Nicole Forsgren. What I noticed was there was an interesting relationship between the Accelerate metrics. Low performance, they were slow and effective. When they had production issues, the impact was huge on the team, the business, and our customers.    

By improving testability, teams made more informed decisions on the risks they could imagine or anticipate, reducing the likelihood of introducing a failure and allowing them to move quickly with confidence. While by improving operability, it allowed teams to deal with the realities they didn’t or couldn’t anticipate. Remember, failure is inevitable, so it allowed them to reduce the impact of failure when it did occur.   

The combination of high testability and operability provided a hugely powerful mechanism for managing complexity and the associated risk. So, again, testability reduces the likelihood of failure while operability reduces the impact of failure. It allows teams to produce quality software at break-neck speeds. Focusing on testability drove throughput in terms of the Accelerate metrics. It ultimately minimized the time and effort to ship changes to production.    

Operability, on the other hand, drove stability in production, in terms of reducing change failure rate and meantime to recover, which ultimately minimized the time and effort our team spent dealing with quality problems in production. These, combined, allowed us to optimize flow by accelerating the delivery of value and minimizing the cost. So what have I learned from these experiences? 

When we deliberately design and evolve our systems to make them easier to build, test, and operate, we can easily automate the predictable, boring stuff. And because computers are faster and more reliable at doing predictable, repeatable, mundane tasks, the painful stuff that people hate doing, we can easily explore the unexpected, interesting stuff because people are creative problem solvers. We love observing complexity because that’s how we learn new and interesting things.

We can concentrate, ultimately, our efforts on building the software people love. How can we make this happen? That’s the next obvious question. Okay? We need to cultivate an environment where teams, and when I say teams I mean everybody involved from concept to customer, are empowered to work together to seek problems, solve problems, and share their lessons. Because if there’s one thing I’ve learned in the last eight years or so, working in this complex, new world, is when you optimize for a great engineering experience, you get happy, high performing teams, and a successful business as a second-order effect. Thank you very much.

If you see any typos in this text or have any questions, reach out to team@honeycomb.io.

Transcript