Testing In Production
Testing in production has gotten a bad rap — despite the fact that we all do it, all the time.
This is probably because we associate it with not enough testing before production: continuous integration, unit tests, functional tests. We hear “testing in production” and think lack of caution, carelessness, poor engineering.
But in reality, every deploy is a test. In production. Every user doing something to your site is a test in production. Increasing scale and changing traffic patterns are a test. In production.
It’s good to try and knock out all the low hanging fruit we can before it hits our users. We should never, ever stop doing this. But here are some things to consider about testing in production.
You already do it
There are lots of things you already test in prod because it’s the only way you can test them. You can test subcomponents of varying sizes and types, in various ways and with lots of different edge cases. You can even capture-replay smaller systems or shadow components of prod traffic — those are the gold standards of systems testing. But many systems are too big, complex, and cost-prohibitive to clone. Most have user traffic that’s too unpredictable to mock.
Imagine trying to spin up a copy of Facebook for testing (with its what, 8 globally distributed datacenters?). Imagine trying to spin up a copy of the national power grid. Ok, say cost is no object and you’ve done that. Now try to generate the traffic patterns and all the clients you’d need to–you know what, never mind.
And even if you could … you still can’t predict tomorrow’s ice storm or traffic patterns from Antarctica or some other novel, chaotic entrant. It’s a fool’s game to try.
So does everyone else
You can’t spin up a copy of Facebook. You can’t spin up a copy of the national power grid. Some things just aren’t amenable to cloning. And that’s fine.
You just can’t usefully mimic the qualities of size and chaos that tease out the long thin tail of bugs or behavior.
And you shouldn’t try.
It’s probably fine
There’s a lot of value in testing: to a point. But if you can catch 80-90% of the bugs with 10-20% of the effort — and you can! – the rest is more usefully poured into making your systems resilient, not preventing failure.
You should actually be practicing failure regularly. Ideally, everyone who has access to production knows how to do a deploy and rollback, or how to get to a known-good state fast. Everyone should know what a normally-operating system looks like, and how to debug basic problems. This should not be a rare occurrence.
If you test in production, it won’t be. I’m talking about things like “does this have a memory leak?” Maybe run it as a canary on five hosts overnight and see. “does this functionality work as planned?” At some point, just ship it with a feature flag so only certain users can exercise it. Stuff like that. Ship it and see.
You’ve got bigger problems
You’re shipping code every day and causing self-inflicted damage on the regular, and you can’t tell what it’s doing before, during or after. It’s not the breaking shit that’s the problem: you can break things safely, it’s the second part that’s not ok. Your bigger problem can be addressed by:
- Canarying. Automated canarying. Automated canarying in graduated levels with automatic promotion. Multiple canaries in simultaneous flight!
- Making your deploys more automated and robust, and faster in general (5 min upper bound is good)
- Making rollbacks wicked fast and reliable.
- Instrumentation, observability, early warning signs for staged canaries. End to end health checks of key endpoints.
- Choosing good defaults. Feature flags. Developer tooling.
- Educating, sharing best practices, standardizing practices, making the easy/fast way the right way.
- Taking as much code and as many backend components as possible out of the critical path. Limiting the blast radius of any given user or change.
- Exploring production, verifying that the expected changes are what actually happened, understanding what normal looks like.
These things are all a great use of your time. Do those things.
Chaos and failure are your friend
Release engineering is a systematically under-invested-in skill set at companies with >50 people. Your deploys are the cause of nearly all your failures, because they inject chaos into your system. Having a staging copy of production is not going to do much to change that (and it adds a large category of problems colloquially known as “it looked just like production soooo I just dropped that table …”).
Embrace failure. The question is not “if” you will fail, it is when you will fail, and whether you will notice, and whether it will annoy all your users because the entire site is down or if it will be annoying to a few users until you fix it at your leisure the next morning.
Lean into it. It’s probably fine.