“But it worked in staging!” is the new “But it works in my machine!” Docker, Kubernetes, and friends have made it easier to create similar environments, but it is still impossible to create environments that are the same as production. Erwin goes over lessons learned and what is needed to be able to successfully kill your staging environment. SPOILER WARNING: One of those things is proper observability.
Erwin van der Koogh [Product Manager|Cloudflare]:
Hey, everyone. My name is Erwin van der Koogh. I’m a Product Manager at Cloudflare these days and welcome to my talk, How to Kill Your Staging Environment.
It was based in how we tested in production. If you’ve seen any of my other talks, there are usually pretty photos and lots of pop culture references. This one is going to be slightly different. That’s because I’m basically just going to talk about our journey with Linc, which is a start up I ran that was recently acquired by Cloudflare. I will talk about it a little bit later. It’s basically a CI/CD tool for front end applications.
Now, everything I’m going to talk about, the most important thing is this may work for you or it may not. This is not polished. This is not finished. This is our story.
Have you ever said, “But it works in staging”? Basically that’s the new, “But it works on my machine.” It’s useless. It’s as useless if it works in staging as if it worked on our machine.
Have you ever had issues reproducing an error? Of course you have. We all have. It happens all the time where there’s this weird edge case. A customer has this problem, and no other customer has it, and you, for the life of you, can’t figure out where this thing is going wrong. Staging environments don’t help. They’re not very useful.
The other warning I want to give you is to not freak out. Right? I’m not making you do this. You don’t have to do any of this. This is our story. This is what we did. Take it and try stuff that you think is useful and acceptable in your environment.
The first thing I want to talk about, sort of the high level overview, the TL;DR, of us getting to our maturity in testing. And the first thing to talk about is what the application must have looked like.
We ended up building an event driven architecture with microservices on a bed of idempotency. And that’s going to be important later when we talk about why we did the things that we did.
Testing in production is incredibly easy if you don’t have any customers. Right? If you don’t have any customers, no one cares whether your production is up or down or if it doesn’t work.
That’s what we did for a while. Terribly easy. It’s also relatively easy if you have no paying customers. And again, we were heavily testing things in production.
One of the major reasons we were happy to sort of keep testing in production even with paying customers on there was, again, the idempotency. That allowed us, in the event of a bug, we can replay all the events that happened in the past 10/15/20 minutes and everything would sort itself out automatically.
Testing in production went well for a long time. But, eventually, we had too many customers and too many important customers, that we couldn’t really sustain it anymore. So what we then started to do was test locally, but with production events and production data.
That’s how it worked for a while. We would test locally, and our local machines were connected to a set of production databases, the production services.
And, eventually, even that became a bit too scrappy, and we ended up with the model where we could run multiple versions of the same component in production at the same time.
Different components could have different versions, and we would be able to pick which versions were used based on things such as customer, so we could turn things on and off for our personal events or for a particular customer we were working with. Or we could turn them on based on the plan. Our free users could be on version 23 where our paying customers were still on version 22. And the same was true for projects and different things.
The first thing people come up with is: Why? Why don’t you just have a staging environment?
One of the most important things for us was safety. We were building and deploying other people’s main way of making money into production. For us, this safety aspect was crucially important. That’s why reducing the blast radius for us was extremely enticing.
What we could now do is reduce the blast radius to often just ourselves or a handful of other customers because that’s where it would go wrong, and there we would find the bug. And it would allow us to improve our mean time to repair. MTTR is the time between the error occurring and the error being fixed. Fixing was often just a simple role, we’ll just go back to the old version. But even fixing forward was really easy because we have full confidence that we could iterate in production on that fix before releasing it to all of our customers.
We would, in the span of 10/15 minutes, which is what most of our sort of resolutions came to, we would maybe push three, four, five versions to production, depending on how complicated something was.
Another massive reason for not having the staging environment is the cost and mainly the return on investment that you get. Staging environments are extremely expensive, not so much in the cost of resources that you use but just a sheer amount of time and effort it takes to maintain your staging environment and keep it somewhat in sync with production is incredibly hard to do. It takes a lot of engineering time, time that people could spend doing other things. So cost is a major factor as well.
Again, having that ability to just try things in production without worrying about how other customers see this or what happens means that we can very quickly push something out and iterate on it in production until we’re happy with it, and then we can ship.
I talk to a lot of engineers about this topic, and being helpful engineers, the first thing they often want to know is: “But why didn’t you just automate your deployments?” That’s a great question. We did. We did fully automate all of our deployments. But, for us, that was a disaster recovery. If something bad happened, we could spin up a new production environment. That’s relatively easy to do, but building your automation so you can transfer between multiple different environments is much harder. Especially if, like us, we had hundreds and hundreds of entities running around in AWS.
There were lambdas, queues, subscriptions, tables, buckets, roles, load balancers, API gateway, clusters, tasks, task definitions, and many, many more things that AWS forces you to have. So that alone was a massive engineering task.
Also, data is just incredibly hard to do well. We were lucky that we had too much privacy. But certainly, if you had that getting that synced between a production environment and staging is incredibly hard. A hard thing to do.
The next thing people talk about is: “Why don’t you just use feature toggles?” And this is an entire rant all by itself. Feature toggles are hard. It increases the complexity of your code and makes things harder to test. I love a set of people who frequently use feature toggles for things like canary releases or quick experiments. Well, do you test your code with both the feature toggle on and off? Some do and some don’t. And a follow-up question is always, well, do you always test it with all the entire permutations of every other feature toggled on and off? And the question is, of course, no, it’s impossible to test that many different permutations.
And that gets to the crux of the issue with feature toggles. You need to have people be really aware of what you can and cannot do in a feature toggle. If you change global states, if you modify or mutate an object. There’s a bunch of things you can do, excess variables outside your scope. There are things you can do to screw up things outside of the code that you’ve set aside.
Of course, there’s always a chance that you have a bug in your toggle, like in the “if” statement. Push one of those live, and all of your customers are in trouble. That’s where things go wrong. Lastly, there’s removing the feature toggle. Like, once it’s no longer useful, you need to remove it, which is another task you have to do, and another set of deployments you have to do. Feature toggles are great, and we use them a lot of times for long running experiments, but using feature toggles all the time didn’t work for us very well.
So how did we do it? As I mentioned before, we started the local scripts. We created an event. It ran the code that we had locally on the machines, fully connected with connections to all of our production databases and APIs. And this worked really well. For a long, long time, this was more than sufficient. And again, this depended very much on the type of application we’re building and the type of data that we had. It was great.
And then it was moving to that multiple versions in production and being more robust and more controlled. But as it grew organically, there was no one way we did this. It really depended on the different components that we had. I want to talk about four of those components to give you a good overview of the different types of tactics that we used.
First of all, this is the component that builds your software and tested it, and uploaded it to be deployed later. This was a Docker container running on ECS. Basically, what we had were three different task definitions. Task definitions are AWS’ concept of a particular set of images and versions of that image complete with its configuration.
What that allowed us to do was the software that was spinning up these builds could pick one of these three task definitions.
So we have the stable one, which is sort of the default one that you got, and this was based on the simpler version of the application, so a stable release got a stable tag that was used by everyone. Then there was the beta tag that was an unstable, simpler version, and that would be served up to our personal accounts and a few other customers that maybe were having issues with something or that we’re trying to roll out a feature for them. And, lastly, there was experimental. We could use that to give to particular customers to build bigger and newer features. And that’s sort of what works. The component-based on the customer.
Then there was the preview server. Linc allowed you to preview every commit with any configuration settings. Again, this was a crucial component of our architecture. So what we had there was, again, two different task definitions. They were Docker containers as well. So two different versions of the same thing running on the same cluster, and they just had different domain names.
We had the public one, which is production. Then we had a staging environment, which was the closest we had ever gotten to a staging environment, but it was still hooked up to all of our production environments.
The third component was the release. This is a component that would take a deployment artifact and deploy it wherever you wanted it to be deployed. This was done just by dynamic loading of the releaser component. There was an ID stored in the database and you could pick where you wanted things to be deployed. Based on that, this ID was picked. Publicly, there were five or six different versions. We had, at any given time, maybe 12 to 15 different possible releaser IDs that you could use, and we could just set them manually in the database to point to other different releasers or other different versions of the set releaser.
Again, it made it easier to roll out new iterations and update existing iterations. We didn’t change anything in the old code, we just created a version two that we would be testing, and if everything was right, version two became the real version.
And lastly, one of the most complicated and the heart of Linc was the release coordinator. This was the piece of software that would coordinate and figure out, based on a number of events, what should happen. It would listen to commits. It would listen to builds and deploys and test results. And based on information it would figure out what version should be released.
And what we did there was built a router lambda. That was the lambda that would listen to all of these events, figure out which particular project, which particular sort of application this event was for, and based on that, it would send it out to a different version of the lambda, the different version of the release coordinator. And again, that allowed us to very, very quickly iterate this absolutely crucial piece of code without any worry that we would be screwing up anyone’s actual reproduction deploys.
So if you aren’t scared away yet and may want to give this a go. What do I need? There are only three things you really need. The first one is you need a mature culture with the right incentives. That was something we had, being a start up, started by a bunch of developers. Our software development and release culture was very mature, but, also, the incentives were right.
The only thing that mattered for us was building a product that people wanted to pay a lot of money for. There were no silos. There was nothing getting in the way of us doing this. If you have a culture where there’s a strong testing department that makes the sign-off on everything that goes to production, none of this is going to work right.
The next thing that you need is an architecture with multiple, small, individually deployable components. Obviously, that’s the whole point, right? Different versions of different components. You need something microservice-y.
Lastly, of course, otherwise I wouldn’t be here, observability. This is incredibly important. Without observability, nothing makes sense anymore. If your error rate spikes, is that a problem with the experiment I just deployed? Is something bad happening with one of our customers? What’s going on? On all of your dashboards, none of it makes sense anymore. You now need a dashboard for every version for every single thing to understand what you’re doing.
It’s crucial to have the ability to slice and dice your data across customers, plans, requests, users, projects, whatever it is that you need to do, you need to be able to go “give me the error rate per version.” Which particular customer was this request from that got me the error? Without observability, you’re flying blind and have no chance of making this work.
Now, I like talking to developers about this. What about regulatory compliance, for example? I don’t know. If we were in a regulatory industry, I have no idea how you would do this in a regulatory environment. I’ve been in those environments and maybe it’s possible, but it’s certainly not going to be easy convincing the rest of your organization and set regulators that what you’re doing is okay. The environment is changing but let me know, really interested in it.
What about QA and test automation? Yeah, it’s okay. Just do it in production. Make sure every person, every developer, everyone that wants to test this has an account on your correction system. They probably do anyway. So test it then. If you want to do a full regression test, create a new account, walk through the entire thing, and then delete the account at the end of your test, but there’s nothing specifically that you have to run in staging environments for. You can do all of that in production environments.
What about some database schema changes? I don’t know, really. We were using DynamoDB for most of it. We had zero problems with migrations that were forwards or backward compatible, which was almost all of our changes. And only when we had the few backward incompatible changes that we had to make difficult, but those were few and far between.
If you need to make schema changes on every commit, if you sort of make regular backward incompatible changes, I don’t know. I don’t have any good answers; but, again, I’m really interested if you have this issue and manage to solve it.
What about my situation? Unfortunately, I won’t be able to do a live Q and A because I’m in Australia, on the right side of the world. But hit me up on Slack, and I’m more than happy to have a chat with you and sort of figure out how to do this in your particular situation.
And the last thing is, obviously, what’s it like? What’s it like fixing bugs? And this is the closest thing I can find to magic fairy dust because fixing bugs becomes almost trivial. Like, it feels like cheating, basically. Like, if you have that good observability, we have Honeycomb for that, locating the issue is easy. Again, it’s production data you’re looking at, and you have the exact steps that happened to get to that error.
And then you have the full capability to go, “I think it’s this.” Make the change, push to production, test it. And if it’s not that, you just go try something else. You have this ability to iterate in production fully confident that you’re not affecting anyone else but this particular customer that has this particular issue.
So what is it like building a new feature? The same. What it is is a customer would have a feature request or multiple customers would have a feature request. We do a quick and dirty prototype to sort of go, “well, this is what it would look like.” They would look at it, give feedback, we would improve it. We would just rinse and repeat until it was either great or we figured out why it wasn’t and never got it to work and we just deleted it. Right? There was no large undoing and deleting a whole bunch of stuff and a whole bunch of refactoring from our code base. It was just, delete the branch that it was in. That’s the only thing we needed to do. That’s what it’s like. So, basically, it’s magic.
Thank you very much.