Conference Talk

August SRE Leaders Panel: Testing In Production

August 26, 2020

 

Transcript

Blameless Host:

All right. Let’s go ahead and get this show on the road. Hey, everyone. Thanks so much for joining us today for the August edition of our SRE Leaders Panel. Today, we’re really, really excited to have a special edition where we’ll be discussing Testing in Production. So we’ll kick it off with 40 minutes of a panel discussion with our amazing guests and then we’ll leave 20 minutes at the end for some open Q&A. As soon as the recording is ready, we’ll also email all of you a recording of this panel so that you can share that with your teams, your family, anyone you’d like. So make sure you add any questions that you have into the Zoom chat so that we can answer them live during our Q&A session as well. So with that, I’ll go ahead and hand it off to our wonderful panel facilitator, Amy Tobey.

Amy Tobey [SRE Staff|Blameless]:

Hi, everyone. I’m Amy Tobey, and I’m staff SRE at Blameless. I’ve been doing SRE for a really long time, those who’ve been here before have heard this a few times. I’m really into the SRE community and figuring out ways we can enable reliability in all of our organizations. So with that, I would like to move on to our panelists. So today, we have joining us, Shelby Spees from Honeycomb and Talia Nassi from Split. And I’m going to let them introduce themselves. So let’s start with, since I’m on the front page, Shelby’s on the left, so we’ll start with her. Shelby, could you please introduce yourself?

Shelby Spees [Developer Advocate|Honeycomb]:

Yeah. Hi, I’m Shelby. I’m a Developer Advocate at Honeycomb. I’ve been developing software and running production systems for about five years now. And I just really want to help people deliver more value in their business.

Talia Nassi [Developer Advocate|Split]:

And I’m Talia, I’m a developer advocate at Split. And before this, I was doing testing and I was a QA engineer and automation engineer for six years. So I have a background in testing. I think that’s why I’m so passionate about testing in production. So yeah, I’m really excited to be here.

Amy Tobey:

Awesome. So to get started, I want to start with this idea of trunk-based development. But we’re not going to talk about trunk-based development. What we’re going to talk about is something that already exists in the world and that I believe that most people should be doing by default. And what that does is it says that we test our code with things like unit tests and functional tests, on our workstations in our dev environments, and things like that. But all the code, before we merge it to our main branch has to be ready to go for production. So we’ve done that. Let’s just assume that we’ve all had that in place, and all of our lovely attendees have that in place. So what’s next?

My role, as somebody in the incident management space, especially, comes a little bit later in the process. So I think in terms of when we say go and deploy that code to production, what’s the first thing? And I think we can start with Talia to quickly talk about Splits’ products. So let’s start with that, just real quick, how they enable some things to be shipped out to the edge and into production before we have full confidence that it’s doing exactly what we expect.

Talia Nassi:

Yeah. So with Split, you can use feature flags. And for those of you who aren’t familiar, a feature flag is basically a piece of code that differentiates code deployment from feature release. So you can deploy code and it won’t be available to your end users only for a specific group of people that you can specify in the UI. So you can use feature flags to help you in this way. So if you’re doing trunk-based development, you can deploy your code to production behind a feature flag, and then once you’re ready, and you’ve tested the code in production, then you can turn the feature flag on.

Amy Tobey:

Something you said in there, was for I think a lot of folks, the feature flag space has expanded a lot over the last few years. So I heard you say something about the ability to turn flags on for a fraction of users.

Talia Nassi:

Yes.

Amy Tobey:

So I wonder if you could bring that back to how that enables us to more confidently shift that code.

Talia Nassi:

Yeah. Let’s say I’m a developer on a team and I am working on this, let’s say it’s a to-do list app. Just a simple basic app. And I want to add the ability to delete items from the list. So I’m a front-end developer and the back-end developers creating this new API to delete tasks. My change is done, and I have to wait for the back end developer. So in order to do that, I’m going to wait. Either I can wait for the back end developer to be done or I can push the code to production behind a feature flag, wait for the backend change to be done, have the back end developer push his code to production behind the feature flag as well. And then there, I can put myself targeted inside the feature flag, which means that only I will be able to see those changes. And then I can test it with both the front end and the back end changes. And it’s a way to do it safely, without affecting your end-users.

If this is something where you’re waiting for other parts to be ready or different parts of your system are dependent on each other, and you don’t want to have those dependencies or wait for them to test-

Amy Tobey:

Or even just waiting for release announcements.

Talia Nassi:

Exactly, exactly.

5:38

Amy Tobey:

Right. So let’s say, a young developer has these tools and is in a trunk-based development system, and they put a feature flag on, they do all the right things. They deploy it to production and they flip the flag on and go home for the weekend. And this is where I’m going to go to Shelby, who has an amazing t-shirt that I just noticed. And somebody else has in the chat. And so we’re testing in production. And customer service wakes up and there are some complaints. And baby engineer needs to go look and figure out what’s going on. And I think this is where we can hand off and talk a little bit about the role of observability.

Shelby Spees:

Absolutely. And I think it’s so important when you’re using feature flags, I think feature flags are a core part of this. But also, even if you’re not behind a feature flag and you just want to observe your changes that go out, having the data to allow you to observe that change and really see the impact of your code changes on real users, whether they’re internal users and you’re dogfooding a change, or you’re releasing out to a subset of traffic or all of traffic, that’s really important. And the thing about observability is it requires high quality, high context data. And the context we talk about is the context you care about as you’re writing your code.

And so as you’re writing code, you think, “Okay, what’s important for this? How am I going to know it’s working in production?” Similar to how we do test-driven development or when we want to write high-quality tests, we say things like, “How do I know this test is effective? How do I know this code is effective?” You write a failing test first, and then you write the code to pass the test, you can do something very similar, where you write instrumentation and interact with your system, and live in that production space. And learn from the actual behavior.

And so when you get paged, or when you get some customer complaints and they’re saying, “Hey, there’s something wrong,” you can flip your feature flag back off, that’s a really quick way to fix it. And you can go back and look at the difference between who’s behind that feature flag versus our baseline and really understand what’s the impact there.

Talia Nassi:

Yeah-

Amy Tobey:

Am I right in hearing that, that implies an integration between the feature flagging service and the observability? The feature flag would be passed into the events being sent to the observability system?

Shelby Spees:

Yeah, absolutely. You can send that context along with all kinds of other contexts.

Amy Tobey:

That’s pretty cool.

Shelby Spees:

Mm-hmm (affirmative), isn’t it?

Talia Nassi:

Just so-

Amy Tobey:

You were going to say something, Talia?

Talia Nassi:

Yeah. I just wanted to piggyback off of what Shelby said. She said, if you release to a subset of users or the entire user base, then you can monitor with observability. And I think that’s also a really important point, is that when you do like a canary release, or a percentage rollout, where you release the feature to only a small subset of users, I think that goes hand in hand with testing in production, because if something goes wrong, would you want 1% of your users to experience this issue or 100%? So it’s just, I think, an added layer of protection to have a percentage rollout where you slowly, incrementally, roll out the change. As opposed to, everyone is going to see this feature all at once in production.

Amy Tobey:

Right, yeah. So we’ve got the system in place, and we’re rolling code out. And let’s say this time, things are going fairly well. But I still have, say, my director of software engineering coming and bugging me regularly about it, just they’re nervous about it. So what’s next? We’ve got the ability to see how the code is behaving in production, we’ve got the ability to have very low latency, fine-grained control over which features are enabled in production. So what are the other layers that you all see out in the wild, that help people build that competence on top of these things? Because these things give us a lot of confidence. But there’s always a little bit more we can do.

Shelby Spees:

I mean, I think it comes down to the DORA metrics. How quickly can you have a change go to production or is your lead time on the order of weeks to months or is it on the order of minutes to hours? If it takes several weeks between writing code and it living in production, you lose all of that context. And so when something inevitably doesn’t go right and 1000s of changes over the year, there’s going to be something that comes up. Or just, you even just want to see it, like, what’s the impact of the work I did? And you lose all that context, it’s that much more expensive to gather all that back up in your head and really remember, what was the point of this that I was working on?

And so lead time for changes rolling out is a big one. And that’s where CI tooling makes a big difference. And at Honeycomb we actually instrument our builds, so that we can see not only how long it takes for a build to run, but what part of it is slow and cannot be optimized. Because we try to keep our build times under 10 minutes because we don’t want people to go and start working on something else, and then have to switch contexts back to see if the build failed, or when there’s code review, having to… Context switching is so expensive. And so it’s something I feel really passionate about.

Amy Tobey:

It sure is.

Shelby Spees:

Yeah.

11:14

Amy Tobey:

It’s a huge strain on cognitive resources. I like how you brought that up, and I was going to take it somewhere, but I got caught up in what you were talking about. So Vanessa asked in the chat, about folks using trunk-based development. If there are known defects in the code, or it’s incomplete, is it still safe to ship behind a feature flag?

Talia Nassi:

Yeah. I think as long as the default rule is off, then it’s safe to do it. Because you’re saying that, if you’re in this bubble, if you’re internal or if you’re part of a beta testing group, then you get to see all the defects and you get to play with this thing and break it in production. But if you’re not, then you don’t get to see anything related to this feature at all. So as long as the default is off, then it’s safe to deploy.

Shelby Spees:

Yeah, and that’s exactly-

Amy Tobey:

There’s a cognitive advantage right here, which is if we have… Modern software engineering is about working in teams and groups. And what I heard is, I finished my component, maybe for the, as you mentioned before, in your example, I get the UI code done, and I feel like I’m confident that the simulator for the API’s or whatever is, that the UI is done. I ship it, and there’s a feature flag turned off, but now the back end team doesn’t need to come bug me and cause another context switch or wait for me to respond and do my thing and deploy stuff. They can simply turn it on in their environment.

Talia Nassi:

Exactly.

Shelby Spees:

Yes.

Amy Tobey:

And so there’s a new level of coordination that’s now captured in the technical system, as opposed to us having to walk around and talk to each other.

Shelby Spees:

And what I really appreciate about this workflow is that, and it’s one of the benefits of trunk-based development is you don’t have your front end feature sitting on a feature branch for weeks at a time and falling behind your trunk. And so instead of using Git, which I love Git, I’ve used the username-

Amy Tobey:

It’s so hard to love.

Shelby Spees:

It’s hard to love. I use the username Git Goddess. I’ve taught Git to juniors and seniors in different jobs. But making people manage integration at the Git level is error-prone, and it’s complicated. And so using feature flags to manage things is a lot better. And I’ve been the person to have to go in and rebase and squash and-

Amy Tobey:

Oh, gosh.

Shelby Spees:

… manage other people’s feature branches. I’ve rebased, I think it was like 115 commits six months of work and I broke it up into like nine PRs or something. I will sit down and do that. I don’t want anyone to ever have to do that. And so hopefully it never gets that bad. Merging things in, but even before you’re ready to release the feature, and just having it guarded by a feature flag, it removes all of that, again, that cognitive load and that complexity.

Amy Tobey:

So we’ve talked a lot about production and taking a lot of the guardrails off and letting our developers fly a little faster. But along that spectrum, I glossed over, there’s a series of different kinds of testing and validation we can do. And so I wanted to back up a little and just, for our audience, try to bring these things back. Because of the Twitter threads we’ve had over the last couple of weeks about this subject, one of the things people brought up frequently is there were a couple of oh, hell no, responses. And I think that that came from a place of, we really should have confidence before we put anything in production. So what are our options here, pre-production, that these tools help us carry through? So for example, in our dev and integration environments, or even local testing. So maybe Talia, what’s the role there where these tools help us accelerate?

15:31

Talia Nassi:

Yeah. So I think a recommended approach, like for unit testing or integration testing, something that is helpful that I recently learned about is to make a custom feature flag abstraction, which makes it easy to mock out. I’m a big fan of examples because it helps me understand things. So if let’s say, like, you’re a developer who’s experimenting with giving people free shipping on an e-commerce site, and now you’re testing the shipping calculator. And if the feature flag is on for you, you get free shipping and if the feature flag is off for you, then you get the existing shipping cost. In this example, you would have three tests.

In the first test, you simulate the feature flag on where the shipping cost is zero. So that means for the duration of this test, any requests asking if the feature flag is on, you say yes. And then in the second test, you simulate the feature flag off, and the shipping cost is zero, and if any requests come in from the test asking if the feature flag is on, then you say no. And then in the last test, you just validate that you can go through the entire purchasing flow, regardless of if the flag is on or off. And I think with this approach, you’re being super explicit in the test and then the test just becomes much more self-documenting and descriptive.

Amy Tobey:

I like how that sets people up to validate both pads in parallel.

Talia Nassi:

Yes

Amy Tobey:

Independent, independent, and then both together.

Talia Nassi:

Right.

Amy Tobey:

Because I’ve seen so many incidents where the current path works fine, the new path works fine, but once both get into production together, they interact in strange and unusual ways.

Talia Nassi:

Right, right. And up until a few months ago, I actually just learned about this approach. And up until a few months ago, I would always recommend getting your test users and targeting your test users inside of the feature flag and then using those to run. But the fragility of that, if someone goes in and deletes a user from a certain treatment in the UI, or if someone like a senior person goes in and is like, “What is this user? I’ve never heard of them. I’m just going to delete them.” It just causes so many problems, because you’re not the only person who can configure changes to the UI. So I like this approach a lot better because you take away that fragility.

Amy Tobey:

Right, nice.

Shelby Spees:

Yeah, I think-

Amy Tobey:

Go ahead, Shelby.

Shelby Spees:

Okay. Yeah, I think it comes down to just thinking about the impact of all your changes. A feature flag, as trivial as it sounds to turn on and off feature flags, it’s still adding a fork in the road, it’s still a different path that your code can take. And so you want to be intentional about, where you’re including it, while you’re including. What’s the purpose of this? And then think about, once again, how do I know it’s going to be successful, and test for that. And like Talia said, test for the interactions between on and off for-

Amy Tobey:

That almost brings me back to chaos engineering. Where we have to have a hypothesis before we start. And if we don’t, that’s the difference between science and screwing around, is the hypothesis.

Shelby Spees:

Totally, totally. And I feel like so many teams do this, where they just throw things at the wall and see what sticks. And there are times when you need to do that, but I think we can do a really good job of reducing how much of the time we’re throwing things at the wall and being more like, what’s going to be the impact of this change? And things like that also make it easier if you know the impact of your change. You can include that in your PR and that facilitates code review. And that facilitates knowledge transfer in your team. It helps you write better code, that’s more self-documenting, and write better comments on your code, and write better documentation around your code. And so all of this stuff, it’s all interconnected. Being intentional helps you just build better software.

Amy Tobey:

Speaking of intentional, that reminds me of something that I think we can talk about real quick, is that we have this fork in the road we’re talking about. And so we get the fork in the road, and we push the new feature flag out at 0% and then we bring it up to 1%. We do the whole process, and eventually, we flip it to 100%. At some point, there needs to be a loopback through the process to remove the dead path. And so could you talk a little bit about the processes that you’ve seen work for making sure that happens? Because I’ve also seen incidents where a feature flag is set and forgotten for months or years, and then someday later somebody else says, “What’s this?” And they flip it, and there’s an incident.

20:38

Talia Nassi:

Yeah. So there’s a few things. The first thing is just piggybacking off what Shelby said, changes to your feature flag should be treated as changes to your codebase because of their sensitivity. So if you require two code reviews for pushing code to production, then you should require two code reviews for making any changes to your feature flags. Just because-

Amy Tobey:

Well, okay.

Talia Nassi:

… you are affecting real users. And then in terms of what to do when you have stale feature flags, so there’s a few things. A lot of feature flag management systems have an alert, that’ll say, and you can set this up too, it’ll say, “Hey, this feature flag hasn’t had any impressions,” which means people going in and making changes are being targeted in that flag. So it’ll say, “Hey, this feature hasn’t had any impressions in the past X amount of days, do you want to turn it off? Do you want to delete it? What do you want to do?” So in the UI, there’s some configuration to set up.

There’s also, in your task management system, if you’re using like JIRA or Asana, whenever you create a ticket to set up the flag, run the test, roll the feature, whatever it is you’re doing, you should also create a ticket to delete the flag and remove the old code. And then inside of the code, basically, the feature flag is just an if/else statement. You’re saying, if you’re in this bucket, do this, and if you’re in this other bucket, do this other thing. So that if/else statement just needs to be reworked and whatever version you chose to live needs to be put in and that if/else statement needs to be taken out.

Amy Tobey:

That makes me think that if/else needs to be kept as simple as possible too.

Talia Nassi:

Yeah.

Amy Tobey:

Because if you’re getting a case statement, then you’re just asking for trouble down the road.

Talia Nassi:

Yeah, yeah. Totally. Totally.

Amy Tobey:

Next one.

Shelby Spees:

Actually, so earlier this year, one of our engineers, Alyson, wrote a blog post about using Honeycomb to remember to delete a feature flag. And I think it’s like a hygiene thing. I appreciate the idea of when you open a ticket that involves creating a feature flag, you also open a ticket to later delete that feature flag. And thankfully, removing a feature flag code involves code review. And so there’s that knowledge transfer and context sharing. Thank you for sharing that. But being on the… So having that step, forces you to be like, “Okay, what’s the impact of this?” And then you can go and if you’re using observability tools, you can see, is there anyone who’s still behind this feature flag? Or still not behind this feature flag, whatever the case is.

Amy Tobey:

Would that be through the metadata that comes in the events? Or do you actually decorate your feature flag code with spans? Or maybe situational?

Shelby Spees:

Yeah. So I think you just add your regular code that says, this feature flag on or off.

Amy Tobey:

Okay. Awesome.

Shelby Spees:

Because you can add arbitrary fields, you can just give it a name that says feature flag XYZ.

Amy Tobey:

Right, right. I was just curious if there were cases where I have a flag A and B. And maybe I have in A, I have a new span or something. And does that impact the ability of my observability system to consistently display and compare spans?

Shelby Spees:

It can be really useful with tracing. If the code behind your feature flag is, for example, a lot more complicated, but you would probably instrument it, just the way you would want it instrumented without the feature flag to see how it’s going to behave normally. And then just, yeah, the other feature flag is a field.

Amy Tobey:

So Dave asked a really interesting question, about when we do these tests in production, sometimes they can implement or impact our data and our back ends. And so what have you seen out in the field, for new techniques for people to say, we have a new route for writing to the database, replacing the old one. Maybe it’s inefficient or whatever. And we flipped to the new one, but maybe it’s writing data in a slightly different format. And so there all these side effects that can still happen and get out to production that causes incidents, which is more work for me. And so what have you seen out there that people are doing, to protect the data?

25:30

Talia Nassi:

Some of the things that I’ve seen that have worked well is when you’re differentiating test data and real production data. So the first thing is, your test users should have some Boolean or something in the backend that says, is test user, and that would be equal to true. So anything that this test user does in production, it’s going to be marked with this Boolean set to true and then in wherever you collect your data, like Looker or, which other one? Datadog, you can just say, if you have any action coming from this test user put it somewhere else, don’t put it in the same place as production, because in production, it’s everything that has…

Amy Tobey:

I see.

Talia Nassi:

… is false. And then same thing with all the other test entities, if you have a test cart, and a test page, whatever your testing entities are, each company is going to be a little bit different. Then those should have some flag in the backend for, is test object. And that should be set to true and then everything that has that flag set should be put somewhere else in your dashboard.

Amy Tobey:

I like that, yeah.

Shelby Spees:

Yeah. And the other thing is, if you’re testing on real production users, that’s real production data. It might be like an experiment, but every time you release a change, it’s an experiment. Every time you release a change that change is something. And you can’t possibly know. I mean, that’s the underlying theme here, when we talk about testing in prod, is you’re already testing in prod, you’re just not being intentional about it.

And so the difference here is if you have a subset of your traffic behind a feature flag, if you know that they’ve specifically opted in as beta users, then as Talia said, you can mark them as beta users or test users. If you’re dogfooding and you’re limiting things to your internal team, then those people should be marked as your team’s users and not real-

Amy Tobey:

Nefarious users, yes.

Shelby Spees:

Yeah. And so if your business data reflects, already thinks about this stuff, then it’s probably not a lot of extra work to have been able to report on changes between these different groups.

Amy Tobey:

So we’ve talked quite a bit about confidence and risk and danger. I want to shift since we’re getting into our… We have about 10, 15 minutes for Q&A. And talk now about the opportunities in front of us with these new tools. Because I’ve been doing this for more than 20 years. Back when we started, we didn’t have this stuff. We had stones and chisels and then we got Vim and some people claim that it wasn’t really a step forward. We’ve come a long way. So we have these new capabilities, and they allow us to do new things.

And since we were just talking about the data, let’s talk a little bit about what’s possible about testing in production that we can’t do in the synthetic lab environments. Because the data isn’t real, and because we have things like GDPR that prevent us from doing that testing. So let’s start with Shelby this time. If you could talk a little bit about the things you’ve seen out there, where people are able to validate their code in a way that, in production, that just isn’t possible anywhere else.

Shelby Spees:

Yeah. I’ll actually share another Honeycomb blog post from one of our users who used Honeycomb to debug an emergent failure in Redis. He actually monkey-patched the Redis Ruby library in order to observe what was going on, because there was all this blocking, and they ended up with all these connections and Redis called them and said, “We’re shutting you off.” And it was this huge problem that without observability he couldn’t possibly debug it. And he could not reproduce it locally or in QA. It was just, you needed a certain amount of traffic in order to debug it and the thing is, I appreciate that there are industries and there are certain domains where you need to have a synthetic environment. If you’re working on pacemakers or something, that’s really important. You want to be very, very confident in your tests.

But there’s also a cost to that. And similar to how we talk about, three nines versus four nines is an order of magnitude, more effort. It’s similar to having a test environment that actually accurately enough represents production in order to give you any answers. And so being able to reproduce emergent failure modes in production in a test environment, it’s super expensive, and it’s often not worth the infrastructure cost and the engineer brain cycles to actually be able to do that. And so I think that’s where a lot of the arguments against testing in prod fall apart, because you’re going to have things that can only happen in prod. And so you may as well give yourself the tools to address them.

Amy Tobey:

I like that.

Talia Nassi:

And I couldn’t agree more with you. You’re preaching to the choir right now.

Amy Tobey:

We’re going to start sending out hymnals for these things, so we can all open them up and-

31:10

Talia Nassi:

Yeah. I used to be an automation engineer, so up until the very end of that part of my career, I was only testing in staging because the companies that I was working at didn’t test in production, until the last couple. But I would spend so much time testing features in staging and they would be pushed to production, and then they will break in production, but they were working perfectly in staging. So what’s the point of testing in staging if they’re going to break in production? My users aren’t going in and using the product in staging, so why do I care? And after that happening over and over again, where I was like, “Okay, there has to be another way to do this.” And then I interviewed at a company that tested in production, and I haven’t looked back since.

Shelby Spees:

And the thing about that too, is that you as a tester are responsible-

Talia Nassi:

Exactly.

Shelby Spees:

… for the quality of the code going out. And so it’s super demoralizing when you have a certain level of confidence in staging, and then all of that falls apart in production. It’s like, what were you even testing? And it’s like you were doing your job, according to what was assigned to you, but you can’t do your job according to what’s actually good for the business. And for people like us who care about the impact of the changes going out, we want to be able to validate that. We want to be able to feel like things are going to work well. And so, yeah.

And I’ve talked to a few testers who are going into learning about observability and learning about the intersection between being a tester and observing code changes. And there’s so much about testing that involves the sense of responsibility and ownership over the services in production. And so it’s like, you don’t have to be a developer to have an impact there. And so I just really appreciate your story there, because I feel you. I totally feel you on that.

Amy Tobey:

I feel like there’s an opportunity there, for testers that, up until… it still goes on all over the place. A lot of shops still have dedicated testers doing old school QA. But what we want is to uplift those people, so that they’re doing more high-value work, just like we tried to do through all of our careers. And maybe the place for them to move towards is this idea of owning and nurturing the test spectrum. Extending that from the line into production, but all the way up into production.

Talia Nassi:

Right. And when you have the right tools available, when you have the right observability tools and you have the right feature flagging and tools, and they’re working together, and you can see the impact of your changes. And you can test them in production before they go out to your end users, though you’re basically creating this bubble of risk mitigation. So if something goes wrong, you’re covered. You can see the changes, you can see in Honeycomb, you can look at your logs and see what’s going on before your end users will be affected by them. So I feel just like you’re doing it in the safest way possible with the right tools, what else do you need?

Amy Tobey:

Well, you need me. So stuff still falls through-

Talia Nassi:

Yeah.

Amy Tobey:

… and maybe to finish up the body of our conversation. We’ve gone from the beginning of the development cycle where we have our unit test and stuff and we’ve talked all the way through that out into production. But sometimes, stuff slips through still. And having these tools in place still makes our life better. So I’m an incident commander, and I’m in there and I’m asking questions, trying to bring an incident resolution. And obviously now with observability and feature control in place, we have additional tools for resolving that incident. So maybe we’ve talked about that for a second. Talia, I’m sure you’ve seen cases where an incident is resolved quicker because of the agility that’s enabled by feature flag service versus having to republish your code.

Talia Nassi:

Yeah. And this just goes back to what Shelby was saying about not being able to reproduce an issue in production. If there’s an incident in production, I’m not going to go to staging to test it. I’m going to go to production to reproduce the issue. And that’s something that used to happen a lot, when I was a QA engineer. There would be, incidents and things were being reported to us in production and it’s one of those like, “Oh, it’s working on my machine type thing,” where I did everything I could to test it and it will be fine for me in staging and then you would go to production and these incidents would only be in production. You’re never going to know the differences between your staging environment and your production environment until you test in production.

36:15

Shelby Spees:

We talk about this in Honeycomb in terms of on call, approaching on call in a healthy way, where if your on call engineer gets paged at two in the morning, and it turns out to be something where they can turn off a feature flag, and then debug it in the morning when they’ve had enough sleep and a good cup of coffee, why aren’t we doing this more? On call doesn’t have to suck as much as we make it out to be. It doesn’t have to be this painful masochistic, or punishment type of thing. We can put guardrails in place so that incidents get resolved and we stop impacting customers right away. And then we have the data, we have the observability to go back and actually resolve what actually went wrong here, so that then you can go fix the code.

Amy Tobey:

I love that.

Shelby Spees:

But with a full night of sleep. Yeah.

Amy Tobey:

I love that it has an impact on the health of our engineers. We’re not stuck at 4:00 in the morning when we’re at our lowest possible cognitive capacity. I don’t know about you two, but you wake me up in the middle of a REM cycle and I’m a damn idiot, for at least an hour before I’m ready to do anything. And I just really like the idea that it also is a tool for reducing burnout and attrition, even within our engineering teams.

Talia Nassi:

Yeah. And I like that idea of having different types of alerts. If something is broken, but it’s not a huge issue, there should be different severity levels for the different types of alerts. You should only be woken up in the middle of the night to turn off a feature flag if your entire app is crashing and things are on fire.

Amy Tobey:

Oh, with my assary hat on I’d say, we should only really be waking engineers up for things that are actually harming our customers.

Talia Nassi:

Exactly.

Amy Tobey:

If it isn’t impacting the critical user journeys, then we probably should sleep through and come and hit it with a full mind.

Shelby Spees:

And that’s where a lot of the work in SLOs comes out, and error budgets. Alex Hidalgo’s new book, I recommend it to everybody, it’s going to be awesome, Implementing Service Level Objectives. I think you can order it now, talks exactly about how can we reduce alert fatigue and have more meaningful pages. And also be able to anticipate, okay, things are affecting a tiny, tiny percentage of traffic right now, a tiny percentage of users, but in four hours, this is going to start impacting everybody. Should I wake somebody up to fix this? Or can I wait till morning? That sort of thing. And that’s exactly how you do it. And it involves some math behind the scenes, but when you have good data about the health of your system, it’s a lot easier to be able to alert on meaningful things instead of just, CPU usage is high. I’ve been on teams where we paged for, traffic isn’t within the band that we expect it to be. And you shouldn’t be waking somebody up for that if it’s not going to impact your customer experience.

Amy Tobey:

Yeah, I think of those as anxiety-driven alerts. They don’t know that something is wrong, you’re just maybe worried that something might be wrong. And so it’s waking up and going, is the baby okay?

Shelby Spees:

Yeah.

Amy Tobey:

And it isn’t great for anybody because disturbed sleep patterns are bad for us.

Talia Nassi:

Shelby and I were talking on Twitter yesterday and she said something really smart, she said, “A lot of companies and a lot of people are making business decisions just based on hunches or just based on an idea that they have, but they’re not looking at data.” And I think that’s so important, that you write your tests based on actual data and you make your business decisions based on data that comes from production. And using an observability tool like Honeycomb will allow you to do that.

Amy Tobey:

Right. And making off of, in your risk assessment, off of your SLOs, as opposed to some fuzzy idea of how many alerts are coming through or anything. Which nobody really has a clear picture in any infrastructure or size.

Talia Nassi:

Right.

Amy Tobey:

Well, cool. That brings us to 52. We have one question, so we could probably come up with one of our own. Do either of you have questions that we could talk about? Since we’re in the Q&A time, and we have one in the queue. So I’ll let you… Oh, go ahead, Shelby.

40:54

Shelby Spees:

Oh, yeah, no, I always have questions about cultural changes in a team. If you’re watching this talk and you’re like, “Oh, man, I get it, some things we can’t help, the only way to test it is in prod.” And you go to your manager, and you’re like, “Why aren’t we doing this?” And your manager’s like, “What are you talking about?” How do you start moving the needle? How do you start pushing for that cultural change on your teams and in your organizations?

Talia Nassi:

So something I always like to tell people is to use examples from your past. If this is something that consistently happens, where you test something fully in staging and it works great and your automation tests are passing and then you push to production and it fails, if that constantly happens, then maybe this is something that you should bring up. Like, “Can we get a trial of these tools that we need to make this work just to see if it works for us for a few weeks?” And I would start with examples from your past, because if things consistently happen and you’re not making changes to make them better, then they’re going to keep happening. So that’s what I would say.

Shelby Spees:

Yeah.

Amy Tobey:

I’d say most of my testing in production earlier in my career was done on the rule of, it’s easier to ask for forgiveness than permission. It would just be like, well, we don’t really have a way to do this, I’m just going to do it and I’m not going to ask anybody. I’m not recommending this for everyone unless you’re really confident in your ability to get a new job. But we’re often stuck in those places.

Shelby Spees:

At my last job, my manager told me to take more risks, because I would learn faster. And I really appreciate that, because I tend to be very hesitant, “Are you sure? Let me get five people to approve this before I do it.” Which is not the making of a good tester necessarily. But I was also on the DevOps team. I had access to all of our production systems. So I didn’t want to just make changes willy nilly. And so him telling-

Amy Tobey:

But that’s fun sometimes.

Shelby Spees:

Totally. And so him telling me, “Take more risks, you’ll learn faster,” to this day, I carry that with me. And it’s still hard for me to just get over that hurdle of “I crave approval.”

Amy Tobey:

And competence. That’s usually what our leadership is looking for. So what I coach people on pretty regularly is to think about what are the goals of our leaders, that maybe we need to convince? And what their job is, is to give the business competence in what we’re doing. So whether we’re writing software, often the task is to say, this is the velocity I have to offer you. And this is my competence in that velocity.

Talia Nassi:

Right.

Amy Tobey:

And then in operations, it’s typically, this is the availability I can offer you. And this is the competence I have in that. And we’ve started to move to more encoded systems like SLOs for that availability one. And really, we use it for feature flags and observability too. That’s our metric we use to determine whether our estimates of competence are meeting the actual wheels on the road.

Talia Nassi:

Right.

Amy Tobey:

So our question from Massimo is, I’m thinking now about destructive tests in production, such as stress tests, security tests. Shall I avoid these kinds of tests in production? And do you have any experience with that? So let’s start with stress tests. Do the feature flags have a role? And how do we integrate that with our stress testing regimen?

Talia Nassi:

Yeah. I think when you run these tests, these types of security tests and stress tests in production, I always get questions about this. And when I talk about this, I’m like, “It’s better to do this yourself at a time of low traffic, rather than have your site crash in production because you didn’t do these types of tests in production.” So yes, you should be running performance tests and load tests, and stress tests in production. But you should do it at a time of low traffic and a time where you know that if you run the test and the site crashes, you can bring it back up. So feature flags can play a role in that. But yes, they definitely should be done in production.

Amy Tobey:

You can also do things like enable, if your stress testing has a synthetic user it uses.

Talia Nassi:

Yeah.

Amy Tobey:

You could enable features just for the stress test.

Talia Nassi:

Yeah.

Amy Tobey:

And then you can go back to your observability system and see the impact of the new code on your systems.

Talia Nassi:

Exactly.

45:59

Shelby Spees:

It brings me back to chaos engineering, and how, if you’re going to perform a chaos experiment, you want to have confidence that your system will be resilient, no matter when you deploy it. But at the same time, you don’t want to drop it on Black Friday. I mean, Black Friday is a chaos experiment itself, for example. But you want to be smart about when you start an experiment or you release a chaos monkey script. And it’s the same thing if you want to start performing stress tests or security tests. If you have people, what’s the term? White hat or security probe people. What are they called?

Amy Tobey:

Penetration testing.

Shelby Spees:

Pen testers, yes. And so if you have those people behind the scenes, you want to be confident that, if they’re starting to cause problems, that you can either block them off from the rest of traffic, quarantine them, or ban them or whatever, that you have mitigation measures in place. And so the point, I mean, the point of these tests is to have more confidence in your production environment. So absolutely, you should be doing these in production. But also, you should have confidence in your production environment.

Amy Tobey:

Right.

Talia Nassi:

Yeah.

Amy Tobey:

I think a lot of the standards around security, actually, the pen tests have to happen in production environments to really be valid. And it got me thinking about, I really like this idea of being able to enable new features just for the security team, to give them the opportunity to attack it, while it’s out in the real world, but before it’s exposed to the wild hordes of bots and kitties and stuff.

Talia Nassi:

Yeah. And I think the norm of this, doing all this testing and staging before, I think years ago, there was no way to safely do it in production. Like you said Amy, these observability tools and feature flagging, didn’t exist when, I don’t know, 20, 30 years ago. So now that there are these tools available, those people who are still in tech from when these tools weren’t available are still in that mindset of, “No, it’s not possible. How is that…”

Amy Tobey:

Some of us grow out of it.

Talia Nassi:

Yeah, yeah.

Amy Tobey:

Massimo says, I saw pen testers hired for testing that in a staging environment. And maybe it was because of a C-suite that was afraid of testing in production. Which is probably a thing. Like, “Hey, don’t attack my production environment. I don’t want you accidentally dropping my tables.” Which I guess is a statement about competence. Again. I keep coming back to that.

Shelby Spees:

It makes me think about, what’s the purpose of running all these software systems that we’re running? We’re here to deliver business value and testing in prod gives us more confidence in our systems and helps us learn more about our systems so that we can do a better job of delivering business value. And so when you refuse to even acknowledge the possibility of testing in prod, when you cover your eyes and cover your ears and stuff, you’re like… It makes me think of trying to deliver packages on horses when you could deliver packages on trains or something. It’s like you’re going to be stuck behind because you’re just not learning about your systems and you can’t address the problems that are there, they’re just going to stay under the surface forever.

Amy Tobey:

And you brushed up against an accessibility element there too. Which is, you got old farts like me, and some of my peers, that started around the time I did or earlier, that are looking at us and going, “You people, what are you doing? Testing in production that is forbidden.” But we also have younger developers or new to tech people who now are empowered to do things that we would not have empowered a young developer to do 10 years ago.

Talia Nassi:

Yeah. And you know what? There’s always going to be those people who say testing in production will never work. What are you guys doing? These naysayers, I like to call them. Honestly, I don’t care about these people. If you are-

Amy Tobey:

I just call them wrong.

Talia Nassi:

Yeah. Sure. Stuck in this mindset, and you absolutely refuse to change your mindset, that’s fine. You do you, everybody here is testing in prod.

Amy Tobey:

Cool. Well, I think we’ve gotten to a pretty good point where we can tie up. Do either of you… Oh, wait, I see… Oh, I thought I saw a new message. So do you have any closing thoughts, before we go, on this whole deal? It’s okay to pass if you don’t want to and then we’ll tie up and let everybody go back to getting ready for lunch or whatever they’re up to.

Talia Nassi:

Shelby?

51:07

Shelby Spees:

Yeah. I mean, as I said, you’re already testing in production. Add the tools to your tool belt so you can get the most value out of those risks that you’re already taking. Because every time you push out a change, that is a test. So yeah, just get the most out of that.

Talia Nassi:

Yeah, absolutely. I would also start with Honeycomb. It has a free version and Split has a free version. I would go in and download both of them, start using them, and figure out if the tools are right for you, if you like the process of testing in production. And again, use examples from your past to bring this up to your management.

Amy Tobey:

And push those tools to production to test them. Right?

Talia Nassi:

Yes.

Amy Tobey:

Maybe don’t push your trial accounts to production. We might stop there.

Shelby Spees:

Actually, we encourage people-

Amy Tobey:

Oh, Honeycomb actually has people doing that, don’t you?

Shelby Spees:

We do. We encourage people to send us their production traffic during their trials. So we also do paired trials. We’ll do proof of concept trials with people if you want to talk to our sales team. And if you have more questions about how to get started with observability, I hold office hours. I added a link in the chat. So yeah, grab a time on my calendar.

Amy Tobey:

Yeah. Well, all three of us are in Developer Relations, so I’m pretty sure we’re all happy, especially on Twitter, we’re all easy to find. And so with that, that’s our time. Thank you, Talia and Shelby, so much for joining us today. I had a great time. I hope you did too. To our audience, thank you for joining us, and that we can all have this time together in a time when we can’t actually be together, and I’m so tired of it. So thank you. Stay safe out there, stay healthy and stay resilient. Goodbye, everyone.

Talia Nassi:

Bye. Thank you so much.

Shelby Spees:

Bye-bye.

Blameless Host:

Take care everyone.

If you see any typos in this text or have any questions, please reach out to marketing@honeycomb.io.

Transcript