Ask Miss O11y: Load Testing With FidelityBy Liz Fong-Jones | Last modified on April 20, 2022
Dear Miss O11y,
My developers and I can't agree about what the right approach is for running load tests in production. Should we even be running load tests against our production infrastructure or is it too risky? And what about ensuring our service-level objectives (SLOs) are correct? And not wanting to overload our observability provider or getting a surprise bill?
–Perplexed About Performance
We're huge advocates for testing in production, and that includes performing chaos engineering/continuous verification in production. But you're right to want to be cautious about exactly how you are performing these tests. After all, it's chaos engineering, not just pure unbridled chaos.
In order to stress test responsibly, you'll want to determine in advance a hypothesis you want to test, know how you intend to measure the results of the experiment, and have an emergency stop button should things go awry. Additionally, you'll want to have plenty of spare error budget in case things do go wrong; if you're running out of error budget already, chances are you have plenty of known unknowns you need to deal with first before you go in search of unknown unknowns.
Therefore, when it comes to doing load testing in production, you'll need to understand how much headroom you have off-peak and on-peak—if you just keep slamming your service with the same level of extra traffic that's safe off-peak, you'll wind up failing on-peak when that extra customer load is added to your load test. You'll also want to ensure everyone is in the loop on when and where the test is happening, and how to disable it in event of problems.
The question you're alluding to around observability and SLOs is a great one. We feel SLOs should be measurements of the health of requests from end users, rather than artificial traffic. As our CTO Charity Majors likes to say, “Nines don't matter if users aren't happy.” If your SLO is flooded with millions of "successful" requests from your load test, but your users are seeing errors, then your SLO claiming that it's in compliance is not reflecting actual user experience.
Conversely, if you need to back out your load test because it's causing too many errors, but you have protected end user traffic with traffic prioritization or QoS headers, you shouldn't be penalized for "missing" your SLO even when no real users have suffered! Honeycomb SLOs make it easy to qualify and exclude spans with a certain attribute from consideration in a service-level indicator (SLI). Set IF(AND(NOT(EXISTS($app.is_loadtest)), EQUALS($service.name, "my_service_name"), [...]), [...]) and proceed onward with your test, knowing you won't be counting load test traffic towards your SLO.
And with regard to observability, you should have visibility into both the performance of load test traffic and real end-user traffic; however, not all that traffic is equally valuable. Real end-user traffic is multifaceted and diverse in all dimensions, whereas artificially generated traffic tends to look very self-similar and is of lower value. As long as you are differentiating artificial from end-user traffic in your telemetry data about that traffic, such as using HTTP headers in the client request and setting an attribute on the root span, you can treat that traffic differently for sampling purposes.
Tail sampling with Honeycomb Refinery frees you from having to propagate that information all the way downstream, and can apply a different sampling rate to the entire downstream trace based on the presence or absence of the load testing attribute in the root span. You'll be able to, without blowing out your observability bill, continue to debug end-user traffic in full fidelity, while also getting a snapshot of how the load test is performing and where it might be slowing down. Being able to visualize what's genuine and what's part of the test can also help clarify operational alerting and response—your operational dashboards should include and highlight what is extra traffic, but your user experience reporting should exclude it.
(Load) Testing in production is a great idea, as long as you have the appropriate guardrails. And it might even help you exercise the paths that generate telemetry, ensuring that you aren't having too much overhead from emitting and processing the telemetry. Make sure you're heavily sampling load test traffic and excluding it from your SLIs and SLOs, and you'll be in excellent shape.
May today be a good day to test in production!
Have a question for Miss O11y? Send us an email!
Ask Miss O11y: Error: missing ‘x-honeycomb-dataset’ header
Your API Key (in the x-honeycomb-team header) tells Honeycomb where to put your data. It specifies a team and an environment. Then, Honeycomb figures out...
The Case for SLOs
With one key practice, it’s possible to help your engineers sleep more, reduce friction between engineering and management, and simplify your monitoring to save money....
Author’s Cut—A Sample of Sampling, and a Whole Lot of Observability at Scale
In this post, we’re moving from the foundations of observability to things that become critical when you start practicing observability at scale. Tools like sampling...