Statistical power is the probability that an A/B test will correctly detect a real effect when one actually exists. It is the complement of the Type II error rate (false negative rate): if a test has 80% power, there is a 20% chance it will miss a real improvement and incorrectly conclude the variant had no effect. Power is determined by four factors: sample size, effect size, significance threshold, and variance in the metric being measured.
Power = 1 − β (where β is the Type II error rate)
The standard minimum threshold in A/B testing is 80% power (β = 0.20), meaning you accept a 20% chance of missing a real effect. Research-critical environments often use 90% or 95% power.
To achieve a given power level, the required sample size increases when:
- The effect size is smaller (harder to detect)
- The significance threshold is stricter (e.g., 99% confidence vs. 95%)
- The metric has higher variance (e.g., revenue per visitor vs. binary conversion)
Power calculators (Evans Miller, AB Testguide) accept your baseline rate, MDE, and desired power level to compute the required sample size.
Why Statistical Power Matters for Ecommerce
Low-powered tests are a silent drain on optimization programs. A test with 50% power has a coin-flip chance of detecting a real improvement — meaning you could run dozens of tests, see mostly "no significant result," and conclude your hypotheses were wrong, when in reality your tests simply weren't large enough to pick up real effects. For D2C Shopify brands with moderate traffic (20,000–100,000 monthly visitors), power planning is essential: without it, you run tests that cannot answer the questions you're asking.
Real-World Example
Pilgrim's CRO team ran a test on their sunscreen product page, changing the hero image from a lifestyle shot to a before/after result image. After 14 days, results showed no statistical significance. The team checked their power retrospectively: the test had only reached 42% power for the 8% lift they were trying to detect, meaning they had a 58% chance of missing a real improvement. They reran the test with a properly sized sample (32 days), and this time the before/after image showed a 7.4% add-to-cart lift at 93% confidence — a genuine win that the first test had been too underpowered to catch.
How to Ensure Adequate Statistical Power
- Always calculate required sample size before launching a test, not after.
- Default to 80% power for standard tests; use 90% for high-stakes experiments that are expensive to reverse.
- Increase traffic allocation or extend run time if your daily visitors are too few to reach the required sample size in a reasonable timeframe.
- Widen the MDE if traffic is very limited — be honest about what effect sizes your store can realistically detect.
- Avoid running multiple concurrent tests on the same audience without accounting for traffic fragmentation, which reduces effective power per test.
Statistical Power in A/B Testing
Power is the mirror image of sample size: to detect smaller effects, you need more power and therefore more data. In practice, teams set 80% power as the floor, enter their baseline rate and MDE into a calculator, and use the output sample size to determine how long the test must run. Reaching the planned sample size — not just a confidence threshold — is the correct stopping condition.
Run smarter A/B tests with CustomFit.ai — 14-day free trial, no credit card required.