The multiple testing problem (also called the multiple comparisons problem) is the statistical phenomenon where running many hypothesis tests simultaneously inflates the probability of generating at least one false positive result by random chance. When each individual test has a 5% false positive rate and you run 20 tests, you'd expect roughly one "significant" result even if none of the tested changes have any real effect. The multiple testing problem is one of the most common sources of misleading A/B test results in ecommerce experimentation.
Family-Wise Error Rate (FWER) = 1 − (1 − α)ⁿ
Where α = per-test false positive rate and n = number of tests.
At α = 0.05:
- 1 test: FWER = 5%
- 5 tests: FWER = 1 − (0.95)⁵ = 22.6%
- 10 tests: FWER = 1 − (0.95)¹⁰ = 40.1%
- 20 tests: FWER = 1 − (0.95)²⁰ = 64.2%
Running 20 tests at 95% confidence without correction means you have a 64% chance of at least one false positive result.
Why the Multiple Testing Problem Matters for Ecommerce
The multiple testing problem silently corrupts optimization programs. It appears in three common forms:
- Many variants: testing 5 variants simultaneously against the control creates 5 comparisons.
- Many metrics: tracking 10 secondary metrics per experiment and reporting any that look significant.
- Many segments: slicing results by device, source, geography, and customer type to find "where it worked."
D2C brands that run multivariate tests without corrections, or that mine experiment data for positive signals across dozens of segments, often build a false picture of their optimization program — a library of "winning" tests that don't hold up in production because the wins were statistical artifacts.
Real-World Example
A Shopify fashion brand ran a multivariate test with 6 variants testing different combinations of product image style and CTA copy. That meant 6 comparisons against the control. Without correction, they'd expect one false positive at α = 0.05 roughly 26% of the time. Two variants came in at p = 0.04 — just below the threshold. Rather than declaring two winners, their analyst applied Bonferroni correction (adjusted α = 0.05/6 = 0.0083) and found neither passed the adjusted threshold. They correctly called the test inconclusive and redesigned it as a focused two-variant A/B test on the single most promising element.
How to Manage the Multiple Testing Problem
- Pre-designate a single primary metric for each experiment — only apply your significance threshold to that metric.
- Apply Bonferroni correction or Holm's procedure when comparing multiple variants against a control.
- Treat secondary metric results as exploratory, not conclusive — they generate hypotheses for future tests, not shipping decisions.
- Segment analysis is post-hoc: findings from slicing by segment require a dedicated confirmatory test before acting on them.
- Track your false discovery rate across your overall test program, especially if you run many tests per month.
Multiple Testing Problem in A/B Testing
The multiple testing problem is especially acute in multivariate testing and in teams that analyze many metrics per test. The solution is not to stop analyzing — it's to be honest about the distinction between confirmatory analysis (pre-planned primary metric) and exploratory analysis (everything else). Findings from exploratory analysis are inputs to the next experiment, not outputs to be shipped.
Run smarter A/B tests with CustomFit.ai — 14-day free trial, no credit card required.