Statistical Significance in A/B Testing: What It Means and Why It Matters

Statistical significance is the measure that tells you whether your A/B test result reflects a real difference in user behavior — or just random noise. Without it, you're making product decisions based on luck.

This guide explains statistical significance in plain English — no statistics degree required.

The Core Problem: Randomness Looks Like Patterns

Imagine you flip a coin 10 times and get 7 heads. Does that mean the coin is biased? Maybe. But 7 heads in 10 flips can also happen by chance — it's not unusual.

A/B testing has the same problem. If 52 out of 100 visitors converted with variant B, and 48 out of 100 converted with variant A, does that mean B is better? Or is the 4-visitor difference just noise?

Statistical significance answers this question with a probability.

What Statistical Significance Actually Measures

Section 1

When a test reports "95% statistical significance," it means:

There is only a 5% probability that the difference between control and variant occurred by random chance.

Another way to say it: if you ran this same test 100 times, you'd expect to see a false positive (a difference that isn't real) only 5 times.

95% is the standard threshold used by CustomFit.ai and most professional A/B testing tools. Some teams use 90% for lower-stakes tests or 99% for high-traffic, high-revenue decisions.

The Relationship Between Confidence and Error

Significance Level	False Positive Rate	When to Use
90%	10%	Low-stakes tests, early exploration
95%	5%	Standard for most A/B tests (recommended)
99%	1%	High-stakes decisions (pricing, checkout changes)

The higher your confidence threshold, the more traffic you need and the longer your test runs.

What is Statistical Power?

Statistical power is the flip side: the probability of detecting a real effect if one exists. Tests with low power miss real winners.

Standard recommendation: 80% statistical power. This means if your variant truly is better, you have an 80% chance of detecting it.

Low power = high rate of false negatives (you miss real improvements).

The way to increase power: increase sample size.

Minimum Detectable Effect (MDE)

MDE is the smallest improvement you want to be able to detect. If your conversion rate is 2% and you want to detect a 10% relative improvement (from 2.0% to 2.2%), that's your MDE.

Smaller MDE = more traffic needed. If you're trying to detect a 1% relative improvement, you'll need roughly 10x more traffic than detecting a 20% improvement.

Practical guidance: For most ecommerce A/B tests, set your MDE at 10-15% relative improvement. Testing for smaller effects requires more traffic than most sites can provide in a reasonable timeframe.

Sample Size: How Much Traffic Do You Need?

Section 2

Use this rough guide (95% confidence, 80% power, 10% relative MDE):

Current Conversion Rate	Visitors Needed (per variant)
1%	~8,500
2%	~4,300
3%	~2,900
5%	~1,700

These are per variant — multiply by 2 for a standard A/B test with one control and one variant.

CustomFit.ai estimates this automatically: enter your current conversion rate and traffic volume, and it tells you how many days your test needs to run.

The "Peeking" Problem

Peeking — checking results before your test has run long enough — is the most common statistical mistake in A/B testing.

Here's why it's dangerous: statistical significance fluctuates during a test. If you check results after 3 days, there's a high probability you'll see either a false positive (the variant looks like it's winning when it isn't) or a false negative. Early significance readings are unreliable.

Rules to prevent peeking:

Calculate minimum runtime before you launch
Schedule a calendar reminder for the results review date
Don't log in to check significance until that date
Run for a minimum of 14 days regardless of early results

Bayesian vs. Frequentist A/B Testing

Most A/B testing tools (including CustomFit.ai) offer two statistical models:

Frequentist (traditional p-value approach):

Reports: "95% confidence the variant is better"
Fixed sample size required in advance
Clear pass/fail threshold (p < 0.05)
What most data scientists are trained on

Bayesian (probability-based approach):

Reports: "There's a 97% probability variant B is better than A"
Can update continuously as data arrives
More intuitive for non-statisticians
Better for continuous testing and multi-armed bandit approaches

Both are valid. For teams new to A/B testing, frequentist is simpler to communicate to stakeholders. For teams running many concurrent tests, Bayesian offers more flexibility.

When to Trust Your Results

Trust your A/B test results when:

✓ The test ran for at least 14 days (full two weeks)
✓ You reached the minimum sample size calculated before launch
✓ Statistical significance is 95%+
✓ The test ran during a representative period (no major sales, holidays, or traffic anomalies)
✓ The result is consistent across segments (desktop and mobile both show the same direction)

Be cautious when:

Results only reach significance on day 3 or 4 of a planned 14-day test
The winning variant shows wildly different results on mobile vs. desktop
The test ran during a promotional period that inflated traffic
The sample size is barely above the minimum threshold

Reading Your Results in CustomFit.ai

CustomFit.ai shows:

Confidence level: The statistical significance of the observed difference
Revenue impact: Estimated monthly revenue lift based on current traffic
Conversion lift: Absolute and relative improvement in your primary metric
Sample size per variant: Confirms you've reached minimum requirements
Test duration: Confirms you've met minimum runtime

When all four indicators are green, you can ship the winner with confidence.

Continue reading:

CustomFit.ai handles the statistics so you don't have to. Start your free trial and run your first statistically valid A/B test today.

From the conversion glossary

Concepts referenced in this article, defined.

Definition

What Is Significance? Definition, Formula & Guide

Definition

What Is Variant? Definition, Formula & Guide

Definition

What Is Statistical Significance? Definition & Guide

Definition

What Is Sample Size? Definition & Guide

Definition

What Is False Positive? Definition & Guide

← Back to Ab Testing guide

Statistical Significance in A/B Testing: What It Means and Why It Matters

The Core Problem: Randomness Looks Like Patterns

What Statistical Significance Actually Measures

The Relationship Between Confidence and Error

What is Statistical Power?

Minimum Detectable Effect (MDE)

Sample Size: How Much Traffic Do You Need?

The "Peeking" Problem

Bayesian vs. Frequentist A/B Testing

When to Trust Your Results

Reading Your Results in CustomFit.ai

From the conversion glossary

Start lifting conversions today.

Built for every D2C category

The Core Problem: Randomness Looks Like Patterns

What Statistical Significance Actually Measures

The Relationship Between Confidence and Error

What is Statistical Power?

Minimum Detectable Effect (MDE)

Sample Size: How Much Traffic Do You Need?

The "Peeking" Problem

Bayesian vs. Frequentist A/B Testing

When to Trust Your Results

Reading Your Results in CustomFit.ai

Statistical Significance in A/B Testing: What It Means and Why It Matters

The Core Problem: Randomness Looks Like Patterns

What Statistical Significance Actually Measures

The Relationship Between Confidence and Error

What is Statistical Power?

Minimum Detectable Effect (MDE)

Sample Size: How Much Traffic Do You Need?

The "Peeking" Problem

Bayesian vs. Frequentist A/B Testing

When to Trust Your Results

Reading Your Results in CustomFit.ai

From the conversion glossary

Related articles

Statistical Significance in A/B Testing: A Plain-English Guide

How A/B Testing Works: Step-by-Step Explained

A/B Testing vs Split Testing: What's the Difference?

Start lifting conversions today.

Built for every D2C category

The Core Problem: Randomness Looks Like Patterns

What Statistical Significance Actually Measures

The Relationship Between Confidence and Error

What is Statistical Power?

Minimum Detectable Effect (MDE)

Sample Size: How Much Traffic Do You Need?

The "Peeking" Problem

Bayesian vs. Frequentist A/B Testing

When to Trust Your Results

Reading Your Results in CustomFit.ai