
From the conversion glossary
Concepts referenced in this article, defined.

Concepts referenced in this article, defined.
Run rigorous A/B tests and personalize every visit on Shopify or any storefront โ no engineers required.
Sequential testing is a statistical approach that allows you to monitor A/B test results continuously and make valid decisions before reaching a fixed sample size โ without inflating false positive rates. Unlike standard fixed-horizon testing (which requires you to pre-specify a sample size and not peek at results), sequential testing adjusts significance thresholds as data accumulates, enabling earlier decisions on clear winners or losers. It's particularly valuable for time-sensitive ecommerce tests and low-traffic stores where reaching fixed sample sizes takes too long.
The most common A/B testing mistake is checking results before the test reaches its target sample size โ and stopping the test when you see a significant result. This is called "peeking" and it dramatically inflates false positive rates.
Here's why: if you check a test every day for 30 days and stop whenever you see p < 0.05, your actual false positive rate is much higher than 5%. You're essentially running 30 hypothesis tests (one per day) but only reporting the favorable one.
Studies show that peeking and stopping early can inflate false positive rates to 25โ30% โ meaning one in three "winning" variants is actually no better than the control.
The conventional solution: Don't peek. Pre-specify your sample size, collect all the data, then analyze once. This is statistically rigorous but often impractical:
Sequential testing provides a statistically valid alternative.
Sequential testing uses methods that adjust the significance threshold over time to account for multiple looks at the data. The key approaches:
The original sequential testing method developed by Abraham Wald in 1945. At each observation (or batch of observations), it calculates a likelihood ratio that determines whether to:
SPRT is theoretically elegant but sensitive to distributional assumptions and pre-specified effect sizes.
A modern enhancement of SPRT that mixes over possible effect sizes rather than requiring a single pre-specified effect size. mSPRT is used by companies like Airbnb, Booking.com, and Netflix for continuous monitoring of their A/B tests.
The math: rather than testing against one specific alternative (e.g., "the variant is exactly 10% better"), mSPRT tests against a weighted mixture of possible effect sizes โ making it more robust to uncertainty about the expected effect.
Bayesian approaches don't use p-values at all. Instead, they calculate the probability that variant B is better than variant A โ expressed as "there's a 83% chance B is better." This probability updates continuously as data accumulates.
Bayesian sequential testing has no fixed-horizon requirement by design. You can check results at any time and get a meaningful, interpretable probability statement โ though you should still define decision thresholds (e.g., "stop when we're 95% confident B is better, or 95% confident it's not").
CustomFit.ai's statistical engine uses Bayesian methods that allow continuous monitoring without peeking inflation โ well-suited to the smaller sample sizes of typical D2C ecommerce stores.
| Factor | Fixed-Horizon | Sequential Testing |
|---|---|---|
| Sample size | Pre-specified | Flexible, test ends when conclusion reached |
| Can peek at results? | No (inflates false positives) | Yes (by design) |
| Stops early? | Never | Yes, when a conclusion is reached |
| Statistical guarantee | Type I error rate = ฮฑ | Type I error rate โค ฮฑ across all peeks |
| Complexity | Simpler | More complex (requires correct method) |
| Best for | Planned experiments with fixed timelines | Ongoing experimentation, time-sensitive tests |
Festive season campaigns: During Diwali, Republic Day, or Holi campaigns, you may only have 7โ10 days to run a test before the campaign ends. Sequential testing allows you to identify a clear winner faster โ or stop a clearly losing variant before it costs you more festive traffic.
Product launches: A new product page launching on a specific date needs to be optimized quickly. Sequential testing's early-stopping capability lets you identify the better variant within the launch window.
Budget-constrained traffic: If you're spending โน50,000/day on paid traffic to test a new landing page, running a clear loser for 4 more weeks at $5 significance costs โน14 lakh in wasted spend. Sequential testing stops the loser faster.
Low-traffic stores: Sequential methods that allow earlier decisions (when the probability threshold is met rather than a fixed sample size) are particularly useful for stores where reaching a pre-specified fixed sample takes months.
Most major testing platforms are moving toward always-valid inference:
Statsig: Uses sequential testing methods by default โ you can check results at any time without inflating false positives.
Optimizely: Stats Accelerator uses adaptive sampling and early stopping with false positive controls.
VWO: Bayesian engine allows continuous monitoring without fixed-horizon requirements.
CustomFit.ai: Uses Bayesian methods that provide always-valid probability statements.
For teams running analysis in Python or R, libraries like always_valid (Python) and the OptimalDesign package (R) implement mSPRT and related methods.
Using sequential testing as an excuse to peek without controls. Sequential testing is not permission to peek at standard (fixed-horizon) A/B test results. If your testing tool uses fixed-horizon statistics, peeking still inflates false positives regardless of how you frame it. Sequential testing requires a correctly implemented sequential method โ not just checking results more often.
Confusing "95% probability B is better" with 95% confidence. Bayesian probability statements are not the same as frequentist confidence levels. A "90% probability B is better" from a Bayesian sequential test doesn't mean the result would be significant at the 10% level in a frequentist test. Interpret each approach on its own terms.
Stopping too early without a decision threshold. Sequential testing still requires pre-specified decision thresholds โ "we'll call a winner when we're 95% confident" or "we'll stop at 10,000 visitors if no clear winner emerges." Without thresholds, teams stop tests based on impatience rather than statistical logic.
Use sequential testing for operational flexibility, not as a shortcut. The value of sequential testing is that it allows valid decisions when you genuinely need to make them early โ not that it makes tests faster by default. Many sequential tests will still run to full sample.
Pair sequential testing with minimum test duration rules. Even with sequential methods, respect a minimum run time of 7โ14 days to capture weekly behavioral variation. A test that reaches significance in 3 days may be capturing a novelty effect or weekend traffic anomaly, not a real behavioral difference.
Understand the guarantee your method provides. Different sequential testing methods make different guarantees. mSPRT guarantees false positive rate control across all sample sizes. Bayesian methods provide probability statements that require interpretation. Know what your tool is calculating before you act on it.
Document your sequential testing decision thresholds before the test starts. Pre-register your decision rules: "We'll declare a winner at 95% posterior probability B is better. We'll stop the test at 30,000 visitors if no conclusion is reached." This prevents motivated reasoning from driving early termination.