
From the conversion glossary
Concepts referenced in this article, defined.

Concepts referenced in this article, defined.
Run rigorous A/B tests and personalize every visit on Shopify or any storefront โ no engineers required.
Run your A/B test long enough to collect a statistically valid sample โ typically a minimum of 7 days and until you reach your pre-calculated sample size. Stopping early because one variant looks like it's winning is the single most common mistake in A/B testing. The answer depends on your traffic volume, baseline conversion rate, and the minimum detectable effect you care about.
Most D2C brands in India make this mistake constantly: they see Variant B performing 20% better after two days and declare victory โ only to watch conversions revert after rolling out the change. This guide gives you the exact framework to know when your test is truly done.
Statistical significance is a probability, not a certainty. At 95% confidence, you're accepting a 1-in-20 chance that your result is random noise. The earlier you stop, the worse this gets.
The peeking problem: Every time you check results and consider stopping, you're running an implicit hypothesis test. If you check daily for two weeks, you've run 14 implicit tests โ not one. This dramatically inflates your false-positive rate.
Day-of-week effects: Consumer behavior on weekdays differs from weekends. Indian shoppers buying beauty products on Nykaa or Plum's website behave differently on Saturday evenings versus Tuesday mornings. A test running only 3 days may capture only weekday behavior.
Novelty effects: When you change something on your site, some visitors click it simply because it's new. This inflates early results. Running a test for at least one full business cycle smooths this out.
Seasonality and campaigns: A test launched during a sale or festive season (Diwali, Holi, Raksha Bandhan) captures atypical behavior. Either avoid launching tests during major promotions or explicitly account for this in your analysis.

Use this process before launching any test:
Step 1: Establish your baseline conversion rate Pull your current conversion rate for the specific goal you're testing. If you're testing a product page add-to-cart button, your baseline might be 4.2%. Use at least 30 days of historical data.
Step 2: Define your Minimum Detectable Effect (MDE) This is the smallest improvement worth detecting. If you need at least a 15% relative lift (e.g., 4.2% โ 4.83%) to justify the effort, set that as your MDE. Smaller MDEs require larger samples.
Step 3: Set your statistical parameters
Step 4: Calculate required sample size For a baseline CVR of 4%, MDE of 15%, 95% confidence, 80% power:
Step 5: Divide by daily traffic If your page gets 500 visitors/day, you need 20 days minimum. Round up to the nearest full week โ so 3 weeks.
Quick reference table:
| Daily Visitors | Baseline CVR | MDE | Test Duration |
|---|---|---|---|
| 200 | 3% | 20% | 6โ8 weeks |
| 500 | 4% | 15% | 3โ4 weeks |
| 1,000 | 5% | 10% | 3 weeks |
| 2,000 | 5% | 10% | 1โ2 weeks |
| 5,000+ | 5% | 10% | 7โ10 days |
Regardless of what your sample size calculator says, follow these non-negotiable minimums:
1. Always run for at least 7 full days This captures at least one complete weekly cycle. A Kapiva ayurvedic supplement brand, for example, sees very different conversion patterns on weekdays (research intent) versus weekends (purchase intent).
2. Run through at least one full business cycle If your brand runs weekly email campaigns, run the test long enough to include two send cycles. If you do UPI cashback promotions every fortnight, include both cycles.
3. Don't run longer than 4โ6 weeks Extended tests get contaminated by seasonality shifts, competitor actions, and user learning effects. If you can't reach significance in 6 weeks, either increase traffic to the test (via paid promotion) or increase your MDE.
4. Pre-commit to your duration before launch Write it down. "This test runs from March 1โ21 and will be evaluated on March 22, regardless of interim results." This prevents peeking-induced bias.

Stopping during a sale spike: Mamaearth or mCaffeine brands often run tests during their sale events and declare winners based on inflated sale-period CVRs. The winner often fails after the sale ends because it was optimized for a different customer cohort.
Ignoring COD vs prepaid split: Indian ecommerce has a unique COD (cash on delivery) behavior. COD customers have different purchase patterns and return rates. If your test shifts the COD/prepaid ratio, your CVR lift may be artificial.
Testing on too-narrow segments: Running a test only on mobile visitors but applying results to all devices is a common error. Always segment your results by device and validate before full rollout.
Confusing sessions with visitors: Some analytics tools report sessions, not unique visitors. A single visitor might have 3 sessions. Use unique visitors for sample size calculations.
CustomFit.ai runs on your Shopify store and includes a built-in sample size calculator that tells you exactly how long to run each test before you launch. The platform:
This is especially useful for brands like Bellavita, which achieved an 11% CVR improvement โ those results came from tests that ran to statistical completion, not from early winners being called.
Use a sample size calculator every single time โ don't guess. Tools like Evan Miller's calculator or CustomFit.ai's built-in tool take 2 minutes.
Write your test plan before launch โ include start date, end date, sample size target, and what "winning" means in absolute numbers, not just percentages.
Never peek at results and adjust the duration โ if you extend a test because the current variant is losing, you've invalidated the test.
Run one full festive cycle minimum for seasonal businesses โ for brands selling during Diwali, Holi, or Valentine's Day, test during the season and validate outside it too.
Split traffic 50/50 unless you have strong reasons not to โ unequal splits require larger total sample sizes and increase test duration.
Document novelty effects โ for major redesigns, watch your data for a novelty spike in the first 3โ5 days and weight later data more heavily.
Validate on a holdout group โ after declaring a winner, roll out to 80% of traffic and keep 20% on control for 1 week to confirm the lift holds.
Related reading: A/B Testing Statistical Significance | Sample Size Calculator | A/B Testing Metrics | What Is Sample Ratio Mismatch | A/B Testing Pillar Guide