You Shipped a Winner. It Wasn’t.
On Day 4, Variant A showed a 20.6% lift. The team shipped it. Three weeks later, conversion rate was back at baseline. Here’s the framework that would have caught it — and the 40 tests a year it makes possible.
4 signals. All pointing to ship. All wrong.
The spreadsheet shows a winner. GA4 confirms the lift. Hotjar heatmaps look supportive. The team signs off and ships. None of them asked whether the sample was large enough to trust any of it.
Test called on 847 sessions per variant. The minimum sample for 80% statistical power at an 8% minimum detectable effect is 4,200 per variant. This test had 20% of the required sample when the team declared a winner.
| Day | Control | Variant A | Diff | Sig? |
|---|---|---|---|---|
| Day 1 | 4.0% | 5.8% | +1.8% | — |
| Day 2 | 3.8% | 4.9% | +1.1% | — |
| Day 3 | 3.6% | 4.5% | +0.9% | Soon? |
| Day 4 | 3.4% | 4.1% | +0.7% | WINNER 🎉 |
Nov 12 · Shipping decision — Checkout Test
Variant A (1-Page Checkout) showing strong early lift after 4 days. CVR improved 20.6%. Heatmap data supportive. Shipping to 100% of traffic.
Three weeks after shipping, CVR returned to the original baseline. The early 20.6% lift was sampling noise amplified by a small sample — a false positive. The variant may still work, but now you can’t know: the test is contaminated, the holdout is gone, and you shipped blind.
Day 4
when the test was called — valid minimum runtime was 14 days
847
sessions per variant at decision time (4,200 needed for 80% power)
$140K/yr
estimated annual revenue at risk from invalid test decisions at this scale
Experimentation done right, end to end.
We connect your raw event data to a proper statistical engine — pre-test power analysis, fixed-horizon design, segment-level results, and guardrail monitoring — then deliver decisions you can act on without second-guessing.
Experiment Engine
Verified Results
Ship
Hold
Needs more data
Every experiment. Properly measured.
The 1-Page Checkout test did win — just not on Day 4. The correct answer arrived on Day 11 with full statistical power. The segment breakdown then told you exactly where to roll it out first.
Experiment Registry
| Test | Status | Days | Visitors | Control CVR | Variant CVR | Uplift | p-value | Decision |
|---|---|---|---|---|---|---|---|---|
| 1-Page Checkout | ✓ Concluded | 14 | 16,840 | 3.20% | 3.71% | +15.9% | 0.0031 | Ship A |
| Free Shipping $50 → $75 | ✓ Concluded | 21 | 24,200 | 12.4% | 11.8% | −4.8% | 0.041 | Hold |
| Hero CTA Copy | ⚡ Running | 7 | 4,200 | 2.1% | 2.3% | +9.5% | 0.31 | Needs more data |
| Product Image Carousel | ⚡ Running | 3 | 1,800 | 4.8% | 5.1% | +6.3% | 0.61 | Too early |
| Email Popup Timing | ⚡ Running | 10 | 8,900 | 18.2% | 20.4% | +12.1% | 0.07 | Trending — not sig. |
| Checkout Trust Badges | ⚡ Running | 4 | 2,100 | 3.1% | 2.9% | −6.5% | 0.58 | Too early |
1-Page Checkout — Statistical Detail
p-value
0.0031
Two-tailed · α = 0.05
Relative Uplift
+15.9%
+0.51pp absolute
Statistical Power
94%
Designed at 80%
Days to Significance
Day 11
Team called Day 4
Control — 3-Step Checkout
Variant A — 1-Page Checkout
Cumulative CVR Over Time — The Peeking Problem
Both lines are noisy early and settle to their true values by Day 11. Reading Day 4 as the final answer overstated Variant A’s lead by ~5pp.
Day 4 read: Variant A at 4.1% vs Control at 3.4% — looks like +20.6%. Sample: 847/variant. This is sampling noise, not signal.
Day 11 read: Variant A at 3.72% vs Control at 3.2% — true lift +15.9%, p = 0.031. Now it’s safe to ship.
Effect Size — 95% Confidence Intervals
X-axis = absolute uplift in percentage points. A CI that does not cross zero (the dashed line) is statistically significant.
1-Page Checkout
Concluded · CI excludes zero
Email Popup Timing
Running · CI barely includes zero
Hero CTA Copy
Running · Underpowered — wide CI
Reading this chart: A CI entirely to the right of zero (like 1-Page Checkout) means the variant reliably beats control. A CI straddling zero (like Hero CTA) means the test is inconclusive — the true effect could be zero or negative.
Segment Breakdown — Variant A Uplift by Audience
Aggregate results hide the segment story. The overall +15.9% is driven almost entirely by mobile users.
Mobile
+57%
2.1% → 3.3%
Desktop
+17%
4.1% → 4.8%
New Visitors
+11%
2.8% → 3.1%
Returning
+8%
4.8% → 5.2%
Recommendation: Ship Variant A to mobile segment immediately (+57% uplift, fully powered). Run a separate desktop-specific variant test — the 1-page layout may need different optimisation for larger screens.
Run experiments you can actually trust.
We build the statistical framework, automate the power analysis, and deliver segment-level results your team can ship from — without second-guessing the numbers.