You Shipped a Winner. It Wasn’t.

On Day 4, Variant A showed a 20.6% lift. The team shipped it. Three weeks later, conversion rate was back at baseline. Here’s the framework that would have caught it — and the 40 tests a year it makes possible.

Real problem. Fake data. Same outcome.
Skip to the results
Nov 12, Day 4 of the test — 9:14 AM

4 signals. All pointing to ship. All wrong.

The spreadsheet shows a winner. GA4 confirms the lift. Hotjar heatmaps look supportive. The team signs off and ships. None of them asked whether the sample was large enough to trust any of it.

Test called on 847 sessions per variant. The minimum sample for 80% statistical power at an 8% minimum detectable effect is 4,200 per variant. This test had 20% of the required sample when the team declared a winner.

AB_Test_Tracker.xlsx
DayControlVariant ADiffSig?
Day 14.0%5.8%+1.8%
Day 23.8%4.9%+1.1%
Day 33.6%4.5%+0.9%Soon?
Day 43.4%4.1%+0.7%WINNER 🎉
Variant A wins · +20.6% uplift · n=847/847
Google Analytics 4
Sessions — Control847
Sessions — Variant A847
Conversion Rate — Control3.4%
Conversion Rate — Variant A4.1%
Day 1–4 · No significance column
Hotjar — Heatmap Notes
CTA clicks — ControlNormal
CTA clicks — Variant AHigher 🔥
Scroll depthSimilar
Rage clicksNone
Qualitative only — used as confirmation
Experiment Log · Confluence

Nov 12 · Shipping decision — Checkout Test

Variant A (1-Page Checkout) showing strong early lift after 4 days. CVR improved 20.6%. Heatmap data supportive. Shipping to 100% of traffic.

Published by @growth_team · Reviewed by 2
Day 4 · 847 sessions per variant · No power analysis documented

Three weeks after shipping, CVR returned to the original baseline. The early 20.6% lift was sampling noise amplified by a small sample — a false positive. The variant may still work, but now you can’t know: the test is contaminated, the holdout is gone, and you shipped blind.

Day 4

when the test was called — valid minimum runtime was 14 days

847

sessions per variant at decision time (4,200 needed for 80% power)

$140K/yr

estimated annual revenue at risk from invalid test decisions at this scale

The N9ine layer

Experimentation done right, end to end.

We connect your raw event data to a proper statistical engine — pre-test power analysis, fixed-horizon design, segment-level results, and guardrail monitoring — then deliver decisions you can act on without second-guessing.

Shopify order events
GA4 session data
Variant assignment log
Segment attributes (device, tier)
Guardrail metrics (AOV, returns)
N9ine

Experiment Engine

Pre-test power analysis
Fixed-horizon z-test (p < 0.05)
Segment-level breakdown
Guardrail metric monitoring
Automated runtime alerts

Verified Results

Ship

Hold

Needs more data

After N9ine

Every experiment. Properly measured.

The 1-Page Checkout test did win — just not on Day 4. The correct answer arrived on Day 11 with full statistical power. The segment breakdown then told you exactly where to roll it out first.

N9ine Intelligence Platform — A/B Test Results
Live
6 active experiments · Updated 3 min ago

Experiment Registry

TestStatusDaysVisitorsControl CVRVariant CVRUpliftp-valueDecision
1-Page Checkout✓ Concluded1416,8403.20%3.71%+15.9%0.0031Ship A
Free Shipping $50 → $75✓ Concluded2124,20012.4%11.8%−4.8%0.041Hold
Hero CTA Copy⚡ Running74,2002.1%2.3%+9.5%0.31Needs more data
Product Image Carousel⚡ Running31,8004.8%5.1%+6.3%0.61Too early
Email Popup Timing⚡ Running108,90018.2%20.4%+12.1%0.07Trending — not sig.
Checkout Trust Badges⚡ Running42,1003.1%2.9%−6.5%0.58Too early

1-Page Checkout — Statistical Detail

p-value

0.0031

Two-tailed · α = 0.05

Relative Uplift

+15.9%

+0.51pp absolute

Statistical Power

94%

Designed at 80%

Days to Significance

Day 11

Team called Day 4

Control — 3-Step Checkout

Samplen = 8,420
Conversions269 conversions
CVR3.20%
95% CI95% CI [2.83%, 3.57%]

Variant A — 1-Page Checkout

Samplen = 8,420
Conversions312 conversions
CVR3.71%
95% CI95% CI [3.34%, 4.08%]

Cumulative CVR Over Time — The Peeking Problem

Both lines are noisy early and settle to their true values by Day 11. Reading Day 4 as the final answer overstated Variant A’s lead by ~5pp.

Day 4 read: Variant A at 4.1% vs Control at 3.4% — looks like +20.6%. Sample: 847/variant. This is sampling noise, not signal.

Day 11 read: Variant A at 3.72% vs Control at 3.2% — true lift +15.9%, p = 0.031. Now it’s safe to ship.

Effect Size — 95% Confidence Intervals

X-axis = absolute uplift in percentage points. A CI that does not cross zero (the dashed line) is statistically significant.

1-Page Checkout

Concluded · CI excludes zero

Email Popup Timing

Running · CI barely includes zero

Hero CTA Copy

Running · Underpowered — wide CI

-0.25pp+0pp+0.25pp+0.5pp+0.75pp+1pp
+0.51pp
+0.38pp
+0.2pp

Reading this chart: A CI entirely to the right of zero (like 1-Page Checkout) means the variant reliably beats control. A CI straddling zero (like Hero CTA) means the test is inconclusive — the true effect could be zero or negative.

Segment Breakdown — Variant A Uplift by Audience

Aggregate results hide the segment story. The overall +15.9% is driven almost entirely by mobile users.

Mobile

+57%

2.1%3.3%

Desktop

+17%

4.1%4.8%

New Visitors

+11%

2.8%3.1%

Returning

+8%

4.8%5.2%

Recommendation: Ship Variant A to mobile segment immediately (+57% uplift, fully powered). Run a separate desktop-specific variant test — the 1-page layout may need different optimisation for larger screens.

5 data sources · 6 active experiments·
Power analysis run before each test
·Guardrail alerts enabled · Segment splits automatic
Before → AfterBefore N9ineAfter N9ine
Winner declaredDay 4 · n=847Day 11 · n=8,420
False positive risk~80% underpoweredPre-test power calc
Segment analysisNoneAutomatic by device + tier
Guardrail metricsNot trackedAOV, return rate, CSAT
Tests per year~4 (manual, slow)40+ (systematic)

Run experiments you can actually trust.

We build the statistical framework, automate the power analysis, and deliver segment-level results your team can ship from — without second-guessing the numbers.