You Shipped a Winner. It Wasn’t.

On Day 4, Variant A showed a 20.6% lift. The team shipped it. Three weeks later, conversion rate was back at baseline. Here’s the framework that would have caught it — and the 40 tests a year it makes possible.

Real problem. Fake data. Same outcome.

Skip to the results

Nov 12, Day 4 of the test — 9:14 AM

4 signals. All pointing to ship. All wrong.

The spreadsheet shows a winner. GA4 confirms the lift. Hotjar heatmaps look supportive. The team signs off and ships. None of them asked whether the sample was large enough to trust any of it.

Test called on 847 sessions per variant. The minimum sample for 80% statistical power at an 8% minimum detectable effect is 4,200 per variant. This test had 20% of the required sample when the team declared a winner.

AB_Test_Tracker.xlsx

Day	Control	Variant A	Diff	Sig?
Day 1	4.0%	5.8%	+1.8%	—
Day 2	3.8%	4.9%	+1.1%	—
Day 3	3.6%	4.5%	+0.9%	Soon?
Day 4	3.4%	4.1%	+0.7%	WINNER 🎉

Variant A wins · +20.6% uplift · n=847/847

Google Analytics 4

Sessions — Control847

Sessions — Variant A847

Conversion Rate — Control3.4%

Conversion Rate — Variant A4.1%

Day 1–4 · No significance column

Hotjar — Heatmap Notes

CTA clicks — ControlNormal

CTA clicks — Variant AHigher 🔥

Scroll depthSimilar

Rage clicksNone

Qualitative only — used as confirmation

Experiment Log · Confluence

Nov 12 · Shipping decision — Checkout Test

Variant A (1-Page Checkout) showing strong early lift after 4 days. CVR improved 20.6%. Heatmap data supportive. Shipping to 100% of traffic.

Published by @growth_team · Reviewed by 2

Day 4 · 847 sessions per variant · No power analysis documented

Three weeks after shipping, CVR returned to the original baseline. The early 20.6% lift was sampling noise amplified by a small sample — a false positive. The variant may still work, but now you can’t know: the test is contaminated, the holdout is gone, and you shipped blind.

Day 4

when the test was called — valid minimum runtime was 14 days

847

sessions per variant at decision time (4,200 needed for 80% power)

$140K/yr

estimated annual revenue at risk from invalid test decisions at this scale

The N9ine layer

Experimentation done right, end to end.

We connect your raw event data to a proper statistical engine — pre-test power analysis, fixed-horizon design, segment-level results, and guardrail monitoring — then deliver decisions you can act on without second-guessing.

Shopify order events

GA4 session data

Variant assignment log

Segment attributes (device, tier)

Guardrail metrics (AOV, returns)

N9ine

Experiment Engine

Pre-test power analysis

Fixed-horizon z-test (p < 0.05)

Segment-level breakdown

Guardrail metric monitoring

Automated runtime alerts

Verified Results

Ship

Hold

Needs more data

After N9ine

Every experiment. Properly measured.

The 1-Page Checkout test did win — just not on Day 4. The correct answer arrived on Day 11 with full statistical power. The segment breakdown then told you exactly where to roll it out first.

N9ine Intelligence Platform — A/B Test Results

Live

6 active experiments · Updated 3 min ago

Experiment Registry

Test	Status	Days	Visitors	Control CVR	Variant CVR	Uplift	p-value	Decision
1-Page Checkout	✓ Concluded	14	16,840	3.20%	3.71%	+15.9%	0.0031	Ship A
Free Shipping $50 → $75	✓ Concluded	21	24,200	12.4%	11.8%	−4.8%	0.041	Hold
Hero CTA Copy	⚡ Running	7	4,200	2.1%	2.3%	+9.5%	0.31	Needs more data
Product Image Carousel	⚡ Running	3	1,800	4.8%	5.1%	+6.3%	0.61	Too early
Email Popup Timing	⚡ Running	10	8,900	18.2%	20.4%	+12.1%	0.07	Trending — not sig.
Checkout Trust Badges	⚡ Running	4	2,100	3.1%	2.9%	−6.5%	0.58	Too early

1-Page Checkout — Statistical Detail

p-value

0.0031

Two-tailed · α = 0.05

Relative Uplift

+15.9%

+0.51pp absolute

Statistical Power

94%

Designed at 80%

Days to Significance

Day 11

Team called Day 4

Control — 3-Step Checkout

Samplen = 8,420

Conversions269 conversions

CVR3.20%

95% CI95% CI [2.83%, 3.57%]

Variant A — 1-Page Checkout

Samplen = 8,420

Conversions312 conversions

CVR3.71%

95% CI95% CI [3.34%, 4.08%]

Cumulative CVR Over Time — The Peeking Problem

Both lines are noisy early and settle to their true values by Day 11. Reading Day 4 as the final answer overstated Variant A’s lead by ~5pp.

Day 4 read: Variant A at 4.1% vs Control at 3.4% — looks like +20.6%. Sample: 847/variant. This is sampling noise, not signal.

Day 11 read: Variant A at 3.72% vs Control at 3.2% — true lift +15.9%, p = 0.031. Now it’s safe to ship.

Effect Size — 95% Confidence Intervals

X-axis = absolute uplift in percentage points. A CI that does not cross zero (the dashed line) is statistically significant.

1-Page Checkout

Concluded · CI excludes zero

Email Popup Timing

Running · CI barely includes zero

Hero CTA Copy

Running · Underpowered — wide CI

-0.25pp+0pp+0.25pp+0.5pp+0.75pp+1pp

+0.51pp

+0.38pp

+0.2pp

[+0.18, +0.84]

[-0.04, +0.8]

[-0.44, +0.84]

Reading this chart: A CI entirely to the right of zero (like 1-Page Checkout) means the variant reliably beats control. A CI straddling zero (like Hero CTA) means the test is inconclusive — the true effect could be zero or negative.

Segment Breakdown — Variant A Uplift by Audience

Aggregate results hide the segment story. The overall +15.9% is driven almost entirely by mobile users.

Mobile

+57%

2.1% → 3.3%

Desktop

+17%

4.1% → 4.8%

New Visitors

+11%

2.8% → 3.1%

Returning

+8%

4.8% → 5.2%

Recommendation: Ship Variant A to mobile segment immediately (+57% uplift, fully powered). Run a separate desktop-specific variant test — the 1-page layout may need different optimisation for larger screens.

5 data sources · 6 active experiments·

Power analysis run before each test·Guardrail alerts enabled · Segment splits automatic

Before → AfterBefore N9ineAfter N9ine

Winner declaredDay 4 · n=847Day 11 · n=8,420

False positive risk~80% underpoweredPre-test power calc

Segment analysisNoneAutomatic by device + tier

Guardrail metricsNot trackedAOV, return rate, CSAT

Tests per year~4 (manual, slow)40+ (systematic)

Run experiments you can actually trust.

We build the statistical framework, automate the power analysis, and deliver segment-level results your team can ship from — without second-guessing the numbers.

Book a Discovery Call