← Home
Reality Check
RCK
04 — Winner's Curse & Bayesian Analysis

Reality Check

// why deflate a winner?

You ran a test and it won. Here is the uncomfortable math: in an underpowered test, real effects are usually too small for the noise to let them reach significance. The only way a modest true effect passes a noisy test is by getting a lucky bounce that makes it look bigger than it is. So the tests that "win" are precisely the ones whose measured lift is inflated. Statisticians call it the winner's curse, and it punishes small samples hardest.

Science measured this on itself. In 2015, the Open Science Collaboration reran 100 psychology studies that had already passed peer review. On the second run, the average effect was roughly half the published size. The originals weren't fabricated. They were selected: journals printed the lucky bounces. Every time you announce the observed lift of a barely-significant test, you are running the same selection process on your own company.

Reality Check deflates your observed lift with Bayesian shrinkage, shows the probability your variant is better at all, and turns the honest estimate into an honest revenue projection. The first number you announce becomes the anchor. Make sure it's one reality can live up to.

"Extraordinary claims require extraordinary evidence."

CARL SAGAN — astronomer and science communicator

01 — Your "Winning" Result
▲ Control (A)
▲ Variant (B)
Use visitors, not sessions or impressions. Your denominator must match the unit your platform randomizes — almost always the visitor. Session-level numbers with a standard test double-count returning users, understate the variance, and fabricate confidence. If switching the denominator "improves" your result, that's not extra evidence — it's the winner's curse with a different coat on.
02 — The Deflation
Enter your test numbers above.
03 — Bayesian Read
Enter your test numbers above.
04 — Honest Revenue Projection
Enter monthly revenue above to see the naive vs honest projection.
// the math — every calculation this tool runs, in full

Observed effect and its uncertainty

From your four inputs: p_A = c_A/n_A, p_B = c_B/n_B, observed absolute difference d = p_B − p_A. Its standard error:

SE = √( p_A(1−p_A)/n_A + p_B(1−p_B)/n_B )

SE is the size of a typical random fluctuation in d. When SE is large relative to d, the data alone can't pin the effect down — that's where the prior starts to matter.

The sceptical prior

The prior encodes "before seeing this test, what lifts are plausible?" as a normal distribution centred on zero effect:

prior: N( 0, τ² )  ·  τ = prior% × p_A  (sceptical 2.5% · realistic 5% · optimistic 10%)

Centred on zero because most ecom tests move nothing; the width says how surprised you'd be by a big true effect. This must be chosen before looking at results — choosing the prior that flatters your result is p-hacking with extra steps.

Bayesian shrinkage — the posterior

With a normal prior and (approximately) normal data, the posterior has a closed form: a precision-weighted average of prior and data. Precision = 1/variance.

σ²_post = 1 / ( 1/τ² + 1/SE² )  ·  μ_post = σ²_post · ( d / SE² )

Read it as a tug-of-war: when your sample is large, SE is tiny, the data term dominates, and μ_post ≈ d (barely any deflation). When the sample is small, the prior pulls the estimate toward zero. The "inflation shaved off" figure is 1 − μ_post/d. This is the same mechanism used in empirical-Bayes estimation at large experimentation platforms.

P(B beats A) and the credible interval

P(B > A) = Φ( μ_post / σ_post )  ·  95% CrI: μ_post ± 1.96·σ_post  (shown relative: ÷ p_A)

Unlike a confidence interval, the credible interval means what people think it means: given the data and prior, there's a 95% probability the true lift is inside it.

The frequentist p-value (for comparison)

z = d / √( p̂(1−p̂)(1/n_A + 1/n_B) )  ·  p = 2·(1 − Φ(|z|))  ·  p̂ = pooled rate

Shown so you can watch it disagree with the Bayesian read: an underpowered test can be "significant" while P(B>A) stays unconvincing under a sceptical prior. That disagreement is the winner's curse being caught in real time.

Revenue projections

naive = 12 × monthly revenue × d/p_A  ·  honest = … × μ_post/p_A  ·  floor = … × max(0, CrI lower bound)

Same arithmetic, three different effect estimates. The gap between naive and honest is what your slide deck would have over-promised.

// the logic

Why ecommerce A/B testing needs Bayesian thinking

There's a recurring argument — on Reddit, in CRO communities, everywhere — that A/B testing is "statistically not viable" for most ecommerce stores. The math behind the complaint is real: a store with 30,000 monthly sessions and a 2% conversion rate needs roughly 8–12 weeks per test to detect a 10% relative lift with classical (frequentist) statistics. Most stores don't have that patience, so they either stop testing or — worse — run underpowered tests and trust the results anyway.

But the conclusion "testing isn't viable" is wrong. What's not viable is using the wrong statistical framework for the traffic you have.

The frequentist problem

Classical testing answers a strange question: "If there were truly no difference between A and B, how surprising is my data?" That's what a p-value is. It never tells you the probability that B is better — it tells you how weird your data would look in a hypothetical world where B does nothing.

This framework was designed for agriculture experiments in the 1920s, where you plant a field once and analyse once. It demands a fixed sample size decided upfront, forbids looking at results early (peeking inflates false positives dramatically), and gives a binary significant/not-significant answer that's routinely misread. For a low-traffic store, this is brutal: you commit to 10 weeks blind, you can't stop early even when the signal is obvious, and at the end you get a yes/no instead of a decision-ready number.

The Bayesian alternative

Bayesian analysis answers the question you actually have: "Given the data I've seen, what's the probability that B is better than A — and by how much?"

P(B > A | data) = 94.2% · Expected lift: +3.1% (95% credible: −0.4% to +7.2%)

That statement is directly usable in a business decision. You can weigh it against implementation cost, risk appetite, and opportunity cost — the way you'd weigh any other business decision under uncertainty.

Frequentist (classical)

  • Needs a fixed sample size committed upfront
  • Peeking at results invalidates the test
  • Answers: "how surprising is this data if nothing changed?"
  • Binary verdict at α = 0.05 — nothing in between
  • Underpowered + significant = inflated estimate (winner's curse)
  • Practical for high-traffic sites with discipline

Bayesian

  • Valid at any sample size — evidence just accumulates
  • Monitoring as you go is fine (with a decision rule)
  • Answers: "what's the probability B is better, and by how much?"
  • Continuous evidence — 94% is stronger than 82%
  • Priors provide built-in shrinkage against inflated effects
  • Practical at ecom traffic levels — this is why VWO and AB Tasty use it

Why this is *the* choice for ecommerce specifically

1. Traffic reality. Bayesian inference doesn't collapse below a magic sample size. With 2,000 visitors per arm you get a wider, more honest credible interval instead of a meaningless "not significant". The evidence you have is quantified, not discarded.

2. Business decisions are already Bayesian. No merchant thinks in terms of rejecting null hypotheses. They think "how confident am I this works, and what does it cost me if I'm wrong?" — which is literally the Bayesian decision framework (expected loss). The statistics should match the decision.

3. Priors are honesty, not cheating. A decade of published CRO data shows most ecom tests move conversion by very little — big lifts of +20% are rare and usually don't replicate. Encoding this as a sceptical prior automatically deflates too-good-to-be-true results. That's the winner's curse protection built directly into the math, which is exactly what the tool above does.

4. You can stop when you know enough. With a pre-agreed decision rule (e.g. "ship when P(B>A) ≥ 95% and the expected loss of shipping is below 0.1pp"), monitoring continuously is legitimate. For a store that can't afford 12-week tests, this typically cuts test duration by 30–50% versus fixed-horizon testing.

The honest caveat: Bayesian is not magic. A small sample still means high uncertainty — the credible interval will be wide and it will include zero. What Bayesian buys you is that the uncertainty is quantified and usable instead of hidden behind "not significant". And the prior must be chosen before you see results — picking the prior that makes your result look best is p-hacking with extra steps. That's why this tool makes you choose the scepticism level explicitly, and why Lockbox exists.

The bottom line: if you have Amazon-scale traffic, frequentist testing works fine when you run it by the book. If you're a normal ecom store, Bayesian methods are simply the correct way to make decisions under the uncertainty you actually have. What's not viable is pretending your underpowered frequentist test gave you certainty. That's what this tool is here to check.