A/B testing, sample ratio mismatch, and multiple comparisons

How big a sample does an A/B test need to detect a real effect? To detect an absolute lift d on a baseline rate p, at the usual 5% significance and 80% power, each arm needs roughly

n  ~  (1.96 + 0.84)^2 * 2 * p * (1 - p) / d^2

The d^2 in the denominator is the painful part: halving the effect you want to detect quadruples the sample you need. Planning this before a test is what stops you running an experiment that never had a chance of reaching significance.

A sample ratio mismatch (SRM) is when an intended 50/50 split shows up as, say, 48/52 across millions of users — a tiny imbalance that a chi-square test flags as wildly unlikely by chance. It means the randomisation or the data pipeline is broken.

When you see an SRM, you stop trusting the result until you've explained it — you do not read a "winner" off a corrupted experiment. Investigate the assignment, filters, and duplicates first.

If you test twenty metrics at the 5% level, you expect about one "significant" result by pure luck. Celebrating that one result is mostly self-deception.

The fix is to pre-specify a single primary metric, or to correct for the number of tests — Bonferroni to control the chance of any false positive, Benjamini-Hochberg to control the false-discovery rate. Most "wins" that fail to replicate die on exactly these points: too small a sample, a broken split, or one lucky metric out of twenty.
intervals” and leads into “The estimator API and the train/test split”.*

A/B testing, sample ratio mismatch, and multiple comparisons

Related cards

Tasks

Question 1

Question 2

Question 3

Question 4

Card Info

Creator