A/B testing

What A/B testing is

An A/B test (also called a controlled experiment or randomized trial) is the gold standard for measuring the causal effect of a change. You randomly split users into two groups — a control group (A) that experiences the current version and a treatment group (B) that experiences the new version — and compare outcomes.

The randomization is what makes A/B tests powerful. Because users are randomly assigned, the two groups are statistically equivalent on average before the experiment begins. Any difference in outcomes afterward can be attributed to the treatment — not to pre-existing differences between the groups.

The anatomy of an A/B test

Hypothesis: formulate what you expect the change to do before running the test.

Null hypothesis $H_0$ : the change has no effect on the metric.
Alternative hypothesis $H_1$ : the change improves (or changes) the metric.

Metric: choose a primary metric that directly captures what matters. Common examples: conversion rate, click-through rate, revenue per user, session length.

Sample size: calculate in advance how many users you need per group to detect the minimum effect size you care about, at your chosen significance level and power. Running too few users and stopping early because results look good is a common mistake — covered below.

Randomization unit: the unit of randomization is usually the user or session. Every request from the same user should go to the same group (consistent experience).

Duration: run long enough to capture weekly seasonality and reach the required sample size. Never stop early based on peeking at results.

Interpreting results

Once the pre-specified sample size is reached, compute the test statistic and p-value. A standard threshold is $\alpha = 0.05$ .

Report the effect size and confidence interval, not just the p-value. "Conversion rate improved from 3.2% to 3.5% (95% CI: 0.1% to 0.5%), p = 0.02" is informative. "p = 0.02" alone is not.

Common mistakes

Peeking and stopping early. Checking results daily and stopping when $p < 0.05$ inflates the false positive rate dramatically. The experiment must run to the pre-specified sample size.

Running too many metrics. Testing 20 secondary metrics means roughly one will be significant by chance at $\alpha = 0.05$ . Pre-register your primary metric and apply multiple testing corrections to secondary metrics.

Network effects. If users in the control and treatment groups interact with each other (social networks, marketplaces), the randomization is compromised. Cluster randomization (randomizing at the group level) is needed.

Novelty effect. Users may engage with a new feature simply because it is new, not because it is better. Short experiments overestimate effects that will decay over time.