Hypothesis testing

The basic question

Hypothesis testing is a formal procedure for deciding whether data provides enough evidence to reject a specific claim about the world. It does not prove things — it measures how surprising the data would be if a particular assumption were true.

The two hypotheses

Every test starts with two competing claims:

Null hypothesis ( $H_0$ ): the default assumption — usually "no effect," "no difference," or "nothing interesting is happening." Example: the new drug has no effect on blood pressure.
Alternative hypothesis ( $H_1$ or $H_a$ ): the claim you are trying to find evidence for. Example: the drug does lower blood pressure.

The test asks: is the data sufficiently inconsistent with $H_0$ to warrant rejecting it?

The logic

Hypothesis testing uses a proof by contradiction structure:

Assume $H_0$ is true.
Compute how likely the observed data (or something more extreme) would be under that assumption.
If the data is very unlikely under $H_0$ , reject $H_0$ in favor of $H_1$ .

You never "accept" $H_0$ — you either reject it or fail to reject it. Failing to reject just means the data is not inconsistent enough with $H_0$ to rule it out.

The test statistic and p-value

A test statistic is a number computed from the data that summarizes how far the observation is from what $H_0$ predicts. Common examples: the z-statistic, t-statistic, chi-squared statistic.

The p-value is the probability of observing a test statistic as extreme as the one computed (or more extreme), assuming $H_0$ is true. A small p-value means the data is unlikely under $H_0$ .

A worked example

You want to test whether a coin is fair ( $H_0: p = 0.5$ ). You flip it 100 times and get 62 heads.

The z-statistic for a proportion is:

$z = \frac{\hat{p} - p_0}{\sqrt{p_0(1-p_0)/n}} = \frac{0.62 - 0.5}{\sqrt{0.5 \times 0.5 / 100}} = \frac{0.12}{0.05} = 2.4$

From a standard normal table, the two-tailed p-value for $z = 2.4$ is approximately $0.016$ .

The significance level $\alpha$

Before conducting the test, you choose a significance level $\alpha$ — the threshold for how small the p-value must be to reject $H_0$ . The most common choice is $\alpha = 0.05$ .

If $p\text{-value} < \alpha$ : reject $H_0$ . The result is statistically significant.
If $p\text{-value} \geq \alpha$ : fail to reject $H_0$ .

In the coin example: $0.016 < 0.05$ , so we reject $H_0$ . The evidence suggests the coin is not fair.

One-tailed vs. two-tailed tests

Two-tailed: tests for a difference in either direction ( $H_1: p \neq 0.5$ ). Use unless you have a strong prior reason to only care about one direction.
One-tailed: tests for a difference in a specific direction ( $H_1: p > 0.5$ ). Has more power in that direction but is only valid if you committed to the direction before seeing the data.