Evaluating Classifiers: Accuracy, Precision, Recall and F1

The limits of accuracy

Accuracy — the fraction of correctly classified examples — is the most intuitive metric:

$\text{Accuracy} = \frac{\text{Number of correct predictions}}{m}$

But accuracy can be deeply misleading. Suppose 95% of emails are not spam. A classifier that always predicts "not spam" achieves 95% accuracy — yet it is completely useless. This problem is class imbalance, and it demands better metrics.

The confusion matrix

The confusion matrix organizes all possible prediction outcomes for binary classification:

	Predicted 0	Predicted 1
Actual 0	True Negative (TN)	False Positive (FP)
Actual 1	False Negative (FN)	True Positive (TP)

True Positive (TP): model correctly predicted positive.
True Negative (TN): model correctly predicted negative.
False Positive (FP): model incorrectly predicted positive (Type I error).
False Negative (FN): model incorrectly predicted negative (Type II error).

All four evaluation metrics below are derived from these four counts.

Precision

Precision measures how many of the model's positive predictions were actually correct:

$\text{Precision} = \frac{TP}{TP + FP}$

High precision means: when the model says "positive," it is usually right. Precision suffers when there are many false positives.

Use when false positives are costly. Example: a spam filter with low precision means legitimate emails get deleted — unacceptable.

Recall (Sensitivity)

Recall measures how many of the actual positives the model correctly identified:

$\text{Recall} = \frac{TP}{TP + FN}$

High recall means: the model misses very few actual positives. Recall suffers when there are many false negatives.

Use when false negatives are costly. Example: a cancer screening test with low recall means many sick patients are missed — dangerous.

The precision–recall trade-off

Precision and recall pull in opposite directions as you adjust the classification threshold:

Lower threshold → predict positive more aggressively → more TP and FP → recall increases, precision decreases.
Higher threshold → predict positive conservatively → fewer FP → precision increases, recall decreases.

There is no threshold that simultaneously maximizes both. The right balance depends on the problem.

F1 Score

The F1 score is the harmonic mean of precision and recall:

$F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

The harmonic mean penalizes extreme imbalances: if either precision or recall is near zero, F1 will also be near zero, even if the other is perfect. F1 is a single number that balances both concerns.

Example calculation

Metric	Count
TP	80
TN	850
FP	20
FN	50

$\text{Precision} = \frac{80}{80+20} = 0.80$

$\text{Recall} = \frac{80}{80+50} = 0.615$

$F_1 = 2 \cdot \frac{0.80 \times 0.615}{0.80 + 0.615} = 2 \cdot \frac{0.492}{1.415} \approx 0.696$

$\text{Accuracy} = \frac{80+850}{1000} = 0.930$

Accuracy looks impressive at 93%. F1 at 0.696 gives a more realistic picture of performance on the positive class.

Generalizing to multiclass: macro and weighted averages

For multiclass problems, compute precision, recall, and F1 per class, then average:

Macro average: simple mean across classes. Treats all classes equally.
Weighted average: mean weighted by the number of true examples per class. Accounts for class imbalance.

Choosing the right metric

Situation	Recommended metric
Balanced classes, equal error costs	Accuracy or F1
Imbalanced classes	F1, precision, or recall
False positives very costly	Precision
False negatives very costly	Recall
Want a single balanced metric	F1
Need threshold-independent evaluation	ROC-AUC (next lesson)

← Prev Next →