Evaluating Classifiers: Accuracy, Precision, Recall and F1

The limits of accuracy

Accuracy — the fraction of correctly classified examples — is the most intuitive metric:

Accuracy=Number of correct predictionsm\text{Accuracy} = \frac{\text{Number of correct predictions}}{m}

But accuracy can be deeply misleading. Suppose 95% of emails are not spam. A classifier that always predicts "not spam" achieves 95% accuracy — yet it is completely useless. This problem is class imbalance, and it demands better metrics.

The confusion matrix

The confusion matrix organizes all possible prediction outcomes for binary classification:

Predicted 0Predicted 1
Actual 0True Negative (TN)False Positive (FP)
Actual 1False Negative (FN)True Positive (TP)
  • True Positive (TP): model correctly predicted positive.
  • True Negative (TN): model correctly predicted negative.
  • False Positive (FP): model incorrectly predicted positive (Type I error).
  • False Negative (FN): model incorrectly predicted negative (Type II error).

All four evaluation metrics below are derived from these four counts.

Precision

Precision measures how many of the model's positive predictions were actually correct:

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

High precision means: when the model says "positive," it is usually right. Precision suffers when there are many false positives.

Use when false positives are costly. Example: a spam filter with low precision means legitimate emails get deleted — unacceptable.

Recall (Sensitivity)

Recall measures how many of the actual positives the model correctly identified:

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

High recall means: the model misses very few actual positives. Recall suffers when there are many false negatives.

Use when false negatives are costly. Example: a cancer screening test with low recall means many sick patients are missed — dangerous.

The precision–recall trade-off

Precision and recall pull in opposite directions as you adjust the classification threshold:

  • Lower threshold → predict positive more aggressively → more TP and FP → recall increases, precision decreases.
  • Higher threshold → predict positive conservatively → fewer FP → precision increases, recall decreases.

There is no threshold that simultaneously maximizes both. The right balance depends on the problem.

F1 Score

The F1 score is the harmonic mean of precision and recall:

F1=2Precision×RecallPrecision+RecallF_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

The harmonic mean penalizes extreme imbalances: if either precision or recall is near zero, F1 will also be near zero, even if the other is perfect. F1 is a single number that balances both concerns.

Example calculation

MetricCount
TP80
TN850
FP20
FN50

Precision=8080+20=0.80\text{Precision} = \frac{80}{80+20} = 0.80

Recall=8080+50=0.615\text{Recall} = \frac{80}{80+50} = 0.615

F1=20.80×0.6150.80+0.615=20.4921.4150.696F_1 = 2 \cdot \frac{0.80 \times 0.615}{0.80 + 0.615} = 2 \cdot \frac{0.492}{1.415} \approx 0.696

Accuracy=80+8501000=0.930\text{Accuracy} = \frac{80+850}{1000} = 0.930

Accuracy looks impressive at 93%. F1 at 0.696 gives a more realistic picture of performance on the positive class.

Generalizing to multiclass: macro and weighted averages

For multiclass problems, compute precision, recall, and F1 per class, then average:

  • Macro average: simple mean across classes. Treats all classes equally.
  • Weighted average: mean weighted by the number of true examples per class. Accounts for class imbalance.

Choosing the right metric

SituationRecommended metric
Balanced classes, equal error costsAccuracy or F1
Imbalanced classesF1, precision, or recall
False positives very costlyPrecision
False negatives very costlyRecall
Want a single balanced metricF1
Need threshold-independent evaluationROC-AUC (next lesson)