Accuracy

Log in to access the full course.

The default metric — and its limits

Accuracy is the fraction of predictions the model gets right:

Accuracy=Number of correct predictionsm=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{\text{Number of correct predictions}}{m} = \frac{TP + TN}{TP + TN + FP + FN}

It is the most intuitive classification metric and a reasonable default when classes are roughly balanced and all errors cost the same.

The confusion matrix

Before any metric, build the confusion matrix — a table of actual versus predicted labels:

Predicted negativePredicted positive
Actual negativeTrue Negative (TN)False Positive (FP)
Actual positiveFalse Negative (FN)True Positive (TP)

Every classification metric is derived from these four counts. Accuracy uses all four; other metrics focus on specific cells to capture specific concerns.

When accuracy misleads

Accuracy fails whenever the classes are imbalanced or the costs of different errors are unequal.

Class imbalance example: 99% of credit card transactions are legitimate. A model that predicts "not fraud" for everything achieves 99% accuracy and catches zero fraud. The metric looks excellent; the model is useless.

Unequal error costs: in disease screening, missing a true case (false negative) is far more costly than a false alarm (false positive). Accuracy treats both equally.

In both situations, accuracy hides the model's failure. It is the metric most likely to create a false sense of confidence.

When accuracy is appropriate

Accuracy is a reasonable metric when:

  • Classes are approximately balanced.
  • False positives and false negatives have similar costs.
  • You want a single, easy-to-communicate number.

For a digit recognition problem (10 balanced classes), accuracy is natural. For fraud, disease detection, or any rare-event problem, you need precision, recall, or AUC.

Baseline comparisons

Always compare accuracy against the majority class baseline — the accuracy of a model that always predicts the most common class. If the majority class is 90% of the data, your model must significantly exceed 90% to be worth anything.

A model that is only slightly better than the majority baseline provides little value, even if its accuracy sounds high.