Evaluating Your Model

Why evaluation matters

A model that fits the training data perfectly is not necessarily a good model — it may have simply memorized the training examples and fail on new data. Proper evaluation measures how well the model generalizes to unseen examples.

The train/test split

The standard approach is to divide your dataset into two non-overlapping subsets before training:

  • Training set (typically 70–80%): used to fit the model parameters.
  • Test set (typically 20–30%): held out and only used for final evaluation.

Never use the test set during training or hyperparameter tuning. Treat it like an exam you haven't seen yet.

For hyperparameter tuning (e.g. choosing the learning rate or regularization strength), add a third split:

  • Validation set (e.g. 10–20% of total data): used for tuning, not for final evaluation.

Evaluation metrics for regression

Mean Squared Error (MSE)

MSE=1mi=1m(y^(i)y(i))2\text{MSE} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2

The same metric used to train the model, now computed on the test set. Lower is better. Units are the square of the target units (e.g. dollars²), which can be hard to interpret.

Root Mean Squared Error (RMSE)

RMSE=MSE\text{RMSE} = \sqrt{\text{MSE}}

Same units as the target — much easier to interpret. An RMSE of $20,000 means predictions are off by roughly $20,000 on average.

Mean Absolute Error (MAE)

MAE=1mi=1my^(i)y(i)\text{MAE} = \frac{1}{m} \sum_{i=1}^{m} |\hat{y}^{(i)} - y^{(i)}|

The average absolute error. Less sensitive to outliers than RMSE because errors are not squared.

R-squared (R2R^2)

R2=1i=1m(y^(i)y(i))2i=1m(y(i)yˉ)2R^2 = 1 - \frac{\sum_{i=1}^{m}(\hat{y}^{(i)} - y^{(i)})^2}{\sum_{i=1}^{m}(y^{(i)} - \bar{y})^2}

R2R^2 measures the proportion of variance in yy explained by the model, relative to a baseline that always predicts yˉ\bar{y}:

  • R2=1R^2 = 1: perfect predictions.
  • R2=0R^2 = 0: the model does no better than predicting the mean every time.
  • R2<0R^2 < 0: the model is worse than predicting the mean (possible on the test set).

R2R^2 is scale-free and easy to communicate: "The model explains 87% of the variance in house prices."

Overfitting and underfitting

SituationTraining errorTest errorDiagnosis
UnderfittingHighHighModel too simple — try more features
Good fitLowLowModel generalises well
OverfittingVery lowHighModel memorised training data — reduce complexity

For linear regression, overfitting is less common than in more complex models, but it can occur when you have many features relative to the number of training examples.

Cross-validation

When data is scarce, a single train/test split may give an unreliable estimate of test performance. kk-fold cross-validation is more robust:

  1. Split data into kk equal folds (typically k=5k = 5 or k=10k = 10).
  2. Train on k1k-1 folds, evaluate on the remaining fold.
  3. Repeat kk times, each time using a different fold as the test set.
  4. Average the kk evaluation scores.

This uses all data for both training and evaluation, at the cost of training the model kk times.

Choosing a metric

  • Use RMSE when large errors are especially bad (e.g. financial forecasting).
  • Use MAE when outliers are common and you don't want them to dominate.
  • Use R2R^2 when you want a scale-free, easily communicated measure of fit.

In practice, report multiple metrics and let the problem context guide which one you optimize.