Evaluating Your Model
Why evaluation matters
A model that fits the training data perfectly is not necessarily a good model — it may have simply memorized the training examples and fail on new data. Proper evaluation measures how well the model generalizes to unseen examples.
The train/test split
The standard approach is to divide your dataset into two non-overlapping subsets before training:
- Training set (typically 70–80%): used to fit the model parameters.
- Test set (typically 20–30%): held out and only used for final evaluation.
Never use the test set during training or hyperparameter tuning. Treat it like an exam you haven't seen yet.
For hyperparameter tuning (e.g. choosing the learning rate or regularization strength), add a third split:
- Validation set (e.g. 10–20% of total data): used for tuning, not for final evaluation.
Evaluation metrics for regression
Mean Squared Error (MSE)
The same metric used to train the model, now computed on the test set. Lower is better. Units are the square of the target units (e.g. dollars²), which can be hard to interpret.
Root Mean Squared Error (RMSE)
Same units as the target — much easier to interpret. An RMSE of $20,000 means predictions are off by roughly $20,000 on average.
Mean Absolute Error (MAE)
The average absolute error. Less sensitive to outliers than RMSE because errors are not squared.
R-squared ()
measures the proportion of variance in explained by the model, relative to a baseline that always predicts :
- : perfect predictions.
- : the model does no better than predicting the mean every time.
- : the model is worse than predicting the mean (possible on the test set).
is scale-free and easy to communicate: "The model explains 87% of the variance in house prices."
Overfitting and underfitting
| Situation | Training error | Test error | Diagnosis |
|---|---|---|---|
| Underfitting | High | High | Model too simple — try more features |
| Good fit | Low | Low | Model generalises well |
| Overfitting | Very low | High | Model memorised training data — reduce complexity |
For linear regression, overfitting is less common than in more complex models, but it can occur when you have many features relative to the number of training examples.
Cross-validation
When data is scarce, a single train/test split may give an unreliable estimate of test performance. -fold cross-validation is more robust:
- Split data into equal folds (typically or ).
- Train on folds, evaluate on the remaining fold.
- Repeat times, each time using a different fold as the test set.
- Average the evaluation scores.
This uses all data for both training and evaluation, at the cost of training the model times.
Choosing a metric
- Use RMSE when large errors are especially bad (e.g. financial forecasting).
- Use MAE when outliers are common and you don't want them to dominate.
- Use when you want a scale-free, easily communicated measure of fit.
In practice, report multiple metrics and let the problem context guide which one you optimize.