Measuring Error: The Cost Function
Why we need a cost function
A model's parameters and control where the line sits. To find the best line, we need a single number that summarizes how wrong the model is across all training examples. That number is the cost (also called loss).
A cost function takes the current parameters as input and returns a scalar that measures total prediction error. Training means finding the parameters that minimize this cost.
Mean Squared Error
The standard cost function for linear regression is Mean Squared Error (MSE):
Breaking this down:
- is the number of training examples.
- is the residual for example .
- Squaring the residual makes all errors positive and penalizes large errors more heavily than small ones.
- Dividing by (or ) gives an average, so the cost doesn't grow just because you have more data. The factor of is a convenience that cancels with the exponent when you take the derivative — it has no effect on which parameters minimize .
Why square the errors?
Several reasons:
- Sign cancellation: without squaring, a residual of and one of would cancel to zero, giving a false impression of a perfect fit.
- Differentiability: the squared function is smooth, making calculus-based optimization straightforward.
- Penalizing outliers: squaring means a residual of 10 contributes 100 to the cost, while a residual of 1 contributes only 1. The model is strongly motivated to fix large errors.
A concrete example
Suppose training examples with residuals , , :
The cost as a surface
Think of as a bowl-shaped surface in three dimensions: the two horizontal axes are and , and the vertical axis is the cost. Every point on this surface represents a different line. The bottom of the bowl — the global minimum — is the best-fit line.
For simple linear regression with MSE, this surface is always a convex paraboloid: it has exactly one minimum, so there is always a unique best answer.
Other cost functions
MSE is the default, but alternatives exist:
| Cost function | Formula (per example) | Use case |
|---|---|---|
| MSE | Standard regression | |
| MAE (Mean Absolute Error) | Robust to outliers | |
| Huber loss | Quadratic near 0, linear far from 0 | Balance of both |
For this course, MSE is used throughout because it has a clean closed-form solution and well-behaved gradients.
Summary
The cost function gives training a clear objective: find and that minimize . The next two lessons cover the two main methods for doing that — Ordinary Least Squares (closed-form) and Gradient Descent (iterative).