Measuring Error: The Cost Function

Why we need a cost function

A model's parameters $\theta_0$ and $\theta_1$ control where the line sits. To find the best line, we need a single number that summarizes how wrong the model is across all training examples. That number is the cost (also called loss).

A cost function $J(\theta_0, \theta_1)$ takes the current parameters as input and returns a scalar that measures total prediction error. Training means finding the parameters that minimize this cost.

Mean Squared Error

The standard cost function for linear regression is Mean Squared Error (MSE):

$J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2$

Breaking this down:

$m$ is the number of training examples.
$\hat{y}^{(i)} - y^{(i)}$ is the residual for example $i$ .
Squaring the residual makes all errors positive and penalizes large errors more heavily than small ones.
Dividing by $m$ (or $2m$ ) gives an average, so the cost doesn't grow just because you have more data. The factor of $\frac{1}{2}$ is a convenience that cancels with the exponent when you take the derivative — it has no effect on which parameters minimize $J$ .

Why square the errors?

Several reasons:

Sign cancellation: without squaring, a residual of $+5$ and one of $-5$ would cancel to zero, giving a false impression of a perfect fit.
Differentiability: the squared function is smooth, making calculus-based optimization straightforward.
Penalizing outliers: squaring means a residual of 10 contributes 100 to the cost, while a residual of 1 contributes only 1. The model is strongly motivated to fix large errors.

A concrete example

Suppose $m = 3$ training examples with residuals $e^{(1)} = 2$ , $e^{(2)} = -1$ , $e^{(3)} = 3$ :

$J = \frac{1}{2 \cdot 3} \left( 2^2 + (-1)^2 + 3^2 \right) = \frac{1}{6}(4 + 1 + 9) = \frac{14}{6} \approx 2.33$

The cost as a surface

Think of $J$ as a bowl-shaped surface in three dimensions: the two horizontal axes are $\theta_0$ and $\theta_1$ , and the vertical axis is the cost. Every point on this surface represents a different line. The bottom of the bowl — the global minimum — is the best-fit line.

For simple linear regression with MSE, this surface is always a convex paraboloid: it has exactly one minimum, so there is always a unique best answer.

Other cost functions

MSE is the default, but alternatives exist:

Cost function	Formula (per example)	Use case
MSE	$(e^{(i)})^2$	Standard regression
MAE (Mean Absolute Error)	$\\|e^{(i)}\\|$	Robust to outliers
Huber loss	Quadratic near 0, linear far from 0	Balance of both

For this course, MSE is used throughout because it has a clean closed-form solution and well-behaved gradients.

Summary

The cost function gives training a clear objective: find $\theta_0$ and $\theta_1$ that minimize $J(\theta_0, \theta_1)$ . The next two lessons cover the two main methods for doing that — Ordinary Least Squares (closed-form) and Gradient Descent (iterative).

← Prev Next →