Measuring Error: The Cost Function

Why we need a cost function

A model's parameters θ0\theta_0 and θ1\theta_1 control where the line sits. To find the best line, we need a single number that summarizes how wrong the model is across all training examples. That number is the cost (also called loss).

A cost function J(θ0,θ1)J(\theta_0, \theta_1) takes the current parameters as input and returns a scalar that measures total prediction error. Training means finding the parameters that minimize this cost.

Mean Squared Error

The standard cost function for linear regression is Mean Squared Error (MSE):

J(θ0,θ1)=12mi=1m(y^(i)y(i))2J(\theta_0, \theta_1) = \frac{1}{2m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right)^2

Breaking this down:

  • mm is the number of training examples.
  • y^(i)y(i)\hat{y}^{(i)} - y^{(i)} is the residual for example ii.
  • Squaring the residual makes all errors positive and penalizes large errors more heavily than small ones.
  • Dividing by mm (or 2m2m) gives an average, so the cost doesn't grow just because you have more data. The factor of 12\frac{1}{2} is a convenience that cancels with the exponent when you take the derivative — it has no effect on which parameters minimize JJ.

Why square the errors?

Several reasons:

  1. Sign cancellation: without squaring, a residual of +5+5 and one of 5-5 would cancel to zero, giving a false impression of a perfect fit.
  2. Differentiability: the squared function is smooth, making calculus-based optimization straightforward.
  3. Penalizing outliers: squaring means a residual of 10 contributes 100 to the cost, while a residual of 1 contributes only 1. The model is strongly motivated to fix large errors.

A concrete example

Suppose m=3m = 3 training examples with residuals e(1)=2e^{(1)} = 2, e(2)=1e^{(2)} = -1, e(3)=3e^{(3)} = 3:

J=123(22+(1)2+32)=16(4+1+9)=1462.33J = \frac{1}{2 \cdot 3} \left( 2^2 + (-1)^2 + 3^2 \right) = \frac{1}{6}(4 + 1 + 9) = \frac{14}{6} \approx 2.33

The cost as a surface

Think of JJ as a bowl-shaped surface in three dimensions: the two horizontal axes are θ0\theta_0 and θ1\theta_1, and the vertical axis is the cost. Every point on this surface represents a different line. The bottom of the bowl — the global minimum — is the best-fit line.

For simple linear regression with MSE, this surface is always a convex paraboloid: it has exactly one minimum, so there is always a unique best answer.

Other cost functions

MSE is the default, but alternatives exist:

Cost functionFormula (per example)Use case
MSE(e(i))2(e^{(i)})^2Standard regression
MAE (Mean Absolute Error)e(i)\|e^{(i)}\|Robust to outliers
Huber lossQuadratic near 0, linear far from 0Balance of both

For this course, MSE is used throughout because it has a clean closed-form solution and well-behaved gradients.

Summary

The cost function gives training a clear objective: find θ0\theta_0 and θ1\theta_1 that minimize J(θ0,θ1)J(\theta_0, \theta_1). The next two lessons cover the two main methods for doing that — Ordinary Least Squares (closed-form) and Gradient Descent (iterative).