The Line of Best Fit

The equation of a line

From school mathematics you may recall the equation of a straight line:

$y = mx + b$

where $m$ is the slope and $b$ is the y-intercept. In machine learning notation, the same equation is written:

$\hat{y} = \theta_0 + \theta_1 x$

$\hat{y}$ (pronounced "y-hat") is the predicted value.
$x$ is the input feature.
$\theta_0$ (theta-zero) is the intercept — the value of $\hat{y}$ when $x = 0$ .
$\theta_1$ (theta-one) is the slope — how much $\hat{y}$ changes for a one-unit increase in $x$ .

The two values $\theta_0$ and $\theta_1$ are called the model parameters or weights. Training a linear regression model means finding the values of these parameters that make the line fit the data as well as possible.

What the slope and intercept mean

Consider a model that predicts house price (in thousands of dollars) from size (in square feet):

$\hat{\text{price}} = 50 + 0.15 \times \text{size}$

Intercept $\theta_0 = 50$ : a house with zero square feet would be predicted to cost $50,000. (This may not be physically meaningful, but it anchors the line.)
Slope $\theta_1 = 0.15$ : each additional square foot adds $150 to the predicted price.

Predictions

Given a trained model, making a prediction is just arithmetic. For a 1,200 sq ft house:

$\hat{y} = 50 + 0.15 \times 1200 = 50 + 180 = 230$

Predicted price: $230,000.

Residuals

No line fits noisy real-world data perfectly. The difference between the actual value $y^{(i)}$ and the predicted value $\hat{y}^{(i)}$ for training example $i$ is called the residual:

$e^{(i)} = y^{(i)} - \hat{y}^{(i)}$

A positive residual means the model under-predicted; a negative residual means it over-predicted. The goal of training is to find parameters $\theta_0, \theta_1$ that make these residuals collectively as small as possible.

Many lines are possible

For any dataset there are infinitely many lines you could draw. The question is: which line is best? Different answers to that question lead to different learning algorithms. The most common answer — minimize the sum of squared residuals — is called Ordinary Least Squares, covered in lesson 4.

Notation summary

Symbol	Meaning
$x^{(i)}$	Feature value of the $i$ -th training example
$y^{(i)}$	True target value of the $i$ -th training example
$\hat{y}^{(i)}$	Predicted value for the $i$ -th example
$m$	Number of training examples
$\theta_0$	Intercept parameter
$\theta_1$	Slope parameter

This notation will be used throughout the rest of the course.

← Prev Next →