The Line of Best Fit

The equation of a line

From school mathematics you may recall the equation of a straight line:

y=mx+by = mx + b

where mm is the slope and bb is the y-intercept. In machine learning notation, the same equation is written:

y^=θ0+θ1x\hat{y} = \theta_0 + \theta_1 x

  • y^\hat{y} (pronounced "y-hat") is the predicted value.
  • xx is the input feature.
  • θ0\theta_0 (theta-zero) is the intercept — the value of y^\hat{y} when x=0x = 0.
  • θ1\theta_1 (theta-one) is the slope — how much y^\hat{y} changes for a one-unit increase in xx.

The two values θ0\theta_0 and θ1\theta_1 are called the model parameters or weights. Training a linear regression model means finding the values of these parameters that make the line fit the data as well as possible.

What the slope and intercept mean

Consider a model that predicts house price (in thousands of dollars) from size (in square feet):

price^=50+0.15×size\hat{\text{price}} = 50 + 0.15 \times \text{size}

  • Intercept θ0=50\theta_0 = 50: a house with zero square feet would be predicted to cost $50,000. (This may not be physically meaningful, but it anchors the line.)
  • Slope θ1=0.15\theta_1 = 0.15: each additional square foot adds $150 to the predicted price.

Predictions

Given a trained model, making a prediction is just arithmetic. For a 1,200 sq ft house:

y^=50+0.15×1200=50+180=230\hat{y} = 50 + 0.15 \times 1200 = 50 + 180 = 230

Predicted price: $230,000.

Residuals

No line fits noisy real-world data perfectly. The difference between the actual value y(i)y^{(i)} and the predicted value y^(i)\hat{y}^{(i)} for training example ii is called the residual:

e(i)=y(i)y^(i)e^{(i)} = y^{(i)} - \hat{y}^{(i)}

A positive residual means the model under-predicted; a negative residual means it over-predicted. The goal of training is to find parameters θ0,θ1\theta_0, \theta_1 that make these residuals collectively as small as possible.

Many lines are possible

For any dataset there are infinitely many lines you could draw. The question is: which line is best? Different answers to that question lead to different learning algorithms. The most common answer — minimize the sum of squared residuals — is called Ordinary Least Squares, covered in lesson 4.

Notation summary

SymbolMeaning
x(i)x^{(i)}Feature value of the ii-th training example
y(i)y^{(i)}True target value of the ii-th training example
y^(i)\hat{y}^{(i)}Predicted value for the ii-th example
mmNumber of training examples
θ0\theta_0Intercept parameter
θ1\theta_1Slope parameter

This notation will be used throughout the rest of the course.