Multiple Features and the Normal Equation
Going beyond one feature
Real datasets rarely have just one input variable. A house price model might use size, number of bedrooms, age, and distance to the city centre — all at once. Multiple linear regression extends the single-feature model to handle features simultaneously.
The prediction equation becomes:
where is the -th feature and is its corresponding weight. There are now parameters to learn (including the intercept ).
Matrix notation
With training examples and features, it is convenient to work in matrix form. Define:
Feature matrix (shape ), where a column of ones is prepended to absorb the intercept:
Parameter vector (shape ):
Target vector (shape ):
With this notation, all predictions are computed at once as:
The Normal Equation
The OLS solution in matrix form is called the Normal Equation:
This single formula gives the exact parameter vector that minimizes MSE across all training examples simultaneously. No iterations, no learning rate.
Derivation sketch
The cost in matrix form is:
Taking the gradient with respect to and setting it to zero:
Rearranging:
These are called the normal equations.
Interpreting multiple coefficients
Each represents the change in for a one-unit increase in , holding all other features constant. This "all else equal" interpretation is important: the coefficients capture partial effects, not total effects.
For example, in a house price model:
- (size in sq ft): adding one sq ft increases price by $100, given fixed bedrooms, age, etc.
- (bedrooms): adding one bedroom increases price by $8,500, given fixed size, age, etc.
When is not invertible
is singular (cannot be inverted) in two situations:
- Redundant features: e.g. including both size in sq ft and size in sq m — they are perfectly linearly dependent.
- More features than examples: .
In both cases, delete the redundant features or apply regularization (Ridge regression adds a term to that guarantees invertibility — see the Regularization lesson).
Gradient descent with multiple features
Gradient descent extends naturally. The update rule for each parameter is:
where by convention for the intercept term. In matrix form this is simply: