Multiple Features and the Normal Equation

Going beyond one feature

Real datasets rarely have just one input variable. A house price model might use size, number of bedrooms, age, and distance to the city centre — all at once. Multiple linear regression extends the single-feature model to handle $n$ features simultaneously.

The prediction equation becomes:

$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n$

where $x_j$ is the $j$ -th feature and $\theta_j$ is its corresponding weight. There are now $n + 1$ parameters to learn (including the intercept $\theta_0$ ).

Matrix notation

With $m$ training examples and $n$ features, it is convenient to work in matrix form. Define:

Feature matrix $X$ (shape $m \times (n+1)$ ), where a column of ones is prepended to absorb the intercept:

$X = \begin{pmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_n^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)} \end{pmatrix}$

Parameter vector $\boldsymbol{\theta}$ (shape $(n+1) \times 1$ ):

$\boldsymbol{\theta} = \begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{pmatrix}$

Target vector $\mathbf{y}$ (shape $m \times 1$ ):

$\mathbf{y} = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix}$

With this notation, all $m$ predictions are computed at once as:

$\hat{\mathbf{y}} = X\boldsymbol{\theta}$

The Normal Equation

The OLS solution in matrix form is called the Normal Equation:

$\boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}$

This single formula gives the exact parameter vector that minimizes MSE across all training examples simultaneously. No iterations, no learning rate.

Derivation sketch

The cost in matrix form is:

$J(\boldsymbol{\theta}) = \frac{1}{2m} \| X\boldsymbol{\theta} - \mathbf{y} \|^2$

Taking the gradient with respect to $\boldsymbol{\theta}$ and setting it to zero:

$\nabla_{\boldsymbol{\theta}} J = \frac{1}{m} X^T (X\boldsymbol{\theta} - \mathbf{y}) = \mathbf{0}$

Rearranging:

$X^T X \boldsymbol{\theta} = X^T \mathbf{y}$

$\boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}$

These are called the normal equations.

Interpreting multiple coefficients

Each $\theta_j$ represents the change in $\hat{y}$ for a one-unit increase in $x_j$ , holding all other features constant. This "all else equal" interpretation is important: the coefficients capture partial effects, not total effects.

For example, in a house price model:

$\theta_1 = 0.10$ (size in sq ft): adding one sq ft increases price by $100, given fixed bedrooms, age, etc.
$\theta_2 = 8.5$ (bedrooms): adding one bedroom increases price by $8,500, given fixed size, age, etc.

When $(X^TX)$ is not invertible

$(X^TX)$ is singular (cannot be inverted) in two situations:

Redundant features: e.g. including both size in sq ft and size in sq m — they are perfectly linearly dependent.
More features than examples: $n \geq m$ .

In both cases, delete the redundant features or apply regularization (Ridge regression adds a term to $(X^TX)$ that guarantees invertibility — see the Regularization lesson).

Gradient descent with multiple features

Gradient descent extends naturally. The update rule for each parameter $\theta_j$ is:

$\theta_j \leftarrow \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x_j^{(i)}$

where by convention $x_0^{(i)} = 1$ for the intercept term. In matrix form this is simply:

$\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \frac{\alpha}{m} X^T (X\boldsymbol{\theta} - \mathbf{y})$

← Prev Next →