Multiple Features and the Normal Equation

Going beyond one feature

Real datasets rarely have just one input variable. A house price model might use size, number of bedrooms, age, and distance to the city centre — all at once. Multiple linear regression extends the single-feature model to handle nn features simultaneously.

The prediction equation becomes:

y^=θ0+θ1x1+θ2x2++θnxn\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \cdots + \theta_n x_n

where xjx_j is the jj-th feature and θj\theta_j is its corresponding weight. There are now n+1n + 1 parameters to learn (including the intercept θ0\theta_0).

Matrix notation

With mm training examples and nn features, it is convenient to work in matrix form. Define:

Feature matrix XX (shape m×(n+1)m \times (n+1)), where a column of ones is prepended to absorb the intercept:

X=(1x1(1)x2(1)xn(1)1x1(2)x2(2)xn(2)1x1(m)x2(m)xn(m))X = \begin{pmatrix} 1 & x_1^{(1)} & x_2^{(1)} & \cdots & x_n^{(1)} \\ 1 & x_1^{(2)} & x_2^{(2)} & \cdots & x_n^{(2)} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_1^{(m)} & x_2^{(m)} & \cdots & x_n^{(m)} \end{pmatrix}

Parameter vector θ\boldsymbol{\theta} (shape (n+1)×1(n+1) \times 1):

θ=(θ0θ1θn)\boldsymbol{\theta} = \begin{pmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_n \end{pmatrix}

Target vector y\mathbf{y} (shape m×1m \times 1):

y=(y(1)y(2)y(m))\mathbf{y} = \begin{pmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{pmatrix}

With this notation, all mm predictions are computed at once as:

y^=Xθ\hat{\mathbf{y}} = X\boldsymbol{\theta}

The Normal Equation

The OLS solution in matrix form is called the Normal Equation:

θ=(XTX)1XTy\boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}

This single formula gives the exact parameter vector that minimizes MSE across all training examples simultaneously. No iterations, no learning rate.

Derivation sketch

The cost in matrix form is:

J(θ)=12mXθy2J(\boldsymbol{\theta}) = \frac{1}{2m} \| X\boldsymbol{\theta} - \mathbf{y} \|^2

Taking the gradient with respect to θ\boldsymbol{\theta} and setting it to zero:

θJ=1mXT(Xθy)=0\nabla_{\boldsymbol{\theta}} J = \frac{1}{m} X^T (X\boldsymbol{\theta} - \mathbf{y}) = \mathbf{0}

Rearranging:

XTXθ=XTyX^T X \boldsymbol{\theta} = X^T \mathbf{y}

θ=(XTX)1XTy\boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}

These are called the normal equations.

Interpreting multiple coefficients

Each θj\theta_j represents the change in y^\hat{y} for a one-unit increase in xjx_j, holding all other features constant. This "all else equal" interpretation is important: the coefficients capture partial effects, not total effects.

For example, in a house price model:

  • θ1=0.10\theta_1 = 0.10 (size in sq ft): adding one sq ft increases price by $100, given fixed bedrooms, age, etc.
  • θ2=8.5\theta_2 = 8.5 (bedrooms): adding one bedroom increases price by $8,500, given fixed size, age, etc.

When (XTX)(X^TX) is not invertible

(XTX)(X^TX) is singular (cannot be inverted) in two situations:

  1. Redundant features: e.g. including both size in sq ft and size in sq m — they are perfectly linearly dependent.
  2. More features than examples: nmn \geq m.

In both cases, delete the redundant features or apply regularization (Ridge regression adds a term to (XTX)(X^TX) that guarantees invertibility — see the Regularization lesson).

Gradient descent with multiple features

Gradient descent extends naturally. The update rule for each parameter θj\theta_j is:

θjθjα1mi=1m(y^(i)y(i))xj(i)\theta_j \leftarrow \theta_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x_j^{(i)}

where by convention x0(i)=1x_0^{(i)} = 1 for the intercept term. In matrix form this is simply:

θθαmXT(Xθy)\boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \frac{\alpha}{m} X^T (X\boldsymbol{\theta} - \mathbf{y})