Linear regression (OLS, assumptions)

The core idea

Linear regression models the relationship between one or more input features and a continuous target variable as a straight line (or hyperplane in multiple dimensions). Given features $x_1, x_2, \ldots, x_n$ , the model predicts:

$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n$

The $\beta$ values are the coefficients — the model learns them from data. $\beta_0$ is the intercept (the prediction when all features are zero); each $\beta_j$ is the slope for feature $j$ .

Ordinary Least Squares (OLS)

The standard method for fitting a linear regression is Ordinary Least Squares (OLS): find the coefficients that minimize the sum of squared differences between the predicted and actual values.

$\text{minimize} \sum_{i=1}^{m} (y^{(i)} - \hat{y}^{(i)})^2$

OLS has a closed-form solution — no iteration required. In matrix notation, with $X$ as the feature matrix and $\mathbf{y}$ as the target vector:

$\boldsymbol{\beta} = (X^T X)^{-1} X^T \mathbf{y}$

This gives the exact optimal coefficients in one computation. For large numbers of features, gradient descent is used instead (the Normal Equation becomes expensive to invert).

The OLS assumptions

OLS has good statistical properties — unbiased, minimum variance — when its assumptions hold. These are worth knowing because violations lead to unreliable estimates or incorrect inferences.

1. Linearity. The true relationship between features and target is linear. If the actual relationship is curved, the model is systematically wrong.

2. Independence of errors. The residuals (errors) are not correlated with each other. Violated in time-series data, where observations close in time tend to be related.

3. Homoscedasticity. The variance of the residuals is constant across all values of the features. If errors are larger for higher predicted values (a "fan" shape in residual plots), inference is unreliable.

4. Normality of errors. The residuals are approximately normally distributed. This matters primarily for hypothesis tests and confidence intervals on coefficients, not for point predictions.

5. No perfect multicollinearity. No feature is an exact linear combination of others. If it is, $(X^TX)$ is not invertible and OLS has no unique solution. (Imperfect multicollinearity is a separate issue — covered in the Multicollinearity lesson.)

Checking assumptions

The standard diagnostic is a residual plot: plot the residuals against the fitted values. You want to see random scatter with no pattern.

A curved pattern suggests non-linearity.
A fan shape suggests heteroscedasticity.
A Q-Q plot of residuals checks normality.

Evaluating fit

Common metrics for regression:

R²: fraction of variance in $y$ explained by the model. R² = 1 is perfect; R² = 0 is no better than predicting the mean.
RMSE: root mean squared error — in the same units as $y$ , easy to interpret.
MAE: mean absolute error — less sensitive to outliers than RMSE.

R² on the training set is optimistic; always evaluate on held-out data.