Assumptions and Common Pitfalls

Why assumptions matter

OLS gives optimal estimates under certain conditions. When those conditions are violated, the estimates may be biased, inefficient, or misleading. Understanding the assumptions helps you diagnose problems and choose remedies.

The five classical OLS assumptions

1. Linearity

The relationship between the features and the target is linear in the parameters. That is, the true data-generating process looks like:

$y = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n + \varepsilon$

How to check: Plot $y$ against each $x_j$ . If the relationship is curved, consider adding polynomial features (e.g. $x_1^2$ ) or applying a log transformation.

2. Independence of errors

The residuals $\varepsilon^{(i)}$ are independent of one another. This is often violated in time-series data, where the error at time $t$ is correlated with the error at time $t-1$ (autocorrelation).

How to check: Plot residuals against time or observation order and look for patterns. The Durbin–Watson test is a formal check.

3. Homoscedasticity (constant variance)

The variance of the residuals is constant across all values of $x$ . If the residual spread grows with $x$ , the data is heteroscedastic.

How to check: Plot residuals against fitted values $\hat{y}$ . Ideally, the spread should be uniform (a horizontal band). A fan-shaped pattern indicates heteroscedasticity.

Remedy: Take the log of the target ( $\log y$ ) or use weighted least squares.

4. Normality of errors (for inference)

The residuals are approximately normally distributed. This assumption is needed for hypothesis tests and confidence intervals on the coefficients, but not required for point predictions to be unbiased.

How to check: Plot a histogram or Q-Q plot of the residuals.

5. No perfect multicollinearity

No feature is a perfect linear combination of other features. (Imperfect multicollinearity is allowed but inflates coefficient variance.)

How to check: Compute the Variance Inflation Factor (VIF) for each feature. A VIF above 5–10 suggests problematic multicollinearity.

Remedy: Remove one of the correlated features, combine them, or apply Ridge regularization.

Common pitfalls

Extrapolation

Linear regression can only reliably predict within the range of the training data. Predicting far outside that range — extrapolating — often produces nonsensical results because the linear relationship may not hold.

Outliers and influential points

Outliers (extreme $y$ values) and leverage points (extreme $x$ values) can pull the fitted line significantly. Always inspect residual plots and consider whether outliers should be corrected, removed, or handled with a robust loss function.

Omitted variable bias

If a variable that genuinely affects $y$ is left out of the model, its effect is absorbed into the residuals. If the omitted variable also correlates with included features, the coefficients of those features become biased.

Including irrelevant features

Adding features that have no true relationship with $y$ does not improve predictions on new data — it increases model variance and can hurt generalization. Evaluate feature importance and consider regularization or feature selection.

Confusing correlation with causation

A large $\theta_j$ tells you that $x_j$ is associated with $y$ in your data. It does not mean $x_j$ causes $y$ . Causal inference requires additional assumptions and study design beyond what regression alone can provide.

Residual plot checklist

After fitting, always make these plots:

Residuals vs. fitted values — check for patterns (non-linearity, heteroscedasticity).
Histogram of residuals — check for approximate normality.
Residuals vs. each feature — check for non-linear relationships.
Residuals vs. observation order — check for autocorrelation.

A well-behaved model shows residuals that look like random noise around zero in all four plots.

← Prev Next →