Assumptions and Common Pitfalls
Why assumptions matter
OLS gives optimal estimates under certain conditions. When those conditions are violated, the estimates may be biased, inefficient, or misleading. Understanding the assumptions helps you diagnose problems and choose remedies.
The five classical OLS assumptions
1. Linearity
The relationship between the features and the target is linear in the parameters. That is, the true data-generating process looks like:
How to check: Plot against each . If the relationship is curved, consider adding polynomial features (e.g. ) or applying a log transformation.
2. Independence of errors
The residuals are independent of one another. This is often violated in time-series data, where the error at time is correlated with the error at time (autocorrelation).
How to check: Plot residuals against time or observation order and look for patterns. The Durbin–Watson test is a formal check.
3. Homoscedasticity (constant variance)
The variance of the residuals is constant across all values of . If the residual spread grows with , the data is heteroscedastic.
How to check: Plot residuals against fitted values . Ideally, the spread should be uniform (a horizontal band). A fan-shaped pattern indicates heteroscedasticity.
Remedy: Take the log of the target () or use weighted least squares.
4. Normality of errors (for inference)
The residuals are approximately normally distributed. This assumption is needed for hypothesis tests and confidence intervals on the coefficients, but not required for point predictions to be unbiased.
How to check: Plot a histogram or Q-Q plot of the residuals.
5. No perfect multicollinearity
No feature is a perfect linear combination of other features. (Imperfect multicollinearity is allowed but inflates coefficient variance.)
How to check: Compute the Variance Inflation Factor (VIF) for each feature. A VIF above 5–10 suggests problematic multicollinearity.
Remedy: Remove one of the correlated features, combine them, or apply Ridge regularization.
Common pitfalls
Extrapolation
Linear regression can only reliably predict within the range of the training data. Predicting far outside that range — extrapolating — often produces nonsensical results because the linear relationship may not hold.
Outliers and influential points
Outliers (extreme values) and leverage points (extreme values) can pull the fitted line significantly. Always inspect residual plots and consider whether outliers should be corrected, removed, or handled with a robust loss function.
Omitted variable bias
If a variable that genuinely affects is left out of the model, its effect is absorbed into the residuals. If the omitted variable also correlates with included features, the coefficients of those features become biased.
Including irrelevant features
Adding features that have no true relationship with does not improve predictions on new data — it increases model variance and can hurt generalization. Evaluate feature importance and consider regularization or feature selection.
Confusing correlation with causation
A large tells you that is associated with in your data. It does not mean causes . Causal inference requires additional assumptions and study design beyond what regression alone can provide.
Residual plot checklist
After fitting, always make these plots:
- Residuals vs. fitted values — check for patterns (non-linearity, heteroscedasticity).
- Histogram of residuals — check for approximate normality.
- Residuals vs. each feature — check for non-linear relationships.
- Residuals vs. observation order — check for autocorrelation.
A well-behaved model shows residuals that look like random noise around zero in all four plots.