Assumptions and Common Pitfalls

Why assumptions matter

OLS gives optimal estimates under certain conditions. When those conditions are violated, the estimates may be biased, inefficient, or misleading. Understanding the assumptions helps you diagnose problems and choose remedies.

The five classical OLS assumptions

1. Linearity

The relationship between the features and the target is linear in the parameters. That is, the true data-generating process looks like:

y=θ0+θ1x1++θnxn+εy = \theta_0 + \theta_1 x_1 + \cdots + \theta_n x_n + \varepsilon

How to check: Plot yy against each xjx_j. If the relationship is curved, consider adding polynomial features (e.g. x12x_1^2) or applying a log transformation.

2. Independence of errors

The residuals ε(i)\varepsilon^{(i)} are independent of one another. This is often violated in time-series data, where the error at time tt is correlated with the error at time t1t-1 (autocorrelation).

How to check: Plot residuals against time or observation order and look for patterns. The Durbin–Watson test is a formal check.

3. Homoscedasticity (constant variance)

The variance of the residuals is constant across all values of xx. If the residual spread grows with xx, the data is heteroscedastic.

How to check: Plot residuals against fitted values y^\hat{y}. Ideally, the spread should be uniform (a horizontal band). A fan-shaped pattern indicates heteroscedasticity.

Remedy: Take the log of the target (logy\log y) or use weighted least squares.

4. Normality of errors (for inference)

The residuals are approximately normally distributed. This assumption is needed for hypothesis tests and confidence intervals on the coefficients, but not required for point predictions to be unbiased.

How to check: Plot a histogram or Q-Q plot of the residuals.

5. No perfect multicollinearity

No feature is a perfect linear combination of other features. (Imperfect multicollinearity is allowed but inflates coefficient variance.)

How to check: Compute the Variance Inflation Factor (VIF) for each feature. A VIF above 5–10 suggests problematic multicollinearity.

Remedy: Remove one of the correlated features, combine them, or apply Ridge regularization.

Common pitfalls

Extrapolation

Linear regression can only reliably predict within the range of the training data. Predicting far outside that range — extrapolating — often produces nonsensical results because the linear relationship may not hold.

Outliers and influential points

Outliers (extreme yy values) and leverage points (extreme xx values) can pull the fitted line significantly. Always inspect residual plots and consider whether outliers should be corrected, removed, or handled with a robust loss function.

Omitted variable bias

If a variable that genuinely affects yy is left out of the model, its effect is absorbed into the residuals. If the omitted variable also correlates with included features, the coefficients of those features become biased.

Including irrelevant features

Adding features that have no true relationship with yy does not improve predictions on new data — it increases model variance and can hurt generalization. Evaluate feature importance and consider regularization or feature selection.

Confusing correlation with causation

A large θj\theta_j tells you that xjx_j is associated with yy in your data. It does not mean xjx_j causes yy. Causal inference requires additional assumptions and study design beyond what regression alone can provide.

Residual plot checklist

After fitting, always make these plots:

  1. Residuals vs. fitted values — check for patterns (non-linearity, heteroscedasticity).
  2. Histogram of residuals — check for approximate normality.
  3. Residuals vs. each feature — check for non-linear relationships.
  4. Residuals vs. observation order — check for autocorrelation.

A well-behaved model shows residuals that look like random noise around zero in all four plots.