OLS vs. Gradient Descent: When to Use Which

Two roads to the same destination

Both OLS and gradient descent minimize the same MSE cost function. For simple linear regression they reach exactly the same answer. The difference is how they get there, and that difference matters a great deal in practice.

Side-by-side comparison

PropertyOLS (Normal Equation)Gradient Descent
Solution typeClosed-form, exactIterative, approximate
Iterations needed1Many (hundreds to millions)
Learning rateNot neededMust be tuned
Scales with features nnPoorly — O(n3)O(n^3)Well — O(kn)O(kn) per iteration
Scales with examples mmModerately — O(mn2)O(mn^2)Well — O(km)O(km) per iteration
Feature scaling neededNoYes (strongly recommended)
Works for non-linear modelsNoYes (with modified cost functions)

When OLS wins

OLS is the better choice when:

  • You have a small to moderate number of features (roughly n<10,000n < 10{,}000).
  • The dataset fits comfortably in memory.
  • You want an exact answer without tuning any hyperparameters.
  • You need statistical guarantees (standard errors, confidence intervals) on the parameters.

The bottleneck of OLS is computing (XTX)1(X^TX)^{-1}, a matrix inversion that costs O(n3)O(n^3). For n=100n = 100 features this is trivial; for n=100,000n = 100{,}000 features it becomes prohibitive.

When gradient descent wins

Gradient descent is the better choice when:

  • You have a large number of features (e.g. text data with millions of word features).
  • The dataset is too large to fit in memory (you can use mini-batches).
  • You are working with neural networks or non-linear models, where no closed form exists.
  • Online learning is needed (updating the model as new data arrives).

The inversion problem

A subtlety: OLS requires computing (XTX)1(X^TX)^{-1}. This matrix is singular (non-invertible) if:

  • Two features are perfectly correlated (multicollinearity).
  • You have more features than training examples (n>mn > m).

In these cases OLS fails outright, while gradient descent continues to function (though it may converge slowly or to a non-unique solution). Regularization — covered in a later lesson — fixes this for OLS too via Ridge regression.

Practical guidance

For most beginner projects and classroom exercises, OLS is the right default because it is simple, exact, and parameter-free. Gradient descent becomes essential when you move to larger problems or more complex models such as neural networks.

A note on numerical stability

Even when OLS is feasible, the matrix inversion can be numerically unstable if features are on very different scales or are nearly collinear. Libraries like NumPy and scikit-learn handle this using QR decomposition or Singular Value Decomposition (SVD) rather than direct inversion, which is more stable.

Summary

SituationRecommended method
Small dataset, few featuresOLS
Large dataset or many featuresGradient descent
Neural network / deep learningGradient descent (always)
Perfect multicollinearityNeither (fix the data first)