OLS vs. Gradient Descent: When to Use Which

Two roads to the same destination

Both OLS and gradient descent minimize the same MSE cost function. For simple linear regression they reach exactly the same answer. The difference is how they get there, and that difference matters a great deal in practice.

Side-by-side comparison

Property	OLS (Normal Equation)	Gradient Descent
Solution type	Closed-form, exact	Iterative, approximate
Iterations needed	1	Many (hundreds to millions)
Learning rate	Not needed	Must be tuned
Scales with features $n$	Poorly — $O(n^3)$	Well — $O(kn)$ per iteration
Scales with examples $m$	Moderately — $O(mn^2)$	Well — $O(km)$ per iteration
Feature scaling needed	No	Yes (strongly recommended)
Works for non-linear models	No	Yes (with modified cost functions)

When OLS wins

OLS is the better choice when:

You have a small to moderate number of features (roughly $n < 10{,}000$ ).
The dataset fits comfortably in memory.
You want an exact answer without tuning any hyperparameters.
You need statistical guarantees (standard errors, confidence intervals) on the parameters.

The bottleneck of OLS is computing $(X^TX)^{-1}$ , a matrix inversion that costs $O(n^3)$ . For $n = 100$ features this is trivial; for $n = 100{,}000$ features it becomes prohibitive.

When gradient descent wins

Gradient descent is the better choice when:

You have a large number of features (e.g. text data with millions of word features).
The dataset is too large to fit in memory (you can use mini-batches).
You are working with neural networks or non-linear models, where no closed form exists.
Online learning is needed (updating the model as new data arrives).

The inversion problem

A subtlety: OLS requires computing $(X^TX)^{-1}$ . This matrix is singular (non-invertible) if:

Two features are perfectly correlated (multicollinearity).
You have more features than training examples ( $n > m$ ).

In these cases OLS fails outright, while gradient descent continues to function (though it may converge slowly or to a non-unique solution). Regularization — covered in a later lesson — fixes this for OLS too via Ridge regression.

Practical guidance

For most beginner projects and classroom exercises, OLS is the right default because it is simple, exact, and parameter-free. Gradient descent becomes essential when you move to larger problems or more complex models such as neural networks.

A note on numerical stability

Even when OLS is feasible, the matrix inversion can be numerically unstable if features are on very different scales or are nearly collinear. Libraries like NumPy and scikit-learn handle this using QR decomposition or Singular Value Decomposition (SVD) rather than direct inversion, which is more stable.

Summary

Situation	Recommended method
Small dataset, few features	OLS
Large dataset or many features	Gradient descent
Neural network / deep learning	Gradient descent (always)
Perfect multicollinearity	Neither (fix the data first)

← Prev Next →