OLS vs. Gradient Descent: When to Use Which
Two roads to the same destination
Both OLS and gradient descent minimize the same MSE cost function. For simple linear regression they reach exactly the same answer. The difference is how they get there, and that difference matters a great deal in practice.
Side-by-side comparison
| Property | OLS (Normal Equation) | Gradient Descent |
|---|---|---|
| Solution type | Closed-form, exact | Iterative, approximate |
| Iterations needed | 1 | Many (hundreds to millions) |
| Learning rate | Not needed | Must be tuned |
| Scales with features | Poorly — | Well — per iteration |
| Scales with examples | Moderately — | Well — per iteration |
| Feature scaling needed | No | Yes (strongly recommended) |
| Works for non-linear models | No | Yes (with modified cost functions) |
When OLS wins
OLS is the better choice when:
- You have a small to moderate number of features (roughly ).
- The dataset fits comfortably in memory.
- You want an exact answer without tuning any hyperparameters.
- You need statistical guarantees (standard errors, confidence intervals) on the parameters.
The bottleneck of OLS is computing , a matrix inversion that costs . For features this is trivial; for features it becomes prohibitive.
When gradient descent wins
Gradient descent is the better choice when:
- You have a large number of features (e.g. text data with millions of word features).
- The dataset is too large to fit in memory (you can use mini-batches).
- You are working with neural networks or non-linear models, where no closed form exists.
- Online learning is needed (updating the model as new data arrives).
The inversion problem
A subtlety: OLS requires computing . This matrix is singular (non-invertible) if:
- Two features are perfectly correlated (multicollinearity).
- You have more features than training examples ().
In these cases OLS fails outright, while gradient descent continues to function (though it may converge slowly or to a non-unique solution). Regularization — covered in a later lesson — fixes this for OLS too via Ridge regression.
Practical guidance
For most beginner projects and classroom exercises, OLS is the right default because it is simple, exact, and parameter-free. Gradient descent becomes essential when you move to larger problems or more complex models such as neural networks.
A note on numerical stability
Even when OLS is feasible, the matrix inversion can be numerically unstable if features are on very different scales or are nearly collinear. Libraries like NumPy and scikit-learn handle this using QR decomposition or Singular Value Decomposition (SVD) rather than direct inversion, which is more stable.
Summary
| Situation | Recommended method |
|---|---|
| Small dataset, few features | OLS |
| Large dataset or many features | Gradient descent |
| Neural network / deep learning | Gradient descent (always) |
| Perfect multicollinearity | Neither (fix the data first) |