Gradient Descent: Learning the Parameters
The iterative alternative
OLS computes the best parameters in one shot. But for very large datasets or models with many features, that direct computation becomes expensive. Gradient descent is an iterative optimization algorithm that starts with a guess for the parameters and improves them step by step until it converges on the minimum of the cost function.
The intuition: rolling downhill
Imagine you are standing somewhere on a hilly landscape (the cost surface from the previous lesson) and you want to reach the lowest point. A natural strategy: look around, find the direction that goes downhill most steeply, take a step in that direction, and repeat.
Gradient descent does exactly this, but in the space of parameters. The gradient of with respect to and points in the direction of steepest increase in cost. So gradient descent moves in the opposite direction — downhill.
The update rule
At each iteration, both parameters are updated simultaneously:
where (alpha) is the learning rate — a small positive number that controls how large each step is.
Computing the partial derivatives
For the MSE cost function:
Notice that is simply the mean residual, and is the mean of residuals weighted by the feature value.
The learning rate
The learning rate is one of the most important hyperparameters to set correctly.
| Learning rate | Effect |
|---|---|
| Too large | Steps overshoot the minimum; cost may oscillate or diverge |
| Too small | Convergence is correct but very slow |
| Just right | Smooth, efficient descent to the minimum |
Typical values to try: , , . You can monitor convergence by plotting against the number of iterations; it should decrease smoothly and level off.
A step-by-step trace
Start with , , , using the four-point dataset from the OLS lesson.
Iteration 1:
- Compute predictions for all examples.
- Residuals: .
- Update:
- Update:
After many iterations the parameters converge toward the OLS solution (, ).
Batch, stochastic, and mini-batch variants
The version above — computing gradients using all examples at each step — is called batch gradient descent. Two important variants:
- Stochastic gradient descent (SGD): update parameters after each single example. Much faster per step, but noisy.
- Mini-batch gradient descent: update after each small batch (e.g. 32 or 256 examples). The practical default in deep learning.
For simple linear regression, batch gradient descent is standard.
Convergence
Gradient descent is said to have converged when the cost stops decreasing meaningfully between iterations, or when the gradient magnitude falls below a small threshold. For linear regression with MSE, the cost surface is convex, so gradient descent always finds the global minimum — it cannot get stuck in a local minimum.