Gradient Descent: Learning the Parameters

The iterative alternative

OLS computes the best parameters in one shot. But for very large datasets or models with many features, that direct computation becomes expensive. Gradient descent is an iterative optimization algorithm that starts with a guess for the parameters and improves them step by step until it converges on the minimum of the cost function.

The intuition: rolling downhill

Imagine you are standing somewhere on a hilly landscape (the cost surface from the previous lesson) and you want to reach the lowest point. A natural strategy: look around, find the direction that goes downhill most steeply, take a step in that direction, and repeat.

Gradient descent does exactly this, but in the space of parameters. The gradient of $J$ with respect to $\theta_0$ and $\theta_1$ points in the direction of steepest increase in cost. So gradient descent moves in the opposite direction — downhill.

The update rule

At each iteration, both parameters are updated simultaneously:

$\theta_0 \leftarrow \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}$

$\theta_1 \leftarrow \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}$

where $\alpha$ (alpha) is the learning rate — a small positive number that controls how large each step is.

Computing the partial derivatives

For the MSE cost function:

$\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})$

$\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)}$

Notice that $\frac{\partial J}{\partial \theta_0}$ is simply the mean residual, and $\frac{\partial J}{\partial \theta_1}$ is the mean of residuals weighted by the feature value.

The learning rate $\alpha$

The learning rate is one of the most important hyperparameters to set correctly.

Learning rate	Effect
Too large	Steps overshoot the minimum; cost may oscillate or diverge
Too small	Convergence is correct but very slow
Just right	Smooth, efficient descent to the minimum

Typical values to try: $0.001$ , $0.01$ , $0.1$ . You can monitor convergence by plotting $J$ against the number of iterations; it should decrease smoothly and level off.

A step-by-step trace

Start with $\theta_0 = 0$ , $\theta_1 = 0$ , $\alpha = 0.01$ , using the four-point dataset from the OLS lesson.

Iteration 1:

Compute predictions $\hat{y}^{(i)} = 0 + 0 \cdot x^{(i)} = 0$ for all examples.
Residuals: $-200, -260, -330, -380$ .
$\frac{\partial J}{\partial \theta_0} = \frac{1}{4}(-200-260-330-380) = -292.5$
$\frac{\partial J}{\partial \theta_1} = \frac{1}{4}((-200)(1000)+(-260)(1500)+(-330)(2000)+(-380)(2500)) = -511250$
Update: $\theta_0 \leftarrow 0 - 0.01 \times (-292.5) = 2.925$
Update: $\theta_1 \leftarrow 0 - 0.01 \times (-511250) = 5112.5$

After many iterations the parameters converge toward the OLS solution ( $\theta_0 \approx 79$ , $\theta_1 \approx 0.122$ ).

Batch, stochastic, and mini-batch variants

The version above — computing gradients using all $m$ examples at each step — is called batch gradient descent. Two important variants:

Stochastic gradient descent (SGD): update parameters after each single example. Much faster per step, but noisy.
Mini-batch gradient descent: update after each small batch (e.g. 32 or 256 examples). The practical default in deep learning.

For simple linear regression, batch gradient descent is standard.

Convergence

Gradient descent is said to have converged when the cost $J$ stops decreasing meaningfully between iterations, or when the gradient magnitude falls below a small threshold. For linear regression with MSE, the cost surface is convex, so gradient descent always finds the global minimum — it cannot get stuck in a local minimum.

← Prev Next →