Gradient Descent: Learning the Parameters

The iterative alternative

OLS computes the best parameters in one shot. But for very large datasets or models with many features, that direct computation becomes expensive. Gradient descent is an iterative optimization algorithm that starts with a guess for the parameters and improves them step by step until it converges on the minimum of the cost function.

The intuition: rolling downhill

Imagine you are standing somewhere on a hilly landscape (the cost surface from the previous lesson) and you want to reach the lowest point. A natural strategy: look around, find the direction that goes downhill most steeply, take a step in that direction, and repeat.

Gradient descent does exactly this, but in the space of parameters. The gradient of JJ with respect to θ0\theta_0 and θ1\theta_1 points in the direction of steepest increase in cost. So gradient descent moves in the opposite direction — downhill.

The update rule

At each iteration, both parameters are updated simultaneously:

θ0θ0αJθ0\theta_0 \leftarrow \theta_0 - \alpha \frac{\partial J}{\partial \theta_0}

θ1θ1αJθ1\theta_1 \leftarrow \theta_1 - \alpha \frac{\partial J}{\partial \theta_1}

where α\alpha (alpha) is the learning rate — a small positive number that controls how large each step is.

Computing the partial derivatives

For the MSE cost function:

Jθ0=1mi=1m(y^(i)y(i))\frac{\partial J}{\partial \theta_0} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})

Jθ1=1mi=1m(y^(i)y(i))x(i)\frac{\partial J}{\partial \theta_1} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)}) \cdot x^{(i)}

Notice that Jθ0\frac{\partial J}{\partial \theta_0} is simply the mean residual, and Jθ1\frac{\partial J}{\partial \theta_1} is the mean of residuals weighted by the feature value.

The learning rate α\alpha

The learning rate is one of the most important hyperparameters to set correctly.

Learning rateEffect
Too largeSteps overshoot the minimum; cost may oscillate or diverge
Too smallConvergence is correct but very slow
Just rightSmooth, efficient descent to the minimum

Typical values to try: 0.0010.001, 0.010.01, 0.10.1. You can monitor convergence by plotting JJ against the number of iterations; it should decrease smoothly and level off.

A step-by-step trace

Start with θ0=0\theta_0 = 0, θ1=0\theta_1 = 0, α=0.01\alpha = 0.01, using the four-point dataset from the OLS lesson.

Iteration 1:

  1. Compute predictions y^(i)=0+0x(i)=0\hat{y}^{(i)} = 0 + 0 \cdot x^{(i)} = 0 for all examples.
  2. Residuals: 200,260,330,380-200, -260, -330, -380.
  3. Jθ0=14(200260330380)=292.5\frac{\partial J}{\partial \theta_0} = \frac{1}{4}(-200-260-330-380) = -292.5
  4. Jθ1=14((200)(1000)+(260)(1500)+(330)(2000)+(380)(2500))=511250\frac{\partial J}{\partial \theta_1} = \frac{1}{4}((-200)(1000)+(-260)(1500)+(-330)(2000)+(-380)(2500)) = -511250
  5. Update: θ000.01×(292.5)=2.925\theta_0 \leftarrow 0 - 0.01 \times (-292.5) = 2.925
  6. Update: θ100.01×(511250)=5112.5\theta_1 \leftarrow 0 - 0.01 \times (-511250) = 5112.5

After many iterations the parameters converge toward the OLS solution (θ079\theta_0 \approx 79, θ10.122\theta_1 \approx 0.122).

Batch, stochastic, and mini-batch variants

The version above — computing gradients using all mm examples at each step — is called batch gradient descent. Two important variants:

  • Stochastic gradient descent (SGD): update parameters after each single example. Much faster per step, but noisy.
  • Mini-batch gradient descent: update after each small batch (e.g. 32 or 256 examples). The practical default in deep learning.

For simple linear regression, batch gradient descent is standard.

Convergence

Gradient descent is said to have converged when the cost JJ stops decreasing meaningfully between iterations, or when the gradient magnitude falls below a small threshold. For linear regression with MSE, the cost surface is convex, so gradient descent always finds the global minimum — it cannot get stuck in a local minimum.