Gradient descent

The core optimization algorithm

Training a machine learning model means finding parameter values that minimize a loss function. For most models — logistic regression, neural networks, SVMs — there is no algebraic formula for the optimal parameters. Instead, we use gradient descent: an iterative algorithm that starts with some initial parameter values and repeatedly moves them in the direction that reduces the loss.

The gradient

The gradient $\nabla_\theta \mathcal{L}$ of the loss function $\mathcal{L}$ with respect to the parameters $\theta$ is a vector that points in the direction of steepest increase in the loss. It has one component per parameter, each telling you how much the loss changes if you increase that parameter slightly.

If the gradient points uphill, moving in the opposite direction — the direction of steepest descent — reduces the loss.

The update rule

At each step, all parameters are updated simultaneously:

$\theta \leftarrow \theta - \alpha \nabla_\theta \mathcal{L}(\theta)$

where $\alpha$ is the learning rate — a small positive number controlling the step size. This is repeated until the loss stops decreasing meaningfully.

The update says: move each parameter in the direction that reduces the loss, scaled by the learning rate.

A geometric picture

Imagine the loss surface as a hilly landscape. The current parameters are your position. The gradient points uphill. You step downhill, landing at a new position with lower loss. You compute the gradient again from there, step downhill again, and repeat. Over many steps, you descend toward a valley — a local or global minimum.

Batch gradient descent

In standard (batch) gradient descent, the gradient is computed using the full training dataset at every step:

$\nabla_\theta \mathcal{L} = \frac{1}{m} \sum_{i=1}^{m} \nabla_\theta \mathcal{L}^{(i)}(\theta)$

This gives the exact gradient of the loss, so each step is as informative as possible. The downside: for large datasets, computing the gradient over all $m$ examples before taking a single step is very slow. With 1 million examples, each step requires passing through all 1 million before any parameter update occurs.

The gradient as a first-order approximation

Gradient descent uses only the first derivative — it approximates the loss surface locally as a plane. This is why the learning rate must be small: a large step in the direction of the gradient might overshoot the minimum because the plane is only a local approximation.

Second-order methods (Newton's method, L-BFGS) use the curvature of the surface (second derivatives) to take better-informed steps. They converge faster per step but are expensive to compute for large parameter spaces.

When gradient descent finds the global minimum

For convex loss functions — where the loss surface is bowl-shaped with a single minimum — gradient descent with a small enough learning rate always converges to the global minimum. Logistic regression and linear regression (with MSE) both have convex loss surfaces.

For non-convex loss functions — such as those arising in neural networks — gradient descent finds a local minimum or saddle point, not necessarily the global minimum. This is explored in the Convex vs Non-Convex lesson.