Gradient Descent for Logistic Regression

No closed-form solution

Unlike linear regression, logistic regression has no closed-form solution. The log-loss cost function combined with the sigmoid non-linearity means there is no algebraic formula that gives the optimal parameters in one step.

Instead, the parameters must be found iteratively using gradient descent (or a more advanced optimizer such as L-BFGS or Newton's method, which most libraries use under the hood).

The update rule

The gradient descent update rule for logistic regression is:

$\theta_j \leftarrow \theta_j - \alpha \frac{\partial J}{\partial \theta_j}$

From the previous lesson, the gradient is:

$\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{p}^{(i)} - y^{(i)})\, x_j^{(i)}$

Substituting in full, with $\hat{p}^{(i)} = \sigma(\boldsymbol{\theta}^T \mathbf{x}^{(i)})$ :

$\theta_j \leftarrow \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} \left(\sigma(\boldsymbol{\theta}^T \mathbf{x}^{(i)}) - y^{(i)}\right) x_j^{(i)}$

All parameters are updated simultaneously at each iteration.

Comparison with linear regression

The update rule is structurally identical to that for linear regression — only the prediction function differs:

	Linear regression	Logistic regression
Prediction	$\hat{y} = \boldsymbol{\theta}^T \mathbf{x}$	$\hat{p} = \sigma(\boldsymbol{\theta}^T \mathbf{x})$
Cost	MSE	Log-loss
Gradient	$\frac{1}{m}\sum(\hat{y}^{(i)} - y^{(i)})x_j^{(i)}$	$\frac{1}{m}\sum(\hat{p}^{(i)} - y^{(i)})x_j^{(i)}$
Update rule	Same form	Same form
Closed-form	Yes (Normal Equation)	No
Cost surface	Convex (MSE)	Convex (log-loss)

A step-by-step trace

Suppose $m = 2$ examples, one feature, and initial parameters $\theta_0 = 0$ , $\theta_1 = 0$ , $\alpha = 0.5$ :

$i$	$x^{(i)}$	$y^{(i)}$
1	2	1
2	-1	0

Iteration 1:

Compute scores: $z^{(1)} = 0 + 0 \cdot 2 = 0$ , $z^{(2)} = 0$ .
Compute predictions: $\hat{p}^{(1)} = \sigma(0) = 0.5$ , $\hat{p}^{(2)} = 0.5$ .
Residuals: $\hat{p}^{(1)} - y^{(1)} = -0.5$ , $\hat{p}^{(2)} - y^{(2)} = 0.5$ .
Gradients:
- $\frac{\partial J}{\partial \theta_0} = \frac{1}{2}((-0.5)(1) + (0.5)(1)) = 0$
- $\frac{\partial J}{\partial \theta_1} = \frac{1}{2}((-0.5)(2) + (0.5)(-1)) = \frac{1}{2}(-1 - 0.5) = -0.75$
Updates: $\theta_0 \leftarrow 0$ , $\theta_1 \leftarrow 0 - 0.5 \times (-0.75) = 0.375$ .

After iteration 1 the model assigns a higher score to $x = 2$ (the positive example) and a lower score to $x = -1$ (the negative example) — it is learning the right direction.

Convergence

Because log-loss is convex, gradient descent converges to the global minimum for any sufficiently small learning rate. Convergence is detected by monitoring the cost $J$ at each iteration; training stops when $J$ changes by less than a small threshold $\epsilon$ (e.g. $10^{-4}$ ) between iterations.

Practical optimizers

Most production implementations of logistic regression (including scikit-learn's LogisticRegression) do not use vanilla gradient descent. They use more efficient second-order or quasi-Newton methods:

Newton's method: uses second derivatives (the Hessian) to take larger, more accurate steps. Converges faster but expensive per step.
L-BFGS: approximates the Hessian efficiently. The default solver in scikit-learn for small-to-medium datasets.
Stochastic / mini-batch gradient descent (SGD): preferred for very large datasets (solver='saga' in scikit-learn).

Understanding gradient descent gives you the conceptual foundation for all of these methods.

Feature scaling

Just as with linear regression, feature scaling is important for gradient-based optimization of logistic regression. Standardize features to mean 0 and standard deviation 1 before training to ensure gradients across all parameters are on a similar scale and convergence is efficient.

← Prev Next →