Gradient Descent for Logistic Regression
No closed-form solution
Unlike linear regression, logistic regression has no closed-form solution. The log-loss cost function combined with the sigmoid non-linearity means there is no algebraic formula that gives the optimal parameters in one step.
Instead, the parameters must be found iteratively using gradient descent (or a more advanced optimizer such as L-BFGS or Newton's method, which most libraries use under the hood).
The update rule
The gradient descent update rule for logistic regression is:
From the previous lesson, the gradient is:
Substituting in full, with :
All parameters are updated simultaneously at each iteration.
Comparison with linear regression
The update rule is structurally identical to that for linear regression — only the prediction function differs:
| Linear regression | Logistic regression | |
|---|---|---|
| Prediction | ||
| Cost | MSE | Log-loss |
| Gradient | ||
| Update rule | Same form | Same form |
| Closed-form | Yes (Normal Equation) | No |
| Cost surface | Convex (MSE) | Convex (log-loss) |
A step-by-step trace
Suppose examples, one feature, and initial parameters , , :
| 1 | 2 | 1 |
| 2 | -1 | 0 |
Iteration 1:
- Compute scores: , .
- Compute predictions: , .
- Residuals: , .
- Gradients:
- Updates: , .
After iteration 1 the model assigns a higher score to (the positive example) and a lower score to (the negative example) — it is learning the right direction.
Convergence
Because log-loss is convex, gradient descent converges to the global minimum for any sufficiently small learning rate. Convergence is detected by monitoring the cost at each iteration; training stops when changes by less than a small threshold (e.g. ) between iterations.
Practical optimizers
Most production implementations of logistic regression (including scikit-learn's LogisticRegression) do not use vanilla gradient descent. They use more efficient second-order or quasi-Newton methods:
- Newton's method: uses second derivatives (the Hessian) to take larger, more accurate steps. Converges faster but expensive per step.
- L-BFGS: approximates the Hessian efficiently. The default solver in scikit-learn for small-to-medium datasets.
- Stochastic / mini-batch gradient descent (SGD): preferred for very large datasets (
solver='saga'in scikit-learn).
Understanding gradient descent gives you the conceptual foundation for all of these methods.
Feature scaling
Just as with linear regression, feature scaling is important for gradient-based optimization of logistic regression. Standardize features to mean 0 and standard deviation 1 before training to ensure gradients across all parameters are on a similar scale and convergence is efficient.