Gradient Descent for Logistic Regression

No closed-form solution

Unlike linear regression, logistic regression has no closed-form solution. The log-loss cost function combined with the sigmoid non-linearity means there is no algebraic formula that gives the optimal parameters in one step.

Instead, the parameters must be found iteratively using gradient descent (or a more advanced optimizer such as L-BFGS or Newton's method, which most libraries use under the hood).

The update rule

The gradient descent update rule for logistic regression is:

θjθjαJθj\theta_j \leftarrow \theta_j - \alpha \frac{\partial J}{\partial \theta_j}

From the previous lesson, the gradient is:

Jθj=1mi=1m(p^(i)y(i))xj(i)\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{p}^{(i)} - y^{(i)})\, x_j^{(i)}

Substituting in full, with p^(i)=σ(θTx(i))\hat{p}^{(i)} = \sigma(\boldsymbol{\theta}^T \mathbf{x}^{(i)}):

θjθjαmi=1m(σ(θTx(i))y(i))xj(i)\theta_j \leftarrow \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} \left(\sigma(\boldsymbol{\theta}^T \mathbf{x}^{(i)}) - y^{(i)}\right) x_j^{(i)}

All parameters are updated simultaneously at each iteration.

Comparison with linear regression

The update rule is structurally identical to that for linear regression — only the prediction function differs:

Linear regressionLogistic regression
Predictiony^=θTx\hat{y} = \boldsymbol{\theta}^T \mathbf{x}p^=σ(θTx)\hat{p} = \sigma(\boldsymbol{\theta}^T \mathbf{x})
CostMSELog-loss
Gradient1m(y^(i)y(i))xj(i)\frac{1}{m}\sum(\hat{y}^{(i)} - y^{(i)})x_j^{(i)}1m(p^(i)y(i))xj(i)\frac{1}{m}\sum(\hat{p}^{(i)} - y^{(i)})x_j^{(i)}
Update ruleSame formSame form
Closed-formYes (Normal Equation)No
Cost surfaceConvex (MSE)Convex (log-loss)

A step-by-step trace

Suppose m=2m = 2 examples, one feature, and initial parameters θ0=0\theta_0 = 0, θ1=0\theta_1 = 0, α=0.5\alpha = 0.5:

iix(i)x^{(i)}y(i)y^{(i)}
121
2-10

Iteration 1:

  1. Compute scores: z(1)=0+02=0z^{(1)} = 0 + 0 \cdot 2 = 0, z(2)=0z^{(2)} = 0.
  2. Compute predictions: p^(1)=σ(0)=0.5\hat{p}^{(1)} = \sigma(0) = 0.5, p^(2)=0.5\hat{p}^{(2)} = 0.5.
  3. Residuals: p^(1)y(1)=0.5\hat{p}^{(1)} - y^{(1)} = -0.5, p^(2)y(2)=0.5\hat{p}^{(2)} - y^{(2)} = 0.5.
  4. Gradients:
    • Jθ0=12((0.5)(1)+(0.5)(1))=0\frac{\partial J}{\partial \theta_0} = \frac{1}{2}((-0.5)(1) + (0.5)(1)) = 0
    • Jθ1=12((0.5)(2)+(0.5)(1))=12(10.5)=0.75\frac{\partial J}{\partial \theta_1} = \frac{1}{2}((-0.5)(2) + (0.5)(-1)) = \frac{1}{2}(-1 - 0.5) = -0.75
  5. Updates: θ00\theta_0 \leftarrow 0, θ100.5×(0.75)=0.375\theta_1 \leftarrow 0 - 0.5 \times (-0.75) = 0.375.

After iteration 1 the model assigns a higher score to x=2x = 2 (the positive example) and a lower score to x=1x = -1 (the negative example) — it is learning the right direction.

Convergence

Because log-loss is convex, gradient descent converges to the global minimum for any sufficiently small learning rate. Convergence is detected by monitoring the cost JJ at each iteration; training stops when JJ changes by less than a small threshold ϵ\epsilon (e.g. 10410^{-4}) between iterations.

Practical optimizers

Most production implementations of logistic regression (including scikit-learn's LogisticRegression) do not use vanilla gradient descent. They use more efficient second-order or quasi-Newton methods:

  • Newton's method: uses second derivatives (the Hessian) to take larger, more accurate steps. Converges faster but expensive per step.
  • L-BFGS: approximates the Hessian efficiently. The default solver in scikit-learn for small-to-medium datasets.
  • Stochastic / mini-batch gradient descent (SGD): preferred for very large datasets (solver='saga' in scikit-learn).

Understanding gradient descent gives you the conceptual foundation for all of these methods.

Feature scaling

Just as with linear regression, feature scaling is important for gradient-based optimization of logistic regression. Standardize features to mean 0 and standard deviation 1 before training to ensure gradients across all parameters are on a similar scale and convergence is efficient.