Regularization for Logistic Regression

Overfitting in logistic regression

Logistic regression can overfit when:

  • There are many features relative to the number of training examples.
  • Features are highly correlated (multicollinearity).
  • The classes are linearly separable (the complete separation problem — covered in the Assumptions lesson), causing coefficients to grow without bound.

The remedy is the same as in linear regression: add a penalty on the magnitude of the coefficients to the cost function.

L2 regularization (Ridge)

L2 regularization adds the sum of squared coefficients to the log-loss:

JL2(θ)=1mi=1m[y(i)logp^(i)+(1y(i))log(1p^(i))]+λ2mj=1nθj2J_{L2}(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{p}^{(i)} + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2

As in linear regression, θ0\theta_0 (the intercept) is not penalized. L2 shrinks all coefficients toward zero but never exactly to zero.

L1 regularization (Lasso)

L1 regularization adds the sum of absolute values of coefficients:

JL1(θ)=1mi=1m[y(i)logp^(i)+(1y(i))log(1p^(i))]+λmj=1nθjJ_{L1}(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{p}^{(i)} + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right] + \frac{\lambda}{m}\sum_{j=1}^{n}|\theta_j|

L1 can drive some coefficients exactly to zero, performing automatic feature selection. This is especially useful in high-dimensional settings (e.g. text classification with thousands of word features).

The CC parameter convention

Most machine learning libraries (including scikit-learn) parameterize logistic regression regularization using CC rather than λ\lambda:

C=1λC = \frac{1}{\lambda}

CC is the inverse of regularization strength:

  • Large CC (small λ\lambda): weak regularization — the model fits the training data closely, risking overfitting.
  • Small CC (large λ\lambda): strong regularization — coefficients are heavily penalized, risking underfitting.

The default in scikit-learn is C=1.0. This is the convention to remember:

Cless regularizationCmore regularization\uparrow C \Rightarrow \text{less regularization} \qquad \downarrow C \Rightarrow \text{more regularization}

This is the opposite of λ\lambda: larger λ\lambda means more regularization, larger CC means less.

Choosing CC (or λ\lambda)

Use cross-validation over a logarithmic grid of values, for example:

C{0.001, 0.01, 0.1, 1, 10, 100, 1000}C \in \{0.001,\ 0.01,\ 0.1,\ 1,\ 10,\ 100,\ 1000\}

Plot validation log-loss (or AUC) against logC\log C and pick the value that maximizes validation performance. Scikit-learn's LogisticRegressionCV automates this.

L1 vs. L2 for logistic regression

L1 (Lasso)L2 (Ridge)
Penaltyθj\sum\|\theta_j\|θj2\sum\theta_j^2
Feature selectionYes (exact zeros)No
Coefficient behaviorSparseSmooth shrinkage
Handles correlated featuresPicks one arbitrarilyDistributes weight evenly
Solver compatibilityRequires liblinear or sagaWorks with most solvers
Best forHigh-dimensional sparse dataDense data with many small effects

Elastic Net

As with linear regression, Elastic Net combines both penalties:

JEN=Jlog-loss+λ1mθj+λ22mθj2J_{\text{EN}} = J_{\text{log-loss}} + \frac{\lambda_1}{m}\sum|\theta_j| + \frac{\lambda_2}{2m}\sum\theta_j^2

In scikit-learn use penalty='elasticnet' with the l1_ratio parameter controlling the mix (l1_ratio=1l1\_ratio = 1 is pure L1, l1_ratio=0l1\_ratio = 0 is pure L2).

Effect of regularization on the decision boundary

Regularization keeps coefficients small, which softens the decision boundary — predictions near the boundary become less extreme (closer to 0.5). This is particularly important for the complete separation problem: without regularization, the model drives its coefficients to infinity to perfectly separate training data; with regularization, the coefficients remain finite and the model generalizes better.