Regularization for Logistic Regression

Overfitting in logistic regression

Logistic regression can overfit when:

There are many features relative to the number of training examples.
Features are highly correlated (multicollinearity).
The classes are linearly separable (the complete separation problem — covered in the Assumptions lesson), causing coefficients to grow without bound.

The remedy is the same as in linear regression: add a penalty on the magnitude of the coefficients to the cost function.

L2 regularization (Ridge)

L2 regularization adds the sum of squared coefficients to the log-loss:

$J_{L2}(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{p}^{(i)} + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right] + \frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$

As in linear regression, $\theta_0$ (the intercept) is not penalized. L2 shrinks all coefficients toward zero but never exactly to zero.

L1 regularization (Lasso)

L1 regularization adds the sum of absolute values of coefficients:

$J_{L1}(\boldsymbol{\theta}) = -\frac{1}{m}\sum_{i=1}^{m}\left[y^{(i)}\log\hat{p}^{(i)} + (1-y^{(i)})\log(1-\hat{p}^{(i)})\right] + \frac{\lambda}{m}\sum_{j=1}^{n}|\theta_j|$

L1 can drive some coefficients exactly to zero, performing automatic feature selection. This is especially useful in high-dimensional settings (e.g. text classification with thousands of word features).

The $C$ parameter convention

Most machine learning libraries (including scikit-learn) parameterize logistic regression regularization using $C$ rather than $\lambda$ :

$C = \frac{1}{\lambda}$

$C$ is the inverse of regularization strength:

Large $C$ (small $\lambda$ ): weak regularization — the model fits the training data closely, risking overfitting.
Small $C$ (large $\lambda$ ): strong regularization — coefficients are heavily penalized, risking underfitting.

The default in scikit-learn is C=1.0. This is the convention to remember:

$\uparrow C \Rightarrow \text{less regularization} \qquad \downarrow C \Rightarrow \text{more regularization}$

This is the opposite of $\lambda$ : larger $\lambda$ means more regularization, larger $C$ means less.

Choosing $C$ (or $\lambda$ )

Use cross-validation over a logarithmic grid of values, for example:

$C \in \{0.001,\ 0.01,\ 0.1,\ 1,\ 10,\ 100,\ 1000\}$

Plot validation log-loss (or AUC) against $\log C$ and pick the value that maximizes validation performance. Scikit-learn's LogisticRegressionCV automates this.

L1 vs. L2 for logistic regression

	L1 (Lasso)	L2 (Ridge)
Penalty	$\sum\\|\theta_j\\|$	$\sum\theta_j^2$
Feature selection	Yes (exact zeros)	No
Coefficient behavior	Sparse	Smooth shrinkage
Handles correlated features	Picks one arbitrarily	Distributes weight evenly
Solver compatibility	Requires liblinear or saga	Works with most solvers
Best for	High-dimensional sparse data	Dense data with many small effects

Elastic Net

As with linear regression, Elastic Net combines both penalties:

$J_{\text{EN}} = J_{\text{log-loss}} + \frac{\lambda_1}{m}\sum|\theta_j| + \frac{\lambda_2}{2m}\sum\theta_j^2$

In scikit-learn use penalty='elasticnet' with the l1_ratio parameter controlling the mix ( $l1\_ratio = 1$ is pure L1, $l1\_ratio = 0$ is pure L2).

Effect of regularization on the decision boundary

Regularization keeps coefficients small, which softens the decision boundary — predictions near the boundary become less extreme (closer to 0.5). This is particularly important for the complete separation problem: without regularization, the model drives its coefficients to infinity to perfectly separate training data; with regularization, the coefficients remain finite and the model generalizes better.

← Prev Next →