Regularization for Logistic Regression
Overfitting in logistic regression
Logistic regression can overfit when:
- There are many features relative to the number of training examples.
- Features are highly correlated (multicollinearity).
- The classes are linearly separable (the complete separation problem — covered in the Assumptions lesson), causing coefficients to grow without bound.
The remedy is the same as in linear regression: add a penalty on the magnitude of the coefficients to the cost function.
L2 regularization (Ridge)
L2 regularization adds the sum of squared coefficients to the log-loss:
As in linear regression, (the intercept) is not penalized. L2 shrinks all coefficients toward zero but never exactly to zero.
L1 regularization (Lasso)
L1 regularization adds the sum of absolute values of coefficients:
L1 can drive some coefficients exactly to zero, performing automatic feature selection. This is especially useful in high-dimensional settings (e.g. text classification with thousands of word features).
The parameter convention
Most machine learning libraries (including scikit-learn) parameterize logistic regression regularization using rather than :
is the inverse of regularization strength:
- Large (small ): weak regularization — the model fits the training data closely, risking overfitting.
- Small (large ): strong regularization — coefficients are heavily penalized, risking underfitting.
The default in scikit-learn is C=1.0. This is the convention to remember:
This is the opposite of : larger means more regularization, larger means less.
Choosing (or )
Use cross-validation over a logarithmic grid of values, for example:
Plot validation log-loss (or AUC) against and pick the value that maximizes validation performance. Scikit-learn's LogisticRegressionCV automates this.
L1 vs. L2 for logistic regression
| L1 (Lasso) | L2 (Ridge) | |
|---|---|---|
| Penalty | ||
| Feature selection | Yes (exact zeros) | No |
| Coefficient behavior | Sparse | Smooth shrinkage |
| Handles correlated features | Picks one arbitrarily | Distributes weight evenly |
| Solver compatibility | Requires liblinear or saga | Works with most solvers |
| Best for | High-dimensional sparse data | Dense data with many small effects |
Elastic Net
As with linear regression, Elastic Net combines both penalties:
In scikit-learn use penalty='elasticnet' with the l1_ratio parameter controlling the mix ( is pure L1, is pure L2).
Effect of regularization on the decision boundary
Regularization keeps coefficients small, which softens the decision boundary — predictions near the boundary become less extreme (closer to 0.5). This is particularly important for the complete separation problem: without regularization, the model drives its coefficients to infinity to perfectly separate training data; with regularization, the coefficients remain finite and the model generalizes better.