Regularization: Ridge and Lasso
The problem regularization solves
When a model has many features — especially relative to the number of training examples — OLS can overfit: it assigns large coefficients that fit the training data well but generalize poorly. Regularization adds a penalty on the size of the coefficients to the cost function, discouraging the model from becoming too complex.
Ridge regression (L2 regularization)
Ridge regression adds the sum of squared coefficients to the cost:
The term is the L2 penalty. The hyperparameter (lambda) controls the strength of regularization:
- : reduces to ordinary OLS.
- Large : strongly penalizes large coefficients, shrinking them toward zero.
Note: The intercept is conventionally not penalized, because it does not contribute to overfitting in the same way.
Closed-form solution for Ridge
The Normal Equation generalizes beautifully:
where is the identity matrix (with the top-left entry set to 0 to exclude from the penalty). Adding to guarantees invertibility even when is singular — solving the multicollinearity problem from earlier.
Effect of Ridge
Ridge shrinks all coefficients toward zero but never exactly to zero. It is useful when you believe most features are relevant but their individual effects are small.
Lasso regression (L1 regularization)
Lasso replaces the squared penalty with the sum of absolute values:
Effect of Lasso
Lasso can shrink coefficients exactly to zero, effectively performing automatic feature selection. Features with are dropped from the model entirely. This makes Lasso valuable when you suspect many features are irrelevant.
Because is not differentiable at zero, Lasso has no closed-form solution and requires iterative methods (e.g. coordinate descent).
Comparing Ridge and Lasso
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty term | $\lambda \sum | |
| Coefficient behavior | Shrinks toward zero | Can shrink to exactly zero |
| Feature selection | No | Yes |
| Closed-form solution | Yes | No |
| Best when | Many small relevant effects | Sparse: few features truly matter |
| Handles multicollinearity | Yes (distributes weight evenly) | Partially (picks one of correlated features) |
Elastic Net
Elastic Net combines both penalties:
It offers the feature selection of Lasso and the stability of Ridge, at the cost of two hyperparameters to tune.
Choosing
is a hyperparameter — it is not learned from the training data. Use cross-validation:
- Train Ridge/Lasso models with a grid of values (e.g. ).
- Evaluate each with -fold cross-validation.
- Pick the with the lowest validation error.
The bias–variance trade-off
Regularization increases bias (the model is slightly constrained away from the true OLS fit) but reduces variance (predictions are more stable across different training sets). For a well-chosen , the reduction in variance outweighs the increase in bias, leading to better test-set performance.
This trade-off is one of the central ideas in all of machine learning: simpler models generalize better when data is limited, but may be too simple when data is abundant.