Regularization: Ridge and Lasso

The problem regularization solves

When a model has many features — especially relative to the number of training examples — OLS can overfit: it assigns large coefficients that fit the training data well but generalize poorly. Regularization adds a penalty on the size of the coefficients to the cost function, discouraging the model from becoming too complex.

Ridge regression (L2 regularization)

Ridge regression adds the sum of squared coefficients to the cost:

JRidge(θ)=12mi=1m(y^(i)y(i))2+λj=1nθj2J_{\text{Ridge}}(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2

The term λj=1nθj2\lambda \sum_{j=1}^{n} \theta_j^2 is the L2 penalty. The hyperparameter λ0\lambda \geq 0 (lambda) controls the strength of regularization:

  • λ=0\lambda = 0: reduces to ordinary OLS.
  • Large λ\lambda: strongly penalizes large coefficients, shrinking them toward zero.

Note: The intercept θ0\theta_0 is conventionally not penalized, because it does not contribute to overfitting in the same way.

Closed-form solution for Ridge

The Normal Equation generalizes beautifully:

θ=(XTX+λI)1XTy\boldsymbol{\theta} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}

where II is the (n+1)×(n+1)(n+1) \times (n+1) identity matrix (with the top-left entry set to 0 to exclude θ0\theta_0 from the penalty). Adding λI\lambda I to XTXX^TX guarantees invertibility even when XTXX^TX is singular — solving the multicollinearity problem from earlier.

Effect of Ridge

Ridge shrinks all coefficients toward zero but never exactly to zero. It is useful when you believe most features are relevant but their individual effects are small.

Lasso regression (L1 regularization)

Lasso replaces the squared penalty with the sum of absolute values:

JLasso(θ)=12mi=1m(y^(i)y(i))2+λj=1nθjJ_{\text{Lasso}}(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|

Effect of Lasso

Lasso can shrink coefficients exactly to zero, effectively performing automatic feature selection. Features with θj=0\theta_j = 0 are dropped from the model entirely. This makes Lasso valuable when you suspect many features are irrelevant.

Because θj|\theta_j| is not differentiable at zero, Lasso has no closed-form solution and requires iterative methods (e.g. coordinate descent).

Comparing Ridge and Lasso

PropertyRidge (L2)Lasso (L1)
Penalty termλθj2\lambda \sum \theta_j^2$\lambda \sum
Coefficient behaviorShrinks toward zeroCan shrink to exactly zero
Feature selectionNoYes
Closed-form solutionYesNo
Best whenMany small relevant effectsSparse: few features truly matter
Handles multicollinearityYes (distributes weight evenly)Partially (picks one of correlated features)

Elastic Net

Elastic Net combines both penalties:

JEN=12m(y^(i)y(i))2+λ1θj+λ2θj2J_{\text{EN}} = \frac{1}{2m}\sum (\hat{y}^{(i)} - y^{(i)})^2 + \lambda_1 \sum |\theta_j| + \lambda_2 \sum \theta_j^2

It offers the feature selection of Lasso and the stability of Ridge, at the cost of two hyperparameters to tune.

Choosing λ\lambda

λ\lambda is a hyperparameter — it is not learned from the training data. Use cross-validation:

  1. Train Ridge/Lasso models with a grid of λ\lambda values (e.g. 0.001,0.01,0.1,1,10,1000.001, 0.01, 0.1, 1, 10, 100).
  2. Evaluate each with kk-fold cross-validation.
  3. Pick the λ\lambda with the lowest validation error.

The bias–variance trade-off

Regularization increases bias (the model is slightly constrained away from the true OLS fit) but reduces variance (predictions are more stable across different training sets). For a well-chosen λ\lambda, the reduction in variance outweighs the increase in bias, leading to better test-set performance.

This trade-off is one of the central ideas in all of machine learning: simpler models generalize better when data is limited, but may be too simple when data is abundant.