Regularization: Ridge and Lasso

The problem regularization solves

When a model has many features — especially relative to the number of training examples — OLS can overfit: it assigns large coefficients that fit the training data well but generalize poorly. Regularization adds a penalty on the size of the coefficients to the cost function, discouraging the model from becoming too complex.

Ridge regression (L2 regularization)

Ridge regression adds the sum of squared coefficients to the cost:

$J_{\text{Ridge}}(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} \theta_j^2$

The term $\lambda \sum_{j=1}^{n} \theta_j^2$ is the L2 penalty. The hyperparameter $\lambda \geq 0$ (lambda) controls the strength of regularization:

$\lambda = 0$ : reduces to ordinary OLS.
Large $\lambda$ : strongly penalizes large coefficients, shrinking them toward zero.

Note: The intercept $\theta_0$ is conventionally not penalized, because it does not contribute to overfitting in the same way.

Closed-form solution for Ridge

The Normal Equation generalizes beautifully:

$\boldsymbol{\theta} = (X^T X + \lambda I)^{-1} X^T \mathbf{y}$

where $I$ is the $(n+1) \times (n+1)$ identity matrix (with the top-left entry set to 0 to exclude $\theta_0$ from the penalty). Adding $\lambda I$ to $X^TX$ guarantees invertibility even when $X^TX$ is singular — solving the multicollinearity problem from earlier.

Effect of Ridge

Ridge shrinks all coefficients toward zero but never exactly to zero. It is useful when you believe most features are relevant but their individual effects are small.

Lasso regression (L1 regularization)

Lasso replaces the squared penalty with the sum of absolute values:

$J_{\text{Lasso}}(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^{m} (\hat{y}^{(i)} - y^{(i)})^2 + \lambda \sum_{j=1}^{n} |\theta_j|$

Effect of Lasso

Lasso can shrink coefficients exactly to zero, effectively performing automatic feature selection. Features with $\theta_j = 0$ are dropped from the model entirely. This makes Lasso valuable when you suspect many features are irrelevant.

Because $|\theta_j|$ is not differentiable at zero, Lasso has no closed-form solution and requires iterative methods (e.g. coordinate descent).

Comparing Ridge and Lasso

Property	Ridge (L2)	Lasso (L1)
Penalty term	$\lambda \sum \theta_j^2$	$\lambda \sum
Coefficient behavior	Shrinks toward zero	Can shrink to exactly zero
Feature selection	No	Yes
Closed-form solution	Yes	No
Best when	Many small relevant effects	Sparse: few features truly matter
Handles multicollinearity	Yes (distributes weight evenly)	Partially (picks one of correlated features)

Elastic Net

Elastic Net combines both penalties:

$J_{\text{EN}} = \frac{1}{2m}\sum (\hat{y}^{(i)} - y^{(i)})^2 + \lambda_1 \sum |\theta_j| + \lambda_2 \sum \theta_j^2$

It offers the feature selection of Lasso and the stability of Ridge, at the cost of two hyperparameters to tune.

Choosing $\lambda$

$\lambda$ is a hyperparameter — it is not learned from the training data. Use cross-validation:

Train Ridge/Lasso models with a grid of $\lambda$ values (e.g. $0.001, 0.01, 0.1, 1, 10, 100$ ).
Evaluate each with $k$ -fold cross-validation.
Pick the $\lambda$ with the lowest validation error.

The bias–variance trade-off

Regularization increases bias (the model is slightly constrained away from the true OLS fit) but reduces variance (predictions are more stable across different training sets). For a well-chosen $\lambda$ , the reduction in variance outweighs the increase in bias, leading to better test-set performance.

This trade-off is one of the central ideas in all of machine learning: simpler models generalize better when data is limited, but may be too simple when data is abundant.

← Prev Next →