L1 vs L2 geometry

Log in to access the full course.

Why geometry explains the difference

L1 and L2 regularization both add a penalty on the size of model coefficients, but they behave differently in an important way: L2 shrinks all coefficients toward zero, while L1 can shrink some all the way to exactly zero. To understand why, it helps to think geometrically.

The setup

Regularized regression solves:

minβi(y(i)y^(i))2loss+λP(β)penalty\min_{\boldsymbol{\beta}} \underbrace{\sum_i (y^{(i)} - \hat{y}^{(i)})^2}_{\text{loss}} + \lambda \underbrace{P(\boldsymbol{\beta})}_{\text{penalty}}

where P(β)P(\boldsymbol{\beta}) is either the L1 or L2 penalty. This can be rewritten as a constrained optimization: minimize the loss subject to the penalty being below some budget tt (the two formulations are equivalent for different values of λ\lambda and tt).

  • L2 (Ridge): jβj2t\sum_j \beta_j^2 \leq t — coefficients constrained to a sphere (in 2D: a circle).
  • L1 (Lasso): jβjt\sum_j |\beta_j| \leq t — coefficients constrained to a diamond (in 2D: a square rotated 45°).

The geometric picture

Imagine 2D coefficient space with axes β1\beta_1 and β2\beta_2. The unregularized OLS solution sits at some point in this space — the center of elliptical contours of equal loss.

Regularization forces the solution inside the constraint region. The regularized solution is where the loss contours first touch the constraint boundary.

For L2 (circle): the circle is smooth with no corners. The contours almost always touch the circle at a point where both β1\beta_1 and β2\beta_2 are non-zero. Coefficients shrink toward zero but stay there.

For L1 (diamond): the diamond has corners at the axes — at points like (β1,0)(\beta_1, 0) and (0,β2)(0, \beta_2). The loss contours are likely to first touch the diamond at one of these corners. When that happens, one coefficient is exactly zero. In higher dimensions with many corners and edges, many coefficients land exactly at zero simultaneously.

This is why L1 produces sparse solutions — it is not a special mathematical trick, it is a geometric consequence of the shape of the constraint region.

L1: the penalty and its geometry

PL1(β)=jβj(sum of absolute values)P_{L1}(\boldsymbol{\beta}) = \sum_j |\beta_j| \quad \text{(sum of absolute values)}

The absolute value function has a kink at zero — it is not differentiable there. This discontinuity in the gradient is precisely what allows the optimization to land exactly on the axes. The subgradient at zero can take any value in [λ,λ][-\lambda, \lambda], which means a coefficient can be held at zero even if there is some gradient pushing it away, as long as the push is small enough.

L2: the penalty and its geometry

PL2(β)=jβj2(sum of squares)P_{L2}(\boldsymbol{\beta}) = \sum_j \beta_j^2 \quad \text{(sum of squares)}

The squared penalty is smooth everywhere, including at zero. There is no kink — the gradient at zero is exactly zero, so the optimization slides smoothly toward (but never reaches) zero unless the loss gradient is also exactly zero, which is rare in practice. Coefficients shrink but retain small non-zero values.

Elastic Net: both at once

Elastic Net combines L1 and L2:

PEN=αjβj+(1α)jβj2P_{\text{EN}} = \alpha \sum_j |\beta_j| + (1-\alpha)\sum_j \beta_j^2

The mixing parameter α[0,1]\alpha \in [0,1] controls the blend. α=1\alpha = 1 is pure Lasso; α=0\alpha = 0 is pure Ridge. Elastic Net gets the feature selection of L1 and the stability of L2 — useful when many correlated features all matter.

Summary

L1 (Lasso)L2 (Ridge)
Constraint regionDiamond (corners on axes)Sphere (smooth)
Effect on coefficientsSome shrink to exactly zeroAll shrink toward zero
Feature selectionYes — sparse solutionsNo
Handles correlated featuresPicks one, zeros othersDistributes weight evenly
Penalty differentiable at 0No (kink)Yes