L1 vs L2 geometry

Why geometry explains the difference

L1 and L2 regularization both add a penalty on the size of model coefficients, but they behave differently in an important way: L2 shrinks all coefficients toward zero, while L1 can shrink some all the way to exactly zero. To understand why, it helps to think geometrically.

The setup

Regularized regression solves:

$\min_{\boldsymbol{\beta}} \underbrace{\sum_i (y^{(i)} - \hat{y}^{(i)})^2}_{\text{loss}} + \lambda \underbrace{P(\boldsymbol{\beta})}_{\text{penalty}}$

where $P(\boldsymbol{\beta})$ is either the L1 or L2 penalty. This can be rewritten as a constrained optimization: minimize the loss subject to the penalty being below some budget $t$ (the two formulations are equivalent for different values of $\lambda$ and $t$ ).

L2 (Ridge): $\sum_j \beta_j^2 \leq t$ — coefficients constrained to a sphere (in 2D: a circle).
L1 (Lasso): $\sum_j |\beta_j| \leq t$ — coefficients constrained to a diamond (in 2D: a square rotated 45°).

The geometric picture

Imagine 2D coefficient space with axes $\beta_1$ and $\beta_2$ . The unregularized OLS solution sits at some point in this space — the center of elliptical contours of equal loss.

Regularization forces the solution inside the constraint region. The regularized solution is where the loss contours first touch the constraint boundary.

For L2 (circle): the circle is smooth with no corners. The contours almost always touch the circle at a point where both $\beta_1$ and $\beta_2$ are non-zero. Coefficients shrink toward zero but stay there.

For L1 (diamond): the diamond has corners at the axes — at points like $(\beta_1, 0)$ and $(0, \beta_2)$ . The loss contours are likely to first touch the diamond at one of these corners. When that happens, one coefficient is exactly zero. In higher dimensions with many corners and edges, many coefficients land exactly at zero simultaneously.

This is why L1 produces sparse solutions — it is not a special mathematical trick, it is a geometric consequence of the shape of the constraint region.

L1: the penalty and its geometry

$P_{L1}(\boldsymbol{\beta}) = \sum_j |\beta_j| \quad \text{(sum of absolute values)}$

The absolute value function has a kink at zero — it is not differentiable there. This discontinuity in the gradient is precisely what allows the optimization to land exactly on the axes. The subgradient at zero can take any value in $[-\lambda, \lambda]$ , which means a coefficient can be held at zero even if there is some gradient pushing it away, as long as the push is small enough.

L2: the penalty and its geometry

$P_{L2}(\boldsymbol{\beta}) = \sum_j \beta_j^2 \quad \text{(sum of squares)}$

The squared penalty is smooth everywhere, including at zero. There is no kink — the gradient at zero is exactly zero, so the optimization slides smoothly toward (but never reaches) zero unless the loss gradient is also exactly zero, which is rare in practice. Coefficients shrink but retain small non-zero values.

Elastic Net: both at once

Elastic Net combines L1 and L2:

$P_{\text{EN}} = \alpha \sum_j |\beta_j| + (1-\alpha)\sum_j \beta_j^2$

The mixing parameter $\alpha \in [0,1]$ controls the blend. $\alpha = 1$ is pure Lasso; $\alpha = 0$ is pure Ridge. Elastic Net gets the feature selection of L1 and the stability of L2 — useful when many correlated features all matter.

Summary

	L1 (Lasso)	L2 (Ridge)
Constraint region	Diamond (corners on axes)	Sphere (smooth)
Effect on coefficients	Some shrink to exactly zero	All shrink toward zero
Feature selection	Yes — sparse solutions	No
Handles correlated features	Picks one, zeros others	Distributes weight evenly
Penalty differentiable at 0	No (kink)	Yes