L1 vs L2 geometry
Log in to access the full course.
Why geometry explains the difference
L1 and L2 regularization both add a penalty on the size of model coefficients, but they behave differently in an important way: L2 shrinks all coefficients toward zero, while L1 can shrink some all the way to exactly zero. To understand why, it helps to think geometrically.
The setup
Regularized regression solves:
where is either the L1 or L2 penalty. This can be rewritten as a constrained optimization: minimize the loss subject to the penalty being below some budget (the two formulations are equivalent for different values of and ).
- L2 (Ridge): — coefficients constrained to a sphere (in 2D: a circle).
- L1 (Lasso): — coefficients constrained to a diamond (in 2D: a square rotated 45°).
The geometric picture
Imagine 2D coefficient space with axes and . The unregularized OLS solution sits at some point in this space — the center of elliptical contours of equal loss.
Regularization forces the solution inside the constraint region. The regularized solution is where the loss contours first touch the constraint boundary.
For L2 (circle): the circle is smooth with no corners. The contours almost always touch the circle at a point where both and are non-zero. Coefficients shrink toward zero but stay there.
For L1 (diamond): the diamond has corners at the axes — at points like and . The loss contours are likely to first touch the diamond at one of these corners. When that happens, one coefficient is exactly zero. In higher dimensions with many corners and edges, many coefficients land exactly at zero simultaneously.
This is why L1 produces sparse solutions — it is not a special mathematical trick, it is a geometric consequence of the shape of the constraint region.
L1: the penalty and its geometry
The absolute value function has a kink at zero — it is not differentiable there. This discontinuity in the gradient is precisely what allows the optimization to land exactly on the axes. The subgradient at zero can take any value in , which means a coefficient can be held at zero even if there is some gradient pushing it away, as long as the push is small enough.
L2: the penalty and its geometry
The squared penalty is smooth everywhere, including at zero. There is no kink — the gradient at zero is exactly zero, so the optimization slides smoothly toward (but never reaches) zero unless the loss gradient is also exactly zero, which is rare in practice. Coefficients shrink but retain small non-zero values.
Elastic Net: both at once
Elastic Net combines L1 and L2:
The mixing parameter controls the blend. is pure Lasso; is pure Ridge. Elastic Net gets the feature selection of L1 and the stability of L2 — useful when many correlated features all matter.
Summary
| L1 (Lasso) | L2 (Ridge) | |
|---|---|---|
| Constraint region | Diamond (corners on axes) | Sphere (smooth) |
| Effect on coefficients | Some shrink to exactly zero | All shrink toward zero |
| Feature selection | Yes — sparse solutions | No |
| Handles correlated features | Picks one, zeros others | Distributes weight evenly |
| Penalty differentiable at 0 | No (kink) | Yes |