The Geometry Behind L1/L2 Regularization

Regularization is a common technique in machine learning to prevent overfitting. It does so by penalizing overly complex models, and encouraging models to find simpler more generalizable patterns. Regularization often involves adding a regularization term to the loss function like so:

L_\lambda(\theta) = L(\theta) + \lambda R(\theta)

where $\theta$ is the model parameters, $L$ is the loss function, $\lambda \geq 0$ , and $R$ is a regularization term that captures model complexity. $\lambda$ helps balance the goal of minimizing the loss with the goal of producing a simple model. Smaller $\lambda$ prioritizes minimizing the original loss while large $\lambda$ prioritizes simple models (which can cause underfitting if $\lambda$ is too high).

Two common forms of regularization are L1/L2 regularization which we define below:

Name	Regularization term $R(\theta)$	Penalizes	Effect on weights
L1	$\lVert \theta \rVert_1$	The sum of absolute weights	Drives some weights exactly to zero, producing sparse solutions
L2	$\lVert \theta \rVert_2$	The sum of squared weights	Shrinks all weights toward zero without forcing them to vanish

Translating Penalty Form to Constraint Form

To picture the geometry of L1/L2 regularization, we change the way we represent regularization. Earlier, we expressed regularization in terms of a loss function, known as penalty form:

\min_\theta L(\theta) + \lambda R(\theta)

Now, we express regularization as a Lagrangian, known as the constraint form:

\min_\theta L(\theta) \text{ s.t. } R(\theta) \leq t

where $t$ is inversely related to $\lambda$ .

Visualizing the Geometry

With the problem in constraint form, we can visualize the regularization. The shaded region is the set of weights allowed by the constraint $R(\theta) \leq t$ , and the concentric curves are level sets (contours) of the loss $L(\theta)$ centered on the unconstrained optimum $\hat\theta$ . The constrained solution $\theta^\star$ is the first point a growing loss contour touches as it expands out toward the constraint.

For a point on the constraint boundary to be the optimum, the loss gradient and the constraint gradient at that point have to point in the same direction (up to a sign). In L2, and along the flat edges of L1, the boundary is smooth, so it has a single gradient direction. The loss gradient has to match that one direction exactly, which rarely happens.

A corner of the L1 diamond is different. The boundary is not smooth there, so no single gradient exists. Instead the corner has a whole set of subgradients consisting of every direction lying between the gradients of the two edges that meet at it. The corner is a candidate optimum as long as the loss gradient points in the same direction (up to a sign) as any direction in that set. Because the set spans a range of directions rather than a single one, a corner is far more likely to be the optimum.

This reasoning is why L1 prefers sparse weights while L2 prefers small weights.

Appendix

Refresher of the Lagrangian

The Lagrangian function is a mathematical tool used to optimize an objective function given a set of constraints. Let's denote the function we want to maximize as $J(\mathbf{x})$ , and our constraint function as $g(\mathbf{x}) = k$ where $k$ is a constant.

g(\mathbf{x}) = k \tag{3}

To find the maximum values of $J$ with respect to the constraint, we first search for points on the constraint surface where the value of $J$ remains unchanged as we take small steps along the surface from that point. Intuitively, searching for such points makes sense, as leading up to a maximum, the slope of the curve is positive. At the maximum, the slope is zero, which means $J$ remains unchanged for small steps around that point. Past the maximum, the slope of the curve becomes negative.

Now, to find points on the constraint surface where $J$ remains unchanged for small steps, we need the directional derivative of $J$ along the constraint surface to be zero at those points:

\nabla J(\mathbf{x}) \cdot \mathbf{t}(\mathbf{x}) = 0

where $\mathbf{t}(\mathbf{x})$ is any tangent vector of the constraint surface at point $\mathbf{x}$ . This equation literally looks for points such that $\nabla J$ at $\mathbf{x}$ has no component along any direction of motion on the constraint surface. In other words, at $\mathbf{x}$ , you cannot immediately increase $J$ by moving in any direction on the constraint surface.

Given that $g$ remains unchanged along the constraint surface (since our constraint is $g(\mathbf{x}) = k$ where $k$ is constant), $\nabla g$ is perpendicular to every tangent direction because the gradient points in the direction of greatest change. Since $\nabla J$ and $\nabla g$ are both perpendicular to every tangent direction, they both point in the same direction (up to a sign). This implies:

\nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \tag{4}

for some $\lambda \in \mathbb{R}$ .

You can now use equations (3) and (4) to construct a system of equations that helps you find a set of candidates for the maximum of $J(\mathbf{x})$ . However, not all candidate points are maxima. This is because other kinds of points such as minima, saddle points, etc. also have the property of having a zero directional derivative. Thus, to discover the maximum, you must compute the value of $J(\mathbf{x})$ for every candidate point, and the $\mathbf{x}$ that produces the maximum value of $J(\mathbf{x})$ is the optimal input.

Lagrangian Function

So, finding a set of candidates for the maximum of $J$ given our constraint $g$ amounts to finding points $\mathbf{x}$ that satisfy equations (3) and (4). The Lagrangian function provides a convenient way to package (3) and (4) together:

\mathcal{L}(\mathbf{x}, \lambda) = J(\mathbf{x}) - \lambda (g(\mathbf{x}) - k)

Look at what happens when we set the partial derivatives of the Lagrangian equal to zero though.

\begin{aligned} \frac{\partial \mathcal{L}}{\partial \lambda} &= -(g(\mathbf{x}) - k) \implies -(g(\mathbf{x}) - k) = 0 \implies g(\mathbf{x}) = k \\ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} &= \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) \implies \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) = 0 \implies \nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \end{aligned}

These are exactly the equations (3) and (4) we had above. So effectively, finding candidates for the maximum of $J(\mathbf{x})$ is the same as solving $\nabla \mathcal{L}(\mathbf{x}, \lambda) = 0$ for $\mathbf{x}$ and $\lambda$ . Plug the candidate $\mathbf{x}$ points back into $J$ to find the $\mathbf{x}$ that produces the maximum $J$ .

Constraints with Inequalities

Let's say we allow our constraint to take on the form:

g(\mathbf{x}) \leq k

At the solution, the constraint is either inactive or active:

Inactive ( $g(\mathbf{x}) < k$ ). The solution sits strictly inside the feasible region, so the constraint has breathing room and isn't doing any real work. The solution is just an unconstrained maximum that happens to be feasible, found by solving $\nabla J(\mathbf{x}) = 0$ and keeping the points that satisfy the constraint.
Active ( $g(\mathbf{x}) = k$ ). The solution sits on the boundary, so the inequality behaves like the equality constraint $g(\mathbf{x}) = k$ and we solve it with the Lagrangian from before.

The constrained optimum has to fall into one of these two cases, so we just gather the candidates from both: the feasible solutions of $\nabla J(\mathbf{x}) = 0$ and the solutions of the equality-constrained Lagrangian. Plug each candidate into $J$ and keep the one that produces the maximum value.