§ Machine Learning · Jun 2026

The Geometry Behind L1/L2 Regularization

L1 prefers sparse weights while L2 prefers small weights. We'll explore why this is, and how circles and squares help answer this question.

Published
June 20, 2026
Category
Machine Learning
Reading time
7 min
Tags
regularizationoptimization

Regularization is a common technique in machine learning to prevent overfitting. It does so by penalizing overly complex models, and encouraging models to find simpler more generalizable patterns. Regularization often involves adding a regularization term to the loss function like so:

Lλ(θ)=L(θ)+λR(θ)L_\lambda(\theta) = L(\theta) + \lambda R(\theta)

where θ\theta is the model parameters, LL is the loss function, λ0\lambda \geq 0, and RR is a regularization term that captures model complexity. λ\lambda helps balance the goal of minimizing the loss with the goal of producing a simple model. Smaller λ\lambda prioritizes minimizing the original loss while large λ\lambda prioritizes simple models (which can cause underfitting if λ\lambda is too high).

Two common forms of regularization are L1/L2 regularization which we define below:

NameRegularization term R(θ)R(\theta)PenalizesEffect on weights
L1θ1\lVert \theta \rVert_1The sum of absolute weightsDrives some weights exactly to zero, producing sparse solutions
L2θ2\lVert \theta \rVert_2The sum of squared weightsShrinks all weights toward zero without forcing them to vanish

Translating Penalty Form to Constraint Form

To picture the geometry of L1/L2 regularization, we change the way we represent regularization. Earlier, we expressed regularization in terms of a loss function, known as penalty form:

minθL(θ)+λR(θ)\min_\theta L(\theta) + \lambda R(\theta)

Now, we express regularization as a Lagrangian, known as the constraint form:

minθL(θ) s.t. R(θ)t\min_\theta L(\theta) \text{ s.t. } R(\theta) \leq t

where tt is inversely related to λ\lambda.

Visualizing the Geometry

With the problem in constraint form, we can visualize the regularization. The shaded region is the set of weights allowed by the constraint R(θ)tR(\theta) \leq t, and the concentric curves are level sets (contours) of the loss L(θ)L(\theta) centered on the unconstrained optimum θ^\hat\theta. The constrained solution θ\theta^\star is the first point a growing loss contour touches as it expands out toward the constraint.

µ1µ2^µµ?L1:kµk1tµ?1=0(sparse)µ1µ2^µµ?L2:kµk2tµ?1;µ?2bothnonzero

For a point on the constraint boundary to be the optimum, the loss gradient and the constraint gradient at that point have to point in the same direction (up to a sign). In L2, and along the flat edges of L1, the boundary is smooth, so it has a single gradient direction. The loss gradient has to match that one direction exactly, which rarely happens.

A corner of the L1 diamond is different. The boundary is not smooth there, so no single gradient exists. Instead the corner has a whole set of subgradients consisting of every direction lying between the gradients of the two edges that meet at it. The corner is a candidate optimum as long as the loss gradient points in the same direction (up to a sign) as any direction in that set. Because the set spans a range of directions rather than a single one, a corner is far more likely to be the optimum.

This reasoning is why L1 prefers sparse weights while L2 prefers small weights.

Appendix

Refresher of the Lagrangian

The Lagrangian function is a mathematical tool used to optimize an objective function given a set of constraints. Let's denote the function we want to maximize as J(x)J(\mathbf{x}), and our constraint function as g(x)=kg(\mathbf{x}) = k where kk is a constant.

g(x)=k(3)g(\mathbf{x}) = k \tag{3}

To find the maximum values of JJ with respect to the constraint, we first search for points on the constraint surface where the value of JJ remains unchanged as we take small steps along the surface from that point. Intuitively, searching for such points makes sense, as leading up to a maximum, the slope of the curve is positive. At the maximum, the slope is zero, which means JJ remains unchanged for small steps around that point. Past the maximum, the slope of the curve becomes negative.

Now, to find points on the constraint surface where JJ remains unchanged for small steps, we need the directional derivative of JJ along the constraint surface to be zero at those points:

J(x)t(x)=0\nabla J(\mathbf{x}) \cdot \mathbf{t}(\mathbf{x}) = 0

where t(x)\mathbf{t}(\mathbf{x}) is any tangent vector of the constraint surface at point x\mathbf{x}. This equation literally looks for points such that J\nabla J at x\mathbf{x} has no component along any direction of motion on the constraint surface. In other words, at x\mathbf{x}, you cannot immediately increase JJ by moving in any direction on the constraint surface.

Given that gg remains unchanged along the constraint surface (since our constraint is g(x)=kg(\mathbf{x}) = k where kk is constant), g\nabla g is perpendicular to every tangent direction because the gradient points in the direction of greatest change. Since J\nabla J and g\nabla g are both perpendicular to every tangent direction, they both point in the same direction (up to a sign). This implies:

J(x)=λg(x)(4)\nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \tag{4}

for some λR\lambda \in \mathbb{R}.

You can now use equations (3) and (4) to construct a system of equations that helps you find a set of candidates for the maximum of J(x)J(\mathbf{x}). However, not all candidate points are maxima. This is because other kinds of points such as minima, saddle points, etc. also have the property of having a zero directional derivative. Thus, to discover the maximum, you must compute the value of J(x)J(\mathbf{x}) for every candidate point, and the x\mathbf{x} that produces the maximum value of J(x)J(\mathbf{x}) is the optimal input.

Lagrangian Function

So, finding a set of candidates for the maximum of JJ given our constraint gg amounts to finding points x\mathbf{x} that satisfy equations (3) and (4). The Lagrangian function provides a convenient way to package (3) and (4) together:

L(x,λ)=J(x)λ(g(x)k)\mathcal{L}(\mathbf{x}, \lambda) = J(\mathbf{x}) - \lambda (g(\mathbf{x}) - k)

Look at what happens when we set the partial derivatives of the Lagrangian equal to zero though.

Lλ=(g(x)k)    (g(x)k)=0    g(x)=kLx=J(x)λg(x)    J(x)λg(x)=0    J(x)=λg(x)\begin{aligned} \frac{\partial \mathcal{L}}{\partial \lambda} &= -(g(\mathbf{x}) - k) \implies -(g(\mathbf{x}) - k) = 0 \implies g(\mathbf{x}) = k \\ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} &= \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) \implies \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) = 0 \implies \nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \end{aligned}

These are exactly the equations (3) and (4) we had above. So effectively, finding candidates for the maximum of J(x)J(\mathbf{x}) is the same as solving L(x,λ)=0\nabla \mathcal{L}(\mathbf{x}, \lambda) = 0 for x\mathbf{x} and λ\lambda. Plug the candidate x\mathbf{x} points back into JJ to find the x\mathbf{x} that produces the maximum JJ.

Constraints with Inequalities

Let's say we allow our constraint to take on the form:

g(x)kg(\mathbf{x}) \leq k

At the solution, the constraint is either inactive or active:

  1. Inactive (g(x)<kg(\mathbf{x}) < k). The solution sits strictly inside the feasible region, so the constraint has breathing room and isn't doing any real work. The solution is just an unconstrained maximum that happens to be feasible, found by solving J(x)=0\nabla J(\mathbf{x}) = 0 and keeping the points that satisfy the constraint.
  2. Active (g(x)=kg(\mathbf{x}) = k). The solution sits on the boundary, so the inequality behaves like the equality constraint g(x)=kg(\mathbf{x}) = k and we solve it with the Lagrangian from before.

The constrained optimum has to fall into one of these two cases, so we just gather the candidates from both: the feasible solutions of J(x)=0\nabla J(\mathbf{x}) = 0 and the solutions of the equality-constrained Lagrangian. Plug each candidate into JJ and keep the one that produces the maximum value.