§ Mathematics · May 2026

PCA is the Answer to a Constrained Optimization

Eigenvectors of the covariance matrix aren't a coincidence. They fall out of maximizing variance under a unit-norm constraint.

Published
May 8, 2026
Category
Mathematics
Reading time
7 min
Tags
linear-algebrastatisticsoptimizationcalculus

We explored Principal Component Analysis (PCA) from a linear algebra perspective in a previous post. I recommend reading that article if you want a refresher on what PCA is, or if you are unfamiliar with the relationship between singular value decomposition and PCA. This article focuses on the optimization side of PCA, walking through how to derive PCA using the Lagrangian.

Why is Variance Important in PCA?

As a reminder, given some mean-centered data matrix XcX_c (where each row is a sample), PCA uses the following matrix decomposition:

Xc=UΣVTX_c = U \Sigma V^T

where URm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}. Furthermore, VV is a matrix whose columns are eigenvectors of XcTXcX_c^T X_c, and these eigenvectors point in the directions of maximum variance in the underlying data XcX_c since 1NXcTXc\frac{1}{N} X_c^T X_c is the covariance matrix.

Remember that when compressing each sample in XcX_c into a lower-dimensional space Rk\mathbb{R}^k, the convention is to right-multiply the centered data matrix XcX_c by the first kk columns of VV. The reason we do this is that we want to capture the parts that make each data point distinct from the others. Thus, we want to capture the value of each data point along the directions of maximum variance, so it’s rather fitting that the columns of VV happen to point in exactly those directions.

Derivation of PCA Using the Lagrangian

The constrained problem we wish to solve is:

maxwVar(Xcw) s.t. w2=1\max_{\mathbf{w}} \operatorname{Var}(X_c \mathbf{w}) \text{ s.t. } \|\mathbf{w}\|^2 = 1

Plugging in the formula for variance:

Var(Xcw)=1N(Xcw)T(Xcw)=1NwTXcTXcw\operatorname{Var}(X_c \mathbf{w}) = \frac{1}{N}(X_c \mathbf{w})^T(X_c \mathbf{w}) = \frac{1}{N} \mathbf{w}^T X_c^T X_c \mathbf{w}

where NN is the number of data points. 1NXcTXc\frac{1}{N} X_c^T X_c is just the covariance matrix since XcX_c is mean-centered.

Thus, the objective function we wish to maximize is J(w)=1NwTXcTXcwJ(\mathbf{w}) = \frac{1}{N}\mathbf{w}^T X_c^T X_c \mathbf{w}. Our constraint is g(w)=w2=wTw=1g(\mathbf{w}) = \|\mathbf{w}\|^2 = \mathbf{w}^T \mathbf{w} = 1. Expressing this in the form of the Lagrangian:

L(w,λ)=1NwTXcTXcwλ(wTw1)\mathcal{L}(\mathbf{w}, \lambda) = \frac{1}{N}\mathbf{w}^T X_c^T X_c \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)

Now to solve, we set the partial derivatives of the Lagrangian equal to zero. We use the fact wwTAw=2Aw\nabla_\mathbf{w} \mathbf{w}^T A \mathbf{w} = 2A\mathbf{w} for symmetric AA.

wL(w,λ)=2NXcTXcw2λw=0    1NXcTXcw=λwλL(w,λ)=(wTw1)=0    wTw=1\begin{align} \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}, \lambda) &= \frac{2}{N} X_c^T X_c \mathbf{w} - 2 \lambda \mathbf{w} = 0 \implies \frac{1}{N}X_c^T X_c \mathbf{w} = \lambda \mathbf{w} \\ \nabla_\lambda \mathcal{L}(\mathbf{w}, \lambda) &= -(\mathbf{w}^T \mathbf{w} - 1) = 0 \implies \mathbf{w}^T \mathbf{w} = 1 \end{align}

Notice that in equation (1), w\mathbf{w} is just an eigenvector of the covariance matrix 1NXcTXc\frac{1}{N} X_c^T X_c.

Lastly, we evaluate our candidate points above to find the maximum:

J(w)=1NwTXcTXcw=wTλw=λJ(\mathbf{w}) = \frac{1}{N} \mathbf{w}^T X_c^T X_c \mathbf{w} = \mathbf{w}^T \lambda \mathbf{w} = \lambda

So, the eigenvector w\mathbf{w} with the highest eigenvalue maximizes the variance. That’s the whole derivation! The eigenvectors of the covariance matrix 1NXcTXc\frac{1}{N} X_c^T X_c point in the directions of maximum variance in XcX_c.

Appendix

Refresher of the Lagrangian

The Lagrangian function is a mathematical tool used to optimize an objective function given a set of constraints. Let's denote the function we want to maximize as J(x)J(\mathbf{x}), and our constraint function as g(x)=kg(\mathbf{x}) = k where kk is a constant.

g(x)=k(3)g(\mathbf{x}) = k \tag{3}

To find the maximum values of JJ with respect to the constraint, we first search for points on the constraint surface where the value of JJ remains unchanged as we take small steps along the surface from that point. Intuitively, searching for such points makes sense, as leading up to a maximum, the slope of the curve is positive. At the maximum, the slope is zero, which means JJ remains unchanged for small steps around that point. Past the maximum, the slope of the curve becomes negative.

Now, to find points on the constraint surface where JJ remains unchanged for small steps, we need the directional derivative of JJ along the constraint surface to be zero at those points:

J(x)t(x)=0\nabla J(\mathbf{x}) \cdot \mathbf{t}(\mathbf{x}) = 0

where t(x)\mathbf{t}(\mathbf{x}) is any tangent vector of the constraint surface at point x\mathbf{x}. This equation literally looks for points such that J\nabla J at x\mathbf{x} has no component along any direction of motion on the constraint surface. In other words, at x\mathbf{x}, you cannot immediately increase JJ by moving in any direction on the constraint surface.

Given that gg remains unchanged along the constraint surface (since our constraint is g(x)=kg(\mathbf{x}) = k where kk is constant), g\nabla g is perpendicular to every tangent direction because the gradient points in the direction of greatest change. Since J\nabla J and g\nabla g are both perpendicular to every tangent direction, they both point in the same direction (up to a sign). This implies:

J(x)=λg(x)(4)\nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \tag{4}

for some λR\lambda \in \mathbb{R}.

You can now use equations (3) and (4) to construct a system of equations that helps you find a set of candidates for the maximum of J(x)J(\mathbf{x}). However, not all candidate points are maxima. This is because other kinds of points such as minima, saddle points, etc. also have the property of having a zero directional derivative. Thus, to discover the maximum, you must compute the value of J(x)J(\mathbf{x}) for every candidate point, and the x\mathbf{x} that produces the maximum value of J(x)J(\mathbf{x}) is the optimal input.

Lagrangian Function

So, finding a set of candidates for the maximum of JJ given our constraint gg amounts to finding points x\mathbf{x} that satisfy equations (3) and (4). The Lagrangian function provides a convenient way to package (3) and (4) together:

L(x,λ)=J(x)λ(g(x)k)\mathcal{L}(\mathbf{x}, \lambda) = J(\mathbf{x}) - \lambda (g(\mathbf{x}) - k)

Look at what happens when we set the partial derivatives of the Lagrangian equal to zero though.

Lλ=(g(x)k)    (g(x)k)=0    g(x)=kLx=J(x)λg(x)    J(x)λg(x)=0    J(x)=λg(x)\begin{aligned} \frac{\partial \mathcal{L}}{\partial \lambda} &= -(g(\mathbf{x}) - k) \implies -(g(\mathbf{x}) - k) = 0 \implies g(\mathbf{x}) = k \\ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} &= \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) \implies \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) = 0 \implies \nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \end{aligned}

These are exactly the equations (3) and (4) we had above. So effectively, finding candidates for the maximum of J(x)J(\mathbf{x}) is the same as solving L(x,λ)=0\nabla \mathcal{L}(\mathbf{x}, \lambda) = 0 for x\mathbf{x} and λ\lambda. Plug the candidate x\mathbf{x} points back into JJ to find the x\mathbf{x} that produces the maximum JJ.

Constraints with Inequalities

Let's say we allow our constraint to take on the form:

g(x)kg(\mathbf{x}) \leq k

At the solution, the constraint is either inactive or active:

  1. Inactive (g(x)<kg(\mathbf{x}) < k). The solution sits strictly inside the feasible region, so the constraint has breathing room and isn't doing any real work. The solution is just an unconstrained maximum that happens to be feasible, found by solving J(x)=0\nabla J(\mathbf{x}) = 0 and keeping the points that satisfy the constraint.
  2. Active (g(x)=kg(\mathbf{x}) = k). The solution sits on the boundary, so the inequality behaves like the equality constraint g(x)=kg(\mathbf{x}) = k and we solve it with the Lagrangian from before.

The constrained optimum has to fall into one of these two cases, so we just gather the candidates from both: the feasible solutions of J(x)=0\nabla J(\mathbf{x}) = 0 and the solutions of the equality-constrained Lagrangian. Plug each candidate into JJ and keep the one that produces the maximum value.