PCA is the Answer to a Constrained Optimization

We explored Principal Component Analysis (PCA) from a linear algebra perspective in a previous post. I recommend reading that article if you want a refresher on what PCA is, or if you are unfamiliar with the relationship between singular value decomposition and PCA. This article focuses on the optimization side of PCA, walking through how to derive PCA using the Lagrangian.

Why is Variance Important in PCA?

As a reminder, given some mean-centered data matrix $X_c$ (where each row is a sample), PCA uses the following matrix decomposition:

X_c = U \Sigma V^T

where $U \in \mathbb{R}^{m \times m}$ , $\Sigma \in \mathbb{R}^{m \times n}$ , $V \in \mathbb{R}^{n \times n}$ . Furthermore, $V$ is a matrix whose columns are eigenvectors of $X_c^T X_c$ , and these eigenvectors point in the directions of maximum variance in the underlying data $X_c$ since $\frac{1}{N} X_c^T X_c$ is the covariance matrix.

Remember that when compressing each sample in $X_c$ into a lower-dimensional space $\mathbb{R}^k$ , the convention is to right-multiply the centered data matrix $X_c$ by the first $k$ columns of $V$ . The reason we do this is that we want to capture the parts that make each data point distinct from the others. Thus, we want to capture the value of each data point along the directions of maximum variance, so it’s rather fitting that the columns of $V$ happen to point in exactly those directions.

Derivation of PCA Using the Lagrangian

The constrained problem we wish to solve is:

\max_{\mathbf{w}} \operatorname{Var}(X_c \mathbf{w}) \text{ s.t. } \|\mathbf{w}\|^2 = 1

Plugging in the formula for variance:

\operatorname{Var}(X_c \mathbf{w}) = \frac{1}{N}(X_c \mathbf{w})^T(X_c \mathbf{w}) = \frac{1}{N} \mathbf{w}^T X_c^T X_c \mathbf{w}

where $N$ is the number of data points. $\frac{1}{N} X_c^T X_c$ is just the covariance matrix since $X_c$ is mean-centered.

Thus, the objective function we wish to maximize is $J(\mathbf{w}) = \frac{1}{N}\mathbf{w}^T X_c^T X_c \mathbf{w}$ . Our constraint is $g(\mathbf{w}) = \|\mathbf{w}\|^2 = \mathbf{w}^T \mathbf{w} = 1$ . Expressing this in the form of the Lagrangian:

\mathcal{L}(\mathbf{w}, \lambda) = \frac{1}{N}\mathbf{w}^T X_c^T X_c \mathbf{w} - \lambda(\mathbf{w}^T \mathbf{w} - 1)

Now to solve, we set the partial derivatives of the Lagrangian equal to zero. We use the fact $\nabla_\mathbf{w} \mathbf{w}^T A \mathbf{w} = 2A\mathbf{w}$ for symmetric $A$ .

\begin{align} \nabla_\mathbf{w} \mathcal{L}(\mathbf{w}, \lambda) &= \frac{2}{N} X_c^T X_c \mathbf{w} - 2 \lambda \mathbf{w} = 0 \implies \frac{1}{N}X_c^T X_c \mathbf{w} = \lambda \mathbf{w} \\ \nabla_\lambda \mathcal{L}(\mathbf{w}, \lambda) &= -(\mathbf{w}^T \mathbf{w} - 1) = 0 \implies \mathbf{w}^T \mathbf{w} = 1 \end{align}

Notice that in equation (1), $\mathbf{w}$ is just an eigenvector of the covariance matrix $\frac{1}{N} X_c^T X_c$ .

Lastly, we evaluate our candidate points above to find the maximum:

J(\mathbf{w}) = \frac{1}{N} \mathbf{w}^T X_c^T X_c \mathbf{w} = \mathbf{w}^T \lambda \mathbf{w} = \lambda

So, the eigenvector $\mathbf{w}$ with the highest eigenvalue maximizes the variance. That’s the whole derivation! The eigenvectors of the covariance matrix $\frac{1}{N} X_c^T X_c$ point in the directions of maximum variance in $X_c$ .

Appendix

Refresher of the Lagrangian

The Lagrangian function is a mathematical tool used to optimize an objective function given a set of constraints. Let's denote the function we want to maximize as $J(\mathbf{x})$ , and our constraint function as $g(\mathbf{x}) = k$ where $k$ is a constant.

g(\mathbf{x}) = k \tag{3}

To find the maximum values of $J$ with respect to the constraint, we first search for points on the constraint surface where the value of $J$ remains unchanged as we take small steps along the surface from that point. Intuitively, searching for such points makes sense, as leading up to a maximum, the slope of the curve is positive. At the maximum, the slope is zero, which means $J$ remains unchanged for small steps around that point. Past the maximum, the slope of the curve becomes negative.

Now, to find points on the constraint surface where $J$ remains unchanged for small steps, we need the directional derivative of $J$ along the constraint surface to be zero at those points:

\nabla J(\mathbf{x}) \cdot \mathbf{t}(\mathbf{x}) = 0

where $\mathbf{t}(\mathbf{x})$ is any tangent vector of the constraint surface at point $\mathbf{x}$ . This equation literally looks for points such that $\nabla J$ at $\mathbf{x}$ has no component along any direction of motion on the constraint surface. In other words, at $\mathbf{x}$ , you cannot immediately increase $J$ by moving in any direction on the constraint surface.

Given that $g$ remains unchanged along the constraint surface (since our constraint is $g(\mathbf{x}) = k$ where $k$ is constant), $\nabla g$ is perpendicular to every tangent direction because the gradient points in the direction of greatest change. Since $\nabla J$ and $\nabla g$ are both perpendicular to every tangent direction, they both point in the same direction (up to a sign). This implies:

\nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \tag{4}

for some $\lambda \in \mathbb{R}$ .

You can now use equations (3) and (4) to construct a system of equations that helps you find a set of candidates for the maximum of $J(\mathbf{x})$ . However, not all candidate points are maxima. This is because other kinds of points such as minima, saddle points, etc. also have the property of having a zero directional derivative. Thus, to discover the maximum, you must compute the value of $J(\mathbf{x})$ for every candidate point, and the $\mathbf{x}$ that produces the maximum value of $J(\mathbf{x})$ is the optimal input.

Lagrangian Function

So, finding a set of candidates for the maximum of $J$ given our constraint $g$ amounts to finding points $\mathbf{x}$ that satisfy equations (3) and (4). The Lagrangian function provides a convenient way to package (3) and (4) together:

\mathcal{L}(\mathbf{x}, \lambda) = J(\mathbf{x}) - \lambda (g(\mathbf{x}) - k)

Look at what happens when we set the partial derivatives of the Lagrangian equal to zero though.

\begin{aligned} \frac{\partial \mathcal{L}}{\partial \lambda} &= -(g(\mathbf{x}) - k) \implies -(g(\mathbf{x}) - k) = 0 \implies g(\mathbf{x}) = k \\ \frac{\partial \mathcal{L}}{\partial \mathbf{x}} &= \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) \implies \nabla J(\mathbf{x}) - \lambda \nabla g(\mathbf{x}) = 0 \implies \nabla J(\mathbf{x}) = \lambda \nabla g(\mathbf{x}) \end{aligned}

These are exactly the equations (3) and (4) we had above. So effectively, finding candidates for the maximum of $J(\mathbf{x})$ is the same as solving $\nabla \mathcal{L}(\mathbf{x}, \lambda) = 0$ for $\mathbf{x}$ and $\lambda$ . Plug the candidate $\mathbf{x}$ points back into $J$ to find the $\mathbf{x}$ that produces the maximum $J$ .

Constraints with Inequalities

Let's say we allow our constraint to take on the form:

g(\mathbf{x}) \leq k

At the solution, the constraint is either inactive or active:

Inactive ( $g(\mathbf{x}) < k$ ). The solution sits strictly inside the feasible region, so the constraint has breathing room and isn't doing any real work. The solution is just an unconstrained maximum that happens to be feasible, found by solving $\nabla J(\mathbf{x}) = 0$ and keeping the points that satisfy the constraint.
Active ( $g(\mathbf{x}) = k$ ). The solution sits on the boundary, so the inequality behaves like the equality constraint $g(\mathbf{x}) = k$ and we solve it with the Lagrangian from before.

The constrained optimum has to fall into one of these two cases, so we just gather the candidates from both: the feasible solutions of $\nabla J(\mathbf{x}) = 0$ and the solutions of the equality-constrained Lagrangian. Plug each candidate into $J$ and keep the one that produces the maximum value.