We explored Principal Component Analysis (PCA) from a linear algebra perspective in a previous post. I recommend reading that article if you want a refresher on what PCA is, or if you are unfamiliar with the relationship between singular value decomposition and PCA. This article focuses on the optimization side of PCA, walking through how to derive PCA using the Lagrangian.
Why is Variance Important in PCA?
As a reminder, given some mean-centered data matrix (where each row is a sample), PCA uses the following matrix decomposition:
where , , . Furthermore, is a matrix whose columns are eigenvectors of , and these eigenvectors point in the directions of maximum variance in the underlying data since is the covariance matrix.
Remember that when compressing each sample in into a lower-dimensional space , the convention is to right-multiply the centered data matrix by the first columns of . The reason we do this is that we want to capture the parts that make each data point distinct from the others. Thus, we want to capture the value of each data point along the directions of maximum variance, so it’s rather fitting that the columns of happen to point in exactly those directions.
Derivation of PCA Using the Lagrangian
The constrained problem we wish to solve is:
Plugging in the formula for variance:
where is the number of data points. is just the covariance matrix since is mean-centered.
Thus, the objective function we wish to maximize is . Our constraint is . Expressing this in the form of the Lagrangian:
Now to solve, we set the partial derivatives of the Lagrangian equal to zero. We use the fact for symmetric .
Notice that in equation (1), is just an eigenvector of the covariance matrix .
Lastly, we evaluate our candidate points above to find the maximum:
So, the eigenvector with the highest eigenvalue maximizes the variance. That’s the whole derivation! The eigenvectors of the covariance matrix point in the directions of maximum variance in .
Appendix
Refresher of the Lagrangian
The Lagrangian function is a mathematical tool used to optimize an objective function given a set of constraints. Let's denote the function we want to maximize as , and our constraint function as where is a constant.
To find the maximum values of with respect to the constraint, we first search for points on the constraint surface where the value of remains unchanged as we take small steps along the surface from that point. Intuitively, searching for such points makes sense, as leading up to a maximum, the slope of the curve is positive. At the maximum, the slope is zero, which means remains unchanged for small steps around that point. Past the maximum, the slope of the curve becomes negative.
Now, to find points on the constraint surface where remains unchanged for small steps, we need the directional derivative of along the constraint surface to be zero at those points:
where is any tangent vector of the constraint surface at point . This equation literally looks for points such that at has no component along any direction of motion on the constraint surface. In other words, at , you cannot immediately increase by moving in any direction on the constraint surface.
Given that remains unchanged along the constraint surface (since our constraint is where is constant), is perpendicular to every tangent direction because the gradient points in the direction of greatest change. Since and are both perpendicular to every tangent direction, they both point in the same direction (up to a sign). This implies:
for some .
You can now use equations (3) and (4) to construct a system of equations that helps you find a set of candidates for the maximum of . However, not all candidate points are maxima. This is because other kinds of points such as minima, saddle points, etc. also have the property of having a zero directional derivative. Thus, to discover the maximum, you must compute the value of for every candidate point, and the that produces the maximum value of is the optimal input.
Lagrangian Function
So, finding a set of candidates for the maximum of given our constraint amounts to finding points that satisfy equations (3) and (4). The Lagrangian function provides a convenient way to package (3) and (4) together:
Look at what happens when we set the partial derivatives of the Lagrangian equal to zero though.
These are exactly the equations (3) and (4) we had above. So effectively, finding candidates for the maximum of is the same as solving for and . Plug the candidate points back into to find the that produces the maximum .
Constraints with Inequalities
Let's say we allow our constraint to take on the form:
At the solution, the constraint is either inactive or active:
- Inactive (). The solution sits strictly inside the feasible region, so the constraint has breathing room and isn't doing any real work. The solution is just an unconstrained maximum that happens to be feasible, found by solving and keeping the points that satisfy the constraint.
- Active (). The solution sits on the boundary, so the inequality behaves like the equality constraint and we solve it with the Lagrangian from before.
The constrained optimum has to fall into one of these two cases, so we just gather the candidates from both: the feasible solutions of and the solutions of the equality-constrained Lagrangian. Plug each candidate into and keep the one that produces the maximum value.