Eigenvectors → SVD → PCA

Principal Component Analysis (PCA) is a valuable statistical method used to take high-fidelity data, capture the “most important parts” of the data, and represent them in lower fidelity. This is widely useful for applications like machine learning where the curse of dimensionality significantly increases the amount of data you need.

Eigenvectors

From linear algebra, we say that matrices encode linear transformations of space. That means given a matrix $A \in \mathbb{R}^{m \times n}$ , it tells you where each vector $\vec{v} \in \mathbb{R}^n$ lands in $\mathbb{R}^m$ after the transformation. A linear transformation is geometrically interpreted as one where grid lines remain parallel and evenly spaced, and the origin is left fixed in place.

A linear transformation can do three things: rotate, scale, and shear. For a given transformation encoded by $A$ , there are a few vectors that might only scale (not rotate or shear) after the transformation is applied. These special vectors are called the eigenvectors of $A$ . The amount they scale is known as the eigenvalue (often denoted as $\lambda$ ). Formally, $\vec{v} \in \mathbb{R}^n$ is an eigenvector of $A$ if the following equation holds $A\vec{v} = \lambda \vec{v}$ for some $\lambda \in \mathbb{R}$ (and conventionally $\vec{v} \neq \vec{0}$ ).

A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}, \quad \lambda_1 = 4, \quad \lambda_2 = 2

Eigendecomposition

Given a matrix $A \in \mathbb{R}^{m \times n}$ , we can represent it in terms of its eigenvectors/eigenvalues if the following conditions hold:

$m = n$
$n$ independent eigenvectors exist

If the above hold, we can factor $A$ into the following:

A = V \Lambda V^{-1}

where $V$ is a matrix containing each of the eigenvectors in its columns, and $\Lambda$ is a diagonal matrix where $\lambda_i \in \Lambda$ is the eigenvalue for the i-th column of $V$ .

Intuitively, this entire process can be viewed as a change of basis. Imagine we take a vector $\vec{b} \in \mathbb{R}^n$ and transform it by $A$ :

A\vec{b} = V \Lambda V^{-1} \vec{b}

Multiplying by $V^{-1}$ translates $\vec{b}$ from being represented by the standard basis vectors to being represented in terms of our eigenvectors. This means the first coordinate of $V^{-1}\vec{b} \in \mathbb{R}^n$ scales the first eigenvector, the second coordinate scales the second eigenvector, and so forth.

Since we know that $A$ only scales our eigenvectors, $\Lambda$ is diagonal. $\Lambda$ applies the scaling factor for each eigenvector where $\lambda_1$ scales the first coordinate, $\lambda_2$ scales the second coordinate, and so on.

Finally, multiplying by $V$ transforms the vector given by $\Lambda V^{-1}\vec{b}$ from being represented in terms of our eigenvectors to being represented back in terms of our original standard basis vectors.

Given this explanation, can you see why the two constraints above had to hold? The answer is that non-square matrices don’t have eigenvectors. If $A \in \mathbb{R}^{m \times n}$ with $m \neq n$ , then $A \vec{v} \in \mathbb{R}^m$ while $\lambda \vec{v} \in \mathbb{R}^n$ , so it is impossible to find a $\vec{v}$ and $\lambda$ that satisfy the equality since the two sides live in different dimensions. Additionally, $n$ independent eigenvectors must exist for there to be a valid inverse of $V$ . The intuition is that having a matrix with dependent columns squishes space to lower dimensions, and there is no inverse matrix that can undo the transformation from higher to lower dimensionality space.

One useful property of eigendecomposition is it lets you efficiently compute the powers of a matrix:

A^2 = (V \Lambda V^{-1})(V \Lambda V^{-1}) = V \Lambda (V^{-1} V) \Lambda V^{-1} = V \Lambda^2 V^{-1}

and in general

A^n = (V \Lambda V^{-1})^n = V \Lambda^n V^{-1}

Taking the n-th power of a diagonal matrix just involves taking the n-th power of each of the diagonal terms. This is much more efficient to compute.

Singular Value Decomposition (SVD)

SVD operates on the key insight that you can express every linear transformation as the combination of applying a rotation to the input space, scaling along the resulting axes, and applying another rotation to land in the output space. Given any matrix $A \in \mathbb{R}^{m \times n}$ , SVD says we can decompose it into the following:

A = U \Sigma V^T

where $U \in \mathbb{R}^{m \times m}$ , $\Sigma \in \mathbb{R}^{m \times n}$ , $V \in \mathbb{R}^{n \times n}$ , and $U$ and $V$ are orthogonal matrices. Since $U$ and $V$ are orthogonal, we know that they encode rotations/reflections. $\Sigma$ is a diagonal matrix whose entries $\sigma_i$ (the singular values) encode the stretch along each axis.

In order to see how SVD works geometrically, imagine the unit circle in $\mathbb{R}^2$ and transform it by $A$ . Functionally, this just means take a bunch of vectors that represent the circle and transform each by $A$ and observe the result.

V^\top \text{ rotates by } {-30°}, \quad \Sigma = \begin{bmatrix} 2.5 & 0 \\ 0 & 1.25 \end{bmatrix}, \quad U \text{ rotates by } 50°

In order for us to decompose a matrix into a combination of rotation, stretch, rotation, we must choose orthogonal vectors $\vec{v}_i \in \mathbb{R}^n$ such that they remain orthogonal after the transformation (if this weren’t the case, it would be impossible to find a $U$ that was just a rotation/reflection matrix). These vectors happen to be the ones that stretch the most. Intuitively, this is because vectors are pulled unevenly towards the strongest stretching directions, so vectors already along the strongest stretching directions preserve relative angles with each other.

To find which vectors stretch the most, let’s try to find what maximizes the value of $\lVert Ax \rVert ^2$ :

\lVert Ax \rVert ^2 = (Ax)^T (Ax) = x^T (A^T A) x

This implies that the eigenvectors of $A^T A$ are the directions that stretch $Ax$ the most. The reason eigenvectors of $A^T A$ get stretched the most by $A^T A$ is because eigenvectors only get stretched (not rotated), so there is no “wasted” effort going into rotation. All of the matrix transformation goes towards stretching. These eigenvectors represent $V$ , and you order the eigenvectors in $V$ such that each subsequent column has an eigenvalue less than or equal to the previous column.

To uncover $U$ , observe the following property:

A = U \Sigma V^T \implies A V = U \Sigma

by the fact that $V^{-1} = V^T$ as $V$ is an orthogonal matrix.

But since we said $\Sigma$ is just a diagonal matrix (as it only scales $U$ ), we can easily compute both $\Sigma$ and $U$ by taking the result of $AV$ . Then make each column of the result unit length and store it as the i-th column of $U$ . The i-th entry of $\Sigma$ becomes the reciprocal of the scaling factor we used to scale the i-th column of $U$ to unit length.

So effectively, we can capture the transformation of $A$ applied to $\vec{b}$ ( $A \vec{b} = U \Sigma V^T \vec{b}$ ) by performing the following:

Multiply by $V^{-1} = V^T$ . This expresses $\vec{b}$ in terms of the input vectors that get stretched the most. Again, this can be seen as a change of basis where multiplying by $V^{-1}$ expresses $\vec{b}$ in terms of the principal stretching directions.
Apply the stretch encoded by $\Sigma$ . $\sigma_1$ scales the first principal stretch direction, $\sigma_2$ scales the second, and so forth.
Perform one final rotation to express the scaled result of step 2 in the output space. This converts the vector being represented by the principal stretch direction back into the output space.

Principal Component Analysis (PCA)

PCA is one of the most widely used applications of SVD. PCA is a statistical technique used to (lossily) compress a high-dimensional dataset into fewer dimensions.

To understand how PCA works, let’s switch gears. Instead of viewing a matrix $X \in \mathbb{R}^{m \times n}$ as a linear transformation, let’s view $X$ as a convenient way to package data. Each sample is a row in $X$ and the columns represent features.

The first step in PCA is to center the data by subtracting the mean for each feature (columns of $X$ ). The intuition is that PCA cares about the relationship amongst the variables, not the actual values of the variables themselves. Denote the centered matrix as $X_c$ .

Once the data is centered, a natural next question is: how do we extract the most significant patterns in the data to represent the original dataset in lower dimensions? That’s where SVD comes into play. Recall the formula for SVD, applied to our centered matrix:

X_c = U \Sigma V^T

where $U \in \mathbb{R}^{m \times m}$ , $\Sigma \in \mathbb{R}^{m \times n}$ , $V \in \mathbb{R}^{n \times n}$ , and $U$ and $V$ are orthogonal matrices.

Remember how we said that $V$ is just a collection of the eigenvectors of $X_c^T X_c$ ? Well, when we view $X_c$ as a centered data matrix, $X_c^T X_c$ happens to be proportional to the covariance matrix because $X_c^T X_c$ dots every feature with each other feature. The eigenvectors of the covariance matrix, sorted by eigenvalue, give us an ordered list of directions such that projecting the data onto each eigenvector captures the largest remaining variance not already captured by the previous directions. Each of these directions is known as a “principal component.” Sneak peek: Knowing the directions of maximum variance is helpful because we want to capture the “most important” parts of the data. What makes each data point unique amongst all the others is where they fall along the line of maximal variance.

$U$ is composed of the eigenvectors of the matrix $X_c X_c^T$ , which is the covariance of the rows. To see this, take a look at this proof:

\begin{aligned} X_c X_c^T &= (U \Sigma V^T)(U \Sigma V^T)^T \\ &= (U \Sigma V^T)(V \Sigma^T U^T) \\ &= U \Sigma \Sigma^T U^T \\ &= U \Sigma^2 U^T \end{aligned}

Looking at the final form, this is exactly the formula for eigendecomposition. This implies that the columns of $U$ are the eigenvectors of $X_c X_c^T$ . In the context of PCA, we say that $U$ helps us combine the principal components in $V$ to recover the original samples. Concretely, the first row of $U$ tells us how much to encode the specific principal components given by $V$ to recover the first sample, the second row of $U$ tells us how much to encode the specific principal components given by $V$ to recover the second sample, and so forth.

$\Sigma$ once again corresponds to the stretching factor. $\Sigma$ is organized so that higher $\sigma_i$ occur first.

To perform compression, you choose the top-k columns of $U$ , $V$ , and $\Sigma$ and represent them as $U_k$ , $V_k$ , and $\Sigma_k$ . You can then either do:

X_c V_k

U_k \Sigma_k

The first multiplication projects each datapoint onto the most important principal components. This is useful because we retain the parts of the data that account for the most variance. $U_k \Sigma_k$ is equivalent to the first form, which you can see by right-multiplying both sides of $X_c = U \Sigma V^T$ by $V$ .

Application to MNIST

Here’s a visualization of performing PCA (with $k=2$ ) on the MNIST dataset. MNIST is a collection of images of handwritten digits from 0-9. You’ll see that clusters emerge for each digit, showing that PCA helps capture the internal structure of the data.