§ Mathematics · Apr 2026

Eigenvectors → SVD → PCA

How these fundamental pieces of math are connected.

Published
April 8, 2026
Category
Mathematics
Reading time
9 min
Tags
linear algebrastatistics

Principal Component Analysis (PCA) is a valuable statistical method used to take high-fidelity data, capture the “most important parts” of the data, and represent them in lower fidelity. This is widely useful for applications like machine learning where the curse of dimensionality significantly increases the amount of data you need.

Eigenvectors

From linear algebra, we say that matrices encode linear transformations of space. That means given a matrix ARm×nA \in \mathbb{R}^{m \times n}, it tells you where each vector vRn\vec{v} \in \mathbb{R}^n lands in Rm\mathbb{R}^m after the transformation. A linear transformation is geometrically interpreted as one where grid lines remain parallel and evenly spaced, and the origin is left fixed in place.

A linear transformation can do three things: rotate, scale, and shear. For a given transformation encoded by AA, there are a few vectors that might only scale (not rotate or shear) after the transformation is applied. These special vectors are called the eigenvectors of AA. The amount they scale is known as the eigenvalue (often denoted as λ\lambda). Formally, vRn\vec{v} \in \mathbb{R}^n is an eigenvector of AA if the following equation holds Av=λvA\vec{v} = \lambda \vec{v} for some λR\lambda \in \mathbb{R} (and conventionally v0\vec{v} \neq \vec{0}).

¡3¡2¡1123¡3¡2¡1123~v1~v2Before¡3¡2¡1123¡3¡2¡1123¸1~v1¸2~v2AfterA~v
A=[3113],λ1=4,λ2=2A = \begin{bmatrix} 3 & 1 \\ 1 & 3 \end{bmatrix}, \quad \lambda_1 = 4, \quad \lambda_2 = 2

Eigendecomposition

Given a matrix ARm×nA \in \mathbb{R}^{m \times n}, we can represent it in terms of its eigenvectors/eigenvalues if the following conditions hold:

  1. m=nm = n
  2. nn independent eigenvectors exist

If the above hold, we can factor AA into the following:

A=VΛV1A = V \Lambda V^{-1}

where VV is a matrix containing each of the eigenvectors in its columns, and Λ\Lambda is a diagonal matrix where λiΛ\lambda_i \in \Lambda is the eigenvalue for the i-th column of VV.

Intuitively, this entire process can be viewed as a change of basis. Imagine we take a vector bRn\vec{b} \in \mathbb{R}^n and transform it by AA:

Ab=VΛV1bA\vec{b} = V \Lambda V^{-1} \vec{b}

Multiplying by V1V^{-1} translates b\vec{b} from being represented by the standard basis vectors to being represented in terms of our eigenvectors. This means the first coordinate of V1bRnV^{-1}\vec{b} \in \mathbb{R}^n scales the first eigenvector, the second coordinate scales the second eigenvector, and so forth.

Since we know that AA only scales our eigenvectors, Λ\Lambda is diagonal. Λ\Lambda applies the scaling factor for each eigenvector where λ1\lambda_1 scales the first coordinate, λ2\lambda_2 scales the second coordinate, and so on.

Finally, multiplying by VV transforms the vector given by ΛV1b\Lambda V^{-1}\vec{b} from being represented in terms of our eigenvectors to being represented back in terms of our original standard basis vectors.

Given this explanation, can you see why the two constraints above had to hold? The answer is that non-square matrices don’t have eigenvectors. If ARm×nA \in \mathbb{R}^{m \times n} with mnm \neq n, then AvRmA \vec{v} \in \mathbb{R}^m while λvRn\lambda \vec{v} \in \mathbb{R}^n, so it is impossible to find a v\vec{v} and λ\lambda that satisfy the equality since the two sides live in different dimensions. Additionally, nn independent eigenvectors must exist for there to be a valid inverse of VV. The intuition is that having a matrix with dependent columns squishes space to lower dimensions, and there is no inverse matrix that can undo the transformation from higher to lower dimensionality space.

One useful property of eigendecomposition is it lets you efficiently compute the powers of a matrix:

A2=(VΛV1)(VΛV1)=VΛ(V1V)ΛV1=VΛ2V1A^2 = (V \Lambda V^{-1})(V \Lambda V^{-1}) = V \Lambda (V^{-1} V) \Lambda V^{-1} = V \Lambda^2 V^{-1}

and in general

An=(VΛV1)n=VΛnV1A^n = (V \Lambda V^{-1})^n = V \Lambda^n V^{-1}

Taking the n-th power of a diagonal matrix just involves taking the n-th power of each of the diagonal terms. This is much more efficient to compute.

Singular Value Decomposition (SVD)

SVD operates on the key insight that you can express every linear transformation as the combination of applying a rotation to the input space, scaling along the resulting axes, and applying another rotation to land in the output space. Given any matrix ARm×nA \in \mathbb{R}^{m \times n}, SVD says we can decompose it into the following:

A=UΣVTA = U \Sigma V^T

where URm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}, and UU and VV are orthogonal matrices. Since UU and VV are orthogonal, we know that they encode rotations/reflections. Σ\Sigma is a diagonal matrix whose entries σi\sigma_i (the singular values) encode the stretch along each axis.

In order to see how SVD works geometrically, imagine the unit circle in R2\mathbb{R}^2 and transform it by AA. Functionally, this just means take a bunch of vectors that represent the circle and transform each by AA and observe the result.

¡22¡22Unitcircle¡22¡22AfterV>¡22¡22After§¡22¡22AfterU
V rotates by 30°,Σ=[2.5001.25],U rotates by 50°V^\top \text{ rotates by } {-30°}, \quad \Sigma = \begin{bmatrix} 2.5 & 0 \\ 0 & 1.25 \end{bmatrix}, \quad U \text{ rotates by } 50°

In order for us to decompose a matrix into a combination of rotation, stretch, rotation, we must choose orthogonal vectors viRn\vec{v}_i \in \mathbb{R}^n such that they remain orthogonal after the transformation (if this weren’t the case, it would be impossible to find a UU that was just a rotation/reflection matrix). These vectors happen to be the ones that stretch the most. Intuitively, this is because vectors are pulled unevenly towards the strongest stretching directions, so vectors already along the strongest stretching directions preserve relative angles with each other.

To find which vectors stretch the most, let’s try to find what maximizes the value of Ax2\lVert Ax \rVert ^2:

Ax2=(Ax)T(Ax)=xT(ATA)x\lVert Ax \rVert ^2 = (Ax)^T (Ax) = x^T (A^T A) x

This implies that the eigenvectors of ATAA^T A are the directions that stretch AxAx the most. The reason eigenvectors of ATAA^T A get stretched the most by ATAA^T A is because eigenvectors only get stretched (not rotated), so there is no “wasted” effort going into rotation. All of the matrix transformation goes towards stretching. These eigenvectors represent VV, and you order the eigenvectors in VV such that each subsequent column has an eigenvalue less than or equal to the previous column.

To uncover UU, observe the following property:

A=UΣVT    AV=UΣA = U \Sigma V^T \implies A V = U \Sigma

by the fact that V1=VTV^{-1} = V^T as VV is an orthogonal matrix.

But since we said Σ\Sigma is just a diagonal matrix (as it only scales UU), we can easily compute both Σ\Sigma and UU by taking the result of AVAV. Then make each column of the result unit length and store it as the i-th column of UU. The i-th entry of Σ\Sigma becomes the reciprocal of the scaling factor we used to scale the i-th column of UU to unit length.

So effectively, we can capture the transformation of AA applied to b\vec{b} (Ab=UΣVTbA \vec{b} = U \Sigma V^T \vec{b}) by performing the following:

  1. Multiply by V1=VTV^{-1} = V^T. This expresses b\vec{b} in terms of the input vectors that get stretched the most. Again, this can be seen as a change of basis where multiplying by V1V^{-1} expresses b\vec{b} in terms of the principal stretching directions.
  2. Apply the stretch encoded by Σ\Sigma. σ1\sigma_1 scales the first principal stretch direction, σ2\sigma_2 scales the second, and so forth.
  3. Perform one final rotation to express the scaled result of step 2 in the output space. This converts the vector being represented by the principal stretch direction back into the output space.

Principal Component Analysis (PCA)

PCA is one of the most widely used applications of SVD. PCA is a statistical technique used to (lossily) compress a high-dimensional dataset into fewer dimensions.

To understand how PCA works, let’s switch gears. Instead of viewing a matrix XRm×nX \in \mathbb{R}^{m \times n} as a linear transformation, let’s view XX as a convenient way to package data. Each sample is a row in XX and the columns represent features.

The first step in PCA is to center the data by subtracting the mean for each feature (columns of XX). The intuition is that PCA cares about the relationship amongst the variables, not the actual values of the variables themselves. Denote the centered matrix as XcX_c.

Once the data is centered, a natural next question is: how do we extract the most significant patterns in the data to represent the original dataset in lower dimensions? That’s where SVD comes into play. Recall the formula for SVD, applied to our centered matrix:

Xc=UΣVTX_c = U \Sigma V^T

where URm×mU \in \mathbb{R}^{m \times m}, ΣRm×n\Sigma \in \mathbb{R}^{m \times n}, VRn×nV \in \mathbb{R}^{n \times n}, and UU and VV are orthogonal matrices.

Remember how we said that VV is just a collection of the eigenvectors of XcTXcX_c^T X_c? Well, when we view XcX_c as a centered data matrix, XcTXcX_c^T X_c happens to be proportional to the covariance matrix because XcTXcX_c^T X_c dots every feature with each other feature. The eigenvectors of the covariance matrix, sorted by eigenvalue, give us an ordered list of directions such that projecting the data onto each eigenvector captures the largest remaining variance not already captured by the previous directions. Each of these directions is known as a “principal component.” Sneak peek: Knowing the directions of maximum variance is helpful because we want to capture the “most important” parts of the data. What makes each data point unique amongst all the others is where they fall along the line of maximal variance.

UU is composed of the eigenvectors of the matrix XcXcTX_c X_c^T, which is the covariance of the rows. To see this, take a look at this proof:

XcXcT=(UΣVT)(UΣVT)T=(UΣVT)(VΣTUT)=UΣΣTUT=UΣ2UT\begin{aligned} X_c X_c^T &= (U \Sigma V^T)(U \Sigma V^T)^T \\ &= (U \Sigma V^T)(V \Sigma^T U^T) \\ &= U \Sigma \Sigma^T U^T \\ &= U \Sigma^2 U^T \end{aligned}

Looking at the final form, this is exactly the formula for eigendecomposition. This implies that the columns of UU are the eigenvectors of XcXcTX_c X_c^T. In the context of PCA, we say that UU helps us combine the principal components in VV to recover the original samples. Concretely, the first row of UU tells us how much to encode the specific principal components given by VV to recover the first sample, the second row of UU tells us how much to encode the specific principal components given by VV to recover the second sample, and so forth.

Σ\Sigma once again corresponds to the stretching factor. Σ\Sigma is organized so that higher σi\sigma_i occur first.

To perform compression, you choose the top-k columns of UU, VV, and Σ\Sigma and represent them as UkU_k, VkV_k, and Σk\Sigma_k. You can then either do:

XcVkX_c V_k

or

UkΣkU_k \Sigma_k

The first multiplication projects each datapoint onto the most important principal components. This is useful because we retain the parts of the data that account for the most variance. UkΣkU_k \Sigma_k is equivalent to the first form, which you can see by right-multiplying both sides of Xc=UΣVTX_c = U \Sigma V^T by VV.

Application to MNIST

Here’s a visualization of performing PCA (with k=2k=2) on the MNIST dataset. MNIST is a collection of images of handwritten digits from 0-9. You’ll see that clusters emerge for each digit, showing that PCA helps capture the internal structure of the data.

0123456789PC1PC2