Skip to contents

0. Problem setup

Suppose we observe nn data points X1,X2,,Xnp. X_1, X_2, \ldots, X_n \in \mathbb{R}^p.

We write the data matrix as X=[X1X2Xn]n×p, X = \begin{bmatrix} X_1^\top \\ X_2^\top \\ \vdots \\ X_n^\top \end{bmatrix} \in \mathbb{R}^{n \times p}, where each row corresponds to one observation and each column corresponds to one variable.

Principal Component Analysis (PCA) is a dimension reduction method that represents high-dimensional data through a small number of orthogonal directions that preserve as much variation as possible.

To do this, PCA finds a direction vpv \in \mathbb{R}^p such that the projected values

Xiv,i=1,,n, X_i^\top v, \qquad i = 1,\ldots,n,

are as spread out as possible. A direction with larger projected variance captures more variation in the data.

1. Population PCA

Let YpY \in \mathbb{R}^p be a random vector with covariance matrix Σ=Cov(Y)\Sigma = \operatorname{Cov}(Y).

First principal component direction

The first population principal component direction is defined as

v1=argmaxv2=1Var(vY)=argmaxv2=1vΣv. v_1 = \arg\max_{\|v\|_2 = 1} \operatorname{Var}(v^\top Y) = \arg\max_{\|v\|_2 = 1} v^\top \Sigma v.

Thus, v1v_1 is the unit direction that maximizes the variance of the projection of YY onto vv.

Subsequent principal component directions

Similarly, each subsequent principal component directions are obtained by maximizing the variance of the projection of YY onto that direction, while being orthogonal to the previously chosen directions.

For k2k \geq 2,

vk=argmaxv2=1vΣvsubject tovvj=0,j=1,,k1. v_k = \arg\max_{\|v\|_2 = 1} v^\top \Sigma v \quad \text{subject to} \quad v^\top v_j = 0, \qquad j = 1,\ldots,k-1.

Therefore, PCA gives a sequence of mutually orthogonal directions v1,v2,,vpv_1, v_2, \ldots, v_p ordered by decreasing projected variance.

2. Eigenvalue decomposition

The solutions to the above variance maximization problems are obtained from the eigenvalue decomposition of the covariance matrix Σ\Sigma.

We can write

Σ=VΛV \Sigma = V \Lambda V^\top where V=[v1,,vp],Λ=diag(λ1,,λp),andλ1λ2λp0. V = [v_1,\ldots,v_p], \quad \Lambda = \operatorname{diag}(\lambda_1,\ldots,\lambda_p), \quad \text{and} \quad \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0.

The eigenvectors v1,,vpv_1,\ldots,v_p are the population principal component directions, and each eigenvalue λj\lambda_j gives the variance of the projection of YY onto vjv_j.

Lagrangian formulation

To see why the PCA directions are eigenvectors, consider the first principal component problem

maxv2=1vΣv. \max_{\|v\|_2 = 1} v^\top \Sigma v.

Using the constraint vv=1v^\top v = 1, define the Lagrangian

(v,λ)=vΣvλ(vv1). \mathcal{L}(v,\lambda) = v^\top \Sigma v - \lambda (v^\top v - 1).

Taking the derivative with respect to vv and setting it equal to zero gives

v=2Σv2λv=0. \frac{\partial \mathcal{L}}{\partial v} = 2\Sigma v - 2\lambda v = 0.

Therefore,

Σv=λv. \Sigma v = \lambda v.

Hence, the optimizer vv must be an eigenvector of Σ\Sigma, and the corresponding Lagrange multiplier λ\lambda is the associated eigenvalue.

For a unit eigenvector vv, we have

vΣv=v(λv)=λvv=λ. v^\top \Sigma v = v^\top (\lambda v) = \lambda v^\top v = \lambda.

Thus,

maxv2=1Var(vY)=maxv2=1vΣv=λmax. \max_{\|v\|_2 = 1} \operatorname{Var}(v^\top Y) = \max_{\|v\|_2 = 1} v^\top \Sigma v = \lambda_{\max}.

The first principal component direction is the eigenvector corresponding to the largest eigenvalue λ1\lambda_1, and the maximum projected variance is λ1\lambda_1.

Repeating this procedure under orthogonality constraints gives the remaining eigenvectors. Therefore,

Σvj=λjvj,j=1,,p, \Sigma v_j = \lambda_j v_j, \qquad j = 1,\ldots,p, with

λ1λ2λp0. \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p \geq 0.

The \ell-th eigenvalue can be interpreted as the variance of the projected random variable along the \ell-th principal component direction

λ=vΣv=Var(vY). \lambda_\ell = v_\ell^\top \Sigma v_\ell = \operatorname{Var}(v_\ell^\top Y).

3. Sample PCA

In practice, the population covariance matrix Σ\Sigma is unknown, so we use the sample covariance matrix Σ̂\hat\Sigma instead.

Let X=1ni=1nXi \bar X = \frac{1}{n} \sum_{i=1}^n X_i

be the sample mean. The sample covariance matrix is

Σ̂=1n1i=1n(XiX)(XiX). \hat\Sigma = \frac{1}{n-1} \sum_{i=1}^n (X_i-\bar X)(X_i-\bar X)^\top.

Equivalently, if XcX_c denotes the centered data matrix, then

Σ̂=1n1XcXc. \hat\Sigma = \frac{1}{n-1} X_c^\top X_c.

Sample principal component directions

The first sample principal component direction is

v̂1=argmaxv2=1vΣ̂v. \hat v_1 = \arg\max_{\|v\|_2 = 1} v^\top \hat\Sigma v.

For 2\ell \geq 2,

v̂=argmaxv2=1vΣ̂vsubject tovv̂j=0,j=1,,1. \hat v_\ell = \arg\max_{\|v\|_2 = 1} v^\top \hat\Sigma v \quad \text{subject to} \quad v^\top \hat v_j = 0, \qquad j = 1,\ldots,\ell-1.

The \ell-th sample principal component direction v̂\hat v_\ell is obtained as the \ell-th eigenvector of Σ̂\hat\Sigma, and the corresponding sample eigenvalue is

λ̂=v̂Σ̂v̂. \hat\lambda_\ell = \hat v_\ell^\top \hat\Sigma \hat v_\ell.

This value is the sample variance of the data projected onto the direction v̂\hat v_\ell.

From an estimation point of view, v̂\hat v_\ell and λ̂\hat\lambda_\ell estimate the population quantities vv_\ell and λ\lambda_\ell, respectively.

4. Principal component score and scree

PC scores

Assume that the data matrix Xn×pX \in \mathbb{R}^{n \times p} has been centered. Let v̂\hat v_\ell be the \ell-th sample principal component direction. The \ell-th PC score vector is defined as

z=Xv̂n. z_\ell = X \hat v_\ell \in \mathbb{R}^n.

Equivalently, the ii-th entry of zz_\ell is

zi=xiv̂, z_{i\ell} = x_i^\top \hat v_\ell,

which is the coordinate of the ii-th observation after projection onto the \ell-th principal component direction.

If the first kk principal component directions are used, the score matrix is

Zk=XV̂k,V̂k=[v̂1,,v̂k]. Z_k = X \hat V_k, \qquad \hat V_k = [\hat v_1,\ldots,\hat v_k].

A score plot usually displays two score vectors, such as (z1,z2)(z_1,z_2), as a two-dimensional scatter plot. It is used to explore low-dimensional patterns in the data, such as clusters, outliers, or separation between groups.

PC Scree values

Let λ̂1λ̂2λ̂p0 \hat\lambda_1 \ge \hat\lambda_2 \ge \cdots \ge \hat\lambda_p \ge 0

be the eigenvalues of the sample covariance matrix Σ̂\hat\Sigma. We call λ̂\hat\lambda_\ell the \ell-th sample scree value. It represents the sample variance explained by the \ell-th principal component direction.

In terms of the score vector z=Xv̂z_\ell = X\hat v_\ell,

λ̂=Var(Xv̂)=v̂Σ̂v̂=1n1z22. \hat\lambda_\ell = \operatorname{Var}(X\hat v_\ell) = \hat v_\ell^\top \hat\Sigma \hat v_\ell = \frac{1}{n-1}\|z_\ell\|_2^2.

A scree plot displays the sequence of eigenvalues

λ̂1,λ̂2,,λ̂p, \hat\lambda_1,\hat\lambda_2,\ldots,\hat\lambda_p,

or the proportion of variance explained by each principal component,

PVÊ=λ̂j=1pλ̂j. \widehat{\mathrm{PVE}}_\ell = \frac{\hat\lambda_\ell}{\sum_{j=1}^p \hat\lambda_j}.

It summarizes how much variation is explained by each principal component. Since the eigenvalues are ordered decreasingly, the scree plot is often used to decide how many principal components should be retained.

Relationship between score plots and scree plots

The scree plot and the score plot show PCA results in different ways.

The score plot shows the observations after projection onto selected principal component directions. For the \ell-th principal component direction v̂\hat v_\ell, the scree value is

λ̂=v̂Σ̂v̂=1n1Xv̂22. \hat\lambda_\ell = \hat v_\ell^\top \hat\Sigma \hat v_\ell = \frac{1}{n-1}\|X\hat v_\ell\|_2^2.

This is the sample variance of the projected scores Xv̂X\hat v_\ell.

The score plot shows the observations in the principal component coordinate system. If

V̂k=[v̂1,,v̂k], \hat V_k = [\hat v_1,\ldots,\hat v_k],

then the score matrix is Zk=XV̂k. Z_k = X\hat V_k.

The rows of ZkZ_k are the low-dimensional coordinates of the observations.

Therefore, the scree plot helps decide which components are important, while the score plot visualizes the data using those components. For instance, when the first two scree values are large, the two-dimensional score plot

(Xv̂1,Xv̂2) (X\hat v_1,\; X\hat v_2)

can give an informative view of the main structure in the data.

References

Jolliffe, I. T. and Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.