Skip to contents

In ordinary PCA, the principal component directions are obtained from the eigenvectors of the sample covariance matrix. In dppca, these directions can be computed in two different ways.

  1. Non-private PC directions: eigenvectors of the sample covariance matrix.
  2. Differentially private PC directions: private principal component directions obtained through the g-DPPCA procedure.

Notation

Let

X=[X1X2Xn]n×p X = \begin{bmatrix} X_1^\top \\ X_2^\top \\ \vdots \\ X_n^\top \end{bmatrix} \in \mathbb{R}^{n \times p}

be the data matrix used for PCA, where XipX_i \in \mathbb{R}^pis the ii-th observation. We assume that XX has been centered, and optionally standardized.

The principal component direction matrix is denoted by

Vk=[v1,,vk]p×k, V_k = [v_1,\ldots,v_k] \in \mathbb{R}^{p \times k},

where each column vv_\ell is a unit vector representing the \ell-th pc direction.

The corresponding score matrix is Z=XVkZ = X V_k.

1. Non-private PC directions

The classical sample covariance matrix is

Σ̂=1n1XX. \hat\Sigma = \frac{1}{n-1}X^\top X.

The non-private PCA directions are obtained from the eigenvalue decomposition

Σ̂=V̂Λ̂V̂, \hat\Sigma = \hat V \hat\Lambda \hat V^\top,

where

V̂=[v̂1,,v̂p],Λ̂=diag(λ̂1,,λ̂p)withλ̂1λ̂2λ̂p0. \hat V = [\hat v_1,\ldots,\hat v_p], \quad \hat\Lambda = \operatorname{diag}(\hat\lambda_1,\ldots,\hat\lambda_p) \quad \text{with} \quad \hat\lambda_1 \geq \hat\lambda_2 \geq \cdots \geq \hat\lambda_p \geq 0.

The \ell-th sample principal component direction is v̂\hat v_\ell.

Equivalently,

v̂=argmaxv2=1vΣ̂vsubject tovv̂j=0,j=1,,1. \hat v_\ell = \arg\max_{\|v\|_2 = 1} v^\top \hat\Sigma v \quad \text{subject to} \quad v^\top \hat v_j = 0, \qquad j = 1,\ldots,\ell-1.

In the non-private option of dppca, the direction matrix used for projection is

V̂k=[v̂1,,v̂k]. \hat V_k = [\hat v_1,\ldots,\hat v_k].

2. DP PC directions

Kim and Jung (2025) proposed g-DPPCA by adding matrix Gaussian mechanism on the generalized multivariate Kendall’s tau matrix which based on the robust data transformation called generalized spatial sign proposed by Raymakers and Rousseeuw (2019).

For a positive valued scale function ξ:(0,)(0,)\xi: (0, \infty) \to (0, \infty), consider a map gξ:ddg_\xi: \mathbb{R}^d \to \mathbb{R}^d defined as

gξ(t)=ξ(t2)tt2. g_\xi(t) = \xi(\|t\|_2)\cdot \frac{t}{\|t\|_2}.

gξg_{\xi} is called as a generalized spatial sign with respect to ξ\xi.

The generalized multivariate Kendall’s tau matrix with respect to gξg_\xi is defined as

Kgξ=𝔼X,X[gξ(XX2)gξ(XX2)], K_{g_\xi} = \mathbb{E}_{X, X'}\left[ g_\xi\left( \frac{X - X'}{\sqrt{2}}\right) g_\xi\left( \frac{X - X'}{\sqrt{2}}\right)^\top ~ \right],

where XX' is an independent copy of XX. Importantly, if XX follows an elliptical distribution (which including Gaussian and multivariate tt-distributions), KgξK_{g_\xi} shares the same eigenvectors with same order to the cov(X)\mbox{cov}(X). So, one can conduct a PCA by estimating KgξK_{g_\xi} and then get eigenvectors of it.

For a convenience, we write gg as the given sign function. For a random sample S=(X1,,Xn)S = (X_1, \dots, X_n), the second order U-statistic of KgK_{g} can be written as

K̂g(S)=2n(n1)i<jg(XjXi2)g(XjXi2). \widehat{K}_g(S) = \frac{2}{n(n-1)} \sum_{i < j} g\left(\frac{X_j - X_i}{\sqrt{2}}\right) g\left(\frac{X_j - X_i}{\sqrt{2}}\right)^\top.

Note that the sensitivity of K̂g\widehat{K}_g with respect to the Frobenius norm can be upper bounded by

ΔF(K̂g)=supSSK̂g(S)K̂g(S)F4g2n. \Delta_F(\widehat{K}_g) = \sup_{S \sim S'} \|\widehat{K}_g(S) - \widehat{K}_g(S')\|_F \le \frac{4\|g\|_\infty^2}{n}.

So, for a dataset S=(x1,,xn)S = (x_1, \dots, x_n) the randomized mechanism Kg\bar{K}_g defined as

Kg(S):=2n(n1)i<jg(xjxi2)g(xjxi2)+vecd1(ξ), \bar{K}_g(S) := \frac{2}{n(n-1)} \sum_{i < j} g\left(\frac{x_j-x_i}{\sqrt{2}}\right)g\left(\frac{x_j-x_i}{\sqrt{2}}\right)^\top + \mbox{vecd}^{-1}(\xi), where ξNd(d+1)/2(0,σε,δ2Id(d+1)/2)\xi \sim N_{d(d+1)/2}(0, \sigma_{\varepsilon, \delta}^2 I_{d(d+1)/2}) and σε,δ=4g22ln(1.25/δ)nε\sigma_{\varepsilon, \delta} = \frac{4\|g\|_{\infty}^2 \sqrt{2 \ln(1.25/\delta)}}{n\varepsilon}, satisfies (ε,δ)(\varepsilon, \delta)-DP.

Define Vg,m(S)𝒪(d,m)\bar{V}_{g, m}(S) \in \mathcal{O}(d, m) as the matrix of the first mm eigenvectors of Kg(S)\bar{K}_g(S). Then, Vg,m(S)\bar{V}_{g, m}(S) satisfies (ε,δ)(\varepsilon, \delta)-DP due to the post-processing property, and it can be served as a DP principal components. Kim and Jung (2025) calls these process as a g-DPPCA.

In the implementation of the function dp_pc_dir with option g_dppca=TRUE, we use the spherical transformation gsph(t)=t/t2g_{sph}(t) = t/\|t\|_2 to output differentially private PC directions Vsph,m\bar{V}_{sph,m}. In this case, it holds that gsph=1\|g_{sph}\|_{\infty} = 1, and thus the variance of additive Gaussian noise is set as σε,δ=42ln(1.25/δ)nε\sigma_{\varepsilon, \delta} = \frac{4\sqrt{2 \ln(1.25/\delta)}}{n\varepsilon}.

Summary

The principal component direction step in dppca can be summarized as follows.

  1. Start with a preprocessed data matrix XX.
  2. Choose a direction estimation method.
  3. Obtain a direction matrix VkV_k.
  4. Compute projected scores Y=XVkY = X V_k.
  5. Use the scores for private scree estimation or private score visualization.

The main distinction is whether VkV_k is obtained from the ordinary sample covariance matrix or from a differentially private robust PC direction estimator.

References

Minwoo Kim and Sungkyu Jung (2025), “Robust and differentially private principal component analysis,” Statistical Analysis and Data Mining, 18(6), https://doi.org/10.1002/sam.70053

Jakob Raymaekers and Peter Rousseeuw (2019), “A generalized spatial sign covariance matrix,” Journal of Multivariate Analysis, 171:94–111, https://doi.org/10.1016/j.jmva.2018.11.010