DP score plots in dppca • dppca

A PCA score plot is a standard visualization for examining the low-dimensional structure of multivariate data. In a non-private analysis, the score plot displays the projected observations directly. In dppca, the differentially private score plot instead represents the distribution of two-dimensional PCA scores by a differentially private histogram.

PC scores

Let

$X \in \mathbb{R}^{n \times p}$

be the input data matrix after the requested preprocessing. In dppca, preprocessing is controlled by the arguments center and standardize.

Let

$V_k = [v_1,\ldots,v_k] \in \mathbb{R}^{p \times k}$

be the matrix of principal component directions, where the column $v_\ell$ is the $\ell$ -th principal component direction. For the $i$ -th observation $x_i^\top$ , the $k$ -dimensional score vector is

$z_i = V_k^\top x_i \in \mathbb{R}^k, \qquad i=1,\ldots,n.$

For visualization, we select two score coordinates. If axes = c(a, b), define

$s_i = (z_{i,a}, z_{i,b})^\top \in \mathbb{R}^2, \qquad i=1,\ldots,n.$

The collection $S = \{s_i\}_{i=1}^n$ is the two-dimensional score point cloud. A non-private score plot would draw these points directly. The private score plot instead releases a noisy two-dimensional histogram of these points.

Overview of the DP score plot

The private score visualization in dppca has the following steps.

Compute two-dimensional PCA scores.
Construct a private plotting frame.
Divide the frame into rectangular bins.
Count how many score points fall into each bin.
Apply a differentially private histogram mechanism.
Normalize and visualize the noisy bin frequencies.

The plotting frame and histogram both consume privacy budget. If g_dppca = TRUE, the private PC directions also consume privacy budget.

1. Private plotting frame

Before constructing a two-dimensional histogram, we need a plotting region. This region is called the plotting frame. If the frame is too narrow, many points are excluded. If it is too wide, the histogram may become sparse and visually uninformative.

The current implementation uses a private center-radius frame. This approach constructs a square frame by privately estimating a center and then privately estimating a radius around that center. The private quantiles appearing in this step are computed using a smooth-sensitivity-based DP quantile estimator, as in Nissim, Raskhodnikova, and Smith (2007).

Private center

Let $S \in \mathbb{R}^{n \times 2}$ be the score matrix, whose $i$ -th row is $s_i^\top = (z_{i,a}, z_{i,b})$ . The frame center is estimated coordinate-wise using private medians:

$\widetilde c_1 = \widetilde Q_{0.5}(z_{1,a},\ldots,z_{n,a}), \qquad \widetilde c_2 = \widetilde Q_{0.5}(z_{1,b},\ldots,z_{n,b}).$

Here $\widetilde Q_q(\cdot)$ denotes a private estimate of the $q$ -quantile. The private center is

$\widetilde c = (\widetilde c_1,\widetilde c_2)^\top.$

Private radius

After obtaining the private center, compute the Euclidean distance from each score point to the private center:

$r_i = \|s_i-\widetilde c\|_2 = \sqrt{(z_{i,a}-\widetilde c_1)^2 + (z_{i,b}-\widetilde c_2)^2}, \qquad i=1,\ldots,n.$

The radius is then estimated by the private 0.99 quantile of these distances:

$\widetilde R = \widetilde Q_{0.99}(r_1,\ldots,r_n).$

To add a visual margin and reduce boundary effects, introduce a fixed inflation factor $\alpha > 0$ .

$\widetilde R_{\mathrm{infl}} = (1+\alpha)\widetilde R,$

where the current implementation uses a fixed inflation factor $\alpha = 0.20$ .

The final plotting frame is

$F = [\widetilde c_1-\widetilde R_{\mathrm{infl}}, \widetilde c_1+\widetilde R_{\mathrm{infl}}] \times [\widetilde c_2-\widetilde R_{\mathrm{infl}}, \widetilde c_2+\widetilde R_{\mathrm{infl}}].$

This produces a square frame centered at the private center.

Numerical safeguard for the private radius

The distances $r_i$ are nonnegative, but the private quantile estimator adds random noise. Therefore, the private radius estimate can occasionally become non-finite or nonpositive, especially when the privacy budget is very small, the sample size is small, or the score points are nearly identical.

The implementation checks the private radius before forming the frame. If the private radius is not finite or is nonpositive, the score plotting routine stops with an informative error.

2. Choosing the number of bins

After the plotting frame $F$ has been determined, it is divided into histogram bins. Let $m_x$ and $m_y$ be the number of bins along the two score axes. The two-dimensional histogram then has

$m = m_x m_y$

bins in total.

In dppca, the user specifies the bin counts through the bins argument, for example bins = c(20, 20). The best bin choice depends on the sample size, privacy budget, and visible structure in the score distribution. Fewer bins can be more stable under stronger privacy noise, while more bins can reveal finer structure when the sample size and privacy budget are sufficiently large.

3. Two-dimensional histogram

Let the private plotting frame be divided into bins $B_1,\ldots,B_m$ . For the score point set $S = \{s_i\}_{i=1}^n$ , the non-private count in bin $B_k$ is

$c_k = \sum_{i=1}^n \mathbf{1}\{s_i \in B_k\}, \qquad k=1,\ldots,m.$

The count vector is $c = (c_1,\ldots,c_m) \in \mathbb{N}^m$ . The empirical frequency in bin $B_k$ is

$q_k = \frac{c_k}{n}, \qquad k=1,\ldots,m.$

The private score visualization displays a noisy version of this frequency vector.

Sensitivity of histogram counts

Under row-level adjacency, two neighboring datasets differ in one observation. Changing one observation can move one score point from one bin to another. Therefore, the count vector can change by at most $+1$ in one bin and $-1$ in another bin. Hence,

$\Delta_1(c) \leq 2, \qquad \Delta_2(c) \leq \sqrt{2}.$

These sensitivity bounds are used to calibrate privacy noise for the histogram mechanisms.

4. Privacy accounting

The DP score histogram procedure has two main privacy-consuming steps when g_dppca = FALSE:

private quantile estimation for constructing the plotting frame,
private histogram release.

If the total privacy budget is $(\epsilon,\delta)$ , the implementation splits the budget as

$(\epsilon_{\mathrm{frame}},\delta_{\mathrm{frame}}) = (\epsilon/2,\delta/2), \qquad (\epsilon_{\mathrm{hist}},\delta_{\mathrm{hist}}) = (\epsilon/2,\delta/2).$

The frame construction itself uses three private quantile estimates: two private medians for the center and one private 0.99 quantile for the radius. These share the frame budget by basic composition.

When g_dppca = TRUE, private PC direction estimation also consumes privacy budget. In that case, the total budget is split across

private PC direction estimation,
private plotting frame construction,
private histogram release.

The implementation uses an equal split:

$(\epsilon_{\mathrm{pc}},\delta_{\mathrm{pc}}) = (\epsilon_{\mathrm{frame}},\delta_{\mathrm{frame}}) = (\epsilon_{\mathrm{hist}},\delta_{\mathrm{hist}}) = (\epsilon/3,\delta/3).$

By basic composition, the overall procedure satisfies the requested $(\epsilon,\delta)$ -DP guarantee.

Method 1: Additive DP histogram

A simple DP histogram can be constructed by adding independent Gaussian noise to each bin count. The noisy counts are then post-processed to be nonnegative and normalized. This additive-noise approach is commonly used for DP histograms Wasserman and Zhou (2010), and the procedure is summarized in Additive DP histogram.

Method 2: Sparse DP histogram

When many bins are empty, adding noise to every bin can dominate the visualization. A sparse histogram aims to report only bins whose counts are large enough to be distinguishable from noise.

In dppca, the sparse histogram is based on the stability-based private histogram idea of Karwa and Vadhan (2017), summarized in Sparse DP histogram.

Group-wise DP score histograms

When group labels are available, DP score histograms can be constructed separately for each group. Let

$\{(s_i,g_i)\}_{i=1}^n$

denote the score data with group labels, where $s_i \in \mathbb{R}^2$ is the two-dimensional PCA score and $g_i \in \mathcal{G}$ is the group label.

The score directions, private plotting frame, and histogram grid are shared across all groups. For each group $g \in \mathcal{G}$ , define the group-specific bin count

$c_k^{(g)} = \sum_{i=1}^n \mathbf{1}\{s_i \in B_k,\; g_i = g\}.$

Because the groups form a partition of the rows, group-wise histogram releases can use parallel composition across groups on the common grid.

In dppca, the group-wise version can be constructed using either the group-wise additive DP histogram or the group-wise sparse DP histogram.

Example usage

library(dppca)

data(gau, package = "dppca")

set.seed(123)
score_plot <- dp_score_plot(
  X = gau,
  eps = 5,
  delta = 1e-5,
  bins = c(15, 15),
  method = c("add", "sparse"),
  axes = c(1, 2)
)

score_plot$plot$all

For grouped score histograms:

library(dppca)

data(gau_g, package = "dppca")

set.seed(123)
score_plot_group <- dp_score_plot_group(
  X = gau_g,
  group = "group",
  eps = 3,
  delta = 1e-5,
  bins = c(15, 15),
  method = c("add", "sparse")
)

score_plot_group$plot$all

References

Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. (2007). “Smooth sensitivity and sampling in private data analysis”. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (STOC ’07). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/1250790.1250803

Lei, Jing (2011). “Differentially private M-estimators”. Advances in Neural Information Processing Systems, 24. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2011/file/f718499c1c8cef6730f9fd03c8125cab-Paper.pdf

Wasserman, L., & Zhou, S. (2010). “A Statistical Framework for Differential Privacy”. Journal of the American Statistical Association, 105(489), 375–389. https://doi.org/10.1198/jasa.2009.tm08651

Vishesh Karwa and Salil Vadhan. (2018). “Finite sample differentially private confidence intervals”. In Proceedings of ITCS 2018, LIPIcs, 94, 44:1–44:9. https://doi.org/10.4230/LIPIcs.ITCS.2018.44