Differentially private score histograms

This function computes two-dimensional principal component scores and returns differentially private histogram estimates on the score space. It returns the score coordinates, the plotting frame, the non-private histogram, and the requested private histogram estimates.

Usage

dp_score(
  X,
  eps,
  delta,
  bins,
  method = c("add", "sparse"),
  center = TRUE,
  standardize = FALSE,
  g_dppca = FALSE,
  cpp.option = FALSE,
  axes = c(1, 2)
)

Arguments

X: A numeric matrix or data frame. Rows correspond to observations and columns correspond to variables.
eps: Positive number defining the total epsilon privacy parameter.
delta: Number in (0, 1) defining the total delta privacy parameter.
bins: Integer vector of length 2 defining the number of histogram bins along the first and second score axes, respectively.
method: Character vector specifying which private histogram methods to compute. Use "add" for the additive Gaussian histogram and "sparse" for the sparse thresholded histogram. The default is c("add", "sparse").
center: A logical value indicating whether to center the columns of X before computing principal component directions. The default is TRUE.
standardize: A logical value indicating whether to scale the columns of X by their sample standard deviations after optional centering. The default is FALSE.
g_dppca: A logical value indicating whether to use private principal component directions. The default is FALSE. See dp_pc_dir() for details.
cpp.option: A logical value passed to dp_pc_dir() when g_dppca = TRUE. The default is FALSE.
axes: Integer vector of length 2 specifying the principal components used to construct the score coordinates. The default is c(1, 2).

Value

A list with components:

score: An \(n \times 2\) matrix containing the PC scores for the two selected axes.
frame: A list with components xlim and ylim.
none: Data frame for the non-private empirical histogram.
add: Data frame for the additive Gaussian private histogram, or NULL if not requested.
sparse: Data frame for the sparse private histogram, or NULL if not requested.
method: Character vector of private histogram methods used.

Details

Let \(v_a\) and \(v_b\) be the principal component directions selected by axes = c(a, b) for some \(1 \le a < b \le ncol(X)\). After preprocessing, the score point for \(i\)th observation is \(s_i = (x_i^\top v_a, x_i^\top v_b)\). A non-private score plot would display the points \(s_1, \ldots, s_n\) directly. This function instead summarizes their empirical distribution by a two-dimensional histogram and releases private versions of the histogram for the visualization.

The plotting frame is constructed privately from the score coordinates. The frame center is estimated by coordinate-wise private medians, and the frame radius is estimated by the private 0.99 quantile of the Euclidean distances from this private center. The resulting private radius is inflated by a fixed factor and used to form a square plotting frame. The private frame is computed using a smooth-sensitivity based quantile mechanism (Nissim et al. 2007) .

The private histogram is computed on the rectangular grid defined by the private frame and the bin counts in bins. Under row-level adjacency, changing one observation can increase one bin count by one and decrease another by one, giving \(\ell_1\) sensitivity at most \(2\) and \(\ell_2\) sensitivity at most \(\sqrt{2}\) for the count vector.

Two private histogram mechanisms are supported:

"add" constructs an additive differentially private histogram by adding Gaussian noise to all bin counts, clipping negative noisy counts to zero, and normalizing the result. This additive-noise approach is commonly used for private histograms; see Wasserman and Zhou (2010) .
"sparse" constructs a sparse differentially private histogram for settings where many bins are empty. It perturbs only nonzero empirical bin proportions and keeps bins whose noisy values exceed a stability threshold, following the stability-based private histogram idea of Karwa and Vadhan (2018) .

The privacy parameters are allocated across the privacy-consuming steps. If g_dppca = FALSE, half of eps and delta is used for private frame construction and half for the private histogram. If g_dppca = TRUE, the parameters are split equally among private direction estimation, private frame construction, and private histogram release.

For a detailed procedure and mathematical formulations, refer https://yejinjo0220.github.io/dppca/articles/dp_score.

References

Dwork C, Roth A (2014). “The Algorithmic Foundations of Differential Privacy.” Found. Trends Theor. Comput. Sci., 9(3–4), 211–407. ISSN 1551-305X, doi:10.1561/0400000042 .

Nissim K, Raskhodnikova S, Smith A (2007). “Smooth Sensitivity and Sampling in Private Data Analysis.” In STOC'07: Proceedings of the 39th Annual ACM Symposium on Theory of Computing, 75–84. ISBN 9781595936318, doi:10.1145/1250790.1250803 .

Wasserman L, Zhou S (2010). “A Statistical Framework for Differential Privacy.” Journal of the American Statistical Association, 105(489), 375–389. doi:10.1198/jasa.2009.tm08651 .

Karwa V, Vadhan S (2018). “Finite Sample Differentially Private Confidence Intervals.” In Proceedings of the 9th Innovations in Theoretical Computer Science Conference, volume 94 of Leibniz International Proceedings in Informatics, 44:1–44:9. doi:10.4230/LIPIcs.ITCS.2018.44 .

Kim M, Jung S (2025). “Robust and Differentially Private Principal Component Analysis.” Statistical Analysis and Data Mining: An ASA Data Science Journal, 18(6), e70053. doi:10.1002/sam.70053 .

Examples

data(gau, package = "dppca")

# Use a small subset to keep the example fast.
X <- gau[1:300, ]

# Compute private two-dimensional PCA scores using the additive histogram method.
set.seed(123)
score_gau <- dp_score(
  X,
  eps = 2,
  delta = 1e-3,
  method = "add",
  bins = c(10, 10)
)

head(score_gau$score)
#>          PC1        PC2
#> 1 -1.6418971 -2.9417503
#> 2  2.4192805 -1.9747774
#> 3 -1.5647289  2.4500389
#> 4  1.1818664  0.6632302
#> 5 -0.7668155 -2.7729387
#> 6  1.4701354  3.1142919
head(score_gau$add)
#>         xmin       xmax      ymin      ymax        prob
#> 1 -5.3429455 -4.4593527 -3.298014 -2.414422 0.016055676
#> 2 -4.4593527 -3.5757599 -3.298014 -2.414422 0.000000000
#> 3 -3.5757599 -2.6921671 -3.298014 -2.414422 0.016720579
#> 4 -2.6921671 -1.8085743 -3.298014 -2.414422 0.000000000
#> 5 -1.8085743 -0.9249815 -3.298014 -2.414422 0.005652482
#> 6 -0.9249815 -0.0413887 -3.298014 -2.414422 0.009703807