A PCA score plot is a standard visualization for examining the
low-dimensional structure of multivariate data. In a non-private
analysis, the score plot displays the projected observations directly.
In dppca, the differentially private score plot instead
represents the distribution of two-dimensional PCA scores by a
differentially private histogram.
PC scores
Let
be the input data matrix after the requested preprocessing. In
dppca, preprocessing is controlled by the arguments
center and standardize.
Let
be the matrix of principal component directions, where the column is the -th principal component direction. For the -th observation , the -dimensional score vector is
For visualization, we select two score coordinates. If
axes = c(a, b), define
The collection is the two-dimensional score point cloud. A non-private score plot would draw these points directly. The private score plot instead releases a noisy two-dimensional histogram of these points.
Overview of the DP score plot
The private score visualization in dppca has the
following steps.
- Compute two-dimensional PCA scores.
- Construct a private plotting frame.
- Divide the frame into rectangular bins.
- Count how many score points fall into each bin.
- Apply a differentially private histogram mechanism.
- Normalize and visualize the noisy bin frequencies.
The plotting frame and histogram both consume privacy budget. If
g_dppca = TRUE, the private PC directions also consume
privacy budget.
1. Private plotting frame
Before constructing a two-dimensional histogram, we need a plotting region. This region is called the plotting frame. If the frame is too narrow, many points are excluded. If it is too wide, the histogram may become sparse and visually uninformative.
The current implementation uses a private center-radius frame. This approach constructs a square frame by privately estimating a center and then privately estimating a radius around that center. The private quantiles appearing in this step are computed using a smooth-sensitivity-based DP quantile estimator, as in Nissim, Raskhodnikova, and Smith (2007).
Private center
Let be the score matrix, whose -th row is . The frame center is estimated coordinate-wise using private medians:
Here denotes a private estimate of the -quantile. The private center is
Private radius
After obtaining the private center, compute the Euclidean distance from each score point to the private center:
The radius is then estimated by the private 0.99 quantile of these distances:
To add a visual margin and reduce boundary effects, introduce a fixed inflation factor .
where the current implementation uses a fixed inflation factor .
The final plotting frame is
This produces a square frame centered at the private center.
Numerical safeguard for the private radius
The distances are nonnegative, but the private quantile estimator adds random noise. Therefore, the private radius estimate can occasionally become non-finite or nonpositive, especially when the privacy budget is very small, the sample size is small, or the score points are nearly identical.
The implementation checks the private radius before forming the frame. If the private radius is not finite or is nonpositive, the score plotting routine stops with an informative error.
2. Choosing the number of bins
After the plotting frame has been determined, it is divided into histogram bins. Let and be the number of bins along the two score axes. The two-dimensional histogram then has
bins in total.
In dppca, the user specifies the bin counts through the
bins argument, for example bins = c(20, 20).
The best bin choice depends on the sample size, privacy budget, and
visible structure in the score distribution. Fewer bins can be more
stable under stronger privacy noise, while more bins can reveal finer
structure when the sample size and privacy budget are sufficiently
large.
3. Two-dimensional histogram
Let the private plotting frame be divided into bins . For the score point set , the non-private count in bin is
The count vector is . The empirical frequency in bin is
The private score visualization displays a noisy version of this frequency vector.
Sensitivity of histogram counts
Under row-level adjacency, two neighboring datasets differ in one observation. Changing one observation can move one score point from one bin to another. Therefore, the count vector can change by at most in one bin and in another bin. Hence,
These sensitivity bounds are used to calibrate privacy noise for the histogram mechanisms.
4. Privacy accounting
The DP score histogram procedure has two main privacy-consuming steps
when g_dppca = FALSE:
- private quantile estimation for constructing the plotting frame,
- private histogram release.
If the total privacy budget is , the implementation splits the budget as
The frame construction itself uses three private quantile estimates: two private medians for the center and one private 0.99 quantile for the radius. These share the frame budget by basic composition.
When g_dppca = TRUE, private PC direction estimation
also consumes privacy budget. In that case, the total budget is split
across
- private PC direction estimation,
- private plotting frame construction,
- private histogram release.
The implementation uses an equal split:
By basic composition, the overall procedure satisfies the requested -DP guarantee.
Method 1: Additive DP histogram
A simple DP histogram can be constructed by adding independent Gaussian noise to each bin count. The noisy counts are then post-processed to be nonnegative and normalized. This additive-noise approach is commonly used for DP histograms Wasserman and Zhou (2010), and the procedure is summarized in Additive DP histogram.
Method 2: Sparse DP histogram
When many bins are empty, adding noise to every bin can dominate the visualization. A sparse histogram aims to report only bins whose counts are large enough to be distinguishable from noise.
In dppca, the sparse histogram is based on the
stability-based private histogram idea of Karwa
and Vadhan (2017), summarized in Sparse DP histogram.
Group-wise DP score histograms
When group labels are available, DP score histograms can be constructed separately for each group. Let
denote the score data with group labels, where is the two-dimensional PCA score and is the group label.
The score directions, private plotting frame, and histogram grid are shared across all groups. For each group , define the group-specific bin count
Because the groups form a partition of the rows, group-wise histogram releases can use parallel composition across groups on the common grid.
In dppca, the group-wise version can be constructed
using either the group-wise
additive DP histogram or the group-wise sparse DP
histogram.
Example usage
library(dppca)
data(gau, package = "dppca")
set.seed(123)
score_plot <- dp_score_plot(
X = gau,
eps = 5,
delta = 1e-5,
bins = c(15, 15),
method = c("add", "sparse"),
axes = c(1, 2)
)
score_plot$plot$allFor grouped score histograms:
References
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. (2007). “Smooth sensitivity and sampling in private data analysis”. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing (STOC ’07). Association for Computing Machinery, New York, NY, USA, 75–84. https://doi.org/10.1145/1250790.1250803
Lei, Jing (2011). “Differentially private M-estimators”. Advances in Neural Information Processing Systems, 24. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2011/file/f718499c1c8cef6730f9fd03c8125cab-Paper.pdf
Wasserman, L., & Zhou, S. (2010). “A Statistical Framework for Differential Privacy”. Journal of the American Statistical Association, 105(489), 375–389. https://doi.org/10.1198/jasa.2009.tm08651
Vishesh Karwa and Salil Vadhan. (2018). “Finite sample differentially private confidence intervals”. In Proceedings of ITCS 2018, LIPIcs, 94, 44:1–44:9. https://doi.org/10.4230/LIPIcs.ITCS.2018.44