This article describes the scree estimation used in the
dppca package. The main idea is to rewrite each scree value
as a mean estimation problem.
The package considers three main methods as follows.
- Clipped mean using Gaussian noise
- Huber GDP mean based on noisy gradient descent
- Private modified winsorized mean (PMWM)
Overview
Let
be the preprocessed data matrix. This means that the data have been centered, and possibly standardized.
Let
be the matrix of PC directions. Depending on the analysis, may be non-private or differentially private.
The goal is to estimate the scree vector in a differentially private way.
In the sample PCA case, the -th scree value is
In dppca, we construct private estimates
and use them to build a differentially private scree plot.
Scree Estimation in dppca
For the -th principal component direction , define the score vector .
The sample variance of the -th score vector is
Therefore, for each component , scree estimation can be viewed as a mean estimation problem for .
In dppca, we want to construct a differentially private
estimate of
,
denoted by
.
So, we first privately estimate the mean of
,
and then rescale it by
.
Method 1: Clipped scree estimation
The simplest way to privately estimate the mean of is to first clip the values to a bounded interval and then add Gaussian noise.
Clipping
Choose a clipping threshold
Define the clipped observations by .
Then
The clipped empirical mean is
The corresponding non-private clipped scree estimate is
Sensitivity
Because each clipped observation lies in , changing one observation can change the mean by at most .
After multiplying by , the sensitivity of the clipped scree estimate is
Parameter
Clipped scree estimation is simple and easy to implement, but it depends on the choice of the clipping threshold .
- If is too small, the scree values may be underestimated.
- If is too large, the required privacy noise may increase.
Example usage
In the dppca, we set clipped-mean parameter by
library(dppca)
dp_scree(
X,
k = 3,
method = "clipped",
eps_total = 1,
delta_total = 1e-6,
center = TRUE,
standardize = FALSE,
control = clipped_control(
C_clip = 3
)
)Method 2: Huber scree estimation
The Huber method is based on the differentially private robust mean estimator of Yu, Ren, and Zhou (2024). For each component , it estimates the mean of using a robust loss function instead of the ordinary sample mean.
Huber loss
For a robustification parameter , the Huber loss is
For small residuals, the Huber loss behaves like squared error loss. For large residuals, it grows linearly, which reduces the influence of extreme observations.
The derivative of the Huber loss is the score function
Thus, clips the residual to the interval .
Huber mean estimator
For the -th component, the Huber mean estimator is defined as
The first-order condition is
This equation shows that large residuals are clipped through the score function. As a result, the estimator is less sensitive to outliers than the ordinary mean.
Noisy gradient descent
The Huber objective is optimized using noisy gradient descent.
Given the current estimate at iteration , we define the corresponding residuals as
The clipped score values are ,
and the average gradient type quantity is
A non-private update would take the form
To ensure privacy, Gaussian noise is added at each iteration.
Sensitivity of one gradient step
Since
changing one observation can change the average score by at most .
After multiplying by the step size , the one-step sensitivity is
This sensitivity determines the scale of the Gaussian noise added at each gradient step.
DP Huber scree estimate
After noisy gradient descent steps, let the final private Huber mean estimate be
The final Huber scree estimate is
The Huber method is useful when the squared scores may be heavy-tailed or affected by outliers.
Robustification parameter
The robustification parameter controls the trade-off between bias, robustness, and privacy.
A small gives stronger clipping and more robustness, but may increase the bias of the estimator. A large behaves more like the ordinary mean, but may be less robust and may require more noise.
In the Huber DP mean, is chosen using a scale quantity related to the second moment. A typical theoretical form is
where
Because is usually unknown and can be sensitive to outliers, it may need to be estimated robustly and privately. we use a private robust estimator for described in Algorithm 2.
Other control parameters
The Huber scree method also uses additional control parameters for noisy gradient descent and for the private scale-proxy step used to estimate . The default values are chosen based on the recommendations in the Yu, Ren, and Zhou (2024)
mu0: Initial value for noisy gradient descent. (default: )eta0: Step size for noisy gradient descent. (default: )T: Number of noisy gradient descent iterations. (default: )M: Number of blocks used in the private estimator for . (default: )-
k_min_m2andk_max_m2: Lower and upper dyadic bin indices used in the private histogram step for estimating . The histogram searches over scale levels for -
m2_frac: Fraction of the Huber scree privacy budget used to privately estimate .If is the budget for Huber scree estimation, then , while the remaining budget is used for Huber noisy gradient descent.
Example usage
In dppca, the Huber scree estimator is controlled by
dp_scree(
X,
k = 3,
method = "huber",
eps_total = 1,
delta_total = 1e-6,
center = TRUE,
standardize = FALSE,
control = huber_control(
mu0 = 0,
eta0 = 1,
T = 50,
M = 20,
k_min_m2 = -40,
k_max_m2 = 40,
m2_frac = 1/4
)
)Method 3: Private modified winsorized mean scree estimation
The PMWM method is based on the private modified winsorized mean of Ramsay and Spicker (2025). For each component , it estimates the mean of by privately estimating tail quantiles, winsorizing the data, and then adding noise to the winsorized mean.
Non-private modified winsorized mean
The PMWM method builds on the non-private modified winsorized mean of Lugosi and Mendelson (2021).
The non-private modified winsorized mean starts with a clipping proportion , which determines the lower and upper tail quantiles used for winsorization.
Let be empirical lower and upper quantiles.
Define the function
The non-private modified winsorized mean is
The idea is to first estimate the lower and upper tail cutoffs, and then winsorize the data by replacing extreme values with the corresponding cutoffs.
Sample splitting notation
In the theoretical description, the data may be split into two subsets:
- a quantile estimation subset with size ,
- a mean estimation subset with size .
If the full sample size is , a simple split is
In practice, the implementation may use all available observations at each step instead of splitting the sample.
Private quantile estimation
PMWM uses private quantile estimates instead of non-private empirical quantiles. For component , let the clipping proportion be
A theoretical choice has the form
where
- : Contamination level,
- : Log-grid parameter in the private quantile estimator,
- : Lower and upper search bounds used in private quantile estimation.
- : Confidence parameter controlling the high-probability accuracy statement of the PMWM estimator.
- : Number of observations used for quantile estimation.
For practical implementation, the paper suggests using the clipping proportion
where
is a user-chosen trimming constant. This is also the choice used in the
dppca package.
Winsorized mean
Let and be private estimates of the lower and upper quantiles.
Define the winsorized observations by
Equivalently,
- if , replace it by ;
- if , keep it unchanged;
- if , replace it by .
Thus,
Using the mean estimation subset , define .
The corresponding non-private winsorized scree estimate is
Sensitivity
Because all winsorized observations lie in , the sensitivity of the winsorized mean is
After multiplying by , the scree sensitivity is .
DP PMWM scree estimate
The final PMWM scree estimate can be written as
For privacy parameters , the noise scale is
Privacy budget splitting
PMWM uses privacy budget for both private quantile estimation and private mean estimation.
For component , let the total component-level budget be .
This can be split as
The quantile budget itself is used to estimate two quantiles, so it can be split again.
Parameters
The PMWM scree estimator uses additional parameters for private quantile estimation and winsorization.
-
beta: Log-binning base used in the private quantile estimator.It determines the spacing of the geometric search grid and must satisfy .
a,b: Lower and upper search bounds supplied to the private quantile. The private lower and upper clipping cutoffs are searched within this range.trim_const,eta: Parameters used to set the practical clipping proportion Here,trim_const / n_qcontrols the baseline clipping level, whileetagives a lower bound reflecting the expected contamination level.-
split_mode: Logical value indicating whether the sample is split into two parts.If
TRUE, one part is used for private quantile estimation and the other part is used for the winsorized mean step. IfFALSE, all observations are used in both steps. max_extra_bins: Maximum number of additional log-grid bins searched beyond the largest occupied bin in the private quantile.
Example usage
In dppca, the PMWM-specific parameters can be specified
through
dp_scree(
X,
k = 3,
method = "pmwm",
eps_total = 1,
delta_total = 1e-6,
center = TRUE,
standardize = FALSE,
control = pmwm_control(
beta = 1.01,
a = 0,
b = 10,
trim_const = 10,
eta = 0.01,
split_mode = TRUE
)
)Post-processing
Because of the added privacy noise, the raw DP scree estimates
may not have the usual scree shape. They may not be decreasing, and some values may be negative.
In ordinary PCA, scree values satisfy
Therefore, dppca can apply post-processing to make the
DP scree estimates nonnegative and decreasing.
If monotone post-processing is used, the PVE can be computed from the post-processed scree values.
This step only modifies the already released DP estimates, so it does not use any additional privacy budget.
Example usage
dp_scree_plot(
X,
k = 3,
dp_scree_method = "clipped",
eps_total = 1,
delta_total = 1e-6,
control = clipped_control(
C_clip = 3
)
)
dp_scree_plot(
X,
k = 3,
dp_scree_method = "all",
eps_total = 1,
delta_total = 1e-6,
center = TRUE,
standardize = FALSE,
control = list(
clipped = clipped_control(
C_clip = 3
),
pmwm = pmwm_control(
beta = 1.01,
a = 0,
b = 10,
trim_const = 10,
eta = 0.01,
split_mode = TRUE
),
huber = huber_control(
mu0 = 0,
eta0 = 1,
T = 50,
M = 20,
k_min_m2 = -40,
k_max_m2 = 40,
m2_frac = 1/4
)
)
)References
Myeonghun Yu. Zhao Ren. Wen-Xin Zhou. “Gaussian differentially private robust mean estimation and inference”. Bernoulli 30 (4) 3059 - 3088, November 2024. https://doi.org/10.3150/23-BEJ1706
Kelly Ramsay and Dylan Spicker. (2025). “Improved subsample-and-aggregate via the private modified winsorized mean”. arXiv preprint. https://arxiv.org/abs/2501.14095
Gábor Lugosi. Shahar Mendelson. “Robust multivariate mean estimation: The optimality of trimmed mean.” Ann. Statist. 49 (1) 393 - 410, February 2021. https://doi.org/10.1214/20-AOS1961