Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent

by   Da Yu, et al.

Differentially private stochastic gradient descent (DP-SGD) is the workhorse algorithm for recent advances in private deep learning. It provides a single privacy guarantee to all datapoints in the dataset. We propose an efficient algorithm to compute per-instance privacy guarantees for individual examples when running DP-SGD. We use our algorithm to investigate per-instance privacy losses across a number of datasets. We find that most examples enjoy stronger privacy guarantees than the worst-case bounds. We further discover that the loss and the privacy loss on an example are well-correlated. This implies groups that are underserved in terms of model utility are simultaneously underserved in terms of privacy loss. For example, on CIFAR-10, the average ϵ of the class with the highest loss (Cat) is 32 the class with the lowest loss (Ship). We also run membership inference attacks to show this reflects disparate empirical privacy risks.


page 1

page 2

page 3

page 4


Learning rate adaptation for differentially private stochastic gradient descent

Differentially private learning has recently emerged as the leading appr...

An Efficient DP-SGD Mechanism for Large Scale NLP Models

Recent advances in deep learning have drastically improved performance o...

On the Effectiveness of Mitigating Data Poisoning Attacks with Gradient Shaping

Machine learning algorithms are vulnerable to data poisoning attacks. Pr...

DP-MAC: The Differentially Private Method of Auxiliary Coordinates for Deep Learning

Developing a differentially private deep learning algorithm is challengi...

Efficient Hyperparameter Optimization for Differentially Private Deep Learning

Tuning the hyperparameters in the differentially private stochastic grad...

PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning

We propose a new framework of synthesizing data using deep generative mo...

A Differentially Private Framework for Deep Learning with Convexified Loss Functions

Differential privacy (DP) has been applied in deep learning for preservi...

1 Introduction

Differential privacy is a strong notion of data privacy, enabling rich forms of privacy-preserving data analysis [DworkMNS06, DworkR14]. Informally speaking, it quantitatively bounds the maximum influence of any datapoint using a privacy parameter , where a small value of corresponds to stronger privacy guarantees. Training deep models with differential privacy is an active research area [AbadiCGMMTZ16, PapernotAEGT17, BuDLS20, YuNBGIKKLMWYZ22, AnilGGKM21, LiTLH22, GolatkarAWRKS22, MehtaTKC22, DeBHSB22]. Models trained with differential privacy not only provide theoretical privacy guarantee to their data but also are more robust against empirical attacks [BernauGRK19, CarliniLEKS19, JagielskiUO20, NasrSTPC21].

Differentially private stochastic gradient descent (DP-SGD) is the de-facto choice for differentially private deep learning [SongCS13, BassilyST14, AbadiCGMMTZ16]. DP-SGD first clips per-instance gradients and then adds Gaussian noise to the aggregated gradients. Standard privacy accounting takes a worst-case approach, and provides all examples with the same privacy parameter

. However, from the perspective of machine learning, different examples can have very different impacts on a learning algorithm

[KohL17, FeldmanZ20]

. For example, consider support vector machines: removing a non-support vector has no effect on the resulting model, and hence that example would have perfect privacy.

In this paper, we give an efficient algorithm to approximately compute per-instance privacy parameters for DP-SGD. Inspecting these per-instance privacy parameters allow us to better understand instance-wise impacts. It turns out that, for a particular dataset, many instances experience much lower privacy loss than the worst-case guarantees. To illustrate this, we plot the per-instance privacy parameters for CIFAR-10 and MNIST in Figure 

1. Experimental details, as well as more results, can be found in Section 4. These differences in per-instance privacy parameters naturally arise when running DP-SGD. To the best of our knowledge, our investigation is the first to explicitly reveal this difference.

Figure 1: Distribution of per-instance privacy parameters on CIFAR-10 and MNIST. The value of is . The dashed line indicates the average of values. The black solid line indicates the original privacy parameter of DP-SGD for all instances.

We propose two techniques to make per-instance privacy accounting viable for DP-SGD. First, we maintain estimates of the gradient norms for all examples so the per-instance privacy loss can be computed accurately at every update. Second, we round the gradient norms with a small precision

to control the number of unique privacy losses, which need to be computed numerically. We explain why these two techniques are necessary in Section 2. More details of the proposed algorithm, as well as methods to release per-instance parameters without additional privacy loss, are in Section 3.

We further demonstrate a strong correlation between per-instance privacy parameters and per-instance losses. That is, we find that datapoints with large per-instance privacy parameters usually also experience high losses over the training process. Stated differently: the same examples suffer a simultaneous unfairness in terms of worse privacy and worse utility. While prior works have shown that underrepresented groups experience worse utility [BuolamwiniG18], and that these disparaties are amplified when models are trained privately [BagdasaryanPS19, SuriyakumarPGG21, petren2022impact, noe2022exploring], we are the first to show that the privacy loss and utility are both negatively impacted concurrently. This is in comparison to prior work in the differentially private setting which took a worst-case perspective for privacy accounting, resulting in a uniform privacy loss for all training examples. Empirical evaluation on MNIST [LeCunBBH98], CIFAR-10 [Krizhevsky09], and UTKFace [ZhangSQ17] datasets are in Section 5. For instance, when running gender classification on the UTKFace dataset, the average of the subgroup with the highest loss (Asian) is 28% higher than that of the subgroup with the lowest loss (Indian). We also run membership inference attacks on those datasets and show the privacy parameters correlate well with the success rates.

1.1 Related Work

There are several works exploring instance-wise privacy in differentially private learning. [RedbergW21] study how to privately publish the per-instance guarantees of objective perturbation. [FeldmanZ21] design a Rényi filter to make use of the per-instance privacy budget of DP-GD. [MuhlB22] provide per-instance privacy guarantees for the PATE framework. In this work, we give an algorithm to compute the per-instance privacy guarantees for DP-SGD, which is the most widely used algorithm in differentially private deep learning.

A recent line of work has found that some examples are more vulnerable to empirical attacks [long2017towards, kulynych2019disparate, ChoquetteChooTCP21, CarliniCNSTT21]. They show membership inference attacks have significantly higher success rates on some examples, e.g., on some specific classes [ShokriSSS17]. In this work, we show the disparity of pre-example privacy risks also exists theoretically when learning with differential privacy. Moreover, we also show the disparity in privacy risks correlates well with the disparity in utility.

2 Preliminaries

We first give some background on differentially private learning. We then describe the privacy analysis of DP-SGD and highlight the challenges in computing per-instance privacy parameters. Finally, we argue that providing the same privacy bound to all samples is not ideal because different examples naturally have different privacy losses due to variation in gradient norms.

2.1 Background on Differentially Private Learning

Differential privacy is built on the notion of neighboring datasets. A dataset is a neighboring dataset of (denoted as ) if can be obtained by adding/removing one example from . We use the following per-instance form of -differential privacy [Wang19, RedbergW21].

Definition 1.

[Per-instance DP] Fix a dataset and . An algorithm satisfies -differential privacy for if for any subset of outputs it holds that and .

The privacy guarantee of Definition 1 is for a pair of a dataset and a single datapoint . It is slightly different with the individual privacy notions in [jorgensen2015conservative] and [FeldmanZ21], where the guarantee is for a datapoint while the dataset is arbitrarily chosen. Definition 2 gives the notion of individual differential privacy.

Definition 2.

[Individual DP] Fix a datapoint , for arbitrary and , an algorithm satisfies -differential privacy for if for any subset of outputs it holds that and .

The privacy guarantee in Definition 2 is for a datapoint which means the guarantee holds when

is arbitrarily chosen. Although individual DP is a stronger privacy notion, it may hide the disparate privacy parameters of an individual datapoint because the disparity is inherent with respect to a specific dataset. A datapoint may be an inlier for one dataset but an outlier for another, and would thus suffer disparate privacy losses. For example, one can modify the dataset

maliciously to maximize the influence of a given example [TramerSJLJHC22]. If using individual DP, the final guarantee for would be the worst-case privacy loss for the worst possible . However, the privacy risk of a trained model is always bound to a given training set and many examples would have a much smaller influence than the worst case. Therefore, we use Definition 1 to evade the worst-case analysis.

DP-SGD is the most common approach for doing deep learning with differential privacy. Instead of protecting the trained model directly, DP-SGD makes each SGD update differentially private. The composition property of differential privacy allows us to reason about the overall privacy of running several such steps. In this work, privacy of different steps is composited through Rényi Differential Privacy [Mironov17]. More details on per-instance Rényi DP and how it composes are in Appendix A. The overall Rényi DP of an instance is converted into -DP after training.

We give a simple example to illustrate how to privatize each update. Suppose we take the sum of all gradients from dataset . Without loss of generality, further assume we add an arbitrary example to obtain a neighboring dataset . The summed gradient becomes , where is the gradient of

. If we add independent Gaussian noise with variance

to each coordinate, then the output distributions of two neighboring datasets are

We then can bound the Rényi divergence between and to provide -DP. Their expectations only differ by and hence a large gradient leads to a large divergence (privacy loss).

2.2 Challenges of Computing Per-Instance Privacy Parameters for DP-SGD

Privacy accounting in DP-SGD is more complex than the simple example in Section 2.1 because the analysis involves privacy amplification by subsampling [AbadiCGMMTZ16, BalleBG18, MironovTZ19, ZhuW19, WangBK19]. Roughly speaking, randomly sampling a minibatch in DP-SGD strengthens the privacy guarantees since most points in the dataset are not involved in a single step. This complication makes direct computation of per-instance privacy parameters impractical.

Before we expand on these difficulties, we first describe the output distributions of neighboring datasets in DP-SGD [AbadiCGMMTZ16]

. Poisson sampling is assumed, i.e., each example is sampled independently with probability

. Let be the sum of the minibatch of gradients of , where is the set of sampled indices. Consider also a neighboring dataset that has one datapoint with gradient added. Because of Poisson sampling, the output is exactly with probability ( is not sampled) and is with probability ( is sampled). Suppose we still add isotropic Gaussian noise, the output distributions of two neighboring datasets are


With Equation (1) and (2), we explain the challenge in computing per-instance privacy parameters.

2.2.1 Full Batch Gradient Norms Are Required at Every Iteration

There is some privacy loss for even if it is not sampled in the current iteration because the analysis makes use of the subsampling process. For a given sampling probability and noise variance, the amount of privacy loss is determined by . Therefore, we need accurate gradient norms of all examples to compute accurate privacy losses at every iteration. However, when running SGD, we only have minibatch gradients. Previous analysis of DP-SGD evades this problem by simply assuming all examples have the maximum possible norm, i.e., the clipping threshold.

2.2.2 Computational Cost of Per-Instance Privacy Parameters is Huge

The density function of

is a mixture of two Gaussian distributions. This makes computing the Rényi divergence between

and harder as there are no closed form solutions. Although there are some asymptotic bounds, those bounds are looser than computing the divergence numerically [AbadiCGMMTZ16, WangBK19, MironovTZ19, GopiLW21], and thus such numerical computations are necessary to achieve strong privacy guarantees. In the classic analysis, there is only one numerical computation as all examples have the same privacy loss over all iterations. However, naive computation of per-instance privacy losses would require up to computations, where is the dataset size and is the number of iterations.

Figure 2: Average gradient norms.

2.3 An Observation: Gradient Norms in Deep Learning Vary Significantly

We show the gradient norms vary significantly across datapoints in the dataset to demonstrate that different examples experience very difference privacy losses when running DP-SGD. We train the standard ResNet-20 model in [HeZRS16] on CIFAR-10. The maximum clipping threshold is the median of gradient norms at initialization. More details are in Section 4. We first sort all examples based on their average gradient norms across training. Then we divide them into five equally sized groups based on their quintile. We plot the average norms across training in Figure 2. The norms of different groups show significant stratification. Such stratification naturally leads to different per-instance privacy losses and hence different privacy parameters for each example. This suggests that quantifying per-instance privacy parameters may be valuable despite the aforementioned challenges.

3 Deep Learning with Per-Instance Privacy Parameters

We give an efficient algorithm to compute per-instance privacy parameters for DP-SGD (Algorithm 1). We perform two modifications to make per-instance privacy loss accounting feasible with small computational overhead. We first use past gradient norms to estimate the gradients norms at the current iteration, and update the estimates when points are sampled into the current minibatch. We also introduce the option to compute full batch gradient norms deliberately to trade off between running time and estimation accuracy. Additionally, we round the gradient norms to a given precision so the number of numerical computations is independent of the dataset size and number of iterations.

Input : Maximum clipping threshold , rounding precision , noise variance , sampling probability

, frequency of updating full gradient norms at every epoch

1 Let be the estimated gradient norms of all examples and initialize . Let be all possible norms under rounding. for  do
2        //Formulations of and are in Equation (1) and (2). Compute the Rényi divergences between and numerically with , , and .
3 end for
4for  do
5        // is the number of iterations per epoch, is the number of epochs. Sample a minibatch of indices and compute gradients . Compute and . Update model with , where . Update . Round with precision . //Update privacy loss for the whole dataset. for  to  do
6               Find corresponding for the example. Add the privacy loss of to the accumulated loss of the example.
7        end for
8       if  and  then
9               Compute full batch gradient norms and update with rounded norms.
10        end if
12 end for
Algorithm 1 Deep Learning with Per-Instance Privacy Accounting

3.1 Estimated Privacy Parameters Are Accurate

Although the gradient norms used for privacy accounting are updated only occasionally, we show that the computed per-instance privacy parameters are very close to the actual ones (Figure 3). This indicates that, in general, the gradient norms do not change rapidly during training. Before we examine this phenomenon, we note that the estimated privacy parameters themselves are strict differential privacy guarantees because we use the estimated norms to clip per-instance gradients.111Using per-instance clipping thresholds could lose more gradient signal if the estimates are inaccurate. In Appendix C, we show the per-instance clipping in Algorithm 1 does not affect the utility.

To compute the actual privacy parameters, we randomly sample 1000 examples and compute the exact gradient norms at every iteration. We compute the Pearson correlation coefficient between the estimated and actual privacy parameters as well as the average and the worst absolute errors. In additional to CIFAR-10 and MNIST, we also include the UTKFace dataset and run age/gender classification tasks on it [ZhangSQ17]. Details about the experiments are in Section 4.

We plot the results in Figure 3. The estimated values are very close to the actual ones (Pearson’s ) even we only update the gradient norms when points are sampled into a minibatch, which incurs almost no computational overhead as those gradients are already calculated by DP-SGD. Updating full batch gradient norms once or twice per epoch further improves the estimation, though doing so would double or triple the running time.

Figure 3: Estimated values versus actual values. The value of is the times of full batch norms update at every epoch. The solid black line indicates of the original analysis for every example.

It is worth noting that affects the computed privacy parameters. Large increases the variation of gradient norms but leads to large worst-case privacy (or large gradient variance if keeping the worst-case privacy unchanged) while small suppresses the variation and leads to large gradient bias [ChenWH20, SongSTT21]. In this work, we set the maximum clipping threshold as the median of gradient norms at initialization unless otherwise mentioned, which is a common choice in practice and has been observed to achieve good accuracy [AbadiCGMMTZ16, AndrewTMR21]. In Appendix D, we show the influence of using different different values of on both accuracy and privacy.

3.2 Rounding Per-Instance Gradient Norms

The rounding operation in Algorithm 1 is essential to make the computation of per-instance privacy loss feasible. For given and , one needs to run the numerical method in [MironovTZ19] once for every unique privacy loss. Consequently, there are at most unique privacy losses because gradient norms vary across different examples and iterations. In order to make the number of unique privacy losses tractable, we round the gradient norms with a prespecified precision . Because the maximum clipping threshold is usually a small constant, then, by the pigeonhole principle, there are at most unique values. Throughout this paper we set , which has almost no impact on the precision of privacy accounting.

To give a concrete running time comparison, running the numerical method of [MironovTZ19] one hundred times using a single CPU core takes around five seconds. This is the total computation overhead of Algorithm 1 on CIFAR-10 () with in Figure 3. We only need to run the computations before training and can reuse the results directly throughout training. However, without rounding, we need to compute unique privacy losses at every epoch on CIFAR-10. This will incur an additional cost of approximately 10 hours to compute the per-instance privacy parameters.

3.3 What Can We Do with Per-Instance Privacy Parameters?

Note that per-instance privacy parameters are dependent on the private data and thus sensitive, and consequently may not be released publicly without care. We describe some approaches to safely make use of per-instance privacy parameters. The first is to only release the privacy parameter to the rightful data owner. The second is to release some statistics of the per-instance privacy parameters to the public. Both approaches offer more granular and tighter privacy guarantees than the single worst-case guarantee offered by the conventional analysis. Another approach is for a trusted data curator to improve the model quality based on the per-instance parameters.

The first approach is to release to the owner of the th example. Although we use gradient norms without adding noise, this approach does not incur additional privacy loss for two reasons. First, it is safe for the th example because only the rightful owner sees . Second, releasing does not increase the privacy loss of any other examples. This is because the post-processing property of differential privacy and the fact that computing only involves a differentially private model and the th example itself. The second reason is important and it may not hold under other private learning algorithms. For example, the per-instance privacy parameters for objective perturbation are interdependent and require additional delicate analysis before publication [RedbergW21].

The second approach is to privately release aggregate statistics of the population, e.g., the average or quantiles of the

values. Recent works have demonstrated such statistics can be published accurately with minor increase in the privacy loss [AndrewTMR21]. In Appendix E, we show the statistics can be released accurately with very small privacy cost ().

Finally, per-instance privacy parameters can also serve as a powerful tool for a trusted data curator to improve the model quality. By analysing the per-instance privacy parameters of a dataset, a trusted curator can focus on collecting more data representative of the subgroups that have higher privacy risks (and worse performance) to mitigate the disparity in privacy.

4 Per-Instance Privacy Parameters on Different Datasets

We investigate the distribution of per-instance privacy parameters of running DP-SGD on four classification tasks. All experiments are run on a Tesla V100 GPU. Experimental setup is as follows.

Datasets. We use two benchmark datasets MNIST ( 60000) and CIFAR-10 ( 50000) [LeCunBBH98, Krizhevsky09] as well as the UTKFace dataset ( 15000) [ZhangSQ17] that contains the face images of four different races (White, 7000; Black, 3500; Asian, 2000; Indian, 2800). We construct two tasks on UTKFace: predicting gender, and predicting whether the age is under .222We acknowledge that predicting gender and age from images may be problematic. Nonetheless, as facial images have previously been highlighted as a setting where machine learning has disparate accuracy on different subgroups, we revisit this domain through a related lens. The labels are provided by the dataset curators. We slightly modify the dataset between these two tasks by randomly removing a few examples to ensure each race has balanced positive and negative labels.

Models and hyperparameters.

We train ResNet-20 models on all datasets. For CIFAR-10 and MNIST, we train the models from scratch. For UTKFace, we fine-tune a model from the PyTorch library


that is pre-trained on ImageNet. For all datasets, the maximum clipping threshold is the median of gradient norms at initialization. We update the full gradient norms twice per epoch. More details about the models and hyperparameters are in Appendix 


4.1 Per-Instance Privacy Parameters Vary Significantly

Figure 4: Distribution of per-instance privacy parameters on UTKFace. The value of is . The dashed line indicates the average of values. The black solid line indicates the original privacy parameter of DP-SGD for all instances.

Figure 1 shows the per-instance privacy parameters on CIFAR-10 and MNIST. Figure 4 shows the per-instance privacy parameters on UTKFace. The privacy parameters vary across a large range on all datasets. For example, when running gender classification on the UTKFace dataset, the maximum value is 4.5 while the minimum value is only 0.1. We also observe that, for easier tasks, more examples enjoy stronger privacy guarantees. For example, 30% of examples reach the worst-case on CIFAR-10 while only 3% do so on MNIST. This may because the loss decreases quickly when the task is easy, resulting in gradient norms also decreasing and thus a reduced privacy loss.

5 Privacy Loss is Unequal Across Different Subgroups

We investigate the difference of privacy loss among different subgroups. We first empirically show the privacy parameter of one example correlates well with its loss. Then we show example groups with higher loss (i.e., groups underserved by the model) also have worse privacy parameters. Finally, we run membership inference attacks to show the computed privacy parameters reflect empirical privacy risks.

5.1 Privacy Parameters Correlate with Loss

Figure 5: Correlation between and loss.

We show a strong correlation between the privacy parameters and loss values. Based on the analysis in Section 2, the privacy parameter of an example directly correlates with its gradient norms across training. The gradient norms further depend on the loss values. To verify this correlation, we plot the average loss and of different groups on CIFAR-10 and MNIST in Figure 5. We use two ways to divide examples into different groups. The first takes the average loss over training and the second takes the loss during the last epoch. We then sort the examples based on loss values and divide the sorted examples into ten even groups. The correlation between loss values and privacy parameters is apparent on both datasets in Figure 5.

5.2 Groups Are Simultaneously Underserved in Both Accuracy and Privacy

It is well-documented that machine learning models may have large differences in accuracy on different subgroups [BuolamwiniG18, BagdasaryanPS19]. Our finding demonstrates that this disparity is simultaneous in terms of both loss values and privacy risks. We empirically verify this by plotting the average and loss values of different subgroups. For loss values, we use both the average loss across training and loss during the last epoch. For CIFAR-10 and MNIST, the subgroups are the data from different classes, while for UTKFace, the subgroups are the data from different races.

Figure 6: Loss and of different groups. Groups with higher loss have worse privacy in general.

We plot the results on CIFAR-10 and UTKFace-Gender in Figure 6. The results on MNIST and UTKFace-Age are in Appendix F. The subgroups are sorted based on their average loss values. The loss still correlates well with in general and subgroups with higher loss do tend to have higher privacy parameters. On CIFAR-10, the average of the ‘Cat’ class (which has the highest average loss at the last epoch) is 6.43 while the average of the class with the lowest loss (‘Ship’) is only 4.87. The observation is similar on UTKFace-Gender, in which the average of the subgroup with the highest loss is 2.62 while the average of the subgroup with the lowest loss is 2.04. To the best of our knowledge, our work is the first to reveal this simultaneous disparity.

5.3 Privacy Parameters Reflect Empirical Privacy Risks

We run membership inference (MI) attacks to verify whether examples with larger privacy parameters have higher privacy risks in practice. We use a simple loss-threshold attack that predicts an example is a member if its loss value is smaller than a prespecified threshold [SablayrollesDSOJ19]. Previous works show that even large privacy parameters are sufficient to defend against such attacks [CarliniLEKS19, YuZCYL21]. In order to better observe the difference in privacy risks, we also include models trained without differential privacy as target models. For each data subgroup, we use its whole test set and a random subset of training set so the numbers of training and test loss values are balanced. We further split the data into two subsets evenly to find the optimal threshold on one and report the success rate on another.

Figure 7: Average and membership inference success rates on different subgroups.

The results on CIFAR-10 and UTKFace-Gender are in Figure 6. The results on MNIST and UTKFace-Age are in Appendix F. The subgroups are sorted based on their average . When the models are trained with DP, all attack success rates are close to random guessing (50%), as anticipated. Although the attack we use can not show the disparity in this case, we note that there are more powerful attacks whose success rates are closer to the lower bound that DP offers [JagielskiUO20, NasrSTPC21]. On the other hand, the difference in privacy risks is clear when models are trained without DP. On CIFAR-10, the MI success rate is 79.7% on the Cat class (which has the worst average ) while is only 61.4% on the Ship class (which has the best average ). These results suggest that the values reflect empirical privacy risks which could vary significantly on different subgroups.

6 Conclusion

We propose an algorithm to compute per-instance privacy parameters for DP-SGD. The algorithm can give accurate per-instance privacy parameters even when extra computation is very small (for the case ). We use this new algorithm to examine per-instance privacy risks for examples in several datasets. Significantly, we find that groups with worse utility also suffer from worse privacy. Our paper reveals the complex while interesting relation among utility, fairness and privacy, which may inspire new studies of jointly considering these factors to build trustworthy systems.


Appendix A More Background on Rényi Differential Privacy

In this work, privacy loss is composited through Rényi Differential Privacy (RDP) and the overall privacy loss is converted into (,)-DP after training [Mironov17]. Here we first introduce the definition of per-instance Rényi Differential Privacy. Then we give the composition theorem for per-instance RDP and the conversion rule from RDP to (,)-DP.

a.1 Per-Instance Rényi Differential Privacy

RDP uses the Rényi divergences of different orders between two output distributions to measure privacy. Let be the Rényi divergence of order between and and be the maximum of the two directions. The definition of per-instance RDP is as follows.

Definition 3.

[Per-instance RDP] Fix a dataset and . An algorithm satisfies -RDP for if for any subset of outputs it holds that

We use the numerical tool in [MironovTZ19] to compute for every at every iteration. The results at different iterations are composed and then converted into -DP.

a.2 Composition and Conversion Rules for Per-instance RDP

We use the composition theorem in [FeldmanZ21] that allows privacy parameters are chosen adaptively. Let be a sequence of algorithms and , where is the output of the th algorithm. Further let that composes . For a fixed pair of neighboring datasets and , the privacy parameter of order at the th algorithm is

Theorem A.1 (A special case of Theorem 3.1 in [FeldmanZ21].).

Fix any , and a pair of neighboring datasets . For any sequence of algorithms , if holds almost surely, where the sequence is defined in Equation (3), then the adaptive composition satisfies

Theorem A.1 states for adaptively chosen privacy parameters, we can still add up the privacy parameters at different steps to get the overall privacy guarantee. After training, we use Theorem A.2 to convert the overall RDP of an instance into -DP [Mironov17].

Theorem A.2 (Convert per-instance RDP into per-instance -Dp [Mironov17].).

If obeys -RDP for , then obeys -DP with respect to for all .

We compute with different orders of in our experiments and choose the tightest -DP bound from the conversion results.

Appendix B More Details on Hyperparameters

The noise multipliers are 3.2, 6.0, and 1.5 for CIFAR-10, MNIST, and UTKFace, respectively. The standard deviation of noise in Algorithm 


is the noise multiplier times the maximum clipping threshold. The batchsize is 4000 for CIFAR-10/MNIST and 200 for UTKFace. The training epoch is 200 for CIFAR-10 and 100 for MNIST and UTKFace. For ResNet-20 models on CIFAR-10 and MNIST, we replace batch normalization with group normalization. For ResNet-20 models on UTKFace, we freeze the batch normalization layers of the pre-trained model. We compute Rényi divergence with integer orders up to 256.

Appendix C Per-example Clipping Does Not Affect Accuracy

Here we run experiments to check the influence of per-example clipping thresholds on utility. Algorithm 1 uses per-example clipping thresholds to ensure the computed privacy parameters are valid privacy guarantees. If the clipping thresholds are close to the actual gradient norms, then the clipped results are close to those of using a single maximum clipping threshold. However, if the estimations of gradient norms are not accurate, per-example thresholds would clip more signal than using a single maximum threshold.

Per-example 65.62 (0.51) 97.32 (0.16)
Single 65.66 (0.68) 97.26 (0.11)
Table 1: Comparison between the test accuracy of using per-example clipping thresholds and that of using a single maximum clipping threshold.

We compare the accuracy of using the maximum clipping threshold and that of using per-example clipping thresholds. The results on CIAFR-10 and MNIST are in Table 1. All the per-example clipping thresholds are updated on-the-fly with . We repeat the experiment four times with different random seeds. Other setups are the same as those in Section 4. The results suggest that using per-example clipping thresholds in Algorithm 1 does not affect the accuracy.

Appendix D On the Influence of the Maximum Clipping Threshold on Privacy

Figure 8: Distribution of privacy parameters on CIFAR-10 with different values of . The median of gradient norms at initialization is . The black solid line indicates the original privacy parameter of DP-SGD for all instances. The maximum values in the first two plots do not match the bound of the original analysis because of the rounding operation in Algorithm 1.

As discussed in Section 3.1, the value of the maximum clipping threshold would affect per-instance privacy parameters in Algorithm 1. Here we run experiments with different values of on CIFAR-10. Let be the median of gradient norms at initialization, we choose from the list . Other experimental setup is the same as that in Section 4.

We plot the results in Figure 8. The variation in privacy parameters increases with the value of . When , nearly 70% datapoints reach the worst privacy parameter while only 3% datapoints reach the worst parameter when . Moreover, when , some privacy parameters do not reach the worst parameter after training. For instance, the maximum privacy parameter is only when while the of the original analysis is .

Appendix E Release Populational Statistics of Per-instance Privacy Parameters

The per-instance privacy parameters computed by Algorithm 1 are sensitive and hence can not be directly released to the public. Here we show the populational statistics of per-instance parameters can be released with minor cost in privacy. Specifically, we compute the average and quantiles of the values with differential privacy. The sensitivity of is the value from the original analysis. For average, we release the noisy aggregation through the Gaussian mechanism in [DworkR14]. For quantiles, we solve the objective function in [AndrewTMR21] with 20 steps of full batch gradient descent. The privacy loss of running gradient descents is composed under RDP and then converted into -DP. The results on MNIST and CIFAR-10 are in Table 2 and Table 3 respectively. The released statistics are close to the actual values on both datasets.

MNIST Average 0.1-quantile 0.3-quantile Median 0.7-quantile 0.9-quantile
Non-private 0.906 0.563 0.672 0.790 0.967 1.422
0.907 0.562 0.670 0.791 0.969 1.369
Table 2: Populational statistics of per-instance privacy parameters on MNIST. The average estimation error rate is 0.77%. The value of is .
CIFAR-10 Average 0.1-quantile 0.3-quantile Median 0.7-quantile 0.9-quantile
Non-private 5.670 2.942 4.822 6.398 6.959 7.132
5.669 3.125 4.818 6.390 6.977 7.159
Table 3: Populational statistics of per-instance privacy parameters on CIFAR-10. The average estimation error rate is 1.18%. The value of is .

Appendix F More Plots Showing Privacy Risk is Unequal Across Different Subgroups

Figure 9: Loss and of different subgroups. Subgroups with higher loss have worse privacy in general. Experimental setup is the same as that in Section 5.

We plot the average loss and of different subgroups on MNIST and UTKFace-age in Figure 9. The subgroups are sorted based on their average loss values. The observation is similar to that in Section 5. Subgroups with higher loss also have higher privacy parameters in general. On UTKFace-Age, the average of the subgroup with the highest loss (Black) is 3.12 while the average of the subgroup with the lowest loss (Asian) is 2.54.

Figure 10: Average and membership inference success rates of different subgroups. Experimental setup is the same as that in Section 5.

The membership inference attack success rates and average on MNIST and UTKFace-Age are in Figure 10. Subgroups are sorted based on their average . On MNIST, the attack success rates are close to that of random guessing (50%) no matter the models are trained with DP or not. This may because the generalization gap is very small on MNIST, i.e., test accuracy >99% when trained without DP, so it is hard to distinguish training loss distribution and test loss distribution. On UTKface-Age, the difference in attack success rates is clear when the model is trained without DP. For example, the attack success rate on the Black subgroup (which has the highest average ) is 70.6% while the attack success rate on the Asian subgroup (which has the lowest average ) is only 60.1%.