Machine learning with differential privacy (Dwork and Roth, 2014) (DP) and its relaxations have sparked growing interest in the last few years. Several approaches have been proposed to optimize models with stochastic gradient descent (SGD) using private data, including objective perturbation (Chaudhuri et al., 2011), output perturbation (Chen et al., 2019) and gradient perturbation (Abadi et al., 2016). The latter is the most widely applicable, since it makes little assumptions on the model or on the data.
In this paper, we are interested in analyzing the effect of the order in which data is processed on privacy. Consider one epoch of SGD with batch size one. At each step, one can independently pick uniformly at random a new training sample. We call this methodsampling with replacement. Another option is to first make a permutation of the whole dataset, and then process all the samples according to this permuted ordering. This second method, known as sampling without replacement or as cyclic SGD, is known to have faster convergence (Bottou, 2009; HaoChen and Sra, 2019) and is hence more widely used in practice. While the privacy of the ”sampling with replacement” method is well-known (Abadi et al., 2016), it is unclear how usual proof techniques extend to the second approach. Recently, the effect of shuffling on pure differential privacy has been studied (Erlingsson et al., 2019). However, relaxations such as differential privacy or Rényi differential privacy are more popular since they are able to obtain smaller privacy budgets.
We perform an analysis of the Rényi differential privacy (RDP) (Mironov, 2017) of SGD without replacement, with a particular focus on privacy guarantees that still hold in a non-convex setting. We prove privacy guarantees for sampling without replacement which are asymptotically similar to those for sampling with replacement, with the restriction that each epoch is stopped part-way through the dataset has been processed. We also provide a toy example showing that the two sampling methods can have different behaviours in terms of privacy.
Rényi differential privacy. Pure differential privacy is a strong but hard to enforce guarantee. It was first relaxed with the weaker differential privacy, and then with the intermediate notion of Rényi differential privacy (RDP) (Mironov, 2017). RDP has the triple advantage of (i) being easy to work with Gaussian perturbations, (ii) simple composition properties, and (iii) easy amplification via subsampling (Balle et al., 2018; Wang et al., 2018), under some assumptions on the required privacy level.
Given a dataset and some algorithm , suppose is a dataset that differs from only in one data point. Then, define and be the probability distributions of the output of on and respectively. The algorithm is said to have Rényi differential privacy if:
2. Main Results
Given a dataset , we run epochs of cyclic SGD with batch size one. Our algorithm takes as inputs the dataset, an initial parameter , a learning rate
, a loss function
and a gradient clipping parameter. Every epoch, we make a permutation of , then return a sequence of parameters computed after clipped gradient steps with loss , processing the dataset in order. At each gradient step, Gaussian noise is added to the stochastic gradient, so that each gradient step (ignoring gradient clipping for readability) is:
Importantly, our algorithm only processes fraction of the shuffled data before moving on to the next epoch—it does not perform a full cycle. This is crucial to our privacy results.
2.2. Privacy Guarantees
Theorem 2.1 (Privacy of one step).
At any epoch and step of Algorithm 1, let be the outputs. Conditioned on and assuming , the -th step satisfies -RDP for
The privacy bound degrades with , and ultimately at no privacy is obtained. This is the main reason why we only process fraction of each epoch. The condition on corresponds to the high noise/high privacy regime considered in (Abadi et al., 2016). We can then use Theorem 2.1 to obtain the following global privacy guarantee on our procedure.
Theorem 2.2 (Global privacy).
The output sequence of parameters given by Algorithm 1 satisfies -Rényi differential privacy for
Our analysis recovers the same asymptotic rate as for sampling with replacement (Abadi et al., 2016). We can easily translate the bounds into differential privacy: Algorithm 1 has privacy for any (Mironov, 2017). Tighter bounds can be obtained using a privacy accountant as in (Abadi et al., 2016) that keeps track of the privacy loss at each step and uses the optimal .
3. Proof Overview
3.1. Privacy amplification via non-uniform subsampling
Subsampling the dataset with ratio before using an -RDP algorithm roughly increases its privacy to (Wang et al., 2018) and this is used to compute the RDP of SGD with replacement. Suppose instead that we are subsampling with non uniform weights (Hartley et al., 1962). These weights may potentially depend on the processed data but can be upper bounded almost surely . Then we have the following generalization:
Theorem 3.1 (non-uniform subsampling).
Suppose that is -RDP for any integer when run on data . Then suppose that is run on a subsampled dataset where for any data point , . The new procedure is -RDP:
This is a generalization of Theorem 9 from (Wang et al., 2018) and closely follows the same proof. The only difference being in Proposition 21 which states the effect of subsampling on the ternary DP, where we need to replace by with an inequality. ∎
3.2. Proof of Theorem 2.1
We will assume for simplicity that Algorithm 1 runs only for one epoch with output for . Let denote all the past iterates (in this epoch) until . Conditioned on the previous iterates , is a mixture of Gaussians where each distribution corresponds to using a particular datapoint . In the case of uniform sampling with replacement, this mixture has uniform weights , which allows the direct use of subsampling theorems. However in the case of sampling without replacement, the sampling weights could possibly depend on the previous iterates since within an epoch each datapoint is sampled at most once.
are the random variables indicating the shuffled indices. The following holds, for anyand :
where we used the fact that . Here
is a Gaussian distribution with mean
and variance. Further, note that almost surely
Thus we can use Theorem 3.1 to bound the privacy of . Note that the update step (without considering the subsampling) is -RDP (Mironov, 2017) (where is the gradient clipping parameter, such that the sensitivity of one gradient step is less than ). Theorem 3.1 then tells us that:
Because of the use of Gaussian noise, as noted by (Wang et al., 2018), this bound has a different behaviour depending on the size of . We consider the case of large enough (), which is the one handled in (Abadi et al., 2016). Then we can simplify the above terms to obtain Theorem 2.1.∎
4. Sampling with replacement can be more private
One might wonder why our privacy guarantee is weaker for sampling without replacement than for sampling with replacement, while both would be expected to have similar behaviours. Here we present a toy example where DP depends on the type of sampling. is initialized at and each training example corresponds either to the non-convex loss function or to (see Fig. 1). The algorithm does two SGD steps with step-size 1, then with probability returns the final , otherwise chooses a response uniformly among (output perturbation). With , sampling with replacement has -DP, while sampling without replacement only has -DP.
Intuitively, the example is built such that choosing twice the same data point preserves more privacy (one cannot tell between and ) than choosing different ones ( and end up in different states). This is a very simple example of randomized response where pure DP can be computed exactly, but the same principle can be applied to SGD with gradient perturbation, albeit with approximate computations.
We proposed a simple analysis of the privacy of SGD with gradient perturbation when sampling is done without replacement. We make no assumptions on the convexity or smoothness of the problem. Our analysis extends results from (Feldman et al., 2018) to non convex settings, when it is possible to shuffle the training data. The privacy guarantees are almost the same as for sampling with replacement (Abadi et al., 2016), up to a slight modification of the shuffling procedure. This might be advantageous in practice since cyclic SGD is more commonly used and empirically converges significantly faster. A caveat though is that it is unclear if cyclic SGD with the noise added still retains its speed of convergence. Further, our guarantees hold even when the whole sequence of iterates is released, allowing its use in distributed and decentralized settings where multiple workers want to hide the data from each other.
- Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 308–318.
- Balle et al. (2018) Borja Balle, Gilles Barthe, and Marco Gaboardi. 2018. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In Advances in Neural Information Processing Systems. 6277–6287.
Curiously fast convergence of some stochastic
gradient descent algorithms. In
Proceedings of the symposium on learning and data science, Paris.
- Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. 2011. Differentially private empirical risk minimization. Journal of Machine Learning Research 12, Mar (2011), 1069–1109.
et al. (2019)
Chen Chen, Jaewoo Lee,
and Dan Kifer. 2019.
Renyi Differentially Private ERM for Smooth
The 22nd International Conference on Artificial Intelligence and Statistics. 2037–2046.
- Dwork and Roth (2014) Cynthia Dwork and Aaron Roth. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4 (2014), 211–407.
- Erlingsson et al. (2019) Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. 2019. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2468–2479.
- Feldman et al. (2018) Vitaly Feldman, Ilya Mironov, Kunal Talwar, and Abhradeep Thakurta. 2018. Privacy amplification by iteration. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 521–532.
- HaoChen and Sra (2019) Jeffery Z HaoChen and Suvrit Sra. 2019. Random Shuffling Beats SGD after Finite Epochs. In International Conference of Machine Learning.
- Hartley et al. (1962) HO Hartley, JNK Rao, et al. 1962. Sampling with unequal probabilities and without replacement. The Annals of Mathematical Statistics 33, 2 (1962), 350–374.
- Mironov (2017) Ilya Mironov. 2017. Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF). IEEE, 263–275.
- Wang et al. (2018) Yu-Xiang Wang, Borja Balle, and Shiva Kasiviswanathan. 2018. Subsampled Rényi Differential Privacy and Analytical Moments Accountant. arXiv preprint arXiv:1808.00087 (2018).