1 Introduction
Modern mobile devices and web services benefit significantly from largescale machine learning, often involving training on user (client) data. When such data is sensitive, steps must be taken to ensure privacy, and a formal guarantee of differential privacy (DP)
[16, 15] is the gold standard. For this reason, DP has been adopted by companies including Google [20, 9, 18], Apple [2], Microsoft [13], and LinkedIn [31], as well as the US Census Bureau [26].Other privacyenhancing techniques can be combined with DP to obtain additional benefits. In particular, crossdevice federated learning (FL) [27] allows model training while keeping client data decentralized (each participating device keeps its own local dataset, and only sends model updates or gradients to the coordinating server). However, existing approaches to combining FL and DP make a number of assumptions that are unrealistic in realworld FL deployments such as [10]. To highlight these challenges, we must first review the stateoftheart in centralized DP training, where differentially private stochastic gradient descent (DPSGD) [34, 8, 1] is ubiquitous. It achieves optimal error for convex problems [8]
, and can also be applied to nonconvex problems, including deep learning, where the privacy amplification offered by randomly subsampling data to form batches is critical for obtaining meaningful DP guarantees
[25, 8, 1, 5, 37].Attempts to combine FL and the above lines of DP research have been made previously; notably, [28, 3] extended the approach of [1]
to FL and userlevel DP. However, these works and others in the area sidestep a critical issue: the DP guarantees require very specific sampling or shuffling schemes assuming, for example, that each client participates in each iteration with a fixed probability. While possible in theory, such schemes are incompatible with the practical constraints and design goals of crossdevice FL protocols
[10]; to quote [23], a comprehensive recent FL survey, “such a sampling procedure is nearly impossible in practice.”^{1}^{1}1In crosssilo FL applications [23], an enumerated set of addressable institutions or datasilos participate in FL, and so explicit servermediated subsampling or shuffling using existing techniques may be feasible. The fundamental challenge is that clients decide when they will be available for training and when they will check in to the server, and by design the server cannot index specific clients. In fact, it may not even know the size of the participating population.Our work targets these challenges. Our primary goal is to provide strong central DP guarantees for the final model released by FLlike protocols, under the assumption of a trusted^{2}^{2}2Notably, our guarantees are obtained by amplifying the privacy provided by local DP randomizers; we treat this use of local DP as an implementation detail in accomplishing the primary goal of central DP. As a byproduct, our approach offers (weaker) local DP guarantees even in the presence of an untrusted server. orchestrating server. This is accomplished by building upon recent work on amplification by shuffling [19, 12, 18, 22, 6] and combining it with new analysis techniques targeting FLspecific challenges (e.g., clientinitiated communications, nonaddressable global population, and constrained client availability).
We propose the first privacy amplification analysis specifically tailored for distributed learning frameworks. At the heart of our result is a novel technique, called random checkin, that relies only on randomness independently generated by each individual client participating in the training procedure. We show that distributed learning protocols based on random checkins can attain privacy gains similar to privacy amplification by subsampling/shuffling (see Table 1 for a comparison), while requiring minimal coordination from the server. While we restrict our exposition to distributed DPSGD within the FL framework for clarity and concreteness (see Figure 1 for a schematic of one of our protocols), we note that the techniques used in our analyses are broadly applicable to any distributed iterative method and might be of interest in other applications^{3}^{3}3In particular, the Federated Averaging [27] algorithm, which computes an update based on multiple local SGD steps rather than a single gradient, can immediately be plugged into our framework..
Contributions
The main contributions of this paper can be summarized as follows:

[leftmargin=1em,topsep=0pt]

We propose random checkins, the first privacy amplification technique for distributed systems with minimal serverside overhead. We also instantiate three distributed learning protocols that use random checkins, each addressing different natural constraints that arise in applications.

We provide formal privacy guarantees for our protocols, and show that random checkins attain similar rates of privacy amplification as subsampling and shuffling while reducing the need for serverside orchestration. We also provide utility guarantees for one of our protocols in the convex case that match the optimal privacy/accuracy tradeoffs for DPSGD in the central setting [7].

As a byproduct of our analysis, we improve privacy amplification by shuffling [19] on two fronts. For the case of DP local randomizers, we improve the dependency of the final central DP by a factor of . Figure 2 provides a numerical comparison of the bound from [19] with our bound; for typical parameter values this improvement allows us to provide similar privacy guarantees while reducing the number of required users by one order of magnitude. We also extend the analysis to the case of DP local randomizers, including Gaussian randomizers that are widely used in practice.
Related work
Our work considers the paradigm of federated learning as a stylized example throughout the paper. We refer the reader to [23] for an excellent overview of the stateoftheart in federated learning, along with a suite of interesting open problems. There is a rich literature on studying differentially private ERM via DPSGD [34, 8, 1, 39, 35, 30]. However, constraints such as limited availability in distributed settings restrict direct applications of existing techniques. There is also a growing line of works on privacy amplification by shuffling [9, 19, 12, 4, 6, 22, 18] that focus on various ways in which protocols can be designed using trusted shuffling primitives. Lastly, privacy amplification by iteration [21] is another recent advancement that can be applied in an iterative distributed setting, but it is limited to convex objectives.
2 Background and Problem Formulation
Differential Privacy
To formally introduce our notion of privacy, we first define neighboring data sets. We will refer to a pair of data sets as neighbors if can be obtained from by modifying one sample for some .
Definition 2.1 (Differential privacy [16, 15]).
A randomized algorithm is differentially private if, for any pair of neighboring data sets , and for all events in the output range of , we have .
For meaningful central DP guarantees (i.e., when ), is assumed to be a small constant, and . The case is often referred to as pure DP (in which case, we just write DP). We shall also use the term approximate DP when .
Adaptive differentially private mechanisms occur naturally when constructing complex DP algorithms, for e.g., DPSGD. In addition to the dataset , adaptive mechanisms also receive as input the output of other differentially private mechanisms. Formally, we say that an adaptive mechanism is DP if the mechanism is DP for every .
Problem Setup
The distributed learning setup we consider in this paper involves clients, where each client holds a data record^{4}^{4}4Each client is identified as a user. In a general FL setting, each can correspond to a local data set [10]. , , forming a distributed data set . We assume a coordinating server wants to train the parameters of a model by using the dataset
to perform stochastic gradient descent steps according to some loss function
. The server’s goal is to protect the privacy of all the individuals in by providing strong DP guarantees against an adversary that can observe the final trained model as well as all the intermediate model parameters. We assume the server is trusted, all devices adhere to the prescribed protocol (i.e., there are no malicious users), and all serverclient communications are privileged (i.e., they cannot be detected or eavesdropped by an external adversary).The server starts with model parameters and over a sequence of time slots produces a sequence of model parameters . Our random checkins technique allows clients to independently decide when to offer their contributions for a model update. If and when a client’s contribution is accepted by the server, she uses the current parameters and her data to send a privatized gradient of the form to the server, where
is a DP local randomizer (e.g., performing gradient clipping and adding Gaussian noise
[1]).Our results consider three different setups inspired by practical applications [10]: (1) The server uses time slots, where at most one user’s update is used in each slot, for a total of minibatch SGD iterations. It is assumed all users are available for the duration of the protocol, but the server does not have enough bandwidth to process updates from every user (Section 3.1); (2) The server uses time slots, and all users are available for the duration of the protocol (Section 4.1). On average, users contribute updates to each time slot, and so, we take minibatch SGD steps; (3) As with (2), but each user is only available during a small window of time relative to the duration of the protocol (Section 4.2).
3 Distributed Learning with Random CheckIns
This section presents the random checkins technique for privacy amplification in the context of distributed learning. We formally define the random checkins procedure, describe a fully distributed DPSGD protocol with random checkins, and analyze its privacy and utility guarantees.
3.1 Random CheckIns with a Fixed Window
Consider the distributed learning setup described in Section 2 where each client is willing to participate in the training procedure as long as their data remains private. To boost the privacy guarantees provided by the local randomizer , we will let clients volunteer their updates at a random time slot of their choosing. This randomization has a similar effect on the uncertainty about the use of an individual’s data on a particular update as the one provided by uniform subsampling or shuffling. We formalize this concept using the notion of random checkin, which can be informally expressed as a client in a distributed iterative learning framework randomizing their instant of participation, and determining with some probability whether to participate in the process at all.
Definition 3.1 (Random checkin).
Let be a distributed learning protocol with checkin time slots. For a set and probability , client performs an checkin in the protocol if with probability she requests the server to participate in at time step , and otherwise abstains from participating. If , we alternatively denote it as an checkin.
Our first distributed learning protocol based on random checkins is presented in Algorithm 1. Client independently decides in which of the possible time steps (if any) she is willing to participate by performing an checkin. We set for all , and assume^{5}^{5}5We make this assumption only for utility; the privacy guarantees are independent of this assumption. all clients are available throughout the duration of the protocol. On the server side, at each time step , a random client among all the ones that checkedin at time is queried: this client receives the current model , locally computes a gradient update using their data , and returns to the server a privatized version of the gradient obtained using a local randomizer . Clients checkedin at time that are not selected do not participate in the training procedure. If at time no client is available, the server adds a “dummy” gradient to update the model.
3.2 Privacy Analysis
From a privacy standpoint, Algorithm 1 shares an important pattern with DPSGD: each model update uses noisy gradients obtained from a random subset of the population. However, there exist two key factors that make the privacy analysis of our protocol more challenging than the existing analysis based on subsampling and shuffling. First, unlike in the case of uniform sampling where the randomness in each update is independent, here there is a correlation induced by the fact that clients that checkin into one step cannot checkin into a different step. Second, in shuffling there is also a similar correlation between updates, but there we can ensure each update uses the same number of datapoints, while here the server does not control the number of clients that will checkin into each individual step. Nonetheless, the following result shows that random checkins provides a factor of privacy amplification comparable to these techniques.
Theorem 3.2 (Amplification via random checkins into a fixed window).
Suppose is an DP local randomizer. Let be the protocol from Algorithm 1 with checkin probability and checkin window for each client . For any , algorithm is DP with . In particular, for and , we get . Furthermore, if is DP with , then is DP with and .
Remark 1
We can always increase privacy in the above statement by decreasing . However, this will also increase the number of dummy updates, which suggests choosing . With such a choice, we obtain an amplification factor of . Critically, however, exact knowledge of the population size is not required to have a precise DP guarantee above.
Remark 2
At first look, the amplification factor of may appear stronger than the typical factor obtained via uniform subsampling/shuffling. Note that one run of our technique provides updates (as opposed to updates via the other methods). When the server has sufficient capacity, we can set to recover a amplification. The primary advantage of our approach is that we can benefit from amplification in terms of even if only a much smaller number of updates are actually processed. We can also extend our approach to recover the amplification even when the server is rate limited (), by repeating the protocol adaptively times to get Corollary 3.3 from Theorem 3.2 and applying advanced composition for DP [17].
Corollary 3.3.
For algorithm described in Theorem 3.2, suppose is an DP local randomizer s.t. , and . Setting , and running repetitions of results in a total of updates, along with an overall central DP guarantee with and , where hides polylog factors in and .
Comparison to Existing Privacy Amplification Techniques
Table 1 provides a comparison of the bound in Corollary 3.3
to other existing techniques, for performing one epoch of training (i.e., use one update from each client). Note that for this comparison, we assume that
, since for all the shown amplification bounds can be written as . “None” denotes a naïve scheme (with no privacy amplification) where each client is used exactly once in any arbitrary order. Also, note that in general, the guarantees via privacy amplification by subsampling/shuffling apply only under the assumption of complete participation availability^{6}^{6}6By a complete participation availability for a client, we mean that the client should be available to participate when requested by the server for any time step(s) of training. of each client. Thus, they define the upper limits of achieving such amplifications. Also, note that even though the bound in Corollary 3.3 appears better than amplification via shuffling, our technique does include dummy updates which do not occur in the other techniques. For linear optimization problems, it is easy to see that our technique will add a factor of more noise as compared to the other two privacy amplification techniques at the same privacy level.Proof Sketch for Theorem 3.2
Here, we provide a summary of the argument^{7}^{7}7Full proofs for every result in the paper are provided in Appendix A. used to prove Theorem 3.2 in the case . First, note that it is enough to argue about the privacy of the sequence of noisy gradients by postprocessing. Also, the role each client plays in the protocol is symmetric, so w.l.o.g. we can consider two datasets differing in the first position. Next, we imagine that the last clients make the same random checkin choices in and . Letting denote the number of such clients that checkin into step , we model these choices by a pair of sequences where is the data record of an arbitrary client who checkedin into step (with representing a “dummy” data record if no client checkedin), and represents the probability that client ’s data will be picked to participate in the protocol at step if she checksin in step . Conditioned on these choices, the noisy gradients produced by can be obtained by: (1) initializing a dataset ; (2) sampling , and replacing with in w.p. ; (3) producing the outputs by applying a sequence of DP adaptive local randomizers to by setting . Here each of the uses all past gradients to compute the model and return .
The final step involves a variant of the amplification by swapping technique [19, Theorem 8] which we call amplification by probable replacement. The key idea is to reformulate the composition of the applied to the random dataset , to a composition of mechanisms of the form . Mechanism uses the gradient history to compute and returns with probability , and otherwise. Note that before the process begins, we have for every
; our analysis shows that the posterior probability after observing the first
gradients is not too far from the prior: . The desired bound is then obtained by using the overlapping mixtures technique [5] to show that is DP with respect to changes on , and heterogeneous advanced composition [24] to compute the final of composing the adaptively.3.3 Utility Analysis
Proposition 3.4 (Dummy updates in random checkins with a fixed window).
For algorithm described in Theorem 3.2, the expected number of dummy updates performed by the server is at most . For if , we get at most expected dummy updates.
Utility for Convex ERMs
We now instantiate our amplification theorem (Theorem 3.2) in the context of differentially private empirical risk minimization (ERM). For convex ERMs, we will show that DPSGD [34, 8, 1] in conjunction with our privacy amplification theorem (Theorem 3.2) is capable of achieving the optimal privacy/accuracy tradeoffs [8].
Theorem 3.5 (Utility guarantee).
Suppose in algorithm described in Theorem 3.2 the loss is Lipschitz and convex in its second parameter and the model space has dimension and diameter , i.e., . Furthermore, let be a distribution on , define the population risk , and let . If
is a local randomizer that adds Gaussian noise with variance
, and the learning rate for a model update at step is set to be , then the output of on a dataset containing i.i.d. samples from satisfies^{8}^{8}8Here, hides a polylog factor in .Remark 3
4 Variations: Thrifty Updates, and Sliding Windows
This section presents two variants of the main protocol from the previous section. The first variant makes a better use of the updates provided by each user at the expense of a small increase in the privacy cost. The second variant allows users to checkin into a sliding window to model the case where different users might be available during different time windows.
4.1 Leveraging Updates from Multiple Users
Now, we present a variant of Algorithm 1 which, at the expense of a mild increase in the privacy cost, removes the need for dummy updates, and for discarding all but one of the clients checkedin at every time step. The serverside protocol of this version is given in Algorithm 2 (the clientside protocol is identical as Algorithm 1). Note that here, if no client checkedin at some step , the server simply skips the update. Furthermore, if at some step multiple clients checked in, the server requests gradients from all the clients, and performs a model update using the average of the submitted noisy gradients.
These changes have the obvious advantage of reducing the noise in the model coming from dummy updates, and increasing the algorithm’s data efficiency by utilizing gradients provided by all available clients. The corresponding privacy analysis becomes more challenging because (1) the adversary gains information about the time steps where no clients checkedin, and (2) the server uses the potentially nonprivate count of clients checkedin at time when performing the model update. Nonetheless, we show that the privacy guarantees of Algorithm 2 are similar to those of Algorithm 1 with an additional factor, and the restriction of noncollusion among the participating clients. For simplicity, we only analyze the case where each client has checkin probability .
Theorem 4.1 (Amplification via random checkins with averaged updates).
Suppose is an DP local randomizer. Let be the protocol from Algorithm 2 performing averaged model updates with checkin probability and checkin window for each user . Algorithm is DP with
where . In particular, for we get . Furthermore, if is DP with , then is DP with and .
Next, we provide a utility guarantee for in terms of the excess population risk for convex ERMs (similar to Theorem 3.5).
Theorem 4.2 (Utility guarantee).
Suppose in algorithm described in Theorem 4.1 the loss is Lipschitz and convex in its second parameter and the model space has dimension and diameter , i.e., . Furthermore, let be a distribution on , define the population risk , and let . If is a local randomizer that adds Gaussian noise with variance , and the learning rate for a model update at step is set to be , then the output of on a dataset containing i.i.d. samples from satisfies
Furthermore, if the loss is smooth in its second parameter and we set the stepsize , then we have
Comparison to Algorithm 1 in Section 3: Recall that in we can achieve a small fixed by taking and , in which case the excess risk bound in Theorem 3.5 becomes . On the other hand, in we can obtain a fixed small by taking . In this case the excess risks in Theorem 4.2 are bounded by in the convex case, and by in the convex and smooth case. Thus, we observe that all the bounds recover the optimal population risk tradeoffs from [8, 7] as , and for and nonsmooth loss provides a better tradeoff than , while on smooth losses and are incomparable. Note that (with ) will not attain a better bound on smooth losses because each update is based on a single datapoint. Setting will reduce the number of updates to for , whereas to get an excess risk bound for for smooth losses where more than one data point is sampled at each time step will require extending the privacy analysis to incorporate the change, which is beyond the scope of this paper.
4.2 Random CheckIns with a Sliding Window
The second variant we consider removes the need for all clients to be available throughout the training period. Instead, we assume that the training period comprises of time steps, and each client is only available during a window of time steps. Clients perform a random checkin to provide the server with an update during their window of availability. For simplicity, we assume clients wake up in order, one every time step, so client will perform a random checkin within the window . The server will perform updates starting at time to provide a warmup period where the first clients perform their random checkins.
Theorem 4.3 (Amplification via random checkins with sliding windows).
Suppose is an DP local randomizer. Let be the distributed algorithm performing model updates with checkin probability and checkin window for each user . For any , algorithm is DP with . For and , we get . Furthermore, if is DP with , then is DP with and .
Remark 4
We can always increase privacy in the statement above by increasing . However, that also increases the number of clients who do not participate in training because their scheduled checkin time is before the process begins, or after it terminates. Moreover, the number of empty slots where the server introduces dummy updates will also increase, which we would want to minimize for good accuracy. Thus, introduces a tradeoff between accuracy and privacy.
Proposition 4.4 (Dummy updates in random checkins with sliding windows).
For algorithm described in Theorem 4.3, the expected number of dummy gradient updates performed by the server is at most .
5 Improvements to Amplification via Shuffling
Here, we provide an improvement on privacy amplification by shuffling. This is obtained using two technical lemmas (deferred to the supplementary material) to tighten the analysis of amplification by swapping, a central component in the analysis of amplification by shuffling given in [19].
Theorem 5.1 (Amplification via Shuffling).
Let , , be a sequence of adaptive DP local randomizers. Let be the algorithm that given a dataset samples a uniform random permutation over , sequentially computes and outputs . For any , algorithm satisfies DP with . Furthermore, if , , is DP with , then satisfies DP with and .
For comparison, the guarantee in [19, Theorem 7] in the case results in
6 Conclusion
Our work highlights the fact that proving DP guarantees for distributed or decentralized systems can be substantially more challenging than for centralized systems, because in a distributed setting it becomes much harder to precisely control and characterize the randomness in the system, and this precise characterization and control of randomness is at the heart of DP guarantees. Specifically, production FL systems do not satisfy the assumptions that are typically made under stateoftheart privacy accounting schemes, such as privacy amplification via subsampling. Without such accounting schemes, service providers cannot provide DP statements with small ’s. This work, though largely theoretical in nature, proposes a method shaped by the practical constraints of distributed systems that allows for rigorous privacy statements under realistic assumptions.
Nevertheless, there is more to do. Our theorems are sharpest in the highprivacy regime (small ’s), which may be too conservative to provide sufficient utility for some applications. While significantly relaxed from previous work, our assumptions will still not hold in all realworld systems. Thus, we hope this work encourages further collaboration between distributed systems and DP theory researchers in establishing protocols that address the full range of possible systems constraints as well as improving the full breadth of the privacy vs. utility Pareto frontier.
Acknowledgements
The authors would like to thank Vitaly Feldman for suggesting the idea of privacy accounting in DPSGD via shuffling, and for help in identifying and fixing a mistake in the way a previous version of this paper handled DP local randomizers.
References
 [1] M. Abadi, A. Chu, I. J. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep learning with differential privacy. In Proc. of the 2016 ACM SIGSAC Conf. on Computer and Communications Security (CCS’16), pages 308–318, 2016.
 [2] D. P. T. Apple. Learning with privacy at scale, 2017.
 [3] S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, et al. Generative models for effective ml on private, decentralized datasets. arXiv preprint arXiv:1911.06679, 2019.
 [4] V. Balcer and A. Cheu. Separating local & shuffled differential privacy via histograms. CoRR, abs/1911.06879, 2019.
 [5] B. Balle, G. Barthe, and M. Gaboardi. Privacy amplification by subsampling: Tight analyses via couplings and divergences. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 38 December 2018, Montréal, Canada, pages 6280–6290, 2018.
 [6] B. Balle, J. Bell, A. Gascon, and K. Nissim. The privacy blanket of the shuffle model. In Advances in Cryptology—CRYPTO, 2019.
 [7] R. Bassily, V. Feldman, K. Talwar, and A. G. Thakurta. Private stochastic convex optimization with optimal rates. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 814 December 2019, Vancouver, BC, Canada, pages 11279–11288, 2019.
 [8] R. Bassily, A. Smith, and A. Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), pages 464–473, 2014.
 [9] A. Bittau, Ú. Erlingsson, P. Maniatis, I. Mironov, A. Raghunathan, D. Lie, M. Rudominer, U. Kode, J. Tinnés, and B. Seefeld. Prochlo: Strong privacy for analytics in the crowd. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 2831, 2017, pages 441–459. ACM, 2017.
 [10] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
 [11] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(34):231–357, 2015.
 [12] A. Cheu, A. Smith, J. Ullman, D. Zeber, and M. Zhilyaev. Distributed differential privacy via mixnets. CoRR, abs/1808.01394, 2018.
 [13] B. Ding, J. Kulkarni, and S. Yekhanin. Collecting telemetry data privately. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 3571–3580, 2017.
 [14] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. In 54th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2013, 2629 October, 2013, Berkeley, CA, USA, pages 429–438. IEEE Computer Society, 2013.
 [15] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our data, ourselves: Privacy via distributed noise generation. In Advances in Cryptology—EUROCRYPT, pages 486–503, 2006.
 [16] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the Third Conf. on Theory of Cryptography (TCC), pages 265–284, 2006.
 [17] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3–4):211–407, 2014.
 [18] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, S. Song, K. Talwar, and A. Thakurta. Encode, shuffle, analyze privacy revisited: Formalizations and empirical evaluation. CoRR, abs/2001.03618, 2020.
 [19] Ú. Erlingsson, V. Feldman, I. Mironov, A. Raghunathan, K. Talwar, and A. Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACMSIAM Symposium on Discrete Algorithms, pages 2468–2479. SIAM, 2019.
 [20] Ú. Erlingsson, V. Pihur, and A. Korolova. RAPPOR: Randomized aggregatable privacypreserving ordinal response. In Proc. of the 2014 ACM Conf. on Computer and Communications Security (CCS’14), pages 1054–1067. ACM, 2014.
 [21] V. Feldman, I. Mironov, K. Talwar, and A. Thakurta. Privacy amplification by iteration. In 59th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pages 521–532, 2018.
 [22] B. Ghazi, R. Pagh, and A. Velingker. Scalable and differentially private distributed aggregation in the shuffled model. CoRR, abs/1906.08320, 2019.
 [23] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, M. Raykova, H. Qi, D. Ramage, R. Raskar, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao. Advances and open problems in federated learning. CoRR, abs/1912.04977, 2019.
 [24] P. Kairouz, S. Oh, and P. Viswanath. The composition theorem for differential privacy. IEEE Trans. Inf. Theory, 63(6):4037–4049, 2017.
 [25] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. D. Smith. What can we learn privately? In 49th Annual IEEE Symp. on Foundations of Computer Science (FOCS), pages 531–540, 2008.
 [26] Y. Kuo, C. Chiu, D. Kifer, M. Hay, and A. Machanavajjhala. Differentially private hierarchical countofcounts histograms. PVLDB, 11(11):1509–1521, 2018.

[27]
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas.
Communicationefficient learning of deep networks from decentralized
data.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 2022 April 2017, Fort Lauderdale, FL, USA
, pages 1273–1282, 2017.  [28] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang. Learning differentially private language models without losing accuracy. CoRR, abs/1710.06963, 2017.
 [29] M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005.
 [30] V. Pichapati, A. T. Suresh, F. X. Yu, S. J. Reddi, and S. Kumar. Adaclip: Adaptive clipping for private sgd. arXiv preprint arXiv:1908.07643, 2019.
 [31] R. Rogers, S. Subramaniam, S. Peng, D. Durfee, S. Lee, S. K. Kancha, S. Sahay, and P. Ahammad. Linkedin’s audience engagements api: A privacy preserving data analytics system at scale, 2020.
 [32] O. Shamir and T. Zhang. Stochastic gradient descent for nonsmooth optimization: Convergence results and optimal averaging schemes. In International Conference on Machine Learning, pages 71–79, 2013.
 [33] A. Smith, A. Thakurta, and J. Upadhyay. Is interaction necessary for distributed private learning? In 2017 IEEE Symposium on Security and Privacy (SP), pages 58–77. IEEE, 2017.
 [34] S. Song, K. Chaudhuri, and A. D. Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245–248. IEEE, 2013.
 [35] O. Thakkar, G. Andrew, and H. B. McMahan. Differentially private learning with adaptive clipping. CoRR, abs/1905.03871, 2019.
 [36] J. S. Vitter. Random sampling with a reservoir. ACM Trans. Math. Softw., 11(1):37–57, Mar. 1985.

[37]
Y. Wang, B. Balle, and S. P. Kasiviswanathan.
Subsampled renyi differential privacy and analytical moments accountant.
In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 1618 April 2019, Naha, Okinawa, Japan, pages 1226–1235, 2019.  [38] Y.X. Wang, S. E. Fienberg, and A. J. Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In Proceedings of the 32nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, page 2493–2502. JMLR.org, 2015.
 [39] X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. F. Naughton. Bolton differential privacy for scalable stochastic gradient descentbased analytics. In S. Salihoglu, W. Zhou, R. Chirkova, J. Yang, and D. Suciu, editors, Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD, 2017.
Appendix A Omitted Results and Proofs
Lemma A.1.
Let be an DP local randomizer. For , and , define to return with probability , and a sample from an arbitrary distribution over with probability . For any and any set of outcomes , we have
Proof.
Fix a set of outcomes . By LDP of , for any , we get
(1) 
Now, for dataset and , we have:
where the third equality follows as , and the first inequality follows using inequality 1, and the fourth equality follows as . ∎
Lemma A.2.
Let be mechanisms of the form . Suppose there exist constants and such that each is DP with . Then, for any , the fold adaptive composition of is DP with .