There have been a lot of recent advances in iterative training methods like stochastic gradient descent (SGD), one of the main reasons being thier applicability in training neural networks. Deep learning has found a variety of applications, including image classification, language translation, and music generation[25, 12, 26, 4]. To effectively perform such tasks, neural networks need a large amount of data for training. It is often the case that these datasets contain a lot of sensitive information. Moreover, many recent works [10, 27, 24, 5, 20] have shown that it is possible to extract sensitive information about the training data just from the parameters of a trained model. Thus, it becomes imperative to use learning techniques that provide a rigorous guarantee of privacy for the training data used.
Differential privacy (DP) [8, 9] has been recently used as a gold standard to bound the privacy leakage of sensitive data when performing learning tasks. Intuitively, DP prevents an adversary from confidently making any conclusions about whether a sample was used in training a model, even while having access to arbitrary side information. To formally establish the notion of DP, we first define neighboring datasets. We will refer to a pair of datasets as neighbors if can be obtained from by adding or removing one element.
A (randomized) algorithm with input domain and output range is -differentially private if for all pairs of neighboring datasets , and every measurable , we have with probability over the coin flips of
, we have with probability over the coin flips ofthat:
Federated Learning  is a decentralized approach in which the training data is left distributed on user devices, and training is done via aggregating updates that have been computed locally. Federated Averaging 
is a technique that combines local SGD on each user’s data with a server that performs model averaging. Learning language models involves fairly complex networks like long short-term memory (LSTM) recurrent neural networks (RNNs), and the training data can involve personalized sensitive information such as passwords and text conversations. Hence, we use learning language models for next-word-prediction as a motivating task, and use this as a running example.
In this paper, we will consider two settings of privacy: example-level, and user-level. When we use Federated SGD, we preserve example-level DP (also in [7, 3, 2, 21, 28, 22, 13]) where each element is a single training example. On the other hand, when we use Federated Averaging, we preserve the stronger guarantee of user-level DP (also in ), where an element refers to the complete data held by a user.
While there has been a lot of work in the designing of DP techniques for learning, almost every technique has some hyperparameters which need to be set appropriately for obtaining a good utility. It is often unclear apriori how to set the values of different hyperparameters introduced via the addition of privacy, for example, the clipping threshold for gradient updates in DP SGD. Moreover, learning techniques have their own hyperparameters which might need to be set differently when training is performed with privacy. For example, the learning rate in DP SGD might need to be set to a high value if the clipping threshold is very low, and vice-versa. Such tuning for large networks can have an exhorbitant cost in computation and efficiency, which can be a bottleneck for real-world systems that involve communicating with millions of samples for training a single network. Tuning also incurs an additional cost for privacy, which needs to be accounted for when providing a privacy guarantee for the released model with tuned hyperparameters.
1.1 Related Work
DP SGD has been the focus of many recent works [7, 3, 2, 28]. Privacy amplification via subsampling was introduced in . The moments accountant, which tightly bounds the privacy loss of the Guassian mechanism when used with amplification via subsampling, was introduced by Abadi et al. . It was further extended in 
to incorporate estimating heterogeneous sets of vectors from batches of subsamples. The technique of Federated Averaging was introduced in, and was subsequently used in  to effectively train recurrent language models. This work builds upon .
Several works have studied the problem of privacy-preserving hyperparameter tuning. An approach based on target accuracy was provided in , which was further improved in terms of privacy cost and computational efficiency in . A method based on data splitting was provided in , whereas one based on satisfying certain stability conditions was introduced in . We would like to note that all the prior works focused on the general problem of parameter search, whereas the focus of this work is to adaptively adjust the value of a parameter in iterative procedures to eliminate the need for extensive tuning.
Bounding the influence of any sample in a learning process is both desirable and necessary. If left unbounded, any sample can potentially sway the learned system to overfit to its data, defeating the purpose of trying to learn actual trends in the population. One way to bound the contribution of an example in any phase of the learning process is to bound the total norm of its gradient update. Let the bound be denoted by . This implies that if the norm of any example’s update is greater than , then it gets ‘clipped’ to have a norm of before being sent to the server. Such clipping also effectively bounds the sensitivity of the system with respect to the addition or removal of any example from the training set. As a result, adding appropriate noise post clipping is sufficient for achieving a differential privacy guarantee for the system. [7, 3, 2]
Setting an appropriate value for the clipping threshold can be crucial for the utility of a differentially private learned system, as setting it too low can result in high information loss, whereas setting it too high can result in the addition of a lot of noise. Both the cases can decrease the signal-to-noise ratio for the learning process, which can adversely affect the utility of the learned system. Such behavior can be observed in prior work  which shows the performance of a differentially private language model learned over various values of .
Learning large models using the Federated Averaging/SGD algorithm [17, 19] can take thousands of rounds of interaction between the central server and the clients. The norms of the updates can vary as the rounds progress. As a result, even setting a constant clipping threshold throughout the learning process can result in decreased utility of the system. Prior work  has shown that decreasing the value of the clipping threshold after training a language model for some initial number of rounds actually results in increased accuracy of the system. However, the behavior of the norms can be difficult to predict without prior knowledge about the system, and it might be inefficient to conduct experiments to learn such behavior.
Since each layer of a learning system can provide a different functionality, it can be useful in some situations to clip the updates layer-wise (i.e., per-layer clipping ). However, as shown in Figure 1, the norms of the individual layers can be of different magnitudes, thus making it even more difficult to efficiently search the space of clipping parameters. As a result, there is a need for a system which learns this ‘on-the-fly’ to get high utility while ensuring privacy.
2 Differentially Private Adaptive Quantile Clipping
In this section, we will describe the adaptive strategy that can be used for adjusting the clipping threshold according to the norms of the updates. First, we will describe the adaptive quantile clipping strategy, which is designed for iterative differentially private mechanisms. Next, we will describe layer-specific noise addition strategies for getting a higher utility than the basic strategy of adding noise with the same scale to each layer.
2.1 Loss Functions for Estimating Quantiles
be a random variable, letbe a quantile to be matched. For any , define
For such that , we have . Therefore, is at the th quantile of . Because the loss is convex and has gradients bounded by 1, we can produce an online estimate of that converges to the th quantile of using online gradient descent (see, e.g., Shalev-Shwartz ). Since the loss is convex but not strongly convex, a learning rate proportional to will produce a sublinear regret bound. See Figure 2
Suppose at some round we have samples of , with values ). The average derivative of the loss for that round is
where is the empirical fraction of samples with value at most . For a given learning rate , we can perform the linear update: .
Since and take values in the range , the linear update rule described above changes by a maximum of at each step. This can be slow if is on the wrong order of magnitude. At the other extreme, if the optimal value of is orders of magnitude smaller than , the update can be very coarse, and may often overshoot to become negative. To remedy such issues, we propose the following geometric update rule: .
2.2 Adaptive Quantile Clipping
Let be the number of layers, be the expected number of users sampled per iteration, and denote the target fraction of users without clipped updates. As in , we consider two kinds of clipping: i) flat clipping, where we have an overall clipping parameter and we clip the concatenation of all layers, and ii) per-layer clipping, where we are given a per-layer clipping parameter for each layer and each layer is clipped separately. If performing per-layer clipping, set , otherwise set . For each and iteration , let be the clipping threshold, and be the learning rate for learning . We start with some value of . Let be the random variable that denotes the number of users sampled in round . Each user will send bits along with the usual update , where bit , for .
We define the loss for user for the update to the layer in the round as
Then define , and for the layer in the round. As ,
is an unbiased estimate of the fraction of unclipped updates for thelayer in the round. Thus, we have . Observe that if , then . Note that the server only requires to compute the gradient , which is computed privately along with the average of the updates from the users. Moreover, the magnitude of the gradient depends on how far is from the target unclipped percentage . We update the clipping threshold for the next round as for a linear update, and for a geometric update. We also define a parameter denoting the proportion of the per-iteration privacy budget (details in ) that is used for the computation of the clipped counts described above. The rest of the budget is used for computing an average of the user updates. We provide a pseudocode of the complete algorithm in Algorithm 1.
Such a strategy can also be useful for getting an idea about the range of the norms of the updates, by setting very close to 0 (1) to find the minimum (maximum). This can help in getting a ballpark idea about the magnitudes of the individual layers without any prior knowledge, which can then be utilized for setting the initial clipping threshold appropriately.
- Abadi et al. [2016a] Martin Abadi, Andy Chu, Ian Goodfellow, Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In 23rd ACM Conference on Computer and Communications Security (ACM CCS), 2016a.
- Abadi et al. [2016b] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 308–318, New York, NY, USA, 2016b. ACM. ISBN 978-1-4503-4139-4. doi: 10.1145/2976749.2978318.
- Bassily et al.  Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on, pages 464–473. IEEE, 2014.
- Briot et al.  Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. Deep Learning Techniques for Music Generation - A Survey. arXiv e-prints, art. arXiv:1709.01620, Sep 2017.
- Carlini et al.  Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets. CoRR, abs/1802.08232, 2018. URL http://arxiv.org/abs/1802.08232.
Chaudhuri and Vinterbo 
Kamalika Chaudhuri and Staal Vinterbo.
A stability-based validation procedure for differentially private machine learning.In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 2652–2660, USA, 2013. Curran Associates Inc. URL http://dl.acm.org/citation.cfm?id=2999792.2999908.
- Chaudhuri et al.  Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
- Dwork et al. [2006a] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT, 2006a.
- Dwork et al. [2006b] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer, 2006b.
- Fredrikson et al.  Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pages 1322–1333, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3832-5. doi: 10.1145/2810103.2813677.
Gupta et al. 
Anupam Gupta, Katrina Ligett, Frank McSherry, Aaron Roth, and Kunal Talwar.
Differentially private combinatorial optimization.In Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’10, pages 1106–1125, Philadelphia, PA, USA, 2010. Society for Industrial and Applied Mathematics. ISBN 978-0-898716-98-6. URL http://dl.acm.org/citation.cfm?id=1873601.1873691.
He et al. 
K. He, X. Zhang, S. Ren, and J. Sun.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034, Dec 2015. doi: 10.1109/ICCV.2015.123.
- Iyengar et al.  Roger Iyengar, Joseph P. Near, Dawn Song, Om Thakkar, Abhradeep Thakurta, and Lun Wang. Towards practical differentially private convex optimization. In S&P 2019, 2019.
- Kasiviswanathan et al.  Shiva Prasad Kasiviswanathan, Homin K Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
- Koenker and Bassett Jr  Roger Koenker and Gilbert Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric Society, pages 33–50, 1978.
- Liu and Talwar  Jingcheng Liu and Kunal Talwar. Private selection from private candidates. CoRR, abs/1811.07971, 2018. URL http://arxiv.org/abs/1811.07971.
McMahan et al. 
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and
Blaise Agüera y Arcas.
Communication-efficient learning of deep networks from decentralized
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, pages 1273–1282, 2017. URL http://proceedings.mlr.press/v54/mcmahan17a.html.
- McMahan et al.  H. Brendan McMahan, Galen Andrew, Ulfar Erlingsson, Steve Chien, Ilya Mironov, Nicolas Papernot, and Peter Kairouz. A General Approach to Adding Differential Privacy to Iterative Training Procedures. arXiv e-prints, art. arXiv:1812.06210, Dec 2018.
- McMahan et al.  H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In ICLR 2018, 2018.
- Melis et al.  Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting Unintended Feature Leakage in Collaborative Learning. arXiv e-prints, art. arXiv:1805.04049, May 2018.
- Papernot et al.  Nicolas Papernot, Martın Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. stat, 1050, 2017.
- Papernot et al.  Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Úlfar Erlingsson. Scalable private learning with pate. arXiv preprint arXiv:1802.08908, 2018.
- Shalev-Shwartz  Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 2012.
- Shokri et al.  R. Shokri, M. Stronati, C. Song, and V. Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18, May 2017. doi: 10.1109/SP.2017.41.
Szegedy et al. 
C. Szegedy, , , P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, June 2015. doi: 10.1109/CVPR.2015.7298594.
- Vinyals et al.  Oriol Vinyals, Lukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 2773–2781, Cambridge, MA, USA, 2015. MIT Press. URL http://dl.acm.org/citation.cfm?id=2969442.2969550.
- Wu et al.  X. Wu, M. Fredrikson, S. Jha, and J. F. Naughton. A methodology for formalizing model-inversion attacks. In 2016 IEEE 29th Computer Security Foundations Symposium (CSF), pages 355–370, June 2016. doi: 10.1109/CSF.2016.32.
- Wu et al.  Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD ’17, pages 1307–1322, New York, NY, USA, 2017. ACM. ISBN 978-1-4503-4197-4. doi: 10.1145/3035918.3064047.