1 Introduction
Training machine learning models on user data without violating privacy is a challenging problem. One common solution is differential privacy differentialprivacy (DP), a mathematical framework that bounds the contribution of individual entries to some statistic over a database. We identify three practical problems with current approaches to differentially private machine learning:

Privacy guarantees are difficult to interpret. Commonly, training is constrained by parameters and . These values are difficult to translate into practical guarantees.

In nonprivate training, multiple models are often trained to find the optimal hyperparameters. In private training privacy spending accumulates for every trained model chaudhuri2013stability ; lei2018differentially . Additional hyperparameters impact our privacy budget, and should be eliminated where possible.
Most existing work focuses either on preventing private information extraction while reaching acceptable performance, or on optimizing privacy guarantees while reaching performance similar to nonprivate learning. Neither approach achieves what is desirable in practice: first determine how much privacy is required for a given task. Then, within this privacy budget, optimize the parameters and hyperparameters of the model. We offer three methods towards making such a workflow practical.
Preliminaries
A common way to apply DP to deep learning, is through the DPSGD algorithm
recdp . This algorithm has two important parameters: a noise scale and a clipping bound . During gradient descent, any gradient whose norm exceeds is scaled such that its norm is, and Gaussian noise with a variance of
is added to the gradient. DPSGD can be combined with the moments accountant recdp , which tracks and controls privacy spending such that it stays within the privacy budget defined by parameters and of the DP framework. Privacysensitive data often contains multiple records per user, with user identity the sensitive attribute. In such settings, training is often federated over the users: each computes a gradient update over their data, applies noise, and the gradients over all data are aggregated centrally. The DPFedAvg framework userdp is a popular extension of DPSGD to the federated setting.2 Methods and experiments
We perform three experiments, each intended to improve a different part of the DP learning workflow. Section 3 discusses how these are combined into a practical approach to private deep learning.
2.1 Memorization under differential privacy
, it was shown that deep neural networks can easily memorize training labels in image classification, even on randomly labeled data. Under differential privacy, such memorization should not be possible.
^{1}^{1}1The idea that “differential privacy implies generalization” is considered folklore wang2016average . This allows us to calibrate our privacy parameters: if the model is able to learn a randomly labeled task, the privacy parameters are insufficiently strict.We train the small Alexnet architecture from Zhang et al. understandingrequires
for 60 epochs on a dataset of 50,000 random noise examples. We train two models: one with the DPSGD algorithm, extended with momentum and one with nonprivate SGD with momentum. In the first experiment the models are trained on random noise and the training accuracy is reported. In the second experiment, the same models are trained on the CIFAR10 task. Here, the test accuracy and difference between train and test accuracy are reported for both models. The differentially private models use
for each layer and a batch size of 128, resulting in for after 60 epochs. Layers are clipped independently using a clipping bound . The results are reported in Figure 1.Even with a large privacy spending, DP effectively prevents memorization of random noise, while being capable of learning on real data, with a small reduction in performance.
Memorization in user level differential privacy
As noted in the preliminaries, privacy sensitive data often contains many records per user. In this experiment, we will test memorization in a userlevel differential privacy setting, using the DPFedAvg framework.
We generate two 10class datasets of 1,000 users with 10 records each, resulting in 10,000 records. All records contain 2828 random noise images. All records of one user are assigned the same (random) label. For of the 10,000 records, a pattern is inserted: in the 1414 upper left patch, all pixels are set to 1.0 and the label is set to 1. In one, the centralized dataset, the pattern is inserted only into the records of one user, in the other, the distributed, it is inserted uniformly over all records. Figure 2 shows the results. For more details, see apracapproach . As expected, in the centralized setting, we are able to learn the pattern without differential privacy but not with differential privacy.
Distributed patterns should be learned. Using the proportions of the centralized setting, DP is too constrictive, but when we increase the occurrence of the pattern, we see that learning is possible.
2.2 Adaptive clipping
One of the main challenges for training with DPSGD is choosing a good clipping parameter . To illustrate the difficulty, Figure 4 shows the norm of the gradient per layer over the course of training for a nonprivate model. Two observations can be made. First, the norms of layers may be very different in the beginning of training compared to the end of training. Second, the size of the gradient may differ between layers, and between weights and biases.
Combining these two observations, we propose a gradientaware clipping scheme. This adaptive clipping schedule uses the differentially private mean norm of the previous batch times a constant factor as the norm bound for the current batch . We define the per layer clipping bound for round and layer over the individual gradients from that layer of the previous round and privacy parameters and :
We train two versions of the small Alexnet model with the DPSGD algorithm, this time without momentum. The first uses a constant clipping bound, optimized by a grid search over apracapproach . The second uses the adaptive clipping scheme with . For the adaptive model, we increase the gradient noise scale to and use an norm noise scale to use the same privacy budget as the non adaptive model. The test accuracy is reported for both models in Figure 4. With adaptive clipping the accuracy climbs from 61.6% to 63.5%.
We replaced parameter with , and . However, the model is very robust to changes in and . Doubling the values of either or yields very similar test accuracy. Because our approach is adaptive, we expect a single parameter value for leads to good performance across tasks.
To examine this hypothesis, we carried out additional experiments on MNIST, CIFAR10 and CIFAR100 (reported in apracapproach ). An value of is near the optimum for all datasets. This suggests that adaptive clipping results in parameters that are easier to choose without seeing the data. A broader investigation across datasets is required to test this hypothesis further.
2.3 Large batch training
Training with larger batches reduces privacy spending. To illustrate, Figure 2.3 shows the noise added per example as a function of batch size (using the the moments accountant from recdp to find the minimum that fully uses a predefined privacy budget in 10 epochs for a training set of 60,000 examples). Large batches dramatically reduce added noise. However, large batches can strongly hurt performance. In Goyal et al. largebatchtraining , several methods are introduced to improve the performance of largebatch training. We adopt the simplest: scaling the learning rate along with the batch size.
We train small Alexnet models with varying batch sizes on CIFAR10, using the DPSGD algorithm for 60 epochs with a budget of . A base learning rate of 0.01 is used for a batch size of 128. We increase both the the batch size and learning rate by a factor of , and repeat the experiment. We train each model for 60 epochs, until an accumulated privacy loss of after 60 epochs is found. Table 1 shows the results: training DP models with larger batches can be beneficial, but only when the learning rate is scaled accordingly.
3 Conclusion
For differentially private learning, hyperparameter optimization on sensitive datasets is undesirable. The proposed methods enable an approach to differentially private learning with reduced privacy spending on hyperparameter tuning before training the final model. Given a classification task and data dimensions, we suggest the following approach to choosing the differential privacy parameters:

Choose a model that is successfully tested on a nonprivate, similar benchmark tasks and use default hyperparameters.

Choose the largest batch size that fits in memory on the training device(s) and scale the learning rate accordingly.

When all sanity checks have passed, train on private data with the same budget. Use a small portion of the budget for computing the differentially private mean norm and use the adaptive clipping method of Section 2.2 with default values , .
These steps combined, provide the structure of a basic differential privacy training workflow. This is far from a fullproof approach: the sanity checks function more as unit tests than hard guarantees, some architectures, like conditional models secretsharer , are not yet supported, and it is not clear whether this approach is sufficient when adversaries actively attempt to influence the training process, such as in the privacy attacks proposed by Hitaj et al. dmug . We hope that our approach provides a basis that can be extended to study such questions.
Acknowledgements
We thank the reviewers for their valuable comments. This publication was supported by the Amsterdam Academic Alliance Data Science (AAADS) Program Award to the UvA and VU Universities.
References
 [1] M. Abadi, A. Chu, I. Goodfellow, H. Brendan McMahan, I. Mironov, K. Talwar, and L. Zhang. Deep Learning with Differential Privacy. ArXiv eprints, July 2016.
 [2] Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson, and Dawn Song. The secret sharer: Measuring unintended neural network memorization & extracting secrets. CoRR, abs/1802.08232, 2018.
 [3] Kamalika Chaudhuri and Staal A Vinterbo. A stabilitybased validation procedure for differentially private machine learning. In Advances in Neural Information Processing Systems, pages 2652–2660, 2013.
 [4] Cynthia Dwork. Differential privacy. In 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), volume 4052, pages 1–12, Venice, Italy, July 2006. Springer Verlag.
 [5] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
 [6] Briland Hitaj, Giuseppe Ateniese, and Fernando PérezCruz. Deep models under the GAN: information leakage from collaborative deep learning. CoRR, abs/1702.07464, 2017.
 [7] Jing Lei, AnneSophie Charest, Aleksandra Slavkovic, Adam Smith, and Stephen Fienberg. Differentially private model selection with penalized and constrained likelihood. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3):609–633, 2018.
 [8] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private language models without losing accuracy. CoRR, abs/1710.06963, 2017.
 [9] Koen Lennart van der Veen. A Practical Approach to Differential Private Learning. Master’s thesis, University of Amsterdam, Amsterdam, The Netherlands, 2018.
 [10] YuXiang Wang, Jing Lei, and Stephen E Fienberg. Onaverage klprivacy and its equivalence to generalization for maxentropy mechanisms. In International Conference on Privacy in Statistical Databases, pages 121–134. Springer, 2016.
 [11] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530, 2016.
Comments
There are no comments yet.