Three Tools for Practical Differential Privacy

12/07/2018 ∙ by Koen Lennart van der Veen, et al. ∙ University of Amsterdam 0

Differentially private learning on real-world data poses challenges for standard machine learning practice: privacy guarantees are difficult to interpret, hyperparameter tuning on private data reduces the privacy budget, and ad-hoc privacy attacks are often required to test model privacy. We introduce three tools to make differentially private machine learning more practical: (1) simple sanity checks which can be carried out in a centralized manner before training, (2) an adaptive clipping bound to reduce the effective number of tuneable privacy parameters, and (3) we show that large-batch training improves model performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training machine learning models on user data without violating privacy is a challenging problem. One common solution is differential privacy differential-privacy (DP), a mathematical framework that bounds the contribution of individual entries to some statistic over a database. We identify three practical problems with current approaches to differentially private machine learning:

  1. Privacy guarantees are difficult to interpret. Commonly, training is constrained by parameters and . These values are difficult to translate into practical guarantees.

  2. In non-private training, multiple models are often trained to find the optimal hyperparameters. In private training privacy spending accumulates for every trained model chaudhuri2013stability ; lei2018differentially . Additional hyperparameters impact our privacy budget, and should be eliminated where possible.

  3. Earlier research treats hyperparameters and privacy parameters as independent recdp ; userdp . However, this is not always the case. For instance, the batch size influences privacy guarantees, but to preserve model performance, the learning rate must be changed accordingly.

Most existing work focuses either on preventing private information extraction while reaching acceptable performance, or on optimizing privacy guarantees while reaching performance similar to non-private learning. Neither approach achieves what is desirable in practice: first determine how much privacy is required for a given task. Then, within this privacy budget, optimize the parameters and hyperparameters of the model. We offer three methods towards making such a workflow practical.

Preliminaries

A common way to apply DP to deep learning, is through the DPSGD algorithm

recdp . This algorithm has two important parameters: a noise scale and a clipping bound . During gradient descent, any gradient whose -norm exceeds is scaled such that its norm is

, and Gaussian noise with a variance of

is added to the gradient. DPSGD can be combined with the moments accountant recdp , which tracks and controls privacy spending such that it stays within the privacy budget defined by parameters and of the DP framework. Privacy-sensitive data often contains multiple records per user, with user identity the sensitive attribute. In such settings, training is often federated over the users: each computes a gradient update over their data, applies noise, and the gradients over all data are aggregated centrally. The DP-FedAvg framework userdp is a popular extension of DPSGD to the federated setting.

2 Methods and experiments

We perform three experiments, each intended to improve a different part of the DP learning workflow. Section 3 discusses how these are combined into a practical approach to private deep learning.

2.1 Memorization under differential privacy

In understandingrequires

, it was shown that deep neural networks can easily memorize training labels in image classification, even on randomly labeled data. Under differential privacy, such memorization should not be possible.

111The idea that “differential privacy implies generalization” is considered folklore wang2016average . This allows us to calibrate our privacy parameters: if the model is able to learn a randomly labeled task, the privacy parameters are insufficiently strict.

We train the small Alexnet architecture from Zhang et al. understandingrequires

for 60 epochs on a dataset of 50,000 random noise examples. We train two models: one with the DPSGD algorithm, extended with momentum and one with non-private SGD with momentum. In the first experiment the models are trained on random noise and the training accuracy is reported. In the second experiment, the same models are trained on the CIFAR-10 task. Here, the test accuracy and difference between train and test accuracy are reported for both models. The differentially private models use

for each layer and a batch size of 128, resulting in for after 60 epochs. Layers are clipped independently using a clipping bound . The results are reported in Figure 1.

a) Train accuracy
b) Test accuracy
c) Generalization gap
Figure 1: Memorization tests on random noise (a) and on CIFAR-10 (b, c)

Even with a large privacy spending, DP effectively prevents memorization of random noise, while being capable of learning on real data, with a small reduction in performance.

Memorization in user level differential privacy

As noted in the preliminaries, privacy sensitive data often contains many records per user. In this experiment, we will test memorization in a user-level differential privacy setting, using the DP-FedAvg framework.

We generate two 10-class datasets of 1,000 users with 10 records each, resulting in 10,000 records. All records contain 2828 random noise images. All records of one user are assigned the same (random) label. For of the 10,000 records, a pattern is inserted: in the 1414 upper left patch, all pixels are set to 1.0 and the label is set to 1. In one, the centralized dataset, the pattern is inserted only into the records of one user, in the other, the distributed, it is inserted uniformly over all records. Figure 2 shows the results. For more details, see apracapproach . As expected, in the centralized setting, we are able to learn the pattern without differential privacy but not with differential privacy.

MLP on centralized dataset
MLP on distributed dataset
Figure 2: Training MLPs with centralized or distributed patterns

Distributed patterns should be learned. Using the proportions of the centralized setting, DP is too constrictive, but when we increase the occurrence of the pattern, we see that learning is possible.

2.2 Adaptive clipping

One of the main challenges for training with DPSGD is choosing a good clipping parameter . To illustrate the difficulty, Figure 4 shows the -norm of the gradient per layer over the course of training for a non-private model. Two observations can be made. First, the -norms of layers may be very different in the beginning of training compared to the end of training. Second, the size of the gradient may differ between layers, and between weights and biases.

Combining these two observations, we propose a gradient-aware clipping scheme. This adaptive clipping schedule uses the differentially private mean -norm of the previous batch times a constant factor as the norm bound for the current batch . We define the per layer clipping bound for round and layer over the individual gradients from that layer of the previous round and privacy parameters and :

To choose , we use a similar adaptive procedure: we use the differential private -norm from the previous iteration times a constant : . We choose and initialize by training for one iteration on random noise and extracting the mean -norm.

We train two versions of the small Alexnet model with the DPSGD algorithm, this time without momentum. The first uses a constant clipping bound, optimized by a grid search over apracapproach . The second uses the adaptive clipping scheme with . For the adaptive model, we increase the gradient noise scale to and use an -norm noise scale to use the same privacy budget as the non adaptive model. The test accuracy is reported for both models in Figure 4. With adaptive clipping the accuracy climbs from 61.6% to 63.5%.

Figure 3: norms
Figure 4: Test accuracy vs clipping methods

We replaced parameter with , and . However, the model is very robust to changes in and . Doubling the values of either or yields very similar test accuracy. Because our approach is adaptive, we expect a single parameter value for leads to good performance across tasks.

To examine this hypothesis, we carried out additional experiments on MNIST, CIFAR-10 and CIFAR-100 (reported in apracapproach ). An value of is near the optimum for all datasets. This suggests that adaptive clipping results in parameters that are easier to choose without seeing the data. A broader investigation across datasets is required to test this hypothesis further.

2.3 Large batch training

Training with larger batches reduces privacy spending. To illustrate, Figure 2.3 shows the noise added per example as a function of batch size (using the the moments accountant from recdp to find the minimum that fully uses a pre-defined privacy budget in 10 epochs for a training set of 60,000 examples). Large batches dramatically reduce added noise. However, large batches can strongly hurt performance. In Goyal et al. largebatchtraining , several methods are introduced to improve the performance of large-batch training. We adopt the simplest: scaling the learning rate along with the batch size.

We train small Alexnet models with varying batch sizes on CIFAR-10, using the DPSGD algorithm for 60 epochs with a budget of . A base learning rate of 0.01 is used for a batch size of 128. We increase both the the batch size and learning rate by a factor of , and repeat the experiment. We train each model for 60 epochs, until an accumulated privacy loss of after 60 epochs is found. Table 1 shows the results: training DP models with larger batches can be beneficial, but only when the learning rate is scaled accordingly.

Figure 5: /example vs batch size Table 1: Batch size versus accuracy Batch size accuracy 128 61.6% 512 64.2% 1024 66.9% 1024 (base lr) 47.2%

3 Conclusion

For differentially private learning, hyperparameter optimization on sensitive datasets is undesirable. The proposed methods enable an approach to differentially private learning with reduced privacy spending on hyperparameter tuning before training the final model. Given a classification task and data dimensions, we suggest the following approach to choosing the differential privacy parameters:

  • Choose a model that is successfully tested on a non-private, similar benchmark tasks and use default hyperparameters.

  • Choose the largest batch size that fits in memory on the training device(s) and scale the learning rate accordingly.

  • Calibrate the noise scale parameter with the DPSGD recdp or DP-FedAvg userdp model on a centralized dataset of random noise until the sanity checks succeed.

  • When all sanity checks have passed, train on private data with the same budget. Use a small portion of the budget for computing the differentially private mean -norm and use the adaptive clipping method of Section 2.2 with default values , .

These steps combined, provide the structure of a basic differential privacy training workflow. This is far from a full-proof approach: the sanity checks function more as unit tests than hard guarantees, some architectures, like conditional models secretsharer , are not yet supported, and it is not clear whether this approach is sufficient when adversaries actively attempt to influence the training process, such as in the privacy attacks proposed by Hitaj et al. dmug . We hope that our approach provides a basis that can be extended to study such questions.

Acknowledgements

We thank the reviewers for their valuable comments. This publication was supported by the Amsterdam Academic Alliance Data Science (AAA-DS) Program Award to the UvA and VU Universities.

References