A General Approach to Adding Differential Privacy to Iterative Training Procedures

12/15/2018 ∙ by H. Brendan McMahan, et al. ∙ Google 0

In this work we address the practical challenges of training machine learning models on privacy-sensitive datasets by introducing a modular approach that minimizes changes to training algorithms, provides a variety of configuration strategies for the privacy mechanism, and then isolates and simplifies the critical logic that computes the final privacy guarantees. A key challenge is that training algorithms often require estimating many different quantities (vectors) from the same set of examples --- for example, gradients of different layers in a deep learning architecture, as well as metrics and batch normalization parameters. Each of these may have different properties like dimensionality, magnitude, and tolerance to noise. By extending previous work on the Moments Accountant for the subsampled Gaussian mechanism, we can provide privacy for such heterogeneous sets of vectors, while also structuring the approach to minimize software engineering challenges.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has been much work recently on integrating differential privacy (DP) techniques into iterative training procedures like stochastic gradient descent 

[Chaudhuri et al., 2011, Bassily et al., 2014, Abadi et al., 2016, Wu et al., 2017, Papernot et al., 2017]; for completeness we provide a formal definition of DP in Appendix A. Although these works differ in the granularity of privacy guarantees offered and the method of privacy accounting, most proposed approaches share the general idea of iteratively computing a model update from training data and then applying the Gaussian mechanism for differential privacy to the update before incorporating it into the model. Our goal in this work is to decouple, to the extent possible, three aspects of integrating a privacy mechanism with the training procedure:

  1. [label=)]

  2. the specification of the training procedure itself (e.g., stochastic gradient descent with batch normalization and simultaneous collection of accuracy metrics and training data statistics),

  3. the selection and configuration of the privacy mechanisms to apply to each of the aggregates collected (model gradients, batch normalization weight updates, and metrics), and

  4. the accounting procedure used to compute a final -DP guarantee.

This separation is critical: the person implementing 1 is likely not a DP expert, and this code typically already exists; there are many configuration options for 2, which will likely require experimentation, and this configuration logic may become complex; thus isolating the key privacy calculations in 3 and keeping them as simple (and well tested) as possible prevents bugs in 1 or 2 from introducing errors in the calculation of the actual privacy achieved.

While model training is our primary motivation, the approach is applicable to any iterative procedure that fits the following template. We have a database with records. A record might correspond to a single training example, a “microbatch” of examples, or all of the data from a particular user or entity (e.g., to achieve user-level DP as in McMahan et al. [2018]). On each round, a random subset of records (a sample) is selected and the training procedure consumes the results of a number of vector queries over that sample; see Table 1. Such vector queries may include the average gradient for each layer, updates to batch-normalization parameters, or the average value for different training accuracy metrics. We describe a general approach to allocating a privacy budget across each of these queries and analyzing the privacy cost of the complete mechanism, all respecting the decoupling of concerns described earlier. Our analysis builds on the Moments Accountant approach of Abadi et al. [2016], which applies to a single vector query per round, and generalizes the extension of McMahan et al. [2018] to multi-vector queries.

Per-example SGD Microbatch SGD Federated learning
(user-level DP)
record gradient on one example average gradient on one
microbatch (~10 examples)
model update from one user
sample minibatch (~100 examples) minibatch (~10 microbatches for a total of 100 examples) set of participating user devices for the round
Table 1: Defining record and sample in different training contexts.

We focus on the following basic building block for a single vector. Suppose we have a database with records consisting of vectors and we are interested in estimating the average111For simplicity, we focus on unweighted average queries, for example to compute average gradient on a batch of examples; the generalization to weighted average and vector sum queries is straightforward. We also restrict attention to the fixed expected denominator of McMahan et al. [2018]; extension to other estimators for averages like their is straightforward.

. Given a selection probability

, clipping threshold , and noise multiplier , the procedure is:

  1. Select a subset of the records by choosing each record with probability .

  2. Clip each for to have maximum norm using .

  3. Output where .

The quantity is the output of the Gaussian mechanism for sums. As , scaling it by

produces an unbiased estimate of the average. The

noise multiplier (the ratio of the noise to the -sensitivity of the query) acts as a knob to trade off privacy vs. utility. If we choose , the mechanism is -differentially private with respect to the full database [Beimel et al., 2014, Dwork and Roth, 2014]. Importantly, the privacy cost of this mechanism is fully specified by together with the privacy tuple , where is an upper bound on the norm of the vectors being summed, and

is the standard deviation of the noise added to the sum.

We generalize the above procedure to the case where each record corresponds to a collection of vectors. We still do the sampling step (1) only once, but we estimate the average of each of the vectors separately, potentially with different clipping thresholds and noise multipliers. Let be the total set of vectors for which averages are to be estimated privately. (Note, we will include the superscript on the th record only when necessary in equations that sum over records.) In general, we may partition this set of

vectors into multiple groups, e.g., fully connected layers vs. convolutional layers vs. metrics. We assume the user (that is, the person using the privacy tools defined here) has identified the relevant set of groups whose averages are needed in the training procedure. For each of these, they need to specify a privacy mechanism together with some hyperparameters. We first describe the privacy mechanisms that can be applied to individual vectors or groups of vectors, then show how the privacy cost of the full collection of mechanisms can be calculated, and finally propose strategies for choosing the parameters to achieve the desired privacy versus utility tradeoff.

Implementations of techniques in this paper may be found in the open-source TensorFlow Privacy framework 

[Google et al., 2018] for TensorFlow [Abadi et al., 2015], as described in Section 7.

2 Privacy mechanisms for a group of vectors

In this section, we describe two strategies that can be applied to a single group of vectors, WLOG the first , , for ; when , the two mechanisms described are identical. Both mechanisms allow individual noise standard deviation parameters to be used for the separate groups. While this might at first seem to preclude the use of the Moments Accountant, which requires spherical noise, we will show how to resolve this issue in the next section.222Privacy mechanisms for groups can be used within TensorFlow Privacy [Google et al., 2018] by employing the NestedQuery class, which evaluates an arbitrary nested structure of queries where each leaf query would be a GaussianAverageQuery corresponding to one group of vectors.

Separate clipping and noise parameters.

This strategy essentially treats the whole group as a single concatenated vector . The user provides , a clipping parameter, and , a noise parameter. For now, assume both of these parameters are simply chosen so as to provide reasonable utility for the resulting average; we will discuss strategies for choosing these parameters in detail in Section 4. The output of the mechanism is

where and . The final expression shows that the mechanism is equivalent to the Gaussian mechanism for sums with privacy tuple . Applying this mechanism with to all vectors recovers the “flat clipping” approach of McMahan et al. [2018], and applying this mechanism separately to each of the vectors with recovers their “per-layer clipping” approach (where is the total bound). Another reasonable strategy that takes into account dimensionality is to apply the mechanism separately with where is the dimensionality of .

Joint clipping.

Here we introduce a new mechanism that allows us to clip less aggressively than applying the previous strategy to each vector individually, while still letting different vectors live on different multiplicative scales. The user supplies as input scale parameters , which may be thought of as bounds or reasonable norm clip parameters on the individual , were they to be clipped individually. The strategy first does a pre-processing step via the scaling operator . If for all , then the joint norm , however it may typically be much less. Then joint clipping and noising is performed using a total clipping parameter and noise with the standard deviation of . The mechanism’s output then scales the vectors back by the factor in post-processing:

where again The final expression shows the output can be written as a post-processing of the subsampled Gaussian mechanism for sums with privacy tuple . Note that if no clipping happens then for all .

To see where this mechanism might be superior to the first, suppose and have and , and suppose they can tolerate noise standard deviations of and respectively. Additionally, assume it is known that either or will be zero for any record . We could clip these separately, but this ignores the (useful) side information that one of the vectors is always zero. On the other hand, if we treat them as a single group, we cannot take into the account the fact they are on very different scales; in particular, we must pick a single noise value which will either be insufficient to add privacy for , or will completely obscure the signal in . The joint mechanism proposed here lets us directly handle this situation using , , , and .

3 Composing privacy guarantees for multiple vector groups

Now, suppose we have partitioned the vectors into groups, and selected a privacy mechanism for each one, producing privacy tuples for . From a privacy accounting point of view, each of these mechanisms is equivalent to running a Gaussian sum query on vectors with and then adding noise to the final sum. We now demonstrate a transformation that lets us analyze this composite mechanism as a single Gaussian sum query on the sample for use with the privacy accountant.

First, we scale each vector , so Now, we imagine a single Gaussian sum query with noise standard deviation , and output the estimate after rescaling by the factors. This is equivalent since

(1)

The final expression is a simple post-processing on the output of a single Gaussian sum query with parameters . Thus, we can apply the privacy accountant to bound the privacy loss of iterative applications of this mechanism.

4 Hyperparameter selection strategies

Here we consider selecting hyperparameters , , and to achieve a particular privacy vs. utility tradeoff. Recall for both mechanisms, , so the key quantity is

Typically, a value of will provide a reasonable privacy guarantee. If is too small for the desired level of privacy, the user has several knobs available: clip more aggressively by decreasing the ’s; noise more aggressively by scaling up the ’s; or increasing . When datasets are large and the additional computational cost of processing larger samples is affordable, this last approach is generally preferable, as observed by McMahan et al. [2018]. If additionally the total number of iterations is known, then since the privacy cost scales monotonically with any of these adjustments to , a binary search can be performed using the privacy accountant repeatedly with different parameters to find e.g. the precise value of needed to achieve a particular -DP guarantee.

Choosing and .

Typical approaches to setting include: 1) using an a priori upper bound on the norm; 2) choosing so that “few” vectors are clipped; or 3) running parameter tuning grids to find a value of that does not reduce utility (e.g., the accuracy of the model) by too much. If private data is used in 2) or 3), the privacy cost of this should be accounted for. Similar strategies can be used to choose , e.g., selecting a value that will introduce an a priori acceptable amount of error, or more likely for model training, running experiments to find the largest amount of noise that does not slow the training procedure.

In some cases one may have bounds on the norms of groups plus an overall target value of , which needs to be distributed across multiple groups. To achieve proportional noise, where for all , we can use . Another reasonable alternative, dimensionality adjusted noise assigns noise proportional to the maximum root mean squared value of the components of given its bound and its dimensionality: , where is the dimensionality of group and .

5 Sampling policies

The basic update step of the SGD algorithm operates on a small subset of records (the minibatch). Convergence guarantees of the standard optimization theory hold under the assumption that each minibatch is an i.i.d. sample of the training dataset, and the original Moments Accountant by Abadi et al. [2016] supported privacy analysis in this regime.

In practice, there are valid reasons for using alternative policies for sampling minibatches, with implications for privacy analysis. We list three of the most common sampling policies below.

Minibatches are i.i.d. samples.

Privacy of this sampling procedure is analyzed by Abadi et al. [2016] and it is used by the federated learning framework where decisions of whether to participate in a particular update step are made locally [McMahan et al., 2018]. If the privacy accountant is dependent on the secrecy of the sample (as in the case of the Moments Accountant), then the size of the sample cannot be released without applying a privacy-preserving mechanism, which can be as simple as additive noise. The variability of the sample’s size makes this sampling policy a poor fit for hardware accelerators. It can be repaired by sampling subsets of a fixed size from the training set without replacement, which leads us to the next policy.

Minibatches are equally sized and independent.

The basic SGD corresponds to this sampling policy and minibatches of cardinality 1. Recent works analyze composition of this sampling policy with a mechanism satisfying RDP [Wang et al., 2018] or tCDP [Bun et al., 2018]. Independence of minibatches makes analysis of multiple iterations of SGD straightforward via application of composition rules for differential privacy.

Minibatches are equally sized and disjoint.

In practice, the most common manner of forming minibatches is permuting the training dataset and partitioning it into disjoint subsets of a fixed size. After a single pass (an epoch

) the process is repeated. This sampling policy can be efficiently implemented, and has intuitive semantics: an epoch corresponds to a training cycle when all examples were visited exactly once. Quantitatively tight analysis of DP-SGD in this model is not known. (A related problem of analyzing

randomized response followed by a random permutation is addressed by Erlingsson et al. [2019].)

6 Privacy ledger

In principle, privacy accounting (via, e.g. the moments accountant) could be done in tandem with calls to the mechanism to keep an online estimate of the privacy guarantee. However we advocate a different approach which cleanly separates concerns 2 and 3 from the introduction. We maintain a privacy ledger and record two types of events: sampling events, which record that a set of records has been drawn using parameters and , and sum query events, which record that a Gaussian sum query has been performed over some group of vectors with privacy tuple . Then the privacy accountant can process the ledger post hoc to produce a privacy guarantee, first converting each group of one sampling event plus some sum query events to an equivalent single sum query event with parameters using Equation (1).

There are two main advantages of this approach. First, bugs in the hyperparameter selection strategy code cannot affect the privacy estimate. Second, it allows the privacy accounting mechanism to be changed and the ledger reprocessed if, for example, a tighter bound on the privacy loss is discovered after the data has been processed.

7 TensorFlow Privacy

TensorFlow Privacy [Google et al., 2018]333Available from https://github.com/tensorflow/privacy under Apache 2.0 license. is a Python library that implements TensorFlow optimizers for training machine learning models with differential privacy. The library comes with tutorials and analysis tools for computing the privacy guarantees provided. From an engineering perspective, the implementation of differentially private optimizers found in the library leverages the decoupled structure outlined above to make it easier for developers to both (a) wrap most optimizers into their differentially private counterpart and (b) compare different privacy mechanisms and accounting procedures.

Perhaps the library is best illustrated by one of its main use cases: training a neural network with differentially-private stochastic gradient descent 

[Abadi et al., 2016]. Given the stochastic gradient descent optimizer class, tf.train.GradientDescentOptimizer, implemented in the main TensorFlow library, one first wraps it into a new optimizer that implements logic for both the clipping and noising of gradients needed to obtain privacy. This is done by having the optimizer estimate the gradients via an instance of a class implementing the DPQuery interface. A DPQuery is responsible for clipping gradients computed by the optimizer, accumulating them, and returning their noisy average to the optimizer. This introduces two additional hyperparameters to the optimizer: the clipping norm and the noise multiplier. The PrivacyLedger class maintains a record of the sum query events for each sampling event which can then be processed by the RDP accountant.

In addition, our implementation leverages microbatches, as defined in Table 1. This implies that gradients are computed over several examples before they are clipped, and once all microbatches in a minibatch have been processed, they are averaged and noised. This introduces a third additional hyperparameter to the optimizer: the number of microbatches. Increasing it often improves utility but typically slows down training.

TensorFlow Privacy is also designed to work with training in a federated context in the vein of McMahan et al. [2018]. In that case the “gradients” supplied to the DPQuery would in fact be the model updates supplied by the users in a given round.

Finally, to compute the differential privacy guarantee for the model, an implementation of the RDP accountant is provided. Given the sampling fraction and the noise multiplier , the RDP is computed for a step. Summing the RDP over the steps, it can then estimate for a fixed .

8 Floating-point arithmetic and randomness source

The hallmark feature of the definition of differential privacy is that it is uncoditional, in other words, it makes no assumptions about the adversarial knowledge or capabilities. It also puts a high burden on a differentially private implementation: its output distribution must have effectively infinite entropy. In practice, the distribution is defined over only a finite domain (such as a vector of single-precision floating-point numbers) and the source of randomness is guaranteed (at best) to be computationally secure. We consider these issues in turn.

Floating-point arithmetic.

The problem of achieving differential privacy by means of standard floating-point arithmetic has been addressed for the additive Laplace mechanism by 

Mironov [2012]. We leave open the task of developing a provable floating-point implementation of DP-SGD and integrating it into an ML library.

Sources of randomness.

Most computational devices have access only to few sources of entropy and they tend to be very low rate (hardware interrupts, on-board sensors). It is standard—and theoretically well justified—to use the entropy to seed a cryptographically secure pseudo-random number generator (PRNG) and use the PRNG’s output as needed. Robust and efficient PRNGs based on standard cryptographic primitives exist that have output rate of gigabytes per second on modern CPUs and require a seed as short as 128 bits [Salmon et al., 2011].

The output distribution of a randomized algorithm  with access to a PRNG is indistinguishable from the output distribution of with access to a true source of entropy as long as the distinguisher is computationally bounded. Compare it with the guarantee of differential privacy which holds against any adversary, no matter how powerful. As such, virtually all implementations of differential privacy satisfy only (variants of) Computational Differential Privacy introduced by [Mironov et al., 2009]. On the positive side, a computationally-bounded adversary cannot tell the difference, which allows us to avoid being overly pedantic about this point.

A training procedure may have multiple sources of non-determinism (e.g., dropout layers or an input of a generative model) but only those that are reflected in the privacy ledger must come from a cryptographically secure PRNG. In particular, the minibatch sampling procedure and the additive Gaussian noise must be drawn from a PRNG for the trained model to satisfy computational differential privacy. In contrast, microbatches need not be chosen using a randomized process.

9 Conclusion

We have shown how the Gaussian mechanism can be applied to vectors of different types with different norm bounds and noise standard deviations, enabling training over heterogeneous parameter vectors, as well as simultaneous privacy-preserving estimation of other statistics such as classifier accuracy, or the number of instances in each class. By implementing iterative training algorithms in terms of a series of Gaussian sum queries and then recording for each query privacy events to a ledger to be processed by a privacy accountant, we separate the three major concerns of implementing privacy-preserving iterative training procedures while allowing flexibility in the specification of clipping strategy and noise allocation. The techniques described in the paper can be easily implemented using the Tensorflow Privacy library.

References

  • Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://tensorflow.org/.
  • Abadi et al. [2016] Martín Abadi, Andy Chu, Ian Goodfellow, Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In 23rd ACM Conference on Computer and Communications Security (ACM CCS), pages 308–318, 2016.
  • Bassily et al. [2014] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proceedings of the 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473, 2014.
  • Beimel et al. [2014] Amos Beimel, Hai Brenner, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the sample complexity for private learning and private data release. Machine Learning, 94(3):401–437, 2014.
  • Bun et al. [2018] Mark Bun, Cynthia Dwork, Guy N. Rothblum, and Thomas Steinke. Composable and versatile privacy via Truncated CDP. In

    Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing

    , pages 74–86, 2018.
  • Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimization. J. Mach. Learn. Res., 12(Mar):1069–1109, 2011.
  • Dwork and Roth [2014] Cynthia Dwork and Aaron Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. Now Publishers, 2014.
  • Erlingsson et al. [2019] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 2468–2479, 2019.
  • Google et al. [2018] Google et al. TensorFlow Privacy. https://github.com/tensorflow/privacy, 2018.
  • McMahan et al. [2018] Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. In International Conference on Learning Representations (ICLR), 2018. URL https://openreview.net/pdf?id=BJ0hF1Z0b.
  • Mironov [2012] Ilya Mironov. On significance of the least significant bits for differential privacy. In Proceedings of the 2012 ACM conference on Computer and Communications Security (CCS), pages 650–661, 2012.
  • Mironov et al. [2009] Ilya Mironov, Omkant Pandey, Omer Reingold, and Salil Vadhan. Computational differential privacy. In Advances in Cryptology—CRYPTO, pages 126–142, 2009.
  • Papernot et al. [2017] Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semi-supervised knowledge transfer for deep learning from private training data. In Proceedings of the International Conference on Learning Representations, 2017. URL https://arxiv.org/abs/1610.05755.
  • Salmon et al. [2011] John K Salmon, Mark A Moraes, Ron O Dror, and David E Shaw. Parallel random numbers: As easy as 1, 2, 3. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, page 16. ACM, 2011.
  • Wang et al. [2018] Yu-Xiang Wang, Borja Balle, and Shiva Kasiviswanathan. Subsampled Rényi differential privacy and analytical moments accountant. CoRR, abs/1808.00087, 2018. URL http://arxiv.org/abs/1808.00087.
  • Wu et al. [2017] Xi Wu, Fengan Li, Arun Kumar, Kamalika Chaudhuri, Somesh Jha, and Jeffrey F. Naughton. Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of SIGMOD, pages 1307–1322, 2017.

Appendix A Differential Privacy

The formal definition of -differential privacy is provided here for reference:

Definition 1.

A randomized mechanism satisfies -differential privacy if for any two adjacent datasets and for any measurable subset of outputs it holds that

The interpretation of adjacent datasets above determines the unit of information that is protected by the algorithm: a differentially private mechanism guarantees that two datasets differing only by addition or removal of a single unit produce outputs that are nearly indistinguishable. For machine learning applications the two most common cases are example-level privacy (e.g., Chaudhuri et al. [2011], Bassily et al. [2014], Abadi et al. [2016], Wu et al. [2017], Papernot et al. [2017]), in which an adversary cannot tell with high confidence from the learned model parameters whether a given example was present in the training set, or user-level privacy (e.g., McMahan et al. [2018]) in which adding or removing an entire user’s data from the training set should not substantially impact the learned model. It is also possible to consider and to be adjacent if they differ by replacing a training example (or an entire user’s data) with another, which would increase the by a factor of two.