Fast and Memory Efficient Differentially Private-SGD via JL Projections

02/05/2021 ∙ by Zhiqi Bu, et al. ∙ Microsoft University of Pennsylvania University of Washington Stanford University 0

Differentially Private-SGD (DP-SGD) of Abadi et al. (2016) and its variations are the only known algorithms for private training of large scale neural networks. This algorithm requires computation of per-sample gradients norms which is extremely slow and memory intensive in practice. In this paper, we present a new framework to design differentially private optimizers called DP-SGD-JL and DP-Adam-JL. Our approach uses Johnson-Lindenstrauss (JL) projections to quickly approximate the per-sample gradient norms without exactly computing them, thus making the training time and memory requirements of our optimizers closer to that of their non-DP versions. Unlike previous attempts to make DP-SGD faster which work only on a subset of network architectures or use compiler techniques, we propose an algorithmic solution which works for any network in a black-box manner which is the main contribution of this paper. To illustrate this, on IMDb dataset, we train a Recurrent Neural Network (RNN) to achieve good privacy-vs-accuracy tradeoff, while being significantly faster than DP-SGD and with a similar memory footprint as non-private SGD. The privacy analysis of our algorithms is more involved than DP-SGD, we use the recently proposed f-DP framework of Dong et al. (2019) to prove privacy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past decade, machine learning algorithms based on (deep) neural architectures have lead to a revolution in applications such as computer vision, speech recognition and natural language processing (NLP). An important factor contributing to this success is the abundance of data. For most of these applications, however, the training data comes from individuals, often containing personal and sensitive information about them. For example, natural language models for applications such as suggested replies for e-mails and dialog systems rely on the training of neural networks on email data of users

Chen et al. (2019); Deb et al. (2019)

, who may be left vulnerable if personal information is revealed. This could happen, for example, when a model generates a sentence or predicts a word that can potentially reveal private information of users in the training set. Many studies have shown successful membership inference attacks on deep learning models

Shokri et al. (2017); Carlini et al. (2019). Indeed, in a recent work, Carlini et al. (2019) show that “unintended memorization” in neural networks is both commonplace and hard to prevent. Such memorization is not due to overtraining Tetko et al. (1995); Carlini et al. (2019), and ad hoc techniques such as early-stopping, dropout etc., do not prevent the risk of privacy violations. Moreover, Feldman (2020) shows that memorization is in fact necessary, provably, for some learning tasks. Thus, to prevent unintended privacy breaches one needs a principled approach for private training of deep learning models. In this paper we study training neural networks with differential privacy, a mathematically rigorous notion of privacy introduced in the seminal work of Dwork et al. (2006), and focus on user level privacy.

Definition 1.1 (-Dp).

We say that an algorithm is -DP if for any two neighboring databases and any subset of outputs, we have

Besides being a provable privacy notion, it has been shown that deep learning models trained with DP protect against leakage of sensitive information; we refer the readers to Carlini et al. (2019); Abadi et al. (2016) for more details.

In a highly influential paper,  Abadi et al. (2016)

introduced a differentially private version of stochastic gradient descent (DP-SGD) for training deep learning models, and showed that it is possible to achieve reasonable accuracy-vs-privacy tradeoff on common benchmarks such as MNIST and CIFAR10. Since then, there has been a vast body of work building on and extending the algorithm of 

Abadi et al. (2016); we refer the readers to McMahan et al. (2018); Bu et al. (2019); Carlini et al. (2019); Thakkar et al. (2019); Augenstein et al. (2020); Zhou et al. (2020); Chen et al. (2020); Balle et al. (2020). The DP-SGD and its variations such as DP-Adam differ from their non-private counter parts in two crucial ways:

  • Gradient Clipping: In each iteration of DP-SGD, we clip each per-sample gradient to have -norm at most some fixed parameter . This step ensures that the sensitivity

    of the average gradient is bounded, which is crucial for privacy analysis. Computing the norms of per-sample gradients is the most expensive step in the algorithm. Efficient implementations of backpropagation such as in TensorFlow and PyTorch only maintain the average of per-sample gradients across a batch by default. Therefore, getting the norm of each per-sample gradient requires significantly more time and memory.

  • Adding Noise:

    Once clipped gradients are averaged across a batch, DP-SGD algorithm adds carefully calibrated noise, typically sampled from the Gaussian distribution, to ensure privacy.

The analysis of DP-SGD in  Abadi et al. (2016) then follows from a careful tracking of privacy budget lost in each iteration, for which they introduced a novel technique called moments accountant, which was later generalized as Renyi differential privacy by Mironov (2017). This analysis was further refined and improved using the -DP framework by Dong et al. (2019) and Bu et al. (2019), which lead to better privacy bounds. In this work, we use -DP framework of Dong et al. (2019) for our privacy analysis.

While DP-SGD has been shown to achieve reasonable accuracy-vs-privacy tradeoff Bu et al. (2019); Abadi et al. (2016), and arguably is the only known algorithm for training deep neural networks, its use in real-world deep learning has been rather limited. One of the primary reasons for this is the training time of DP-SGD compared to SGD. In DP-SGD, per-sample gradients are computed at a heavy cost in runtime, especially for large deep learning models. The naive approach of setting the batch size to 1 is too slow to be practical as we completely lose the benefits of parallelization. This problem has attracted significant attention in the community and has been noted in popular implementations of DP-SGD including Tensorflow Privacy and Opacus (Pytorch DP). Many strategies have been proposed to circumvent this issue, and they fall broadly into the following categories:

  • Microbatching: DP-SGD implementation in Tensorflow Privacy allows dividing a batch into several microbatches and clipping the gradient at the microbatch level; the per-sample gradients in a microbatch are first averaged and then clipped, and finally these clipped gradients are averaged across microbatches. Thus, if each microbatch is of size , then it gives a speedup of over the usual DP-SGD. Unfortunately, the sensitivity goes up by a factor of , so we need to add times more noise. In our experiments, we observe that this often leads to poor accuracy even for moderate values of .

  • Multiple method: In this approach, proposed by Goodfellow

    and implemented as the vectorized DP-SGD in

    Tensorflow Privacy , one copies the model as many times as there are samples in a batch. As each copy of the model is used on only one example, we can compute the per-sample gradients in parallel. This approach improves speed at the cost of memory and is impractical for large models.

  • Outer product method: This strategy was proposed by Goodfellow (2015) for fully-connected networks and later generalized by Rochette et al. (2019) to include convolutional layers. The norms of per-sample gradients are computed exactly using outer products between the activations and backpropagated gradients across adjacent layers. In a very recent work, Lee and Kifer (2021) showed how to extend this approach to recurrent layers. A drawback of this approach is that it does not work for all network architectures in a black-box manner, and needs careful implementation for each network architecture. Furthermore, the current implementations of these methods require significantly more memory than non-private SGD as shown in Subramani et al. (2020).

  • Compiler Optimization: A completely different approach towards mitigating the problem was suggested by Subramani et al. (2020). They showed that by exploiting language primitives, such as vectorization, just-in-time compilation, and static graph optimization, one can implement DP-SGD algorithm significantly faster. They demonstrated these ideas in two frameworks: JAX and TensorFlow. While we believe this is exciting progress, the ideas in Subramani et al. (2020) are specific to these JAX and TensorFlow implementations (as of today) and present a non-algorithmic approach to this problem.

Optimizers Privacy Speed Memory Generalizability
Non-DP SGD
DP-SGD-Vanilla
DP-SGD-Multiple
DP-SGD-Outer
JAX
DP-SGD-JL
Table 1: Summary of different methods for DP training of neural networks

As summarized in Table 1, none of these approaches for speeding up DP-SGD completely solve the problem and fall short in at least one dimension. In this work, we propose a new algorithmic framework based on JL-projections for fast and memory efficient differentially private training of deep neural networks, which bypasses the expensive step of exactly computing per-sample gradient norms.

1.1 Our Contributions and Techniques

The main idea of our algorithm is to approximate the per-sample gradient norms instead of computing them exactly. Johnson–Lindenstrauss (JL) projections provide a convenient way to approximate the norm of a vector; simply project the vector onto a uniformly random direction, the length of the projection (scaled appropriately) is a good approximation to the -norm of the vector. By doing more such projections and averaging, we get even better approximation to the true -norm. Moreover, there is an efficient way to obtain such projections using forward-mode auto-differentiation or Jacobian-vector product (jvp) (see Section 2.1 for details). jvp can be calculated during the forward pass making it very efficient. Since this makes the sensitivity

itself a random variable, the privacy analysis is significantly harder than the traditional DP-SGD. We use the recently proposed

-DP framework of Dong et al. (2019) for our analysis. Intuitively, -DP captures the entire collection of -DP guarantees that each iteration of the algorithm satisfies and composes them optimally to find the best -DP guarantee for the final algorithm.

figurec

Figure 1: Performance of various algorithms training an RNN on IMDb dataset using a batch size of 256. The privacy parameter is color coded for . In DP-Adam-JL, refers to the JL dimension, i.e., the number of JL projections used to approximate per-sample gradient norms. DP-Adam-Vanilla (TFP) is the implementation of DP-Adam in Tensorflow Privacy library.

To summarize, the key contributions of this paper are:

  • Our algorithms DP-SGD-JL and DP-Adam-JL are considerably faster than previously known differentially private training algorithms that require exact per sample gradient norms, and work for all network architectures. The privacy-vs-accuracy tradeoff achieved by our algorithms is comparable to the existing state-of-the-art DP-algorithms.

  • Memory footprint of our algorithms is nearly the same as that of non-private training algorithms. This allows us to work with larger batch sizes (which is crucial for achieving good privacy-vs-accuracy tradeoffs) without resorting to gradient accumulation. This also improves the running time of training.

  • Compared to DP-SGD, our analysis of privacy is more involved. Since we only approximate the per-sample gradient norms, we cannot precisely bound sensitivity. Therefore the analysis requires significantly new ideas, which could be of independent interest.

  • We demonstrate these improvements by training an RNN using layers such as bidirectional LSTM, embedding layer, fully connected etc. on the IMDb dataset. As can be seen from Figure 1, our algorithms are significantly faster than current implementations of DP-SGD while achieving similar privacy-vs-accuracy tradeoff.

  • Our algorithms introduce a new knob

    , the dimension of JL-projection, which allows us to do a tradeoff between training time and privacy, which was not possible in earlier algorithms. All hyperparameters being the same, smaller JL dimension will give much better running time with an increase in privacy budget. Moreover, our experiments show that although the privacy bounds we could prove for DP-SGD-JL are not so great for very small JL dimensions (see Figure 

    3

    ), their behavior (accuracy-vs-epoch curve) converges very quickly to that of DP-SGD-Vanilla. Figure 

    4 shows that DP-SGD-JL(3) is already very close to DP-SGD-Vanilla and DP-SGD-JL(20) is almost indistinguishable. We find these properties of DP-SGD-JL algorithms to be very useful during initial stages of experimentation and hyper-parameter tuning for private training.

2 DP-SGD-JL Algorithm

In this section we describe our new differentially private optimizers. We will first present DP-SGD-JL, and DP-Adam-JL follows in the same lines and is presented in Appendix A. We begin with an introduction to “Jacobian-vector product” (jvp) which is crucial for our algorithm.

2.1 Jacobian-vector product (jvp)

Given a function , the gradient of with respect to , denoted by , is:

Let be some function given by . The Jacobian of , denoted by , is the matrix:

Most auto-differentiation packages allow for calculating the vector-Jacobian product (vjp) given by for any efficiently using reverse-mode auto-differentiation, which is the familiar ‘backpropagation’.222In PyTorch, can be calculated as autograd.grad(F,,grad_outputs=u). In TensorFlow, this is tf.GradientTape().gradient(F,,output_gradients=u). One can also calculate the Jacobian-vector product (jvp) given by

efficiently using forward-mode auto-differentiation (i.e., the required derivatives are calculated during the forward pass of the network). We refer the reader to the survey on automatic differentiation by Baydin et al. (2017) for more details. jvp is implemented using forward-mode auto-differentiation in the recent TensorFlow versions.333Supported in tf-nightly2.4.0.dev20200924 as tf.autodiff.ForwardAccumulator(,v).jvp(F). Unfortunately, PyTorch doesn’t have an implementation of forward-mode auto-differentiation. Instead, one can compute jvp using two calls to vjp, this is called the ‘double vjp trick’ (see Townsend ).444In Pytorch, an implementation of jvp using the double vjp trick exists and can be invoked as torch.autograd.functional.jvp(F,,inputs=v). Define , which can be calculated using vjp. Note that Now we can use vjp again on to calculate jvp as

In our experiments, we use the efficient implementation of jvp in TensorFlow to compute the Jacobian-vector products as the double vjp trick is a few times slower.

2.2 Algorithm

Figure 2: Distribution of for different

The main idea of our algorithm is to approximate -norms of per-sample gradient norms instead of computing them exactly. And the key tool to achieve this is JL projections.

Proposition 2.1 (JL projections).

Let be any vector. If are independent standard Gaussian vectors, then has the same distribution as .555

is the chi-square distribution with

degrees of freedom which is the distribution of sum of squares of standard Gaussians. In particular

Proof.

By the properties of the standard Gaussian distribution, has the distribution of And are independent for to . Therefore has the same distribution as

As shown in Figure 2, as grows larger, the distribution of concentrates more around 1 and therefore

becomes a better estimate of

Using jvp, we can compute projections of per-sample gradients on to standard Gaussian vectors quickly and therefore get good approximations to their norms. This is the main idea of DP-SGD-JL (Algorithm 1). The privacy analysis of our algorithm is quite involved and is presented in Section 4.

Input: Examples

, loss function

, initialization .
Parameters: number of iterations , learning rates , noise scale , batch size , clipping norms , number of JL projections .
for  to  do
     Sample uniformly at random.
     Define
     
     Sample (where ) JL projections to estimate per-sample gradient norm
     for  to  do
          (using jvp) Note that
     end for
     for  to  do
         Set is an estimate for
     end for
     Define Scale the losses to clip per-sample gradients
     
      Add noise to the average of clipped gradients
     
     Update
end for
Output:
Algorithm 1 Differentially private SGD using JL projections (DP-SGD-JL)

3 Experiments

In this section, we demonstrate experimentally that compared to existing implementations of DP-SGD with exact per-sample gradient clipping, our optimizers have significant advantages in speed and memory cost while achieving comparable accuracy-vs-privacy tradeoff. Moreover our algorithms perform well on a variety of network architectures. The main goal of this section is to give empirical evidences towards the following three strengths of our algorithm alluded in the introduction:

  1. Our algorithm is significantly faster compared to per-sample gradient computations and works for any network in a black-box way.

  2. Memory footprint of our algorithm is roughly same as non-private SGD.

  3. The DP-SGD-JL algorithms with smaller values of JL dimension exhibit similar behavior as DP-SGD but with orders of magnitude speed up, and hence can be used for hyper-parameter search.

In the following, we write ‘nonDP-SGD’ for the standard non-private SGD and ‘DP-SGD-Vanilla’ for the implementation of DP-SGD in Tensorflow Privacy , nonDP-Adam and DP-Adam-Vanilla are similarly defined. We use Tensorflow and Tensorflow Privacy for all our experiments because Opacus does not support arbitrary network architectures.666In Pytorch Opacus github, the LSTM layer is only partially supported, e.g. single directional, single LSTM layer, no dropout layer; other recurrent layers such as GRU are not supported (see Opacus). Moreover Tensorflow has an efficient implementation of jvp while PyTorch doesn’t.

Notation:

We denote the noise multiplier as , clipping norm as , batch size as , learning rate as and the number of epochs as . We fix the privacy parameter , as done by prior work. We denote the JL dimension used by each optimizer in the parentheses. We use one Tesla P100 16GB GPU for all experiments. In all the experiments, we report the time per epoch by averaging over a large number of epochs.

3.1 Training an LSTM model on IMDb dataset

Non-DP Adam DP-Adam-Vanilla DP-Adam-JL(1) DP-Adam-JL(5) DP-Adam-JL(10) DP-Adam-JL(30)
12 19317 41 128 243 675
Table 2: Seconds per epoch to train our RNN with 598,274 parameters and 25,000 training samples. We set .

The goal of these experiments is to demonstrate the first strength of our algorithm: it is significantly faster than per-sample gradient computations and works for any network in a black-box way. We train a bidirectional LSTM with an embedding layer on the IMDb dataset for sentiment analysis. We remark that simpler networks

not based on RNNs can achieve good accuracy-vs-privacy tradeoff as shown in Bu et al. (2019) and Pytorch,Team . However, LSTM models based on RNN architectures are widely used in NLP applications, and hence efficient private training of such models remains an important challenge in this area McMahan et al. (2018). Moreover, as we noted in the introduction, extensions of outer product trick to bidirectional LSTMs as described in Lee and Kifer (2021) are significantly more complicated, and require considerable effort to implement. Since authors of Lee and Kifer (2021) did not provide the code, we are unable to compare the improvements. Moreover, as also noted in Lee and Kifer (2021), the outer product method requires significantly more memory and hence will not scale to large batch sizes, which is very important to achieve good privacy vs utility tradeoff.

We implement DP-Adam-JL using jvp method in TensorFlow. We train the same single-layer bidirectional LSTM777We cannot use CuDNNLSTM and instead use LSTM for the following reason. When using CuDNNLSTM (and on GPU), we observe a significant speedup compared to LSTM, but the accuracy is invalid and we further incur a LookupError when computing jvp. as in the Tensorflow,Team

tutorial, using the same IMDb dataset with 8k vocabulary. The dataset has binary labels and is preprocessed by zero-padding the sequences to a length of 150 before being fed into the embedding layer.

Figure 3: Privacy curves for various algorithms while training the LSTM model on IMDb dataset. Here is fixed.

Table 2 shows the training time per epoch for different algorithms. As expected, we observe that DP-Adam-JL algorithms with smaller values of JL dimension are significantly faster than DP-Adam-Vanilla. However, as we note in the Figure 3 privacy guarantees of DP-Adam-JL algorithm with smaller values of JL dimension are considerably worse than DP-Adam-Vanilla. On the other hand, DP-Adam-JL(30) is faster than DP-Adam-Vanilla while achieving similar privacy guarantees as DP-Adam-Vanilla. When allowed to train for sufficient number of epochs, we observed that all the algorithms achieved same accuracy but with different privacy costs and running times. This three dimensional tradeoff between utility, privacy and speed is depicted in the Figure 1 (see introduction), where we plot the privacy values using a color plot.

Remark 3.1.

As we can see from Table 2, for a batch size of 256, the slowdown of DP-Adam-Vanilla is more than 256 compared to nonDP-Adam. This is counter intuitive as a naive implementation DP-Adam-Vanilla, by setting the batch size equal to 1 and then doing gradient accumulation across 256 batches should only be 256 times slower. As we show below, this is due to the memory issues of DP-Adam-Vanilla in the implementation in Tensorflow Privacy . Indeed the naive implementation of DP-Adam-Vanilla in tensorflow using gradient accumulation takes about 4000 seconds per epoch for the same batch size.

3.2 Memory footprint

Another key strength of JL based algorithms is their memory consumption, and the goal of this section is to show this aspect of our new algorithms via experiments. As a proxy for memory consumption, we compare the largest batch size each algorithm can handle without running out of memory. It is known that to achieve good privacy-vs-accuracy tradeoffs for DP-SGD, we need to use large batch sizes Abadi et al. (2016). One way to support large batch sizes is via gradient accumulation; however, this has the disadvantage that one loses parallelism, which in turns leads to slower run times. Hence memory footprint of algorithms also indirectly affects the training time.

We compare our JL algorithms with the implementation of DP-SGD-Vanilla in Tensorflow Privacy

. We train a convolutional neural network from

Tensorflow Privacy tutorial on MNIST dataset, which has 60,000 training samples.888We use the implementation and the network from mnist_dpsgd_tutorial_keras.py As we can see from Table 3, DP-SGD-JL algorithm and nonDP-SGD can both run with the maximum possible batch size of 60,000 whereas DP-SGD-Vanilla can only handle a batch size of at most 500. In general, we believe that the memory footprint of DP-SGD-JL algorithm is very close to that of non-private SGD. To show this, we augment the CNN in Tensorflow Privacy tutorial with dense layers to blowup the model size to 17,146,938 parameters, and repeat the experiment. As we see from Table 4, the largest batch size supported by DPSGD-JL(30) is only a factor 2 away from the largest batch size supported by non-DP SGD. On the other hand, we observe that DP-SGD-Vanilla only supports a batch size of 100.

nonDP-SGD DP-SGD-Vanilla DP-SGD-JL(10) DP-SGD-JL(30)
60,000 512 60,000 60,000
Table 3: Maximum batch size supported by various algorithms on a CNN with 26,010 parameters trained on MNIST with 60,000 training samples. Training done on one Tesla P100 16GB GPU.
Non-DP SGD DP-SGD-Vanilla DP-SGD-JL(10) DP-SGD-JL(30)
52,000 100 28,000 25,000
Table 4: Maximum batch size supported by various algorithms on a CNN with 17,146,938 parameters trained on MNIST with 60,000 training samples. Training done on one Tesla P100 16GB GPU.

The above experiments also give a possible explanation of why DP-SGD-Vanilla implementation in TFP has a slowdown that is larger than the batch size, as we observed in the LSTM experiments. Even for MNIST, we observe that DP-SGD-Vanilla running time gets better with batch size in the very beginning but as the batch size becomes larger the running time gets worse, and soon after it runs out of memory.

3.3 Using DP-SGD-JL for hyper-parameter search

Figure 4: Behavior of various private algorithms for MNIST dataset. The test accuracy is averaged over 25 runs. Note that the DP-SGD-JL20 curve is nearly overlapping with the curve for DP-SGD-Vanilla.

As we saw in our experiments summarized in Table 2, our algorithms with small values of JL dimension are orders of magnitude faster than DP-Adam-Vanilla; DP-Adam-JL(1) is about 470x faster and DP-SGD-JL(5) is about 150x faster. However, unfortunately, the privacy bounds we can prove for these algorithms are considerably worse than DP-Adam-Vanilla. Despite this drawback, we observe that behavior of DP-SGD-JL even for small JL dimension is very close to that of DP-Adam-Vanilla. Figure 4 plots the accuracy vs epochs for various algorithms training a CNN from Tensorflow Privacy tutorial on MNIST dataset. We observe that as JL dimension increases, the accuracy vs epoch curve quickly converges to that of DP-SGD-Vanilla. DP-SGD-JL(3) is already very similar to DP-SGD-Vanilla and DP-SGD-JL(20) is nearly indistinguishable. This also lets us hypothesize that the privacy of our algorithms could be much better than what we could prove and that it should converge equally quickly to that of DP-SGD-Vanilla. Thus we believe that DP-SGD-JL(3) or DP-SGD-JL(5) are good candidates for experimentation and hyper-parameter tuning during private training, since their behavior is almost identical to that of DP-SGD-Vanilla while being orders of magnitude faster.

4 Privacy Analysis

4.1 -DP preliminaries

We use the recently proposed -DP framework of Dong et al. (2019) for our privacy analysis. The -DP framework allows us to reason about a collection of -privacy guarantees simultaneously which can then be composed to get a much better -privacy for the final algorithm. We will first define the notion of -DP formally and then define the notion of -DP. We then state a proposition from Dong et al. (2019) which shows that these two notions are dual to each other.

Definition 4.1 (-Dp).

We say that an algorithm is -DP if for any two neighboring databases and any subset of outputs, we have

For each value of , there exists some such that is -DP. We can represent all these privacy guarantees by a function and say that is -DP for each We will now see that there is a dual way to represent the function called -DP. To introduce this, we will need the notion of a tradeoff function.

Definition 4.2 (Tradeoff function).

Let be two distributions over some domain Let

be a prediction rule which given a sample predicts which distribution the sample came from. The type I error is defined as

and the type II error is defined as

. The tradeoff function is defined as:

(1)

Given two random variables , we define to be where are the distributions of respectively.

Note that if , then A tradeoff curve is called symmetric if Given two functions on the same domain, we say if for all in the domain. is similarly defined.

Definition 4.3 (-Dp).

We say an algorithm is -differentially private if for every two neighboring databases , we have .

Proposition 4.1 (Duality of -DP and -DP from Dong et al. (2019)).

If satisfies -DP where is symmetric (i.e., ). Let be such that . Then satisfies -DP for every where:

So the tangent to at has slope and -intercept (see Figure 5).

Figure 5: Duality of -DP and -DP. Every tangent to the -DP curve gives an -DP guarantee where the slope is and -intercept is .

Note that by convexity of is increasing in and by symmetry Therefore Proposition 4.1 covers the entire range of

Proposition 4.2 (Post-processing).

Let be two random variables supported on and let is some randomized function, then

Proposition 4.3.

Let and be pairs of random variables such that has the same distribution as for all . Then

Proof.

By post-processing (Proposition 4.2), Let be a random variable which has the distribution of and . Let Then and . Therefore, by post-processing (Proposition 4.2), we have the inequality in the other direction. ∎

Proposition 4.4 (Composition Dong et al. (2019)).

Let be independent random variables and let be independent random variables. Then

where is a commutative, associative operation on functions from

For any random variable , we have where is defined as The function is identity for operation i.e. for all

We will the need the following proposition which explains how subsampling affects the privacy curve.

Proposition 4.5 (Dong et al. (2019)).

Let and let . Then .

Though Eqn (1) defining the tradeoff curve requires us to take an infimum over a large class of tests, Neyman-Pearson lemma gives a very simple form for the optimal tests to use.

Proposition 4.6 (Neyman-Pearson lemma).

Let be two continuous distributions over some domain The type I error vs type II error tradeoff function between and is attained by the Neyman-Pearson tests which are a single parameter family of tests of the form defined as:

Using the Neyman-Pearson lemma, we will prove the following lemma which is crucial for our privacy analysis.

Lemma 4.1.

Let be some random variable and let . Then where:

and is the CDF of standard Gaussian.

Proof.

Denote . From Neyman-Pearson lemma (Proposition 4.6), the type I/II errors are

and

We will now prove an other key lemma which is useful for our privacy analysis.

Lemma 4.2.

Let be two random variables such that there exists some coupling with Then .

Proof.

We will prove this by post-processing (Proposition 4.2). Define a randomized map

where , i.e., is sampled from the conditional distribution of given Note that because the coupling satisfies Now it is easy to verify that and

4.2 Proof of privacy for Algorithm 4

Input: Vectors , clipping norm , noise scale , number of JL projections .
Sample JL projections to estimate per-sample gradient norm
For , set is an estimate for
Output:
Algorithm 2 Subroutine of Algorithm 1

We will first analyze the privacy of a crucial subroutine used in Algorithm 4 which is shown in Algorithm 2.

Lemma 4.3.

Algorithm 2 is -DP with

where Moreover can be parametrized as for where:

Proof.

Let be the output of Algorithm 2 with input and let be the output of Algorithm 2 with input . We want to show that . We have

By post-processing property (Proposition 4.2), we have

where

By Proposition 4.3,

Let be a rotation matrix which rotates to where Let and . Since is a fixed bijective map,

Because of rotation invariance, has the same distribution as So,

The coordinates are independent of each other and . Similarly the coordinates are also independent of each other and . Moreover and has the same distribution for Therefore by Proposition 4.4,

Let . We can further simplify this using Proposition 4.3 as:

We can simplify further by using the fact that has the same distribution as . Therefore has the same distribution as

Let and so By Lemma 4.2,

Therefore, this proves that where The parametrization follows from Lemma 4.1. ∎

Theorem 4.1.

Let where is the batch size and is the total number of samples. Then Algorithm 1 is -DP with where and .999 is the double convex conjugate of (i.e., the greatest convex lower bound for ).

Proof.

Algorithm 1 can be thought of as adaptive composition of iterations of Algorithm 2, but where the inputs to the Algorithm 2

in each iteration is subsampled from the entire input with sampling probability

. And we already showed in Lemma 4.3 that Algorithm 2 satisfies -DP with as claimed. The rest of the proof is very similar to Theorem 3 in Bu et al. (2019) which itself builds on a similar theorem in Dong et al. (2019). It proceeds by applying Proposition 4.5 to understand the effect of subsampling and an adaptive version of the composition in Proposition 4.4 to compose the privacy curves in all the iterations. ∎

One could hope to use the central limit theorem for composition from 

Dong et al. (2019); Bu et al. (2019) to find an approximate closed form expression for the final privacy curve. Unfortunately, these central limit theorems do not apply in our setting.101010 diverges Instead, we numerically compute the final privacy curve obtained in Theorem 4.1 making use of Lemma 4.1.

4.3 Effect of JL dimension on privacy

Since the per-sample gradient norm estimations get more accurate with JL dimension, it is clear that the privacy of DP-SGD-JL should converge to that of DP-SGD-Vanilla for large JL dimension. We also observe that privacy parameter is monotonically decreasing with increasing JL dimension and eventually converges to the for DP-SGD-Vanilla. This can be see from Figure 3.

Acknowledgements

We thank Sergey Yekhanin for his constant support and encouragement during this work. We also thank Lukas Wutschitz and Osman Ramadan for helpful discussions. Finally, we thank the amazing open source community of TensorFlow for their quick response in fixing bugs which was crucial for our experiments (

TFissue43449, ).

References

  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: Fast and Memory Efficient Differentially Private-SGD via JL Projections111Author ordering is alphabetical. Work done when the first and the last two authors were interns at Algorithms group, Microsoft Research Redmond. , §1, §1, §1, §1, §3.2, Algorithm 4.
  • S. Augenstein, H. B. McMahan, D. Ramage, S. Ramaswamy, P. Kairouz, M. Chen, R. Mathews, and B. A. y Arcas (2020) Generative models for effective ML on private, decentralized datasets. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §1.
  • B. Balle, P. Kairouz, H. B. McMahan, O. Thakkar, and A. Thakurta (2020) Privacy amplification via random check-ins. CoRR abs/2007.06605. External Links: Link, 2007.06605 Cited by: §1.
  • A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind (2017) Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research 18 (1), pp. 5595–5637. Cited by: §2.1.
  • Z. Bu, J. Dong, Q. Long, and W. J. Su (2019) Deep learning with gaussian differential privacy. External Links: 1911.11607 Cited by: §1, §1, §1, §3.1, §4.2, §4.2.
  • N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284. Cited by: §1, §1, §1.
  • M. X. Chen, B. N. Lee, G. Bansal, Y. Cao, S. Zhang, J. Lu, J. Tsay, Y. Wang, A. M. Dai, Z. Chen, T. Sohn, and Y. Wu (2019) Gmail smart compose: real-time assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis (Eds.), pp. 2287–2295. External Links: Link, Document Cited by: §1.
  • X. Chen, Z. S. Wu, and M. Hong (2020) Understanding gradient clipping in private SGD: A geometric perspective. CoRR abs/2006.15429. External Links: Link, 2006.15429 Cited by: §1.
  • B. Deb, P. Bailey, and M. Shokouhi (2019)

    Diversifying reply suggestions using a matching-conditional variational autoencoder

    .
    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), Minneapolis, Minnesota. Cited by: §1.
  • J. Dong, A. Roth, and W. J. Su (2019) Gaussian differential privacy. External Links: 1905.02383 Cited by: Fast and Memory Efficient Differentially Private-SGD via JL Projections111Author ordering is alphabetical. Work done when the first and the last two authors were interns at Algorithms group, Microsoft Research Redmond. , §1.1, §1, §4.1, §4.2, §4.2, Proposition 4.1, Proposition 4.4, Proposition 4.5.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Cited by: §1.
  • V. Feldman (2020) Does learning require memorization? a short tale about a long tail. In

    Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 22-26, 2020

    , K. Makarychev, Y. Makarychev, M. Tulsiani, G. Kamath, and J. Chuzhoy (Eds.),
    pp. 954–959. External Links: Link, Document Cited by: §1.
  • [13] I. Goodfellow https://github.com/tensorflow/tensorflow/issues/4897#issuecomment-290997283. Cited by: 2nd item.
  • I. Goodfellow (2015) Efficient per-example gradient computations. arXiv preprint arXiv:1510.01799. Cited by: 3rd item.
  • J. Lee and D. Kifer (2021) Scaling up differentially private deep learning with fast per-example gradient clipping. Proceedings on Privacy Enhancing Technologies 2021 (1), pp. 128–144. Cited by: 3rd item, §3.1.
  • H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang (2018) Learning differentially private recurrent language models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §3.1.
  • I. Mironov (2017) Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275. Cited by: §1.
  • [18] L. Opacus Note:
    github.com/facebookresearch/pytorch-dp
    Cited by: §1, §3.
  • [19] Pytorch,Team DP training on imdb. Note:
    https://github.com/pytorch/opacus/blob/master/examples/imdb_README.md
    Cited by: §3.1.
  • G. Rochette, A. Manoel, and E. W. Tramel (2019) Efficient per-example gradient computations in convolutional neural networks. arXiv preprint arXiv:1912.06015. Cited by: 3rd item.
  • R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1.
  • P. Subramani, N. Vadivelu, and G. Kamath (2020) Enabling fast differentially private sgd via just-in-time compilation and vectorization. arXiv preprint arXiv:2010.09063. Cited by: 3rd item, 4th item.
  • [23] L. Tensorflow Privacy Note:
    github.com/tensorflow/privacy
    Cited by: Figure 1, 2nd item, §1, §3.2, §3.3, Remark 3.1, §3.
  • [24] Tensorflow,Team Text classification with an rnn. Note:
    https://www.tensorflow.org/tutorials/text/text_classification_rnn
    Cited by: §3.1.
  • I. V. Tetko, D. J. Livingstone, and A. I. Luik (1995) Neural network studies, 1. comparison of overfitting and overtraining. J. Chem. Inf. Comput. Sci. 35 (5), pp. 826–833. External Links: Link, Document Cited by: §1.
  • [26] TFissue43449, Fast per example gradient support. Note:
    https://github.com/tensorflow/tensorflow/issues/43449
    Cited by: Acknowledgements.
  • O. Thakkar, G. Andrew, and H. B. McMahan (2019) Differentially private learning with adaptive clipping. CoRR abs/1905.03871. External Links: Link, 1905.03871 Cited by: §1.
  • [28] J. Townsend A new trick for calculating Jacobian vector products.. Note:
    https://j-towns.github.io/2017/06/12/A-new-trick.html
    Cited by: §2.1.
  • Y. Zhou, Z. S. Wu, and A. Banerjee (2020) Bypassing the ambient dimension: private SGD with gradient subspace identification. CoRR abs/2007.03813. External Links: Link, 2007.03813 Cited by: §1.

Appendix A DP-SGD and DP-Adam-JL

For completeness, we provide pseudo-code for DP-SGD and DP-Adam-JL used in our experiments. DP-Adam-JL satisfies exactly the same privacy bounds as DP-SGD-JL.

Input: Examples , loss function , initialization .
Parameters: number of iterations , momentum parameters , noise scale , batch size , clipping norms , number of JL projections .
for  to  do
     Sample uniformly at random.
     Define
     Sample (where ) JL projections to estimate per-sample gradient norm
     for  to  do
          (using jvp) Note that
     end for
     for  do
          is an estimate for
     end for
     Define Scale the losses to clip per-sample gradients