1 Introduction
Over the past decade, machine learning algorithms based on (deep) neural architectures have lead to a revolution in applications such as computer vision, speech recognition and natural language processing (NLP). An important factor contributing to this success is the abundance of data. For most of these applications, however, the training data comes from individuals, often containing personal and sensitive information about them. For example, natural language models for applications such as suggested replies for emails and dialog systems rely on the training of neural networks on email data of users
Chen et al. (2019); Deb et al. (2019), who may be left vulnerable if personal information is revealed. This could happen, for example, when a model generates a sentence or predicts a word that can potentially reveal private information of users in the training set. Many studies have shown successful membership inference attacks on deep learning models
Shokri et al. (2017); Carlini et al. (2019). Indeed, in a recent work, Carlini et al. (2019) show that “unintended memorization” in neural networks is both commonplace and hard to prevent. Such memorization is not due to overtraining Tetko et al. (1995); Carlini et al. (2019), and ad hoc techniques such as earlystopping, dropout etc., do not prevent the risk of privacy violations. Moreover, Feldman (2020) shows that memorization is in fact necessary, provably, for some learning tasks. Thus, to prevent unintended privacy breaches one needs a principled approach for private training of deep learning models. In this paper we study training neural networks with differential privacy, a mathematically rigorous notion of privacy introduced in the seminal work of Dwork et al. (2006), and focus on user level privacy.Definition 1.1 (Dp).
We say that an algorithm is DP if for any two neighboring databases and any subset of outputs, we have
Besides being a provable privacy notion, it has been shown that deep learning models trained with DP protect against leakage of sensitive information; we refer the readers to Carlini et al. (2019); Abadi et al. (2016) for more details.
In a highly influential paper, Abadi et al. (2016)
introduced a differentially private version of stochastic gradient descent (DPSGD) for training deep learning models, and showed that it is possible to achieve reasonable accuracyvsprivacy tradeoff on common benchmarks such as MNIST and CIFAR10. Since then, there has been a vast body of work building on and extending the algorithm of
Abadi et al. (2016); we refer the readers to McMahan et al. (2018); Bu et al. (2019); Carlini et al. (2019); Thakkar et al. (2019); Augenstein et al. (2020); Zhou et al. (2020); Chen et al. (2020); Balle et al. (2020). The DPSGD and its variations such as DPAdam differ from their nonprivate counter parts in two crucial ways:
Gradient Clipping: In each iteration of DPSGD, we clip each persample gradient to have norm at most some fixed parameter . This step ensures that the sensitivity
of the average gradient is bounded, which is crucial for privacy analysis. Computing the norms of persample gradients is the most expensive step in the algorithm. Efficient implementations of backpropagation such as in TensorFlow and PyTorch only maintain the average of persample gradients across a batch by default. Therefore, getting the norm of each persample gradient requires significantly more time and memory.

Adding Noise:
Once clipped gradients are averaged across a batch, DPSGD algorithm adds carefully calibrated noise, typically sampled from the Gaussian distribution, to ensure privacy.
The analysis of DPSGD in Abadi et al. (2016) then follows from a careful tracking of privacy budget lost in each iteration, for which they introduced a novel technique called moments accountant, which was later generalized as Renyi differential privacy by Mironov (2017). This analysis was further refined and improved using the DP framework by Dong et al. (2019) and Bu et al. (2019), which lead to better privacy bounds. In this work, we use DP framework of Dong et al. (2019) for our privacy analysis.
While DPSGD has been shown to achieve reasonable accuracyvsprivacy tradeoff Bu et al. (2019); Abadi et al. (2016), and arguably is the only known algorithm for training deep neural networks, its use in realworld deep learning has been rather limited. One of the primary reasons for this is the training time of DPSGD compared to SGD. In DPSGD, persample gradients are computed at a heavy cost in runtime, especially for large deep learning models. The naive approach of setting the batch size to 1 is too slow to be practical as we completely lose the benefits of parallelization. This problem has attracted significant attention in the community and has been noted in popular implementations of DPSGD including Tensorflow Privacy and Opacus (Pytorch DP). Many strategies have been proposed to circumvent this issue, and they fall broadly into the following categories:

Microbatching: DPSGD implementation in Tensorflow Privacy allows dividing a batch into several microbatches and clipping the gradient at the microbatch level; the persample gradients in a microbatch are first averaged and then clipped, and finally these clipped gradients are averaged across microbatches. Thus, if each microbatch is of size , then it gives a speedup of over the usual DPSGD. Unfortunately, the sensitivity goes up by a factor of , so we need to add times more noise. In our experiments, we observe that this often leads to poor accuracy even for moderate values of .

Multiple method: In this approach, proposed by Goodfellow
and implemented as the vectorized DPSGD in
Tensorflow Privacy , one copies the model as many times as there are samples in a batch. As each copy of the model is used on only one example, we can compute the persample gradients in parallel. This approach improves speed at the cost of memory and is impractical for large models. 
Outer product method: This strategy was proposed by Goodfellow (2015) for fullyconnected networks and later generalized by Rochette et al. (2019) to include convolutional layers. The norms of persample gradients are computed exactly using outer products between the activations and backpropagated gradients across adjacent layers. In a very recent work, Lee and Kifer (2021) showed how to extend this approach to recurrent layers. A drawback of this approach is that it does not work for all network architectures in a blackbox manner, and needs careful implementation for each network architecture. Furthermore, the current implementations of these methods require significantly more memory than nonprivate SGD as shown in Subramani et al. (2020).

Compiler Optimization: A completely different approach towards mitigating the problem was suggested by Subramani et al. (2020). They showed that by exploiting language primitives, such as vectorization, justintime compilation, and static graph optimization, one can implement DPSGD algorithm significantly faster. They demonstrated these ideas in two frameworks: JAX and TensorFlow. While we believe this is exciting progress, the ideas in Subramani et al. (2020) are specific to these JAX and TensorFlow implementations (as of today) and present a nonalgorithmic approach to this problem.
Optimizers  Privacy  Speed  Memory  Generalizability 
NonDP SGD  ✘  ✔  ✔  ✔ 
DPSGDVanilla  ✔  ✘  ✔  ✔ 
DPSGDMultiple  ✔  ✔  ✘  ✔ 
DPSGDOuter  ✔  ✔  ✘  ✘ 
JAX  ✔  ✔  ✔  ✘ 
DPSGDJL  ✔  ✔  ✔  ✔ 
As summarized in Table 1, none of these approaches for speeding up DPSGD completely solve the problem and fall short in at least one dimension. In this work, we propose a new algorithmic framework based on JLprojections for fast and memory efficient differentially private training of deep neural networks, which bypasses the expensive step of exactly computing persample gradient norms.
1.1 Our Contributions and Techniques
The main idea of our algorithm is to approximate the persample gradient norms instead of computing them exactly. Johnson–Lindenstrauss (JL) projections provide a convenient way to approximate the norm of a vector; simply project the vector onto a uniformly random direction, the length of the projection (scaled appropriately) is a good approximation to the norm of the vector. By doing more such projections and averaging, we get even better approximation to the true norm. Moreover, there is an efficient way to obtain such projections using forwardmode autodifferentiation or Jacobianvector product (jvp) (see Section 2.1 for details). jvp can be calculated during the forward pass making it very efficient. Since this makes the sensitivity
itself a random variable, the privacy analysis is significantly harder than the traditional DPSGD. We use the recently proposed
DP framework of Dong et al. (2019) for our analysis. Intuitively, DP captures the entire collection of DP guarantees that each iteration of the algorithm satisfies and composes them optimally to find the best DP guarantee for the final algorithm.figurec
To summarize, the key contributions of this paper are:

Our algorithms DPSGDJL and DPAdamJL are considerably faster than previously known differentially private training algorithms that require exact per sample gradient norms, and work for all network architectures. The privacyvsaccuracy tradeoff achieved by our algorithms is comparable to the existing stateoftheart DPalgorithms.

Memory footprint of our algorithms is nearly the same as that of nonprivate training algorithms. This allows us to work with larger batch sizes (which is crucial for achieving good privacyvsaccuracy tradeoffs) without resorting to gradient accumulation. This also improves the running time of training.

Compared to DPSGD, our analysis of privacy is more involved. Since we only approximate the persample gradient norms, we cannot precisely bound sensitivity. Therefore the analysis requires significantly new ideas, which could be of independent interest.

We demonstrate these improvements by training an RNN using layers such as bidirectional LSTM, embedding layer, fully connected etc. on the IMDb dataset. As can be seen from Figure 1, our algorithms are significantly faster than current implementations of DPSGD while achieving similar privacyvsaccuracy tradeoff.

Our algorithms introduce a new knob
, the dimension of JLprojection, which allows us to do a tradeoff between training time and privacy, which was not possible in earlier algorithms. All hyperparameters being the same, smaller JL dimension will give much better running time with an increase in privacy budget. Moreover, our experiments show that although the privacy bounds we could prove for DPSGDJL are not so great for very small JL dimensions (see Figure
3), their behavior (accuracyvsepoch curve) converges very quickly to that of DPSGDVanilla. Figure
4 shows that DPSGDJL(3) is already very close to DPSGDVanilla and DPSGDJL(20) is almost indistinguishable. We find these properties of DPSGDJL algorithms to be very useful during initial stages of experimentation and hyperparameter tuning for private training.
2 DPSGDJL Algorithm
In this section we describe our new differentially private optimizers. We will first present DPSGDJL, and DPAdamJL follows in the same lines and is presented in Appendix A. We begin with an introduction to “Jacobianvector product” (jvp) which is crucial for our algorithm.
2.1 Jacobianvector product (jvp)
Given a function , the gradient of with respect to , denoted by , is:
Let be some function given by . The Jacobian of , denoted by , is the matrix:
Most autodifferentiation packages allow for calculating the vectorJacobian product (vjp) given by for any efficiently using reversemode autodifferentiation, which is the familiar ‘backpropagation’.^{2}^{2}2In PyTorch, can be calculated as autograd.grad(F,,grad_outputs=u). In TensorFlow, this is tf.GradientTape().gradient(F,,output_gradients=u). One can also calculate the Jacobianvector product (jvp) given by
efficiently using forwardmode autodifferentiation (i.e., the required derivatives are calculated during the forward pass of the network). We refer the reader to the survey on automatic differentiation by Baydin et al. (2017) for more details. jvp is implemented using forwardmode autodifferentiation in the recent TensorFlow versions.^{3}^{3}3Supported in tfnightly2.4.0.dev20200924 as tf.autodiff.ForwardAccumulator(,v).jvp(F). Unfortunately, PyTorch doesn’t have an implementation of forwardmode autodifferentiation. Instead, one can compute jvp using two calls to vjp, this is called the ‘double vjp trick’ (see Townsend ).^{4}^{4}4In Pytorch, an implementation of jvp using the double vjp trick exists and can be invoked as torch.autograd.functional.jvp(F,,inputs=v). Define , which can be calculated using vjp. Note that Now we can use vjp again on to calculate jvp as
In our experiments, we use the efficient implementation of jvp in TensorFlow to compute the Jacobianvector products as the double vjp trick is a few times slower.
2.2 Algorithm
The main idea of our algorithm is to approximate norms of persample gradient norms instead of computing them exactly. And the key tool to achieve this is JL projections.
Proposition 2.1 (JL projections).
Let be any vector. If are independent standard Gaussian vectors, then has the same distribution as .^{5}^{5}5
is the chisquare distribution with
degrees of freedom which is the distribution of sum of squares of standard Gaussians. In particularProof.
By the properties of the standard Gaussian distribution, has the distribution of And are independent for to . Therefore has the same distribution as ∎
As shown in Figure 2, as grows larger, the distribution of concentrates more around 1 and therefore
becomes a better estimate of
Using jvp, we can compute projections of persample gradients on to standard Gaussian vectors quickly and therefore get good approximations to their norms. This is the main idea of DPSGDJL (Algorithm 1). The privacy analysis of our algorithm is quite involved and is presented in Section 4.3 Experiments
In this section, we demonstrate experimentally that compared to existing implementations of DPSGD with exact persample gradient clipping, our optimizers have significant advantages in speed and memory cost while achieving comparable accuracyvsprivacy tradeoff. Moreover our algorithms perform well on a variety of network architectures. The main goal of this section is to give empirical evidences towards the following three strengths of our algorithm alluded in the introduction:

Our algorithm is significantly faster compared to persample gradient computations and works for any network in a blackbox way.

Memory footprint of our algorithm is roughly same as nonprivate SGD.

The DPSGDJL algorithms with smaller values of JL dimension exhibit similar behavior as DPSGD but with orders of magnitude speed up, and hence can be used for hyperparameter search.
In the following, we write ‘nonDPSGD’ for the standard nonprivate SGD and ‘DPSGDVanilla’ for the implementation of DPSGD in Tensorflow Privacy , nonDPAdam and DPAdamVanilla are similarly defined. We use Tensorflow and Tensorflow Privacy for all our experiments because Opacus does not support arbitrary network architectures.^{6}^{6}6In Pytorch Opacus github, the LSTM layer is only partially supported, e.g. single directional, single LSTM layer, no dropout layer; other recurrent layers such as GRU are not supported (see Opacus). Moreover Tensorflow has an efficient implementation of jvp while PyTorch doesn’t.
Notation:
We denote the noise multiplier as , clipping norm as , batch size as , learning rate as and the number of epochs as . We fix the privacy parameter , as done by prior work. We denote the JL dimension used by each optimizer in the parentheses. We use one Tesla P100 16GB GPU for all experiments. In all the experiments, we report the time per epoch by averaging over a large number of epochs.
3.1 Training an LSTM model on IMDb dataset
NonDP Adam  DPAdamVanilla  DPAdamJL(1)  DPAdamJL(5)  DPAdamJL(10)  DPAdamJL(30) 
12  19317  41  128  243  675 
The goal of these experiments is to demonstrate the first strength of our algorithm: it is significantly faster than persample gradient computations and works for any network in a blackbox way. We train a bidirectional LSTM with an embedding layer on the IMDb dataset for sentiment analysis. We remark that simpler networks
not based on RNNs can achieve good accuracyvsprivacy tradeoff as shown in Bu et al. (2019) and Pytorch,Team . However, LSTM models based on RNN architectures are widely used in NLP applications, and hence efficient private training of such models remains an important challenge in this area McMahan et al. (2018). Moreover, as we noted in the introduction, extensions of outer product trick to bidirectional LSTMs as described in Lee and Kifer (2021) are significantly more complicated, and require considerable effort to implement. Since authors of Lee and Kifer (2021) did not provide the code, we are unable to compare the improvements. Moreover, as also noted in Lee and Kifer (2021), the outer product method requires significantly more memory and hence will not scale to large batch sizes, which is very important to achieve good privacy vs utility tradeoff.We implement DPAdamJL using jvp method in TensorFlow. We train the same singlelayer bidirectional LSTM^{7}^{7}7We cannot use CuDNNLSTM and instead use LSTM for the following reason. When using CuDNNLSTM (and on GPU), we observe a significant speedup compared to LSTM, but the accuracy is invalid and we further incur a LookupError when computing jvp. as in the Tensorflow,Team
tutorial, using the same IMDb dataset with 8k vocabulary. The dataset has binary labels and is preprocessed by zeropadding the sequences to a length of 150 before being fed into the embedding layer.
Table 2 shows the training time per epoch for different algorithms. As expected, we observe that DPAdamJL algorithms with smaller values of JL dimension are significantly faster than DPAdamVanilla. However, as we note in the Figure 3 privacy guarantees of DPAdamJL algorithm with smaller values of JL dimension are considerably worse than DPAdamVanilla. On the other hand, DPAdamJL(30) is faster than DPAdamVanilla while achieving similar privacy guarantees as DPAdamVanilla. When allowed to train for sufficient number of epochs, we observed that all the algorithms achieved same accuracy but with different privacy costs and running times. This three dimensional tradeoff between utility, privacy and speed is depicted in the Figure 1 (see introduction), where we plot the privacy values using a color plot.
Remark 3.1.
As we can see from Table 2, for a batch size of 256, the slowdown of DPAdamVanilla is more than 256 compared to nonDPAdam. This is counter intuitive as a naive implementation DPAdamVanilla, by setting the batch size equal to 1 and then doing gradient accumulation across 256 batches should only be 256 times slower. As we show below, this is due to the memory issues of DPAdamVanilla in the implementation in Tensorflow Privacy . Indeed the naive implementation of DPAdamVanilla in tensorflow using gradient accumulation takes about 4000 seconds per epoch for the same batch size.
3.2 Memory footprint
Another key strength of JL based algorithms is their memory consumption, and the goal of this section is to show this aspect of our new algorithms via experiments. As a proxy for memory consumption, we compare the largest batch size each algorithm can handle without running out of memory. It is known that to achieve good privacyvsaccuracy tradeoffs for DPSGD, we need to use large batch sizes Abadi et al. (2016). One way to support large batch sizes is via gradient accumulation; however, this has the disadvantage that one loses parallelism, which in turns leads to slower run times. Hence memory footprint of algorithms also indirectly affects the training time.
We compare our JL algorithms with the implementation of DPSGDVanilla in Tensorflow Privacy
. We train a convolutional neural network from
Tensorflow Privacy tutorial on MNIST dataset, which has 60,000 training samples.^{8}^{8}8We use the implementation and the network from mnist_dpsgd_tutorial_keras.py As we can see from Table 3, DPSGDJL algorithm and nonDPSGD can both run with the maximum possible batch size of 60,000 whereas DPSGDVanilla can only handle a batch size of at most 500. In general, we believe that the memory footprint of DPSGDJL algorithm is very close to that of nonprivate SGD. To show this, we augment the CNN in Tensorflow Privacy tutorial with dense layers to blowup the model size to 17,146,938 parameters, and repeat the experiment. As we see from Table 4, the largest batch size supported by DPSGDJL(30) is only a factor 2 away from the largest batch size supported by nonDP SGD. On the other hand, we observe that DPSGDVanilla only supports a batch size of 100.nonDPSGD  DPSGDVanilla  DPSGDJL(10)  DPSGDJL(30) 

60,000  512  60,000  60,000 
NonDP SGD  DPSGDVanilla  DPSGDJL(10)  DPSGDJL(30) 

52,000  100  28,000  25,000 
The above experiments also give a possible explanation of why DPSGDVanilla implementation in TFP has a slowdown that is larger than the batch size, as we observed in the LSTM experiments. Even for MNIST, we observe that DPSGDVanilla running time gets better with batch size in the very beginning but as the batch size becomes larger the running time gets worse, and soon after it runs out of memory.
3.3 Using DPSGDJL for hyperparameter search
As we saw in our experiments summarized in Table 2, our algorithms with small values of JL dimension are orders of magnitude faster than DPAdamVanilla; DPAdamJL(1) is about 470x faster and DPSGDJL(5) is about 150x faster. However, unfortunately, the privacy bounds we can prove for these algorithms are considerably worse than DPAdamVanilla. Despite this drawback, we observe that behavior of DPSGDJL even for small JL dimension is very close to that of DPAdamVanilla. Figure 4 plots the accuracy vs epochs for various algorithms training a CNN from Tensorflow Privacy tutorial on MNIST dataset. We observe that as JL dimension increases, the accuracy vs epoch curve quickly converges to that of DPSGDVanilla. DPSGDJL(3) is already very similar to DPSGDVanilla and DPSGDJL(20) is nearly indistinguishable. This also lets us hypothesize that the privacy of our algorithms could be much better than what we could prove and that it should converge equally quickly to that of DPSGDVanilla. Thus we believe that DPSGDJL(3) or DPSGDJL(5) are good candidates for experimentation and hyperparameter tuning during private training, since their behavior is almost identical to that of DPSGDVanilla while being orders of magnitude faster.
4 Privacy Analysis
4.1 DP preliminaries
We use the recently proposed DP framework of Dong et al. (2019) for our privacy analysis. The DP framework allows us to reason about a collection of privacy guarantees simultaneously which can then be composed to get a much better privacy for the final algorithm. We will first define the notion of DP formally and then define the notion of DP. We then state a proposition from Dong et al. (2019) which shows that these two notions are dual to each other.
Definition 4.1 (Dp).
We say that an algorithm is DP if for any two neighboring databases and any subset of outputs, we have
For each value of , there exists some such that is DP. We can represent all these privacy guarantees by a function and say that is DP for each We will now see that there is a dual way to represent the function called DP. To introduce this, we will need the notion of a tradeoff function.
Definition 4.2 (Tradeoff function).
Let be two distributions over some domain Let
be a prediction rule which given a sample predicts which distribution the sample came from. The type I error is defined as
and the type II error is defined as
. The tradeoff function is defined as:(1) 
Given two random variables , we define to be where are the distributions of respectively.
Note that if , then A tradeoff curve is called symmetric if Given two functions on the same domain, we say if for all in the domain. is similarly defined.
Definition 4.3 (Dp).
We say an algorithm is differentially private if for every two neighboring databases , we have .
Proposition 4.1 (Duality of DP and DP from Dong et al. (2019)).
If satisfies DP where is symmetric (i.e., ). Let be such that . Then satisfies DP for every where:
So the tangent to at has slope and intercept (see Figure 5).
Note that by convexity of is increasing in and by symmetry Therefore Proposition 4.1 covers the entire range of
Proposition 4.2 (Postprocessing).
Let be two random variables supported on and let is some randomized function, then
Proposition 4.3.
Let and be pairs of random variables such that has the same distribution as for all . Then
Proof.
Proposition 4.4 (Composition Dong et al. (2019)).
Let be independent random variables and let be independent random variables. Then
where is a commutative, associative operation on functions from
For any random variable , we have where is defined as The function is identity for operation i.e. for all
We will the need the following proposition which explains how subsampling affects the privacy curve.
Proposition 4.5 (Dong et al. (2019)).
Let and let . Then .
Though Eqn (1) defining the tradeoff curve requires us to take an infimum over a large class of tests, NeymanPearson lemma gives a very simple form for the optimal tests to use.
Proposition 4.6 (NeymanPearson lemma).
Let be two continuous distributions over some domain The type I error vs type II error tradeoff function between and is attained by the NeymanPearson tests which are a single parameter family of tests of the form defined as:
Using the NeymanPearson lemma, we will prove the following lemma which is crucial for our privacy analysis.
Lemma 4.1.
Let be some random variable and let . Then where:
and is the CDF of standard Gaussian.
Proof.
We will now prove an other key lemma which is useful for our privacy analysis.
Lemma 4.2.
Let be two random variables such that there exists some coupling with Then .
Proof.
We will prove this by postprocessing (Proposition 4.2). Define a randomized map
where , i.e., is sampled from the conditional distribution of given Note that because the coupling satisfies Now it is easy to verify that and ∎
4.2 Proof of privacy for Algorithm 4
We will first analyze the privacy of a crucial subroutine used in Algorithm 4 which is shown in Algorithm 2.
Lemma 4.3.
Proof.
Let be the output of Algorithm 2 with input and let be the output of Algorithm 2 with input . We want to show that . We have
By postprocessing property (Proposition 4.2), we have
where
By Proposition 4.3,
Let be a rotation matrix which rotates to where Let and . Since is a fixed bijective map,
Because of rotation invariance, has the same distribution as So,
The coordinates are independent of each other and . Similarly the coordinates are also independent of each other and . Moreover and has the same distribution for Therefore by Proposition 4.4,
Let . We can further simplify this using Proposition 4.3 as:
We can simplify further by using the fact that has the same distribution as . Therefore has the same distribution as
Let and so By Lemma 4.2,
Therefore, this proves that where The parametrization follows from Lemma 4.1. ∎
Theorem 4.1.
Let where is the batch size and is the total number of samples. Then Algorithm 1 is DP with where and .^{9}^{9}9 is the double convex conjugate of (i.e., the greatest convex lower bound for ).
Proof.
Algorithm 1 can be thought of as adaptive composition of iterations of Algorithm 2, but where the inputs to the Algorithm 2
in each iteration is subsampled from the entire input with sampling probability
. And we already showed in Lemma 4.3 that Algorithm 2 satisfies DP with as claimed. The rest of the proof is very similar to Theorem 3 in Bu et al. (2019) which itself builds on a similar theorem in Dong et al. (2019). It proceeds by applying Proposition 4.5 to understand the effect of subsampling and an adaptive version of the composition in Proposition 4.4 to compose the privacy curves in all the iterations. ∎One could hope to use the central limit theorem for composition from
Dong et al. (2019); Bu et al. (2019) to find an approximate closed form expression for the final privacy curve. Unfortunately, these central limit theorems do not apply in our setting.^{10}^{10}10 diverges Instead, we numerically compute the final privacy curve obtained in Theorem 4.1 making use of Lemma 4.1.4.3 Effect of JL dimension on privacy
Since the persample gradient norm estimations get more accurate with JL dimension, it is clear that the privacy of DPSGDJL should converge to that of DPSGDVanilla for large JL dimension. We also observe that privacy parameter is monotonically decreasing with increasing JL dimension and eventually converges to the for DPSGDVanilla. This can be see from Figure 3.
Acknowledgements
We thank Sergey Yekhanin for his constant support and encouragement during this work. We also thank Lukas Wutschitz and Osman Ramadan for helpful discussions. Finally, we thank the amazing open source community of TensorFlow for their quick response in fixing bugs which was crucial for our experiments (
TFissue43449, ).References
 Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: Fast and Memory Efficient Differentially PrivateSGD via JL Projections^{1}^{1}1Author ordering is alphabetical. Work done when the first and the last two authors were interns at Algorithms group, Microsoft Research Redmond. , §1, §1, §1, §1, §3.2, Algorithm 4.
 Generative models for effective ML on private, decentralized datasets. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, External Links: Link Cited by: §1.
 Privacy amplification via random checkins. CoRR abs/2007.06605. External Links: Link, 2007.06605 Cited by: §1.
 Automatic differentiation in machine learning: a survey. The Journal of Machine Learning Research 18 (1), pp. 5595–5637. Cited by: §2.1.
 Deep learning with gaussian differential privacy. External Links: 1911.11607 Cited by: §1, §1, §1, §3.1, §4.2, §4.2.
 The secret sharer: evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp. 267–284. Cited by: §1, §1, §1.
 Gmail smart compose: realtime assisted writing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 48, 2019, A. Teredesai, V. Kumar, Y. Li, R. Rosales, E. Terzi, and G. Karypis (Eds.), pp. 2287–2295. External Links: Link, Document Cited by: §1.
 Understanding gradient clipping in private SGD: A geometric perspective. CoRR abs/2006.15429. External Links: Link, 2006.15429 Cited by: §1.

Diversifying reply suggestions using a matchingconditional variational autoencoder
. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), Minneapolis, Minnesota. Cited by: §1.  Gaussian differential privacy. External Links: 1905.02383 Cited by: Fast and Memory Efficient Differentially PrivateSGD via JL Projections^{1}^{1}1Author ordering is alphabetical. Work done when the first and the last two authors were interns at Algorithms group, Microsoft Research Redmond. , §1.1, §1, §4.1, §4.2, §4.2, Proposition 4.1, Proposition 4.4, Proposition 4.5.
 Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pp. 265–284. Cited by: §1.

Does learning require memorization? a short tale about a long tail.
In
Proccedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, Chicago, IL, USA, June 2226, 2020
, K. Makarychev, Y. Makarychev, M. Tulsiani, G. Kamath, and J. Chuzhoy (Eds.), pp. 954–959. External Links: Link, Document Cited by: §1.  [13] https://github.com/tensorflow/tensorflow/issues/4897#issuecomment290997283. Cited by: 2nd item.
 Efficient perexample gradient computations. arXiv preprint arXiv:1510.01799. Cited by: 3rd item.
 Scaling up differentially private deep learning with fast perexample gradient clipping. Proceedings on Privacy Enhancing Technologies 2021 (1), pp. 128–144. Cited by: 3rd item, §3.1.
 Learning differentially private recurrent language models. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §3.1.
 Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275. Cited by: §1.

[18]
Note:
github.com/facebookresearch/pytorchdp Cited by: §1, §3. 
[19]
DP training on imdb.
Note:
https://github.com/pytorch/opacus/blob/master/examples/imdb_README.md Cited by: §3.1.  Efficient perexample gradient computations in convolutional neural networks. arXiv preprint arXiv:1912.06015. Cited by: 3rd item.
 Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1.
 Enabling fast differentially private sgd via justintime compilation and vectorization. arXiv preprint arXiv:2010.09063. Cited by: 3rd item, 4th item.

[23]
Note:
github.com/tensorflow/privacy Cited by: Figure 1, 2nd item, §1, §3.2, §3.3, Remark 3.1, §3. 
[24]
Text classification with an rnn.
Note:
https://www.tensorflow.org/tutorials/text/text_classification_rnn Cited by: §3.1.  Neural network studies, 1. comparison of overfitting and overtraining. J. Chem. Inf. Comput. Sci. 35 (5), pp. 826–833. External Links: Link, Document Cited by: §1.

[26]
Fast per example gradient support.
Note:
https://github.com/tensorflow/tensorflow/issues/43449 Cited by: Acknowledgements.  Differentially private learning with adaptive clipping. CoRR abs/1905.03871. External Links: Link, 1905.03871 Cited by: §1.

[28]
A new trick for calculating Jacobian vector products..
Note:
https://jtowns.github.io/2017/06/12/Anewtrick.html Cited by: §2.1.  Bypassing the ambient dimension: private SGD with gradient subspace identification. CoRR abs/2007.03813. External Links: Link, 2007.03813 Cited by: §1.
Appendix A DPSGD and DPAdamJL
For completeness, we provide pseudocode for DPSGD and DPAdamJL used in our experiments. DPAdamJL satisfies exactly the same privacy bounds as DPSGDJL.