Backpropagation Clipping for Deep Learning with Differential Privacy

by   Timothy Stevens, et al.

We present backpropagation clipping, a novel variant of differentially private stochastic gradient descent (DP-SGD) for privacy-preserving deep learning. Our approach clips each trainable layer's inputs (during the forward pass) and its upstream gradients (during the backward pass) to ensure bounded global sensitivity for the layer's gradient; this combination replaces the gradient clipping step in existing DP-SGD variants. Our approach is simple to implement in existing deep learning frameworks. The results of our empirical evaluation demonstrate that backpropagation clipping provides higher accuracy at lower values for the privacy parameter ϵ compared to previous work. We achieve 98.7 CIFAR-10 with ϵ = 3.64.



page 1

page 2

page 3

page 4


Dynamic Differential-Privacy Preserving SGD

Differentially-Private Stochastic Gradient Descent (DP-SGD) prevents tra...

Backdrop: Stochastic Backpropagation

We introduce backdrop, a flexible and simple-to-implement method, intuit...

Privacy-Preserving Deep Learning for any Activation Function

This paper considers the scenario that multiple data owners wish to appl...

Three Variants of Differential Privacy: Lossless Conversion and Applications

We consider three different variants of differential privacy (DP), namel...

Backpropagation Neural Tree

We propose a novel algorithm called Backpropagation Neural Tree (BNeural...

DP-MAC: The Differentially Private Method of Auxiliary Coordinates for Deep Learning

Developing a differentially private deep learning algorithm is challengi...

Differentially Private Temporal Difference Learning with Stochastic Nonconvex-Strongly-Concave Optimization

Temporal difference (TD) learning is a widely used method to evaluate po...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has proved extremely effective in many different application areas. However, deep learning also possess privacy risks when used on sensitive data: deep learning models tend to leak information about the data they are trained on [shokri2017membership, jayaraman2020revisiting, yeom2018privacy], and in some cases, can memorize training data [carlini2021extracting, carlini2019secret].

Many of these concerns can be addressed by training with differential privacy [dwork2006calibrating, dwork2014algorithmic], a strong formal privacy definition. Differentially private stochastic gradient descent (DP-SGD) [abadi2016deep] is one algorithm for this task; it works by adding Gaussian noise to the computed gradient before each update step. The noise is scaled to the global sensitivity of the gradient, which is bounded by gradient clipping—scaling the gradient so that it has norm is bounded by a constant. Recent work has steadily improved the accuracy of models trained using DP-SGD [papernot2021tempered]. However, a significant accuracy gap between differentially private and non-private models still exists, precluding wider adoption.

We propose a new approach called backpropagation clipping that significantly improves the accuracy of DP-SGD while providing the same formal guarantee of differential privacy. Our approach clips each trainable layer’s inputs (during the forward pass) and its upstream gradients (during the backward pass) to ensure bounded global sensitivity for the layer’s gradient; this combination replaces the gradient clipping step in existing DP-SGD variants.

Our approach is fundamentally different from gradient clipping that occurs after backpropagation, because backpropagation clipping actually redistributes gradient magnitude during backpropagation. This modification reduces gradient growth (which is a challenge for DP-SGD, as observed by Papernot et al. [papernot2021tempered]

) and may help improve the signal-to-noise ratio of perturbed gradients.

Our experimental evaluation demonstrates a significant benefit to our approach. Compared to the state-of-the-art DP-SGD algorithm [papernot2021tempered], we improve accuracy on MNIST from 98.1% to 98.7%, on FMNIST from 86.1% to 88.6%, and on CIFAR-10 from 66.2% to 74.0%, while simultaneously reducing the privacy parameter in all three cases.


In summary, our contributions are:

  • We propose backpropagation clipping, a modified DP-SGD algorithm that clips inputs and upstream gradients during backpropagation (§ 3)

  • We prove global sensitivity bounds based on backpropagation clipping for fully-connected (§ 4) and convolutional (§ 5) layers

  • We show how to implement backpropagation clipping with minimal effort in existing deep learning frameworks like PyTorch and Tensorflow

    111§ 6)

  • Our empirical evaluation demonstrates that backpropagation clipping improves accuracy significantly compared to previous work (§ 7)

2 Preliminaries

Differential privacy.

Differential privacy is a formal privacy definition. A differentially private probabilistic mechanism

is required to give a particular output with similar probability, with or without a particular individual’s data. The strength of the guarantee is controlled by the

privacy parameter (or privacy budget) .

Definition 1 (Differential Privacy [dwork2014algorithmic]).

Let and be a randomized algorithm. satisfies -differential privacy if, for all datasets and that differ on a single element, and all ,

where the probability is taken over the randomness used by the algorithm.

We leverage a recent variant of differential privacy called Zero-Concentrated Differential Privacy [bun2016concentrated] (zCDP), due to its favorable composition properties.

Definition 2 (Zero-Concentrated Differential Privacy (zCDP) [bun2016concentrated]).

Let and be a randomized algorithm. satisfies -zero-concentrated differential privacy (zCDP) if, for all datasets and that differ on a single element, and all ,

where the probability is taken over the randomness used by the algorithm.

zCDP can be easily converted to -differential privacy:

Proposition 1 (zCDP conversion ([bun2016concentrated], Proposition 1.3)).

If a mechanism satisfies -zCDP, then it also satisfies -differential privacy for:

zCDP satisfies sequential composition: if satisfies -zCDP, then running times satisfies -zCDP. It also satisfies parallel composition: running on disjoint subsets of a dataset satisfies -zCDP. To achieve zCDP, we can add Gaussian noise to the output of a (non-private) function , scaled to the sensitivity of .

Definition 3 ( sensitivity [dwork2014algorithmic]).

The global sensitivity of a function is

Proposition 2 (Gaussian mechanism [bun2016concentrated]).

For a function with sensitivity , the mechanism:

satisfies -zCDP.

Input : Model architecture , training data , noise parameter , minibatch size , learning rate , clipping parameter

, number of epochs

Output : Noisy model weights .
Privacy guarantee: satisfies -zCDP.
2 for  epochs do
3       for each batch of size  do
4             for each example  do
5                   compute gradient
7            for each trainable layer  do
8                   clip gradient
9                   add noise
11             update model
Algorithm 1 DP-SGD [abadi2016deep], zCDP variant with per-layer noise addition


Differentially private stochastic gradient descent (DP-SGD) [abadi2016deep] is an iterative optimization algorithm based on gradient perturbation—at each iteration, the algorithm computes the gradient of the loss, adds noise to the gradient to satisfy differential privacy, then updates the model using the noisy gradient. DP-SGD has been successfully used to train deep learning models with differential privacy [abadi2016deep], but the noise used to achieve privacy has a significant impact on the accuracy of the trained model for smaller values of the privacy parameter .

The major challenge of DP-SGD is bounding the sensitivity of the gradient computation. This computation depends on the neural network architecture being trained, and can be complex; in addition, the worst-case sensitivity of the gradient is often extremely large (much larger than its average-case output).

To address this challenge, Abadi et al. [abadi2016deep] introduced gradient clipping—an approach that enforces an upper bound on the magnitude of the gradient for each example in a mini-batch, resulting in bounded sensitivity for the average gradient of that minibatch. Gradient clipping is controlled by a clipping parameter —an upper bound on the maximum allowed

norm of any example’s gradient. The clipping parameter must be set carefully (usually via hyperparameter tuning)—if it is too large, we add more noise than necessary; if it is too small, we lose useful information from the gradient.

Algorithm 1 depicts a zCDP variant of the original DP-SGD algorithm of Abadi et al. [abadi2016deep]. Line 6 clips the gradient associated with each example in the current batch, so that the gradient sum in line 7 has an global sensitivity of . Line 7 adds noise to achieve -zCDP for each training epoch, and the entire training process (of epochs) satisfies -zCDP by sequential composition.

The variant in Algorithm 1 differs from the original of Abadi et al. [abadi2016deep]

in that it does not make use of privacy amplification by subsampling or the moments accountant. Instead, our version ensures disjoint batches in each epoch, and bounds the per-epoch privacy cost by parallel composition. This change is to ease understanding and implementation; recent improvements in composition and subsampling yield similar algorithms with slightly tighter bounds on privacy cost 

[dong2019gaussian, bu2020deep, asoodeh2020better].

3 Backpropagation Clipping

Input : Model architecture with trainable layers, training data , noise parameter , minibatch size , learning rate , clipping parameter , number of epochs .
Output : Noisy model weights .
Privacy guarantee: satisfies -zCDP.
2 for each trainable layer  do
3       clip inputs
4       clip upstream grads
6for  epochs do
7       for each batch of size  do
8             forward pass
9             backward pass
10             for each trainable layer  do
11                   add noise
13             update model
Algorithm 2 DP-SGD with Backpropagation Clipping

This section introduces backpropagation clipping, our novel variant of DP-SGD. The key idea of our approach is to ensure bounded gradient norms for each layer of the neural network by enforcing an upper bound on the norms of the input and the upstream gradient to each layer. By upstream gradient, we mean the gradient of the “next” layer in the network, computed during backpropagation.

A summary of our modified algorithm appears in Algorithm 2. It is largely similar to Algorithm 1, with two major differences:

  1. We use forward and backward hooks to perform backpropagation clipping and ensure bounded sensitivity of each layer’s gradient (lines 2-4)

  2. We compute gradients using the full batch, rather than on a per-example basis, because per-example gradient clipping is not needed (lines 6-8)

Note that our approach does involve per-example clipping—but it takes place during the forward and backward passes, as a result of the forward and backward hooks installed in lines 3 and 4.

Enforcing bounded sensitivity.

We enforce these two upper bounds by adding “hooks” to the network that operate during the forward and backward passes. During the forward pass, our hook clips the inputs to each layer with trainable weights, so that the input to each of these layers has norm bounded by (Algorithm 2, line 3).

During the backward pass, our hook clips the upstream gradient passed to each layer with trainable weights, so that the final norm of the layer’s gradient is bounded by . The specific method of clipping during the backward pass depends on the layer type, but generally works for any layer whose gradient is a function of the inputs and upstream gradient. Section 4 describes backpropagation clipping for fully-connected layers, and Section 5 describes the clipping approach for convolutional layers.

Proposition 3 (Sensitivity of the gradient).

For each layer , Algorithm 2 ensures the layer’s gradient has global sensitivity at most .

For each layer type with trainable weights in our architecture, we will need to define a combination of forward and backward hooks that ensures bounded gradient norm. As in traditional gradient clipping, bounding the maximum norm of the gradient is sufficient to ensure bounded global sensitivity.

Privacy analysis.

The privacy analysis of Algorithm 2 is similar to that of the original DP-SGD, but the global sensitivity of the gradient is bounded by our clipping hooks rather than explicit clipping after the gradient is computed.

Theorem 1 (Privacy of Algorithm 2).

For a neural network with layers, trained for epochs, with input clipping parameter , backpropagation clipping parameter , and noise scale , Algorithm 2 satisfies:


By analysis of the definition of Algorithm 2.

  • Privacy of the gradient: by Proposition 3, the global sensitivity of layer ’s gradient (line 10) is at most . By Proposition 2, satisfies -zCDP.

  • Privacy for the batch: The network has layers, so the noise addition loop (lines 9-10) runs times. By sequential composition, the loop satisfies -zCDP.

  • Privacy for one epoch: The batches are disjoint, so by parallel composition, each epoch of training (lines 6-11) satisfies -zCDP.

  • Overall privacy: We run epochs, so by sequential composition, Algorithm 2 satisfies -zCDP.

The privacy cost of Algorithm 2 can be converted to -differential privacy (via Proposition 1) for comparison to other approaches. Our results in Section 7 are presented in terms of -differential privacy.

Definition of the general per-layer gradient.

A layer with function denoted by , weights , and inputs computes an output :

During backpropagation, the gradient of the loss with respect to the weights of

is computed using the chain rule:

Our primary goal is to bound the norm of the gradient by bounding the norms of these individual terms with and , respectively.

Specific norm bounds for common layer types.

We give specific details on the bounds for two common layer types in the next two sections. In Section 4, we show how to set and to bound the sensitivity of the gradient for fully-connected layers. In Section 5, we show how to set and for convolutional layers.

We focus on fully-connected and convolutional layers, due to their prevalence in models trained with DP-SGD. To use our approach with other layer types, associated sensitivity bounds would be required. We are investigating provable bounds and implementations of other common layer types, such as recurrent layers, token embeddings, and attention in order to enable this approach on additional model architectures.

4 Backpropagation Clipping for Fully-Connected Layers

This section shows how backpropagation clipping can be used to bound the global sensitivity of fully-connected layers. We first describe how gradients are calculated for these layers, then show how bounds on the layer’s input and upstream gradient imply a bound on the layer’s gradient.

Definition of the gradient.

A fully-connected layer with weights and inputs computes an output :

During backpropagation, the gradient of the loss with respect to the weights of is computed using the chain rule:

The upstream gradient is the gradient of the loss with respect to this layer’s output, computed during backpropagation.

Bounding the gradient norm.

If we know an upper bound on the norms of and the upstream gradient, it is easy to bound the layer’s gradient with respect to the weights.

Proposition 4 (Gradient bound, linear layer).

For a linear layer with weights , inputs , and outputs , and for any sub-multiplicative matrix norm :

Proposition 4 follows directly from the sub-multiplicative property of the norm, and directly implies a bound on the global sensitivity of the layer’s gradient.

Theorem 2 (Clipping, fully-connected layer).

For fully-connected layer and clipping parameters and :

  • If the input is clipped so that

  • And the upstream gradient is clipped so that

Then the sensitivity of the layer’s gradient is bounded by:

Theorem 2 follows directly from Proposition 4.


For the bias component of a fully-connected layer, the gradient of the loss is equal to the upstream gradient:

Thus the gradient norm (and thus, the sensitivity) for the bias weights is bounded by , the bound on the upstream gradient.

Activation functions.

Our consideration of fully-connected layers separates the layer itself from the activation function (as in PyTorch, where activations are layers with no trainable weights). Our bounds work for any activation function; we use ReLU activations in our evaluation. Our approach clips the upstream gradient

between the activation layer and the fully-connected layer. Since the activation function has no trainable weights, it has no further impact on sensitivity or privacy.

5 Backpropagation Clipping for Convolutional Layers

This section shows how backpropagation clipping can be used to bound the global sensitivity of convolutional layers, following the same structure as Section 4. We will make extensive use of norm bounds given by Sedghi et al. [sedghi2018singular].

Definition of the gradient.

A 2-dimensional convolutional layer with filter weights and inputs computes an output :

Where conv2d is a 2-dimensional convolution operation. In this case, the gradient of the loss with respect to the filter weights is also computed using a convolution, in which the upstream gradient replaces the filter weights:

Here, the use of the chain rule is encoded in the conv2d operation.

Representing convolution as a linear transform.

Sedhi et al. [sedghi2018singular]

show how a convolution operation can be represented using a linear transform, and how to derive properties of that transform without computing it explicitly. Given a convolution operation with input

, output , and filters , there exists a matrix such that:

This linear transformation allows us to define our convolution gradient as

for some , defined as a function of our filters, .

Bounding the gradient norm for convolutional layers.

As with Theorem 2, we can use the the sub-multiplicative property of the norm to form a bound on the gradient, following from the linear transformation above:

We can enforce a bound on global sensitivity of the convolutional layer’s by bounding the norms of and . The bound of can be enforced directly, while bounding requires calculating a bound from our filters.

Bounding the linear transformation of a convolutional layer.

It is possible to calculate and bound it directly, but this would substantially reduce computational performance. Instead we choose to utilize the following existing bound for the norm of as a function of the filters for a convolutional layer.

Proposition 5 (Norm bound for convolution as a linear transformation).

Suppose a convolution operation with input , output , and filters is represented by a linear transform, or matrix, . Then:

Where iterates over the output channels, iterates over the input channels, and and iterate over the filter values.


Follows from Sedghi et al. [sedghi2018singular] (Proposition 11). ∎

Substituting for in Proposition 5 allows us to bound the norm of and therefore bound the global sensitivity of a convolutional layer. In , the input channels are the upstream gradients for different samples in a batch. Since we are concerned with the sensitivity of each individual sample, we calculate the our bound of as if the batch size is , and do not utilize the max function in Proposition 5.

This gives us a final bound on the global sensitivity of a convolutional layer:

Theorem 3 (Clipping, convolutional layer).

For convolutional layer and clipping parameters and :

  • If the input is clipped so that

  • And the upstream gradient is clipped so that:

Then the sensitivity of the layer’s gradient is bounded by:

Theorem 3 follows directly from Proposition 5.

Padding and bias.

Our sensitivity bound inherits two limitations from the analysis of Sedghi et al. [sedghi2018singular]: it works only with valid and Wrap Aroundpadding, and it does not support bias. Further analysis of norm bounds for convolutions may allow relaxing these limitations.

6 Implementation

We can implement backpropagation clipping as defined in Algorithm 2 using PyTorch’s hooks, which allow us to register functions that modify each layer’s input and gradient during backpropagation. We can clip inputs (add_forward_hook in Algorithm 2) using PyTorch’s forward pre hook, and we can clip upstream gradients (add_backward_hook in Algorithm 2) by using PyTorch’s backward hook on the layer following the target layer.

Backpropagation clipping can be implemented in TensorFlow by defining a “clipping layer” with a custom gradient function (implemented using the custom_gradient decorator) that clips the upstream gradient.

In both systems, the implementation is simple and does not require significant changes to the underlying framework—the code in this section is sufficient to reproduce our implementation. Our complete implementation, including examples and the code to reproduce our experimental results, is available as open source.


Input clipping.

To clip inputs, we use PyTorch’s forward pre hook to register a clipping function for each layer with trainable weights. The clipping function (l2_clip_examples) performs clipping on each input example.

# L2 clipping for layer’s input
def clip_input(self, input):
    return tuple([l2_clip_examples(x, INPUT_BOUND)
                  for x in input])
# Called for each fully-connected layer "L"

Here, the l2_clip_examples function is defined to clip each example in the batch input to have norm bounded by INPUT_BOUND. This is a form of per-example clipping, but it does not require modification of the backpropagation framework because per-example inputs (and gradients wrt a layer’s input) are available by default.

Backpropagation clipping for fully-connected layers.

To clip the upstream gradient, we can use the backward hook on the activation layer directly following each fully-connected layer. PyTorch’s hooks do not allow modifying the upstream gradient directly, but do allow modifying the “input” gradient for a layer after it is calculated; modifying the input gradient of the layer following a fully-connected layer accomplishes the goal.

# L2 clipping for layer’s gradient wrt input
def clip_grad_fc(self, grad_input, grad_output):
    g = grad_input[0]
    return (l2_clip_examples(g, UPSTREAM_GRAD_BOUND),)
# Called for each layer "L" *following*
# a fully-connected layer

The constants UPSTREAM_GRAD_BOUND and INPUT_BOUND are hyperparameters that can be tuned in the same way as the traditional clipping parameter in DP-SGD. Smaller values for these hyperparameters result in lower sensitivity, but may harm training. See Section 7 for more on hyperparameter tuning.

Backpropagation clipping for convolutional layers.

Backpropagation clipping for convolutional layers is similar to fully-connected layers, but uses a different norm corresponding to Theorem 3. Because the bound imposed by Theorem 3 is not a standard norm, we call this clipping bound l15.

# L15 clipping for layer’s gradient wrt input
def clip_grad_conv(self, grad_input, grad_output):
    g = grad_input[0]
    return (l15_clip_examples(g, UPSTREAM_GRAD_BOUND),)
# Called for each layer "L" *following*
# a convolutional layer

In our experiments, we use the same UPSTREAM_GRAD_BOUND and INPUT_BOUND for fully-connected and convolutional layers. Using different constants for different layer types, or even for different layers of the same type could allow more accurate models at the cost of expanding the hyperparameter space.

7 Evaluation

This section describes our experimental evaluation. We compare our approach to previous work on DP-SGD for three datasets. The results demonstrate that backpropagation clipping provides significantly better accuracy than previous work for much smaller values of .

7.1 Experiment Setup

We evaluate Backpropagation Clipping using the MNIST, FashionMNIST, and CIFAR10 datasets. In each of these datasets, the current state of the art for differentially private ML has been set using the tempered sigmoid [papernot2021tempered].

Experiments are performed on an AWS G4dn.xlarge instance with NVIDIA T4 GPUs and 16GB of memory. The models and clipping method are implemented using PyTorch. The architectures used for these experiments are nearly identical to the architecture used in [papernot2021tempered] with the exception of the model for CIFAR-10. In our CIFAR-10 model, we replace the final stage (10 filter convolution, followed by spacial averaging) with a fully connected layer followed by a softmax, and change the filter dimensions so that padding is not needed. We set padding=valid and bias=False, consistent with the limitation of our sensitivity bound discussed in Section 5. We utilize ReLU activation for our convolution and fully connected layers. Our architectures are described in Tables 3 and 4.

We train our models using the Adam optimizer with a learning rate of and perform a parameter sweep by varying , the number of epochs, the batch size, and the clipping bounds for each layer. Figure 1 describes the hyper parameter space searched for all three data sets. We used grid search to find the optimal set of hyper parameters, described in Figure 2. Privacy budgets are accumulated using zCDP [bun2016concentrated] and sequential composition, and converted to -differential privacy using Proposition 1.

Parameter Range
Epochs 15, 25, 40, 100
Batch Size 512, 1024, 2048, 4096
Input Clip .1, 1, 5, 10
() Gradient Clip -1, -2, -3, -4, -5
Epsilon .1 - 7
Figure 1: Hyperparameters tested for each data set
Epochs 25 40 100
Batch Size 4096 4096 4096
Input Clip 1.0 10.0 5.0
Gradient Clip 0.01 0.01 0.001
Figure 2: Most accurate parameters for each data set.
Layer Parameters

16 filters of 8x8, strides 2

Max-Pooling 2x2
Convolution 32 filters of 4x4, strides 2
Max-Pooling 2x2
Fully Connected 32 units
Softmax 10 units
Figure 3: Architecture for MNIST and FMNIST
Layer Parameters
Convolution x 2 32 filters of 2x2, strides 1
Avg-Pooling 2x2, stride 1
Convolution x 2 64 filters of 2x2, strides 1
Avg-Pooling 2x2, stride 1
Convolution x 2 128 filters of 2x2, strides 1
Avg-Pooling 2x2, stride 1
Convolution x 2 256 filters of 2x2, strides 1
Fully Connected 256 units
Softmax 10 units
Figure 4: Architecture for CIFAR-10
Figure 5: Test accuracy as a function of privacy budget. Results averaged over 10 runs. “Non-private baseline” indicates baseline accuracy of our models without privacy. “Non-private clipping” indicates accuracy of our models when backpropagation is used, but no noise is added. “Backprop clipping” indicates differentially private training using Algorithm 2.
Dataset Technique Acc.
MNIST SGD (not private) 99.0% 0
SGD w/ backprop clipping (not private) 98.9% 0
DP-SGD w/ ReLU [papernot2021tempered] 96.6% 2.93
DP-SGD w/ tempered sigmoid (tanh) [papernot2021tempered] 98.1% 2.93
DP-SGD w/ backprop clipping [ours] 98.7% 0.07
FMNIST SGD (not private) 89.4% 0
SGD w/ backprop clipping (not private) 88.7% 0
DP-SGD w/ ReLU [papernot2021tempered] 81.9% 2.7
DP-SGD w/ tempered sigmoid (tanh) [papernot2021tempered] 86.1% 2.7
DP-SGD w/ backprop clipping [ours] 88.6% 0.87
CIFAR-10 SGD (not private) 76.9% 0
SGD w/ backprop clipping (not private) 74.4% 0
DP-SGD w/ ReLU [papernot2021tempered] 61.6% 7.53
DP-SGD w/ tempered sigmoid (tanh) [papernot2021tempered] 66.2% 7.53
DP-SGD w/ backprop clipping [ours] 74.0% 3.64
Table 1: Summary of results comparing backpropagation clipping against state-of-the-art results for DP-SGD. DP-SGD results for comparison (both tempered sigmoid and ReLU activations) reproduced from Papernot et al. [papernot2021tempered]. Results for our approach are averaged over 10 runs.

7.2 Results

Figure 5 demonstrates the effect of changing epsilon on accuracy for our three datasets. Even with small values of , we very quickly see accuracy for private models approaches the accuracy of the non-private baseline. With , backpropagation clipping produces a usably accurate model for all proposed datasets. With , the models have approached the accuracy of the no-noise clipping. The MNIST models report the best accuracy at low values, supporting the notion that MNIST is the easiest of the three datasets tested. CIFAR-10 is the most difficult problem tested, and uses a more complex model architecture than MNIST or FMNIST. Both of these factors lead to the lower accuracy, and higher privacy budgets reported for CIFAR.

The higher privacy budgets of FMNIST and CIFAR are a function of the number of epochs we used in our optimal FMNIST and CIFAR models. The optimal configurations for FMNIST and CIFAR are presented in Figure 2. The best FMNIST and CIFAR models are trained for and epochs respectively. That implies more composed dataset queries and thus more privacy budget usage.

Table 1 compares our approach with the state of the art [papernot2021tempered]. In spite of reporting results for much smaller privacy budgets, our trained models return much higher test set accuracy for every dataset tested. With CIFAR-10, we improve upon the state of the art by .

8 Discussion

Our accuracy results represent a substantial improvement over the state of the art in differentially private models trained from scratch in spite of limitations imposed on our model architecture. Our models use no padding or bias in any of their layers. This is due to the limitations of our bounds on global sensitivity (Sections 4 and 5). In particular, the results of Sedghi et al. [sedghi2018singular] for convolutions apply only when no padding is used. Extending this result to the case where padding is used may allow further improvements in accuracy by adding padding to our convolutional layers.

In all cases, our approach reduces the privacy budget required to achieve reasonable accuracy significantly (e.g. for MNIST). This reduction in is possible because aggressive clipping leads to significantly reduced sensitivity for the gradient—for example, the best parameters for MNIST lead to a sensitivity of 0.01 for each layer’s gradient, which is roughly 100 times smaller than values typically used for traditional gradient clipping (for MNIST, in the range of 1-5). The lower sensitivity allows for much less noise to be used while maintaining a small value of .

One limitation of our approach is the addition of new hyper-parameters to tune. In typical deep learning with differential privacy settings, a single clipping parameter is applied to the entire gradient. With backpropagation clipping, there is potential for up to two clipping parameters per layer, which substantially increases the size of the hyperparameter search space. We apply the same gradient clipping parameter and input clipping parameter to each layer for the sake of simplicity.

9 Related Work

Deep learning with differential privacy.

The first algorithm for differentially private deep learning is due to Shokri and Shmatikov [shokri2015privacy], but it required very high values for . Abadi et al. [abadi2016deep] developed a significantly improved approach based on the improved composition bounds of the moments accountant, and was able to achieve 95% accuracy on MNIST with . Both approaches bound sensitivity using gradient clipping, as described in Algorithm 1.

Since then, further advances have improved the DP-SGD approach. Nasr et al. [nasr2020improving] achieved 96.1% accuracy on MNIST with via gradient encoding and denoising, and Feldman and Zrnic [feldman2020individual] achieved 96.6% accuracy on MNIST with . Asoodeh et al. [asoodeh2020better] use -divergences to give better composition bounds (allowing more epochs of training); Bu et al. [bu2020deep] also use -divergences (via Gaussian differential privacy [dong2021gaussian]) and achieve 96.6% accuracy on MNIST with . Tramer and Boneh [tramer2020differentially] provide a good overview of these results.

Papernot et al. [papernot2021tempered] replace ReLU activations with tempered sigmoids (e.g. Tanh) that have bounded output ranges. This has a similar effect to our input clipping step, and has the indirect effect of also reducing gradient norms. This approach achieves 98.1% accuracy on MNIST with —the best of any DP-SGD approach, to the best of our knowledge.

Pre-trained layers.

Pre-training some layers (without privacy) on a separate (but related) public dataset can improve accuracy significantly, but this requires public data to be available. Abadi et al. [abadi2016deep] obtain 73% accuracy on CIFAR-10 with , but pre-train the convolutional layers (without privacy) on CIFAR-100 data. End-to-end differentially private training for CIFAR-10 yields much lower accuracy, as reported in subsequent work [nasr2020improving, papernot2021tempered].

Similarly, Tramer and Boneh [tramer2020differentially]

use a pre-trained ScatterNet model as a pre-processing step, and train either a linear classifier or a convolutional network on ScatterNet’s output. They achieve 69.3% accuracy on CIFAR-10 with

using the convolutional model.

Our results are for end-to-end differentially private training; we would expect similar improvements in accuracy if we pre-trained some components of our architectures.

Transfer learning approaches.

Differentially private transfer learning is an alternative to DP-SGD that avoids perturbing gradients by adding noise to

predictions instead, and using those predictions for transfer learning. This approach was first proposed in the PATE framework [papernot2016semi, papernot2018scalable], with the theory further developed by Bassily et al. [bassily2018model]. Transfer learning approaches generally achieve better accuracy than DP-SGD (e.g. Papernot et al. [papernot2018scalable] achieve 98.5% accuracy on MNIST with ). However, they are more challenging to deploy, because they require training many models (the “teachers”) and sometimes require access to unlabeled public data. To the best of our knowledge, our approach is the first DP-SGD variant to provide better accuracy than approaches based on transfer learning.

Dimensionality reduction approaches.

A major challenge specific to differentially private deep learning is that deep learning models may have millions or tens of millions of trainable weights. The lottery ticket hypothesis [frankle2018lottery] suggests that only a small subset of these weights really “matter” for model performance. Zhou et al. [zhou2020bypassing] and Luo et al. [luo2021scalable] both use public data to learn a gradient subspace sufficient for good performance, and demonstrate improved accuracy via reducing dimensionality. These techniques could be combined with backpropagation clipping to improve accuracy even further, but would require access to public data.

10 Conclusion

We have presented backpropagation clipping, a novel variant of DP-SGD that clips each trainable layer’s inputs (during the forward pass) and its upstream gradients (during the backward pass) to ensure bounded global sensitivity for the layer’s gradient. Our approach is simple to implement in existing deep learning frameworks. The results of our empirical evaluation demonstrate that backpropagation clipping provides higher accuracy at lower values for the privacy parameter when compared to the state-of-the-art for end-to-end differentially private deep learning.


This material is based upon work supported by the Broad Agency Announcement Program and the Cold Regions Research and Engineering Laboratory (ERDC-CRREL) under Contract No. W913E521C0003, and by the Defense Advanced Research Projects Agency (DARPA) under Agreement No. HR00112090106. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Government or DARPA.