DP-FP: Differentially Private Forward Propagation for Large Models

by   Jian Du, et al.

When applied to large-scale learning problems, the conventional wisdom on privacy-preserving deep learning, known as Differential Private Stochastic Gradient Descent (DP-SGD), has met with limited success due to significant performance degradation and high memory overhead when compared to the non-privacy counterpart. We show how to mitigate the performance drop by replacing the DP-SGD with a novel DP Forward-Propagation (DP-FP) followed by an off-the-shelf non-DP optimizer. Our DP-FP employs novel (1) representation clipping followed by noise addition in the forward propagation stage, as well as (2) micro-batch construction via subsampling to achieve DP amplification and reduce noise power to 1/M, where M is the number of micro-batch in a step. When training a classification model, our DP-FP with all of the privacy-preserving operations on the representation is innately free of gradient bias, total noise proportionally to model size, and memory issues in DP-SGD. As a result, our DP-FP outperforms cutting-edge DP-SGD while retaining the same level of privacy, and it approaches non-private baselines and significantly outperforms state-of-the-art DP-SGD variants. When applied to RoBERTa-large on four downstream tasks, for example, DP-FP achieves an average accuracy of 91.34% with privacy budgets less than 3, representing a 3.81% performance improvement over the state-of-the-art DP-SGD and only a 0.9% loss compared to the non-private baseline but with a significantly lower privacy leakage risk.



page 8


Dynamic Differential-Privacy Preserving SGD

Differentially-Private Stochastic Gradient Descent (DP-SGD) prevents tra...

One size does not fit all: Investigating strategies for differentially-private learning across NLP tasks

Preserving privacy in training modern NLP models comes at a cost. We kno...

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Machine learning (ML) models trained by differentially private stochasti...

Large Language Models Can Be Strong Differentially Private Learners

Differentially Private (DP) learning has seen limited success for buildi...

An Efficient DP-SGD Mechanism for Large Scale NLP Models

Recent advances in deep learning have drastically improved performance o...

Automatic Clipping: Differentially Private Deep Learning Made Easier and Stronger

Per-example gradient clipping is a key algorithmic step that enables pra...

DataLens: Scalable Privacy Preserving Training via Gradient Compression and Aggregation

Recent success of deep neural networks (DNNs) hinges on the availability...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large pretrained language models, such as BERT 

(devlin2019bert; roberta)

, have gained a lot of traction in a variety of down streaming tasks due to their powerful representation learning from massive data. Simultaneously, machine learning models are notorious for leaking information about their (potentially sensitive) training data. For example, it is known that adversaries can use membership inference attacks 

(shokri2017membership) to predict whether or not a particular data record was in the training data for certain models. carlini2021extracting,shokri2017membership, carlini2019secret all demonstrate how specific and sensitive training records can be extracted from pretrained models. As a result, Differential Privacy (DP) (dp_dwork; dp_dwork2), a gold-standard way for privacy-preserving computation, is becoming increasingly important and must be guaranteed for large pre-trained language models in order to protect privacy.

Unfortunately, when applied to large models, DP has typically struggled to produce useful models, resulting in models with either weak privacy guarantees (dupuy2021efficient; qu2021privacy) or performance far below non-private baselines. hoory2021learning; yu2021large; li2021large

have recently demonstrated separately that performance drop when using DP can be improved by fine-tuning large pretrained models with Differentially Private Stochastic Gradient Descent (DP-SGD). During backward propagation, DP-SGD typically applies clipping operations and random noise to the gradient computed from each data record. However, it is widely acknowledged that DP-SGD invariably degrades model performance, raising the privacy-utility trade-off issue, and that the degradation is especially severe for large-scale models like BERT. Aside from lowering model accuracy, per-example gradient clipping significantly increases memory costs, making DP-SGD unsuitable for large model training.

li2021large propose ghost clipping method to help alleviate memory cost. Nonetheless, gradient clipping in DP-SGD continues to be a significant source of model accuracy degradation because it introduces gradient bias into the learning process, lowering both the convergence rate and model accuracy. Furthermore, adding noise to the clipped gradient, which scales with the huge model dimension yu2021large, reduces DP-SGD convergence even further. As a result, there is a near 8% performance drop when compared to the non-DP model tested on the SST-2 dataset with moderate privacy guarantees (li2021large; yu2021large).

In order to alleviate the issues shown above, we leverage the post-processing property (dwork2014algorithmic, Proposition 2.1) of DP mechanism to take one step back from DP in backward propagation stage, and we pay more attention to DP in forward computation by sacrificing non-protected labels. We believe this because most classification labels, such as "True" and "False," are trivial, and only the data sample without labels requires protection as long as it cannot be linked to the training data, which is true if the training data is protected under DP. We leave DP for generation tasks as our future work. In contrast to DP-SGD for large-scale model (li2021large; yu2021large), we propose Differentially Private Forward Computation (DP-FC) followed by an off-the-shelf optimizer such as SGD, Adam, etc, which yields classification models with much higher accuracy while imposing no memory or computation burden on the training. In this paper, we make the following main contributions:

  1. Unlike DP-SGD, which has a high memory cost penalty due to clipping per-example gradient, DP-FC only clips the latent representation for noise calibration during the forward propagation stage. Because an off-the-shelf optimizer can be used, DP-FC is naturally free of gradient bias and has no additional memory cost over its non-DP counterpart.

  2. The amount of noise required no longer scales with model size and is only related to the representation dimension. More importantly, we propose a novel micro-batch structure that is compatible with the DP-FP process and further reduces noise power to , where is the number of micro-batches in each step. The maximum amount of noise power reduction occurs when the value of is set to the size of the mini-batch.

  3. We show that the DP-FP significantly improves the model performance with strong privacy guarantee. For example, we compare DP-FP to the cutting-edge DP-SGD method for different privacy budgets, i.e., , on RoBERTa-large, and it shows that DP-FP achieves an average accuracy of 91.34% on four downstream tasks, which is within 0.9% of the non-private baseline and 3.81% better than the cutting-edge DP-SGD method.

2 ML Model under DP

2.1 DP Preliminary

DP Definition: DP stabilizes the output and introduces “noise” or random information into the process of computing the output of a function, such as an ML model, making it nearly impossible for the adversary to determine whether or not a specific data record, such as a sentence in the training data, was used. Individual data records face the same level of privacy risk whether or not they are included in the computation due to DP assurance.

The rigorous DP definition is as follows. Let and be neighboring datasets, i.e., , such that one can be obtained from the other by either adding or removing a data record (dwork2006calibrating). In an NLP task, for example, and could be two datasets with sentences, and indicates that they differ by only one sentence. A randomized algorithm is -DP if for any subsets of , and all neighboring datasets , we have


The probability distributions of the outputs

and differ only by a multiplicative factor and a very small value . Thus, the value of determines the strength of the privacy guarantee: the lower the value of , the more privacy is guaranteed, and can be interpreted as the likelihood of failing to achieve DP. Note that can be any complicated random function of

. For instance, it could be a gradient estimate in DP-SGD as shown later in Eq. (


) or a token/latent representation computed by neural networks in DP-FP as shown later in Eq. (

9). To quantify privacy protection, data administrators impose a privacy loss cap, also referred to as a privacy budget, i.e., , to calibrate the amount of random noise required to design .

DP Mechanism: To achieve the -DP protection defined in Eq. (1), the random function must be specified. A simple but effective method for achieving DP protection of given is obtained by first constraining the range of by a clipping operator, i.e., , and then randomizing the result with calibrated Gaussian noise (abadi2016deep; li2021large). We introduce details of in Eq. (6) and Eq. (8) with concrete examples. The clipping threshold is also known as sensitivity that guarantees the greatest output variation for any pair of in terms of the -norm. Mathematically, the DP mechanism is defined by


with being calibrated according to the DP profile given by (balle2018improving):





is the c.d.f. of Gaussian distribution.

Post Processing Property: A key DP property is its immunity to post-processing dwork2014algorithmic, which states that a differentially private output, i.e., , can be arbitrarily transformed using some data-independent function without compromising its privacy guarantees.

DP Accounting

: When each DP mechanism meets certain DP guarantees individually, privacy suffers as a result of the composition of these DP mechanisms. Subsampling used before a private mechanism, on the other hand, increases the guarantee of privacy. Because differentially private model training entails a series of composition of subsampling and updates, DP accounting for such a series operations is required. There are several widely used privacy accounting methodologies to adopt, such as moments accountant 

(abadi2016deep), Reńyi DP (RDP) (mironov2017renyi)

, Gaussian DP with central limit theorem 

(dong2021gaussian; bu2020deep), and numerically composing tradeoff functions (gopi2021numerical; koskela2020computing; zhu2021optimal). RDP, a generalization of moments accounting, provides strict upper bounds on the actual privacy cost, whereas Gaussian DP with central limit theorem (CLT), while asymptotically exact, tends to underestimate the privacy loss, and composing tradeoff functions (gopi2021numerical) provides a more precise estimate. We use Gaussian DP with CLT for DP analysis due to its simplicity and also report the total DP cost in the experimental analysis using the various methods listed above. To make the paper self-contained, we include more information about Gaussian DP with CLT in the appendix.

2.2 DP-SGD Challenges in Large Models

DP assurance has been incorporated into deep learning by appropriately randomizing conventional model updates to limit what can be breached from the training data when revealing the model (abadi2016deep). DP-SGD, which randomizes the SGD-based optimizers such as SGD and Adam, is commonly used to achieve such a sanitized model parameter release while concealing individual training data record.

Unlike traditional off-the-shelf optimizers, DP-SGD clips the -norm of per-sample gradient before adding the calibrated noise to protect the input data in each step. In more concrete terms, the DP mechanism for estimating the gradient in each -sized batch is as follows:


with the model dimension and

the loss function of data record

. The clipping operator that shrinks the gradient of an individual example whenever it exceeds some threshold is computed by


The model is then updated based on , which is known as DP-SGD111The represents the DP-enhanced version of most off-the-shelf optimizers, such as DP-Adam, which, like regular Adam, performs updates and moment accumulation with privatized gradients, i.e., in Eq. (5). In this paper, we simply refer to the DP-version optimizers as DP-SGD.:


Since has been DP guaranteed with an privacy cost in Eq. (5), on the left-hand-side is also DP protected with the same DP cost as DP is immune to post-processing. Furthermore, for -step updates, the total privacy cost can be calculated by composing of each steps using the methods discussed in Section 2.1’s DP Accounting paragraph. We denote the privacy budget by with and . Applying the above DP-SGD to large-scale models raises the following challenges:

  • DP-SGD requires the computation and storage of individual gradients in each step due to the clipping operation in Eq (5). Because each individual gradient requires the same amount of memory as the model, storing individual gradients in DP-SGD consumes a significant amount of memory (li2021large).

  • The clipping operation in Eq. (6) inescapably introduces a significant bias (chen2020understanding) in the update direction, preventing convergence and significantly degrading model performance in general.

  • The total noise power scales with the model size (), as shown in Eq. (5), and degrades the model performance significantly (yu2021large). Even with small , the total noise power is still huge for large models, such as BERT, where scales to 110M.

3 DP Forward-Propagation (DP-FP)

To address the DP-SGD issues mentioned above, we propose Differentially Private Forward Propagation, or DP-FP, which protects data privacy during forward propagation. As a result, all off-the-shelf optimizers can now be used in the back-propagation stage without DP. As a result, DP-FP is free of DP-SGD’s flaws, such as massive memory consumption, gradient bias, and a large-dimension noise vector. We also propose a novel micro-batch construction method for DP amplification to further reduce the noise power in each coordinate of the added noise vector.

3.1 Forward Propagation under DP

Let denote the entire training dataset and be the subsampling scheme used to build a batch for each step of model update, such as shuffling and sampling with/without replacement, Poisson sampling. The latent representation is then denoted in a composition form of with being the ML model for latent representation computation, e.g., the hidden state or pooler output of [CLS]222In this paper, we simply use the pooler output of [CLS] as our choice of , since it is straightforward to integrate our DP at the outside of encoders for classification tasks, without changing any code inside encoders. Furthermore, it is much smaller in size than the transformer layers preceding the pooler.. Note that subsampling schemes provide DP amplifications wang2019subsampled; dong2021gaussian, which reduce the amount of noise required under the same privacy budget.

To achieve the forward propagation under DP, we first stabilize the latent representation. Because training data is random, the output of can vary significantly, implying that the model will vary significantly if different data records are used for training. As a result, data privacy is at risk of being compromised by a membership attack. Thus, we constrain ’s output range by clipping the representation that corresponds to each data record, such as a sentence in a language model. We clip the output of , which shrinks the latent representation whenever its norm exceeds a certain threshold , similarly to DP-SGD. More formally, the clipped latent representation is given by


The clipping operation also implies that the greatest variation output for a pair of neighboring datasets in terms of the -norm is given by

Following clipping, Gaussian noise is added to ensure DP, with details on noise power calibration provided later in this subsection. Before the downstream classification layer, the latent representation is computed as follows:


Since the input data is under DP according to Eq. (9), the model updated based on the result of Eq. (9) still follows the same DP assurance according to the post-processing property of DP (dwork2014algorithmic, Proposition 2.1). As a result, after Eq. (9), a conventional SGD-based optimizer, such as Adam, can be used and the privacy of is still guaranteed.

Until now, we identify the advantages of DP-FP over DP-SGD in the following propositions:

Proposition 1

Small noise dimension in DP-FP: It is worth noting that represents a

-dimension identity matrix, and therefore the noise vector dimension

in DP-FP is much smaller than , which is the noise vector dimension in Eq. (5) that equals to the model dimension in DP-SGD. Take BERT for example, 110M, whereas is only 768 if using the hidden state of [CLS] for downstream task. Thus, DP-FP saves significant total noise power than that for DP-SGD due to a significant noise dimension reduction.

Proposition 2

Unbiased gradient in DP-FP: The results of Eq. (9

) are then fed to the classifier, which predicts label distribution, computes the loss further, and then performs standard backpropagation via an off-the-shelf optimizer such as Adam for model update. As a result, the DP-FP backpropation inherits all of the advantages of the non-DP optimizer and produces an unbiased estimate of the true gradient.

3.2 Micro-batch for DP Amplification333The micro-batch construction in our DP-FP is used to achieve DP amplification in order to reduce calibrated noise variance without sacrificing privacy, which is distinct from the micro-batch functionality used in Tensorflow Privacy to reduce the memory cost of the DP-SGD implementation. in DP-FP

Intuitively, privacy amplification by subsampling is caused by the fact that individual data record has complete privacy if it is not included in the subsample. Based on this intuition, DP-SGD benefits from batch subsampling for DP amplification (li2021large; yu2021large), which reduces the calibrated noise power for each coordinate significantly.

In contrast to DP-SGD, we show that additional DP amplification can be achieved by subsampling out micro-batches that comprise a batch, resulting in lower noise power for each coordinate of the -dimension noise vector in Eq. (9). Because of the unique structure for DP operations in forward propagation, this DP amplification is unique to DP-FP and does not exist in DP-SGD, as discussed further below.

More concretely, an independent Bernoulli trial for all data records, i.e., sentences in a dataset, is performed to construct each micro-batch with subsampling probability

. Clipping is applied to each latent representation corresponding to the input data record, i.e., clip the hidden state of [CLS] in the BERT model. In the following, we evaluate the privacy cost using the Gaussian DP (GDP) framework (dong2021gaussian), which measures the privacy profile in terms of using Eq. (3) and (4). To make our paper self-contained, we include a preliminary of GDP calculation in Appendix A.1, as well as a more detailed procedure for privacy accounting described below.

Memory cost
Computational cost
Coordinates # to add noise
Noise power at each coordinate
Table 1: For all methods, is the batch size, and is the model size. In our DP-FP, and are the latent representation dimension used for downstream tasks and micro-batch number, respectively. Note that , e.g. in the experiment while Million for BERT model. Specifically for RGP in yu2021large, is the model width, is the reparametrization rank, and is the number of power iterations.666We use slightly different symbol notion for complexity analysis from that in yu2021large without assuming the weight matrix is square as that in  yu2021large.

According to the subsampling DP amplification in the Gaussian DP framework (bu2020deep), the privacy cost corresponding to each micro-batch is given by


where is a function of and with and . The details of function is given in the appendix. In each step, the training stage executes micro-batches and updates steps based on the training dataset. Even if each micro-step is DP protected with a privacy cost of , the question is whether all micro-batches are private when combined, and if so, how privacy degrades as the number of steps increases, a process known as DP composition. We have the total privacy cost according to the central limit theorem for rounds Gaussian DP composition given by


with details provided in the appendix. It is evident that the smaller is, the smaller the total privacy cost denoted by .

Similarly in DP-SGD li2021large with subsampling probability to construct the mini-batch, the total privacy cost is given by


where . To make a fair comparison, DP-FC is set to have the same batch size in expectation as DP-SGD by setting . In the strong DP regime, and are very small positive values close to zero. Thus, by taking the Taylor series expansion of the exponential function in (11) and (12), respectively and taking into account the fact that for the same privacy budget, we obtain

As a result, we have the third advantage of DP-FP over DP-SGD:

Proposition 3

As with DP-SGD, DP-FP requires less than noise power per coordinate. Note that the comparison is under the same batch-size, privacy budget, and clipping value. However, as demonstrated in li2021large, larger batch sizes improve performance in their DP-SGD. As demonstrated later in the experiment, our DP-FP, on the other hand, prefers small batch sizes. Due to the fact that each batch is constructed by randomly sampling each data record, the batch size decreases as decreases. Thus, is even smaller than according to Eq. (11).

Finally, we summarize the DP-FP algorithm in Algorithm 1, including the micro-batch construction for DP amplification. Since the noise power in each step is calibrated according to the DP budget of and total steps , is spent out after steps.

0:  DP budget , sampling rate , clipping threshold , and representation .
1:  Put into (3) and compute as .
2:  Calibrate noise power by substituting , , , and into (11).
3:  for  do
4:     for  in parallel do { micro-batches in each step:}
5:        Sample micro-batch with Bernoulli trail for each data record
7:         { SGD-based off-the-shelf optimizer for model update}
8:     end for
9:  end for
Algorithm 1 DP-FP Training

3.3 Comparison to existing methods

In contrast to SGD, DP-SGD necessitates the computation and storage of individual gradients, resulting in batch- and model-size-dependent memory costs. Furthermore, the total noise power in DP-SGD scales linearly with model size. As a result, applying DP-SGD to large-scale models is difficult.

li2021large propose ghost clipping, a memory-saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients. By extending the method in lee2020scaling, ghost clipping consumes nearly the same memory as non-private SGD training while doubling the computation throughput via numerical validation. As a result, we use the memory and computational cost of SGD as a reference for ghost clipping. We also compare the costs of the most recent memory efficient method, reparameterized gradient perturbation (RGP) (yu2021large)

. While RGP is faster per update, it requires more than three times as many epochs as the ghost clipping method as pointed out by 


4 Experiments

Model Method (RDP) (RDP)
MNLI-(m/mm) QQP QNLI SST-2 Avg. MNLI-(m/mm) QQP QNLI SST-2 Avg.
BERT-base Non-DP 83.97/84.46 90.99 90.09 92.55 88.41 83.97/84.46 90.99 90.09 92.55 88.41
DP-SGD 54.6/53.4 74.5 63.6 82.3 65.68
RGP 79.1/78.0 84.8 86.2 91.5 83.92
Ours DP-FP 81.93/82.11 89.47 89.25 90.14 86.58 82.54/82.39 89.68 88.82 91.51 86.99
RoBERTa-base Non-DP 87.33/87.08 91.07 91.61 94.27 90.27 87.33/87.08 91.07 91.61 94.27 90.27
DP-SGD 82.47/82.10 85.41 84.62 86.12 84.14 83.30/83.13 86.15 84.81 85.89 84.66
RGP . – 80.5/79.6 85.5 87.2 91.6 84.88
Ours DP-FP 85.84/86.18 89.92 90.35 93.23 89.10 86.00/85.81 89.84 90.96 93.23 89.17
RoBERTa-large Non-DP 90.12/89.72 91.60 93.56 96.22 92.24 90.12/89.72 91.60 93.56 96.22 92.24
DP-SGD 85.53/85.81 86.65 88.94 90.71 87.53 86.28/86.54 87.49 89.42 90.94 88.13
RGP 86.1/86.0 86.7 90.0 93.0 88.36
Ours DP-FP 89.27/89.08 90.68 92.59 95.07 91.34 89.70/89.05 90.81 93.11 95.18 91.57
(Gaussian DP + CLT) 2.52 2.52 2.00 1.73 5.83 5.85 4.75 4.33
(Compose tradeoff func.) 2.75 2.75 2.57 2.41 7.15 7.16 6.87 6.69
Table 2: Accuracy scores of different methods on development sets. The DP-SGD and RGP results are from  yu2021large; and the DP-SGD and RGP results are the documented number in li2021large. Best accuracy scores for each privacy level are in bold. “Avg.” shows average scores of four tasks.888The results for SGD and RGP in Table 2 are from documented numbers in yu2021large. These results are under the DP guarantees of . These guarantees are strictly weaker than those provided by DP-SGD and our DP-FP, which are based on (note that the smallest dataset in these tasks contains k records).

4.1 Data and Settings

Following li2021large, we mainly focus on fine-tuning large pretrained models on classification tasks, including MNLI, QQP, QNLI, and SST-2 from the GLUE benchmark wang2018glue that each has more than 60k training instances. We provide the data statistics for the four tasks in Appendix A.2.

Our non differential privacy (Non-DP) baselines are finetuned BERT-base, RoBERTa-base, and RoBERTa-large. For classification task, following common settings, we add a linear layer over the output of the pooler layer for encoder, the pooler layer simply takes the hidden state of [CLS] token as input, and applies another linear dense layer and a activation. We train our baselines on four data sets for 5 epochs with a learning rate , a batch size , and a max input length . We save and validate each model for each epoch, and report the best accuracy scores as our baseline systems.

During the fine-tuning stage, our DP-FP method adds noise and clipping operations (as in Algorithm 1) before the linear classification layer based on the following two facts. First this approach treats pretrained model encoder classes as black boxes and does not require any code change inside encoder classes. Second, typically for large-pretrained models, the noise dimension, i.e., 768 for BERT-base and RoBERTa-base, and 1024 for RoBERT-large, is fixed and much smaller than the model size, i.e., 110M for BERT-base. Then we apply standard AdamW adamw optimizer in the back-propagation stage. Please note that we do not add any noise and clippings in the inference time, as we only need to protect the fine-tuning training data. In our following experiments, we use the following hyper-parameters as our default settings for our DP-FP method: total fine-tuning epoch is three, , , learning rate is , max input length is , the micro-batch subsampling rate and the expected batch size . We consider a practical scenario in which the total amount of privacy budget is constrained. For both DP-FP and DP-SGD, this privacy budget constraint corresponds to a constraint on the number of data samples used in the tuning process. As a result, we report accuracy scores on development sets once the privacy budget has been depleted.

To ensure a fair comparison with the DP-SGD method in large-scale models reported in the literature, we set the same privacy budget for Gaussian DP with CLT as that in li2021large for each experiment using in the set and with the data record number in the training set. We then numerically calibrate as in Line 2 of Algorithm 1. We also report the corresponding total privacy cost documented in (li2021large) calculated by RDP and composing tradeoff function methods. Please see Section 2.1 and the references therein for further information.

4.2 Main Results

Table 2 shows the main results on four tasks. The larger the pretrained models, the higher the accuracies for the non-DP baselines, ranging from BERT-base (110M parameters) to RoBERTa-large (355M parameters).

We compare full fine-tuning with reparameterized gradient perturbation (RGP) (yu2021large), and memory efficient DP-SGD for large language model (li2021large) as they are the state-of-the-art methods for DP fine-tuning on sentence classification at the time of writing. The DP-SGD scores are documented in li2021large, and DP-SGD hurts the baseline performance by 4-6 in average for RoBERTa-base and RoBERTa-large, even with a larger at 8 (RDP). In particular, results of RoBERTa-base models show that DP-SGD still degrades performance by up to 8 for the SST-2 data set. By contrast, RGP of yu2021large reduces the gap significantly to 1 point on SST-2 data set, but is still at least 4 far behind Non-DP BERT-base on other three tasks.

We run DP-FP on each level of the privacy budget for three epochs of steps, and report the final scores only for the model reaches the privacy budget. Note that, we use the default setting and same hyper-parameters as illustrated in the previous subsection for each experiment of our DP-FP in Table 2. As shown in Table 2, DP-FP significantly improves accuracy over existing methods and closes the gap to the Non-DP counterpart. It has only a 0.58 performance drop for SST2 on RoBERTa-large with set to 1.73 under GDP with CLT in particular. We further average this DP-FP performance drops across datasets for each of the models. When (RDP), the average performance drops to Non-DP model are within 0.83-1.78; and when , they are within 0.68-1.43. Moreover, DP-FP is clearly better than DP-SGD and RGP approaches. This significant performance advantage stems in part from the fact that DP-FP overcomes the biased gradient problem in DP-SGD and also has a lower noise power due to micro-batch number as well as a lower noise dimension.

More interestingly, the larger the pretrained model, the smaller the gap to the Non-DP model for DP-FP. For example, on the SST-2 task for (RDP), the gap between DP-FP and Non-DP is reduced from 2.41 (BERT-base) to 0.58 (RoBERTa-large). The main reason for this, we believe, is that because our DP-FP adds noise to the latent representation before the linear classification layer, the total noise power does not scale with model size.

Figure 1: Accuracy scores for different batch sizes () on SST-2 development set.

4.3 Hyperparameter Tuning

As suggested in li2021large, DP optimization is sensitive to hyper-parameter selection. We therfore examine the performance of DP-FP for various hyper-parameters, as our DP-FP differs significantly from DP-SGD in terms of architecture. In this section, we report all results of RoBERTa-base with DP-FP on SST-2 data set, and we fix the total training epochs to be 3 to deplete the privacy budget, i.e., .

Figure 1 shows the heat map of accuracy scores for different batch sizes with and . Those results suggest that 1) large batch sizes, like 128 and 256, hurt performance; 2) DP-FP requires a small learning rate in order to achieve better performance. Those two interesting observations are opposite of the findings of DP-SGD in li2021large. The intuitive explanation is given below. By substituting into Eq. (11), we have the privacy parameter


Intuitively, this demonstrates that, given a fixed privacy budget, which implies a fixed , noise power decreases as batch-size decreases. However, we cannot reduce too much because the model will not converge.

Figure 2: Accuracy scores for different clipping threshold () on SST-2 development set.
Figure 3: Accuracy curve for different micro-batch numbers () on SST-2 development set.

Figure 2 shows the heat map of accuracy scores for different clipping threshold () with , , and . li2021large show that small clipping thresholds lead to the best performance by setting . In contrast, we find different trends from that in li2021large such that smaller clipping thresholds lower than 0.4 actually hurt performance, and there is no significant changes in larger clipping thresholds (from 0.6 to 1.0). Removing too much of the latent representation with a small in DP-FC results in too much loss of semantic information and leads to significant performance drop.

Finally, Figure 3 shows the curve of DP-FP with different micro-batch numbers on SST-2 development set. In this experiment, we set the privacy budget to be 0.02 under Gaussian DP + CLT, a very strong privacy level, the batch size to be 32, and the total number of epoch to be 3. Larger micro-batch numbers clearly lead to better performance, as the larger the , the more noise power is reduced (by ), as shown in Proposition 3.

5 Related Work

In the research line of sharing models while protecting the corresponding training data privacy, previous work mainly use DP-SGD to train privacy-preserving models (shokri2015privacy; yu2019differentially). abadi2016deep

propose a tight privacy accounting leading to a reasonable level of privacy cost for DP-SGD. Following that, variants of DP-SGD have been proposed to improve the model accuracy such as clipping based on quantiles 

andrew2019differentially, dynamic noise power and clipping value du2021dynamic, and dynamic learning rate wu2021adaptive, etc. DP-SGD and its variants, on the other hand, have had limited success in large deep learning models due to their high computation cost, huge memory overhead, and significant performance drops.

In order to reduce the huge memory cost of each DP-SGD step, subramani2020enabling; anil2021large study an effective use of JAX primitives in conjunction with the XLA compiler. Later li2021large mitigate the memory cost by a ghost clipping method, but still has 2 times computation throughput cost due to per-example grade clipping. yu2021large reduce the memory cost via rank decomposition.

To improve the model accuracy, anil2021large successfully implement mega batch size, e.g., a batch-size of 2M, for DP-BERT for model accuracy improvement, but still has a 10% accuracy drop compared to the non-private BERT model. dupuy2021efficient investigate private BERT fine-tuning, but reported results with at least 100. Very recently, studies in hoory2021learning; yu2021large; li2021large show that the DP-SGD’s model performance drop can be mitigated by the use of large pretrained models. Event though, there is still an obvious performance gap to the non-DP counterpart due to gradient bias and additive noise.

6 Discussion

It is instructive to compare the types of information that DP-SGD and DP-FP protect under DP. As illustrated in Eq. (6), in the DP-SGD, the label information is embedded in the gradient, thus the data record under DP protection is each pair of training data samples and their labels. While in the DP-FP, as illustrated in Eq. (8), because the DP operation is performed in the forward stage, the data record under DP protection is simply the training data sample without labels. But, it is interesting to note that for the majority of classification tasks, one only needs to protect the privacy of the data sample, as the labels themselves are finite and not constitute privacy information as long as they cannot be connected to the training data sample. Consider the SST-2 sentence classification task as an illustration, which contains 67,500 sentences that require protection under DP. Additionally, each sentence is labeled with “positive” or “negative”. Because DP-FP ensures that the adversary almost never recovers any sentence using the fine-tuned model, labels cannot be associated with the training sentences. However, in DP-FP for the generation task, protecting the label information is required and difficult, and further model architecture design is required. We plan to investigate it as part of our future work.

7 Conclusion

In this paper, we have introduced the differentially private forward propagation (DP-FP) method for applying differential privacy to large pretrained models on classification tasks. The key design of DP-FP exploits differential privacy’s post-processing property, ensuring privacy by protecting the latent representation in the forward stage rather than the conventional wisdom of DP Stochastic Gradient Descent (DP-SGD), which protects the gradient in the backward stage. DP-FP has the same memory cost as an off-the-shelf SGD-based optimizer, an unbiased gradient, and significantly lower noise power that scales only with the latent representation dimension, as opposed to DP-SGD, which has a large memory cost, a biased gradient, and total noise power that scales with the huge model size. We have also created micro-batches that are unique to DP-FP in order to reduce the noise power for each coordinate. As a result, on a large model like RoBERTa-large, DP-FP achieves an average accuracy of 91.34% on four downstream tasks with less than 3, which is only within 0.9% lower than the non-private baseline and 3.81% better than the state-of-the-art DP-SGD method.


Appendix A Supplementary Formalism Details

Dataset Task Type Classes Average Length Training Sample Size (D)
MNLI Natural Language Inference 3 32.96 240,942
QQP Semantic Matching 2 24.77 384,348
QNLI Question Answering 2 39.74 104,374
SST-2 Sentiment Analysis 2 9.94 67,349
Table 3: Statistics of four GLUE data sets.

a.1 Privacy Accounting for DP-FP

Based on hypothesis testing of two Gaussian distributions, Gaussian DP define a canonical single-parameter family of privacy notions. It gives a computationally efficient tool for analyzing the exact composition of private algorithms in a tractable way. We present some key results in the following section, and for more information, please see dong2021gaussian.

The hypothesis testing view of privacy notion: Let and denote the distributions of random function output and with , and let be any rejection rule for testing the hypothesis against . With these in place, we can define the trade-off function between and as follows:


where, and

are the type I and type II errors of the rejection rule of

, respectively. It is shown that . The function a single-parameter funcation denoted by , which is referred to as -GDP.

In each micro-batch of the DP-FP in (9) with the Gaussian mechanism, it achieves -GDP with . Consider the sampling scheme that each individual data record is subsampled independently with probability from the training set to construct the micro-batch. (bu2020deep) demonstrates that given two neighboring datasets and , if a random mechanism is -DP, then after the subsampling with denoted by , it holds that


where . Then, after micro-batches in each step and a total of steps, a Berry-Esseen style CLT result is indicated by bu2020deep that as and a constant, the composition of the r.h.s. of (15) converges to a -DP with


Then the total privacy cost in terms of () after steps can be computed according to (3), which is repeated below:



a.2 Datasets Statistics

Please refer Table 3.