An Efficient DP-SGD Mechanism for Large Scale NLP Models

by   Christophe Dupuy, et al.

Recent advances in deep learning have drastically improved performance on many Natural Language Understanding (NLU) tasks. However, the data used to train NLU models may contain private information such as addresses or phone numbers, particularly when drawn from human subjects. It is desirable that underlying models do not expose private information contained in the training data. Differentially Private Stochastic Gradient Descent (DP-SGD) has been proposed as a mechanism to build privacy-preserving models. However, DP-SGD can be prohibitively slow to train. In this work, we propose a more efficient DP-SGD for training using a GPU infrastructure and apply it to fine-tuning models based on LSTM and transformer architectures. We report faster training times, alongside accuracy, theoretical privacy guarantees and success of Membership inference attacks for our models and observe that fine-tuning with proposed variant of DP-SGD can yield competitive models without significant degradation in training time and improvement in privacy protection. We also make observations such as looser theoretical ϵ, δ can translate into significant practical privacy gains.




Per-Instance Privacy Accounting for Differentially Private Stochastic Gradient Descent

Differentially private stochastic gradient descent (DP-SGD) is the workh...

DP-FP: Differentially Private Forward Propagation for Large Models

When applied to large-scale learning problems, the conventional wisdom o...

One size does not fit all: Investigating strategies for differentially-private learning across NLP tasks

Preserving privacy in training modern NLP models comes at a cost. We kno...

Large Language Models Can Be Strong Differentially Private Learners

Differentially Private (DP) learning has seen limited success for buildi...

On the Effectiveness of Mitigating Data Poisoning Attacks with Gradient Shaping

Machine learning algorithms are vulnerable to data poisoning attacks. Pr...

Differentially Private Coordinate Descent for Composite Empirical Risk Minimization

Machine learning models can leak information about the data used to trai...

Exploring the Unfairness of DP-SGD Across Settings

End users and regulators require private and fair artificial intelligenc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large scale NLP models have contributed significantly to the success of commercial voice assistants like Amazon Alexa, Google Assistant and Siri. They have shown high generalization accuracies for various learning tasks, from question answering (Qu et al., 2019)

to named entity recognition

(Akbik et al., 2019). However, large NLP models can be prone to privacy attacks (e.g. MIA: Membership Inference Attacks (Shokri et al., 2017)) and can leak data used to train these models. In this paper, we focus on measuring and mitigating the privacy risks of these models. Specifically, differentially private (DP) model building algorithms have shown promise in providing defense against privacy attacks (Erlingsson et al., 2019; Rahimian et al., 2019). We focus on a specific central differential private mechanism - Differentially Private Stochastic Gradient Descent (DP-SGD) and evaluate its impact on model utility and privacy.

DP-SGD (Abadi et al., 2016) is an extension over the popular stochastic gradient descent algorithm that offers theoretical () privacy guarantees (Dwork et al., 2006). In our work, we make extension to proposals by (McMahan et al., 2016) and (Yu et al., 2019) and, propose efficient DP-SGD (eDP-SGD) suited for GPU training. Specifically, we utilize the noise addition on gradients over data batches proposed by (McMahan et al., 2016) to compute gradients per GPU for enhanced efficiency and clip parameters for each layer of the NLU models used in this study. Inspired from (Yu et al., 2019), we apply noise decay on gradients computed on each GPU. We apply the combination of these techniques to NLU model fine-tuning (as opposed to full training) on dataset of various sizes. For privacy evaluation, we study the impact of our extension of DP-SGD on MIA success. Previous work on the effects of DP on MIA performance study the two extreme cases where either 1) the DP-model is resilient to MIA but suffers significant degradation in performance (Rahman et al., 2018); or 2) the DP-model achieves similar performance to the non-private model but does not offer significant gains in privacy (Jayaraman & Evans, 2019). We observe that applying our method with looser theoretical DP guarantees translate into significant reduction in MIA performance. We also report the impact of using DP-SGD and our extension on the training time which, to the best of our knowledge, no prior work has done. We observe that DP-SGD can be prohibitively slower (up to 150 times) than non-private baselines while our method, in the worst case, is slower by a factor of 2. In this paper, we make the following contributions: (i) We study the impact of applying DP techniques for NLP models trained on large-scale datasets. We present a computationally efficient setting for DP-SGD and provide a comparison in terms of training time, accuracy and privacy of the non-private and DP models. No other study has done this comparison, let alone for large-scale NLP setting. (ii) We demonstrate that using DP-SGD during fine-tuning, one can obtain models with competitive utility (in comparison to models trained with vanilla SGD), while achieving significant gains in protection against privacy attacks and, (iii) Building on existing DP-SGD variants, we propose an extended version of the DP-SGD technique which is computationally-efficient and report significant compute gains over DP-SGD. The conclusions of this paper are novel and we believe, them to be useful for the NLP community as we advance the state of knowledge for the DP-SGD algorithm.

2 Related work

DP-SGD (Abadi et al., 2016)

modifies vanilla SGD by clipping the gradients computed over each individual datapoint, followed by accumulation of the clipped gradients over a batch and noise addition. Researchers have applied DP-SGD to reduce memorization in language models

(Thomas et al., 2020; McMahan et al., 2017), data generation (Xie et al., 2018) and image classification (Rahman et al., 2018). (Chen et al., 2020; Jagielski et al., 2020) aim to understand the properties of DP-SGD. Attempts have been made to improve efficiency of DP-SGD by improving communication protocols in a distributed training setting, where individual servers contribute gradient on locally stored data and privacy of data in each local server is desirable (Agarwal et al., 2018; Xu et al., 2020). Adaptive variants of DP-SGD in a federated setting have also been proposed (Thakkar et al., 2019).

MIA has been studied in a variety of settings, such as MIA with synthetic or noisy data (Shokri et al., 2017), shallow models (Truex et al., 2019), non-matching shadow and target data sets (Salem et al., 2019), customer-level MIA on language models (Song & Shmatikov, 2019)

, MIA on GANs (generative adversarial networks

(Goodfellow et al., 2014)) (Hayes et al., 2017). DP-SGD while carrying theoretical privacy guarantees has also been demonstrated to provide defense against MIA (Rahman et al., 2018). In this work, we report both theoretical and MIA based privacy quantifications.

3 Efficient Differentially Private Stochastic Gradient Descent (eDP-SGD)

In this section, we present the eDP-SGD algorithm by modifying the following three techniques and, adapting them for a GPU based training. Given the high risks associated with NLP models revealing private information, and the growing concern over these risks, as well as the scarcity of studies in the domain of privacy-preserving algorithms for NLP, we have chosen some sensitive and proprietary datasets of a voice assistant for our studies. We propose the following modifications to the existing DP-SGD technique in order to: 1) improve training speed and 2) preserve accuracy of the models, which get affected when training models with Differential privacy.

Micro-batch computations per GPU DP-SGD requires clipping the gradient of every single example in the batch. This induces a significant computational cost since given a batch size , DP-SGD would require the computation of gradients, as opposed to computation of one gradient in classic SGD. McMahan & Andrew show that it is possible to group examples in micro-batches in the DP-SGD scheme and still maintain DP-guarantees for the resulting model. This manipulation is equivalent to a global Gaussian mechanism and the authors provide the equivalence relationship in their paper. We leverage this work and apply DP-SGD computations to the micro-batch contained within each GPU. Given a batch of data points and GPUs, we divide the batch into micro-batches, processed independently on each GPU.

We make another addition to the algorithm suggested by McMahan & Andrew, and add scaled noise to gradient computed per micro-batch within the GPU (as opposed to adding noise post the aggregation of gradients from all micro-batches). Given GPUs, adding a Gaussian noise to gradients per micro-batch is equivalent to adding a noise to the gradients aggregated over the micro-batch. This change relaxes the need for aggregation before noise addition and accelerates computation.

Scaling For large scale models, the magnitude of the gradient varies across parameters in the model. For instance, the magnitude of gradients for parameters in the lower layers can be different compared to those in the upper layers. Hence, using a constant clipping parameter can either be too aggressive or too weak for a certain set of parameters. In this case, given a set of parameters (e.g. those drawn from the

layer in a neural network), a strategy that clips gradients with a parameter (

) specific to the set is preferable. However, this strategy may again lead to poor privacy guarantees for large variations in assigned to each set of parameters. Inspired by the scaling approach suggested by McMahan & Andrew

, we compute a scaling factor for each layer, proportional to the norm of gradient calculated on the first iteration for each layer. The scaling is applied to gradient computed over a micro-batch contained in each GPU. Since the model parameters are randomly initialized and that the gradient norm is expected to decrease during training, the norm of the gradient in the first iteration gives a rough estimation of the upper bound of the gradient magnitude throughout training. We scale a constant clipping value

by the factor for each layer. This strategy also reduces the number of hyper-parameters to tune for every new model or dataset.

Noise Decay

In DP-SGD, the amount of noise added is the same for all the training iterations with a variance equal to the clipping parameter

times the noise multiplier , used to compute theoretical DP guarantees. As the magnitude of the gradients is likely to decrease when approaching convergence, adding noise with constant variance can lead to slower convergence (Yu et al., 2019)

; as the magnitude of the noise can be significantly higher than the magnitude of the gradient. In addition, a noise with high variance can wash out the information contained in the gradients after a few epochs, while a noise with low variance would yield low privacy guarantees. To improve convergence,

Yu et al. use noise variance reduction at every epoch by scaling the initial noise multiplier by a decreasing function (parameterized by ). In our work, we use this strategy, however, it is applied to gradients computed independently at each GPU. We decay the noise strength as a function of epoch number using one of the following forms of the multiplier : (i) Linear decay: (ii) Exponential decay: Algorithm 1 summarizes eDP-SGD.

0:  GPU Devices: ; Data: Batch . DP-SGD Input: Noise multiplier ; Clipping coefficient ; Noise decay ; Scaling ;Model Input: Loss ; Epoch ;
0:  DP-gradient for optimizer
  for  in  do
     Send micro-batch to
     Compute gradient:   
     Set noise multiplier:
     Add scaled Gaussian noise:    ,
  end for
Algorithm 1 eDP-SGD

4 Experimental setup

We focus on Intent Classification (IC) and Named-Entity Recognition (NER) tasks in this work as they are popular in industrial NLP systems (Su et al., 2018). We next describe the datasets used in our experiments.

4.1 Datasets

We use three publicly available datasets - ATIS (Hemphill et al., 1990), SNIPS (Coucke et al., 2018) and NLU-EVAL (Xingkun Liu & Rieser, 2019) and three additional datasets from a leading smart-home company: Communication, Health and Video, each containing roughly 1 million utterances. Example utterances in the internal datasets are: Communication “call my parents at 0123”, Health: “refill my aspirin prescription” and Video: “play my favorite movie”.

We construct a train/validation/test split for these datasets so that the split ratio is approximately the same (45:5:50). A roughly equal number of datapoints in the train and test set helps us create a balanced evaluation set for training the MIA models. The “member” utterances used to train/evaluate the MIA success are sourced from the IC-NER training sets, and an equal number of “non-member” utterances are sourced from the test set (not used in training).

Dataset SGD No decay Linear Decay Exponential Decay
CLC model
ATIS 6.3 67.5 -12.9 -8.26 4.4 -10.6 -11.6 5.8 -11.9 -11.3 24.4
SNIPS 13.0 70.6 -1.8 -2.5 3.1 -10.3 1.8 6.5 -6.7 4.4 17.4
NLU-eval 26.7 65.3 -0.7 -1.4 2.0 3.6 -7.6 2.7 1.4 -4.6 22.1
Health 6.3 95.3 -1.2 -3.1 3.6 -2.5 -2.3 4.0 -2.7 -1.5 4.3
Comm. 5.4 87.7 0.3 -3.2 6.2 0.3 -5.7 7.5 0.5 -5.4 19.1
Video 15.6 75.1 4.0 -2.4 4.8 4.4 -3.7 5.9 3.1 -6.0 19.5
BERT model
ATIS 5.4 56.1 1.3 -2.5 5.9 -9.7 -5.2 6.3 -4.1 -5.8 11.1
SNIPS 12.7 65.3 -21.5 -8.3 2.7 -20.3 -9.2 2.3 -7.1 -8.3 2.7
NLU-eval 24.1 63.6 -2.4 -7.6 2.2 4.6 -11.6 1.7 0.3 -9.5 2.3
Health 6.2 90.6 3.2 -4.1 1.9 0.8 -2.9 2.8 1.2 -4.3 2.9
Comm. 5.6 88.4 0.3 -4.4 2.1 0.2 -3.4 2.8 0.7 -4.4 3.7
Video 15.8 81.0 4.7 -10.1 1.9 4.6 -10.9 1.8 3.3 -8.2 2.3
Table 1: Results showcasing privacy-utility tradeoff of NLU models trained using eDP-SGD against models trained with SGD. Datasets are arranged by size. SER and MIA represent relative changes w.r.t baseline. is set to for public corpora and for public corpora during computation of using CDP (Yu et al., 2019)
Time/epoch Multiplicative factor
CLC model
ATIS 9.5s 2.1 1.04
SNIPS 9.8s 5.1 1.01
NLU-EVAL 9.9s 7.3 1.03
Health 21.8s 153.9 1.14
Comm. 109.8s 143.5 1.15
Video 83.7s 152.4 1.2
BERT model
ATIS 2.5s 13.0 1.29
SNIPS 2.8s 22.8 1.52
NLU-EVAL 3.5s 29.2 1.62
Health 56.5s 82.0 2.06
Comm. 398.8s 73.8 2.05
Video 276.0s 80.3 2.17
Table 2: Comparison of training time per epoch for SGD, DP-SGD and eDP-SGD.

4.2 Models

We train IC and NER models using the following two architectures. These architectures are chosen to capture training on an LSTM based and a transformer based model. These architectures have consistently pushed state of the art on several tasks (Devlin et al., 2018; Peters et al., 2018).

CLC (Ma & Hovy, 2016): The CLC model (Ma & Hovy, 2016) takes as input the concatenation of a token-level character CNN output with token embeddings, followed by a bi-LSTM layer. IC and NER models are a fully connected layer and a CRF layer, respectively, on top of the bi-LSTM layer. We use pre-trained FastText embeddings (Mikolov et al., 2018) for inputs of our models. In order to train IC-NER models on the public data, we use the pre-trained embeddings obtained using Wiki-news corpus 111 In order to train the models on the internal corpora, we train FastText embeddings on a separate set of utterances sourced from the same smart home agent. We use 2 bi-LSTM layers with hidden dimension of 384.

BERT: (Devlin et al., 2018) We also train IC-NER layers on top a BERT model, pretrained on a combination of Wikipedia articles dataset 222

and One Billion Word corpus

(Chelba et al., 2013) using the masked language modeling task. Our BERT model has 4 layers, 12 attention heads, and a hidden dimensions of 312. We use the sum of the CRF loss (NER) and the cross entropy loss (IC) as the optimization objective.

We tuned the value of from the set , from values ranging from to . We tuned from the set for linear decay; and from the set for exponential decay. We use a p3.16xlarge instance333 with 8 GPUs to train each model. We implemented the model block and the DP scheme with MXnet (Chen et al., 2015) and we leverage Horovod (Sergeev & Balso, 2018) to boost efficiency.

We report the semantic error rate (SER (Su et al., 2018), a normalized measure for IC-NER task) metric as the utility metric (lower is better). We also report the theoretical DP guarantees (for DP models) and the success of an MIA attack captured using the area under ROC curve (AUC) metric. Following Shokri et al. structure for MIA, we train a shadow model on a chosen public corpora. The attack model is then trained on the outputs of this shadow model and evaluated on the outputs of the model under attack. Sorted IC and NER output scores are used as features to attack model. As a reminder, during MIA evaluation, train set for the model under attack are used as member set, while the test set is used as non-member set. Finally, we also report the time take to train the models using SGD/eDP-SGD/DP-SGD. ADAM optimizer is use for all – SGD/eDP-SGD/DP-SGD settings.

5 Results

Table 1 presents the utility/privacy metrics obtained using SGD training, in addition to relative changes in those metrics when trained using various decay schemes in eDP-SGD (for the sake or brevity, we do not report DP-SGD results separately as they are similar to the no decay scheme). From the results, we sometimes observe an improvement in performance (particularly for public corpora), as was also reported by Khatri. This is encouraging and we attribute the improvement in performance to eDP-SGD acting as a regularizer. For smaller datasets, it is easier to overfit the CLC and BERT models to the training set. On the larger corpora, the model error rates does not degrade significantly (except for Video). Additionally, either linear or exponential decay offer a better privacy-utility trade-off than no decay on larger datasets where a loss in utility is expected (we embolden a lower utility loss and higher MIA value compared to no decay for communication and video datasets in Table 1). This indicates that decay when applied independently at each micro-batch still improves privacy-utility trade-off, but the observation is limited to larger datasets. We also observe that the theoretical privacy guarantee seem to have a loose correlation with MIA success rate, where sometimes higher value of is associated with greater decrease in MIA success rate. Table 2 also demonstrates that our algorithm does not degrade the training time over SGD. On the other hand, DP-SGD can increase training time by factor of up to 150. Overall, these results indicate that task specific fine tuning with eDP-SGD can be used in industrial settings to train privacy preserving models.

6 Conclusion

We propose a variant of DP-SGD that is suited for GPU based training and use it during fine-tuning IC-NER models on multiple datasets. We report training time, accuracy and privacy metrics on two model architectures, and argue that task specific fine-tuning with eDP-SGD is practical for large scale model training from an accuracy and efficiency perspective. In the future, one can explore the protection offered by DP-SGD against on recently proposed larger models. We can also study the impact of other techniques like regularization with DP-SGD on model memorization.