Differentially Private Learning with Adaptive Clipping

by   Om Thakkar, et al.
Boston University

We introduce a new adaptive clipping technique for training learning models with user-level differential privacy that removes the need for extensive parameter tuning. Previous approaches to this problem use the Federated Stochastic Gradient Descent or the Federated Averaging algorithm with noised updates, and compute a differential privacy guarantee using the Moments Accountant. These approaches rely on choosing a norm bound for each user's update to the model, which needs to be tuned carefully. The best value depends on the learning rate, model architecture, number of passes made over each user's data, and possibly various other parameters. We show that adaptively setting the clipping norm applied to each user's update, based on a differentially private estimate of a target quantile of the distribution of unclipped norms, is sufficient to remove the need for such extensive parameter tuning.


page 1

page 2

page 3

page 4


Learning rate adaptation for differentially private stochastic gradient descent

Differentially private learning has recently emerged as the leading appr...

Removing Disparate Impact of Differentially Private Stochastic Gradient Descent on Model Accuracy

When we enforce differential privacy in machine learning, the utility-pr...

Learning to be adversarially robust and differentially private

We study the difficulties in learning that arise from robust and differe...

Hyperparameter Tuning with Renyi Differential Privacy

For many differentially private algorithms, such as the prominent noisy ...

Gradient Leakage Attack Resilient Deep Learning

Gradient leakage attacks are considered one of the wickedest privacy thr...

Edge differentially private estimation in the β-model via jittering and method of moments

A standing challenge in data privacy is the trade-off between the level ...

Disparate Impact in Differential Privacy from Gradient Misalignment

As machine learning becomes more widespread throughout society, aspects ...

1 Introduction

There have been a lot of recent advances in iterative training methods like stochastic gradient descent (SGD), one of the main reasons being thier applicability in training neural networks. Deep learning has found a variety of applications, including image classification, language translation, and music generation

[25, 12, 26, 4]. To effectively perform such tasks, neural networks need a large amount of data for training. It is often the case that these datasets contain a lot of sensitive information. Moreover, many recent works [10, 27, 24, 5, 20] have shown that it is possible to extract sensitive information about the training data just from the parameters of a trained model. Thus, it becomes imperative to use learning techniques that provide a rigorous guarantee of privacy for the training data used.

Differential privacy (DP) [8, 9] has been recently used as a gold standard to bound the privacy leakage of sensitive data when performing learning tasks. Intuitively, DP prevents an adversary from confidently making any conclusions about whether a sample was used in training a model, even while having access to arbitrary side information. To formally establish the notion of DP, we first define neighboring datasets. We will refer to a pair of datasets as neighbors if can be obtained from by adding or removing one element.

Definition 1.1 (Differential Privacy [8, 9]).

A (randomized) algorithm with input domain and output range is -differentially private if for all pairs of neighboring datasets , and every measurable

, we have with probability over the coin flips of


Federated Learning [17] is a decentralized approach in which the training data is left distributed on user devices, and training is done via aggregating updates that have been computed locally. Federated Averaging [17]

is a technique that combines local SGD on each user’s data with a server that performs model averaging. Learning language models involves fairly complex networks like long short-term memory (LSTM) recurrent neural networks (RNNs), and the training data can involve personalized sensitive information such as passwords and text conversations. Hence, we use learning language models for next-word-prediction as a motivating task, and use this as a running example.

In this paper, we will consider two settings of privacy: example-level, and user-level. When we use Federated SGD, we preserve example-level DP (also in [7, 3, 2, 21, 28, 22, 13]) where each element is a single training example. On the other hand, when we use Federated Averaging, we preserve the stronger guarantee of user-level DP (also in [19]), where an element refers to the complete data held by a user.

While there has been a lot of work in the designing of DP techniques for learning, almost every technique has some hyperparameters which need to be set appropriately for obtaining a good utility. It is often unclear apriori how to set the values of different hyperparameters introduced via the addition of privacy, for example, the clipping threshold for gradient updates in DP SGD. Moreover, learning techniques have their own hyperparameters which might need to be set differently when training is performed with privacy. For example, the learning rate in DP SGD might need to be set to a high value if the clipping threshold is very low, and vice-versa. Such tuning for large networks can have an exhorbitant cost in computation and efficiency, which can be a bottleneck for real-world systems that involve communicating with millions of samples for training a single network. Tuning also incurs an additional cost for privacy, which needs to be accounted for when providing a privacy guarantee for the released model with tuned hyperparameters.

1.1 Related Work

DP SGD has been the focus of many recent works [7, 3, 2, 28]. Privacy amplification via subsampling was introduced in [14]. The moments accountant, which tightly bounds the privacy loss of the Guassian mechanism when used with amplification via subsampling, was introduced by Abadi et al. [1]. It was further extended in [18]

to incorporate estimating heterogeneous sets of vectors from batches of subsamples. The technique of Federated Averaging was introduced in

[17], and was subsequently used in [19] to effectively train recurrent language models. This work builds upon [19].

Several works have studied the problem of privacy-preserving hyperparameter tuning. An approach based on target accuracy was provided in [11], which was further improved in terms of privacy cost and computational efficiency in [16]. A method based on data splitting was provided in [7], whereas one based on satisfying certain stability conditions was introduced in [6]. We would like to note that all the prior works focused on the general problem of parameter search, whereas the focus of this work is to adaptively adjust the value of a parameter in iterative procedures to eliminate the need for extensive tuning.

1.2 Motivation

Bounding the influence of any sample in a learning process is both desirable and necessary. If left unbounded, any sample can potentially sway the learned system to overfit to its data, defeating the purpose of trying to learn actual trends in the population. One way to bound the contribution of an example in any phase of the learning process is to bound the total norm of its gradient update. Let the bound be denoted by . This implies that if the norm of any example’s update is greater than , then it gets ‘clipped’ to have a norm of before being sent to the server. Such clipping also effectively bounds the sensitivity of the system with respect to the addition or removal of any example from the training set. As a result, adding appropriate noise post clipping is sufficient for achieving a differential privacy guarantee for the system. [7, 3, 2]

Setting an appropriate value for the clipping threshold can be crucial for the utility of a differentially private learned system, as setting it too low can result in high information loss, whereas setting it too high can result in the addition of a lot of noise. Both the cases can decrease the signal-to-noise ratio for the learning process, which can adversely affect the utility of the learned system. Such behavior can be observed in prior work [19] which shows the performance of a differentially private language model learned over various values of .

Learning large models using the Federated Averaging/SGD algorithm [17, 19] can take thousands of rounds of interaction between the central server and the clients. The norms of the updates can vary as the rounds progress. As a result, even setting a constant clipping threshold throughout the learning process can result in decreased utility of the system. Prior work [19] has shown that decreasing the value of the clipping threshold after training a language model for some initial number of rounds actually results in increased accuracy of the system. However, the behavior of the norms can be difficult to predict without prior knowledge about the system, and it might be inefficient to conduct experiments to learn such behavior.

Since each layer of a learning system can provide a different functionality, it can be useful in some situations to clip the updates layer-wise (i.e., per-layer clipping [19]). However, as shown in Figure 1, the norms of the individual layers can be of different magnitudes, thus making it even more difficult to efficiently search the space of clipping parameters. As a result, there is a need for a system which learns this ‘on-the-fly’ to get high utility while ensuring privacy.

Figure 1: Layers in an LSTM next-word prediction model, together with the mean norm of the per-layer client updates, after 500 rounds of Federated Averaging. The largest number of parameters are found in the word embedding matrix shared by the input and output, embedding_lookup/params.

2 Differentially Private Adaptive Quantile Clipping

In this section, we will describe the adaptive strategy that can be used for adjusting the clipping threshold according to the norms of the updates. First, we will describe the adaptive quantile clipping strategy, which is designed for iterative differentially private mechanisms. Next, we will describe layer-specific noise addition strategies for getting a higher utility than the basic strategy of adding noise with the same scale to each layer.

2.1 Loss Functions for Estimating Quantiles


be a random variable, let

be a quantile to be matched. For any , define

Hence, .

For such that , we have . Therefore, is at the th quantile of [15]. Because the loss is convex and has gradients bounded by 1, we can produce an online estimate of that converges to the th quantile of using online gradient descent (see, e.g., Shalev-Shwartz [23]). Since the loss is convex but not strongly convex, a learning rate proportional to will produce a sublinear regret bound. See Figure 2

for a plot of the loss function for a discrete random variable that takes six values with equal probability.

Figure 2: Loss functions to estimate the 0th, 50th, 75th, and 100th quantiles for a random variable that uniformly takes values in . The loss function is the average of convex piecewise-linear functions, one for each value. For instance, for the median (), this is just , where is the random value, and is the estimate. When we average these functions, we arrive at the blue function in the plot showing the average loss, which indeed is minimized by any value between the middle two elements, i.e., in the interval . The function for is minimized at because for values in , the quantile is less than while for values in the quantile is greater than .

Suppose at some round we have samples of , with values ). The average derivative of the loss for that round is

where is the empirical fraction of samples with value at most . For a given learning rate , we can perform the linear update: .

Geometric updates.

Since and take values in the range , the linear update rule described above changes by a maximum of at each step. This can be slow if is on the wrong order of magnitude. At the other extreme, if the optimal value of is orders of magnitude smaller than , the update can be very coarse, and may often overshoot to become negative. To remedy such issues, we propose the following geometric update rule: .

2.2 Adaptive Quantile Clipping

Let be the number of layers, be the expected number of users sampled per iteration, and denote the target fraction of users without clipped updates. As in [19], we consider two kinds of clipping: i) flat clipping, where we have an overall clipping parameter and we clip the concatenation of all layers, and ii) per-layer clipping, where we are given a per-layer clipping parameter for each layer and each layer is clipped separately. If performing per-layer clipping, set , otherwise set . For each and iteration , let be the clipping threshold, and be the learning rate for learning . We start with some value of . Let be the random variable that denotes the number of users sampled in round . Each user will send bits along with the usual update , where bit , for .

We define the loss for user for the update to the layer in the round as

Then define , and for the layer in the round. As ,

is an unbiased estimate of the fraction of unclipped updates for the

layer in the round. Thus, we have . Observe that if , then . Note that the server only requires to compute the gradient , which is computed privately along with the average of the updates from the users. Moreover, the magnitude of the gradient depends on how far is from the target unclipped percentage . We update the clipping threshold for the next round as for a linear update, and for a geometric update. We also define a parameter denoting the proportion of the per-iteration privacy budget (details in [18]) that is used for the computation of the clipped counts described above. The rest of the budget is used for computing an average of the user updates. We provide a pseudocode of the complete algorithm in Algorithm 1.

Such a strategy can also be useful for getting an idea about the range of the norms of the updates, by setting very close to 0 (1) to find the minimum (maximum). This can help in getting a ballpark idea about the magnitudes of the individual layers without any prior knowledge, which can then be utilized for setting the initial clipping threshold appropriately.

  function Train(Parameters- user selection probability , target unclipped quantile , clipping counts budget proportion , noise scale , ClipUpdate (Lin or Geom), UsrUpdate (for FedAvg or FedSGD), ClipFn (FlatClip or PerLayerClip)) 
     Initialize model , clipping bound(s) , moments accountant
     for each round  do
         (sample users with probability )
        for each user in parallel do
         (bound on for ClipFn)
  function LinUpdate(Input- Current value(s) , current unclipped quantile(s) ; Parameters- ) 
  function GeomUpdate(Input- Current value(s) , current unclipped quantile(s) ; Parameters- ) 
  function UsrUpdateFedAvg(Input- , ClipFn, Clipping bound(s) ; Parameters- ) 

 each local epoch

         ( local data split into size batches)
        for batch  do
     return ClipFn()
  function UsrUpdateFedSGD(Input- , ClipFn, Clipping bound(s) ; Parameters- ) 
     Select a batch of size from examples
     return ClipFn(
  function FlatClip(Input- Update , Clipping bound ) 
     , ,
  function PerLayerClip(Input- Update , Clipping bounds ) 
     for each layer  do
Algorithm 1 Differentially Private Learning with Adaptive Clipping