Differential privacy [DMNS06] is a standard concept for capturing privacy of statistical algorithms. In its original formulation, (pure) differential privacy is parameterized by a single real number—the so-called privacy budget—which characterizes the privacy loss of an individual contributor to the input dataset.
As applications of differential privacy start to proliferate, they bring to the fore the problem of administering the privacy budget, with specific emphasis on privacy composition and privacy amplification.
Privacy composition enables modular design and analysis of complex and heterogeneous algorithms from simpler building blocks by controlling the total privacy budget of their combination. Improving on “naïve” composition, which simply (but very consequentially!) states that the privacy budgets of composition blocks sum up, “advanced” composition theorems allow subadditive accumulation of the privacy budgets. All existing proofs of advanced composition theorems assume that all intermediate outputs are revealed, whether the composite mechanism requires it or not.
Privacy amplification goes even further by bounding the privacy budget—for select mechanisms—of a combination to be less than the privacy budget of its parts. The only systematically studied instance of this phenomenon is privacy amplification by sampling [KLN08, BBKN14, WFS15, BDRS18, WBK18, ACG16]. In its basic form, for , an -differentially private mechanism applied to a secretly sampled fraction of the input satisfies -differential privacy. More recent results demonstrate that privacy can be amplified in proportion to (for a Gaussian additive noise mechanism and appropriate relaxations of differential privacy).
This work introduces a new amplification argument—amplification by iteration—that in certain contexts can be seen as an alternative to privacy amplification by sampling. As an exemplar of the kind of algorithms we wish to analyze, we consider noisy stochastic gradient descent for a smooth and convex objective.
Our preferred privacy notion for formally stating our contributions is Rényi differential privacy (RDP). For the purpose of this introduction, it suffices to keep in mind that RDP is parameterized with and measures the Rényi divergence of order (denoted ) between the output distributions of a randomized algorithm on two neighboring datasets. It is a relaxation of (pure) differential privacy which has been instrumental for achieving tighter bounds on privacy cost in a number of recent papers on privacy-preserving machine learning. In addition, to being a privacy definition in its own right, one can easily translate RDP bounds to usual
) between the output distributions of a randomized algorithm on two neighboring datasets. It is a relaxation of (pure) differential privacy which has been instrumental for achieving tighter bounds on privacy cost in a number of recent papers on privacy-preserving machine learning. In addition, to being a privacy definition in its own right, one can easily translate RDP bounds to usual-DP bounds.
Our first contribution is a general theorem that states that, under certain conditions on an iterative process, the process shrinks the Rényi divergence between distributions. We will focus on the simplest form of these conditions in which the mechanism is a composition of a sequence of contractive (or -Lipschitz) maps and an additive Gaussian noise mechanism. This is a natural setting for several differentially private optimization algorithms. A more general treatment that allows other Banach spaces and noise distributions appears in Section 3.3.
Theorem 1 (Informal).
Let and let be obtained from by iterating
for some sequence of contractive maps and . Let denote the output of the same process started at some . Then for every , .
We note that in this result we measure the divergence only between the final steps, in other words, the intermediate steps of the iteration are not revealed. This theorem is a special case of our more general result Theorem 22.
This result translates a metric assumption of bounded distance between and to an information-theoretic conclusion of bounded Rényi divergence between and . While standard facts about the Gaussian distribution allow one to make such a statement for a one-step process, the intermediate arbitrary contractive steps essentially rule out a first principles approach to proving such a theorem. We use a careful induction argument that rests on controlling the “distance” between . We interpolate between these two using a new
. While standard facts about the Gaussian distribution allow one to make such a statement for a one-step process, the intermediate arbitrary contractive steps essentially rule out a first principles approach to proving such a theorem. We use a careful induction argument that rests on controlling the “distance” betweenand . We start by measuring the metric distance when and gradually transform this to an information theoretic divergence at
. We interpolate between these two using a newhybrid distance measure that we refer to as shifted divergence. We believe that this notion should find additional applications in the analyses of stochastic processes. Our bounds are tight (with no loss in constants) and show that the worst-case for such a result is when all the contractive maps are the identity map.
This result has some surprising implications. Consider an iterative mechanism that processes one input record at a time, iterations in total. The immediate application of this result to this mechanism leads to the following observation about individuals’ privacy loss. The person whose record was processed last experiences privacy loss afforded by the Gaussian noise added at the last iteration. At the same time, the person whose record was processed first suffers the least amount of privacy loss, equal to of the last one’s. Importantly, the order in which the inputs were considered need not be random or secret for this analysis to be applicable. In contrast, privacy amplification by sampling depends crucially on the sample’s randomness and secrecy.
We outline some applications of this analysis in privacy-preserving machine learning via convex optimization.
Distributed stochastic gradient descent.
In this setting records are stored locally, and the parties engage in a distributed computation to train a model [DKM06]. Using amplification by sampling as in DP-SGD by Abadi et al. [ACG16] would require keeping secret the set of parties taking part in each step of the algorithm. When the communication channel is not trusted, hiding whether or not a party takes part in a certain step would essentially require all parties to communicate in all steps, leading to an unreasonable amount of communication. In addition, the assumption that the sample of parties participating in each step is a random subset may itself be difficult to enforce in many settings.
Our approach does not need the order of participating parties to be random or hidden. It is sufficient to hide the model itself until a certain number of update steps are applied. This approach then allows significantly reducing communication costs to be proportional to the size of the mini-batch (the number of records consumed by each update). Additionally, our approach can amplify privacy even when the noise added in each step is too small to guarantee much privacy. This is in contrast to amplification by sampling, which requires the unamplified privacy cost to be small to start with: a starting becomes which is close to for small but grows quickly, and for instance, precludes setting for small . Our main result applies for arbitrary so that even if each is very small (say, ) the final privacy is non-vacuous. A smaller noise scale then permits a smaller size of each mini-batch, further reducing the communication cost. On the negative side, the privacy guarantee we get varies between examples: examples used early in the SGD get stronger privacy than those occurring late.
Our approach above gives better privacy than competing approaches to the parties taking part early in the computation, while giving similar guarantees to the last user. This better per-user privacy guarantee can allow one to solve several such convex optimization problems on the same set of users, at no increase in the worst-case privacy cost. Specifically, if we have parties, then we can solve such convex optimization problems at the same privacy cost as answering one of them. More generally, the privacy cost grows linearly in . To our knowledge, except for privacy-amplification-by-sampling, existing techniques such as output perturbation have utility bounds that grow linearly in .
The setting in which some public data from the same distribution as private data is available has been recently identified as promising and practically important [PAE17, AKZ17]. The public corpus can be based on opt-in population, such as a product’s developers or early testers, data shared by volunteers [Chu05], or be released through a legal process [KY04].
In this model, the last iterations of the iterative algorithm can be done over the public samples whose privacy need not be preserved. Since data points used early lose less privacy, we can add much less noise at each step. In effect, having public samples decreases the error due to the addition of noise by a factor of . In the absence of public data, privacy comes at a provable cost: while the statistical error due to sampling scales as independently of the dimension, the error of the differentially private version scales as [BST14a]. Our results imply that for convex optimization problems satisfying very mild smoothness assumptions, given public data points, we can ensure that the additional error due to privacy is comparable to the statistical error.
We remark that our technique requires that the optimized functions satisfy a mild smoothness assumption. However, as we show, in our applications we can always achieve the desired level of smoothness by convolving the optimized functions with the Gaussian kernel. Such convolution introduces an additional error but this error is dominated by the error necessary to ensure privacy.
The rest of the paper is organized as follows. After discussing some additional related work, we start with some preliminaries in Section 2. We present our main technique in Section 3. Section 4 shows how this technique can be applied to versions of the noisy stochastic gradient descent algorithm. Finally, in Section 5, we apply this framework to derive the applications mentioned above.
1.1 Related Work
The field of differentially private convex optimization spans almost a decade [CM08, CMS11, JKT12, KST12, ST13, DJW13, Ull15, JT14, BST14b, TTZ15, STU17, WLK17, INS19]. Many of these results are optimal under different regimes such as empirical loss, population loss, the low-dimensional setting () or the high-dimensional setting . Some of the algorithms (e.g., output perturbation [CMS11] and objective perturbation [CMS11, KST12]) require finding a global optimum of an optimization problem to argue privacy and utility, while the others are based on the variants of noisy stochastic gradient descent. In this section we restrict ourselves to only the population loss, and allow comparisons to algorithms that can be implemented with one pass of stochastic gradient descent over the data set for a direct comparison (which is close to the typical application of optimization algorithms in machine learning). We note that our analysis technique also applies to multi-pass and batch versions of gradient descent. In this setting our algorithm achieves close to optimal bounds on population loss (see Table 1 for details).
In this table we also compare the local differential privacy of the algorithms [KLN08]. In several settings (such as distributed learning) we want the published outcome of the optimization algorithm to satisfy a strong level of (central) differential privacy while still guaranteeing differential privacy. Local differential privacy protects the user’s data even from the aggregating server or an adversary who can obtain the complete transcript of communication between the server and the user.
|Excess loss||LDP ()|
|Algorithm||for one task||for tasks||for one task|
|Noisy SGD + sampling [BST14b]|
|Noisy SGD [DJW13, STU17]|
|Output perturbation [CMS11, WLK17]|
We note that some architectures may not be compatible with all privacy-preserving techniques or guarantees. For instance, we assume secrecy of intermediate computations, which rules out sharing intermediate updates (which is a standard step in federated learning [MMR17]). In contrast, analyses based on secrecy of the sample (e.g., [KLN08, ACG16]) require that either data be stored centrally (thus eliminating local differential privacy guarantees) or all-to-all communications.
We recall definitions and tools from the learning theory, probability theory, and differential privacy and define the notion of shifted divergence. In the process we set up the notation that we will use throughout the paper.
We recall definitions and tools from the learning theory, probability theory, and differential privacy and define the notion of shifted divergence. In the process we set up the notation that we will use throughout the paper.
2.1 Convex Loss Minimization
Let be the domain of data sets, and be a distribution over . Let be a data set drawn i.i.d. from . Let be a convex set denoting the space of all models. Let be a loss function, which is convex in its first parameter (the second parameter is a data point and dependence on this parameter can be arbitrary).
The excess population loss of solution
be a loss function, which is convex in its first parameter (the second parameter is a data point and dependence on this parameter can be arbitrary). The excess population loss of solutionis defined as
In order to argue differential privacy we place certain assumptions on the loss function. To that end, we need the following two definitions of Lipschitz continuity and smoothness.
Definition 2 (-Lipschitz continuity).
A function is -Lipschitz continuous over the domain if the following holds for all : .
Definition 3 (-smoothness).
A function is -smooth over the domain if for all , .
2.2 Probability Measures
In this work, we will primarily be interested in the -dimensional Euclidean space endowed with the metric and the Lebesgue measure. Our main result holds in a more general setting of Banach spaces.
We say a distribution is absolutely continuous with respect to if whenever for all measurable sets . We will denote this by .
Given two distributions and on a Banach space , one can define several notions of distance between them. The first family of distances we consider is independent of the norm:
Definition 4 (Rényi Divergence [Rén61]).
Let and be measures with . The Rényi divergence of order between and is defined as
Here we follow the convention that . If , we define the Rényi divergence to be . Rényi divergence of orders is defined by continuity.
Proposition 5 ([vEH14]).
The following hold for any , and distributions :
For any (deterministic) function , , where we denotes the distribution of where .
As usual, we denote by the convolution of and , that is the distribution of the sum where we draw and independently.
We will also need the following “norm-aware” statistical distance:
Definition 6 (-Wasserstein Distance).
The -Wasserstein distance between distributions and on a Banach space is defined as
where means that the essential supremum is taken relative to measure over parameterized by . Here is the collection of couplings of and , i.e., the collection of measures on with marginals and on the first and second factors respectively.
The following is immediate from the definition.
The following are equivalent for any distributions , over :
There exists jointly distributed r.v.’ssuch that , and .
There exists jointly distributed r.v.’s such that , and .
Next we define a hybrid111Here we use a budgeted version of the definition, putting a hard constraint on the portion of the distance, as it is most convenient for reasoning about differential privacy. A Lagrangian version of the definition may be more natural in other applications. between these two families of distances that plays a central role in our work.
Definition 8 (Shifted Rényi Divergence).
Let and be distributions defined on a Banach space . For parameters and , the -shifted Rényi divergence between and is defined as
The following follows from the definition:
The shifted Rényi divergences satisfy the following for any , and any :
For , .
For any , , where we let
denote the distribution of the random variable that is always equal to(note that is the distribution of for ).
For a noise distribution over a Banach space we measure the magnitude of noise by considering the function that for , measures the largest Rényi divergence of order between and the same distribution shifted by a vector of length at most
shifted by a vector of length at most:
We denote the standard Gaussian distribution over with variance
with varianceby . By the well-known properties of Gaussians, for any , and , . This implies that in the Euclidean space, .
When and are sampled from and respectively, we will often abuse notation and write , and to mean , and , respectively.
2.3 (Rényi ) Differential Privacy
The notion of differential privacy (creftype 11) is by now a de facto standard for statistical data privacy [DMNS06, Dwo06, DR14]. At a semantic level, the privacy guarantee ensures that an adversary learns almost the same thing about an individual independent of the individual’s presence or absence in the data set. The parameters quantify the amount of information leakage. A common choice of these parameters is and , where refers to the size of the dataset.
A randomized algorithm is-differentially private (-DP) if, for all neighboring data sets and and for all events in the output space of , we have
The notion of neighboring data sets is domain-dependent, and it is commonly taken to capture the contribution of a single individual. In the simplest case and differ in one record, or equivalently, , where is the Hamming distance. We also define
Definition 12 (Per-person Privacy).
An algorithm operating on a sequence of data points is said to satisfy -differentially privacy at index if for any pair of sequences that differ in the th position, and for any event in the output space of , we have
Another related model of privacy is local differential privacy [KLN08]. In this model each user executes a differentially private algorithm on their individual input which is then used for arbitrary subsequent computation (we omit the formal definition as it is not used in our work).
Starting with Concentrated Differential Privacy [DR16], definitions that allow more fine-grained control of the privacy loss random variable have proven useful. The notions of zCDP [BS16] , Moments Accountant
, Moments Accountant[ACG16], and Rényi differential privacy (RDP) [Mir17] capture versions of this definition. This approach improves on traditional -DP accounting in numerous settings, often leading to significantly tighter privacy bounds as well as being applicable when the traditional approach fails [PAE17, PSM18]. In the current work, we will use the nomenclature based on the notion of the Rényi divergence (creftype 4).
Definition 13 ([Mir17]).
For and , a randomized algorithm is -Rényi differentially private, or -RDP if for all neighboring data sets and we have
Per-person RDP can be defined in an analogous way. The following two lemmas [Mir17] allow translating Rényi differential privacy to -differential privacy, and give a composition rule for RDP.
If satisfies -Rényi differential privacy, then for all it also satisfies -differential privacy. Moreover, pure -differential privacy coincides with -RDP.
The standard composition rule for Rényi differential privacy, when the outputs of all algorithms are revealed, takes the following form.
If are randomized algorithms satisfying, respectively, -RDP,…,-RDP, then their composition defined as is -RDP. Moreover, the ’th algorithm can be chosen on the basis of the outputs of algorithms .
2.4 Contractive Noisy Iteration
We start by recalling the definition of a contraction.
Definition 16 (Contraction).
For a Banach space , a function is said to be contractive if it is 1-Lipschitz. Namely, for all ,
A canonical example of a contraction is projection onto a convex set in the Euclidean space.
Let be a convex set in . Consider the projection operator:
The map is a contraction.
Another example of a contraction, which will be important in our work, is a gradient descent step for a smooth convex function. The following is a standard result in convex optimization [Nes04]; for completeness, we give a proof in Appendix A.
Suppose that a function is convex and -smooth. Then the function defined as:
is contractive as long as .
We will be interested in a class of iterative stochastic processes where we alternate between adding noise and applying some contractive map.
Definition 19 (Contractive Noisy Iteration (Cni)).
Given an initial random state , a sequence of contractive functions , and a sequence of noise distributions , we define the Contractive Noisy Iteration (CNI) by the following update rule:
where is drawn independently from . For brevity, we will denote the random variable output by this process after steps as .
3 Coupled Descent
In this section, we prove a bound on the Rényi divergence between the outputs of two contractive noisy iterations. Suppose that and are two random states such that . The map’s contractivity and the fact that we are adding noise ensures that and are -close in -Rényi divergence. By the post-processing property of Rényi divergence, and are similarly close. Our main theorem says that this can be substantially improved if we do not release the intermediate steps. The noise added in subsequent steps further decreases the Rényi divergence even when contractive steps are taken in between the noise addition.
While the final result is a statement about Rényi divergences, the shifted Rényi divergences play a crucial role in the proof. We start with an important technical lemma that for the noise addition step, allows one to reduce the shift parameter . We will then show how contractive maps affect the shifted divergence. Armed with these results, we prove the main theorem in Section 3.3.
3.1 The Shift-Reduction Lemma
In this section we prove the key lemma that relates to . Recall that we use to measure how well noise distribution hides changes in our norm (see creftype 10):
Lemma 20 (Shift-Reduction Lemma).
Let , and be distributions over a Banach space . Then for any ,
Let be distributed as and as . We first show the result for the case when . Let be the distribution certifying , that is and . Let be the random variable whose existence is given by Lemma 7. That is, with probability , and . Let be an independent random variable distributed as . We can write
where we have used the post-processing property of Rényi divergence. Note that the distribution is a product distribution, whereas the factors of are dependent. Denoting the the density function of a random variable , we expand
Taking logs and dividing by , we get the claim for .
The general case reduces readily to the case. Define
It is easy to see that for all , and that whenever .
As before, let be r.v.’s from the joint distribution guaranteed by creftype 7. Let and . It follows that and with probability 1. We write
where we have used the case in the last step. On the other hand,
This completes the proof. ∎
3.2 Contractive Maps
We next show that contractive maps cannot increase a shifted divergence. In the lemma below we give a more general version that allows using different contractive maps.
Lemma 21 (Contraction reduces ).
Suppose that and are contractive maps on and . Then for r.v.’s and over ,
3.3 Privacy Amplification by Iteration
We are now ready to prove our main result. We prove a general statement that can handle changes in several ’s; this enables us to easily analyze algorithms that access data points more than once222Since Rényi divergence does not satisfy the triangle inequality, blackbox analyses of such algorithms use the group privacy properties of RDP that can be loose.. Recall that is introduced in creftype 10 and measures the maximal Rényi divergence of order between a noise distribution and its shifted copy.
Let and denote the output of and . Let . Let be a sequence of reals and let . If for all , then
In particular, if , then
The proof is by induction where we use the contraction-reduces- lemma and then reduce the shift amount by using the shift-reduction lemma.
Let (resp., ) denote the ’th iterate of the (resp., . We argue that for all ,
The base case is . By definition, and . For the inductive step, let denote the random variable drawn from .
This completes the induction step and the proof. ∎
4 Privacy Guarantees for Noisy Stochastic Gradient Descent
We will now apply our analysis technique to derive the privacy parameters of several versions of the noisy stochastic gradient descent algorithm (also referred to as Stochastic Gradient Langevin Dynamics) defined as follows. We are given a family of convex loss functions over some convex set parameterized by , that is is convex and differentiable in the first parameter for every . Given a dataset , starting point , rate parameter , and noise scale the algorithm works as follows. Starting from perform the following update and , where is a freshly drawn sample from and denotes the Euclidean projection to set . We refer to this algorithm as PNSGD and describe it formally in Algorithm 1.
The key property that allows us to treat noisy gradient descent as a contractive noisy iteration is the fact that for any convex function, a gradient step is contractive as long as the function satisfies a relatively mild smoothness condition (see Proposition 18). In addition, as is well known, for any convex set , the (Euclidean) projection to is contractive (see Proposition 17). Naturally, a composition of two contractive maps is a contractive map and therefore we can conclude that PNSGD is an instance of contractive noisy iteration. More formally, consider the sequence . In this sequence, is obtained from by first applying a contractive map that consists of projection to followed by the gradient step at and then addition of Gaussian noise of scale . Note that the final output of the algorithm is but it does not affect our analysis of divergence as it can be seen as an additional post-processing step.
For this baseline algorithm we prove that points that are used earlier have stronger privacy guarantees due to noise injected in subsequent steps.
Let be a convex set and be a family of convex -Lipschitz and -smooth functions over . Then, for every , , starting point , and , PNSGD satisfies -RDP for its ’th input, where .
Let and be two arbitrary datasets that differ at index . As discussed above, under the smoothness condition the steps of PNSGD are a contractive noisy iteration. Specifically, on the dataset , the CNI is defined by the initial point , sequence of functions and sequence of noise distributions . Similarly, on the dataset , the CNI is defined in the same way with the exception of . By our assumption, is -Lipschitz for every and and therefore
We can now apply Theorem 22 with and . Note that and for . In addition, for all and . Hence we obtain that
as claimed. ∎
We now consider privacy guarantees for several variants of this baseline approach. These variants are needed to ensure utility guarantees, that require that the algorithm output one of the iterates randomly. Specifically, we define the algorithm Skip-PNSGD as the algorithm that picks randomly and uniformly and then skips the first points. That is, it makes only steps and at step the update is . It is easy to see that the privacy guarantees Skip-PNSGD are at least as good as those we gave for PNSGD in Theorem 23.
Let be a convex set and be a family of convex -Lipschitz and -smooth functions over . Then, for every , , starting point , and , Skip-PNSGD satisfies