Sharp Composition Bounds for Gaussian Differential Privacy via Edgeworth Expansion

03/10/2020 ∙ by Qinqing Zheng, et al. ∙ University of Pennsylvania 6

Datasets containing sensitive information are often sequentially analyzed by many algorithms and, accordingly, a fundamental question in differential privacy is concerned with how the overall privacy bound degrades under composition. To address this question, we introduce a family of analytical and sharp privacy bounds under composition using the Edgeworth expansion in the framework of the recently proposed f-differential privacy. In contrast to the existing composition theorems using the central limit theorem, our new privacy bounds under composition gain improved tightness by leveraging the refined approximation accuracy of the Edgeworth expansion. Our approach is easy to implement and computationally efficient for any number of compositions. The superiority of these new bounds is confirmed by an asymptotic error analysis and an application to quantifying the overall privacy guarantees of noisy stochastic gradient descent used in training private deep neural networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning, data mining, and statistical analysis are widely applied to various applications impacting our daily lives. While we celebrate the benefits brought by these applications, to an alarming degree, the algorithms are accessing datasets containing sensitive information such as individual behaviors on web and health records. By tweaking the datasets and leveraging the output of algorithms, it is possible for an adversary to learn information about and even identify certain individuals [FJR15, SSSS17]. In particular, privacy concerns become even more acute when the same dataset is probed by a sequence of algorithms. With knowledge of the dataset from the prior algorithms’ output, an adversary can adaptively analyze the dataset to cause additional privacy loss at each round. This reality raises one of the most fundamental problems in the area of private data analysis:

How to accurately and efficiently quantify the cumulative

privacy loss under composition of private algorithms?

To address this important problem, one has to start with a formal privacy definition. To date, the most popular statistical privacy definition is -differential privacy (DP) [DKM06, DMNS06], with numerous deployments in both industrial applications and academic research [EPK14, ACG16, PAE16, DKY17, App17, Abo18]. Informally, this privacy definition requires an unnoticeable change in a (randomized) algorithm’s output due to the replacement of any individual in the dataset. More concretely, letting and , an algorithm is -differentially private if for any pair of neighboring datasets , (in the sense that differing in one individual) and any event ,

(1)

Unfortunately, this privacy definition comes with a drawback when handling composition. Explicitly, -DP is not closed under composition in the sense that the overall privacy bound of a sequence of (at least two) -DP algorithms cannot be precisely described by a single pair of the parameters [KOV17]. Although the precise bound can be collectively represented by infinitely many pairs of , [MV16] shows that it is computationally hard to find such a pair of privacy parameters.

The need for a better treatment of composition has triggered a surge of interest in proposing generalizations of -DP, including divergence-based relaxations [DR16, BS16, Mir17, BDRS18] and, more recently, a hypothesis testing-based extension termed -differential privacy (-DP) [DRS19]

. This privacy definition leverages the hypothesis testing interpretation of differential privacy, and characterizes the privacy guarantee using the trade-off between type I and type II errors given by the associated hypothesis testing problem. As an advantage over the divergence-based privacy definitions, among others,

-DP allows for a concise and sharp argument for privacy amplification by subsampling. More significantly,

-DP is accompanied with a technique powered by the central limit theorem (CLT) for analyzing privacy bounds under composition of a large number of private algorithms. Loosely speaking, the overall privacy bound asymptotically converges to the trade-off function defined by testing between two normal distributions. This class of trade-off functions gives rise to

Gaussian differential privacy (GDP), a subfamily of -DP guarantees.

In deploying differential privacy, however, the number of private algorithms under composition may be moderate or small

(see such applications in private sparse linear regression

[KST12] and personalized online advertising [LO11]). In this regime, the CLT phenomenon does not kick in and, as such, the composition bounds developed using CLT can be inaccurate [DRS19, BDLS19]. To address this practically important problem, in this paper we develop sharp and analytical composition bounds in -DP without assuming a larger number of algorithms, by leveraging the Edgeworth expansion [Hal13]

. The Edgeworth expansion is a technique for approximating probability distributions in terms of their cumulants. Compared with the CLT approach, in our setting, this technique enables a significant reduction of approximation errors for composition theorems.

In short, our Edgeworth expansion-powered privacy bounds have a number of appealing properties, which will be shown in this paper through both theoretical analysis and numerical examples.

  • The Edgeworth expansion is a more general approach that subsumes the CLT-based approximation. Moreover, our new privacy bounds tighten the composition bounds that are developed in the prior art [DRS19, BDLS19].

  • Our method is easy-to-implement and computationally efficient. In the case where all trade-off functions are identical under composition, the computational cost is constant regardless of the number of private algorithms. This case is not uncommon and can be found, for example, in the privacy analysis of noisy stochastic gradient descent (SGD) used in training deep neural networks.

The remainder of the paper is organized as follows. Section 2 reviews the -DP framework. Section 3 introduces our methods based on the Edgeworth expansion. Section 4 provides a two-parameter interpretation of the Edgeworth approximation-based privacy guarantees. Finally, we present experimental results to demonstrate the superiority of our approach in Section 5.

2 Preliminaries on -Differential Privacy

Let a randomized algorithm take a dataset as input. Leveraging the output of this algorithm, differential privacy seeks to measure the difficulty of identifying the presence or absence of any individual in . The -DP definition offers such a measure using the probabilities that gives the same outcome for two neighboring datasets and . A more concrete description is as follows. Let and denote the probability distribution of and , respectively. To breach the privacy, in essence, an adversary performs the following hypothesis testing problem:

The privacy guarantee of boils down to what extent the adversary can tell the two distributions apart. In the case of -DP, the privacy guarantee is expressed via 1. The relationship between differential privacy and hypothesis testing is first studied in [WZ10, KOV17, LHC19, BBG19]. More recently, [DRS19] proposes to use the trade-off between type I and type II errors of the optimal likelihood ratio tests at level ranging from 0 to 1 as a measure of the privacy guarantee. Note that the optimal tests are given by the Neyman–Pearson lemma, and can be thought of as the most powerful adversary.

Trade-off function.

Let be a rejection rule for testing against against . The type I and type II error of are and , respectively. The trade-off function between the two probability distributions and is defined as

That is, equals the minimum type II error that one can achieve at significance level . A larger trade-off function corresponds to a more difficult hypothesis testing problem, thereby implying more privacy of the associated private algorithm. When the two distributions are the same, the perfect privacy is achieved and the corresponding trade-off function is . In the sequel, we denote this function by . With the definition of trade-off functions in place, [DRS19] introduces the following privacy definition (we say if for all ): Let be a trade-off function. An algorithm is -differentially private if for any pair of neighboring datasets and .

While the definition above considers a general trade-off function, it is worthwhile to remark that can always be assumed to be symmetric. Letting (note that if ), a trade-off function is said to be symmetric if . Due to the symmetry of the two neighboring datasets in the privacy definition, an -DP algorithm must be -DP. Compared to , the new trade-off function is symmetric and gives a greater or equal privacy guarantee. For the special case where the lower bound in Definition 2

is a trade-off function between two Gaussian distributions, we say that the algorithm has

Gaussian differential privacy (GDP): Let for some , where

denotes the cumulative distribution function (CDF) of the standard normal distribution. An algorithm

gives -Gaussian differential privacy if for any pair of neighboring datasets and .

Figure 1: An example of a trade-off function and two supporting lines induced by the associated -DP guarantees. These lines have slopes , respectively, and intercepts .

Duality to -Dp.

The -DP framework has a dual relationship with -DP in the sense that -DP is equivalent to an infinite collection of -DP guarantees via the convex conjugate of . One can view -DP as the primal representation of privacy, and accordingly, its dual representation is the collection of -DP guarantees. In this paper, the Edgeworth approximation addresses -DP from the primal perspective. However, it is also instructive to check the dual presentation. The following propositions introduce how to convert the primal to the dual, and vice versa. Geometrically, each associated -DP guarantee defines two symmetric supporting linear functions to (assuming is symmetric). See Figure 1. [Primal to Dual] For a symmetric trade-off function , let be its convex conjugate function . A mechanism is -DP if and only if it is -DP for all with . [Dual to Primal] Let be an arbitrary index set such that each is associated with and . A mechanism is -DP for all if and only if it is -DP with , where is the trade-off function corresponding to -DP.

Next, we introduce how -DP guarantees degrade under composition. With regard to composition, SGD offers an important benchmark for testing a privacy definition. As a popular optimizer for training deep neural networks, SGD outputs a series of models that are generated from the composition of many gradient descent updates. Furthermore, each step of update is computed from a subsampled mini-batch of data points. While composition degrades the privacy, in contrast, subsampling amplifies the privacy as individuals uncollected in the mini-batch have perfect privacy. Quantifying these two operations under the

-DP framework is crucial for analyzing the privacy guarantee of deep learning models trained by noisy SGD.

Composition.

Let and , [DRS19] defines a binary operator on trade-off functions such that , where is the distribution product. This operator is commutative and associative. The composition primitive refers to an algorithm that consists of algorithms , where observes both the input dataset and output from all previous algorithms111In this paper, denotes the number of private algorithms under composition, as opposed to the number of individuals in the dataset. This is to be consistent with the literature on central limit theorems.. In [DRS19], it is shown that if is -DP for , then the composed algorithm is -DP. The authors further identify a central limit theorem-type phenomenon of the overall privacy loss under composition. Loosely speaking, the privacy guarantee asymptotically converges to GDP in the sense that as under certain conditions. The privacy parameter depends on the trade-off functions .

Subsampling.

Consider the operator that includes each individual in the dataset with probability independently. Let denote the algorithm where is applied to the subsampled dataset. In the subsampling theorem for -DP, [DRS19] proves that if is -DP, then is -DP if and , where . As such, we can take , which however is not convex in general. This issue can be resolved by using in place of , where denotes the double conjugate of . Indeed, [DRS19] shows that the subsampled algorithm is -DP.

Noisy SGD.

Let denote the noisy gradient descent update, where is the scale of the Gaussian noise added to the gradient. The noisy SGD update can essentially be represented as . Exploiting the above results for composition and subsampling, [BDLS19] shows that is -DP, where . Recognizing that noisy SGD with iterations is the -fold composition of , the overall privacy lower bound is -DP, where . To evaluate the composition bound, [BDLS19] uses a central limit theorem-type result in the asymptotic regime where converges to a positive constant as : in this regime, one can show and consequently as well.

3 Edgeworth Approximation

In this section, we introduce the Edgeworth expansion-based approach to computing the privacy bound under composition. The development of this approach builds on top of [DRS19], with two crucial modifications.

Consider the hypothesis testing problem vs . Let denote the distribution , and

denote the probability density functions of

. Correspondingly, we define and in the same way. Letting

, the likelihood ratio test statistic is given by

The Neyman–Pearson lemma states that the most powerful test at a given significant level must be a thresholding function of . As a result, the optimal rejection rule would reject if , where is determined by . An equivalent rule is to apply thresholding to the standardized statistic: is rejected if

(2)

where the threshold is determined by .

In the sequel, for notational simplicity we shall use to denote , though it is a function of . Let be the CDF of when is drawn from . That is, By the Lyapunov CLT, the standardized statistic

converges in distribution to the standard normal random variable. In other words, it holds that

as . Likewise, we write with and get

With these notations in place, one can write the type I error of the rejection rule (

2) as

(3)

The type II error of this test, which is by definition, is given by

In [DRS19], the authors assume that is symmetric and therefore derive the identity . As a consequence, can be written as , where . Taken together, the equations above give rise to . Leveraging this expression of , [DRS19] proves a CLT-type asymptotic convergence result under certain conditions:

(4)

as , where is the limit of .

Now, we discard the symmetry assumption and just rewrite

(5)

Plugging Equation 3 into 5, we obtain

(6)

In the special case is symmetric, the factor is equal to one and we recover the result in [DRS19].

To obtain the composition bound, the exact computing of Equation 6 is not trivial. In Section 5.1

we present a numerical method to compute it directly, however, this method is computationally daunting and could not scale to a large number of compositions. The CLT estimator (Equation 

4) can be computed quickly, however it can be loose for a small or moderate number of compositions. More importantly, in practice, we observe that the CLT estimator does not handle the composition of asymmetric trade-off functions well. To address these issues, we propose a two-side approximation method, where the Edgeworth expansion is applied to both and in Equation 6. Our method leads to more accurate description of , as justified in Section 5.

3.1 Technical Details

In Equation 6, we need to evaluate and . Our methods for addressing each of them are described below.

Approximate .

Assume . Denote

(7)

where and

are the mean and variance of

under distribution . Recall is the CDF of , and we can apply the Edgeworth expansion to approximate directly, following the techniques introduced in [Hal13]. It provides a family of series that approximate in terms of the cumulants of , which can be derived from the cumulants of s under distribution s. See Definition 3.1 for the notion of cumulant and Appendix B for how to compute them.

For a random variable , let

be the natural logarithm of the moment generating function of

: . The cumulants of , denoted by for integer , are the coefficients in the Taylor expansion of about the origin:

The key idea of the Edgeworth approximation is to write the characteristic function of the distribution of

as the following form:

where is a polynomial with degree

, and then truncate the series up to a fixed number of terms. The corresponding CDF approximation is obtained by the inverse Fourier transform of the truncated series. The Edgeworth approximation of degree

amounts to truncating the above series up to terms of order .

Let be the -th cumulant of under , and . Let . Denoted by 222We note that is not guaranteed to be a valid CDF in general, however it is a numerical approximation to with improved error bounds comparing to the CLT approximation., the degree- Edgeworth approximation of is given by

(8)

The term is of order and is of order . See Appendix A for detailed derivations. Our framework covers the CLT approximation introduced in [DRS19]. The CLT approximation is equivalent to the degree- Edgeworth approximation, whose approximation error of order .

Approximate .

Figure 2: Comparison of obtained when approximating using Method (i) and (ii). The approximation of is fixed as in Equation 8.

In Equation 6 we need to compute the . This is the quantile of the distribution of , where is distributed. We consider two approaches to deal with it.

Method (i): First compute the degree- Edgeworth approximation of :

(9)

where and is the -th cumulant of under . Next, numerically solve equation .

Method (ii): Apply the closely related Cornish-Fisher Expansion [CF38, FC60], an asymptotic expansion used to approximate the quantiles of a probability distribution based on its cumulants, to approximate directly. Let be the quantile of the standard normal distribution. The degree-2 Cornish-Fisher approximation of the quantile of is given by

(10)

Both approaches have pros and cons. The Cornish-Fisher approximation has closed form solution, yet Figure 2 shows that it is unstable at the boundary when the number of compositions is small. For our experiments in Section 5, we use the numerical inverse approach throughout all the runs.

3.2 Error Analysis

Here we provide an error bound for approximating the overall privacy level using Edgeworth expansion. For simplicity, we assume that is symmetric and the log-likelihood ratios ’s are iid distributed with the common distribution having sufficiently light tails for the convergence of the Edgeworth expansion, under both and .

The Edgeworth expansion of degree 2 satisfies both and . Conversely, the inverse satisfies and for that is bounded away from 0 and 1. Making use of these approximation bounds, we get

As a caveat, we remark that the analysis above does not extend the error bound to a type I error that is close to 0 or 1. The result states that the approximation error of using the Edgeworth expansion quickly tends to 0 at the rate of . This error rate can be improved at the expense of a higher order expansion of the Edgeworth series. For comparison, our analysis can be carried over to show that the approximation error is of using CLT.

3.3 Computational Cost

The computational cost of the Edgeworth approximation can be broken down as follows. We first need to compute the cumulants of under and up to a certain order, for . Next, we need to compute . The Cornish-Fisher approximation (Equation 10) costs constant time. If we choose to compute the inverse numerically, we need to evaluate Equation 9 then solve a one dimensional root-finding problem. The former has a constant cost, and the latter can be extremely efficiently computed by various types of iterative methods, for which we can consider the cost as constant too. Finally, it costs constant time to evaluate using Equation 8. Therefore, the only cost that might scale with the number of compositions is the computation of cumulants. However, in many applications including computing the privacy bound for noisy SGD, all the s are iid distributed. Under such condition, we only need to compute the cumulants once. The total cost is thus a constant independent of . This is verified by the runtime comparison in Section 5.

4 A Two-Parameter Privacy Interpreter

Let and be two private algorithms that are associated with trade-off functions and , respectively. The algorithm will be more private than if upper bounds . For the family of Gaussian differentially private algorithms, this property can be reflected by the parameter directly, where a smaller value of manifests a more private algorithm.

Here we provide a two-parameter description for the Edgeworth approximation, through which the privacy guarantee between two different approximations can be directly compared.

Given an approximation , let be its fixed point such that . Let be the parameter of GDP for which admits the same fixed point as : . Such can be computed in closed form:

Let be the area under the curve of .

Two symmetric Edgeworth approximations333If is asymmetric, we can always symmetrize it by taking . and can be compared in the sense that is more private than if their associated parameters and satisfy the following condition:

Figure 3: Left: An illustration of the parameterization. Right: is more private than .

The left panel of Figure 3 provides the geometric interpretation of the above parameterization. The Edgeworth approximation , the CLT approximation , and the line intersect at the point .

The right panel compares two Edgeworth approximations and . It is easy to see that in this case upper bounds and thus it is more private than . There are two important properties. First, its intersection with the line is further away from the original point than . Consider the geometric interpretation shown in the left panel, this implies that . Second, the approximation also has a larger area under the curve than , which is essentially .

This parameterization defines a partial order over the set 444The perfect privacy is attained when the trade-off function is , whose area under the curve is .. It is also applicable to general trade-off functions.

5 Experiments

In this section, we present numerical experiments to compare the Edgeworth approximation and the CLT approximation. Before we proceed, we introduce a numerical method to directly compute the true composition bound in Section 5.1. This method is not scalable and hence merely servers as a reference for our comparison. We use the Edgeworth approximation of degree for all the experiments. In the sequel, we refer to those methods as Edgeworth, CLT, and Numerical, respectively. All the methods are implemented in Python and all the experiments are carried out on a MacBook with 2.5GHz processor and 16GB memory.

5.1 A Numerical Method

Consider the problem of computing numerically. We know that we can find such that and . However, computing directly involves high-dimensional testing, which can be challenging. We show this difficulty can be avoided by going from the primal representation to the dual representation. Let be the dual representation associated with . The method contains 3 steps to obtain for .

  1. Convert to . This step can be done implicitly via and , see Lemma 5.1.

  2. Iteratively compute from using Lemma 5.1.

  3. Convert to using Proposition 2.

Next, we explain how to compute from . First, we need a lemma that relates with .

[] Suppose and have densities with respect to a dominating measure . Then the dual representation satisfies Suppose we are given . Let and be the dual representations of and respectively. The following lemma shows how to evaluate from and . To simplify notations, we assume are distributions on the real line and have corresponding densities for with respect to Lebesgue measure. Generalization to abstract measurable space is straightforward. [] Let . Then In particular, it yields a recursive formula to compute when . Again we assume has densities on the real line. Let . We have

We remark here that if is asymmetric, then the dual involves negative , which is why the conversion to involves the whole real line. The proof of the above lemmas is deferred to Appendix D.

In practice, it is more efficient to store in the memory than to perform the computation on the fly, so we need to discretize the domain and store the function value on this grid. Consider an abstract grid , the recursion goes as follows:

where

. This rounding step can be replaced by an interpolation as well.

The major challenge in making this numerical method practical for computing composition product of trade-off functions is that it is slow in computation as it involves numerical integrations.

5.2 A Moderate Number of Compositions

Section 3 shows that the approximation error of Edgeworth is , and for CLT the error is . We thus expect for small or moderate values of , Edgeworth will produce non-negligible outperformance to CLT. To verify this, we investigate their performance on a toy problem for testing order- compositions of Laplace distributions555 The density function of is . : vs .

Figure 4: The estimated trade-off functions for testing vs .
Figure 5: The associated -DP of the estimated trade-off functions for testing vs .

We let the number of compositions vary from to . Since the privacy guarantee decays as increases and the resulting curves would be very close to the axes, we set for the sake of better visibility. Figure 4 plots the estimated trade-off functions for 4 representative cases . For each of the methods, we also compute the associated -DP (see Proposition 2) and plot as a function of in Figure 5. From both the primal and dual views, Edgeworth coincides better with Numerical in all the cases. When the number of compositions is 10, even though the difference between Edgeworth and CLT is small in the primal visualization (Figure 4), the presentation still clearly distinguishes them. In addition, due to the heavy tail of the Laplace distribution, we shall have for (see Definition 2 for the exact form of ). Therefore, the ground truth has the property that for . Figure 5 shows that Edgeworth also outperforms CLT for predicting this changing point.

CLT 0.0004 0.0004 0.0004 0.0004 0.0005
Edgeworth 0.2347 0.2341 0.2391 0.2222 0.2234
Numerical 3.6834 7.3361 12.055 16.729 21.3575
Table 1: Runtime for estimating the trade-off function for testing vs .

Table 1 reports the runtime of the above experiment. CLT takes constant computing time at the scale of 1e-4 second. Due to the homogeneity of the component distributions under composition, the runtime of Edgeworth is also invariant of the composition number, which is at the scale of 0.1 second. Numerical is computationally heavy. Its runtime is much larger and grows linearly as the number of compositions.

Figure 6: The estimation of .

5.3 Privacy Guarantees for Noisy SGD

Figure 7: The estimation of the privacy bound for -step noisy SGD. The sampling rate is and the noise scale is .

We inspect the performance of Edgeworth and CLT on estimating the privacy guarantee for -step noisy SGD. As introduced in Section 2, the privacy bound is of form where , and the CLT estimation is with . For Edgeworth, note that is the trade-off function of testing the standard normal distribution versus the mixture model (see Appendix C for the proof). It follows that