1 Introduction
Machine learning, data mining, and statistical analysis are widely applied to various applications impacting our daily lives. While we celebrate the benefits brought by these applications, to an alarming degree, the algorithms are accessing datasets containing sensitive information such as individual behaviors on web and health records. By tweaking the datasets and leveraging the output of algorithms, it is possible for an adversary to learn information about and even identify certain individuals [FJR15, SSSS17]. In particular, privacy concerns become even more acute when the same dataset is probed by a sequence of algorithms. With knowledge of the dataset from the prior algorithms’ output, an adversary can adaptively analyze the dataset to cause additional privacy loss at each round. This reality raises one of the most fundamental problems in the area of private data analysis:
How to accurately and efficiently quantify the cumulative
privacy loss under composition of private algorithms?
To address this important problem, one has to start with a formal privacy definition. To date, the most popular statistical privacy definition is differential privacy (DP) [DKM06, DMNS06], with numerous deployments in both industrial applications and academic research [EPK14, ACG16, PAE16, DKY17, App17, Abo18]. Informally, this privacy definition requires an unnoticeable change in a (randomized) algorithm’s output due to the replacement of any individual in the dataset. More concretely, letting and , an algorithm is differentially private if for any pair of neighboring datasets , (in the sense that differing in one individual) and any event ,
(1) 
Unfortunately, this privacy definition comes with a drawback when handling composition. Explicitly, DP is not closed under composition in the sense that the overall privacy bound of a sequence of (at least two) DP algorithms cannot be precisely described by a single pair of the parameters [KOV17]. Although the precise bound can be collectively represented by infinitely many pairs of , [MV16] shows that it is computationally hard to find such a pair of privacy parameters.
The need for a better treatment of composition has triggered a surge of interest in proposing generalizations of DP, including divergencebased relaxations [DR16, BS16, Mir17, BDRS18] and, more recently, a hypothesis testingbased extension termed differential privacy (DP) [DRS19]
. This privacy definition leverages the hypothesis testing interpretation of differential privacy, and characterizes the privacy guarantee using the tradeoff between type I and type II errors given by the associated hypothesis testing problem. As an advantage over the divergencebased privacy definitions, among others,
DP allows for a concise and sharp argument for privacy amplification by subsampling. More significantly,DP is accompanied with a technique powered by the central limit theorem (CLT) for analyzing privacy bounds under composition of a large number of private algorithms. Loosely speaking, the overall privacy bound asymptotically converges to the tradeoff function defined by testing between two normal distributions. This class of tradeoff functions gives rise to
Gaussian differential privacy (GDP), a subfamily of DP guarantees.In deploying differential privacy, however, the number of private algorithms under composition may be moderate or small
(see such applications in private sparse linear regression
[KST12] and personalized online advertising [LO11]). In this regime, the CLT phenomenon does not kick in and, as such, the composition bounds developed using CLT can be inaccurate [DRS19, BDLS19]. To address this practically important problem, in this paper we develop sharp and analytical composition bounds in DP without assuming a larger number of algorithms, by leveraging the Edgeworth expansion [Hal13]. The Edgeworth expansion is a technique for approximating probability distributions in terms of their cumulants. Compared with the CLT approach, in our setting, this technique enables a significant reduction of approximation errors for composition theorems.
In short, our Edgeworth expansionpowered privacy bounds have a number of appealing properties, which will be shown in this paper through both theoretical analysis and numerical examples.

Our method is easytoimplement and computationally efficient. In the case where all tradeoff functions are identical under composition, the computational cost is constant regardless of the number of private algorithms. This case is not uncommon and can be found, for example, in the privacy analysis of noisy stochastic gradient descent (SGD) used in training deep neural networks.
The remainder of the paper is organized as follows. Section 2 reviews the DP framework. Section 3 introduces our methods based on the Edgeworth expansion. Section 4 provides a twoparameter interpretation of the Edgeworth approximationbased privacy guarantees. Finally, we present experimental results to demonstrate the superiority of our approach in Section 5.
2 Preliminaries on Differential Privacy
Let a randomized algorithm take a dataset as input. Leveraging the output of this algorithm, differential privacy seeks to measure the difficulty of identifying the presence or absence of any individual in . The DP definition offers such a measure using the probabilities that gives the same outcome for two neighboring datasets and . A more concrete description is as follows. Let and denote the probability distribution of and , respectively. To breach the privacy, in essence, an adversary performs the following hypothesis testing problem:
The privacy guarantee of boils down to what extent the adversary can tell the two distributions apart. In the case of DP, the privacy guarantee is expressed via 1. The relationship between differential privacy and hypothesis testing is first studied in [WZ10, KOV17, LHC19, BBG19]. More recently, [DRS19] proposes to use the tradeoff between type I and type II errors of the optimal likelihood ratio tests at level ranging from 0 to 1 as a measure of the privacy guarantee. Note that the optimal tests are given by the Neyman–Pearson lemma, and can be thought of as the most powerful adversary.
Tradeoff function.
Let be a rejection rule for testing against against . The type I and type II error of are and , respectively. The tradeoff function between the two probability distributions and is defined as
That is, equals the minimum type II error that one can achieve at significance level . A larger tradeoff function corresponds to a more difficult hypothesis testing problem, thereby implying more privacy of the associated private algorithm. When the two distributions are the same, the perfect privacy is achieved and the corresponding tradeoff function is . In the sequel, we denote this function by . With the definition of tradeoff functions in place, [DRS19] introduces the following privacy definition (we say if for all ): Let be a tradeoff function. An algorithm is differentially private if for any pair of neighboring datasets and .
While the definition above considers a general tradeoff function, it is worthwhile to remark that can always be assumed to be symmetric. Letting (note that if ), a tradeoff function is said to be symmetric if . Due to the symmetry of the two neighboring datasets in the privacy definition, an DP algorithm must be DP. Compared to , the new tradeoff function is symmetric and gives a greater or equal privacy guarantee. For the special case where the lower bound in Definition 2
is a tradeoff function between two Gaussian distributions, we say that the algorithm has
Gaussian differential privacy (GDP): Let for some , wheredenotes the cumulative distribution function (CDF) of the standard normal distribution. An algorithm
gives Gaussian differential privacy if for any pair of neighboring datasets and .Duality to Dp.
The DP framework has a dual relationship with DP in the sense that DP is equivalent to an infinite collection of DP guarantees via the convex conjugate of . One can view DP as the primal representation of privacy, and accordingly, its dual representation is the collection of DP guarantees. In this paper, the Edgeworth approximation addresses DP from the primal perspective. However, it is also instructive to check the dual presentation. The following propositions introduce how to convert the primal to the dual, and vice versa. Geometrically, each associated DP guarantee defines two symmetric supporting linear functions to (assuming is symmetric). See Figure 1. [Primal to Dual] For a symmetric tradeoff function , let be its convex conjugate function . A mechanism is DP if and only if it is DP for all with . [Dual to Primal] Let be an arbitrary index set such that each is associated with and . A mechanism is DP for all if and only if it is DP with , where is the tradeoff function corresponding to DP.
Next, we introduce how DP guarantees degrade under composition. With regard to composition, SGD offers an important benchmark for testing a privacy definition. As a popular optimizer for training deep neural networks, SGD outputs a series of models that are generated from the composition of many gradient descent updates. Furthermore, each step of update is computed from a subsampled minibatch of data points. While composition degrades the privacy, in contrast, subsampling amplifies the privacy as individuals uncollected in the minibatch have perfect privacy. Quantifying these two operations under the
DP framework is crucial for analyzing the privacy guarantee of deep learning models trained by noisy SGD.
Composition.
Let and , [DRS19] defines a binary operator on tradeoff functions such that , where is the distribution product. This operator is commutative and associative. The composition primitive refers to an algorithm that consists of algorithms , where observes both the input dataset and output from all previous algorithms^{1}^{1}1In this paper, denotes the number of private algorithms under composition, as opposed to the number of individuals in the dataset. This is to be consistent with the literature on central limit theorems.. In [DRS19], it is shown that if is DP for , then the composed algorithm is DP. The authors further identify a central limit theoremtype phenomenon of the overall privacy loss under composition. Loosely speaking, the privacy guarantee asymptotically converges to GDP in the sense that as under certain conditions. The privacy parameter depends on the tradeoff functions .
Subsampling.
Consider the operator that includes each individual in the dataset with probability independently. Let denote the algorithm where is applied to the subsampled dataset. In the subsampling theorem for DP, [DRS19] proves that if is DP, then is DP if and , where . As such, we can take , which however is not convex in general. This issue can be resolved by using in place of , where denotes the double conjugate of . Indeed, [DRS19] shows that the subsampled algorithm is DP.
Noisy SGD.
Let denote the noisy gradient descent update, where is the scale of the Gaussian noise added to the gradient. The noisy SGD update can essentially be represented as . Exploiting the above results for composition and subsampling, [BDLS19] shows that is DP, where . Recognizing that noisy SGD with iterations is the fold composition of , the overall privacy lower bound is DP, where . To evaluate the composition bound, [BDLS19] uses a central limit theoremtype result in the asymptotic regime where converges to a positive constant as : in this regime, one can show and consequently as well.
3 Edgeworth Approximation
In this section, we introduce the Edgeworth expansionbased approach to computing the privacy bound under composition. The development of this approach builds on top of [DRS19], with two crucial modifications.
Consider the hypothesis testing problem vs . Let denote the distribution , and
denote the probability density functions of
. Correspondingly, we define and in the same way. Letting, the likelihood ratio test statistic is given by
The Neyman–Pearson lemma states that the most powerful test at a given significant level must be a thresholding function of . As a result, the optimal rejection rule would reject if , where is determined by . An equivalent rule is to apply thresholding to the standardized statistic: is rejected if(2) 
where the threshold is determined by .
In the sequel, for notational simplicity we shall use to denote , though it is a function of . Let be the CDF of when is drawn from . That is, By the Lyapunov CLT, the standardized statistic
converges in distribution to the standard normal random variable. In other words, it holds that
as . Likewise, we write with and get
With these notations in place, one can write the type I error of the rejection rule (
2) as(3) 
The type II error of this test, which is by definition, is given by
In [DRS19], the authors assume that is symmetric and therefore derive the identity . As a consequence, can be written as , where . Taken together, the equations above give rise to . Leveraging this expression of , [DRS19] proves a CLTtype asymptotic convergence result under certain conditions:
(4) 
as , where is the limit of .
Now, we discard the symmetry assumption and just rewrite
(5)  
Plugging Equation 3 into 5, we obtain
(6) 
In the special case is symmetric, the factor is equal to one and we recover the result in [DRS19].
To obtain the composition bound, the exact computing of Equation 6 is not trivial. In Section 5.1
we present a numerical method to compute it directly, however, this method is computationally daunting and could not scale to a large number of compositions. The CLT estimator (Equation
4) can be computed quickly, however it can be loose for a small or moderate number of compositions. More importantly, in practice, we observe that the CLT estimator does not handle the composition of asymmetric tradeoff functions well. To address these issues, we propose a twoside approximation method, where the Edgeworth expansion is applied to both and in Equation 6. Our method leads to more accurate description of , as justified in Section 5.3.1 Technical Details
In Equation 6, we need to evaluate and . Our methods for addressing each of them are described below.
Approximate .
Assume . Denote
(7) 
where and
are the mean and variance of
under distribution . Recall is the CDF of , and we can apply the Edgeworth expansion to approximate directly, following the techniques introduced in [Hal13]. It provides a family of series that approximate in terms of the cumulants of , which can be derived from the cumulants of s under distribution s. See Definition 3.1 for the notion of cumulant and Appendix B for how to compute them.For a random variable , let
be the natural logarithm of the moment generating function of
: . The cumulants of , denoted by for integer , are the coefficients in the Taylor expansion of about the origin:The key idea of the Edgeworth approximation is to write the characteristic function of the distribution of
as the following form:where is a polynomial with degree
, and then truncate the series up to a fixed number of terms. The corresponding CDF approximation is obtained by the inverse Fourier transform of the truncated series. The Edgeworth approximation of degree
amounts to truncating the above series up to terms of order .Let be the th cumulant of under , and . Let . Denoted by ^{2}^{2}2We note that is not guaranteed to be a valid CDF in general, however it is a numerical approximation to with improved error bounds comparing to the CLT approximation., the degree Edgeworth approximation of is given by
(8)  
The term is of order and is of order . See Appendix A for detailed derivations. Our framework covers the CLT approximation introduced in [DRS19]. The CLT approximation is equivalent to the degree Edgeworth approximation, whose approximation error of order .
Approximate .
In Equation 6 we need to compute the . This is the quantile of the distribution of , where is distributed. We consider two approaches to deal with it.
Method (i): First compute the degree Edgeworth approximation of :
(9)  
where and is the th cumulant of under . Next, numerically solve equation .
Method (ii): Apply the closely related CornishFisher Expansion [CF38, FC60], an asymptotic expansion used to approximate the quantiles of a probability distribution based on its cumulants, to approximate directly. Let be the quantile of the standard normal distribution. The degree2 CornishFisher approximation of the quantile of is given by
(10)  
Both approaches have pros and cons. The CornishFisher approximation has closed form solution, yet Figure 2 shows that it is unstable at the boundary when the number of compositions is small. For our experiments in Section 5, we use the numerical inverse approach throughout all the runs.
3.2 Error Analysis
Here we provide an error bound for approximating the overall privacy level using Edgeworth expansion. For simplicity, we assume that is symmetric and the loglikelihood ratios ’s are iid distributed with the common distribution having sufficiently light tails for the convergence of the Edgeworth expansion, under both and .
The Edgeworth expansion of degree 2 satisfies both and . Conversely, the inverse satisfies and for that is bounded away from 0 and 1. Making use of these approximation bounds, we get
As a caveat, we remark that the analysis above does not extend the error bound to a type I error that is close to 0 or 1. The result states that the approximation error of using the Edgeworth expansion quickly tends to 0 at the rate of . This error rate can be improved at the expense of a higher order expansion of the Edgeworth series. For comparison, our analysis can be carried over to show that the approximation error is of using CLT.
3.3 Computational Cost
The computational cost of the Edgeworth approximation can be broken down as follows. We first need to compute the cumulants of under and up to a certain order, for . Next, we need to compute . The CornishFisher approximation (Equation 10) costs constant time. If we choose to compute the inverse numerically, we need to evaluate Equation 9 then solve a one dimensional rootfinding problem. The former has a constant cost, and the latter can be extremely efficiently computed by various types of iterative methods, for which we can consider the cost as constant too. Finally, it costs constant time to evaluate using Equation 8. Therefore, the only cost that might scale with the number of compositions is the computation of cumulants. However, in many applications including computing the privacy bound for noisy SGD, all the s are iid distributed. Under such condition, we only need to compute the cumulants once. The total cost is thus a constant independent of . This is verified by the runtime comparison in Section 5.
4 A TwoParameter Privacy Interpreter
Let and be two private algorithms that are associated with tradeoff functions and , respectively. The algorithm will be more private than if upper bounds . For the family of Gaussian differentially private algorithms, this property can be reflected by the parameter directly, where a smaller value of manifests a more private algorithm.
Here we provide a twoparameter description for the Edgeworth approximation, through which the privacy guarantee between two different approximations can be directly compared.
Given an approximation , let be its fixed point such that . Let be the parameter of GDP for which admits the same fixed point as : . Such can be computed in closed form:
Let be the area under the curve of .
Two symmetric Edgeworth approximations^{3}^{3}3If is asymmetric, we can always symmetrize it by taking . and can be compared in the sense that is more private than if their associated parameters and satisfy the following condition:
The left panel of Figure 3 provides the geometric interpretation of the above parameterization. The Edgeworth approximation , the CLT approximation , and the line intersect at the point .
The right panel compares two Edgeworth approximations and . It is easy to see that in this case upper bounds and thus it is more private than . There are two important properties. First, its intersection with the line is further away from the original point than . Consider the geometric interpretation shown in the left panel, this implies that . Second, the approximation also has a larger area under the curve than , which is essentially .
This parameterization defines a partial order over the set ^{4}^{4}4The perfect privacy is attained when the tradeoff function is , whose area under the curve is .. It is also applicable to general tradeoff functions.
5 Experiments
In this section, we present numerical experiments to compare the Edgeworth approximation and the CLT approximation. Before we proceed, we introduce a numerical method to directly compute the true composition bound in Section 5.1. This method is not scalable and hence merely servers as a reference for our comparison. We use the Edgeworth approximation of degree for all the experiments. In the sequel, we refer to those methods as Edgeworth, CLT, and Numerical, respectively. All the methods are implemented in Python and all the experiments are carried out on a MacBook with 2.5GHz processor and 16GB memory.
5.1 A Numerical Method
Consider the problem of computing numerically. We know that we can find such that and . However, computing directly involves highdimensional testing, which can be challenging. We show this difficulty can be avoided by going from the primal representation to the dual representation. Let be the dual representation associated with . The method contains 3 steps to obtain for .
Next, we explain how to compute from . First, we need a lemma that relates with .
[] Suppose and have densities with respect to a dominating measure . Then the dual representation satisfies Suppose we are given . Let and be the dual representations of and respectively. The following lemma shows how to evaluate from and . To simplify notations, we assume are distributions on the real line and have corresponding densities for with respect to Lebesgue measure. Generalization to abstract measurable space is straightforward. [] Let . Then In particular, it yields a recursive formula to compute when . Again we assume has densities on the real line. Let . We have
We remark here that if is asymmetric, then the dual involves negative , which is why the conversion to involves the whole real line. The proof of the above lemmas is deferred to Appendix D.
In practice, it is more efficient to store in the memory than to perform the computation on the fly, so we need to discretize the domain and store the function value on this grid. Consider an abstract grid , the recursion goes as follows:
where
. This rounding step can be replaced by an interpolation as well.
The major challenge in making this numerical method practical for computing composition product of tradeoff functions is that it is slow in computation as it involves numerical integrations.
5.2 A Moderate Number of Compositions
Section 3 shows that the approximation error of Edgeworth is , and for CLT the error is . We thus expect for small or moderate values of , Edgeworth will produce nonnegligible outperformance to CLT. To verify this, we investigate their performance on a toy problem for testing order compositions of Laplace distributions^{5}^{5}5 The density function of is . : vs .
We let the number of compositions vary from to . Since the privacy guarantee decays as increases and the resulting curves would be very close to the axes, we set for the sake of better visibility. Figure 4 plots the estimated tradeoff functions for 4 representative cases . For each of the methods, we also compute the associated DP (see Proposition 2) and plot as a function of in Figure 5. From both the primal and dual views, Edgeworth coincides better with Numerical in all the cases. When the number of compositions is 10, even though the difference between Edgeworth and CLT is small in the primal visualization (Figure 4), the presentation still clearly distinguishes them. In addition, due to the heavy tail of the Laplace distribution, we shall have for (see Definition 2 for the exact form of ). Therefore, the ground truth has the property that for . Figure 5 shows that Edgeworth also outperforms CLT for predicting this changing point.
CLT  0.0004  0.0004  0.0004  0.0004  0.0005 
Edgeworth  0.2347  0.2341  0.2391  0.2222  0.2234 
Numerical  3.6834  7.3361  12.055  16.729  21.3575 
Table 1 reports the runtime of the above experiment. CLT takes constant computing time at the scale of 1e4 second. Due to the homogeneity of the component distributions under composition, the runtime of Edgeworth is also invariant of the composition number, which is at the scale of 0.1 second. Numerical is computationally heavy. Its runtime is much larger and grows linearly as the number of compositions.
5.3 Privacy Guarantees for Noisy SGD
We inspect the performance of Edgeworth and CLT on estimating the privacy guarantee for step noisy SGD. As introduced in Section 2, the privacy bound is of form where , and the CLT estimation is with . For Edgeworth, note that is the tradeoff function of testing the standard normal distribution versus the mixture model (see Appendix C for the proof). It follows that
Comments
There are no comments yet.