Integral Privacy for Density Estimation with Approximation Guarantees

06/13/2018 ∙ by Hisham Husain, et al. ∙ 0

Density estimation is an old and central problem in statistics and machine learning. There exists only few approaches to cast this problem in a differential privacy framework and to our knowledge, while all provide proofs of security, very little is still known about the approximation guarantees of the unknown density by the private one learned. In this paper, we exploit the tools of boosting to show that, provided we have access to a weak learner in the original boosting sense, there exists a way to learn a private density out of classifiers, which can guarantee an approximation of the true density that degrades gracefully as the privacy budget ϵ decreases. There are three key formal features of our results: (i) our approximation bound is, as we show, near optimal for our technique at hand and (ii) the privacy guarantee holds even when we remove the famed adjacency condition of inputs in differential privacy, thereby leading to a stronger privacy guarantee we relate to as integral privacy. Finally, (iii) we provide for the first time approximation guarantees for the capture of fat regions of the density, a problem which is receiving a lot of attention in the generative adversarial networks literature with the mode capture problem. Experimental results against a state of the art implementation of private kernel density estimation display that our technique consistently obtains improved results, managing in particular to get similar outputs for a privacy budget ϵ which is however orders of magnitude smaller.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the past decade, (-)differential privacy (DP) has evolved as the leading statistical protection model for individuals Dwork and Roth (2014), as it guarantees plausible deniability regarding the presence of an individual in the input of a mechanism, from the observation of its output.

DP has however a limitation inherent to its formulation regarding group protection: what if we wish to extend the guarantee to subsets of the input, not just individuals ? This is particularly relevant to medical / pharmaceutical applications where privacy needs to be carried out at a subpopulation level (being affected by particular conditions, purchasing specific drugs, etc Palanisamy et al. (2017)).

When the group size is limited, it is a simple textbook matter to extend the privacy guarantee by scaling the privacy budget by the maximum size, see e.g. Gaboardi (2016, Proposition 1.13): if we want to protect groups of size , we just divide by to get the same privacy guarantee as for individuals in

-DP. One might object that this is not very efficient to retain information as standard randomized mechanisms would get their variance

scaled proportionally by (Dwork and Roth, 2014, Section 3), but this can be reasonable for small group sizes Palanisamy et al. (2017). When the maximum group size is unknown, this approach definitely does not work anymore as any non-trivial privacy requirement results in a vanishing privacy budget – and therefore, in a variance that diverges with the input size. This setting is however arguably the most interesting to protect groups as in the worst case, group sizes may not be available. One might wonder whether such a strong privacy regime – to which we refer to as integral privacy – is actually achievable as it requires, in the worst case, to extend plausible deniability to the whole input itself, regardless of its size. Classical DP goes with some privacy-dependent loss of information, so we may also ask what is the information loss to pay to lift DP to the level of integral privacy.

In this paper, we propose a general mechanism to achieve such a non-trivial privacy regime, with guaranteed approximation bounds, for the problem of sampling. Sampling at large is an old and major problem in machine learning and statistics (Fix and Hodges, 1951; Hastings, 1970; Metropolis et al., 1953; Silverman, 1986). Its private counterpart is more recent (Bernstein and Sheldon, 2018; Geumlek et al., 2017; Machanavajjhala et al., 2008; Rubin, 1993; Wang et al., 2015), but the emergency to tackle the problem is clear, with a regularly expanding list of data privacy breaches in medical / pharmaceutical systems Lord (2018); Rubinstein et al. (2016); Farell (2017).

The task of sampling in a private setting ideally entails two objectives of being accurate and private. Without getting into technicalities yet, suppose we have access to i.i.d. samples (say, ) from a fixed but unknown target density whose support is denoted . Let denote any algorithm taking as input and returning an observation by sampling a density — which therefore depends on ( can be implicit, like for GANs). The first objective is that be accurate wrt : , where is any suitable form of divergence and bounds the approximation error. The second objective is that sampling from must never disclose too much information about the input dataset used to learn : in classical -DP,

(1)

Here, is the DP budget (Dwork and Roth, 2014) and means that samples differ from one observation only.

Our contribution can be split in three parts:
we introduce a simple trick that allows private sampling from non necessarily private densities without using the randomization toolkit of DP (Dwork and Roth, 2014). This trick considers subsets of densities with prescribed range, that we call mollifiers111It bears superficial similarities with functional mollifiers (Gilbarg and Trudinger, 2001, Section 7.2).

, which makes their sampling private. Related tricks, albeit significantly more constrained and/or tailored to weaker models of privacy, have recently been used in the context of private Bayesian inference, sampling and regression

(Dimitratakis et al., 2014, Section 3), (Mir, 2013, Chapter 5), (Wang et al., 2015, Theorem 1), (Wasserman and Zhou, 2010, Section 4.1). A key benefit is privacy holds even when we remove the adjacency condition of (1) — and therefore the same size constraint as well, resulting in a privacy guarantee stronger than for classical (central) or local DP that we denote as integral privacy;
in this set of mollifier densities, we show how to learn a density with guaranteed approximation wrt target , and introduce a computationally efficient algorithm for this task, . To get guaranteed approximation of , we use a celebrated theory of statistical learning: boosting (Schapire and Freund, 2012). Boosting was previously used once in the context of differential privacy to privatize database queries (Dwork et al., 2010). Compared to this work, we use the ingredients of boosting as they were initially designed: a weak learner outputting classifiers different from random guessing. Hence, we bring density estimation in contact with supervised classification and our technique learns a density out of classifiers

. Those classifiers are used to craft the sufficient statistics of an exponential family — hence, with the successes of deep learning, those sufficient statistics can represent extremely complex mappings resulting in virtually arbitrarily diverse mollifier densities;


we provide experimental results for  against a state of the art approach for sampling from private KDE (Aldà and Rubinstein, 2017), which displays the competitiveness and convenience of , in particular when is small, making our approach a good fit for privacy demanding applications. We also performed comparisons with a private generative approach (Xie et al., 2018), which resulted in  very significantly outperforming that one — Figure 1 provides a snapshot example of experimental results comparing the various approaches ( for Xie et al. (2018)).

target learned

Our method Private KDE DPGAN
Figure 1: Our method vs private KDE (Aldà and Rubinstein, 2017) and DPGAN (Xie et al., 2018) on a ring Gaussian mixture (see Section 5). Remark the values of (chosen for the densities to look alike), and the fact that the GAN is subject to mode collapse.

The rest of this paper is organized as follows. 2 presents related work. 3 introduces key definitions and basic results. 4 introduces our algorithm, , and states its key privacy and approximation properties. 5 presents experiments and two last Sections respectively discuss and conclude our paper. Proofs and additional experiments are postponed to an .

2 Related work

Figure 2 summarizes the various approaches for sampling when the input is an i.i.d. sample from a target (notations from the Introduction appear in the Figure), and the four essential locations where differential-privacy-style protection can be enforced. One highly desirable feature of differential privacy (DP) is that it survives post-processing. Hence, if we protect upstream for differential privacy the input sample directly as shown in Figure 2, then all further steps will be private as well and there will be no restriction on the algorithms used to come up with . A significant advantage for this process is that standard tools from differential privacy can be directly used (Dwork and Roth, 2014, Section 3). The problem however is that standard tools operate by noisifying the signal and the more upstream is the noise, the less tailored to the task at hand it can be and ultimately the worst it should be (accuracy-wise) for the downstream output. A second possibility is to rather learn a private implicit generative model like a private GAN (Xie et al., 2018). This makes sampling easy, but there is a significant downside: noisification for privacy makes GAN training much trickier (Triastcyn and Faltings, 2018), granted that even without noise, training GANs already faces significant challenges (like mode collapse, see Figure 1

). One can instead opt for a general private Markov Chain Monte Carlo sampling

(Wang et al., 2015), but convergence is in weak asymptotic form and privacy typically biases likelihoods, priors.

Somewhat closer to ours is a third set of techniques that directly learn a private density. A broad literature has been developed early for discrete distributions (Machanavajjhala et al., 2008) (and references therein). For a general not necessarily discrete, more sophisticated approaches have been tried, most of which exploit randomisation and the basic toolbox of differential privacy (Dwork and Roth, 2014, Section 3): given non-private , one compute the sensitivity of the approach, then use a standard mechanism to compute a private . If mechanism delivers -DP, like Laplace mechanism (Dwork and Roth, 2014), then we get an -DP density. Such general approaches have been used for being the popular kernel density estimation (KDE, (Givens and Hoeting, 2013)) with variants (Aldà and Rubinstein, 2017; Hall et al., 2013; Rubinstein and Aldà, 2017).

Figure 2: Different approaches that allow (differentially) private sampling: (1) protecting the input dataset sampled from true distribution ; (2) learning a private sampler (e.g. Monte Carlo) or generative model (e.g. GAN) from ; (3) learning a private density from and then sampling from it; (4) learning a density whose sampling is private (but releasing may not be private). The grey area locates our contribution.

A convenient way to fit a private is to approximate it in a specific function space, being Sobolev (Duchi et al., 2013a; Hall et al., 2013; Wasserman and Zhou, 2010), Bernstein polynomials (Aldà and Rubinstein, 2017), Chebyshev polynomials (Thaler et al., 2012), and then compute the coefficients in a differentially private way. This approach suffers several drawbacks. First, the sensitivity depends on the quality of the approximation: increasing it can blow-up sensitivity in an exponential way (Aldà and Rubinstein, 2017; Rubinstein and Aldà, 2017), which translates to a significantly larger amount of noise. Second, one always pays the price of the underlying function space’s assumptions, even if limited to smoothness (Duchi et al., 2013a, b; Hall et al., 2013; Wainwright, 2014; Wasserman and Zhou, 2010), continuity or boundedness (Aldà and Rubinstein, 2017; Duchi et al., 2013a, b; Thaler et al., 2012). We note that we have framed the general approach to private density estimation in -DP. While the state of the art we consider investigate privacy models that are closely related, not all are related to () differential privacy. Some models opt for a more local (or "on device", because the sample size is one) form of differential privacy (Differential privacy team, Apple, 2017; Duchi et al., 2013a, b; Wainwright, 2014), others for relaxed forms of differential privacy (Hall et al., 2013; Rubinstein and Aldà, 2017). Finally, while all previous techniques formally investigate privacy (eq. (1)), the quality of the approximation of with respect to is much less investigated. The state of the art investigates criteria of the form where the expectation involves all relevant randomizations, including sampling of , mechanism , etc. (Duchi et al., 2013a, b; Wainwright, 2014; Wasserman and Zhou, 2010); minimax rates are also known (Duchi et al., 2013a, b; Wainwright, 2014). Pointwise approximation bounds are available (Aldà and Rubinstein, 2017) but require substantial assumptions on the target density or sensitivity for the approach to remain tractable.

3 Basic definitions and results

Basic definitions: let be a set (typically, ) and let be the target density. Without loss of generality, all distributions considered have the same support, . We are given a dataset , where each is an i.i.d. observation. As part of our goal is to learn and then sample from a distribution such that

is small, where KL denotes the Kullback-Leibler divergence:

(2)

(we assume for the sake of simplicity the same base measure for all densities). We pick the KL divergence for its popularity and the fact that it is the canonical divergence for broad sets of distributions (Amari and Nagaoka, 2000).

Boosting: in supervised learning, a classifier is a function

where denotes a class. We assume that and so the output of is bounded. This is not a restrictive assumption: many other work in the boosting literature make the same boundedness assumption (Schapire and Singer, 1999). We now present the cornerstone of boosting, the weak learning assumption. It involves a weak learner, which is an oracle taking as inputs two distributions and and is required to always return a classifier that weakly guesses the sampling from vs . [WLA] Fix two constants. We say that satisfies the weak learning assumption (WLA) for iff for any , returns a classifier satisfying

(3)

where . Remark that as the two inputs and become "closer" in some sense to one another, it is harder to satisfy the WLA. However, this is not a problem as whenever this happens, we shall have successfully learned through . The classical theory of boosting would just assume one constraint over a distribution whose marginals over classes would be and (Kearns, 1988), but our definition can in fact easily be shown to coincide with that of boosting. A boosting algorithm is an algorithm which has only access to a weak learner and, throughout repeated calls, typically combines the weak classifiers it receives to end up with a combination significantly more accurate than its parts. The trick of boosting that we shall employ, as exemplified in Schapire (1990), is to train the weak learner on inputs that carefully change throughout the course of learning, so that the weak learner’s outputs are indeed weak with respect to their respective tweaked inputs but the global combination gets stronger on the boosting algorithm’s original input.
Differential privacy, intregral privacy: we introduce a user-defined parameter, , which represents a privacy budget; and the smaller it is, the stronger the privacy demand. Hereafter, and denote input datasets from , and denotes the predicate that and differ by one observation. Let denote a randomized algorithm that takes as input datasets and outputs samples. For any fixed , is said to meet -differential privacy (DP) iff, it always holds that

(4)

meets -integral privacy (IP) iff (4) holds when alleviating the constraint for a more general . Note that integral privacy is a significantly stronger notion of privacy since by removing the constraint, we implicitly remove the constraint that and are neighbors or have the same size.

Mollifiers. We now introduce a property for sets of densities that shall be crucial for privacy. Let be a set of densities with the same support. is said to be an -mollifier for some iff

(5)

Before stating how we can simply transform any set of densities with finite range into an -mollifier, let us show why such sets are important for integrally private sampling. Suppose there exists and -mollifier such that samples densities within . Then is -integrally private. (Proof in , Section 8.1) Notice that we do not need to require that and be sampled from the same density . This trick which essentially allows to get "privacy for free" using the fact that sampling carries out the necessary randomization we need to get privacy, is not new: a similar, more specific trick was designed for Bayesian learning in Wang et al. (2015) and in fact the first statement of (Wang et al., 2015, Theorem 1) implements in disguise a specific -mollifier related to one we use, (see below). We now show examples of mollifiers and properties they can bear.

Figure 3: Left: example of mollifiers for two values of , (red curves) or (blue curves). For that latter case, we also indicate in light blue the necessary range of values to satisfy (5), and in dark blue a sufficient range that allows to satisfy (5). Right: schematic depiction of how one can transform any set of finite densities in an -mollifier without losing the modes and keeping derivatives up to a positive constant scaling.

Our examples are featured in the simple case where the support of the mollifier is and densities have finite range and are continuous: see Figure 3 (left). The two ranges indicated are featured to depict necessary or sufficient conditions on the overall range of a set of densities to be a mollifier. For the necessary part, we note that any continuous density must have in its range of values (otherwise its total mass cannot be unit), so if it belongs to an -mollifier, its maximal value cannot be and its minimal value cannot be . We end up with the range in light blue, in which any -mollifier has to fit. For the sufficiency part, we indicate in dark blue a possible range of values, , which gives a sufficient condition for the range of all elements in a set for this set to be an -mollifier222We have indeed .. Let us denote more formally this set as .

Notice that as , any -mollifier converges to a singleton. In particular, all elements of

converge in distribution to the uniform distribution, which would also happen for sampling using standard mechanisms of differential privacy

(Dwork and Roth, 2014), so we do not lose qualitatively in terms of privacy. However, because we have no constraint apart from the range constraint to be in , this freedom is going to be instrumental to get guaranteed approximations of via the boosting theory. Figure 3 (right) also shows how a simple scale-and-shift procedure allows to fit any finite density in while keeping some of its key properties, so "mollifying" a finite density in in this way do not change its modes, which is an important property for sampling, and just scales its gradients by a positive constant, which is is an important property for learning and optimization.

4 Mollifier density estimation with approximation guarantees

We now present our approach depicted in Figure 2. The cornerstone is an algorithm that (i) learns an explicit density in an -mollifier and (ii) with approximation guarantees with respect to the target . Recall that, from Lemma 3, sampling from the output of such an algorithm is -integrally private because of (i), so we can focus in showing (i+ii). This algorithm, , for Mollified Boosted Density Estimation, is depicted above.

1:  for  do
2:     
3:     
4:     
5:  end forOutput
Algorithm 1 ()

It uses a weak learner whose objective is to distinguish between the target and the current guessed density — the index indicates the iterative nature of the algorithm. is progressively refined using the weak learner’s output classifier , for a total number of user-fixed iterations . We start boosting by setting as the starting distribution, typically a simple non-informed (to be private) distribution such as a standard Gaussian. The classifier is then aggregated into in the following way:

(6)

where , (from now on,

denotes the vector of all classifiers) and

is the log-normalizer given by

(7)

This process repeats until and the proposed distribution is . It is not hard to see that is an exponential family with natural parameter , sufficient statistics , and base measure (Amari and Nagaoka, 2000). We now show three formal results on .
Sampling from is -integrally private — Recall is the set of densities whose range is in . We now show that the output of  is in , guaranteeing -integral privacy on sampling (Lemma 3). . (Proof in , Section 8.2) We observe that privacy comes with a price, as for example , so as we become more private, the updates on become less and less significant and we somehow flatten the learned density — as already underlined in Section 3, such a phenomenon is not a particularity of our method as it would also be observed for standard mechanisms of differential privacy (Dwork and Roth, 2014).
 approximates the target distribution in the boosting framework — As explained in Section 3, it is not hard to fit a density in to make its sampling private. An important question is however what guarantees of approximation can we still have with respect to , given that may not be in . We now give such guarantees to  in the boosting framework, and we also show that the approximation is within close order to the best possible given the constraint to fit in . We start with the former result, and for this objective include the iteration index in the notations from Definition 3 since the actual weak learning guarantees may differ amongst iterations, even when they are still within the prescribed bounds (as e.g. for ). For any , suppose WeakLearner satisfies at iteration the WLA for . Then we have:

(8)

where (letting ):

(9)

(Proof in , Section 8.3) Remark that in the high boosting regime, we are guaranteed that so the bound on the KL divergence is guaranteed to decrease. This is a regime we are more likely to encounter during the first boosting iterations since and are then easier to tell apart — we can thus expect a larger . In the low boosting regime, the picture can be different since we need to make the bound not vacuous. Since exponentially fast, this constraint is somewhat minor and we can also expect the bound on the KL divergence to also decrease in the low boosting regime.

We now check that the guarantees we get are close to the best possible in an information-theoretic sense given the two constraints: (i) is an exponential family as in (6) and (ii) . Let us define the set of such densities, where is fixed, and let . Intuitively, the farther is from , the farther we should be able to get from to approximate , and so the larger should be . Notice that this would typically imply to be in the high boosting regime for . For the sake of simplicity, we consider to be the same throughout all iterations. We have

(10)

and if  stays in the high boosting regime, then

(11)

(Proof in , Section 8.4) Hence, as and , we have and since as ,  indeed reaches the information-theoretic limit in the high boosting regime.
 and the capture of modes of — Mode capture is a prominent problem in the area of generative models (Tolstikhin et al., 2017). We have already seen that enforcing mollification can be done while keeping modes, but we would like to show that  is indeed efficient at building some with guarantees on mode capture. For this objective, we define for any and density ,

(12)

that are respectively the total mass of on and the KL divergence between and restricted to . Suppose  stays in the high boosting. Let . Then , , if

(13)

then . (Proof in , Section 8.5) There is not much we can do to control as this term quantifies our luck in picking to approximate in but if this restricted KL divergence is small compared to the mass of , then we are guaranteed to capture a substantial part of it through . As a mode, in particular "fat", would tend to have large mass over its region , Theorem 4 says that we can indeed hope to capture a significant part of it as long as we stay in the high boosting regime. As and , the condition on in (13) vanishes with and we end up capturing any fat region (and therefore, modes, assuming they represent "fatter" regions) whose mass is sufficiently large with respect to .

To finish up this Section, recall that is also defined (in disguise) and analyzed in (Wang et al., 2015, Theorem 1) for posterior sampling. However, convergence (Wang et al., 2015, Section 3) does not dig into specific forms for the likelihood of densities chosen — as a result and eventual price to pay, it remains essentially in weak asymptotic form, and furthermore later on applied in the weaker model of -differential privacy. We exhibit particular choices for these mollifier densities, along with a specific training algorithm to learn them, that allow for significantly better approximation, quantitatively and qualitatively (mode capture) without even relaxing privacy.

5 Experiments

Figure 4: Gaussian ring: densities obtained for DPB (upper row) against  (lower row)
Gaussian ring 1D non random Gaussian
NLL = f() Mode coverage = f() NLL = f() Mode coverage = f()
Figure 5: Metrics for  (blue): NLL (lower is better) and mode coverage (higher is better). Orange: DPB (see text).

Architectures (of , private KDE and private GANs): we carried out experiments on a simulated setting inspired by Aldà and Rubinstein (2017), to compare  (implemented following its description in Section 4) against differentially private KDE (Aldà and Rubinstein, 2017). To learn the sufficient statistics for , we fit for each

a neural network (NN) classifier:

(14)

where depending on the experiment. At each iteration of boosting, is trained using samples from and using Nesterov’s accelerated gradient descent with based on cross-entropy loss with epochs. Random walk Metropolis-Hastings is used to sample from at each iteration. For the number of boosting iterations in , we pick . This is quite a small value but given the rate of decay of and the small dimensionality of the domain, we found it a good compromise for complexity vs accuracy. Finally, we pick for a standard Gaussian (zero mean, identity covariance).
Contenders: we know of no integrally private sampling approach operating under conditions equivalent to ours, so our main contender is going to be a particular state of the art -differentially private approach which provides a private density, DPB (Aldà and Rubinstein, 2017). We choose this approach because digging in its technicalities reveal that its integral privacy budget would be roughly equivalent to ours, mutatis mutandis. Here is why: this approach allows to sample a dataset of arbitrary size (say, ) while keeping the same privacy budget, but needs to be scaled to accomodate integral privacy, while in our case,  allows to obtain integral privacy for one observation (), but its privacy budget needs to be scaled to accomodate for larger . It turns out that in both approaches, the scaling of the privacy parameter to accomodate for arbitrary and integral privacy is roughly the same. In our case, the change is obvious: the privacy parameter is naturally scaled by . In the case of Aldà and Rubinstein (2017), the requirement of integral privacy multiplies the sensitivity333Cf Aldà and Rubinstein (2017, Definition 4) for the sensitivity, Aldà and Rubinstein (2017, Section 6) for the key function involved. by , which implies that the Laplace mechanism does not change only if is scaled by (Dwork and Roth, 2014, Section 3.3).
We have also compared with a private GAN approach, which has the benefit to yield a simple sampler but involves a weaker privacy model (Xie et al., 2018) (DPGAN). For DPB, we use a bandwidth kernel and learn the bandwidth parameter via -fold cross-validation. For DPGAN, we train the WGAN base model using batch sizes of and epochs, with . We found that DPGAN is significantly outperformed by both DPB and , so to save space we have only included the experiment in Figure 1. We observed that DPB does not always yield a positive measure444The approach of fitting kernel density estimation in the Bernstein basis does not indeed guarantee the positivity of the measure.. To ensure this property, we shift and scale the output for positivity, without caring for privacy in doing so, which degrades the privacy guarantee for DPB but keeps the approximation guarantees of its output Aldà and Rubinstein (2017). Clearly,  does not suffer this drawback.
Metrics: we consider two metrics, inspired by those we consider for our theoretical analysis and one investigated in Tolstikhin et al. (2017) for mode capture. We first investigate the ability of our method to learn highly dense regions by computing mode coverage, which is defined to be for such that . Mode coverage essentially attempts to find high density regions of the model (based on ) and computes the mass of the target under this region. Second, we compare the negative log likelihood, as a general loss measure.
Domains: we essentially consider three different problems. The first is the ring Gaussians problem now common to generative approaches Goodfellow (2016), in which 8 Gaussians have their modes regularly spaced on a circle. The target is shown in Figure 1. Second, we consider a mixture of three 1D gaussians with pdf . For the final experiment, we consider a 1D domain and randomly place gaussians with means centered in the interval and variances . We vary ,

and repeat the experiment four times to get means and standard deviations.  (Section

9) shows more experiments.
Results: Figure 4 displays contour plots of the learned against DPB (Aldà and Rubinstein, 2017). Figure 1 in the introduction also provides additional insight in the comparison. Figure 5 provides metrics. We indicate the metric performance for DPB on one plot only since density estimates obtained for some of the other metrics could not allow for an accurate computation of metrics. The experiments bring the following observations:  is much better at density estimation than DPB if we look at the ring Gaussian problem.  essentially obtains the same results as DPB for values of that are 400 times smaller as seen from Figure 1. We also remark that the density modelled are more smooth and regular for  in this case. One might attribute the fact that our performance is much better on the ring Gaussians to the fact that our is a standard Gaussian, located at the middle of the ring in this case, but experiments on random 2D Gaussians (see ) display that our performances also remain better in other settings where should represent a handicap. All domains, including the 1D random Gaussians experiments in Figure 7 (), display a consistent decreasing NLL for  as increases, with sometimes very sharp decreases for (See also , Section 9). We attribute it to the fact that it is in this regime of the privacy parameter that  captures all modes of the mixture. For larger values of , it justs fits better the modes already discovered. We also remark on the 1D Gaussians that DPB rapidly reaches a plateau of NLL which somehow show that there is little improvement as increases, for . This is not the case for , which still manages some additional improvements for and significantly beats DPB. We attribute it to the flexibility of the sufficient statistics as (deep) classifiers in . The 1D random Gaussian problem (Figure 7 in ) displays the same pattern for : results get better as increases, and get significantly better than DPB, which does not succeed in getting improved results with , indicating a privacy regime — remains small — where the algorithm fails learning a good density. We also observe that the standard deviation of  is often 100 times smaller than for DPB, indicating not just better but also much more stable results. In the case of mode coverage, we observe for several experiments (e.g. ring Gaussians) that the mode coverage decreases until , and then increases, on all domains, for . This, we believe is due to our choice of , which as a Gaussian, already captures with its mode a part of the existing modes. As increases however,  performs better and obtains in general a significant improvement over . We also observe this phenomenon for the random 1D Gaussians (Figure 6) where the very small standard deviations (at least for or ) display a significant stability for the solutions of .

  Mean = f() StDev = f()
Figure 6: Mode coverage for  on 1D random Gaussian.

6 Discussion

By dropping the neighboring condition on inputs, our privacy guarantee is stronger than that of classical (or central (Differential privacy team, Apple, 2017)) DP. It is also stronger than that of the local ("on device") DP which requires individual protection of every row in (Differential privacy team, Apple, 2017, Definition 3.2) — equivalently, it is DP with . Integral privacy puts no constraint on sizes and is therefore a good protection when groups of any/unknown size have to be protected. If we drop the "local" constraint in local DP, the model becomes equivalent to integral privacy. This equivalence must be considered with caution. First, local DP algorithms would typically not scale: relaxing the protection to subsets of rows would multiply the sensitivity by as well. Making no assumption on for privacy (which is what integral privacy enforces) would just wipe out the protection guarantee as . Second, we do not suffer this caveat but we suspect that many integral privacy algorithms would risk to be over-conservative in a local DP setting. So, there is no "one size fits all" algorithm, which somehow justifies two highly distinct models to accomodate for specific algorithms.

In the introduction, we highlighted how it would be a bad idea to lift DP to integral privacy by scaling to the input size . One might think that this argument holds in fact for the sampling output size of integral privacy for sampling, as partially discussed in the experiments. This in fact needs to be nuanced for three reasons: (i) in relevant applications, we would have , as for example when the input data is census based – thus, is virtually equivalent to – but the output is deliberately small sized (such was the setting of Australia’s Medicare data breach Rubinstein et al. (2016)); (ii) we can trade part of the scaling down of in  by scaling down instead classifiers outputs (reducing ): we get mollifiers and thus get -integral privacy for output samples (, Section 8.2). This allows to fine tune the boosting convergence, even when the final guarantee is inevitably weakened. Finally, (iii) it turns out that regardless of , if one element of the -mollifier is close to , then we are guaranteed that the density learned by  will be close to as well. This is shown in  (Section 8.6), as well as the fact that such a guarantee is essentially the best we can hope for, not just for integral privacy but for the weaker model of -differential privacy as well.

7 Conclusion

In this paper, we have proposed an extension of -differential privacy to handle the protection of groups of arbitrary size, and applied it to sampling. The technique bypasses noisification as usually carried out in DP. An efficient learning algorithm is proposed, which learns (deep) sufficient statistics for an exponential family using classifiers. Formal approximation guarantees are obtained for this algorithm in the context of the boosting theory, using assumptions substantially weaker than the state of the art. Experiments demonstrate the quality of the solutions found, in particular in the context of the mode capture problem.

Acknowledgements and code availability

We are indebted to Benjamin Rubinstein for providing us with the Private KDE code, Borja de Balle Pigem and anonymous reviewers for significant help in correcting and improving focus, clarity and presentation, and finally Arthur Street for stimulating discussions around this material. Our code is available at:

https://github.com/karokaram/PrivatedBoostedDensities

References

  • Aldà and Rubinstein [2017] F. Aldà and B. Rubinstein. The Bernstein mechanism: Function release under differential privacy. In AAAI’17, 2017.
  • Amari and Nagaoka [2000] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. Oxford University Press, 2000.
  • Bernstein and Sheldon [2018] G. Bernstein and D. Sheldon. Differentially private bayesian inference for exponential families. CoRR, abs/1809.02188, 2018.
  • Boissonnat et al. [2010] J.-D. Boissonnat, F. Nielsen, and R. Nock. Bregman voronoi diagrams. DCG, 44(2):281–307, 2010.
  • Differential privacy team, Apple [2017] Differential privacy team, Apple. Learning with differential privacy at scale, 2017.
  • Dimitratakis et al. [2014] C. Dimitratakis, B. Nelson, A. Mitrokotsa, and B. Rubinstein. Robust and private bayesian inference. In ALT’14, pages 291–305, 2014.
  • Duchi et al. [2013a] J.-C. Duchi, M.-I. Jordan, and M. Wainwright.

    Local privacy and minimax bounds: sharp rates for probability estimation.

    NIPS*26, pages 1529–1537, 2013a.
  • Duchi et al. [2013b] J.-C. Duchi, M.-I. Jordan, and M. Wainwright. Local privacy, data processing inequalities, and minimax rates. CoRR, abs/1302.3203, 2013b.
  • Dwork and Roth [2014] C. Dwork and A. Roth. The algorithmic foudations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9:211–407, 2014.
  • Dwork et al. [2010] C. Dwork, G.-N. Rothblum, and S.-P. Vadhan. Boosting and differential privacy. In Proc. of the 51 FOCS, pages 51–60, 2010.
  • Farell [2017] P. Farell. The medicare machine: patient details of ’any australian’ for sale on darknet. The Guardian Australia, July 2017.
  • Fix and Hodges [1951] E. Fix and J. L. Hodges. Discrimatory analysis, nonparametric discrimination. Technical Report TR-21-49-004, Rept 4, USAF School of Aviation Medicine, Randolph Field, TX, 1951.
  • Gaboardi [2016] M. Gaboardi. Topics in differential privacy. Course Notes, State University of New York, 2016.
  • Geumlek et al. [2017] J. Geumlek, S. Song, and K. Chaudhuri. Rényi differential privacy mechanisms for posterior sampling. In NIPS*30, pages 5295–5304, 2017.
  • Gilbarg and Trudinger [2001] D. Gilbarg and N. Trudinger.

    Elliptic Partial Differential Equations of Second Order

    .
    Springer, 2001.
  • Givens and Hoeting [2013] G.-F. Givens and J.-A. Hoeting. Computational Statistics. Wiley, 2013.
  • Goodfellow [2016] I. Goodfellow. Generative adversarial networks, 2016. NIPS’16 tutorials.
  • Hall et al. [2013] R. Hall, A. Rinaldo, and L.-A. Wasserman. Differential privacy for functions and functional data. JMLR, 14(1):703–727, 2013.
  • Hastings [1970] W.-K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970.
  • Kearns [1988] M. Kearns. Thoughts on hypothesis boosting, 1988. ML class project.
  • Lord [2018] N. Lord. Top 10 biggest healthcare data breaches of all time. The Digital Guardian, June 2018.
  • Machanavajjhala et al. [2008] A. Machanavajjhala, D. Kifer, J.-M. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In ICDE’08, pages 277–286, 2008.
  • Metropolis et al. [1953] N. Metropolis, A.-W. Rosenbluth, M.-N. Rosenbluth, A.-H. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1092, 1953.
  • Mir [2013] D.-J. Mir. Differential privacy: an exploration of the privacy-utility landscape. PhD thesis, Rutgers University, 2013.
  • Palanisamy et al. [2017] B. Palanisamy, C. Li, and P. Krishnamurthy. Group differential privacy-preserving disclosure of multi-level association graphs. In ICDCS’17, pages 2587–2588, 2017.
  • Rubin [1993] D. B. Rubin. Discussion: statistical disclosure limitation. Journal of Official Statistics, 9(2):462––468, 1993.
  • Rubinstein and Aldà [2017] B. Rubinstein and F. Aldà. Pain-free random differential privacy with sensivity sampling. In 34 ICML, 2017.
  • Rubinstein et al. [2016] B. Rubinstein, V. Teague, and C. Culnane. Understanding the maths is crucial for protecting privacy. The University of Melbourne, September 2016.
  • Schapire [1990] R. E. Schapire. The strength of weak learnability. MLJ, pages 197–227, 1990.
  • Schapire and Freund [2012] R.-E. Schapire and Y. Freund. Boosting, Foundations and Algorithms. MIT Press, 2012.
  • Schapire and Singer [1999] R. E. Schapire and Y. Singer. Improved boosting algorithms using confidence-rated predictions. MLJ, 37:297–336, 1999.
  • Silverman [1986] B.-W. Silverman. Density estimation for statistics and data analysis. Chapman and Hall, 1986.
  • Thaler et al. [2012] J. Thaler, J. Ullman, and S.-P. Vadhan. Faster algorithms for privately releasing marginals. In ICALP’12, pages 810–821, 2012.
  • Tolstikhin et al. [2017] I.-O. Tolstikhin, S. Gelly, O. Bousquet, C. Simon-Gabriel, and B. Schölkopf. Adagan: Boosting generative models. In NIPS*30, pages 5430–5439, 2017.
  • Triastcyn and Faltings [2018] A. Triastcyn and B. Faltings. Generating differentially private datasets using GANs. CoRR, abs/1803.03148, 2018.
  • Wainwright [2014] M. Wainwright. Constrained forms of statistical minimax: computation, communication, and privacy. In International Congress of Mathematicians, ICM’14, 2014.
  • Wang et al. [2015] Y.-X. Wang, S. Fienberg, and A.-J. Smola. Privacy for free: Posterior sampling and stochastic gradient Monte Carlo. In 32 ICML, pages 2493–2502, 2015.
  • Wasserman and Zhou [2010] L. Wasserman and S. Zhou. A statistical framework for differential privacy. J. of the Am. Stat. Assoc., 105:375–389, 2010.
  • Xie et al. [2018] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou. Differentially private generative adversarial network. CoRR, abs/1802.06739, 2018.

: table of contents

Proofs and formal resultsPg 8
Proof of Lemma 3Pg 8.1
Proof of Theorem 4Pg 8.2
Proof of Theorem 4Pg 8.3
Proof of Theorem 1Pg 8.4
Proof of Theorem 4Pg 8.5
Additional formal resultsPg 8.6

Additional experimentsPg 9

8 Proofs and formal results

8.1 Proof of Lemma 3

The proof follows from two simple observations: (i) ensuring (4) is equivalent to since it has to holds for all , and (ii) the probability to sample any is equal to the mass under the density from which it samples from:

(15)

Recall that base measures are assumed to be the same, so being in translates to a property on Radon-Nikodym derivatives, , and we then get the statement of the Lemma: since where is a -mollifier, we get from Definition 3 that for any input samples , from and any :

(16)

which shows that is -integrally private.

8.2 Proof of Theorem 4

The proof follows from two Lemma which we state and prove. For any , we have that

(17)
Proof.

Since for any and noting that , we can conclude that is a geometric sequence. For any geometric series with ratio , we have that

(18)
(19)
(20)

Indeed, is the limit of the geometric series above when . In our case, we let to show that

(21)

which concludes the proof. ∎

For any and , let denote the parameters and denote the sufficient statistics returned by Algorithm 1, then we have

(22)
Proof.

Since the algorithm returns classifiers such that for all , we have from Lemma 8.2,

(23)

and similarly,

(24)

Thus we have

(25)

By taking exponential, integrand (w.r.t ) and logarithm of 25, we get

(26)
(27)

Since and , the proof concludes by considering highest and lowest values. ∎

The proof of Theorem 4 now follows from taking the of all quantities in (22), which makes appear in the middle and conditions for membership to in the bounds.

8.3 Proof of Theorem 4

We begin by first deriving the KL drop expression. At each iteration, we learn a classifier , fix some step size and multiply by and renormalize to get a new distribution which we will denote by to make the dependence of explicit. For any , let . The drop in KL is

(28)
Proof.

Note that is indeed a one dimensional exponential family with natural parameter , sufficient statistic , log-partition function and base measure . We can write out the KL divergence as

(29)
(30)
(31)
(32)

It is not hard to see that the drop is indeed a concave function of , suggesting that there exists an optimal step size at each iteration. We split our analysis by considering two cases and begin when . Since , we can lowerbound the first term of the KL drop using WLA. The trickier part however, is bounding which we make use of Hoeffding’s lemma. [Hoeffding’s Lemma] Let

be a random variable with distribution

, with such that , then for all , we have

(33)

For any classifier satisfying Assumption 3 (WLA), we have

(34)
Proof.

Let , , and and noticing that

(35)

allows us to apply Lemma 8.3. By first realizing that

(36)

We get that

(37)

Re-arranging and using the WLA inequality yields

(38)
(39)

Applying Lemma 8.3 and Lemma 8.3 (writing ) together gives us

(40)
(41)
(42)
(43)

Now we move to the case of . For any classifier returned by Algorithm 1, we have that

(44)

where .

Proof.

Consider the straight line between and given by , which by convexity is greater then on the interval . To this end, we define the function

(45)

Since for all , we have that for all . Taking over both sides and using linearity of expectation gives

(46)
(47)
(48)
(49)
(50)
(51)
(52)

as claimed. ∎

Now we use Lemma 8.3 and Jensen’s inequality since so that

(53)
(54)
(55)
(56)
(57)
(58)
(59)

8.4 Proof of Theorem 1

We first note that for any ,

(60)
(61)
(62)
(63)
(64)

which completes the proof of (10). To show (10), we have that

(65)
(66)
(67)
(68)