Differential privacy is gaining popularity as an agreed upon measure of privacy leakage, widely used by the government to publish Census statistics [Abo18], Google to aggregate user’s choices in web-browser features [EPK14, FPE16], Apple to aggregate mobile user data [Tea17], and smart meters in telemetry [DKY17]. As increasing number of privatization mechanisms are introduced and deployed in the wild, it is critical to have countermeasures to check the fidelity of those mechanisms. Such techniques will allow us to hold accountable the deployment of privatization mechanisms if the claimed privacy guarantees are not met, and find and fix bugs in implementations of those mechanisms.
A user-friendly tool for checking privacy guarantees is necessary for several reasons. Writing a program for a privatization mechanism is error-prone, as it involves complex probabilistic computations. Even with customized languages for differential privacy, checking the end-to-end privacy guarantee of an implementation remains challenging [BKOZB12, WDW19]. Furthermore, even when the implementation is error-free, there have been several cases where the mechanism designers have made errors in calculating the privacy guarantees, and falsely reported higher level of privacy [LSL17, CM15]. This is evidence of an alarming issue that analytically checking the proof of a privacy guarantee is a challenging process even for an expert. An automated and data-driven algorithm for checking privacy guarantees will significantly reduce such errors in the implementation and the design. On other cases, we are given very limited information about how the mechanism works, like Apple’s white paper [Tea17]. The users are left to trust the claimed privacy guarantees.
To address these issues, we propose a data-driven approach to estimate how much privacy is guaranteed, from a black-box access to a purportedly private mechanism. Our approach is based on an optimal polynomial approximation that gracefully trades off bias and variance. We study the fundamental limit of how many samples are necessary to achieve a desired level of accuracy in the estimation, and show that the proposed approach achieves this fundamental bound.
Problem formulation. Differential privacy (DP) introduced in [Dwo11] is a formal mathematical notion of privacy that is widespread, due to several key advantages. It gives one of the strongest guarantees, allows for precise mathematical analyses, and is intuitive to explain even to non-technical end-users. When accessing a database through a query, we say the query output is private if the output did not reveal whether a particular person’s entry is in the database or not. Formally, we say two databases are neighboring if they only differ in one entry (one row in a table, for example). Let denote the distribution of the randomized output to a query on a database . We consider discrete valued mechanisms taking one of values, i.e. the response to a query is in for some integer . We say a mechanism guarantees -DP, [Dwo11], if the following holds
for some , , and all subset and for all neighboring databases and . When , -DP is referred to as (pure) differential privacy, and the general case of is referred to as approximate differential privacy. For pure DP, the above condition can be relaxed as
for all output symbol , and for all neighboring databases and . This condition can now be checked, one symbol at a time from , without having to enumerate all subsets . This naturally leads to the following algorithm.
For a query and two neighboring databases and of interest, we need to verify the condition in Eq. (2). As we only have a black-box access to the mechanism, we collect responses from the mechanism on the two databases. We check the condition on the empirical distribution of those collected samples, for each . If it is violated for any , we assert the mechanism to be not -DP and present as an evidence. Focusing only on pure DP, [DWW18] proposed an approach similar to this, where they also give guidelines for choosing the databases and to test. However, their approach is only evaluated empirically, no statistical analysis is provided, and a more general case of approximate DP is left as an open question, as the condition in Eq. (1) cannot be decoupled like Eq. (2) when .
We propose an alternative approach from first principles to check the general approximate DP guarantees, and prove its minimax optimality. Given two probability measuresand over , we define the following approximate DP divergence with respect to as
where . The last representation indicates that this metric falls under a broader class of metrics known as -divergences, with a special choice of . From the definition of DP, it follows that a mechanism is -DP if and only if for all neighboring databases and . We propose estimating this divergence from samples, and comparing it to the target . This only requires number of operations scaling as where is the sample size.
In this paper, we suppose there is a specific query of interest, and two neighboring databases and have been already selected either by a statistician who has some side information on the structure of the mechanism or by some algorithm, such as those from [DWW18]. Without exploiting the structure (such as symmetry, exchangeability, or invariance to the entries of the database conditioned on the true output of the query), one cannot avoid having to check all possible combinations of neighboring databases. As a remedy, [GM18] proposes checking randomly selected databases. This in turn ensures a relaxed notion of privacy known as random differential privacy. Similarly, [DJRT13] proposed checking the typical databases, assuming there we have access to a prior distribution over the databases. Our framework can be seamlessly incorporated with such higher-level routines to select databases.
Contributions. We study the problem of estimating the approximate differential privacy guaranteed by a mechanism, from a black-box access where we can sample from the mechanism output given a query , a database , and a target . We first show that a straightforward plug-in estimator of achieves mean squared error scaling as , where is the size of the alphabet and is the number of samples used (Section 2.1.1).
In the regime where we fix and increase the sample size, this achieves the parametric rate of , and cannot be improved upon. However, in many cases of practical interest where is comparable to , we show that this can be improve upon with a more sophisticated estimator. To this end, we introduce a novel estimator of . The main idea is to identify the regimes of non-smoothness in where the plug-in estimator has a large bias. We replace it by the uniformly best polynomial approximation of the non-smooth regime of the function, and estimate those polynomial from samples. By selecting appropriate degree of the polynomial, we can optimally trade off the bias and variance. We provide an upper bound on the error scaling as , when and are comparable. We prove that this is the best one can hope for, by providing a minimax lower bound that matches.
We first show this for the case when we know and sample from in Section 2.1, to lay out the main technical insights while maintaining simple exposition. Then, we consider the practical scenario where both and are accessed via samples, and provide an minimax optimal estimator in Section 2.2. This phenomenon is referred to as effective sample size amplification; one can achieve with samples a desired error rate, that would require samples for a plug-in estimator. We present numerical experiments supporting our theoretical predictions in Section 3, and use our estimator to identify those real-world and artificial mechanisms that have incorrectly reported the privacy guarantees.
Related work. Formally guaranteeing differential privacy is a challenging and error-prone task. Principled approaches have been introduced to derive the end-to-end privacy loss of a software program that uses multiple differentially private accesses to the data, including SQL-based languages [McS09], higher-order functional languages [RP10], and imperative languages [CGLN11]. A unifying recipe of these approaches is to combine standard mechanisms like Laplacian, exponential, and Gaussian mechanisms [Dwo11, MT07, GKOV15], and calculate the end-to-end privacy loss using the composition theorem [KOV17]. To extend these approaches to a more general notion of approximate differential privacy and allow non-standard components, principled approaches have been proposed in [BKOZB12, WDW19]. The main idea is to perform fine-grained reasoning about the complex probabilistic computations involved. These existing approaches have been preemptive measures aimed at automated verification of privacy of the source code. Instead, we seek a post-hoc measure of estimating the privacy guarantee, given a black-box access to a privatization mechanism and not the source code.
Several published mechanisms have mistakes in the privacy guarantees. These are variations of a popular mechanism known as Sparse Vector Technique (SVT). The original SVT first generates a random threshold. To answer a sequence of queries, it adds a random noise to each query output, and respond whether this is higher than the threshold (true) or not (false). This process continues until either all queries are answered, each with a privatized Boolean response or if the number of trues meets a pre-defined bound. Several variations violate claimed DP guarantees.[SCM14] proposes variant of SVT with no noise adding and no bound on the number of trues, which does not satisfy DP for any values of . [CXZX15] makes a similar false claim, proposing a version of SVT with no bound on the number of trues. [LC14] adds a smaller noise independent of the bound on the trues. [Rot11] outputs the actual noisy vector instead of the Boolean vector.
A formal investigation into verifying DP guarantees of a given mechanism was addressed in [DJRT13]. DP condition is translated into a certain Lipschitz condition on over the databases , and a Lipschitz tester is proposed to check the conditions. However, this approach is not data driven, as it requires the knowledge of the distribution and no sampling of the mechanism outputs is involved. [GM18] analyzes tradeoffs involved in testing DP guarantees. It is shown that one cannot get accurate testing without sacrificing the privacy of the databases used in the testing. Hence, when testing DP guarantees, one should not use databases that contain sensitive data. We compare some of the techniques involved in Section 2.1.1.
Our techniques are inspired by a long line of research in property estimation of a distribution from samples. In particular, there has been significant recent advances for high-dimensional estimation problems, starting from entropy estimation for discrete random variables in[VV13, JVHW15, WY16]. The general recipe is to identify the regime where the property to be estimated is not smooth, and use functional approximation to estimate a smoothed version of the property. This has been widely successful in support recovery [WY19], density estimation with loss [HJW15], and estimating Renyi entropy [AOST17]
. more recently, this technique has been applied to estimate certain divergences between two unknown distributions, for Kullback-Leibler divergence[HJW16], total variation distance [JHW18], and identity testing [DKW18]. With carefully designed estimators, these approximation-based approaches can achieve improvement over typical parametric rate of error rate, sometimes referred to as effective sample size amplification.
Notations. We let the alphabet of a discrete distribution be for some positive integer denoting the size of the alphabet. We let
denote the set of probability distributions over. We use to denote that for some constant , and is analogously defined. denotes that and .
2 Estimating differential privacy guarantees from samples
We want to estimate from a blackbox access to the mechanism outputs accessing two databases, i.e. and . We first consider a simpler case, where is known and we observe samples from an unknown distribution in Section 2.1. We cover this simpler case first to demonstrate the main ideas on the algorithm design and analysis technique while maintaining the exposition simple. This paves the way for our main algorithmic and theoretical results in Section 2.2, where we only have access to samples from both and .
2.1 Estimating with known P
For a given budget , representing an upper bound on the expected number of samples we can collect, we propose sampling a random number
of samples from Poisson distribution with mean, i.e. . Then, each sample is drawn from for , and we let denote the resulting histogram divided by , such that .
Note that is not the standard empirical distribution, as with high probability. However, in this paper we refer to as empirical distribution of the samples. The empirical distribution would have been divided by instead of . Instead, is the maximum likelihood estimate of the true distribution . This Poisson sampling, together with the MLE construction of , ensures independence among , making the analysis simpler. We first analyze the sample complexity of a simple plug-in estimator in Section 2.1.1. This is well-defined regardless of whether sums to one or not. We next show that we can significantly improve the sample complexity by using a novel estimator, Algorithm 1, in Section 2.1.2. We show minimax optimality of the proposed estimator by proving a matching lower bound in Section 2.1.3.
2.1.1 Performance of the plug-in estimator
The following result shows that it is necessary and sufficient to have samples to achieve an arbitrary desired error rate, if we use this plug-in estimator , under the worst-case and . Some assumption on is inevitable as it is trivial to achieve zero error for any sample size, for example if and have disjoint supports. Both and are 1 with probability one. We provide a proof in Section 4.2. The bound in Eq. (6) also holds for . The proof technique is analogous, and we omit the proof here.
For any , support size , and distribution , the plug-in estimator satisfies
with expected number of samples . If , we can also lower bound the worst case mean squared error as
When , it follows that right-hand side of the lower bound is at least , and the matching upper and lower bounds can be simplified as follows, for the worst case and .
If and , we have
A similar analysis was done in [GM18], which gives an upper bound scaling as . We tighten the analysis by a factor of , and provide a matching lower bound.
2.1.2 Achieving optimal sample complexity with a polynomial approximation
To simplify the analysis, we split the samples randomly into two partitions, each having an independent and identical distribution of samples from the multinomial distribution . We let denote the count of the first set of samples (normalized by ), and the second set of samples. See Algorithm 1 for a formal definition. Note that for the analysis we are collecting samples in total on average. In all the experiments, however, we apply our estimator without partitioning the samples. A major challenge in achieving the minimax optimality is in handling the non-smoothness of the function at . We use one set of samples to identify whether an outcome is in the smooth regime () or not (), with an appropriately defined set function:
for and . The scaling of the interval is chosen carefully such that it is large enough for the probability of making a mistake on the which regime falls into to vanishes (Lemma 14); and it is small enough for the variance of the polynomial approximation in the non-smooth regime to match that of the other regimes (Lemma 15). In the smooth regime, we use the plug-in estimator. In the non-smooth regime, we can improve the estimation error by using the best polynomial approximation of , which has a smaller bias:
where is the set of polynomial functions of degree at most ,
and we approximate in an interval for any .
Having this slack of in the approximation allows us to guarantee the approximation quality,
even if the actual is not exactly in the non-smooth regime
Once we have the polynomial approximation, we estimate this polynomial function
from samples, using
the uniformly minimum variance unbiased estimator (MVUE)
uniformly minimum variance unbiased estimator (MVUE).
There are several advantages that makes this two-step process attractive. As we use an unbiased estimate of the polynomial, the bias is exactly the polynomial approximation error of , which scales as . Larger degree reduces the approximation error, and larger reduces the support of the domain we apply the approximation to in (Lemma 15). The variance is due to the sample estimation of the polynomial , which scales as for some universal constant (Lemma 15). Larger degree increases the variance. We prescribe choosing for appropriate constant to optimize the bias-variance tradeoff in Algorithm 1. The resulting two-step polynomial estimator, has two characterizations, depending on where is. Let .
Case 1: and . We consider the function by substituting into . Let be the best polynomial approximation of with order , i.e. and denote it as . Then . Once we have the polynomial approximation, we estimate with the uniformly minimum variance unbiased estimator (MVUE) to estimate .
Computing the ’s can be challenging, and we discuss this for the general case when is not known in Section 2.2.
Case 2: and . In this regime, the best polynomial approximation of is given by
where ’s are defined from the best polynomial approximation of on with order : . The unique uniformly minimum variance unbiased estimator (MVUE) for is
shown in Lemma 12. Hence,
The coefficients ’s only depend on and can be pre-computed and stored in a table.
Choosing an appropriate scaling as , we can prove an upper bound on the error scaling as . This proves that the proposed estimator achieves the minimax optimal performance.
For any , suppose for some constants and , then there exist constants and that only depends on , and such that
for and where is defined in Algorithm 1.
For worst case and , if for some , then there exist constants and , such that for ,
Note that the plug-in estimator in Corollary 2 achieves the parametric rate of . In the low-dimensional regime, where we fix and grow , this cannot be improved upon. To go beyond the parametric rate, we need to consider a high-dimensional regime, where grows with . Hence, a condition similar to is necessary, although it might be possible to further relax it.
2.1.3 Matching minimax lower bound
In the high-dimensional regime, where grows with sufficiently fast, we can get a tighter lower bound then Theorem 1, that matches the upper bound in Theorem 3. Again, supremum over is necessary as there exists where it is trivial to achieve zero error, for any sample size (see Section 2.1.1 for an example). For any given we provide a minimax lower bound in the following.
Suppose and there exists a constant such that . Then for any , there exists a constant that only depends on such that if , then
where the infimum is taken over all possible estimator.
A proof is provided in Section 4.4. When and , the condition
is satisfied for uniform distribution, and we can simplify the right-hand side of the lower bound. Together, we get the following minimax lower bound, matching the upper bound in Corollary 4.
If there exist constants , such that and , then
2.2 Estimating from samples
We now consider the general case where and are both unknown, and we access them through samples. We propose sampling a random number of samples and from each distribution, respectively. Define the empirical distributions and as in the previous section. From the proof of Theorem 1, we get the same sample complexity for the plug-in estimator: If and , we have
Using the same two-step process, we construct an estimator that improves upon this parametric rate of plug-in estimator.
2.2.1 Estimator for
We present an estimator using similar techniques as in Algorithm 1, but there are several challenges in moving to a multivariate case. The multivariate function in non-smooth in a region . We first define a two-dimensional non-smooth set as
where . As before, the plug-in estimator is good enough in the smooth regime, i.e. .
We construct a polynomial approximation of this function with order , in this non-smooth regime. We will set to achieve the optimal tradeoff. We split the samples randomly into four partitions, each having an independent and identical distribution of samples, two from the multinomial distributions and other two from . See Algorithm 2 for a formal definition. We use one set of samples to identify the regime, and the other for estimation.
case 1: For .
A straightforward polynomial approximation of on cannot achieve approximation error smaller than . As , this gives a bias of for each symbol in , resulting in total bias of . This requires to achieve arbitrary small error, as opposed to which is what we are targeting. This is due to the fact that we are requiring multivariate approximation, and the bias is dominated by the worst case for each . If is fixed, as in the case of univariate approximation in Lemma 15, the bias would have been , with , where total bias scales as when summed over all symbols .
Our strategy is to use the decomposition . Each function can be approximated up to a bias of , and the dominant term in the bias becomes . This gives the desired bias. Concretely, we use two bivariate polynomials and to approximate and in , respectively. Namely,
Then denote . For , define
In practice, one can use the best Chebyshev polynomial expansion to achieve the same uniform error rate, efficiently [MNS13].
case 2: For and .
We utilize the best polynomial approximation of on with order . Denote it as . Define