Recently, differential privacy (DP) has gained traction outside of theoretical research as several companies (Google, Apple, Microsoft, Census, etc.) have announced deployment of large-scale differentially private mechanisms (Erlingsson et al., 2014, Apple, 2017, Abowd and Schmutte, 2017, Ding et al., 2017). This use of DP, while exciting, might be construed as a marketing tool used to encourage privacy-aware consumers to release more of their sensitive data to the company. In addition, the software behind the deployment of DP is typically proprietary since it ostensibly provides commercial advantage. This raises the question: with limited access to the software, can we verify the privacy guarantees of purportedly DP algorithms?
Suppose there exists some randomised algorithm that is claimed to be - differentially private and we are given query access to . That is, the domain of is the set of databases and we have the power to choose a database and obtain a (randomised) response . How many queries are required to verify the privacy guarantee? We formulate this problem in the property testing framework for pure DP, approximate DP, random pure DP, and random approximate DP.
Definition 1 (Property testing with side information).
A property testing algorithm with query complexity , proximity parameter , privacy parameters and side information , makes queries to the black-box and:
(Completeness) ACCEPTS with probability at leastif is -private and is accurate.
(Soundness) REJECTS with probability at least if is -far from being -private.
In this early stage of commercial DP algorithms, approaches to transparency have been varied. For some algorithms, like Google’s RAPPOR, a full description of the algorithm has been released (Erlingsson et al., 2014). On the other hand, while Apple has released a white paper (Apple DP Team, 2017) and a patent (Thakurta et al., 2017), there are still many questions about their exact implementations. We focus on the two extreme settings: when we are given no information about the black-box (except the domain and range), and the full information setting where we have an untrusted full description of the algorithm . A similar formulation of DP in the property testing framework was first introduced in Dixit et al. (2013)
, who consider testing for DP given oracle access to the probability density functions on outputs.Dixit et al. (2013) reduce this version of the problem to testing the Lipschitz property of functions and make progress on this more general problem.
Both settings we consider, full information and no information, are subject to fundamental limitations. We first show that verifying privacy is at least as difficult as breaking privacy, even in the full information setting. That is, suppose samples are sufficient to verify that an algorithm is -private. Then Theorem 6 implies that for every algorithm that is not -private, there exists some pair of neighbouring databases and such that samples from is enough to distinguish between and
. Differential privacy is designed so that this latter problem requires a large number of samples. This connection has the unfortunate implication that verifiability and privacy are directly at odds:if a privacy guarantee is efficiently verifiable, then it mustn’t be a strong privacy guarantee.
For the remainder of this work we restrict to discrete distributions on . Our upper and lower bounds in each setting are contained in Table 1. We rule out sublinear verification of privacy in every case except verifying approximate differential privacy in the full information setting. That is, for all other definitions of privacy, the query complexity for property testing of privacy is .
Each privacy notion we consider is a relaxation of pure differential privacy. Generally, the privacy is relaxed in one of two ways: either privacy loss is allowed to occur on unlikely outputs, or privacy loss is allowed to occur on unlikely inputs. The results in Theorem 8 and the lower bounds in Table 1 imply that for efficient verification, we need to relax in both directions. That is, random approximate DP is the only efficiently verifiable privacy notion in the no information setting. Even then, we need about queries per database to verify -approximate differential privacy. Theorem 14 shows that random approximate DP can be verified in (roughly) samples, where (roughly) and are the probabilities of choosing a disclosive output or input, respectively. This means verification is efficient if and are small but not too small. This may seem insufficient to those familiar with DP, where common wisdom decrees that and should be small enough that this query complexity is infeasibly large.
There have been several other relaxations of pure differential privacy proposed in the literature, chief among them Rényi DP (Mironov, 2017) and concentrated DP (Dwork and Rothblum, 2016). These relaxations find various ways to sacrifice privacy, with a view towards allowing a strictly broader class of algorithms to be implemented. Similar to pure DP, Rényi and concentrated DP have the property that two distributions and can be close in TV distance while the pair has infinite privacy parameters. Thus, many of the results for pure differential privacy in this work can be easily extended to Rényi and concentrated DP. We leave these out of our discussion for brevity.
|No Information||Full information|
|pDP||Unverifiable [Theorem 10]||[Theorem 18]|
|aDP||[Theorem 12]||[Theorem 16]|
One might hope to obtain significantly lower query complexity if the property tester algorithm is given side information, even if the side information is untrusted. We find that this is true for both approximate DP and pure DP, if we allow the query complexity to depend on the side information. A randomised algorithm can be abstracted as a set of distributions where . We obtain a sublinear verifier for approximate DP. For pure DP, we find the quantity that controls the query complexity is
the minimum value of the collection of distributions. If is large then efficient verification is possible: verifying that the pure differential privacy parameter is less than requires queries of each database (Theorem 17). Note that this is not sublinear since and if then we have no improvement on the no information setting. However, for reasonable , this is a considerable improvement on the no information lower bounds and may be efficient for reasonable .
A central theme of this work is that verifying the privacy guarantees that corporations (or any entity entrusted with private data) claim requires compromise by either the verifier or algorithm owner. If the verifier is satisfied with only a weak privacy guarantee (random approximate DP with and small but not extremely small), then they can achieve this with no side information from the algorithm owner. If the company is willing to compromise by providing information about the algorithm up-front, then much stronger privacy guarantees can be verified. Given this level of transparency, one might be tempted to suggest that the company provide source code instead. While verifying privacy given source code is an important and active area of research, there are many scenarios where the source code itself is proprietary. We have already seen instances where companies have been willing to provide detailed descriptions of their algorithms. In the full information case, we obtain our lowest sample complexity algorithms, including a sublinear algorithm for verifying approximate differential privacy.
This paper proceeds as follows: we start by defining property testing for privacy in Section 2. We then proceed to the main contributions of this work:
Verifying privacy is as hard as breaking privacy (Section 3).
In the no information setting, verifying pure differential privacy is impossible while there is a finite query complexity property tester for approximate differential privacy (Section 5).
If , then finite query complexity property testers exist for pure differential privacy in the full information setting (Section 6).
A sublinear property tester exists for approximate differential privacy in the full information setting.
The main lower bounds and algorithmic upper bounds in this paper are summarized in Table 1.
2 Background and Problem Formulation
A database is a vectorin for some data universe . That is, if , is the number of copies of in the database. We call two databases , neighbouring if they differ on a single data point, that is . For a randomised algorithm and database , we use to denote the output and to denote the distribution of . We will often prefer to view an algorithm as simply a collection of distributions . We will only consider discrete distributions in this paper, so is a discrete distribution on . For a distribution , represents independent copies of .
For much of this paper we will consider algorithms that accept only two databases as input. We use the notation to denote such an algorithm that accepts only two databases and as input, and and . The databases 0 and 1 are assumed to be neighbouring.
The privacy notions we discuss will all center around the idea that and should be close for neighbouring databases and . As such, we will deal with many measures of closeness between distributions. We collect these definitions for ease of reference.
Let and be two distributions.
(Max divergence) .
(-approximate max divergence)
(KL divergence) .
(Total Variance (TV) distance).
where the is the supremum over all events in the outcome space.
2.1 Privacy Definitions
Pure differential privacy is the gold standard for privacy-preserving data analysis. However, it is a very strong definition and as a result, many relaxations of it have gained traction as the work on differential privacy evolves. These relaxations find various ways to sacrifice privacy, with a view towards allowing a strictly broader class of algorithms to be implemented. Since these definitions are becoming standard, we give only a cursory introduction in this section. An introduction can be found in Dwork and Roth (2014) and more in depth surveys can be found in Vadhan (2016), Dwork (2008), Ji et al. (2014).
The idea is simple; suppose the adversary has narrowed the list of possible databases down to neighbouring databases and . Any output the adversary sees is almost equally as likely to have arisen from or . Thus, the adversary gains almost no information that helps them distinguish between and .
Definition 3 (Data Distribution Independent Privacy Definitions).
A randomised algorithm is
-pure differentially private (pDP) if .
-approximate differentially private (aDP) if .
where the supremums are over all pairs of neigbouring databases and .
Note that -pDP is exactly -aDP. The parameter can be thought of as our probability of failing to preserve privacy. To see this, suppose the distributions output 0 with probability , and a unique identifier for the database with probability . Then this algorithm is -DP. Thus, we typically want to be small enough that we can almost guarantee that we will not observe this difference in the distributions. In contrast, while it is desirable to have small, a larger still gives meaningful guarantees (Dwork et al. (2011)). Typically one should think of as extremely small, , and as quite small, . The larger is, the more private the algorithm is. Unlike the other parameters we will treat as a fixed part of the definition, rather than a variable like .
Let be a distribution on the data universe . For a database and datapoint , let denote the neighbouring database where the first datapoint of is replaced by .
Definition 4 (Data Distribution Dependent Privacy Definitons).
An algorithm is
-Random pure differentially private (RpDP) if .
-Random approximate differentially private (RADP) if .
where the probabilities in RpDP and RaDP are over .
Similar to , represents the probability of catastrophic failure in privacy. Therefore, we require that is small enough that this event is extremely unlikely to occur.
2.2 Problem Formulation
Our goal is to answer the question given these privacy parameters, is the algorithm at least -private? where is an appropriate privacy parameter. A property testing algorithm, which outputs ACCEPT or REJECT, answers this question if it ACCEPTS whenever is -private, and only ACCEPTS if the algorithm is close to being -private. A tester with side information may also REJECT simply because the side information is inaccurate.
We say that is -far from being -private if , where the minimum is over all such that is -private. The metrics used for each form of privacy are contained in Table 2. We introduce the scalar to penalise deviation in one parameter more than deviation in another parameter. For example, it is much worse to mistake a -RpDP algorithm for -RpDP than it is to mistake a -RpDP algorithm for -RpDP. We leave the question of how much worse as a parameter of the problem. However, we give the general guideline that if we want an error to be tolerable in both and then , which may be large, is an appropriate choice.
The formal definition of a property tester with side information was given in Definition 1. A no information property tester is the special case when . A full information property tester is the special case when contains a distribution for each database . We use to denote the distribution on outputs presented in the side information and to denote the true distribution on outputs of the algorithm being tested. For and privacy parameter , a full information (FI) property tester for this problem satisfies:
(Completeness) Accepts with probability at least if the algorithm is -private and for all .
(Soundness) Rejects with probability at least if the algorithm is -far from being -private.
We only force the property tester to ACCEPT if the side information is exactly accurate (). It is an interesting question to consider a property tester that is forced to ACCEPT if the side-information is close to accurate, for example in TV-distance. We do not consider this in this work as being close in TV-distance does not imply closeness of privacy parameters.
For a database , we will refer to the process of obtaining a sample from as querying the black-box. It will usually be necessary to input each database into the black-box multiple times. We will use to denote the number of unique databases that are queries to the black-box and to denote the number of times each database is input. We will only consider algorithms where the number of samples from for each input database is , so our query complexity is for each algorithm. Our aim is verify the privacy parameters using as few queries as possible.
2.3 Related Work
This work connects to two main bodies of literature. There are several works on verifying privacy with different access models that share the same motivation as this work. In terms of techniques, our work is most closely linked to recent work on property testing of distributions.
Testing DP in the property testing framework was first considered in Dixit et al. (2013). The access model in this paper is different to ours but their goal is similar. Recent work by Ding et al. (2018) studies privacy verification from a hypothesis testing perspective. They design a privacy verification algorithm which aims to find violations of the privacy guarantee. Their algorithm provides promising experimental results in non-adversarial settings (when the privacy guarantee is frequently violated), although they provide no theoretical guarantees.
Several algorithms and tools have been proposed for formal verification of the DP guarantee of an algorithm (Barthe et al., 2014, Roy et al., 2010, Reed and Pierce, 2010, Gaboardi et al., 2013, Tschantz et al., 2011). Much of this work focuses on verifying privacy given access to a description of the algorithm. There is a line of work (Barthe et al., 2014, Roy et al., 2010, Reed and Pierce, 2010, Gaboardi et al., 2013, Tschantz et al., 2011, Barthe et al., 2012, 2013, McSherry, 2009) using logical arguments (type systems, I/O automata, Hoare logic, etc.) to verify privacy. These tools are aimed at automatic (or simplified) verification of privacy of source code. There is another related line of work where the central problem is testing software for privacy leaks. This work focuses on blatant privacy leaks, such as a smart phone application surreptitiously leaking a user’s email (Jung et al., 2008, Enck et al., 2010, Fan et al., 2012).
Given sample access to two distributions and and a distance measure , the question of distinguishing between and is called tolerant property testing. This question is closely related to the question of whether is private. There is a large body of work exploring lower bounds and algorithmic upper bounds for tolerant testing using standard distances (TV, KL, , etc.) with both and (Daskalakis et al., 2018, Paninski, 2008, Batu et al., 2013, Acharya et al., 2015, Valiant and Valiant, 2014). In our work, we draw most directly from the techniques of Valiant (2011).
3 Lower Bounds via Distinguishability
We now turn to examining the fundamental limitations of property testing for privacy. We find that even in the full information setting, the query complexity to verifying privacy is lower bounded by the number of queries required to distinguish between two possible inputs. We expect the latter to increase with the strength of the privacy guarantee.
Databases and are -distinguishable under if there exists a testing algorithm such that given a description of and where , it accepts with probability at least if and rejects with probability at least 2/3 if .
The following theorem says that the per database query complexity of a privacy property testing algorithm is lower bounded by the minimal such that two neighbouring databases are -distinguishable under . Recall that we use the notation to denote an algorithm that accepts only two databases and as input, and and . The databases 0 and 1 are assumed to be neighbouring.
Consider any privacy definition, privacy parameter , and let . Suppose there exists a -privacy property tester with proximity parameter and (per database) query complexity . Let be an algorithm that is -far from -private. If the privacy notion is
pDP or aDP then there exists a pair of neighbouring databases that are -distinguishable under .
RpDP or RaDP and , then a randomly sampled pair of neighbouring databases has probability at least of being -distinguishable.
A major reason that DP has gained traction is that it is preserved even if the (randomised) algorithm is repeated. That is, if and is private, then is private with slightly worse privacy parameters. Typically we want the privacy parameters to start small enough that
has to be quite large before any pair of neighbouring databases can be distinguished between using the output. If the algorithm is known and trusted then distinguishability may be possible with a feasible number of samples, for example by mean estimation (Laplacian distribution, etc.). This is no longer necessarily the case when make no assumptions on form of the distributions. In fact, many of our proofs in the following sections proceed by finding two distribution and such that has high privacy parameters but it is still difficult to distinguish between and (for example because they only differ on a set with small measure). We consider this setting because we are considering an untrusted, possibly adversarial, algorithm owner. In the future we would like to explore assumptions that can be placed on the class of distributions that may lower the sample complexity of distinguishability.
We start with pDP or aDP and suppose such a -privacy property testing algorithm exists. Let be an algorithm that is -far from -private. Since the privacy parameter is defined as a maximum over all neighbouring databases, there exists a pair of databases and such that has the same privacy parameter as . We can design a tester algorithm that distinguishes between and as follows: given input , first sample . Then run the privacy property testing algorithm on with sample . If then is 0-DP, so the property tester will accept with probability at least 2/3. If then is -far from from -private so the property tester will reject with probability at least 2/3.
Finally, suppose such a -RpDP property testing algorithm exists. Let be an algorithm that is -far from -private so that, in particular, is not -RpDP. Thus, if we randomly sample a pair of neighbouring databases and , with probability , is not -pDP. The remainder of the proof proceeds as above by noticing that the algorithm is -far from -RpDP and is -RpDP. The proof of almost identical for RaDP. ∎
4 Restriction to Two Distribution Setting
Differential privacy is an inherently local property. That is, verifying that is -private means verifying that is -private, either always or with high probability, for pairs of neighbouring databases and . We refer to the problem of determining whether a pair of distributions satisfies -privacy as the two database setting. We argue in this section that the hard part of privacy property testing is the two database setting. For this reason, from Section 5 onwards, we only consider the two database setting. Recall that we use the notation to denote an algorithm that accepts only two databases and as input, and and . The databases 0 and 1 are assumed to be neighbouring.
An algorithm is non-adaptive if it chooses pairs of distributions and queries the blackbox with each database times. It does not choose its queries adaptively. The following is a non-adaptive algorithm for converting a tester in the two database setting to a random privacy setting.
Theorem 7 (Conversion to random privacy tester).
If there exists a -privacy property tester for the two database setting with query complexity per database and proximity parameter , then there exists a privacy property tester for -random privacy with proximity parameter and query complexity
The conversion is given in Algorithm 1. We first prove completeness. Suppose is -random private. Let
so . Our goal is to estimate using the empirical estimate given by . We perform the property tester times on the pair to reduce the failure probability from to so
so . Then Therefore as above,
So, Algorithm 1 REJECTS with probability at least 2/3. ∎
Notice that if then the query complexity is approximately . One shortcoming of the conversion algorithm in Theorem 7 is that we need to know the data distribution . We can relax to an approximation that is close in TV-distance, but it is not difficult to see that is necessary.
Theorem 8 (Lower bound).
Let . Let be a lower bound on the query complexity in the two distribution settting. If is sufficiently small then any non-adaptive -random privacy property tester with proximity parameter has query complexity .
We conjecture that the lower bound is actually . If this is true then Theorem 7 gives an almost optimal conversion from the two database setting to the random setting.
A random privacy property tester naturally induces a property tester in the two distribution setting by setting for half the databases and for the other half. Then is -random private if is and -far if is -far. Therefore, the random privacy tester must use at least as many queries as a privacy tester in the two database setting.
Suppose is -far from
-private and the data universe is uniformly distributed. Ifis small enough then there exists a pair of nested subsets such that
Define if , if , if and if . Then is -random private and is -far from -random private.
Recall that a non-adaptive property testing algorithm can query by randomly sampling a pair of neighbours , and then sampling . If is the normalisation factor, the distributions and have total variation distance . Therefore, it takes at least queries to distinguish between and . ∎
5 No Information Setting
We first show that no privacy property tester with finite query complexity exists for pDP. We then analyse a finite query complexity privacy property tester for aDP, as well query complexity lower bounds. For the remainder of this work we consider the two databases setting, where each algorithm accepts only two databases, 0 and 1, as input and and . The databases 0 and 1 are assumed to be neighbouring.
The impossibility of testing pDP arises from the fact that very low probability events can cause the privacy parameters to increase arbitrarily. In each case we can design distributions and that are close in TV-distance but for which the algorithm has arbitrarily large privacy parameters. This intuition allows us to use a corollary of Le Cam’s inequality (Corollary 9) to prove our impossibility results.
For any privacy definition, let be the proximity parameter and be the privacy parameters. Suppose and are algorithms such that is -DP and is -far from being -DP. Then, any privacy property testing algorithm with QC must satisfy
Theorem 10 (pDP lower bound).
Let and . No -pDP property tester with proximity parameter has finite query complexity.
Let be the query complexity of any pDP property tester. Let . Our goal is to prove that . If this is true for all , the query complexity cannot be finite.
Consider algorithms, and where
Then is -pDP and is -far from -pDP. Now, by Pinsker’s inequality,
Therefore, by Lemma 9,
We designed two distributions that are equal on a large probability set but for which the ratio blows-up on a set with small probability. In Section 6 we will see that testing pure DP becomes possible if we make assumptions on the algorithm . The assumption we need will ensure that is upper bounded.
5.2 Property Testing for aDP in the No Information Setting
Fortunately, the situation is less dire for verifying aDP. Finite query complexity property testers do exist for aDP, although their query complexity can be very large. In the previous section, we relied on the fact that two distributions and can be close in TV-distance while has unbounded privacy parameters. In this section, we first show this is not true for aDP, which sets it apart from the other privacy notions. We then prove that the query complexity is , and there exists an algorithm that uses queries per database. Define
An algorithm is -aDP if and only if . The following lemma shows the relationship between the aDP parameters and TV-distance.
Let and suppose is -aDP and . If and
then is Furthermore, if then this bound is tight. That is, if , then there exists an algorithm such that conditions (1) and (2) hold but is -far from
For any event ,
Conversely, let and suppose . There must exist an event such that . The condition on can rewritten as so we must have that either or .
First, suppose that . Then there exists a distribution such that
and . If we let then , which implies is -far from -aDP.
Finally, suppose . Then there exists a distribution such that
and . Letting , again is -far from -aDP. ∎
Theorem 12 (Lower bound).
Let and suppose . Any -aDP property tester with proximity parameter has query complexity
The proof of the component of the lower bound relies on Lemma 9 in a similar way to Theorem 10. The proof of the lower bound borrows a technique from Valiant (2011). The lemma uses the fact that if two distributions only differ on elements of with low probability, then many samples are needed to distinguish between them.
A property of a distribution is a function . It is called symmetric if for all permutations and distributions we have . It is -weakly-continuous if for all distributions and satisfying we have . The following lemma will be used in the proof of Theorem 12.
Valiant (2011) Given a symmetric property on distributions on that is -weakly-continuous and two distributions, , that are identical for any index occurring with probability at least in either distribution but where and , then no tester can distinguish between and in samples.
For aDP, our property is , which is -weakly-continuous. We can now prove our lower bound.
Proof of Theorem 12.
Let be the uniform distribution on . Let
Then, is -aDP and is -far from -aDP. Now,
By the same argument as Theorem 10 we have .
Suppose is a disjoint union of the sets and , all of which have cardinality . Let so and . Let
Now, for , and for , . Since the distributions agree on any index with probability greater than , Lemma 13 implies that no tester can distinguish between and with less than samples. ∎
At first glance, Theorem 12 doesn’t look too bad. We should expect the sample complexity to scale like since we need to have enough samples to detect the bad events. Our concern is the size of . If we would like to be the same order as , then our query complexity must scale as . As we typically require to be extremely small (i.e. ), may be infeasibly large. If we are willing to accept somewhat larger , then may be reasonable.
We now turn our attention to Algorithm 2, a simple algorithm for testing aDP with query complexity . Its sample complexity matches the lower bound in Theorem 12 in when is held constant and in when is held constant. We are going to use a trick called Poissonisation to simplify the proof of soundness and completeness, as in Batu et al. (2013). Suppose that, rather than taking samples from , the algorithm first samples
from a Poisson distribution with parameterand then takes samples from . Let
be the random variable corresponding to the number of times the elementappears in the sample from . Then is distributed identically to the Poisson distribution with parameter and all the ’s are mutually independent. Similarly, we sample from a Poisson distribution with parameter and then take samples from . Let be the the number of times appears in the sample from , so is Poisson with and the are independent.
Theorem 14 (Upper bound).
Let . Algorithm 2 is a -aDP property tester with proximity parameter and sample complexity .
Let and . Let so and Var. Note also that is -DP if . First note that . If then
If then . Therefore,
Now, let be an independent copy of then
So . Therefore,