1. Introduction
Differential privacy has become a de facto standard for extracting information from a dataset (e.g., answering queries, building machine learning models, etc.) while protecting the confidentiality of individuals whose data are collected. Implemented correctly, it guarantees that any individual’s record has very little influence on the output of the algorithm.
However, the design of differentially private algorithms is very subtle and errorprone – it is wellknown that a large number of published algorithms are incorrect (i.e. they violate differential privacy). A sign of this problem is the existence of papers that are solely designed to point out errors in other papers (Lyu et al., 2017; Chen and Machanavajjhala, 2015). The problem is not limited to novices who may not understand the subtleties of differential privacy; it even affects experts whose goal is to design sophisticated algorithms for accurately releasing statistics about data while preserving privacy.
There are two main approaches to tackling this prevalence of bugs: programming platforms and verification. Programming platforms, such as PINQ (McSherry, 2009), Airavat (Roy et al., 2010), and GUPT (Mohan et al., 2012) provide a small set of primitive operations that can be used as building blocks of algorithms for differential privacy. They make it easy to create correct differentially private algorithms at the cost of accuracy (the resulting privacypreserving query answers and models can become less accurate). Verification techniques, on the other hand, allow programmers to implement a wider variety of algorithms and verify proofs of correctness (written by the developers) (Barthe et al., 2012, 2013; Barthe and Olmedo, 2013; Barthe et al., 2014, 2016, 2016) or synthesize most (or all) of the proofs (Albarghouthi and Hsu, 2017; Zhang and Kifer, 2017; Reed and Pierce, 2010; Gaboardi et al., 2013).
In this paper, we take a different approach: finding bugs that cause algorithms to violate differential privacy, and generating counterexamples that illustrate these violations. We envision that such a counterexample generator would be useful in the development cycle – variations of an algorithm can be quickly evaluated and buggy versions could be discarded (without wasting the developer’s time in a manual search for counterexamples or a doomed search for a correctness proof). Furthermore, counterexamples can help developers understand why their algorithms fail to satisfy differential privacy and thus can help them fix the problems. This feature is absent in all existing programming platforms and verification tools. To the best of our knowledge, this is the first paper that treats the problem of detecting counterexamples in incorrect implementations of differential privacy.
Although recent work on relational symbolic execution (Farina et al., 2017) aims for simpler versions of this task (like detecting incorrect calculations of sensitivity), it is not yet powerful enough to reason about probabilistic computations. Hence, it cannot detect counterexamples in sophisticated algorithms like the
sparse vector technique
(Dwork et al., 2014), which satisfies differential privacy but is notorious for having many incorrect published variations (Lyu et al., 2017; Chen and Machanavajjhala, 2015).Our counterexample generator is designed to function in blackbox mode as much as possible. That is, it executes code with a variety of inputs and analyzes the (distribution of) outputs of the code. This allows developers to use their preferred languages and libraries as much as possible; in contrast, most languagebased tools will restrict developers to specific programming languages and a very small set of libraries. In some instances, the code may include some tuning parameters. In those cases, we can use an optional symbolic execution model (our current implementation analyzes python code) to find values of those parameters that make it easier to detect counterexamples. Thus, we refer to our method as a semiblackbox approach.
Our contributions are as follows:

We present the first counterexample generator for differential privacy. It treats programs as semiblackboxes and uses statistical tests to detect violations of differential privacy.

We evaluate our counterexample generator on a variety of sophisticated differentially private algorithms and their common incorrect variations. These include the sparse vector method and noisy max (Dwork et al., 2014), which are cited as the most challenging algorithms that have been formally verified so far (Albarghouthi and Hsu, 2017; Barthe et al., 2016). In particular, the sparse vector technique is notorious for having many incorrect published variations (Lyu et al., 2017; Chen and Machanavajjhala, 2015). We also evaluate the counterexample generator on some simpler algorithms such as the histogram algorithm (Dwork, 2006), which are also easy for novices to get wrong (by accidentally using too little noise). In all cases, our counterexample generator produces counterexamples for incorrect versions of the algorithms, thus showing its usefulness to both experts and novices.

The false positive error (i.e. generating ”counterexamples” for correct code) of our algorithm is controllable because it is based on statistical testing. The false positive rate can be made arbitrarily small just by giving the algorithm more time to run.
Limitations: it is impossible to create counterexample/bug detector that works for all programs. For this reason, our counterexample generator is not intended to be used in an adversarial setting (where a rogue developer wants to add an algorithm that appears to satisfy differential privacy but has a back door). In particular, if a program satisfies differential privacy except with an extremely small probability (a setting known as
approximate differential privacy (Dwork et al., 2006a)) then our counterexample generator may not detect it. Solving this issue is an area for future work.The rest of the paper is organized as follows. Related work is discussed in Section 2. Background on differential privacy and statistical testing is discussed in Section 3. The counterexample generator is presented in Section 4. Experiments are presented in Section 5. Conclusions and future work are discussed in Section 6.
2. Related Work
Differential privacy
The term differential privacy covers a family of privacy definitions that include pure differential privacy (the topic of this paper) (Dwork et al., 2006b) and its relaxations: approximate differential privacy (Dwork et al., 2006a), concentrated differential privacy (Dwork and Rothblum, 2016; Bun and Steinke, 2016), and Renyi differential privacy (Mironov, 2017). The pure and approximate versions have received the most attention from algorithm designers (e.g., see the book (Dwork et al., 2014)). However, due to the lack of availability of easytouse debugging and verification tools, a considerable fraction of published algorithms are incorrect. In this paper, we focus on algorithms for which there is a public record of an error (e.g., variants of the sparse vector method (Lyu et al., 2017; Chen and Machanavajjhala, 2015)) or where a seemingly small change to an algorithm breaks an important component of the algorithm (e.g., variants of the noisy max algorithm (Dwork et al., 2014; Barthe et al., 2016) and the histogram algorithm (Dwork, 2006)).
Programming platforms and verification tools
Several dynamic tools (McSherry, 2009; Roy et al., 2010; Tschantz et al., 2011; Xu et al., 2014; Ebadi et al., 2015) exist for enforcing differential privacy. Those tools track the privacy budget consumption at runtime, and terminates a program when the intended privacy budget is exhausted. On the other hand, static methods exist for verifying that a program obeys differential privacy during any execution, based on relational program logic (Barthe et al., 2012, 2013; Barthe and Olmedo, 2013; Barthe et al., 2014, 2016, 2016; Albarghouthi and Hsu, 2017) and relational type system (Zhang and Kifer, 2017; Reed and Pierce, 2010; Gaboardi et al., 2013). We note that those methods are largely orthogonal to this paper: their goal is to verify a correct program or to terminate an incorrect one, while our goal is to detect an incorrect program and generate counterexamples for it. The counterexamples provide valuable guidance for fixing incorrect algorithms for algorithm designers. Moreover, we believe our tool fills in the currently missing piece in the development of differentially private algorithms: with our tool, immature designs can first be tested for counterexamples, before being fed into those dynamic and static tools.
Counterexample generation
Symbolic execution (King, 1976; Cadar et al., 2006; Cadar et al., 2008) is widely used for program testing and bug finding. One attractive feature of symbolic execution is that when a property is being violated, it generates counterexamples (i.e., program inputs) that lead to violations. More relevant to this paper is work on testing relational properties based on symbolic execution (Person et al., 2008; Milushev et al., 2012; Farina et al., 2017). However, those work only apply to deterministic programs, but the differential privacy property inherently involves probabilistic programs, which is beyond the scope of those work.
3. Background
In this section, we discuss relevant background on differential privacy and hypothesis testing.
3.1. Differential Privacy
We view a database as a finite multiset of records from some domain. It is sometimes convenient to represent a database by a histogram, where each cell is the count of times a specific record is present.
Differential privacy relies on the notion of adjacent databases. The two most common definitions of adjacency are: (1) two databases and are adjacent if can be obtained from by adding or removing a single record. (2) two databases and are adjacent if can be obtained from by modifying one record. The notion of adjacency used by an algorithm must be provided to the counterexample generator. We write to mean that is adjacent to (under whichever definition of adjacency is relevant in the context of a given algorithm).
We use the term mechanism to refer to an algorithm that tries to protect the privacy of its input. In our case, a mechanism is an algorithm that is intended to satisfy differential privacy:
Definition 3.1 (Differential Privacy (Dwork et al., 2006b)).
Let . A mechanism is said to be differentially private if for every pair of adjacent databases and , and every , we have
The value of , called the privacy budget, controls the level of the privacy: the smaller is, the more privacy is guaranteed.
One of the most common building blocks of differentially private algorithms is the Laplace mechanism (Dwork et al., 2006b) , which is used to answer numerical queries. Let be the set of possible databases. A numerical query is a function (i.e. it outputs a dimensional vector of numbers). The Laplace mechanism is based on a concept called global sensitivity, which measures the worstcase effect one record can have on a numerical query:
Definition 3.2 (Global Sensitivity (Dwork et al., 2006b)).
The global sensitivity of a numerical query is
The Laplace mechanism works by adding Laplace noise (having density
and variance
) to query answers. The chosen variance depends on and the global sensitivity. We use the notation to refer to the Laplace noise.Definition 3.3 (The Laplace mechanism (Dwork et al., 2006b)).
For any numerical query , the Laplace mechanism outputs
where
are independent random variables sampled from
.Theorem 3.4 ((Dwork et al., 2014)).
The Laplace mechanism is differentially private.
3.2. Hypothesis Testing
A statistical hypothesis is a claim about the parameters of the distribution that generated the data. The null hypothesis, denoted by is a statistical hypothesis that we are trying to disprove. For example, if we have two samples, and where was generated by a Binomial distribution and was generated by a Binomial
distribution, one null hypothesis could be
(that is, we would like to know if the data supports the conclusion that and came from different distributions). The alternative hypothesis, denoted by , is the complement of the null hypothesis (e.g., ).A hypothesis test is a procedure that takes in a data sample
and either rejects the null hypothesis or fails to reject the null hypothesis. A hypothesis test can have two types of errors: type I and type II. A type I error occurs if the test incorrectly rejects
when it is in fact true. A type II error occurs if the test fails to reject
when the alternative hypothesis is true. Type I and type II errors are analogous to false positives and false negatives, respectively.In most problems, controlling type I error is the most important. In such cases, one specifies a significance level and requires that the probability of a type I error be at most . Commonly used values for are and . In order to allow users to control the type I error, the hypothesis test also returns a number
– known as the pvalue – which is a probabilistic estimate of how unlikely it is that the null hypothesis is true. The user rejects the null hypothesis if
. In order for this to work (i.e. in order for the Type I error to be below ), the value must satisfy certain technical conditions: (1) a value is a function of a data sample , (2) , (3) if the null hypothesis is true, then .A relevant example of a hypothesis test is Fisher’s exact test (Fisher, 1935) for two binomial populations. Let be a sample from a Binomial distribution and let be a sample from a Binomial distribution. Here and are unknown. Using these values of and , the goal is to test the null hypothesis against the alternative . Let . The key insight behind Fisher’s test is that if Binomial ^{1}^{1}1This is read as ” is a random variable having the Binomial distribution”. and Binomial and if , then the value does not depend on the unknown parameters or
and can be computed from the cumulative distribution function of the hypergeometric distribution; specifically, it is equal to
. When , then cannot be computed without knowing and . However, it is less than . Thus it can be shown that is a valid value and so the Fisher’s exact test rejects the null hypothesis when this quantity is .4. Counterexample Detection
For a mechanism that does not satisfy differential privacy, the goal is to prove this failure. By Definition 3.1, this involves finding a pair of adjacent databases and an output event such that . Thus a counterexample involves finding these two adjacent inputs and , the bad output set , and to show that for these choices, .
Ideally, one would compute the probabilities and . Unfortunately, for sophisticated mechanisms, it is not always possible to compute these quantities exactly. However, we can sample from these distributions many times by repeatedly running and and counting the number of times that the outputs fall into . Then, we need a statistical test to reject the null hypothesis (or fail to reject it if the algorithm is differentially private).
We will be using the following conventions:

The input to most mechanisms is actually a list of queries rather than a database directly. For example, algorithms to release differentially private histograms operate on a histogram of the data; the sparse vector mechanism operates on a sequence of queries that each have global sensitivity equal to 1. Thus, we require the user to specify how the input query answers can differ on two adjacent databases. For example, in a histogram, exactly one cell count changes by at most 1. In the sparse vector technique (Dwork et al., 2014), every query answer changes by at most 1. To simplify the discussion, we abuse notation and use to also denote the answers of on the input adjacent databases. For example, when discussing the sparse vector technique, we write and . This means there are adjacent databases and a list of queries such that they evaluate to on the first database and on the second database.

We use to indicate the privacy level that a mechanism claims to achieve.

We use for the set of all possible outputs (i.e., range) of the mechanism . We use for a single output of .

We call a subset an event. We use (respectively, ) to denote , the probability that the output of falls into when executing on database (respectively, ).

Some mechanisms take additional inputs, e.g., the sparse vector mechanism. We collectively refer to them as args.
Our discussion is organized as follows. We provide an overview of the counterexample generator in Section 4.1. Then we incrementally explain our approach. In Section 4.2 we present the hypothesis test. That is, suppose we already have query sequences and that are generated from adjacent databases and an output set , how do we test if or ? Next, in Section 4.3, we consider the question of output selection. That is, suppose we already have query answers and that are generated from adjacent databases, how do we decide which should be used in the hypothesis test? Finally, in Section 4.4, we consider the problem of generating the adjacent query sequences and as well as additional inputs args.
The details of specific mechanisms we test for violations of differential privacy will be given in the experiments in Section 5.
4.1. Overview
At a high level, the counterexample generator can be summarized in the pseudocode in Algorithm 1. First, it generates an InputList, a set of candidate tuples of the form . That is, instead of returning a single pair of adjacent inputs and any auxiliary arguments the mechanism may need, we return multiple candidates which will be filtered later. Each adjacent pair is designed to be short so that a developer can understand the problematic inputs and trace them through the code of the mechanism . For this reason, the code of will also run fast, so that it will be possible to later evaluate and multiple times very quickly.
The next step is the EventSelector. It takes each tuple from InputList and runs and multiple times. Based on the type of the outputs, it generates a set of candidates for . For example, if the output is a real number, then the set of candidates is the set of intervals . For each candidate and each tuple , it counts how many times produced an output and how many times produced an output in . Based on these results, it picks one specific and one tuple which it believes is most likely to show a violation of differential privacy.
Finally, the HypothesisTest takes the selected , , , and args and checks if it can detect statistical evidence that – which corresponds to the value – or – which corresponds to the value .
It is important to note that the EventSelector also uses HypothesisTest internally as a subroutine to filter out candidates. That is, for every candidate and every candidate (, , args), it runs HypothesisTest and treats the returned value as a score. The combination of and with the best score is returned by the EvenSelector. Note that EventSelector is using the HypothesisTest in an exploratory way – it evaluates many hypotheses and returns the best one it finds. This is why the and (, , args) that are finally chosen need to be evaluated again on Line 1 using fresh samples from .
Interpreting the results
One of the best ways of understanding the behavior of the counterexample generator is to look at the pvalues it outputs. That is, we take an mechanism that claims to satisfy differential privacy and, for each close to , we test whether it satisfies differential privacy (that is, even though claims to satisfy differential privacy, we may want to test if it satisfies differential privacy for some other value of that is close to ). The hypothesis tester returns two values:

. Small values indicate that probably .

. Small values indicate that probably .
For each , we plot the minimum of and . Figure 1 shows typical results that would appear when the counterexample detector is run with real mechanisms as input.
In Figure 0(a), correctly satisfies the claimed differential privacy. In that plot, we see that the values corresponding to are very low, meaning that the counterexample generator can prove that the algorithm does not satisfy differential privacy for those smaller values of . Near it becomes difficult to find counterexamples; that is, if an algorithm satisfies differential privacy, it is very hard to statistically prove that it does not satisfy differential privacy. This is a typical feature of hypothesis tests as it becomes difficult to reject the null hypothesis when it is only slightly incorrect (e.g., when the true privacy parameter is only slightly different from the we are testing). Now, any algorithm that satisfies differential privacy also satisfies differential privacy for all . This behavior is seen in Figure 0(a) as the values are large for all larger values of .
Figure 0(b) shows a graph that can arise from two distinct scenarios. One of the situations is when the mechanism claims to provide differential privacy but actually provides more privacy (i.e. differential privacy for ). In this figure, the counterexample generator could prove, for example, that does not satisfy differential privacy, but leaves open the possibility that it satisfies differential privacy. The other situation is when our tool has failed to find good counterexamples. Thus when a mechanism is correct, good precision by the counterexample generator means that the line starts rising close to (but before the dotted line), and worse precision means that the line starts rising much earlier.
Figure 0(c) shows a typical situation in which an algorithm claims to satisfy differential privacy but actually provides less privacy than advertised. In this case, the counterexample generator can generate good counterexamples at (the dotted line) and even at much higher values of . When an mechanism is incorrect, such a graph indicates good precision by the counterexample generator.
Limitations
In some cases, finding counterexamples requires a large input datasets. In those cases, searching for the right inputs and running algorithms on them many times will impact the ability of our counterexample generator to find counterexamples. This is a limitation of all techniques based on statistical tests.
Another important case where our counterexample generator is not expected to perform well is when violations of differential privacy happen very rarely. For example, consider a mechanism that checks if its input is . If so, with probability it outputs and otherwise it outputs (if the input is not , always outputs ). does not satisfy differential privacy for any value of . However, showing it statistically is very difficult. Supposing and are adjacent databases, it requires running and billions of times to observe that an output of is possible under but is at least times less likely under .
Addressing both of these problems will likely involve incorporation of program analysis, such as symbolic execution, into our statistical framework and is a direction for future work.
4.2. Hypothesis Testing
Suppose we have a mechanism , inputs and an output set (we discuss the generation of in Section 4.4 and in Section 4.3). We would like to check if or if , as that would demonstrate a violation of differential privacy. We treat the case in this section, as the other case is symmetric.
To do this, the high level idea is to:

Define and

Formulate the null hypothesis as and the alternative as .

Run with inputs and independently times each. Record the results as and .

Count the number of times the result falls in in each case. Let and . Intuitively, provides strong evidence against the null hypothesis.

Calculate a value based on , to determine how unlikely the null hypothesis is.
The challenge is, of course, in the last step as we don’t know what and are. One direction is to estimate them from and . However, it is also challenging to estimate the variance of our estimates and (the higher the variance, the less the test should trust the estimates).
Instead, we take a different approach that allows us to conduct the test without knowing what and are. First, we note that and are equivalent to samples from a Binomial distribution and a Binomial distribution respectively. We first consider the border case where . Consider sample from a Binomial distribution. We note that this sample enjoys the following property (which implies that in the border case, has the same distribution as ):
Lemma 4.1 ().
Let Binomial and be generated from by sampling from the Binomial distribution. The marginal distribution of is Binomial.
Proof.
The relationship between Binomial and Bernoulli random variables means that , where is a Bernoulli() random variable. Generating from is the same as doing the following: set if . If , set with probability (and set otherwise). Then set . Hence, the marginal distribution of is a Bernoulli random variable:
This means that the marginal distribution of is Binomial. ∎
Thus we have the following facts that follow immediately from the lemma:

If then the distribution of is Binomial with and so has a larger Binomial parameter than (which is Binomial. We want our test to be able to reject the null hypothesis in this case.

If then the distribution of is Binomial with and so has the same Binomial parameter as . We do not want our test to reject the null hypothesis in this case.

If then the distribution of is Binomial with and so has a smaller Binomial parameter than (which is Binomial. We do not want to reject the null hypothesis in this case.
Thus, by randomly generating from , we have (randomly) reduced the problem of testing vs. (on the basis of and ) to the problem of testing vs. (on the basis of and ). Now, checking whether and come from the same distribution can be done with the Fisher’s exact test (see Section 3): the value is .^{2}^{2}2Here we use a notation from SciPy (Jones et al., 01) package where Hypergeom.cdf means the cumulative distribution function of hypergeometric distribution. This is done in the function pvalue in Algorithm 2.
To summarize, given and , we first sample from the Binomial distribution and then return the pvalue of . Since this is a random reduction, we reduce its variance by sampling multiple times and averaging the pvalues. That is, we run the pvalue function (Algorithm 2) multiple times with the same inputs and average the pvalues it returns.
4.3. Event Selection
Having discussed how to test if or if when , and were prespecified, we now discuss how to select the event that is most likely to show violations of differential privacy.
One of the challenges is that different mechanisms could have different output types (e.g., a discrete number, a vector of numbers, a vector of categorical values, etc.). To address this problem, we define a search space of possible events to look at. The search space depends on the type of the output of , which can be determined by running and multiple times.

The output is a fixed length list of categorical values. We first run once and ask it to not use any noise (i.e. tell it to satisfy differential privacy with ). Denote this output as . Now, when runs with its preferred privacy settings to produce an output , we define be the Hamming distance between the output and . The search space is
where is the fixed length of output of . Another set of events relate to the count of a categorical value in the output. If there are values, then define
. The overall search space is the union of and all .

The output is a variable length list of categorical values. In this case, one extra set of events we look at correspond to the length of the output. For example, we may check if . Hence, we define
For the search space , we use this unioned with the search space from the previous case.

The output is a fixed length list of numeric values.
In this case, the output is of the form . Our search space is the union of the following:
That is, we would end up checking if , etc.. To save time, we often restrict and to be multiples of a small number like , or . In the case that the output is always an integer array, we replace the condition “” with “” for each integer .

outputs a variable length list of numeric values.
The search space is the union of Case 3 and in Case 2.

outputs a variable length list of mixed categorical and numeric values. In this case, we separate out the categorical values from numeric values and use the cross product of the search spaces for numeric and categorical values. For instance, events would be of the form “ has categorical components equal to and the average of the numerical components of is in ”
The EventSelector is designed to return one event for use in the hypothesis test in Algorithm 1. The way EventSelector works is it receives an InputList, which is a set of tuples where are adjacent databases and args is a set of values for any other parameters needs. For each such tuple, it runs and for times each. Then for each possible event in the search space, it runs the hypothesis test (as an exploratory tool) to get a pvalue. The combination of and that produces the lowest pvalue is then returned to Algortihm 1. Algorithm 1 uses those choices to run the real hypothesis test on fresh executions of on and .
4.4. Input Generation
In this section we discuss our approaches for generating candidate tuples where are adjacent databases and args is a set of auxiliary parameters that a mechanism may need.
4.4.1. Database Generation
To find the adjacent databases that are likely to form the basis of counterexamples that illustrate violations of differential privacy, we adopt a simple and generic approach that works surprisingly well. Recalling that the inputs to mechanisms are best modeled as a vector of query answers, we use the type of patterns shown in Table 1.


Category  Sample D1  Sample D2 
One Above  [1, 1, 1, 1, 1]  [2, 1, 1, 1, 1] 
One Below  [1, 1, 1, 1, 1]  [0, 1, 1, 1, 1] 
One Above Rest Below  [1, 1, 1, 1, 1]  [2, 0, 0, 0, 0] 
One Below Rest Above  [1, 1, 1, 1, 1]  [0, 2, 2, 2, 2] 
Half Half  [1, 1, 1, 1, 1]  [0, 0, 0, 2, 2] 
All Above & All Below  [1, 1, 1, 1, 1]  [2, 2, 2, 2, 2] 
X Shape  [1, 1, 0, 0, 0]  [0, 0, 1, 1, 1] 

The “One Above” and “One Below” categories are suitable for algorithms whose input is a histogram (i.e. in adjacent databases, at most one query can change, and it will change by at most 1). The rest of the categories are suitable when in adjacent databases every query can change by at most one (i.e. the queries have sensitivity^{4}^{4}4For queries with larger sensitivity, the extension is obvious. For example and ).
The design of the categories is based on the wide variety of changes in query answers that are possible when evaluated on one database and on an adjacent database. For example, it could be the case that a few of the queries increase (by 1, if their sensitivity is 1, or by in the general case) but most of them decrease. A simple representative of this situation is “One Above Rest Below” in which one query increases and the rest decrease. The category “One Below Rest Above” is the reverse.
Another situation is where roughly half of the queries increase and half decrease (when evaluated on a database compared to when evaluated on an adjacent database). This scenario is captured by the “Half Half” category. Another situation is where all of the queries increase. This is captures by the “All Above & All Below” category. Finally, the “X Shape” category captures the setting where the query answers are not all the same and some increase and others decrease when evaluated on one database compared to an adjacent database.
These categories were chosen from our desire to allow counterexamples to be easily understood by mechanism designers (and to make it easier for them to manually trace the code to understand the problems). Thus the samples are short and simple. We consider inputs of length 5 (as in Table 1) and also versions of length 10.
4.4.2. Argument Generation
Some differentiallyprivate algorithms require extra parameters beyond the database. For example, the sparse vector technique (Dwork et al., 2014), shown in Algorithm 11, takes as inputs a threshold and a bound . It tries to output numerical queries that are larger than . However, for privacy reasons, it will stop after it returns noisy queries whose values are greater than . These two arguments are specific to the algorithm and their proper values depend on the desired privacy level as well as algorithm precision.
To find values of auxiliary parameters (such as and in Sparse Vector), we build argument generator based on Symbolic Execution (King, 1976), which is typically used for bug finding: it generates concrete inputs that violate assertions in a program. In general, a symbolic executor assigns symbolic values, rather than concrete values as normal execution would do, for inputs. As the execution goes, the executor maintains a symbolic program state at each assertion and generates constraints that will violate the assertion. When those constraints are satisfiable, concrete inputs (i.e., a solution of the constraints) are generated.
Compared with standard symbolic execution, a major difference in our argument generation is that we are interested in algorithm arguments that will likely maximize the privacy cost of an algorithm. In other words, there is no obvious assertion to be checked in our argument generation. To proceed, we use two heuristics that likely will cause large privacy cost of an algorithm:

The first heuristic applies to parameters that affect noise generation. For example in Sparse Vector, the algorithm adds Lap() noise. For such a variable, we use the value that results in small amount of noise (i.e., ). Small amount of noise is favorable since it reduces the variance in the hypothesis testing (Section 4.2).

The second heuristic (for variables that do not affect noise) prefers arguments that make two program executions using two different databases (as described in Section 4.4.1) to take as many diverging branches as possible. The reason is that diverging branches will likely use more privacy budget.
Next, we give a more detailed overview of our customized symbolic executor. The symbolic executor takes a pair of concrete databases as inputs (as described in Section 4.4.1) and uses symbolic values for other input parameters. Random samples in the program (e.g., a sample from Laplace distribution) are set to value 0 in the symbolic execution. Then, the symbolic executor tracks symbolic program states along program execution in the standard way (King, 1976). For example, the executor will generate a constraint^{5}^{5}5For simplicity, we use a simple representation for constraints; Z3 has an internal format and a user can either use Z3’s APIs or SMT2 (Ranise and Tinelli, 2006) format to represent constraints. after an assignment (x y+1), assuming that variable y has a symbolic value before the assignment. Also, the executor will unroll loops in the source code, which is standard in most symbolic executors.
Unlike standard symbolic executors, the executor conceptually tracks a pair of symbolic program states along program execution (one on concrete database , and one on concrete database ). Moreover, it also generate extra constraints, according to the two heuristics above, in the hope of maximizing the privacy cost of an algorithm. In particular, it handles two kinds of statements in the following way:

Sampling. The executor generates two constraints for a sampling statement: a constraint that eliminates randomness in symbolic execution by assigning sample to value 0, and a constraint that ensures a small amount of noise. Consider a statement (). The executor generates two constraints: as well as a constraint that minimizes expression .

Branch. The executor generates a constraint that makes the two executions diverge on branches. Consider a branch statement (). Assume that the executor has symbolic values and for the value of expression e on databases and respectively; it will generates a constraint to make the executions diverge. Note that unlike other constraints, a diverging constraint might be unsatisfiable (e.g., if the query answers under and are the same). However, our goal is to maximize the number of satisfiable diverging constraints, which can be achieved by a MaxSMT solver.
The executor then uses an external MaxSMT solver such as Z3 (De Moura and Bjørner, 2008) on all generated constraints to find arguments that maximizes the number of diverged branches.
For example, the correct version of the Sparse Vector algorithm (see the complete algorithm in Algorithm 11) has the parameter (a threshold). It has a branch that tests whether the noisy query answer is above the threshold :
Here, is a noise variable, is one query answer (i.e. one of the components of the input of the algorithm) and is a noisy threshold (). Suppose we start from a database candidate ([1, 1, 1, 1, 1], [2, 2, 2, 2, 2]). The symbolic executor assigns symbolic values to the parameters and unrolls the loop in the algorithm, where each iteration handles one noisy query. Along the execution, it updates program states. For example, statement results in . For the first execution of the branch of interest, the executor tracks the following symbolic program state:
as well as the following constraint for diverging branches:
Similarly, the executor generates constraints from other iterations. In this example, the MaxSMT solver returns a value in between of and so that constraints from all iterations are satisfied. This value of is used as arg in the candidate tuple .
5. Experiments
We implemented our counterexample detection framework with all components, including hypothesis test, event selector and input generator. The implementation is publicly available^{6}^{6}6 https://github.com/cmlapsu/statdp.. The tool takes in an algorithm implementation and the desired privacy bound , and generates counterexamples if the algorithm does not satisfy differential privacy.
In this section we evaluate our detection framework on some of the popular privacy mechanisms and their variations. We demonstrate the power of our tool: for mechanisms that falsely claim to be differentially private, our tool produces convincing evidence that this is not the case in just a few seconds.
5.1. Noisy Max
Report Noisy Max reports which one among a list of counting queries has the largest value. It adds noise to each answer and returns the index of the query with the largest noisy answer. The correct versions have been proven to satisfy differential privacy (Dwork et al., 2014) no matter how long the input list is. A naive proof would show that it satisfies differential privacy (where is the length of the input query list), but a clever proof shows that it actually satisfies differential privacy.
5.1.1. Adding Noise
The correct Noisy Max algorithm (Algorithm 5) adds independent noise to each query answer and returns the index of the maximum value. As Figure 1(a) shows, we test this algorithm for different privacy budget at . All lines rise when the test is slightly less than the claimed privacy level of the algorithm. This demonstrates the precision of our tool: before , there is almost 0 chance to falsely claim that this algorithm is not private; after , the value is too large to conclude that the algorithm is incorrect. We note that the test result is very close to the ideal cases, illustrated by the vertical dashed lines.
5.1.2. Adding Exponential Noise
One correct variant of Noisy Max adds noise, rather than Laplace noise, to each query answer(Algorithm 6). This mechanism has also been proven to be differential private(Dwork et al., 2014). Figure 1(b) shows the corresponding test result, which is similar to that of Figure 1(a). The result indicates that this correct variant likely satisfies differential privacy for the claimed privacy budget.
5.1.3. Incorrect Variants of Exponential Noise
An incorrect variant of has the same setup but instead of returning the index of maximum value, it directly returns the maximum value. We evaluate on two variants that report the maximum value instead of the index (Algorithm 7 and 8) and show the test result in Figure 1(c) and 1(d).
For the variant using Laplace noise (Figure 1(c)), we can see that for , the line rises at around test of 0.4, indicating that this algorithm is incorrect for the claimed privacy budget of . The same pattern happens when we set privacy budget to be 0.7 and 1.5: all lines rise much later than their claimed privacy budget. In this incorrect version, returning the maximum value (instead of its index) causes the algorithm to actually satisfy differential privacy instead of differential privacy. For the variant using Exponential noise (Figure 1(d)), the lines rise much later than the claimed privacy budgets, indicating strong evidence that this variant is indeed incorrect. Also, we can hardly see the lines for privacy budgets 0.7 and 1.5, since their pvalues remain 0 for all the test ranging from 0 to 2.2 in the experiment.
5.2. Histogram
The Histogram algorithm (Dwork, 2006) is a very simple algorithm for publishing an approximate histogram of the data. The input is a histogram and the output is a noisy histogram with the same dimensions. The Histogram algorithm requires input queries to differ in at most one element. Here we evaluate with different scale parameters for the added Laplace noise.
The correct Histogram algorithm adds independent noise to each query answer, as shown in Algorithm 9. Since at most one query answer may differ by at most 1, returning the maximum value is differentially private (Dwork, 2006).
To mimic common mistakes made by novices of differential privacy, we also evaluate on an incorrect variant where noise is used in the algorithm (Algorithm 10). We note that the incorrect variant here satisfies differential privacy, rather the claimed differential privacy.
Figures 2(a) and 2(b) show the test results for the correct and incorrect variants respectively. Here, Figures 2(a) indicates that the correct implementation satisfies the claimed privacy budgets. For the incorrect variant, the claimed budgets of 0.2 and 0.7 are correctly rejected; this is expected since the true privacy budgets are and respectively for this incorrect version. Interestingly, the result indicates that for , this algorithm is likely to be more private than claimed (the line rise around 0.6 rather than 1.5). Again, this is expected since in this case, the variant is indeed differentially private.
5.3. Sparse Vector
The Sparse Vector Technique (SVT) (Dwork et al., 2009) (see Algorithm 11) is a powerful mechanism for answering numerical queries. It takes a list of numerical queries and simply reports whether their answers are above or below a preset threshold . It allows the program to output some noisy query answers without any privacy cost. In particular, arbitrarily many “below threshold” answers can be returned, but only at most “above threshold” answers can be returned. Because of this remarkable property, there are many variants proposed in both published papers and practical use. However, most of them turn out to be actually not differentially private(Lyu et al., 2017). We test our tool on a correct implementation of SVT and the major incorrect variants summarized in (Lyu et al., 2017). In the following, we describe what the variants do and list their pseudocodes.
5.3.1. Svt (Lyu et al., 2017)
Lyu et al. have proposed an implementation of SVT and proved that it satisfies differential privacy. This algorithm (Algorithm 11) tries to allocate the global privacy budget into two parts: half of the privacy budget goes to the threshold, and the other half goes to values which are above the threshold. There will not be any privacy cost if the noisy value is below the noisy threshold, in which case the program will output a False. If the noisy value is above the noisy threshold, the program will output a True. After outputting a certain amount () of True’s, the program will halt.
Figure 3(a) shows the test result for this correct implementation. All lines rise around the true privacy budget, indicating that our tool correctly conclude that this algorithm is correct.
5.3.2. iSVT 1 (Stoddard et al., 2014)
One incorrect variant (Algorithm 12) adds no noise to the query answers, and has no bound on the number of True’s that the algorithm can output. This implementation does not satisfy differential privacy for any finite .
This expectation is consistent with the test result shown in Figure 3(b): the pvalue never rises at any test . This result strongly indicates that this implementation with claimed privacy budget 0.2, 0.7, 1.5 is not private for at least any .