Guidelines for Implementing and Auditing Differentially Private Systems

02/10/2020 ∙ by Daniel Kifer, et al. ∙ Facebook Penn State University University of California Santa Cruz University of Pennsylvania 0

Differential privacy is an information theoretic constraint on algorithms and code. It provides quantification of privacy leakage and formal privacy guarantees that are currently considered the gold standard in privacy protections. In this paper we provide an initial set of “best practices” for developing differentially private platforms, techniques for unit testing that are specific to differential privacy, guidelines for checking if differential privacy is being applied correctly in an application, and recommendations for parameter settings. The genesis of this paper was an initiative by Facebook and Social Science One to provide social science researchers with programmatic access to a URL-shares dataset. In order to maximize the utility of the data for research while protecting privacy, researchers should access the data through an interactive platform that supports differential privacy. The intention of this paper is to provide guidelines and recommendations that can generally be re-used in a wide variety of systems. For this reason, no specific platforms will be named, except for systems whose details and theory appear in academic papers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Datasets can provide a wealth of information to social scientists, statisticians and economists. However, data sharing is often restricted due to privacy concerns for users whose records are contained in the data. Differential privacy is the current gold standard in protecting their privacy while allowing for statistical computations on their data. It requires the careful addition of noise into computations over the data; the noise is designed to mask the effects of any one individual on the outcome of the computation.

The purpose of this whitepaper is to produce an initial set of “best practices” for testing, debugging, and code review of differential privacy implementations. To the best of our knowledge, such principles have not been codified before, but are necessary due to the increasing adoption of differential privacy for data sharing.

Data sharing organizations can provide differentially private access to tabular data as does the U.S. Census (Abowd et al., 2019) or they can create systems to allow external researchers to construct more flexible and precise differentially private analyses. In this paper we focus on the latter.

Differential privacy platforms and languages (McSherry, 2009; Roy et al., 2010; Mohan et al., 2012; Gaboardi et al., 2016; Zhang et al., 2018; Reed and Pierce, 2010; Gaboardi et al., 2013; Winograd-Cort et al., 2017; Zhang et al., 2019) simplify the construction of differentially private data analyses by providing API calls that allow users to specify the computations they want to perform from a limited menu. The platforms add appropriate amounts of noise into those computations and track the overall privacy risk of those data accesses. Granting the correctness of the platforms themselves, the correctness of the programs implemented within those platforms is guaranteed.

Despite this obvious benefit, the real-world deployments of differential privacy to date have not made use of these systems, due to a number of limitations. First, the software engineering aspects of such platforms – hardening against side-channels, support for unit testing, mature design principles, and code review standards – are still lagging. In terms of side-channels, many existing differential privacy systems have known vulnerabilities in their floating point implementations (Mironov, 2012) and are susceptible to timing attacks (Haeberlen et al., 2011). Furthermore, in complex systems it is easy to introduce subtle bugs that impact privacy leakage (Ebadi et al., 2016). Moreover because these systems seek to provide mathematical certainty about the correctness of the privacy properties of programs developed within them, they are necessarily limited to a set of predefined differential privacy primitives. Determining whether an arbitrary piece of code in a general-purpose programming language satisfies differential privacy is an undecidable problem, so systems that offer provable guarantees must pay a price in expressivity.

There has been considerable progress in formally verifying the correctness of differentially privacy programs (Mislove et al., 2011; Barthe et al., 2012; Barthe and Olmedo, 2013; Eigner and Maffei, 2013; Barthe et al., 2014; Ebadi et al., 2015; Barthe et al., 2015, 2016a, 2016b; Zhang and Kifer, 2017; Albarghouthi and Hsu, 2017; Wang et al., 2019b); however, many of these systems are designed to verify the correctness of differential privacy primitives, rather than to develop differentially private data analyses. Hence, they are difficult to use by data analysts and restrict the choice of programming languages and operations/libraries that they can use. A goal that is less ambitious than formal verification is debugging and testing. Research in testing differentially private programs (Ding et al., 2018; Bichsel et al., 2018; Gilbert and McMillan, 2018) is fairly new but these ideas inform much of our proposed test design.

The role of testing.

Verifying that a complex system satisfies differential privacy is, in general, an intractable task. Proving that a piece of code satisfies differential privacy is in general undecidable, and black-box statistical testing cannot in general positively establish privacy guarantees because differential privacy requires a constraint to hold for every pair of adjacent datasets and for every possible outcome. Nevertheless, confidence in a system can be increased through a combination of choosing a good set of design principles, performing code audits, developing a mathematical model of a system’s semantics along with a proof of its correctness (possibly aided by software verification tools), and designing (unit) tests that help ensure that the code conforms to the mathematical model. They can also help find subtle bugs in code/proofs and can find mismatches between implementation and pseudo-code.

The most actionable part of this paper is the testing framework it provides. Thus, it is important to emphasize (again) that unit tests cannot guarantee that a system works correctly. At best, they can fail to prove that a system violates differential privacy. In doing so, they can increase our confidence in the likelihood that a system designed to be differentially private in fact satisfies differential privacy. Unit testing is in no way a replacement for mathematical proof or formal verification, and the core belief that a system is differentially private should continue to stem from mathematical argument (either derived “by hand” or as a result of using a framework for differentially private programming that itself certifies correctness).

Organization

The rest of this whitepaper is organized as follows. In Section 2, we briefly describe the use-case that motivated the effort to establish these guidelines. In Section 3 we present background information on differential privacy. In Section 4 we present recommended design principles for differential privacy platforms. In Section 5 we discuss how code review can reduce the potential for privacy bugs and identify possible problems that code reviewers should check for. In Section 6, we describe methods for testing different components of a differential privacy system and in Section 7 we present results and conclusions.

2 Motivating Use Case

In this section we briefly describe the motivating use case and challenges encountered along the way. The project we describe is a work in progress, and the guidelines established in the remainder of this paper are intended to guide the development of an interactive system that best meets the needs of this use-case. While in the short-term this project will rely on a differentially private table release, the medium- and long-term plan is to implement a differentially private query system to facilitate additional analyses and optimal expenditure of privacy budget.

Purpose and origin of the project.

In April 2018, an array of academic and funding organizations partnered with Facebook to make the company’s data available to researchers to study the impact of social media on democracy and elections.111The partnership consisted of Social Science One (SS1), Facebook, the Social Science Research Council (SSRC), and the following foundations: the Laura and John Arnold Foundation, the Children’s Investment Fund Foundation, the Democracy Fund, the William and Flora Hewlett Foundation, the John S. and James L. Knight Foundation, the Charles Koch Foundation, Omidyar Network’s Tech and Society Solutions Lab, and the Alfred P. Sloan Foundation. The project was announced in the context of growing concern about both privacy and the use of the platform by nefarious actors to influence political outcomes. These concerns produced two countervailing pressures: protecting privacy on the one hand while providing data that yields as much precision as is possible.

After some deliberation, the parties decided to start by releasing a data set that allowed external researchers to study trends in misinformation from websites external to Facebook that were shared on the platform. Initially, the effort was to rely on K-anonymity to protect privacy. However, that approach was revised after the privacy community raised concerns that k-anonymity did not provide substantive privacy guarantees. The consortium started to explore sharing the data under differential privacy, which offers a substantially more rigorous guarantee (see Section 3).

Data description.

The data that we wish to grant researches access to describe web page addresses (URLs) that have been “shared” on Facebook starting January 1, 2017 up to and including August 6, 2019. URLs are included if shared with public privacy settings more than 100 times plus noise applied. This approach of using a soft threshold for inclusion in the dataset minimizes information leakage. Note that because all of these URLs have been shared “publicly” the set of URLS included in the dataset need not be viewed as private information — and adding Laplace noise to the threshold for inclusion is done simply out of an abundance of caution, rather than for a formal privacy guarantee.

The sheer volume of the data makes a comprehensive “static” data release difficult, which is another reason that the initial and ultimate goal of this project is to produce an interactive data analysis system. Data processed includes more than 50 TB per day for public sharing activity and interaction metrics, plus more than 1 PB per day for exposure data. Data were collected by logging actions taken on Facebook and processed using a combination of Hive, Presto, and Spark data systems, using the Dataswarm job-execution framework.

The underlying data consist of 3 tables:

  1. A URL-table with descriptive information about URLs (actual URL, page title, brief description); The data in this table is the least sensitive, since it it pertains to the URLs, and not the users’ interactions with them (except, as discussed above, via the inclusion of the URL in the dataset).

  2. A user-URL-action table describing actions users have taken on specific URLs (likes, clicks, views, shares, comments, flagging for false news, etc.). This table contains sensitive data because whether or not a user has shared or liked a particular URL can reveal latent attributes, including sensitive personal characteristics (e.g. political affiliation, sexual orientation, etc.).

  3. A user table describing the characteristics of the users in question. This data is also sensitive.

Early problems.

Facebook sought to provide researchers with approved project proposal access to a differentially private query platform that would allow them to construct arbitrary queries (including joins) against these tables. However, this proved to be more difficult than anticipated because of a lack of pre-existing industrial-strength solutions that met our use case at that time. In particular, we found that existing solutions had the following shortcomings:

  • Existing solutions were not able to cleanly provide user-level privacy guarantees for these data. These solutions are tailored to tables that contain data pertaining to a single user in one and only one row. The data for this project contain user-url-action records, which are not useful if aggregated at the user level.

  • The sheer size of our data meant that existing solutions were unable to scale to handle the necessary data operations.

  • Privacy guarantees are often difficult to assess in proprietary code, which complicates one of the desired features of differential privacy: mathematically-based privacy guarantees. This is especially true for less-mature codebases that do not implement clear testing standards.

DP table release.

These shortcomings lead to delays and as a stopgap measure Social Sciene One and Facebook created a series of static, differentially private table releases that was anticipated to be of use to researchers. After a number of intermediate releases designed to allow researchers to get started on the projects, Facebook created a URL table and a table containing counts of actions taken related to each URL, broken down by user characteristics.

Data aggregates that describe user actions were protected under a form of action-level, zero-concentrated differential privacy (zCDP, see Bun and Steinke (2016)). The privacy parameter was chosen to be small enough to provide protections akin to user-level guarantees for 99% of users in the dataset (i.e. 99% of users in the dataset took fewer actions than the number that received a serious action-level guarantee — see Section 3.3 for further discussion), with a graceful decay for the remaining 1% as a function of action-frequency. This decision was taken after weighing the tradeoffs with utility, and deciding that the action-level guarantee was substantial enough to provide meaningful protections of plausible deniability, while still allowing for more accurate statistical release than would have been possible with a full user-level guarantee.

The release was a simple application of perturbation with Gaussian noise, and does not correct the data for consistency as for example the U.S. Census has done in its differentially private releases (Abowd et al., 2019). That means (for example) that some values in DP-protected aggregated fields will be negative due to the noise added, even though this is clearly impossible in the real data.222Note that truncating these counts at zero will bias statistical estimates.

However, this strategy allows researchers to compute unbiased estimates in a straightforward manner. In general, post-processing for things like consistency can only destroy information, and can always be performed by the end-user if desired.

Going Forward

We view the release of static tables as a stopgap measure, and our ultimate goal continues to be the development of a robust interactive API through which researchers can dynamically make detailed queries to the data. Towards this end, the rest of this paper proposes design principles for such future systems, principles which Facebook will aim to adhere to through this development process. This includes a particular focus on modularity and testing, so that the privacy guarantees of any future systems (which will inevitably be complex) can be rigorously evaluated and stress tested.

3 Differential Privacy

In this section we define differential privacy and discuss what it does and does not promise, along with crucial decisions that a designer needs to make when deciding to use differential privacy. The suitability of differential privacy for an application depends on answers to questions such as:

  • Are the harms prevented by differential privacy of the same sort that we care about? (Sections 3.1 and 3.2).

  • At what “granularity” do we care about privacy? (Section 3.3).

  • Does the privacy loss budget that we can “afford” in our application offer a reasonable quantitative guarantee? (Section 3.4).

We note that for historical reasons we use the word ”privacy” for what is often called ”confidentiality” in the literature on law and ethics – the protection of sensitive data/information. There, the term privacy refers to an individual’s control over their information and freedom from intrusion.

3.1 What Differential Privacy Promises

Differential privacy requires algorithms that process data to be randomized – given a certain input, we cannot predict with certainty what the output of an algorithm will be. However, for any possible input and output

, we may know the probability of the output

occurring when is run with input – that is, we may know the output distribution of .

Differential privacy offers a very specific promise: if the data record corresponding to a single individual were removed from a dataset (or possibly changed in some other way — see Section 3.3), it would have only a controllably small effect on the distribution of outcomes for any analysis of the data. Formally, the mathematical guarantee is as follows. Let be an arbitrary data domain, and let a dataset be an unordered multiset of records from . We say that two datasets and are neighbors if can be derived from by adding or removing the record of a single user.

Definition 1 (Dwork et al. (2006a, b)).

An algorithm mapping datasets to outcomes in some space satisfies user-level differentially private if for all pairs of neighboring datasets and for all events , we have that:

Differential privacy has a number of interpretations. Here we list three, but see e.g. Kasiviswanathan and Smith (2014); Dwork and Roth (2014) for a more in depth discussion.

  1. Most directly, differential privacy promises that no individual will substantially increase their likelihood of coming to harm as a result of their data’s inclusion in the dataset. Said another way, for any event at all, the probability that the event occurs increases or decreases in probability by only a little if their data is unilaterally added or removed from the dataset, and the remaining elements are left unchanged. Here, “only a little” is controlled by the parameters and .

  2. Differential privacy controls the unilateral effect that any single person’s data can have on the beliefs of any observer of the output of a differentially private computation, independent of what that observer’s prior beliefs might have been. Specifically, for any outcome , the beliefs of an outside observer after having observed given that an individual’s data was included in the dataset are likely to be close to what they would have been had the outcome been observed, but the individual’s data had not been included in the dataset. Here, ”close” is controlled by and “likely” is controlled by 333This description most directly maps on to the guarantee of “pointwise indistinguishability” — but as shown in Kasiviswanathan and Smith (2014), this is equivalent to -differential privacy up to small constant factors..

  3. Differential privacy controls the ability of any outside observer to distinguish whether or not the unilateral decision was made to include your data in the dataset under study, holding the participation of all other individuals fixed. In particular, it implies that no statistical test aimed at determining, from the output of a differentially private computation, whether the input was or (for any neighboring pair or ) can succeed with probability substantially outperforming random guessing. Here “substantially” is what is controlled by the privacy parameters and . It is this interpretation that is most actionable when it comes to testing: finding a statistical test for an algorithm that is able to reliably distinguish the distributions and for a particular pair of neighboring (beyond what is consistent with the chosen parameters ) is enough to falsify a claim of differential privacy for the algorithm .

These last two interpetations can be viewed as giving people a form of plausible deniability about the content of their data record.

3.2 What Differential Privacy Does Not Promise

Differential privacy compares two hypothetical “worlds” and . In world , your data is used in carrying out some data analysis. In world , your data is not used — but the data analysis is still carried out, using the remainder of the dataset. Differential privacy guarantees in a strong sense that you should be almost indifferent between worlds and — and hence if you do not fear any privacy harms in world (because your data was not used at all), you should also not fear privacy harms in world .

However, it is important to note that this guarantee — although strong in many respects — does not necessarily prevent many harms that might be colloquially thought of as related to privacy. Crucially, in both of the hypothetical worlds and , the data analysis is carried out, and even with the guarantees of differential privacy, an individual might strongly prefer that the analysis had not been carried out. Here we discuss two stylized examples of things that are not prevented by differential privacy:

  1. Statistical Inferences From Population Level Information: The goal of statistical analyses is to learn generalizable facts about the population described by a dataset. By design, differential privacy allows for useful statistical analyses. However, inevitably, population level facts about the world, once learned, can be used in combination with observable features about an individual to make further inferences about them. For example, Facebook “Likes” turn out to exhibit significant correlation to individual features, like sexual orientation, political affiliation, religious views, and more (Kosinski et al., 2013). These correlations can in principle be learned in a differentially private manner, and used to make inferences about these traits for any user who makes their ”Likes” publicly observable. Note that the privacy loss to a hypothetical user in this situation here is not really due to the differentially private analyses that discovered correlations between Likes with sensitive attributes, but rather because the user allowed their Likes to be publicly visible. Nevertheless, this might feel like a privacy harm to a user who was not aware that the information that they were making public correlated to sensitive attributes that they did not want known.

  2. “Secrets” embedded in the data of many people: Differential privacy promises that essentially nothing can be learned from a data analysis that could not also have been learned without the use of your data. However, it does not necessarily hide “secrets” that are embedded in the data of many people, because these could have been learned without the data of any one of those people. Genetic data is a prime example of this phenomenon: an individual’s genome, although highly “personal”, contains information about close relatives. Similarly, on a social network, some kinds of information can be contained in the records of many individuals in the same social group: for example, a link or post that is widely shared. However, differential privacy does protect information that is held amongst only a small number of individuals – the “group privacy” property of -differential privacy promises that the data of any set of individuals is given the protections of -differential privacy (Dwork and Roth, 2014).

3.3 The Granularity of Privacy

The discussion so far has been about the standard “user-level” guarantee of differential privacy: informally, as we discussed in Section 3.1, differential privacy promises a user plausible deniability about the entire content of their data record. However, it is also possible to define differential privacy at a finer granularity. Doing so offers a weaker but still meaningful guarantee. Formally, this is accomplished by modifying the definition of “neighboring datasets”, which was defined (for the user-level guarantee) as datasets differing in the record of a single user. Other forms of finer granularity are possible, but here we mention two of particular interest:

  • Action Level Privacy: In some datasets, user records may simply be collections of actions the users have taken. For example, in a Facebook dataset, actions might include pages and posts that the user has “liked”, URLs she has shared, etc. Action level privacy corresponds to redefining the neighboring relation such that two neighboring datasets differ by the addition or subtraction of a single action of one user. This correspondingly gives users a guarantee of plausible deniability at the level of a single action: although it might be possible to obtain confidence in the general trends of a particular users behavior, no analyst would be able to make confident assertions about whether any particular user had taken any particular action.

  • Month (or Week, or Year) Level Privacy: Similarly, in some datasets, user-level data might accumulate with time and be time stamped. This would be the case, for example, with actions in a Facebook dataset. Month level privacy corresponds to redefining the neighboring relation such that two neighboring datasets differ by the addition or subtraction of an (arbitrary) set of user actions all time-stampled with the same month. This correspondingly gives users a guarantee of plausible deniability at the level of a month: once again, although it might be possible to obtain confidence in long-term trends of the behavior of a particular user, the user could make a statistically plausible claim that their behavior during any particular month was whatever they assert.

Note that the chosen granularity of privacy interacts with the quantitative level of privacy chosen via differential privacy’s group privacy property. If e.g. a computation promises action level differential privacy at level , then it offers a guarantee akin to user-level privacy at level for any user who has taken at most actions (and this guarantee scales smoothly: it offers a guarantee akin to user-level privacy at level for any user who has taken at most 2000 actions, etc.). Such a guarantee might be acceptable when, for example, the vast majority of users in the dataset in question have taken less than actions.

3.4 The Importance of the Privacy Budget

A crucial feature of differential privacy is that it is compositional: If a sequence of analyses are carried out, such that each analysis is -differentially private, then in aggregate, the entire sequence of analyses is -differentially private for . There are a number of other more sophisticated methods for “accounting for” the constituent privacy losses and aggregating them into an overall guarantee: but the important point is that 1) such aggregations of privacy loss are possible, and 2) the guarantees of differential privacy ultimately stem from the final -guarantee characterizing the entire sequence of analyses performed. Differential privacy should therefore be thought of as a limited resource to parcel out, with some bound on the overall privacy usage. These parameters can be thought of as a privacy budget. The choice of what the privacy budget should be set at, and how it should be parcelled out are among the most important policy decisions that are made when deciding to allow for differentially private data access.

The Scale of the Privacy Budget

In the following, we discuss the “meaning” of under the assumption that . In general, these interpretations will continue to hold with , with the caveat that the guarantees discussed may fail to hold with probability roughly (See Kasiviswanathan and Smith (2014) for the formal derivation of this equivalence.). We also note that many mechanisms will simultaneously satisfy a guarantee of -differential privacy for every value of and some function of . In settings like this, should not be viewed as a “catastrophic failure” probability.

The statistical testing interpretation of differential privacy implies that no statistical test for distinguishing whether an individual ’s data is contained in the dataset that has significance level (i.e., the risk of concluding that the user is in the dataset when she is in fact not) can have power greater than (i.e., the probability of correctly concluding that the user is in the dataset when she is). So, for example, a test that has a false positive rate of only 5% will not be able to have a true positive rate that is greater than . For , this corresponds to a true positive rate just above 8%. For , this corresponds to a true positive rate of 13%. But note that the guarantee declines precipitously as increases beyond 1: When , the guarantee becomes almost vacuous.

Note, however, that the calculation above for a worst-case hypothesis test is carried out in an unrealistically strong threat model: the optimal hypothesis test on which the above bounds will bind has knowledge of the entire private dataset as well as individual ’s data, and is simply trying to determine whether individual ’s data was used in the computation or not. For an alternative view, we can take the most basic interpretation of differential privacy: that events that have probability absent individual ’s data will have probability at most when individual ’s data is included in the computation. Under this view, what a meaningful value of is depends on the estimate that one makes about the probability of the event that we are trying to prevent from occurring if the data analyses were run without individual ’s data. For example, if we are only worried about “re-identification”, might be extremely small: in the best case, on the order of , where is the number of possible data records an individual could have. Roughly speaking, values of on the order of can still be meaningful in this context.

More generally, one can attempt to choose the privacy budget by performing an economic analysis that trades off the privacy costs of the participants with the potential utility of an accurate analysis (Hsu et al., 2014; Ghosh and Roth, 2015; Abowd and Schmutte, 2019). This style of reasoning is delicate however and potentially brittle to modeling assumptions.

Who Shares the Budget?

In addition to deciding the numeric scale of the privacy budget , when a differentially private API giving access to a dataset will be shared amongst many users, an important design decision is what group of users will share the privacy budget. The tradeoffs to consider include the strength of the final guarantee, and how much of that guarantee relies on the mathematics of differential privacy (as opposed to the contractual obligations of the users), together with the rate at which the “privacy budget” (and hence the useful lifetime of the dataset) will be expended. There are many choices that can be made here, but we focus on two natural ones:

  1. One (Global) Privacy Budget: The safest decision (and the only one whose global guarantees rely only on the mathematics of differential privacy, rather than legal obligation) is to have a single privacy budget shared by (and depleted via the actions of) all users. This offers the rigorous protections of differential privacy (and all of the interpretations discussed in Section 3.1) no matter how those users behave, even if they aggregate all of their findings collectively, with the goal of trying to violate the privacy of a particular individual. On the other hand, in applications in which many independent groups wish to have access to the same dataset for different purposes, we expect that this approach will often exceed reasonable privacy budgets very quickly, and so it may be reasonable to consider a less conservative solution.

  2. Separate Privacy Budgets Across Organizational Structures — With Contractual Protections: One important use case is when there are many organizationally distinct groups of users (e.g. research groups at different universities, internal analytics teams in different parts of an organizational structure), etc. who may have no need or intention of communicating with one another over the course of their data analysis. If each of these groups is given their own privacy budget, the guarantees of differential privacy (and all of the interpretations discussed in Section 3.1) will hold for each of these groups in isolation, so long as they do not share information with each other. This guarantee may be acceptable, especially if additional legal protections are added on: for example, as a condition for obtaining access to the differentially private API, a research group may have to sign a contract forbidding them from sharing information obtained through the API with other research groups. In this case, the overall privacy guarantee will come from a combination of mathematical protections and legal protections.

While affording different groups separate privacy budgets does offer a meaningfully weaker guarantee given the same value of , we expect that in many cases — when paired with appropriately designed access control and contractual obligation — the protection it affords will be sufficiently strong, while allowing for substantially longer-lived data re-use as well as more precise analytical results for each group, compared to requiring a single, shared privacy budget.

4 System Design

An interactive differentially private query answering system needs to be designed carefully to achieve the following (sometimes conflicting) goals:

  • Expressivity

    : The system should support a privacy-preserving version of common data analysis workflows, such as computing descriptive statistics (such as mean, median, variance of sub-populations), building machine learning models, and computing queries used for business decisions.

  • Modularity: The system should be developed from sub-components that are easy to isolate from the rest of the system during testing. Differential privacy is a distributional property of a randomized computation. Thus, tests may need to execute some components millions of times in order to estimate the output distributions and verify that they meet expectations. This is made much easier if the individual components on which the claims of privacy rely are simple and fast to run.

  • Minimality: The privacy-critical components that need code review and testing should be few in number. This reduces the load on the code reviewer and provides fewer chances for information-leaking bugs to creep into the code base. It also increases the chance that the privacy-critical components can be formally verified by emerging program verification tools.

  • Security

    : The system should prevent intentional and unintentional attempts at leaking more information than allowable by the differential privacy settings. This involves attention to details that stand outside of the abstract mathematical model in which differential privacy is usually proven: in particular, the details of random number generation, floating point arithmetic, and side-channels including system run-time.

  • Transparency

    : A differentially private system provides approximate (noisy) answers to its users. The users need to adjust their statistical inferences based on the noise. Such adjustments are only possible if the users know exactly what was done to the original data. Moreover, a strength of differential privacy is that its guarantees do not in any way rely on the secrecy of the algorithms used to analyze the data. As a result, ideally the code of a differentially private data analysis system will be open source and open to inspection. This also has the benefit of enabling a larger community to check for privacy errors, implementation bugs, thereby building trust in the system.

Modularity and minimality are possible because of two important properties of differential privacy: postprocessing immunity and composition.

Figure 1: Postprocessing Immunity. If the first component satisfies -differential privacy, then the entire workflow satisfies differential privacy with the same parameter .

Postprocessing immunity, informally, means that we can do whatever we want with the output of an -differentially private computation (so long as we don’t make further use of the private data) without weakening its differential privacy guarantee. To illustrate this point, consider a simple algorithm that can be represented as a two-component flow diagram as in Figure 1. The first component processes the sensitive data and satisfies -differential privacy. Its output is fed into the second component. This is called the postprocessing component because it does not directly access the data we want to protect – it only processes its differentially private input. The entire algorithm then satisfies -differential privacy with the same privacy loss budget (i.e., ) as the first component. In particular, this means that once we release differentially private data to the users, the users can do whatever they want with the data without compromising privacy (beyond the leakage allowed by the parameter settings). It also informs the system design that we recommend in Section 4.1 — in particular, the “postprocessing layer”.

Figure 2: Independent composition. The total privacy guarantee is -differential privacy.

Figure 3: Sequential composition. The total privacy leakage is -differential privacy.

On the other hand, composition related to the amount of information that is leaked when multiple differentially private processes are combined together. In the case of -differential privacy (with ), the overall privacy loss is dictated by the sum of the individual privacy losses. Consider Figure 3. Here the data are accessed by two independent differentially private algorithms. The first algorithm satisfies -differential privacy while the second one (which possibly takes some constant as a second input) satisfies -differential privacy. The combined release of both of their outputs satisfies -differential privacy.

Now consider the situation in Figure 3, where the constant input to Component 2 is replaced by the output of Component 1. This is a form of sequential composition. Releasing the outputs of both components to the user satisfies -differential privacy. Note that generically, releasing only the output of Component 2 also results in -differential privacy – hiding the output of the first component does not necessarily reduce the privacy cost as it affects the operation of the second component.

There is a caveat in our discussion of sequential composition, which is that the privacy parameters are assumed to be fixed up front, before the computation begins. If Component 2 uses the output of Component 1 to dynamically determine its privacy level , then this situation is known as parameter adaptive composition. The simple composition theorem for pure differential privacy (i.e. ) can be extended to this case (Rogers et al., 2016) without any loss in the composition guarantees, so that the overall privacy loss is the sum of the privacy loss of the two components. Similarly, “additive” composition theorems for approximate differential privacy (when ) and variants like Renyi and concentrated differential privacy extend to this case. However, when , more sophisticated sub-additive composition theorems do not necessarily extend to the parameter adaptive case, and adapting them can result in slightly larger overall privacy loss compared to the non parameter-adaptive case. See Rogers et al. (2016); Winograd-Cort et al. (2017) for more details.

Finally, many differential privacy guarantees ultimately stem from a bound on the sensitivity of a function. Function sensitivity measures the quantitative effect that the presence or absence of a single individual can have on the (exact) value of a function. Consider, for example, a simple function , which, given a dataset , computes — in our example, the number of users who have Facebook accounts and have listed their romantic preference as ”Interested in Men”. We would say that this function is “1-sensitive”, because adding or removing a single user from the dataset could change this count by at most 1. More generally, the sensitivity of a 1-dimensional numeric valued function is defined as follows:

Definition 2.

Given a function which takes as input a dataset and outputs a real number, its sensitivity is:

where the maximum is taken over all pairs of neighboring datasets.

There are also multi-dimensional generalizations of sensitivity. Function sensitivity is important because it is often the primary thing we need to know about a function in order to use its value as input to a differentially private algorithm. For example, the simplest differentially private algorithms for estimating numeric valued functions (the “Laplace” and “Gaussian” mechanisms — see Dwork and Roth (2014)) simply perturb the value

with noise with standard deviation scaled proportionally to

— the sensitivity of . This also lends itself to a kind of modularity that informs our recommended system design in Section 4.1: because function sensitivity can be tracked independently of differential privacy calculations as data is manipulated, and then passed to algorithms whose privacy guarantees depend on this sensitivity.

4.1 Recommended System Design

Figure 4: Recommended System Architecture

The principles of composition and postprocessing immunity together with the central importance of function sensitivity simplify the design of interactive differential privacy platforms to make it easier to achieve the desiderata listed at the beginning of Section 4. A recommended system design, shown in Figure 4, consists of three layers: a data access layer, a privacy layer, and a postprocessing layer. Each layer performs a qualitatively distinct task, and the separation of a system into these three layers both lends itself to modular design, and simplifies the process of testing for correctness.

4.1.1 Data access layer.

The data access layer is the only layer that should have access to the raw data. It performs transformations and aggregations on the data (e.g., joins, group by, and other SQL operations are simple examples). In general, it is this layer that performs exact computations on the data in response to user queries. For example, when computing a regression, functions in the data access layer might compute the covariance matrix of a set of datapoints, or when performing stochastic gradient descent to optimize model parameters in some class, the (exact) gradients of the current model would be computed in the data access layer.

A crucial function of the data access layer is to compute (bounds on) the sensitivity of the functions that it is computing on the data. Sensitivity calculations are deterministic, and can be automated under some circumstances: see for example the type system for tracking sensitivity developed in Reed and Pierce (2010) and subsequent work for the state-of-the-art in automating sensitivity calculations.

Functions in the data access layer can be called only from the privacy layer, and they return to the privacy layer both an exact computation of some function on the raw data, and a bound on the sensitivity of that computation.

4.1.2 Privacy layer.

The privacy layer is responsible for performing the randomized computations on functions of the data obtained from the data access layer that guarantee differential privacy at a particular level . At a high level, the privacy layer has two distinct roles. First, it provides the implementations of the “base” private computations that are supported — e.g. simple counts and averages, estimates of marginal tables and correlations, etc. Second, it performs a book-keeping role: keeping track of a bound on the privacy loss that has accumulated through the use of these base computations.

The base computations should take as input a desired base privacy parameter, and then make calls to the data access layer. Then, potentially as a function of the sensitivity bound provided by the data access layer, it performs the randomized computation that is guaranteed to protect privacy at the specified level. The base private computations are the cornerstone of the end-to-end privacy guarantee of the entire system. Hence, each of them should come with rigorous mathematical proof and/or formal verification to ensure that at algorithm-level, each base computation is deferentially private; moreover, testing and code auditing is required to validate the correctness of the implementation of each differentially-private algorithm.

Separately, the privacy layer keeps track of the total privacy cost expended during a sequence of analyses, and makes sure that it does not exceed some pre-specified privacy budget . This is accomplished by keeping track of the privacy cost of each of the base computations, and applying an appropriate “composition theorem” to track the overall loss. In the simplest case in which the final guarantee is -differential privacy, this is accomplished simply by summing the base privacy costs and making sure that (and disallowing further computation if it will cause the privacy budget to be exceeded). However, more sophisticated methods can be employed when the desired final privacy guarantee is -differential privacy for . This can involve tracking variants of differential privacy like “Concentrated Differential Privacy” (Dwork and Rothblum, 2016; Bun and Steinke, 2016) or “Renyi Differential Privacy” (Mironov, 2017b), which are measured and tracked in different units, but can be converted to standard -differential privacy guarantees as desired (and can be used to ensure that the total privacy cost doesn’t exceed a budget that is specified with standard parameters and ).

4.1.3 Postprocessing layer.

The postprocessing layer houses the user facing API, and makes calls to the privacy layer. Given queries supplied by the user, it verifies through the privacy layer that the desired computations will not exceed the privacy budget. If not, it makes queries to the privacy layer, and then potentially performs further transformations on them (the “postprocessing”) before returning a response to the user. The guarantees of the privacy layer are sufficient to guarantee the privacy of answers returned through the postprocessing layer, because of differential privacy’s postprocessing immunity property.

As much computation as possible should be embedded in the post-processing layer so as to keep the set of base functions in the privacy layer as small as possible. For example, an algorithm for solving a linear regression problem privately might be implemented across layers as follows: the covariance matrix for a dataset would be computed in the data access layer, and then passed to an implementation of the Gaussian mechanism in the privacy layer (which would add noise in proportion to the automatically computed sensitivity of the covariance matrix). The privacy layer would return to the postprocessing layer a perturbed (and hence privacy preserving) covariance matrix. Only in the postprocessing layer would the covariance matrix be used to compute the final regression coefficients.

The postprocessing layer should never have access to the original data. Hence, it should ideally be on a different server than the privacy and data-access layers. Aside from interfacing with the user, the purpose of the postprocessing layer is to keep the privacy layer as small as possible (to make it easier to verify and harder to attack). Because differential privacy is immune to postprocessing, under our suggested architecture, differential privacy guarantees will not be compromised even if there are errors in the implementation of functions in the postprocessing layer. Thus when implementing a new piece of functionality, designers should first carefully consider whether it can be implemented as a post-processing of functionality that already exists in the privacy layer (for example, if we already have code to compute a sum and a size, we do not need new code in the privacy layer to compute an average). This both keeps the attack surface as small as possible and reduces the burden of testing for correctness.

5 The Role of Code Review

Differential privacy is ultimately a mathematical property of a system that must be proven in some idealized model — and not something that can be established just through testing. The first step of designing a system should therefore involve specifying the idealized models in terms such as pseudo code and design document. The second step involves mathematical proof of the idealized models: that the sensitivity calculations from the data access layer are correct, that the base algorithms in the privacy layer have the privacy guarantees they claim to, and that the accounting for the privacy budget is correct. The role of code review is to check that the actual system as implemented conforms as closely as possible to the idealized models (in the form of pseudo code, design document etc.) in which privacy has been proven.

Our recommended system design simplifies this process by limiting the volume of code that has to be reviewed and compartmentalizing what has to be reviewed:

  1. Data Access Layer: The only thing that needs to be reviewed at the data access layer are that the computed bounds on sensitivity correctly characterize the actual data transformations that are being done. There are existing type systems that can guarantee correctness for sensitivity calculations on a limited but expressive subset of data operations (Reed and Pierce, 2010). Additional discussion about the documentation of the data access layer that needs to be checked can be found in Section 6.3

  2. Privacy Layer: Each base function in the privacy layer corresponds to an assertion that if the inputs returned from the data access layer have sensitivity , then the privacy guarantee of the algorithm implemented in the base function is , for some function . It is this assertion that needs to be verified in the privacy layer — which can be checked independently of the correctness of the sensitivity calculations in the data access layer. Typically, most base functions (such as Laplace mechanism, Exponential mechanism) come with rigorous mathematical proofs in the literature; in case of developing new base functions or variants of well-studied base functions, the developer should either provide rigorous mathematical proofs, or use verification tools to warrant their correctness.

  3. Postprocessing Layer: Functions in the post-processing layer have no bearing on the privacy guarantee offered by the algorithm. Hence from the perspective of auditing the privacy guarantee of a system, nothing in the postprocessing layer needs to be reviewed at all (other than checking that all data access to data goes through the privacy layer). This is the primary motivation for designing functionality so that as much code as possible is assigned to the postprocessing layer.

Code review for differentially private systems is not different in kind from code review for any other system aimed at guaranteeing that implemented code correctly matches a system specification — except that it is more important, because differential privacy cannot be verified empirically. However, there are certain common pitfalls to watch out for, which we discuss in the following.

Naive implementations of differential privacy platforms are expected to have potential vulnerabilities in their source of randomness, their use of finite precision numbers, their use of optimizers, and in timing channels. Some of these vulnerabilities have published exploits and some do not. We note that implementations which defeat specific published attacks but do not solve the underlying problem are still considered vulnerable.

5.1 The Source of Randomness

The mathematical formulation of differential privacy requires perfect randomness (Dodis et al., 2012). Perfect randomness is extremely difficult to achieve even in cryptographic applications, so real systems will have to settle for imperfect randomness. We can still ask for the same strong computational indistinguishability guarantees used for cryptography, however. Therefore we recommend that a code review check for the following:

  • “Random numbers” should be generated by a cryptographically secure pseudo-random number generator (CSPRNG). Mersenee Twister is a commonly used pseudo-random number generator for statistical applications (in many libraries, it is the default) but is considered insecure for differential privacy.

  • Seeds for the CSPRNG should not be hardcoded and should not be read from configuration files. Seeds should also never be saved.

  • If two CSPRNG are used in parallel (for example in implementations using Hadoop, Spark, etc.) then they must explicitly be seeded differently. For example, one “global” CSPRNG can be used to seed the rest.

5.2 Finite Precision Bugs

Differentially private algorithms are typically analyzed in the idealized setting in which real-valued distributions (like Laplace and Gaussian distributions) can be sampled from, and in which all computations are carried out on real numbers (i.e. to infinite decimal precision). In practice, however, these algorithms are necessarily implemented on finite-precision machines and this disconnect can lead to bugs that can leak significant amounts of information. Three important cases for a code review to focus on include random number generation, floating point computations that involve exponents and/or products, and continuous optimization solvers.

5.2.1 Noise Generation.

Mironov (2012) studied common implementations of the Laplace mechanism (Dwork et al., 2006b; Dwork and Roth, 2014)

in a variety of languages and found that the Laplace distribution had “holes” corresponding to numbers that would never be generated. Such “holes” are shown to breach differential privacy since two adjacent datasets where random variables with these holes were added to finite precision numbers may miss different output values; hence the resulting mechanism cannot satisfy differential privacy. Paradoxically, small privacy budgets (e.g.,

), which in the idealized model would leak almost nothing were the ones that resulted in the most leakage in practice. Similar issues can affect discrete distributions such as the discrete Laplace distribution (Ghosh et al., 2009) over the integers. This problem has not been fully solved but a variety of techniques can be used to mitigate the problems. These include:

  • Use a secure version of a distribution that has been published in the literature. For example, the Snapping Mechanism (Mironov, 2012) is a viable replacement for the Laplace mechanism.

  • Place a lower limit on the privacy budget used for computations. Holes in distributions arise when the variance is large (and hence the privacy budget allocated to a computation is small). Placing a lower limit on the budget used in a mechanism can help mitigate this issue. For example, the Laplace mechanism adds noise with scale , where is the sensitivity and is the budget allocated for this mechanism. As part of the mitigation procedure, one can require, for instance, that to cap the variance and reduce the risk of holes.

  • In addition to placing a lower limit on the privacy budget allocated to an operation, one can also discretize the output (for example, round the Laplace mechanism to the nearest integer).

Some mechanisms use continuous random variables to create discrete distributions. This includes selection algorithms such as “Report Noisy Max”

(Dwork and Roth, 2014) and “private selection from private candidates” (Liu and Talwar, 2018). Here the concerns of finite precision numbers are slightly different. Report Noisy Max takes a list of query answers as an input, adds Laplace noise to each, and returns the index of the query that had the largest noisy answer. The privacy analysis of such algorithms assumes that ties are impossible, which is true for continuous distributions but not for the samples produced by finite-precision computers. The true privacy guarantee is known as approximate differential privacy (Dwork et al., 2006a), which allows guarantees of differential privacy to fail with a very small probability. One concern in generating random variables inside these algorithms is that the chance of a tie should be minimized. In these cases, a standard floating point implementation of the Laplace distribution (with a lower bound on the privacy budget used) would be preferable to the Snapping Mechanism, as the Snapping Mechanism has a much higher chance of ties.

In other cases, discrete distributions are typically sampled in the following way (which is not recommended). Suppose the possible values are with probabilities (with ). Form the cumulative sum: , draw a uniform random variable between and , find the smallest such that , and output . This technique is commonly used for naive implementations of the exponential mechanism (McSherry and Talwar, 2007), which samples from distributions of the form

. Due to the exponentiation, some of the probabilities can underflow to 0, resulting in a potentially exploitable bug. It is recommended to avoid computation of probabilities as much as possible and to opt for mechanisms that use noise addition to achieve similar functionality – for example, replacing the Exponential Mechanism with special versions of Report Noisy Max that use random variables drawn from the exponential distribution

(Barthe et al., 2016b).

5.2.2 Numerically Stable Computations.

Floating point operations such as addition differ from their mathematically idealized counterparts. For example, in pure mathematics, only occurs when and always equals . These identities can fail in floating point. If is much larger than , then even for nonzero due to rounding errors. There are three particular situations to look for in code review:

  • Multiplication of many numbers. The product is numerically unstable. If all of the are strictly between -1 and 1, this computation risks numerical underflow (resulting in 0). If all the have absolute value larger than 1, the product could result in an overflow (the program may treat it as or NaN). If some of the are close to 0 and others are large the result will depend on the order in which multiplication is performed (with possible answers ranging from 0 to depending on this order). Such multiplication often occurs when working with probabilities. It is better to work on a log scale, storing instead of . Noting that , working in log scale results in significantly better numerical stability. Note that in this section, we take to be the natural logarithm.

  • Working with exponentials. Differential privacy often deals with quantities such as . To avoid overflowing towards infinity, it is better to store instead of (again, this means working on a log scale).

  • When working in log scale, one often needs to compute sums: we store and but need to compute . The naive approach, which loses precision is to compute . A better approach is the following. Let , , , and . Then, mathematically . We can further use the function where is a much more accurate version of and is available in most numerical libraries (e.g., numpy). Hence we can compute as to obtain more precision than the naive approach for addition in log scale.

5.2.3 Use of Continuous Optimizers.

Similarly, some algorithms have analyses in the idealized model that depend on computations being carried out to infinite decimal precision. An example is the objective perturbation technique (Chaudhuri et al., 2011)

often used for differentially private logistic regression. This technique requires that an optimizer find the exact optimum of a convex optimization problem. In practice, such optimizations can only be approximated (since they are not guaranteed to have rational solutions in general) and often the optimization is stopped before reaching optimality (to reduce running time). Such inaccuracies must be accounted for in the privacy analysis.

Some solutions to this problem include the following. Objective perturbation and related algorithms could be replaced by a different method, such as certain types of output perturbation (Wu et al., 2017) that do not require optimality for privacy guarantees. Or, one could use a variant that does not require exact computation of the optimum (Iyengar et al., 2019; Neel et al., 2019a) – these variants also perturb the output with a small amount of noise.

5.3 Use of Heuristic Optimizers

There are a number of differentially private algorithms that make use of heuristic optimizers like integer program solvers

(Gaboardi et al., 2014; Neel et al., 2019b, a; Abowd et al., 2019). These heuristics may frequently work well, but because they are designed to try and solve NP-hard problems, they are not guaranteed to succeed on all inputs. This can make their use difficult to square with differential privacy (which is a worst-case guarantee). Thus, whenever heuristic solvers are used in differentially private systems, they should be a focus of the code review. Whenever possible, the privacy guarantees of the algorithm should not depend on the correctness of the solution found by the solver. To make this transparent, algorithms depending on solvers for utility guarantees but not for privacy should implement calls to the solver in the post-processing layer of the system (Gaboardi et al., 2014). This is by far the most preferred way to use a heuristic solver.

When the correctness of the solver’s output is important for the privacy guarantee of the algorithm (Neel et al., 2019a), it is important that a solver be able to certify its own success (i.e. that it has returned an optimal solution to the optimization problem it was tasked with). Many commonly used mixed integer program solvers (e.g. Gurobi and CPLEX among others) can do this. Algorithms that use solvers with this guarantee can be converted into algorithms whose privacy guarantee does not depend on the success of the solver (Neel et al., 2019b), albeit with significant computational overhead. As an option of last resort, the reviewer should verify with extensive testing that the solver indeed successfully terminates reliably on a wide variety of realistic test cases.

5.4 Timing Attacks

Differential privacy is typically analyzed in an idealized model in which the only thing observable by the data analyst is the intended output of the algorithm. In practice, however, there are other side-channels though which information about the data may leak. Many of these side channels (e.g. electricity usage or electromagnetic radiation) are eliminated by making sure that user queries are executed on a remote server, but for systems designed to interactively field queries made by users, one significant one remains: execution time. The concern is that the following kind of adversarial query might be posed: “given a boolean predicate , return an approximate count of the number of data points such that ”. This query has sensitivity 1, and so can be answered using standard perturbation techniques like the Laplace or Gaussian mechanisms. However, executing the query requires evaluating on every database element, and if evaluating takes much longer on certain database elements compared to others, then the running time of the query can reveal the presence of those elements in the database.

The best practice for closing timing channel is to make sure that — as closely as possible — queries to the dataset takes the same amount of time to complete, independently of the contents of the dataset. Guaranteeing this will typically involve both time-outs and padding: consider the example of estimating the number of database elements such that

, the system could evaluate on each dataset element, and default to a value of “TRUE” if the execution of took more than some pre-specified time . Similarly, if finishes evaluating in less time than , then a delay should be added so that in total time elapses before moving on to the next element. should be chosen so that on almost all reasonable (i.e. non adversarial) queries, the timeout never occurs. If this can be accomplished, then the execution time of computing does not leak any information about . The running time of computing will now scale linearly with (it will be roughly ), which also leaks some private information through the total dataset size . This can be mitigated by having the system expend a small amount of its privacy budget at startup to compute a private estimate of the dataset size (which will also be useful for many other tasks, such as computing averages), and adding a delay to the overall run-time of the query that depends on , so that the run-time output itself is differentially private. For details of the implementation of timing attack mitigations, see Haeberlen et al. (2011). It is also possible to perform the paddings in an adaptive and automatic matter, by following the predictive mitigation mechanism (Askarov et al., 2010; Zhang et al., 2011). Predictive mitigation is a general mechanism for mitigating timing channels in interactive systems: it starts with an initial prediction of computation time, and then for each query, it either pads the response time to the prediction (when takes less time) or, updates the predication to be longer and pads the response time to the new prediction. Although predictive mitigation was designed for interactive systems in general, it can be adopted to mitigate timing channels in differential private systems.

Consultations with the NCC Group has raised additional indirect sources of timing attacks. For instance, if the function is allowed to use regular expressions, then an attacker can craft a regular expression that takes a long time to evaluate on a target record. Similarly, can create a large string (gigabytes in size) if a record is present. The resulting memory issues could slow execution and defeat improperly implemented timeouts.

5.5 Running arbitrary code

In principle, it is possible to allow users to create custom queries to be answered by writing an arbitrary piece of code. While this can be useful, it also poses special challenges that must be taken seriously.

The first challenge is to bound the sensitivity of a user-provided function. Consider the following example: a hypothetical system allows the user to supply an arbitrary function , written in a general programming language, that takes as input a database element and outputs a real number . The user would like to estimate the empirical average of this function on the dataset: . In order to estimate this average privately (using say the Laplace or Gaussian mechanism), it is necessary to know the sensitivity of which might be difficult to automatically bound if consists of arbitrary code (in general the problem is undecidable). One potential solution is to ask the user to specify that always returns values in , and the system can instead estimate , where the function clamps the value to always lie in the range . The quantity is guaranteed to have sensitivity bounded by , and is equal to the quantity the user wanted to estimate if his assertion on the bounds of was correct. This is sufficient to guarantee differential privacy in the idealized model.

The second challenge is that allowing the user to run arbitrary code opens up an attack surface for data to be exfiltrated through timing channels. For example, the function supplied by the user might attempt to set a value to a global variable as a function of the data point it is evaluated on (say, setting a global Boolean value to “TRUE” if a particular target individual’s record is accessed, thus revealing through a side-channel that the target individual is in the dataset). If can be written in a general-purpose programming language, detecting such timing channels can be a very changing task, which has been extensively studied in the programming languages community (Agat, 2000; Molnar et al., 2006; Zhang et al., 2012; Pasareanu et al., 2016; Chen et al., 2017; Antonopoulos et al., 2017; Wang et al., 2017; Almeida et al., 2016; Wang et al., 2019a; Brotzman et al., 2019).

We recommend that either the system does not support writing user-defined queries of this sort, or else if it does, that the queries must be written in a small domain specific language such as Fuzz (Haeberlen et al., 2011) that is specifically designed to prevent side effects like this. See Haeberlen et al. (2011) for further details of attacks that can result from running arbitrary code, and the mitigations that have been proposed.

5.6 Peeking at the data

It is important that the system does not peek at the data when making decisions (e.g., when choosing which functions to run or how much noise to use). All decisions must be noisy. Some common errors include revealing the exact attribute values that appear in a dataset (e.g., the exact set of diseases of patients in a hospital, the exact maximum age, etc.) and even the number of records in a table. Such deterministic information complicates privacy analyses (Ashmead et al., 2019) and interferes with useful properties of differential privacy such as composition.

To see why revealing the number of records is problematic, consider a hospital with several different departments accessing the same underlying data. One wants to publish private statistics about cancer patients, another wants to publish statistics about patients with infectious diseases, and another wants to publish statistics about patients over 65. If each department publishes the size of view of the dataset, they are essentially publishing query answers with no noise. Pieces of information such as this can be combined together with additional differentially private statistics to further sharpen inference about individuals in the dataset (Kifer and Machanavajjhala, 2014). Even if the publication of statistics only goes through one department in an organization, revealing the exact number of records should be avoided (because of future possible data releases by the organization or because of data releases by organizations about overlapping sets of people).

6 Testing the System

In addition to careful code review, it is important to set up an automated collection of unit-tests for privacy-critical components of the system. These tests cannot positively confirm that the system is differentially private, but can raise flags confirming that it is not. Such tests are important even after a careful code-review because they can catch bugs that are introduced as the system is further developed.

In a system that follows the recommended design, each layer has its own specific set of tests that can be run to help catch mistakes that could cause violations of privacy.

6.1 Testing the Postprocessing Layer

The postprocessing layer does not affect the privacy properties of the system if it is designed according to our recommendation, as it has no direct access to the original data. Thus standard penetration testing is sufficient. This is the reason why as much functionality as possible (like a function for “average” that just re-uses functionality in the privacy layer for “sum” and “size”) should be implemented in this layer rather than in the privacy layer (as discussed in Section 4.1.3) and why the postprocessing layer should be hosted on its own server.

6.2 Testing the Privacy Layer

The privacy layer consists of three main components:

  • The privacy accountant. The privacy accountant keeps track of the accumulated privacy loss of all queries made by the user, and makes sure that it does not exceed a pre-specified budget. For pure (i.e. -privacy when ) differential privacy, this amounts to keeping track of the sum of the privacy parameters spent on each operation and ensuring that the sum does not exceed the specified budget . For approximate differential privacy there are more sophisticated variants that may keep track of privacy in different units — like Renyi Differential Privacy (Mironov, 2017a) — using more sophisticated accounting procedures like the moments accountant (Abadi et al., 2016).

  • Basic primitives that sample from distributions used by differentially private mechanisms (e.g. the Laplace distribution, the Gaussian distribution, the two-sided geometric distribution

    (Ghosh et al., 2009), and the staircase distribution (Geng et al., 2015)). These functions do not access the data layer, but are implemented in the privacy layer because they are called by the differentially private mechanisms which do access the data layer, and their correctness is important for the final privacy guarantees of the system.

  • Differentially private mechanisms (e.g., Report Noisy Max and Sparse Vector

    (Dwork and Roth, 2014), the Laplace mechanism (Dwork et al., 2006b), and others) whose inputs are computed from the data and whose outputs are supposed to satisfy differential privacy. Note that the difference between the Laplace mechanism and the Laplace distribution is that the Laplace mechanism computes some quantity via a call to the data layer, and then perturbs it using a call to sample from the Laplace distribution (with parameters determined by the sensitivity of the data). The correctness of the Laplace mechanism depends both on the correctness of the function that samples from the Laplace distribution, and the correctness of the parameters it chooses as a function of the sensitivity of its input and the desired privacy parameter .

6.2.1 Testing the privacy accountant.

The privacy accountant accumulates the overall privacy impact of results returned to the user. To do this, it requires details about which mechanism was used and what data transformations were applied to the input data. The privacy accountant, however, cannot be given deterministic information about the original data (otherwise, such access would violate differential privacy). Validating the privacy accountant often requires three different steps. The first step is to make sure that the privacy accountant is never accidentally bypassed – it must record the impact of every privacy mechanism that was run. We note that this task can be aided by programming language support. For instance, by inheritance, we can enforce each mechanism to implement a privacy accountant method.

The second step is to make sure it can correctly measure the privacy impact of each mechanism in isolation. In the case of pure differential privacy, this is often very easy – use the value reported by a mechanism. For more complex variants of differential privacy, such as Renyi Differential Privacy, the privacy impact is often provided in the literature as a formula that arises from evaluating an integral. In such cases, verifying the formula with numerical integration is useful. This is also a place where code review should be used to ensure the numerical stability of the floating point operations that are used.

The third step is to ensure that the total privacy impact is aggregated correctly. This involves taking a sequence of operations whose total privacy impact is already known (in the literature), and verifying that the privacy accountant claims the same (or larger) privacy leakage. In many cases, such as in the moment accountant, this simply involves addition of the privacy impacts of individual mechanisms (here it is also important to review the code for numerical stability).

In some cases, testing the accumulated privacy impact is more complex. To illustrate this point, we consider how to account for linear queries. Suppose the input data can be represented as a vector and suppose a change to one person’s record can change this vector by at most 1 in the norm. That is, changing one person’s record will result in a different vector that is close to in the following sense . A linear query can be represented as a vector having the same dimensionality as the data, so that is the query answer.

Example 1.

Suppose the vector represents a histogram on age, with being the number of people in the data whose age is equal to .444Note that the upper bound on age is 115 in this example. This upper bound must be chosen without looking at the data. Any records with age larger than 115 must then be set to 115 (this process is called top-coding or clipping). Adding or removing one person changes just one entry of the vector by at most 1 (so one person can change by at most 1 in the norm). A query such as “how many people are 18 and under?” can be represented as a vector , where while all other entries of are 0.

Suppose we have queries and they are answered using Laplace noise with scale (i.e., for each , we are given the noisy query answer ). To compute their precise privacy impact under -differential privacy, we first define a matrix , where (for each ) row is the vector . We also define the diagonal matrix , where entry . For each column in the matrix product we can compute the norm of that column. Let be the largest norm among the columns. Then the release of the noisy query answers satisfies -differential privacy with (but no smaller value).

Thus, when we check the privacy accountant (under pure differential privacy), it should produce an that is at least as large as the value computed above.

6.2.2 Testing distributions.

Testing the correctness of functions that sample from a distribution can be done using a goodness-of-fit test. For discrete distributions with a relatively small number of possible outcomes (e.g., 20), one can use the chi-squared goodness-of-fit test that is available in virtually any statistical software such as R, scipy, etc. Create a large number of outcomes (e.g., 10 million) and run the test.

For continuous one-dimensional distributions, a recommended goodness-of-fit test is the Anderson-Darling test (Anderson and Darling, 1954)

. It is well-suited for differential privacy because it puts emphasis on the tails of a distribution (which is where privacy-related problems tend to show up). Suppose the distribution we want to sample from has a cumulative distribution function (CDF)

and a density . Let be points sampled from our code ( should be large, like 10 million). The steps are as follows:

  1. First, sort the . Let be those points in sorted order (so ).

  2. Define the empirical distribution as follows:

  3. Compute the test statistic

    . For continuous distributions, this integral is equal to .

  4. Compare the value of the test statistic to

    , which is the 99% percentile of the asymptotic distribution under the null hypothesis

    (Marsaglia and Marsaglia, 2004). If the test statistic is larger than this number, the test has failed (it should only fail 1% of the time).

In general, it is a good idea to use several different goodness-of-fit tests.

6.2.3 White-box testing of privacy mechanisms.

All differential privacy mechanisms used by a system should have an accompanying mathematical proof of correctness, or a proof from computer-aided verification tools with soundness guarantee. A reference to the literature is not always enough for the following reasons:

  • Some published mechanisms do not satisfy differential privacy for any due to errors that are discovered after publication.

  • Some published mechanisms do satisfy differential privacy but have errors in the constants (i.e., they might under-estimate or over-estimate the privacy impact).

  • Some published mechanisms are not accompanied by published proofs.

  • Some published mechanisms may rely on assumptions stated elsewhere in a paper. Verifying their proofs is one way of identifying these assumptions.

  • Some published mechanisms may use slightly different versions of differential privacy than being guaranteed by the system. For example, some variations consider the effect of modifying one person’s record (instead of adding/removing a record), or assume that a person can affect only one record in exactly one table of a multi-table dataset (which may not necessarily be the case with actual data).

Once a rigorous proof is established, one can design tests to determine that the code faithfully matches the pseudocode provided in the literature. For example, in a differentially private histogram mechanisms, we may be able to analytically derive the variance of the histogram cells. This variance can be checked by running the mechanisms multiple times and empirically estimating the variance.

This is another reason for the separation between the privacy layer and the data access layer. Creating an input histogram from raw data is usually an expensive operation. Turning the input histogram into a sanitized histogram that is a process that is much faster. Hence, for tests that require multiple runs of a mechanism, we want to avoid re-running slow components.

6.2.4 Black-box testing of privacy mechanisms.

In some cases, it may be difficult to derive properties of a privacy mechanism for white-box testing. In this case black-box testing can be done. The goal of a black-box test is to identify a neighboring pair of input databases , and a set of possible outputs with the property that . If such and are found, they constitute a counterexample that demonstrates violations of differential privacy.

Searching for an appropriate and is an area of active research and requires running the mechanism multiple times. Heuristic techniques for identifying promising choices for and are discussed by Ding et al. (2018)

. Typically, it suffices to use small input datasets (with few rows and columns). For example, to test one-dimensional mechanisms, like differentially private sums, variances, counts, and quantiles, one can first consider simple datasets having two columns (the first being a number ranging from 0 to 100 and the second from 0 to 1):

  • Dataset 1: (i.e., no records).

  • Dataset 2: {(0,0)}

  • Dataset 3 {(100, 1), (0,0)}

  • Dataset 4 {(100, 1), (50, 0), (0,0)}

Note that Datasets 1 and 2 are neighbors, so are 2 and 3, as well as 3 and 4. The feature of these datasets is that the records are nearly as different from one another as possible. In such cases, suitable choices for output could be an interval of the form or or . In practice, one would evaluate many different neighboring pairs of databases and many different sets (e.g., many different intervals).

Once and are chosen, verifying can be done statistically by running the mechanism on both inputs many times and using a hypothesis test to determine if there is enough evidence to conclude that , as proposed by Ding et al. (2018). In some cases, the corresponding probabilities can be computed directly from the source code using tools such as the PSI solver (Gehr et al., 2016; Bichsel et al., 2018). Many different values of should also be evaluated.

If counterexamples are found, the mechanism does not satisfy differential privacy. If counterexamples are not found, then this adds more support to the working hypothesis that the mechanism is correct — but does not constitute a proof of correctness.

It is important to note that proper system design aids these black-box tests. Since they require running privacy mechanisms for millions of times, these tests benefit from a separation between the privacy layer and the data access layer (which houses slow data manipulation operations). Furthermore, the simpler the mechanism is, the easier to test. For example, suppose we have a two-stage mechanism for computing a differentially private mean. The first stage computes a noisy mean and the second stage forces the noisy mean to be nonnegative (i.e. a postprocessing stage). Black-box testing of the two stages together is less likely to successfully find counterexamples when only the first stage has bugs. However, if we move the postprocessing stage to the postprocessing layer (where it belongs), then we just need to test the first stage by itself, making it easier to discover if it has any errors.

6.3 Testing the Data Access Layer

The data access layer is responsible for managing and querying the original data. Hadoop, Spark, Hive and relational database are backends that can be part of this layer. While it is performing data transformations (such as selecting rows, dropping columns, aggregating values), it needs to keep track of quantities such as:

  • The set of valid values for a transformation. For example, if we group records by disease, we need to keep track of what the valid groups (diseases) are. These have to be pre-defined (not taken from the data). For numeric attributes, this could include computation on the bounds of attributes. For example, there need to be pre-defined upper and lower bounds on Age. If we are interested in squaring the ages, the system needs to keep track of upper and lower bounds on the square of ages (these bounds cannot depend on the data).

  • Sensitivity and stability. One can often define a distance over datasets that result from data transformations. For transformations that return datasets (e.g., the SQL query “SELECT Age, Income FROM Salaries WHERE Income ¿ 30,000”), one natural metric between two datasets and is the size of their symmetric difference (how many records must be added to or removed from to obtain ). For transformations that return vectors of numbers, a natural metric is the norm (common for pure differential privacy) or the norm (common for Renyi Differential Privacy (Mironov, 2017a)). These metrics help us quantify the possible effect that one record could have on the output of a transformation. For numerical transformations, the sensitivity measures (using or distance) how much the output of a transformation can be affected by adding or removing one record in the worst case. For transformations that output datasets, the corresponding concept is called stability (McSherry, 2009; Ebadi et al., 2016) and uses the size of the symmetric difference to quantify how much the output can change.

Thus testing the data access layer requires that it produces the correct query answers, that it properly keeps track of the data domain (and/or bounds on the data), that it properly computes sensitivity, and that it properly computes stability. Note that domain, bounds, sensitivity, and stability can only depend on public knowledge about the data (such as the database schema) but cannot depend on any private information (e.g., the actual records).

The intermediate data products produced within this layer generally have at least three types such as:

  • Datasets – tables with rows corresponding to records and columns corresponding to record attributes. An example is an employees table that records the name, title, workplace location, and salary of every employee in a company.

  • Grouped datasets. Suppose we group the records of this employees table by title and workplace location. Every record in the resulting table consists of a title (e.g., ”manager”), workplace location (e.g., Menlo Park), and a set of employee records who match the title and workplace location (i.e., the records of managers in Menlo Park). Thus this grouped table has the following columns: “title”, “workplace location”, and “record-set”.

  • Vector – a list of predefined data statistics. For example, a two-dimensional vector could include the most frequent title as the first dimension and the overall average salary in the second dimension. We can also think of a scalar as a vector of dimension 1.

The reason for grouping data products into types is that different operations are available on different data types. For example we can perform SQL queries on tables, we can additionally perform aggregations on each record in a grouped dataset, and we can perform vector operations on a vector. Furthermore, for each type we may need to store different information that summarizes the worst-case influence that adding or removing one record can have.

6.3.1 Tracking Bounds and the Data Domain

For each column in a table, we must keep track of the valid values for that column. These values cannot depend on the data. For instance, if a hospital produced a medical records table for patients with infectious diseases, the set of diseases must be predefined (for instance, “ebola” should be one of those diseases even if no patients at the hospital ever contracted ebola). For numerical attributes, one can maintain upper and lower bounds on the attributes (e.g., a lower bound on recorded age and an upper bound on recorded age). All of this can be considered metadata.

For data access operations that output tables, unit tests must ensure that the metadata does not depend on the actual data records. For example if a system top-codes age at 115 (i.e. ages over 115 are converted to 115), then the age column must have an upper bound of 115 even if such ages do not appear in the data. Similarly, suppose “Workers” is a table that keeps track of the name, title, and age of each worker. Suppose the a priori bounds on age in this table ensure that it is between 18 and 65 (inclusive). If we perform the SQL query such “SELECT * FROM Workers WHERE Age ” then upper and lower bounds on the resulting table should be 25 and 18, respectively, even if this table is empty.

As tracking bounds and allowable sets of values can be complex, it is advisable that system designers create a document detailing how each operation affects the allowable domain and attribute bounds for its input tables. This can be augmented by runtime assertions. For example, if the system expects that age is between 18 and 25 in a table that it has produced, it can go through each record to ensure that the bounds hold. What should the system do if it finds a violation of the bounds? It must not throw an error or perform any special action that is detectable by a user (as this would leak information). Instead, it should correct the violating record (e.g., trim the data) so that it satisfies the bound and log an error message that can only be viewed by system developers (since a bug has been detected). The logging mechanism must not introduce side channel vulnerabilities (e.g., it should not delay system processing by a noticeable amount, or noticeable increase memory usage).

It is important to note that data transformation operations such as GroupBy can differ from traditional implementations in databases and big-data platforms such as Spark. If we have a table of infectious diseases and we perform a GroupBy operation on disease, the resulting grouped table should have a entry for each possible disease (not just diseases that appear in the data). It is important to specifically generate unit tests for this case.

6.3.2 Tracking Stability

For operations whose inputs are tables (or grouped tables) and outputs are also tables (or grouped tables), the system must track how a change to the input propagates to the output. For example, suppose a user asks the system to perform the following sequence of operations on an Employee table:

  • Temp1 = (SELECT * FROM Employee) UNION (SELECT * FROM Employee)

  • Temp2 = (SELECT * FROM Temp1) UNION (SELECT * FROM Temp1)

  • Temp3 = (SELECT * FROM Temp2) UNION (SELECT * FROM Temp2)

  • Temp4 = (SELECT * FROM Temp3) UNION (SELECT * FROM Temp3)

  • Temp5 = (SELECT * FROM Temp4) UNION (SELECT * FROM Temp4)

  • SELECT SUM(Salary) FROM Temp5 WHERE Employee_id==‘1234567’

Suppose salaries are capped at $300,000. How much noise should be added to the result? One could reason holistically about the whole set of queries and determine that the largest change in the answer occurs from adding one person with salary $300,000 to the input table (causing 32 records to be added to Temp5). Hence, if using the Laplace mechanism, Laplace noise with scale is needed. In general, it is difficult to program this kind of holistic reasoning into a system but it is easier to track the effect of each transformation in isolation and then to combine their results. For datasets and grouped datasets, this is tracked using the notion of stability of an operation , which depends on the concept of symmetric difference. Given two tables and , their symmetric difference is the set of records that appear in one, but not both of the datasets (i.e. the records that appear in but not in ). The stability of a function is a rule that explains how a perturbation in the inputs (as measured by the size of the symmetric difference) affects the output. For instance, let be the union operation on two datasets and that may contain the same people. If we have two other datasets and such that and then clearly . In other words, if we add/remove a total of records to and a total of records to , we can change the union by at most records. By applying this rule sequentially to the above sequence of operations, we can track the influence of one person as we are performing the computation: adding/removing 1 person to Employees affects Temp1 by at most 1+1=2 records, which affects Temp2 by at most 2+2=4 records, etc.

Thus unit tests for stability of an operation involve creating datasets, measuring the size of their symmetric differences, and then measuring the sizes of the symmetric differences after applying an operation to each. For some common cases, the stability is given in the literature (Ebadi et al., 2016):

  • GroupBy has a stability of 2. Adding or removing a total of records in the input causes at most groups to change. This results in a symmetric difference of size (since changing one group is the same as removing that group and then adding the new value for that group).

  • Many query languages have a “limit m” operation which returns at most records from a dataset. Adding or removing a total of records can cause the size of the symmetric difference to change by either or , depending on how Limit is implemented. For example, if we pose the query ”SELECT * FROM Employees Limit 10” and the system returns 10 arbitrarily chosen records, then if the employees table is modified by adding one record, the query could return 10 completely different records (for a symmtric difference of ). On the other hand, if the system returns records the in order based on their timestamp, then the symmetric difference still has size (since the ordering information adds an implicit field). However, if the system returns the first records (based on timestamp) but randomly shuffles their order, then the size of the symmetric difference is as adding/removing records can affect the result by at most . Due to these tricky complications, it is not recommended that a system provide an operator such as Limit.

  • Bernoulli random sampling. If we take a table and perform a Bernoulli random sample (drop each record independently with probability ) then we require a more general notion of stability that accounts for the randomness (Ebadi et al., 2016). However, this is a stable operation, the effective size of the symmetric difference remains unchanged. Hence, this is a good replacement for Limit.

  • The Order By operator, which sorts the records in a table, is particularly troublesome since adding one record (which would be first in the sorted order) could change the positions of all other records. This introduces an implicit ordering field which causes the size of the symmetric difference to increase to the maximum table size. Most differentially private computations have no need for ordering (for example, if we need the average salary or the 95th percentile, we do not need to sort the data first) and hence it is not recommended that a system provide an operator such as Order By.

  • SELECT/WHERE. Simple Select-Project queries of the form “SELECT list-of-attributes FROM MyTable” keep the size of the symmetric difference unchanged.

  • Distinct. Transformations of the form SELECT DISTINCT(column-name) FROM MyTable” also keep the size of the symmetric difference unchanged.

It is important to note that in special cases, some operations can actually decrease the size of the set difference, depending on the previous operations. For instance, going back to our motivating example at the beginning of Section 6.3.2, if we added the transformation “SELECT DISTINCT(Salary, Employee_id) FROM Temp5” then the entire sequence of transformations up to that point leaves the size of the symmetric difference unchanged (i.e. adding/removing people from the Employees table causes the output of the query sequence to change by at most ). Such special cases should be tested separately in the same was as for individual transformations (that is, the sequence of operations is run on two different inputs whose symmetric difference is known; the test measures the size of the symmetric difference of the outputs and compares it to the upper bound on the symmetric difference produced by the system).

In addition to tests, it is important to have a document that states and proves the correctness of the stability computations for each transformation.

6.3.3 Tracking Sensitivity and Lipschitz Continuity

For functions whose inputs are datasets and outputs are numerical vectors, the analogue of stability is sensitivity which measures the following. If a total of records are added to/removed from the input then by how much (measured as distance) does the output change? One example is a function that computes the total salary of an input table. If records are added/removed, then the output changes by at most times the upper bound on salary. For some functions, the input is a vector and the output is a vector. For such functions, one can track the Lipschitz continuity – if the input to a function changes by at most (in the distance), how much does the output change by? For example, if we have a function that takes as input a number and then doubles it, then the output changes by at most . Keeping track of sensitivity and Lipschitz continuity allows a system to determine the overall sensitivity of a calculation. For instance, suppose salary is capped by . Then, the query ”SELECT 2*SUM(Salary) FROM Employees” can be thought of as a sequence of two operations: compute the total salary then double it. Adding or removing one record changes the total by at most and the doubling operation turns that into . Thus the sensitivity of the combined sequence of operations is .

Sensitivity and Lipschitz continuity can be tested by feeding different inputs into a function and measuring how far apart the outputs are. Again, a document that states and proves the correctness of sensitivity and Lipschitz continuity calculations is necessary. These computations can be embedded into a type system that automatically tracks the overall sensitivity of a sequence of operations. For details, see the languages proposed by Reed and Pierce (2010); Winograd-Cort et al. (2017); Gaboardi et al. (2013).

7 Summary and Conclusions

Differential privacy is a worst-case, mathematical statement about an algorithm, and as such can only be established through mathematical proof. But differential privacy is typically proved in an idealized setting based on mathematical abstractions of algorithms, rather than their concrete implementations, and this can introduce a significant challenge for deployed systems:

  1. Complicated algorithms have complicated analyses, and even the paper proofs of correctness in the idealized model may have errors.

  2. As with any large system, bugs can be introduced when translating the mathematical abstraction of the algorithm into code, and these bugs can invalidate the privacy guarantees.

  3. There are fundamental differences between the idealized model and implemented software: in practice, we do not have access to truly random numbers, continuous distributions, or infinite precision calculation. There are also side channels like run-time that are not typically considered in the idealized model.

All of these issues make both careful code review and testing essential for deployed systems claiming differential privacy. Testing is not a substitute for mathematical proof, but rather a way of potentially detecting either errors in the mathematical proof or important gaps between the idealized model and the implemented system. These tests should be automated and periodically run so that errors introduced as new updates are deployed can be caught quickly.

Testing for differential privacy is a difficult task in and of itself, but can be simplified by a careful system design. The one we propose has two main features that aid in testing:

  1. It partitions functions across layers according to what needs to be tested to guarantee differential privacy: i.e. at the data access layer, it is deterministic sensitivity calculations. At the privacy layer, it is claimed properties of the distribution output by randomized functions. This modularity means, for example, that when testing functionality at the privacy layer, it is not necessary to run the code at the data access layer, which might be time consuming — especially since tests at the privacy layer may need to run millions of times to guarantee sufficient statistical significance.

  2. It aims at keeping the “privacy core” as small as possible by pushing as much code as possible to the post-processing layer, whose correctness is immaterial for claims of differential privacy. This limits the quantity of code that needs to be tested in order to build confidence in a claimed degree of differential privacy.

When possible, we also recommend that the core code on which privacy guarantees rely be made open source: this enables a wider population to verify correctness and discover and remediate bugs, and ultimately will help build confidence in the system.

Our guidelines make large-scale deployment of differentially private APIs feasible now. But there is need for more research on automatic verification and testing for differential privacy to make the process of development easier and less error prone. The automated ”unit tests” that we propose in these guidelines are statistical hypothesis tests aimed at falsifying claims of worst-case differential privacy. However one can imagine automated tests that combine statistical and mathematical reasoning that aim to positively establish weaker notions, like “Random Differential Privacy”

Hall et al. (2013). Weaker guarantees of this sort fall short of the end goal of differential privacy, but automated tests that can positively establish such guarantees complement statistical tests aimed at falsifying differential privacy to allow system builders to establish trust in their system.

8 Acknowledgments

We thank Mason Hemmel and Keegan Ryan from NCC group for pointing out additional sources of timing attacks on differential privacy systems.

References

  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16. Cited by: 1st item.
  • J. Abowd, R. Ashmead, S. Garfinkel, D. Kifer, P. Leclerc, A. Machanavajjhala, B. Moran, and W. Sexton (2019) Census topdown: differentially private data, incremental schemas, and consistency with public knowledge. Note: https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0945_Consistency_for_Large_Scale_Differentially_Private_Histograms.pdf Cited by: §1, §2, §5.3.
  • J. M. Abowd and I. M. Schmutte (2019) An economic analysis of privacy protection and statistical accuracy as social choices. American Economic Review 109 (1), pp. 171–202. Cited by: §3.4.
  • J. Agat (2000) Transforming out timing leaks. In ACM Symp. on Principles of Programming Languages (POPL), pp. 40–53. Cited by: §5.5.
  • A. Albarghouthi and J. Hsu (2017) Synthesizing coupling proofs of differential privacy. Proceedings of ACM Programming Languages 2 (POPL), pp. 58:1–58:30. External Links: ISSN 2475-1421 Cited by: §1.
  • J. B. Almeida, M. Barbosa, G. Barthe, F. Dupressoir, and M. Emmi (2016) Verifying constant-time implementations. In 25th USENIX Security Symposium (USENIX Security 16), pp. 53–70. Cited by: §5.5.
  • T.W. Anderson and D.A. Darling (1954) A test of goodness-of-fit. Journal of the American Statistical Association 49, pp. 765––769. Cited by: §6.2.2.
  • T. Antonopoulos, P. Gazzillo, M. Hicks, E. Koskinen, T. Terauchi, and S. Wei (2017) Decomposition instead of self-composition for proving the absence of timing channels. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 362–375. Cited by: §5.5.
  • R. Ashmead, D. Kifer, P. Leclerc, A. Machanavajjhala, and W. Sexton (2019) EFFECTIVE privacy after adjusting for invariants with applications to the 2020 census. Note: https://github.com/uscensusbureau/census2020-das-e2e/blob/master/doc/20190711_0941_Effective_Privacy_after_Adjusting_for_Constraints__With_applications_to_the_2020_Census.pdf Cited by: §5.6.
  • A. Askarov, D. Zhang, and A. C. Myers (2010) Predictive black-box mitigation of timing channels. In ACM Conference on Computer and Communications Security (CCS), pp. 297–307. Cited by: §5.4.
  • G. Barthe, M. Gaboardi, E. J. G. Arias, J. Hsu, C. Kunz, and P. Y. Strub (2014) Proving differential privacy in hoare logic. In 2014 IEEE 27th Computer Security Foundations Symposium, pp. 411–424. Cited by: §1.
  • G. Barthe and F. Olmedo (2013) Beyond differential privacy: composition theorems and relational logic for f-divergences between probabilistic programs. In ICALP, Cited by: §1.
  • G. Barthe, N. Fong, M. Gaboardi, B. Grégoire, J. Hsu, and P. Strub (2016a) Advanced probabilistic couplings for differential privacy. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), Cited by: §1.
  • G. Barthe, M. Gaboardi, E. J. G. Arias, J. Hsu, A. Roth, and P. Strub (2015) Higher-order approximate relational refinement types for mechanism design and differential privacy. In POPL, Cited by: §1.
  • G. Barthe, M. Gaboardi, B. Gregoire, J. Hsu, and P. Strub (2016b) Proving differential privacy via probabilistic couplings. In IEEE Symposium on Logic in Computer Science (LICS), Note: To apprear Cited by: §1, §5.2.1.
  • G. Barthe, B. Köpf, F. Olmedo, and S. Zanella Béguelin (2012) Probabilistic relational reasoning for differential privacy. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 97–110. Cited by: §1.
  • B. Bichsel, T. Gehr, D. Drachsler-Cohen, P. Tsankov, and M. Vechev (2018) DP-finder: finding differential privacy violations by sampling and optimization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Cited by: §1, §6.2.4.
  • R. Brotzman, S. Liu, D. Zhang, G. Tan, and M. Kandemir (2019) CaSym: cache aware symbolic execution for side channel detection and mitigation. In IEEE Symposium on Security and Privacy (S&P), pp. 364–380. Cited by: §5.5.
  • M. Bun and T. Steinke (2016) Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pp. 635–658. Cited by: §2, §4.1.2.
  • K. Chaudhuri, C. Monteleoni, and A. D. Sarwate (2011) Differentially private empirical risk minimization. J. Mach. Learn. Res. 12. Cited by: §5.2.3.
  • J. Chen, Y. Feng, and I. Dillig (2017) Precise detection of side-channel vulnerabilities using quantitative cartesian hoare logic. In ACM Conference on Computer and Communications Security (CCS), pp. 875–890. Cited by: §5.5.
  • Z. Ding, Y. Wang, G. Wang, D. Zhang, and D. Kifer (2018) Detecting violations of differential privacy. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, Cited by: §1, §6.2.4, §6.2.4.
  • Y. Dodis, A. López-Alt, I. Mironov, and S. P. Vadhan (2012) Differential privacy with imperfect randomness. In CRYPTO, Cited by: §5.1.
  • C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor (2006a) Our data, ourselves: privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pp. 486–503. Cited by: §5.2.1, Definition 1.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006b) Calibrating noise to sensitivity in private data analysis. In Proceedings of the Third Conference on Theory of Cryptography (TCC), Cited by: §5.2.1, 3rd item, Definition 1.
  • C. Dwork and A. Roth (2014) The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 9 (3–4). Cited by: item 2, §3.1, §4, §5.2.1, §5.2.1, 3rd item.
  • C. Dwork and G. N. Rothblum (2016) Concentrated differential privacy. arXiv preprint arXiv:1603.01887. Cited by: §4.1.2.
  • H. Ebadi, D. Sands, and G. Schneider (2015) Differential privacy: now it’s getting personal. In POPL, Cited by: §1.
  • H. Ebadi, T. Antignac, and D. Sands (2016) Sampling and partitioning for differential privacy. In 14th Annual Conference on Privacy, Security and Trust (PST), Cited by: §1, 2nd item, 3rd item, §6.3.2.
  • F. Eigner and M. Maffei (2013) Differential privacy by typing in security protocols. In CSF, Cited by: §1.
  • M. Gaboardi, E. J. Gallego-Arias, J. Hsu, A. Roth, and Z. S. Wu (2014)

    Dual query: practical private query release for high dimensional data

    .
    In International Conference on Machine Learning, pp. 1170–1178. Cited by: §5.3.
  • M. Gaboardi, A. Haeberlen, J. Hsu, A. Narayan, and B. C. Pierce (2013) Linear dependent types for differential privacy. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’13, pp. 357–370. Cited by: §1, §6.3.3.
  • M. Gaboardi, J. Honaker, G. King, K. Nissim, J. Ullman, and S. P. Vadhan (2016) PSI (): a private data sharing interface. Vol. abs/1609.04340. External Links: Link Cited by: §1.
  • T. Gehr, S. Misailovic, and M. Vechev (2016) PSI: exact symbolic inference for probabilistic programs. In Computer Aided Verification (CAV), Cited by: §6.2.4.
  • Q. Geng, P. Kairouz, S. Oh, and P. Viswanath (2015) The staircase mechanism in differential privacy. IEEE Journal of Selected Topics in Signal Processing 9 (7), pp. 1176–1184. Cited by: 2nd item.
  • A. Ghosh and A. Roth (2015) Selling privacy at auction. Games and Economic Behavior 91, pp. 334–346. Cited by: §3.4.
  • A. Ghosh, T. Roughgarden, and M. Sundararajan (2009) Universally utility-maximizing privacy mechanisms. In

    Proceedings of the Forty-first Annual ACM Symposium on Theory of Computing

    ,
    STOC ’09. Cited by: §5.2.1, 2nd item.
  • A. Gilbert and A. McMillan (2018) Property testing for differential privacy. Note: https://arxiv.org/abs/1806.06427 Cited by: §1.
  • A. Haeberlen, B. C. Pierce, and A. Narayan (2011) Differential privacy under fire. In Proceedings of the 20th USENIX Conference on Security, Cited by: §1, §5.4, §5.5.
  • R. Hall, L. Wasserman, and A. Rinaldo (2013) Random differential privacy. Journal of Privacy and Confidentiality 4 (2). Cited by: §7.
  • J. Hsu, M. Gaboardi, A. Haeberlen, S. Khanna, A. Narayan, B. C. Pierce, and A. Roth (2014) Differential privacy: an economic method for choosing epsilon. In 2014 IEEE 27th Computer Security Foundations Symposium, pp. 398–410. Cited by: §3.4.
  • R. Iyengar, J. P. Near, D. Song, O. Thakkar, A. Thakurta, and L. Wang (2019) Towards practical differentially private convex optimization. In Towards Practical Differentially Private Convex Optimization, Cited by: §5.2.3.
  • S. P. Kasiviswanathan and A. Smith (2014) On the’semantics’ of differential privacy: a bayesian formulation. Journal of Privacy and Confidentiality 6 (1). Cited by: §3.1, §3.4, footnote 3.
  • D. Kifer and A. Machanavajjhala (2014) Pufferfish: a framework for mathematical privacy definitions. ACM Trans. Database Syst. 39 (1), pp. 3:1–3:36. Cited by: §5.6.
  • M. Kosinski, D. Stillwell, and T. Graepel (2013) Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences 110 (15), pp. 5802–5805. Cited by: item 1.
  • J. Liu and K. Talwar (2018) Private selection from private candidates. CoRR abs/1811.07971. Cited by: §5.2.1.
  • G. Marsaglia and J. Marsaglia (2004) Evaluating the anderson-darling distribution. Journal of Statistical Software, Foundation for Open Access Statistics 9 (2). Cited by: item 4.
  • F. D. McSherry (2009) Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In SIGMOD, pp. 19–30. Cited by: §1, 2nd item.
  • F. McSherry and K. Talwar (2007) Mechanism design via differential privacy. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, Washington, DC, USA, pp. 94–103. External Links: ISBN 0-7695-3010-9, Link, Document Cited by: §5.2.1.
  • I. Mironov (2017a) Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), Cited by: 1st item, 2nd item.
  • I. Mironov (2012) On significance of the least significant bits for differential privacy. In the ACM Conference on Computer and Communications Security (CCS), Cited by: §1, 1st item, §5.2.1.
  • I. Mironov (2017b) Rényi differential privacy. In 2017 IEEE 30th Computer Security Foundations Symposium (CSF), pp. 263–275. Cited by: §4.1.2.
  • M. Mislove, J. Ouaknine, M. C. Tschantz, D. Kaynar, and A. Datta (2011) Formal verification of differential privacy for interactive systems. In Twenty-seventh Conference on the Mathematical Foundations of Programming Semantics (MFPS XXVII), Cited by: §1.
  • P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler (2012) GUPT: privacy preserving data analysis made easy. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Cited by: §1.
  • D. Molnar, M. Piotrowski, D. Schultz, and D. Wagner (2006) The program counter security model: automatic detection and removal of control-flow side channel attacks. In International Conference on Information Security and Cryptology, pp. 156–168. Cited by: §5.5.
  • S. Neel, A. Roth, G. Vietri, and Z. S. Wu (2019a) Differentially private objective perturbation: beyond smoothness and convexity. arXiv preprint arXiv:1909.01783. Cited by: §5.2.3, §5.3, §5.3.
  • S. Neel, A. Roth, and Z. S. Wu (2019b) How to use heuristics for differential privacy. In Proceedings of the 60th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’19, pp. 94–103. External Links: ISBN 0-7695-3010-9, Link, Document Cited by: §5.3, §5.3.
  • C. S. Pasareanu, Q. Phan, and P. Malacaria (2016) Multi-run side-channel analysis using symbolic execution and max-smt. In IEEE Computer Security Foundations (CSF), pp. 387–400. Cited by: §5.5.
  • J. Reed and B. C. Pierce (2010) Distance makes the types grow stronger: a calculus for differential privacy. In Proceedings of the 15th ACM SIGPLAN International Conference on Functional Programming, ICFP ’10, pp. 157–168. Cited by: §1, §4.1.1, item 1, §6.3.3.
  • R. Rogers, A. Roth, J. Ullman, and S. Vadhan (2016) Privacy odometers and filters: pay-as-you-go composition. In Advances in Neural Information Processing Systems, pp. 1921–1929. Cited by: §4.
  • I. Roy, S. Setty, A. Kilzer, V. Shmatikov, and E. Witchel (2010) Airavat: security and privacy for MapReduce. In NSDI, Cited by: §1.
  • S. Wang, Y. Bao, X. Liu, P. Wang, D. Zhang, and D. Wu (2019a) Identifying cache-based side channels through secret-augmented abstract interpretation. In 28th USENIX Security Symposium (USENIX Security 19), pp. 657–674. Cited by: §5.5.
  • S. Wang, P. Wang, X. Liu, D. Zhang, and D. Wu (2017) CacheD: identifying cache-based timing channels in production software. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), pp. 235–252. Cited by: §5.5.
  • Y. Wang, Z. Ding, G. Wang, D. Kifer, and D. Zhang (2019b) Proving Differential Priacy with Shadow Execution. In PLDI, Cited by: §1.
  • D. Winograd-Cort, A. Haeberlen, A. Roth, and B. C. Pierce (2017) A framework for adaptive differential privacy. Proceedings of the ACM on Programming Languages 1 (ICFP), pp. 10. Cited by: §1, §4, §6.3.3.
  • X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton (2017) Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the ACM International Conference on Management of Data, Cited by: §5.2.3.
  • D. Zhang, R. McKenna, I. Kotsogiannis, M. Hay, A. Machanavajjhala, and G. Miklau (2018) EKTELO: a framework for defining differentially-private computations. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD ’18. Cited by: §1.
  • D. Zhang, A. Askarov, and A. C. Myers (2011) Predictive mitigation of timing channels in interactive systems. In ACM Conference on Computer and Communications Security (CCS), pp. 563–574. Cited by: §5.4.
  • D. Zhang, A. Askarov, and A. C. Myers (2012) Language-based control and mitigation of timing channels. In ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), pp. 99–110. Cited by: §5.5.
  • D. Zhang and D. Kifer (2017) LightDP: Towards Automating Differential Privacy Proofs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, Cited by: §1.
  • H. Zhang, E. Roth, A. Haeberlen, B. C. Pierce, and A. Roth (2019) Fuzzi: a three-level logic for differential privacy. Proc. ACM Program. Lang. 3 (ICFP), pp. 93:1–93:28. External Links: ISSN 2475-1421, Link, Document Cited by: §1.