Differential privacy (DP) provides formal guarantees that the output of a database query does not reveal too much information about any individual present in the database. While many differentially private algorithms have been proposed in the scientific literature, there are only a few end-to-end implementations of differentially private query engines. Crucially, existing systems assume that each individual is associated with at most one database record, which is unrealistic in practice. We propose a generic and scalable method to perform differentially private aggregations on databases, even when individuals can each be associated with arbitrarily many rows. We express this method as an operator in relational algebra, and implement it in an SQL engine. To validate this system, we test the utility of typical queries on industry benchmarks, and verify its correctness with a stochastic test framework we developed. We highlight the promises and pitfalls learned when deploying such a system in practice, and we publish its core components as open-source software.READ FULL TEXT VIEW PDF
A private data federation is a set of autonomous databases that share a
We consider accurately answering smooth queries while preserving differe...
Modern society generates an incredible amount of data about individuals,...
When sharing sensitive databases with other parties, a database owner ai...
This paper firstly proposes a simple yet efficient generalized approach ...
We study the design of differentially private algorithms for adaptive
Data exploration systems that provide differential privacy must manage a...
Many services collect sensitive data about individuals. These services must balance the possibilities offered by analyzing, sharing, or publishing this data with their responsibility to protect the privacy of the individuals present in their data. Releasing aggregate results about a population without revealing too much about individuals is a long-standing field of research. The standard definition used in this context is differential privacy (DP): it provides a formal guarantee on how much the output of an algorithm reveals about any individual in its input [dwork2014algorithmic, dwork2007ad, dwork2009differential]. Differential privacy states that the distribution of results derived from private data cannot reveal “too much” about a single person’s contribution, or lack thereof, to that data [dwork2006calibrating]. By using differential privacy when analyzing data, organizations can minimize the disclosure risk of sensitive information about their users.
Query engines are a major analysis tool for data scientists, and one of the most common ways for analysts to write queries is with Structured Query Language (SQL). As a result, multiple query engines have been developed to enable data analysis while enforcing DP [mcsherry2009privacy, johnson2018towards, kotsogiannisarchitecting, bater2018shrinkwrap], and all of them use a SQL-like syntax.
However, as we discuss in Section 2, these differentially private query engines make some implicit assumptions, notably that each individual in the underlying database is associated with at most one database record. This does not hold in many real-world datasets, so the privacy guarantee offered by these systems is weaker than advertised for those databases. To overcome this limitation we introduce a generic mechanism for bounding user contribution to a large class of differentially private aggregate functions. We then propose a design for a SQL engine using these contribution bounding mechanisms to enforce DP, even when a given individual can be associated with arbitrarily many records or the query contains joins.
Our work goes beyond this design and accompanying analysis: we also describe the implementation of these mechanisms as part of a SQL engine, and the challenges encountered in the process. We describe a stochastic testing framework that generates databases on which we test for differential privacy to increase our level of trust into the system’s robustness. To aid in replicability of our work and encourage wider adoption of differential privacy, we release core components of the system as open-source software.
To be useful for non-expert analysis, a differentially private SQL engine must at least:
Make realistic assumptions about the data, specifically allowing multiple records to be associated with an individual user.
Support typical data analysis operations, such as counts, sums, means, percentiles, etc.
Provide analysts with information about the accuracy of the queries returned by the engine, and give them clear privacy guarantees.
Provide a way to test the integrity of the engine and validate the engine’s privacy claims.
In this work, we present a differentially private SQL engine that satisfies these high-level requirements. More precisely:
We detail how we use the concept of row ownership to enforce the original meaning of differential privacy: the output of the analysis does not reveal anything about a single individual. In our engine, multiple rows can be associated with the same “owner” (hereafter referred to as a user, although the owner could also be a group), and the differential privacy property is enforced at the user level.
We implement common aggregations (counts, sums, medians, etc.), arbitrary per-record transforms, and joins on the row owner column as part of our engine. To do so we provide a method of bounding query sensitivity and stability across transforms and joins, and a mechanism to enforce row ownership throughout the query transformation.
We detail some of the usability challenges that arise when trying to productionize such a system and increase its adoption. In particular, we explain how we communicate the accuracy impact of differential privacy to analysts, and we experimentally verify that the noise levels are acceptable in typical conditions. We also propose an algorithm for automatic sensitivity determination.
We present a testing framework that verifies that -DP aggregation functions are correctly implemented, and can be used to detect software regressions that break the privacy guarantees.
Overall, this work contributes to research on differential privacy by proposing a method of bounded contributions and exploring the trade-offs in accuracy in a testable and verifiable way. Additionally, this work can increase the appropriate adoption of differential privacy by providing a usable system based on popular tools used by data analysts. For reproducibility and adoption, we release the new SQL aggregation operations and the stochastic tester as open-source software.
Multiple differentially private query engines have been proposed in the literature. In this work, we mainly compare our system to two existing differentially private query engines: PINQ [mcsherry2009privacy] and Flex [johnson2018towards]. Our work differs in two major ways from these engines: we support the common case where a single user is associated with multiple rows, and we support arbitrary
GROUP BY statements.
In these systems a single organization is assumed to hold all the raw data. Query engines can also be used in other contexts: differential privacy can be used in concert with secure multiparty computation techniques to enable join queries between databases held by different organizations, e.g., DJoin [narayan2012djoin] and Shrinkwrap [bater2018shrinkwrap].
A significant amount of research focuses on improving the accuracy of query results while still maintaining differential privacy. In this work, for clarity, we keep the description of our system conceptually simple, and explicitly do not make use of techniques like smooth sensitivity [nissim2007smooth], tight privacy budget computation methods [kairouz2017composition, meiser2018tight], variants of the differential privacy definition [bun2016concentrated, mironov2017renyi, desfontaines2019sok], adjustment of noise levels to a pre-specified set of queries [li2014data], or generation of differentially private synthetic data to answer arbitrarily many queries afterwards [kotsogiannisarchitecting, bindschaedler2017plausible].
The testing framework we introduce in Section 5.3 is similar to recent work in verification for differential privacy [DingEtAl2018, bichsel2018dp], but approaches the problem in a generic way by testing a diverse set of databases that are agnostic to specific algorithms.
Our work is not the first to use noise and thresholding to preserve privacy: this method was originally proposed in [korolova2009releasing, gotz2011publishing] in the specific context of releasing search logs with -DP; our work can be seen as an extension and generalization of this intuition. Diffix [francis2017diffix] is another system using similar primitives; however, it does not provide any formal privacy guarantee; so a meaningful comparison with our work is not feasible. In Section 4, we provide a comparison of query accuracy between our work, PINQ, and Flex.
We introduce here the definitions and notations used throughout this paper. Let be an arbitrary set of records, and an arbitrary set of user identifiers. A row is a pair for some and , and a database is a multiset of rows. A user is said to own the rows for all . We denote the space of all databases.
We denote row-level change the addition or removal of a single row from a database, and user-level change the addition or removal of all rows associated with a user. Given two databases and , we denote the minimum number of row-level changes necessary to transform into , and the minimum number of user-level changes necessary to transform into .
We recall the definitions of -differential privacy and of global -sensitivity.
A randomized mechanism satisfies row-level -DP if for all pairs of databases that satisfy , and for all sets of outputs , we have:
satisfies user-level DP111A similar notion in the context of streaming data, pan-privacy, is introduced in [dwork2010pan]. if the above condition holds for all pairs of databases such that . -DP is an alias for -DP.
Note that this notion is technically unbounded differential privacy [kifer2011no], which we use for simplicity throughout this work. Up to a change in parameters, it is equivalent to the classical definition, which also allows the change of one record in the distance relation between databases.
The global -sensitivity of a function is defined by:
where denotes the norm. The user-global -sensitivity of is defined by:
Before describing the technical details of our system we first give an intuition of how it works using a simple example: histogram queries. Consider a simple database that logs accesses to a given website. An analyst wants to know which browser agents are most commonly used among users visiting the page. A typical query to do so is presented in 1.
How would one make this simple operation -differentially private? One naive approach is to add Laplace noise of scale to each count. This solution suffers from several shortcomings.
The naive approach will correctly hide the existence of individual records from the database: each record of the access log will only influence one of the returned counts by at most , and it is well known [dwork2006calibrating] that this mechanism will provide -DP. However, it fails to protect the existence of individual users: the same user could have visited the example page many times with a particular browser agent, and therefore could have contributed an arbitrarily large number of rows to
visits for a particular
GROUP BY partition, violating our assumption that query sensitivity is 1.
In PINQ and Flex, the differential privacy definition explicitly considers records as the privacy unit. Because we instead want to protect the full contribution of users, we need to explicitly include a notion of a user in our system design. In this work, we do this via the notion of a user identifier, hereafter abbreviated uid.
Because 1 has unbounded sensitivity, just adding noise to counts is not enough to enforce differential privacy; we need to bound user contribution to each partition. One way this can be addressed is by counting distinct users, which has a user-global sensitivity of 1, instead of counting rows. Here, even though it modifies the query semantics, we chose this approach to keep the example simple. This leads to a modification of the query, as shown in 2.
In other contexts it might make more sense to allow a user to contribute more than once to a partition (e.g. we count up to five visits from each distinct user with each distinct browser agent); in this case we would need to further modify the query to allow multiple contributions and increase sensitivity to match the maximum number of contributions.
Even if we bound contribution to partitions and adapt noise levels, the query is still not -DP. Suppose that the attacker is trying to distinguish between two databases differing in only one record, but this record is a unique browser agent : this browser agent does not appear in , but appears once in . Then, irrespective of the value of the noisy counts, the
GROUP BY keys are enough to distinguish between the two databases simply by looking at the output: will appear in the query output for but not for .
To address this problem, we drop from the results all keys associated with a noisy count lower than a certain threshold . is chosen independently of the data, and the resulting process is -DP with . We call this mechanism -thresholding. With a sufficiently high , the output rows with keys present in but not
(and vice-versa) will be dropped with high probability, making the keys indistinguishable to an attacker. A longer discussion on the relation between, and can be found in Section 3.5. This approach is represented in SQL in 3.
PINQ and Flex handle this issue by requiring the analyst to enumerate all keys to use in a
GROUP BY operation, and return noisy counts for only and all such keys222The open-source implementation of Flex [flexgithub], however, does not appear to implement this requirement.. This enforces -DP but impairs usability: the range of possible values is often large (potentially the set of all strings) and difficult to enumerate, especially if the analyst cannot look at the raw data.
Some data synthesis algorithms have been proposed to release histograms under -DP [mcsherrydatasynthesis], but are limited, for example to datasets subject to hierarchical decomposition. Our approach is simpler and more generic, at some cost in the privacy guarantee.
Finally, we must consider the possibility of a user contributing to multiple partitions in our query. Imagine a user visiting the example page with many different browsers, each with a different browser agent. Such a user could potentially contribute a value of 1 to each partition’s count, changing the sensitivity of the query to be the number of partitions which is unbounded!
Because both PINQ and Flex consider records as the privacy unit, this is not an issue for their privacy models. So long as they are only used on databases where that requirement holds true, and where the sensitivity and stability impact of joins (and related operations) are carefully considered, they will provide adequate DP guarantees. However as shown in [ubernotdp], these conditions are not always true.
Instead of adding strict requirements on the nature of the underlying database and on how joins are used, we introduce a novel mechanism for bounding user contribution across partitions. Concretely, we first choose a number , and for each user, we randomly keep the contributions to partitions for this user, dropping contributions to other partitions. This operation allows us to bound the global sensitivity of the aggregation: each user can then influence at most unique counts, and we can adapt the noise level added to each count, by using Laplace noise of scale .
The final version of our query is shown in 4. It uses a non-standard variant of the SQL
TABLESAMPLE operator, which supports partitioning and reservoir sampling, to represent the mechanism we introduced. This finaly version satisfies -differential privacy for well-chosen parameters.
In the remainder of this paper, we formalize this approach, and adapt it to a larger set of operations. In particular, we extend it to arbitrary aggregations with bounded sensitivity, and we explain how to make this model compatible with joins.
As suggested in Section 1.3, we assume that there is a special column of the input database that specifies which user owns each row. The system is agnostic to the semantics of this special column. In principle, it can be any unit of privacy that we need to protect: a device identifier, an organization, or even a unique row ID if we want to protect rows and not users. For simplicity of notation we assume that this special column is a user identifier. Users may own multiple rows in each input table, and each row must be owned by exactly one user. Our model guarantees -DP with respect to each user, as defined in Definition 2.
We implement our DP query engine in two components on top of a general SQL engine: a collection of custom SQL aggregation operators (described in Section 3.2), and a query rewriter that performs anonymization semantics validation and enforcement (described in Section 3.3). The underlying SQL engine tracks user ID metadata across tables, and invokes the DP query rewriter when our anonymization query syntax is used on tables containing user data and any applicable permission checks succeed. 5 provides an example of a SQL query accepted by our system.
The query rewriter decomposes such queries into two steps, one before and one after our introduced DP aggregation operator, denoted by
ANONYMIZATION. The first step begins by validating that all table subqueries inside the DP operator’s
clause enforce unique user ownership of intermediate rows. Next, for each row in the subquery result relation, our operator partitions all input rows by the vector of user-specified
GROUP BYkeys and the user identifier, and applies an intermediate vanilla-SQL partial aggregation to each group.
For the second step, we sample a fixed number of these partially aggregated rows for each user to limit user contribution across partitions. Finally, we compute a cross-user DP aggregation across all users contributing to each
GROUP BY partition, limiting user contribution within partitions. Adjusting query semantics is necessary to ensure that, for each partition, the cross-user aggregations receive only one input row per user.
In this section, we present the set of supported -DP statistical aggregates with bounded contribution. These functions are applied as part of the cross-user aggregation step, discussed in Section 3.3. Importantly, at this step, we assume that each user’s contributions have been aggregated to a single input row - this property is enforced by the query rewriter.
For a simple example for bounded contribution,
COUNT(DISTINCT uid) counts unique users. Adding or subtracting a user will change the count by no more than 1.
For more complex aggregation functions we must determine how much a user can contribute to the result and add appropriately scaled noise. A naive solution without limits on the value of each row leads to unbounded contribution by a single user. For example, a
SUM which can take any real as input has an unbounded -sensitivity by Definition 3.
To address this, each -DP function accepts an additional pair of lower and upper limit parameters used to clamp (i.e., bound) each input. For example, denoting the lower and upper bounds as and , respectively, consider the anonymized sum function:
Let be the function that transforms each of its inputs into , and then all are summed. The global sensitivity for this bounded sum function is:
ANON_SUM can be defined by using this function and then adding noise scaled by this sensitivity. For all functions, noise is added to internal states using the well-known Laplace mechanism [geng2012optimal, dwork2014algorithmic] before a differentially private version of the aggregate result can be released.
ANON_AVG, we use the algorithm designed in Li et al. [li2016differential]: we take the quotient of a noisy sum (bounded as in
ANON_SUM, and scaled by sensitivity ) and a noisy count.
ANON_VAR is similarly derived; we use the same algorithm to compute a bounded mean and square it, and to compute a mean of bounded squares.
ANON_STDDEV is implemented as the square root of
ANON_NTILE, is based on a Bayesian binary search algorithm [ben2008bayesian, karp2007noisy], and can be used to define max, min, and median functions. The upper and lower bounds are only used to restrict the search space and do not affect sensitivity. Each iteration of the internal binary search alters counts with a noise-adding mechanism scaled by sensitivity .
For the rest of this paper, we assume that contribution bounds are specified as literals in each query to simplify our presentation. Setting bounds requires some prior knowledge about the input set. For instance, to average a column of ages, the lower bound can be reasonably set to and the upper bound to . To enhance usability in the case where there is no such prior knowledge, we also introduce a mechanism for automatically inferring contribution bounds, described in more detail in Section 5.1.
The definition of sensitivity given in Definition 3 assumes deterministic functions. From here, we will say that the global sensitivity of a -DP aggregate function is bounded by if the global sensitivity of the same function with no noise added is bounded by . Due to the contribution bounding discussed in this section, we can determine such bounds for our -DP aggregation functions.
Table 1 lists our suite of aggregate functions and their sensitivity bounds, proven in Appendix E. These bounds assume that the aggregation is done on at least one user; we do not consider the case where we compare an empty aggregation with an aggregation over one user. This last case is considered in Section 3.5. Note that the bounds shown here are loose; we mostly care about the boundedness, and the order of magnitude with respect to and .
In this section, we define our -DP relational operator, denoted by . To do so, we use some of the conventional operators in relational algebra, a language used to describe the transformations on databases that occur in a query:
: Project columns from .
: Select from satisfying the predicate .
: Group on the cross products of distinct keys in the columns in . Apply the aggregations in to each group.
: Take the cross product of rows in and , select only the rows that satisfy the predicate cond.
Let (select-list), (group-list), and (aggregate-list) denote the attribute names , , and , respectively. Assume is restricted to only contain the -DP aggregate function calls discussed in Section 3.2. Our introduced operator, , can be interpreted as an anonymized grouping and aggregation operator with similar semantics to .
Let be a table subquery containing any additional analyst-specified operators that do not create any intermediate objects of shared ownership (as defined in Section 3.3). Let be an input table with each row owned by exactly one user. must have a denoted user-identifying uid attribute in the schema for the query to be allowed. We use as the input to our proposed operator . Then, we can define our query :
Our approach represents the general form of this relational expression using augmented SQL:
The query rewriter discussed in Section 3.1 transforms a query containing into a query that only contains SQL primitives, minimizing the number of changes to the underlying SQL engine. We define the following in order to express our rewriter operation:
Let be the corresponding non--DP partial aggregation function of . For example, if is
ANON_SUM, would be
Let behave like a reservoir-sampling SQL
TABLESAMPLE operator where is the grouping list and is the number of samples per group. In other words, the operation partitions by columns . For each partition, it randomly samples up to rows. We use as the stability-bounding operator and discuss its implications in Section 3.4.
When the query rewriter is invoked on the following relational expression containing :
it returns the modified expression, with expanded:
Effectively, the rewriter splits into the two-stage aggregation and . groups the output of by the key vector , applying the partial aggregation functions to each group. rows are then sampled for each user. The first aggregation enforces that there is only one row per user per partition during the next aggregation step. performs a second, differentially private, aggregation over the output of . This aggregation groups only by the keys in and applies the -DP aggregation functions . These functions assume that every user contributes at most with one input row. A filter operator is applied last to suppress any rows with too few contributing users.
In this section, we introduce the constraints imposed by on the table subquery . Our approach requires that the relational operators composing do not create any intermediate objects of shared ownership, that is, no intermediate row may be derived from rows owned by different users. A naive application of certain relational operators violate this requirement. For example, for the aggregation operator, rows owned by distinct users may be aggregated into the same partition. Then the resulting row from that partition will be owned by all users whose data are contributed to the group. For the naive join operator, two rows from distinct users may be joined together, creating a row owned by both users.
This restriction limits our system since some queries cannot be run. We observed that in practice, most use-cases can be fit within these constraints. We leave extensions to a wider class of queries for future work. This might require a different model than the one presented here, but changing query semantics will always be necessary for queries with unbounded sensitivity.
We address the shared ownership issue by restricting each operator composing such that, for each row in that operator’s output relation, that row is derived only from rows in the input relation that have matching uid attributes. We enforce this rule for aggregate operators by requiring that the analyst additionally groups-by the input relation’s uid attribute. For join operators, we require that the analyst adds a
USING() clause (or equivalent) to the join condition. Additionally, each operator of must propagate uid from the input relation(s) to the output relation. In queries where this is unambiguous (i.e., the analyst does not refer to uid in the query), we can automatically propagate uid.
Allowed alternatives for each basic operator are listed in Table 2. They are enforced recursively during the query rewrite for each operator that composes .
|Operator||Basic Form||Required Variant|
Consider the following query, counting the number of employees per department with at least one order:
Note that including bounds on
ANON_COUNT(*, L, U) is shorthand for using
ANON_SUM(col, L, U) in the cross-user aggregation step. We can express this query in relational algebra using our DP operator, :
Expand into the two-stage aggregation, and . is a table subquery. Our query can then be written as:
Note that we add an additional user counting -DP function, aliased as , which is compared to our threshold parameter, , to ensure that the grouping does not violate the -DP predicate, to be discussed in Section 3.5. In this case is the number of unique users in each department.
In Figure 1, we illustrate the workflow with example tables
Order, , and .
We adapt the notion of query stability from [mcsherry2009privacy].
Let be a function . We say that has -stability if for all :
Note that for a user owning rows in and a -stable transformation , there may be rows derived from rows owned by in .
Our privacy model requires the input to the cross-user aggregation to have constant stability. Simple SQL operators have a stability of one. For instance, each record in an input relation can only be projected onto one record. So an addition or a deletion can only affect one record in a projection; thus projections have a stability of one. The same logic applies for selection. Other operators, such as joins, have unbounded stability because source records can be multiplied. Adding or removing a source record can affect an unbounded number of output records. When SQL operators are sequentially composed, we multiply the stability of each operator to yield the entire query’s overall stability.
We can compose an unbounded transform with a stability-bounding transform to yield a composite -stable transform:
For , we use partitioned-by-user reservoir sampling with a per-partition reservoir size of , which has a stability of . Reservoir sampling was chosen for its simplicity, and because it guarantees a strict bound on the contribution of one user. Non-random sampling (e.g. taking the first elements, or using systematic sampling) risks introducing bias in the data, depending on the relative position of records in the database. Simple random sampling does not guarantee contribution bounding, and all types of sampling with replacement can also introduce bias in the data.
Joins appear frequently in queries [johnson2018towards], so it is imperative to support them in order for an engine to be practical. Since joins have unbounded stability, a stability bounding mechanism is necessary to provide global stability privacy guarantees on queries with joins. We can thus support a well bounded, full range of join operators.
In this section, we show that the user-global sensitivity of any allowed query in our engine is bounded. Sensitivity is bounded due to the structure of our two-stage aggregation, the bounded-contribution aggregation functions, and the stability-bounding operator .
Consider an anonymization query in the form , where is the input table and is the query transformation. Then there exist constants , which depends on , and an engine-defined constant , such that the user sensitivity satisfies .
Recall Definition 3:
Let be some allowed query transformation in our privacy model. Since must return a vector of aggregate values as the output, we can write , where and . In other words, is a database-to-database transformation while takes a database and returns a vector of real numbers. Suppose that has stability . Then for any such that , we have .
Suppose the maximum number of rows any user owns in the database is . Then for our query , the addition or deletion of a single user from a database creates at most changes in . Thus, our user-global sensitivity is bounded by:
The databases for all is a subset of all databases , so
which, by the definition of global sensitivity, can be bounded as
Now, consider in addition that:
Again, since for all is a subset of ,
from which we conclude that
Our per-user sensitivity is unbounded if at least one of , , or are unbounded. Our privacy model, however, is formulated so that we can bound the product.
Since is an allowed query for our privacy model, we know that must be a finite vector of bounded-contribution aggregation functions, as discussed in Section 3.2. Therefore, for each , the per-row global sensitivity of is bounded and listed in Table 1. Each sensitivity is function-dependent, so call it . The sensitivity of is then bounded by the sum of these sensitivities .
The operator bounds stability by sampling a fixed amount of rows per user after the per-user aggregation stage. Call this fixed amount . Then the number of rows owned by a user in the transformed database, previously , is now bounded by .
Putting it all together, for each user there can only be contributing rows to , each with a bounded contribution of , as determined by the analyst-specified clamp bounds in Table 1. We can conclude that for model-defined constants and , we have
concluding the argument. ∎
In this section, we outline a technique to prevent the presence of arbitrary group keys in a query from violating the privacy predicate: the -thresholding mechanism. For example, consider the naive implementation for the following query:
SELECT col1, ANON_SUM(col2, L, U) FROM Table GROUP BY col1;
Suppose that for the value , only one user contributed to the sum. Then querying without data from would reveal the absence of the output row corresponding to group . It would be revealed with certainty that user has value for .
To prevent this, for each grouping or aggregation result row we first calculate an -DP count of unique contributing users. If that count is less than some minimum user threshold the result row must not be released. is chosen by our model based on , , and parameter values. In the example above, the output row for group would not appear in the result with some probability.
Consider a database containing one row. The probability that an -DP noisy count of the number of rows in will yield at least for any is
The noisy count is distributed as the Laplace distribution centered at the true count of with scale parameter . Evaluating the CDF at yields . ∎
Let be privacy parameters. Consider a SQL engine that, for each non-empty group in a query’s grouping list, computes and releases an -DP noisy count of the number of contributing users. For empty groups, nothing is released. A user may not influence more than such counts. Each count must be or greater in order to be released. We may set
to provide user-level -DP in such an engine.
In this section, we demonstrate that our engine satisfies user-level -DP, as defined in Definition 2.
Suppose that a query has aggregate function list and grouping list . Let the privacy parameter for aggregation function be .
Consider any set of rows owned by a single user in the input relation of our anonymization operator, . We first partition and aggregate these rows by the key vector , before sampling up to rows for each partition by .
The result for a group is reported if the -thresholding predicate defined in Section 3.5 is satisfied. Computing and reporting the comparison count for this predicate is -DP. For , such that they differ by a user’s data for a single group , consider each aggregation function as applied to group . By composition theorems [kairouz2017composition] and Theorem 2, we provide user-level -DP for that row.
However, there are many groups in a given query. Due to our stability bounding mechanism, a single user can contribute to up to groups. The output rows corresponding to these groups can be thought of as the result of sequentially composed queries. Let such that . Let be the sum of the greatest elements in the set . By composition theorems in differential privacy [kairouz2017composition], we conclude that for any output set , and some , we have
This demonstrates that given engine-defined parameters , , and , it is possible to set privacy parameters for individual -DP functions to satisfy the query -DP predicate. In reality, we conservatively set for all and . is enforced by the derived parameter as discussed in Section 3.5. Both user-privacy parameters and can therefore be bounded to be arbitrarily small by analysts and data owners.
Note that the method we use to guarantee user-level differential privacy can be interpreted as similar to row-level group privacy: after the per-user aggregation step, we use the composition theorem to provide group privacy for a group of size . Alone, row-level group privacy does not provide user-level privacy, but in combination to user-level contribution bounding, this property is sufficient to obtain the desired property.
In this section, we explore the accuracy of our system by running numerical experiments and provide analytical reasoning about the relationship between accuracy and various parameters.
Intentionally incorrect sensitivity to demonstrate contribution bounding
We assess accuracy experimentally using TPC-H [council2008tpc], an industry standard SQL benchmark. The TPC-H benchmarks contains a database schema and queries that are similar to those used by analysts of personal data at real-world organizations. In addition, the queries contain interesting features such as joins and a variety of aggregations. We generate a TPC-H database with the default scale factor of . We treat suppliers or customers as “users”, as appropriate. Our metric for accuracy is median relative error, the same one used in [johnson2018towards]; a smaller median relative error corresponds to higher utility.
We compute the median relative error of 1,000,000 runs for our model over TPC-H Query 1 using three different aggregation functions and in Table 3. We compare our results to 1,000,000 runs of Flex over the same query, and 10,000 runs (due to performance considerations) of PINQ over the same query. To present a fair comparison, we disabled -thresholding and compared only one result record to remove the need for stability bounding. In addition, each run of the experiment was performed using a function of fixed sensitivity, controlled by supplying the function with a lower bound of and an upper bound of the value in the Q1 column. The bounds were fixed to minimize accuracy loss from contribution clamping.
For our experiments with PINQ and Flex, we also set sensitivity to our previously determined Q1 listed in Table 3. The results are close to our model’s results, but because neither PINQ nor Flex can enforce contribution bounds for databases with multiple contributions per user, incorrectly set sensitivity can result in query results that are not differentially private. Such incorrectly set bounds can be seen in experiments in Johnson et al. [johnson2018towards] and McSherry’s analysis [ubernotdp], and in the last row of Table 3, where PINQ and Flex report errors far below what are required to satisfy the -DP predicate.
With correctly set sensitivity bounds our model’s results are comparable to PINQ’s results for count and average. Implementation differences in our median function mean that our error is lower by a factor of 2. Both PINQ and our model outperform Flex’s result for count by around an order of magnitude. We don’t report errors for average and median for Flex because Flex does not support those functions.
|Query||Q||Experimental error||-thresholding rate|
Results uninterpretable due to high levels of -thresholding
We present the results of running our system over a selection of TPC-H queries containing joins in Table 4. Similarly, we report the median relative error of 1,000,000 runs for each query using . We report the impact of -thresholding (the ratio of suppressed records), suggesting that our model is -DP. was set with [dwork2009differential], where is the number of distinct users in the underlying database: either customers or suppliers, depending on the query.
Q4 represents how our system behaves when very little -thresholding occurs. Q16 and Q21 demonstrate the opposite, both queries exhibit a very large error that differs from the theoretical error due to most partitions being removed by the threshold because of their small user count. Indeed, this is by design: as partition user counts approach 1, the ratio of -thresholding approaches . Finally, Q13 represents a more typical result, a query containing mostly medium user count partitions with a long tail of lower count partitions. A moderate amount of -thresholding occurs which increases error compared to Q4, but the results are still quite accurate.
In this section we explore the relationship between utility and various parameters, which must be adjusted to balance privacy and utility [amin2019].
The privacy parameter is inversely proportional to the Laplace scale parameter used by anonymous functions to add noise. Hence, an increase in leads to a decrease in utility. By the derivation in Appendix C, the median error satisfies:
where is the sensitivity. When many aggregations are used in a single query, the privacy budget is split equally among them. In the presence of aggregations, each will satisfy -differential privacy. Thus, utility also degrades inversely with the number of aggregations.
For a fixed , varying causes the threshold to change, which changes the number of records dropped due to thresholding. Similarly, changing modifies the number of records dropped due to contribution bounding. We first perform experiments on TPC-H Query 13 with varying to quantify the impact on partitions returned: Figure 2 displays the results.
In addition, we analyze the effect of for a specific artificial query, typical of common analyses. Consider the following query after rewriting.
Suppose that the values in
col are distributed according to some distribution and that each of users in
Table has a number of rows distributed according to . Then the mean squared error of the query above is
Figure 3 shows the effect of on with and for various distributions , all of them with the same mean. For all distributions shown the curves are nearly identical. Notably for larger the change in mean squared error diminishes rapidly to .
We analyze the effect of clamping on accuracy using model input distributions. Since clamping occurs at the -DP aggregation level, we focus on input sets that have at most one row per user.
Consider an input set of size uniformly distributed on . The expected mean of is . For symmetric input distribution, symmetric clamping will not create bias, so we focus on clamping only one end of the distribution: consider the clamping bounds and such that . Then the expected mean of set clamped between is
and the expected error of the clamped mean is
the error scales quadratically with . Compare this error to the median noise added by
ANON_AVG(S, l, u), which is approximately (see derivation in Appendix C). Not only does this error only grow linearly in the clamp range , but it decreases inversely with input size . We plot the sum of the clamping error and the median noise with and various in Figure 4. Analysis with the other aggregation functions yields similar results: overestimating the spread of the input set is superior to restricting the sensitivity, especially with large sets.
Figure 5 displays the clamping expected error as a function of the upper bound for several input distributions. These distributions are all centered at . The lower bound was set at .
Designing a differentially private query engine for non-expert use requires a number of considerations beside the design and implementation of the system described in the previous section. In this section, we highlight a few of these concerns, and detail the approaches we have taken to mitigate them.
Usability is paramount to the success of any system, especially if it changes existing workflows. We present three ways we tried to improve the system’s usability.
One major difference between standard SQL queries and queries using our differentially private aggregation operator is the presence of bounds: e.g., when using
ANON_SUM, an analyst must specify the lower and upper bound for each sum input. This differs from standard workflows, and more importantly, it requires prior knowledge of the data that an analyst might not have.
To remove this hurdle, we designed an aggregation function which can be sequentially composed with our previously introduced -DP functions to automatically compute bounds that minimize accuracy loss. Call this function
Without contribution bounds, the domain of an -DP function in our model spans . Finding differentially private extrema over such a range is difficult. Fortunately, we can leverage two observations. First, inputs to
are represented on a machine using finite precision; typically 64-bit integers or floating point numbers. Second, bounds do not need to be close to the real extrema: clipping a small fraction of data will usually not have a large influence on the aggregated results, and might even have a positive influence by removing outliers.
ANON_SUM operation operating on 64-bit unsigned integers where bounds are not provided. We divide the privacy budget in two: the first half to be used to infer approximate bounds using
APPROX_BOUNDS(col); the second half to be used to calculate the noisy sum as usual. We must spend privacy budget to choose bounds in a data-dependent way.
APPROX_BOUNDS(col) function, we instantiate a 64-bin logarithmic histogram of base 2, and, for each input , increment the th bin. Laplace noise is then added to the count in each bin, as is standard for differentially private histograms [dwork2014algorithmic].
Then, to find the approximate maximum of the input values, we select the most significant bin whose count exceeds some threshold , calculated using parameters and :
where is the count of histogram bins and is the desired probability of not selecting a false positive. For example, for unsigned integers. For the derivation of the threshold , see Appendix D.
When setting , the trade-offs of clipping distribution tails, false positive risk, and algorithm failure due to no bin count exceeding must all be considered. Values on the order of for can be suitable, depending on and the size of the input database.
The approximate minimum bound can similarly be found by searching for the least significant bin with count exceeding . We generalize this for signed numbers by adding additional negative-signed bins, and for floating point numbers by adding bins for negative powers of .
To provide data analysts information on the accuracy of their results we attach a confidence interval (CI) of the noise that was added. The CI can be calculated from each function’s contribution bounds and share of. The CI does not account for the effect of clamping or thresholding. In addition, during automatic bounds determination (Section 5.1), the log-scale histogram gives us an approximate fraction of inputs exceeding the chosen bounds; this can also be returned to the analyst.
For queries with a long tail of low user count partitions that do not pass -thresholding, we can combine all such partitions into a single partition. If the combined partition now exceeds the
-threshold, we may return aggregate results for the "leftovers" partition. This will allow analysts to estimate data loss.
To represent privacy, there are well-established techniques [lee2011much, hsu2014differential, naldi2015differential, krehbiel2019choosing] and perspectives [nissim2017differential] in the literature.
Testing is necessary to get a strong level of assurance that our query engine correctly enforces its privacy guarantee. We audited the code manually and found some implementation issues. Some of these issues have previously been explored in the literature, notably regarding the consequences of using a floating-point representation [mironov2012significance] with Laplace noise. Some of them, however, do not appear to have been previously mentioned in the literature, and are good examples of what can go wrong when writing secure anonymization implementations.
One of these comes from another detail of floating-point representation: special
NaN (“not a number”) values. These special values represent undefined numbers, like the result of . Importantly, arithmetic operations including a
NaN are always
NaN, and comparisons between any
NaN and other numbers are always
False. This can be exploited by an attacker, for example using a query like
ANON_SUM(IF uid=4217 THEN 0/0 ELSE 0). The
NaN value will survive naive contribution bounding (bounds checks like
if(value > upper_bound) will return
False), and the overall sum will return
NaN iff the condition was verified. We suspect that similar issues might arise with the use of special
infinity values, although we have not found them in our system (such values are correctly clamped).
From this example, we found that a larger class of issues can appear whenever the user can abuse a branching condition to fail if an arbitrary condition is satisfied (by example, by throwing a runtime error or crashing the engine). Thus, in a completely untrusted environment, the engine should catch all errors and silently ignore them, to avoid leaking information in the same way; and it should be hardened against crashes. We do not think that we can completely mitigate this problem, and silently catching all errors severely impedes usability. Thus, is it a good idea to add additional risk mitigation techniques, like query logging and auditing.
While the operations used in the engine are theoretically proven to be differentially private, it is crucial to verify that these operations are implemented correctly. Since the number of possible inputs is unbounded, it is impossible to exhaustively test this. Thus we fall back to stochastic testing and try to explore the space of databases as efficiently as possible. This does not give us a guarantee that an algorithm passing the test is differentially private, but it is a good mechanism to detect violations.
Note that we focus on testing DP primitives (aggregation functions) in isolation, which allows us to restrict the scope of the tests to row-level DP. We then use classical unit testing to independently test contribution bounding. We leave it as future work to extend our system to handle end-to-end user-level DP testing.
Our testing system contains four components: database generation, search procedure to find database pairs, output generation, and predicate verification.
What databases should we be generating? All DP aggregation functions are scale-invariant, so without loss of generality, we can consider only databases with values in a unit range . Of course, we can’t enumerate all possible databases , where is the size of the database. Instead, we try to generate a diverse set of databases. We use the Halton sequence [Halton1964] to do so. As an example, Figure 6 plots databases of size 2 generated by a Halton sequence. Unlike uniform random sampling, Halton sequences ensure that databases are evently distributed and not clustered together.
A database is a set of records: we consider its power set, and find database pairs by recursively removing records. This procedure is shown in Figure 7.
Once we have pairs of adjacent databases, we describe how we test each pair . The goal is to check that for all possible outputs of mechanism :
By repeatedly evaluating
on each database, we estimate the density of these probability distributions. We then use a simple method to compare these distributions: histograms.
We illustrate this procedure in (a) and (b). The upper curves (in orange) are the upper DP bound, created by multiplying the probability estimate of each bin for database by and adding . The lower curve (in blue) is the unmodified probability estimate of . In (a), all blue buckets are less than the upper DP bound: that the DP predicate is not violated. In (b), 3 buckets exceed this upper bound: the DP predicate is been violated. For symmetry, we also repeat this check with swapped with .
It is sufficient to terminate once we find a single pair of databases which violate the predicate. However, since the histogram is subject to sampling error, a correctly implemented algorithm can fail this test with non-zero probability. To address this, we relax our test by using confidence intervals as bounds [wasserman2013all]. We can also parameterize the tester with a parameter that tolerates a percentage of failing buckets per histogram comparison.
The overall approach is an algorithm that iterates over databases and performs a DFS on each of the database search graphs, where each edge is a DP predicate test. See Appendix A for more details.
We were able to detect that an algorithm was implemented incorrectly, violating DP.
When we first implemented
ANON_AVG, we used the Noisy Average with Accurate Count algorithm from [li2016differential]: we used our
ANON_SUM implementation to compute the noisy sum and then divided it by the un-noised count.
Our first version of
ANON_SUM used a Laplace distribution with scale , where and are the upper and lower clamping bounds, which is the correct bound when used as a component of
However, this was not correct for noisy sum in the case when adjacent databases differ by the presence of a row. We updated the scale to , as maximum change in this case is the largest magnitude. This change created a regression in DP guarantee for
ANON_AVG, which was detected by the stochastic tester.
Figure 9 a pair of datasets where the stochastic tester detected a violation of the DP predicate: and .
We can clearly see that several buckets violate the predicate. Once the stochastic tester alerted us to the error we quickly modified
ANON_AVG to no longer depend on
ANON_SUM so that it could use the correct sensitivity.
We presented a generic system to answer SQL queries with user-level differential privacy. This system is able to capture most data analysis tasks based on aggregations, performs well for typical use-cases, and provides a mechanism to deduce privacy parameters from accuracy requirements, allowing a principled decision between re-identification risk and the required utility of a particular application. All implemented mechanisms are tested with a stochastic checker that prevents regressions and increases our level of confidence in the robustness of the privacy guarantee. By releasing components of our system as open-source software after we validated its viability on internal use-cases, we hope to encourage further adoption and research of differentially private data analysis.
The algorithms presented in this work are relatively simple, but empirical evidence show that this approach is useful, robust and scalable. Future work could include usability studies to test the success of the methods we used to explain the system and its inherent accuracy/privacy trade-offs. In addition, we see room for significant accuracy improvements: using Gaussian noise and better composition theorems is an natural next step. There might also be opportunities to compensate the data loss due to contribution bounding and thresholding, optimize the algorithms used for specific sets of queries, ore use amplification by sampling. We did not attempt to cache results to allow people to re-run the queries, but further work could also explore the usability and privacy impact of such a method.
More generally, we believe that future work on DP should consider that realistic data typically includes multiple contributions for a single user: we believe that contribution bounds can be built into other DP mechanisms that are not SQL-based.
We present our algorithm putting all the pieces together in Algorithm 1. For simplicity, we abstract away the generation of databases by including it as an input parameter here, which can be assumed to be generated by the Halton sequence as we described in Section 5.3. We also do not include the confidence intervals or parameter described earlier for dealing with the approximation errors. It is also possible to adaptively choose a histogram bin width, but we put an input parameter here. The general idea is a depth-first search procedure that iterates over edges of the database search graph.
Our actual implementation includes all of the above omissions, including an efficient implementation of the search procedure that caches samples of databases already generated.
Let be the SQL engine operator. Consider any pair of databases such that and such that the set of non-empty groups from is the same as that for ; call this set . The -DP noisy count will get invoked for all groups in for both databases, and the -thresholding applied. For all groups not in , no row will be released for both databases. A change in the user may affect a maximum of output counts, each of which is -DP, so by differential privacy composition rules [kairouz2017composition],
Next, consider empty database . Let be the set of all outputs and let be the output set containing only the output “no result rows are produced”. Then we have . It remains to show that for database containing a single user, . The -DP count is computed by counting the number of rows, and then adding Laplace noise with scale . Database contains values for a maximum of groups; it is only possible to produce an output row for those groups. For each groups, the probability that the noisy count will be at least is , by Lemma 1. Then:
so we need to satisfy:
Solving for in the expression :
Lastly, consider SQL engine operator and databases such that is with the addition of a single user. It remains to consider the case where and do not have the same set of non-empty groups. Since they differ by one user, may have a maximum of additional non-empty groups, each containing unique user; call this set of groups . Call the set of the remaining groups . Split the rows of into two databases: the rows that correspond to groups in and , respectively. Call the rows of that correspond to groups as and the rows that correspond to groups as . Note that only contains rows corresponding to groups . Now, split the operator into two operators and : is with an added filter that only outputs result rows corresponding groups in ; is the same for .
We have decomposed our problem into the previous two cases. The system of , , and is the case where the pair of databases have the same set of non-empty groups, . The system of , the empty database, and is the case where the non-empty database contains a single user. Each system satisfies the DP predicate separately. Since they operate on a partition of all groups in and , the two systems satisfy the DP predicate when recombined into , , and .
We have shown that for two databases differing by a single user, the DP predicate is satisfied. Thus, we have shown that our engine provides user-level -DP. ∎
We find the theoretical median noise of a Laplace distribution. Divide the theoretical median noise by the exact result to obtain the theoretical median error.
For instance, the
ANON_COUNT function applies Laplacian noise with parameter . Let be the theoretical median noise. Then satisfies:
In this section we will derive the internal threshold used in the
APPROX_BOUNDS(col) function described in Section 5.1.
We argue that for privacy parameter , the number of histogram bins , and the probability of a false positive , we should set the threshold to be
Recall that in the automatic bounding algorithm, we create a logarithmic histogram of input values, and apply Laplace noise to the count in each histogram bin. The probability that a given bin produced a count of if its true count is zero is
Suppose we are looking for the most significant bin with a count greater than . In the
APPROX_BOUNDS(col) function, we iterate through the histogram bins, from most to least significant, until we find one exceeding . In the worst case, the desired bin is the least significant bin. This means bins with exact counts of must not have noisy counts exceeding . Thus, the probability that there was not a false positive in this worst case is
Solving for , we obtain the desired threshold.
ANON_COUNT(col) is bounded by sensitivity 1.
Adding any row only changes the count by 1. ∎
ANON_SUM(col, L, U) is bounded by sensitivity .
Consider adding a clamped input . Since , we have and . ∎
ANON_AVG(col, L, U) is bounded by sensitivity .
The average of a clamped set of inputs whose elements are on will always lie on . Then the change in the average when adding or removing an element must be bounded by . ∎
ANON_VAR(col, L, U) is bounded by sensitivity .
Consider clamped input set of size with average
. The following is the variance.
The values , so the magnitude difference between them must be bounded by . Then the variance is bounded by
ANON_STDDEV(col, L, U) is bounded by sensitivity .
This follows directly from taking the square root of the bound in the proof of Lemma 5. ∎
ANON_NTILE(col, L, U) is bounded by sensitivity .
The search space is bounded by so the result is in the interval. Thus, sensitivity can never exceed the interval width. ∎