Towards Practical Differential Privacy for SQL Queries

06/28/2017
by   Noah Johnson, et al.
berkeley college
0

Differential privacy promises to enable general data analytics while protecting individual privacy, but existing differential privacy mechanisms do not support the wide variety of features and databases used in real-world SQL-based analytics systems. This paper presents the first practical approach for differential privacy of SQL queries. Using 8.1 million real-world queries, we conduct an empirical study to determine the requirements for practical differential privacy, and discuss limitations of previous approaches in light of these requirements. To meet these requirements we propose elastic sensitivity, a novel method for approximating the local sensitivity of queries with general equijoins. We prove that elastic sensitivity is an upper bound on local sensitivity and can therefore be used to enforce differential privacy using any local sensitivity-based mechanism. We build FLEX, a practical end-to-end system to enforce differential privacy for SQL queries using elastic sensitivity. We demonstrate that FLEX is compatible with any existing database, can enforce differential privacy for real-world SQL queries, and incurs negligible (0.03

READ FULL TEXT VIEW PDF

Authors

page 2

page 3

09/20/2018

Chorus: Differential Privacy via Query Rewriting

We present Chorus, a system with a novel architecture for providing diff...
12/07/2020

Local Dampening: Differential Privacy for Non-numeric Queries via Local Sensitivity

Differential privacy is the state-of-the-art formal definition for data ...
11/15/2018

Achieving Differential Privacy using Methods from Calculus

We introduce derivative sensitivity, an analogue to local sensitivity fo...
03/04/2021

Quantifying identifiability to choose and audit ε in differentially private deep learning

Differential privacy allows bounding the influence that training data re...
04/23/2018

Individual Sensitivity Preprocessing for Data Privacy

The sensitivity metric in differential privacy, which is informally defi...
09/17/2021

Robust Control Under Uncertainty via Bounded Rationality and Differential Privacy

The rapid development of affordable and compact high-fidelity sensors (e...
06/08/2020

Blockchain-Based Differential Privacy Cost Management System

Privacy preservation is a big concern for various sectors. To protect in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As organizations increasingly collect sensitive information about individuals, these organizations are ethically and legally obligated to safeguard against privacy leaks. Data analysts within these organizations, however, have come to depend on unrestricted access to data for maximum productivity. This access is frequently provided in the form of a relational database that supports SQL queries. Current approaches for data security and privacy cannot guarantee privacy for individuals while providing general-purpose access for the analyst.

As demonstrated by recent insider attacks [8, 10, 7, 11], allowing members of an organization unrestricted access to data is a major cause of privacy breaches. Access control policies can limit access to a particular database, but once an analyst has access, these policies cannot control how the data is used. Data anonymization attempts to provide privacy while allowing general-purpose analysis, but cannot be relied upon, as demonstrated by a number of re-identification attacks [51, 43, 46, 18].

Differential privacy [20, 23, 25] is a promising technique for addressing these issues. Differential privacy allows general statistical analysis of data while protecting data about individuals with a strong formal guarantee of privacy.

Because of its desirable formal guarantees, differential privacy has received growing attention from organizations including Google and Apple. However, research on practical techniques for differential privacy has focused on special-purpose use cases, such as collecting statistics about web browsing behaviors [27] and keyboard and emoji use [1], while differential privacy for general-purpose data analytics remains an open challenge.

Various mechanisms [40, 47, 44, 41, 14, 42] provide differential privacy for some subsets of SQL-like queries but none support the majority of queries in practice. These mechanisms also require modifications to the database engine, complicating adoption in practice.

Furthermore, although the theoretical aspects of differential privacy have been studied extensively, little is known about the quantitative impact of differential privacy on real-world queries. Recent work has evaluated differential privacy mechanisms on real-world data [15, 32, 33], however this work uses a limited set of queries representing a single, special-purpose analytics task such as histogram analysis [15] or range queries [32]. To the best of our knowledge, no existing work has explored the design and evaluation of differential privacy techniques for general, real-world queries.

This paper proposes elastic sensitivity, a novel approach for differential privacy of SQL queries. In contrast to existing work, our approach is compatible with real database systems, supports queries expressed in standard SQL, and integrates easily into existing data environments. The work therefore represents a first step towards practical differential privacy. The approach has recently been adopted by Uber to enforce differential privacy for internal data analytics [12].

We developed elastic sensitivity using requirements derived from a dataset of 8.1 million real-world queries. This paper focuses on counting queries, which constitute the majority of statistical queries in this dataset, and discusses extensions to the approach for other aggregation functions. We have released an open-source tool for computing elastic sensitivity of SQL queries 

[4].

Contributions

We make four primary contributions toward practical differential privacy:

  1. [topsep=1mm,leftmargin=4mm]

  2. We conduct the largest known empirical study of real-world SQL queries—8.1 million queries in total. From these results we show that the queries used in prior work to evaluate differential privacy mechanisms are not representative of real-world queries. We propose a new set of requirements for practical differential privacy on SQL queries based on these results.

  3. To meet these requirements, we propose elastic sensitivity, a sound approximation of local sensitivity [22, 44] that supports general equijoins and can be calculated efficiently using only the query itself and a set of precomputed database metrics. We prove that elastic sensitivity is an upper bound on local sensitivity and can therefore be used to enforce differential privacy using any local sensitivity-based mechanism.

  4. We design and implement Flex, an end-to-end differential privacy system for SQL queries based on elastic sensitivity. We demonstrate that Flex is compatible with any existing database, can enforce differential privacy for the majority of real-world SQL queries, and incurs negligible (0.03%) performance overhead.

  5. In the first experimental evaluation of its kind, we use Flex to evaluate the impact of differential privacy on 9862 real-world statistical queries in our dataset. In contrast to previous empirical evaluations of differential privacy, our experimental set contains a diverse variety of real-world queries executed on real data. We show that Flex introduces low error for a majority of these queries.

The rest of the paper is organized as follows. Section 2 contains our empirical study and defines the requirements for a practical differential privacy mechanism. In Section 3, we define elastic sensitivity and prove that elastic sensitivity is an upper bound on local sensitivity. In Section 4 we describe Flex, our system for enforcing differential privacy using elastic sensitivity. Section 5 contains our experimental evaluation of Flex and Section 6 surveys related work.

2 Requirements for Practical
Differential Privacy

We use a dataset consisting of millions of SQL queries to establish requirements for a practical differential privacy system that supports the majority of real-world queries. We investigate the limitations of existing general-purpose differential mechanisms in light of these requirements, and introduce elastic sensitivity, our new approach that meets these requirements.

We investigate the following properties of queries in our dataset:

  • [topsep=1mm,leftmargin=4mm]

  • How many different database backends are used? A practical differential privacy system must integrate with existing database infrastructure.

  • Which relational operators are used most frequently? A practical differential privacy system must at a minimum support the most common relational operators.

  • What types of joins are used most frequently and many are used by a typical query? Making joins differentially private is challenging because the output of a join may contain duplicates of sensitive rows. This duplication is difficult to bound as it depends on the join type, join condition, and the underlying data. Understanding the different types of joins and their relative frequencies is therefore critical for supporting differential privacy on these queries.

  • What fraction of queries use aggregations and which aggregation functions are used most frequently? Aggregation functions in SQL return statistics about populations in the data. Aggregation and non-aggregation queries represent fundamentally different privacy problems, as will be shown. A practical system must at minimum support the most common aggregations.

  • How complex are typical queries and how large are typical query results? To be practical, a differential privacy mechanism must support real-world queries without imposing restrictions on query syntax, and it must scale to typical result sizes.

Dataset

We use a dataset of SQL queries written by employees at Uber. The dataset contains 8.1 million queries executed between March 2013 and August 2016 on a broad range of sensitive data including rider and driver information, trip logs, and customer support data.

Data analysts at Uber query this information in support of many business interests such as improving service, detecting fraud, and understanding trends in the business. The majority of these use-cases require flexible, general-purpose analytics.

Given the size and diversity of our dataset, we believe it is representative of SQL queries in other real-world situations.

2.1 Study Results

We first summarize the study results, then define requirements of a practical differential privacy technique for real-world queries based on these results.

Question 1: How many different database backends are used?

Results. The queries in our dataset use more than 6 database backends, including Vertica, Postgres, MySQL, Hive, and Presto.

Question 2: Which relational operators are used most frequently?

Operator Frequency
Select 100%
Join 62.1%
Union 0.57%
Minus/Except 0.06%
Intersect 0.03%
Results. All queries in our dataset use the Select operator, more than half of the queries use the Join operator, and fewer than 1 percent use other operators such as Union, Minus, and Intersect.

Question 3: How many joins are used by a typical query?

Results. A significant number of queries use multiple joins, with queries using as many as 95 joins.

Question 4: What types of joins are used most frequently?

Join condition. The vast majority (76%) of joins are equijoins: joins that are conditioned on value equality of one column from both relations. A separate experiment (not shown) reveals that 65.9% of all join queries use exclusively equijoins.

Compound expressions, defined as join conditions using function applications and conjunctions and disjunctions of primitive operators, account for 19% of join conditions. Column comparisons, defined as conditions that compare two columns using non-equality operators such as greater than, comprise 3% of join conditions. Literal comparisons, defined as join conditions comparing a single column to a string or integer literal, comprise 2% of join conditions.

Join relationship. A majority of joins (64%) are conditioned on one-to-many relationships, over one-quarter of joins (26%) are conditioned on one-to-one relationships, and 10% of joins are conditioned on many-to-many relationships.

Self join. 28% of queries include at least one self join, defined as a join in which the same database table appears in both joined relations. The remaining queries (72%) contain no self joins.

Join type. Inner join is the most common join type (69%), followed by left join (29%) and cross join (1%). The remaining types (right join and full join) together account for less than 1%.

Question 5: What fraction of queries use aggregations?

Results. Approximately one-third of queries are statistical, meaning they return only aggregations (count, average, etc.). The remaining queries return non-aggregated results (i.e., raw data) in at least one output column.

Question 6: Which aggregation functions are used most frequently?

Results. Count is the most common aggregation function (51%), followed by Sum (29%), Avg (8%), Max (6%) and Min (5%). The remaining functions account for fewer than 1% of all aggregation functions.

Question 7: How complex are typical queries?

Results. The majority of queries are fewer than 100 clauses but a significant number of queries are much larger, with some queries containing as many as thousands of clauses.

Question 8: How large are typical query results?

Results. The output sizes of queries varies dramatically with respect to both rows and columns, and queries commonly return hundreds of columns and hundreds of thousands of rows.

Database One-to-one One-to-many Many-to-many
compatibility equijoin equijoin equijoin
PINQ [40]
wPINQ [47]
Restricted sensitivity [14]
DJoin [42]
Elastic sensitivity (this work)
Table 1: Comparison of general-purpose differential privacy mechanisms with support for join.

2.2 Summary of Results

Our study reveals that real-world queries are executed on many different database engines—in our dataset there are over 6. We believe this is typical; a variety of databases are commonly used within a company to match specific performance and scalability requirements. A practical mechanism for differential privacy will therefore allow the use of any of these existing databases, requiring neither a specific database distribution nor a custom execution engine in lieu of a standard database.

The study shows that 62.1% of all queries use SQL Join, and specifically equijoins which are by far the most common. Additionally, a majority of queries use multiple joins, more than one-quarter use self joins, and joins are conditioned on one-to-one, one-to-many, and many-to-many relationships. These results suggest that a practical differential privacy approach must at a minimum provide robust support for equijoins, including the full spectrum of join relationships and an arbitrary number of nested joins.

One-third (34%) of all queries return aggregate statistics. Differential privacy is principally designed for such queries, and in the remainder of this paper we focus on these queries. Enforcing differential privacy for raw data queries is beyond the scope of this work, as differential privacy is generally not intended to address this problem.

For statistical queries, Count is by far the most common aggregation. This validates the focus on counting and histogram queries by the majority of previous general-purpose differential privacy mechanisms [40, 47, 14, 42]. Our work similarly focuses on this class of queries. In Section 3.7.2 we discuss possible extensions to support other aggregation functions.

Requirements

We summarize our requirements for practical differential privacy of real-world SQL queries:

  • [topsep=0.5mm,leftmargin=4mm]

  • Requirement 1: Compatibility with existing databases. A practical differential privacy approach must support heterogeneous database environments by not requiring a specific database distribution or replacement of the database with a custom runtime.

  • Requirement 2: Robust support for equijoin. A practical differential privacy approach must provide robust support for equijoin, including both self joins and non-self joins, all join relationship types, and queries with an arbitrary number of nested joins.

Our study shows that a differential privacy system satisfying these requirements is likely to have broad impact, supporting a majority of real-world statistical queries.

2.3 Existing Differential Privacy Mechanisms

Several existing general-purpose differential privacy mechanisms support queries with join. Table 1 summarizes these mechanisms and their supported features in comparison to our proposed mechanism (last row). In Section 6 we discuss additional mechanisms which are not considered here either because they are not general-purpose or they cannot support joins.

Pinq

Privacy Integrated Queries (PINQ) [40] is a mechanism that provides differential privacy for counting queries written in an augmented SQL dialect. PINQ supports a restricted join operator that groups together results with the same key. For one-to-one joins, this operator is equivalent to the standard semantics. For one-to-many and many-to-many joins, on the other hand, a PINQ query can count the number of unique join keys but not the number of joined results. Additionally, PINQ introduces new operators that do not exist in standard SQL, so the approach is not compatible with standard databases.

wPINQ

Weighted PINQ (wPINQ) [47] extends PINQ with support for general equijoins and works by assigning a weight to each row in the database, then scaling down the weights of rows in a join to ensure an overall sensitivity of 1. In wPINQ, the result of a counting query is the sum of weights for records being counted plus noise drawn from the Laplace distribution. This approach allows wPINQ to support all three types of joins. However, wPINQ does not satisfy our database compatibility requirement. The system described by Proserpio et al. [47] uses a custom runtime; applying wPINQ in an existing database would require modifying the database to propagate weights during execution.

Restricted sensitivity

Restricted sensitivity [14] is designed to bound the global sensitivity of counting queries with joins, by using properties of an auxiliary data model. The approach requires bounding the frequency of each join key globally (i.e. for all possible future databases). This works well for one-to-one and one-to-many joins, because the unique key on the “one” side of the join has a global bound. However, it cannot handle many-to-many joins, because the frequency of keys on both sides of the join may be unbounded. Blocki et al. [14] formalize the restricted sensitivity approach but do not describe how it could be used in a system compatible with existing databases, and no implementation is available.

DJoin

DJoin [42] is a mechanism designed for differentially private queries over datasets distributed over multiple parties. Due to the additional restrictions associated with this setting, DJoin supports only one-to-one joins, because it rewrites join queries as relational intersections. For example, consider the following query:

SELECT COUNT(*) FROM X JOIN Y ON X.A = Y.B

DJoin rewrites this query to the following (in relational algebra), which is semantically equivalent to the original query only if the join is one-to-one: .

Additionally, the approach requires the use of special cryptographic functions during query execution, so it is not compatible with existing databases.

To address the limitations of existing mechanisms we propose elastic sensitivity, discussed next. Elastic sensitivity is compatible with any existing database and supports general equijoins with the full spectrum of join relationships. This combination allows use of elastic sensitivity in real-world settings.

3 Elastic Sensitivity

Elastic sensitivity is a novel approach for calculating an upper bound on a query’s local sensitivity. After motivating the approach, we provide background on necessary concepts in Section 3.2, formally define elastic sensitivity in Section 3.3, give an example in Section 3.4, prove its correctness in Section 3.5, and discuss an optimization in Section 3.6.

3.1 Motivation

Many previous differential privacy mechanisms [40, 14] are based on global sensitivity. These approaches do not generalize to queries with joins; the global sensitivity of queries with general joins may be unbounded because “a join has the ability to multiply input records, so that a single input record can influence an arbitrarily large number of output records.” [40]

Techniques based on local sensitivity [44, 22] generally provide greater utility than global sensitivity-based approaches because they consider the actual database. Indeed, local sensitivity is finite for general queries with joins. However, directly computing local sensitivity is computationally infeasible, as it requires running the query on every possible neighbor of the true database—in our environment this would require running more than 1 billion queries for each original query. Previous work [44] describes efficient methods to calculate local sensitivity for a limited set of fixed queries (e.g., the median of all values in the database) but these techniques do not apply to general-purpose queries or queries with join.

These challenges are reflected in the design of previous mechanisms listed in Table 1. PINQ and restricted sensitivity support only joins for which global sensitivity can be bounded, and wPINQ scales weights attached to the data during joins to ensure a global sensitivity of 1. DJoin uses a measure of sensitivity unique to its distributed setting. None of these techniques are based on local sensitivity.

Elastic sensitivity is the first tractable approach to leverage local sensitivity for queries with general equijoins. The key insight of our approach is to model the impact of each join in the query using precomputed metrics about the frequency of join keys in the true database. This novel approach allows elastic sensitivity to compute a conservative approximation of local sensitivity without requiring any additional interactions with the database. In Section 3.5, we prove elastic sensitivity is an upper bound on local sensitivity and can therefore be used with any local sensitivity-based differential privacy mechanism. In Section 4, we describe how to use elastic sensitivity to enforce differential privacy.

3.2 Background

We briefly summarize existing differential privacy concepts necessary for describing our approach. For a more thorough overview of differential privacy, we refer the reader to Dwork and Roth’s excellent reference [25].

Differential privacy provides a formal guarantee of indistinguishability: a differentially private result does not yield very much information about which of two neighboring databases was used in calculating the result.

Formally, differential privacy considers a database modeled as a vector

, in which represents the data contributed by user . The distance between two databases is . Two databases are neighbors if .

Definition 1 (Differential privacy).

A randomized mechanism preserves -differential privacy if for any pair of databases such that , and for all sets of possible outputs:

Intuitively, the sensitivity of a query corresponds to the amount its results can change when the database changes. One measure of sensitivity is global sensitivity, which is the maximum difference in the query’s result on any two neighboring databases.

Definition 2 (Global Sensitivity).

For and all , the global sensitivity of is

McSherry [40] defines the notion of stable transformations on a database, which we will use later. Intuitively, a transformation is stable if its privacy implications can be bounded.

Definition 3 (Global Stability).

A transformation is -stable if for such that , .

Another definition of sensitivity is local sensitivity [22, 44], which is the maximum difference between the query’s results on the true database and any neighbor of it:

Definition 4 (Local Sensitivity).

For and , the local sensitivity of at is

Local sensitivity is often much lower than global sensitivity since it is a property of the single true database rather than the set of all possible databases.

We extend the notion of stability to the case of local sensitivity by fixing to be the true database.

Definition 5 (Local Stability).

A transformation is locally -stable for true database if for such that , .

Differential privacy for multi-table databases

In this paper we consider bounded differential privacy [36], in which can be obtained from its neighbor by changing (but not adding or removing) a single tuple. Our setting involves a database represented as a multiset of tuples, and we wish to protect the presence or absence of a single tuple. If tuples are drawn from the domain and the database contains tuples, the setting can be represented as a vector , in which if row in the database contains the tuple .

For queries without joins, a database is considered as a single table. However, our setting considers database with multiple tables and queries with joins. We map this setting into the traditional definition of differential privacy by considering tables as disjoint subsets of a single database , so that .

With this mapping, differential privacy offers the same protection as in the single-table case: it protects the presence or absence of any single tuple in the database. When a single user contributes more than one protected tuple, however, protecting individual tuples may not be sufficient to provide privacy. Note that this caveat applies equally to the single- and multi-table cases—it is not a unique problem of multi-table differential privacy.

We maintain the same definition of neighboring databases as the single-table case. Neighbors of can be obtained by selecting a table and changing a single tuple, equivalent to changing a single tuple in a single-table database.

Smoothing functions

Because local sensitivity is based on the true database, it must be used carefully to avoid leaking information about the data. Prior work [44, 22] describes techniques for using local sensitivity to enforce differential privacy. Henceforth we use the term smoothing functions to refer to these techniques. Smoothing functions are independent of the method used to compute local sensitivity, but generally require that local sensitivity can be computed an arbitrary distance from the true database (i.e. when at most entries are changed).

Definition 6 (Local Sensitivity at Distance).

The local sensitivity of at distance from database is:

3.3 Definition of Elastic Sensitivity

We define the elastic sensitivity of a query recursively on the query’s structure. To allow the use of smoothing functions, our definition describes how to calculate elastic sensitivity at arbitrary distance from the true database (under this definition, the local sensitivity of the query is defined at ).

Figure 1 contains the complete definition, which is in four parts: (a) Core relational algebra, (b) Definition of Elastic sensitivity, (c) Max frequency at distance , and (d) Ancestors of a relation. We describe each part next.

Core relational algebra

We present the formal definition of elastic sensitivity in terms of a subset of the standard relational algebra, defined in Figure 1(a). This subset includes selection (), projection (), join (), counting (Count), and counting with grouping (). It admits arbitrary equijoins, including self joins, and all join relationships (one-to-one, one-to-many, and many-to-many).

To simplify the presentation our notation assumes the query performs a count as the outermost operation, however the approach naturally extends to aggregations nested anywhere in the query as long as the query does not perform arithmetic or other modifications to aggregation result. For example, the following query counts the total number of trips and projects the “count” attribute:

Our approach can support this query by treating the inner relation as the query root.

Elastic sensitivity

Figure 1(b) contains the recursive definition of elastic sensitivity at distance . We denote the elastic sensitivity of query at distance from the true database as . The function is defined in terms of the elastic stability of relational transformations (denoted ).

bounds the local stability (Definition 5) of relation at distance from the true database . is defined in terms of , the maximum frequency of attribute in relation at distance from database .

Max frequency at distance

The maximum frequency metric is used to bound the sensitivity of joins. We define the maximum frequency as the frequency of the most frequent value of attribute in relation in the database instance . In Section 4 we describe how the values of mf can be obtained from the database.

To bound the local sensitivity of a query at distance from the true database, we must also bound the max frequency of each join key at distance from the true database. For attribute of relation in the true database , we denote this value , and define it (in terms of mf) in Figure 1(c).

Ancestors of a relation

The definition in Figure 1(d) is a formalization to identify self joins. Self joins have a much greater effect on sensitivity than joins of non-overlapping relations. In a self join, adding or removing one row of the underlying database may cause changes in both joined relations, rather than just one or the other. The join case of elastic sensitivity is therefore defined in two cases: one for self joins, and one for joins of non-overlapping relations. To distinguish the two cases, we use (defined in Figure 1(d)), which denotes the set of tables possibly contributing rows to . A join of two relations and is a self join when and overlap, which occurs when some table in the underlying database contributes rows to both and . Rows and are non-overlapping when .

Core relational algebra: Attribute namesaValue constantsvRelational transformationsR::=t  —  R_1 x = yR_2Π_a_1, …, a_n R  —  σ_φRCount(R)Selection predicatesφ::=a_1 θa_2  —  a θvθ::=¡  —  ≤ —  =≠ —  ≥ —  ¿Counting queriesQ::=Count(R)G_1..G_nCount(R) Definition of elastic stability: ^S^(k)_R::R →D^n →elastic stability^S^(k)_R(t, x)=1^S^(k)_R(r_1 a = br_2, x)={max(mfk(a, r1, x)^S(k)R(r2, x),max(mfk(b, r2, x)^S(k)R(r1, x))A(r1) ∩A(r2) — = 0mfk(a, r1, x)^S(k)R(r2, x) +mfk(b, r2, x)^S(k)R(r1, x) +^S(k)R(r1, x)^S(k)R(r2, x)A(r1) ∩A(r2) — ¿ 0^S^(k)_R(Π_a_1, …, a_n r, x)=^S^(k)_R(r, x)^S^(k)_R(σ_φr, x)=^S^(k)_R(r, x)^S^(k)_R(Count(r))=1Definition of elastic sensitivity:^S^(k)::Q →D^n →elastic sensitivity^S^(k)(Count(r), x)=^S^(k)_R(r, x)^S^(k)(G_1..G_nCount(r), x)=2 ^S^(k)_R(r, x) Maximum frequency at distance : mf_k::a →R →D^n →Nmf_k(a, t, x)=mf(a, t, x) + kmf_k(a_1, r_1 a_2 = a_3r_2,x)={mfk(a1, r1, x) mfk(a3, r2, x)a1∈r1mfk(a1, r2, x) mfk(a2, r1, x)a1∈r2mf_k(a, Π_a_1, …, a_n r, x)=mf_k(a, r, x)mf_k(a, σ_φr, x)=mf_k(a, r, x)mf_k(a, Count(r), x)= (c) Ancestors of a relation: A::R →{R}A(t)={t}A(r_1 a = br_2)=A(r_1) ∪A(r_2)A(Π_a_1, …, a_n r)=A(r)A(σ_φr)=A(r)
(a) (b) (d)
Figure 1: (a) syntax of core relational algebra; (b) definition of elastic stability and elastic sensitivity at distance ; (c) definition of maximum frequency at distance ; (d) definition of ancestors of a relation.
Join conditions

For simplicity our notation refers only to the case where a join contains a single equality predicate. However, the approach naturally extends to join conditions containing any predicate that can be decomposed into a conjunction of an equijoin term and any other terms. Consider for example the following query:

SELECT COUNT(*) FROM a
JOIN b ON a.id = b.id AND a.size > b.size

Calculation of elastic sensitivity for this query requires only the equijoin term () and therefore follows directly from our definition. Note that in a conjunction, each predicate adds additional constraints that may decrease (but never increase) the true local stability of the join, hence our definition correctly computes an upper bound on the stability.

Unsupported queries

We discuss several cases of queries that are not supported by our definition in Section 3.7.1.

3.4 Example: Counting Triangles

We now consider step-by-step calculation of elastic sensitivity for an example query. We select the problem of counting triangles in a directed graph, described by Prosperio et al. in their evaluation of WPINQ [47]. This example contains multiple self-joins, which demonstrate the most complex recursive cases of Figure 1.

Following Prosperio et al. we select privacy budget and consider the ca-HepTh [6] dataset, which has maximum frequency metric of 65.

In SQL, the query is expressed as:

SELECT COUNT(*) FROM edges e1
JOIN edges e2 ON e1.dest = e2.source AND
                 e1.source < e2.source
JOIN edges e3 ON e2.dest = e3.source AND
                 e3.dest = e1.source AND
                 e2.source < e3.source

Consider the first join (), which joins the edges table with itself. By definition of (self join case) the elastic stability of this relation is:

Furthermore, since edges is a table, .

We then have:

Substituting the max frequency metric (65), the elastic stability of this relation is .

Now consider the second join, which joins (an alias for the edges table) with the previous joined relation (). Following the same process and substituting values, the elastic stability of this relation is:

This expression describes the elastic stability at distance of relation . Per the definition of the elastic sensitivity of a counting query is equal to the elastic stability of the relation being counted, therefore this expression defines the elastic sensitivity of the full original query.

As we will discuss in Section 4.1, elastic sensitivity must be smoothed using smooth sensitivity [44] before it can be used with the Laplace mechanism. In short, this process requires computing the maximum value of elastic sensitivity at multiplied by an exponentially decaying function in :

where and .

The maximum value is , which occurs at distance . Therefore, to enforce differential privacy we add Laplace noise scaled to , per Definition 7 (see Section 4.1).

3.5 Elastic Sensitivity is an Upper Bound on Local Sensitivity

In this section, we prove that elastic sensitivity is an upper bound on the local sensitivity of a query. This fundamental result affirms the soundness of using elastic sensitivity in any local sensitivity-based differential privacy mechanism. First, we prove two important lemmas: one showing the correctness of the max frequency at distance , and the other showing the correctness of elastic stability.

Lemma 1.

For database , at distance , has at most occurrences of the most popular join key in attribute :

Proof.

By induction on the structure of .

Case . To obtain the largest possible number of occurrences of the most popular join key in a table at distance , we modify rows to contain the most popular join key. Thus, .

Case . We need to show that:

(1)

Consider the case when (the proof for case is symmetric). The worst-case sensitivity occurs when each tuple in with the most popular value for also contains attribute value matching the most popular value of attribute in . So we can rewrite equation 1:

(2)

We then rewrite the left-hand side, based on the definition of and the inductive hypothesis. Each step may make the left-hand side smaller, but never larger, preserving the original inequality:

Which is equal to the right-hand side of equation 2.

Case . Projection does not change the number of rows, so the conclusion follows directly from the inductive hypothesis.

Case . Selection might filter out some rows, but does not modify attribute values. In the worst case, no rows are filtered out, so has the same number of occurrences of the most popular join key as . The conclusion thus follows directly from the inductive hypothesis. ∎

Lemma 2.

is an upper bound on the local stability of relation expression at distance from database :

Proof.

By induction on the structure of .

Case . The stability of a table is 1, no matter its contents.

Case . We want to bound the number of changed rows in the joined relation. There are two cases, depending on whether or not the join is a self join.

Subcase 1: no self join. When the ancestors of and are non-overlapping (i.e. ), then the join is not a self join. This means that either may change or may change, but not both. As a result, either or . We therefore have two cases:

  1. [topsep=0.2mm,leftmargin=4mm]

  2. When , may contain at most changed rows, producing at most changed rows in the joined relation.

  3. In the symmetric case, when , the joined relation contains at most changed rows.

We choose to modify the relation resulting in the largest number of changed rows, which is exactly the definition.

Subcase 2: self join. When the set of ancestor tables of overlaps with the set of ancestor tables of , i.e. , then changing a single row in the database could result in changed rows in both and .

In the self join case, there are three sources of changed rows:

  1. [topsep=0.2mm,leftmargin=4mm]

  2. The join key of an original row from could match the join key of a changed row in .

  3. The join key of an original row from could match the join key of a changed row in .

  4. The join key of a changed row from could match the join key of a changed row in .

Now consider how many changed rows could exist in each class.

  1. [topsep=0.2mm,leftmargin=4mm]

  2. In class 1, could have at most changed rows (by the inductive hypothesis). In the worst case, each of these changed rows matches the most popular join key in , which occurs at most times (by Lemma 1), so class 1 contains at most changed rows.

  3. Class 2 is the symmetric case of class 1, and thus contains at most changed rows.

  4. In class 3, we know that contains at most changed rows and contains at most changed rows. In the worst case, all of these changed rows contain the same join key, and so the joined relation contains changed rows.

The total number of changed rows is therefore bounded by the sum of the bounds on the three classes:

Which is exactly the definition.

Case . Projection does not change rows. The conclusion therefore follows from the inductive hypothesis.

Case . Selection does not change rows. The conclusion therefore follows from the inductive hypothesis.

Case . Count without grouping produces a relation with a single row. The stability of such a relation is 1, at any distance.

Main theorem

We are now prepared to prove the main theorem.

Theorem 1.

The elastic sensitivity of a query at distance from the true database is an upper bound on the local sensitivity of executed at distance from database :

Proof.

There are two cases: histogram queries and non-histogram queries.

Case (non-histogram). The local sensitivity of a non-histogram counting query over is equal to the stability of , so the result follows directly from Lemma 2.

Case (histogram). In a histogram query, each changed row in the underlying relation can change two rows in the histogram [23]. Thus by Lemma 2, the histogram’s local stability is bounded by .∎

3.6 Optimization for Public Tables

Our definition of elastic sensitivity assumes that all database records must be protected. In practice, databases often contain a mixture of sensitive and non-sensitive data. This fact can be used to tighten our bound on local sensitivity for queries joining on non-sensitive tables.

In our dataset, for example, city data is publicly known, therefore the system does not need to protect against an attacker learning information about the cities table. Note the set of public tables is domain-specific and will vary in each data environment.

More precisely, in a join expression T1 JOIN T2 ON T1.A = T2.B, if T2 is publicly known, the elastic stability of the join is equal to the elastic stability of T1 times the maximum frequency of T2.B. This formulation prevents the use of a publicly-known table with repeated join keys from revealing information about a private table.

3.7 Discussion of Limitations and Extensions

This section discusses limitations of elastic sensitivity and potential extensions to support other common aggregation functions.

3.7.1 Unsupported Queries

Elastic sensitivity does not support non-equijoins, and adding support for these is not straightforward. Consider the query:

SELECT count(*) FROM A JOIN B ON A.x > B.y

This query compares join keys using the greater-than operator, and bounding the number of matches for this comparison would require knowledge about all the data for A.x and B.y.

Fortunately, as demonstrated in our empirical study, more than three-quarters of joins are equijoins. Elastic sensitivity could be extended to support other join types by querying the database for necessary data-dependent bounds, but this modification would require interactions with the database for each original query.

Elastic sensitivity can also fail when requisite max-frequency metrics are not available due to the query structure. Consider the query:

WITH A AS (SELECT count(*) FROM T1),
     B AS (SELECT count(*) FROM T2)
SELECT count(*) FROM A JOIN B ON A.count = B.count

This query uses counts computed in subqueries as join keys. Because the mf metric covers only the attributes available in the original tables of the database, our approach cannot bound the sensitivity of this query and must reject it. In general, elastic sensitivity applies only when join keys are drawn directly from original tables. Fortunately, this criterion holds for 98.5% of joins in our dataset, so this limitation has very little consequence in practice.

3.7.2 Supporting Other Aggregation Functions

In this section we outline possible extensions of our approach to support non-count aggregation functions, and characterize the expected utility for each. These extensions, which provide a roadmap for potential future research, would expand the set of queries supported by an elastic sensitivity-based system.

Value range metric

To describe these extensions we define a new metric, value range , defined as the maximum value minus the minimum value allowed by the data model of column in relation .

This metric can be derived in a few ways. First, it can be extracted automatically from the database’s column constraint definitions [2], if they exist. Second, a SQL query can extract the current value range, which can provide a guideline for selecting the permissible value range based on records already in the database; finally, a domain expert can define the metric using knowledge about the data’s semantics.

Once the metric is defined, it must be enforced in order for differential privacy to be guaranteed. The metric could be enforced as a data integrity check, for example using column check constraints [2].

Sum and Average

For sum and average, we note that the local sensitivity of these functions is affected both by the stability of the underlying relation, because each row of the relation potentially contributes to the computed sum or average, and by the range of possible values of the attributes involved.

Given our definition of vr above, the elastic sensitivity of both Sum and Average on relation at distance from database is defined by . This expression captures the largest possible change in local sensitivity, in which each new row in has the maximum value of , for a total change of per row.

For Sum queries on relations with stability 1 (i.e. relations without joins), this definition of elastic sensitivity is exactly equal to the query’s local sensitivity, so the approach will provide optimal utility. As the relation’s stability grows, so does the gap between elastic sensitivity and local sensitivity, and utility degrades, since elastic sensitivity makes the worst-case assumption that each row duplicated by a join contains the maximum value allowed by the data model.

For the average function, this definition is exactly equal to local sensitivity only for the degenerate case of averages of a single row. As more input rows are added, local sensitivity shrinks, since the impact of a single new row is amortized over the number of averaged records, while elastic sensitivity remains constant. Therefore utility degradation is proportional to both the stability of the relation as well as the number of records being averaged.

This could be mitigated with a separate analysis to compute a lower bound on the number of records being averaged, in which case the sensitivity could be scaled down by this factor. Such an analysis would require inspection of filter conditions in the query and an expanded set of database metrics.

Max and min

We observe that the stability of the underlying relation has no effect on the local sensitivity of max and min. Consequently, for such queries the data model directly provides the global sensitivity

, which is an upper bound of local sensitivity. However, the max and min functions are inherently sensitive, because they are strongly affected by outliers in the database 

[22], and therefore any differential privacy technique will provide poor utility in the general case.

Due to this fundamental limitation, previous work  [22, 44, 50] has focused on the robust counterparts of these functions, such as the interquartile range, which are less sensitive to changes in the database. This strategy is not viable in our setting since functions like interquartile range are not supported by standard SQL.

4 FLEX: Practical Differential
Privacy for SQL Queries

Figure 2: Architecture of Flex.

This section describes Flex, our system to enforce differential privacy for SQL queries using elastic sensitivity. Figure 2 summarizes the architecture of our system. For a given SQL query, Flex uses an analysis of the query to calculate its elastic sensitivity, as described in Section 3. Flex then applies smooth sensitivity [44, 45] to the elastic sensitivity and finally adds noise drawn from the Laplace distribution to the original query results. In Section 4.1 we prove this approach provides -differential privacy.

Importantly, our approach allows the query to execute on any existing database. Flex requires only static analysis of the query and post-processing of the query results, and requires no interactions with the database to enforce differential privacy. As we demonstrate in Section 5, this design allows the approach to scale to big data while incurring minimal performance overhead.

Collecting max frequency metrics

The definition of elastic sensitivity requires a set of precomputed metrics mf from the database, defined as the frequency of the most frequent attribute for each join key. The values of mf can be easily obtained with a SQL query. For example, this query retrieves the metric for column of table :

SELECT COUNT(a) FROM T GROUP BY a
ORDER BY count DESC LIMIT 1;

Obtaining these metrics is a separate step from enforcing differential privacy for a query; the metrics can be obtained once and re-used for all queries. Note the metric must be recomputed when the most frequent join attribute changes, otherwise differential privacy is no longer guaranteed. For this reason, the architecture in Figure 2 is ideal for environments where database updates are far less frequent than database queries.

Most databases can be configured using triggers [3] to automatically recompute the metrics on database updates; this approach could support environments with frequent data updates.

Elastic Sensitivity analysis

To compute elastic sensitivity we built an analysis framework for SQL queries based on the Presto parser [9], with additional logic to resolve aliases and a framework to perform abstract interpretation-based dataflow analyses on the query tree. Flex’s elastic sensitivity analysis is built on this dataflow analysis engine, and propagates information about ancestor relations and max-frequency metrics for each joined column in order to compute the overall elastic sensitivity of the query, per the recursive definition in Section 3. We evaluate the runtime and success rate of this analysis in Section 5.

Histogram bin enumeration

When a query uses SQL’s GROUP BY construct, the output is a histogram containing a set of bin labels and an aggregation result (e.g., count) for each bin. To simplify presentation, our definition of elastic sensitivity in Section 3.3 assumes that the analyst provides the desired histogram bins labels . This requirement, also adopted by previous work [40], is necessary to prevent leaking information via the presence or absence of a bin. In practice, however, analysts do not expect to provide histogram bin labels manually.

In some cases, Flex can automatically build the set of histogram bin labels for a given query. In our dataset, many histogram queries use non-protected bin labels drawn from finite domains (e.g. city names or product types). For each possible value of the histogram bin label, Flex can automatically build and obtain the corresponding differentially private count for that histogram bin. Then, Flex adds a row to the output containing the bin label and its differentially private count, where results for missing bins are assigned value 0 and noise added as usual.

This process returns a histogram of the expected form which does not reveal anything new through the presence or absence of a bin. Additionally, since this process requires the bin labels to be non-protected, the original bin labels can be returned. The process can generalize to any aggregation function.

This process requires a finite, enumerable, and non-protected set of values for each histogram bin label. When the requirement cannot be met, for example because the histogram bin labels are protected or cannot be enumerated, Flex can still return the differentially private count for each bin, but it must rely on the analyst to specify the bin labels.

4.1 Proof of Correctness

In this section we formally define the Flex mechanism and prove that it provides -differential privacy.

Flex implements the following differential privacy mechanism derived from the Laplace-based smooth sensitivity mechanism defined by Nissim et al. [44, 45]:

Definition 7 (Flex mechanism).

For input query and histogram bin labels on true database of size , with privacy parameters :

  1. Set .

  2. Calculate .

  3. Release .

This mechanism leverages smooth sensitivity [44, 45], using elastic sensitivity as an upper bound on local sensitivity.

Theorem 2.

The Flex mechanism provides -differential privacy.

Proof.

By Theorem 1 and Nissim et al. [45] Lemma 2.3, is a -smooth upper bound on the local sensitivity of . By Nissim et al. Lemma 2.9, when the Laplace mechanism is used, a setting of suffices to provide -differential privacy. By Nissim et al. Corollary 2.4, the value released by the Flex mechanism is -differentially private. ∎

4.2 Efficiently Calculating

The definition of the Flex mechanism (Definition 7) requires (in step 2) calculating the maximum smooth sensitivity over all distances between 0 and (the size of the true database). For large databases, this is inefficient, even if each sensitivity calculation is very fast.

The particular combination of elastic sensitivity with smooth sensitivity allows for an optimization. The elastic sensitivity grows as , where is the number of joins in (see Lemma 3 below). For a given query, is fixed. The smoothing factor (), on the other hand, shrinks exponentially in .

Recall that the smoothed-out elastic sensitivity at is . We will show that to find , it is sufficient to find , where . Since is typically much smaller than (and depends on the query, rather than the size of the database), this observation yields significant computational savings.

Lemma 3.

For all relation expressions and databases , where is the number of joins in , is a polynomial in of degree at most , and all coefficients are non-negative.

Proof.

Follows from the recursive definitions of and , since each makes at most recursive calls and only adds or multiplies the results. ∎

Theorem 3.

For all queries and databases , the smoothed-out elastic sensitivity at distance is . For each and , if queries a relation , the maximum value of occurs from to .

Proof.

Let the constant . By Lemma 3, we have that for some set of constants :

We therefore have that:

Under the condition that , each term in the numerator is exactly when . We know that by the definition of elastic sensitivity.

We also know that , because a query cannot have a negative number of joins. Thus the first term () is exactly when (we know that by its definition). All of the other terms will also be when , because for , .

We can therefore conclude that when , and so is flat or decreasing for . ∎

4.3 Privacy Budget & Multiple Queries

Flex does not prescribe a specific privacy budget management strategy, allowing the use existing privacy budget methods as needed for specific applications. Below we provide a brief overview of several approaches.

Composition techniques

Composition for differential privacy [23] provides a simple way to support multiple queries: the s and s for these queries simply add up until they reach a maximum allowable budget, at which point the system refuses to answer new queries. The strong composition theorem [26] improves on this method to produce a tighter bound on the privacy budget used. Both approaches are independent of the mechanism and thus apply directly to Flex.

Budget-efficient approaches

Several approaches answer multiple queries together (i.e. in a single workload) resulting in more efficient use of a given privacy budget than simple composition techniques. These approaches work by posing counting queries through a low-level differentially private interface to the database. Flex can provide the low-level interface to support these approaches.

The sparse vector technique [24] answers only queries whose results lie above a predefined threshold. This approach depletes the privacy budget for answered queries only. The Matrix Mechanism [37] and MWEM [30] algorithms build an approximation of the true database using differentially private results from the underlying mechanism; the approximated database is then used to answer questions in the workload. Ding et al. [19] use a similar approach to release differentially private data cubes. Each of these mechanisms is defined in terms of the Laplace mechanism and thus can be implemented using Flex.

5 Experimental Evaluation

We evaluate our approach with the following experiments:

  • [leftmargin=4mm]

  • We measure the performance overhead and success rate of Flex on real-world queries (Section 5.1).

  • We investigate the utility of Flex-based differential privacy for real-world queries with and without joins (Section 5.2).

  • We evaluate the effect of the privacy budget on the utility of Flex-based differential privacy (Section 5.3).

  • We measure the utility impact of the public table optimization described in Section 3.6 (Section 5.4).

  • We compare Flex and wPINQ on a set of representative counting queries using join (Section 5.5).

Experimental setup & dataset

We ran all of our experiments using our implementation of Flex with Java 8 on Mac OSX. Our test machine was equipped with a 2.2 GHz Intel Core i7 and 8GB of memory. Our experiment dataset includes 9862 real queries executed during October 2016. To build this dataset, we identified all counting queries (including histogram queries) submitted during this time which examined sensitive trip data. Our dataset also includes original results for each of these queries.

Avg (s) Max (s)
Original query 42.4 3,452
Flex: Elastic Sensitivity Analysis 0.007 1.2
Flex: Output Perturbation 0.005 2.4
Table 2: Performance of Flex-based differential privacy.

5.1 Success Rate and Performance of Flex

To investigate Flex’s support for the wide range of SQL features in real-world queries, we ran Flex

’s elastic sensitivity analysis on the queries in our experiment dataset. We recorded the number of errors and classified each error according to its type.

In total, Flex successfully calculated elastic sensitivity for 76% of the queries. The largest group of errors is due to unsupported queries (14.14%). These queries use features for which our approach cannot compute an elastic sensitivity, as described in Section 3.7.1. Parsing errors occurred for 6.58% of queries. These errors result from incomplete grammar definitions for the full set of SQL dialects used by the queries, and could be fixed by expanding Presto parser’s grammar definitions. The remaining errors (3.21%) are due to other causes.

To investigate the performance of Flex-based differential privacy, we measured the total execution time of the architecture described in Figure 2 compared with the original query execution time. We report the results in Table 2. Parsing and analysis of the query to calculate elastic sensitivity took an average of 7.03 milliseconds per query. The output perturbation step added an additional 4.86 milliseconds per query. By contrast, the average database execution time was 42.4 seconds per query, implying an average performance overhead of 0.03%.

5.2 Utility of Flex on Real-World Queries

Our work is the first to evaluate differential privacy on a set of real-world queries executed on real data. In contrast with previous evaluations of differential privacy [15, 32, 33], our dataset includes a wide variety of real queries executed on real data.

We evaluate the behavior of Flex for this broad range of queries. Specifically, we measure the noise introduced to query results based on whether or not the query uses join and what percentage of the data is accessed by the query.

Query population size

To evaluate the ability of Flex to handle both small and large populations, we define a metric called population size. The population size of a query is the number of unique trips in the database used to calculate the query results. The population size metric quantifies the extent to which a query targets specific users or trips: a low population size indicates the query is highly targeted, while a higher population size means the query returns statistics over a larger subgroup of records.

Figure 3: Distribution of population sizes for dataset queries.
Figure 4: Median error vs population size for queries with no joins (a) and with joins (b).

Figure 3 summarizes the distribution of population sizes of the queries in our dataset. Our dataset contains queries with a wide variety of population sizes, reflecting the diversity of queries in the dataset.

Utility of Flex-based differential privacy

We evaluate the utility of Flex by comparing the error introduced by differential privacy on each query against the population size of that query. For small population sizes, we expect our approach to protect privacy by producing high error; for large population sizes, we expect our approach to provide high utility by producing low error.

We used Flex to produce differentially private results for each query in our dataset. We report separately the results for queries with no joins and those with joins. For each cell in the results, we calculated the relative (percent) error introduced by Flex, as compared to the true (non-private) results. Then, we calculated the median error of the query by taking the median of the error values of all cells. For this experiment, we set and (where is the size of the database), following Dwork and Lei [22].

Figure 4 shows the median error of each query against the population size of that query for queries with no joins (a) and with joins (b). The results indicate that Flex achieves its primary goal of supporting joins. Figure 4 shows similar trends with and without joins. In both cases the median error generally decreases with increasing population size; furthermore, the magnitude of the error is comparable for both. Overall, Flex provides high utility (less than 10% error) for a majority of queries both with and without joins.

Figure 4(b) shows a cluster of queries with higher errors but exhibiting the same error-population size correlation as the main group. The queries in this cluster perform many-to-many joins on private tables and do not benefit from the public table optimization described in Section 3.6. Even with this upward shift, a high utility is predicted for sufficiently large population size: at population sizes larger than 5 million the median error drops below 10%.

Hay et al. [32] define the term scale- exchangeability to describe the trend of decreasing error with increasing population size. The practical implication of this property is that a desired utility can always be obtained by using a sufficiently large population size. For counting queries, a local sensitivity-based mechanism using Laplace noise is expected to exhibit scale- exchangeability. Our results provide empirical confirmation that Flex preserves this property, for both queries with and without joins, while calculating an approximation of local sensitivity.

Query Description # Joins
  Q1 Billed, shipped, and returned business 0
  Q4 Priority system status and customer satisfaction 0
  Q13 Relationship between customers and order size 1
  Q16 Suppliers capable of supplying various part types 1
  Q21 Suppliers with late shipping times for required parts 3
Table 3: Evaluated TPC-H queries.

5.2.1 Utility of Flex on TPC-H benchmark

We repeat our utility experiment using TPC-H [17], an industry-standard SQL benchmark. The source code and data for this experiment are available for download [5].

The TPC-H benchmark includes synthetic data and queries simulating a workload for an archetypal industrial company. The data is split across 8 tables (customers, orders, suppliers, etc.) and the benchmark includes 22 SQL queries on these tables.

The TPC-H benchmark is useful for evaluating our system since the queries are specifically chosen to exhibit a high degree of complexity and to model typical business decisions [17]. This experiment measures the ability of our system to handle complex queries and provide high utility in a new domain.

Experiment setup

We populated a database using the TPC-H data generation tool with the default scale factor of 1. We selected the counting queries from the TPC-H query workload, resulting in five queries for evaluation including three queries that use join. The selected queries use SQL’s GROUP BY operator and other SQL features including filters, order by, and subqueries. The selected queries are summarized in Table 3. The remaining queries in the benchmark are not applicable for this experiment as they return raw data or use non-counting aggregation functions.

We computed the median population size and median error for each query using the same methodology as the previous experiment and privacy parameters and . We marked as private every table containing customer or supplier information (customer, orders, lineitem, supplier, partsupp). The 3 tables containing non-sensitive metadata (region, nation, part) were marked as public.

Results

The results are presented in Figure 5. Elastic sensitivity exhibits the same trend as the previous experiment: error decreases with increasing population size; this trend is observed for queries with and without joins, but error tends to be higher for queries with many joins.

Figure 5: Median error vs population size (TPC-H queries).

5.2.2 Inherently sensitive queries

Differential privacy is designed to provide good utility for statistics about large populations in the data. Queries with low population size, by definition, pose an inherent privacy risk to individuals; differential privacy requires poor utility for their results in order to protect privacy. As pointed out by Dwork and Roth [25], “Questions about specific individuals cannot be safely answered with accuracy, and indeed one might wish to reject them out of hand.”

Since queries with low population size are inherently sensitive and therefore not representative of the general class of queries of high interest for differential privacy, we exclude queries with sample size smaller than 100 in the remaining experiments. This ensures the results reflect the behavior of Flex on queries for which high utility may be expected.

5.3 Effect of Privacy Budget

In this section we evaluate the effect of the privacy budget on utility of Flex-based differential privacy. For each value of in the set (keeping fixed at ), we computed the median error of each query, as in the previous experiment.

We report the results in Figure 6, as a histogram grouping queries by median error. As expected, larger values of result in lower median error. When , Flex produces less than 1% median error for approximately half (49.8%) of the less sensitive queries in our dataset.

Figure 6: Effect of on median error.
High-error queries

The previous two experiments demonstrate that Flex produces good utility for queries with high population size, but as demonstrated by the number of queries in the “More” bin in Figure 6, Flex also produces high error for some queries.

To understand the root causes of this high error, we manually examined a random sample of 50 of these queries and categorized them according to the primary reason for the high error.

We summarize the results in Table 4. The category filter on individual’s data (8% of high error queries) includes queries that use a piece of data specific to an individual—either to filter the sample with a Where clause, or as a histogram bin. For example, the query might filter the set of trips by comparing the trip’s driver ID against a string literal containing a particular driver’s ID, or it might construct a histogram grouped by the driver ID, producing a separate bin for each individual driver. These queries are designed to return information specific to individuals.

The category low-population statistics (72% of high error queries) contains queries with a Where clause or histogram bin label that shrinks the set of rows considered. A query to determine the success rate of a promotion might restrict the trips considered to those within a small city, during the past week, paid for using a particular type of credit card, and using the promotion. The analyst in this case may not intend to examine the information of any individual, but since the query is highly dependent on a small set of rows, the results may nevertheless reveal an individual’s information.

These categories suggest that even queries with a population size larger than 100 can carry inherent privacy risks, therefore differential privacy requires high error for the reasons motivated earlier.

The third category (20% of high error queries) contains queries that have many-to-many joins with large maximum frequency metrics and which do not benefit from any of the optimizations described in Section 3.6. These queries are not necessarily inherently sensitive; the high error may be due to a loose bound on local sensitivity arising from elastic sensitivity’s design.

Figure 7: Impact of public table optimization.

5.4 Impact of Public Table Optimization

Section 3.6 describes an optimization for queries joining on public tables. We measure the impact of this optimization on query utility by calculating median error introduced by Flex for each query in our dataset with the optimization enabled and disabled. We use the same experimental setup described in the previous section, with and . As before, we exclude queries with population size less than 100.

The optimization is applied to 23.4% of queries in our dataset. Figure 7 shows the utility impact of the optimization across all queries. The optimization increases the percentage of queries with greatest utility (error less than 1.0%) from 28.5% to 49.8%. The majority of the increase in high-utility queries come from the lowest-utility bin (error greater than 100%) while little change is seen in the mid-range error bins. This suggests our optimization is most effective on queries which would otherwise produce high error, optimizing more than half of these queries into the 1% error bin.

Category Percent
Filters on individual’s data 8%
Low-population statistics 72%
Many-to-many Join causes high elastic sensitivity 20%
Table 4: Manual categorization of queries with high error.
Program Joined tables Median Median Error (%)
Population wPINQ Elastic
Size Sensitivity
1. Count distinct drivers who have completed a trip in San Francisco yet enrolled as a driver in a different city. trips, drivers 663 45.9 22.6
2. Count driver accounts that are active and were tagged after June 6 as duplicate accounts. users, user_tags 734 71.5  2.8
3. Count motorbike drivers in Hanoi who are currently active and have completed 10 or more trips. drivers, analytics 212 51.4  4.72
4. Histogram: Daily trips by city (for all cities) on Oct. 24, 2016. trips, cities 87 11.5 23
5. Histogram: Total trips per driver in Hong Kong between Sept. 9 and Oct. 3, 2016. trips, drivers 1 974 6437
6. Histogram: Drivers by different thresholds of total completed trips for drivers registered in Syndey, AUS who have completed a trip within the past 28 days. drivers, analytics 72 51.5 27.8
Table 5: Utility comparison of wPINQ and Flex for selected set of representative counting queries using join.

5.5 Comparison with wPINQ

We aim to compare our approach to alternative differential privacy mechanisms with equivalent support for real-world queries. Of the mechanisms listed in Section 2.3, only wPINQ supports counting queries with the full spectrum of join types.

Since wPINQ programs are implemented in C#, we are unable to run wPINQ directly on our SQL query dataset. Instead we compare the utility between the two mechanisms for a selected set of representative queries. The precise behavior of each mechanism may differ for every query, however this experiment provides a relative comparison of the mechanisms for the most common cases.

Experiment Setup

We selected a set of representative queries based on the most common equijoin patterns (joined tables and join condition) across all counting queries in our dataset. We identify the three most common join patterns for both histogram and non-histogram queries and select a random query representing each. Our six selected queries collectively represent 8.6% of all join patterns in our dataset.

For each selected query we manually transcribe the query into a wPINQ program. To ensure a fair comparison, we use wPINQ’s select operator rather than the join operator for joins on a public table. This ensures that no noise is added to protect records in public tables, equivalent to the optimization described in Section 3.6.

Our input data for these programs includes all records from the cities table, which is public, and a random sample of 1.5 million records from each private table (it was not feasible to download the full tables, which contain over 2 billion records). We execute each program 100 times with the wPINQ runtime [13].

To obtain baseline (non-differentially private) results we run each SQL query on a database populated with only the sampled records. For elastic sensitivity we use max-frequency metrics calculated from this sampled data. We compute the median error for each query using the methodology described in the previous section, setting for both mechanisms.

Table 5 summarizes the queries and median error results. Flex provides lower median error than wPINQ for programs 1, 2, 3 and 6—more than 90% lower for 2 and 3 and nearly 50% lower for programs 1 and 6. Flex produces higher error than wPINQ for programs 4 and 5.

In the case of program 5, both mechanisms produce errors above 900%. The median population size of 1 for this program indicates that our experiment data includes very few trips per driver that satisfy the filter conditions. Elastic sensitivity provides looser bounds on local sensitivity for queries that filter more records, resulting in a comparably higher error for queries such as this one. Given that such queries are inherently sensitive, a high error (low utility) is required for any differential privacy mechanism, therefore the comparably higher error of Flex is likely insignificant in practice.

Proserpio et al. [47]

describe a post-processing step for generating synthetic data by using wPINQ results to guide a Markov-Chain Monte Carlo simulation. The authors show that this step improves utility for graph triangle counting queries when the original query is executed on the synthetic dataset. While this strategy may produce higher utility than the results presented in Table 

5, we do not evaluate wPINQ with this additional step since it is not currently automated.

6 Related Work

Differential privacy was originally proposed by Dwork [20, 23, 21], and the reference by Dwork and Roth [25] provides an excellent general overview of differential privacy. Much of this work focuses on mechanisms for releasing the results of specific algorithms. Our focus, in contrast, is on a general-purpose mechanism for SQL queries that supports general equijoins. We survey the existing general mechanisms that support join in Section 2.3.

Lu et al. [38] propose a mechanism for generating differentially private synthetic data such that queries with joins have similar performance characteristics, but not necessarily similar answers, on the synthetic and true databases. However, Lu et al. do not propose a mechanism for answering queries with differential privacy. As such, it does not satisfy either of the two requirements in Section 2.2.

Airavat [48] enforces differential privacy for arbitrary MapReduce programs, but requires the analyst to bound the range of possible outputs of the program, and clamps output values to lie within that range. Fuzz [28, 29] enforces differential privacy for functional programs, but does not support one-to-many or many-to-many joins.

Propose-test-release [22] (PTR) is a framework for leveraging local sensitivity that works for arbitrary real-valued functions. PTR requires (but does not define) a way to calculate the local sensitivity of a function. Our work on elastic sensitivity is complementary and can enable the use of PTR by providing a bound on local sensitivity.

Sample & aggregate [44, 45]

is a data-dependent framework that applies to all statistical estimators. It works by splitting the database into chunks, running the query on each chunk, and aggregating the results using a differentially private algorithm. Sample & aggregate cannot support joins, since splitting the database breaks join semantics, nor does it support queries that are not statistical estimators, such as counting queries. GUPT 

[41] is a practical system that leverages the sample & aggregate framework to enforce differential privacy for general-purpose analytics.

The Exponential Mechanism [39] supports queries that produce categorical (rather than numeric) data. It works by randomly selecting from the possible outputs according to a scoring function provided by the analyst. Extending Flex to support the exponential mechanism would require specification of the scoring function and a means to bound its sensitivity.

A number of less-general mechanisms for performing specific graph analysis tasks have been proposed [31, 49, 34, 35]. These tasks often involve joins, but the mechanisms used to handle them are specific to the task and are not applicable for general-purpose analytics. For example, the recursive mechanism [16] supports general equijoins in the context of graph analyses, but is restricted to monotonic queries and in the worst case, runs in time exponential in the number of participants in the database.

Kifer et al. [36] point out that database constraints (such as uniqueness of a primary key) can lead to leaks of private data. Such constraints are common in practice, and raise concerns for all differential privacy approaches. Kifer et al. propose increasing sensitivity based on the specific constraints involved, but calculating this sensitivity is computationally hard. Developing a tractable method to account for common constraints, such as primary key uniqueness, is an interesting target for future work.

7 Conclusion

This paper takes a first step towards practical differential privacy for general-purpose SQL queries. To meet the requirements of real-world SQL queries, we proposed elastic sensitivity, the first efficiently-computed approximation of local sensitivity that supports joins. We have released an open-source tool for computing elastic sensitivity of SQL queries [4]. We use elastic sensitivity to build Flex, a system for enforcing differential privacy for SQL queries. We evaluated Flex on a wide variety of queries, demonstrating that Flex can support real-world queries and provides high utility on a majority of queries with large population sizes.

Acknowledgments

The authors would like to thank Abhradeep Guha Thakurta, Om Thakkar, Frank McSherry, Ilya Mironov and the anonymous reviewers for their helpful comments. This work was supported by the Center for Long-Term Cybersecurity, and DARPA & SPAWAR under contract N66001-15-C-4066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes not withstanding any copyright notation thereon. The views, opinions, and/or findings expressed are those of the author(s) and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

References

  • [1] Apple previews iOS 10, the biggest iOS release ever. http://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios-release-ever.html.
  • [2] Check Constraint. https://msdn.microsoft.com/en-us/library/ms190377.aspx.
  • [3] Database Triggers. https://docs.oracle.com/cd/A57673_01/DOC/server/doc/SCN73/ch15.htm.
  • [4] Dataflow analysis & differential privacy for SQL queries. http://www.github.com/uber/sql-differential-privacy.
  • [5] Elastic Sensitivity experiments using TPC-H. http://www.github.com/sunblaze-ucb/elastic-sensitivity-experiments.
  • [6] High Energy Physics - Theory collaboration network. https://snap.stanford.edu/data/ca-HepTh.html.
  • [7] Morgan Stanley Breach a Reminder of Insider Risks. https://securityintelligence.com/news/morgan-stanley-breach-reminder-insider-risks/.
  • [8] Nearly 5,000 patients affected by UC Irvine medical data breach. http://www.latimes.com/business/la-fi-uc-irvine-data-breach-20150618-story.html.
  • [9] Presto: Distributed SQL Query Engine for Big Data. https://prestodb.io/.
  • [10] Sutter Health California Pacific Medical Center audit uncovers data breach. http://www.csoonline.com/article/2876324/data-breach/sutter-health-california-pacific-medical-center-audit-uncovers-data-breach.html.
  • [11] Swiss spy agency warns U.S., Britain about huge data leak. http://www.reuters.com/article/us-usa-switzerland-datatheft-idUSBRE8B30ID20121204.
  • [12] Uber Releases Open Source Project for Differential Privacy. https://medium.com/uber-security-privacy/differential-privacy-open-source-7892c82c42b6.
  • [13] Weighted Privacy Integrated Queries. http://cs-people.bu.edu/dproserp/wPINQ.html.
  • [14] J. Blocki, A. Blum, A. Datta, and O. Sheffet. Differentially private data analysis of social networks via restricted sensitivity. In Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, ITCS ’13, pages 87–96, New York, NY, USA, 2013. ACM.
  • [15] J. Blocki, A. Datta, and J. Bonneau. Differentially private password frequency lists. In 23nd Annual Network and Distributed System Security Symposium, NDSS 2016, San Diego, California, USA, February 21-24, 2016. The Internet Society, 2016.
  • [16] S. Chen and S. Zhou. Recursive mechanism: Towards node differential privacy and unrestricted joins. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD ’13, pages 653–664, New York, NY, USA, 2013. ACM.
  • [17] T. P. P. Council. Tpc-h benchmark specification. Published at http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf, 21:592–603, 2008.
  • [18] Y.-A. de Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel. Unique in the crowd: The privacy bounds of human mobility. Scientific reports, 3, 2013.
  • [19] B. Ding, M. Winslett, J. Han, and Z. Li. Differentially private data cubes: optimizing noise sources and consistency. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 217–228. ACM, 2011.
  • [20] C. Dwork. Differential privacy. In M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener, editors, Automata, Languages and Programming, volume 4052 of Lecture Notes in Computer Science, pages 1–12. Springer Berlin Heidelberg, 2006.
  • [21] C. Dwork. Differential privacy: A survey of results. In Theory and applications of models of computation, pages 1–19. Springer, 2008.
  • [22] C. Dwork and J. Lei. Differential privacy and robust statistics. In

    Proceedings of the forty-first annual ACM symposium on Theory of computing

    , pages 371–380. ACM, 2009.
  • [23] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer, 2006.
  • [24] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vadhan. On the complexity of differentially private data release: efficient algorithms and hardness results. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 381–390. ACM, 2009.
  • [25] C. Dwork, A. Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • [26] C. Dwork, G. N. Rothblum, and S. Vadhan. Boosting and differential privacy. In Foundations of Computer Science (FOCS), 2010 51st Annual IEEE Symposium on, pages 51–60. IEEE, 2010.
  • [27] Ú. Erlingsson, V. Pihur, and A. Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067. ACM, 2014.
  • [28] M. Gaboardi, A. Haeberlen, J. Hsu, A. Narayan, and B. C. Pierce. Linear dependent types for differential privacy. In ACM SIGPLAN Notices, volume 48, pages 357–370. ACM, 2013.
  • [29] A. Haeberlen, B. C. Pierce, and A. Narayan. Differential privacy under fire. In USENIX Security Symposium, 2011.
  • [30] M. Hardt, K. Ligett, and F. McSherry. A simple and practical algorithm for differentially private data release. In Advances in Neural Information Processing Systems, pages 2339–2347, 2012.
  • [31] M. Hay, C. Li, G. Miklau, and D. Jensen. Accurate estimation of the degree distribution of private networks. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on, pages 169–178. IEEE, 2009.
  • [32] M. Hay, A. Machanavajjhala, G. Miklau, Y. Chen, and D. Zhang. Principled evaluation of differentially private algorithms using dpbench. In F. Özcan, G. Koutrika, and S. Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 139–154. ACM, 2016.
  • [33] X. Hu, M. Yuan, J. Yao, Y. Deng, L. Chen, Q. Yang, H. Guan, and J. Zeng. Differential privacy in telco big data platform. PVLDB, 8(12):1692–1703, 2015.
  • [34] V. Karwa, S. Raskhodnikova, A. Smith, and G. Yaroslavtsev. Private analysis of graph structure. PVLDB, 4(11):1146–1157, 2011.
  • [35] S. P. Kasiviswanathan, K. Nissim, S. Raskhodnikova, and A. Smith. Analyzing graphs with node differential privacy. In Theory of Cryptography, pages 457–476. Springer, 2013.
  • [36] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 193–204. ACM, 2011.
  • [37] C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing linear counting queries under differential privacy. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 123–134. ACM, 2010.
  • [38] W. Lu, G. Miklau, and V. Gupta. Generating private synthetic databases for untrusted system evaluation. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on, pages 652–663. IEEE, 2014.
  • [39] F. McSherry and K. Talwar. Mechanism design via differential privacy. In Foundations of Computer Science, 2007. FOCS’07. 48th Annual IEEE Symposium on, pages 94–103. IEEE, 2007.
  • [40] F. D. McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30. ACM, 2009.
  • [41] P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. Gupt: privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 349–360. ACM, 2012.
  • [42] A. Narayan and A. Haeberlen. Djoin: differentially private join queries over distributed databases. In Presented as part of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), pages 149–162, 2012.
  • [43] A. Narayanan and V. Shmatikov. How to break anonymity of the Netflix prize dataset. CoRR, abs/cs/0610105, 2006.
  • [44] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75–84. ACM, 2007.
  • [45] K. Nissim, S. Raskhodnikova, and A. Smith. Smooth sensitivity and sampling in private data analysis. 2011. Draft Full Version, v1.0. http://www.cse.psu.edu/~ads22/pubs/NRS07/NRS07-full-draft-v1.pdf.
  • [46] V. Pandurangan. On taxis and rainbows: Lessons from NYC’s improperly anonymized taxi logs. https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1.
  • [47] D. Proserpio, S. Goldberg, and F. McSherry. Calibrating data to sensitivity in private data analysis: A platform for differentially-private analysis of weighted datasets. PVLDB, 7(8):637–648, 2014.
  • [48] I. Roy, S. T. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and privacy for mapreduce. In NSDI, volume 10, pages 297–312, 2010.
  • [49] A. Sala, X. Zhao, C. Wilson, H. Zheng, and B. Y. Zhao. Sharing graphs using differentially private graph models. In Proceedings of the 2011 ACM SIGCOMM conference on Internet measurement conference, pages 81–98. ACM, 2011.
  • [50] A. Smith. Privacy-preserving statistical estimation with optimal convergence rates. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 813–822. ACM, 2011.
  • [51] L. Sweeney. Weaving technology and policy together to maintain confidentiality. The Journal of Law, Medicine & Ethics, 25(2-3):98–110, 1997.