Overlook: Differentially Private Exploratory Visualization for Big Data

by   Pratiksha Thaker, et al.
Stanford University

Data exploration systems that provide differential privacy must manage a privacy budget that measures the amount of privacy lost across multiple queries. One effective strategy to manage the privacy budget is to compute a one-time private synopsis of the data, to which users can make an unlimited number of queries. However, existing systems using synopses are built for offline use cases, where a set of queries is known ahead of time and the system carefully optimizes a synopsis for it. The synopses that these systems build are costly to compute and may also be costly to store. We introduce Overlook, a system that enables private data exploration at interactive latencies for both data analysts and data curators. The key idea in Overlook is a virtual synopsis that can be evaluated incrementally, without extra space storage or expensive precomputation. Overlook simply executes queries using an existing engine, such as a SQL DBMS, and adds noise to their results. Because Overlook's synopses do not require costly precomputation or storage, data curators can also use Overlook to explore the impact of privacy parameters interactively. Overlook offers a rich visual query interface based on the open source Hillview system. Overlook achieves accuracy comparable to existing synopsis-based systems, while offering better performance and removing the need for extra storage.



There are no comments yet.


page 10

page 11


Shrinkwrap: Differentially-Private Query Processing in Private Data Federations

A private data federation is a set of autonomous databases that share a ...

Private Exploration Primitives for Data Cleaning

Data cleaning is the process of detecting and repairing inaccurate or co...

Optimizing Fitness-For-Use of Differentially Private Linear Queries

In practice, differentially private data releases are designed to suppor...

Mr. Plotter: Unifying Data Reduction Techniques in Storage and Visualization Systems

As the rate of data collection continues to grow rapidly, developing vis...

Differentially Private SQL with Bounded User Contribution

Differential privacy (DP) provides formal guarantees that the output of ...

Free Gap Information from the Differentially Private Sparse Vector and Noisy Max Mechanisms

Noisy Max and Sparse Vector are selection algorithms for differential pr...

KloakDB: A Platform for Analyzing Sensitive Data with K-anonymous Query Processing

A private data federation enables data owners to pool their information ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Privacy has become a key issue for all organizations that collect personal data, from companies to government entities [us-census-privacy, apple-privacy, google-privacy]. After organizations collect a dataset, they would like to make it available to internal data analysis teams, or even expose it to external researchers [us-census-privacy], without leaking a significant amount of information about any individual in the dataset. To be broadly useful, a private data analysis system should support ad-hoc, exploratory queries through familiar interfaces while making it easy for the data curator (the administrator configuring the system) to control the amount of information leaked.

The most widely used framework for reasoning about privacy is differential privacy (DP) [dp, dp-survey]. Differential privacy quantifies the privacy cost of a statistical analysis through a privacy budget; a smaller privacy budget implies more error in the query results but more privacy for the individuals in the dataset.

Although many research systems provide differential privacy [pinq, Johnson18, psi, privatesql, dawa, airavat], these current systems are challenging for organizations to configure and use, especially for ad-hoc exploratory analysis. At a high level, current DP systems fall into two categories:

  • Systems with per-query budgeting: Systems such as PINQ [pinq] and FLEX [Johnson18] ask users to select a privacy budget, , for each query they execute. The total privacy leakage of the system is then bounded by the sum of these values. These systems are complex for both users and data curators to use. Users typically have a limited total privacy budget available, , and need to decide how to divide it between the queries they submit; when they run out of budget, they can no longer make queries. In addition, two users that collaborate can reveal information proportional to the sum of their budgets, so data curators must carefully limit which users can access the system. Such systems would be unsuitable for exposing a research dataset to the public, for instance [us-census-privacy].

  • Synopsis-based systems: Systems such as PrivateSQL [privatesql] generate a synopsis data structure that can answer a specific class of queries after taking in a dataset, a description of the query class, and a total privacy budget . Users can then query the synopsis arbitrarily many times without revealing additional information beyond the budget. These systems are more suitable for exploratory analysis and for public access, but unfortunately, they are also challenging to use. Constructing the synopsis requires solving an expensive optimization problem to minimize the error it will produce for a specific query workload, which can take hours even for a modest dataset, and the synopsis can consume a large amount of space, on par with the original data, making it costly for large datasets.

In this paper, we present Overlook, a system that makes synopsis-based differential privacy practical for one of the most common types of data analysis: visual exploratory analysis of immutable datasets. Visual query interfaces, such as Tableau [tableau], are one of the most common ways for organizations to expose data internally, and produce a class of queries that are a good fit for synopsis data structures (mostly counting queries). In Overlook, we seek to make private visual queries accessible to both data users and data curators, by designing a system that lets curators tune a synopsis interactively to set privacy parameters, and lets users query data interactively at a similar cost to their existing data analysis infrastructure. Overlook runs as an interposition layer in front of existing analytical engines, such as any SQL RDBMS, enabling organizations to benefit from the scalability and optimizations of these existing engines and to offer private visual query interfaces over existing datasets.

The key idea in Overlook is a virtual synopsis data structure that represents the noise that would be added by a classical synopsis algorithm in a highly compressed format using a pseudo-random function (PRF). For any counting query (e.g., counting the users in a dataset by country), Overlook can use the virtual synopsis to compute just the noise that should be added to each tuple in the query result. Overlook simply adds this noise to the results from an existing query engine. Thanks to this design, users can run queries at a similar speed to their existing query engine, with extra computation that is only proportional to the number of output tuples (e.g., number of bins in a histogram). Likewise, data curators can use Overlook to explore parameters of the virtual synopsis interactively, e.g., change the total privacy budget and see its results on several queries. Overlook’s synopses are based on the hierarchical histogram mechanism [Hay10, binary-mechanism], a synopsis design that supports multidimensional queries, and can be tuned by curators to provide different noise levels for different dimensions in the data.

Overlook also offers a rich privacy-aware visual query interface built on virtual synopses, based on the open source Hillview system [hillview]

. In particular, we extend most of the built-in visualizations in Hillview, such as histograms and heatmaps, to display information about the noise introduced by DP, and to automatically coarsen the visualization when the noise exceeds the discernible signal. These automatic adjustments all query the virtual synopsis, so they do not cause any additional privacy leakage. Data visualization has some unique features that make it a good candidate for a synopsis based approach to DP: most queries could be expressed as (combinations of) count queries, for which there are good synopses. Secondly, the visualization itself introduces errors via the quantization to the pixels in the screen. We found that this inherent approximation often masks the error introduced by DP. Finally, typically data visualization interfaces already incorporate methods for presenting errors and approximations to the user (e.g., via confidence intervals and error bars). These tools help the user understand the results the differential privacy mechanism produces.

Unlike prior systems, Overlook provides an interface for both the data curator as well as the data analyst. The data curator, who manages and has access to the raw data, must make decisions about the privacy parameters used to build the synopsis. Overlook’s curator UI lets the curator quickly see the impact of adjusting various privacy parameters, such as bins for categorical features and privacy budgets along different data dimensions, on multiple visual queries. To our knowledge, Overlook is the first system to provide interactive feedback for tuning a DP synopsis.

We implement Overlook using the Hillview UI, and develop backends to let it run either over a SQL DBMS or over Hillview’s built-in distributed execution engine [hillview], a high performance in-memory query engine that supports approximate query processing for common visualization queries. In both cases, Overlook benefits directly from the optimizations in the underlying engine.

We evaluate Overlook against the algorithms for computing synopses in DAWA [dawa] and PrivateSQL [privatesql], and other algorithms in DPBench [dpbench]. Although many of these algorithms build workload-aware synopses, which are optimized for a specific set of queries [matrix-mechanism], we find that Overlook’s workload-agnostic virtual synopses offer similar levels of error in query results when given the set of visualization queries as input. The key intuition is that in an exploratory analytics setting with many possible queries, a synopsis that “balances” the noise over the possible queries will perform well, and it is not useful to solve a complex optimization problem [QardajiYL13, qardaji2013differentially, Hay10]. Moreover, Overlook’s virtual synopses require no precomputation and minimal storage over the underlying histogram. We show that Overlook achieves the same scaling properties as the underlying database while requiring no more than 2.5 the time required to compute equivalent non-private queries. Overlook requires only a 32-byte key for its synopsis, compared to existing synopses that require kilobytes of space to store and potentially gigabytes of memory to compute.

To summarize, our contributions are:

  • We present Overlook, the first differentially private visual analytics system that provides interactive configuration for data curators and simultaneously supports arbitrary, interactive ad-hoc queries for users by leveraging synopses.

  • We introduce virtual synopses, a data structure to represent DP synopses for multidimensional counting queries that can be used incrementally to add noise to a specific query instead of requiring costly precomputation and storage.

  • We develop a privacy-aware interactive visualization UI for both data users and data curators, including a novel curation UI that lets curators see the effect of configuring virtual synopsis parameters interactively.

Overlook is open source at http://github.com/vmware/hillview.

2 System overview

Figure 1: Overlook architecture.

Figure 1 shows the architecture of Overlook. A data analyst interacts with Overlook through a browser interface that allows them to issue queries to the Overlook root node. The root dispatches the query to the backend, applies a privacy mechanism to the returned result, and returns the private result to the user.

The raw data resides on a possibly distributed set of trusted servers; a centralized root receives histogram queries and dispatches them to the servers. The root also stores relevant privacy parameters used to compute the private response to a histogram query. The trusted data curator can make changes to these privacy parameters until the dataset is published, at which point the privacy parameters as well as the dataset must become immutable. The untrusted data analyst can only access published data through the private results of histogram queries. The privacy parameters are assumed to be public and visible to both the data curator and the data analyst. The privacy parameters are discussed further in Section 2.3.

Overlook primarily supports histogram queries. A histogram query over a column takes as input a set of disjoint buckets and returns the number of data items that fall in each bucket. In Overlook, these counts are perturbed with noise consistent with a differentially-private mechanism, described further in Section 4. Overlook allows users to issue an arbitrary number of histogram queries on privately-published data, with no privacy budget restriction. Histogram queries are sufficient to support a large number of visualizations, including histograms, heat maps, pie charts, trellis plots, and CDFs. The visualizations supported in Overlook are described in Section 2.2.1.

In addtion to histogram queries, Overlook’s interface has support for other count-based queries, such as the number of NULL values in a column.

2.1 Threat model

We assume untrusted users, who can make an unbounded number of queries to the Overlook backend through the Overlook UI. Users may communicate with each other and make queries in parallel or from multiple sessions. Users cannot modify privacy parameters, view raw data, or alter any secret key stored on the root. Side-channel and denial-of-service attacks are out of scope in our work.

The data curator is trusted and has access to the raw data and privacy parameters. The curator may not modify data or parameters once a dataset has been published, as doing so would violate differential privacy.

The distributed backend is entirely trusted, including the root server as well as servers hosting the raw data.

2.2 User interface

The data curator and data analyst both access Overlook through interfaces that are extensions of the Hillview data visualization system [hillview], which provides a browser interface for interacting with charts and data. The curator UI is similar to the analyst’s view, except that the curator may additionally modify privacy parameters and generate new histogram queries under the new parameters prior to publishing the dataset.

The result of a histogram query is displayed as an interactive plot. Additional queries can be made by zooming in using the mouse by selecting an interval, which issues a new histogram query to the backend.

Note that, while the frontend is an extension to Hillview, Overlook can be used with any backend that supports count queries. In Section 5, we describe one such alternate implementation using MySQL.

2.2.1 Supported visualizations

Overlook supports two main categories of private data summaries: (1) histograms, and (2) counts of specific values, such as NULL values. Multi-dimensional histograms encompass a number of useful visualizations, including the traditional 1-dimensional histogram but also heat maps, pie charts, trellis plots, and CDFs.

In addition, the user interface displays schema metadata including standard values such as the column type, but also the privacy policy associated with a column or group of columns.


Overlook’s main primitive is a histogram query. This primitive can be applied to create a variety of useful visualizations:

  • Histogram queries over a column (with numeric or categorical data). The visual presentation can be a bar chart with confidence intervals, as shown in Figure 2, or, for example, a pie chart, which emphasizes percentage of the whole that falls within each bucket.

  • Cumulative distributions functions (CDF) over a column (numeric or categorical). CDF plots are always shown together with a histogram plot. Figure 2 shows a histogram plot with an overlaid CDF curve.

  • Histogram queries over a pair of columns, each of which can be either numeric or categorical. This can be visually presented as a heat map as in Figure 3, or for example a trellis plot of 1-dimensional histograms.

One important feature of Overlook is that it displays estimates of

uncertainty about the data. For 1-dimensional histograms, this is in the form of 99% confidence intervals. The presentation of uncertainty is discussed further in Section 6.


In addition to histograms, Overlook supports releasing certain useful counts (“degenerate histograms”) privately:

  • In many views, the system displays information about the number of elements and the number of NULL values in a column.

  • Distinct count queries: these estimate the number of distinct values in a column.

Figure 2: Histogram plot with CDF curve overlaid. The data at the mouse position is shown in a semi-transparent white rectangle; notice that the bar size (count) is given as an interval, and confidence intervals are plotted for each bar.
Figure 3: Heatmap on two columns. The color shows the count for each combination of values. Values with low confidence are hidden, and the user can highlight with the mouse values within a specific range.

2.3 Curator interface

The data curator’s job is to decide which columns and pairs of columns will be released privately, and to then decide the privacy level for each of those data releases. Overlook’s curator UI helps the data curator make these decisions.

For each set of columns that is to be released privately, the curator must specify a corresponding privacy policy. This policy provides Overlook with information about public values that can be used in the histogram as well as parameters that are used to instantiate the privacy mechanism. The curator’s view of the Overlook UI allows the curator to edit these policy settings and generate sample charts on a dataset before it is published. In this section, we describe the parameter settings the curator can choose in a privacy policy.

While specifying a policy for every such histogram may be impractical, Overlook provides some useful default values for the convenience of the curator. While curators should be careful to choose parameters that are public and independent of the data, we note that the curator’s decisions may nevertheless leak information because they are made based on the true underlying dataset. Devising methods to add differential privacy to this kind of human-in-the-loop parameter selection is an interesting avenue for future work.

2.3.1 Privacy levels

For each set of columns to release, the curator specifies a corresponding value that denotes the privacy level that should be used to release the column. A smaller value of typically results in more privacy at the cost of more noise in the private output. The curator can explore many values of for each set of columns before deciding which privacy level gives the best tradeoff between data privacy and utility for potential analysts.

2.3.2 Data ranges

On the first histogram query a user makes for a column, Overlook must return a histogram over a sensible range, after which the user can zoom in to regions of interest. A non-private visualization could compute the true minimum and maximum values in the dataset and return a histogram over the full range. Computing these values in a differentially-private manner is a considerably more complex task [DworkJ09]. Instead, Overlook requries the data curator to specify publicly-visible values for the initial minimum and maximum for each column in the privacy policy. As in prior work [pinq, flex], the curator must be careful to not choose values closely associated with specific data points.

2.3.3 Quantization

In Overlook, the data curator must specify a public quantization, or partitioning, of the data. The need for this is twofold:

  • The synopsis used in Overlook, detailed in Section 3, operates over a finite, enumerable data domain.

  • When displaying histograms on categorical data, Overlook uses the curator-specified bin boundaries as public labels for the bins.

To illustrate the second point, consider a column that contains the names of patients in a hospital. A non-private histogram might reveal specific names through the choice of bin labels. In Overlook, a curator may set the quantization boundaries to be the letters ‘A’ through ‘Z’, so that the finest unit of aggregation is the first letter of the name, and no individual’s name is leaked in the published histogram.

The idea of public partitioning has been explored in prior systems [pinq, flex], but the partitioning in those systems has been left up to the data analyst to specify. In contrast, in Overlook, the data curator specifies the bin boundaries, and can therefore choose a set of boundaries that is appropriate for the dataset and provides good visual utility to analysts.

Importantly, Overlook never fully materializes a quantized version of the dataset. Instead, the quantization policy is expressed implicitly as a function that maps data points their quantized versions.

3 Definitions and Data Model

We consider a tabular dataset , , which consists of rows and columns or attributes. This dataset could be the result of a join on multiple tables or a materialized view.111 We note that a line of prior work [ProserpioGM14, restricted-sensitivity, djoin, flex, privatesql] considers limiting the sensitivity of joins in differential privacy. Overlook assumes all released views are materialized and provides only per-row privacy in the materialized view, but, as Overlook operates over a standard database backend, these techniques could be incorporated in the future. Each column takes values in the domain , which must be finite and public (which can be achieved implicitly by applying the privacy policy to each column at query time).

Our goal is to answer queries from some set , where each is a function . A mechanism for answering queries from is a randomized algorithm . Following [dp], a mechanism is -differentially private if for any two databases which differ in a single row,

Intuitively, this means that adding, removing, or changing any one row of the dataset will not change the probability of an event under the differentially-private mechanism by more than a pre-specified multiplicative factor,

. An extensive line of work has explored the construction of mechanisms that achieve this guarantee [dp-survey, salil-survey]. The most common mechanisms add random noise to the raw result according to the Laplace distribution. In a histogram visualization, this manifests as random perturbations to the counts for each bar. Our challenge is then to choose a mechanism that gives good visual utility to the user in spite of these random perturbations (§3.2) and implement it efficiently (§4).

Importantly, throughout this section we assume that the mechanism operates over data which has already been quantized according to the privacy policy, and therefore belongs to a finite and public domain . In practice, this quantization happens on demand, at query time; the quantized dataset is not materialized.

3.1 Query model

Overlook primarily supports one- and two-dimensional histogram queries.222Larger dimensions can be accommodated easily, but histograms and heat maps are most relevant to the visualization setting.

The basic building block of a histogram query is computing the count of the elements that belong to a bucket; a bucket is in general a -dimensional rectangle.

For example, a 1-dimensional histogram query with buckets specifies a column and a set of bucket boundaries

. The query returns a vector of

counts, one for each interval .

Note that, although the domain must be finite and public, the bucket boundaries can be any value specified by the user. For example, if , the user may query the range . In this case, Overlook would privately return the (noisy) count corresponding to the value 1 (the only value in the quantized domain that falls in the query range). We are careful to note that this count corresponds to the data after quantization.

Overlook also supports releasing certain counts, such as counts of NULL or missing values, privately apart from the histograms. These can be made private simply by perturbing the count with noise distributed as [dp]. The data curator must take into account the additional privacy cost of releasing these values.

3.2 Synopsis mechanism

Certain families of queries permit a special type of mechanism that produces a private summary called a synopsis. Such a mechanism can be decomposed into two stages:

  • It releases a synopsis of the dataset, which is guaranteed to be -differentially private, and is independent of the queries .

  • On input , it computes an answer using only the synopsis. Since the answer is computed by post-processing the synopsis, and post-processing does not leak privacy [dp], the answer is differentially private. Further, there is no limit to the number of queries that a user can submit.

Given a column of a dataset over a finite, enumerable data domain of size , one can build a histogram of buckets containing the count of each element in the domain. Making this histogram private naïvely requires adding Laplace noise with scale to each of buckets. Answering an interval query of size then adds independent random Laplace variables to the result. The error of such a mechanism scales as [binary-mechanism].

However, adding this much noise to each query is suboptimal. Instead, Overlook uses a mechanism called the hierarchical histogram [Hay10, binary-mechanism], also referred to as in the literature. The error of this mechanism scales as rather than . At a high level, the hierarchical histogram builds a tree such that nodes higher in the tree correspond to progressively larger contiguous intervals in the domain. Each internal node of the tree corresponds to an interval of the histogram that is the union of its children, and the mechanism adds noise with scale to each internal node. For such a tree with branching factor , an arbitrary interval of size can be computed by taking the union of only nodes: the number of noise variables now scales logarithmically, rather than linearly, in the interval size. Figure 4 shows such a tree with branching factor 2 (also called a “dyadic” tree).

Figure 4: Tree used in a hierarchical histogram with . The tree for column is constructed for the quantized domain . Each internal node in the tree corresponds to the count for a contiguous interval in the domain, and receives independent random Laplace noise. In this example, the grey internal nodes in the tree can be used to compute the count for the grey range in the domain.

A multidimensional rectangle query can be computed by taking the Cartesian product of its decomposition in each axis, as illustrated in Figure 5, for a heatmap over columns .

Figure 5: Interval decomposition for a 2D histogram (heatmap). The decomposition in two dimensions is simply the Cartesian product of the decompositions in each dimension.

3.3 Discussion

Why use hierarchical histograms

Hierarchical histograms are just one of many mechanisms that have been proposed in the literature (see the surveys of [Hay10, QardajiYL13]). We choose to implement hierarchical histograms for two reasons:

  • Data-obliviousness. Hierarchical histogram mechanisms are oblivious to the data. They only need to know the quantization, which is public knowledge in our setting. This makes them particularly suitable for exploratory data analysis. We do not need any expensive processing of the data to compute quantization boundaries, which several data dependent mechanisms require.

  • Error guarantees. A systematic comparative study of various data-dependent and data-indpendent mechanisms was performed by [dpbench]. They found that for large datasets, the mechanism typically results in less error than any other mechanism.

Optimizing the shape of the tree

Changing the topology of the tree yields different tradeoffs between accuracy and privacy [QardajiYL13]. The shallower the tree, the less the sensitivity of the resulting synopsis, hence the less noise we need to add per node of the tree. But then the number of nodes we need to sum might be large for some ranges, which means those queries produce noisier results. The tradeoffs between these parameters have been studied extensively in [QardajiYL13]. The best choice of depends on what type of queries one wishes to optimize for.

Setting privacy parameters

The simplest option for the curator is to rely on the basic composition theorem [salil-survey, Lemma 2.3] which states that the privacy leakage adds up across mechanisms. Hence, for range queries of dimensionality the curator might specify a value , such that . The curator may then partition each (uniformly or otherwise) across the mechanisms that answer -dimensional range queries. Overlook’s privacy policy allows the curator to specify the value of for each set of columns independently. A curator with greater expertise in differential privacy may take advantage of advanced composition theorems [salil-survey] to optimize the choice of for a table.

4 Virtual Synopses

Input: range for which to compute private count, branching factor , domain size , privacy level , column index , true count , PRF key
      Output: noisy count

Set scale
Set B-adicDecomposition
for  do
end for
Algorithm 1 Overlook virtual synopsis.

Input: interval , ; branching factor
Output: nodes in tree corresponding to , indexed by (start, interval size)

function B-adicDecomposition(R, b)
     if  then
     end if
     while  do
         if  then
               Largest power of that divides
         end if
          Largest power of that fits in remaining interval
         nodeSize =
     end while
end function
Algorithm 2 Computing the -adic decomposition for a range.

An important requirement for releasing a private synopsis is that random noise is added once, when the synopsis is constructed, and must not be resampled on future queries to the synopsis. For the hierarchical histogram mechanism, this requirement naïvely would mean that Overlook would have to store a random sample for every node in the synopsis tree, a storage overhead that grows linearly in the size of the domain.

Our solution is to use a cryptographically secure pseudo-random function (PRF) . Informally, a PRF guarantees that, given a small random key in the key space , will be indistinguishable from a truly random function to a computationally-bounded adversary. (See e.g. [bonehshoup] for a formal definition.)

In our setting, the inputs to are nodes in the hierarchical histogram synopsis for a given column, each of which corresponds to a contiguous interval in the underlying domain. takes as input a node index , a column index ,333Multi-dimensional histograms are also assigned unique indexes, which can then be used with the PRF in the same manner; the general PRF for a node in a -dimensional histogram with index is . and the key

associated with a table, and returns a uniformly distributed random sample

, which we then transform to a sample from a Laplace variable with the appropriate scale .

Using the PRF, we are able to reduce the storage cost of the synopsis from linear in the domain size to a small constant – in fact, only the 32 bytes required to store the key associated with a given table.

We incorporate the PRF into the hierarchical histogram mechanism as follows. A range can be decomposed into a minimal set of internal nodes in the synopsis tree, each of which corresponds to a sub-interval of . For each , we can use the PRF to compute , the random noise corresponding to that interval. Then, if the true count in the interval is , we can compute the private count for the interval as . (For a domain of size , the scale is .)

Algorithm 1 describes the synopsis algorithm in detail. For completeness, we also describe the algorithm for computing a -adic decomposition of an interval in Algorithm 2.

Because the PRF is ultimately deterministic (though indistinguishable from random), this algorithm satisfies the requirement that queries to the synopsis will not require resampling noise, although we have not explicitly stored any samples.

This use of random number generators to save space is similar in spirit to the use of such generators in streaming algorithms [AMS96].

Cryptographic security

It is important to note that the PRF used to generate random samples must be cryptographically secure. If one can somehow reverse-engineer the generator and compute , then one can subtract it from the end result and obtain the true count . A cryptographic PRF ensures that, even if an adversary knows for some , perhaps because they know as auxillary information, the adversary still cannot efficiently compute for a new value and column. This additionally requires that the key associated with a table must be stored securely on the root node.

Section 5.2.1 further describes the implementation of virtual synopses in Overlook.

5 Implementation

We have implemented Overlook by extending the open-source Hillview [Budiu2019] visualization system. However, to make the case that a system like Overlook could be implemented as an agent between any suitably powerful UI and a generic database, we have extended Hillview with a custom back-end that generates SQL queries in the SQL dialect of MySQL. We then have added the Overlook differential privacy layer on top of both these back-ends: the Hillview in-memory database and MySQL. For both cases the adaptations required were minimal; the MySQL engine is completely unmodified. We describe them in the following sections.

In this section, we describe implementation details for the UI (§ 5.1), privacy interposition layer (§ 5.2), and Hillview and MySQL backends (§ 5.3).

5.1 User interface

For data interaction and presentation we reuse the UI of Hillview [hillview]. This UI is written in TypeScript and runs in any modern web browser. To support differentially private visualizations we had to modify a few hundred lines of code. The most significant changes are (1) the data presentation of uncertainty (confidence intervals) that is inherent in differentially private results, and (2) the curator interface, which enables the curator to edit the privacy parameters interactively. We believe that displaying the confidence intervals significantly enhances the usefulness of a visualization tool.

5.2 Privacy layer

The privacy layer of Overlook is implemented in Java. It is sandwiched between the web server layer and the query generation and execution layer. When a new dataset is opened, the privacy layer checks for the existence of related privacy metadata to decide whether a data source should be treated as differentially private.

5.2.1 PRFs for virtual synopses

The virtual synopsis described in Section 4 uses a PRF to generate noise. In practice, we use AES-256 as the PRF. The root node stores one AES key per table, which ensures that no two tables are released using the same PRF. In the same vein, columns and pairs of columns are labeled with unique and immutable IDs so that no two synopses within a table share random samples.

On receiving an interval query , the root computes unique IDs of the nodes in the synopsis corresponding to this interval. To generate a new Laplace sample for an interval, the root uses the interval ID and column ID as input to AES to generate random bits, which can then be transformed into a Laplace sample using standard methods for converting bits to doubles and then inverting the Laplace CDF.444 We note that we do not currently implement the snapping mechanism described in [ilya-dp], but this is not fundamental to our system design, and can be incorporated in the future.

5.2.2 Confidence intervals

To approximate the -confidence interval for a sum of Laplace variables, we sample the corresponding distribution and return the percentile value. Naïvely this operation would be performed for every bucket on every histogram query. However, we observe that our synopsis guarantees that every interval will be the sum of at most random variables (for a histogram of dimension , where is the size of the domain). Therefore, the confidence intervals once computed can be stored in a cache of size at most . Moreover, the confidence intervals are added as a postprocessing step independent of the raw data, and therefore need not be computed securely; the cache can be shared across columns and tables.

5.2.3 Privacy policies

Overlook stores privacy policies in JSON format at the root node. Any user or curator can query the privacy policy, but only the curator is allowed to modify it.

5.2.4 Query rewriting

The query rewriting layer receives queries from the UI and rewrites them to operate on quantized data. (The exception is the “distinct count” query, which operates directly on the raw data, and not on the quantized data.) We give a concrete example about such query rewriting in our implementation Section 5, where we describe the implementation of Overlook using a traditional SQL database. Recall that the quantization parameters for a column are established by the data curator. The quantization information describes a range of intervals for the data values; data that falls outside all the quantization intervals is treated as if it is a NULL value.

5.2.5 Adding noise to results

When the root receives the complete counts for the base histogram, it queries the virtual synopsis for the noise to add to each bucket and adds this noise to the histogram before returning it to the UI.

5.3 Backends

Overlook’s operation can be adapted to use any back-end that supports a rich enough query language to compute standard histograms. We demonstrate this by describing how it operates over two different back-ends: the Hillview back-end, and one that operates on top of MySQL. We are interested in highlighting the additional effort required for adding privacy on top of an existing SQL-based query engine. In this section we describe how this is done for the case of histogram queries.

5.3.1 Hillview backend

Hillview is a MapReduce-like distributed query engine that implements vizketches – mergeable sketches for visualization. It implements a number of data-parallel aggregation tasks suitable for visualization, including those used in Overlook. Hillview is described in additional detail in [Budiu2019].

The only change required to support privacy-related processing is to adapt all the existing sketches to first quantize the columns that they operate on according to the appropriate privacy policy. No other changes were required in the backend. This change amounts to less than 10 lines of code in each sketch.

5.3.2 MySQL backend

Overlook can interface with unmodified, existing database backends; we have implemented one such backend in MySQL. In this section, we describe some of the queries implemented in order to support the Overlook UI.

Numeric histograms

Consider a user request to display a (non-private) histogram of the data in a column C as a histogram with b buckets. Let us assume first that C is a numeric column in table t. This kind of visualization is executed using SQL in two stages: (1) the range of the data in the column is computed, and (2) the histogram is built. To obtain the range of the data we generate the following query:

SELECT min(C), max(C), count(*), count(C)

This query computes the minimum and maximum values in column C, and also the number of non-null elements and the number of total elements.

The UI receives these parameters and decides on a range lr of data and on a number of buckets b to display (in some cases the UI does not need to issue any other query, for example when all elements are NULL, or when l=r). The query to compute a histogram is written as:

SELECT bucket, COUNT(bucket) FROM (
  SELECT CAST(FLOOR((C - l) * scale)
     AS UNSIGNED) AS bucket
  FROM t
  WHERE C between l AND r)
GROUP BY bucket

scale=b/(r-l) is computed statically before the query is generated.

Quantized data view

Since all private queries operate over quantized data, one option is to pre-compute and materialize a view where all columns are quantized using the curator-specified quantization intervals. Such a query can be generated automatically by the system once the privacy policy has been set. For example, to create a view QV of a table with a single numeric column C with equal-sized quantization intervals of size g between qmin and qmax one can issue the following query:

CREATE view QV as
  (SELECT qmin + FLOOR((C-qmin)/g)*g AS C
   FROM t WHERE C between qmin AND qmax)
Private numeric histograms

For the case of a private numeric column the general flow is very much as described in the previous section; there are two changes: (1) the query is executed over the quantized view, and (2) after the histogram is computed noise is added to each bucket. Let us assume that we are quantizing the data to be within the range qmin and qmax with a granularity g.

The complete query that is executed is:

-- compute histogram
SELECT bucket, COUNT(bucket) FROM (
  -- compute buckets
  SELECT CAST(FLOOR((C - l) * scale)
     AS UNSIGNED) AS bucket
  FROM QV -- quantized view
  WHERE C between l AND r)
GROUP BY bucket
String histograms

Computing (non-private) histograms over a categorical column is a bit more involved because the UI never displays a large number of histogram buckets (more than can be shown on the screen). The first step in computing a histogram over a string column involves computing a set of

distinct quantiles

over the column. For example, if the screen can accommodate 50 columns, then the UI will first issue a query to sort the distinct values in the column and extract 50 equi-distant values from the sorted set. These 50 values will be used as histogram bucket boundaries. If the column has fewer than 50 distinct values then all values will be used as distinct bucket boundaries.


(One has to be careful with the sorting and comparison order: these have to be consistent between the code that computes the buckets and the database code that performs comparisons and sorting. In our case we had to prevent MySQL from doing default case-insensitive string comparisons in order to obtain consistent results — this is why we used the BINARY keyword. We will omit it from the subsequent queries.)

To compute a histogram quickly over a set of explicit string buckets the Java code generates an explicit binary search tree using nested SQL IF expressions. For example, to build a histogram with buckets separated by strings ‘A’, ‘G’, ‘M’, and ‘Z’ it generates the following query:

SELECT bucket, count(bucket)
  SELECT (IF(C<‘G’,0,IF(C<‘M’,1,2))) AS bucket
  FROM t
GROUP BY bucket
Private string histograms

Finally, computing private histograms requires modifying the query for string histograms in a way similar to numeric private histograms, by quantizing the data in the column first. A quantization policy for a string column is given essentially by a sorted list of strings. The quantization query also makes use of a binary search tree. Let’s assume that our quantization boundaries are ‘A’, ‘F’, ‘N’, ‘O’ and ‘Z’. The quantization query is:

CREATE view QV as
  (SELECT IF(C<‘N’,IF(C<‘F’, A’,‘F’),
                 IF(C<‘O’,‘N’, O’))ASC

The query to compute a histogram over a quantized view is the composition of these two queries:

SELECT bucket, count(bucket)
  SELECT (IF(C<‘G’,0,IF(C<‘M’,1,2))) AS bucket
  FROM QV -- query the quantized view
GROUP BY bucket

6 Experience

Since we have built Overlook on top of the UI of an existing visualization tool, we can make a direct comparison of the user experience for traditional and differentially-private visualization. In this section, we describe some notable differences between these user experiences.

Browsing individual data items

The most conspicuous difference is that many operations that are natural in a normal visualization are unavailable when doing differentially-private visualization. For example, enumerating the rows of a table is something that cannot be done in a differentially-private way. Traditional visualization systems can be used for two purposes: detecting trends and identifying outliers. A differentially-private visualization system can only be used for the first purpose: differential privacy masks rare events.

Displaying uncertain values

A second difference is that all counts that are displayed in a differentially-private visualization are noisy. This can be interpreted as displaying a value with uncertain range. Although there is substantial work on the visualization of uncertain values, from our experience the interpretation of confidence intervals requires a sophisticated understanding from users. While confidence intervals are a useful tool to visualize uncertainty, they do not prevent spurious high counts that users might confuse for signal. This phenomenon has already been observed by [zhang-dpw16].

Uncertainty in heat maps
Figure 6: An example of a confidence interval (dotted box) overlaid on a heat map legend. When a user hovers over a heat map cell, Overlook highlights the confidence interval for the box on the legend.
Figure 7: Whitening to convey uncertainty. The color scale of the heat map conveys the value in each cell; whitening adds an additional axis of color that can be used to convey uncertainty. (6(a)) Raw histogram with no noise or whitening. (6(b)) Histogram with noise and no color adjustment. (6(c)

) Whitening added to raw histogram on fine-grained bins. Each bin has a small count relative to the standard deviation of the added noise, which is conveyed through the whiteness of the chart. (

6(d)) Whitening added to raw histogram on coarse-grained bins. Each bin now has a larger count relative to the standard deviation of the data, so less whitening is applied.

While uncertainty for histograms can intuitively be presented as a range around each count, it is less clear how uncertainty should be displayed in a heat map. We prototyped multiple possible solutions to this problem.

Figure 6 shows a heat map legend in Overlook that displays a confidence interval. When the user hovers over a cell, Overlook highlights the confidence interval in the legend.

In addition, however, we would like to visually convey the confidence in each cell on the chart itself. One way to achieve this, shown in Figure 3, is to suppress any values whose count is smaller than a multiplicative factor of its confidence interval. As a result, only counts that are likely to be informative are displayed.

Another idea is to use the whiteness of the image to convey uncertainty. In figure 7, we demonstrate a prototype of such a plot. The raw color scale is used to convey the count in each cell, and the whiteness provides an additional dimension that can be used to convey the amount of certainty a user should have in the visualization.

Quantization intervals

The quantization intervals, especially for categorical data, have a huge impact on the information that is conveyed to the user. As an example, we show in Figure 8 several histograms of the exact same dataset on the column “cities” with different quantization intervals. The first histogram has on the X axis only the cities that actually appear in the dataset, sorted alphabetically, so the distribution of cities into buckets is quite different. The last histogram allows the user to zoom-in further and explore the distribution of data for each letter pair; the additional structure is visible in the CDF which is more fine-grained (it has steps instead of just ).

Figure 8: Histogram with up to 26 buckets of a set of cities quantized in different ways: (upper) public histogram with adaptively chosen bin boundaries; (lower) quantization fixed to two-letter boundaries. The shape of the histogram changes according to the choice of bins.
Query resolution

The amount of noise added to a histogram/heatmap bucket on columns depends on two primary factors: the extent of the bucket (the set of quantization intervals that fall in ), and the privacy budget allocated for the set of columns. The noise does not depend on the actual data distribution; however, the relative noise added does depend on the number of data items that falls into the bucket . So there is a trade-off between the resolution of the query and its precision: if we make buckets smaller, we can potentially see more detail in the data, but the relative noise will be higher. If we make buckets larger we lose the resolution but we gain precision. There is no obvious choice in this trade-off, since it depends very much on the data distribution. This is a trade-off that the data curator can explore and to some degree control by choosing the and the quantization intervals for each column.

Outliers or sentinel values

In one database we have encountered a date column which was using a value year of 9999 to indicate that an event has not happened yet. Overlook in general releases counts for NULL values separately from the hierarchical histogram synopsis, as NULL will not be part of any range query. In contrast, this sentinel value would be naïvely included in the displayed histogram if the specified public date range included values up to 9999.

7 Evaluation

In this section, we evaluate the design decisions made in Overlook to support our claims that the system:

  • allows data curators to quickly explore parameter settings for synopses before data release (§ 7.1),

  • implements a synopsis that provides accuracy comparable to state-of-the-art methods (§ 7.2),

  • achieves significantly lower storage cost than the synopses implemented in prior systems through the use of a pseudorandom function (§ 7.3), and

  • retains the scaling properties of the underlying distributed system with low performance overhead from privacy (§ 7.4).

In addition, we demonstrate that, in the visualization setting, the error induced by differential privacy can be smaller than a pixel on the screen – so that, with high probability, the user loses no utility compared to the raw visualization (§ 7.5).

Evaluation setup

Local experiments were run on a machine with 16 GB of memory and 4 cores using an Intel i7 processor. Cloud experiments were run on an Amazon EC2 cluster of 15 machines with 8 GB of RAM and 2 cores each. The Hillview backend uses Java 8.

7.1 Synopsis generation overhead

One benefit of Overlook is that it allows the data curator to quickly explore privacy parameters for the data before it is released. In particular, the data curator might change the data quantization or privacy budget for any column. Generating example visualizations with the new parameters then requires recomputing the underlying synopsis.

We use DPBench [dpbench] to evaluate the time required to generate a synopsis with the hierarchical histogram mechanism against the time required for comparable synopses. We stress that these times are not trivially comparable as DPBench is primarily an accuracy benchmark that is not optimized for performance.

(a) Time required to generate synopses as the domain size increases.
(b) Generation time for faster synopses.
Figure 9: Time required to generate synopses using various mechanisms, benchmarked using DPBench. MWEM and DAWA dominate in the first plot; the second plot shows that generating hierarchical histograms scales in the domain size, when not using Overlook’s PRF-based construction.

We evaluate each method on a one-dimensional all-zeros dataset of increasing size, on a workload of all intervals (the workload that Overlook targets). These times do not include the additional time required to compute the base histogram of counts over which the synopses are computed. We evaluate seven mechanisms in the literature: the baseline “identity” mechanism [dp], the binary hierarchical histogram [binary-mechanism, Hay10], the hierarchical histogram with adaptive branching [Hay10], DAWA [dawa], MWEM [mwem], Privelet [privelet], and StructureFirst [xu2013differentially].

Figure 9 shows the results of the benchmark. Figure 8(a) shows that MWEM and DAWA are by far the most expensive algorithms, followed by StructureFirst. The remaining algorithms run in under one second, so we plot these separately in Figure 8(b). While the time required for the hierarchical mechanisms scales linearly in the data size, they are still considerably less expensive to compute than more complicated workload-aware synopses.

In fact, Overlook itself does not instantiate the synopsis and computes the noisy values on the fly, so the synopsis adds no precomputation overhead but does add some overhead at query time. However, this overhead scales with the number of histogram buckets in the query, rather than the size of the domain or dataset. We benchmark Overlook query overhead in Section 7.4.

7.2 Synopsis accuracy

(a) Histogram accuracy. Most mechanisms perform comparably on the flights dataset.
(b) Heatmap accuracy. The baseline identity mechanism outperforms all others; the hierarchical mechanism nevertheless achieves reasonable accuracy.
Figure 10: error of Overlook mechanism on the U.S. flights dataset on 5000 randomly-sampled queries per column or pair of columns.

Overlook uses hierarchical histograms as the underlying synopsis. The primary benefit of the hierarchical histogram is that the error scales logarithmically, rather than linearly, in the size of the underlying dataset. However, more complex optimization procedures [matrix-mechanism, dawa] may yield even better accuracy.

In this section, we demonstrate empirically that Overlook’s histogram mechanism achieves utility comparable to that of state-of-the-art synopses. These results are supported by prior work [QardajiYL13, qardaji2013differentially, Hay10]

that also investigates the empirical accuracy of hierarchical histograms. More complex methods work well for skewed or restricted query workloads, but Overlook benefits from simple mechanisms because it aims to support a very general set of range queries.

We benchmark accuracy on a dataset of 20 years of U.S. flights [ontime]. This dataset contains both numeric and categorical columns over a range of data distributions and domain sizes (varying from 7 to over 4000). Figure 10 shows results for histograms on all columns and heat maps on a selection of columns. For each bar, we sample 5000 random intervals or rectangles and compute the distance between the vector of true counts for all samples and the vector of noisy counts returned by the mechanism.

The key takeaway from these figures is that the hierarchical histogram mechanism has comparable accuracy to mechanisms that perform more complex, workload-specific optimization on the random-intervals workload. As noted in [QardajiYL13], the benefits of this mechanism decrease as the dimension increases. Adaptively choosing which mechanism to use for a given visualization may be a direction for future work.

7.3 Synopsis memory overhead

Overlook’s synopsis mechanism has only 32 bytes of memory overhead required to store the AES secret key.

The synopsis mechanisms implemented by DPBench are consistent mechanisms. These take as input a histogram over the elements of the domain and output a synthetic histogram as the synopsis over which all subsequent queries are run. The size of the synopsis is therefore proportional to the size of the data domain for a histogram, and grows exponentially in the number of dimensions. (In particular, if the data is small or especially sparse in the domain, the size may be considerably larger than necessary to represent the data.)

We note, that these mechanisms may require a considerably larger amount of memory at computation time. For example, DAWA in two dimensions requires instantiating requires instantiating a matrix representation of the workload [dawa]. For the workload that Overlook supports (the all-queries workload), this requires a matrix of size . For such a workload, a relatively small domain with 1000 quantization intervals would require at least 8 gigabytes of memory simply to compute the synopsis.

7.4 Overlook performance

In this section, we benchmark the performance of Overlook and demonstrate that adding differential privacy does not substantially slow down the system or change its underlying scaling properties. In particular, the overall slowdown from privacy is no greater than 2.5.

7.4.1 Slowdown relative to public data

(a) Histogram slowdown.
(b) Heatmap slowdown.
Figure 11: Slowdown relative to raw (non-private) databases for histograms and heat maps. In all cases, privacy adds at most a 2.5 performance penalty.

We first evaluate how much differential privacy causes queries to slow down relative to queries on public data. In order to understand the slowdown, we make two measurements for each backend: first, the time required to quantize the dataset, and second, the time required to answer a quantized histogram query with noise added.

Figure 11 shows the average slowdown when plotting histograms and heat maps on the U.S. flights dataset using both the Hillview and MySQL backends. The slowdown is below 2.5 for all configurations. In all cases, the majority of the slowdown is a result of the quantization step. This is intuitive: where each data point would initially have required one operation to add it to the appropriate bucket, quantization adds an additional operation to round the point to its nearest value in the public column domain.

7.4.2 Scaling

The Overlook frontend can be used with any SQL backend. However, the Hillview distributed backend is powerful as it retains Hillview’s ability to scale to large datasets.

(a) Histogram scaling.
(b) Heat map scaling.
Figure 12: Average time to generate histograms for columns in the flights dataset as the number of machines grows. The data size grows with the number of machines, so the runtime remains constant.

We evaluate scaling using clusters of 1, 2, 4, 8,and 15 Amazon EC2 machines. The total dataset size is 58.2 GB, split equally among the machines in the cluster. so for linear scaling we expect the time required for each query to be roughly the same regardless of the number of machines. We measure time to compute charts once the data is already in memory.

In Figure 12 we show our measurements that evaluate the time breakdown for computing histograms over the U.S. flights dataset. Each point corresponds to the total time required to compute a histogram or heat map for every column or pair of columns. The overhead of privacy is the same roughly 2 overhead as in Figure 11, but privacy does not change the scaling behavior of the system at all, as expected.

7.5 Visual error

At large enough data sizes, the error induced by differential privacy can be smaller than the pixel-level rounding error induced by the screen resolution. In particular, in a 1-dimensional histogram, Overlook rescales the -axis to the maximum displayed value in order to make use of all of the available vertical pixels.

Given vertical pixels and a maximum displayed value of , each pixel represents a count of . Hence, a confidence interval of size less than will be smaller than a pixel on the screen. Assuming the case of a single Laplace random variable added to each bucket (i.e. the “identity” mechanism), we ask what value of would suffice to achieve this level of error.

The inverse of the distribution is given by

Then we would like for a confidence level of . In this case, we arrive at an approximate solution of . In other words, if the maximum count is approximately 2.3 times the number of vertical pixels, a privacy level of will likely result in no visible difference from the raw data.

8 Related Work

Differentially private database management systems

A number of prior systems make differential privacy available through a SQL-like database API. PINQ [McSherry2009] implements a subset of SQL as well as a prototype visualization system with incremental -budgeting. PINQ additionally pointed out that joins have potentially unbounded sensitivity, and several later systems [ProserpioGM14, restricted-sensitivity, djoin, flex] propose methods for mitigating this issue.

Other work considers additional variants on SQL-like programming frameworks that allow developers to easily express differentially private queries. Airavat [airavat] allows users to run custom MapReduce queries on sensitive data by enforcing differential privacy on the queries. Ektelo [Zhang2018] exposes a number of higher-level operators as a programming framework for differentially private mechanisms. PrivateSQL [privatesql] uses synopsis-based mechanisms rather than incremental budgeting to release dataset views. PrivateSQL introduces the notion of view sensitivity as an approach to handle joins. While Overlook does not explicitly target joins, such techniques could naturally integrate with Overlook, as the join is ultimately materialized as a tabular view of the data. Chorus [chorus] implements differential privacy directly in SQL.

Most of these prior systems support a broad set of queries written directly in SQL or a SQL-like language. In contrast, Overlook restricts queries to those visualizations enabled by the UI, which enables the release of a flexible and small synopsis.

Synopsis-based mechanisms

A number of DP mechanisms [salil-survey] are designed to support all queries in a given class of queries simultaneously. Recent work on synopses include the matrix mechanism [li2010optimizing, li2012adaptive, li2015matrix], wavelet transforms [privelet], and approaches that incorporate data-dependent partitioning [xiao2012dpcube, xu2013differentially, qardaji2013differentially, cormode2012differentially]. Overlook primarily relies on a hierarchical histogram [Hay10, binary-mechanism]. A growing body of work [QardajiYL13, Qardaji:2014, dpbench] additionally considers data-dependent optimizations that can improve the accuracy of synopses under certain query workloads.

A number of papers [QardajiYL13, qardaji2013differentially, Hay10] have investigated the accuracy of these methods in practice. These papers support our claim that the synopses used in Overlook give usable, and often optimal, accuracy in practice.

The idea of public quantization boundaries, or partitions, has been explored by PINQ [pinq] and FLEX [flex]. Both of these systems leave it to the data analyst, rather than the data curator, to specify the quantization boundaries.

Similar ideas are used to analyze streaming data, cf. [Ghayyur0YMHMM18], [Chen2017]. A data aware version of the binary mechanism is presented in [Acs:2012, dawa]. It uses a private partitioning method that smooths regions of similar count.

Differentially private visualization

PSI [psi] may be the closest system to Overlook; PSI makes -budgeting more user-friendly by providing a visual interface for users to interact with and understand the impact of various values of . In contrast to Overlook, PSI assumes a per-user, incremental budget; additionally, PSI is not targeted toward the data exploration use case.

PINQ [pinq] provides a case study of a differentially private map visualization as an application of the framework. [chen2016differentially]

provides methods to make linear and logistic regression plots differentially private. VisDPT

[visdpt] is an interface to view two-dimensional trajectories in a differentially private manner. [zhangchallenges] studies the challenges involved in creating meaningful visualizations under differential privacy.

9 Conclusion

We have presented Overlook, a visualization system for private data that provides interactive latencies both for data curators and data analysts. Overlook’s novel virtual synopsis enables it to scale to large data domains while incurring minimal performance and storage overhead over queries to raw data. Overlook can integrate with existing query engines with no intrusive changes. Overlook makes differential privacy accessible, useful, and performant, making it a practical privacy tool for the real world.