All data analysis fundamentally depends on a basic understanding of how the data is distributed. Many sophisticated data analysis and machine learning techniques are built on top of primitives that describe where data points are located, or what is the data density in a given region. That is, we need to provide accurate answers to estimates of the data density at a given point or within a range. Consequently, we need to ensure that such queries can be answered accurately under a variety of data access models.
This remains the case when the data is sensitive, comprised of the personal details of many individuals. Here, we still need to answer range queries accurately, but also meet high standards of privacy, typically by ensuring that answers are subject to sufficient bounded perturbations that each individual’s data is protected. In this work, we adopt the recently popular model of Local Differential Privacy (LDP). Under LDP, individuals retain control of their own private data, by revealing only randomized transformations of their input. Aggregating the reports of sufficiently many users gives accurate answers, and allows complex analysis and models to be built, while preserving each individual’s privacy.
LDP has risen to prominence in recent years due to its adoption and widespread deployment by major technology companies, including Google (Erlingsson et al., 2014), Apple (Differential Privacy Team, Apple, 2017) and Microsoft (Ding et al., 2017). These applications rely at their heart on allowing frequency estimation within large data domains (e.g. the space of all words, or of all URLs). Consequently, strong locally private solutions are known for this point estimation problem. It is therefore surprising to us that no prior work has explicitly addressed the question of range queries under LDP. Range queries are perhaps of wider application than point queries, from their inherent utility to describe data, through their immediate uses to address cumulative distribution and quantile queries, up to their ability to instantiate classification and regression models for description and prediction.
In this paper, we tackle the question of how to define protocols to answer range queries under strict LDP guarantees. Our main focus throughout is on one-dimensional discrete domains, which provides substantial technical challenges under the strict model of LDP. These ideas naturally adapt to multiple dimensions, as we discuss briefly as an extension. A first approach to answer range queries is to simply pose each point query that constitutes the range. This works tolerably well for short ranges over small domains, but rapidly degenerates for larger inputs. Instead, we adapt ideas from computational geometry, and show how hierarchical and wavelet decompositions can be used to reduce the error. This approach is suggested by prior work in the centralized privacy model, but we find some important differences, and reach different conclusions about the optimal way to include data and set parameters in the local model. In particular, we see that approaches based on hierarchical decomposition and wavelet transformations are both effective and offer similar accuracy for this problem; whereas, naive approaches that directly evaluate range queries via point estimates are inaccurate and frequently unwieldy.
Our contributions. In more detail, our contributions are as follows: We provide background on the model of Local Differential Privacy (LDP) and related efforts for range queries in Section 2. Then in Section 3, we summarize the approaches to answering point queries under LDP, which are a building block for our approaches. Our core conceptual contribution (Section 4) comes from proposing and analyzing several different approaches to answering one-dimensional range queries.
We first formalize the problem and show that the simple approach of summing a sequence of point queries entails error (measured as variance) that grows linearly with the length of the range (Section 4.2).
In Section 4.3, we consider hierarchical approaches, generalizing the idea of a binary tree. We show that the variance grows only logarithmically with the length of the range. Post-processing of the noisy observations can remove inconsistencies, and reduces the constants in the variance, allowing an optimal braching factor for the tree to be determined.
The last approach is based on the Discrete Haar wavelet transform (DHT, described in Section 4.6). Here the variance is bounded in terms of the logarithm of the domain size, and no post-processing is needed. The variance bound is similar but not directly comparable to that in the hierarchical approach.
Once we have a general method to answer range queries, we can apply it to the special case of prefix queries, and to find order statistics (medians and quantiles). We perform an empirical comparison of our methods in Section 5. Our conclusion is that both the hierarchical and DHT approach are effective for domains of moderate size and upwards. The accuracy is very good when there is a large population of users contributing their (noisy) data. Further, the related costs (computational resources required by each user and the data aggregator, and the amount of information sent by each user) are very low for these methods, making them practical to deploy at scale. We show that the wavelet approach is most accurate in high privacy settings, while the hierarchical approach dominates for weaker privacy requirements. We conclude by considering extensions of our scenario, such as multidimensional data (Section 6).
2. Related Work
Range queries. Primitives to support range queries are necessary in a variety of data processing scenarios. Exact range queries can be answered by simply scanning the data and counting the number of tuples that fall within the range; faster answers are possible by pre-processing, such as sorting the data (for one-dimensional ranges). Multi-dimensional range queries are addressed by geometric data structures such as - trees or quadtrees (Samet, 2005)
. As the dimension increases, these methods suffer from the “curse of dimensionality”, and it is usually faster to simply scan the data.
Various approaches exist to approximately answer range queries. A random sample of the data allows the answer on the sample to be extrapolated; to give an answer with an additive guarantee requires a sample of size (Cormode et al., 2012). Other data structures, based on histograms or streaming data sketches can answer one-dimensional range queries with the same accuracy guarantee and with a space cost of (Cormode et al., 2012). However, these methods do not naturally translate to the private setting, since they retain information about a subset of the input tuples exactly, which tends to conflict with formal statistical privacy guarantees.
Local Differential Privacy (LDP). The model of local differential privacy has risen in popularity in recent years in theory and in practice as a special case of differential privacy. It has long been observed that local data perturbation methods, epitomized by Randomized Response (Warner, 1965) also meet the definition of Differential Privacy (Dwork and Roth, 2014). However, in the last few years, the model of local data perturbation has risen in prominence: initially from a theoretical interest (Duchi et al., 2013), but subsequently from a practical perspective (Erlingsson et al., 2014). A substantial amount of effort has been put into the question of collecting simple popularity statistics, by scaling randomized response to handle a larger domain of possibilities (Differential Privacy Team, Apple, 2017; Ding et al., 2017; Bassily et al., 2017; Wang et al., 2017). The current state of the art solutions involve a combination of ideas from data transformation, sketching and hash projections to reduce the communication cost for each user, and computational effort for the data aggregator to put the information together (Bassily et al., 2017; Wang et al., 2017).
Building on this, there has been substantial effort to solve a variety of problems in the local model, including: language modeling and text prediction (Chang and Thakurta, 2018); higher order and marginal statistics (Zhang et al., 2018; Fanti et al., 2016; Cormode et al., 2018); social network and graph modeling (Gao et al., 2018; Qin et al., 2017); and various machine learning, recommendation and model building tasks (Wang et al., 2019; Duchi et al., 2013; Nguyên et al., 2016; Zheng et al., 2017; Shin et al., 2018) However, among this collection of work, we are not aware of any work that directly or indirectly addresses the question of allowing range queries to be answered in the strict local model, where no interaction is allowed between users and aggregator.
Private Range queries. In the centralized model of privacy, there has been more extensive consideration of range queries. Part of our contribution is to show how several of these ideas can be translated to the local model, then to provide customized analysis for the resulting algorithms. Much early work on differentially private histograms considered range queries as a natural target (Dwork et al., 2006; Hardt et al., 2012). However, simply summing up histogram entries leads to large errors for long range queries.
Xiao et al. (Xiao et al., 2011) considered adding noise in the Haar wavelet domain, while Hay et al. (Hay et al., 2010) formalized the approach of keeping a hierarchical representation of data. Both approaches promise error that scales only logarithmically with the length of the range. These results were refined by Qardaji et al. (Qardaji et al., 2013a), who compared the two approaches and optimized parameter settings. The conclusion there was that a hierarchical approach with moderate fan-out (of 16) was preferable, more than halving the error from the Haar approach. A parallel line of work considered two-dimensional range queries, introducing the notion of private spatial decompositions based on - trees and quadtrees (Cormode et al., 2012). Subsequent work argued that shallow hierarchical structures were often preferable, with only a few levels of refinement (Qardaji et al., 2013b).
3. Model And Preliminaries
3.1. Local Differential Privacy
Initial work on differential privacy assumed the presence of a trusted aggregator, who curates all the private information of individuals, and releases information through a perturbation algorithm. In practice, individuals may be reluctant to share private information with a data aggregator. The local variant of differential privacy instead captures the case when each user only has their local view of the dataset (typically, they only know their own data point ) and she independently releases information about her input through an instance of a DP algorithm. This model has received widespread industrial adoption, including by Google (Erlingsson et al., 2014; Fanti et al., 2016), Apple (Differential Privacy Team, Apple, 2017), Microsoft (Ding et al., 2017) and Snap (Pihur et al., 2018)
for tasks like heavy hitter identification (e.g., most used emojis), training word prediction models, anomaly detection, and measuring app usage.
In the simplest setting, we assume each participant has an input drawn from some global discrete or continuous distribution over a domain . We do not assume that users share any trust relationship with each other, and so do not communicate amongst themselves. Implicitly, there is also an (untrusted) aggregator interested in estimating some statistics over the private dataset .
Formal definition of Local Differential Privacy (LDP) (Kasiviswanathan et al., 2011). A randomized function is -locally differentially private if for all possible pairs of and for every possible output tuple in the range of :
This is a local instantiation of differential privacy (Dwork and Roth, 2014), where the perturbation mechanism is applied to each data point independently. In contrast to the centralized model, perturbation under LDP happens at the user’s end.
3.2. Point Queries and Frequency Oracles
A basic question in the LDP model is to answer point queries on the distribution: to estimate the frequency of any given element from the domain . Answering such queries form the underpinning for a variety of applications such as population surveys, machine learning, spatial analysis and, as we shall see, our objective of quantiles and range queries.
In the point query problem, each holds a private item drawn from a public set using an unknown common discrete distribution . That is,
is the probability that a randomly sampled input element is equal to. The goal is to provide a protocol in the LDP model (i.e. steps that each user and the aggregator should follow) so the aggregator can estimate as as accurately as possible. Solutions for this problem are referred to as providing a frequency oracle.
Several variant constructions of frequency oracles have been described in recent years. In each case, the users perturb their input locally via tools such as linear transformation and random sampling, and send the result to the aggregator. These noisy reports are aggregated and an appropriate bias correction is applied to them to reconstruct the frequency for each item in. The error in estimation is generally quantified by the mean squared error. We know that the mean squared error can be decomposed into (squared) bias and variance. Often estimators for these mechanisms are unbiased and have the same variance for all items in the input domain. Hence, the variance can be used interchangably with squared error, after scaling. The mechanisms vary based on their computation and communication costs, and the accuracy (variance) obtained. In most cases, the variance is proportional to .
Optimized Unary Encoding (OUE). A classical approach to releasing a single bit of data with a privacy guarantee is Randomized Response (RR), due to Wagner (Warner, 1965). Here, we either report the true value of the input or its complement with appropriately chosen probabilities. To generalize to inputs from larger domains, we represent
as the sparse binary vector(where and 0 elsewhere), and randomly flip each bit of to obtain the (non-sparse) binary vector . Naively, this corresponds to applying one-bit randomized response (Warner, 1965) to each bit independently. Wang et al. (Wang et al., 2017) proposed a variant of this scheme that reduces the variance for larger .
Perturbation: Each user flips each bit at each location of using the following distribution.
Finally user sends the perturbed input to the aggregator.
OUE does not scale well to very large due to large communication complexity (i.e., bits from each user), and the consequent computation cost for the user ( time to flip the bits). Subsequent mechanisms have smaller communication cost than OUE.
Optimal Local Hashing (OLH) (Wang et al., 2017). The OLH method aims to reduce the impact of dimensionality on accuracy by employing universal hash functions111A family of hash functions is said to be universal if i.e. collision probability behaves uniformly.. More specifically, each user samples a hash function () u.a.r from a universal family and perturbs the hashed input.
Perturbation: User samples a u.a.r and computes . User then perturbs using a version of RR generalized for categorical inputs (Kairouz et al., 2016). Specifically, each user reports and, with probability gives the true , else she reports a value sampled u.a.r from .
Aggregation: The aggregator collects the perturbed hash values from all users. For each hash value , the aggregator computes a frequency vector for all items in the original domain, based on which items would produce the hash value under . All such histograms are added together to give
and an unbiased estimator for each frequency for all elements in the original domain is given by the correction.
Variance: Setting minimizes the variance to be . OLH has the same variance as OUE and it more economical on communication. However, a major downside is that it is compute intensive in terms of the decoding time at the aggregator’s side, which is prohibitive for very large dimensions (say, for above tens of thousands), since the time cost is proportional to .
Hadamard Randomized Response (HRR) (Cormode et al., 2018; Nguyên et al., 2016). The Discrete Fourier (Hadamard) transform is described by an orthogonal, symmetric matrix of dimension (where is a power of 2). Each entry in is
where is the number of ’s that and agree on in their binary representation. The (full) Hadamard transformation (HT) of user’s input is the th column of i.e. . For convenience, the user can scale up by to give values either or . Figure 1 shows an example of the Hadamard transformation matrix.
Perturbation: User samples an index u.a.r and perturbs with binary randomized response, keeping the value with probability , and flipping it with probability . Finally user releases the perturbed coefficient and .
Aggregation: Consider each report from each user. With probability , the report is the true value of the coefficient; with probability , we receive its negation. Hence, we should divide the reported value by
to obtain an unbiased random variable whose expectation is the correct value. The aggregator can then compute the observed sum of each perturbed coefficientas . An unbiased estimation of the th Hadamard coefficient (with the factor restored) is given by . Therefore, the aggregator can compute an unbiased estimator for each coefficient, and then apply the inverse transform to produce .
Variance: The variance of each user report is given by the squared error of our unbiased estimator. With probability , the squared error is , else the squared error is . Then, we can expand the variance for each report as
There are total reports, each of which samples one of coefficients at random. Observing that the estimate of any frequency in the original domain is a linear combination of Hadamard coefficients with unit Euclidean norm, we can find an expression for the value of as . Using (to ensure LDP), we find .
This method achieves a good compromise between accuracy and communication since each user transmits only bits to describe the index and the perturbed coefficient, respectively. Also, the aggregator can reconstruct the frequencies in the original domain by computing the estimated coefficients and then inverting HT with operations, versus for OLH.
Thus, we have three representative mechanisms to implement a frequency oracle. Each one provides -LDP, by considering the probability of seeing the same output from the user if her input were to change. There are other frequency oracles mechanisms developed offering similar or weaker variance bounds (e.g. (Fanti et al., 2016; Ding et al., 2017)) and resouce trade-offs but we do not include them for brevity.
4. Range Queries
4.1. Problem Definition
We next formally define the range queries that we would like to support. As in Section 3.2, we assume non-colluding individuals each with a private item . For any , a range query is to compute
is a binary variable that takes the valueif the predicate is true and otherwise.
Definition 4.1 ().
(Range Query Release Problem) Given a set of users, the goal is to collect information guaranteeing -LDP to allow approximation of any closed interval of length . Let be an estimation of interval of length computed using a mechanism . Then the quality of is measured by the squared error .
4.2. Flat Solutions
One can observe that for an interval , , where is the (fractional) frequency of the item . Therefore a first approach is to simply sum up estimated frequencies for every item in the range, where estimates are provided by an -LDP frequency oracle: . We denote this approach (instantiated by a choice of frequency oracle ) as flat algorithms.
Fact 1 ().
For any range query of length answered using a flat method with frequency oracle ,
Note that the variance grows linearly with the interval size which can be as large as .
Lemma 4.2 ().
The average worst case squared error over evaluation of queries is
There are queries of length . Hence the average error is ∎
4.3. Hierarchical Solutions
We can view the problem of answering range queries in terms of representing the frequency distribution via some collection of histograms, and producing the estimate by combining information from bins in the histograms. The “flat” approach instantiates this, and keeps one bin for each individual element. This is necessary in order to answer range queries of length 1 (i.e. point queries). However, as observed above, if we have access only to point queries, then the error grows in proportion to the length of the range. It is therefore natural to keep additional bins over subranges of the data. A classical approach is to impose a hierarchy on the domain items in such a way that the frequency of each item contributes to multiple bins of varying granularity. With such structure in place, we can answer a given query by adding counts from a relatively small number of bins. There are many hierarchical methods possible to compute histograms. Several of these have been tried in the context of centralized DP. But to the best of our knowledge, the methods that work best in centralized DP tend to rely on a complete view on the distribution, or would require multiple interactions between users and aggregator when translated to the local model. This motivates us to choose more simple yet effective strategies for histogram construction in the LDP setting. We start with the standard notion of -adic intervals and a useful property of -adic decompositions.
Fact 2 ().
For and , an interval is B-adic if it is of the form i.e. its length is of a power of and starts with an integer multiple of its length.
Fact 3 ().
Any sub-range of length of length from can be decomposed into disjoint -adic ranges.
For example, for , the interval can be decomposed into sub-intervals .
The -adic decomposition can be understood as organizing the domain under a complete -ary tree where each node corresponds to a bin of a unique -adic range. The root holds the entire range and the leaves hold the counts for unit sized intervals. A range query can be answered by a walk over the tree similar to the standard pre-order traversal and therefore a range query can be answered with at most nodes, and at most in the worst case.
4.4. Hierarchical Histograms (HH)
Now we describe our framework for computing hierarchical histograms. All algorithms follow a similar structure but differ on the perturbation primitive they use:
Input transformation: user locally arranges the input in the form of a full -ary tree of height . Then defines a unique path from a leaf to the root with a weight of 1 attached to each node on the path, and zero elsewhere. Figure 2 shows an example. Figure 2(a) shows the dyadic () decomposition of the input vector , where the weights on internal nodes are the sum of the weights in their subtree. Figure 2(b) illustrates two user’s local views ( and ). In each local histogram, the nodes in the path from leaf to the root are shaded in red and have a weight of 1 on each node.
Perturbation: samples a level with probability . There are nodes at this level, with exactly one node of weight one and the rest zero. Hence, we can apply one of the mechanisms from Section 3. User perturbs this vector using some frequency oracle and sends the perturbed information to the aggregator along with the level id .
Aggregation: The aggregator builds an empty tree with the same dimensions and adds the (unbiased) contribution from each user to the corresponding nodes, to estimate the fraction of the input at each node. Range queries are answered by aggregating the nodes from the -adic decomposition of the range.
Key difference from the centralized case: Hierarchical histograms have been proposed and evaluated in the centralized case. However, the key difference here comes from how we generate information about each level. In the centralized case, the norm is to split the “error budget” into pieces, and report the count of users in each node; in contrast, we have each user sample a single level, and the aggregator estimates the fraction of users in each node. The reason for sampling instead of splitting emerges from the analysis: splitting would lead to an error proportional to , whereas sampling gives an error which is at most proportional to . Because sampling introduces some variation into the number of users reporting at each level, we work in terms of fractions rather than counts; this is important for the subsequent post-processing step.
In summary, the approach of hierarchical decomposition extends to LDP by observing the fact that it is a linear transformation of the original input domain. This means that adding information from the hierarchical decomposition of each individual’s input yields the decomposition of the entire population. Next we evaluate the error in estimation using the hierarchical methods.
Error behavior for Hierarchical Histograms. We begin by showing that the overall variance can be expressed in terms of the variance of the frequency oracle used, . In what follows, we denote hierarchical histograms aggregated with fanout B as .
Theorem 4.3 ().
When answering a range query of length using a primitive , the worst case variance under the framework is where .
Recall that all the methods we consider have the same (asymptotic) variance bound , with denoting the number of users contributing to the mechanism. Importantly, this does not depend on the domain size , and so we can write , where is a constant for method that depends on . This means that once we fix the method , the variance for any node at level will be the same, and is determined by , the number of users reporting on level . The range query of length is decomposed into at nodes at each level, for levels (from leaves upwards). So we can bound the total variance in our estimate by
using the fact that (in expectation) . ∎
In the worst case, , and we can minimize this bound by a uniform level sampling procedure:
Lemma 4.4 ().
The quantity subject to and is minimized by setting .
We use the Lagrange multiplier technique, and define a new function , introducing a new variable .
We compute the gradient as
Equating , we get and . Hence, . ∎
Then, setting in Theorem 4.3 gives
Hierarchical versus flat methods. The benefit of the HH approach over the baseline flat method depends on the factor versus the quantity . Note that and , so we obtain an improvement over flat methods when , for example. When is very small, this may not be achieved: for and , this condition yields . However, for larger , say and , we obtain , which corresponds to approximately 1.5% of the range.
Theorem 4.5 ().
The worst case average (squared) error incurred while answering all range queries using HH, , is (approximately)
We obtain the bound by summing over all range lengths . For a given length , there are possible ranges. Hence,
We find bounds on each of the two components separately.
1. Using Stirling’s approximation
2. Writing and , we make use of Jensen’s inequality to get
Plugging these upper bounds in to the main expression,
Key difference from the centralized case: Similar looking bounds are known in centralized case, for example due to Qardaji et al. (Qardaji et al., 2013a), but with some key differences. There, the bound (simplified) is proportional to rather than the we see here. Note however that in the centralized case scales proportionate to rather than in the local case: a necessary cost to provide local privacy guarantees.
Optimal branching factor for HH. In general, increasing the fan-out has two consequences under our algorithmic framework. Large reduces the tree height, which increases accuracy of estimation per node since larger population is allocated to each level. But this also means that we can require more nodes at each level to evaluate a query which tends to increase the total error incurred during evaluation. We would like to find a branching factor that balances these two effects. We use the expression for the variance in (1) to find the optimal branching factor for a given . We first compute the gradient of the function . Differentiating w.r.t. we get
We now seek a such that the derivative . The numerical solution is (approximately) . Hence we minimize the variance by choosing to be 4 or 5. This is again in contrast to the centralized case, where the optimal branching factor is determined to be approximately (Qardaji et al., 2013a).
4.5. Post-processing for consistency
There is some redundancy in the information materialized by the HH approach: we obtain estimates for the weight of each internal node, as well as its child nodes, which should sum to the parent weight. Hence, the accuracy of the HH framework can be further improved by finding the least squares solution for the weight of each node taking into account all the information we have about it, i.e. for each node , we approximate the (fractional) frequency with such that is minimized subject to the consistency constraints. We can invoke the Gauss-Markov theorem since the variance of all our estimates are equal, and hence the least squares solution is the best linear unbiased estimator.
Lemma 4.6 ().
The least-squares estimated counts reduce the associated variance by a factor of at least in a hierarchy of fan-out .
We begin by considering the linear algebraic formulation. Let denote the matrix that encodes the hierarchy, where is the number of nodes in the tree structure. For instance, if we consider a single level tree with leaves, then , where is the -length vector of all 1s, and is the identity matrix. Let denote the vector of reconstructed (noisy) frequencies of nodes. Then the optimal least-squares estimate of the true counts can be written as . Denote a range query as the length vector that is 1 for indices between and , and 0 otherwise. Then the answer to our range query is . The variance associated with query can be found as
First, consider the simple case when is a single level tree with leaves. Then we have , where denotes the matrix of all ones. We can verify that . From this we can quickly read off the variance of any range query. For a point query, the associated variance is simply , while for a query of length , the variance is . Observe that the variance for the whole range is just , and that the maximum variance is for a range of just under half the length, , which gives a bound of
The same approach can be used for hierarchies with more than one level. However, while there is considerable structure to be studied here, there is no simple closed form, and forming can be inconvenient for large . Instead, for each level, we can apply the argument above between the noisy counts for any node and its children. This shows that if we applied this estimation procedure to just these counts, we would obtain a bound of to any node (parent or child), and at most for any sum of node counts. Therefore, if we find the optimal least squares estimates, their (minimal) variance can be at most this much. ∎
Consequently, after this constrained inference, the error variance at each node is at most . It is possible to give a tighter bound for nodes higher up in the hierarchy: the variance reduces by for level (counting up from level 1, the leaves). This approaches , from above; however, we adopt the simpler bound for clarity.
This modified variance affects the worst case error, and hence our calculation of an optimal branching factor. From the above proof, we can obtain a new bound on the worst case error of for every level touched by the query (that is, for the left and right fringe of the query). This equates to total variance. Differentiating w.r.t. , we find
Consequently, the value that minimizes is — larger than without consistency. This implies a constant factor reduction in the variance in range queries from post-processing. Specifically, if we pick (a power of 2), then this bound on variance is
compared to for HH without consistency. We confirm this reduction in error experimentally in Section 5.
We can make use of the structure of the hierarchy to provide a simple linear-time procedure to compute optimal estimates. This approach was introduced in the centralized case by Hay et al. (Hay et al., 2010). Their efficient two-stage process can be translated to the local model.
Stage 1: Weighted Averaging: Traversing the tree bottom up, we use the weighted average of a node’s original reconstructed frequency and the sum of its children’s (adjusted) weights to update the node’s reconstructed weight. For a non-leaf node , its adjusted weight is a weighted combination as follows:
Stage 2: Mean Consistency: This step makes sure that for each node, its weight is equal to the sum of its children’s values. This is done by dividing the difference between parent’s weight and children’s total weight equally among children. For a non-root node ,
where is the weight of ’s parent after weighted averaging. The values of achieve the minimum solution.
Finally, we note that the cost of this post-processing is relatively low for the aggregator: each of the two steps can be computed in a linear pass over the tree structure. A useful property of finding the least squares solution is that it enforces the consistency property: the final estimate for each node is equal to the sum of its children. Thus, it does not matter how we try to answer a range query (just adding up leaves, or subtracting some counts from others) — we will obtain the same result.
Key difference from the centralized case. Our post-processing is influenced by a sequence of papers in the centralized case. However, we do observe some important points of departure. First, because users sample levels, we work with the distribution of frequencies across each level, rather than counts, as the counts are not guaranteed to sum up exactly. Secondly, our analysis method allows us to give an upper bound on the variance at every level in the tree – prior work gave a mixture of upper and lower bounds on variances. This, in conjunction with our bound on covariances allows us to give a tighter bound on the variance for a range query, and to find a bound on the optimal branching factor after taking into account the post-processing, which has not been done previously.
4.6. Discrete Haar Transform (DHT)
The Discrete Haar Transform (DHT) provides an alternative approach to summarizing data for the purpose of answering range queries. DHT is a popular data synopsis tool that relies on a hierarchical (binary tree-based) decomposition of the data. DHT can be understood as performing recursive pairwise averaging and differencing of our data at different granularities, as opposed to the HH approach which gathers sums of values. The method imposes a full binary tree structure over the domain, where is the height of node , counting up from the leaves (level 0). The Haar coefficient for a node is computed as , where are the sum of counts of all leaves in the left and right subtree of . In the local case when represents a leaf of the tree, there is exactly one non-zero haar coefficient at each level with value . The DHT can also be represented as a matrix of dimension (where is a power of 2) with each row encoding the Haar coefficients for item . Figure 3 shows an instance of DHT matrix for , i.e. . In each row, the zeroth coefficient is always . The subsequent entries give weights for each of levels. For instance, for row 0, , and are the sets for coefficients for levels 1, 2 and 3. We can decode the count at any leaf node by taking the inner product of the vector of Haar coefficients with the row of corresponding to . Observe that we only need coefficients to answer a point query.
Answering a range query. A similar fact holds for range queries. We can answer any range query by first summing all rows of that correspond to leaf nodes within the range, then taking the inner product of this with the coefficient vector. We can observe that for an internal node in the binary tree, if it is fully contained (or fully excluded) by the range, then it contributes zero to the sum. Hence, we only need coefficients corresponding to nodes that are cut by the range query: there are at most of these. The main benefit of DHT comes from the fact that all coefficients are independent, and there is no redundant information. Therefore we obtain a certain amount of consistency by design: any set of Haar coefficients uniquely determines an input vector, and there is no need to apply the post-processing step described in Section 4.5.
Our algorithmic framework. For convenience, we rescale each coefficient reported by a user at a non-root node to be from , and apply the scaling factor later in the procedure. Similar to the HH approach, each user samples a level with probability and perturbs the coefficients from that level using a suitable perturbation primitive. Each user then reports her noisy coefficients along with the level. The aggregator, after accepting all reports, prepares a similar tree and applies the correction to make an unbiased estimation of each Haar coefficient. The aggregator can evaluate range queries using the (unbiased but still noisy) coefficients.
Perturbing Haar coefficients. As with hierarchical histogram methods, where each level is a sparse (one hot) vector, there are several choices for how to release information about the sampled level in the Haar tree. The only difference is that previously the non-zero entry in the level was always a 1 value; for Haar, it can be a or a . There are various straightforward ways to adapt the methods that we have already (see, for example, (Bassily and Smith, 2015; Nguyên et al., 2016; Ding et al., 2017)). We choose to adapt the Hadamard Randomized Response (HRR) method, described in Section 3.2. First, this is convenient: it immediately works for negative valued weights without any modification. But it also minimizes the communication effort for the users: they summarize their whole level with a single bit (plus the description of the level and Hadmard coefficient chosen). We have confirmed this choice empirically in calibration experiments (omitted for brevity): HRR is consistent with other choices in terms of accuracy, and so is preferred for its convenience and compactness.
Recall that the (scaled) Hadamard transform of a sparse binary vector is equivalent to selecting the th row/column from the Hadamard matrix. When we transform , the Hadamard coefficients remain binary, with their signs negated. Hence we use HRR for perturbing levelwise Haar coefficients. At the root level, where there is a single coefficient, this is equivalent to 1 bit RR. The 0th coefficient can be hardcoded to since it does not require perturbation. We refer to this algorithm as HaarHRR.
Error behavior for HaarHRR. As mentioned before, we answer an arbitrary query of length by taking a weighted combination of at most coefficients. A coefficient at level contributes to the answer if and only if the leftmost and rightmost leaves of the subtree of node partially overlaps with the range. The 0th coefficient is assigned the weight . Let () be the size of the overlap sets for left (right) subtree for with the range. Using reconstructed coefficients, we evaluate a query to produce answer as below.
where, is an unbiased estimation of a coefficient at level . In the worst case, the absolute weight . We can analyze the corresponding varance, , as
Here, is the variance associated with the HRR frequency oracle. As in the hierarchical case, the optimal choice is to set (i.e. we sample a level uniformly), where . Then we obtain
It is instructive to compare this expression with the bounds obtained for the hierarchical methods. Recall that, after post-processing for consistency, we found that the variance for answering range queries with HH, based on optimizing the branching factor, is (from (2)). That is, for long range queries where is close to , these (3) will be close to (2). Consequently, we expect both methods to be competitive, and will use empirical comparison to investigate their behavior in practice.
Finally, observe that since this bound does not depend on the range size itself, the average error across all possible range queries is also bounded by (3).
Key difference from the centralized case. The technique of perturbing Haar coefficients to answer differentially private range queries was proposed and studied in the centralized case under the name “privelets” (Xiao et al., 2011). Subsequent work argued that more involved centralized algorithms could obtain better accuracy. We will see in the experimental section that HaarHRR is among our best performing methods. Hence, our contribution in this work is to reintroduce the DHT as a useful tool in local privacy.
4.7. Prefix and Quantile Queries
Prefix queries form an important class of range queries, where the start of the range is fixed to be the first point in the domain. The methods we have developed allow prefix queries to be answered as a special case. Note that for hierarchical and DHT-based methods, we expect the error to be lower than for arbitrary range queries. Considering the error in hierarchical methods (Theorem 4.3), we require at most nodes at each level to construct a prefix query, instead of , which reduces the variance by almost half. For DHT similarly, we only split nodes on the right end of a prefix query, so we also reduce the variance bound by a factor of 2. Note that a reduction in variance by 0.5 will translate into a factor of in the absolute error. Although the variance bound changes by a constant factor, we obtain the same optimal choice for the branching factor in .
Prefix queries are sufficient to answer quantile queries. The -quantile is that index in the domain such that at most a -fraction of the input data lies below , and at most a fraction lies above it. If we can pose arbitrary prefix queries, then we can binary search for a prefix such that the prefix query on meets the -quantile condition. Errors arise when the noise in answering prefix queries causes us to select a that is either too large or too small. The quantiles then describe the input data distribution in a general purpose, non-parametric fashion. Our expectation is that our proposed methods should allow more accurate reconstructions of quantiles than flat methods, since we expect they will observe lower error. We formalize the problem:
Definition 4.7 ().
(Quantile Query Release Problem) Given a set of users, the goal is to collect information guaranteeing -LDP to approximate any quantile . Let be the item returned as the answer to the quantile query using a mechanism , which is in truth the quantile, and let be the true quantile. We evaluate the quality of by both the value error, measured by the squared error ; and the quantile error .
5. Experimental Evaluation
Our goal in this section is to validate our solutions and theoretical claims with experiments.
We are interested in comparing the flat, hierarchical and wavelet methods for range queries of varying lengths on large domains, capturing meaningful real-world settings. We have evaluated the methods over a variety of real and synthetic data. Our observation is that measures such as speed and accuracy do not depend too heavily on the data distribution. Hence, we present here results on synthetic data sampled from Cauchy distributions. This allows us to easily vary parameters such as the population sizeand the domain size
, as well as varying the distribution to be more or less skewed.
The shape of the (symmetrical) Cauchy distribution is controlled by two parameters, center and height. We set the location of the center at , for , so that larger values of shift the mass further to the right. Since Cauchy distribution has infinite support, we drop any values that fall outside . Larger height parameters tend to reduce the sparsity in the distribution by flattening it. Our default choice is a relatively spread out distribution with height = and .222Our experiments varying did not show much significant deviation in the trends observed. We vary the domain size from small () to large () as powers of two.
Algorithm default parameters and settings. We set a default value of (), in line with prior work on LDP. This means, for example, that binary randomized response will report a true answer of the time, and lie of the time — enough to offer plausible deniability to users, while allowing algorithms to achieve good accuracy. Since the domain size is chosen to be a power of 2, we can choose a range of branching factors for hierarchical histograms so that remains an integer. The default population size is set to be which captures the scenario of an industrial deployment, similar to (Erlingsson et al., 2014; Pihur et al., 2018; Differential Privacy Team, Apple, 2017)
. Each bar plot is the mean of 5 repetitions of an experiment and error bars capture the observed standard deviation. The simulations are implemented in C++ and tested on a standard Linux machine. To the best of our knowledge, ours is among the first non-industrial work to provide simulations with domain sizes as large as. Our final implementation will shortly be made available as open source.
Sampling range queries for evaluation. When the domain size is small or moderate ( and ), it is feasible to evaluate all range queries and so compute the exact average. However, this is not scalable for larger domains, and so we average over a subset of the range queries. To ensure good coverage of different ranges, we pick a set of evenly-spaced starting points, and then evaluate all ranges that begin at each of these points. For and we pick start points every and steps, respectively, yielding a total of 17M and 69M unique queries.
Histogram estimation primitives. The HH framework in general is agnostic to the choice of the histogram estimation primitive . We show results with OUE, HRR and OLH as the primitives for histogram reconstruction, since they are considered to be state of art (Wang et al., 2017), and all provide the same theoretical bound on variance. Though any of these three methods can serve as a flat method, we choose OUE as a flat method since it can be simulated efficiently and reliably provides the lowest error in practice by a small margin. We refer to the hierarchical methods using HH framework as TreeOUE, TreeOLH and TreeHRR. Their counterparts where the aggregator applies postprocessing to enforce consistency are identified with the CI suffix, e.g. TreeHRRCI.
We quickly observed in our preliminary experiments that direct implementation of OUE can be very slow for large : the method perturbs and reports bits for each user. For accuracy evaluation purposes, we can replace the slow method with a statistically equivalent simulation. That is, we can simulate the aggregated noisy count data that the aggregator would receive from the population. We know that noisy count of any item is aggregated from two distributions (1) “true” ones that are reported as ones (with prob. ) (2) zeros that are flipped to be ones (with prob. ). Therefore, using the (private) knowledge of the true count of item , the noisy count can be expressed as a sum of two binomial random variables, Our simulation can perform this sampling for all items, then provides the sampled count to the aggregator, which then performs the usual bias correction procedure.
The OLH method suffers from a more substantial drawback: the method is very slow for the aggregator to decode, due to the need to iterate through all possible inputs for each user report (time ). We know of no short cuts here, and so we only consider OLH for our initial experiments with small domain size .