In recent years, Differential Privacy [dwork2006calibrating] has emerged as the dominant approach to privacy and its adoption in practical settings is growing. Differential privacy is achieved with carefully designed randomized algorithms, called mechanisms. These mechanisms must (provably) satisfy the strong requirement of differential privacy, while still extracting utility from the data. Utility is measured on a task-by-task basis, and different tasks require different mechanisms. Utility-optimal mechanisms, or mechanisms that maximize utility for a given task under a fixed privacy budget, are still not known in many cases.
There are two main models of differential privacy: the central model and the local model. In the central model, users provide their data to a trusted data curator, who runs a privacy mechanism on the dataset in its entirety. In the local model, users execute a privacy mechanism before sending it to the data curator. Local differential privacy (LDP) offers a stronger privacy guarantee than central differential privacy, as it does not rely on the assumption of a trusted data curator. For that reason, it has been embraced by several organizations like Google [erlingsson2014rappor], Apple [thakurta2017learning], and Microsoft [ding2017collecting] for the collection of personal data from customers. While the stronger privacy guarantee is a benefit of the local model, it necessarily leads to greater error than the central model [dwork2014algorithmic], which makes error-optimal mechanisms an important goal.
Our focus is answering a workload of linear counting queries under local differential privacy. Answering a query workload is a general task that subsumes other common tasks, like estimating histograms, range queries, and marginals. Furthermore, the expressivity of linear query workloads goes far beyond these special cases, as it can include an arbitrary set of predicate counting queries. By defining the workload, the analyst expresses the exact queries they care about most, and their relative importance.
There are several LDP mechanisms for answering particular fixed workloads, like histograms [acharya2018, ye2018optimal, bassily2017practical, wang2017locally], range queries [cormode2019answering], and marginals [cormode2018marginal]. These mechanisms were carefully crafted to provide accuracy on the workloads for which they were designed, but their accuracy properties typically do not transfer to other workloads. Some LDP mechanisms are designed to answer an arbitrary collection of linear queries [bassily2018linear, edmonds2019power], but they do not outperform simple baselines in practice. Many of these mechanisms are summarized in Table 1.
|Randomized Response [warner1965randomized]||Histogram|
|Subset Selection [ye2018optimal]||Histogram|
|Hierarchical [cormode2019answering]||All Range Queries|
|Fourier [cormode2018marginal]||k-Way Marginals|
|Gaussian Factorization [edmonds2019power]||Any|
In this paper, we propose a new mechanism that automatically adapts in order to prioritize accuracy on a target workload. Adaptation to the workload is accomplished by solving a numerical optimization problem, in which we search over an expressive class of unbiased LDP mechanisms for one that minimizes variance on the workload queries.
Workload-adaptation [hardt2012simple, li2010optimizing, mckenna2018optimizing] is a much more developed topic in the central model of differential privacy and has led to mechanisms that offer best-in-class error rates in some settings [dpbench]. Our work is conceptually similar to the Matrix Mechanism [li2010optimizing, mckenna2018optimizing], which also minimizes variance over a class of unbiased mechanisms. However, because the class of mechanisms we consider is different, the optimization problem is fundamentally different and requires a novel analysis and algorithmic solution. In addition, the optimal mechanism in our problem formulation depends on the setting of the privacy parameter, , while in the Matrix Mechanism this is not a factor.
The paper consists of four main technical contributions.
We give an efficient algorithm to approximately solve this optimization problem, by reformulating it into a form that is algorithmically tractable (Section 4).
We provide a theoretical analysis which illuminates error properties of the mechanism and justifies the design choices we made (Section 5).
In a comprehensive set of experiments we test our mechanism on a range workloads, showing that it consistently delivers lower error than competitors, by as much as a factor of (Section 6).
2 Background and Problem Setup
In this section we introduce notation for the input data and query workload as well as provide basic definitions of local differential privacy. A full review of notation is provided in Table 2 of the Appendix.
2.1 Input Data and Workload
Given a domain of distinct user types ,the input data is a collection of users , where each
. We commonly use a vector representation of the input data, containing a count for each possible user type:
Definition (Data Vector)
The data vector, denoted by , is an -length vector of counts indexed by user types such that:
In the local model, we do not have direct access to , but it is still useful to define it for the purpose of analysis. Below is a simple data vector one might obtain from a data set of student grades.
Example (Student Data)
Consider a data set of student grades, where . Suppose students got an , students got a , students got a and no students got a or . Then the data vector would be:
Linear counting queries have a similar vector representation, as shown in Definition .
Definition (Linear Query)
A linear counting query is an -length vector indexed by user types , such that the answer to the query is .
A workload is a collection linear queries organized into a matrix . Our goal is to accurately estimate answers to each workload query under local differential privacy, i.e., we want to privately estimate .
The most commonly studied workload is the so-called Histogram workload, which is represented by a identity matrix. A more interesting workload is given below:
Example (Prefix workload)
The prefix workload contains queries that compute the (unnormalized) empirical cumulative distribution function of the data, or the number of students that have grades.
The workload is an input to our algorithm, reflecting the queries of interest to the analyst and therefore determining the measure of utility that will be used to assess algorithm performance.
2.2 Local Differential Privacy
Local differential privacy (LDP) [dwork2014algorithmic] is a property of a randomized mechanism that acts on user data. Instead of reporting their true user type, users instead report a randomized response obtained by executing on their true input. These randomized responses allow an analyst to learn something about the population as a whole, while still providing the individual users a form of plausible deniability about their true input. The formal requirement on is stated in Definition .
Definition (Local Differential Privacy)
A randomized mechanism is said to be -LDP if an only if, for all and all :
The output range can vary between mechanisms. In some simple cases, it is the same as the input domain , but it does not have to be. Choosing the output range is typically the first step in designing a mechanism. When the range of the mechanism is finite, i.e., , we can completely specify a mechanism by a so-called strategy matrix , indexed by . The mechanism is then defined by:
This encoding of a mechanism essentially stores a probability for every possible input, output pair in the strategy matrix. We translate Definition 2.2 to strategy matrices in Proposition .
Proposition (Strategy Matrix)
The mechanism is -LDP if and only if the following conditions are satisfied:
Above, the first condition is the privacy constraint, ensuring that the output distributions for any two users are close, and the second is the probability simplex constraint, ensuring that each column of
corresponds to a valid probability distribution. Representing mechanisms as matrices is useful because it allows us to reason about them mathematically with linear algebra[kairouz2014extremal, holohan2017extreme]. Example shows how a simple mechanism, called randomized response111the name of this mechanism should not be confused with the outputs of an arbitrary mechanism , which we also call randomized responses., can be encoded as a strategy matrix.
Example (Randomized Response)
The randomized response mechanism [warner1965randomized] can be encoded as a strategy matrix in the following way:
For this mechanism, the output range is the same as the input domain, and hence the strategy matrix is square. The diagonal entries of the strategy matrix are proportional to , and the off-diagonal entries are proportional to . This means that each user reports their true input with probability proportional to and all other possible outputs with probability proportional to . It is easy to see that the conditions of Proposition are satisfied. While this is one of the simplest mechanisms, many other mechanisms can also be represented in this way. In fact, the RAPPOR, Hadamard, Subset Selection, Hierarchical, and Fourier mechanisms from Table 1 can all be expressed as a strategy matrix (plus a post-processing step).
When executing the mechanism, each user reports a (randomized) response . When all users randomize their data with the same mechanism, these responses are typically aggregated into a response vector (indexed by elements ), where . For notational convenience, we overload the definition of , allowing it to consume a data vector and return a response vector, so that . The response vector is often not that useful by itself, but it can be used to estimate more useful quantities, such as the data vector or the workload answers . This is typically achieved with a post-processing step, and does not impact the privacy guarantee of the mechanism.
3 The Factorization Mechanism
In this section, we describe our mechanism and the main optimization problem that underlies it. We begin with a high-level problem statement, and reason about it analytically until it is in a form we can deal with algorithmically. We present our key ideas and the main steps of our derivation, but defer the finer details of the proofs to Appendix B.
At a high level, our goal is to find a mechanism that has low expected error on the workload. This objective is formalized in Section 3.
Problem (Workload Error Minimization)
Given a workload , design an -LDP mechanism that offers low (ideally minimal) expected error on the workload for all possible data vectors . Formally,
In the problem statement above, our goal is to search through the space of all -LDP mechanisms for the one that is best for the given workload. Because it is difficult to characterize an arbitrary mechanism in a way that makes optimization possible, we do not solve the above problem in its full generality. Instead, we perform the search over a restricted class of mechanisms which is easier to characterize. While somewhat restricted, this class of mechanisms is quite expressive, and it captures many of the state-of-the-art LDP mechanisms available today [erlingsson2014rappor, ye2018optimal, acharya2018hadamard, cormode2018marginal, cormode2019answering].
Definition (Workload Factorization Mechanism)
Given an -LDP strategy matrix and a reconstruction matrix such that , the Workload Factorization Mechanism (factorization mechanism for short) is defined as:
Note that the mechanism is completely characterized by the matrices and , and hence selecting a mechanism from this class amounts to choosing and . Additionally, is defined in terms of , and it is parameterized by an additional reconstruction matrix as well, which is used to estimate the workload query answers from the response vector. Clearly, inherits the privacy guarantee of by the well-known post-processing principle of differential privacy [dwork2014algorithmic]. Additionally, this class of mechanisms is appealing because it gives unbiased answers to the workload queries. It is easy to see this, as:
Because these mechanisms are unbiased, they will produce the correct workload answers in expectation. However, the individual estimates produced by the mechanism may not be the true workload answers for any underlying dataset, which is a consistency problem. For example, the estimates might suggest that one or more entries of the data vector are negative, which is clearly impossible. To address this problem we show our mechanism can be improved through a post-processing technique that produces consistent estimates to the workload queries that are as close as possible to the unbiased estimates. This idea is not new, but an adaptation of existing techniques [nikolov2013geometry, li2015matrix], so we defer the full description to the appendix. We nevertheless demonstrate its benefits experimentally in Section 6.6.
We consider this an extension of our mechanism that can improve utility in practice, and evaluate it in isolation. For the remainder of the paper, we will focus on the original, unbiased mechanism, as it is substantially easier to reason about analytically.
Many of the mechanisms in Table 1 can be represented as a factorization mechanism. For example, we show how the Randomized Response mechanism can be expressed as a factorization mechanism in Example .
Example (Randomized Response)
The randomized response mechanism uses as defined in Example and to estimate the Histogram workload ().
While the randomized response mechanism is intended to be used to answer the Histogram workload, there is no reason why it cannot be used for other workloads as well. In fact, it is quite straightforward to see how it can be extended to answer an arbitrary workload, simply by using .
3.1 Variance Derivation
While the factorization mechanism is unbiased for any workload factorization, different factorizations lead to different amounts of variance on the workload answers. This creates the opportunity to choose the workload factorization that leads to the lowest possible total variance. In order to do that, we need an analytic expression for the total variance in terms of and , which we derive in Theorem .
The expected total squared error (total variance) of a workload factorization mechanism is:
where denotes column of and denotes row of .
Notice above that the exact expression for variance depends on the data vector , which we do not have access to, as it is a privacy quantity. We want our mechanism to work well for all possible , so we consider worst-case variance and a relaxation average-case variance instead.222Alternatively, if we had a prior distribution over , we could use that to estimate variance.
Corollary (Worst-case variance)
The worst-case variance of occurs when all users have the same worst case type (i.e., for some ), and is:
Corollary (Average-case variance)
The average-case variance of occurs when for all and is:
With these analytic expressions for variance, we can analyze and compare existing mechanisms that can be expressed as a workload factorization mechanism. The variance for randomized response is shown in Example .
Example (Variance of Randomized Response)
The worst-case and average-case variance of the factorization in Example on the Histogram workload is:
The expression above is obtained by simply plugging in and to the equations above and simplifying. Interestingly, the worst-case and average-case variance are the same for this workload factorization due to the symmetry in the workload and strategy matrices.
3.2 Strategy Optimization
With an analytic expression for variance, we can state the optimization problem underlying the factorization mechanism. Our goal is to find a workload factorization that minimizes the total variance on the workload. To do that, we set up an optimization problem, using total variance as the objective function while taking into consideration the constraints that have to hold on and . This is formalized in Section 3.2.
Problem (Optimal Factorization)
Given a workload and a privacy budget :
is a loss function that captures how good a given factorization is, such as the worst-case varianceor the average-case variance . While our original objective in Section 3 was to find the mechanism that optimizes worst-case variance, for practical reasons we use the average-case variance instead. The average-case variance is a smooth approximation of the worst-case variance, which leads to a more analytically and computationally tractable problem. Additionally, the smoothness of the average-case variance makes the corresponding optimization problem more amenable to numerical methods. We study the ramifications of this relaxation theoretically in Section 5.1 and experimentally in Section 6.5. When using as the objective function, we observe it can be expressed in a much simpler form using matrix operations, as shown in Theorem .
Theorem (Objective Function)
The objective function is related to by:
where and is the trace of a matrix.
From now on, when we refer to , we are referring to its definition in Theorem . The new objective is equivalent to up to constant factors, and hence can be used in place of it for the purposes of optimization.
With this simplified objective function, we observe that for a fixed strategy matrix , we can compute the optimal in closed form. If the entries of the response vector were statistically independent and had equal variance, then this would simply be , where is the Moore-Penrose pseudo-inverse of [casella2002statistical, hay2010boosting]. However, since the entries of the response vector have unequal variance and are not statistically independent in general, this simple expression is not correct. We can still express the optimal in closed form, however, as shown in Theorem .
Theorem (Optimal for fixed )
For a fixed , the minimizer of subject to is given by:
Note that we can assume is invertible without loss of generality. If it were not, then one entry of the diagonal would have to be , implying that a row of is all zero. Such a row corresponds to an output that never occurs under the privacy mechanism, and can be removed without changing the mechanism. Further note that for the above formula to apply, there must exist a such that , which is guaranteed if and only if is in the row space of [strang1993introduction]. Expressed as a constraint, this is .
Now that we know the optimal for any , we can plug it into to express the objective as a function of alone. Doing this, and simplifying further, leads to our final optimization objective, stated in Theorem .
Theorem (Objective Function for )
The objective function can be expressed as:
is our final optimization objective. We are almost ready to restate the optimization problem in terms of . However, before we do that, it is useful to simplify the constraints of the problem. The constraints stated in Section 3.2 are challenging to deal with algorithmically because there are a large number of them. Ignoring the factorization constraint, there are constraints on , and each entry of is constrained by entries from the same column and entries from the same row.
By introducing an auxiliary optimization variable , we reduce this to constraints, so that each entry of is only constrained by entries from the same column and . Specifically, corresponds to the minimum allowable value on each row of , and every column of must be between and (coordinate-wise inequality). It is clear that this is exactly equivalent to Condition in Proposition . Also note that Condition can be expressed in matrix form as . The final optimization problem underlying the workload factorization mechanism is stated in Section 3.2.
Problem (Strategy Optimization)
Given a workload and a privacy budget :
4 Optimization Algorithm
We now discuss our approach to solving Section 3.2. It is a nonlinear optimization problem with linear and nonlinear constraints. While the objective is smooth (and hence differentiable) within the boundary of the constraint , it is not convex. It is typically infeasible to find a closed form solution to such a problem, and conventional numerical optimization methods are not guaranteed to converge to a global minimum for non-convex objectives. However, such numerical gradient-based methods have seen remarkable empirical success in a variety of domains, often finding high quality local minima. That is the approach we take, however, rather than use an out-of-the-box commercial solver, which would not be able to scale to larger problem sizes, we provide our own optimization algorithm which achieves greater scalability by exploiting structure in the constraint set.
The algorithm we propose is an instance of projected gradient descent [nocedal2006numerical], a variant of gradient descent that handles constraints. To implement this algorithm, the key challenge is to project onto the constraint set. In other words, given a matrix that does not satisfy the constraints, find the “closest” matrix that does satisfy the constraints. Ignoring the constraint for now, this sub-problem is stated formally in Section 4.
Problem (Projection onto LDP Constraints)
Given an arbitrary matrix , a vector , and a privacy budget , the projection onto the privacy constraint, denoted is obtained by solving the following problem:
Section 4 is easier to solve than Section 3.2 because the objective is now a quadratic function of . In addition, a key insight to solve this problem efficiently is to notice that it is closely related to the problem of projecting onto the probability simplex [wang2013projection] (now with bound constraints), and admits a similar solution. Specifically, the form of the solution is stated in Proposition .
Proposition (Projection Algorithm)
The solution to Section 4 may be obtained one column at a time using
where clip “clips” the entries of to the range entry-wise and is a scalar value that makes .
The solution is remarkably simple. Intuitively, we add the same scalar value to every entry of then clip those values that lie outside the allowed range. The scalar value is chosen so that sums to , and finding it is the main challenge. It may be calculated through binary search or any other method to find the root of the function . We give an efficient implementation of Proposition in Algorithm 1.
Now that we have discussed the projection problem and its solution, we are ready to state the full projected gradient descent algorithm for finding an optimized strategy. Algorithm 2 is an iterative algorithm, where in each iteration we perform a gradient descent plus projection step on the optimization parameters and . The gradient is easily obtained as is a function of , but the gradient term is less obvious. However, by observing that is actually a function of (from the projection step
), we can use the multi-variate chain rule to back-propagate the gradient fromto to obtain . We do not discuss the details of computing the gradients here, as it can be easily accomplished with automatic differentiation tools [griewank1989automatic, maclaurin2015autograd].
We note that Algorithm 2 handles the constraint “for free” in the sense that we do not need to deal with it explicitly, as long as the step sizes and initialization are chosen appropriately. Specifically, as long as the initial satisfies the constraint, and the step sizes are sufficiently small, every subsequent in the algorithm will also satisfy the constraint. Intuitively, this is because as we move closer to the boundary of the constraint, the objective function blows up and eventually reaches a point of discontinuity when the constraint is not satisfied. Because we update using the negative gradient, which is a descent direction, we will never approach the boundary of the constraint. We discuss the choice of step size and initialization below. This trick is a very nice way to deal with a constraint that is otherwise challenging to deal with. We note that similar ideas have been used to deal with related constraints in prior work [yuan2016convex].
The step size for the gradient descent step must be supplied as input, and two different step sizes are used to update and . Notice that we take a smaller step size to update than
. This is a heuristic we use to make suredoesn’t change too fast; it improves the robustness of the algorithm. We perform a hyper-parameter search to find a step size that works well, only running the algorithm for a few iterations in this phase, then running it longer once a step size is chosen. Decaying the step size at each iteration is also possible, as smaller step sizes typically work better in later iterations.
The final missing piece in Algorithm 2 is the initialization of , for which there are multiple options. One option is to initialize with an existing factorization, such as the best one from Table 1. Then intuitively the optimized strategy will never be worse than the other mechanisms, because the negative gradient is a descent direction. This is an informal argument, as it is technically possible that the optimized strategy has better average-case variance but worse worst-case variance than the initial strategy. We do not take this approach, however, as we find initializing randomly tends to work better. Specifically, we let be a random matrix, where each entry is sampled from . Then we obtain by projecting onto the constraint set; i.e., , where , where is a vector of ones. The choice of is also an important consideration when initializing . While larger leads to a more expressive strategy space, it also leads to more expensive optimization. Furthermore, we have noticed that using larger does not always lead to higher quality strategies. This is likely due to the fact that the objective is non-convex, and it is harder to find good local minima in higher dimensional spaces. Our choice of represents a sweet spot that we found works well empirically. In general, can be chosen via hyper-parameter search, however.
5 Theoretical Results
In this section, we answer several theoretical questions about our mechanism. First, we justify the relaxation in the objective function, used to make the optimization analytically tractable. Second, we theoretically analyze the error achieved by our mechanism, measured in terms of sample complexity. Third, we derive lower bounds on the achievable error of workload factorization mechanisms. Finally, we state the computational complexity of our optimization algorithm. All the proofs are deferred to Appendix B.
5.1 Relaxed Objective
In Section 3 we replaced our true optimization objective with a relaxation . In this section, we justify that choice theoretically, showing that is tightly bounded above and below by .
Theorem (Bounds on )
Let be an arbitrary factorization of where is an -LDP strategy matrix. Then the worst case variance and average-case variance are related as follows:
Theorem suggests that relaxing to does not significantly impact the optimization problem. Intuitively, this theorem holds because of the privacy constraint on , which guarantees that the column of for the worst-case user cannot be too different from any other column. Hence, all users must have a similar impact on the total variance of the mechanism. Empirically, we find that is often even closer to than the upper bound suggests (Section 6.5). Furthermore, in some cases is exactly equal to , as we showed in Example .
5.2 Sample complexity
We gave an analytic expression for the expected total squared error (total variance) of our mechanism in Corollary . However, this quantity might be difficult to interpret, and it is more natural to look at the number of samples needed to achieve a fixed error instead. Furthermore, when running an LDP mechanism it is important to know how much data is required to obtain a target error rate, as that information is critical for determining an appropriate privacy budget.
Because the total variance increases with the number of individuals and the number of workload queries , we instead look at a normalized measure of variance.
Definition (Normalized Variance)
The normalized worst-case variance of is:
is the same as up to constant factors, although it is more interpretable because it is a measure of variance on a single “average” workload query, where variance is measured on the normalized data vector .
Corollary (Normalized Variance)
The normalized variance is:
Interestingly, the dependence on does not change with and — it is always , but the constant factor depends on the quality of the workload factorization.
Corollary (Sample Complexity)
The number of samples needed to achieve normalized variance is:
We can readily compute the sample complexity numerically for any factorization . In fact, the sample complexity and worst-case variance of a mechanism are proportional, as evident from the above equation. Additionally, by replacing with a lower bound, we can get an analytical expression for the sample complexity in terms of the privacy budget and the properties of the workload .
Example (Sample complexity, Randomized Response)
The Randomized Response mechanism described in Example has sample complexity:
on the Histogram workload.
Example suggests the sample complexity of the randomized response mechanism grows roughly at a linear rate with the domain size .
5.3 Lower Bound
For a given workload, a theoretical lower bound on the achievable error is useful for checking how close to optimal our strategies are. It also can be used to characterize the inherent difficulty of the workload. In this section, we derive an easily-computable lower bound on the achievable error under our mechanism in terms of the singular values of the workload matrix.
Theorem (Lower Bound, Factorization Mechanism)
Let be a workload matrix and let be any -LDP strategy matrix. Then we have:
where are the singular values of and is the loss function defined in Theorem .
This result is similar to lower bounds known to hold in the central model of differential privacy, based on the analysis of the Matrix Mechanism [li2015lower]. In both cases the hardness of a workload is characterized by its singular values.
Other lower bounds for this problem have characterized the hardness of a workload in terms of quantities like the largest column norm of [bassily2018linear], the so-called factorization norm of [edmonds2019power], and the so-called packing number associated with [blasiok2019towards]. While interesting theoretically, the factorization norm and packing number are hard to calculate in practice. In contrast, our bound can be easily calculated.
Corollary (Worst-case variance)
The worst-case variance of any factorization mechanism must be at least:
Example (Lower Bound for Histogram Workload)
Every workload factorization mechanism requires at least samples to achieve normalized variance on the Histogram workload.
Note the very weak dependence on in Example , which suggests that the sample complexity should not change much with . Further, recall from Example that the sample complexity of randomized response is linear in . This suggests randomized response is not the best mechanism for the Histogram workload. This result is not new, as there are several mechanisms that are known to perform better than randomized response [acharya2018, ye2018optimal, wang2017locally, erlingsson2014rappor]. We show empirically in Section 6 that some of these mechanisms achieve the optimal sample complexity for the Histogram workload up to constant factors (i.e., no dependence on ). Our mechanism also achieves the optimal sample complexity for this workload, but has better constant factors.
For other workloads, the sample complexity may depend on . Calculating the exact dependence on for other workloads requires deriving the singular values of the workload as a function of in closed form, which may be challenging for workloads with complicated structure.
5.4 Computational Complexity
We now study the computational complexity of Algorithm 2. In each iteration, we perform a gradient descent step and a projection step. Evaluating the gradient requires the same amount of time as evaluating the objective function using automatic differentiation [baydin2018automatic]. We remind the reader that the objective function only depends on through its Gram matrix . Assuming this has been pre-computed, the time complexity to evaluate the objective is , or to compute and to compute its inverse and the multiplication with .
The time complexity of Algorithm 1 is , as it requires sorting a list of length and then iterating through it. Algorithm 1 is called times in each iteration of Algorithm 2, and hence the total cost of projection is . The total per-iteration time complexity of Algorithm 2 is thus , or if we set as we recommended in Section 4.
In this section we experimentally evaluate our mechanism. We extensively study the utility of our mechanism on a variety of workloads, domains, and privacy levels, and compare it against multiple competing mechanisms from the literature. We demonstrate consistent improvements in utility compared to other mechanisms in all settings (Section 6.2 and Section 6.3). We also evaluate the scalability of our optimization algorithm (Section 6.4), the impact of using the relaxed optimization objective (Section 6.5), and the utility improvements we can obtain from using the postprocessing extension described in Remark (Section 6.6).
6.1 Experimental setup
We consider six different workloads in our empirical analysis, each of which can be defined for a specified domain size. These workloads are intended to capture common queries an analyst might want to perform on data and have been studied previously in the privacy literature. These workloads are Histogram, Prefix, All Range, All Marginals, 3-Way Marginals, and Parity. Histogram is the simplest workload, studied as a running example throughout the paper, and encoded as an identity matrix. Prefix was introduced in Example , and includes a set of range queries required to compute the empirical CDF over a 1-dimensional domain. All Range is a workload containing all range queries over a 1-dimensional domain, studied in [cormode2018marginal]. All Marginals and 3-Way Marginals contain queries to compute the marginals over a multidimensional binary domain, and were studied in [cormode2019answering]. Parity also contains queries defined over a multidimensional binary domain, and was studied in [gaboardi2014dual].
We compare our mechanism against six other state-of-the-art mechanisms, including Randomized Response [warner1965randomized], Hadamard [acharya2018hadamard], Hierarchical [cormode2019answering], Fourier [cormode2018marginal], and the Matrix Mechanism [li2010optimizing, edmonds2019power] (both and versions), adapted to the local model. To adapt the Matrix Mechanism to the local model, we run it independently for all single-user databases and aggregate the results, as described in [edmonds2019power].
The first four mechanisms are all particular instances of the class of factorization mechanisms, just with different factorizations. They were all designed to answer a fixed workloads (e.g., Randomized Response was designed for the Histogram workload), they can still be run on other workloads with minor modifications. In particular, for each mechanism we use the same across different workloads, but change based on the workload, using Theorem (the optimal ).
We omit from comparison the Gaussian mechanism [bassily2018linear], as it is strictly dominated by the Matrix Mechanism. We also omit from comparison RAPPOR [erlingsson2014rappor] and Subset Selection [ye2018optimal], as they require exponential space to represent the strategy matrix, making it prohibitive to calculate worst-case variance and sample complexity. However, we note that these mechanisms have been previously compared with Hadamard, and shown to offer comparable performance on the Histogram workload [acharya2018hadamard].
Our primary evaluation metric for comparing algorithms issample complexity, which we calculate exactly using Corollary with . Recall that the sample complexity is proportional to the worst-case variance, but is appropriately normalized and easier to interpret. Furthermore, we remark that for most experiments in this section, no input data is required, as the sample complexities we report apply for the worst-case dataset. In practice, we have found that the variance on real world datasets is quite close to the worst-case variance, however, consistent with the theory in Theorem . We also vary the privacy budget and domain size , studying their impact on the sample complexity for each mechanism and workload.
6.2 Impact of Epsilon
Figure 1 shows the relationship between workload and on the sample complexity for each mechanism. We consider ranging from to , fixing to be . These privacy budgets are common in practical deployments of differential privacy, and local differential privacy in particular [wang2017locally, ding2017collecting, erlingsson2014rappor]. We state our main findings below:
Our mechanism (labeled Optimized) is consistently the best in all settings: it requires fewer samples than every other mechanism on every workload and we tested.
The magnitude of improvement over the best competitor varies between (Histogram, ) and (All Range, ), but the improvement is typically around in the medium privacy regime. In the very high-privacy regime with , our mechanism is typically quite close to the best competitor, and in the very low-privacy regime with , our mechanism matches randomized response, which is optimal in that regime. The reduction of required samples translates to a context that really matters: data collectors can now run their analyses on smaller samples to achieve their desired accuracy.
The best competitor changes with the workload and . For example, the best competitor on the Prefix workload was Hierarchical, while the best competitor on the 3-Way Marginals workload was Fourier. In both cases, these mechanisms were specifically designed to offer low error on their respective workloads, but they don’t work as well on other workloads. Additionally, Randomized Response is often the best mechanism at high , so even for a fixed workload, the best competitor is not always the same. On the other hand, our mechanism adapts effectively to the workload and , and works well in all settings.
As a result, only one algorithm needs to be implemented, rather than an entire library of algorithms, and, accordingly it is not necessary to select among alternative algorithms.
Some workloads are inherently harder to answer than others: the number of samples required by our mechanism differs by up to two orders of magnitude between workloads. The easiest workload appears to be Histogram, while the hardest is Parity. This is consistent with the lower bound we gave in Theorem , which characterizes the hardness of the workload in terms of its singular values – the bound is much lower for Histogram than for Parity.
6.3 Impact of Domain Size
Figure 2 shows the relationship between workload and on the sample complexity for each mechanism. We consider ranging from to , fixing to be . We state our main findings below:
For the Histogram workload, there is almost no dependence on the domain size for every mechanism except randomized response. This is consistent with our finding in Example regarding the lower bound on sample complexity. This observation is unique to the Histogram workload, however.
The mechanisms that were designed for a given workload, and those that adapt to the workload, have a better dependence on the domain size (smaller slope) than the mechanisms that do not. This includes the Matrix Mechanism, which is worse than every other mechanism in most settings, but slowly overtakes the other mechanisms for large domain sizes.
The sample complexity of our mechanism and other mechanisms tailored to the workload is generally , as the slope of the lines are in log space. 333the slope of a line in log space corresponds to the power of a polynomial in linear space; i.e., . On the other hand, the sample complexity of the mechanisms not tailored to the workload is more like (as the slope of the line is ). These findings are quite interesting: they suggest the improvements offered by workload adaptivity are more than just a constant factor, they grow with the domain size.
6.4 Scalability of Optimization
We measure the scalability of optimization by looking at the per-iteration time complexity. In each iteration, we must evaluate the objective function (and its gradient) stated in Theorem , then project onto the constraint set using Algorithm 1. We assume has been precomputed, and note that the per-iteration time complexity only depends on through its size, and not its contents. We therefore use the identity matrix for . Additionally, we let be a random strategy matrix. In Figure (a)a, we report the per-iteration time required for increasing domain sizes, averaged over 15 iterations. As we can see, optimization scales up to domains as large as , where it takes about seconds per iteration. While expensive, it is not unreasonable to run for a few hundred iterations in this case, and is an impressive scalability result given that there are over million optimization variables when is that large. Additionally, we note that strategy optimization is a one-time cost, and it can be done offline before deploying the mechanism to the users. Furthermore, as we showed in Section 6.3, the number of samples required typically increases with the domain size, so there is good reason to run mechanisms on small domains whenever possible, compressing it if necessary. In general, the plot shows that the time grows roughly at a rate, as it took about seconds for and seconds for . This confirms the theoretical analysis in Section 5.4.
6.5 Worst-case vs. Average-case variance
Recall from Section 3.2, our goal is to optimize worst-case variance, but we actually optimize a relaxed objective where the maximum term is replaced with an average. In Section 5.1, we gave a convincing theoretical justification for this relaxation, showing that the worst-case variance is tightly bounded above and below by the average-case variance. Here we supplement that with convincing empirical evidence that the relaxation is effective.
We collect all of the optimized strategies, covering all the workloads and epsilons used in Figure 1 for all domains . We compute the worst-case variance of each strategy (which is our true objective) and the average-case variance of each strategy (which is our relaxed objective), plotting each of these pairs as a point in Figure (b)b. As we can see, all of the points are very close to the main diagonal, suggesting that the bound is quite tight. In fact, about 90% of points have an error ratio (worst / average) of less than . For this reason, we are confident that the relaxation is well-founded.
6.6 Non-negativity and consistency
In this section, we experimentally evaluate the extension we proposed in Remark , which we call workload non-negative least squares (WNNLS). The full details are described in Appendix A. For this experiment, we fix , , and , and use a random sample of data from the “HEPTH” dataset obtained from the DPBench study [dpbench], but note that results on other datasets were similar. With this extension, we no longer have a closed form expression for the variance of the mechanism, so we run simulations to estimate it on real data.
Figure (c)c shows the (normalized) variance of the mechanism on this dataset with and without the extension. As we can see, the extension reduces the variance in all cases, and the improvement ranges from to , which is a significant amount. In general, the magnitude of improvement depends on factors like and , which are not varied here. When and are sufficiently large, the default workload query estimates will already be non-negative, in which case WNNLS would offer no improvement. WNNLS offers significant improvement for many practical and , however. Additionally, we note that this extension can be plugged into any of the other competing mechanisms as well, and offers similar utility improvements.
7 Related Work
The mechanism we propose in this work is related to a large body of research in both the central and local model of differential privacy.
Answering linear queries under central differential privacy is a widely studied topic [barak2007privacy, li2015matrix, mckenna2018optimizing, bhaskara2012unconditional, li2012adaptive]. Many state-of-the-art mechanisms for this task, like the Matrix Mechanism [li2010optimizing], achieve privacy by adding Laplace or Gaussian noise to a carefully selected set of “strategy queries”. This query strategy is tailored to the workload, and can even be optimized for it, as the matrix mechanism does. The optimization problem posed by the matrix mechanism has been studied theoretically [li2010optimizing, li2015lower], and several algorithms have been proposed to solve it or approximately solve it [yuan2012low, li2012adaptive, yuan2016convex, mckenna2018optimizing]. While similar in spirit to our mechanism, the optimization problem underlying the matrix mechanism is substantially different from ours, as it requires search over a different space of mechanisms not tailored to local differential privacy and the optimal solution does not depend on in any way.
While answering linear queries under local differential privacy has received less attention, any mechanism designed for the central model of differential privacy can be easily adapted to the local model simply by executing it independently for each single-user database. This approach has been studied theoretically with the Gaussian mechanism [bassily2018linear] and the Matrix Mechanism [edmonds2019power]. However, since it is not tailored to local differential privacy, it does not work well in practice. Another notable approach for this task casts it as a mean estimation problem, and uses LDP mechanisms designed for that [blasiok2019towards]. These works provide a thorough theoretical treatment of this problem, showing bounds on achieved error, but no practical implementation or evaluation.
More work has been done to answer specific, fixed workloads of general interest, such as histograms [acharya2018, ye2018optimal, wang2017locally, warner1965randomized, erlingsson2014rappor, kairouz2016discrete, bassily2015local, wang2016mutual], range queries [cormode2019answering], and marginals [cormode2018marginal]. A very nice summary of the computational complexity, sample complexity and communication complexity for the mechanisms designed for the Histogram workload is given in [acharya2018]. Interestingly, even for the very simple Histogram workload, there are multiple mechanisms because the optimal mechanism is not clear. This is in stark contrast to the central model of differential privacy, where, for the Histogram workload it is clear that the optimal strategy is the workload itself. Almost all of these mechanisms are instances of the general class of mechanisms we consider in Definition , just with different workload factorizations. The strategy matrices for these workloads were all carefully designed to offer low error on the workloads they were designed for, by exploiting knowledge about those specific workloads. However, none of these mechanisms perform optimization to choose the strategy matrix, instead it is fixed in advance.
Kairouz et al. propose an optimization-based approach to mechanism design as well [kairouz2014extremal]. The mechanism they propose is not designed to estimate histograms or workload queries, but for other statistical tasks, namely hypothesis testing and something they call information preservation. They also consider the class of mechanisms characterized by a strategy matrix (Proposition ), and propose an optimization problem over this space to maximize utility for a given task. Moreover, for convex and sublinear utility functions, they show that the optimal mechanism is a so-called extremal mechanism
, and state a linear program to find this optimal mechanism. Unfortunately, there areoptimization variables in this linear program, making it infeasible to solve in practice. Furthermore, the restriction on the utility function (sublinear, convex) prevents the technique from applying to our setting.
We proposed a new LDP mechanism that adapts to a workload of linear queries provided by an analyst. We formulated this as a constrained optimization problem over an expressive class of unbiased LDP mechanisms and proposed a projected gradient descent algorithm to solve this problem. We showed experimentally that our mechanism outperforms all other existing LDP mechanisms in a variety of settings, even outperforming mechanisms on the workloads for which they were intended.
We would like to thank Daniel Sheldon for his insights regarding Theorem . This work was supported by the National Science Foundation under grants CNS-1409143, CCF-1642658, CCF-1618512, TRIPODS-1934846; and by DARPA and SPAWAR under contract N66001-15-C-4067.
|number of users|
|set of possible users,|
|set of possible outcomes,|
|vector of ones|
|trace of a matrix|
|optimization objective for|
|diagonal matrix from vector|
Appendix A Non-negativity and consistency
In this section, we describe and evaluate a simple extension to our mechanism that can greatly improve its utility in practice. While our basic mechanism offers unbiased answers to the workload queries, this can come at the cost of certain anomalies. For example, the estimated answer to a workload query might be negative, even though the true answer could never be negative. This motivates us to consider an extension where we try to account for the structural constraints we know about data vector. Specifically, we propose the following non-negative least squares problem to find a non-negative (feasible) such that is as close as possible to our unbiased estimate . Formally, the problem at hand is:
Once is obtained, the workload answers can be estimated by computing . While this estimate is not necessarily unbiased, it often has substantially lower variance than and can be a worthwhile trade-off, particularly in the high-privacy/low-data regime where non-negativity is a bigger issue. A similar problem was studied theoretically by Nikolov et al. [nikolov2013geometry] and empirically by Li et al. [li2015matrix] in the central model of differential privacy, where it has been shown to offer significant utility improvements. There are various python implementations to solve the above problem efficiently [scipy, zhang2018ektelo, mckenna19graphical]. We simply use the limited memory BFGS algorithm from scipy to solve it [liu1989limited, scipy].
Appendix B Missing Proofs
Proof (of Theorem )
We begin by deriving the variance for a single query (where ). Note that
is a sum of multinomial random variablesinstantiated with and . This gives us:
The total variance is obtained by summing over all the rows of . This completes the proof.
Proof (of Theorem )
We prove the claim by showing the two objectives are the same up to constant additive and multiplicative factors.
Since is constant, we can drop that term for the purposes of defining the objective function. This completes the proof.
Proof (of Theorem )
Observe that we can construct one row at a time because there are no interaction terms between and for in the objective function. Furthermore, we can optimize through the following quadratic program.
The above problem is closely related to a standard norm minimization problem, and can be transformed to one by making the substitution . The problem becomes:
This unique solution to this problem is given by [boyd2004convex]. Using the Hermitian reduction identity [ben2003generalized] , we have:
Applying this for all , we arrive at the desired solution.
Proof (of Theorem )
We plug in the optimal solution for as given in Theorem and simplify using linear algebra identities and the cyclic permutation property of trace.
Proof (of Theorem )
It is obvious that the worst-case variance is greater than (or equal to) the average-case variance. We will now bound the worst-case variance from above. Using to denote the worst-case user, we have the following upper bound on worst-case variance:
In step (a), we use the fact that is non-negative. In step (b), we apply the fact that for all . In step (c), we express the bound in terms of , adding in the form of . This completes the proof.
Proof (of Theorem )
Consider the following optimization problem which is closely related to Problem 3.2.
Li et al. derived the SVD bound, which shows the minimum above is least [li2015lower]. See also [yuan2016convex]. Furthermore, if is the optimal solution and the constraint is replaced with then remains optimal [li2015matrix], in which case the bound becomes .
We will now argue that any feasible solution to our problem can be directly transformed into a feasible solution of the above related problem. Suppose is a feasible solution to Problem 3.2 and let . Note that the objective functions are identical now. We will argue that .
Thus, we have shown that any solution to Problem 3.2 gives rise to a corresponding solution to the above problem. Thus, the SVD Bound applies and we arrive at the desired result: