Differential privacy [DMNS06] is a rigorous mathematical definition that has emerged as one of the most successful notions of privacy in statistical data analysis. Differential privacy provides a rich and powerful algorithmic framework for private data analysis, which can help organizations mitigate users’ privacy concerns. There are two main models for private data analysis that are studied in the literature of differential privacy: the centralized model and the local model. The centralized model assumes a trusted centralized curator that collects all the personal information and then analyzes it. In contrast, the local model, which dates back to [War65], does not involve a central repository. Instead, each individual holding a piece of private data randomizes her data herself via a local randomizer before it is collected for analysis. This local randomizer is designed to satisfy differential privacy, providing a strong privacy protection for each individual. The local model is attractive in many practical and industrial domains since it relieves organizations and companies from the liability of holding and securing their users private data. Indeed, in the last few years there have been many successful deployments of local differentially private algorithms in the industrial domain, most notably by Google and Apple [EPK14, TVV17].
In this paper, we study the problem of linear queries estimation under local differential privacy (LDP). Let be a data domain of size . A linear query with respect to is uniquely identified by a vector that describes a linear function where denotes the probability simplex in . In this problem, we have a set of individuals (users), where each user holds a private value drawn independently from some unknown distribution . An entity (server) generates a sequence of linear queries and wishes to estimate, within a small error, the values of these queries over the unknown distribution , i.e., . To do this, the server collects signals from the users about their inputs and use them to generate these estimates. Due to privacy concerns, the signal sent by each user is generated via a local randomizer that outputs a randomized (privatized) version of the user’s true input in a way that satisfies LDP. The goal is to design a protocol that enables the server to derive accurate estimates for its queries under the LDP constraint. This problem subsumes a wide class of estimation tasks under LDP, including distribution estimation studied in [DJW13b, BS15, DHS15, KBR16, BNST17, YB18, ASZ18] and mean estimation in dimensions [DJW13a, DJW13b].
Non-adaptive versus Adaptive Queries:
In this work, we consider two versions for the above problem. In the non-adaptive (offline) version, the set of queries are decided by the server before the protocol starts (i.e., before users send their signals). In this case, the set of queries can be represented as the rows of a matrix that is published before the protocol starts. In the adaptive version of this problem, the queries are submitted and answered over rounds: one query in each round. Before the start of each round the server can adaptively choose the query based on all the history it sees, i.e., based on all the previous queries and signals from users in the past rounds. This setting is clearly harder than the offline setting. Both distribution estimation and mean estimation over a finite (arbitrary large) domain can be viewed as special cases of the offline queries model above. In particular, for distribution estimation, the queries matrix is set to
, the identity matrix of size(in such case, the dimensionality ). For -dimensional mean estimation, the columns of are viewed as the set of all realizations of a
-dimensional random variable.
One of the main challenges in the local model is dealing with high-dimensional settings (i.e., when ). Previous constructions for distribution estimation [DJW13b, KBR16, YB18, ASZ18] and mean estimation [DJW13b] suffer from an explicit polynomial dependence on the dimensions in the resulting estimation error.
In this work, we address this challenge and give new constructions for large, natural families of offline linear queries that subsumes the above estimation problems. The resulting estimation error111In this work, we consider the true population risk not the empirical risk. We refer to it as the estimation error and sometimes as the true error. has no dependence on in the high-dimensional setting and depends only sub-logarithmically on . We also consider the adaptive version of the general linear queries problem, and give a new protocol with optimal error (which is a more natural error criterion in the adaptive setting). We discuss these results below.
1.1 Results and comparison to previous works
The accuracy guarantees of our -LDP protocols are summarized in Table 1.
General offline linear queries:
We assume that the norm of any column of the queries matrix is bounded from above by some arbitrary constant We note that this is weaker assumption than assuming that the spectral norm of
(largest singular value) is bounded by. For any , let denote the collection of all matrices in satisfying this condition. We design -LDP protocol that given any queries matrix from this family, it outputs an estimate for with nearly optimal estimation error (see Section 2.2.1 for the definition of the estimation error). As noted earlier, the resulting estimation error does not depend on in the high-dimensional setting: in particular, in the case where (which subsumes the high-dimensional setting when ). This improves over the upper bound in [DJW13b, Proposition 3] achieved by the ball sampling mechanism proposed therein. The near optimality of our protocol follows from the lower bound in the same reference (see Table 1). To construct our protocol, we start with an -LDP protocol that employs the Gaussian mechanism together with the projection technique similar to the one used in [NTZ13] in the centralized model of differential privacy. We show the applicability of this technique in the local model. Next, we transform our -LDP construction into a pure -LDP construction while maintaining the same accuracy (and the same computational cost). To do this, we give a technique based on rejection sampling ideas from [BS15, BNS18]. In particular, our technique can be viewed as a simpler, more direct version of the generic transformation of [BNS18] tuned to the linear queries problem. For this general setting, we focus on improving the estimation error. We do not consider the problem of optimizing communication or computational efficiency. We think that providing a succinct description of the queries matrix (possibly under more assumptions on its structure) is an interesting problem, which we leave to future work.
For this special case, we extend the Hadamard-Response protocol of [ASZ18] to the high-dimensional setting. This protocol enjoys several computational advantages, particularly, communication and running time for each user. We show that this protocol when combined with a projection step onto the probability simplex gives estimation error that depends only sub-logarithmically on for all . The resulting error is also tight up to a sub-logarithmic factor in . We note that the error bound in [ASZ18] is applicable only in the case where . Our result thus shows the possibility of accurate distribution estimation under the error criterion in the high-dimensional setting. Our bound also improves over the bound of [ASZ18] for all . To the best of our knowledge, existing results do not imply error bound better than the trivial error in the regime where . It is worthy to point out that the error bound of [ASZ18] is optimal only when . Although this condition is not explicitly mentioned in [ASZ18], however, as stated in the same paper, their claim of optimality follows from the lower bound in [YB18]; specifically, [YB18, Theorem IV]. From this theorem, it is clear that the lower bound is only valid when . Hence, our bound does not contradict with the results of these previous works. We also note that the idea of projecting the estimated distribution onto the probability simplex was proposed in [KBR16] (along with a different protocol than that of [ASZ18]). Although [KBR16] show empirically that the projection technique yield improvements in accuracy, no formal analysis or guarantees were provided for the resulting error in this case.
Note that the estimation error bounds in the previous works were derived for the expected -squared error, and hence the expressions here are the square-root of the bounds appearing in these references. Moreover, we note that our bounds are obtained by first deriving bounds on the -squared estimation error, which then imply our stated bounds on the error. Hence, squaring our bounds give valid bounds on the -squared error.
Adaptive linear queries:
We assume the following constraint on any sequence of adaptively chosen queries : for each for some . That is, each vector defining a query has a bounded norm. Unlike the offline setting, since the sequence of the queries is not fixed beforehand (i.e., the queries matrix is not known a priori), the above constraint is more natural than constraining a quantity related to the norm of the queries matrix as we did in the offline setting. For any , we let , i.e., denote the family of all linear queries satisfying the above constraint. In this setting, we measure accuracy in terms of the true error; that is, the maximum true error in any of the estimates for the queries. (See Section 2.2.2 for a precise definition).
We give a construction of -LDP protocol that answers any sequence of adaptively chosen queries from . Our protocol attains the optimal estimation error. The optimality follows from the fact that our upper bound matches a lower bound on the same error in the non-adaptive setting given in [DJW13b, Proposition 4]. In our protocol, each user sends only a constant number of bits to the server, namely, bitsuser. In our protocol, the set of users are partitioned into disjoint subsets, and each subset is used to answer one query. Roughly speaking, this partitioning technique can be viewed as some version of sample splitting. In contrast, this technique is known to be suboptimal (w.r.t. the estimation error) in the centralized model of differential privacy [BNS16]. Moreover, given the offline lower bound in [DJW13b], our result shows that adaptivity does not pose any extra penalty in the true estimation error for linear queries in the local model. In contrast, it is still not clear whether the same statement can be made in the centralized model of differential privacy. For instance, assuming and then in the centralized model, the best known upper bound on the true estimation error for this problem in the adaptive setting is [BNS16, Corollary 6.1] (which combines [DMNS06] with the generalization guarantees of differential privacy). Whereas in the offline setting, the true error is upper-bounded by (combining [DMNS06] with the standard generalization bound for the offline setting). There is also a gap to be tightened in the other regime of and as well. For example, this can be seen by comparing [BNS16, Corollary 6.3] with the bound attained by the private multiplicative weights algorithm [HR10] in the offline setting.
2 Preliminaries and Definitions
2.1 ()-Local Differential Privacy
In the local model, an algorithm can access any entry in a private data set only via a randomized algorithm (local randomizer) that, given an index runs on the input and returns a randomized output to . Such algorithm satisfies -local differential privacy (-LDP) if the local randomizer satisfies -LDP defined as follows.
Definition 2.1 (-Ldp).
A randomized algorithm is -LDP if for any pair and any measurable subset we have
where the probability is taken over the random coins of . The case of is called pure -LDP.
2.2 Accuracy Definitions
2.2.1 Offline queries
For the non-adaptive (offline) setting, we measure accuracy in terms of the worst-case expected -error in the responses to queries. Let be any (unknown) distribution over a data domain . To simplify presentation, we will overload notation and use to also denote the probability mass function (p.m.f.) of the same distribution, where refers to the probability simplex in defined as
Let denote the set of users’ inputs that are drawn i.i.d. from (this will be usually denoted as ). For any , let ; that is, denote the family of all matrices in whose columns lie in (the -dim ball of radius ). Let be a queries matrix whose rows determine offline linear queries. An -LDP protocol describes a set of procedures executed at each user and the server that eventually produce an estimate for the true answer vector subject to -LDP. Let denote the final estimate vector generated by the protocol for a data set and queries matrix . The true expected error in the estimate when is defined as
where the expectation is taken over the randomness in and the random coins of the protocol.
The worst-case expected -error (with respect to worst-case distribution and worst case queries matrix in ) is defined as
Sometimes, we will consider the worst-case empirical error of an LDP protocol. Given any data set , let denote the histogram (i.e., the empirical distribution) of . The worst-case empirical error of an LDP protocol is defined as
Note the expectation in this case is taken only over the random coins of .
Optimal non-private estimators for offline linear queries
Given (3), if we have an LDP protocol that has worst-case empirical error , then such a protocol has worst-case true error .
2.2.2 Adaptive queries
For any , we let , i.e., denote the family of all linear queries described by vectors in of norm bounded by . In the adaptive setting, we consider the worst-case expected error in the vector of estimates generated by LDP protocol for any sequence of adaptively chosen queries . Let be a data set of users’ inputs. Let be LDP protocol for answering any such sequence. We define the worst-case error as
where denotes the estimate generated by the protocol in the -th round of the protocol.
2.3 Geometry facts
For a convex body , the polar body is defined as . A convex body is symmetric if . The Minkowski norm induced by a symmetric convex body is defined as . The Minkowski norm induced by the polar body of is the dual norm of , and has the form . By Holder’s inequality, we have .
Let denote the unit ball in . A symmetric convex polytope of vertices that are represented as the columns of a matrix is defined as The dual Minkowski norm induced by the convex symmetric polytope is given by where the last equality is due to the fact that any linear function over a polytope attains its maximum at one of the vertices of the polytope.
The following is a useful lemma based on standard analysis that bounds the least squared estimation error over convex bodies. We restate here the version that appeared in [NTZ13].
Lemma 2.2 (Lemma 1 in [Ntz13]).
Let be a symmetric convex body, and let and for some . Let . Then, we must have
As a direct consequence of the above lemma and the preceding facts, we have the following corollary.
Let be a symmetric convex polytope of vertices , and let and for some . Let . Then, we must have
2.4 SubGaussian random variables
Definition 2.4 (-subGaussian random variable).
A zero mean random variable is called -subgaussian if for all
Another equivalent version of the definition is as follows: A zero-mean random variable is -subgaussian if for all . It is worth noting that these two versions of the definition are equivalent up to a small constant in (see, e.g., [Bul]).
3 LDP Protocols for Offline Linear Queries
In this section, we consider the problem of estimating offline linear queries under -LDP. For any given , as discussed in Section 2.2.1, we consider a queries matrix ; that is, the columns of are assumed to lie in the ball of radius .
As a warm-up, in Section 3.1, we first describe and analyze an -LDP protocol. Our protocol is simple and is based on (i) perturbing the columns of corresponding to users’ inputs via Gaussian noise and (ii) applying a projection step, when appropriate, to the noisy aggregate similar to the technique of [NTZ13] in the centralized model. This projection step reduces the error significantly in the regime where (which subsumes the high-dimensional setting when ). In particular, in such regime, our protocol yields an error , which does not depend on and depends only sub-logarithmically on . Moreover, this error is within a factor of from the optimal error in this regime. Hence, this result establishes the possibility of accurate estimation of linear queries with respect to the error in high-dimensional settings. Adoption of all previously known algorithms (particularly, the ball sampling mechanism of [DJW13b]) do not provide any guarantees better than the trivial error for that problem in the regime where .
In Section 3.2, we give a construction that transforms our algorithm into a pure -LDP algorithm with essentially the same error guarantees. Our transformation is inspired by ideas from [BS15, BNS18]. In particular, [BNS18] gives a generic technique for transforming an -LDP protocol to an -LDP protocol. Our construction can be viewed as a simpler, more direct version of this transformation for the case of linear queries.
3.1 LDP Protocol for Offline Linear Queries
We first describe the local randomization procedure carried out by each user . The local randomization is based on perturbation via Gaussian noise ; that is, it can be viewed as LDP version of the standard Gaussian mechanism [DKM06].
The desciption of our protocol for linear queries is given in Algorithm 2.
We now state and prove the privacy and accuracy guarantee of our protocol. Note in the local model of differential privacy, the privacy of the entire protocol rests only on differential privacy of the local randomizers, which we prove now.
[Privacy Guarantee] Algorithm 1 is -LDP.
Theorem 3.2 (Accuracy of Algorithm 2).
Fix any queries matrix . Let where is the actual histogram of the users’ data set (here, denotes the vector with in the -th coordinate and zeros elsewhere). First, consider the case where . Note that , and hence is Gaussian random vector with zero mean and covariance matrix . Hence, in this case, it directly follows that where is the worst-case empirical error as defined in (2).
Next, consider the case where . Since is the projection of on the symmetric convex polytope , then by Corollary 2.3, it follows that
Hence, we have
As before, note that . Note also that . Hence, for each , is Gaussian with zero mean and variance . By standard bounds on the maximum of Gaussian r.v.s (e.g., see [Rig15]), we have
Hence, in this case, we have .
Putting the two cases above together, we get that is upper-bounded by the expression in the theorem statement.
Note that the term above is swamped by the bound on . This completes the proof.
3.2 LDP Protocol for Offline Linear Queries
In this section, we give a pure LDP construction that achieves essentially the same accuracy (up to a constant factor of at most ) as our approximate LDP algorithm above. Our construction is based on a direct transformation of the above approximate LDP protocol into a pure LDP one. Our construction is inspired by the idea of rejection sampling in [BS15, BNS18], and can be viewed as a simpler, more direct version of the generic technique in [BNS18] in the case of linear queries.
In our construction, we assume that 222This is not a loss of generality in most practical scenarios where we aim at a reasonably strong privacy guarantee.. For any , letwhere . (Note that the setting of is the same setting for the Gaussian noise used in Algorithm 2 with .)
We now state and prove the privacy and accuracy guarantees of our protocol.
[Privacy Guarantee] Algorithm 3 is -LDP.
Consider any user . Let be any input of user . Define
where . Note that by the standard analysis of the Gaussian mechanism, we have , where and are set as in Step 2 of Algorithm 3). Now, we note that the output of Algorithm 3 is a function of only the bit . Since differential privacy is resilient to post-processing, it suffices to show that for any , any , we have . First, observe that
We also have
Thus, . Note that for any , . Also, note that since , we have . Hence, this ratio can be upper bounded as
In the last step, we use the fact that and hence, .
Now, we consider the event that . Note that . Hence, we have
We also have
Theorem 3.4 (Accuracy of Algorithm 4).
The high-level idea of the proof can be described as follows. We first show that the number of users who end up sending a signal to the server (i.e., those users with ) is at least a constant fraction of the total number of users (). Hence, the effective reduction in the sample size will not have a pronounced effect on the true error (it can only increase the true expected error by at most a factor ). Next, we show that conditioned on