1 Introduction
Differential privacy [DMNS06] is a rigorous mathematical definition that has emerged as one of the most successful notions of privacy in statistical data analysis. Differential privacy provides a rich and powerful algorithmic framework for private data analysis, which can help organizations mitigate users’ privacy concerns. There are two main models for private data analysis that are studied in the literature of differential privacy: the centralized model and the local model. The centralized model assumes a trusted centralized curator that collects all the personal information and then analyzes it. In contrast, the local model, which dates back to [War65], does not involve a central repository. Instead, each individual holding a piece of private data randomizes her data herself via a local randomizer before it is collected for analysis. This local randomizer is designed to satisfy differential privacy, providing a strong privacy protection for each individual. The local model is attractive in many practical and industrial domains since it relieves organizations and companies from the liability of holding and securing their users private data. Indeed, in the last few years there have been many successful deployments of local differentially private algorithms in the industrial domain, most notably by Google and Apple [EPK14, TVV17].
In this paper, we study the problem of linear queries estimation under local differential privacy (LDP). Let be a data domain of size . A linear query with respect to is uniquely identified by a vector that describes a linear function where denotes the probability simplex in . In this problem, we have a set of individuals (users), where each user holds a private value drawn independently from some unknown distribution . An entity (server) generates a sequence of linear queries and wishes to estimate, within a small error, the values of these queries over the unknown distribution , i.e., . To do this, the server collects signals from the users about their inputs and use them to generate these estimates. Due to privacy concerns, the signal sent by each user is generated via a local randomizer that outputs a randomized (privatized) version of the user’s true input in a way that satisfies LDP. The goal is to design a protocol that enables the server to derive accurate estimates for its queries under the LDP constraint. This problem subsumes a wide class of estimation tasks under LDP, including distribution estimation studied in [DJW13b, BS15, DHS15, KBR16, BNST17, YB18, ASZ18] and mean estimation in dimensions [DJW13a, DJW13b].
Nonadaptive versus Adaptive Queries:
In this work, we consider two versions for the above problem. In the nonadaptive (offline) version, the set of queries are decided by the server before the protocol starts (i.e., before users send their signals). In this case, the set of queries can be represented as the rows of a matrix that is published before the protocol starts. In the adaptive version of this problem, the queries are submitted and answered over rounds: one query in each round. Before the start of each round the server can adaptively choose the query based on all the history it sees, i.e., based on all the previous queries and signals from users in the past rounds. This setting is clearly harder than the offline setting. Both distribution estimation and mean estimation over a finite (arbitrary large) domain can be viewed as special cases of the offline queries model above. In particular, for distribution estimation, the queries matrix is set to
, the identity matrix of size
(in such case, the dimensionality ). For dimensional mean estimation, the columns of are viewed as the set of all realizations of adimensional random variable.
One of the main challenges in the local model is dealing with highdimensional settings (i.e., when ). Previous constructions for distribution estimation [DJW13b, KBR16, YB18, ASZ18] and mean estimation [DJW13b] suffer from an explicit polynomial dependence on the dimensions in the resulting estimation error.
In this work, we address this challenge and give new constructions for large, natural families of offline linear queries that subsumes the above estimation problems. The resulting estimation error^{1}^{1}1In this work, we consider the true population risk not the empirical risk. We refer to it as the estimation error and sometimes as the true error. has no dependence on in the highdimensional setting and depends only sublogarithmically on . We also consider the adaptive version of the general linear queries problem, and give a new protocol with optimal error (which is a more natural error criterion in the adaptive setting). We discuss these results below.
1.1 Results and comparison to previous works
The accuracy guarantees of our LDP protocols are summarized in Table 1.
General offline linear queries:
We assume that the norm of any column of the queries matrix is bounded from above by some arbitrary constant We note that this is weaker assumption than assuming that the spectral norm of
(largest singular value) is bounded by
. For any , let denote the collection of all matrices in satisfying this condition. We design LDP protocol that given any queries matrix from this family, it outputs an estimate for with nearly optimal estimation error (see Section 2.2.1 for the definition of the estimation error). As noted earlier, the resulting estimation error does not depend on in the highdimensional setting: in particular, in the case where (which subsumes the highdimensional setting when ). This improves over the upper bound in [DJW13b, Proposition 3] achieved by the ball sampling mechanism proposed therein. The near optimality of our protocol follows from the lower bound in the same reference (see Table 1). To construct our protocol, we start with an LDP protocol that employs the Gaussian mechanism together with the projection technique similar to the one used in [NTZ13] in the centralized model of differential privacy. We show the applicability of this technique in the local model. Next, we transform our LDP construction into a pure LDP construction while maintaining the same accuracy (and the same computational cost). To do this, we give a technique based on rejection sampling ideas from [BS15, BNS18]. In particular, our technique can be viewed as a simpler, more direct version of the generic transformation of [BNS18] tuned to the linear queries problem. For this general setting, we focus on improving the estimation error. We do not consider the problem of optimizing communication or computational efficiency. We think that providing a succinct description of the queries matrix (possibly under more assumptions on its structure) is an interesting problem, which we leave to future work.Problem/Error metric 


Lower bound  










– 

Distribution estimation:
For this special case, we extend the HadamardResponse protocol of [ASZ18] to the highdimensional setting. This protocol enjoys several computational advantages, particularly, communication and running time for each user. We show that this protocol when combined with a projection step onto the probability simplex gives estimation error that depends only sublogarithmically on for all . The resulting error is also tight up to a sublogarithmic factor in . We note that the error bound in [ASZ18] is applicable only in the case where . Our result thus shows the possibility of accurate distribution estimation under the error criterion in the highdimensional setting. Our bound also improves over the bound of [ASZ18] for all . To the best of our knowledge, existing results do not imply error bound better than the trivial error in the regime where . It is worthy to point out that the error bound of [ASZ18] is optimal only when . Although this condition is not explicitly mentioned in [ASZ18], however, as stated in the same paper, their claim of optimality follows from the lower bound in [YB18]; specifically, [YB18, Theorem IV]. From this theorem, it is clear that the lower bound is only valid when . Hence, our bound does not contradict with the results of these previous works. We also note that the idea of projecting the estimated distribution onto the probability simplex was proposed in [KBR16] (along with a different protocol than that of [ASZ18]). Although [KBR16] show empirically that the projection technique yield improvements in accuracy, no formal analysis or guarantees were provided for the resulting error in this case.
Note that the estimation error bounds in the previous works were derived for the expected squared error, and hence the expressions here are the squareroot of the bounds appearing in these references. Moreover, we note that our bounds are obtained by first deriving bounds on the squared estimation error, which then imply our stated bounds on the error. Hence, squaring our bounds give valid bounds on the squared error.
Adaptive linear queries:
We assume the following constraint on any sequence of adaptively chosen queries : for each for some . That is, each vector defining a query has a bounded norm. Unlike the offline setting, since the sequence of the queries is not fixed beforehand (i.e., the queries matrix is not known a priori), the above constraint is more natural than constraining a quantity related to the norm of the queries matrix as we did in the offline setting. For any , we let , i.e., denote the family of all linear queries satisfying the above constraint. In this setting, we measure accuracy in terms of the true error; that is, the maximum true error in any of the estimates for the queries. (See Section 2.2.2 for a precise definition).
We give a construction of LDP protocol that answers any sequence of adaptively chosen queries from . Our protocol attains the optimal estimation error. The optimality follows from the fact that our upper bound matches a lower bound on the same error in the nonadaptive setting given in [DJW13b, Proposition 4]. In our protocol, each user sends only a constant number of bits to the server, namely, bitsuser. In our protocol, the set of users are partitioned into disjoint subsets, and each subset is used to answer one query. Roughly speaking, this partitioning technique can be viewed as some version of sample splitting. In contrast, this technique is known to be suboptimal (w.r.t. the estimation error) in the centralized model of differential privacy [BNS16]. Moreover, given the offline lower bound in [DJW13b], our result shows that adaptivity does not pose any extra penalty in the true estimation error for linear queries in the local model. In contrast, it is still not clear whether the same statement can be made in the centralized model of differential privacy. For instance, assuming and then in the centralized model, the best known upper bound on the true estimation error for this problem in the adaptive setting is [BNS16, Corollary 6.1] (which combines [DMNS06] with the generalization guarantees of differential privacy). Whereas in the offline setting, the true error is upperbounded by (combining [DMNS06] with the standard generalization bound for the offline setting). There is also a gap to be tightened in the other regime of and as well. For example, this can be seen by comparing [BNS16, Corollary 6.3] with the bound attained by the private multiplicative weights algorithm [HR10] in the offline setting.
2 Preliminaries and Definitions
2.1 ()Local Differential Privacy
In the local model, an algorithm can access any entry in a private data set only via a randomized algorithm (local randomizer) that, given an index runs on the input and returns a randomized output to . Such algorithm satisfies local differential privacy (LDP) if the local randomizer satisfies LDP defined as follows.
Definition 2.1 (Ldp).
A randomized algorithm is LDP if for any pair and any measurable subset we have
where the probability is taken over the random coins of . The case of is called pure LDP.
2.2 Accuracy Definitions
2.2.1 Offline queries
For the nonadaptive (offline) setting, we measure accuracy in terms of the worstcase expected error in the responses to queries. Let be any (unknown) distribution over a data domain . To simplify presentation, we will overload notation and use to also denote the probability mass function (p.m.f.) of the same distribution, where refers to the probability simplex in defined as
Let denote the set of users’ inputs that are drawn i.i.d. from (this will be usually denoted as ). For any , let ; that is, denote the family of all matrices in whose columns lie in (the dim ball of radius ). Let be a queries matrix whose rows determine offline linear queries. An LDP protocol describes a set of procedures executed at each user and the server that eventually produce an estimate for the true answer vector subject to LDP. Let denote the final estimate vector generated by the protocol for a data set and queries matrix . The true expected error in the estimate when is defined as
where the expectation is taken over the randomness in and the random coins of the protocol.
True error:
The worstcase expected error (with respect to worstcase distribution and worst case queries matrix in ) is defined as
(1) 
Empirical error:
Sometimes, we will consider the worstcase empirical error of an LDP protocol. Given any data set , let denote the histogram (i.e., the empirical distribution) of . The worstcase empirical error of an LDP protocol is defined as
(2) 
Note the expectation in this case is taken only over the random coins of .
Optimal nonprivate estimators for offline linear queries
The following is a simple observation that follows wellknown facts in statistical estimation.
(3) 
Note that
is an unbiased estimator of
. The above bound follows from a simple analysis of the variance of
.Note:
Given (3), if we have an LDP protocol that has worstcase empirical error , then such a protocol has worstcase true error .
2.2.2 Adaptive queries
For any , we let , i.e., denote the family of all linear queries described by vectors in of norm bounded by . In the adaptive setting, we consider the worstcase expected error in the vector of estimates generated by LDP protocol for any sequence of adaptively chosen queries . Let be a data set of users’ inputs. Let be LDP protocol for answering any such sequence. We define the worstcase error as
(4) 
where denotes the estimate generated by the protocol in the th round of the protocol.
2.3 Geometry facts
For a convex body , the polar body is defined as . A convex body is symmetric if . The Minkowski norm induced by a symmetric convex body is defined as . The Minkowski norm induced by the polar body of is the dual norm of , and has the form . By Holder’s inequality, we have .
Let denote the unit ball in . A symmetric convex polytope of vertices that are represented as the columns of a matrix is defined as The dual Minkowski norm induced by the convex symmetric polytope is given by where the last equality is due to the fact that any linear function over a polytope attains its maximum at one of the vertices of the polytope.
The following is a useful lemma based on standard analysis that bounds the least squared estimation error over convex bodies. We restate here the version that appeared in [NTZ13].
Lemma 2.2 (Lemma 1 in [Ntz13]).
Let be a symmetric convex body, and let and for some . Let . Then, we must have
As a direct consequence of the above lemma and the preceding facts, we have the following corollary.
Corollary 2.3.
Let be a symmetric convex polytope of vertices , and let and for some . Let . Then, we must have
2.4 SubGaussian random variables
Definition 2.4 (subGaussian random variable).
A zero mean random variable is called subgaussian if for all
Another equivalent version of the definition is as follows: A zeromean random variable is subgaussian if for all . It is worth noting that these two versions of the definition are equivalent up to a small constant in (see, e.g., [Bul]).
3 LDP Protocols for Offline Linear Queries
In this section, we consider the problem of estimating offline linear queries under LDP. For any given , as discussed in Section 2.2.1, we consider a queries matrix ; that is, the columns of are assumed to lie in the ball of radius .
As a warmup, in Section 3.1, we first describe and analyze an LDP protocol. Our protocol is simple and is based on (i) perturbing the columns of corresponding to users’ inputs via Gaussian noise and (ii) applying a projection step, when appropriate, to the noisy aggregate similar to the technique of [NTZ13] in the centralized model. This projection step reduces the error significantly in the regime where (which subsumes the highdimensional setting when ). In particular, in such regime, our protocol yields an error , which does not depend on and depends only sublogarithmically on . Moreover, this error is within a factor of from the optimal error in this regime. Hence, this result establishes the possibility of accurate estimation of linear queries with respect to the error in highdimensional settings. Adoption of all previously known algorithms (particularly, the ball sampling mechanism of [DJW13b]) do not provide any guarantees better than the trivial error for that problem in the regime where .
In Section 3.2, we give a construction that transforms our algorithm into a pure LDP algorithm with essentially the same error guarantees. Our transformation is inspired by ideas from [BS15, BNS18]. In particular, [BNS18] gives a generic technique for transforming an LDP protocol to an LDP protocol. Our construction can be viewed as a simpler, more direct version of this transformation for the case of linear queries.
3.1 LDP Protocol for Offline Linear Queries
We first describe the local randomization procedure carried out by each user . The local randomization is based on perturbation via Gaussian noise ; that is, it can be viewed as LDP version of the standard Gaussian mechanism [DKM06].
The desciption of our protocol for linear queries is given in Algorithm 2.
We now state and prove the privacy and accuracy guarantee of our protocol. Note in the local model of differential privacy, the privacy of the entire protocol rests only on differential privacy of the local randomizers, which we prove now.
Theorem 3.1.
[Privacy Guarantee] Algorithm 1 is LDP.
Proof.
Theorem 3.2 (Accuracy of Algorithm 2).
Proof.
Fix any queries matrix . Let where is the actual histogram of the users’ data set (here, denotes the vector with in the th coordinate and zeros elsewhere). First, consider the case where . Note that , and hence is Gaussian random vector with zero mean and covariance matrix . Hence, in this case, it directly follows that where is the worstcase empirical error as defined in (2).
Next, consider the case where . Since is the projection of on the symmetric convex polytope , then by Corollary 2.3, it follows that
Hence, we have
As before, note that . Note also that . Hence, for each , is Gaussian with zero mean and variance . By standard bounds on the maximum of Gaussian r.v.s (e.g., see [Rig15]), we have
Hence, in this case, we have .
Putting the two cases above together, we get that is upperbounded by the expression in the theorem statement.
From (3) in Section 2.2.1 (and the succeeding note), we have
Note that the term above is swamped by the bound on . This completes the proof.
∎
3.2 LDP Protocol for Offline Linear Queries
In this section, we give a pure LDP construction that achieves essentially the same accuracy (up to a constant factor of at most ) as our approximate LDP algorithm above. Our construction is based on a direct transformation of the above approximate LDP protocol into a pure LDP one. Our construction is inspired by the idea of rejection sampling in [BS15, BNS18], and can be viewed as a simpler, more direct version of the generic technique in [BNS18] in the case of linear queries.
In our construction, we assume that ^{2}^{2}2This is not a loss of generality in most practical scenarios where we aim at a reasonably strong privacy guarantee.. For any , let
denote the probability density function of the Gaussian distribution
where . (Note that the setting of is the same setting for the Gaussian noise used in Algorithm 2 with .)In Algorithm 3, we describe the local randomization procedure executed independently by every user . Then, we describe our LDP protocol for offline linear queries in Algorithm 4.
We now state and prove the privacy and accuracy guarantees of our protocol.
Theorem 3.3.
[Privacy Guarantee] Algorithm 3 is LDP.
Proof.
Consider any user . Let be any input of user . Define
where . Note that by the standard analysis of the Gaussian mechanism, we have , where and are set as in Step 2 of Algorithm 3). Now, we note that the output of Algorithm 3 is a function of only the bit . Since differential privacy is resilient to postprocessing, it suffices to show that for any , any , we have . First, observe that
We also have
Thus, . Note that for any , . Also, note that since , we have . Hence, this ratio can be upper bounded as
In the last step, we use the fact that and hence, .
Now, we consider the event that . Note that . Hence, we have
We also have
Hence,
∎
Theorem 3.4 (Accuracy of Algorithm 4).
The highlevel idea of the proof can be described as follows. We first show that the number of users who end up sending a signal to the server (i.e., those users with ) is at least a constant fraction of the total number of users (). Hence, the effective reduction in the sample size will not have a pronounced effect on the true error (it can only increase the true expected error by at most a factor ). Next, we show that conditioned on