Estimation problems in which users keep their personal data private even from data collectors are of increasing interest in large-scale machine learning applications in both industrial[14, 1, 4] and academic [e.g. 34, 3, 23, 9] settings. These notions of privacy are satisfying because a user or data provider can be confident that his or her data will remain private irrespective of what data collectors do, and they mitigate risks for data collectors, limiting challenges of hacking or other interference. Because of their importance, a parallel literature on optimality results in local privacy is developing [9, 8, 35].
Yet this theory fails to address a number of important issues. Most saliently, many of these results only apply in settings in which the privatization scheme is non-adaptive, that is, the scheme remains static for all data contributors except in 1-dimensional problems [9, 35, 16]. A second issue is that these results provide meaningful bounds only for certain types of privacy. Typically, the results are sharp only for high levels of privacy (in the language of differential privacy, privacy parameters ), as in the papers of Duchi et al. , Rohde and Steinberger , and Duchi and Ruan , or at most logarithmic in dimension ; given the promise of privacy amplification in local settings , it is important to address limits in the case that . With the exception of Duchi and Ruan, they also fail to apply to weakenings of differential privacy (for example, Rényi differential privacy ).
We remove many of these restrictions by framing the problem of estimation and learning under local privacy constraints as a problem in the communication complexity of statistical estimation. By doing so, we can build off of a line of sophisticated results due to Zhang et al. , Garg et al. , and Braverman et al. , who develop minimax lower bounds on distributed estimation problems. To set the stage for our results and give intuition for what follows, we recall the intuitive consequences of these results. Each applies in a setting in which machines each receive a sample from an underlying (unknown) population distribution . These machines then interactively communicate with a central server, or a public blackboard, sending bits of (Shannon) information in total, so that each sends an average of bits. For -dimensional mean estimation problems, where and the goal is to estimate , the main consequences of these papers is that the mean-squared error for estimation must scale as , where is the optimal (communication unlimited) mean-squared error based on a sample of size . Such scaling is intuitive, as to estimate a -dimensional quantity, we expect each machine must send roughly bits to achieve optimal complexity, and otherwise we receive information about only coordinates. The strength of these results is that, in the most general case , they allow essentially arbitrary interaction between the machines, so long as it is mitigated by the information constraints.
We leverage these results on information-limited statistical estimation to establish lower bounds for locally private estimation. By providing bounds on the information released by locally private protocols—even when data release schemes are adaptive and arbitrarily interactive—we can nearly immediately provide minimax lower bounds on rates of convergence in estimation and learning problems under privacy. By using this information-based-complexity framework, we can simultaneously address each of the challenges we identify in previous work on optimal estimation under privacy constraints, in that our results apply to differential privacy  and its weakenings, including concentrated differential privacy [11, 6], Rényi differential privacy , and -differential privacy . They also apply to arbitrarily interactive data release scenarios. Roughly, what we show is that so long as we wish to estimate quantities for -dimensional parameters that are “independent” of one another—which we define subsequently—the effective sample size available to a private procedure reduces from to for all -private procedures.
The use of information and communication complexity in determining fundamental limits in differential privacy is not uniquely ours. Indeed, McGregor et al.  show that approximating functions by low-error differentially private protocols and low communication protocols are strongly related. In their case, however, they study low error approximation of conditional, or sample, quantities, where one wishes to estimate for a function . Here, as in most work in statistics and learning [33, 36, 9, 5], we provide limits on the ability to estimate functions of the population from which the sample comes.
As a consequence of our lower bounds, we identify a number of open questions for future work. The work in information-limited estimation [37, 17, 5, 19] typically strongly relies on some type of independence among estimands, which allows decoupling approaches to apply. Our results similarly suffer these restrictions, which, as we show, is essential: when correlations exist among different coordinates of the sample vectors , it is often possible to achieve much faster convergence. Thus, we argue that we should have renewed focus on local (non-minimax) notions of complexity [24, 30, 8], which address the difficulty of the particular problem at hand. In classical statistics, the theory of local asymptotic normality and minimaxity addresses these issues; a modern treatment of these in the face of restricted estimators would be valuable.
We index several quantities throughout. We always indicate coordinates of a vector by , and (independent) vectors we index by . We consider private protocols that communicate data in rounds, which we index by time . We define and and similarly for superscripts. For distributions and , is the Rényi -divergence [cf. 31].
2 Problem setting and main results
We first describe our problem setting in detail, providing graphical representations of each of our privacy (or communication) settings. We present corollaries of our main lower bounds to highlight their application, then (in Section 3) give the main techniques, which extend Assouad’s method for lower bounds.
2.1 Local privacy and interactivity
In our local privacy setting, we consider individuals, each with private data for , and each individual communicates privatized views of . In contrast to other work on lower bounds for (local) differential privacy [e.g. 9], we allow the communication of this private data to be dependent on other data providers’ private data. Thus, in general, we consider communication of privatized data in rounds , where may be infinite, and in round , individual communicates private datum . This data may depend on all the previous individual’s data in the current round of communication as well as all previous rounds. We visualize this as a blackboard, which after round collects all the (and previous blackboards ) into ; subsequent private variables , , may depend on . Thus, at round , individual generates the private variable according to the channel
Figure 1 illustrates this communication scheme over two rounds of communication.
Our main assumptions are that each channel satisfies a particular quantitative privacy definition. Let
. A random variableis -differentially private for if conditional on , has distribution and for all measurable sets and ,
We also say the channel is -differentially private, and when , that is -differentially private. For , the channel is -Rényi differentially private if for all ,
Because Rényi-divergence is non-decreasing in , any -Rényi differentially private channel is also -Rényi private for , making KL-privacy the weakest Rényi privacy.
We consider channel and disclosure scenarios where users and data providers obtain, in expectation, a given amount of privacy. For shorthand, let
be the “messages” coming into the channel generating , so as in Fig. 1. Our lower bounds depend on the assumption that is private: [Interactive local privacy bounds] For each and , there exists a function such that
for all . Additionally, there exists such that for all ,
where the expectation is taken over the randomness in the private variables .
Assumption 2.1 means that the total amount of private information compromised per person—as measured by the summed KL-divergences—is at most . This is irrespective of the interaction patterns between the sequential data releases, so that the private variables can interact arbitrarily. Assumption 2.1 is weaker than the assumption that each individual’s data is -differentially private: inequality (1) shows that -differential privacy implies -KL-privacy.
We also consider channels that provide - differential privacy, and use the following [Interactive approximate local privacy bounds] The space is finite. For each and there exist and such that the channel is -approximately differentially private. Additionally, the satisfy
Finally, there exists such that for all ,
where the expectation is taken over the randomness in the private variables . Assumption 1
captures the idea that the probabilitiesof the bad events must be low relative to the privacy levels . Thus, for example, in , then we require . In this protocol, for individual , the total privacy loss [10, Appendix B] is that they compromise at most -approximate differential privacy.
2.2 Minimax lower bounds on private estimation
Given our definitions of (interactive) privacy and the interactive privacy bounds in Assumptions 2.1 and 1, we may now describe the minimax framework in which we work. Let be a collection of distributions on a space , and let be a parameter of interest. In the classical (non-information-limited) setting, we wish to estimate given observations drawn i.i.d. according to the distribution . We focus on -dimensional parameters , and the performance of an estimator is its expected loss (or risk) for a loss ,
We elaborate this classical setting by an additional privacy layer. For a sample , any (interactive) channel produces a set of private observations, each from some set ,
and we consider estimators that depend only on this private sample, which then suffer risk
where the expectation is taken over the i.i.d. observations and the privatized views . For the channel , we define the channel minimax risk for the family , parameter , and loss by
We prove lower bounds on the quantity (2) for all channels satisfying the local interactive privacy bounds of either Assumption 2.1 or 1, which thus respectively imply lower bounds on estimation for Rényi-locally-differentially private algorithms or -locally differentially private algorithms.
Rather than stating and proving our main theorems, we present a number of corollaries of our main results, all of whose proofs we defer to Section 4, to illustrate the power of the information-based framework we adopt. Our first corollary deals with estimating Bernoulli means. Let
be the Bernoulli distributions onand for some symmetric loss minimized at . There are numerical constants such that for any channel satisfying either Assumption 2.1 or Assumption 1 with privacy budget ,
A second consequence of this result is that, if the private data releases of each individual are -differentially private, then inequality (1) implies that for constant
An interesting counterpart to the lower bound (3) is that -differentially-private channels achieve this risk when , and they require no interactivity. At least to within numerical constant factors, weakenings of -local differential privacy—down to KL-privacy—provide no gain in estimation utility over fully private mechanisms. Bhowmick et al. [4, Sec. 4.1] exhibit a mechanism (PrivUnit), based on uniform sampling of vectors from spherical caps, that given satisfying samples that is -differentially private, satisfies , and for a numerical constant . Taking the radius (as satisfies ), the estimator yields
For the simpler case of KL-privacy, Gaussian noise addition suffices. We have thus characterized the complexity of locally private -dimensional estimation of bounded vectors.
By a reduction, the lower bound of Corollary 2.2
applies to logistic regression. In this case, we letbe the collection of logistic distributions on , where for , . We take the loss as the gap in prediction risk: for
and . We define the excess risk . Let be the family of logistic distributions and be the excess logistic risk as above. There exists a numerical constant such that for any sequence of channels satisfying either Assumption 2.1 or Assumption 1, for all suitably large we have
Having presented lower bounds on estimation in certain discrete distribution families, it is also of interest to consider continuous distributions with unbounded support. In this case, we consider estimation of both general and sparse Gaussian means, showing results that follow as (more or less) immediate corollaries of our information bounds coupled with Garg et al.  and Braverman et al. . In these cases, the lower bounds only hold for channels satisfying Assumption 2.1, which is intuitive—we use mutual information-based bounds, and on the (negligible) -probability event of a privacy failure under Assumption 1, it is possible to release infinite information.
be the collection of Gaussian distributionswhere , is known, and consider the squared loss . There exists a numerical constant such that if is any channel satisfying Assumption 2.1,
We demonstrate how to achieve this risk in Section 2.3.1, showing (as is the case for our other results) that it is achievable by differentially private schemes.
2.3 Achievability, information complexity, independence, and correlation
The lower bounds in our corollaries are achievable—we demonstrate each of these here—but we highlight a more subtle question regarding correlation. Each of our lower bounds relies on the independence structure of the data: roughly, all the communication-based bounds we discuss require the coordinates of to follow a product distribution. The lower bounds in this case are intuitive: we must estimate -dimensional quantities using (on average) bits, so we expect penalties scaling as because one coordinate carries no information about the others. In cases where there is correlation, however, we might hope for more efficient estimation; we view this as a major open question in privacy and, more broadly, information-constrained estimators. To that end, we briefly show (Section 2.3.1) that each of our lower bounds in Corollaries 2.2–2.2 is achievable. After this, we mention asymptotic results for sparse estimation (Sec. 2.3.2) and correlated data problems (Sec. 2.3.3).
2.3.1 Achievability by differentially-private estimators
We first demonstrate that the results in each of our corollaries are achievable by -differentially private channels with limited interactivity. We have already done so for Corollary 2.2. For Corollary 2.2, Corollary 3.2 of Bhowmick et al.  gives the achievability result. For the Gaussian results, we require a small amount of additional work, which we now provide.
We begin by demonstrating a one-dimensional Gaussian estimator. Let , where is known and . Now, consider the privatized version of defined by
Then it is clear that is -differentially private, and . Thus, letting denote the standard Gaussian CDF, we have
Thus, letting be the average of the , the estimator defined by solving is nearly unbiased. By projecting this quantity onto , we achieve our estimator:
This estimator satisfies the following Let be the estimator (5) for the location family, where is at least some numerical constant. Assume and . Then for numerical constants ,
See Appendix B.1 for the proof, which is essentially a Taylor expansion.
To achieve an upper bound matching the rate in Corollary 2.2, consider the following simple estimator, which applies when each individual releases data once with some level of differential privacy. We consider the cases that and separately.
In the case that , choose coordinates uniformly at random. On each chosen coordinate , release via mechanism (4) using privacy level , and use the estimator (5) applied to each coordinate; this mechanism is -differentially private, each coordinate (when sampled) takes values , and so the resulting vector satisfies
When , we use the -based mechanism of Duchi et al.  applied to the vector , which then releases a vector for a numerical constant chosen to guarantee . Thus each coordinate of satisfies the conditions of Lemma 5, and applying the inversion (5) to each coordinate independently yields . In this setting, the value by inequality (1).
2.3.2 Sparse Estimation
We now turn to the first of our settings in which the coordinates exhibit some dependence, assuming individuals have -differential privacy for to make the discussion concrete. Duchi et al. [9, Sec. 4.2.2] achieve the minimax rate to within a logarithmic factor. Consider the sparse Gaussian mean problem where for . For simplicity, let us consider that and is known; Corollary 2.2 gives the minimax lower bound under -differential privacy, while the non-private minimax risk  is the exponentially smaller .
In the case of a (very) large sample size , however, we observe a different phenomenon: the non-private and private rates coincide, at least in a restricted set of cases. Let us assume that , and that as remains fixed. Let the sample be of size , which we split into halves of size . On the first half, we further split the sample into bins of size ; for each of these bins, we construct a 1-dimensional estimator of the mean of coordinate via (5), which gives us preliminary estimates , each of which is -locally differentially private. Lemma 5 shows that we can identify the non-zero coordinate of by with exponentially high probability. Then, on the second half of the sample, we apply the private estimator (5) to estimate the mean of coordinate . In combination, this yields an estimator that achieves for large , while the non-private analogue in this case has risk .
We have moved from an exponential gap in the dimension to one that scales only as , as soon as is large enough. This example is certainly stylized and relies on a particular flavor of asymptotics (); we believe this transformation from “independent” structure, with risk scaling as , to an identified structure with risk scaling as , merits more investigation.
2.3.3 Correlated Data
We consider one additional stylized example of strong correlation. Let be a known bit vector, and assume the data where , for an unknown . Without privacy, the debiased sample mean achieves minimax optimal risk ; the error is simply times that for the one-dimensional quantity. In the private case, however, as is known, the private channel for user may privatize only the bit using randomized response, setting as in Eq. (4). Using the private estimate yields , so the mean squared error of is
In contrast to the case with independent coordinates in Corollary 2.2, here the locally private estimator achieves (to within a factor of ) the same risk as the non-private estimator. This example is again special, but it suggests that leveraging correlation structures may close some of the substantial gaps between private and non-private estimation that prevent wider adoption of private estimators.
3 Lower bounds via information complexity
We begin with an extension of Assouad’s method [2, 36], which transforms a -dimensional estimation problem into one of testing binary hypotheses, to information-limited settings. We consider a family of distributions indexed by the hypercube , where nature chooses uniformly at random. Conditional on , we draw , from which we obtain the observed (privatized) . Letting , we follow Duchi et al.  and say that induces a -Hamming separation if there exists such that
An example is illustrative.
[Location families] Let be a family of distributions, each specified by a mean , and set for some and each . Then for any symmetric and loss of the form , we have .
Similar separations hold for (strongly) convex risk minimization problems.
[Convex risk minimization] Consider the problem of minimizing a convex risk functional , where is convex in its first argument and the expectation is over . Now, define , and let . If is -strongly convex in a neighborhood of radius of , then a straightforward convexity argument  yields
Thus, if as in the previous example we can construct distributions such that , where , then induces a -separation in Hamming metric.
Letting and be the marginal distributions of the privatized conditional on and , respectively, we have Assouad’s method (Duchi et al. [9, Lemma 1] gives this form): [Assouad’s method] Let the conditions of the previous paragraph hold and let induce a -separation in Hamming metric. Then
Consequently, if we can show that the total variation distance is small while the -separation (6) is large for our family, we have shown a strong lower bound.
3.1 Strong data processing and information contraction
To prove lower bounds via Lemma 3, we build off of ideas that originate from Zhang et al. , which Braverman et al.  develop elegantly. Braverman et al. show how strong data processing inequalities, which quantify the information loss in classical information processing inequalities , extend from one observation to multiple observations. They use this to prove lower bounds on the information complexity of distributed estimators, and we show how their results imply strong lower bounds on private estimation. We first provide a definition.
be a Markov chain, wheretakes values , and conditional on we draw , then draw conditional on . The strong data processing constant is the smallest such that for all distributions ,
We consider families of distributions where the coordinates of are independent, dovetailing with Assouad’s method. For , conditional on we assume that
a -dimensional product distribution. That is, conditional on , the coordinates are i.i.d. and independent of . When we have the generation strategy (7), we can use Garg et al. and Braverman et al.’s results to prove the following lower bound.
Let and consider the Markov chain , where conditional on the are i.i.d., follow the product distribution (7), and follows the protocol of Fig. 1. Assume that for each coordinate , the chain satisfies a strong data processing inequality with value , and for some . Then for any estimator ,
By combining Theorem 3.1 with Lemma 3, we can prove strong lower bounds on minimax rates of convergence if we can both (i) provide a strong data processing constant for and and (ii) bound the mutual information . We do both presently, but we note that Theorem 3.1 relies strongly on the repeated communication structure in Figure 1 (as does Corollary A, Braverman et al.’s Theorem 3.1 in the sequel). Similar techniques appear challenging in centralized settings.
3.2 Information bounds
To apply Theorem 3.1, the first step is to develop information bounds on private communication. We present our two main lemmas that accomplish this, based on Assumptions 2.1 and 1, here. The main result of the section, which follows immediately by combining Lemma 3.2 or Lemma 3.2 with Theorem 3.1 gives the following corollary. Let the conditions of Theorem 3.1 hold and assume additionally that satisfy Assumption 2.1 or Assumption 1. Then
We begin with the easier Assumption 2.1, where we note that such bounds hold in more generality. For example, in centralized and Rényi differential privacy, we always have the bound if is -Rényi private [6, Lemma 6.2].
Let the private variables satisfy Assumption 2.1. Then
For any Markov chain we have
so we control . We observe that
because is conditionally independent of , , given . Using that
where inequality follows by the convexity of the KL-divergence and definition of the marginal over and inequality by Assumption 2.1. We then sum over and . ∎
In the more complicated -differential privacy cases, we require more care. We present a sequence of lemmas. The lynchpin, whose somewhat complex proof (see Appendix B.2) we base on the development of Rogers et al. , is the following lemma.
Let be an -differentially private view of , where takes values in the finite set . Let and define . If , then
Additionally, if is small enough that