The Right Complexity Measure in Locally Private Estimation: It is not the Fisher Information

06/14/2018 ∙ by John C. Duchi, et al. ∙ 0

We identify fundamental tradeoffs between statistical utility and privacy under local models of privacy in which data is kept private even from the statistician, providing instance-specific bounds for private estimation and learning problems by developing the local minimax risk. In contrast to approaches based on worst-case (minimax) error, which are conservative, this allows us to evaluate the difficulty of individual problem instances and delineate the possibilities for adaptation in private estimation and inference. Our main results show that the local modulus of continuity of the estimand with respect to the variation distance---as opposed to the Hellinger distance central to classical statistics---characterizes rates of convergence under locally private estimation for many notions of privacy, including differential privacy and its relaxations. As consequences of these results, we identify an alternative to the Fisher information for private estimation, giving a more nuanced understanding of the challenges of adaptivity and optimality, and provide new minimax bounds for high-dimensional estimation showing that even interactive locally private procedures suffer poor performance under weak notions of privacy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The increasing collection of data at large scale—medical records, location information from cell phones, internet browsing history—points to the importance of a deeper understanding of the tradeoffs inherent between privacy and the utility of using the data collected. Classical mechanisms for preserving privacy, such as permutation, small noise addition, releasing only mean information, or basic anonymization are insufficient, and notable privacy compromises with genomic data [30] and movie rating information [39] have caused the NIH to temporarily stop releasing genetic information and Netflix to cancel a proposed competition for predicting movie ratings. Balancing the tension between utility and the risk of disclosure of sensitive information is thus essential.

In response to these challenges, researchers in the statistics, databases, and computer science communities have studied differential privacy [54, 25, 24, 23, 27, 21] as a formalization of disclosure risk limitation, providing strong privacy guarantees. This literature discusses two notions of privacy: local privacy, in which data is privatized before it is even shared with a data collector, and central privacy, where a centralized curator maintains the sample and guarantees that any information it releases is appropriately private. The local model is stronger, and consequently it is more challenging to develop statistically efficient algorithms. Yet the strong privacy protections local privacy provides encourage its adoption. Whether for ease of compliance with regulatory strictures, for example with European Union privacy rules [48]

; for reasons of transparency and belief in the importance of privacy; or to avoid the risks proximate to holding sensitive data, like hacking or subpoena risk, because private data never leaves an individual’s device in the clear; major technology companies have adopted local differential privacy protections in their data collection and machine learning tools. Apple provides local differential privacy in many of its iPhone systems 

[3], and Google has built systems supplying central and local differential privacy [47, 1]. The broad impact of privacy protections in billions of devices suggest we should carefully understand the fundamental limitations and possibilities of learning with local notions of privacy.

To address this challenge, we study the local minimax complexity of estimation and learning under local notions of privacy. Worst-case notions of complexity may be too stringent for statistical practice [21], and in real-world use, we wish to understand how difficult the actual problem we have is, and whether we can adapt to this problem difficulty, so that our procedurs more efficiently solve easy problems as opposed to being tuned to worst-case notions of difficulty. Our adoption of local minimax complexity is thus driven by three desiderata: we seek fundamental limits on estimation and learning that (i) are instance specific, applying to the particular problem at hand, (ii) are uniformly attainable, in that there exist adaptive procedures to achieve the instance-specific difficulty, and (iii) have super-efficiency limitations, so that if a procedure achieves better behavior than the lower bounds suggest is possible, there should be problem instances in which the procedure must have substantially worse behavior. In this paper, we provide a characterization of the difficulty of estimation of one-dimensional quantities under local privacy that satisfies these desiderata.

The celebrated Le Cam–Hájek local asymptotic minimax theory [26, 32, 34, 50, 35]

cleanly delineates efficient from inefficient estimators in classical statistics and highlights the importance of local notions of optimality (making Fisher information bounds rigorous). As an example encapsulating the differences between global and local minimax complexity, consider the one-dimensional logistic regression problem of predicting

from , with , where—taking our motivation from applications of machine learning [28, 3, 1]—we wish only to accurately estimate . This problem is easier the larger is, and a calculation shows that the maximum likelihood estimator has misclassification error decreasing exponentially in ; a fully minimax analysis provides a lower bound at , or random guessing, with convergence lower bound independent of . For most applications of statistical learning, we hope our model substantially outperforms random guessing, so such (global worst-case) analyses are of limited utility in the design of (near-) optimal procedures. To that end, any practicable theory of optimal private estimation should encapsulate a local notion of problem difficulty.

1.1 Contributions, outline, and related work

Our development of instance-specific (local) notions of problem complexity under privacy constraints allows us to more precisely quantify the statistical price of privacy. Identifying the tension here is of course of substantial interest, and Duchi, Jordan, and Wainwright [21, 20] develop a set of statistical and information-theoretic tools for understanding the minimax risk in locally differentially private settings, providing the point of departure for our work. To understand their and our coming approach, let us formally define our setting.

We have i.i.d. data drawn according to a distribution on a space . Instead of observing the original sample , however, the statistician or learner sees only privatized data , where the data is drawn from a Markov kernel conditional on (following information-theoretic parlance, we often call the privacy channel [13]; in the privacy literature is the mechanism [24]). In full generality, we allow the channel to be sequentially interactive [21], meaning that at observation , the channel may depend on the previous (private) observations . That is, we have

(1)

This notion of interactivity is important for procedures, such as stochastic gradient methods [21] or the one-step-corrected estimators we develop in the sequel, which modify the mechanism after some number of observations to more accurately perform inference.

The statistical problems we consider are, abstractly, as follows. Let be a family of distributions, and let be a parameter belonging to a parameter space we wish to estimate, where denotes the estimand. Let

be a loss function measuring the loss of an estimated value

for the distribution , where we assume that for all distributions . As an example, we may consider the mean and squared error . Let be a collection of appropriately private channels, for example, -differentially private channels (which we define in the sequel). The private minimax risk [21] is

(2)

where denotes the marginal and drawn conditionally (1). Duchi et al. [21] provide upper and lower bounds on this quantity when is the collection of -locally differentially private channels, developing strong data processing inequalities to quantify the costs of privacy.

The worst-case nature of the formulation (2) may suggest lower bounds that are too pessimistic for practice, and does not allow a characterization of problem-specific difficulty, which is important for a deeper understanding of adaptive and optimal procedures. Accordingly, we adopt a local minimax approach, which builds out of the classical statistical literature on hardest one-dimensional alternatives that begins with Stein [44, 4, 16, 17, 18, 9, 11]. In the same setting as the above, we define the local minimax risk at the distribution for the set of channels as

(3)

The quantity (3) measures the difficulty of the loss minimization problem for a particular distribution under the privacy constraints characterizes, and at this distinguished distribution, we look for the hardest alternative distribution .

To situate our contributions, let us first consider the non-private variant of the minimax complexity (2) and local minimax complexity (3), when (the identity mapping), and we use the squared error loss . Let us first consider a classical setting, in which we wish to estimate a linear function of a parameter in a parametric family with Fisher information matrix . The Fisher information bound [35] for the parameter is

More generally, if we wish to estimate a functional of a distribution , Donoho and Liu [16, 17, 18] show how the modulus of continuity takes the place of the classical information bound. Again considering the squared error, define the modulus of continuity of over with respect to Hellinger distance by

(4)

where . Then under mild regularity conditions,

which highlights that separation in Hellinger distance precisely governs problem difficulty in non-private classical statistical problems. In the local minimax case, similar characterizations via a local modulus of continuity are available in some problems, including estimation of the value of a convex function [9] and stochastic optimization [11].

In contrast, the work of Duchi et al. [21, 20] suggests that for -locally differentially private estimation, we should replace the Hellinger distance by variation distance. In the case of higher-dimensional problems, there are additional dimension-dependent penalties in estimation that local differential privacy makes unavoidable, at least in a minimax sense [21]. In work independent of and contemporaneous to our own, Rohde and Steinberger [43] build off of [21] to show that (non-local) minimax rates of convergence under -local differential privacy are frequently governed by a modulus of continuity (4), except that the variation distance replaces the Hellinger distance . Rohde and Steinberger also exhibit a mechanism that is minimax optimal for “nearly” linear functionals based on randomized response [54, 43, Sec. 4]. Thus, locally differentially private procedures give rise to a different geometry than classical statistical problems.

Now we are in a position for a high-level description of our results. Our results apply in a variety of locally private estimation settings, whose definitions we formalize in Section 2, but all of them consist of weakenings of -differential privacy (including concentrated and Rényi-differential privacy [24, 22, 8, 38]). We provide a precise characterization of the local minimax complexity (3) in these settings. If we define the local modulus of continuity (for the squared error) at by

then a consequence of our Theorem 3.1 is that for the squared loss and family of -locally private channels,

We provide this characterization in more detail and for general losses in Section 3. Moreover, we show a super-efficiency result that any procedure that achieves risk better than the local minimax complexity at a distribution must suffer higher risk at another distribution , so that this characterization does indeed satisfy our desiderata of an instance-specific complexity measure.

The departure of these risk bounds from the typical Hellinger-based moduli of continuity (4) has consequences for locally private estimation and adaptivity of estimators, which we address via examples in Section 4. For instance, instead of the classical Fisher information, an alternative we term the -information

characterizes the complexity of locally private estimation: the classical Fisher information bounds are unobtainable. A challenging consequence of these results is that, for some parametric models (including Bernoulli estimation and binomial logistic regression), the local complexity (

3) is independent of the underlying parameter: problems that are easy (in the classical Fisher information sense) are never easy under local privacy constraints. Our proofs, building off of those of Duchi et al. [21], rely on novel Markov contraction inequalities for divergence measures, which strengthen classical strong data processing inequalities [12, 14].

Developing adaptive procedures uniformly achieving the instance-specific local minimax risk (3) is challenging, but we show that such optimal design is possible in a number of cases in Section 4, including well- and mis-specified exponential family models, using an extension of classical one-step corrected estimators. We compare these locally optimal procedures with the minimax optimal procedures Duchi et al. [21] propose on a protein expression-prediction problem in Section 6; the experimental results suggests that the local minimax perspective indeed outperforms the global minimax procedures, however, the costs of privacy are still nontrivial.

Lastly, because we consider weaker notions of privacy, one might ask whether it is possible to improve the minimax bounds that Duchi et al. [21, 20] develop. Unfortunately, this appears impossible (see Section 5). Duchi et al. show only that non-interactive privacy mechanisms (i.e. the channel may depend only on and not the past observations) must suffer poor performance in high dimensions under stringent differential privacy constraints. Our results show that this is unavoidable, even with weaker notions of privacy and allowing interactive mechanisms. We provide some additional discussion and perspective in the closing of the paper in Section 7.

2 Definitions of privacy

Our starting point is a formalization of our notions of local privacy. With the notion (1) of sequentially interactive channels, where the th private observation is drawn conditionally on the past as , we consider several notions of privacy, going from the strongest to the weakest. First is local differential privacy, which Warner [54] first proposed (implicitly) in his 1965 work on survey sampling, then explicitly defined by Evfimievski et al. [25] and Dwork et al. [24].

The channel is -locally differentially private if for all , , and , we have

The channel is non-interactive if

for all and .

Duchi et al. [21] consider this notion of privacy, developing its consequences for minimax optimal estimation. It is a satisfying definition from a privacy point of view, and an equivalent view is that an adversary knowing the data is either or cannot accurately test, even conditional on the output , whether the generating data was or ’. To mitigate the consequent difficulties for estimation and learning with differentially private procedures, researchers have proposed a number of weakenings of Definition 2, which we also consider.

To that end, a second notion of privacy, which Dwork and Rothblum [22] propose and Bun and Steinke [8] develop reposes on Rényi-divergences. For an , the Rényi-divergence of order is

where for one takes the downward limit as , yielding . We then have the following definition. The channel is -zero-concentrated locally differentially private (zCDP) if for all , , and , we have

An equivalent definition is that the log likelihood ratio has sub-Gaussian tails, that is, for , we have

for all (and ). Mironov [38] proposes a natural relaxation of Definition 2, suggesting that we require it hold only for a single fixed . This yields The channel is -Rényi locally differentially private if for all , and , we have

Perhaps the most salient point in Definition 2 is the choice , which will be important in our subsequent analysis. Consider a prior distribution on two points , represented by and , and then consider the posterior and after observing the private quantity . Then -Rényi privacy is equivalent [38, Sec. VII]

to the condition that the prior and posterior odds ratios of

against do not change much in expectation:

(5)

for all two-point priors , where the expectation is taken over . (For -differential privacy, the inequality holds for all without expectation). Because Rényi divergences are monotonic in (cf. [51, Thm. 3]), any channel that is -Rényi private is also -Rényi private for .

Our final notion of privacy is based on -divergences, which is related to Definition 2. Recall for a convex function with , the -divergence between distributions and is

which is non-negative and strictly positive when and is strictly convex at the point . We consider -divergences parameterized by of the form

For , the channel is --divergence locally private if for all , , and , we have

When , this is the familiar -divergence [37, 46], and it is equivalent to -Rényi differential privacy. We describe this special situation as --privacy.

The definitions provide varying levels of privacy. It is immediate that if a channel is -differentially private, then it is --divergence locally private. For , this implies --divergence local privacy. It is also clear that Definition 2 is stronger than 2, which is stronger than 2. We can quantify this as well: -differential privacy implies -zero-concentrated differential privacy. For , we also find that if the channel is -zCDP, then it is immediate that it satisfies --divergence privacy with , where we take in the definition of the Rényi-divergence. Our results all apply for -private channels, so that -privacy (Definition 2 with or Definition 2 with ) implies strong lower bounds on estimation.

3 Local minimax complexity and private estimation

We turn to our main goal of establishing localized minimax complexities for locally private estimation. To that end, we begin in Section 3.1 by defining the modulus of continuity of estimands, showing how it provides a tight lower bound on localized complexity for private estimation. Section 3.2 continues the development of Section 3.1 by establishing a super-efficiency result, showing that any estimator achieving lower risk than our localized modulus of continuity for some distribution must be inefficient on other distributions . In Section 3.3, we present the main technical tools that underlie our results, providing new strong data-processing inequalities showing precisely how locally private channels degrade the information in statistical problems.

3.1 The modulus of continuity and local minimax complexities

Recall our setting, where we wish to estimate a parameter of a distribution , a collection of possible distributions, and we measure performance of an estimand via a loss satisfying . We define the “distance” between distributions and for the loss by

which is always non-negative. As an example, if is 1-dimensional and we use the squared error ,

More generally, for any symmetric convex function , if ,

(6)

where . A similar result holds for general losses; if is non-decreasing and we measure the parameter error , then

(7)

as it is no loss of generality to assume that in the definition of the distance.

For a family of distributions , the modulus of continuity associated with the loss at the distribution is

(8)

As we shall see, this modulus of continuity fairly precisely characterizes the difficulty of locally private estimation of functionals. The key in this definition is that the modulus of continuity is defined with respect to variation distance. This is in contrast to classical results on optimal estimation, where the more familiar modulus of continuity with respect to Hellinger distance characterizes problem difficulty. Indeed, Le Cam’s theory of quadratic mean differentiability, contiguity, local asymptotic normality, and local alternatives for testing all reposes on closeness in Hellinger distance [34, 35, 42, 50], which justifies the use of Fisher Information in classical statistical problems. In nonparametric problems, as we mentioned briefly in the introduction, the modulus of continuity of the parameter with respect to Hellinger distance also characterizes minimax rates for estimation of functionals of distributions [4, 16, 17] (at least in a global minimax sense), and in some instances, it governs local minimax guarantees as well [9]. These results all correspond to replacing the variation distance in definition (8) with the Hellinger distance between and . As we illustrate, the difference between the classical Hellinger-based modulus of continuity and ours (8) leads to to substantially different behavior for private and non-private estimation problems.

With this, we come to our first main result, which lower bounds the local minimax risk using the modulus (8). We defer the proof to Section 3.4.3, using our results on strong data-processing inequalities to come to prove it. Let be the collection of --locally private channels. Then for any distribution , we have

Noting that the modulus of continuity is increasing in its first argument and that for all , we have the simplified lower bound

In (nearly) simultaneous independent work, Rohde and Steinberger [43] provide a global minimax lower bound, akin to (2), using a global modulus of continuity with respect to variation distance, extending [16, 17, 18] to the private case. Our focus here is on instance-specific bounds, with the hope that we may calculate practically useful quantities akin to classical information bounds [50, 35].

Achievability in the theorem is a somewhat more delicate argument; demonstrating procedures that achieve the lower bound uniformly is typically nontrivial. With that said, under two reasonable conditions on our loss, distance, and growth of the modulus of continuity, we can show a converse to Theorem 3.1, showing that the modulus indeed describes the local minimax complexity to within numerical constants.

[Reverse triangle inequality] There exists such that for ,

In the case that the loss is based on the error for nondecreasing, the inequality (7) shows that Condition 3.1 holds whenever for all . In addition, we sometimes use the following condition on the modulus of continuity. [Polynomial growth] At the distribution , there exist constants such that for all

Condition 3.1 is similar to the typical Hölder-type continuity properties assumed on the modulus of continuity for estimation problems [16, 17]. We give examples satisfying Condition 3.1 presently.

The conditions yield the following converse to Theorem 3.1, which shows that the modulus of continuity characterizes the local minimax complexity. See Appendix A.2 for a proof of the result. Let Conditions 3.1 and 3.1 on and hold. Let and , and let be the collection of non-interactive -differentially private channels (Definition 2). Then

The proposition as written is a bit unwieldy, so we unpack it slightly. For , we have , so that for a constant that may depend on , , and , for each there exists a non-interactive -differentially private channel and estimator such that

This matches the lower bound in Theorem 3.1 up to a numerical constant.

We briefly discuss one example satisfying Condition 3.1 to demonstrate that we typically expect it to hold, so that the modulus of continuity characterizes the local minimax rates. We also show that that (potentially mis-specified) exponential family models also satisfy Condition 3.1 in Section 4.4, Lemma 4.4.

[Modulus of continuity in (nonparametric) mean estimation] Consider loss where is nondecreasing. Inequality (7) implies that using the shorthand

we have

and assuming that the function itself satisfies for , to show satisfies Condition 3.1, it is sufficient to show that satisfies Condition 3.1.

Now consider the problem of estimating , where the unknown distribution belongs to for some compact set . Denote . We claim the following upper and lower bounds on :

(9)

which of course combine to imply Condition 3.1. To see the lower bound (9), for any , define , where denotes a point mass at . Then for all , so . The upper bound (9) is similarly straightforward: for all , we have

by the triangle inequality, which is our desired result.

3.2 Super-efficiency

To demonstrate that the local modulus of continuity is indeed the “correct” lower bound on estimation, we consider the third of the desiderata for a strong lower bound that we idenfity in the introduction: a super-efficiency result. We provide this via a constrained risk inequality [6, 19]. Our result applies in the typical setting in which the loss is for some increasing function , and we use the shorthand for the risk (expected loss) of the estimator under the distribution . The starting point for our development is an inequality extending Brown and Low [6, Thm. 1] showing that if has small risk for a parameter under a distribution , then its risk under a distribution close to may be large (see also [45, Thm. 6]). In the lemma and the remainder of this section, for measures and we define the -affinity

(10)

which measures the similarity between distributions and . With these definitions, we have the following constrained risk inequality.

[[19], Theorem 1] Let , , and define . If the estimator satisfies for some , then

The lemma shows that an estimator has small risk under distribution , then its risk for a nearby (in -divergence) distribution must be nearly the distance between the associated parameters and .

With Lemma 3.2 in hand, we can prove a super-efficiency result, showing that improvement over our modulus of continuity lower bound at a point implies worse performance elsewhere. Let be sequentially interactive --private channel (Defs. 2 or 2) with associated marginal distributions . Let Condition 3.1 hold with parameter . If for some the estimator satisfies

then for all there exists a distribution such that

See Section 3.4.4 for a proof.

The proposition depends on a number of constants, but roughly, it shows (for small enough , where we simplify by taking ) that if an estimator is super-efficient at , in that , then there exists a constant such that for some we have . In this sense, our local modulus of continuity bounds are sharp: no estimator can achieve much better risk than the local modulus of continuity at a distribution without paying nontrivial cost elsewhere.

3.3 Contractions of probability measures

The main technical tool underpinning our lower bounds is that our definitions of privacy imply strong contractions on the space of probability measures. Such contractive properties have been important in the study of information channels 

[12, 14], where one studies strong data processing

inequalities, and in the mixing properites of Markov chains under so-called

strong mixing conditions, such as the Dobrushin condition [15]. For , define the marginal distributions

The goal is then to provide upper bounds on the -divergence in terms of the channel ; the standard data-processing inequality [13, 37] guarantees . Dobrushin’s celebrated ergodic coefficient guarantees that for any -divergence (see [12, 14]),

(11)

Thus, as long as the Dobrushin coefficient is strictly positive, one obtains a strong data processing inequality. In our case, our privacy guarantees provide a stronger condition than the positivity of the Dobrushin coefficient. Consequently, we are able to provide substantially stronger data processing inequalities: we can even show that it is possible to modify the underlying -divergence.

Thus, we reconsider the notions of privacy based on divergences between the channels and . We have the following proposition, which provides a strong data processing inequality for all channels satisfying the divergence-based notion of privacy (Definition 2).

Let for some , and let and be arbitrary distributions on a common space . Let be a Markov kernel from to satisfying

for all and . Then

Jensen’s inequality implies that , so Proposition 3.3 provides a stronger guarantee than the classical bound (11) for the specific divergence associated with . Because for all , it is possible that the -divergence is infinite, while the marginals are much closer together. It is this transfer from power divergence to variation distance, that is, to , that allows us to prove the strong localized lower bounds depending on variation distance such as Theorem 3.1.

As a corollary of Proposition 3.3, we may parallel the proof of [21, Theorem 1]

to obtain a tensorization result. In this context, the most important divergence for us is the

divergence, which corresponds to the case in Proposition 3.3, that is, , which also corresponds to Rényi differential privacy with (Def. 2) with a guarantee that prior and posterior odds of discovery do not change much (Eq. (5)). Recall our formulation (1), in which the channel may be defined sequentially as as , and let

Now, let be product distributions on , where we say that the distribution of either follows or , and define , noting that as is a product distribution. We have the following corollary.

Let be a sequentially interactive channel satisfying --divergence privacy, that is, for all and . Then

See Section 3.4.2 for a proof. An immediate consequence of Corollary 3.3 and the fact [46, Lemma 2.7] that yields

(12)

The tensorization (12) is the key to our results, as we see in the later sections.

3.4 Proofs

We collect the proofs of our main results in this section, as they are reasonably brief and (we hope) elucidating. We begin with the key contraction inequality in Proposition 3.3, as it underlies all subsequent results.

3.4.1 Proof of Proposition 3.3

Let and be the densities of with respect to some base measure dominating . Without loss of generality, we may assume that is finite, as all -divergences are approximable by finite partitions [49]; we let denote the associated p.m.f. For , the function is convex on . Thus, applying Jensen’s inequality, we may bound by

(13)

It thus suffices to upper bound . To do so, we rewrite as

where we have used that . Now define the function

By Minkowski’s integral inequality, we have the upper bound

Now we compute the inner summation: we have that

Substituting this into our upper bound (3.4.1) on , we obtain that

as . Substitute this upper bound into inequality (13) to obtain the proposition.

3.4.2 Proof of Corollary 3.3

We use an inductive argument. The base case in which follows immediately by Proposition 3.3. Now, suppose that Corollary 3.3 holds at ; we will show that the claim holds for . We use the shorthand for the density of the measure , and , which we may assume exists w.l.o.g. Then, by definition of -divergence, we have,