Locally Private Gaussian Estimation

11/20/2018 ∙ by Matthew Joseph, et al. ∙ University of Minnesota Microsoft University of Pennsylvania 0

We study a basic private estimation problem: each of n users draws a single i.i.d. sample from an unknown Gaussian distribution, and the goal is to estimate the mean of this Gaussian distribution while satisfying local differential privacy for each user. Informally, local differential privacy requires that each data point is individually and independently privatized before it is passed to a learning algorithm. Locally private Gaussian estimation is therefore difficult because the data domain is unbounded: users may draw arbitrarily different inputs, but local differential privacy nonetheless mandates that different users have (worst-case) similar privatized output distributions. We provide both adaptive two-round solutions and nonadaptive one-round solutions for locally private Gaussian estimation. We then partially match these upper bounds with an information-theoretic lower bound. This lower bound shows that our accuracy guarantees are tight up to logarithmic factors for all sequentially interactive (ε,δ)-locally private protocols.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Differential privacy is a formal algorithmic guarantee that no single input has a large effect on the output of a computation. Since its introduction [13] over a decade ago, a rich line of work has made differential privacy a compelling privacy guarantee (see  Dwork et al. [14] and Vadhan [26] for surveys), and deployments of differential privacy now exist at many organizations, including Apple [3], Google [15, 6], Microsoft [11], Mozilla [4], and the US Census Bureau [1, 22].

Much recent attention, including almost all industrial deployments, has focused on a stronger variant of differential privacy called local differential privacy [27, 16, 21]. In the local model private data is distributed across many users, and each user privatizes their data before the data is collected by an analyst. Thus, as any locally differentially private computation runs on already-privatized data, data contributors need not worry about compromised data analysts or insecure communication channels. In contrast, (global) differential privacy assumes that the data analyst has trusted access to the unprivatized data. As a result, under global differential privacy any violation of this trust may lead to serious privacy loss for the users contributing the data.

However, the stronger privacy guarantees of the local model come at a price: for many problems, “good” solutions under local privacy require far more samples than similarly good solutions under global privacy [21]. Moreover, many problems remain little-understood under local differential privacy. In this paper, we study the simple problem of locally private Gaussian estimation: given users each holding an i.i.d. draw from an unknown Gaussian distribution , can one accurately estimate the mean while guaranteeing local differential privacy for each user?

One challenge of this problem is that, since data is drawn from a Gaussian, there is no a priori (worst-case) bound on the scale of the observations. Naive applications of standard privatization methods like Laplace and Gaussian mechanisms that add noise proportional to the worst-case scale of the data are therefore infeasible. Second, it is desirable to limit the number of rounds of interaction between users and the data analyst, as protocols requiring many rounds of user-analyst interaction are difficult to implement.

1.1 Our Contributions

We divide our solution to locally private Gaussian estimation into two cases. In the first case, is known to the analyst, and in the second case is unknown but bounded in known . For each case, we provide adaptive two-round and nonadaptive one-round sequentially interactive protocols. Hhere sequential interactivity informally means that no user outputs information more than once (see Section 2 for details). Informal guarantees for these protocols appear below.

Theorem 1.1.

Let where and is known. Then

  1. Adaptive two-round protocol KVGausstimate satisfies -local differential privacy for

    and, with probability at least

    , outputs such that

  2. Nonadaptive one-round protocol 1RoundKVGausstimate satisfies -local differential privacy for and, with probability at least , outputs such that

Theorem 1.2.

Let where and is unknown but bounded in known where . Then

  1. Adaptive two-round protocol UVGausstimate satisfies -local differential privacy for and, with probability at least , outputs such that

  2. Nonadaptive one-round protocol 1RoundUVGausstimate satisfies -local differential privacy for and, with probability at least , outputs such that

Moreover, we show in the following (informal) information-theoretic lower bound that these upper bounds are tight up to logarithmic factors. Our proof relies on techniques from the strong data-processing inequality literature [7, 23].

Theorem 1.3.

For a given , there does not exist a sequentially interactive -locally private protocol such that for any , given , outputs estimate satisfying with probability .

1.2 Related Work

Several works have already studied differentially private versions of various statistical tasks, especially in the global setting. Both Karwa and Vadhan [20] and Kamath et al. [19]

are relevant, as they consider similar versions of Gaussian estimation under global differential privacy, respectively in the one-dimensional and high-dimensional cases. For both the known and unknown variance cases, 

Karwa and Vadhan [20] offer an

accuracy upper bound for estimating . Our upper and lower bounds thus demonstrate that local privacy adds a roughly accuracy cost for estimating .

In local differential privacy, several recent works have studied related statistical tasks like identity and independence testing [17, 24, 2], albeit restricted to discrete distributions. In concurrent work, Gaboardi et al. [18] also study Gaussian estimation under local differential privacy. They provide an adaptive two-round protocol in the known variance case and an adaptive -round protocol in the unknown variance case, where upper bounds both and and both protocols are approximately locally private. In our case, may be as large as , leading to round complexity for their unknown variance protocol.

In comparison, we construct adaptive two-round and nonadaptive one-round purely locally private protocols improving on these guarantees for both cases: see Figure 1 for a detailed comparison. Moreover, while Gaboardi et al. [18] prove an lower bound for nonadaptive one-round protocols, we prove a logarithmically weaker but also more general lower bound for adaptive sequentially interactive protocols. Gaboardi et al. [18]

also offer extensions to quantile estimation and estimation when

lacks a known upper bound.

Our lower bounds are structurally similar to existing mutual information-based approaches [12, 5, 25] and build on recent results showing that pure and approximate local differential privacy are “equivalent” [8, 10]. Our lower bound also uses tools from the strong data processing inequality literature [7, 23]; broader application of these techniques to local differential privacy may be of independent interest.

Gaboardi et al. [18] This Work
Setting Accuracy Rounds Accuracy Rounds
Known , adaptive 2 2
Known , nonadaptive 1
Unknown , adaptive 2
Unknown , nonadaptive 1
Figure 1: A comparison of upper bounds presented in Gaboardi et al. [18] and our work. In all cases,  Gaboardi et al. [18] use -locally private algorithms while we use . Here, denotes an upper bound on both and . In our setting, , leading the unknown variance protocol of Gaboardi et al. [18] to round complexity potentially as large as .

2 Preliminaries

We consider a setting in which each user has private data consisting of a single i.i.d. draw from an unknown Gaussian distribution, . In our communication protocol, users may exchange messages over public channels with a single (possibly untrusted) central analyst111The notion of a central analyst is a useful simplification but is not intrinsic to the protocol. Technically, as the analyst need not be trusted, any user can fulfill the same role.. The analyst’s task is to accurately estimate while guaranteeing local differential privacy for each user.

We restrict our attention to sequentially interactive protocols, where every user sends at most a single message to the analyst in the entire protocol. For simplicity, our definition of sequentially interactive protocols is slightly less general than the one introduced by Duchi et al. [12] (see Section 5 for details). The algorithms we present for our upper bounds all satisfy our more restrictive notion of sequential interactivity, while our lower bounds apply to the more general notion used by Duchi et al. [12].

We also study the round complexity of these interactive protocols. Formally, one round of interaction in a protocol consists of the following two steps: 1) the analyst selects a subset of users , along with a set of randomizers , and 2) each user in computes a message using the assigned function and sends the message to the analyst.

2.1 Differential Privacy

Informally, a randomized algorithm is differentially private if arbitrarily changing a single input does not change the output distribution “too much”. The resulting computation preserves privacy because the output distribution is insensitive to any change of a single user’s data. More formally:

Definition 2.1 ((Standard) Differential Privacy).

A randomized algorithm satisfies -differential privacy if for any two databases that differ by a single observation, the following holds for any event ,

Here, we study a stronger privacy guarantee called local differential privacy. In the local model, each user computes their message using a local randomizer. A local randomizer is a differentially private algorithm taking single-element databases as input. More formally, a randomized function is an -local randomizer if, for every pair of observations and any ,

A sequentially interactive protocol is locally private if every user computes their message using a local randomizer.

Definition 2.2.

A sequentially interactive protocol is -locally private for private user data if, for every user , the message for every user is computed using an -local randomizer . When , we say is approximately locally private. If , is purely locally private.

3 Known Variance

In this section, we present two solutions for the setting where the variance is known (shorthanded “KV”). In Section 3.1, we analyze an adaptive protocol KVGausstimate that requires two rounds of analyst-user interaction. In Section 3.2, we analyze a nonadaptive protocol 1RoundKVGausstimate achieving a weaker accuracy guarantee in a single round.

3.1 Two-round protocol

We begin with a high-level overview of KVGausstimate before analyzing its components in detail. In KVGausstimate, the analyst splits the users into halves and , employing users from to compute an initial estimate of and then users from to further refine this estimate.

More concretely, the analyst partitions into subgroups of size , where is the desired failure probability. The analyst then solicits (a privatized version of) from each user in subgroup . Each user responds by calling RR1, and the analyst aggregates these estimates through KVAgg1. By doing so, the analyst effectively executes a one-round binary search and obtains an initial -accurate estimate of .

The analyst then passes to users in and solicits user estimates using (a privatized version of) a de-meaning protocol from the distributed statistical estimation literature [7]. Users in respond by calls to KVRR2, where each user de-means their point using , standardizes it using , and randomized responds on . Crucially, this de-meaning relies on knowing an -accurate estimate of , which necessitates the first estimate . The analyst then uses KVAgg2 to aggregate these responses into an estimate of the CDF of , from which the analyst can finally back out a final estimate . Pseudocode for KVGausstimate appears below. Throughout, we make the following assumptions on our problem parameters, deferring exact constants to the analysis. For neatness, let , , and .

Assumption 3.1.

and 222While we assume is nonnegative, this is largely for convenience – all of our methods extend to negative (but similarly bounded) at the expense of constant factors..

0:  
1:  for  do
2:     for user  do
3:        User outputs
4:     end for
5:  end for End of round 1
6:  Analyst computes
7:  Analyst computes
8:  for user  do
9:     User outputs
10:  end for End of round 2
11:  Analyst computes
12:  Analyst computes
13:  Analyst outputs
13:  Analyst estimate of
Algorithm 1 KVGausstimate

We start our analysis with a privacy guarantee.

Theorem 3.2.

KVGausstimate satisfies -local differential privacy for .

Proof.

As KVGausstimate is sequentially interactive, each user only produces one output. It therefore suffices to show that each randomized response routine used in KVGausstimate is -locally private. In RR1, for any possible inputs and output we have

so RR1 is -locally private. KVRR2 is -locally private by similar logic. ∎

Next, we recall our overall accuracy result for KVGausstimate.

Theorem 3.3.

With probability at least , KVGausstimate outputs an estimate such that

We prove this result by analyzing the execution of KVGausstimate in sequence below.

3.1.1 Round one

We start with KVGausstimate’s first round of interaction. First, each user in group runs RR1 to publish an -privatized version of . Note that below denotes a uniform random draw from set .

0:  
1:  
2:  if  then
3:     User publishes
4:  else
5:     User publishes
6:  end if
6:  Private user estimate of
Algorithm 2 RR1

Since these randomized responses contain information about users’ local estimates of each bit of , the analyst uses to aggregate them into histogram .

0:  
1:  for  do
2:     for  do
3:        
4:        
5:     end for
6:  end for
7:  Output
7:  Aggregated histogram of private user responses
Algorithm 3 KVAgg1

Let be the “true” histogram, for all and . Since the analyst only has access to , we need to show that and are similar.

Lemma 3.4.

With probability at least , for all ,

Proof.

Choose and . , so by a pair of Chernoff bounds on the users in , with probability at least ,

Then since , this implies

where the last step uses . Union bounding over and all groups completes the proof. ∎

Next, we show how the analyst uses to estimate through EstMean1. Intuitively, in subgroup when user responses concentrate in a single bin , this suggests that lies in the corresponding bin. In the other direction, when user responses do not concentrate in a single bin, users with points near must spread out over multiple bins, suggesting that lies near the boundary between bins. We formalize this intuition in EstMean1 and Lemma 3.5.

0:  
1:  
2:  
3:  
4:  while  and  do
5:     Analyst computes integer such that and
6:     Analyst computes
7:     
8:  end while
9:  
10:  Analyst computes
11:  Analyst computes
12:  Analyst computes maximum integer such that and or
13:  Analyst outputs
13:  Initial estimate of
Algorithm 4 EstMean1
Lemma 3.5.

Conditioned on the success of the preceding lemmas, with probability at least , .

Proof.

Recall the definitions of , , and from the pseudocode for EstMean1: , , and . We start by proving two useful claims.

Claim 1: With probability at least , for all where , if all have , then .

To see why, suppose and let . Recall the Gaussian CDF . Then for any

where the second inequality uses . Thus by a binomial Chernoff bound, the assumption , and Lemma 3.4, with probability , . Therefore if for some we have , . Moreover, if then letting be the (unique) integer such that and (since has endpoints and for integer ) we get . As by Assumption 3.1, the claim follows by induction.

Claim 2: Let be the maximum with , and let be the maximum integer such that and or . If , then with probability at least , .

To see why, first note that by Claim 1, . Let be the subinterval of containing for integer . Then as , for , by another application of the Gaussian CDF,

Thus by the same method as above, using the assumption , with probability at least , . By similar logic, since

with probability at least , . Next, consider . If , then

so with probability at least

where the middle inequality uses . Thus or ; the case is symmetric. If instead then by similar logic with probability at least

so by (implied by ) or . It follows that with probability at least in all cases or . Moreover, by a similar application of the Gaussian CDF, one of and lies in as well.

Recalling that is the maximum integer such that and or , as well. Assume . By above, or . In the first case,

so with probability at least , , a contradiction of . In the second case,

and with probability at least , , contradicting . Thus .

We put these facts together in EstMean1 as follows: let be the maximum element of such that . If , then by Fact 2 setting implies . If instead , then any setting of (including ) guarantees . Thus in all cases, with probability at least , . ∎

3.1.2 Round two

The results above give the analyst an (initial) estimate such that . Now, the analyst passes this estimate to users , and each user uses to de-mean their value and randomized respond on the resulting in KVRR2.

0:  
1:  User computes
2:  User computes
3:  User computes
4:  if  then
5:     User publishes
6:  else
7:     User publishes
8:  end if
8:  Private de-meaned user estimate
Algorithm 5 KVRR2

De-meaning thus effectively transforms the problem of estimating into the problem of estimating when is small. This in turn enables us to use techniques for estimating the CDF near (specifically, a private version of Protocol 2 in Braverman et al. [7]).

0:  
1:  for  do
2:     
3:     
4:  end for
5:  Analyst outputs
5:  Aggregated histogram of private user responses
Algorithm 6 KVAgg2

We now prove that this de-meaning process results in a more accurate final estimate of .

Lemma 3.6.

Conditioned on the success of the previous lemmas, with probability at least KVGausstimate outputs such that

Proof.

The proof is broadly similar to that of Theorem B.1 in Braverman et al. [7], with some modifications for privacy. First, by Lemma 3.5 . Letting we get that . Next, since , and in general

where is the CDF of , by we get (note that we are analyzing the unprivatized values to start; later, we will use this analysis to prove the analogous result for the privatized values ).

A Chernoff bound on

-bounded random variables then shows that, with probability at least

, for we have

and by we get as well.

Since , . Thus , so by we get

Using we get and , so and thus . Let be an upper bound on the Lipschitz constant for in ,

Then for any we have , so setting ,

using the bound on from above.

It remains to analyze the privatized values and bound , recalling that we set

in KVAgg1. By a Chernoff bound analogous to that of Lemma 3.4, with probability at least