Differential privacy is a mathematically rigorous notion of privacy that has become the de-facto gold-standard of privacy preserving data analysis. Informally, -differential privacy bounds the affect of a single datapoint on any result of the computation by . By now we have a myriad of differentially private analogues of numerous data analysis tasks. Moreover, in recent years the subject of private hypothesis testing has been receiving increasing attention (see Related Work below). However, by and large, the focus of private hypothesis testing is in the centralized model (or the curated model), where a single trusted entity holds the sensitive details of users and runs the private hypothesis tester on the actual data.
In contrast, the subject of this work is private hypothesis testing in the local-model (or the distributed model), where a -differentially private mechanism is applied independently
to each datum, resulting in one noisy signal per each datum. Moreover, the noisy signal is quite close to being uniformly distributed among all possible signals, so any observer that sees the signal has a very limited advantage of inferring the datum’s true type. This model, which alleviates trust (each user can run the mechanism independently on her own and release the noisy signal from the mechanism), has gained much popularity in recent years, especially since it was adopted by Google’s Rappor[EPK14] and Apple [App17]. And yet, despite its popularity, and the fact that recent works [BS15, BNST17] have shown the space of possible locally-private mechanism is richer than what was originally thought, little is known about private hypothesis testing in the local-model.
1.1 Background: Local Differential Privacy as a Signaling Scheme
We view the local differentially private model as a signaling scheme. Each datum / user has a type taken from a predefined and publicly known set of possible types whose size is . The differentially private mechanism is merely a randomized function , mapping each possible type of the -th datum to some set of possible signals , which we assume to be -differentially private: for any index , any pair of types and any signal it holds that .111For simplicity, we assume , the set of possible signals, is discrete. Note that this doesn’t exclude mechanisms such as adding Gaussian/Gamma noise to a point in — such mechanisms require to be some bounded subset of and use the bound to set the noise appropriately. Therefore, the standard approach of discretizing and projecting the noisy point to the closest point in the grid yields a finite set of signals . In our most general results (Theorems 1 and 9), we ignore the fact that is -differentially private, and just refer to any signaling scheme that transforms one domain (namely, ) into another (). For example, a surveyer might unify rarely occurring types under the category of “other”, or perhaps users report their types over noisy channels, etc.
We differentiate between two types of signaling schemes, both anchored in differentially private mechanisms: the symmetric (or index-oblivious) variety, and the non-symmetric (index-aware) type. A local signaling mechanism is called symmetric or index-oblivious if it is independent of the index of the datum. Namely, if for any we have that . A classic example of such a mechanism is randomized-response — that actually dates back to before differential privacy was defined [War65] and was first put to use in differential privacy in [KLN08] — where each user / datum draws her own signal from the set skewing the probability ever-so-slightly in favor of the original type. I.e. if the user’s type is then . This mechanism applies to all users, regardless of position in the dataset.
The utility of the above-mentioned symmetric mechanism scales polynomially with (or rather, with ), which motivated the question of designing locally differentially-private mechanisms with error scaling logarithmically in . This question was recently answered on the affirmative by the works of Bassily and Smith [BS15] and Bassily et al [BNST17] , whose mechanisms are not symmetric. In fact, both of them work by presenting each user with a mapping (the mapping itself is chosen randomly, but it is public, so we treat it as a given), and the user then runs the standard randomized response mechanism on the signals using as the more-likely signal. (In fact, in both schemes, : in [BS15] is merely the -th coordinate of a hashing of the types where and the hashing function are publicly known, and in [BNST17] maps a u.a.r chosen subset of to and its complementary to .222In both works, much effort is put to first reducing to the most frequent types, and then run the counting algorithm. Regardless, the end-counts / collection of users’ signals are the ones we care for the sake of hypothesis testing.) It is simple to identify each as a -matrix of size ; and — even though current works use only a deterministic mapping — we even allow for a randomized mapping, so can be thought of a of entries in (such that for each we have ). Regardless, given , the user then tosses her our private random coins to determine what signal she broadcasts. Therefore, each user’s mechanism can be summarized in a -matrix, where is the probability a user of type sends the signal . For example, using the mechanism of [BNST17], each user whose type maps to sends “signal ” with probability and “signal ” with probability . Namely, and , where is the mapping set for user .
1.2 Our Contribution and Organization
This work initiates (to the best of our knowledge) the theory of differentially private hypothesis testing in the local model. First we survey related work and preliminaries. Then, in Section 3
, we examine the symmetric case and show that any mechanism (not necessarily a differentially private one) yields a distribution on the signals for which finding a maximum-likelihood hypothesis is feasible, assuming the set of possible hypotheses is convex. Then, focusing on the classic randomized-response mechanism, we show that the problem of maximizing the likelihood of the observed signals is strongly-convex and thus simpler than the original problem. More importantly, in essence we give a characterization of hypothesis testing under randomized response: the symmetric locally-private mechanism translates the original null hypothesis(and the alternative ) by a known affine translation into a different set (and resp. ). Hence, hypothesis testing under randomized-response boils to discerning between two different (and considerably closer in total-variation distance) sets, but in the exact same model as in standard hypothesis testing as all signals were drawn from the same hypothesis in . As an immediate corollary we give bounds on identity-testing (Corollary 5) and independence-testing (Theorem 6) under randomized-response. (The latter requires some manipulations and far less straight-forward than the former.) The sample complexity (under certain simplifying assumptions) of both problems is proportional to .
In Section 4 we move to the non-symmetric local-model. Again, we start with a general result showing that in this case too, finding an hypothesis that maximizes the likelihood of the observed signals is feasible when the hypothesis-set is convex. We then focus on the mechanism of Bassily et al [BNST17] and show that it also makes the problem of finding a maximum-likelihood hypothesis strongly-convex. We then give a simple identity tester under this scheme whose sample complexity is proportional to , and is thus more efficient than any tester under standard randomized-response. Similarly, we also give an independence-tester with a similar sample complexity. In Section 4.2 we empirically investigate alternative identity-testing and independence-testing based on Pearson’s -test in this non-symmetric scheme, and identify a couple of open problems in this regime.
1.3 Related Work
Several works have looked at the intersection of differential privacy and statistics [DL09, Smi11, CH12, DJW13a, DSZ15] mostly focusing on robust statistics; but only a handful of works study rigorously the significance and power of hypotheses testing under differential privacy [VS09, USF13, WLK15, RVLG16, CDK17, She17, KV18]. Vu and Slavkovic [VS09] looked at the sample size for privately testing the bias of a coin. Johnson and Shmatikov [JS13] , Uhler et al [USF13] and Yu et al [YFSU14] focused on the Pearson -test (the simplest goodness of fit test), showing that the noise added by differential privacy vanishes asymptotically as the number of datapoints goes to infinity, and propose a private -based test which they study empirically. Wang et al [WLK15] and Gaboardi et al [RVLG16] who have noticed the issues with both of these approaches, have revised the statistical tests themselves to incorporate also the added noise in the private computation. Cai et al [CDK17] give a private identity tester based on noisy -test over large bins, Sheffet [She17]
studies private Ordinary Least Squares using the JL transform, and Karwa and Vadhan[KV18]
give matching upper- and lower-bounds on the confidence intervals for the mean of a population. All of these works however deal with the centralized-model of differential privacy.
who give matching upper- and lower-bound on robust estimators in the local model. And while their lower bounds do inform as to the sample complexity’s dependency on, they do not ascertain the sample complexity dependency on the size of the domain () we get in Section 3. Moreover, these works disregard independence testing (and in fact [DJW13b] focus on mean estimation so they apply randomized-response to each feature independently generating a product-distribution even when the input isn’t sampled from a product-distribution). And so, to the best of our knowledge, no work has focused on hypothesis testing in the local model, let alone in the (relatively new) non-symmetric local model.
2 Preliminaries, Notation and Background
We user -case letters to denote scalars,
characters to denote vectors andletters to denote matrices. So denotes the number, denotes the all- vector, and denotes the all- matrix over a domain . We use to denote the standard basis vector with a single in coordinate corresponding to . To denote the -coordinate of a vector we use , and to denote the -coordinate of a matrix we use . For a given vector , we use to denote the matrix whose diagonal entries are the coordinates of . For any natural , we use to denote the set .
Distances and norms.
Unless specified otherwise refers to the -norm of , whereas refers to the -norm. We also denote . For a matrix, denotes (as usual) the maximum absolute column sum. We identify a distribution over a domain as a -dimensional vector with non-negative entries that sum to . This defines the total variation distance between two distributions: . (On occasion, we will apply to vectors that aren’t distributions, but rather nearby estimations; in those cases we use the same definition: the half of the -norm.) It is known that the TV-distance is a metric overs distributions. We also use the -divergence to measure difference between two distributions: . The -divergence is not symmetric and can be infinite, however it is non-negative and zeros only when . We refer the reader to [SV16]
for more properties of the total-variance distance the-divergence.
An algorithm is called -differentially private, if for any two datasets and that differ only on the details of a single user and any set of outputs , we have that . The unacquainted reader is referred to the Dwork-Roth monograph [DR14] as an introduction to the rapidly-growing field of differential privacy.
Hypothesis testing is an extremely wide field of study, see [HMC05] as just one of many resources about it. In general however, the problem of hypothesis testing is to test whether a given set of samples was drawn from a distribution satisfying the null-hypothesis or the alternative-hypothesis. Thus, the null-hypothesis is merely a set of possible distributions and the alternative is disjoint set
. Hypothesis tests boils down to estimating a test-statisticswhose distribution has been estimated under the null-hypothesis (or the alternative-hypothesis). We can thus reject the null-hypothesis is the value of is highly unlikely, or accept the null-hypothesis otherwise. We call an algorithm a tester if the acceptance (in the completeness case) or rejection (in the soundness case) happen with probability . Standard amplification techniques (return the median ofindependent tests) reduce the error probability from to any at the expense of increasing the sample complexity by a factor of ; hence we focus on achieving a constant error probability. One of the most prevalent and basic tests is the identity-testing, where the null-hypothesis is composed of a single distribution and our goal is to accept if the samples are drawn from and reject if they were drawn from any other -far (in ) distribution. Another extremely common tester is for independence when is composed of several features (i.e., ) and the null-hypothesis is composed of all product distributions where each is a distribution on the th feature .
The Chebyshev inequality states that for any random variable, we have that . We also use the Heoffding inequality, stating that for iid random variables in the range we have that and similarly that . It is a particular case of the MacDiarmid inequality, stating that for every function such that if we have bounds we have then .
A matrix is called positive semidefinite (PSD) if for any unit-length vector we have . We use to denote that is a positive semi-definite (PSD) matrix, and to denote that . We use to denote ’s pseudo-inverse.When the rows of are independent, we have that . We emphasize that we made no effort to minimize constants in our proofs, and only strived to obtain asymptotic bounds (). We use to hide poly-log factors.
3 Symmetric Signaling Scheme
Recall, in the symmetric signaling scheme, each user’s type is mapped through a random function into a set of signals . This mapping is index-oblivious — each user of type , sends the signal with the same probability . We denote the matrix as the -matrix whose entries are , and its th-row by . Note that all entries of are non negative and that for each we have . By garbling each datum i.i.d, we observe the new dataset .
For any convex set of hypotheses, the problem of finding the max-likelihood generating the observed signals is poly-time solvable.
Since describes the probability that a user of type sends the signal , any distribution over the types in yields a distribution on where
Therefore, given the signal , we can summarize it by a histogram over the different signals , and thus the likelihood of seeing this particular signal is given by:
Denoting the log-loss function as, we get that its gradient is
and its Hessian is given by the -matrix
As is a PSD matrix, and each of its rank- summands is scaled by a positive number, it follows that the Hessian is a PSD matrix and that our loss-function is convex. Finding the minimizer of a convex function over a convex set is poly-time solvable (say, by gradient descent [Zin03]), so we are done. ∎
Unfortunately, in general the solution to this problem has no closed form (to the best of our knowledge). However, we can find a close-form solution under the assumption that, the assumption that (in all applications we are aware of use fewer signals than user-types) and one extra-condition.
Let be the -dimensional vector given by . Given that , that is a full-rank matrix satisfying and assuming that , then any vector in of the form where and is an hypothesis that maximizes the likelihood of the given signals .
Our goal is to find some which minimizes . Denoting as the -dimensional vector such that , we note that isn’t just any linear transformation, but rather one that induces probability over the signals, and so is a non-negative vector that sums to . We therefore convert the problem of minimizing our loss function into the following optimization problem
Using Lagrange multipliers, it is easy to see that and that and so the minimizer is obtained when equates all ratios for all , namely when . Since we assume has a non-empty intersection with , then let be any hypothesis in of the form where . We get that is the minimizer of satisfying all constraints. By assumption, . Due to the fact that is full-rank and that we have that , and by definition, is a valid distribution vector (non-negative that sums to ). ∎
If all conditions of Corollary 2 hold, we get a simple procedure for finding a minimizer for our loss-function: (1) Compute the pseudo-inverse and find ; (2) find a vector such that . (The latter steps requires the exact description of , and might be difficult if is not convex. However, if is convex, then is a shift of a convex body and therefore convex, so finding the point which minimizes the distance to a given linear subspace is a feasible problem.)
3.1 Hypothesis Testing under Randomized-Response
We now aim to check the affect of a particular , the one given by the randomized-response mechanism. In this case and we denote as the matrix whose entries are where and . We get that (where is the all- matrix). In particular, all vectors , which correspond to the rows of , are of the form: . It follows that for any probability distribution we have that . We have therefore translated any (over ) to an hypothesis over (which in this case ), using the affine transformation when denotes the uniform distribution over . (Indeed, , an identity we will often apply.) Furthermore, at the risk of overburdening notation, we use to denote the same transformation over scalars, vectors and even sets (applying to each vector in the set).
As is injective, we have therefore discovered the following theorem.
Under the classic randomized response mechanism, testing for any hypothesis (or for comparing against the alternative ) of the original distribution, translates into testing for hypothesis (or against ) for generating the signals .
Theorem 3 seems very natural and simple, and yet (to the best of our knowledge) it was never put to words.
Moreover, it is simple to see that under standard-randomized response, our log-loss function is in fact strongly-convex, and therefore finding becomes drastically more efficient (see, for example [HKKA06]).
Given signals generated using standard randomized response with parameter , we have that our log-loss function from Equation (1) is -strongly convex.
Note that in expectation , hence with overwhelming probability we have so our log-loss function is -strongly convex.
Recall that for any we have . Hence, our log-loss function , whose gradient is the vector whose -coordinate is . The Hessian of is therefore the diagonal matrix whose diagonal entries are . Recall the definitions of and : it is easy to see that , and since we also have that , hence . And so:
making at least ()-strongly convex. ∎
A variety of corollaries follow from Theorem 3. In particular, a variety of detailing matching sample complexity upper- and lower-bounds translate automatically into the realm of making such hypothesis-tests over the outcomes of the randomized-response mechanism. We focus here on two of the most prevalent tests: identity testing and independence testing.
Perhaps the simplest of the all hypothesis testing is to test whether a given sample was generated according to a given distribution or not. Namely, the null hypothesis is a single hypothesis , and the alternative is for a given parameter . The seminal work of Valiant and Valiant [VV14] discerns that (roughly) samples are sufficient and are necessary for correctly rejecting or accepting the null-hypothesis w.p..333For the sake of brevity, we ignore pathological examples where by removing probability mass from we obtain a vector of significantly smaller -norm.
Here, the problem of identity testing under standard randomized response reduces to the problem of hypothesis testing between and .
In order to do identity testing under standard randomized response with confidence and power , it is necessary and sufficient that we get samples.
For any it follows that . Recall that and , and so, for we have and , namely and . Next, we bound :
It follows that the necessary and sufficient number of samples required for identity-testing under standard randomized response is proportional to
For any -dimensional vector with -norm of we have . Thus and therefore the first of the two terms in the sum is the greater one. The required follows.
Comment: It is evident that the tester given by Valiant and Valiant [VV14] solves (w.p. ) the problem of identity-testing in the randomized response model using samples. However, it is not a-priori clear why their lower bounds hold for our problem. After all, the set is only a subset of . Nonetheless, delving into the lower bound of Valiant and Valiant, the collection of distributions which is hard to differentiate from given samples is given by choosing suitable and then looking at the ensemble of distributions given by for each . Luckily, this ensemble is maintained under , mapping each such distribution to . The lower bound follows. ∎
Another prevalent hypothesis testing over a domain where each type is composed of multiple feature is independence testing (examples include whether having a STEM degree is independent of gender or whether a certain gene is uncorrelated with cancer). Denoting as a domain with possible features (hence ), our goal is to discern whether an observed sample is drawn from a product distribution or a distribution -far from any product distribution. In particular, the null-hypothesis in this case is a complex one: and the alternative is . To the best of our knowledge, the (current) tester with smallest sample complexity is of Acharya et al [ADK15] , which requires iid samples.
We now consider the problem of testing for independence under standard randomized response.444Note that if were to implement the feature-wise randomized response (i.e., run Randomize-Response per feature with privacy loss set to ) then we would definitely create signals that come from a product distribution. That is why we stick to the straight-forward implementation of Randomized Response even when is composed of multiple features. Our goal is to prove the following theorem.
There exists an algorithm that takes signals generated by applying standard randomized response (with ) on samples drawn from a distribution over a domain and with probability accepts if , or rejects if . Moreover, no algorithm can achieve such guarantee using signals.
Note that has to be at least two types per feature, so , and if all s are the same we have . Thus is the leading term in the above bound.
Theorem 3 implies we are comparing to . Note that is not a subset of product-distributions over but rather a convex combination (with publicly known weights) of the uniform distribution and ; so we cannot run the independence tester of Acharya et al on the signals as a black-box. Luckily — and similar to the identity testing case — it holds that is far from all distributions in : for each and we have . And so we leverage on the main result of Acharya et al ([ADK15] , Theorem 2): we first find a distribution such that if the signals were generated by some then , and then test if indeed the signals are likely to be generated by a distribution close to using Acharya et al’s algorithm. Again, we follow the pattern of [ADK15] — we construct as a product distribution where is devised by projecting each signal onto its th feature. Note that the th-marginal of the distribution of the signals is of the form (again, denotes the uniform distribution over ). Therefore, for each we derive by first approximating the distribution of the th marginal of the signals via some , then we apply the inverse mapping from Corollary 2 to so to get the resulting distribution which we show to approximate the true . We now give our procedure for finding the product-distribution .
Per feature , given the th feature of the signals where each appears times, our procedure for finding is as follows.
(Preprocessing:) Denote . We call any type where as small and otherwise we say type is large. Ignore all small types, and learn only over large types. (For brevity, we refer to as the number of signals on large types and as the number of large types.)
Set the distribution as the “add-1” estimator of Kamath et al [KOPS15] for the signals: .
Once is found for each feature , set run the test of Acharya et al [ADK15] (Theorem 2) with looking only at the large types from each feature, setting the distance parameter to and confidence , to decide whether to accept or reject.
In order to successfully apply the Acharya et al’s test, a few conditions need to hold. First, the provided distribution should be close to . This however hold trivially, as is a product-distribution. Secondly, we need that and to be close in -divergence, as we argue next.
Suppose that , the number of signals, is at least . Then the above procedure creates distributions such that the product distribution satisfies the following property. If the signals were generated by for some product-distribution , then w.p. we have that .
We table the proof of Lemma 7 for now. Next, either completeness or soundness must happen: either the signals were taken from randomized-response on a product distribution (were generated using some ), or they were generated by a distribution -far from . If no type of any feature was deemed as “small” in our preprocessing stage, this condition clearly holds; but we need to argue this continues to hold even when we run our tester on a strict subset of composed only of large types in each feature. Completeness is straight-forward: since we remove types feature by feature, the types now come from a product distribution where each is a restriction of to the large types of feature , and Lemma 7 assures us that and are close in -divergence. Soundness however is more intricate. We partition into two subsets: and ; and break into , with . Using the Hoeffding bound, Claim 8 argues that . Therefore, , implying that .
Assume the underlying distribution of the samples is and that the number of signals is at least . Then w.p. our preprocessing step marks certain types each feature as “small” such that the probability (under ) of sampling a type such that is .
So, given that both Lemma 7 and Claim 8 hold, we can use the test of Acharya et al, which requires a sample of size . Recall that so , and we get that the sample size required for the last test is . Moreover, for this last part, the lower bound in Acharya et al [ADK15] still holds (for the same reason it holds in the identity-testing case): the lower bound is derived from the counter example of testing whether the signals were generated from the uniform distribution (which clearly lies in ) or any distribution from a collection of perturbations which all belong to (See [Pan08] for more details). Each of distribution is thus -far from and so any tester for this particular construction requires -many samples. Therefore, once we provide the proofs of Lemma 7 and Claim 8 our proof of Theorem 6 is done.
4 Non-Symmetric Signaling Schemes
Let us recall the non-symmetric signaling schemes in [BS15, BNST17]. Each user, with true type , is assigned her own mapping (the mapping is broadcast and publicly known) . This sets her inherent signal to , and then she runs standard (symmetric) randomized response on the signals, making the probability of sending her true signal to be -times greater than any other signal .
In fact, let us allow an even broader look. Each user is given a mapping , and denoting and , we identify this mapping with a -matrix . The column is the probability distribution that a user of type is going to use to pick which signal she broadcasts. (And so the guarantee of differential privacy is that for any signal and any two types we have that .) Therefore, all entries in are non-negative and for all s.
Similarly to the symmetric case, we first exhibit the feasibility of finding a maximum-likelihood hypothesis given the signals from the non-symmetric scheme. Since we view which signal in was sent, our likelihood mainly depends on the row vectors .
For any convex set of hypotheses, the problem of finding the max-likelihood generating the observed non-symmetric signals is poly-time solvable.
Fix any , a probability distribution on . Using the public we infer a distribution on , as
with denoting the row of corresponding to signal .
Therefore, given the observed signals , the likelihood of any is given by
Naturally, the function we minimize is the negation of the average log-likelihood, namely
whose partial derivatives are: , so the gradient of is given by