Locally differentially private (LDP) mechanisms have gained prominence as methods of choice for sharing sensitive data with untrusted curators. This strong notion of privacy, introduced in [DJW13] (see also [EGS03]) as a variant of differential privacy [DMNS06, Dwo06], requires each user to report only a noisy version of its data such that the distribution of the reported data does not change multiplicatively beyond a prespecified factor when the underlying user data changes. With the proliferation of user data accumulated using such locally private mechanisms, there is an increasing demand for designing data analytics toolkits for operating on the collated user data. In this paper, we consider the design of algorithms aimed at providing a basic ability to such a toolkit, namely the ability to run statistical tests for the underlying user data distribution. At a high-level, we seek to address the following question.
How should one conduct statistical testing on the (sensitive) data of users, such that each user maintains their own privacy both to the outside world and to the (untrusted) curator performing the inference?
In particular, we consider two fundamental statistical inference problems for a discrete distribution over a large alphabet: identity testing (goodness-of-fit) and independence testing
. A prototypical example of the former is testing whether the user data was generated from a uniform distribution; the latter tests if two components of user data vectors are independent. Our main focus is the uniformity testing problem and most of the other results are obtained as an extension using similar techniques. We seek algorithms that are efficient in the number of LDP user data samples required and can be implemented practically. These two problems are instances ofdistribution testing
, a sub-area of statistical hypothesis testing focusing on small-sample analysis introduced by Batu et al.[BFR00] and Goldreich, Goldwasser, and Ron [GGR98].
Our results are comprehensive, and organized along two axes: First, we consider tests that use existing LDP data release mechanisms to collect inputs at the center and perform a post-processing test on this aggregated data. Specifically, we consider the popular Rappor mechanism of [EPK14] and the recently introduced the Hadamard Response mechanism (HR) of [ASZ18]. Because these mechanisms have utility beyond our specific use-case of distribution testing – Rappor, for instance, is already deployed in many applications – it is natural to build a more comprehensive data analytics toolkit using the data accumulated by these mechanisms. To this end, we provide uniformity testing algorithms with optimal sample complexity for both mechanisms; further, for HR, we also provide an independence testing algorithm and analyze its performance.
Second, we consider the more general class of public-coin mechanisms for solving testing problems which are allowed to use public randomness. We present a new response mechanism, Randomized Aggregated Private Testing Optimal Response (Raptor), that only requires users to send a single privatized bit indicating whether their data point is in a (publicly known) random subset of the domain. Using Raptor, we obtain simple algorithms for uniformity and independence testing that are sample-optimal even among public-coin mechanisms.
We next provide a detailed description of our results, followed by a discussion of the relevant literature to put them in perspective. At the outset we mention that the problems studied here have been introduced earlier in [She18, GR18]. Our algorithms outperform their counterparts from these papers, and we complement them with information-theoretic lower bounds establishing their optimality (except for the proposed HR-based independence test).
1.1 Algorithms and results
The privacy level of a locally private mechanism is often parameterized by a single parameter . Specifically, an -LDP mechanism ( Duchi et al. [DJW13]) ensures that for any two distinct values of user data, the distribution of the output reported to the curator is within a multiplicative factor of ; smaller values of indicate stronger privacy guarantees. In this work, we focus on the high-privacy regime, and assume throughout that ; however, our choice of as an upper bound is to set a convention and can be replaced with any constant.
In uniformity testing, the user data comprises independent samples from an unknown -ary distribution. These samples are then made available to the curator through an -LDP mechanism, and she seeks to determine if the underlying distribution was uniform or -far from uniform in total variation distance. How many locally private samples must the curator access?
First, we consider two representative locally private mechanisms, Rappor and HR. We briefly describe these mechanisms here informally and provide a more complete definition in Section 2. In Rappor, the -ary observation of the user is first converted to a. HR, on the other hand, is a generalization of the classic Randomized Response (RR) [War65] which roughly maps each -ary observation to either a randomly chosen entry of the -th row of the Hadamard matrix with probability , or to a randomly chosen entry with probability . Interestingly, both these mechanisms have been shown recently to be sample-optimal for learning -ary distributions; see [DJW17, EPK14, WHW16, YB17, KBR16, ASZ18]. Further, note that both Rappor and HR are private-coin mechanisms, and are symmetric across users.
We propose the following algorithm to enable uniformity testing using data obtained via Rappor. Once again, the description here is brief and a formal description is provided in Section 3.1.
We analyze the sample complexity of the above test and show that it is order-wise optimal among all tests that use Rappor.
Result 1 (Sample complexity of uniformity testing using Rappor).
The uniformity test described above requires samples. Furthermore, any test using Rappor must use samples.
Moving now to HR, denote by the output distribution of HR when the underlying samples are generated from the uniform distribution. (Note that can be computed explicitly.) Invoking Parseval’s theorem, we show that the distance between the and the output distribution of HR is roughly times the distance between the uniform and the user data distributions. This motivates the following test.
Our next result shows that this test is indeed sample-optimal among all tests using HR.
Result 2 (Sample complexity of uniformity testing using Hr).
The uniformity test described above requires samples. Furthermore, any test using HR must use samples.
Both tests proposed above thus provably cannot be improved beyond this barrier of samples. Interestingly, this was conjectured by Sheffet to be the optimal sample complexity of locally private uniformity testing [She18], although no algorithm achieving this sample complexity was provided. Yet, our next result shows that one can achieve the same guarantees with much fewer samples when public randomness is allowed.
Specifically, we describe a new mechanism Raptor, described below:
The key observation is that when the underlying distribution is -far from uniform, the bias of is with constant probability (over the choice of ); while clearly, under uniform the bits are unbiased. Thus, we can simply test for uniformity by learning the bias of the bit up to an accuracy of , which can be done using samples from Raptor. In fact, we further show that (up to constant factors) this number of samples cannot be improved upon.
Result 3 (Sample complexity of locally private uniformity testing).
Uniformity testing using Raptor requires samples. Furthermore, any public-coin mechanism for locally private uniformity testing requires samples.
Although we have stated the previous three results for uniformity testing, our proofs extend easily to identity testing, , the problem of testing equality of the underlying distribution to a fixed known distribution which is not necessarily uniform. In fact, if we allow simple preprocessing of user observations before applying locally private mechanisms, a reduction argument due to Goldreich [Gol16] can be used to directly convert identity testing to uniformity testing.
Our final set of results are for independence testing, where user data consists of two-dimensional vectors from . We seek to ascertain if these vectors were generated from an independent distribution or a distribution that is -far in total variation distance from every independent distribution. For this problem, a natural counterpart of Raptor which simply applies Raptor to each of the two coordinate using independently generated sets yields a sample optimal test – indeed, we then simply need to test if the pair of indicator-bits are independent or not. This can be done using , leading to the following result.
Result 4 (Sample complexity of locally private independence testing).
The sample complexity of locally private independence testing is and is achieved by a simple public-coin mechanism that applies Raptor to each coordinate of user data.
For completeness, we also present a private-coin mechanism for independence testing based on HR which requires samples. The proposed test builds on a technique introduced in Acharya, Daskalakis, and Kamath [ADK15] and relies on learning in divergence. Although this result is suboptimal in the dependence on the privacy parameter , it improves on both [She18] and the testing-by-learning baseline approach by a factor of . We summarize all our results in Table 1 and compare them with the best known prior bounds from [She18].
|This work||Previous [She18]|
1.2 Proof techniques
We start by describing the analysis of our tests based on existing
-LDP mechanisms. Recall that a standard (non-private) uniformity test entails estimating thenorm of the underlying distribution by counting the number of collisions in the observed samples. When applying the same idea on the data collected via Rappor, we can naively try to estimate the number of collisions by adding the number of pairs of output vectors with s in the -th coordinate, for each
. However, the resulting statistic has a prohibitively high variance stemming from the noise added byRappor. We fix this shortcoming by considering a bias-corrected version of this statistic that closely resembles the classic
statistic. However, analyzing the variance of this new statistic turns out to be rather technical and involves handling the covariance of quadratic functions of correlated binomial random variables. Our main technical effort in this part goes into analyzing this covariance, which may find further applications.
For our second test that builds on HR, we follow a different approach. In this case, we exploit the structure of Hadamard transform and take recourse to Parseval’s theorem to show that the distance to uniformity of the original distribution is equal, up to an factor, to the
distance of the Fourier transformto some (explicit) fixed distribution ; further, it can be shown that . With this structural result in hand, we can test identity of to in the Fourier domain, by invoking the non-private tester of Chan et al. [CDVV14] with the corresponding distance parameter . Exploiting the fact that has a small norm leads to the stated sample complexity.
Our private-coin mechanism for independence testing uses HR as well, and once again hinges on the idea that testing and learning in the Fourier domain can be done efficiently. To wit, we adapt the “testing-by-learning” framework of Acharya, Daskalakis, and Kamath [ADK15] (which they show can be applied to many testing problems, including independence testing) to our private setting. The main insight here is that instead of using HR to learn and test the original distribution in distance, we perform both operations directly in the transformed domain to the distribution at the output of HR. Namely, we first learn the transform of , then test whether the outcome is close to the transform of . The main challenge here is to show that the variant of Hadamard transform that we use preserves (as was the case for uniformity testing) the distance from independence. We believe this approach to be quite general, as was the case in [ADK15], and that it can be used to tackle many other distribution testing questions such as locally private testing of monotonicity or log-concavity.
As mentioned above, our main results – the optimal public-coin mechanisms for identity and independence testing – are remarkably simple. The key heuristic underlying both can be summarized as follows:If is -far from uniform, then with constant probability a uniformly random subset of size will satisfy ; on the other hand, if is uniform then always holds. Thus, one can reduce the original testing problem (over alphabet size ) to the much simpler question of estimating the bias of a coin. This latter task is very easy to perform optimally in a locally private manner – for instance it can be completed via RR – and requires each player to send only one
bit to the server. Hence, the main technical difficulty is to prove this quite intuitive claim. We do this by showing anticoncentration bounds for a suitable random variable by bounding its fourth moment and invoking the Paley–Zygmund inequality. As a byproduct, we end up establishing a more general version,Theorem 14, which we believe to be of independent interest.
Our information-theoretic lower bounds are all based on a general approach introduced recently by Acharya, Canonne, and Tyagi [ACT18] (in a non-private setting) that allows us to handle the change in distances between distributions when information constraints are imposed on samples. We utilize the by-now-standard “Paninski construction” [Pan08], a collection of distributions obtained by adding a small pointwise perturbation to the -ary uniform distribution. In order to obtain a lower bound for the sample complexity of locally private uniformity testing, following [ACT18], we identify such a mechanism to the noisy channels (that is, the randomized mappings used by the players) it induces on the samples and consider the distribution of the tuple of messages when the underlying distribution of the samples is . The key step then is to bound the divergence between (i) , the distribution of the messages under the uniform distribution; and (ii) , the average distribution of the messages when is chosen uniformly at random among the “perturbed distributions.”
Using the results of [ACT18], this in turn is tantamount to obtaining an upper bound the Frobenius norm of specific matrices that capture the information constraints imposed by ’s. Deriving these bounds for Frobenius norms constitutes the main technical part of the lower bounds and relies on a careful analysis of the underlying mechanism and of the LDP constraints it must satisfy.
On the range of parameters.
As pointed out earlier, in this work we focus on the high-privacy regime, , the case when the privacy parameter is small and the privacy constraints on the mechanisms are the most stringent. From a technical standpoint, this allows us to rewrite the expressions such as and , which appear frequently, as simply and greatly simplifies the statements of our results. However, our results carry through to the general setting of large , with replacing term; the former is for large .
1.3 Related prior work
Testing properties of a distribution by observing samples from it is a central problem in statistics and has been studied for over a century. Motivated by applications arising from algorithms dealing with massive amounts of data, it has seen renewed interest in the computer science community under the broad title of distribution testing, with a particular focus on sample-optimal algorithms for discrete distributions. This literature itself is over two decades old; we refer an interested reader to surveys and books [Rub12, Can15, Gol17, BW17] for a comprehensive review. Here, we only touch upon works that are related directly to our paper.
Sample complexity for uniformity testing was settled in [Pan08], following a long line of work. The related, and more general, problem of identity testing has seen revived interest lately. The sample complexity for this problem was shown to be in [VV17], and by now even the optimal dependence on the error probability is known ( [HM13, DGPP16]). Moreover, a work of Goldreich [Gol16] further shows that any uniformity testing algorithm implies an identity testing one with similar sample complexity. Another variant of this problem, termed “instance-optimal” identity testing and introduced in [VV17], seeks to characterize the dependence of the sample complexity on the distribution we are testing identity to, instead of the alphabet size. As pointed out in [ACT18], the reduction from [Gol16] can be used in conjunction with results from [BCG17] to go through even for the instance-optimal setting. This observation allows us to focus on uniformity testing only, even when local privacy constraints are imposed.
The optimal sample complexity for the independence testing problem where both observations are from the same set111The more general question asks to test independence of distributions over , or even over . Optimal (non-private) sample complexities for these generalizations are also known [DK16]. was shown to be in [ADK15, DK16].
Moving now to distribution testing settings with privacy constraints, the setting of differentially private (DP) testing has by now been extensively studied. Here the algorithm itself is run by a trusted curator who has access to all the user data, but needs to ensure that the output of the test maintains differential privacy. Private identity testing in this sense has been considered in [CDK17, ADR17], with a complete characterization of sample complexity derived in [ASZ17]. Interestingly, in several parameter ranges of interest the sample complexity here matches the sample complexity for the non-private case discussed earlier, showing that “privacy often comes at no additional cost” in this setting. As we show in this work, this is in stark contrast to what can be achieved in the more stringent locally private setting.
We are not aware of any existing private algorithm for DP independence testing. While the literature on DP testing includes several interesting mechanisms, for instance the works [GLRV16, KR17, WLK15] which contain mechanisms for both identity and independence testing, finite-sample guarantees are not available and the results hold only in the asymptotic regime.
Finally, coming to the literature most closely related to our work, locally private hypothesis testing was considered first by Sheffet in [She18] where, too, both identity and independence testing were considered. This work characterized the sample complexity of LDP independence and uniformity testing when using Randomized Response, and introduced more general mechanisms. However, as pointed-out in Table 1, the algorithms proposed in [She18] require significantly more samples than our sample-optimal algorithms for those questions. Moreover, the overall sample complexity without restricting to any specific class of mechanisms has not been considered.
An interesting concern studied in Sheffet’s work is the distinction between symmetric and asymmetric mechanisms. Broadly speaking, the latter are locally private mechanisms where each player applies the same randomized function to its data, where asymmetric mechanisms allow different behaviors, with player using its own . While we mention this distinction in our results (see Table 1), we observe in Lemma 4 that allowing asymmetric mechanisms can only improve the sample complexity by at most a logarithmic factor.
Another class of problems of statistical inference requires learning the unknown distribution up to a desired accuracy of in total variation distance. Clearly, the testing problems we consider can be solved by privately learning the distributions (to accuracy ). The optimal sample complexity of locally private learning discrete -ary distributions is known to be ; see [DJW17, EPK14, YB17, KBR16, ASZ18]. (Furthermore, all these sample-optimal learning schemes are symmetric.) This readily implies a sample complexity upper bound of for locally private identity testing, and of for independence testing. In this respect the theoretical guarantees from [She18] are either implied or superseded by this “testing-by-learning” approach.
2 Notation and Preliminaries
We write for the set of integers , and denote by and the binary and natural logarithms, respectively. We make extensive use of the standard asymptotic , , and notation; moreover, we shall sometimes use , , and for their non-asymptotic counterparts (i.e., , , and for every , where are absolute constants).
Following the standard setting of distribution testing, we consider probability distributions over a discrete (and known) domain. Denote by the set of all such distributions,
endowed with the total variation distance (statistical distance) as a metric, defined as . It is easy to see that , where is the distance between and as probability mass functions. For a distance parameter , we say that are -far if ; otherwise, they are -close. We denote by the product distribution over defined by , for , .
In distribution testing, for a prespecified set of distributions and given independent samples from an unknown , our goal is to distinguish between the cases (i) and (ii) is -far from every with constant probability222As is typical, we set that probability to be ; by a standard argument, this can be amplified to any at the price of an extra factor in the sample complexity and running time.. The sample complexity of testing is defined as the minimum number of samples required to achieve this task in the worst case over all (as a function of , , and all other relevant parameters of ).
The specific problem of identity testing corresponds to and for some fixed and known . Uniformity testing is the special case of identity testing with being the uniform distribution, i.e., for all . Lastly, independence testing corresponds to and .
2.1 Local Differential Privacy
We consider the standard setting of -local differential privacy, which we recall below. A -user mechanism is simply a randomized mapping which, given as input user data , outputs a random variable taking values in . We represent this mechanism by a channel where denotes the probability that the mechanism outputs when the user input is . Similarly, an -user mechanism is represented by where denotes the channel used for the -th user; when is clear from context, we will simply use mechanism for an -user mechanism. For our purposes, will be the domain of our discrete probability distributions, , and will be identified with , for some integer .
Note that each channel is applied independently to each user’s data. In particular, for independent samples , the outputs of are independent, too. The mechanisms described above are private-coin mechanisms: they only require independent, local randomness at each user to implement the local channels . A private-coin mechanism is further said to be symmetric if is the same for all , in which case, with an abuse of notation, we denote it . A broader class of mechanisms of interest to us are public-coin mechanisms, where the output of each user may depend additionally on shared public randomness (independent of the users’ data); when the shared randomness takes the value , the mechanism uses channels . Clearly, private-coin mechanisms are a special case, corresponding to constant . The above distinction between symmetric and asymmetric mechanisms applies to public-coin mechanisms as well.
A public-coin mechanism is an -locally differentially private (-LDP) mechanism if it satisfies the following:
2.2 Existing LDP mechanisms
Three LDP mechanisms will be of interest to us: randomized response, Rappor, and Hadamard response.
The -randomized response (-RR) mechanism [War65] is an -LDP mechanism, , with , such that
Originally introduced for the binary case (), it is one of the simplest and most natural response mechanisms.
The randomized aggregatable privacy-preserving ordinal response (Rappor) is an -LDP mechanism introduced in [DJW13, EPK14]. Its simplest implementation, -Rappor, maps to in two steps. First, a one-hot encoding is applied to the input to obtain a vector such that for and for . The privatized output, , of -Rappor is represented by a -bit vector obtained by independently flipping each bit of independently with probability .
Note that if is drawn from , this leads to such that the coordinates are (non-independent) Bernoulli random variables with distributed as where are defined as
Hadamard response is a symmetric, communication- and time-efficient mechanism, proposed in [ASZ18].
In order to define the Hadamard response mechanism, we first define a general family of -LDP mechanisms that include RR as a special case. Let be two integers, and for each let be a subset of size with . Then, the general privatization scheme is described by
which can easily be seen to be -LDP. Further, note that -RR corresponds to the special case with , , and for all .
The Hadamard Response mechanism (HR), is obtained by choosing , and a collection of sets such that
For every , .
For every distinct , the symmetric difference satisfies .
For these parameters, we get that
and combining these two
A method for constructing sets that also allows efficient implementation of the resulting mechanism was proposed in [ASZ18] using Hadamard codes (hence the name Hadamard Response). Specifically, let
so that , and let be the Hadamard matrix of order (see Section 2.3 for more details). Hereafter, we identify each row of to a subset of . As , we can pick an injection and map each to a distinct subset defined by the -th row of . By creftype 1 in the next section, this family satisfies 1 and 2.
2.3 Hadamard matrices and linear codes
Next, we recall some useful properties of Hadamard matrices which will be needed for our analysis of HR-based tests.
Let be any power of two. The Hadamard matrix of order , denoted , is the matrix of size defined recursively by Sylvester’s construction: (i) , and (ii) for ,
Note that all entries of are in .
Let be any integer. Then, the Hadamard matrix has the following properties:
The first row of is the all-one vector.
For every , the -th row of is balanced, i.e, contains exactly entries equal to .
Every two distinct rows are orthogonal; that is, for every , the -th and -th row agree (resp. disagree) on exactly entries.
Fix any . The Hadamard matrix corresponds to the Walsh–Hadamard transform (or Fourier transform; see, for example, [O’D14]). Specifically, for any two functions , define the inner product over as
and let denote the norm induced by this inner product. Moreover, the functions defined for every by form an orthonormal basis, whereby every can be uniquely written as
where . The Walsh–Hadamard matrix specifies this transformation of basis. Specifically, we note following standard fact:
Let . Then, for every and subset identified to its characteristic vector , we have that
This spectral view of Walsh–Hadamard matrix leads to Parseval’s Theorem, which is instrumental in design of our tests based on HR.
Theorem 3 (Parseval’s Theorem).
For every function ,
2.4 On symmetry and asymmetry
While all the LDP mechanisms underlying our proposed sample-optimal tests in this paper can be cast as symmetric mechanisms, the next result shows that asymmetric mechanisms can in any case yield at most a logarithmic-factor improvement in sample complexity over symmetric ones.
Suppose that there exists a private-coin (respectively public-coin) LDP mechanism for some task with users and probability of success . Then, there exists a private-coin (respectively public-coin) symmetric LDP mechanism for with users and probability of success .
Let be the purported mechanism, with being the mapping of the -th user. We create a symmetric (randomized) mechanism as follows: On input , use private (respectively public) randomness to generate uniformly at random (and independently of everything else); and output .333Note that for public-coin mechanisms, one can define , as there is no need for a user to communicate the random index to the referee.
Clearly, the resulting mechanism is symmetric. Further, by a standard coupon-collector argument, for we have that with probability at least , each will be drawn at least once. Whenever this is the case, upon gathering all the outputs, the referee can then select a subset of outputs and simulate the original mechanism, having received the output of . Overall, the probability of failure is at most by a union bound. ∎
2.5 A warmup for the binary case
We conclude this section with simple algorithms for identity and independence testing for the case when , i.e, for support size . These algorithms will be used later in our optimal tests based on Raptor.
2.5.1 Private estimation of the bias of a coin.
First, we deal with the problem of estimating the bias of a coin up to an additive accuracy of , when the outcomes of coin tosses can be accessed via an -LDP mechanism. Note that this yields as a corollary an algorithm for identity testing over . Indeed, to test if the generating distribution equals or is -far from it, we estimate probability to additive and compare it with . The following result is a folklore and is included for completeness.
Lemma 5 (Locally Private Bias Estimation, Warmup).
For , an estimate of the bias of a coin with an additive accuracy of can be obtained using samples via -LDP RR. Moreover, any estimate of bias obtained via -LDP RR must use samples.
Recall from (2) that an -LDP RR is described by the channel . When a random variable passes through this channel, the output is a Bernoulli random variable with mean
Therefore, estimating to using this mechanism is equivalent to estimating to an additive , which can be done with samples (the second as ).
It remains to prove optimality. For , it can shown that any -LDP scheme can be obtained by passing output of an -LDP RR through another channel. Therefore, RR will require the least number of samples for estimating the bias, and it suffices to show the claimed bound of for RR. To that end, suppose we provide as input a Bernoulli random variable with bias to RR. Then, the output has bias . On the other hand, when the input is , then the output is as well. Therefore, distinguishing between a and a using samples from an -LDP RR is at least as hard as distinguishing and without privacy constraints. This latter task is known to require the stated number of samples. ∎
2.5.2 Independence testing over
As a corollary of Lemma 5, we obtain an algorithm for locally private independence testing for , which, too, will be used later in the paper.
For , there exists a symmetric, private-coin -LDP mechanism that tests whether a distribution over is a product distribution or -far from any product distribution using samples.
Consider a distribution over with marginals and . Note that
Thus, if is -far in total variation distance from any product distribution, it must hold that , which in view of the equation above yields . Using this observation, we can test for independence using samples as follows. First, note that for any symbol , can be estimated up to an accuracy using samples by converting the observation to the binary observation and applying the estimator of Lemma 5. Thus, we can estimate , , and up to an accuracy by assigning samples each for them. Denote the respective estimates by , , and . When ,
On the other hand, when , we have
Thus, for , locally private independence testing can be performed with samples by estimating the probabilities , , and comparing to the threshold . ∎
3 Locally Private Uniformity Testing using Existing Mechanisms
In this section, we provide two locally private mechanisms for uniformity testing. As discussed earlier, this in turn provides similar mechanisms for identity testing as well. These two tests, based respectively on the symmetric, private-coin mechanisms Rappor and HR, will be seen to have the same sample complexity of . However, the first has the advantage of being based on a widespread mechanism, while the second is more efficient in terms of both time and communication.
3.1 A mechanism based on Rappor
Given independent samples from , let the output of Rappor applied to these samples be denoted by , where for . The following fact is a simple consequence of the definition of Rappor.
Let , and .
where are defined as in (3).
First idea: Counting Collisions.
A natural idea would be to try and estimate by counting the collisions from the output of Rappor. Since this only adds post-processing to Rappor, which is LDP, the overall procedure does not violate the -LDP constraint. For defined as , , , the statistic counting collisions over all samples and differentially private symbols can be seen to have expectation
Up to the constant normalizing factor, this suggests an unbiased estimator for, and thereby also for . However, the issue lies with the variance of this estimator. Indeed, it can be shown that (for constant ). Thus, if we use this statistic to distinguish between and for uniformity testing, we need
, . This sample requirement turns out to be off by a quadratic factor, and even worse than the trivial upper bound obtained by learning .
An Optimal Mechanism.
We now propose our testing mechanism based on Rappor, which, in essence, uses a privatized version of a -type statistic of [CDVV14, ADK15, VV17]. For , let the number of occurrences of among the (privatized) outputs of Rappor be
which by the definition of Rappor follows a distribution. Now, letting
we get a statistic, applied to the output of Rappor, which (as we shall see) is up to normalization an unbiased estimator for the squared distance of to uniform. The main difference with the naive approach we discussed previously, however, lies in the extra linear term. Indeed, the collision-based statistic was of the form
and in comparison, keeping in mind that is typically concentrated around its expected value of roughly , our new statistics can be seen to take the form
since . That is, now the fluctuations of the quadratic term are reduced significantly by the subtracted linear term, bringing down the variance of the statistic.
This motivates our testing algorithm based on Rappor, Algorithm 4, and leads to the main result of this section:
For , Algorithm 4 based on -LDP Rappor can test whether a distribution is uniform or -far from uniform using
Proof of Theorem 8.
Clearly, since Rappor is an -LDP mechanism, the overall Algorithm 4 does not violate the -LDP constraint. We now analyze the error performance of the proposed test, which we will do simply by using Chebyshev’s inequality. Towards that, we evaluate the expected value and the variance of .
The following evaluation of expected value of statistic uses a simple calculation entailing moments of a Binomial random variable:
With defined as above, we have
where the expectation is taken over the private-coins used by Rappor and the samples drawn from . In particular, (i) if , then ; while (ii) if , then .
Letting and using the fact that , we have
which, along with the observation that , gives the result. ∎
Turning to the variance, we get the following:
With defined as above, we have