We develop differentially private methods for estimating various distributional properties. Given a sample from a discrete distribution p, some functional f, and accuracy and privacy parameters α and ε, the goal is to estimate f(p) up to accuracy α, while maintaining ε-differential privacy of the sample. We prove almost-tight bounds on the sample size required for this problem for several functionals of interest, including support size, support coverage, and entropy. We show that the cost of privacy is negligible in a variety of settings, both theoretically and experimentally. Our methods are based on a sensitivity analysis of several state-of-the-art methods for estimating these properties with sublinear sample complexities.READ FULL TEXT VIEW PDF
We investigate the problems of identity and closeness testing over a dis...
We study the problem of estimating finite sample confidence intervals of...
Joint distribution estimation of a dataset under differential privacy is...
We consider the binary classification problem in a setup that preserves ...
We consider a setup in which confidential i.i.d. samples X_1,,X_n
Median regression analysis has robustness properties which make it attra...
We give a simple, computationally efficient, and node-differentially-pri...
How can we infer a distribution given a sample from it? If data is in abundance, the solution may be simple – the empirical distribution will approximate the true distribution. However, challenges arise when data is scarce in comparison to the size of the domain, and especially when we wish to quantify “rare events.” This is frequently the case: for example, it has recently been observed that there are several very rare genetic mutations which occur in humans, and we wish to know how many such mutations exist [KeinanC12, TennessenBOFKGMDLJKJLGRAANBSBBABSN12, NelsonWEKSVSTBFWAZLZZLLLWTHNWACZWCNM12]. Many of these mutations have only been seen once, and we can infer that there are many which have not been seen at all. Over the last decade, a large body of work has focused on developing theoretically sound and effective tools for such settings [OrlitskySW16] and references therein, including the problem of estimating the frequency distribution of rare genetic variations [ZouVVKCSLSDM16].
However, in many settings where one wishes to perform statistical inference, data may contain sensitive information about individuals. For example, in medical studies, where the data may contain individuals’ health records and whether they carry some disease which bears a social stigma. Alternatively, one can consider a map application which suggests routes based on aggregate positions of individuals, which contains delicate information including users’ residence data. In these settings, it is critical that our methods protect sensitive information contained in the dataset. This does not preclude our overall goals of statistical analysis, as we are trying to infer properties of the population , and not the samples which are drawn from said population.
That said, without careful experimental design, published statistical findings may be prone to leaking sensitive information about the sample. As a notable example, it was recently shown that one can determine the identity of some individuals who participated in genome-wide association studies [HomerSRDTMPSNC08]. This realization has motivated a surge of interest in developing data sharing techniques with an explicit focus on maintaining privacy of the data [JohnsonS13, UhlerSF13, YuFSU14, SimmonsSB16].
Privacy-preserving computation has enjoyed significant study in a number of fields, including statistics and almost every branch of computer science, including cryptography, machine learning, algorithms, and database theory – see, e.g.,[Dalenius77, AdamW89, AgrawalA01, DinurN03, Dwork08, DworkR14] and references therein. Perhaps the most celebrated notion of privacy, proposed by theoretical computer scientists, is differential privacy [DworkMNS06]. Informally, an algorithm is differentially private if its outputs on neighboring datasets (differing in a single element) are statistically close (for a more precise definition, see Section 2). Differential privacy has become the standard for theoretically-sound data privacy, leading to its adoption by several large technology companies, including Google and Apple [ErlingssonPK14, AppleDP17].
Our focus in this paper is to develop tools for privately performing several distribution property estimation tasks. In particular, we study the tradeoff between statistical accuracy, privacy, and error rate in the sample size. Our model is that we are given sample access to some unknown discrete distribution , over a domain of size , which is possibly unknown in some tasks. We wish to estimate the following properties:
Support Coverage: If we take samples from the distribution, what is the expected number of unique elements we expect to see?
: How many elements of the support have non-zero probability?
Entropy: What is the Shannon entropy of the distribution?
For more formal statements of these problems, see Section 2.1. We require that our output is -accurate, satisfies -differential privacy, and is correct with probability . The goal is to give an algorithm with minimal sample complexity , while simultaneously being computationally efficient.
Theoretical Results. Our main results show that privacy can be achieved for all these problems at a very low cost. For example, if one wishes to privately estimate entropy, this incurs an additional additive cost in the sample complexity which is very close to linear in . We draw attention to two features of this bound. First, this is independent of . All the problems we consider have complexity , so in the primary regime of study where , this small additive cost is dwarfed by the inherent sample complexity of the non-private problem. Second, the bound is almost linear in . We note that performing even the most basic statistical task privately, estimating the bias of a coin, incurs this linear dependence. Surprisingly, we show that much more sophisticated inference tasks can be privatized at almost no cost. In particular, these properties imply that the additive cost of privacy is in the most studied regime where the support size is large. In general, this is not true – for many other problems, including distribution estimation and hypothesis testing, the additional cost of privacy depends significantly on the support size or dimension [DiakonikolasHS15, CaiDK17, AcharyaSZ17, AliakbarpourDR17]. We also provide lower bounds, showing that our upper bounds are almost tight. A more formal statement of our results appears in Section 3.
Experimental Results. We demonstrate the efficacy of our method with experimental evaluations. As a baseline, we compare with the non-private algorithms of [OrlitskySW16] and [WuY18]. Overall, we find that our algorithms’ performance is nearly identical, showing that, in many cases, privacy comes (essentially) for free. We begin with an evaluation on synthetic data. Then, inspired by [ValiantV13, OrlitskySW16], we analyze text corpus consisting of words from Hamlet, in order to estimate the number of unique words which occur. Finally, we investigate name frequencies in the US census data. This setting has been previously considered by [OrlitskySW16], but we emphasize that this is an application where private statistical analysis is critical. This is proven by efforts of the US Census Bureau to incorporate differential privacy into the 2020 US census [DajaniLSKRMGDGKKLSSVA17].
Techniques. Our approach works by choosing statistics for these tasks which possess bounded sensitivity, which is well-known to imply privacy under the Laplace or Gaussian mechanism. We note that bounded sensitivity of statistics is not always something that can be taken for granted. Indeed, for many fundamental tasks, optimal algorithms for the non-private setting may be highly sensitive, thus necessitating crucial modifications to obtain differential privacy [AcharyaDK15, CaiDK17]. Thus, careful choice and design of statistics must be a priority when performing inference with privacy considerations.
To this end, we leverage recent results of [AcharyaDOS17], which studies estimators for non-private versions of the problems we consider. The main technical work in their paper exploits bounded sensitivity to show sharp cutoff-style concentration bounds for certain estimators, which operate using the principle of best-polynomial approximation. They use these results to show that a single algorithm, the Profile Maximum Likelihood (PML), can estimate all these properties simultaneously. On the other hand, we consider the sensitivity of these estimators for purposes of privacy – the same property is utilized by both works for very different purposes, a connection which may be of independent interest.
We note that bounded sensitivity of a statistic may be exploited for purposes other than privacy. For instance, by McDiarmid’s inequality, any such statistic also enjoys very sharp concentration of measure, implying that one can boost the success probability of the test at an additive cost which is logarithmic in the inverse of the failure probability. One may naturally conjecture that, if a statistical task is based on a primitive which concentrates in this sense, then it may also be privatized at a low cost. However, this is not true – estimating a discrete distribution in distance is such a task, but the cost of privatization depends significantly on the support size [DiakonikolasHS15].
One can observe that, algorithmically, our method is quite simple: compute the non-private statistic, and add a relatively small amount of Laplace noise. The non-private statistics have recently been demonstrated to be practical [OrlitskySW16, WuY18], and the additional cost of the Laplace mechanism is minimal. This is in contrast to several differentially private algorithms which invoke significant overhead in the quest for privacy. Our algorithms attain almost-optimal rates (which are optimal up to constant factors for most parameter regimes of interest), while simultaneously operating effectively in practice, as demonstrated in our experimental results.
Related Work. Over the last decade, there have been a flurry of works on the problems we study in this paper by the computer science and information theory communities, including Shannon and Rényi entropy estimation [Paninski03, ValiantV17b, JiaoVHW17, AcharyaOST17, ObremskiS17, WuY18], support coverage and support size estimation [OrlitskySW16, WuY18]. A recent paper studies the general problem of estimating functionals of discrete distribution from samples in terms of the smoothness of the functional [FukuchiS17]. These have culminated in a nearly-complete understanding of the sample complexity of these properties, with optimal sample complexities (up to constant factors) for most parameter regimes.
Recently, there has been significant interest in performing statistical tasks under differential privacy constraints. Perhaps most relevant to this work are [CaiDK17, AcharyaSZ17, AliakbarpourDR17], which study the sample complexity of differentialy privately performing classical distribution testing problems, including identity and closeness testing. Other works investigating private hypothesis testing include [WangLK15, GaboardiLRV16, KiferR17, KakizakiSF17, Rogers17, GaboardiR17], which focus less on characterizing the finite-sample guarantees of such tests, and more on understanding their asymptotic properties and applications to computing p-values. There has also been study on private distribution learning [DiakonikolasHS15, DuchiJW17, KarwaV18], in which we wish to estimate parameters of the distribution, rather than just a particular property of interest. A number of other problems have been studied with privacy requirements, including clustering [WangWS15, BalcanDLMZ17]ChaudhuriSS13, KapralovT13, HardtP14b]Sheffet17], and much more.
We will start with some definitions.
Let be the set of discrete distributions over a countable support. Let be the set of distributions in with at most non-zero probability values. A property is a mapping from . We now describe the classical distribution property estimation problem, and then state the problem under differential privacy.
Given , , and independent samples from an unknown distribution , design an estimator such that with probability at least , . The sample complexity of , is is the smallest number of samples to estimate to accuracy , and error . We study the problem for , and by the median trick, we can boost the error probability to with an additional multiplicative more samples: . The sample complexity of estimating a property is the minimum sample complexity over all estimators: .
An estimator is -differentially private (DP) [DworkMNS06] if for any and , with , , for all , and measurable .
Given , , and independent samples from an unknown distribution , design an -differentially private estimator such that with probability at least , . Similar to the non-private setting, the sample complexity of -differentially private estimation problem is , the smallest number of samples for which there exists such a estimator with error probability at most 1/3.
In their original paper [DworkMNS06] provides a scheme for differential privacy, known as the Laplace mechanism. This method adds Laplace noise to a non-private scheme in order to make it private. We first define the sensitivity of an estimator, and then state their result in our setting.
The sensitivity of an estimator is Let .
[DworkMNS06] showed that for a function with sensitivity , adding Laplace noise makes the output -differentially private. By the definition of , the Laplace noise we add has parameter at most
. Recall that the probability density function ofis , hence we have . By the union bound, we get an additive error less than with probability at most . Hence, with the median trick, we can boost the error probability to , at the cost of a constant factor in the number of samples. ∎
To prove sample complexity lower bounds for differentially private estimators, we observe that the estimator can be used to test between two distributions with distinct property values, hence is a harder problem. For lower bounds on differentially private testing, [AcharyaSZ17] gives the following argument based on coupling:
Suppose there is a coupling between distributions and over , such that . Then, any -differentially private algorithm that distinguishes between and with error probability at most must satisfy .
The support size of a distribution is , the number of symbols with non-zero probability values. However, notice that estimating from samples can be hard due to the presence of symbols with negligible, yet non-zero probabilities. To circumvent this issue, [RaskhodnikovaRSS09] proposed to study the problem when the smallest probability is bounded. Let be the set of all distributions where all non-zero probabilities have value at least . For , our goal is to estimate up to with the least number of samples from .
For a distribution , and an integer , let , be the expected number of symbols that appear when we obtain independent samples from the distribution . The objective is to find the least number of samples in order to estimate to an additive .
Support coverage arises in many ecological and biological studies [ColwellCGLMCL12] to quantify the number of new elements (gene mutations, species, words, etc) that can be expected to be seen in the future. Good and Toulmin [GoodT56] proposed an estimator that for any constant , requires samples to estimate .
The Shannon entropy of a distribution is , is a central object in information theory [CoverT06], and also arises in many fields such as machine learning [Nowozin12], neuroscience [BerryWM97, NemenmanBRS04], and others. Estimating is hard with any finite number of samples due to the possibility of infinite support. To circumvent this, a natural approach is to consider distributions in . The goal is to estimate the entropy of a distribution in to an additive , where is all discrete distributions over at most symbols.
Our theoretical results for estimating support coverage, support size, and entropy are given below. Algorithms for these problems and proofs of these statements are provided in Section 4. Our experimental results are described and discussed in Section 5.
For any , the sample complexity of support coverage is
For any , the sample complexity of support size estimation is
Let be any small fixed constant. For instance, can be chosen to be any constant between and . We have the following upper bounds on the sample complexity of entropy estimation:
We provide some discussion of our results. At a high level, we wish to emphasize the following two points:
Our upper bounds show that the cost of privacy in these settings is often negligible compared to the sample complexity of the non-private statistical task, especially when we are dealing with distributions over a large support. Furthermore, our upper bounds are almost tight in all parameters.
The algorithmic complexity introduced by the requirement of privacy is minimal, consisting only of a single step which noises the output of an estimator. In other words, our methods are realizable in practice, and we demonstrate the effectiveness on several synthetic and real-data examples.
First, we examine our results on support size and support coverage estimation. We note that we focus on the regime where is not exceptionally small, as the privacy requirement becomes somewhat unusual. For instance, non-privately, if we have samples for the problem of support coverage, then the empirical plug-in estimator is the best we can do. However, if , then group privacy [DworkR14] implies that the algorithm’s output distribution on any dataset of samples must be very similar – however, these samples may have an arbitrary value of support coverage , which precludes hopes for a highly accurate estimator. To avoid degeneracies of this nature, we restrict our attention to . In this regime, if for any constant , then up to constant factors, our upper bound is within a constant factor of the optimal sample complexity without privacy constratints. In other words, for most meaningful values of , privacy comes for free.
Next, we turn our attention to entropy estimation. We note that the second upper bound in Theorem 3 has a parameter that indicates a tradeoff between the sample complexity incurred in the first and third term. This parameter determines the degree of a polynomial to be used for entropy estimation. As the degree becomes smaller (corresponding to a large ), accuracy of the polynomial estimator decreases, however, at the same time, low-degree polynomials have a small sensitivity, allowing us to privatize the outcome.
In terms of our theoretical results, one can think of . With this parameter setting, it can be observed that our upper bounds are almost tight. For example, one can see that the upper and lower bounds match to either logarithmic factors (when looking at the first upper bound), or a very small polynomial factor in (when looking at the second upper bound). For our experimental results, we experimentally determined an effective value for the parameter on a single synthetic instance. We then show that this choice of parameter generalizes, giving highly-accurate private estimation in other instances, on both synthetic on real-world data.
In this section, we prove our results for support coverage in Section 4.1, support size in Section 4.2, and entropy in Section 4.3. In each section, we first describe and analyze our algorithms for the relevant problem. We then go on to describe and analyze a lower bound construction, showing that our upper bounds are almost tight.
All our algorithms fall into the following simple framework:
Compute a non-private estimate of the property;
Privatize this estimate by adding Laplace noise, where the parameter is determined through analysis of the estimator and potentially computation of the estimator’s sensitivity.
Let be the number of symbols that appear times in . We will use the following non-private support coverage estimator from [OrlitskySW16]:
is a Poisson random variable with mean(which is a parameter to be instantiated later), and .
Our private estimator of support coverage is derived by adding Laplace noise to this non-private estimator with the appropriate noise parameter, and thus the performance of our private estimator, is analyzed by bounding the sensitivity and the bias of this non-private estimator according to Lemma 1.
The sensitivity and bias of this estimator is bounded in the following lemmas.
Suppose , then the maximum coefficient of in is at most .
By the definition of , we know , hence we have:
The bias of the estimator is bounded in Lemma 4 of [AcharyaDOS17]:
Suppose , then
Using these results, letting , [OrlitskySW16] showed that there is a constant , such that with samples, with probability at least 0.9,
Our upper bound in Theorem 1 is derived by the following analysis of the sensitivity of .
If we change one sample in , at most two of the ’s change. Hence by Lemma 3, the sensitivity of the estimator satisfies
By Lemma 1, there is a private algorithm for support coverage estimation as long as
which by (1) holds if
Let , note that . Suppose , then, the condition above reduces to
This is equivalent to
Suppose , then the condition above reduces to the requirement that
We now prove the lower bound described in Theorem 1. Note that the first term in the lower bound is the sample complexity of non-private support coverage estimation, shown in [OrlitskySW16]. Therefore, we turn our attention to prove the latter term in the sample complexity.
Consider the following two distributions. is uniform over . is distributed over elements where and . Moreover, . Then,
Hence we know there support coverage differs by . Moreover, their total variation distance is . The following lemma is folklore, based on the coupling interpretation of total variation distance, and the fact that total variation distance is subadditive for product measures.
For any two distributions , and , there is a coupling between iid samples from the two distributions with an expected Hamming distance of .
Using Lemma 5 and , we have
Suppose and are as defined before, there is a coupling between and with expected Hamming distance equal to .
In this section, we prove our main theorem about support size estimation, Theorem 2:
In [OrlitskySW16], it is shown that the support coverage estimator can be used to obtain optimal results for estimating the support size of a distribution. In this fashion, taking , we we may use an estimate of the support coverage as an estimator of . In particular, their result is based on the following observation.
Suppose , then for any ,
From the definition of , we have . For the other side,
Therefore, estimating for , up to , also estimates up to . Therefore, the goal is to estimate the smallest value of to solve the support coverage problem.
Suppose , and in the support coverage problem. Then, we have
Then, by Lemma 4 in the previous section, we have
We will find conditions on such that the middle term above is at most . Toward this end, note that holds if and only if . Plugging in (4), this holds when
which is equivalent to
where we have assumed without loss of generality that .
The computations for sensitivity are very similar. From Lemma 1 1, we need to find the value of such that
where we assume that , else we just add noise to the true number of observed distinct elements By computations similar to the expectation case, this reduces to
Therefore, this gives us a sample complexity of
for the sensitivity result to hold.
We note that the bound above blows up when . However, we note that our lower bound implies that we need at least samples in this case, which is not in the sub-linear regime that we are interested in. We therefore consider only the regime where the privacy parameter is at least .
In this section, we prove a lower bound for support size estimation, as described in Theorem 2. The techniques are similar to those for support coverage in Section 4.1.2. The first term of the complexity is the lower bound for non-private setting. This follows by combining the lower bound of [OrlitskySW16] for support coverage, with the equivalence between estimation of support size and coverage as implied by Lemma 7. We focus on the second term in the sequel.
Consider the following two distributions:
is a uniform distribution overand is a uniform distribution over . Then the support size of these two distribution differs by , and .
Hence by Lemma 5, we know the following:
Suppose and , there is a coupling between and with expected Hamming distance equal to .
In this section, we prove our main theorem about entropy estimation, Theorem 3:
We describe and analyze two upper bounds. The first is based on the empirical entropy estimator, and is described and analyzed in Section 4.3.1. The second is based on the method of best-polynomial approximation, and appears in Section 4.3.2. Finally, our lower bound is in Section 4.3.3.
Our first private entropy estimator is derived by adding Laplace noise into the empirical estimator. The parameter of the Laplace distribution is , where denotes the sensitivity of the empirical estimator. By analyzing its sensitivity and bias, we prove an upper bound on the sample complexity for private entropy estimation and get the first upper bound in Theorem 3.
Let be the empirical distribution, and let be the entropy of the empirical distribution. The theorem is based on the following three facts:
With these three facts in hand, the sample complexity of the empirical estimator can be bounded as follows. By Lemma 1, we need , which gives . We also need and , which gives .
The largest change in any when we change one symbol is one. Moreover, at most two change. Therefore,
By the concavity of entropy function, we know that
The variance bound ofis given precisely in Lemma 15 of [JiaoVHW17]. To obtain the other half of the bound of, we apply the bounded differences inequality in the form stated in Corollary 3.2 of [BoucheronLM13].
Let be a function. Suppose further that
Then for independent variables ,
We prove an upper bound on the sample complexity for private entropy estimation if one adds Laplace noise into best-polynomial estimator.This will give us the second upper bound in Theorem 3.
In the non-private setting the optimal sample complexity of estimating over is given by Theorem 1 of [WuY16]
However, this estimator can have a large sensitivity. [AcharyaDOS17] designed an estimator that has the same sample complexity but a smaller sensitivity. We restate Lemma 6 of [AcharyaDOS17] here:
Let be a fixed small constant, which may be taken to be any value between and . Then there is an entropy estimator with sample complexity
and has sensitivity .
We can now invoke Lemma 1 on the estimator in this lemma to obtain the upper bound on private entropy estimation.
We now prove the lower bound for entropy estimation. Note that any lower bound on privately testing two distributions , and such that is a lower bound on estimating entropy.
We analyze the following construction for Proposition 2 of [WuY16]. The two distributions , and over are defined as:
Then, by the grouping property of entropy,
For , the entropy difference becomes .
The total variation distance between and is . By Lemma 5 in the paper, there is a coupling over , and generated from and with expected Hamming distance at most . This along with Lemma 2 in the paper gives a lower bound of on the sample complexity.
We evaluated our methods for entropy estimation and support coverage on both synthetic and real data. Overall, we found that privacy is quite cheap: private estimators achieve accuracy which is comparable or near-indistinguishable to non-private estimators in many settings. Our results on entropy estimation and support coverage appear in Sections 5.1 and 5.2, respectively. Code of our implementation is available at https://github.com/HuanyuZhang/INSPECTRE.
We compare the performance of our entropy estimator with a number of alternatives, both private and non-private. Non-private algorithms considered include the plug-in estimator (plug-in), the Miller-Madow Estimator (MM) [Miller55], the sample optimal polynomial approximation estimator (poly) of [WuY16]. We analyze the privatized versions of plug-in, and poly in Sections 4.3.1 and 4.3.2, respectively. The implementation of the latter is based on code from the authors of [WuY16]111See https://github.com/Albuso0/entropy for their code for entropy estimation.. We compare performance on different distributions including uniform, a distribution with two steps, Zipf(1/2), a distribution with Dirichlet-1 prior, and a distribution with Dirichlet- prior, and over varying support sizes.
While plug-in, and MM are parameter free, poly (and its private counterpart) have to choose the degree of the polynomial to use, which manifests in the parameter in the statement of Theorem 3. [WuY16] suggest the value of in their experiments. However, since we add further noise, we choose a single as follows: (i) Run privatized poly for different values and distributions for , , (b) Choose the value of that performs well across different distributions (See Figure 1). We choose from this, and use it for all other experiments. To evaluate the sensitivity of poly, we computed the estimator’s value at all possible input values, computed the sensitivity, (namely, ), and added noise distributed as .
The RMSE of various estimators for , and for various distributions are illustrated in Figure 2. The RMSE is averaged over 100 iterations in the plots.
We observe that the performance of our private-poly is near-indistinguishable from the non-private poly, particularly as the number of samples increases. It also performs significantly better than all other alternatives, including the non-private Miller-Madow and the plug-in estimator. The cost of privacy is minimal for several other settings of and , for which results appear in Section A.
We investigate the cost of privacy for the problem of support coverage. We provide a comparison between the Smoothed Good-Toulmin estimator (SGT) of [OrlitskySW16] and our algorithm, which is a privatized version of their statistic (see Section 4.1.1). Our implementation is based on code provided by the authors of [OrlitskySW16]. As shown in our theoretical results, the sensitivity of SGT is at most , necessitating the addition of Laplace noise with parameter . Note that while the theory suggests we select the parameter , is unknown. We instead set , as previously done in [OrlitskySW16].
In our synthetic experiments, we consider different distributions over different support sizes . We generate samples, and then estimate the support coverage at . For large , estimation is harder. Some results of our evaluation on synthetic are displayed in Figure 3. We compare the performance of SGT, and privatized versions of SGT with parameters and . For this instance, we fixed the domain size . We ran the methods described above with samples, and estimated the support coverage at , for ranging from to . The performance of the estimators is measured in terms of RMSE over 1000 iterations.
We observe that, in this setting, the cost of privacy is relatively small for reasonable values of . This is as predicted by our theoretical results, where unless is extremely small (less than ) the non-private sample complexity dominates the privacy requirement. However, we found that for smaller support sizes (as shown in Section A.2), the cost of privacy can be significant. We provide an intuitive explanation for why no private estimator can perform well on such instances. To minimize the number of parameters, we instead argue about the related problem of support-size estimation. Suppose we are trying to distinguish between distributions which are uniform over supports of size and . We note that, if we draw
samples, the “profile” of the samples (i.e., the histogram of the histogram) will be very similar for the two distributions. In particular, if one modifies only a few samples (say, five or six), one could convert one profile into the other. In other words, these two profiles are almost-neighboring datasets, but simultaneously correspond to very different support sizes. This pits the two goals of privacy and accuracy at odds with each other, thus resulting in a degradation in accuracy.
We conclude with experiments for support coverage on two real-world datasets, the 2000 US Census data and the text of Shakespeare’s play Hamlet, inspired by investigations in [OrlitskySW16] and [ValiantV17b]. Our investigation on US Census data is also inspired by the fact that this is a setting where privacy is of practical importance, evidenced by the proposed adoption of differential privacy in the 2020 US Census [DajaniLSKRMGDGKKLSSVA17].
The Census dataset contains a list of last names that appear at least 100 times. Since the dataset is so oversampled, even a small fraction of the data is likely to contain almost all the names. As such, we make the task non-trivial by subsampling individuals from the data, obtaining distinct last names. We then sample of the individuals without replacement and attempt to estimate the total number of last names. Figure 4 displays the RMSE over 100 iterations of this process. We observe that even an exceptionally stringent privacy budget of , the performance is almost indistinguishable from the non-private SGT estimator.
The Hamlet dataset has words, of which 4804 are distinct. Since the distribution is not as oversampled as the Census data, we do not need to subsample the data. Besides this difference, the experimental setup is identical to that of the Census dataset. Once again, as we can see in Figure 5, we get near-indistinguishable performance between the non-private and private estimators, even for very small values of . Our experimental results demonstrate that privacy is realizable in practice, with particularly accurate performance on real-world datasets.
This section contains additional plots of our synthetic experimental results. Section A.1 contains experiments on entropy estimation, while Section A.2 contains experiments on estimation of support coverage.
We present four more plots of our synthetic experimental results for entropy estimation. Figures 6 and 7 are on a smaller support of , with and , respectively. Figures 8 and 9 are on a support of , with and .