1 Introduction
1.1 Overview
Differential privacy (Dwork et al., 2006) is a strong and by now widely accepted definition of privacy for statistical analysis of datasets with sensitive information about individuals. While there is now a rich and flourishing body of research on differential privacy, extending well beyond theoretical computer science, the following three basic goals for research in the area have not been studied in combination with each other:
Differentially private statistical inference:
The vast majority of work in differential privacy has studied how well one can approximate statistical properties of the dataset itself, i.e. empirical quantities, rather than inferring statistics of an underlying population from which a dataset is drawn. Since the latter is the ultimate goal of most data analysis, it should also be a more prominent object of study in the differential privacy literature.
Conservative statistical inference:
An important purpose of statistical inference is to limit the chance that data analysts draw incorrect conclusions because their dataset may not accurately reflect the underlying population, for example due to the sample size being too small. For this reason, classical statistical inference also offers measures of statistical significance such as values and confidence intervals. Constructing such measures for differentially private algorithms is more complex, as one must also take into account the additional noise that is introduced for the purpose of privacy protection. For this reason, we advocate that differentially private inference procedures should be conservative, and err on the side of underestimating statistical significance, even at small sample sizes and for all settings of other parameters.
Rigorous analysis of the inherent price of privacy:
As has been done extensively in the differential privacy
literature for empirical statistics, we should also investigate the fundamental “privacy–utility tradeoffs” for (conservative) differentially private statistical inference. This involves both designing and analyzing differentially private statistical inference procedures, as well as proving negative results about the performance
that can be achieved, using the best nonprivate procedures as a benchmark.
In this paper, we pursue all of these goals, using as a case study the problem of constructing a confidence interval for the mean of normal data. The latter is one of the most basic problems in statistical inference, yet already turns out to be nontrivial to fully understand under the constraints of differential privacy. We expect that most of our modeling and methods will find analogues for other inferential problems (e.g. hypothesis testing, Bayesian credible intervals, nonnormal data, and estimating statistics other than the mean).
1.2 Confidence Intervals for a Normal Mean
We begin by recalling the problem of constructing a level confidence interval for a normal mean without privacy. Let be an independent and identically distributed (iid
) random sample from a normal distribution with an unknown mean
and variance . The goal is to design an estimator that given , outputs an interval such thatfor all and . Here is called the coverage probability
. Given a desired coverage probability, the goal is minimize the
expected length of the interval, namely .Known Variance.
In the case that variance is known (so only is unknown), the classic confidence interval for a normal mean is:
where is the sample mean and represents the quantile of a standard normal distribution.^{2}^{2}2The proof that this is in fact a confidence interval follows by observing that has a standard normal distribution, and covers a fraction of the mass of this distribution. It is known that this interval has the smallest expected size among all level confidence sets for a normal mean, see for example, Lehmann and Romano (2006). In this case, the length of the confidence interval is fixed and equal to
Unknown Variance.
In the case that the variance is unknown, the variance must be estimated from the data itself, and the classic confidence interval is:
where is the sample variance defined by
and is the quantile of a distribution with degrees of freedom (see the appendix for definitions).^{3}^{3}3Again the proof follows by observing that follows a distribution, with no dependence on the unknown parameters.
Now the length of the interval is a random variable with expectation
for an appropriate constant . (See Lehmann and Romano (2006).)
Relation to Hypothesis Tests.
In general, including both cases above, a confidence interval for a population parameter also gives rise to hypothesis tests, which is often how the confidence intervals are used in applied statistics. For example, if our null hypothesis is that the mean
is nonnegative, then we could reject the null hypothesis if the interval does not intersect the positive real line. The significance level of this hypothesis test is thus at least . Minimizing the length of the confidence interval corresponds to being able to reject the alternate hypotheses that are closer to the null hypothesis; that is, when the confidence interval is of length at most and is distance greater than from the null hypothesis, then the test will reject with probability at least .1.3 Differential Privacy
Let be a dataset of elements where each . In the problems that we consider, and . Two datasets and , both of size , are called neighbors if they differ by one element.
Definition 1 (Differential Privacy, Dwork et al. (2006)).
A randomized algorithm is differentially private if for all neighboring and for all measurable sets of outputs , we have
where the probability is over the randomness of .
Intuitively, this captures privacy because arbitrarily changing one individual’s data (i.e. changing a row of to obtain ) has only a small effect on the output distribution of . Typically we think of as a small constant (e.g. ) while should be cryptographically small (in particular, much smaller than ) to obtain a satisfactory formulation of privacy. The case that is often called pure differential privacy, while is known as approximate differential privacy.
Nontrivial differentially private algorithms are necessarily randomized (as the probabilities in the definition above are taken only over the randomness of , which is important for the interpretation and composability of the definition), and thus such algorithms work by injecting randomness into statistical computations to obscure the effect of each individual.
The most basic differentially private algorithm is the Laplace Mechanism (Dwork et al., 2006), which approximates an arbitrary a function by adding Laplace noise:
Here is the Laplace distribution with scale
(which has standard deviation proportional to
), and is the global sensitivity of — the maximum of over all pairs of neighboring datasets .In particular, if is the empirical mean of the dataset and , then , so approximates the empirical mean to within additive error with high probability.
1.4 Statistical Inference with Differential Privacy
The Laplace mechanism described above is about estimating a function of the dataset , rather than the population from which is drawn, and much of the differential privacy literature is about estimating such empirical statistics. There are several important exceptions, the earliest being the work on differentially private PAC learning (Blum et al. (2005); Kasiviswanathan et al. (2011)), but still many basic statistical inference questions have not been addressed.
However, a natural approach was already suggested in early works on differential privacy. In many cases, we know that population statistics are wellapproximated by empirical statistics, and thus we can try to estimate these empirical statistics with differential privacy. For example, the population mean for a normal population is wellapproximated by the sample mean , which we can estimate using the Laplace mechanism:
On the positive side, observe that the noise being introduced for privacy vanishes linearly in , whereas converges to the population mean at a rate of , so asymptotically we obtain privacy “for free” compared to the (optimal) nonprivate estimator .
However, this rough analysis hides some important issues. First, it is misleading to look only at the dependence on . The other parameters, such as , , and can be quite significant and should not be treated as constants. Indeed only when , which means that the asymptotics only kick in at a very large value of . Thus it is important to determine whether the dependence on these parameters is necessary or can be improved. Second, the parameter is supposed to be a (worstcase) bound on the range of the data, which is incompatible with a modeling the population as following a normal distribution (which is supported on the entire real line). Thus, there have been several works seeking the best asymptotic approximations we can obtain for population statistics under differential privacy, such as Dwork and Lei (2009); Smith (2011); Wasserman and Zhou (2010); Wasserman (2012); Hall et al. (2013); Duchi et al. (2013a, b); Barber and Duchi (2014).
1.5 Conservative Statistical Inference with DP
The works discussed in the previous section focus on providing point estimates for population quantities, but as mentioned earlier, it is also important to be able to provide measures of statistical significance, to prevent analysts from drawing incorrect conclusions from the results. These measures of statistical significance need to take into account the uncertainty coming both from the sampling of the data and from the noise introduced for privacy. Ignoring the noise introduced for privacy can result in wildly incorrect results at finite sample sizes, as demonstrated empirically many times (e.g. Fienberg et al. (2010); Karwa and Slavković (2012, 2016)) and this can have severe consequences. For example, Fredrikson et al. (2014) found that naive use of differential privacy in calculating warfarin dosage would lead to unsafe levels of medication, but of course one should never use any sort of statistics for lifeordeath decisions without some analysis of statistical significance.
Since calculating the exact statistical significance of differentially private computations seems difficult in general, we advocate conservative estimates of significance. That is, we require that for all values of , values of the population parameters, and values of the privacy parameter.
For sample sizes that are too small or privacy parameters that are too aggressive, we may achieve this property by allowing the algorithm to sometimes produce an extremely large confidence interval, but that is preferable to producing a small interval that does not actually contain the true parameter which may violate the desired coverage property. Note that what constitutes a sample size that is “too small” can depend on the unknown parameters of the population (e.g. the unknown variance ) and their interplay with other parameters (such as the privacy parameter ).
Returning to our example of estimating a normal mean with known variance under differential privacy, if we use the Laplace Mechanism to approximate the empirical mean (as discussed above), we can obtain a conservative confidence interval for the population mean by increasing the length of classical, nonprivate confidence interval to account for the likely magnitude of the Laplace noise. More precisely, starting with the differentially private mechanism
the following is a level confidence interval for the population mean :
The point is that with probability , the Laplace noise has magnitude at most , so increasing the interval by this amount will preserve coverage (up to an change in the probability). Again, the privacy guarantees of the Laplace mechanism relies on the data points being guaranteed to lie in ; otherwise, points need to be clamped to lie in the range, which can bias the empirical mean and compromise the coverage guarantee. Thus, to be safe, a user may choose a very large value of , but then this makes for a much larger (and less useful) interval, as the length of the interval grows linearly with . Thus, a natural research question (which we investigate) is whether such a choice and corresponding cost is necessary.
Conservative hypothesis testing with differential privacy, where we require that the significance level is at least , was advocated by Gaboardi et al. (2016b). Methods aimed at calculating the combined uncertainty due to sampling and privacy (for various differentially private algorithms) were given in Vu and Slavkovic (2009); Williams and McSherry (2010); Karwa and Slavković (2012); Karwa et al. (2015, 2014); Karwa and Slavković (2016); Gaboardi et al. (2016b); Solea (2014); Wang et al. (2015); Kifer and Rogers (2016), but generally the utility of these methods (e.g. the expected length of a confidence interval or power of a hypothesis test) is only evaluated empirically or the conservativeness only holds in a particular asymptotic regime. Rigorous, finitesample analyses of conservative inference were given in Sheffet (2017)
for confidence intervals on the coefficients from ordinary leastsquares regression (which can be seen as a generalization of the problem we study to multivariate Gaussians) and in
Cai et al. (2017) for hypothesis testing of discrete distributions. However, neither paper provides matching lower bounds, and in particular, the algorithms of Sheffet (2017) only apply for bounded data (similar to the basic Laplace mechanism). In our work, we provide a comprehensive theoretical analysis of conservative differentially private confidence intervals for a normal mean, with both algorithms and lower bounds, without any bounded data assumption.Before stating our results, we define more precisely the notion of a (conservative) level confidence set. Let be a family of distributions supported on where is a real valued parameter and
is a vector of
nuisance parameters. A nuisance parameter is an unknown parameter that is not a primary object of study, but must be accounted for in the analysis. For example, when we consider estimating the mean of a normal distribution with unknown variance, the variance is a nuisance parameter. We write if is an independent and identically distributed random sample from a distribution . We sometimes abuse the notation and write instead of when is clear from the context.Definition 2 (level confidence set).
Let . Let where and is a vector of nuisance parameters. A level confidence set for with sample complexity is a (possibly randomized) measurable function , where is a set of measurable subsets of , such that for all and , we have
where the probability is taken over the randomness of both and the data .
1.6 Our Results
As discussed above, in this paper we develop conservative differentially private estimators of confidence intervals for the mean of a normal distribution with known and unknown variance . Our algorithms are designed to be differentially private for all input datasets and they provide level coverage whenever the data is generated from a normal distribution. Unlike the Laplace mechanism described above and many other differentially private algorithms, we do not make any assumptions on the boundedness of the data. Our pure DP (i.e. DP) algorithms assume that the mean and variance lie in a bounded (but possibly very large) interval, and we show (using lower bounds) that such an assumption is necessary. Our approximate (i.e. ) differentially private algorithms do not make any such assumptions, i.e. both the data and the parameters () can remain unbounded. We also show that the differentially private estimators that we construct have nearly optimal expected length, up to logarithmic factors. This is done by proving lower bounds on the length of differentially private confidence intervals. A key aspect of the confidence intervals that we construct is their conservativeness — the coverage guarantee holds in finite samples, as opposed to only holding asymptotically. We also show that as , the length of our differentially private confidence intervals is at most factor larger than length of their nonprivate counterparts.
Let be an independent and identically distributed (iid) random sample from a normal distribution with an unknown mean and variance , where and . Our goal is to construct differentially private level confidence sets for in both the known and the unknown variance case, i.e. we seek a set such that

is a level confidence interval, and

is differentially private.

is as small as possible.
Known Variance:
We prove the following result for estimating the confidence interval with the privacy constraint:
Theorem 1.1 (known variance case).
There exists an differentially private algorithm that on input with known and unknown mean outputs a level confidence interval for . Moreover, if
(where and are universal constants) then the interval is of fixed width where
Theorem 1.1 asserts that there exists a differentially private algorithm that outputs a fixed width level confidence interval for any . Moreover, when is large enough, the algorithm outputs a confidence interval of length which is nontrivial in the sense that . Specifically, is a maximum of two terms: The first term is which is the same as the length of the nonprivate confidence interval discussed in Section 1.2 up to constant factors. The second term is up to polylogarithmic factors it goes to at the rate of which is faster than the rate at which the first term goes to . Thus for large the increase in the length of the confidence interval due to privacy is mild. Note that, unlike the basic approach based on the Laplace mechanism discussed in Section 1.5, the length of the confidence interval has no dependence on the range of the data, or even the range of the mean .
The sample complexity required for obtaining a nontrivial confidence interval is the minimum of two terms: and . The dependence of sample complexity on is only logarithmic. Thus one can choose a very large value of . Moreover, when , we can set and hence there is no dependence of the sample complexity on .
The first term in the length of the confidence interval in Theorem 1.1 hides some constants which can lead to a constant multiplicative factor increase in the length of the differentially private confidence intervals when compared to the nonprivate confidence intervals. We show that it is possible to eliminate this multiplicative increase and obtain differentially private confidence intervals with only additive increase in the length:
Theorem 1.2.
There exists an differentially private algorithm that on input with known and unknown mean outputs a level confidence interval of . Moreover, if
(where is a universal constant) then
Theorem 1.2 asserts that with a small change to the sample complexity on by an additive term of , we can achieve an additive increase in the length of the confidence interval as opposed to a multiplicative increase. Note that the first term in exactly matches the length of the nonprivate confidence interval, namely , while the second term vanishes more quickly as a function of .^{4}^{4}4We note that when the range of the mean is bounded, as we require for our pure differentially private algorithms, the length of a nonprivate algorithm can be improved, but the improvement is insignificant in the regime of parameters we are interested in, namely when . See Theorem 6.1. An important point is that we retain an rather than an in the first term, contrary to the common belief that differential privacy has a price of in sample size. (See Steinke and Ullman (2015) for a proof of such a statement for computing summary statistics of a dataset rather than inference, and Hay et al. (2016) for an informal claim along these lines.)
Unknown Variance:
Our differentially private confidence interval in the unknown variance case is as follows:
Theorem 1.3 (unknown variance case).
There exists an differentially private that on input with unknown mean and variance always outputs a level confidence interval of . Moreover, if
(where is a universal constant), then the expected length of the interval is such that
As in the known variance case, Theorem 1.3 asserts that there exists an differentially private algorithm that always outputs an confidence interval of for all . If is large enough, the length of the confidence interval is a maximum of two terms, where the first term is same as the length of the nonprivate confidence interval and the second term goes to at a faster rate.
As before the dependence of sample complexity on and is logarithmic, as opposed to linear. Hence we can set these parameters to a large number. Moreover, when , we can set and to be and to be . Thus when , there are no assumptions on the boundedness of the parameters.
Finally, along the lines of Theorem 1.2, at the cost of a minor increase in sample complexity, we can obtain a differentially private algorithm that has only additive increase in the length of the confidence interval that is asymptotically vanishing relative to the nonprivate length. Specifically, we can obtain an interval with length
where again the first term is exactly the same as in the nonprivate case (see Section 1.2) and the second term vanishes more quickly as a function of .
Lower Bounds.
We also prove lower bounds on the length of any level differentially private confidence set of expected size :
Theorem 1.4 (Lower bound).
Let be any differentially private algorithm that on input produces a level confidence set of of expected size . If , then
Moreover, if , then
where is a universal constant.
Note that the first lower bound says that we must pay in the length of the confidence interval when is very large. Our algorithms come quite close to this lower bound with an extra factor of . The second lower bound shows that the sample complexity required by Theorem 1.1 is necessary to obtain a confidence interval that saves more than a factor of 2 over the trivial interval . By setting , the sample complexity lower bound also matches that of Theorem 1.3 in our parameter regime of interest, namely when .
1.7 Techniques
Known Variance Algorithms:
Our algorithms for the known variance case (Theorems 1.1 and 1.2) are based on simple Laplacemechanismbased confidence interval discussed in Section 1.5, except that we calculate a suitable bound based on the data in a differentially private manner, rather than having it be an input provided by a data analyst. Specifically, we give a differentially private algorithm that takes real numbers and outputs an interval (which need not be centered at 0) such that for every , when , we have:

With probability at least over and the coins of , we have , and

With probability 1, .
Thus, if we clamp all datapoints to lie in the interval (which will usually have no effect for data that comes from our normal model, by Property 1), we can calculate an approximate mean and thus construct a confidence interval using Laplace noise of scale .
Now, estimating the range of a dataset with differential privacy is impossible to do with any nontrivial accuracy in the worst case, so we must exploit the distributional assumption on our dataset to construct . Specifically, we exploit the following properties of normal data:

A vast majority of the probability mass of is concentrated in an interval of width around the mean .

With probability at least , all datapoints are at distance at most from .
Similar properties hold for many other natural parameterized families of distributions, changing the factor of according to the concentration properties of the family.
Given these properties, works as follows: we partition the original range (where might be infinite) into “bins” (intervals) of width , and calculate an differentially private approximate histogram of how many points lie in each bin. By Property 1 and a Chernoff bound, with high probability, the vast majority of our normally distributed data points will be in the bin containing or one of the neighboring bins. Existing algorithms for differentially private histograms (Dwork et al., 2006; Bun et al., 2016) allow us to identify one of these heavy bins with probability , provided , where is the number of bins. After identifying such a bin, Property 2 tells us that we can simply expand the bin by on each side and include all of the datapoints with high probability. This proof sketch gives a level confidence interval, and redefining yields Theorem 1.1. To obtain, Theorem 1.2, we set parameters more carefully so that the failure probability in estimating the range is much smaller than , say , which increases the sample complexity of the histogram algorithms only slightly.
This general approach, of finding a differentially private estimate of the range and using it to compute a differentially private mean are inspired by the work of Dwork and Lei (2009). They present an differentially private algorithm to estimate the scale of the data. They use the estimate of scale to obtain a differentially private estimate of the median without making any assumptions on the range of the data. Their algorithms require . In contrast, our range finding algorithms for Gaussian data work for without making any assumptions on the range of the data, but instead assume that the parameters need to be bounded). Our algorithms also handle the unknown variance case, as discussed below. Also, while the general idea of eliminating the dependence on range of the data is similar, the underlying techniques and privacy and utility guarantees are different.
Unknown Variance Algorithms:
For the case of an unknown variance (Theorem 1.3), we begin with the observation that our rangefinding algorithm discussed above only needs a constantfactor approximation to the variance . Thus, we will begin by calculating a constantfactor approximation to in a differentially private manner, and then estimate the range as above. To do this, we consider the dataset of size given by . Here each point is distributed as the absolute value of a random variable, which has the vast majority of its probability mass on points of magnitude . Thus, if we partition the interval into bins of the form and apply an approximate histogram algorithm, the heaviest bin will give us an estimate of to within a constant factor. Actually, to analyze the expected length of our confidence interval, we will need that our estimate of is within a constant factor of the true value not only with high probability but also in expectation; this requires a finer analysis of the histogram algorithm, where the probability of picking any bin decays linearly with the probability mass of that bin (so bins further away from have exponentially decaying probability of being chosen). Note that this approach for approximating also exploits the symmetry of a normal distribution, so that is likely to have magnitude , independent of ; it should generalize to many other common symmetric distribution families. For nonsymmetric families, one could instead use differentially private algorithms for releasing threshold functions (i.e. estimating quantiles) at the price of a small dependence on the ranges even when . (See Vadhan (2017, Sec. 7.2) and references therein.)
Now, once we have found the range as in the knownvariance case, we can again use the Laplace mechanism to estimate the empirical mean to within additive error . And we can use our constantfactor approximation of to estimate the size of the nonprivate confidence interval to within a constant factor. This suffices for Theorem 1.3. But to obtain the tighter bound, where we only pay an additive increase over the length of the nonprivate interval, we cannot just use a constantfactor approximation of the variance. Instead, we also use the Laplace mechanism to estimate the sample variance Our bound on the range (with clamping) ensures that has global sensitivity , and thus can be estimated quite accurately.
Lower Bounds:
For our lower bounds (Thm 1.4), we observe that the expected length of a confidence set can be written as
where ranges over and the notation indicates that probability is taken over generated according to for a particular value of , and over the mechanism . Next, we use the differential privacy guarantee to deduce that
where is the total variation distance between and . (This can be seen as a distributional analogue of the “group privacy” property used in “packing lower bounds” for calculating empirical statistics under differential privacy (Hardt and Talwar, 2010; Beimel et al., 2010; Wasserman and Zhou, 2010; Hall et al., 2011), and is also a generalization of the “secrecy of the sample” property of differential privacy (Kasiviswanathan et al., 2011; Smith, 2009; Bun et al., 2015). Finally, we know that by the coverage property of , yielding our lower bound (after some calculations).
1.8 Directions for Future Work
The most immediate direction for future work is to close the (small) gaps between our upper and lower bounds. Most interesting is whether the price of privacy in the length of confidence intervals needs to be even additive, as in Theorem 1.1. Our lower bound only implies that the length of a differentially private confidence interval must be at least the maximum of a privacy term (namely, the lower bound in Theorem 1.4) and the nonprivate length (cf. Theorem 6.1), rather than the sum. In particular, when is sufficiently large, the nonprivate length is larger than the privacy term, and Theorem 1.4 leaves open the possibility that a differentially private confidence interval can have exactly the same length as a nonprivate confidence interval. This seems unlikely, and it would be interesting to prove that there must be some price to privacy even if is very large.
We came to the problem of constructing confidence intervals for a normal mean as part of an effort to bring differential privacy to practice in the sharing of social science research data through the design of the software tool PSI (Gaboardi et al., 2016a)
, as confidence intervals are a workhorse of data analysis in the social sciences. However, our algorithms are not optimized for practical performance, but rather for asymptotic analysis of the confidence interval length. Initial experiments indicate that alternative approaches (not just tuning of parameters) may be needed to reasonably sized confidence intervals (e.g. length at most twice that of the nonprivate length) handle modest sample sizes (e.g. in the 1000’s). Thus designing practical differentially private algorithms for confidence intervals remains an important open problem, whose solution could have wide applicability.
As mentioned earlier, we expect that much of the modelling and techniques we develop should also be applicable more widely. In particular, it would be natural to study the estimation of other population statistics, and families of distributions, such as other continuous random variables, Bernoulli random variables, and multivariate families. In particular, a natural generalization of the problem we consider is to construct confidence intervals for the parameters of a (possibly degenerate) multivariate Gaussian, which is closely related to the problem of ordinary leastsquares regression (cf.
Sheffet (2017)).Finally, while we have advocated for conservative inference at finite sample size, to avoid spurious conclusions coming from the introduction of privacy, many practical, nonprivate inference methods rely on asymptotics also for measuring statistical significance. In particular, the standard confidence interval for a normal mean with unknown variance and its corresponding hypothesis test (see Section 1.2
) is often applied on nonnormal data, and heuristically justified using the Central Limit Theorem. (This is heuristic since the rate of convergence depends on the data distribution, which is unknown.) Is there a criterion to indicate what asymptotics are “safe”? In particular, can we formalize the idea of only using the “same” asymptotics that are used without privacy?
Kifer and Rogers (2016) analyze their hypothesis tests using asymptotics that constrain the setting of the privacy parameter in terms of the sample size (e.g. ), but it’s not clear that this relationship is safe to assume in general.1.9 Organization
The rest of the paper is organized in the following manner. In Section 2, we introduce some preliminary results on DP and techniques such as Laplace mechanism and histogram learners that are needed for our algorithms. In Section 3, we present differentially private algorithms to estimate the range of the data; these algorithms serve as building blocks for estimating differentially private confidence intervals. In Sections 4 and 5, we present differentially private algorithms to estimate an level confidence interval of with known and unknown variance respectively. Section 6 is devoted to lower bounds.
2 Preliminaries
2.1 Notation
We use to denote natural log to the base , unless otherwise noted. Random variables are denoted by capital roman letters and their realization by small roman letters. For example is a random variable and is its realization. We write or equivalently or to denote a sample of independent and identically random variables from the distribution , where and is a vector of nuisance parameters. We sometimes abuse notation and write instead of . Estimators of parameters are denoted by . A differentially private mechanism is denoted by or .
There are two sources of randomness in our algorithms: The first source of randomness is from the coin flips made by the estimator or algorithm and the dataset is considered fixed. We use the notation of conditioning to denote the probabilities and expectations with respect to the privacy mechanism, when the data is considered to be fixed. Specifically, conditional probability is denoted by and conditional expectation is denoted by . The second source of randomness comes from assuming that the dataset is a sample from an underlying distribution . The probability and expectation with respect to this distribution is denoted by , and . While the privacy guarantees are with respect to a fixed dataset, the accuracy guarantees are with respect to both the randomness in the data and the mechanism. In such cases, we state both sources of randomness by writing
where is any measurable event and the subscripts denote the sources of randomness.
2.2 Differential Privacy
We will present some key properties of differential privacy that we make use of in this paper. One of the attractive properties of Differential Privacy is its ability to compose. To prove the privacy properties of an algorithm, we will rely on the fact that an differentially private algorithm that runs on a dataset and an output of a previous differentially private computation is also differentially private.
Lemma 2.1 (Composition of DP, Dwork et al. (2006)).
Let be an differentially private algorithm. Let be such that is an differentially private algorithm for every fixed . Then the algorithm is differentially private.
In many algorithms, we will rely on a basic mechanism for Differential Privacy that works by adding Laplace noise. Let be any function of the dataset that we wish to release an approximation of. The global sensitivity of is defined as
where is the norm of the vector .
Lemma 2.2 (The Laplace Mechanism, Dwork et al. (2006)).
Let be a function with global sensitivity at most . The mechanism
is differentially private, where is the input dataset, is a dimensional random vector where each component of is an independent Laplace distribution (defined in 7.2) with mean and scale parameter
In many estimators we design, we need a differentially private mechanism for finding a heaviest bin from a (possibly countably infinite) collection of bins, i.e. the bin with the maximum probability mass under the data generating distribution . Formally, let be any collection of disjoint measurable subsets of , which we will refer to as bins, where can be . The histogram of a distribution corresponding to the bins is given by the vector where each . In Lemma 2.3, we assert the existence of an differentially private mechanism that on input iid samples from outputs a noisy histogram from which the heaviest bin can be extracted.
Lemma 2.3 (Histogram Learner, following Dwork et al. (2006), Bun et al. (2016), Vadhan (2016)).
For every , domain , for every collection of disjoint bins defined on , , , and there exists an differentially private algorithm such that for every distribution on , if

,

, and

(1)
then,
(2)  
(3) 
where the probability is taken over the randomness of and the data .
Lemma 2.3 asserts the existence of an differentially private histogram learner that takes as input an iid sample and a collection of disjoint bins, and outputs estimates of , the probability of falling in bin for all . It has the property that with high probability, the maximum generalization error is at most . It is important to note that the error is measured with respect to the population and not the sample. In particular, the error term includes the noise due to sampling and differential privacy. To bound the generalization error, we follow a standard technique of bounding the generalization error without privacy (i.e. the difference between the sample and the population) and the error introduced for privacy (see for example the equivalence between differentially private query release and differentially private threshold learning in Bun et al. (2015)).
If we take equal to the RHS of inequality 1 in Lemma 2.3, we obtain a bound on as a function of , , , and , namely
(4) 
The last term, is the sampling error, which is incurred even without privacy. For privacy, we incur an error that is the minimum of two terms: and . Note that these two terms vanish linearly in , faster than the sampling error, which vanishes as . Moreover, the dependence on the number of bins is only logarithmic or in case of , even nonexistent. When , the choice of allows us to construct DP algorithms that have no dependence on the range of the parameters.
We will use the histogram learner to obtain the largest noisy bin from a possibly infinite collection of bins. Hence, apart from a bound on the maximum generalization error, we also need a bound on the probability of picking the wrong bin as the largest bin. Lemma 2.3 asserts that the probability of choosing any bin as the largest bin is roughly upper bounded by , the expected number of points falling in bin . As before, this probability is over the sampling and the noise added by the differential privacy mechanism. Note that this bound is useful only when is small, in particular . Hence, the theorem bounds the probability of incorrectly choosing a bin that has very few expected points as the largest bin.
Proof of Lemma 2.3.
Let be the number of points that fall in bin and be the corresponding proportion of points. The distribution learner operates as follows: When , it uses an DP algorithm, and when , it uses an DP algorithm to output a noisy histogram.
The key idea behind the proof of the existence of a histogram learner is the following. There exist basic and differentially private mechanisms with the property that on input and bins they output , such that if
then for every ,
That is, with high probability, there is a small difference between the differentially private output and the empirical estimates . Moreover, the DvoretzkyKieferWolfowitz inequality (Massart, 1990) tells us that with high probability the empirical estimates are close to the population parameters:
So, if , then . Thus, by a union bound, if
then,
Now we review the two differentially private algorithms we use and also prove the additional claim regarding the probability of any bin being selected as the maximum.
case:
We will first start with the differentially private algorithm which we use when . Consider the Laplace mechanism, given in Lemma 2.2 applied to the empirical histogram. The empirical histogram of a dataset on bins is given by Note that the global sensitivity of is . Hence we can release the empirical histogram by adding Laplace noise with scale to each . Formally, let where . If , then,
where the second line follows from the tails of a Laplace distribution, see Proposition 7.2. Now, let us prove that
Let . Consider the event . Note that is continuous. Hence the probability of ties is and the argmax is unique and well defined. This implies that . In turn, this implies that either, , or or or . Hence, by a union bound,
where we have used the Chernoff bound (Proposition 7.1), and a tail bound for a Laplace random variable from (Proposition 7.2).
Case:
For the case that , we will use an differentially private algorithm, called stabilitybased histogram (Bun et al., 2016) to estimate , which removes the dependence on by allowing for to be infinity. Specifically, the algorithm on input and , where is possibly runs as follows:

Let

If , set

If ,

Let , where .

Let

If , set


Output
A proof of differential privacy of this algorithm can be found in Theorem 3.5 of Vadhan (2016). For utility, we will show that if
then . Note that for any such that , we have,