Protection against disclosure is a legal and ethical obligation for agencies releasing microdata files for public use. Any decision about release requires a careful assessment of the risk of disclosure, which is supported by the estimation of measures of disclosure risk (e.g., Willenborg and de Waal ). Consider a microdata sample from a finite population of size , and without loss of generality assume that each sample record contains two disjoint types of information for the
-th individual: identifying information and sensitive information. Identifying information consists of the values of a set of categorical variables, which might be matchable to known units of the population. A risk of disclosure arises from the possibility that an intruder might succeed in identifying a microdata unit through such a matching and hence be able to disclose the sensitive information on this unit. To quantify the risk of disclosure, microdata sample records are cross-classified according to potentially identifying variables, i.e.,is partitioned in cells with corresponding frequency counts such that , where denotes the frequency of the -th cell out of the sample . A risk of disclosure arises from cells in which both sample frequencies and population frequencies are small. Of special interest are cells with frequency (singletons or uniques) since, assuming no errors in the matching process or data sources, for these cells the match is guaranteed to be correct. This has motivated inference on measures of disclosure risk that are functionals of the number of singletons, the most common being the number of sample singletons which are also population singletons. See, e.g., Bethlehem et al.  and Skinner et al.  for a thorough discussion on measures of disclosure risk.
The Poisson abundance model is arguably the most natural, and weak, modeling assumption to infer . If , with , it assumes that: i) the population records can be ideally extended to a sequence , of which is an observable subsample; ii) the ’s are independent and identically distributed according to an unknown distribution , where
is the probability of the-th cell in which
may be cross-classified; iii) the sample size is a Poisson random variablewith mean , in symbols . Then sample records result in cells with frequencies such that for , is independent of for any , and . As discussed in Section 2.4 of Skinner and Elliot , nonparametric estimation of under the Poisson abundance model is an intrinsically difficult problem. It shares the well-known difficulties of the classical problem of estimating the number of unseen species (e.g., Good and Toulmin , Efron and Thisted , Orlitsky et al. ). In particular, nonparametric estimators of
may be “very unreasonable” since they are subject to serious upward bias and high variance for small sampling fractions of the population, i.e. for. To overcome these issues, in the last three decades stronger modeling assumptions have been considered. These studies resulted in a range of parametric and semiparametric approaches, both frequentist and Bayesian, to infer , e.g., Bethlehem et al. , Samuels , Skinner and Elliot , Reiter , Rinott and Shlomo , Skinner and Shlomo , Manrique-Vallier and Reiter , Manrique-Vallier and Reiter , Carota et al.  and Carota et al. .
In this paper, we first study nonparametric estimation of under the Poisson abundance model for sample records. Given a collection of sample records from the population , we introduce a class of nonparametric linear estimators of that are simple, computationally efficient and scalable to massive datasets. We show that our estimators admit an interpretation as (smoothed) nonparametric empirical Bayes estimators in the sense of Robbins , and we prove theoretical guarantees for them that hold uniformly for any distribution . In particular, we show that the proposed estimators provably estimate all of the way up to the sampling fraction , with vanishing normalized mean-square error (NMSE) as becomes large. Then, by relying on recent techniques developed in Wu and Yang  in the context of optimal estimation of the support size of discrete distributions, we establish a lower bound for the minimax NMSE for the estimation of . This result allows us to show that is the smallest possible sampling fraction of the population, and that estimators’ NMSE is near optimal, in the sense of matching the minimax lower bound, for large . This is the main result of our paper, and it provides a precise answer to the question raised by Skinner and Elliot  about the feasibility of nonparametric estimation of under the Poisson abundance model and for a sampling fraction . Indeed our result shows that nonparametric estimation of has uniformly provable guarantees, in terms of vanishing NMSE for large , if and only if .
The paper is structured as follows. In Section 2 we introduce a class of nonparametric estimators for , and we show that they provably estimate all of the way up to the sampling fraction , with vanishing NMSE as becomes large. In Section 3 we show that is the smallest possible sampling fraction of the population, and that estimators’ NMSE is near optimal for large . Section 4 contains a numerical illustration of the proposed estimators. Proofs and deferred to the Appendix.
2 A nonparametric estimator of
We consider an infinite sequence of observations , and we assume that is the microdata sample of random size under the Poisson abundance model. We suppose that is a subsample of , where , with and independent of . In the present framework may be seen as the unobservable population. When the sample records are cross-classified according to the potentially identifying variables, the sample is partitioned in cells with corresponding frequency counts such that . Hereafter we denote by the number of cells with frequency , and by the number of cells with frequency greater or equal than , for any index . We are interest in estimating the number of sample uniques which are also population uniques, namely the functional
We recall that the frequency counts
’s are independent, and that they are Poisson distributed with parameter, where is the unknown probability associated to the -th cell, that is for such that . We will denote by the whole sequence of the cell’s frequency count, when we are provided with a sample of size .
To fix the notation, in the sequel we will write , for two generic functions and , iff there exists a universal constant such that ; we will further write whenever both and are satisfied. Let us denote by the set of all possible distributions over , i.e. , where denotes the Dirac measure centered at . An estimator of is understood to be a measurable function depending on the available sample and the actual size of the observed sample . We will evaluate the performance of a generic estimator of , by its worst–case NMSE, defined as
where is the mean squared error (MSE) of , also denoted by . The use of the NMSE (1) has been recently proposed in Orlitsky et al.  in the context of the estimation of the number of unseen species.
A nonparametric estimator for may be deduced comparing expectations, indeed it is easy to see that:
from which we may define the following estimator
which turns out to be unbiased by construction. See Appendix A.1 for the determination of (2). The estimator admits a natural interpretation as a nonparametric empirical Bayes estimator in the sense of Robbins . More precisely, is the posterior expectation of with respect to an unknown prior distribution on the ’s that is estimated from the . See Appendix A.2 for details. The next theorem legitimates the use of as an estimator of , for , i.e. when the size of the unobserved population is less or equal than , the size of the observed sample.
For any positive real numbers and let denote the integer part of and let denote the maximum between and . If , for any , we get
where in (5) we defined such that .
See Appendix A.3 for the proof of Theorem 1. According to Theorem 1, for one has and upon noticing that . That is, in expectation, approximate to within . Hence we formalize our considerations in the following.
Assume that , then the nonparametric estimator defined in (3) satisfies
for any .
This legitimates the use of as an estimator of under the hypothesis , which unfortunately is a quite restrictive assumption within the framework of disclosure risk: indeed the size of the unobserved sample is usually much bigger than the size of the available one. However the derivation of a variance bound for is a crucial step for our study. Indeed it reveals that the assumption is necessary to obtain a finite estimate of the variance. This variance issue of is determined by the geometrically increasing magnitude of the coefficients . Indeed, as , the estimator grows superlinearly as for the largest such that , thus eventually far exceeding that grows at most linearly. This is the main reason why become useless for , thus requiring an adjustment via suitable smoothing techniques. Hereafter we follow ideas originally developed by Good and Toulmin , Efron and Thisted  and Orlitsky et al.  for nonparametric estimators of the number of unseen species. Specifically, we propose a smoothed version of by truncating the series (3) at an independent random location and averaging over the distribution of , i.e.,
For any , as the the index in (7) increases, the tail probability compensate for the exponential growth of , thereby stabilizing the variance. In the next theorem we show that for the estimator is biased for , and we provide a bound for the MSE of .
Suppose that , then is a biased estimator of with
Choosing different smoothing distributions for the random variable yields different estimators for . Following Orlitsky et al. , three possible choices for the distribution of are the following: i) a Poisson distribution with parameter
; ii) a Binomial distribution with parameter; iii) a Binomial distribution with parameter . In particular, it can be shown that the choice of the Binomial distribution with parameter corresponds to the truncation at the point of the Euler transformation of the estimator (3). To choose the parameter of the Poisson distribution and the parameter of the Binomial distribution, one should look for and which minimizes the MSE bound (9). Once the values of and are determined explicitly, we are able to obtain limit of predictability for . That is, for some we are able to specify the maximum value of the sampling fraction for which . This gives a provable (performance) guarantee for the estimation of in terms of the sampling fraction . The next proposition specifies the limit of predictability for the estimator under the choice of a Poisson distribution with parameter for the smoothing distribution .
Let be a Poisson random variable with parameter . Then
whose upper bound is minimized when
for any . Moreover, if is a Poisson random variable with parameter then
and for any
where is continuous in with and .
See Appendix A.5 for the proof of Proposition 1. Similar results hold true when is assumed to be a Binomial random variable: the derivation of these results follows along similar lines as the proof of Proposition 1. Hence we state the following result in presence of a Binomial smoothing without proof.
Let be a Binomial random variable with parameter . Then
whose upper bound is minimized when
for any . Moreover, if is a Binomial random variable with parameter then
and for any
where is continuous in with and .
3 Optimality of the proposed estimators
In Section 2 we have defined two different estimators of providing guarantees of their performance as in terms of the NMSE. We have already remarked that the case is the most interesting one for estimating the disclosure risk , indeed the fraction of the unobserved sample is usually much larger than . Throughout the section we assume that and we prove that the proposed estimator is essentially optimal. More precisely we determine a lower bound for the best worst–case NMSE, defined by
where the infimum in the previous definition runs over all possible estimators of . We will then see that the determined lower bound essentially matches with the upper bound (11). In the sequel we refer to as the (normalized) minimax risk.
The theorem we are going to state below provides us with a lower bound for the minimax risk.
Assume that . Then, there exists a universal constant such that for any sufficiently big we have
From Theorem 3, it is clear that the minimax risk goes to zero if and the rate is provided by the following Corollary.
Assume that , then there exist universal constants and such that for any sufficiently large
Corollary 2 is an easy consequence of Theorem 3, indeed, when the two lower bounds in (17)–(18) are constants, whereas if it is easy to observe that the leading term in (17) (as ) is of order as in (18) for some . Corollary 2 provides us with a lower bound for the NMSE of any estimator of the disclosure risk . The lower bound (18) has an important implication: without imposing any parametric assumption on the model, one can estimate with vanishing NMSE all the way up to . It is then impossible to determine an estimator having provable guarantees (in terms of vanishing NMSE) when goes to much faster than , as a function of . By the limit of predictability (12) determined for the estimator , we conclude that the proposed estimator is optimal, because its limit of predictability matches (asymptotically) with its maximum possible value .
3.1 Guideline for the proof of Theorem 3
In the present section we provide the main ingredients for the proof of Theorem 3, technical results and related proofs are deferred to the Appendix. In the sequel we will write to make explicit the dependence of the expected value w.r.t. and the parameter of the Poisson random variable .
The starting point for the proof of Theorem 3 is the next Lemma 1, which is an interesting result in its own right and will help a lot in the proof of Theorem 3. Remark that the definition of the minimax risk in (16) allows for estimators depending on the whole sample , while depends only on the frequencies and . Thus, we feel like there should be no gain of information in using estimators depending on over estimators depending only on the frequencies . This is made formal in the next lemma, proved in Section B.1. Note that this is convenient since is nicely distributed under the Poisson model.
The following equality is true
where the infinimum in the previous equation is understood to be taken with respect to all measurable maps .
The next step is to use Jensen’s inequality to deduce that
Note that there is no explicit dependency on and anymore in the last display, but only on the random variable which, under
, is distributed as an infinite vector of independent Poisson random variables with parameters. Besides observe also that . For the sake of notational simplicity, in the sequel will stand for the random variable , and we also let
Remark that is independent of and is a collection of independent Poisson random variables with intensities . Henceforth, we get
We now trade for its expectation. Let us introduce . Recall that under the vector is distributed as independent Poisson with parameters . Hence,
Similarly, for any ,
Thus from (20), Young’s inequality, we find that
The remainder of the proof mostly follows the reduction scheme used in Wu and Yang [28, 29] which consists on reducing the problem to finding the best polynomial approximation (in uniform norm) to a suitable function.
The first step of the reduction scheme is to trade in (16) for a slightly more convenient set. We let be the only integer satisfying
and we also let, for some constant to be determined later,
Then, for another constant and for to be determined later, we define
Remark that contains measures that are not probability measures, and hence it is not clear a priori that we can lower bound the supremum over by the supremum over . The next proposition shows that it is fine as long as is not too large. Here and after, under , is understood as a vector of independent Poisson random variables with intensities , with not necessarily equal to one, and is extended trivially from to by letting , . The next proposition is proved in Section B.2.
Assume that as and let define . Then as ,
We are now in position to lower bound the risk by the Bayes risk. To do so, we follow the prior construction of Wu and Yang [28, 29]. For some to be determined later, but satisfying for some constant , we let and be two random variables taking values in such that when is large enough,
The existence of such random variables is guaranteed by Theorem C.1 for a universal constant . Then we let , respectively , be an independent vector of i.i.d. copies of , respectively . Denoted by the space of all measures on , we construct the following random variable such that . Then, from the Proposition 3 and Hölder’s inequality, we find that is bounded from below by plus
which is in turn lower bounded by
The last display follows because we don’t have nor almost-surely in
, but it is clear that the strong law of large numbers implies they should be concentrated on. Formally, an application of Bernstein’s inequality (see Section B.3 below) leads to the following proposition.
Assume that as . Then, there exists a constant , depending only on , such that for large enough,
Thus under the conditions of Proposition 4, we get
We now wish to trade and for their expectations in the last equation. Intuitively, this should not be problematic since they are sums of i.i.d. random variables, they should concentrate near their expectations for large enough. We made this formal using a Hoeffding argument in the next proposition, proved in Section B.4.
Let everything as above. Then,