Feature allocation models generalize classical species sampling models by allowing every observation to belong to more than one species, now called features. In particular, every observation is endowed with a (unknown) finite set of features selected from a (possibly infinite) collection of features. We conveniently represent each observation with a binary sequence, with entries indicating the presence (1) or absence (0) of each feature. Every feature
is associated to an unknown probability, and each observation displays feature with probability . The Bernoulli product model, or binary independence model, is arguably the most popular feature allocation model. It models the –th observation as a sequence
of independent Bernoulli random variables with unknown success probabilities, and that is independent of for any
. Bernoulli product models have found applications in several scientific disciplines. They have been applied extensively in ecology when animals are captured using traps and each observation is an incidence vector collecting the presence or absence of each species in the traps (e.g.,Colwell et al. (2012) and Chao et al. (2014)). Besides ecology, Bernoulli product models found applications in the broad area of biosciences (e.g., Ionita-Laza et al. (2009), Zou et al. (2016)); in the analysis of choice behaviour arising from psychology and marketing (Görür et al. (2006)); in binary matrix factorization for dyadic data (Meeds et al. (2007)); in graphical models (e.g., Wood et al. (2006) and Wood & Griffiths (2007)); in the analysis of similarity judgement matrices (Navarro & Griffiths (2007)); in network data analysis (Miller et al. (2006)).
Let be random samples collected under the Bernoulli product model with unknown feature probabilities , and let be the number of times that feature has been observed in . That is, is a Binomial random variable with parameters for any . In this paper we consider the problem of estimating the conditional expected number, given the sample , of hitherto unseen features that would be observed if one additional sample was collected, i.e.
where denotes the indicator function. For easiness of notation, in the rest of the paper we will not highlight the dependence on and , and simply write instead of . The statistic is typically referred to as the missing mass, namely the sum of the probability masses of unobserved features in the first initial samples.
Interest in estimating the missing mass is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources, and further draws can be only motivated by the possibility of recording new unobserved features. In genetics, for instance, the ambitious prospect of growing databases to encompass hundreds of thousands of human genomes, makes important to quantify the power of large sequencing projects to discover new genetic variants (Auton et al. (2015)). An accurate estimate of will enable better study design and quantitative evaluation of the potential and limitations of these datasets. Indeed one can fix a threshold such that the sequencing procedure takes place until the estimate of becomes for the first time smaller than the threshold. This introduces a criterion for evaluating the effectiveness of further sampling, providing a roadmap for large-scale sequencing projects. A Bayesian parametric estimator of (1) has been introduced in Ionita-Laza et al. (2009), and it relies on a Beta prior distribution for the unknown probabilities ’s. A limitation of this approach is that it requires parametric forms for the distribution of variant frequencies, which requires some model of demography and selection. For instance, the Beta prior is a reasonable assumption for neutrally evolving variants but may not be appropriate for deleterious mutations. To overcome this drawback, a nonparametric approach to estimate (1) has been proposed in Zou et al. (2016)
. This is a purely algorithmic type approach, whose core is a linear programming based algorithm.
In this work, we introduce a simple, robust and theoretically sound estimator of the missing mass. As in Zou et al. (2016), our approach is nonparametric in the sense that it does not rely on any distributional assumption on the unknown ’s. The proposed estimator turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass under the multinomial model for species sampling (e.g., Good (1953) and Robbins (1968)), with the difference that the two estimators have different ranges (supports). For this reason we refer to the estimator as the Good-Turing estimator for feature allocation models. We show that admits a natural interpretation both as a jackknife (resampling) estimator (Quenouille (1956) and Tukey (1958)) and as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973). Theoretical properties of the estimator
are investigated. Specifically, we first provide a lower bound for the minimax risk under a squared loss function and then we show that the mean squared error of the proposed estimator achieves the optimal minimax rate for the estimation of
. This asymptotic analysis provides with an interesting connection between the Good-Turing estimators for species sampling models and for feature allocation models in terms of the limiting Poisson regime of the Binomial distribution. Finally, we derive a non-asymptotic confidence interval forby means of novel Bernstein type concentration inequalities which are of separate interest; the confidence interval is easy to compute and it does not rely on any asymptotic assumption or any parametric constraint on the unknown ’s. We illustrate the proposed methodology by the analysis of various synthetic data and SNP datasets from the ENCODE (http://www.hapmap.org/downloads/encode1.html.en) sequencing genome project.
The paper is structured as follows. In Section 2 we introduce the Good-Turing estimator for the missing mass , and we give provable guarantees for its performance. In Subsection 3 we derive a non-asymptotic confidence interval for , in Subsection 4 we discuss the use of for designing cost-effective feature inventories, and in Section 5 we present a numerical illustration of with synthetic and real data. Some concluding remarks on future works are discussed in Section 6. Proofs are deferred to Appendix A.
2 A Good-Turing estimator for
Let be random samples under the Bernoulli product model with unknown feature probabilities such that . That is, is a sequence of independent Bernoulli random variables, with being the success probability of for any , and is independent of for any . The assumption implies that each displays finitely many features almost surely; indeed, by monotone convergence, we have if and only if , which in turns implies that almost surely. If denotes the number of times that feature has been observed in the sample , then
is the number of features with frequency in . We denote by the total number of features in the random sample , i.e. . For any two sequences and , write to mean that as . An intuitive estimator of can be deduced from a comparison between expectations of and . Specifically, since is a Binomial random variable with parameter , we can write
as an estimator of , for any . The estimator is nonparametric, in the sense that it does not rely on any distributional assumptions on the feature probabilities .
The estimator turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass for species sampling models. The Good-Turing estimator first appeared in Good (1953) as a nonparametric empirical Bayes estimator under the classical multinomial model for species sampling, i.e., are random samples from a population of individuals belonging to a (possibly infinite) collection of species with unknown proportions such that . Under the multinomial model every observation is endowed with one species selected from , and hence . Therefore, although the estimator has the same analytic form of the Good-Turing estimator, the two estimators have different ranges (supports): while the Good-Turing estimator for species sampling models takes values in , the Good-Turing estimator for feature allocation models takes positive (finite) values. Hereafter we show that: i) admits interpretations both as a jackknife (resampling) estimator and as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973); ii) the mean squared error of converges to zero at the best possible rate for the estimation of ; iii) is linked to the Good-Turing estimator for species sampling models in terms of the limiting Poisson regime of the Binomial distribution.
2.1 Interpretations of
The estimator admits a natural interpretation as a jackknife estimator in the sense of Quenouille (1956) and Tukey (1958). See also the monograph by Efron (1987) and references therein for a comprehensive account on jackknife (resampling) estimators. Let be the number of distinct features observed in the sample after the removal of the -th sample . It is easy to show that the missing mass equals the posterior expected value of , given all the samples but , i.e., . Accordingly, the difference
provides with an unbiased estimator of. The corresponding jackknife estimator is then which equals ; indeed, for any fixed , coincides with the number of features displayed only by the -th sample .
The estimator also has an interpretation as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973). A Bayesian nonparametric approach to estimate relies on the specification of a prior distribution for the unknown feature probabilities . We consider the three parameters Beta process prior (Teh and Görür (2009) and James (2017)). This is a generalization of the Beta process prior of Hjort (1990), and it is defined as the distribution of a completely random measure (Daley and Vere-Jones (2008)) with intensity measure where , for , and , with being the Gamma function. Under the three parameters Beta process prior, it is shown that an estimator of with respect to a squared loss function is
is a consistent estimator of for any and , and plugging the estimator into (2) we obtain
In other terms, the proposed estimator coincides with the nonparametric empirical Bayes estimator of the missing mass. is obtained by assigning the three parameters Beta process prior to and then estimating its mass parameter with an appropriately chosen consistent estimator.
2.2 Optimality of
We start by showing that the mean squared error (-risk) of the estimator goes to zero at the best possible rate. This legitimates the use of as an estimator of . For any let be a class of features probabilities. Also, let denote a whichever estimator of the missing mass based on , that is a map on the -fold product space of observations and taking values in . For a specific collection of feature probabilities in the class , the -risk of the estimator is defined as follows
and the minimax risk is
i.e. the infimum, with respect to the set of possible estimators of , of the worst-case risk over the class . The next theorem provides with an upper bound for and a lower bound for .
Let be feature probabilities in the class . Then,
See Appendix A.2 for the proof of Theorem 2.1. For any two sequences and , write to mean that there exists a constant such that for all and to mean that we have both and . Part i) of Theorem 2.1 shows that , namely as goes to infinity the mean squared error of goes to zero at rate . Part ii) of Theorem 2.1 provides with a lower bound for the minimax risk, showing that . In other terms the mean squared error of goes to zero at the best possible rate. Theorem 2.1 then shows that the estimator is essentially rate optimal due to the matching minimax lower bound in the class of admissible probabilities’ masses up to a constant factor.
In Theorem 2.1 we have considered the class of feature probabilities having total mass bounded by . Here we briefly remark the crucial role played by this assumption in order to be able to provide asymptotic results. First, let us notice from Theorem 2.1 that, for a fixed value of the sample size , the minimax rate increases linearly in . This implies that, for a fixed , the estimation problem can be made as hard as desired if no bounds are imposed on the sum of the possible vectors of probabilities . Since at first glance this result can seem slightly counter intuitive, let us consider an illustrative example which should clarify why the estimation difficulty of the problem is proportional to . To simplify the argument, let us suppose that is a strictly positive integer and consider frequency vectors , i.e. such that for all . Now, let us construct another vector of frequencies obtained by setting for and . The resulting vector is the concatenation of , such that in the first entries of we put the first elements , followed by the second ones , the third ones and so on. Furthermore, any observation sampled from the Bernoulli product model with frequencies given by this can be rewritten, following a similar construction, as the concatenation of observations , , each of them sampled from a binomial product model with the corresponding frequencies . Hence, the missing mass for a sample of observations sampled from can be related to the missing masses of each of its subcomponents, by
where . From this construction, we can see that by trying to estimate the missing mass on the left hand side, we are basically trying to estimate a sum of unrelated quantities, which explains why the error is linear in .
A final remark is that in order for practitioners to use Theorem 2.1 and evaluate the performances of the Good-Turing estimator compared to the minimax rate, we would need to know an upper bound , i.e. an upper bound on the total mass of the unknown vector of probabilities . In real life applications, is unlikely to be known. However, we can easily estimate it. Specifically, since the total mass is the expected number of features displayed per observation, we can use as a consistent estimator for the estimator .
2.3 Connection to the Good Turing estimator for species sampling models
We relate the Good-Turing estimator for feature allocation models with the classical Good-Turing estimator for species sampling, by relying on the limiting Poisson approximation of Binomial random variables. In particular, Theorem 2.1 states that, in the feature allocation models, the Good-Turing estimator achieves a risk of order , while it is known from Rajaraman et al. (2017) that the risk of the Good Turing estimator in the species sampling case asymptotically behaves as . In order to compare the two models, we will consider the limiting scenario when . Let us consider a vector of feature frequencies , such that for some positive value and denote the normalized probability vector. Applying the large Poisson approximation of the binomial distribution, it follows that each
is now approximately distributed according to a Poisson distribution with mean. Therefore, the approximated model for large boils down to sample first an effective size , distributed according to a Poisson with parameter , and, conditionally on it, sample independent and identically distributed observations from probability vector . Hence it is equivalent to a species sampling model where the sample size is random. Denote the missing mass in the corresponding species sampling model. Now, by noticing that , it follows that . Therefore, we expect the quantity to have the same asymptotic behaviour of . Indeed, from Theorem 2.1 and , we get that .
We conclude by remarking that this construction also provides a justification for the Good-Turing in the feature allocation models. Indeed, in feature allocation models, we want to estimate , where both and are unknown and need to be estimated. In order to estimate , we can use the Good-Turing estimator for species models, which here turns to be
However, we also need to estimate in order to use to estimate . As pointed out at the end of the previous subsection, a consistent estimator of is . Using this estimator for and for , we obtain the following estimator of the missing mass ,
which turns out to be exactly the Good-Turing estimator for feature allocation models.
3 A confidence interval for
In this section, we consider the problem of uncertainty quantification of the proposed estimator of the missing mass. We exploit tools from concentration inequalities for sum of independent random variables (Boucheron et al. (2013)) to introduce a non-asymptotic level- confidence interval for , for any .
Let be any sequence of feature probabilities s.t. , and set for any nonnegative integer and . Then, with probability at least
See Appendix A.3 for the proof of Theorem 3.1. For any , Theorem 3.1 provides with a level confidence interval for that holds for every sample size and for every possible values of the feature probabilities . Indeed, this is derived by applying finite sample concentration inequalities, without using any asymptotic approximation, and it does not rely on any distributional assumption on the feature probabilities . Note that the proposed confidence interval can be easily evaluated without resorting to any simulation based strategy. It is enough to count the total number of features and number of features with frequency one and two in the observable sample, i.e. , and , and plug these quantities into and to be added and subtracted to .
4 A Stopping Rule for the Discovery Process
As we recalled in the Introduction, interest in estimating missing mass is motivated by the design of feature inventories that are cost-effective in terms of the number of features discovered and the amount of resources allocated in the discovery process. This is a fundamental issue in numerous scientific disciplines where the sampling procedure is expensive, in terms of time and/or financial resources. Feature inventories must then be designed in such a way to redirect the search of new features to more productive sites, methods or time periods whenever the sampling effort becomes unprofitable. In such a context, the estimator is the key ingredient for defining an adaptive sequential approach in terms a stopping rule for the discovery process. Specifically, let be an utility function, defined on the integers, such that is the gain of observing features; assume that in non-decreasing and concave. If is the cost associated with each sampling step, and is the number of features in the sample , then the stopping rule may be defined as
This brief discussion highlights how to exploit the estimator within the context of designing cost-effective feature inventories. In particular, it gives rises to the challenging problem of finding the time at which it is optimal to conclude the discovery process, i.e. the time that maximizes the expected payoff .
5 Numerical Illustration
We illustrate the experimental performance of the estimator by the analysis of various synthetic data, and by the analysis SNP datasets from a genome project. Let be the total number of possible features in the whole population of individuals. With regards to synthetic data, we present a numerical illustration by setting for and for , with being a nonnegative parameter. Note that the feature probability masses ’s correspond to the unnormalized masses of the ubiquitous Zipf distribution. The parameter controls how the total mass is spread among the features: the lager , the smaller is the number of features with high probability. In other terms the larger , the larger is the number of features with small frequencies (rare features). Hereafter we set , and we consider the following values for the discount parameter : 0.6, 0.8, 1.0, 1.2, 1.4, 1.6. For each of these values of the parameter , we take samples of size from the population . Table 1 displays the true value of the missing mass , the estimated value and the corresponding confidence interval (CI). All the experiments are averaged over the 100 samples. Results in Table 1 show that provides good estimates of true value of in all the scenarios that we considered. In addition, confidence intervals are quite narrow around the estimator and contain always the true value of the missing mass.
|0.6||183.81||184.66||(174.35, 198.12)||105.38||105.61||(101.74, 110.53)|
|0.8||33.79||33.67||(29.45, 39.51)||26.40||26.05||(24.40, 28.30)|
|1.0||7.02||7.02||(4.88, 9.91)||5.41||5.43||(4.66, 6.49)|
|1.2||1.92||1.87||(0.54, 3.60)||1.35||1.35||(0.92, 1.93)|
|1.4||0.71||0.72||(0, 1.98)||0.44||0.44||(0.15, 0.81)|
|1.6||0.34||0.36||(0, 1.39)||0.18||0.18||(0, 0.46)|
We conclude with an application to SNP datasets from the ENCODE sequencing genome project (http://www.hapmap. org/downloads/encode1.html.en). The same project was analyzed in Ionita-Laza et al. (2009). Ten 500-kb regions of the genome were sequenced in 209 unrelated DNA samples: 60 Yoruba (YRI), 60 CEPH European (CEPH), 45 Han Chinese (CHB), and 44 Japanese (JPT). These regions were chosen to be representative of the genome in general, including various chromosomes, recombination rates, gene density, and values of nontranscribed conservation with mouse. To make results comparable across the 4 populations (YRI, CEPH, CHB, and JPT), we consider only of the sequenced individuals for each dataset. Table 2 displays the estimated values and confidence interval (CI). For samples of 20 individuals, the YRI population displays the highest estimate of the missing mass. This agrees with results in Ionita-Laza et al. (2009), showing that the African population is the most diverse. We also consider increasing the sample size to of the sequenced individuals for each dataset. Table 2 shows how the missing mass decreases with respect to the case . This highlights the saturation effect in discovering new variants. The discovery process is very efficient in the beginning, but after many individuals are sequenced, each additional individual contributes less and less to the pool of the newly discovered variants.
|CEPH||55.6||(38.7, 88.9)||22.3||(15.6, 38.9)|
|CHB||50.0||(37.3, 81.3)||17.6||(12.6, 32.8)|
|JPT||61.7||(46.9, 93.4)||26.7||(20.9, 42.2)|
|YRI||88.3||(65.2, 125.2)||26.6||(18.9, 44.7)|
6 Concluding Remarks
We introduced a nonparametric estimator of the missing mass for feature allocation, we obtained provable guarantees for its performance, and we derived corresponding non-asymptotic confidence intervals. In particular, we proved that the mean squared error of goes to zero at the best possible rate as the sample size goes to infinity. Our approach is simple, intuitive and easily implementable from practitioners. It distinguishes from the approaches of of Ionita-Laza et al. (2009) and Zou et al. (2016) for being the first nonparametric statistical approach for estimating the missing mass in feature allocation models. In particular, differently from Ionita-Laza et al. (2009) and Zou et al. (2016), we associated to the proposed estimator an exact (non-asymptotic) quantification of its uncertainty. The present paper paves the way to new directions in feature allocation problems. In particular, three promising directions are: i) estimating the conditional expected number, given an observable sample , of features with frequency in that would be observed in one additional sample; ii) solving the optimal stopping problem discussed in Section 4; iii) estimating the conditional expected number, given an observable sample , of new features that would be observed in additional (unobservable) samples. Work on this is ongoing.
Appendix A Appendix - Proofs
a.1 Nonparametric empirical Bayes
In the present section we derive as a nonparametric empirical Bayes estimator in the sense of Efron and Morris (1973). Recall that the label of feature is denoted by , in the sequel we further assume that .
Remind also that we have associated the -th individual with a sequence of independent Bernoulli random variables with parameter , where (resp. ) indicates the presence (resp. absence) of feature in the –th individual. The Bayesian nonparametric approach to estimate the missing mass requires a prior specification for the ’s: we resort to the three-parameter Beta process introduced by Teh and Görür (2009). Such a prior distribution is defined as the distribution of a Completely Random Measure (CRM) (see e.g. Daley and Vere-Jones (2008)), with
Lévy intensity , where , being and .
Let be a random sample of size and be the distinct features observed in it, the posterior estimate of , under a square loss function, equals
where the ’s are jumps having a density on the positive real line and is a CRM having updated Lévy intensity given by . Hence the Bayesian estimator in (6) boils down to
where the second equality follows from the fact that the base measure of is diffuse and the last equality follows from the availability of the Laplace functional and from standard calculations. We are now going to show that is a consistent estimator of , then we will conclude that the empirical Bayes estimator coincides with . The consistency of can be established resorting to the regular variation theory by Karlin (1967); Gnedin et al. (2007). More specifically we are able to show that the tail integral of is a regularly varying function, having regular variation index , indeed
Hence we just proved that as , and therefore an application of (Gnedin et al., 2007, Corollary 21) gives the asymptotic relation . Moreover, from Stirling’s approximation, . Therefore, and is a consistent estimator for . Plugging into the posterior estimate , we obtain . ∎
a.2 Proof of Theorem 2.1
We prove part i) (upper bound) and ii) (lower bound) of Theorem 2.1 separately.
We first focus on the upper bound for the -risk of . In the sequel we denote by the total mass of features with frequency 1 in a sample of size , formally we observe that the expectation of equals
and obviously , for any sequence belonging to the class .
In order to bound the -risk of the estimator from above for any , we remind that
may be seen as the sum of the squared bias and the variance. The upper bound for the bias can be easily proved as follows
As for the variance bound, we note that
where we have exploited the independence of the random variables . We now observe that the events and are incompatible, hence we get
Putting together the bound on the bias and the variance we immediately obtain (4).
We now focus on the proof of the lower bound of the minimax risk , which has been defined in (3). Let us first notice that, since belongs to , then almost surely, and hence