1 Introduction
Feature models generalize species sampling models by allowing every observation to belong to more than one species, now called features. In particular, every observation is endowed with a finite set of features selected from a (possibly infinite) collection of features
. Every feature is associated with an unknown probability , and each observation displays feature with probability. We may conveniently represent each observation with a binary sequence, whose entries indicate the presence (1) or absence (0) of each feature. Feature models have been first applied in ecology for modeling incidence vectors collecting the presence or absence of species traps (
Colwell et al. (2012) and Chao et al. (2014)), and more recently in several fields of biosciences, such as the study of genetic variation and protein interactions (Chu et al. (2006), IonitaLaza et al. (2009), IonitaLaza et al. (2010) and Zou et al. (2016)). They also found applications in the analysis of choice behaviour arising from psychology, marketing and computer science (Görür et al. (2006)); in the context of binary matrix factorization for modeling dyadic data to design recommender system (Meeds et al. (2007)); in graphical models (Wood et al. (2006) and Wood & Griffiths (2007)); in cognitive psychology for the analysis of similarity judgement matrices (Navarro & Griffiths (2007)); in the context of independent component analysis and sparse factor analysis (
Knowles & Ghahramani (2007)); in link prediction using network data (Miller et al. (2010)).The Bernoulli product model is arguably the most popular feature model. It assumes that the –th observation is a sequence
of independent Bernoulli random variables with unknown success probabilities
, and that is independent of for any . Therefore , namely the number of times that feature has been observed in a sample , is a Binomial random variable with parameter for any . Recently, the Bernoulli product model has been extensively applied to the fundamental problem of discovering genetic variation in human populations. See, e.g., IonitaLaza et al. (2009), Zou et al. (2016)) and references therein. In such a context, interest is in estimating the conditional expected number, given a sample , of hitherto unseen features that would be observed if an additional sample was collected, namely(1) 
where is the indicator function. The statistic is referred to as the missing mass, i.e. the sum of the probability masses of unobserved features in a sample of size . In genetics, interest in estimating (1) is motivated by the ambitious prospect of growing databases to encompass hundreds of thousands of genomes, which makes important to quantify the power of large sequencing projects to discover new genetic variants (Auton et al. (2015)). An accurate estimate of the missing mass provides a quantitative evaluation of the potential and limitations of these datasets, providing a roadmap for largescale sequencing projects.
Let denote an arbitrary estimator of . For easiness of notation, in the rest of the paper we will not highlight the dependence on and , and we simply write and . Motivated by the recent works of Ohannessian & Dahleh (2012), Mossel & Ohannessian (2015), BenHamou et al. (2017) and Ayed et al. (2018) on the estimation of the missing mass in species sampling models, in this paper we consider the problem of consistent estimation of under the Bernoulli product model. The classical notion of additive consistency, involving the large limiting behaviour of , is not suitable in the context of the estimation of . This is because , as , which implies that is a consistent estimator of the missing mass for any sequence . Hence, in such a framework, one should invoke a more adequate notion of consistency, which allows to achieve more informative results. This notion of consistency is based on the limiting behaviour of the multiplicative loss function
(2) 
More precisely we say that the estimator is multiplicative consistent for if as , either almost surely or in probability. The multiplicative loss function has been already used in statistics, e.g. for the estimation of small value probabilities using importance sampling (Chatterjee & Diaconis (2018)) and for the estimation of tail probabilities in extreme value theory (Beirlant & Devroye (1999)). We show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass . That is, under the Bernoulli product model and the loss function (2), we prove that for any estimator of there exists at least a choice of for which does not converge to in probability, as . The proof relies on nontrivial extensions of Bayesian nonparametric ideas and techniques developed by Ayed et al. (2018) for the estimation of the missing mass in species sampling models. In particular, the key argument makes use of a generalized Indian Buffet construction (James (2017)), which allows to prove inconsistency by exploiting properties of the posterior distribution of . Our inconsistency result is the natural counterpart for feature models of the work of Mossel & Ohannessian (2015), showing the impossibility of estimating the missing mass without imposing any structural (distributional) assumption on the ’s. We complete our study by investigating the consistency of an estimator of recently proposed by Ayed et al. (2017). To the best of our knowledge this is the first nonparametric estimator of , in the sense that its derivation does not rely on any distributional assumption on the ’s. We show that the estimator of Ayed et al. (2017) is strongly consistent, in the multiplicative sense, under the assumption that the tail of decays to zero as a regularly varying function (Bingham et al. (1987)). The proof relies on novel concentration inequalities for , as well as for related statistics, which are of independent interest.
The paper is structured as follows. In Section 2 we prove that for the Bernoulli product model there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass . Section 3 introduces some exponential tail bounds for , as well as for related statistics, which are then applied in Section 4 to show that the estimator of in Ayed et al. (2018) is consistent under the assumption of regularly varying probabilities ’s.
2 Non existence of universally consistent estimators of the missing mass
Consider the Bernoulli product model described in the Introduction. Without loss of generality, we assume that each feature is labeled by a value in and therefore is a sequence of distinct points in . Furthermore, the probabilities are assumed to be summable, i.e. ; this condition is needed in order to guarantee that every observation will display only a finite number of features almost surely. Indeed, is equivalent to , which in turns implies almost surely, by TonelliFubini Theorem. The two unknown sequences and can be uniquely encoded in a finite measure on , , with all masses smaller than one. We can therefore consider as parameter space the set
(3) 
Recall that denotes the number of times that feature has been observed in the sample , that is is a Binomial random variable with parameter . For a fixed , an estimator of the missing mass is a measurable map which argument is the observed sample . We say that the estimator is multiplicative consistent under the parameter space if for every and every ,
(4) 
where denotes the law of the observations under a feature allocation model of parameter . Theorem 2.1 shows that there are no universally multiplicative consistent estimators of for the class . This means that for any estimator of the missing mass, there exists at least one element for which does not converge to in probability, as .
Theorem 2.1
Under the feature allocation model, there are no universally consistent estimators, i.e. there are no estimators satisfying (4). In particular, for every estimator , it is possible to find an element such that for any
(5) 
for some strictly positive constant .
2.1 Proof of Theorem 2.1
In order to prove Theorem 2.1, it is enough to show that for every estimator and every ,
(6) 
and therefore there exists a for which is not consistent.
First, let us notice that, for every ,
(7) 
Indeed, if , then
(8) 
and, from the lower bound of (8), . Because , it follows that . This last inequality together with (8) leads to . Considering the complements of the two events, it follows that
and, as a consequence, , proving (7). From now on, we will denote and prove that
(9) 
for some strictly positive constant .
The main idea of the proof is in the following formula and works as follows: we lower bound the supremum over in (7) by an average with respect to a (carefully chosen) prior for ; we swap the conditional distribution of and the marginal of with the conditional of and the marginal of ; we lower bound the event probability with respect to the posterior of given . Formally,
(10) 
where we have applied reverse Fatou’s lemma to take the outside the expectation. In (10), denotes the expectation with respect to the prior for , the expectation with respect to the marginal distribution of and the probability under the posterior of given .
Our choice of the nonparametric prior for is based on completely random measures (see Daley and VereJones (2008)) and the generalized Indian Buffet process prior of James (2017). In particular, a prior for can defined through a completely random measure on , where is a Poisson Point Process on , by setting . We select to be a completely random measure with Lévy intensity . The distribution of is completely characterized by its Laplace functional defined as follows,
for any measurable function . See also Kingman (1993).
Theorem 3.1 of James (2017) provides with a distributional equality for the posterior of given . Denoting by the distinct features observed in , we have the following distributional equality
(11) 
where the ’s are nonnegative random jumps and is an independent completely random measure with updated Lévy intensity .
Defining , from (11) we have that, for any Borel set in , the missing mass satisfies
(12) 
showing that the posterior distribution of the missing mass is equal in distribution to the random variable . Besides, it is worth to introduce the random variable
whose distribution can be computed exactly and turns out to be a Gamma random variable of parameters . Indeed, from the Laplace functional, for every we have
which is the characteristic function of a Gamma
random variable.We now have all the necessary ingredients to prove the lower bound (10). Fix . First note that, the inverse triangular inequality entails
(13) 
which implies
(14) 
indeed, thanks to (13), the two events together
imply that
where the last inequality follows from the fact that . Hence, from (14), we have that
which may be plugged into (10) to obtain
(15) 
We are going to lower bound separately the two terms on the r.h.s. of (15). With regard to the first term, let us observe that the elementary inequality , for , implies that for all
Summing over ,
and therefore,
As a simple consequence of the last inequality, for any , the event implies the validity of and therefore we can upper bound the first term in (15) as follows
(16) 
where we have used the fact that the posterior distribution of is .
Let us now consider the second term on the r.h.s. of (15). Using again the fact that
is Gamma distributed and
, we haveit is now easy to see that the function is strictly positive, continuous and admits a global minimum on at the point , therefore
(17) 
Using the two bounds (16) and (17) in (15), for any we get
which completes the proof.
3 Concentration inequalities for feature models
In this section we will establish exponential tail bounds for the missing mass and the statistic defined by
which counts the number of features observed with frequency in the sample
. The statistic is of interest in different applications of feature allocation models and its analysis will be important for the study of the estimator of missing mass considered in Section 4, which involves .
The tail bounds we present in this Section are valid in full generality, i.e. without any assumptions on the probability masses .
In Section 4, we will use these results to prove consistency results under the assumption of regularly varying heavy tails .
In order to derive the concentration inequalities for we will use Chernoff bounds, which require suitable bounds on the logLaplace transform. First, let us recall some definitions from Boucheron et al. (2013) and BenHamou et al. (2017).
Definition 3.1
Let be a real–valued random variable defined on some probability space, then:

is subGaussian on the right tail (resp. on the left tail) with variance factor
if for any (resp. )(18) 
is subGamma on the right tail with variance factor and scale parameter if
(19) 
is subGamma on the left tail with variance factor and scale parameter if is subgamma on the right tail with variance factor and scale parameter ;

is subPoisson with variance factor if for all
(20) being .
Note that a subGaussian random variable is also subGamma for any choice of the scale parameter , but in general the inverse is not true. As we will see in the sequel, the bounds on the logLaplace (18)–(19) imply exponential tails bounds by means of the Chernoff inequality.
See Boucheron et al. (2013) for the details.
The following proposition shows that the missing mass is subGaussian on the left tail and subGamma on the right one.
Proposition 3.1
Let . On the left tail, the random variable is subGaussian with variance factor , i.e. for any it holds
(21) 
On the right tail, the random variable is subGamma with variance factor and scale parameter , i.e. for any one has
(22) 
Proof.
We first focus on the proof of (21). Let , exploiting the independence of the random variables ’s and the elementary inequality , valid for any , we obtain
We observe that, being , one has:
hence (21) has been proven.
We now concentrate on the proof of (22), arguing exactly as before we obtain that
where we have used the infinite series representation for the exponential function. Fixing the useful notation
and observing that , for any , we get
(23) 
for any . Proceeding along similar lines as in (Gnedin et al., 2007, Lemma 1), it is not difficult to see that
which entails , for any . The last inequality can be used to provide an upper bound for the r.h.s. of (23) as follows
and (22) has been now proved. ∎
As already mentioned at the beginning of this section, the subGaussian and subGamma bounds obtained in Proposition 3.1 imply useful exponential tail bounds for (see Boucheron et al. (2013)). More specifically we have that:
Corollary 3.1
For any and , the following hold
Proof.
Proceeding along similar lines as before we show that is a subPoisson random variable, this result is implicitly proved in the Supplementary material by Ayed et al. (2017), but for the sake of completeness we report it also here.
Proposition 3.2
For any and , the random variable is subPoisson with variance factor . Indeed, for any the following bound holds true
(24) 
where .
Proof.
Exploiting the independence of the random variables ’s, for any we can write:
where we have used the inequality , for any . ∎
The previous proposition and the Chernoff bounds imply an exponential tail bound for , indeed one can prove that
Corollary 3.2
For any , and the following holds true
(25) 
Corollary 3.1 and 3.2 provide us with concentration inequalities of the missing mass and the statistic , respectively, around their mean. These results have been derived without any assumption on the probabilities and hold for all elements of . In the next Section, we will focus on the class of regularly varying probabilities and, after recalling the nonparametric estimator proposed by Ayed et al. (2018) we will prove that this estimator is consistent within such a subset of .
4 A consistent estimator for regularly varying feature probabilities
Ayed et al. (2018) have introduced a nonparametric estimator of the missing mass, defined as follows
(26) 
Namely, is the number of features having frequency one divided by the sample size . Such an estimator is attractive both from a theoretical and a computational standpoint. Indeed, on the one side, it admits two different interpretations as a Jackknife estimator in the sense of Quenouille (1956) and as a nonparametric empirical Bayes estimator in the same spirit as Efron and Morris (1973); on the other side, it is feasible and easy to implement. See Ayed et al. (2018) for details. Here we want to study the consistency of (26). In order to do this we have seen that, without assumptions on the features’ proportions, any estimator of the missing mass is always inconsistent (Theorem 2.1), hence we study the consistency of (26) under the ubiquitous assumption of heavy tailed probabilities . We rely on the theory of regular variation by Karamata, J. (1930, 1933) (see also Karlin (1967)) to define a suitable class of heavytailed , showing that, under this class, turns out to be multiplicative consistent.
We use the limiting notation to mean ; we further write if there exists a fixed constant such that . Then, similarly as done by Karlin (1967) we give the following
Definition 4.1
Let and define the measure , which is the cumulative count of all features having no less than a certain probability mass. We say that is regularly varying with regular variation index if as , where is a slowly varying function, that is as for all .
Let us remark that if we denote the sorted probabilities in decreasing order, definition 4.1 is equivalent to
as , where is another slowly varying function. For simplicity, the relation between , and is skipped here, interested readers can refer to Lemma 22 and Proposition 23 of Gnedin et al. (2007). Definition 4.1 is in the same spirit as Karlin (1967), but for our purposes here we consider the case , while in Karlin (1967) the ’s satisfy the more restrictive condition . The next theorem is similar to a result proved by Karlin (1967) and provides the first order asymptotic of .
Theorem 4.1
Let be regularly varying with . If denotes the Gamma function, then as ,
Comments
There are no comments yet.