Consistent estimation of the missing mass for feature models

Feature models are popular in machine learning and they have been recently used to solve many unsupervised learning problems. In these models every observation is endowed with a finite set of features, usually selected from an infinite collection (F_j)_j≥ 1. Every observation can display feature F_j with an unknown probability p_j. A statistical problem inherent to these models is how to estimate, given an initial sample, the conditional expected number of hitherto unseen features that will be displayed in a future observation. This problem is usually referred to as the missing mass problem. In this work we prove that, using a suitable multiplicative loss function and without imposing any assumptions on the parameters p_j, there does not exist any universally consistent estimator for the missing mass. In the second part of the paper, we focus on a special class of heavy-tailed probabilities (p_j)_j≥ 1, which are common in many real applications, and we show that, within this restricted class of probabilities, the nonparametric estimator of the missing mass suggested by Ayed et al. (2017) is strongly consistent. As a byproduct result, we will derive concentration inequalities for the missing mass and the number of features observed with a specified frequency in a sample of size n.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/20/2015

A Bennett Inequality for the Missing Mass

Novel concentration inequalities are obtained for the missing mass, i.e....
05/19/2020

Revisiting Concentration of Missing Mass

We revisit the problem of missing mass concentration, deriving Bernstein...
03/10/2015

Novel Bernstein-like Concentration Inequalities for the Missing Mass

We are concerned with obtaining novel concentration inequalities for the...
06/25/2018

On consistent estimation of the missing mass

Given n samples from a population of individuals belonging to different ...
02/25/2014

Novel Deviation Bounds for Mixture of Independent Bernoulli Variables with Application to the Missing Mass

In this paper, we are concerned with obtaining distribution-free concent...
03/12/2015

On the Impossibility of Learning the Missing Mass

This paper shows that one cannot learn the probability of rare events wi...
02/27/2019

A Good-Turing estimator for feature allocation models

Feature allocation models generalize species sampling models by allowing...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Feature models generalize species sampling models by allowing every observation to belong to more than one species, now called features. In particular, every observation is endowed with a finite set of features selected from a (possibly infinite) collection of features

. Every feature is associated with an unknown probability , and each observation displays feature with probability

. We may conveniently represent each observation with a binary sequence, whose entries indicate the presence (1) or absence (0) of each feature. Feature models have been first applied in ecology for modeling incidence vectors collecting the presence or absence of species traps (

Colwell et al. (2012) and Chao et al. (2014)), and more recently in several fields of biosciences, such as the study of genetic variation and protein interactions (Chu et al. (2006), Ionita-Laza et al. (2009), Ionita-Laza et al. (2010) and Zou et al. (2016)). They also found applications in the analysis of choice behaviour arising from psychology, marketing and computer science (Görür et al. (2006)); in the context of binary matrix factorization for modeling dyadic data to design recommender system (Meeds et al. (2007)); in graphical models (Wood et al. (2006) and Wood & Griffiths (2007)); in cognitive psychology for the analysis of similarity judgement matrices (Navarro & Griffiths (2007)

); in the context of independent component analysis and sparse factor analysis (

Knowles & Ghahramani (2007)); in link prediction using network data (Miller et al. (2010)).

The Bernoulli product model is arguably the most popular feature model. It assumes that the –th observation is a sequence

of independent Bernoulli random variables with unknown success probabilities

, and that is independent of for any . Therefore , namely the number of times that feature has been observed in a sample , is a Binomial random variable with parameter for any . Recently, the Bernoulli product model has been extensively applied to the fundamental problem of discovering genetic variation in human populations. See, e.g., Ionita-Laza et al. (2009), Zou et al. (2016)) and references therein. In such a context, interest is in estimating the conditional expected number, given a sample , of hitherto unseen features that would be observed if an additional sample was collected, namely

(1)

where is the indicator function. The statistic is referred to as the missing mass, i.e. the sum of the probability masses of unobserved features in a sample of size . In genetics, interest in estimating (1) is motivated by the ambitious prospect of growing databases to encompass hundreds of thousands of genomes, which makes important to quantify the power of large sequencing projects to discover new genetic variants (Auton et al. (2015)). An accurate estimate of the missing mass provides a quantitative evaluation of the potential and limitations of these datasets, providing a roadmap for large-scale sequencing projects.

Let denote an arbitrary estimator of . For easiness of notation, in the rest of the paper we will not highlight the dependence on and , and we simply write and . Motivated by the recent works of Ohannessian & Dahleh (2012), Mossel & Ohannessian (2015), Ben-Hamou et al. (2017) and Ayed et al. (2018) on the estimation of the missing mass in species sampling models, in this paper we consider the problem of consistent estimation of under the Bernoulli product model. The classical notion of additive consistency, involving the large limiting behaviour of , is not suitable in the context of the estimation of . This is because , as , which implies that is a consistent estimator of the missing mass for any sequence . Hence, in such a framework, one should invoke a more adequate notion of consistency, which allows to achieve more informative results. This notion of consistency is based on the limiting behaviour of the multiplicative loss function

(2)

More precisely we say that the estimator is multiplicative consistent for if as , either almost surely or in probability. The multiplicative loss function has been already used in statistics, e.g. for the estimation of small value probabilities using importance sampling (Chatterjee & Diaconis (2018)) and for the estimation of tail probabilities in extreme value theory (Beirlant & Devroye (1999)). We show that there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass . That is, under the Bernoulli product model and the loss function (2), we prove that for any estimator of there exists at least a choice of for which does not converge to in probability, as . The proof relies on non-trivial extensions of Bayesian nonparametric ideas and techniques developed by Ayed et al. (2018) for the estimation of the missing mass in species sampling models. In particular, the key argument makes use of a generalized Indian Buffet construction (James (2017)), which allows to prove inconsistency by exploiting properties of the posterior distribution of . Our inconsistency result is the natural counterpart for feature models of the work of Mossel & Ohannessian (2015), showing the impossibility of estimating the missing mass without imposing any structural (distributional) assumption on the ’s. We complete our study by investigating the consistency of an estimator of recently proposed by Ayed et al. (2017). To the best of our knowledge this is the first nonparametric estimator of , in the sense that its derivation does not rely on any distributional assumption on the ’s. We show that the estimator of Ayed et al. (2017) is strongly consistent, in the multiplicative sense, under the assumption that the tail of decays to zero as a regularly varying function (Bingham et al. (1987)). The proof relies on novel concentration inequalities for , as well as for related statistics, which are of independent interest.

The paper is structured as follows. In Section 2 we prove that for the Bernoulli product model there do not exist universally consistent estimators, in the multiplicative sense, of the missing mass . Section 3 introduces some exponential tail bounds for , as well as for related statistics, which are then applied in Section 4 to show that the estimator of in Ayed et al. (2018) is consistent under the assumption of regularly varying probabilities ’s.

2 Non existence of universally consistent estimators of the missing mass

Consider the Bernoulli product model described in the Introduction. Without loss of generality, we assume that each feature is labeled by a value in and therefore is a sequence of distinct points in . Furthermore, the probabilities are assumed to be summable, i.e. ; this condition is needed in order to guarantee that every observation will display only a finite number of features almost surely. Indeed, is equivalent to , which in turns implies almost surely, by Tonelli-Fubini Theorem. The two unknown sequences and can be uniquely encoded in a finite measure on , , with all masses smaller than one. We can therefore consider as parameter space the set

(3)

Recall that denotes the number of times that feature has been observed in the sample , that is is a Binomial random variable with parameter . For a fixed , an estimator of the missing mass is a measurable map which argument is the observed sample . We say that the estimator is multiplicative consistent under the parameter space if for every and every ,

(4)

where denotes the law of the observations under a feature allocation model of parameter . Theorem 2.1 shows that there are no universally multiplicative consistent estimators of for the class . This means that for any estimator of the missing mass, there exists at least one element for which does not converge to in probability, as .

Theorem 2.1

Under the feature allocation model, there are no universally consistent estimators, i.e. there are no estimators satisfying (4). In particular, for every estimator , it is possible to find an element such that for any

(5)

for some strictly positive constant .

2.1 Proof of Theorem 2.1

In order to prove Theorem 2.1, it is enough to show that for every estimator and every ,

(6)

and therefore there exists a for which is not consistent.

First, let us notice that, for every ,

(7)

Indeed, if , then

(8)

and, from the lower bound of (8), . Because , it follows that . This last inequality together with (8) leads to . Considering the complements of the two events, it follows that

and, as a consequence, , proving (7). From now on, we will denote and prove that

(9)

for some strictly positive constant .

The main idea of the proof is in the following formula and works as follows: we lower bound the supremum over in (7) by an average with respect to a (carefully chosen) prior for ; we swap the conditional distribution of and the marginal of with the conditional of and the marginal of ; we lower bound the event probability with respect to the posterior of given . Formally,

(10)

where we have applied reverse Fatou’s lemma to take the outside the expectation. In (10), denotes the expectation with respect to the prior for , the expectation with respect to the marginal distribution of and the probability under the posterior of given .

Our choice of the nonparametric prior for is based on completely random measures (see Daley and Vere-Jones (2008)) and the generalized Indian Buffet process prior of James (2017). In particular, a prior for can defined through a completely random measure on , where is a Poisson Point Process on , by setting . We select to be a completely random measure with Lévy intensity . The distribution of is completely characterized by its Laplace functional defined as follows,

for any measurable function . See also Kingman (1993).

Theorem 3.1 of James (2017) provides with a distributional equality for the posterior of given . Denoting by the distinct features observed in , we have the following distributional equality

(11)

where the ’s are non-negative random jumps and is an independent completely random measure with updated Lévy intensity .

Defining , from (11) we have that, for any Borel set in , the missing mass satisfies

(12)

showing that the posterior distribution of the missing mass is equal in distribution to the random variable . Besides, it is worth to introduce the random variable

whose distribution can be computed exactly and turns out to be a Gamma random variable of parameters . Indeed, from the Laplace functional, for every we have

which is the characteristic function of a Gamma

random variable.

We now have all the necessary ingredients to prove the lower bound (10). Fix . First note that, the inverse triangular inequality entails

(13)

which implies

(14)

indeed, thanks to (13), the two events together

imply that

where the last inequality follows from the fact that . Hence, from (14), we have that

which may be plugged into (10) to obtain

(15)

We are going to lower bound separately the two terms on the r.h.s. of (15). With regard to the first term, let us observe that the elementary inequality , for , implies that for all

Summing over ,

and therefore,

As a simple consequence of the last inequality, for any , the event implies the validity of and therefore we can upper bound the first term in (15) as follows

(16)

where we have used the fact that the posterior distribution of is .

Let us now consider the second term on the r.h.s. of (15). Using again the fact that

is Gamma distributed and

, we have

it is now easy to see that the function is strictly positive, continuous and admits a global minimum on at the point , therefore

(17)

Using the two bounds (16) and (17) in (15), for any we get

which completes the proof.

3 Concentration inequalities for feature models

In this section we will establish exponential tail bounds for the missing mass and the statistic defined by

which counts the number of features observed with frequency in the sample . The statistic is of interest in different applications of feature allocation models and its analysis will be important for the study of the estimator of missing mass considered in Section 4, which involves . The tail bounds we present in this Section are valid in full generality, i.e. without any assumptions on the probability masses . In Section 4, we will use these results to prove consistency results under the assumption of regularly varying heavy tails .
In order to derive the concentration inequalities for we will use Chernoff bounds, which require suitable bounds on the log-Laplace transform. First, let us recall some definitions from Boucheron et al. (2013) and Ben-Hamou et al. (2017).

Definition 3.1

Let be a real–valued random variable defined on some probability space, then:

  • is sub-Gaussian on the right tail (resp. on the left tail) with variance factor

    if for any (resp. )

    (18)
  • is sub-Gamma on the right tail with variance factor and scale parameter if

    (19)
  • is sub-Gamma on the left tail with variance factor and scale parameter if is sub-gamma on the right tail with variance factor and scale parameter ;

  • is sub-Poisson with variance factor if for all

    (20)

    being .

Note that a sub-Gaussian random variable is also sub-Gamma for any choice of the scale parameter , but in general the inverse is not true. As we will see in the sequel, the bounds on the log-Laplace (18)–(19) imply exponential tails bounds by means of the Chernoff inequality. See Boucheron et al. (2013) for the details.
The following proposition shows that the missing mass is sub-Gaussian on the left tail and sub-Gamma on the right one.

Proposition 3.1

Let . On the left tail, the random variable is sub-Gaussian with variance factor , i.e. for any it holds

(21)

On the right tail, the random variable is sub-Gamma with variance factor and scale parameter , i.e. for any one has

(22)
Proof.

We first focus on the proof of (21). Let , exploiting the independence of the random variables ’s and the elementary inequality , valid for any , we obtain

We observe that, being , one has:

hence (21) has been proven.
We now concentrate on the proof of (22), arguing exactly as before we obtain that

where we have used the infinite series representation for the exponential function. Fixing the useful notation

and observing that , for any , we get

(23)

for any . Proceeding along similar lines as in (Gnedin et al., 2007, Lemma 1), it is not difficult to see that

which entails , for any . The last inequality can be used to provide an upper bound for the r.h.s. of (23) as follows

and (22) has been now proved. ∎

As already mentioned at the beginning of this section, the sub-Gaussian and sub-Gamma bounds obtained in Proposition 3.1 imply useful exponential tail bounds for (see Boucheron et al. (2013)). More specifically we have that:

Corollary 3.1

For any and , the following hold

Proof.

The two inequalities follow by the Chernoff bound and the log-Laplace bound proved in Proposition 3.1. This is a standard argument, see Boucheron et al. (2013) for details. ∎

Proceeding along similar lines as before we show that is a sub-Poisson random variable, this result is implicitly proved in the Supplementary material by Ayed et al. (2017), but for the sake of completeness we report it also here.

Proposition 3.2

For any and , the random variable is sub-Poisson with variance factor . Indeed, for any the following bound holds true

(24)

where .

Proof.

Exploiting the independence of the random variables ’s, for any we can write:

where we have used the inequality , for any . ∎

The previous proposition and the Chernoff bounds imply an exponential tail bound for , indeed one can prove that

Corollary 3.2

For any , and the following holds true

(25)

Corollary 3.1 and 3.2 provide us with concentration inequalities of the missing mass and the statistic , respectively, around their mean. These results have been derived without any assumption on the probabilities and hold for all elements of . In the next Section, we will focus on the class of regularly varying probabilities and, after recalling the nonparametric estimator proposed by Ayed et al. (2018) we will prove that this estimator is consistent within such a subset of .

4 A consistent estimator for regularly varying feature probabilities

Ayed et al. (2018) have introduced a nonparametric estimator of the missing mass, defined as follows

(26)

Namely, is the number of features having frequency one divided by the sample size . Such an estimator is attractive both from a theoretical and a computational standpoint. Indeed, on the one side, it admits two different interpretations as a Jackknife estimator in the sense of Quenouille (1956) and as a non-parametric empirical Bayes estimator in the same spirit as Efron and Morris (1973); on the other side, it is feasible and easy to implement. See Ayed et al. (2018) for details. Here we want to study the consistency of (26). In order to do this we have seen that, without assumptions on the features’ proportions, any estimator of the missing mass is always inconsistent (Theorem 2.1), hence we study the consistency of (26) under the ubiquitous assumption of heavy tailed probabilities . We rely on the theory of regular variation by Karamata, J. (1930, 1933) (see also Karlin (1967)) to define a suitable class of heavy-tailed , showing that, under this class, turns out to be multiplicative consistent.

We use the limiting notation to mean ; we further write if there exists a fixed constant such that . Then, similarly as done by Karlin (1967) we give the following

Definition 4.1

Let and define the measure , which is the cumulative count of all features having no less than a certain probability mass. We say that is regularly varying with regular variation index if as , where is a slowly varying function, that is as for all .

Let us remark that if we denote the sorted probabilities in decreasing order, definition 4.1 is equivalent to

as , where is another slowly varying function. For simplicity, the relation between , and is skipped here, interested readers can refer to Lemma 22 and Proposition 23 of Gnedin et al. (2007). Definition 4.1 is in the same spirit as Karlin (1967), but for our purposes here we consider the case , while in Karlin (1967) the ’s satisfy the more restrictive condition . The next theorem is similar to a result proved by Karlin (1967) and provides the first order asymptotic of .

Theorem 4.1

Let be regularly varying with . If denotes the Gamma function, then as ,