1 Introduction
Given samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. It first appeared in ecology (e.g., Fisher et al. Fisher et a. (1943) and Good Good (1953)), and its importance has grown considerably in recent years driven by challenging applications in a wide range of scientific disciplines, such as biological and physical sciences (e.g., Kroes et al. Kroes et al. (1999), Gao et al. Gao et al. (2007) and IonitaLaza et al. IonitaLaza et al. (2009)
), machine learning and computer science (e.g., Motwani and Vassilvitskii
Motwani and Vassilvitskii (2006) and Bubeck et al. Bubeck et al. (2013)), and information theory (e.g., Orlitsky et al. Orlitsky et al. (2014) and BenHamou et al. BenHamou et al. (2018)). To move into a concrete setting, let be an unknown discrete distribution, where is a sequence of atoms on some measurable space and denote the corresponding probability masses, i.e. such that . Ifis a collection of independent and identically distributed random variables from
, then we define the missing mass as(1) 
where is the indicator function. Among various nonparametric estimators of the missing mass, both frequentist and Bayesian, the GoodTuring estimator (Good Good (1953)) is arguably the most popular. It has been the subject of numerous studies, most of them in the recent years. These include, e.g., asymptotic normality and large deviations (Zhang and Zhang Zhang et al. (2009) and Gao Gao (2013)), admissibility and concentration properties (McAllester and Ortiz, McAllester and Ortiz (2003), Ohannessian and Dahleh Ohnnessian and Dahleh (2012) and BenHamou et al. BenHamou et al. (2017)), consistency and convergence rates (McAllester Schapire McAllester and Schapire (2000), Wagner et al. Wagner et al. (2006) and Mossel and Ohannessian Mossel and Ohannessian (2015)), optimality and minimax properties (Orlitsky et al. Orlitsky et al. (2013) and Rajaraman et al. Rajaraman (2017)).
Under the setting depicted above, let denote an estimator of . Motivated by the recent works of Ohannessian and Dahleh Ohnnessian and Dahleh (2012), Mossel and Ohannessian Mossel and Ohannessian (2015) and BenHamou et al. BenHamou et al. (2017)
, in this paper we consider the problem of consistent estimation of the missing mass under the multiplicative loss function
(2) 
As discussed in Ohannessian and Dahleh Ohnnessian and Dahleh (2012), the loss function (2) is adequate for estimating small value parameters, in the sense that it allows to achieve more informative results. Such a loss function has been already used in statistics, e.g. for the estimation of small value probabilities using importance sampling (Chatterjee and Diaconis Chatterjee and Diaconis (2018)) and for the estimation of tail probabilities in extreme value theory (Beirlant and Devroye Beirlant and Devroye (1999)). Under the loss function (2), Ohannessian and Dahleh Ohnnessian and Dahleh (2012) showed that: i) the GoodTuring estimator may be inconsistent; ii) the GoodTuring estimator is strongly consistent if the tail of decays to zero as a regularly varying function with parameter (Bingham et al. Bingham and Goldie (1987)). See also BenHamou et al. BenHamou et al. (2017) for further results on missing mass estimation under regularly varying . Mossel and Ohannessian Mossel and Ohannessian (2015) then strengthened the inconsistency result of Ohannessian and Dahleh Ohnnessian and Dahleh (2012), showing the impossibility of estimating (learning) in a completely distributionfree fashion, that is without imposing further structural assumptions on .
We present an alternative, and simpler, proof of the result of Mossel and Ohannessian Mossel and Ohannessian (2015). Our proof relies on tools from Bayesian nonparametrics, and in particular on the use of a Dirichlet prior (Ferguson Ferguson (1973)) for the unknown distribution . This allows us to exploit properties of the posterior distribution of to prove the impossibility of a distributionfree estimation of the missing mass, thus avoiding the winding (geometric) coupling argument of Mossel and Ohannessian Mossel and Ohannessian (2015). Up to our knowledge, the use of Bayesian ideas to study large sample asymptotics for the missing mass is new, and it could be of independent interest. Motivated by the work of Ohannessian and Dahleh Ohnnessian and Dahleh (2012) and BenHamou et al. BenHamou et al. (2017) we then investigate convergence rates and minimax rates for the GoodTuring estimator under the class of regularly varying . We still rely on tools from Bayesian nonparametrics, thus providing an original approach to tackle these problems. In particular, we make use of the two parameter PoissonDirichlet prior (Perman et al. Perman et al. (1992) and Pitman and Yor Pitman and Yor (1997)) for the unknown distribution , which is known to generate (almost surely) discrete distributions whose tail decays to zero as a regularly varying function with parameter . See Gnedin et al. Gnedin et al. (2007) and references therein. This allows us to exploit properties of the posterior distribution of to prove that: i) the convergence rate of the GoodTuring estimator is the best rate that any estimator of the missing mass can achieve, up to a slowly varying function; ii) the minimax rate must be at least . We conclude with a discussion on the problem of deriving the minimax rate of the GoodTuring estimator, conjecturing that the GoodTuring estimator is an asymptotically optimal minimax estimator under the class of regularly varying .
The paper is structured as follows. In Section 2 we state our main results on convergence rates and minimax rates for the GoodTuring estimator under regularly varying distribution . Proofs of these results, as well as the alternative proof of the result of Mossel and Ohannessian Mossel and Ohannessian (2015), are provided in Section 4. In Section 3 we discuss open problems and possible future developments on missing mass estimation. Auxiliary results and technical lemmas are deferred to Appendix 5. The following notation is adopted throughout the paper:

is the unit interval, and its Borel algebra;

is the space of discrete distributions on , endowed with the smallest algebra making measurable for every ;

is the fold product of on , and the expectation with respect ; for easiness of notation, we will use to denote also the expectation with respect to . ;

is a generic slowly varying function, i.e. a function satisfying as for every ;

denotes a generic strictly positive constant that can vary in the calculations and in distinct statements;

Given a sequence of probabilities , denotes the corresponding ordered sequence, i.e. ;

Given two functions and , stands for , for , for ;

is the Beta integral of parameters and .
2 Main results
Let be a collection of independent and identically distributed random variables from an unknown discrete distribution . The actual values taken by the observations, ’s, are irrelevant for the missing mass estimation problem and, without loss of generality, they can be assumed to be values in the set . Therefore, is supposed to be a discrete distribution on the sample space , given a sequence of atoms and masses such that . Both atoms and masses of the distribution are assumed to be unknown. Given the sample , we are interested in estimating the missing defined in (1), which turns out to be a jointly measurable function of and as proved in Proposition 5.1. Given an estimator of , we will measure its statistical performance by using the multiplicative loss function defined in (2). As we discussed in the introduction, this loss function is suitable to study theoretical properties of parameters or functionals taking small values, and it has already been used in previous works on missing mass estimation, e.g., Ohannessian and Dahleh Ohnnessian and Dahleh (2012), Mossel and Ohannessian Mossel and Ohannessian (2015) and BenHamou et al. BenHamou et al. (2017).
A sequence of estimators is said to be consistent for under parameter space and loss function , if the loss incurred by the estimator converges in probability to zero under all points in the parameter space. Formally, is consistent for if for all and for all ,
(3) 
as . Also, is strongly consistent if (3) is replaced by almost sure convergence. Under this setting Mossel and Ohannessian Mossel and Ohannessian (2015) proved the following result.
Theorem 2.1
Mossel Ohannessian Mossel and Ohannessian (2015) proved Theorem 2.1
by exploiting a coupling of two generalized (dithered) geometric distributions. In section Section
4 we present an alternative proof of Theorem 2.1. While the proof of Mossel and Ohannessian Mossel and Ohannessian (2015) has the merit to be constructive, our approach has the merit to be simpler and it provides a new way to face these type of problems, which mainly relies on Bayesian nonparametric techniques. Similar Bayesian nonparametric arguments will then be crucial in order to study of convergence rates and minimax rates of the GoodTuring estimator under the class of regularly varying .Roughly speaking, Theorem 2.1 proves that any asymptotic result holding uniformly over a set of possible distributions, the parametric space must be restricted to a suitable subclass. In particular, from the proof of Theorem 2.1 we see that some conditions have to be imposed on the tail decay of the elements of the parameter space. That is, from the proof of Theorem 2.1 we deduce that there are no consistent estimators for the class of distributions sampled from a Dirichlet process. From Kingman Kingman (1975) (Equation 65), we have that, if is sampled from a Dirichlet process, its sequence of ordered masses behaves like , as . Therefore, the tail of has approximately exponential form, resembling a geometric distribution and satisfying for every . Indeed, a geometric distribution was used in Ohannessian and Dahleh Ohnnessian and Dahleh (2012) as an example to prove that the GoodTuring estimator can be inconsistent. Theorem 2.1 shows that, under this very light regime, any estimator of the missing mass, not just the GoodTuring, fails to be consistent under multiplicative loss. This motivates us to consider the class of s having heavy enough tails. This will be the subject of the rest of this section.
2.1 Consistency under regularly varying
In this section we recall the GoodTuring estimator (Good Good (1953)) of the missing mass , and we study its convergence rate and minimax risk for regularly varying . The definition of the GoodTuring estimator makes use of the proportion of unique values in the sample to estimate the missing mass. Let be the number of times the value is observed in the sample , i.e.,
Furthermore, let and be the number of values observed times and the total number of distinct values, respectively, observed in , i.e.,
The GoodTuring estimator of is defined in terms of the statistic , that is
(4) 
Ohannessian and Dahleh Ohnnessian and Dahleh (2012) first showed the inconsistency of under the choice of being a geometric distribution. In the same paper, it is shown that under the assumption that the tail of decays to zero as a regularly varying function with parameter , the GoodTuring estimator is strongly consistent. This latter result was generalized to the range in BenHamou et al. BenHamou et al. (2017).
The assumption of regularly varying is a generalization of the power law tail decay, adding some more flexibility by the introduction of the slowly varying function . Powerlaw distributions are observed in the empirical distributions of many quantities in different applied areas, and their study have attracted a lot of interest in recent years. For extensive discussions of power laws in empirical data and their properties, the reader is referred to Mitzenmacher Mitzenmacher (2014), Goldwater et al. Goldwater et al. (2006), Newman Newman (2003), Clauset et al. Clauset et al. (2009) and Sornette Sornette (2006)
. Restricting the parameter space to probability distributions having regularly varying tail is not a mere technical assumption and, on the contrary, it represents a natural subset of the parameter space to consider, which we expect to contain the true data generating distribution for many different applications.
To move into the concrete setting of regular variation (Bingham et al. Bingham and Goldie (1987)), for every we define a counting measure on as , with corresponding tail function defined as for all . Then a distribution is said to be regularly varying with parameter if
(5) 
where is a slowly varying function. From Lemma 22 and Proposition 23 of Gnedin et al. Gnedin et al. (2007), (5) is equivalent to the more explicit condition in term of ordered masses of
(6) 
where is a slowly varying function depending on . We denote by the set of all regularly varying distribution on with parameter . From (6) it is clear that such a class includes distributions having powerlaw tail decay, which correspond to the particular case of being a constant, which is equivalent to being constant. We denote the class of distributions having power law tail decay by . In the following results, we will restrict our attention to the estimation problem under restricted parameter spaces and .
From Ohannessian and Dahleh Ohnnessian and Dahleh (2012) it is known that the GoodTuring estimator is consistent under all s belonging to the regularly varying class . Focusing attention on the class , in the next proposition we refine the result of Ohannessian and Dahleh Ohnnessian and Dahleh (2012) by studying the rate at which the multiplicative loss of the GoodTuring estimator converges to zero. A sequence is a convergence rate of an estimator for the distribution if
for all sequences . The next proposition shows the rate of convergence of the GoodTuring estimator is . The proof is omitted because it follows from Proposition 2.2 below along with a simple application of Markov’s inequality.
Proposition 2.1
As a further result on convergence rate of the GoodTuring estimator, in Theorem 2.2 we show that the convergence rate achieved by the GoodTuring estimator is actually almost the best convergence rate any estimator of can achieve. Specifically, for any other estimator, it is possible to find a point for which the rate of convergence is not faster than .
Theorem 2.2
For any estimator , there exists such that for every
(8) 
Therefore the convergence rate of cannot be faster than .
Proposition 2.1 and Theorem 2.2 together show that the GoodTuring estimator achieves the best convergence rate up to possibly a slowly varying function. In particular, if the distribution has a powerlaw decay, i.e. , the two rates match and the GoodTuring estimator achieves the best rate possible. In particular, because does not depend on , it follows that the GoodTuring estimator is actually rate adaptive for the class of power law distributions, However, for a general we do not know whether the two rates of Proposition 2.1 and Theorem 2.2 may be improved to make them match or they are not.
As a final result, in the next theorem we consider the asymptotic minimax estimation risk for the missing mass under the loss function (2) and with parameter space . Theorem 2.3 provides with a lower bound for the estimation risk of this statistical problem, showing that the minimax rate is not smaller than .
Theorem 2.3
Let be the class of discrete distributions on with power law tail function and let denote the multiplicative loss function (2). Then, there exists a positive constant such that
where the infimum is taken over all possible estimators .
The lower bound of Theorem 2.3 can be used to derive the minimax rate, by matching it with appropriate upper bounds of specific estimators of the missing mass. This lower bound trivially still holds for any parametric set larger than and, therefore, the theorem also provides with a lower bound of the estimation risk under the larger parameter space . In the next Proposition, we show that for a fixed distribution , the GoodTuring estimator achieves the best possible rate of Theorem 2.3 up to a slowly varying term.
Proposition 2.2
Let be the GoodTuring estimator and let . Then, there exists a finite constant such that for every
(9) 
where is the slowly varying function specific to appearing in (5).
Extending Proposition 2.2 to hold uniformly over is an open problem and probably requires a careful control over the size of . Indeed, the classes of distributions we are considering are defined through the asymptotic properties of their elements, while to obtain minimax results we need a control for each . Even though Proposition 2.2 does not directly provide with the minimax rate of the GoodTuring estimator, it still provides with a sanity check for its asymptotic risk. Specifically, Proposition 2.2 implies that for every ,
Moreover, from a minor change at the beginning of the proof of Theorem 2.3, we can also prove that for every estimator and every sequence diverging to infinity, we can find an element such that . This leads us to conjecture that the GoodTuring estimator should be a rate optimal minimax estimator.
3 Discussion
In this paper we have considered the problem of consistent estimation of the missing mass under a suitable multiplicative loss function. We have presented an alternative, and simpler, proof of the result by Mossel and Ohannessian Mossel and Ohannessian (2015) on the impossibility of a distributionfree estimation of the missing mass. Our results relies on novel arguments from Bayesian nonparametric statistics, which are then exploited to study convergence rates and minimax rates of the GoodTuring estimator under the class of regularly varying . In Proposition 2.1 and Theorem 2.2 it has been shown that, within the class , the GoodTuring estimator achieves the best convergence rate possible, while for the class, , this rate is the best up to a slowly varying function. An open problem is to understand weather this additional slowly varying term is intrinsic to the problem or our results can actually be improved to make the rate of the GoodTuring estimator matches the best possible rate also within the class of regularly varying distributions. Under the restricted parametric spaces, in Theorem 2.3 we have provided a lower bound for the asymptotic risk. This bound can be used to compare estimators from a minimax point of view, by finding suitable upper bounds matching the lower bound rate. In particular, in Proposition 2.2 we have shown that the asymptotic rate of the risk of the GoodTuring estimator matches the lower bound rate, up to a slowly varying function. However, the rate of Proposition 2.2 is a pointwise result, for a fixed . An open problem is to extend Proposition 2.2 to the uniform case, when considering the supremum of the risk over all . This extension probably requires a careful analysis and control of the size of this parameter space . Work on this is ongoing.
4 Proofs
In this section we will prove all the theorems stated in Section 2. The proofs of some technical auxiliary results are postponed to Appendix 5. We start with a simple lemma that will be useful in the sequel.
Lemma 4.1
For and , implies
proof Let be positive real numbers, and suppose . Straightforwardly, have that
(10) 
From the lower bound of (10), and, therefore, . Multiplying (10) by this last inequality, we have . ∎
4.1 Proof of Theorem 2.1
We are going to show that for every estimator , there exists such that
(11) 
and, therefore, there exists such that does not converge to zero in probability.
First, note that for , Lemma (4.1) implies that
(12) 
so it is sufficient to show that there exists such that for every estimator ,
(13) 
We will prove that (13) holds for all (and therefore (11) holds for any ). Let . Let denote the Dirichlet process measure on (Ferguson Ferguson (1973)), with base measure on . We choose uniform, i.e. . Now, we can lower bound the supremum in (13) by an average over with respect to and then swap the integration by Fubini theorem, therefore
where the first inequality follows since we can lower bound the supremum by an average, the second from reverse Fatou’s lemma, the equality comes by swapping the marginal of and conditional of given with the marginal of , denoted , and the conditional of given , the last inequality follows since we are considering the infimum over all possible values of . Also recall that, when is distributed as , then the marginal of , , is a Generalized Polya urn, while the conditional of given is (see Theorem 4.6 and subsection 4.1.4 of Ghosal and Van der Vaart Ghosal and Van der Vaart (2017)).
From Proposition 5.2 (Appendix 5), under the posterior distribution is distributed according to a Beta random variable . Therefore,
(14) 
where . We are now going to lower bound the probability of the event on the right hand side of (14). First let us consider and
(15)  
(16)  
where we have used in (15) and that the maximum of the function is achieved in in (16). Now let , noticing that , it comes that
Therefore, for all and . Plugging this estimate in place of (14), we obtain
and the right hand side is strictly positive for all .
4.2 Proof of Theorem 2.2
Let any nonnegative sequence converging to . We will show that for any estimator ,
(17) 
Let us denote by the law of a stable process on of parameter . This a subordinator with Levy intensity, . See Kingman Kingman (1975), Lijoi and Prünster Lijoi and Prünster (2010) and Pitman Pitman (2006) for details and additional references. Because of , the stable process samples probability measures belonging to . Now we can upper bound the infimum in (17) by an average with respect to ,
where the last equality follows by applying Fatou’s Lemma.
Take large enough so that . Let us denote by the marginal law of the observations under an stable process, when is integrated out, i.e. the probability measure on defined as for all . We swap the integration of the marginal of and the conditional of given with the marginal of and the conditional of given and then apply Lemma 4.1 to obtain
where denotes the posterior of given the sample. Therefore, taking , we can upper bound the quantity appearing on the l.h.s. of (17) by
(18) 
We will upperbound the two terms of the sum in (18) independently. Let us focus on the first term of (18). Let large enough so that and . From Proposition 5.2, under the posterior , is distributed according to a Beta random variable . Let us denote , , and for easiness of notation we will simply write and in the following calculations. Also let
be the cumulative distribution function of the beta random variable
. From Proposition 5.2 in the Appendix, we have thatConsider the function defined by
Notice that and that . Therefore, reaches its maximum in (denoted for easiness of notation) satisfying
where denotes the density function of the distribution. On the event , we have , and so is bellshaped with second inflexion point,
Therefore, is non decreasing on the interval and as a consequence, is non negative on , from which we can deduce that is non decreasing on the same interval. Now, since , it follows that on . Therefore,
We can now upperbound as follows:
Comments
There are no comments yet.