Given samples from a population of individuals belonging to different types with unknown proportions, how do we estimate the probability of discovering a new type at the -th draw? This is a classical problem in statistics, commonly referred to as the missing mass estimation problem. It first appeared in ecology (e.g., Fisher et al. Fisher et a. (1943) and Good Good (1953)), and its importance has grown considerably in recent years driven by challenging applications in a wide range of scientific disciplines, such as biological and physical sciences (e.g., Kroes et al. Kroes et al. (1999), Gao et al. Gao et al. (2007) and Ionita-Laza et al. Ionita-Laza et al. (2009)
), machine learning and computer science (e.g., Motwani and VassilvitskiiMotwani and Vassilvitskii (2006) and Bubeck et al. Bubeck et al. (2013)), and information theory (e.g., Orlitsky et al. Orlitsky et al. (2014) and Ben-Hamou et al. Ben-Hamou et al. (2018)). To move into a concrete setting, let be an unknown discrete distribution, where is a sequence of atoms on some measurable space and denote the corresponding probability masses, i.e. such that . If
is a collection of independent and identically distributed random variables from, then we define the missing mass as
where is the indicator function. Among various nonparametric estimators of the missing mass, both frequentist and Bayesian, the Good-Turing estimator (Good Good (1953)) is arguably the most popular. It has been the subject of numerous studies, most of them in the recent years. These include, e.g., asymptotic normality and large deviations (Zhang and Zhang Zhang et al. (2009) and Gao Gao (2013)), admissibility and concentration properties (McAllester and Ortiz, McAllester and Ortiz (2003), Ohannessian and Dahleh Ohnnessian and Dahleh (2012) and Ben-Hamou et al. Ben-Hamou et al. (2017)), consistency and convergence rates (McAllester Schapire McAllester and Schapire (2000), Wagner et al. Wagner et al. (2006) and Mossel and Ohannessian Mossel and Ohannessian (2015)), optimality and minimax properties (Orlitsky et al. Orlitsky et al. (2013) and Rajaraman et al. Rajaraman (2017)).
Under the setting depicted above, let denote an estimator of . Motivated by the recent works of Ohannessian and Dahleh Ohnnessian and Dahleh (2012), Mossel and Ohannessian Mossel and Ohannessian (2015) and Ben-Hamou et al. Ben-Hamou et al. (2017)
, in this paper we consider the problem of consistent estimation of the missing mass under the multiplicative loss function
As discussed in Ohannessian and Dahleh Ohnnessian and Dahleh (2012), the loss function (2) is adequate for estimating small value parameters, in the sense that it allows to achieve more informative results. Such a loss function has been already used in statistics, e.g. for the estimation of small value probabilities using importance sampling (Chatterjee and Diaconis Chatterjee and Diaconis (2018)) and for the estimation of tail probabilities in extreme value theory (Beirlant and Devroye Beirlant and Devroye (1999)). Under the loss function (2), Ohannessian and Dahleh Ohnnessian and Dahleh (2012) showed that: i) the Good-Turing estimator may be inconsistent; ii) the Good-Turing estimator is strongly consistent if the tail of decays to zero as a regularly varying function with parameter (Bingham et al. Bingham and Goldie (1987)). See also Ben-Hamou et al. Ben-Hamou et al. (2017) for further results on missing mass estimation under regularly varying . Mossel and Ohannessian Mossel and Ohannessian (2015) then strengthened the inconsistency result of Ohannessian and Dahleh Ohnnessian and Dahleh (2012), showing the impossibility of estimating (learning) in a completely distribution-free fashion, that is without imposing further structural assumptions on .
We present an alternative, and simpler, proof of the result of Mossel and Ohannessian Mossel and Ohannessian (2015). Our proof relies on tools from Bayesian nonparametrics, and in particular on the use of a Dirichlet prior (Ferguson Ferguson (1973)) for the unknown distribution . This allows us to exploit properties of the posterior distribution of to prove the impossibility of a distribution-free estimation of the missing mass, thus avoiding the winding (geometric) coupling argument of Mossel and Ohannessian Mossel and Ohannessian (2015). Up to our knowledge, the use of Bayesian ideas to study large sample asymptotics for the missing mass is new, and it could be of independent interest. Motivated by the work of Ohannessian and Dahleh Ohnnessian and Dahleh (2012) and Ben-Hamou et al. Ben-Hamou et al. (2017) we then investigate convergence rates and minimax rates for the Good-Turing estimator under the class of regularly varying . We still rely on tools from Bayesian nonparametrics, thus providing an original approach to tackle these problems. In particular, we make use of the two parameter Poisson-Dirichlet prior (Perman et al. Perman et al. (1992) and Pitman and Yor Pitman and Yor (1997)) for the unknown distribution , which is known to generate (almost surely) discrete distributions whose tail decays to zero as a regularly varying function with parameter . See Gnedin et al. Gnedin et al. (2007) and references therein. This allows us to exploit properties of the posterior distribution of to prove that: i) the convergence rate of the Good-Turing estimator is the best rate that any estimator of the missing mass can achieve, up to a slowly varying function; ii) the minimax rate must be at least . We conclude with a discussion on the problem of deriving the minimax rate of the Good-Turing estimator, conjecturing that the Good-Turing estimator is an asymptotically optimal minimax estimator under the class of regularly varying .
The paper is structured as follows. In Section 2 we state our main results on convergence rates and minimax rates for the Good-Turing estimator under regularly varying distribution . Proofs of these results, as well as the alternative proof of the result of Mossel and Ohannessian Mossel and Ohannessian (2015), are provided in Section 4. In Section 3 we discuss open problems and possible future developments on missing mass estimation. Auxiliary results and technical lemmas are deferred to Appendix 5. The following notation is adopted throughout the paper:
is the unit interval, and its Borel -algebra;
is the space of discrete distributions on , endowed with the smallest -algebra making measurable for every ;
is the -fold product of on , and the expectation with respect ; for easiness of notation, we will use to denote also the expectation with respect to . ;
is a generic slowly varying function, i.e. a function satisfying as for every ;
denotes a generic strictly positive constant that can vary in the calculations and in distinct statements;
Given a sequence of probabilities , denotes the corresponding ordered sequence, i.e. ;
Given two functions and , stands for , for , for ;
is the Beta integral of parameters and .
2 Main results
Let be a collection of independent and identically distributed random variables from an unknown discrete distribution . The actual values taken by the observations, ’s, are irrelevant for the missing mass estimation problem and, without loss of generality, they can be assumed to be values in the set . Therefore, is supposed to be a discrete distribution on the sample space , given a sequence of atoms and masses such that . Both atoms and masses of the distribution are assumed to be unknown. Given the sample , we are interested in estimating the missing defined in (1), which turns out to be a jointly measurable function of and as proved in Proposition 5.1. Given an estimator of , we will measure its statistical performance by using the multiplicative loss function defined in (2). As we discussed in the introduction, this loss function is suitable to study theoretical properties of parameters or functionals taking small values, and it has already been used in previous works on missing mass estimation, e.g., Ohannessian and Dahleh Ohnnessian and Dahleh (2012), Mossel and Ohannessian Mossel and Ohannessian (2015) and Ben-Hamou et al. Ben-Hamou et al. (2017).
A sequence of estimators is said to be consistent for under parameter space and loss function , if the loss incurred by the estimator converges in probability to zero under all points in the parameter space. Formally, is consistent for if for all and for all ,
by exploiting a coupling of two generalized (dithered) geometric distributions. In section Section4 we present an alternative proof of Theorem 2.1. While the proof of Mossel and Ohannessian Mossel and Ohannessian (2015) has the merit to be constructive, our approach has the merit to be simpler and it provides a new way to face these type of problems, which mainly relies on Bayesian nonparametric techniques. Similar Bayesian nonparametric arguments will then be crucial in order to study of convergence rates and minimax rates of the Good-Turing estimator under the class of regularly varying .
Roughly speaking, Theorem 2.1 proves that any asymptotic result holding uniformly over a set of possible distributions, the parametric space must be restricted to a suitable subclass. In particular, from the proof of Theorem 2.1 we see that some conditions have to be imposed on the tail decay of the elements of the parameter space. That is, from the proof of Theorem 2.1 we deduce that there are no consistent estimators for the class of distributions sampled from a Dirichlet process. From Kingman Kingman (1975) (Equation 65), we have that, if is sampled from a Dirichlet process, its sequence of ordered masses behaves like , as . Therefore, the tail of has approximately exponential form, resembling a geometric distribution and satisfying for every . Indeed, a geometric distribution was used in Ohannessian and Dahleh Ohnnessian and Dahleh (2012) as an example to prove that the Good-Turing estimator can be inconsistent. Theorem 2.1 shows that, under this very light regime, any estimator of the missing mass, not just the Good-Turing, fails to be consistent under multiplicative loss. This motivates us to consider the class of s having heavy enough tails. This will be the subject of the rest of this section.
2.1 Consistency under regularly varying
In this section we recall the Good-Turing estimator (Good Good (1953)) of the missing mass , and we study its convergence rate and minimax risk for regularly varying . The definition of the Good-Turing estimator makes use of the proportion of unique values in the sample to estimate the missing mass. Let be the number of times the value is observed in the sample , i.e.,
Furthermore, let and be the number of values observed times and the total number of distinct values, respectively, observed in , i.e.,
The Good-Turing estimator of is defined in terms of the statistic , that is
Ohannessian and Dahleh Ohnnessian and Dahleh (2012) first showed the inconsistency of under the choice of being a geometric distribution. In the same paper, it is shown that under the assumption that the tail of decays to zero as a regularly varying function with parameter , the Good-Turing estimator is strongly consistent. This latter result was generalized to the range in Ben-Hamou et al. Ben-Hamou et al. (2017).
The assumption of regularly varying is a generalization of the power law tail decay, adding some more flexibility by the introduction of the slowly varying function . Power-law distributions are observed in the empirical distributions of many quantities in different applied areas, and their study have attracted a lot of interest in recent years. For extensive discussions of power laws in empirical data and their properties, the reader is referred to Mitzenmacher Mitzenmacher (2014), Goldwater et al. Goldwater et al. (2006), Newman Newman (2003), Clauset et al. Clauset et al. (2009) and Sornette Sornette (2006)
. Restricting the parameter space to probability distributions having regularly varying tail is not a mere technical assumption and, on the contrary, it represents a natural subset of the parameter space to consider, which we expect to contain the true data generating distribution for many different applications.
To move into the concrete setting of regular variation (Bingham et al. Bingham and Goldie (1987)), for every we define a counting measure on as , with corresponding tail function defined as for all . Then a distribution is said to be regularly varying with parameter if
where is a slowly varying function depending on . We denote by the set of all regularly varying distribution on with parameter . From (6) it is clear that such a class includes distributions having power-law tail decay, which correspond to the particular case of being a constant, which is equivalent to being constant. We denote the class of distributions having power law tail decay by . In the following results, we will restrict our attention to the estimation problem under restricted parameter spaces and .
From Ohannessian and Dahleh Ohnnessian and Dahleh (2012) it is known that the Good-Turing estimator is consistent under all s belonging to the regularly varying class . Focusing attention on the class , in the next proposition we refine the result of Ohannessian and Dahleh Ohnnessian and Dahleh (2012) by studying the rate at which the multiplicative loss of the Good-Turing estimator converges to zero. A sequence is a convergence rate of an estimator for the distribution if
for all sequences . The next proposition shows the rate of convergence of the Good-Turing estimator is . The proof is omitted because it follows from Proposition 2.2 below along with a simple application of Markov’s inequality.
As a further result on convergence rate of the Good-Turing estimator, in Theorem 2.2 we show that the convergence rate achieved by the Good-Turing estimator is actually almost the best convergence rate any estimator of can achieve. Specifically, for any other estimator, it is possible to find a point for which the rate of convergence is not faster than .
For any estimator , there exists such that for every
Therefore the convergence rate of cannot be faster than .
Proposition 2.1 and Theorem 2.2 together show that the Good-Turing estimator achieves the best convergence rate up to possibly a slowly varying function. In particular, if the distribution has a power-law decay, i.e. , the two rates match and the Good-Turing estimator achieves the best rate possible. In particular, because does not depend on , it follows that the Good-Turing estimator is actually rate adaptive for the class of power law distributions, However, for a general we do not know whether the two rates of Proposition 2.1 and Theorem 2.2 may be improved to make them match or they are not.
As a final result, in the next theorem we consider the asymptotic minimax estimation risk for the missing mass under the loss function (2) and with parameter space . Theorem 2.3 provides with a lower bound for the estimation risk of this statistical problem, showing that the minimax rate is not smaller than .
Let be the class of discrete distributions on with power law tail function and let denote the multiplicative loss function (2). Then, there exists a positive constant such that
where the infimum is taken over all possible estimators .
The lower bound of Theorem 2.3 can be used to derive the minimax rate, by matching it with appropriate upper bounds of specific estimators of the missing mass. This lower bound trivially still holds for any parametric set larger than and, therefore, the theorem also provides with a lower bound of the estimation risk under the larger parameter space . In the next Proposition, we show that for a fixed distribution , the Good-Turing estimator achieves the best possible rate of Theorem 2.3 up to a slowly varying term.
Let be the Good-Turing estimator and let . Then, there exists a finite constant such that for every
where is the slowly varying function specific to appearing in (5).
Extending Proposition 2.2 to hold uniformly over is an open problem and probably requires a careful control over the size of . Indeed, the classes of distributions we are considering are defined through the asymptotic properties of their elements, while to obtain minimax results we need a control for each . Even though Proposition 2.2 does not directly provide with the minimax rate of the Good-Turing estimator, it still provides with a sanity check for its asymptotic risk. Specifically, Proposition 2.2 implies that for every ,
Moreover, from a minor change at the beginning of the proof of Theorem 2.3, we can also prove that for every estimator and every sequence diverging to infinity, we can find an element such that . This leads us to conjecture that the Good-Turing estimator should be a rate optimal minimax estimator.
In this paper we have considered the problem of consistent estimation of the missing mass under a suitable multiplicative loss function. We have presented an alternative, and simpler, proof of the result by Mossel and Ohannessian Mossel and Ohannessian (2015) on the impossibility of a distribution-free estimation of the missing mass. Our results relies on novel arguments from Bayesian nonparametric statistics, which are then exploited to study convergence rates and minimax rates of the Good-Turing estimator under the class of regularly varying . In Proposition 2.1 and Theorem 2.2 it has been shown that, within the class , the Good-Turing estimator achieves the best convergence rate possible, while for the class, , this rate is the best up to a slowly varying function. An open problem is to understand weather this additional slowly varying term is intrinsic to the problem or our results can actually be improved to make the rate of the Good-Turing estimator matches the best possible rate also within the class of regularly varying distributions. Under the restricted parametric spaces, in Theorem 2.3 we have provided a lower bound for the asymptotic risk. This bound can be used to compare estimators from a minimax point of view, by finding suitable upper bounds matching the lower bound rate. In particular, in Proposition 2.2 we have shown that the asymptotic rate of the risk of the Good-Turing estimator matches the lower bound rate, up to a slowly varying function. However, the rate of Proposition 2.2 is a pointwise result, for a fixed . An open problem is to extend Proposition 2.2 to the uniform case, when considering the supremum of the risk over all . This extension probably requires a careful analysis and control of the size of this parameter space . Work on this is ongoing.
In this section we will prove all the theorems stated in Section 2. The proofs of some technical auxiliary results are postponed to Appendix 5. We start with a simple lemma that will be useful in the sequel.
For and , implies
proof Let be positive real numbers, and suppose . Straightforwardly, have that
4.1 Proof of Theorem 2.1
We are going to show that for every estimator , there exists such that
and, therefore, there exists such that does not converge to zero in probability.
First, note that for , Lemma (4.1) implies that
so it is sufficient to show that there exists such that for every estimator ,
We will prove that (13) holds for all (and therefore (11) holds for any ). Let . Let denote the Dirichlet process measure on (Ferguson Ferguson (1973)), with base measure on . We choose uniform, i.e. . Now, we can lower bound the supremum in (13) by an average over with respect to and then swap the integration by Fubini theorem, therefore
where the first inequality follows since we can lower bound the supremum by an average, the second from reverse Fatou’s lemma, the equality comes by swapping the marginal of and conditional of given with the marginal of , denoted , and the conditional of given , the last inequality follows since we are considering the infimum over all possible values of . Also recall that, when is distributed as , then the marginal of , , is a Generalized Polya urn, while the conditional of given is (see Theorem 4.6 and subsection 4.1.4 of Ghosal and Van der Vaart Ghosal and Van der Vaart (2017)).
where . We are now going to lower bound the probability of the event on the right hand side of (14). First let us consider and
Therefore, for all and . Plugging this estimate in place of (14), we obtain
and the right hand side is strictly positive for all .
4.2 Proof of Theorem 2.2
Let any non-negative sequence converging to . We will show that for any estimator ,
Let us denote by the law of a stable process on of parameter . This a subordinator with Levy intensity, . See Kingman Kingman (1975), Lijoi and Prünster Lijoi and Prünster (2010) and Pitman Pitman (2006) for details and additional references. Because of , the stable process samples probability measures belonging to . Now we can upper bound the infimum in (17) by an average with respect to ,
where the last equality follows by applying Fatou’s Lemma.
Take large enough so that . Let us denote by the marginal law of the observations under an -stable process, when is integrated out, i.e. the probability measure on defined as for all . We swap the integration of the marginal of and the conditional of given with the marginal of and the conditional of given and then apply Lemma 4.1 to obtain
where denotes the posterior of given the sample. Therefore, taking , we can upper bound the quantity appearing on the l.h.s. of (17) by
We will upper-bound the two terms of the sum in (18) independently. Let us focus on the first term of (18). Let large enough so that and . From Proposition 5.2, under the posterior , is distributed according to a Beta random variable . Let us denote , , and for easiness of notation we will simply write and in the following calculations. Also let
be the cumulative distribution function of the beta random variable. From Proposition 5.2 in the Appendix, we have that
Consider the function defined by
Notice that and that . Therefore, reaches its maximum in (denoted for easiness of notation) satisfying
where denotes the density function of the distribution. On the event , we have , and so is bell-shaped with second inflexion point,
Therefore, is non decreasing on the interval and as a consequence, is non negative on , from which we can deduce that is non decreasing on the same interval. Now, since , it follows that on . Therefore,
We can now upper-bound as follows: