A Good-Turing estimator for feature allocation models

02/27/2019
by   Fadhel Ayed, et al.
0

Feature allocation models generalize species sampling models by allowing every observation to belong to more than one species, now called features. Under the popular Bernoulli product model for feature allocation, given n samples, we study the problem of estimating the missing mass M_n, namely the expected number hitherto unseen features that would be observed if one additional individual was sampled. This is motivated by numerous applied problems where the sampling procedure is expensive, in terms of time and/or financial resources allocated, and further samples can be only motivated by the possibility of recording new unobserved features. We introduce a simple, robust and theoretically sound nonparametric estimator M̂_n of M_n. M̂_n turns out to have the same analytic form of the popular Good-Turing estimator of the missing mass in species sampling models, with the difference that the two estimators have different ranges. We show that M̂_n admits a natural interpretation both as a jackknife estimator and as a nonparametric empirical Bayes estimator, we give provable guarantees for the performance of M̂_n in terms of minimax rate optimality, and we provide with an interesting connection between M̂_n and the Good-Turing estimator for species sampling. Finally, we derive non-asymptotic confidence intervals for M̂_n, which are easily computable and do not rely on any asymptotic approximation. Our approach is illustrated with synthetic data and SNP data from the ENCODE sequencing genome project.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2018

On consistent estimation of the missing mass

Given n samples from a population of individuals belonging to different ...
research
11/23/2015

Estimating the number of unseen species: A bird in the hand is worth n in the bush

Estimating the number of unseen species is an important problem in many ...
research
01/28/2019

Exact Good-Turing characterization of the two-parameter Poisson-Dirichlet superpopulation model

Large sample size equivalence between the celebrated approximated Good-...
research
02/06/2022

Missing Mass Estimation from Sticky Channels

Distribution estimation under error-prone or non-ideal sampling modelled...
research
02/27/2019

Consistent estimation of the missing mass for feature models

Feature models are popular in machine learning and they have been recent...
research
03/11/2022

Bayesian Nonparametric Inference for "Species-sampling" Problems

"Species-sampling" problems (SSPs) refer to a broad class of statistical...
research
03/06/2018

STADS: Software Testing as Species Discovery

A fundamental challenge of software testing is the statistically well-gr...

Please sign up or login with your details

Forgot password? Click here to reset