Consistent estimation of the missing mass for feature models

by   Fadhel Ayed, et al.

Feature models are popular in machine learning and they have been recently used to solve many unsupervised learning problems. In these models every observation is endowed with a finite set of features, usually selected from an infinite collection (F_j)_j≥ 1. Every observation can display feature F_j with an unknown probability p_j. A statistical problem inherent to these models is how to estimate, given an initial sample, the conditional expected number of hitherto unseen features that will be displayed in a future observation. This problem is usually referred to as the missing mass problem. In this work we prove that, using a suitable multiplicative loss function and without imposing any assumptions on the parameters p_j, there does not exist any universally consistent estimator for the missing mass. In the second part of the paper, we focus on a special class of heavy-tailed probabilities (p_j)_j≥ 1, which are common in many real applications, and we show that, within this restricted class of probabilities, the nonparametric estimator of the missing mass suggested by Ayed et al. (2017) is strongly consistent. As a byproduct result, we will derive concentration inequalities for the missing mass and the number of features observed with a specified frequency in a sample of size n.


page 1

page 2

page 3

page 4


A Bennett Inequality for the Missing Mass

Novel concentration inequalities are obtained for the missing mass, i.e....

Concentration of the missing mass in metric spaces

We study the estimation of the probability to observe data further than ...

Revisiting Concentration of Missing Mass

We revisit the problem of missing mass concentration, deriving Bernstein...

Novel Bernstein-like Concentration Inequalities for the Missing Mass

We are concerned with obtaining novel concentration inequalities for the...

Estimation and Concentration of Missing Mass of Functions of Discrete Probability Distributions

Given a positive function g from [0,1] to the reals, the function's miss...

A Good-Turing estimator for feature allocation models

Feature allocation models generalize species sampling models by allowing...

On the Impossibility of Learning the Missing Mass

This paper shows that one cannot learn the probability of rare events wi...

Please sign up or login with your details

Forgot password? Click here to reset