Optimal estimation of high-order missing masses, and the rare-type match problem

06/26/2023
āˆ™
by   Stefano Favaro, et al.
āˆ™
0
āˆ™

Consider a random sample (X_1,ā€¦,X_n) from an unknown discrete distribution P=āˆ‘_jā‰„1p_jĪ“_s_j on a countable alphabet š•Š, and let (Y_n,j)_jā‰„1 be the empirical frequencies of distinct symbols s_j's in the sample. We consider the problem of estimating the r-order missing mass, which is a discrete functional of P defined as Īø_r(P;š—_n)=āˆ‘_jā‰„1p^r_jI(Y_n,j=0). This is generalization of the missing mass whose estimation is a classical problem in statistics, being the subject of numerous studies both in theory and methods. First, we introduce a nonparametric estimator of Īø_r(P;š—_n) and a corresponding non-asymptotic confidence interval through concentration properties of Īø_r(P;š—_n). Then, we investigate minimax estimation of Īø_r(P;š—_n), which is the main contribution of our work. We show that minimax estimation is not feasible over the class of all discrete distributions on š•Š, and not even for distributions with regularly varying tails, which only guarantee that our estimator is consistent for Īø_r(P;š—_n). This leads to introduce the stronger assumption of second-order regular variation for the tail behaviour of P, which is proved to be sufficient for minimax estimation of Īø_r(P;š—_n), making the proposed estimator an optimal minimax estimator of Īø_r(P;š—_n). Our interest in the r-order missing mass arises from forensic statistics, where the estimation of the 2-order missing mass appears in connection to the estimation of the likelihood ratio T(P,š—_n)=Īø_1(P;š—_n)/Īø_2(P;š—_n), known as the "fundamental problem of forensic mathematics". We present theoretical guarantees to nonparametric estimation of T(P,š—_n).

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset