Optimal estimation of high-order missing masses, and the rare-type match problem
Consider a random sample (X_1,ā¦,X_n) from an unknown discrete distribution P=ā_jā„1p_jĪ“_s_j on a countable alphabet š, and let (Y_n,j)_jā„1 be the empirical frequencies of distinct symbols s_j's in the sample. We consider the problem of estimating the r-order missing mass, which is a discrete functional of P defined as Īø_r(P;š_n)=ā_jā„1p^r_jI(Y_n,j=0). This is generalization of the missing mass whose estimation is a classical problem in statistics, being the subject of numerous studies both in theory and methods. First, we introduce a nonparametric estimator of Īø_r(P;š_n) and a corresponding non-asymptotic confidence interval through concentration properties of Īø_r(P;š_n). Then, we investigate minimax estimation of Īø_r(P;š_n), which is the main contribution of our work. We show that minimax estimation is not feasible over the class of all discrete distributions on š, and not even for distributions with regularly varying tails, which only guarantee that our estimator is consistent for Īø_r(P;š_n). This leads to introduce the stronger assumption of second-order regular variation for the tail behaviour of P, which is proved to be sufficient for minimax estimation of Īø_r(P;š_n), making the proposed estimator an optimal minimax estimator of Īø_r(P;š_n). Our interest in the r-order missing mass arises from forensic statistics, where the estimation of the 2-order missing mass appears in connection to the estimation of the likelihood ratio T(P,š_n)=Īø_1(P;š_n)/Īø_2(P;š_n), known as the "fundamental problem of forensic mathematics". We present theoretical guarantees to nonparametric estimation of T(P,š_n).
READ FULL TEXT