# The minimax risk in testing the histogram of discrete distributions for uniformity under missing ball alternatives

We consider the problem of testing the fit of a discrete sample of items from many categories to the uniform distribution over the categories. As a class of alternative hypotheses, we consider the removal of an ℓ_p ball of radius ϵ around the uniform rate sequence for p ≤ 2. We deliver a sharp characterization of the asymptotic minimax risk when ϵ→ 0 as the number of samples and number of dimensions go to infinity, for testing based on the occurrences' histogram (number of absent categories, singletons, collisions, ...). For example, for p=1 and in the limit of a small expected number of samples n compared to the number of categories N (aka "sub-linear" regime), the minimax risk R^*_ϵ asymptotes to 2 Φ̅(n ϵ^2/√(8N)), with Φ̅(x) the normal survival function. Empirical studies over a range of problem parameters show that this estimate is accurate in finite samples, and that our test is significantly better than the chisquared test or a test that only uses collisions. Our analysis is based on the asymptotic normality of histogram ordinates, the equivalence between the minimax setting to a Bayesian one, and the reduction of a multi-dimensional optimization problem to a one-dimensional problem.

READ FULL TEXT