Testing Data Binnings
Motivated by the question of data quantization and "binning," we revisit the problem of identity testing of discrete probability distributions. Identity testing (a.k.a. one-sample testing), a fundamental and by now well-understood problem in distribution testing, asks, given a reference distribution (model) πͺ and samples from an unknown distribution π©, both over [n]={1,2,β¦,n}, whether π© equals πͺ, or is significantly different from it. In this paper, we introduce the related question of 'identity up to binning,' where the reference distribution πͺ is over k βͺ n elements: the question is then whether there exists a suitable binning of the domain [n] into k intervals such that, once "binned," π© is equal to πͺ. We provide nearly tight upper and lower bounds on the sample complexity of this new question, showing both a quantitative and qualitative difference with the vanilla identity testing one, and answering an open question of Canonne (2019). Finally, we discuss several extensions and related research directions.
READ FULL TEXT