# Likelihood-free hypothesis testing

Consider the problem of testing Z ∼ℙ^⊗ m vs Z ∼ℚ^⊗ m from m samples. Generally, to achieve a small error rate it is necessary and sufficient to have m ≍ 1/ϵ^2, where ϵ measures the separation between ℙ and ℚ in total variation (𝖳𝖵). Achieving this, however, requires complete knowledge of the distributions ℙ and ℚ and can be done, for example, using the Neyman-Pearson test. In this paper we consider a variation of the problem, which we call likelihood-free (or simulation-based) hypothesis testing, where access to ℙ and ℚ (which are a priori only known to belong to a large non-parametric family 𝒫) is given through n iid samples from each. We demostrate existence of a fundamental trade-off between n and m given by nm ≍ n^2_𝖦𝗈𝖥(ϵ,𝒫), where n_𝖦𝗈𝖥 is the minimax sample complexity of testing between the hypotheses H_0: ℙ= ℚ vs H_1: 𝖳𝖵(ℙ,ℚ) ≥ϵ. We show this for three non-parametric families P: β-smooth densities over [0,1]^d, the Gaussian sequence model over a Sobolev ellipsoid, and the collection of distributions 𝒫 on a large alphabet [k] with pmfs bounded by c/k for fixed c. The test that we propose (based on the L^2-distance statistic of Ingster) simultaneously achieves all points on the tradeoff curve for these families. In particular, when m≫ 1/ϵ^2 our test requires the number of simulation samples n to be orders of magnitude smaller than what is needed for density estimation with accuracy ≍ϵ (under 𝖳𝖵). This demonstrates the possibility of testing without fully estimating the distributions.

READ FULL TEXT