Fast Two-Sample Testing with Analytic Representations of Probability Measures

06/15/2015
by   Kacper Chwialkowski, et al.
0

We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2019

Comparing distributions: ℓ_1 geometry improves kernel two-sample testing

Are two sets of observations drawn from the same distribution? This prob...
research
01/14/2023

Compress Then Test: Powerful Kernel Testing in Near-linear Time

Kernel two-sample testing provides a powerful framework for distinguishi...
research
08/04/2015

Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Nonparametric two sample testing is a decision theoretic problem that in...
research
05/22/2016

Interpretable Distribution Features with Maximum Testing Power

Two semimetrics on probability distributions are proposed, given as the ...
research
10/15/2016

An Adaptive Test of Independence with Analytic Kernel Embeddings

A new computationally efficient dependence measure, and an adaptive stat...
research
05/02/2012

Hypothesis testing using pairwise distances and associated kernels (with Appendix)

We provide a unifying framework linking two classes of statistics used i...
research
06/18/2022

Efficient Aggregated Kernel Tests using Incomplete U-statistics

We propose a series of computationally efficient, nonparametric tests fo...

Please sign up or login with your details

Forgot password? Click here to reset