Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

08/04/2015
by   Aaditya Ramdas, et al.
0

Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2014

On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

Nonparametric two sample testing deals with the question of consistently...
research
09/08/2015

On Wasserstein Two Sample Testing and Related Families of Nonparametric Tests

Nonparametric two sample or homogeneity testing is a decision theoretic ...
research
06/15/2015

Fast Two-Sample Testing with Analytic Representations of Probability Measures

We propose a class of nonparametric two-sample tests with a cost linear ...
research
03/22/2017

Testing and Learning on Distributions with Symmetric Noise Invariance

Kernel embeddings of distributions and the Maximum Mean Discrepancy (MMD...
research
02/19/2019

Interpoint Distance Based Two Sample Tests in High Dimension

In this paper, we study a class of two sample test statistics based on i...
research
12/05/2022

Testing for Regression Heteroskedasticity with High-Dimensional Random Forests

Statistical inference for high-dimensional regression heteroskedasticity...
research
02/13/2020

Bayesian Kernel Two-Sample Testing

In modern data analysis, nonparametric measures of discrepancies between...

Please sign up or login with your details

Forgot password? Click here to reset