Classification Accuracy as a Proxy for Two Sample Testing

02/06/2016
by   Aaditya Ramdas, et al.
0

When data analysts train a classifier and check if its accuracy is significantly different from random guessing, they are implicitly and indirectly performing a hypothesis test (two sample testing) and it is of importance to ask whether this indirect method for testing is statistically optimal or not. Given that hypothesis tests attempt to maximize statistical power subject to a bound on the allowable false positive rate, while prediction attempts to minimize statistical risk on future predictions on unseen data, we wish to study whether a predictive approach for an ultimate aim of testing is prudent. We formalize this problem by considering the two-sample mean-testing setting where one must determine if the means of two Gaussians (with known and equal covariance) are the same or not, but the analyst indirectly does so by checking whether the accuracy achieved by Fisher's LDA classifier is significantly different from chance or not. Unexpectedly, we find that the asymptotic power of LDA's sample-splitting classification accuracy is actually minimax rate-optimal in terms of problem-dependent parameters. Since prediction is commonly thought to be harder than testing, it might come as a surprise to some that solving a harder problem does not create a information-theoretic bottleneck for the easier one. On the flip side, even though the power is rate-optimal, our derivation suggests that it may be worse by a small constant factor; hence practitioners must be wary of using (admittedly flexible) prediction methods on disguised testing problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2023

Minimax optimal testing by classification

This paper considers an ML inspired approach to hypothesis testing known...
research
01/23/2016

Minimax Lower Bounds for Linear Independence Testing

Linear independence testing is a fundamental information-theoretic and s...
research
02/01/2019

Local minimax rates for closeness testing of discrete distributions

We consider the closeness testing (or two-sample testing) problem in the...
research
06/10/2015

Sequential Nonparametric Testing with the Law of the Iterated Logarithm

We propose a new algorithmic framework for sequential hypothesis testing...
research
01/11/2022

Estimation and Inference with Proxy Data and its Genetic Applications

Existing high-dimensional statistical methods are largely established fo...
research
07/27/2023

Rapid and Scalable Bayesian AB Testing

AB testing aids business operators with their decision making, and is co...
research
06/10/2021

What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?

Understanding classifier decision under novel environments is central to...

Please sign up or login with your details

Forgot password? Click here to reset