On high-dimensional modifications of some graph-based two-sample tests

06/06/2018
by   Soham Sarkar, et al.
0

Testing for the equality of two high-dimensional distributions is a challenging problem, and this becomes even more challenging when the sample size is small. Over the last few decades, several graph-based two-sample tests have been proposed in the literature, which can be used for data of arbitrary dimensions. Most of these test statistics are computed using pairwise Euclidean distances among the observations. But, due to concentration of pairwise Euclidean distances, these tests have poor performance in many high-dimensional problems. Some of them can have powers even below the nominal level when the scale-difference between two distributions dominates the location-difference. To overcome these limitations, we introduce a new class of dissimilarity indices and use it to modify some popular graph-based tests. These modified tests use the distance concentration phenomenon to their advantage, and as a result, they outperform the corresponding tests based on the Euclidean distance in a wide variety of examples. We establish the high-dimensional consistency of these modified tests under fairly general conditions. Analyzing several simulated as well as real data sets, we demonstrate their usefulness in high dimension, low sample size situations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2019

A New Framework for Distance and Kernel-based Metrics in High Dimensions

The paper presents new metrics to quantify and test for (i) the equality...
research
12/16/2022

On High Dimensional Behaviour of Some Two-Sample Tests Based on Ball Divergence

In this article, we propose some two-sample tests based on ball divergen...
research
01/13/2022

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

The failure of the Euclidean norm to reliably distinguish between nearby...
research
01/08/2020

On a Generalization of the Average Distance Classifier

In high dimension, low sample size (HDLSS)settings, the simple average d...
research
09/21/2015

Significance Analysis of High-Dimensional, Low-Sample Size Partially Labeled Data

Classification and clustering are both important topics in statistical l...
research
08/30/2020

diproperm: An R Package for the DiProPerm Test

High-dimensional low sample size (HDLSS) data sets emerge frequently in ...
research
04/03/2023

Synthesis parameter effect detection using quantitative representations and high dimensional distribution distances

Detection of effects of the parameters of the synthetic process on the m...

Please sign up or login with your details

Forgot password? Click here to reset