Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance

06/25/2021
by   Alex Hagen, et al.
0

Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2018

Power Comparison between High Dimensional t-Test, Sign, and Signed Rank Tests

In this paper, we propose a power comparison between high dimensional t-...
research
10/22/2017

A test for k sample Behrens-Fisher problem in high dimensional data

In this paper, the k sample Behrens-Fisher problem is investigated in hi...
research
09/04/2017

FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

We present FLASH ( Fast LSH Algorithm for Similarity search accelerat...
research
12/07/2018

Approximate Calculation of Tukey's Depth and Median With High-dimensional Data

We present a new fast approximate algorithm for Tukey (halfspace) depth ...
research
08/12/2019

An Efficient Skyline Computation Framework

Skyline computation aims at looking for the set of tuples that are not w...
research
09/05/2017

Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information

Conditional independence testing is a fundamental problem underlying cau...
research
10/03/2020

Randomized tests for high-dimensional regression: A more efficient and powerful solution

We investigate the problem of testing the global null in the high-dimens...

Please sign up or login with your details

Forgot password? Click here to reset