Partial identification of kernel based two sample tests with mismeasured data

08/07/2023
by   Ron Nafshi, et al.
0

Nonparametric two-sample tests such as the Maximum Mean Discrepancy (MMD) are often used to detect differences between two distributions in machine learning applications. However, the majority of existing literature assumes that error-free samples from the two distributions of interest are available.We relax this assumption and study the estimation of the MMD under ϵ-contamination, where a possibly non-random ϵ proportion of one distribution is erroneously grouped with the other. We show that under ϵ-contamination, the typical estimate of the MMD is unreliable. Instead, we study partial identification of the MMD, and characterize sharp upper and lower bounds that contain the true, unknown MMD. We propose a method to estimate these bounds, and show that it gives estimates that converge to the sharpest possible bounds on the MMD as sample size increases, with a convergence rate that is faster than alternative approaches. Using three datasets, we empirically validate that our approach is superior to the alternatives: it gives tight bounds with a low false coverage rate.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/23/2014

On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

Nonparametric two sample testing deals with the question of consistently...
research
07/10/2020

Learning Entangled Single-Sample Gaussians in the Subset-of-Signals Model

In the setting of entangled single-sample distributions, the goal is to ...
research
02/23/2018

Exponentially Consistent Kernel Two-Sample Tests

Given two sets of independent samples from unknown distributions P and Q...
research
10/28/2019

Testing Equivalence of Clustering

In this paper, we test whether two datasets share a common clustering st...
research
09/26/2013

Estimating Undirected Graphs Under Weak Assumptions

We consider the problem of providing nonparametric confidence guarantees...
research
04/05/2021

Efficiency Lower Bounds for Distribution-Free Hotelling-Type Two-Sample Tests Based on Optimal Transport

The Wilcoxon rank-sum test is one of the most popular distribution-free ...
research
02/07/2023

A Bipartite Ranking Approach to the Two-Sample Problem

The two-sample problem, which consists in testing whether independent sa...

Please sign up or login with your details

Forgot password? Click here to reset