A statistical Testing Procedure for Validating Class Labels

06/04/2020
by   Melissa C. Key, et al.
0

Motivated by an open problem of validating protein identities in label-free shotgun proteomics work-flows, we present a testing procedure to validate class/protein labels using available measurements across instances/peptides. More generally, we present a solution to the problem of identifying instances that are deemed, based on some distance (or quasi-distance) measure, as outliers relative to the subset of instances assigned to the same class. The proposed procedure is non-parametric and requires no specific distributional assumption on the measured distances. The only assumption underlying the testing procedure is that measured distances between instances within the same class are stochastically smaller than measured distances between instances from different classes. The test is shown to simultaneously control the Type I and Type II error probabilities whilst also controlling the overall error probability of the repeated testing invoked in the validation procedure of initial class labeling. The theoretical results are supplemented with results from an extensive numerical study, simulating a typical setup for labeling validation in proteomics work-flow applications. These results illustrate the applicability and viability of our method. Even with up to 25 mislabeled, our testing procedure maintains a high specificity and greatly reduces the proportion of mislabeled instances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2015

Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

Protein-protein interaction (PPI) prediction is an important problem in ...
research
05/08/2023

Q A Label Learning

Assigning labels to instances is crucial for supervised machine learning...
research
11/17/2021

A label efficient two-sample test

Two-sample tests evaluate whether two samples are realizations of the sa...
research
01/03/2019

Instance-Based Classification through Hypothesis Testing

Classification is a fundamental problem in machine learning and data min...
research
04/02/2018

On the Computation of Kantorovich-Wasserstein Distances between 2D-Histograms by Uncapacitated Minimum Cost Flows

In this work, we present a method to compute the Kantorovich distance, t...
research
07/16/2019

Labelings vs. Embeddings: On Distributed Representations of Distances

We investigate for which metric spaces the performance of distance label...
research
10/22/2018

The Bregman chord divergence

Distances are fundamental primitives whose choice significantly impacts ...

Please sign up or login with your details

Forgot password? Click here to reset