Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

by   Haohan Wang, et al.

Protein-protein interaction (PPI) prediction is an important problem in machine learning and computational biology. However, there is no data set for training or evaluation purposes, where all the instances are accurately labeled. Instead, what is available are instances of positive class (with possibly noisy labels) and no instances of negative class. The non-availability of negative class data is typically handled with the observation that randomly chosen protein-pairs have a nearly 100 1 in 1,500 protein pairs expected is expected to be an interacting pair. In this paper, we focused on the problem that non-availability of accurately labeled testing data sets in the domain of protein-protein interaction (PPI) prediction may lead to biased evaluation results. We first showed that not acknowledging the inherent skew in the interactome (i.e. rare occurrence of positive instances) leads to an over-estimated accuracy of the predictor. Then we show that, with the belief that positive interactions are a rare category, sampling random pairs of proteins excluding known interacting proteins set as the negative testing data set could lead to an under-estimated evaluation result. We formalized those two problems to validate the above claim, and based on the formalization, we proposed a balancing method to cancel out the over-estimation with under-estimation. Finally, our experiments validated the theoretical aspects and showed that this balancing evaluation could evaluate the exact performance without availability of golden standard data sets.



There are no comments yet.


page 7


Training large margin host-pathogen protein-protein interaction predictors

Detection of protein-protein interactions (PPIs) plays a vital role in m...

A statistical Testing Procedure for Validating Class Labels

Motivated by an open problem of validating protein identities in label-f...

Network and Sequence-Based Prediction of Protein-Protein Interactions

Background:Typically, proteins perform key biological functions by inter...

ProteinNet: a standardized data set for machine learning of protein structure

Rapid progress in deep learning has spurred its application to bioinform...

Multitask Protein Function Prediction Through Task Dissimilarity

Automated protein function prediction is a challenging problem with dist...

Adaptive Positive-Unlabelled Learning via Markov Diffusion

Positive-Unlabelled (PU) learning is the machine learning setting in whi...

Random graphs with node and block effects: models, goodness-of-fit tests, and applications to biological networks

Many popular models from the networks literature can be viewed through a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.