Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

09/18/2015
by   Haohan Wang, et al.
0

Protein-protein interaction (PPI) prediction is an important problem in machine learning and computational biology. However, there is no data set for training or evaluation purposes, where all the instances are accurately labeled. Instead, what is available are instances of positive class (with possibly noisy labels) and no instances of negative class. The non-availability of negative class data is typically handled with the observation that randomly chosen protein-pairs have a nearly 100 1 in 1,500 protein pairs expected is expected to be an interacting pair. In this paper, we focused on the problem that non-availability of accurately labeled testing data sets in the domain of protein-protein interaction (PPI) prediction may lead to biased evaluation results. We first showed that not acknowledging the inherent skew in the interactome (i.e. rare occurrence of positive instances) leads to an over-estimated accuracy of the predictor. Then we show that, with the belief that positive interactions are a rare category, sampling random pairs of proteins excluding known interacting proteins set as the negative testing data set could lead to an under-estimated evaluation result. We formalized those two problems to validate the above claim, and based on the formalization, we proposed a balancing method to cancel out the over-estimation with under-estimation. Finally, our experiments validated the theoretical aspects and showed that this balancing evaluation could evaluate the exact performance without availability of golden standard data sets.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

11/21/2017

Training large margin host-pathogen protein-protein interaction predictors

Detection of protein-protein interactions (PPIs) plays a vital role in m...
06/04/2020

A statistical Testing Procedure for Validating Class Labels

Motivated by an open problem of validating protein identities in label-f...
07/08/2021

Network and Sequence-Based Prediction of Protein-Protein Interactions

Background:Typically, proteins perform key biological functions by inter...
02/01/2019

ProteinNet: a standardized data set for machine learning of protein structure

Rapid progress in deep learning has spurred its application to bioinform...
11/03/2016

Multitask Protein Function Prediction Through Task Dissimilarity

Automated protein function prediction is a challenging problem with dist...
08/13/2021

Adaptive Positive-Unlabelled Learning via Markov Diffusion

Positive-Unlabelled (PU) learning is the machine learning setting in whi...
04/07/2021

Random graphs with node and block effects: models, goodness-of-fit tests, and applications to biological networks

Many popular models from the networks literature can be viewed through a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.