Evaluation of Protein-protein Interaction Predictors with Noisy Partially Labeled Data Sets

09/18/2015
by   Haohan Wang, et al.
0

Protein-protein interaction (PPI) prediction is an important problem in machine learning and computational biology. However, there is no data set for training or evaluation purposes, where all the instances are accurately labeled. Instead, what is available are instances of positive class (with possibly noisy labels) and no instances of negative class. The non-availability of negative class data is typically handled with the observation that randomly chosen protein-pairs have a nearly 100 1 in 1,500 protein pairs expected is expected to be an interacting pair. In this paper, we focused on the problem that non-availability of accurately labeled testing data sets in the domain of protein-protein interaction (PPI) prediction may lead to biased evaluation results. We first showed that not acknowledging the inherent skew in the interactome (i.e. rare occurrence of positive instances) leads to an over-estimated accuracy of the predictor. Then we show that, with the belief that positive interactions are a rare category, sampling random pairs of proteins excluding known interacting proteins set as the negative testing data set could lead to an under-estimated evaluation result. We formalized those two problems to validate the above claim, and based on the formalization, we proposed a balancing method to cancel out the over-estimation with under-estimation. Finally, our experiments validated the theoretical aspects and showed that this balancing evaluation could evaluate the exact performance without availability of golden standard data sets.

READ FULL TEXT
research
11/21/2017

Training large margin host-pathogen protein-protein interaction predictors

Detection of protein-protein interactions (PPIs) plays a vital role in m...
research
06/04/2020

A statistical Testing Procedure for Validating Class Labels

Motivated by an open problem of validating protein identities in label-f...
research
07/08/2021

Network and Sequence-Based Prediction of Protein-Protein Interactions

Background:Typically, proteins perform key biological functions by inter...
research
02/01/2019

ProteinNet: a standardized data set for machine learning of protein structure

Rapid progress in deep learning has spurred its application to bioinform...
research
11/03/2016

Multitask Protein Function Prediction Through Task Dissimilarity

Automated protein function prediction is a challenging problem with dist...
research
08/13/2021

Adaptive Positive-Unlabelled Learning via Markov Diffusion

Positive-Unlabelled (PU) learning is the machine learning setting in whi...
research
06/14/2021

Estimating the interaction graph of stochastic neuronal dynamics by observing only pairs of neurons

We address the questions of identifying pairs of interacting neurons fro...

Please sign up or login with your details

Forgot password? Click here to reset