Efficient online learning for large-scale peptide identification

05/08/2018
by   Xijun Liang, et al.
0

Motivation: Post-database searching is a key procedure in peptide dentification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with an extremely large proportion of false positives (hard datasets). A more efficient learning strategy is required for improving the performance of peptide identification on challenging datasets. Results: In this work, we present an online learning method to conquer the challenges remained for exiting peptide identification algorithms. We propose a cost-sensitive learning model by using different loss functions for decoy and target PSMs respectively. A larger penalty for wrongly selecting decoy PSMs than that for target PSMs, and thus the new model can reduce its false discovery rate on hard datasets. Also, we design an online learning algorithm, OLCS-Ranker, to solve the proposed learning model. Rather than taking all training data samples all at once, OLCS-Ranker iteratively feeds in only one training sample into the learning model at each round. As a result, the memory requirement is significantly reduced for large-scale problems. Experimental studies show that OLCS-Ranker outperforms benchmark methods, such as CRanker and Batch-CS-Ranker, in terms of accuracy and stability. Furthermore, OLCS-Ranker is 15--85 times faster than CRanker method on large datasets. Availability and implementation: OLCS-Ranker software is available at no charge for non-commercial use at https://github.com/Isaac-QiXing/CRanker.

READ FULL TEXT
research
11/24/2020

WeiPS: a symmetric fusion model framework for large-scale online learning

The recommendation system is an important commercial application of mach...
research
08/30/2022

Compound Figure Separation of Biomedical Images: Mining Large Datasets for Self-supervised Learning

With the rapid development of self-supervised learning (e.g., contrastiv...
research
10/24/2014

Online and Stochastic Gradient Methods for Non-decomposable Loss Functions

Modern applications in sensitive domains such as biometrics and medicine...
research
02/03/2021

HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data

Database-search algorithms, that deduce peptides from Mass Spectrometry ...
research
10/29/2014

Faster graphical model identification of tandem mass spectra using peptide word lattices

Liquid chromatography coupled with tandem mass spectrometry, also known ...
research
12/16/2019

Progressive Learning Algorithm for Efficient Person Re-Identification

This paper studies the problem of Person Re-Identification (ReID)for lar...

Please sign up or login with your details

Forgot password? Click here to reset