HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data

02/03/2021
by   Muhammad Haseeb, et al.
0

Database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Existing serial, and high-performance computing (HPC) search engines, otherwise highly successful, are known to exhibit poor-scalability with increasing size of theoretical search-space needed for increased complexity of modern non-model, multi-species MS-based omics analysis. Consequently, the bottleneck for computational techniques is the communication costs of moving the data between hierarchy of memory, or processing units, and not the arithmetic operations. This post-Moore change in architecture, and demands of modern systems biology experiments have dampened the overall effectiveness of the existing HPC workflows. We present a novel efficient parallel computational method, and its implementation on memory-distributed architectures for peptide identification tool called HiCOPS, that enables more than 100-fold improvement in speed over most existing HPC proteome database search tools. HiCOPS empowers the supercomputing database search concept for comprehensive identification of peptides, and all their modified forms within a reasonable time-frame. We demonstrate this by searching Gigabytes of experimental MS data against Terabytes of databases where HiCOPS completes peptide identification in few minutes using 72 parallel nodes (1728 cores) compared to several weeks required by existing state-of-the-art tools using 1 node (24 cores); 100 minutes vs 5 weeks; 500x speedup. Finally, we formulate a theoretical framework for our overhead-avoiding strategy, and report superior performance evaluation results for key metrics including execution time, CPU utilization, speedups, and I/O efficiency. The software will be made available at: hicops.github.io

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2021

Communication-avoiding micro-architecture to compute Xcorr scores for peptide identification

Database algorithms play a crucial part in systems biology studies by id...
research
11/24/2020

Comprehensive and Sensitive Proteogenomics Data Analysis Strategy based on Complementary Multi-Stage Database Search

Proteogenomics provide opportunities for proteomic validation of gene st...
research
11/15/2022

Massively Parallel Open Modification Spectral Library Searching with Hyperdimensional Computing

Mass spectrometry, commonly used for protein identification, generates a...
research
09/29/2020

Communication Lower-Bounds for Distributed-Memory Computations for Mass Spectrometry based Omics Data

Mass spectrometry based omics data analysis require significant time and...
research
11/18/2018

Algorithmic complexity in Computational Biology

Computational problems can be classified according to their algorithmic ...
research
05/08/2018

Efficient online learning for large-scale peptide identification

Motivation: Post-database searching is a key procedure in peptide dentif...
research
06/26/2021

GSmart: An Efficient SPARQL Query Engine Using Sparse Matrix Algebra – Full Version

Efficient execution of SPARQL queries over large RDF datasets is a topic...

Please sign up or login with your details

Forgot password? Click here to reset