Extreme-scale many-against-many protein similarity search

03/03/2023
by   Oguz Selvitopi, et al.
0

Similarity search is one of the most fundamental computations that are regularly performed on ever-increasing protein datasets. Scalability is of paramount importance for uncovering novel phenomena that occur at very large scales. We unleash the power of over 20,000 GPUs on the Summit system to perform all-vs-all protein similarity search on one of the largest publicly available datasets with 405 million proteins, in less than 3.5 hours, cutting the time-to-solution for many use cases from weeks. The variability of protein sequence lengths, as well as the sparsity of the space of pairwise comparisons, make this a challenging problem in distributed memory. Due to the need to construct and maintain a data structure holding indices to all other sequences, this application has a huge memory footprint that makes it hard to scale the problem sizes. We overcome this memory limitation by innovative matrix-based blocking techniques, without introducing additional load imbalance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2020

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

Identifying similar protein sequences is a core step in many computation...
research
12/01/2021

Leveraging Sequence Embedding and Convolutional Neural Network for Protein Function Prediction

The capability of accurate prediction of protein functions and propertie...
research
09/17/2023

End-to-End Optimized Pipeline for Prediction of Protein Folding Kinetics

Protein folding is the intricate process by which a linear sequence of a...
research
11/04/2018

Deep Robust Framework for Protein Function Prediction using Variable-Length Protein Sequences

Amino acid sequence portrays most intrinsic form of a protein and expres...
research
05/18/2021

Conformational variability of loops in the SARS-CoV-2 spike protein

The SARS-CoV-2 spike (S) protein facilitates viral infection, and has be...
research
04/07/2023

Similarity search in the blink of an eye with compressed indices

Nowadays, data is represented by vectors. Retrieving those vectors, amon...
research
08/31/2023

MS-BioGraphs: Sequence Similarity Graph Datasets

Progress in High-Performance Computing in general, and High-Performance ...

Please sign up or login with your details

Forgot password? Click here to reset