Parallel and Scalable Precise Clustering for Homologous Protein Discovery

08/28/2019
by   Stuart Byma, et al.
0

This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find similar pairs in a set of n elements such as protein sequences. Precise clustering ensures that each pair of similar elements appears together in at least one cluster, so that similarities can be identified by all-to-all comparison in each cluster rather than on the full set. This paper introduces ClusterMerge, a new algorithm for precise clustering that uses transitive relationships among the elements to enable parallel and scalable implementations of this approach. We apply ClusterMerge to the important problem of finding similar amino acid sequences in a collection of proteins. ClusterMerge identifies 99.8 O(n^2) comparison, with only half as many operations. More importantly, ClusterMerge is highly amenable to parallel and distributed computation. Our implementation achieves a speedup of 604× on 768 cores (1400× faster than a comparable single-threaded clustering implementation), a strong scaling efficiency of 90

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2018

Conformation Clustering of Long MD Protein Dynamics with an Adversarial Autoencoder

Recent developments in specialized computer hardware have greatly accele...
research
03/09/2023

Parallel Filtered Graphs for Hierarchical Clustering

Given all pairwise weights (distances) among a set of objects, filtered ...
research
09/30/2020

Distributed Many-to-Many Protein Sequence Alignment using Sparse Matrices

Identifying similar protein sequences is a core step in many computation...
research
05/16/2022

A Parallel Algorithm for (3 + ε)-Approximate Correlation Clustering

Grouping together similar elements in datasets is a common task in data ...
research
04/20/2018

Parallel Quicksort without Pairwise Element Exchange

Standard implementations of 2-way, parallel, distributed memory Quicksor...
research
07/24/2021

Identifying similar functional modules by a new hybrid spectral clustering method

Recently, a large number of researches have focused on finding cellular ...
research
06/22/2018

Scalable Simple Linear Iterative Clustering (SSLIC) Using a Generic and Parallel Approach

Superpixel algorithms have proven to be a useful initial step for segmen...

Please sign up or login with your details

Forgot password? Click here to reset