MS-BioGraphs: Sequence Similarity Graph Datasets

08/31/2023
by   Mohsen Koohi Esfahani, et al.
0

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-to-all similarity aligning) 1.7 billion protein sequences. The MS-BioGraphs family includes also seven subgraphs with different sizes and direction types. We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. We present a comparative study of structural characteristics of MS-BioGraphs. The datasets are available online on https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs .

READ FULL TEXT

page 8

page 9

research
05/14/2018

Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

There has been significant interest in parallel graph processing recentl...
research
03/27/2023

Parallel Computation of Piecewise Linear Morse-Smale Segmentations

This paper presents a well-scaling parallel algorithm for the computatio...
research
08/09/2019

Human Perceptual Evaluations for Image Compression

Recently, there has been much interest in deep learning techniques to do...
research
02/01/2019

ProteinNet: a standardized data set for machine learning of protein structure

Rapid progress in deep learning has spurred its application to bioinform...
research
05/17/2020

Deuteros 2.0: Peptide-level significance testing of data from hydrogen deuterium exchange mass spectrometry

Summary: Hydrogen deuterium exchange mass spectrometry (HDX-MS) is becom...
research
01/30/2023

Multi-Structural Games and Beyond

Multi-structural (MS) games are combinatorial games that capture the num...
research
03/03/2023

Extreme-scale many-against-many protein similarity search

Similarity search is one of the most fundamental computations that are r...

Please sign up or login with your details

Forgot password? Click here to reset