Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

10/20/2020
by   Giulia Guidi, et al.
0

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80 efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3x for the human genome and 1.5-1.9x for C. elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3x for the human genome and 18-29x for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.

READ FULL TEXT

page 1

page 8

page 9

research
07/10/2022

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

De novo genome assembly, i.e., rebuilding the sequence of an unknown gen...
research
09/13/2023

GPU Scheduler for De Novo Genome Assembly with Multiple MPI Processes

De Novo Genome assembly is one of the most important tasks in computatio...
research
11/10/2020

A step towards neural genome assembly

De novo genome assembly focuses on finding connections between a vast am...
research
06/30/2018

Fast Characterization of Segmental Duplications in Genome Assemblies

Segmental duplications (SDs), or low-copy repeats (LCR), are segments of...
research
05/22/2018

copMEM: Finding maximal exact matches via sampling both genomes

Genome-to-genome comparisons require designating anchor points, which ar...
research
10/13/2022

Fast genomic optical map assembly algorithm using binary representation

Reducing the cost of sequencing genomes provided by next-generation sequ...
research
06/01/2022

Learning to Untangle Genome Assembly with Graph Convolutional Networks

A quest to determine the complete sequence of a human DNA from telomere ...

Please sign up or login with your details

Forgot password? Click here to reset