Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

10/20/2020
by   Giulia Guidi, et al.
0

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80 efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3x for the human genome and 1.5-1.9x for C. elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3x for the human genome and 18-29x for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

page 9

01/13/2018

Scalable De Novo Genome Assembly Using Pregel

De novo genome assembly is the process of stitching short DNA sequences ...
11/10/2020

A step towards neural genome assembly

De novo genome assembly focuses on finding connections between a vast am...
06/30/2018

Fast Characterization of Segmental Duplications in Genome Assemblies

Segmental duplications (SDs), or low-copy repeats (LCR), are segments of...
05/22/2018

copMEM: Finding maximal exact matches via sampling both genomes

Genome-to-genome comparisons require designating anchor points, which ar...
09/19/2018

Extreme Scale De Novo Metagenome Assembly

Metagenome assembly is the process of transforming a set of short, overl...
02/04/2021

Optimal Construction of Hierarchical Overlap Graphs

Genome assembly is a fundamental problem in Bioinformatics, where for a ...
04/06/2020

SOPanG 2: online searching over a pan-genome without false positives

The pan-genome can be stored as elastic-degenerate (ED) string, a recent...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.