Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

07/10/2022
by   Giulia Guidi, et al.
0

De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed-memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences. Our algorithm shows good scaling with parallel efficiency up to 80 coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2020

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

One of the most computationally intensive tasks in computational biology...
research
01/27/2020

diBELLA: Distributed Long Read to Long Read Alignment

We present a parallel algorithm and scalable implementation for genome a...
research
09/19/2018

Extreme Scale De Novo Metagenome Assembly

Metagenome assembly is the process of transforming a set of short, overl...
research
08/14/2020

PANDA: Processing-in-MRAM Accelerated De Bruijn Graph based DNA Assembly

Spurred by widening gap between data processing speed and data communica...
research
06/07/2022

Fast Exact String to D-Texts Alignments

In recent years, aligning a sequence to a pangenome has become a central...
research
11/10/2020

A step towards neural genome assembly

De novo genome assembly focuses on finding connections between a vast am...
research
09/13/2021

Specified Certainty Classification, with Application to Read Classification for Reference-Guided Metagenomic Assembly

Specified Certainty Classification (SCC) is a new paradigm for employing...

Please sign up or login with your details

Forgot password? Click here to reset