Genome assembly, a universal theoretical framework: unifying and generalizing the safe and complete algorithms

11/25/2020
by   Massimo Cairo, et al.
0

Genome assembly is a fundamental problem in Bioinformatics, requiring to reconstruct a source genome from an assembly graph built from a set of reads (short strings sequenced from the genome). A notion of genome assembly solution is that of an arc-covering walk of the graph. Since assembly graphs admit many solutions, the goal is to find what is definitely present in all solutions, or what is safe. Most practical assemblers are based on heuristics having at their core unitigs, namely paths whose internal nodes have unit in-degree and out-degree, and which are clearly safe. The long-standing open problem of finding all the safe parts of the solutions was recently solved by a major theoretical result [RECOMB'16]. This safe and complete genome assembly algorithm was followed by other works improving the time bounds, as well as extending the results for different notions of assembly solution. But it remained open whether one can be complete also for models of genome assembly of practical applicability. In this paper we present a universal framework for obtaining safe and complete algorithms which unify the previous results, while also allowing for easy generalizations to assembly problems including many practical aspects. This is based on a novel graph structure, called the hydrostructure of a walk, which highlights the reachability properties of the graph from the perspective of the walk. The hydrostructure allows for simple characterizations of the existing safe walks, and of their new practical versions. Almost all of our characterizations are directly adaptable to optimal verification algorithms, and simple enumeration algorithms. Most of these algorithms are also improved to optimality using an incremental computation procedure and a previous optimal algorithm of a specific model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/04/2021

Optimal Construction of Hierarchical Overlap Graphs

Genome assembly is a fundamental problem in Bioinformatics, where for a ...
research
11/10/2020

A step towards neural genome assembly

De novo genome assembly focuses on finding connections between a vast am...
research
07/04/2016

Cell assemblies at multiple time scales with arbitrary lag constellations

Hebb's idea of a cell assembly as the fundamental unit of neural informa...
research
02/24/2020

From omnitigs to macrotigs: a linear-time algorithm for safe walks – common to all closed arc-coverings of a directed graph

A partial solution to a problem is called safe if it appears in all solu...
research
12/09/2021

Unique Assembly Verification in Two-Handed Self-Assembly

One of the most fundamental and well-studied problems in Tile Self-Assem...
research
05/06/2021

Self-Replication via Tile Self-Assembly (extended abstract)

In this paper we present a model containing modifications to the Signal-...
research
06/01/2022

Learning to Untangle Genome Assembly with Graph Convolutional Networks

A quest to determine the complete sequence of a human DNA from telomere ...

Please sign up or login with your details

Forgot password? Click here to reset