Simulating the DNA String Graph in Succinct Space

01/29/2019
by   Diego Díaz-Domínguez, et al.
0

Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has advantages over the other depending on the application. Still, the ideal setting would be to have an index of the reads that is easy to build and can be adapted to any type of biological analysis. In this paper, we propose a new data structure we call rBOSS, which gets close to that ideal. Our rBOSS is a de Bruijn graph in practice, but it simulates any length up to k and can compute overlaps of size at least m between the labels of the nodes, with k and m being parameters. If we choose the parameter k equal to the size of the reads, then we can simulate a complete string graph. As most BWT-based structures, rBOSS is unidirectional, but it exploits the property of the DNA reverse complements to simulate bi-directionality with some time-space trade-offs. We implemented a genome assembler on top of rBOSS to demonstrate its usefulness. Our experimental results show that using k = 100, rBOSS can assemble 185 MB of reads in less than 15 minutes and using 110 MB in total. It produces contigs of mean sizes over 10,000, which is twice the size obtained by using a pure de Bruijn graph of fixed length k.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2018

Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding

The boom of genomic sequencing makes compression of set of sequences ine...
research
02/28/2022

Minimal Absent Words on Run-Length Encoded Strings

A string w is called a minimal absent word (MAW) for another string T if...
research
07/08/2020

String Indexing for Top-k Close Consecutive Occurrences

The classic string indexing problem is to preprocess a string S into a c...
research
12/22/2021

On the Reverse-Complement String-Duplication System

Motivated by DNA storage in living organisms, and by known biological mu...
research
04/16/2019

Dynamic Packed Compact Tries Revisited

Given a dynamic set K of k strings of total length n whose characters ar...
research
06/21/2022

The Complexity of the Co-Occurrence Problem

Let S be a string of length n over an alphabet Σ and let Q be a subset o...
research
03/24/2019

The Size of a t-Digest

A t-digest is a compact data structure that allows estimates of quantile...

Please sign up or login with your details

Forgot password? Click here to reset