An Index for Sequencing Reads Based on The Colored de Bruijn Graph

08/06/2019
by   Diego Díaz-Domínguez, et al.
Universidad de Chile
0

In this article, we show how to transform a colored de Bruijn graph (dBG) into a practical index for processing massive sets of sequencing reads. Similar to previous works, we encode an instance of a colored dBG of the set using BOSS and a color matrix C. To reduce the space requirements, we devise an algorithm that produces a smaller and more sparse version of C. The novelties in this algorithm are (i) an incomplete coloring of the graph and (ii) a greedy coloring approach that tries to reuse the same colors for different strings when possible. We also propose two algorithms that work on top of the index; one is for reconstructing reads, and the other is for contig assembly. Experimental results show that our data structure uses about half the space of the plain representation of the set (1 Byte per DNA symbol) and that more than 99

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/26/2020

Vertex Ordering Algorithms for Graph Coloring Problem

Graph coloring is a fundamental problem in combinatorics with many appli...
03/21/2022

Brooks' Theorem in Graph Streams: A Single-Pass Semi-Streaming Algorithm for Δ-Coloring

Every graph with maximum degree Δ can be colored with (Δ+1) colors using...
05/12/2021

Distributed Graph Coloring Made Easy

In this paper we present a deterministic CONGEST algorithm to compute an...
11/23/2019

Hardness of some variants of the graph coloring game

Very recently, a long-standing open question proposed by Bodlaender in 1...
06/24/2019

Hierarchical Colorings of Cographs

Cographs are exactly hereditarily well-colored graphs, i.e., the graphs ...
12/08/2019

Deep Learning-based Hybrid Graph-Coloring Algorithm for Register Allocation

Register allocation, which is a crucial phase of a good optimizing compi...
09/15/2020

Scatterplot Selection Applying a Graph Coloring Problem

Scatterplot selection is an effective approach to represent essential po...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A set of sequencing reads is a massive collection of overlapping short strings that together encode the sequence of a DNA sample. Analyzing this kind of data allows scientists to uncover complex biological processes that otherwise could not be studied. There are many ways for extracting information from a set of reads (see [27] for review). However, in most of the cases, the process can be reduced to build a de Bruijn graph (dBG) of the collection and then search for graph paths that spell segments of the source DNA (see  [6, 15, 28] for some examples).

Briefly, a dBG is a directed labeled graph that stores the transitions of the substrings of size , or kmers, in . Constructing it is relatively simple, and the resulting graph usually uses less space than the input text. Nevertheless, this data structure is lossy, so it is not always possible to know if the label of a path matches a substring of the source DNA. The only paths that fulfill this property are those in which all nodes, except the first and last, have indegree and outdegree one [16]. Still, they represent just a fraction of the complete dBG.

More branched parts of the graph are also informative, but traverse them requires extra information to avoid spelling incorrect sequences. A simple solution is to augment the dBG with colors, in other words, we assign a particular color to every string , and then we store the same in every edge that represents a kmer of . In this way, we can walk over the graph always following the successor node colored with the same color of the current node.

The idea of coloring dBGs was first proposed by Iqbal et al. [15]. Their data structure, however, contemplated a union dBG built from several string collections, with colors assigned to the collections rather than particular strings. Considering the potential applications of colored dBGs, Boucher et al. [4] proposed a succinct version of the data structure of Iqbal et al. In their index, called VARI, the topology of the graph is encoded using  [5], and the colors are stored separately from the dBG in a binary matrix , in which the rows represent the kmers and the columns represent the colors. Since the work of Boucher et al., some authors have tried to compress and manipulate even further; including that of [2], [25], [13], while others, such as [21] and [22] have proposed methods to store compressed and dynamic versions .

An instance of a colored dBG for a single set can also be encoded using a color matrix. The only difference though is that the number of columns is proportional to the number of sequences in . Assigning a particular color to every sequence is not a problem if the collection is of small or moderated size. However, massive datasets are rather usual in Bioinformatics, so even using a succinct representation of might not be enough. One way to reduce the number of columns is to reuse colors for those sequences that do not share any kmer in the dBG. Alpanahi et al. [1] addressed this problem, and showed that it is unlikely that the minimum-size coloring can be approximated in polynomial time.

Alpanahi et al. also proposed a heuristic for recoloring the colored dBG of a set of sequences that, in practice, dramatically reduces the number of colors when

is a set of sequencing reads. Their coloring idea, however, might still produce incorrect sequences, so the applications of their version of the colored dBG are still limited.

Our Contributions. In this article, we show how to use a colored dBG to store and analyze a collection of sequencing reads succinctly. Similarly to VARI, we use and the color matrix to encode the data. However, we reduce the space requirements by partially coloring the dBG and greedily reusing the same colors for different reads when possible. We also propose two algorithms that work on top of the data structure, one for reconstructing the reads directly from the dBG and other for assembling contigs. We believe that these two algorithms can serve as a base to perform Bioinformatics analyses in compressed space. Our experimental results show that on average, the percentage of nodes in that need to be colored is about 12.4%, the space usage of the whole index is about half the space of the plain representation of (1 Byte/DNA symbol), and that more than 99% of the original reads can be reconstructed from the index.

2 Preliminaries

DNA strings. A DNA sequence is a string over the alphabet (which we map to ), where every symbol represents a particular nucleotide in a DNA molecule. The DNA complement is a permutation that reorders the symbols in exchanging a with t and c with g. The reverse complement of , denoted , is a string transformation that reverses and then replaces every symbol by its complement . For technical convenience we add to the so-called dummy symbol $, which is always mapped to 1.

de Bruijn graph. A de Bruijn graph (dBG) [9] of the string collection is a labeled directed graph that encodes the transitions between the substrings of size of , where is a parameter. Every node is labeled with a unique substring of . Two nodes and are connected by a directed edge if the suffix of overlaps the prefix of and the -string resulted from the overlap exists as substring in . The label of the edge is the last symbol of the label of node .

Rank and select data structures. Given a sequence of elements over the alphabet , with and , returns the number of times the element occurs in , while returns the position of the th occurrence of in . For binary alphabets, can be represented in bits so that rank and select are solved in constant time [8]. When has 1s, a compressed representation using bits, still solving the operations in constant time, is of interest [26].

BOSS index.

The data structure [5] is a succinct representation for dBGs based on the Burrows-Wheeler Transform ([7]. In this index, the labels of the nodes are regarded as rows in a matrix and sorted in reverse lexicographical order, i.e., strings are read from right to left. Suffixes and prefixes in of size below

are also included in the matrix by padding them with

$ symbols in the right size (for suffixes) or the left side (for prefixes). These padded nodes are also called dummy. The last column of the matrix is stored as an array , with being the number of labels lexicographically smaller than any other label ending with character . Additionally, the symbols of the outgoing edges of every node are sorted and then stored together in a single array

. A bit vector

is also set to mark the position in of the first outgoing symbol of each node. The complete index is thus composed of the vectors , , and . It can be stored in bits, where is the zero-order empirical entropy [23, Sec 2.3]. This space is reached with a Huffman-shaped Wavelet Tree [18] for , a compressed bitmap [26] for (as it is usually very dense), and a plain array for . Bowe et al. [5] defined the following operations over to navigate the graph:

  • outdegree: number of outgoing edges of .

  • forward: node reached by following an edge from labeled with .

  • indegree: number of incoming edges of .

  • backward: list of the nodes with an outgoing edge to .

  • nodeLabel: label of node .

The first four operations can be answered in time while the last one takes time. For our purposes, we also define the following operations:

  • forward_r: node reached by following the r-th outgoing edge of in lexicographical order.

  • label2Node: identifier in of the node labeled with the -string .

The function forward_r is a small variation forward, and it maintains the original time, while the function label2Node is the opposite of nodeLabel, but it also maintain its complexity in time.

Graph coloring.

The problem of coloring a graph consists of assigning an integer to each node such that i) no adjacent nodes have the same color and ii) is minimal. The coloring is complete if all the nodes of the graph are assigned with one color, and it is proper if constraint i) is met for each node. The chromatic number of a graph , denoted by , is the minimum number of colors required to generate a coloring that is complete and proper. A coloring using exactly colors is considered to be optimal. Determining if there is a feasible -coloring for is well known to be NP-complete, while the problem of inferring is NP-hard [17].

Colored dBG.

The first version of the colored dBG [15] was described as a union graph built from several dBGs of different string collections. The edges in that encode the kmers of the i-th collection are assigned the color . The compacted version of this graph [4] represents the topology of using the index and the colors using a binary matrix , where the position is set to true if the kmer represented by the i-th edge in the ordering of is assigned color . The rows of are then stored using the compressed representation for bit vectors of [26], or using Elias-Fano encoding [11, 10, 24] if the rows are very sparse. In the single set version of the colored dBG, the colors are assigned to every string. Therefore, the number of columns in grows with the size of the collection. Alipanahi et al. [1] noticed that we could reduce the space of by using the same colors in those strings that have no common kmers. This new problem was named the CDBG-recoloring, and formally stated as follows; given a set of strings and its dBG , find the minimum number of colors such that i) every string is assigned one color and ii) strings having two or more kmers in common in cannot have the same color. Alipanahi et al. [1] showed that an instance of the Graph-Coloring problem can be reduced in polynomial time to another instance of the CDBG-recoloring problem. Thus, any algorithm that finds , also finds the minimum number of colors for dBG . However, they also proved that the decision version of CDBG-Recoloring is NP-complete.

3 Definitions

Let be a collection of DNA sequencing reads, and let be a collection of size that contains the strings in along with their DNA reverse complements. The dBG of order constructed from is defined as , and an instance of for is denoted as , where and include the dummy nodes and their edges. For simplicity, we will refer to just as . A node in is considered an starting node if its label is of the form $, where $ is a dummy symbol and is a prefix of one or more sequences in . Equivalently, a node is considered an ending node if its label is of the form $, with $ being a dummy and being a suffix of one or more sequences in . Nodes whose labels do not contain dummy symbols are considered solid, and solid nodes with at least one predecessor node with outdegree more than one are considered critical. For practical reasons, we define two extra functions, isStarting and isEnding that are used to check if a node is starting or ending respectively.

A walk over the dBG of is a sequence where are nodes and are edges, connecting with . is a path if all the nodes are different, except possibly the first and the last. In such case, is said to be a cycle. A sequence is unambiguous if there is a path in whose label matches the sequence of and if no pair of colored nodes in share a predecessor node . In any other case, is ambiguous. Finally, the path that spells the sequence of is said to be safe if every one of its branching nodes has only one successor colored with the color of .

We assume that is a factor-free set, i.e., no is also a substring of another sequence , with .

4 Coloring a dBG of reads

In this section, we define a coloring scheme for that generates a more succinct color matrix, and that allows us to reconstruct and assemble unambiguous sequences of . We use the dBG of because most of the Bioinformatic analyses require the inspection of the reverse complements of the reads. Unlike previous works, the rows in represent the nodes in instead of the edges.

A partial coloring.

We make more sparse by coloring only those nodes in the graph that are strictly necessary for reconstructing the sequences. We formalize this idea with the following lemma:

Lemma 1

For the path of an unambiguous sequence to be safe we have to color the starting node that encodes the prefix of , the ending node that encodes the suffix of and the critical nodes in the path.

Proof

We start a walk from using the following rules: (i) if the current node in the walk has outdegree one, then we follow its only outgoing edge, (ii) if is a branching node, i.e., it has outdegree more than one, then we inspect its successor nodes and follow the one colored with the same color of and (iii) if is equal to , then we stop the traversal.∎

Note that the successor nodes of a branching node are critical by definition, so they are always colored. On the other hand, nodes with outdegree one do not require a color inspection because they have only one possible way out.

Coloring the nodes and for every is necessary; otherwise, it would be difficult to know when a path starts or ends. Consider, for example, using the solid nodes that represent the prefix and the suffix of as starting and ending points respectively. It might happen that the starting point of can also be a critical point of another sequence . If we start a reconstruction from and pick the color of , then we will generate an incomplete sequence. A similar argument can be used for ending nodes. The concepts associated with our coloring idea are depicted in Figures 2A and 2B.

Unsafe coloring.

As explained in Section 2, we can use the recoloring idea of [1] to reduce the number of columns in . Still, using the same colors for unrelated strings is not safe for reconstructing unambiguous sequences.

Lemma 2

Using the same color for two unambiguous sequences that do not share any substring might produce an unsafe path for or .

Proof

Assume there is another pair of sequences that do not share any subsequence either, to which we assign them color . Suppose that the paths of and crosses the paths of and such that the resulting dBG topology resembles a grid. In other words, if has the edge and has the edge , then has the edge and has the edge . In this scenario, will have two successors, node from the path of and some other node from the path of . Both and are critical by definition so they will be colored with and respectively. The problem is that node is also a critical node for , so it will also have color . The reason is that , a node that precedes , appears in and . As a consequence, the path of is no longer safe because one of its nodes ( in this example) has to successors colored with . A similar argument can be made for and color . Figure 1 depicts the idea of this proof. ∎

When spurious edges connect paths of unrelated sequences that are assigned the same color (as the in the proof of Lemma 2), we can generate chimeric strings if, by error, we follow one of those edges. In the algorithm, we solve this problem by assigning different colors to those strings with sporadic edges, even if they do not share any substring.

Figure 1: Example of unsafe paths produced by a graph recoloring. (A) The dBG generated from the unambiguous sequences , , and . Every texture represents the path of a specific string. (B) Recolored dBG. Sequences and are assigned the same color (light gray) as they do not share any substring. Similarly, sequences and are assigned another color (horizontal lines) as they do not share any sequence neither. Nodes and are those mentioned in the Proof of Lemma  2. The sequences of and cannot be reconstructed as their paths become unsafe after the recoloring.

Safer and greedy coloring.

Our greedy coloring algorithm starts by marking in a bitmap the nodes of that need to be colored (starting, ending and critical). After that, we create an array of entries. Every with will contain a dynamic vector that stores the colors of the j-th colored node in the ordering. We also add support to to map a node to its array of colors in . Thus, its position can be inferred as .

The only inputs we need for the algorithm are , and . For every we proceed as follows; we append a dummy symbol at the ends of the string, and then use the function label2Node to find the node labeled with the prefix of . Note that this prefix will map a starting node as we append dummies to . From , we begin a walk on the graph and follow the edges whose symbols match the characters in the suffix . Note now that the last node we visit in this walk is an ending node that maps the suffix of . As we move through the edges, we store in an array the starting, ending, and critical nodes associated with . Additionally, we push into another array the neighbor nodes of the walk that need to be inspected to assign a color to . The rules for pushing elements into are as follows; i) if is a node in the path of with outdegree more than one, then we push all its successor nodes into , ii) if is a node in the path of with indegree more than one, then we visit every predecessor node of , and if has outdegree more than one, then we push into the successor nodes of . Once we finish the traversal, we create a hash map and fill it with the colors that were previously assigned to the nodes in and . After that, we pick the smallest color that is not in the keys of , and push it to every array with . After we process all the sequences in , the final set of colors is represented by the values in . The whole processing of coloring a is described in detail by the procedure greedyCol in Pseudocode 1.

The construction of the sets and is independent for every string in , so it can be done in parallel. However, the construction of the hash map and the assignment of the color to the elements of has to be performed sequentially as all the sequences in need concurrent access to .

Figure 2: Succinct colored dBG. (A) The topology of the graph. Colors and textures represent the paths that spell the input sequences of the dBG. Numbers over the nodes are their identifiers. Nodes 4,1,6 and 10 are starting nodes (darker borders). Nodes 11,13 and 12 are ending nodes and nodes 3,9,11 and 5 are critical. (B) Our greedy coloring algorithm. (C) The binary matrix that encodes the colors of Figure B. The left side is in its uncompressed form and the right side is our succinct version of using the arrays ,, and .

Ambiguous sequences.

Our scheme, however, cannot safely retrieve sequences that are ambiguous.

Lemma 3

Ambiguous sequences of cannot be reconstructed safely from the color matrix and .

Proof

Assume that collection is composed just by one string , where is a repeated substring and are two different symbols in . Consider also that the kmer size for is . The instance of will have a node labeled with , with two outgoing edges, whose symbols are and . Given our coloring scheme, the successor nodes of will be both colored with the same color. As a consequence, if during a walk we reach node , then we will get stuck because there is not enough information to decide which is the correct edge to follow (both successor nodes have the same color).∎

A sequence will be ambiguous if it has the same pattern in two different contexts. Another case in which is ambiguous is when a spurious edge connects an uncolored node of with two or more critical nodes in the same path. Note that unlike unambiguous sequences with spurious edges, an ambiguous sequence will always be encoded by an unsafe path, regardless of the recoloring algorithm. In general, the number of ambiguous sequences will depend on the value we use for .

5 Compressing the colored dBG

The pair can be regarded as a compact representation of , where the empty rows were discarded. Every , with , is a row with at least one value, and every color , with , is a column. However, is not succinct enough to make it practical. We are still using a computer word for every color of . Besides, we need extra words to store the pointers for the lists in .

We compress by using an idea similar to the one implemented in to store the edges of the dBG. The first step is to sort the colors of every list . Because the greedy coloring generates a set of unique colors for every node, each becomes an array of strictly increasing elements after the sorting. Thus, instead of storing the values explicitly, we encode them as deltas, i.e., . After transforming , we concatenate all its values into one single list and create a bit map to mark the first element of every in . We store using Elias-Fano encoding [10, 11] and using the compressed representation for bit maps of [26]. Finally, we add support to to map a range of elements in to an array in . The complete representation of the color matrix now becomes (see Figure 2C). The complete index of the colored dBG is thus composed of our version of and . We now formalize the idea of retrieving the colors of a node from the succinct representation of .

  • getColors: list of colors assigned to node .

Theorem 5.1

the function getColors computes in time the colors assigned to node .

Proof

We first compute the rank of node within the colored nodes. This operation is carried out with . After retrieving , we obtain the range in where the values of lie. For this purpose, we perform two operations over , . Finally, we scan the range in , and as we read the values, we incrementally reconstruct the colors from the deltas. All the rank and select operations takes , and reading the entries from takes , because retrieving an element from an Elias-Fano-encoded array also takes . In conclusion, computing the colors of takes .∎

6 Algorithms for the colored dBG

Reconstructing unambiguous sequences.

We describe now an online algorithm that works on top of our index and that reconstructs all the unambiguous sequences in . We cannot tell, however, if a reconstructed string was present in the original set or if it was its reverse complement . This is not really a problem, because a sequence and its reverse complement are equivalent in most of the Bioinformatic analyses.

The algorithm receives as input a starting node . It first computes an array with the colors assigned to using the function getColors (see Section 5), and then sets a string . For every color , the algorithm performs the following steps; initializes two temporary variables, an integer and string , and then begins a graph walk from . If the outdegree of is one, then the next node in the walk is the successor node . On the other hand, if the outdegree of is more than one, then the algorithm inspects all the successor nodes of to check which one of them is the node colored with . If there is only one such , then it sets . This procedure continues until becomes an ending node. During the walk, the edge symbols are pushed into . When an ending node is reached, the algorithm reports as the reconstructed sequence.

If at some point during a walk, the algorithm reaches a node with outdegree more than one, and with more than one successor colored with , then aborts the reconstruction of the string as the path is unsafe for color . Then, it returns to and continues with the next sequence. The complete procedure is detailed in the function buildSeqs of Algorithm 2.

Assembling contigs.

Our coloring scheme for the dBG allows us to report sequences that represent the overlap of two or more strings of . There are several ways in which a set of sequences can be arranged such that they form valid overlaps, but in practice, we are not interested in all such combinations. What we want is to compute only those union strings that describe real segments of the underlying genome of , a.k.a contig sequences. In this work we do not go deep into the complexities of contig assembly (see [16, 19, 14, 20] for some review). Instead, we propose a simple heuristic, that work on top of our index, and that it is aimed to produce contigs that are longer than those produced by uncolored dBGs.

Similar to buildSeqs, this method traverses the graph to reconstruct the contigs. During the process, it uses the color information to weight the outgoing edges on the fly

, and thus, inferring which is the most probable path that matches a real segment of the source DNA.

The algorithm receives as input a starting node and initializes a set and hash map . Both data structures are used to store information about the strings that belong to the contig of . A read is identified in the index as a pair , where is a color assigned to and is the starting node of its path. contains the reads already traversed while contains the active reads. The algorithm also initializes a string and pushes every pair with . After that, it begins a walk from and pushes into the symbols of the edges it visits. For every new node reached during the walk, the algorithm checks if one of its predecessor nodes, say , is a starting node. If so, then for every sets if does not exist in . On the other hand, if one of the successors of , say , is an ending node, then for every sets and then removes the entry . After updating and , it selects one of the outgoing edges of to continue the walk. For this purpose, the algorithm uses the following rules; (i) if has outdegree one, then it takes its only outgoing edge, (ii) if has outdegree more than one, then it inspects how the colors in distribute among the successors of . If there is only one successor node of , say , colored with at least fraction of the colors of , where is a parameter, then the algorithm follows , and removes from the colors of the other successor nodes of .

The algorithm will stop if; (i) there is no such that meet the threshold, (ii) there is more than one successor of with the same color or (iii) has outdegree one, but the successor node is an ending node. After finishing the walk, the substring is reported as the contig. The procedure contigAssm in Pseudocode 3 describes in detail the contig assembly algorithm, and a graphical example is shown in Figure 3.

Figure 3: Example of the assembly of a contig using our index. (A) Inexact overlap of four sequences. The circle to the left of every string represents its color in the dBG. Light gray symbols are mismatches in the overlap. (B) The colored dBG of the sequences. Circles with darker borders are starting and ending nodes. Light gray values over the starting nodes are their identifiers. The contig assembly begins in node 5 (denoted with a dashed arrow) and the threshold to continue the extension is set to 0.5. The state of the hash map when the walk reaches a branching node (dashed circles) is depicted below the graph. The assembly ends in the right-most branching node as it has not a successor node that contains at least 50% of the colors in . The final contig is shown as a light grey path over the graph, and its sequence is stored in .

7 Experiments

We use a set of reads generated from the E.coli genome111http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc to test the ideas described in this article. The raw file was in FASTQ format and contained 14,214,324 reads of 100 characters long each. We preprocessed the file by removing sequencing errors using the tool of [3], and discarding reads with N symbols. The preprocessing yielded a data set of 8,655,214 reads (a FASTQ file of 2GB). Additionally, we discarded sequencing qualities and the identifiers of the reads as they are not considered in our data structure. From the resulting set (a text file of 833.67 MB), we created another set that considers the elements in and their reverse complements.

Our version of the colored dBG, the algorithm for greedy coloring and the algorithms for reconstructing and assembly reads were implemented in C++222https://bitbucket.org/DiegoDiazDominguez/colored_bos/src/master, on top of the SDSL-lite library [12]. In our implementation, arrays and are precomputed beforehand to store the colors directly to them, because using the dynamic list is not cache-friendly. Additionally, all our code, except the algorithm for contig assembly, can be executed using multiple threads.

We built six instances of our index using as input. We choose different values for , from 25 to 50 in steps of five. The coloring of every one of these instances was carried out using eight threads. Statistics about the graph topologies are shown in Table 1, and statistics about the coloring process are shown in Table 2. In every instance, we reconstructed the unambiguous reads (see Table 2). Additionally, we generated an FM-index of to locate the reconstructed reads and check that they were real sequences. All the tests were carried out on a machine with Debian 4.9, 252 GB of RAM and processor Intel(R) Xeon(R) Silver @ 2.10GHz, with 32 cores.

8 Results

The average compression rate achieved by our index is 1.89, meaning that, in all the cases, the data structure used about half the space of the plain representation of the reads (see Table 1). We also note that the smaller the value for , the greater the size of the index. This behavior is expected as the dBG becomes denser when we decrease . Thus, we have to store a higher number of colors per node.

k Total number Number of Number of Index Compression
of nodes solid nodes edges size rate
25 106,028,714 11,257,781 120,610,151 446.38 1.86
30 142,591,410 11,425,646 157,186,548 443.82 1.87
35 179,167,289 11,561,630 193,773,251 441.18 1.88
40 215,751,326 11,667,364 230,365,635 438.23 1.90
45 252,337,929 11,743,320 266,958,709 435.30 1.91
50 288,925,674 11,791,640 303,552,318 432.13 1.92
Table 1: Statistics about the different colored dBGs generated in the experiments. The index size is expressed in MB and considers the space of plus the space of our succinct version of . The compression rate was calculated as the space of the plain representation of the reads (833.67 MB) divided by the index size.

The number of colors of every instance is several orders of magnitude smaller than the number of reads, being the instance with more colors (6552) and the instance with the fewest (1689). Even though the fraction of colored nodes in every instance is small, the percentage of the index space used by the color matrix is still high ( 73% on average). Regarding the time for coloring the graph, it seems to be reasonable for practical purposes if we use several threads. In fact, building, filling and compacting took 5,015 seconds on average, and the value decreases if we increment . The working space, however, is still considerable. We had memory peaks ranging from 3.03 GB to 4.3 GB, depending on the value for (see Table 2).

The process of reconstructing the reads yielded a small number of ambiguous sequences in all the instances (2,760 sequences on average), and decreases with higher values of , especially for values above 40 (see Table 2).

k Number of Number of Color space Ambiguous Elapsed Memory
colored nodes colors usage sequences Time peak
25 21,882,874 6,552 83.03 1904 5,835 4,391
30 21,907,324 4,944 79.14 1502 5,551 4,119
35 21,926,687 2,924 75.27 1224 5,131 3,847
40 21,942,083 2,064 71.40 1054 4,872 3,575
45 21,954,138 1,888 67.51 714 4,507 3,303
50 21,964,947 1,689 63.58 176 4,199 3,030
Table 2: Statistics about our greedy coloring algorithm. The column “Color space usage” refers to the percentage of the index space used by our succinct version of . Elapsed time and memory peak are expressed in seconds and MB, respectively, and both consider only the process of building, filling, and compacting the color matrix.

9 Conclusions and further work

Experimental results shows our data structure is succinct, and that has a practical use. Still, we believe that a more careful algorithm for constructing the index is still necessary to reduce the memory peaks during the coloring. Further compaction of the color matrix can be achieved by using more elaborated compression techniques. However, this extra compression can increase the construction time of the colored dBG and produce a considerable slow down in the algorithms that work on top of it for extracting information from the reads. Comparison of our results with other similar data structures is difficult for the moment. Most of the indexes based on colored dBGs were not designed to handle huge sets of colors like ours and the greedy recoloring of

[1] does not scale well and needs extra information for reconstructing the reads. Still, it is a promising approach that, with further work, can be used in the future as a base for performing Bioinformatics analyses in compressed space.

References

  • [1] B. Alipanahi, A. Kuhnle, and C. Boucher (2018) Recoloring the colored de bruijn graph. In Proc. 25th International Symposium on String Processing and Information Retrieval (SPIRE), pp. 1–11. Cited by: §1, §2, §4, §9.
  • [2] F. Almodaresi, P. Pandey, and R. Patro (2017) Rainbowfish: a succinct colored de Bruijn graph representation. In Proc. 17th International Workshop on Algorithms in Bioinformatics (WABI), pp. article 18. Cited by: §1.
  • [3] A. Bankevich, S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, et al. (2012) SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19 (5), pp. 455–477. Cited by: §7.
  • [4] Boucher,Christina, A. Bowe, T. Gagie, S. J. Puglisi, and K. Sadakane (2015) Variable-order de Bruijn graphs. In Proc. 25th Data Compression Conference (DCC), pp. 383–392. Cited by: §1, §2.
  • [5] A. Bowe, Onodera,Taku, K. Sadakane, and T. Shibuya (2012) Succinct de Bruijn graphs. In Proc. 12th International Workshop on Algorithms in Bioinformatics (WABI), pp. 225–235. Cited by: §1, §2.
  • [6] N. Bray, H. Pimentel, P. Melsted, and L. Pachter (2016) Near-optimal probabilistic rna-seq quantification. Nature Biotechnology 34 (5), pp. 525–527. Cited by: §1.
  • [7] M. Burrows and D. Wheeler (1994) A block sorting lossless data compression algorithm. Technical report Technical Report 124, Digital Equipment Corporation. Cited by: §2.
  • [8] D. Clark (1996) Compact PAT trees. Ph.D. Thesis, University of Waterloo, Canada. Cited by: §2.
  • [9] N. G. de Bruijn (1946) A combinatorial problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49 (49), pp. 758–764. Cited by: §2.
  • [10] P. Elias (1974) Efficient storage and retrieval by content and address of static files. Journal of the ACM 21 (2), pp. 246–260. Cited by: §2, §5.
  • [11] R. M. Fano (1971) On the number of bits required to implement an associative memory. Massachusetts Institute of Technology. Cited by: §2, §5.
  • [12] S. Gog, T. Beller, A. Moffat, and M. Petri (2014) From theory to practice: plug and play with succinct data structures. In Proc. 13th International Symposium on Experimental Algorithms (SEA), pp. 326–337. Cited by: §7.
  • [13] G. Holley, R. Wittler, and J. Stoye (2015) Bloom filter trie – a data structure for pan-genome storage. In Proc. 15th International Workshop on Algorithms in Bioinformatics (WABI), pp. 217–230. Cited by: §1.
  • [14] R. M. Idury and M. S. Waterman (1995) A new algorithm for DNA sequence assembly. Journal of Computational Biology 2 (2), pp. 291–306. Cited by: §6.
  • [15] Z. Iqbal, M. Caccamo, I. Turner, P. Flicek, and G. McVean (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics 44 (2), pp. 226–232. Cited by: §1, §1, §2.
  • [16] J. D. Kececioglu and E. W. Myers (1995) Combinatorial algorithms for DNA sequence assembly. Algorithmica 13 (1), pp. 7–51. Cited by: §1, §6.
  • [17] R. Lewis (2015) A guide to graph colouring. Springer. Cited by: §2.
  • [18] V. Mäkinen and G. Navarro (2005) Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12 (1), pp. 40–66. Cited by: §2.
  • [19] P. Medvedev, K. Georgiou, G. Myers, and M. Brudno (2007) Computability of models for sequence assembly. In Proc. 7th International Workshop on Algorithms in Bioinformatics (WABI), pp. 289–301. Cited by: §6.
  • [20] P. Medvedev, S. Pham, M. Chaisson, G. Tesler, and P. Pevzner (2011) Paired de bruijn graphs: a novel approach for incorporating mate pair information into genome assemblers. Journal of Computational Biology 18 (11), pp. 1625–1634. Cited by: §6.
  • [21] H. Mustafa, A. Kahles, M. Karasikov, and G. Raetsch (2017) Metannot: a succinct data structure for compression of colors in dynamic de bruijn graphs. bioRxiv, pp. article 236711. Cited by: §1.
  • [22] H. Mustafa, I. Schilken, M. Karasikov, C. Eickhoff, G. Rätsch, and A. Kahles (2018) Dynamic compression schemes for graph coloring. Bioinformatics 35 (3), pp. 407–414. Cited by: §1.
  • [23] G. Navarro (2016) Compact data structures: a practical approach. Cambridge University Press. Cited by: §2.
  • [24] D. Okanohara and K. Sadakane (2007) Practical entropy-compressed rank/select dictionary. In Proc. 9th Workshop on Algorithm Engineering and Experiments (ALENEX), pp. 60–70. Cited by: §2.
  • [25] P. Pandey, F. Almodaresi, M. A. Bender, M. Ferdman, R. Johnson, and R. Patro (2018) Mantis: a fast, small, and exact large-scale sequence-search index. Cell Systems 7 (2), pp. 201–207. Cited by: §1.
  • [26] R. Raman, V. Raman, and S. R. Satti (2007) Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms 3 (4), pp. article 43. Cited by: §2, §2, §2, §5.
  • [27] J. Reuter, D. Spacek, and M. Snyder (2015) High-throughput sequencing technologies. Molecular Cell 58 (4), pp. 586–597. Cited by: §1.
  • [28] L. Salmela, R. Walve, E. Rivals, and E. Ukkonen (2016) Accurate self-correction of errors in long reads using de bruijn graphs. Bioinformatics 33 (6), pp. 799–806. Cited by: §1.

10 Appendix

Appendix 0.A Pseudocodes

1:procedure greedyCol(,,,) is a dBG, is a bitmap, is a string and is array of lists
2:      append dummy symbols at the ends of
3:     
4:     
5:     
6:     for each  do traverse the dBG path of
7:         
8:         if  then
9:              for  to  do
10:                                          
11:         
12:         if  then
13:              for  to  do
14:                  
15:                  
16:                  if  then
17:                       for  to  do
18:                                                                                            
19:         if  is true then
20:                        
21:               
22:     
23:     
24:     for each  do compute the colors already used
25:         for each  do
26:               true               
27:      minimum color not in
28:     for each  do color the nodes
29:               
Algorithm 1 Function greedyCol
1:procedure buildSeqs(,) is a colored dBG and is a starting node
2:      list of rebuilt sequences
3:      getColors(,)
4:      initialize an string with the label of
5:     for each  do
6:          temporal dBG node
7:          false
8:         while isEnding(,) is false and amb is false do
9:               outdegree(,)
10:              if  is  then
11:                   push the new symbol into
12:                  
13:              else
14:                  
15:                  for  to  do check which successors of has color
16:                       if  then
17:                           
18:                                                                     
19:                  if  then more than one successor has color
20:                        true                                          
21:         if  not true then
22:                             
23:     return
Algorithm 2 Function buildSeqs
1:procedure contigAssm(,,) is a starting node and is a threshold
2:     
3:     
4:     for each  do
5:               
6:     while true do
7:         if  then
8:              
9:              if  then add new reads to the contig
10:                  for each  do
11:                       if  is not true then
12:                                                                                            
13:         if  then
14:              
15:              for  to  do compute the most probable successor node
16:                  
17:                  if  then discard reads ending at
18:                       for each  do
19:                                                   
20:                       
21:                  else
22:                       
23:                        weight the successor node
24:                       if  then
25:                           
26:                           
27:                           
28:                           break                                                        
29:              if  isthen break no successor has the minimum weight               
30:         else
31:              
32:              if  then break               
33:                             
34:     return
Algorithm 3 Function contigAssm