Space-efficient merging of succinct de Bruijn graphs

02/07/2019 ∙ by Lavinia Egidi, et al. ∙ Universidade de São Paulo Università del Piemonte Orientale 0

We propose a new algorithm for merging succinct representations of de Bruijn graphs introduced in [Bowe et al. WABI 2012]. Our algorithm is inspired by the approach introduced by Holt and McMillan [Bionformatics 2014, ACM-BCB 2014] for merging Burrow-Wheeler transforms. In addition to its input, our algorithm uses only four bitvectors of working space and all data is accessed sequentially making the algorithm suitable for external memory execution. Our algorithm can also merge Colored succinct de Bruijn graphs, and can compute the Variable Order succinct representation from the plain representation of the input graphs. Combining our merging algorithm with recent results on de Bruijn graph construction, we provide a space efficient procedure for building succinct representations of de Bruijn graphs for very large collections.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The de Bruijn graph for a collection of strings is a key data structure in genome assembly. After the seminal work of Bowe et al [5], many succinct representations of this data structure have been proposed in the literature [2, 3, 4, 15] offering more and more functionalities still using a fraction of the space required to store the input collection uncompressed.

In this paper we consider the problem of merging two existing succinct representations of different collections in order to build the succinct  graph for the union collection. Since from the  graph is a lossy representation and it cannot be used to recover the original collection, the alternative to merging is storing a copy of the collection to be used for building new  graphs.

Recently, Muggli et al. [14] have proposed a merging algorithm for colored  graphs and have shown the effectiveness of the merging approach for the construction of  graphs for very large datasets. The algorithm in [14] is based on a MSD Radix Sort procedure of the graph nodes and its running time is , where is the total number of edges, the order of the  graph and the total number of colors. In this paper we present a new merging algorithm based on LSD Radix Sort which is inspired by a BWT merging algorithm introduced by Holt and McMillan [9, 10] and later improved in [7]. Our algorithm has the same time complexity of the one in [14] but is space efficient, in that it only uses four bitvectors of working space, and accesses all data by sequential scans, so it is suitable for execution in external memory. In addition, our algorithm can compute, with no additional cost, the (Longest Common Suffix) information thus making it possible to construct succinct Variable Order  graph representations [4]. Combining our merging algorithm with recent results on external memory  graph construction [6], we provide a space efficient procedure for building succinct representations of  graphs for very large collections.

2 Notation

Given an alphabet of size and a collection of strings over , we prepend to each string copies of a symbol which is lexicographically smaller than any other symbol. The order- graph for the collection is a directed edge-labeled graph containing a node for every unique -mer appearing in one of the strings of . For each node we denote by its associated -mer, where are symbols. The graph contains an edge , with label , iff one of the strings in contains a -mer with prefix and suffix . The edge therefore represents the -mer . Note that each node has at most outgoing edges and all edges incoming to node have label .

2.1 BOSS succinct representation

In 2012, Bowe et al. [5] introduced a succinct representation for the de Bruijn graphs, herein referred to as BOSS representation, for the authors’ initials. The authors showed how to implement the graph in small space supporting fast navigation operations. The BOSS representation of the graph is defined by considering the set of nodes sorted according to the co-lexicographic order of the associated strings. Hence, if denotes the string reversed, the nodes are ordered so that

(1)

By construction the first node is and all are distinct. For each node , , we define as the sorted sequence of symbols on the edges leaving from node ; if has out-degree zero we set . Finally, we define

  1. as the concatenation ;

  2. as the bitvector such that iff corresponds to the label of the edge such that has the smallest rank among the nodes that have an edge going to node ;

  3. as the bitvector such that iff he outgoing edges in and have different source nodes, we define .

It is easy to see that the length of the arrays , , and is equal to the number of edges plus the number of nodes with out-degree 0. In addition, the number of ’s in is equal to the number of nodes , and the number of ’s in is equal to the number of nodes with positive in-degree, which is since is the only node with in-degree 0. Note that there is a natural one-to-one correspondence between the indices such that and the the set : in this correspondence where is the destination node of the edge associated to the entry . See example in Figs. 1 and 2.

Figure 1: graph for TACACT, TACTCA, GACTCA
Figure 2: BOSS representation for for TACACT, TACTCA, GACTCA
Property .

The map is order preserving in the following sense: if then

(2)

In [5] it is shown that enriching the arrays , , and with data structures supporting constant time rank and select operations [16, 8], we can efficiently navigate the graph . The authors defined the following basic queries returns the node reached from by an edge with label , if no such node exists, it returns ; returns the nodes with an edge from with ; and returns the last symbol of .

2.2 Colored BOSS

The colored graph [11] is an extension of the graphs for a multiset of individual graphs, where each edge is associated with a set of “colors” that indicates which graphs contain that edge.

The BOSS representation for a set of graphs contains the union of all individual graphs. The colors of each edge are stored in a two-dimensional binary array , such that indicates that the -th edge is present in the individual graph . There are different compression alternatives for the color matrix that support fast operations [15, 2, 13]. Recently, Alipanah et al. [1] presented a different approach to reduce the size of by recoloring.

2.3 Variable-order BOSS

The order (dimension) of a graph is an important parameter for genome assembling algorithms. The graph can be very small and uninformative when is small, whereas it can become too large with unrelated parts when is large.

To add flexibility to the BOSS representation, Boucher et al. [4] suggest to enrich the BOSS representation of an order- graph with the length of the longest common suffix () between the consecutive nodes sorted according to Eq. 1. These lengths are stored in a wavelet tree using additional bits. The authors show that this enriched representation supports navigation on all  graphs of order and that it is even possible to vary the order of the graph on the fly during the navigation up to the maximum value .

The between and is equivalent to the length of the longest common prefix () between their reverses and . The (or ) between the nodes can be computed during the -mer sorting phase. In the following we denote by VO-BOSS the variable order succinct graph consisting of the BOSS representations enriched with the information.

3 Merging plain BOSS representations

Suppose we are given the BOSS representation of two graphs and obtained respectively from the collections of strings and . In this section we show how to compute the BOSS representation for the union collection . The procedure does not change in the general case when we are merging an arbitrary number of graphs. Let and denote respectively the (uncompressed) graphs for and , and let

their respective set of nodes sorted in colexicographic order. Hence, with the notation of the previous section we have

(3)

We observe that the -mers in the collection are simply the union of the -mers in and . To build the graph for we need therefore to: 1) merge the nodes in and according to the colexicographic order of their associated -mers, 2) recognize when two nodes in and refer to the same -mer, and 3) properly merge and update the bitvectors , and , . The following sections are devoted to the solutions of each one of these subproblems.

3.1 Phase 1: Merging -mers

The main technical difficulty is that in the BOSS representation the -mers associated to each node are not directly available. Our algorithm will essentially reconstruct them using the symbols associated to the graph edges; note that to this end it suffices to consider the edges such that the corresponding entry in or is equal to .

Following these edges, first we can recover the last symbol of each -mer, following them a second time we could recover the last two symbols of each -mer and so on. However, to save space we do not explicitly maintain the -mers; instead, using the ideas from [9, 10] we compute a bitvector representing how the -mers in and should be merged according to the colexicographic order. In addition, we maintain just enough information, to recognize when two -mers are identical, so that their incoming and outgoing edges can be merged.

To this end, our algorithm executes iterations of the code shown in Fig. 3. For , during iteration , we compute a bitvector containing 0’s and 1’s such that satisfies the following property

Property .

For and the -th 0 precedes the -th 1 in if and only if . ∎

Property 3.1 states that if we merge the nodes from and according to the bitvector the corresponding -mers will be sorted according to the lexicographic order restricted to the first symbols of each reversed -mer. As a consequence, will provide us the colexicographic order of all the nodes in and . To prove that Property 3.1 holds, we first define and show that it satisfies the property, then we prove that for the code in Fig. 3 computes that still satisfies Property 3.1.

For let and denote respectively the number of nodes in and whose associated -mers end with symbol . These values can be computed with a single scan of (resp. ) considering only the symbols (resp. ) such that (resp. ). By construction, it is

where the two 1’s account for the nodes , and whose associated -mer is . We define

the first pair 01 in accounts for and ; each group accounts for the nodes ending with symbol . Note that, apart from the first two symbols, can be logically partitioned into subarrays one for each alphabet symbol. For let

then the subarray corresponding to starts at position and has size . Note that, as a consequence of (3), the -th 0 (resp. -th 1) belongs to the subarray associated to symbol iff (resp. ).

To see that satisfies Property 3.1, observe that the -th 0 precedes -th 1 iff the -th 0 belongs to a subarray corresponding to a symbol not larger than the symbol corresponding to the subarray containing the -th 1; this implies .

The bitvectors computed by the algorithm in Fig. 3 can be logically divided into the same subarrays we defined for . Because of how the array is initialized and updated, we see that every time we read a symbol at line 14 the corresponding bit , identifying the graph containing , is written in the portion of corresponding to (line 16). The only exception are the first two entries of which are written at line 6 which corresponds to the nodes and . We treat these nodes differently since they are the only ones with in-degree zero. For all other nodes, we implicitly use the one-to-one correspondence between entries with and nodes with positive in-degree.

Note that lines 810 and 1722 of the algorithm are related to the computation of the array that is used in the following section and do not influences the computation of  or Property 3.1. We can now formally prove the following result.

For , the array computed by the algorithm in Fig. 3 satisfies Property 3.1.

Proof.

To prove the “if” part, let denote two indexes such that is the -th 0 and is the -th 1 in for some and . We need to show that .

Assume first . The hypothesis implies , since otherwise during iteration  the -th 1 would have been written in a subarray of preceding the one where the -th 0 is written. Hence as claimed.

Assume now . In this case during iteration the -th 0 and the -th 1 are both written to the subarray of associated to symbol . Let , denote respectively the value of the main loop variable  in the procedure of Fig. 3 when the entries and are written. Since each subarray in is filled sequentially, the hypothesis implies . By construction and . Say is the -th 0 in and is the -th 1 in . By the inductive hypothesis on it is

(4)

By construction there is an edge labeled from to and from to hence

therefore

using (4) we conclude that as claimed.

For the “only if” part assume for some and . We need to prove that in the -th 0 precedes the -th 1. If the proof is immediate. If then

Let and be such that and . By induction hypothesis, in the -th 0 precedes the -th 1.

During phase , the -th 0 in is written to position when processing the -th 0 of , and the -th 1 in is written to position when processing the -th 1 of . Since in the -th 0 precedes the -th 1 and since and both belongs to the subarray of corresponding to the symbol , their relative order does not change and the -th 0 precedes the -th 1 as claimed. ∎

 

1:for  to  do
2:      Init array
3:      Init array
4:end for
5:; Init counters for and
6: First two entries correspond to
7:for  to  do
8:     if  and  then
9:          A new block of is starting
10:     end if
11:      Read bit from
12:     repeat Current node is from graph
13:         if  then
14:               Get symbol from outgoing edges
15:               Get destination for according to symbol
16:               Copy bit to
17:              if  then
18:                   Update block id for symbol
19:                  if  then Check if already marked
20:                        A new block of will start here
21:                  end if
22:              end if
23:         end if
24:     until  Exit if was last child
25:end for

 

Figure 3: Main procedure for merging compact graphs.

3.2 Phase 2: Recognizing identical -mers

Once we have determined, via the bitvector , the colexicographic order of the -mers, we need to determine when two -mers are identical since in this case we have to merge their outgoing and incoming edges. Note that two identical -mers will be consecutive in the colexicographic order and they will necessarily belong one to and the other to .

Following Property 3.1, and a technique introduced in [7], we identify the -th 0 in with and the -th 1 in with . Property 3.1 is equivalent to state that we can logically partition into -blocks

(5)

such that each block corresponds to a set of -mers which are prefixed by the same length- substring. Note that during iterations the -mers within an -block will be rearranged, and sorted according to longer and longer prefixes, but they will stay within the same block.

In the algorithm of Fig. 3, in addition to , our algorithm maintains an integer array , such that at the end of iteration  it is if and only if a block of starts at position . At the beginning of the algorithm we set . During iteration  new block boundaries are established as follows. At line 8 identify each existing block with its starting position. Then, at lines 1722, if the entry has the form , while has the form with and belonging to different blocks, then we know that is the starting position of an -block. Note that we write to only if no other value has been previously written there. This ensures that is the smallest position in which the string in and differ, or equivalently, is the LCP between the strings in and . The above observations are summarized in the following Lemma, which is a generalization to  graphs of an analogous result for BWT merging established in Corollary 4 in [7].

After iteration  of our merging algorithm for if then is the LCP between the reverse -mers associated to and , while if their LCP is equal to , hence such -mers are equal.∎

The above lemma shows that using array we can establish when two -mers are equal and consequently the associated graph nodes should be merged.

3.3 Phase 3: Building the succinct representation for the union graph

We now show how to compute the succinct representation of the union graph , consisting of the arrays , , , given the succinct representations of and and the arrays and .

The arrays , , are initially empty and we will fill them in a single sequential pass. For we consider the values and . If then the -mer associated to , say is identical to the -mer associated to , say . In this case we recover from and the labels of the edges outgoing from and , we compute their union and write them to (we assume the edges are in the lexicographic order), writing at the same time the representation of the resulting out-degree of the new node to . If instead , then the -mer associated to is unique and we copy the information of its outgoing edges and out-degree directly to and .

When we write the symbol we simultaneously write the bit according to the following strategy. If the symbol is the first occurrence of after a value , with , then we set , otherwise we set . The rationale is that if no values with occur between two nodes, then the associated (reversed) -mers have a common LCP of length and therefore if they both have an outgoing edge labelled with they reach the same node and only the first one should have .

4 Implementation details and analysis

Let denote the sum of number of nodes in and , and let denote the sum of the number of edges. The -mer merging algorithm as described executes in time a first pass over the arrays , , and , to compute the values for and initialize the arrays and (Phase 1). Then, the algorithm executes iterations of the code in Fig. 3 each iteration taking time. Finally, still in time the algorithm computes the succinct representation of the union graph (Phases 2 and 3). The overall running time is therefore .

The space usage of the algorithm, in addition to the input and the output consists of bits for two instances of the array (for the current and for the previous iteration ), plus bits for the array. Note however, that during iteration we only need to check whether is equal to 0, , or some value within 0 and . Similarly, for the computation of we need to distinguish between the cases where is equal to 0, or some value . Therefore, we can save space replacing with an array containing two bits per entry representing the four possible states . During iteration , the values in are used instead of the ones in as follows: An entry corresponds to , an entry corresponds to an entry . In addition, if is even, an entry corresponds to and an entry corresponds to ; while if

is odd the correspondence is

, . The reason for this apparently involved scheme, first introduced in [6], is that during phase , an entry in can be modified either before or after we have read it at Line 8. Using this technique, we get that the total working space of the algorithm, i.e., the space in addition to the input and the output is bits.

Note that during Phase 1, at each iteration , the arrays , , , and are read sequentially from beginning to end. At the same time, the arrays and are written sequentially but into different partitions whose starting positions are the values in  and are the same for each iteration. Thus, if we split and into different files all accesses are sequential and Phase 1 is suitable for execution in external memory using only words of RAM for the arrays , , and . Since during Phases 2 and 3 all input and output arrays are accessed sequentially in linear time we can summarize our analysis as follows.

The merging of two succinct representations of  graphs takes time and bits of working space. The algorithm can be executed in external memory using bits of RAM and sequential IOs.∎

Note that the -mer merging algorithm uses only symbols in and corresponding to the 1’s in and (line 13 in Fig. 3). Before starting Phase 1 we can make a copy of such arrays in temporary arrays and , so that each successive iteration takes time. The overall running time can be therefore reduced to , at the cost of using additional bits.

The merging of two succinct representations of  graphs can be done in time and bits of working space. The algorithm can be executed in external memory using bits of RAM and sequential IOs. ∎

5 Merging other BOSS representations

Our algorithm can be easily generalized to merge colored/variable-order BOSS representations, described in Sections 2.22.3.

Given the colored BOSS representation of two graphs and , the corresponding color matrices and have size and , respectively. We initially create a new color matrix of size with all entries empty. During the merging of the union graph (Phase 3), for , we write the colors of the edges associated to to the corresponding line in possibly merging the colors when we find nodes with identical -mers in time, with . To make sure that colors from and do not intersect in the new graph we just need to add the constant (the number of distinct colors in ) to any color id coming from the matrix .

The merging of two succinct representations of colored  graphs takes time and bits of working space, where is . ∎

We now show that we can compute the variable order VO-BOSS representation of the union of two graphs and given their plain, eg. non variable order, BOSS representations. For the VO-BOSS representation we need the array for the nodes in the union graph , , . We notice that after merging the -mers of and with the algorithm in Fig. 3 (Phase 1) the values in already provide the LCP information between the reverse labels of all consecutive nodes (Lemma 3.2). When building the union graph during Phase 3, for , the between two consecutive nodes, say and , is equal to the of their reverses and , which is given by whenever (if then and nodes and should be merged). Note that in this case we cannot use the compact (2-bit) representation of suggested in Section 4, however we can re-use the space of to store the array.

The merging of two succinct representations of variable order  graphs takes time and bits of working space. ∎

6 Space-efficient construction of succinct de Bruijn graphs

Using our merging algorithm it is straightforward to design a complete space-efficient algorithm to construct succinct graphs.

Assume we are given a string collection of total length , and the desired order , and the amount of available RAM . First, we split into smaller subcollections , such that we can compute the BWT and LCP array of each subcollection in linear time in RAM using bytes, using e.g. the suffix sorting algorithm gSACA-K [12]. For each subcollection we then compute, and write to disk, the BOSS representation of its graph using the algorithm described in [6, Section 5.3]. Since these are linear algorithms the overall cost of this phase is time and sequential IOs.

Finally, we merge all graphs into a single BOSS representation of the union graph with the external memory variant of the merging algorithm (Section 3). Since the number of subcollections is , a total of merging rounds will suffices to get the BOSS representation of the union graph.

Given a collection of strings collection of total length , the construction of the succinct order graph takes time using words of RAM.∎

Note that our construction algorithm can be extended to generate colored/variable order variants of the graph, since the last merging step can compute these variants given plain BOSS representations as input (see Section 5). Finally, we observe that we can also update the graph built for a collection with new data without reconstructing the complete graph. In order to do that, we can compute the succinct graph for the new strings (via the BWT and LCP array) and them merge the new graph into the larger graph for using the merging algorithm.

References

  • [1] Bahar Alipanahi, Alan Kuhnle, and Christina Boucher. Recoloring the colored de Bruijn graph. In SPIRE, volume 11147 of Lecture Notes in Computer Science, pages 1–11. Springer, 2018.
  • [2] Fatemeh Almodaresi, Prashant Pandey, and Robert Patro. Rainbowfish: A succinct colored de Bruijn graph representation. In WABI, volume 88 of LIPIcs, pages 18:1–18:15. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2017.
  • [3] Djamal Belazzougui, Travis Gagie, Veli Mäkinen, Marco Previtali, and Simon J. Puglisi. Bidirectional variable-order de Bruijn graphs. International Journal of Foundations of Computer Science, 29(08):1279–1295, 2018. doi:10.1142/s0129054118430037.
  • [4] Christina Boucher, Alexander Bowe, Travis Gagie, Simon J. Puglisi, and Kunihiko Sadakane. Variable-order de Bruijn graphs. In DCC, pages 383–392. IEEE, 2015.
  • [5] Alexander Bowe, Taku Onodera, Kunihiko Sadakane, and Tetsuo Shibuya. Succinct de Bruijn graphs. In WABI, volume 7534 of Lecture Notes in Computer Science, pages 225–235. Springer, 2012.
  • [6] Lavinia Egidi, Felipe Alves Louza, Giovanni Manzini, and Guilherme P. Telles. External memory BWT and LCP computation for sequence collections with applications. In WABI, volume 113 of LIPIcs, pages 10:1–10:14. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
  • [7] Lavinia Egidi and Giovanni Manzini. Lightweight BWT and LCP merging via the Gap algorithm. In SPIRE, volume 10508 of Lecture Notes in Computer Science, pages 176–190. Springer, 2017.
  • [8] P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. Compressed representations of sequences and full-text indexes. ACM Transactions on Algorithms, 3(2), 2007.
  • [9] James Holt and Leonard McMillan. Constructing Burrows-Wheeler transforms of large string collections via merging. In BCB, pages 464–471. ACM, 2014.
  • [10] James Holt and Leonard McMillan. Merging of multi-string BWTs with applications. Bioinformatics, 30(24):3524–3531, 2014.
  • [11] Zamin Iqbal, Mario Caccamo, Isaac Turner, Paul Flicek, and Gil McVean. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nature Genetics, 44(2):226–232, 2012. doi:10.1038/ng.1028.
  • [12] Felipe A. Louza, Simon Gog, and Guilherme P. Telles. Inducing enhanced suffix arrays for string collections. Theor. Comput. Sci., 678:22–39, 2017.
  • [13] Shoshana Marcus, Hayan Lee, and Michael C. Schatz. Splitmem: a graphical algorithm for pan-genome analysis with suffix skips. Bioinformatics, 30(24):3476–3483, 2014.
  • [14] Martin D Muggli and Christina Boucher. Succinct de Bruijn graph construction for massive populations through space-efficient merging. bioRxiv, 2017. URL: https://www.biorxiv.org/content/early/2017/12/06/229641, arXiv:https://www.biorxiv.org/content/early/2017/12/06/229641.full.pdf, doi:10.1101/229641.
  • [15] Martin D. Muggli, Alexander Bowe, Noelle R. Noyes, Paul S. Morley, Keith E. Belk, Robert Raymond, Travis Gagie, Simon J. Puglisi, and Christina Boucher. Succinct colored de Bruijn graphs. Bioinformatics, 33(20):3181–3187, 2017.
  • [16] R. Raman, V. Raman, and S.S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms, 3(4), 2007.