Lightweight merging of compressed indices based on BWT variants

03/04/2019 ∙ by Lavinia Egidi, et al. ∙ 0

In this paper we propose a flexible and lightweight technique for merging compressed indices based on variants of Burrows-Wheeler transform (BWT), thus addressing the need for algorithms that compute compressed indices over large collections using a limited amount of working memory. Merge procedures make it possible to use an incremental strategy for building large indices based on merging indices for progressively larger subcollections. Starting with a known lightweight algorithm for merging BWTs [Holt and McMillan, Bionformatics 2014], we show how to modify it in order to merge, or compute from scratch, also the Longest Common Prefix (LCP) array. We then expand our technique for merging compressed tries and circular/permuterm compressed indices, two compressed data structures for which there were hitherto no known merging algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Burrows Wheeler transform (BWT), originally introduced as a tool for data compression [4], has found application in the compact representation of many different data structures. After the seminal works [31] showing that the BWT can be used as a compressed full text index for a single string, many researchers have proposed variants of this transformation for string collections [5, 24], trees [9, 10], graphs [3, 27, 35], and alignments [30, 29]. See [13] for an attempt to provide a unified view of these variants.

In this paper we consider the problem of constructing compressed indices for string collections based on BWT variants. A compressed index is obviously most useful when working with very large amounts of data. Therefore, a fundamental requirement for construction algorithms, in order to be of practical use, is that they are lightweight in the sense that they use a limited amount of working space, i.e. space in addition to the space used for the input and the output. Indeed, the construction of compressed indices in linear time and small working space is an active and promising area of research, see [1, 12, 28] and references therein.

A natural approach when working with string collections is to build the indexing data structure incrementally, that is, for progressively larger subcollections. For example, when additional data should be added to an already large index, the incremental construction appears much more reasonable, and often works better in practice, than rebuilding the complete index from scratch, even when the from-scratch option has better theoretical bounds. Indeed, in [33] and [26] the authors were able to build the largest indices in their respective fields using the incremental approach.

Along this path, Holt and McMillan [16, 15] proposed a simple and elegant algorithm, that we call the H&M algorithm from now on, for merging BWTs of collections of sequences. For collections of total size , their fastest version takes time where is the average length of the longest common prefix between suffixes in the collection. The average length of the longest common prefix is in the worst case but for random strings and for many real world datasets [22]. However, even when the H&M algorithm is not theoretically optimal since computing the BWT from scratch takes  time. Despite its theoretical shortcomings, because of its simplicity and small space usage, the H&M algorithm is competitive in practice for collections with relatively small average LCP. In addition, since the H&M algorithm accesses all data by sequential scans, it has been adapted to work on very large collections in external memory [16].

In this paper we revisit the H&M algorithm and we show that its main technique can be adapted to solve the merging problem for three different compressed indices based on the BWT.

First, in Section 4 we describe a procedure to merge, in addition to the BWTs, the Longest Common Prefix (LCP) arrays of string collections. The LCP array is often used to provide additional functionalities to indices based on the BWT [31], and the issue of efficiently computing and storing LCP values has recently received much attention [14, 20]. Our algorithm has the same complexity as the H&M algorithm.

Next, in Section 5 we describe a procedure for merging compressed labelled trees (tries) as produced by the eXtended BWT transform (XBWT) [9, 10]

. This result is particularly interesting since at the moment there are no time and space optimal algorithms for the computation from scratch of the XBWT. Our algorithm takes time proportional to the number of nodes in the output tree times the

average node height.

Finally, in Section 6 we describe algorithms for merging compressed indices for circular patterns [17], and compressed permuterm indices [11]. The time complexity of these algorithms is proportional to the total collection size times the average circular LCP, a notion that naturally extends the LCP to the modified lexicographic order used for circular strings.

Our algorithms are based on the H&M technique specialized to the particular features of the different compressed indices given as input. They all make use of techniques to recognize blocks of the input that become irrelevant for the computation and skip them in successive iterations. Because of the skipping of irrelevant blocks we call our merging procedures Gap algorithms. Our algorithms are all lightweight in the sense that, in addition to the input and the output, they use only a few bitarrays of working space and the space for handling the irrelevant blocks. The latter amount of space can be significant for pathological inputs, but in practice we found it takes between 2% and 9% of the overall space, depending on the alphabet size.

The Gap algorithms share with the H&M algorithm the feature of accessing all data by sequential scans and are therefore suitable for implementation in external memory. In [7] an external memory version of the Gap algorithm for merging BWT and LCP arrays is engineered, analyzed, and extensively tested on collections of DNA sequences. The results reported there show that the external memory version of Gap outperforms the known external memory algorithms for BWT/LCP computation when the avergae LCP of the collection is relatively small or when the strings of the input collection have widely different lengths.

To the best of our knowledge, the problem of incrementally building compressed indices via merging has been previously addressed only in [34] and [26]. Sirén presents in [34] an algorithm that maintains a BWT-based compressed index in RAM and incrementally merges new collections to it. The algorithm is the first that makes it possible to build indices for Terabytes of data without using a specialized machine with a lot of RAM. However, Sirén’s algorithm is specific for a particular compressed index (which doesn’t use the LCP array), while ours can be more easily adapted to build different flavors of compressed indices as shown in this paper. In [26] the authors present a merge algorithm for colored de Bruijn graphs. Their algorithm is also inspired by the H&M algorithm and the authors report a threefold reduction in working space compared to the state of the art methods for from scratch de Bruijn graphs. Inspired by the techniques introduced in this paper, we are currently working on an improved de Bruijn graph merging algorithm [6] that also supports the construction of succinct Variable Order de Bruijn graph representations [2].

2 Background

Let denote a string of length over an alphabet of constant size . We write to denote the substring . If we assume . If or then is the empty string. Given two strings and we write () to denote that is lexicographically (strictly) smaller than . We denote by the length of the longest common prefix between and .

The suffix array associated to is the permutation of giving the lexicographic order of ’s suffixes, that is, for , . The longest common prefix array is defined for by

(1)

the array stores the length of the longest common prefix between lexicographically consecutive suffixes. For convenience we define . We also define the maximum and average LCP as:

(2)

The Burrows-Wheeler transform of is defined by

is best seen as the permutation of in which the position of coincides with the lexicographic rank of (or of if ) in the suffix array. We call the string context of . See Figure 1 for an example.

lcp bwt context
-1 b
0 c ab
2 abcab
0 a b
1 a bcab
0 b cab
-1
lcp bwt context
-1 c
0 aabcabc
1 c abc
3 a abcabc
0 a bc
2 a bcabc
0 b c
1 b cabc
-1
id context
0 -1 b
1 0 c
1 0 aabcabc
0 1 c ab
1 2 c abc
0 3 abcab
1 5 a abcabc
0 0 a b
1 1 a bc
0 2 a bcab
1 4 a bcabc
1 0 b c
0 1 b cab
1 3 b cabc
-1
Figure 1: LCP array and BWT for and , and multi-string BWT and corresponding LCP array for the same strings. Column id shows, for each entry of whether it comes from or .

The longest common prefix (LCP) array, and Burrows-Wheeler transform (BWT) can be generalized to the case of multiple strings. Historically, the first of such generalizations is the circular BWT [24] considered in Section 6. Here we consider the generalization proposed in [5] which is the one most used in applications. Let and be such that and where are two symbols not appearing elsewhere in and and smaller than any other symbol. Let denote the suffix array of the concatenation . The multi-string BWT of and , denoted by , is defined by

In other words, is the symbol preceding the -th lexicographically larger suffix, with the exception that if then and if then . Hence, is always a character of the string ( or ) containing the -th largest suffix (see again Fig. 1). The above notion of multi-string BWT can be immediately generalized to define for a family of distinct strings . Essentially is a permutation of the symbols in such that the position in of is given by the lexicographic rank of its context (or if ).

Given the concatenation and its suffix array , we consider the corresponding LCP array defined as in (1) (see again Fig. 1). Note that, for , gives the length of the longest common prefix between the contexts of and . This definition can be immediately generalized to a family of strings to define the LCP array associated to the multi-string BWT .

2.1 The H&m Algorithm

In [16] Holt and McMillan introduced a simple and elegant algorithm, we call it the H&M algorithm, to merge multi-string BWTs111Unless explicitly stated otherwise, in the following we use H&M to refer to the algorithm from [16], and not to its variant proposed in  [15].. Because it is the starting point for our results, we now briefly recall its main properties.

Given and the H&M algorithm computes . The computation does not explicitly need but only the (multi-string) BWTs to be merged. For simplicity of notation we describe the algorithm assuming we are merging two single-string BWTs and ; the same algorithm works in the general case with multi-string BWTs in input. Note also that the algorithm can be easily adapted to merge more than two (multi-string) BWTs at the same time.

Computing amounts to sorting the symbols of and according to the lexicographic order of their contexts, where the context of symbol (resp. ) is (resp. ). By construction, the symbols in and are already sorted by context, hence to compute we only need to merge and without changing the relative order of the symbols within the two input sequences.

The H&M algorithm works in successive iterations. After the -th iteration the entries of and are sorted on the basis of the first symbols of their context. More formally, the output of the

-th iteration is a binary vector

containing 0’s and 1’s and such that the following property holds.

Property .

For and the -th 0 precedes the -th 1 in if and only if

(3)

(recall that according to our notation if then coincides with , and similarly for ).∎

Following Property 2.1 we identify the -th 0 in with and the -th 1 in with so that encodes a permutation of . Property 2.1 is equivalent to stating that we can logically partition into blocks

(4)

such that each block corresponds to a set of symbols whose contexts are prefixed by the same length- string (the symbols with a context shorter than are contained in singleton blocks). Within each block the symbols of precede those of , and the context of any symbol in block is lexicographically smaller than the context of any symbol in block with .

The H&M algorithm initially sets : since the context of every symbol is prefixed by the same length-0 string (the empty string), there is a single block containing all symbols. At iteration the algorithm computes from using the procedure in Figure 2. The following lemma is a restatement of Lemma 3.2 in [16] using our notation (see [8] for a proof in our notation).

 

1:Initialize array
2:; Init counters for and
3:for  to  do
4:      Read bit from
5:      Get symbol from or according to
6:     if  then
7:          Get destination for according to symbol
8:     else
9:          Symbol goes to position 
10:     end if
11:      Copy bit to
12:end for

 

Figure 2: Main loop of algorithm H&M for computing given . Array is initialized so that contains the number of occurrences of symbols smaller than in and plus one. Note that the bits stored in immediately after reading symbol are stored in positions from to of .

For the bit vector satisfies Property 2.1.∎

 

1:Initialize arrays and
2:; Init counters for and
3:for  to  do
4:     if  and  then
5:          A new block of is starting
6:     end if
7:      Read bit from
8:      Get symbol from or according to
9:     if  then
10:          Get destination for according to symbol
11:     else
12:          Symbol goes to position 
13:     end if
14:      Copy bit to
15:     if  then
16:          Update block id for symbol
17:         if  then Check if already marked
18:               A new block of will start here
19:         end if
20:     end if
21:end for

 

Figure 3: Main loop of the H&M algorithm modified for the computation of the values. At Line 1 for each symbol we set and as in Figure 2. At the beginning of the algorithm we initialize the array as .

3 Computing LCP values with the H&m algorithm

Our first result is to show that with a simple modification to the H&M algorithm it is possible to compute the LCP array , in addition to merging and . Our strategy consists in keeping explicit track of the logical blocks we have defined for and represented in (4). We maintain an integer array such that at the end of iteration it is if and only if a block of starts at position . The use of such integer array is shown in Figure 3. Note that: initially we set and once an entry in becomes nonzero it is never changed, during iteration we only write to the value , because of the test at Line 4 the values written during iteration influence the algorithm only in subsequent iterations. In order to identify new blocks, we maintain an array such that is the of the block of to which the last seen occurrence of symbol belonged.

The following lemma shows that the nonzero values of at the end of iteration  mark the boundaries of ’s logical blocks.

For any , let , be such that and

(5)

Then, at the end of iteration the array is such that

(6)

and is one of the blocks in (4).

Proof.

We prove the result by induction on . For , hence before the execution of the first iteration, (5) is only valid for and (recall that we defined ). Since initially our claim holds.

Suppose now that (5) holds for some . Let ; by (5) is a common prefix of the suffixes starting at positions , , …, , and no other suffix of is prefixed by . By Property 2.1 the 0s and 1s in corresponds to the same set of suffixes That is, if and is the th 0 (resp. th 1) of then the suffix starting at (resp. ) is prefixed by .

To prove (6) we start by showing that, if , then at the end of iteration it is . To see this observe that the range is part of a (possibly) larger range containing all suffixes prefixed by the length prefix of . By inductive hypothesis, at the end of iteration it is which proves our claim since and .

To complete the proof, we need to show that during iteration : we do not modify and we write a nonzero to and if they do not already contain a nonzero. Let and so that . Consider now the range containing the suffixes prefixed by . By inductive hypothesis at the end of iteration it is

(7)

During iteration , the bits in are possibly changed only when we are scanning the region and we find an entry , , such that the corresponding value in is . Note that by (7) as soon as reaches the variable changes and becomes different from all values stored in . Hence, at the first occurrence of symbol the value will be stored in (Line 18) unless a nonzero is already there. Again, because of (7), during the scanning of the variable does not change so subsequent occurrences of will not cause a nonzero value to be written to . Finally, as soon as we leave region and reaches , the variable changes again and at the next occurrence of a nonzero value will be stored in . If there are no more occurrences of after we leave region then either is the first suffix array entry prefixed by symbol or . In the former case gets a nonzero value at iteration 1, in the latter case gets a nonzero value when we initialize array . ∎∎

For , if , then starting from the end of iteration it is .

Proof.

By Lemma 3 we know that becomes nonzero only after iteration . Since at the end of iteration it is still during iteration gets the value which is never changed in successive iterations.∎∎

The above corollary suggests the following algorithm to compute and : repeat the procedure of Figure 3 until the iteration in which all entries in become nonzero. At that point describes how and should be merged to get and for . The above strategy requires a number of iterations, each one taking time, equal to the maximum of the values, for an overall complexity of , where . Note that in addition to the space for the input and the output the algorithm only uses two bit arrays (one for the current and the next ) and a constant number of counters (the arrays and ). Summing up we have the following result.

Given and , the algorithm in Figure 3 computes and in time and bits of working space, where and is the maximum LCP of .∎

4 The Gap BWT/LCP merging Algorithm

The Gap algorithm, as well as its variants described in the following sections, are based on the notion of monochrome blocks.

If , and , we say that block is monochrome if it contains only 0’s or only 1’s.∎

Since a monochrome block only contains suffixes from either or , whose relative order is known, it does not need to be further modified. If in addition, the LCP arrays of and are given in input, then also LCP values inside monochrome blocks are known without further processing. This intuition is formalized by the following lemmas.

If at the end of iteration bit vector contains only monochrome blocks we can compute and in time from , , and .

Proof.

By Property 2.1, if we identify the -th 0 in with and the -th 1 with the only elements which could be not correctly sorted by context are those within the same block. However, if the blocks are monochrome all elements belong to either or so their relative order is correct.

To compute we observe that if then by (the proof of) Corollary 3 it is . If instead we are inside a block hence and belong to the same string or and their LCP is directly available in or .∎∎

Notice that a lazy strategy of not completely processing monochrome blocks, makes it impossible to compute LCP values from scratch. In this case, in order to compute it is necessary that the algorithm also takes and in input.

Suppose that, at the end of iteration , is a monochrome block. Then for , , and processing during iteration creates a set of monochrome blocks in .

Proof.

The first part of the Lemma follows from the observation that subsequent iterations of the algorithm will only reorder the values within a block (and possibly create new sub-blocks); but if a block is monochrome the reordering will not change its actual content.

For the second part, we observe that during iteration as goes from to the algorithm writes to the same value which is in . Hence, a new monochrome block will be created for each distinct symbol encountered (in or ) as goes through the range .∎∎

The lemma implies that, if block is monochrome at the end of iteration , starting from iteration processing the range will not change with respect to . Indeed, by the lemma the monochrome blocks created in iteration do not change in subsequent iterations (in a subsequent iteration a monochrome block can be split in sub-blocks, but the actual content of the bit vector does not change). The above observation suggests that, after we have processed block in iteration , we can mark it as irrelevant and avoid to process it again. As the computation goes on, more and more blocks become irrelevant. Hence, at the generic iteration instead of processing the whole we process only the blocks which are still “active” and skip irrelevant blocks. Adjacent irrelevant blocks are merged so that among two active blocks there is at most one irrelevant block (the gap after which the algorithm is named). The overall structure of a single iteration is shown in Figure 4. The algorithm terminates when there are no more active blocks since this implies that all blocks have become monochrome and by Lemma 4 we are able to compute and .

 

1:if (next block is irrelevant) then
2:     skip it
3:else
4:     process block
5:     if (processed block is monochrome) then
6:         mark it irrelevant
7:     end if
8:end if
9:if (last two blocks are irrelevant) then
10:     merge them
11:end if

 

Figure 4: Main loop of the Gap algorithm. The processing of active blocks at Line 4 is done as in Lines 720 of Figure 3.

We point out that at Line 2 of the Gap algorithm we cannot simply skip an irrelevant block ignoring its content. To keep the algorithm consistent we must correctly update the global variables of the main loop, i.e. the array and the pointers and in Figure 3. To this end a simple approach is to store for each irrelevant block the number of occurrences of each symbol in it and the pair providing the number of 0’s and 1’s in the block (recall that an irrelevant block may consist of adjacent monochrome blocks coming from different strings). When the algorithm reaches an irrelevant block, , , are updated setting , and . The above scheme for handling irrelevant blocks is simple and effective for most applications. However, for a large non-constant alphabet it would imply a multiplicative slowdown. In [8, Sect. 4] we present a different scheme for large alphabets with a slowdown reduced to .

We point out that our Gap algorithm is related to the H&M variant with time complexity described in [15, Sect. 2.1]: Indeed, the sorting operations are essentially the same in the two algorithms. The main difference is that Gap keeps explicit track of the irrelevant blocks while H&M keeps explicit track of the active blocks (called buckets in [15]): this difference makes the non-sorting operations completely different. An advantage of working with irrelevant blocks is that they can be easily merged, while this is not the case for the active blocks in H&M. Of course, the main difference is that Gap merges simultaneously BWT and LCP values.

Given and let . The Gap algorithm computes and in time, where is the average LCP of the string . The working space is bits, plus the space used for handling irrelevant blocks.

Proof.

For the running time we reason as in [15] and observe that the sum, over all iterations, of the length of all active blocks is bounded by . The time bound follows observing that at any iteration the cost of processing an active block of length is bounded by time.

For the analysis of the working space we observe for the array we can use the space for the output LCP, hence the working space consists only in bits for two instances of the arrays and a constant number of counters (the arrays and ).∎∎

It is unfortunately impossible to give a clean bound for the space needed for keeping track of irrelevant blocks. Our scheme uses words per block, but in the worst case we can have

blocks. Although such worst case is rather unlikely, it is important to have some form of control on this additional space. We use the following simple heuristic: we choose a threshold 

and we keep track of an irrelevant block only if its size is at least . This strategy introduces a time slowdown but ensures that there are at most irrelevant blocks simultaneously. The experiments in the next section show that in practice the space used to keep track of irrelevant blocks is less than 10% of the total.

Note that also in [15] the authors faced the problem of limiting the memory used to keep track of the active blocks. They suggested the heuristic of keeping track of active blocks only after the -th iteration ( for their dataset).

Name Size GB Max Len Ave Len Max LCP Ave LCP
Pacbio 6.24 5 40212 9567.43 1055 17.99
Illumina 7.60 6 103 102.00 102 27.53
Wiki-it 4.01 210 553975 4302.84 93537 61.02
Proteins 6.11 26 35991 410.22 25065 100.60
Table 1: Collections used in our experiments sorted by average LCP. Columns 4 and 5 refer to the lengths of the single documents. Pacbio are NGS reads from a D.melanogaster dataset. Illumina are NGS reads from Human ERA015743 dataset. Wiki-it are pages from Italian Wikipedia. Proteins are protein sequences from Uniprot. Collections and source files are available on https://people.unipmn.it/manzini/gap.
Name gSACA-K
+ time space time space time space
Pacbio 7 0.46 0.41 4.35 0.46 4.18 0.51 4.09
Illumina 4 0.48 0.93 3.31 1.02 3.16 1.09 3.08
Wiki-it 5 0.41 3.07 6.55
Proteins 4 0.59 3.90 4.55 5.18 4.29 7.05 4.15
Table 2: For each collection we report the number of subcollections, the average running time of gSACA-K+ in secs per symbol, and the running time (secs) and space usage (bytes) per symbol for Gap for different values of the parameter. All tests were executed on a desktop with 32GB RAM and eight Intel-I7 3.40GHz CPUs, using a single CPU in each experiment.

4.1 Experimental Results

We have implemented the Gap algorithm in C and tested it on the collections shown in Table 2 which have documents of different size, LCP, and alphabet size. We represented LCP values with the minimum possible number of bytes for each collection: 1 byte for Illumina, 2 bytes for Pacbio and Proteins, and 4 bytes for Wiki-it. We always used 1 byte for each BWT value and bytes to represent a pair of arrays using 4 bits for each entry so that the tested implementation can merge simultaneously up to BWTs.

Referring to Table 2, we split each collection into subcollections of size less than 2GB and we computed the multi-string SA of each subcollection using gSACA-K [23]. From the SA we computed the multi-string BWT and LCP arrays using the algorithm [19] (implemented in gSACA-K). This computation used 13 bytes per input symbol. Then, we merged the subcollections BWTs and LCPs using Gap with different values of the parameter which determines the size of the smallest irrelevant block we keep track of. Since skipping a block takes time proportional to , regardeless of  Gap never keeps track of blocks smaller than that threshold; therefore for Wiki-it we performed a single experiment where the smallest irrelevant block size was .

From the results in Table 2 we see that Gap’s running time is indeed roughly proportional to the average LCP. For example, Pacbio and Illumina collections both consist of DNA reads but, despite Pacbio reads being longer and having a larger maximum LCP, Gap is twice as fast on them because of the smaller average LCP. Similarly, Gap is faster on Wiki-it than on Proteins despite the latter collection having a smaller alphabet and shorter documents.

As expected, the parameter offers a time-space tradeoff for the Gap algorithm. In the space reported in Table 2, the fractional part is the peak space usage for irrelevant blocks, while the integral value is the space used by the arrays , and . For example, for Wiki-it we use bytes for the BWTs, bytes for the LCP values (the array), bytes for , and the remaining bytes are mainly used for keeping track of irrelevant blocks. This is a relatively high value, about 9% of the total space, since in our current implementation the storage of a block grows linearly with the alphabet size. For DNA sequences and the cost of storing blocks is less than 3% of the total without a significant slowdown in the running time.

For completeness, we tested the H&M implementation from [15] on the Pacbio collection. The running time was 14.57 secs per symbol and the space usage bytes per symbol. These values are only partially significant for several reasons: H&M computes the BWT from scratch, hence doing also the work of gSACA-K, H&M doesn’t compute the LCP array, hence the lower space usage, the algorithm is implemented in Cython which makes it easier to use in a Python environment but is not as fast and space efficient as C.

4.2 Merging only BWTs

If we are not interested in LCP values but we only need to merge BWTs, we can still use Gap instead of H&M to do the computation in time. In that case however, the use of the integer array recording LCP values is wasteful. We can save space replacing it with an array containing two bits per entry representing four possible states called . The rationale for this is that, if we are not interested in LCP values, the entries of are only used in Line 4 of Fig. 3 where it is tested whether they are different from 0 or .

During iteration , the values in are used instead of the ones in as follows: An entry corresponds to , an entry corresponds to . If is even, an entry corresponds to and an entry corresponds to ; while if

is odd the correspondence is

, . The array is initialized as , and it is updated appropriately in lines 1314. The reason for this apparently involved scheme is that during iteration , an entry in can be modified either before or after we read it at Line 4. The resulting code is shown in Fig. 5. Using the array we can still define (and skip) monochrome blocks and therefore achieve the complexity.

 

4:if  and  then
5:      A new block of is starting
6:end if
7:if  then
8:      Mark the block as old
9:end if
10:
13:if  then Check if already marked
14:      A new block of will start here
15:end if

 

Figure 5: Modification of the H&M algorithm to use a two-bit array instead of the integer array . The code shows the case for even; if is odd, the value is replaced by and viceversa.

Notice that, by Corollary 3, the value in changes from to or during iteration . Hence, if every time we do such change we write to an external file the pair , when the merging is complete the file contains all the information required to compute the LCP array even if we do not know and . This idea has been introduced and investigated in [7].

5 Merging compressed tries

 
0 a
1 b

0
a a
0 b
1 c

1
# aa

1
# aca

1
c b

1
# ba
1 a ca
1 # cb
 
0 a
1 b

0
a a
1 b

1
c aa

1
# ab

1
a b

1
# ba
1 # caa
 
0 a
1 b

0
a a
0 b
1 c

0
# aa
1 c

1
# ab
1 # aca

0
a b
1 c

1
# ba
1 a ca
1 # caa
1 # cb
Figure 6: The trie containing the strings aa#, ab#, aca#, bc# (left), the trie containing aac#, ab#, ba# (center) and the trie containing the union of the two set of strings (right). Below each trie we show the corresponding XBWT representation.

Tries [21] are a fundamental data structure for representing a collection of distinct strings. A trie consists of a rooted tree in which each edge is labeled with a symbol in the input alphabet, and each string is represented by a path from the root to one of the leaves. To simplify the algorithms, and ensure that no string is the prefix of another one, it is customary to add a special symbol at the end of each string.222In this and in the following section we purposely use a special symbol # different from $. The reason is that $ is commonly used to for sorting purposes, while # simply represents a symbol different from the ones in . Tries for different sets of strings are shown in Figure 6. For any trie node we write to denote its height, that is the length of the path from the root to . We define the height of the trie as the maximum node height , and the average height , where denotes the number of trie nodes.

The eXtended Burrows-Wheeler Transform [10, 25, 32] is a generalization of the BWT designed to compactly represent any labeled tree . To define , to each internal node we associate the string obtained by concatenating the symbols in the edges in the upward path from to the root of . If has internal nodes we have strings overall; let denote the array containing such strings sorted lexicographically. Note that is always the empty string corresponding to the root of . For let denote the set of symbols labeling the edges exiting from the node corresponding to . We define the array as the concatenation of the arrays . If has edges (and therefore nodes), it is and contains symbols from and occurrences of #. To keep an explicit representation of the intervals within , we define a binary array such that iff is the last symbol of some interval . See Figure 6 for a complete example.

In [9] it is shown that the two arrays are sufficient to represent , and that if they are enriched with data structures supporting constant time rank and select operations, can be used for efficient upward and downward navigation and for substring search in . The fundamental property for efficient navigation and search is that there is an one-to-one correspondence between the symbols in different from # and the strings in different from the empty string. The correspondence is order preserving in the sense that the -th occurrence of symbol corresponds to the -th string in starting with . For example, in Figure 6 (right) the third a in corresponds to the third string in starting with a, namely ab. Note that ab is the string associated to the node reached by following the edge associated to the third a in .

In this section, we consider the problem of merging two distinct XBWTs. More formally, let (resp. ) denote the trie containing the set of strings (resp. ), and let denote the trie containing the strings in the union ,…, , , …, (see Figure 6). Note that might contain less than strings: if the same string appears in both and it will be represented in only once. Given and we want to compute the XBWT representation of the trie .

We observe that if we had at our disposal the sorted string arrays and , then the construction of could be done as follows: First, we merge lexicographically the strings in and , then we scan the resulting sorted array of strings. During the scan

  • if we find a string appearing only once then it corresponds to an internal node belonging to either or ; the labels on the outgoing edges can be simply copied from the appropriate range of or .

  • if we find two consecutive equal strings they correspond respectively to an internal node in and to one in . The corresponding node in has a set of outgoing edges equal to the union of the edges of those nodes in and : thus, the labels in the outgoing edges are the union of the symbols in the appropriate ranges of and .

Although the arrays and are not available, by properly modifying the H&M algorithm we can compute how their elements would be interleaved by the merge operation. Let , , and similarly