DeepAI
Log In Sign Up

Space-Efficient Construction of Compressed Suffix Trees

08/12/2019
by   Nicola Prezza, et al.
University of Pisa
0

We show how to build several data structures of central importance to string processing, taking as input the Burrows-Wheeler transform (BWT) and using small extra working space. Let n be the text length and σ be the alphabet size. We first provide two algorithms that enumerate all LCP values and suffix tree intervals in O(nσ) time using just o(nσ) bits of working space on top of the input BWT. Using these algorithms as building blocks, for any parameter 0 < ϵ≤ 1 we show how to build the PLCP bitvector and the balanced parentheses representation of the suffix tree topology in O(n(σ + ϵ^-1· n)) time using at most nσ·(ϵ + o(1)) bits of working space on top of the input BWT and the output. In particular, this implies that we can build a compressed suffix tree from the BWT using just succinct working space (i.e. o(nσ) bits) and any time in Θ(nσ) + ω(n n). This improves the previous most space-efficient algorithms, which worked in O(n) bits and O(n n) time. We also consider the problem of merging BWTs of string collections, and provide a solution running in O(nσ) time and using just o(nσ) bits of working space. An efficient implementation of our LCP construction and BWT merge algorithms use (in RAM) as few as n bits on top of a packed representation of the input/output and process data as fast as 2.92 megabases per second.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

01/16/2019

Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

We show that the Longest Common Prefix Array of a text collection of tot...
12/13/2017

Closing in on Time and Space Optimal Construction of Compressed Indexes

Fast and space-efficient construction of compressed indexes such as comp...
02/28/2018

Fast Lempel-Ziv Decompression in Linear Space

We consider the problem of decompressing the Lempel-Ziv 77 representatio...
03/04/2019

Lightweight merging of compressed indices based on BWT variants

In this paper we propose a flexible and lightweight technique for mergin...
02/14/2019

Conversion from RLBWT to LZ77

Converting a compressed format of a string into another compressed forma...
10/16/2018

Nearly Optimal Space Efficient Algorithm for Depth First Search

We design a space-efficient algorithm for performing depth-first search ...
03/05/2018

Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

We consider the problem of encoding a string of length n from an alphabe...

1 Introduction and Related Work

The increasingly-growing production of large string collections—especially in domains such as biology, where new generation sequencing technologies can nowadays generate Gigabytes of data in few hours—is lately generating much interest towards fast and space-efficient algorithms able to index this data. The Burrows-Wheeler Transform [9] and its extension to sets of strings [28, 1] is becoming the gold-standard in the field: even when not compressed, its size is one order of magnitude smaller than classic suffix arrays (while preserving many of their indexing capabilities). This generated considerable interest towards fast and space-efficient BWT construction algorithms [1, 22, 35, 6, 15, 25, 33, 35]. As a result, the problem of building the BWT is well understood to date. The fastest algorithm solving this problem operates in sublinear time and bits of space on a binary text of length by exploiting word parallelism [25]. The authors also provide a conditional lower bound suggesting that this running time might be optimal. The most space-efficient algorithm terminates in time and uses just bits of space (succinct) on top of the input and output [33], where is the alphabet’s size. In the average case, this running time can be improved to on constant-sized alphabets while still operating within succinct space [35].

In some cases, a BWT alone is not sufficient to complete efficiently particular string-processing tasks. For this reason, the functionalities of the BWT are often extended by flanking to it additional structures such as the Longest Common Prefix (LCP) array [11] (see e.g. [36, 37, 18, 43] for bioinformatic applications requiring this additional component). A disadvantage of the LCP array is that it requires bits to be stored in plain form. To alleviate this problem, usually the PLCP array [40]—an easier-to-compress permutation of the LCP array—is preferred. The PLCP relies on the idea of storing LCP values in text order instead of suffix array order. As shown by Kasai et al. [24], this permutation is almost increasing () and can thus be represented in just bits in a bitvector known as the PLCP bitvector. More advanced applications might even require full suffix tree functionality. In such cases, compressed suffix trees [17, 41] (CSTs) are the preferred choice when the space is at a premium. A typical compressed suffix tree is formed by a compressed suffix array (CSA), the PLCP bitvector, and a succinct representation of the suffix tree topology [41] (there exist other designs, see Ohlebusch et al. [34] for an exhaustive survey). To date, several practical algorithms have been developed to solve the task of building de novo such additional components [11, 19, 20, 8, 12, 2, 23, 44], but little work has been devoted to the task of computing them from the BWT in little working space (internal and external). Considering the advanced point reached by state-of-the-art BWT construction algorithms, it is worth to explore whether such structures can be built more efficiently starting from the BWT, rather than from the raw input text.

CSA As far as the CSA is concerned, this component can be easily built from the BWT using small space as it is formed (in its simplest design) by just a BWT with rank/select functionality enhanced with a suffix array sampling, see also [2].

LCP We are aware of only one work building the LCP array in small space from the BWT: Beller et al. [5] show how to build the LCP array in time and bits of working space on top of the input BWT and the output. Other works [30, 2] show how to build the LCP array directly from the text in time and bits of space (compact).

PLCP Kärkkäinen et al. [23] show that the PLCP bitvector can be built in time using bits of working space on top of the text, the suffix array, and the output PLCP. Kasai at al.’s lemma also stands at the basis of a more space-efficient algorithm from Välimäki et al. [44], which computes the PLCP from a CSA in time using constant working space on top of the CSA and the output. Belazzougui [2] recently presented an algorithm for building the PLCP bitvector from the text in optimal time and compact space ( bits).

Suffix tree topology The remaining component required to build a compressed suffix tree (in the version described by Sadakane [41]) is the suffix tree topology, represented either in BPS [31] (balanced parentheses) or DFUDS [7] (depth first unary degree sequence), using bits. As far as the BPS representation is concerned, Hon et al. [21] show how to build it from a CSA in time and compact space for any constant . Belazzougui [2] improves this running time to the optimal , still working within compact space. Välimäki et al. [44] describe a linear-time algorithm that improves the space to bits on top of the LCP array (which however needs to be represented in plain form), while Ohlebusch et al. [34] show how to build the DFUDS representation of the suffix tree topology in time using bits of working space on top of a structure supporting access to LCP array values in time.

Summing up, the situation for building compressed suffix trees from the BWT is the following: algorithms working in optimal linear time require bits of working space. Algorithms reducing this space to (on top of a CSA) are only able to build the suffix tree topology within time, which is with the current best techniques, and the PLCP bitvector in time. No algorithm can build all the three CST components within bits of working space on top of the input BWT and the output. Combining the most space-efficient existing algorithms, the following two trade-offs can therefore be achieved for building all compressed suffix tree components from the BWT:

  • bits of working space and time, or

  • bits of working space and time.

Our contributions

In this paper, we give new space-time trade-offs that allow building the CST’s components in smaller working space (and in some cases even faster) with respect to the existing solutions. We start by combining Beller et al.’s algorithm [5] with the suffix-tree enumeration procedure of Belazzougui [2] to obtain an algorithm that enumerates (i) all pairs , and (ii) all suffix tree intervals in time using just bits of working space on top of the input BWT. We use this procedure to obtain algorithms that build (working space is on top of the input BWT and the output):

  1. The LCP array of a string collection in time and bits of working space (see Section 5).

  2. the PLCP bitvector and the BPS representation of the suffix tree topology in time and bits of working space, for any user-defined parameter (see Section 7 and 8).

  3. The BWT of the union of two string collections of total size in time and bits of working space, given the BWTs of the two collections as input (see Section 9).

Contribution (1) is the first showing that the LCP array can be induced from the BWT using succinct working space for any alphabet size.

Contribution (2) can be used to build a compressed suffix tree from the BWT using just bits of working space and any time in —for example, , for any . On small alphabets, this improves both working space and running time of existing -bits solutions.

Also contribution (3) improves the state-of-the-art, due to Belazzougui et al. [2, 3]. In those papers, the authors show how to merge the BWTs of two texts and obtain the BWT of the collection in time and bits of working space for any  [3, Thm. 7]. When , this running time is the same as our result (3), but the working space is much higher on small alphabets.

We implemented and tested our algorithms (1, 3) on DNA alphabet. Our tools use (in RAM) as few as bits on top of a packed representation of the input/output, and process data as fast as megabases per second.

Contributions (1, 3) are part of a preliminary version [38] of this paper. This paper also extends such results with the suffix tree interval enumeration procedure and with the algorithms of contribution (2) for building the PLCP bitvector and the BPS representation of the suffix tree topology.

2 Basic Concepts

Let be a finite ordered alphabet of size with , where denotes the standard lexicographic order. Given a text we denote by its length . We assume that the input text is terminated by the special symbol (terminator) , which does not appear elsewhere in . We use to denote the empty string. A factor (or substring) of is written as with . When declaring an array , we use the same notation to indicate that the array has entries indexed from to . A right-maximal substring of is a string for which there exist at least two distinct characters such that and occur in .

The suffix array SA of a string (see [39] for a survey) is an array containing the permutation of the integers that arranges the starting positions of the suffixes of into lexicographical order, i.e., for all , .

The inverse suffix array is the inverse permutation of , i.e., if and only if .

The Burrows-Wheeler Transform of a string is a reversible transformation that permutates its symbols, i.e. if or otherwise.

In some of our results we deal with string collections. There exist some natural extensions of the suffix array and the Burrows-Wheeler Transform to a collection of strings.

Let be a string collection of total length , where each is terminated by a character (the terminator) lexicographically smaller than all other alphabet’s characters. In particular, a collection is an ordered multiset, and we denote .

We define lexicographic order among the strings’ suffixes in the usual way, except that, only while sorting, each terminator of the -th string is considered (implicitly) a different symbol , with if and only if . Equivalently, in case of equal suffixes ties are broken by input’s order: if , then we define if and only if .

The generalized suffix array (see [42, 11, 26]) of is an array of pairs such that is the -th lexicographically smallest suffix of strings in , where we break ties by input position (i.e. in the notation above). Note that, if the collection is formed by a single string , then the first component in ’s pairs is always equal to 1, and the second components form the suffix array of . We denote by , also referred to as suffix array (SA) interval of , or simply -interval, the maximal pair such that all suffixes in are prefixed by . We use the same notation with the suffix array of a single string . Note that the number of suffixes lexicographically smaller than in the collection is . We extend this definition also to cases where is not present in the collection: in this case, the (empty) range is and we still require that is the number of suffixes lexicographically smaller than in the collection (or in the string).

The extended Burrows-Wheeler Transform [28, 1] of is the character array defined as , where .

To simplify notation, we indicate with “” both the Burrows-Wheeler Transform of a string and of a string collection. The used transform will be clear from the context.

The longest common prefix (LCP) array of a string [27] (resp. a collection of strings, see [11, 26, 12]) is an array storing the length of the longest common prefixes between two consecutive suffixes of (resp. ) in lexicographic order (with ). When applied to a string collection, we take the longest common prefix of two equal suffixes of length to be equal to (i.e. as if their terminators were different).

Given two collections of total length , the Document Array of their union is the binary array such that if and only if the -th smallest suffix comes from . When merging suffixes of the two collections, ties are broken by collection number (i.e. suffixes of are smaller than suffixes of in case of ties).

The -array of a string (or collection) is an array such that contains the number of characters lexicographically smaller than in , plus one ( will be clear from the context). Equivalently, is the starting position of suffixes starting with in the suffix array of the string. When (or any of its permutations) is represented with a balanced wavelet tree, then we do not need to store explicitly , and can be computed in time with no space overhead on top of the wavelet tree (see [16]). Function returns the number of characters equal to in . When is represented by a wavelet tree, rank can be computed in time.

Function , where is the extended Burrows-Wheeler transform of a string collection and is the suffix array interval of some string appearing as a substring of some element of , returns all suffix array intervals of strings , with , that occur in . When is represented with a balanced wavelet tree, we can implement this function so that it terminates in time per returned interval [5]

. The function can be made to return the output intervals on-the-fly, one by one (in an arbitrary order), without the need to store them all in an auxiliary vector, with just

bits of additional overhead in space [5] (this requires to DFS-visit the sub-tree of the wavelet tree induced by ; the visit requires only bits to store the current path in the tree).

An extension of the above function that navigates in parallel two BWTs is immediate. Function takes as input two ranges of a string on the BWTs of two collections, and returns the pairs of ranges on the two BWTs corresponding to all left-extensions of () such that appears in at least one of the two collections. To implement this function, it is sufficient to navigate in parallel the two wavelet trees as long as at least one of the two intervals is not empty.

Let be a string. The function returns the set of distinct alphabet characters different than the terminator in . Also this function can be implemented in time per returned element when is represented with a wavelet tree (again, this requires a DFS-visit of the sub-tree of the wavelet tree induced by ).

is the function that, given the suffix array interval of a string occurring in the collection, returns the suffix array interval of by using the BWT of the collection [14]. This function requires access to array and rank support on , and runs in time when is represented with a balanced wavelet tree.

To conclude, our algorithms will take as input a wavelet tree representing the BWT. As shown in the next lemma by Claude et al., this is not a restriction:

 
Lemma 1
([10]).

Given a word-packed string of length on alphabet , we can replace it with its wavelet matrix [10] in time using bits of additional working space.

Wavelet matrices [10] are a particular space-efficient representation of wavelet trees taking bits of space and supporting all their operations within the same running times. Since the output of all our algorithms will take at least bits, it will always be possible to re-use a portion of the output’s space (before computing it) to fit the extra bits required by Lemma 1.

3 Belazzougui’s Enumeration Algorithm

In [2], Belazzougui showed that a BWT with rank and range distinct functionality (see Section 2) is sufficient to enumerate in small space a rich representation of the internal nodes of the suffix tree of a text . For the purposes of this article, we assume that the BWT is represented using a wavelet tree (whereas Belazzougui’s original result is more general), and thus that all queries take time.

 
Theorem 1
(Belazzougui [2]).

Given the Burrows-Wheeler Transform of a text represented with a wavelet tree, we can enumerate the following information for each distinct right-maximal substring of : (i) , and (ii) for all such that occurs in . The process runs in time and uses bits of working space on top of the BWT.

To keep the article self-contained, in this section we describe the algorithm at the core of the above result. Remember that explicit suffix tree nodes correspond to right-maximal substrings. The first idea is to represent any substring (not necessarily right-maximal) as follows. Let be the alphabetically-sorted character array such that is a substring of for all , where is the number of right-extensions of . We require to be also complete: if is a substring of , then . Let moreover be the array such that is the starting position of (the range of) in the suffix array of for , and is the end position of in the suffix array of . The representation for is (differently from [2], we omit from the representation and we add ; these modifications will turn useful later):

Note that, if is not right-maximal nor a text suffix, then is followed by distinct characters in and the above representation is still well-defined. When is right-maximal, we will also say that is the representation of a suffix tree explicit node (i.e. the node reached by following the path labeled from the root).

Weiner Link Tree Visit

The enumeration algorithm works by visiting the Weiner Link tree of starting from the root’s representation, that is, , where (see Section 2 for a definition of the -array) and are the sorted alphabet’s characters. Since the suffix tree and the Weiner link tree share the same set of nodes, this is sufficient to enumerate all suffix tree nodes. The visit uses a stack storing representations of suffix tree nodes, initialized with . At each iteration, we pop the head from the stack and we push such that is right-maximal in . To keep the stack’s size under control, once computed for the right-maximal left-extensions of we push them on the stack in decreasing order of range length (i.e. the node with the smallest range is pushed last). This guarantees that the stack will always contain at most elements [2]. Since each element takes bits to be represented, the stack’s size never exceeds bits.

Computing Weiner Links

We now show how to efficiently compute the node representation from for the characters such that is right-maximal in . In [2, 3] this operation is supported efficiently by first enumerating all distinct characters in each range for , using function (see Section 2). Equivalently, for each we want to list all distinct left-extensions of . Note that, in this way, we may also visit implicit suffix tree nodes (i.e. some of these left-extensions could be not right-maximal). Stated otherwise, we are traversing all explicit and implicit Weiner links. Since the number of such links is linear [2, 4] (even including implicit Weiner links111To see this, first note that the number of right-extensions of that have only one left-extension is at most equal to the number of right-extensions of ; globally, this is at most the number of suffix tree’s nodes (linear). Any other right-extension that has at least two distinct left-extensions and is, by definition, left maximal and corresponds therefore to a node in the suffix tree of the reverse of . It follows that all left-extensions of can be charged to an edge of the suffix tree of the reverse of (again, the number of such edges is linear).), globally the number of distinct characters returned by operations is . An implementation of on wavelet trees is discussed in [5] with the procedure getIntervals (this procedure actually returns more information: the suffix array range of each ). This implementation runs in time per returned character. Globally, we therefore spend time using a wavelet tree. We now need to compute for all left-extensions of and keep only the right-maximal ones. Let and be the function that returns the representations of such strings (used in Line 12 of Algorithm 1). This function can be implemented by observing that

where for , and noting that and are available in . Note also that we do not actually need to know the value of characters to compute the ranges of each ; this is the reason why we can omit from . Using a wavelet tree, the above operation takes time. By the above observations, the number of strings such that is right-maximal is bounded by . Overall, computing for all left-extensions of all right-maximal strings takes therefore time. Within the same running time, we can check which of those extensions is right maximal (i.e. those such that ), sort them in-place by interval length (we always sort at most node representations, therefore also sorting takes globally time), and push them on the stack.

4 Beller et al.’s Algorithm

The second ingredient used in our solutions is the following result, due to Beller et al (we slightly re-formulate their result to fit our purposes, read below for a description of the differences):

 
Theorem 2
(Beller et al.[5]).

Given the Burrows-Wheeler Transform of a text represented with a wavelet tree, we can enumerate all pairs in time using bits of working space on top of the BWT.

Theorem 2 represents the state of the art for computing the LCP array from the BWT. Also Beller et al.’s algorithm works by enumerating a (linear) subset of the BWT intervals. LCP values are induced from a particular visit of those intervals. Belazzougui’s and Beller et al.’s algorithms have, however, two key differences which make the former more space-efficient on small alphabets, while the latter more space-efficient on large alphabets: (i) Beller et al. use a queue (FIFO) instead of a stack (LIFO), and (ii) they represent -intervals with just the pair of coordinates and the value . In short, while Beller et al.’s queue might grow up to size , the use of intervals (instead of the more complex representation used by Belazzougui) makes it possible to represent it using bitvectors of length . On the other hand, the size of Belazzougui’s stack can be upper-bounded by , but its elements take more space to be represented.

We now describe in detail Beller et al.’s result. We keep a bitvector such that if and only if the pair has not been output yet. In their original algorithm, Beller et al. use the LCP array itself to mark undefined LCP entries. In our case, we don’t want to store the whole LCP array (for reasons that will be clear in the next sections) and thus we only record which LCP values have been output. Bitvector accounts for the additional bits used by Theorem 2 with respect to the original result described in [5]. At the beginning, for all . Beller et al.’s algorithm starts by inserting in the queue the triple , where the first two components are the BWT interval of (the empty string) and the third component is its length. From this point, the algorithm keeps performing the following operations until the queue is empty. We remove the first (i.e. the oldest) element from the queue, which (by induction) is the interval and length of some string : and . Using operation  [5] (see Section 2) we left-extend the BWT interval with the characters in , obtaining the triples corresponding to the strings . For each such triple , if and then we set , we output the LCP pair and push on the queue. Importantly, note that we can push the intervals returned by in the queue in any order; as discussed in Section 2, this step can be implemented with just bits of space overhead with a DFS-visit of the wavelet tree’s sub-tree induced by (i.e. the intervals are not stored temporarily anywhere: they are pushed as soon as they are generated).

Queue implementation

To limit space usage, Beller et al. use the following queue representations. First note that, at each time point, the queue’s triples are partitioned into a (possibly empty) sequence with associated length (i.e. the third element in the triples) , followed by a sequence with associated length , for some . To simplify the description, let us assume that these two sequences are kept as two distinct queues, indicated in the following as and . At any stage of the algorithm, we pop from and push into . It follows that there is no need to store strings’ lengths in the triples themselves (i.e. the queue’s elements become just ranges), since the length of each element in is . When is empty, we create a new empty queue , pop from , and push into (and so on). Beller et al. represent as follows. While pushing elements in , as long as its size does not exceed we represent it as a vector of pairs (of total size at most bits). This representation supports push/pop operations in (amortized) constant time and takes at most bits of space. As soon as ’s size exceeds , we switch to a representation that uses two packed bitvectors of length storing, respectively, the left- and right-most boundaries of the ranges in the queue. Note that this representation can be safely used since the pairs in are suffix array ranges of strings of some fixed length , therefore there cannot be overlapping intervals. Pushing an interval into such a queue takes constant time (it just requires setting two bits). Popping all the intervals, on the other hand, can easily be implemented in time by scanning the bitvectors and exploiting word-parallelism (see [5] for all details). Since Beller et al.’s procedure visits SA intervals, will exceed size for at most values of . It follows that also with this queue representation pop operations take amortized constant time.

Time complexity

It is easy to see that the algorithm inserts in total a linear number of intervals in the queue since an interval is inserted only if , and successively is set to . Clearly, this can happen at most times. In [5] the authors moreover show that, even when counting the left-extensions of those intervals (computed after popping each interval from the queue), the total number of generated intervals stays linear. Overall, the algorithm runs therefore in time (as discussed in Section 2, runs in time per returned element).

5 Enumerating LCP values

In this section we prove our first main result: how to enumerate LCP pairs using succinct working space on top of a wavelet tree representing the BWT. Later we will use this procedure to build the LCP and PLCP arrays in small space on top of a plain representation of the BWT. We give our lemma in the general form of string collections, which will require adapting the algorithms seen in the previous sections to this more general setting. Our first observation is that Theorem 1, extended to string collections as described below, can be directly used to enumerate LCP pairs using just bits of working space on top of the input and output. We combine this procedure with an extended version of Beller et al.’s algorithm working on string collections in order to get small working space for all alphabets. Algorithms 1 and 2 report our complete procedure; read below for an exhaustive description. We obtain our first main result:

 
Lemma 2
.

Given a wavelet tree for the Burrows-Wheeler Transform of a collection of total length on alphabet , we can enumerate all pairs in time using bits of working space on top of the BWT.

Proof.

If then and our extension of Theorem 1 gives us additional working space. If then and we can use our extension to string collections of Theorem 2, which yields extra working space . Note that, while we used the threshold , any threshold of the form , with would work. The only constraint is that , since otherwise for the working space would become for constant (not good since we aim at ). ∎

We now describe all the details of our extensions of Theorems 1 and 2 used in the proof of Lemma 2. Procedure BGOS(BWT) in Line 2 of Algorithm 1 is a call to Beller et al.’s algorithm, modified as follows. First, we enumerate the LCP pairs for all . Then, we push in the queue for all and start the main algorithm. Note moreover that (see Section 2) from now on we never left-extend ranges with .

Recall that each string of a text collection is ended by a terminator (common to all strings). Consider now the LCP and GSA arrays of . We divide LCP values in two types. Let , with , indicate that the -th suffix in the lexicographic ordering of all suffixes of strings in is . A LCP value is of node type when the -th and -th suffixes are distinct: , where and . Those two suffixes differ before the terminator is reached in both suffixes (it might be reached in one of the two suffixes, however); we use the name node-type because and are the last and first suffix array positions of the ranges of two adjacent children of some suffix tree node, respectively (i.e. the node corresponding to string ). Note that it might be that one of the two suffixes, or , is the string “”. Similarly, a leaf-type LCP value is such that the -th and -th suffixes are equal: . We use the name leaf-type because, in this case, it must be the case that , where is the suffix array range of some suffix tree leaf (it might be that since there might be repeated suffixes in the collection). Note that, in this case, could coincide with . Entry escapes the above classification, so we output it separately.

Our idea is to compute first node-type and then leaf-type LCP values. We argue that Beller et al.’s algorithm already computes the former kind of LCP values. When this algorithm uses too much space (i.e. on small alphabets), we show that Belazzougui’s enumeration strategy can be adapted to reach the same goal: by the very definition of node-type LCP values, they lie between children of some suffix tree node , and their value corresponds to the string depth of . This strategy is described in Algorithm 1. Function in Line 12 takes as input the representation of a suffix tree node and returns all explicit nodes reached by following Weiner links from (an implementation of this function is described in Section 3). Leaf-type LCP values, on the other hand, can easily be computed by enumerating intervals corresponding to suffix tree leaves. To reach this goal, it is sufficient to enumerate ranges of suffix tree leaves starting from and recursively left-extending with backward search with characters different from whenever possible. For each range obtained in this way, we set each entry to the string depth (terminator excluded) of the corresponding leaf. This strategy is described in Algorithm 2. In order to limit space usage, we use again a stack or a queue to store leaves and their string depth (note that each leaf takes bits to be represented): we use a queue when , and a stack otherwise. The queue is the same used by Beller et al.[5] and described in Section 4. This guarantees that the bit-size of the queue/stack never exceeds bits: since leaves take just bits to be represented and the stack’s size never contains more than leaves, the stack’s bit-size never exceeds when . Similarly, Beller et al’s queue always takes at most bits of space, which is for . Note that in Lines 18-21 we can afford storing temporarily the resulting intervals since, in this case, the alphabet’s size is small enough.

To sum up, our full procedure works as follows: (1), we output node-type LCP values using procedure Node-Type described in Algorithm 1, and (2) we output leaf-type LCP values using procedure Leaf-Type described in Algorithm 2.

1:if  then
2:      Run Beller et al.’s algorithm
3:else
4:      Initialize new stack
5:      Push representation of
6:     while  do
7:          Pop highest-priority element
8:          Number of children of ST node
9:         for  do
10:              output Output LCP value
11:         end for
12:          Follow Weiner Links
13:          Sort by interval length
14:         for  do
15:               Push representations
16:         end for
17:     end while
18:end if
Algorithm 1 Node-Type(BWT)
1:for  do
2:     output
3:end for
4:if  then
5:      Initialize new queue
6:else
7:      Initialize new stack
8:end if
9: Push range of terminator and LCP value 0
10:while  do
11:      Pop highest-priority element
12:     for  do
13:         output Output LCP inside range of ST leaf
14:     end for
15:     if  then
16:          Pairs interval,
17:     else
18:         
19:          Sort by interval length
20:         for  do
21:               Push in order of decreasing length
22:         end for
23:     end if
24:end while
Algorithm 2 Leaf-Type(BWT)

The correctness, completeness, and complexity of our procedure are proved in the following Lemma:

 
Lemma 3
.

Algorithms 1 and 2 correctly output all LCP pairs of the collection in time using bits of working space on top of the input BWT.

Proof.

Correctness - Algorithm 1. We start by proving that Beller et al.’s procedure in Line 2 of Algorithm 1 (procedure BGOS(BWT)) outputs all the node-type LCP entries correctly. The proof proceeds by induction on the LCP value and follows the original proof of [5]. At the beginning, we insert in the queue all -intervals, for . For each such interval we output . It is easy to see that after this step all and only the node-type LCP values equal to 0 have been correctly computed. Assume, by induction, that all node-type LCP values less than or equal to have been correctly output, and that we are about to extract from the queue the first triple having length . For each extracted triple with length associated to a string , consider the triple associated to one of its left-extensions . If has been computed, i.e. if , then we have nothing to do. However, if , then it must be the case that (i) the corresponding LCP value satisfies , since by induction we have already computed all node-type LCP values smaller than or equal to , and (ii) is of node-type, since otherwise the BWT interval of would also include position . On the other hand, it cannot be the case that since otherwise the -interval would include position . We therefore conclude that must hold.

Completeness - Algorithm 1. The above argument settles correctness; to prove completeness, assume that, at some point, and the value of to be computed and output is . We want to show that we will pull a triple from the queue corresponding to a string (note that and, moreover, could end with ) such that one of the left-extensions of satisfies , for some . This will show that, at some point, we will output the LCP pair . We proceed by induction on . Note that we separately output all LCP values equal to 0. The base case is easy: by the way we initialized the queue, , for all , are the first triples we pop. Since we left-extend these ranges with all alphabet’s characters except , it is easy to see that all LCP values equal to 1 have been output. From now on we can therefore assume that we are working on LCP values equal to , i.e. , for and . Let be the length- left-extension of such that . Since, by our initial hypothesis, , the collection contains also a suffix lexicographically larger than and such that . But then, it must be the case that (it cannot be smaller by the existence of and it cannot be larger since ). By inductive hypothesis, this value was set after popping a triple corresponding to string , left-extending with , and pushing in the queue. This ends the completeness proof since we showed that is in the queue, so at some point we will pop it, extend it with , and output . If the queue uses too much space, then Algorithm 1 switches to a stack and Lines 4-15 are executed instead of Line 2. Note that this pseudocode fragment corresponds to Belazzougui’s enumeration algorithm, except that now we also set LCP values in Line 10. By the enumeration procedure’s correctness, we have that, in Line 10, is the SA-range of a right-maximal string with , and is the first position of the SA-range of , with , where are all the (sorted) right-extensions of . Then, clearly each LCP value in Line 10 is of node-type and has value , since it is the LCP between two strings prefixed by and . Similarly, completeness of the procedure follows from the completeness of the enumeration algorithm. Let be of node-type. Consider the prefix of length of the -th suffix in the lexicographic ordering of all strings’ suffixes. Since , the -th suffix is of the form , with , and is right-maximal. But then, at some point our enumeration algorithm will visit the representation of , with . Since is the first position of the range of , we have that for some , and Line 10 correctly outputs the LCP pair .

Correctness and completeness - Algorithm 2. Proving correctness and completeness of this procedure is much easier. It is sufficient to note that the while loop iterates over all ranges of strings ending with and not containing anywhere else (note that we start from the range of and we proceed by recursively left-extending this range with symbols different than ). Then, for each such range we conclude that is equal to , i.e. the string depth of the corresponding string (excluding the final character ). By their definition, all leaf-type LCP values are correctly computed in this way.

Complexity - Algorithm 1. If , then we run Beller et al’s algorithm, which terminates in time and uses bits of additional working space. Otherwise, we perform a linear number of operations on the stack since, as observed in Section 3, the number of Weiner links is linear. By the same analysis of Section 3, the operation in Line 12 takes amortized time on wavelet trees, and sorting in Line 13 (using any comparison-sorting algorithm sorting integers in time) takes time. Note that in this sorting step we can afford storing in temporary space nodes since this takes additional space bits. All these operations sum up to time. Since the stack always takes at most bits and , the stack’s size never exceeds bits.

Complexity - Algorithm 2. Note that, in the while loop, we start from the interval of and recursively left-extend with characters different than until this is possible. It follows that we visit the intervals of all strings of the form such that does not appear inside . Since these intervals form a cover of , their number (and therefore the number of iterations in the while loop) is also bounded by . This is also the maximum number of operations performed on the queue/stack. Using Beller et al.’s implementation for the queue and a simple vector for the stack, each operation takes constant amortized time. Operating on the stack/queue takes therefore overall time. For each interval popped from the queue/stack, in Line 13 we output LCP values. As observed above, these intervals form a cover of and therefore Line 13 is executed no more than times. Line 18 takes time . Finally, in Line 19 we sort at most intervals. Using any fast comparison-based sorting algorithm, this costs overall at most time.

As far as the space usage of Algorithm 2 is concerned, note that we always push just pairs interval/length ( bits) in the queue/stack. If , we use Beller et al.’s queue, taking at most bits of space. Otherwise, the stack’s size never exceeds elements, with each element taking bits. This amounts to bits of space usage. Moreover, in Lines 18-19 it holds so we can afford storing temporarily all intervals returned by in bits. ∎

Combining Lemma 2 and Lemma 1, we obtain:

 
Theorem 3
.

Given the word-packed Burrows-Wheeler Transform of a collection of total length on alphabet , we can build the LCP array of the collection in time using bits of working space on top of the BWT.

6 Enumerating Suffix Tree Intervals

In this section we show that the procedures described in Section 5 can be used to enumerate all suffix tree intervals—that is, the suffix array intervals of all right-maximal text substrings—taking as input the BWT of a text. Note that in this section we consider just simple texts rather than string collections as later we will use this procedure to build the compressed suffix tree of a text.

When , we can directly use Belazzougui’s procedure (Theorem 1), which already solves the problem. For larger alphabets, we modify Beller et al’s procedure (Theorem 2) to also generate suffix tree’s intervals as follows.

When , we modify Beller et al.’s procedure to enumerate suffix tree intervals using bits of working space, as follows. We recall that (see Section 4), Beller et al’s procedure can be conveniently described using two separate queues: and . At each step, we pop from an element with and for some string , left-extend the range with all , obtaining the ranges and, only if , set , output the LCP pair , and push into . Note that, since we have that the -th and -th smallest suffixes start, respectively, with and for some , where . This implies that is right-maximal. It is also clear that, from the completeness of Beller et al.’s procedure, all right-maximal text substrings are visited by the procedure, since otherwise the LCP values equal to inside would not be generated. It follows that, in order to generate all suffix tree intervals once, we need two extra ingredients: (i) whenever we pop from an element corresponding to a string , we also need the range of , and (ii) we need to quickly check if a given range of a right-maximal substring has already been output. Point (ii) is necessary since, using only the above procedure (augmented with point (i)), will be output for each of its right-extensions (except the lexicographically largest, which does not cause the generation of an LCP pair).

Remember that, in order to keep space usage under control (i.e. bits), we represent as a standard queue of pairs if and only if . For now, let us assume that the queue size does not exceed this quantity (the other case will be considered later). In this case, to implement point (i) we simply augment queue pairs as , where for some . When left-extending with a character , we also left-extend with , obtaining . Let . At this point, if we do the following:

  1. we set ,

  2. we push in , and

  3. if has not already been generated, we output .

Note that steps (1) and (2) correspond to Beller et al.’s procedure. The test in step (3) (that is, point (ii) above) can be implemented as follows. Note that a suffix array range can be identified unambiguously by the two integers and . Note also that we generate suffix tree intervals in increasing order of string depth (i.e. when popping elements from , we output suffix array intervals of string depth ). It follows that we can keep a bitvector of length recording in whether or not the suffix array interval of the string of length whose first coordinate is has already been output. Each time we change the value of a bit