Closing in on Time and Space Optimal Construction of Compressed Indexes

12/13/2017 ∙ by Dominik Kempa, et al. ∙ Helsingin yliopisto 0

Fast and space-efficient construction of compressed indexes such as compressed suffix array (CSA) and compressed suffix tree (CST) has been a major open problem until recently, when Belazzougui [STOC 2014] described an algorithm able to build both of these data structures in O(n) (randomized; later improved by the same author to deterministic) time and O(n/_σn) words of space, where n is the length of the string and σ is the alphabet size. Shortly after, Munro et al. [SODA 2017] described another deterministic construction using the same time and space based on different techniques. It has remained an elusive open problem since then whether these bounds are optimal or, assuming non-wasteful text encoding, the construction achieving O(n / _σn) time and space is possible. In this paper we provide a first algorithm that can achieve these bounds. We show a deterministic algorithm that constructs CSA and CST using O(n / _σ n + r ^11 n) time and O(n / _σ n + r ^10 n) working space, where r is the number of runs in the Burrows-Wheeler transform of the input text. As one of the applications of our techniques we show how to compute the LZ77 parsing in O(n/_σn + r^11n+z^10n) time and O(n/_σn + r^9n) space, which is optimal for highly repetitive strings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The invention of FM-index [18, 17] and the compressed suffix array [26, 27] at the turn of the millennium have revolutionized the field of string algorithms for nearly two decades. For the first time the algorithms could use the full potential of powerful indexing structures in space asymptotically equal to that needed by the text, bits or words, where is the text length and is its alphabet size. Dozens of papers followed the two seminal papers, proposing various improvements, generalizations, and practical implementations (see [51, 52, 8] for excellent surveys). These indexes are now widespread, both in theory where they provide an “off-the-shelf” small space indexing structures, and in practice (particularly bioinformatics) where they achieved massive success: to date, the FM-index-based read aligner “Bowtie” [46] has been cited over ten thousand times.

The other approach to indexing, recently gaining popularity due to quick increase in the amount of highly repetitive data such as software repositories and genomic databases (see [24] for the recent survey), is designing indexes specialized for repetitive strings. The first such index [39] was based on the Lempel-Ziv (LZ77) parsing [59], one of the most popular compression algorithms (used, e.g., in gzip or 7-zip) also with numerous applications outside of data compression such as detecting regularities [15, 28, 44], approximating the smallest grammar [55, 11], or the smallest string attractor [41]. Many improvements to the basic scheme were proposed since then [22, 9, 6, 21, 3, 2, 1], and now the performance of LZ-based indexes is often on par with the FM-index or CSA [16]. In recent years, other compression paradigms have crossed over into the field of indexing. Most notably, Mäkinen et al. [47] have shown that applying run-length compression to FM-index and CSA yields excellent practical performance (see also [58]). The most recent addition to this family is the version of run-length compressed CSA (RLCSA) of Gagie et al. [24]

who showed how to effectively store the samples necessary to report occurrences during pattern matching queries, and allowed them to achieve optimal pattern matching and substring extraction query times within

or even space, where is the number of runs in the Burrows-Wheeler transform (BWT) [10] of the text. The value of is (next to , the size of the LZ77 parsing of the text) is a common measure of the repetitiveness of the text. In theory it allows achieving polynomial or even exponential compression ratio, and in practice it compresses similarly to LZ77 [24]. Furthermore, Mäkinen [47] proved that in the model where several identical copies of a given text are subject to point mutations (such as indels), the value of is provably small under probabilistic assumptions matching the biological processes in the genome.

Although the literature on indexes is very rich, comparatively few results exists for efficient index construction. A gradual improvement [45, 29, 30] in the construction of compressed suffix array culminated with the recent work of Belazzougui [5] who described the (randomized) linear-time construction with optimal working space of . Shortly after, an alternative (and deterministic) construction was proposed by Munro at al. [50]. This solves the problem for large alphabets (i.e., when ) but leaves open the existence of the algorithm running in time, which can be up to smaller for small alphabets.

Our contribution.

In this paper we propose the first algorithms constructing the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing, in optimal time and space for highly repetitive strings. More precisely, all our construction algorithms achieve runtime and construction space of , where is the number of runs in the BWT of the input text.

Virtually every index developed during the last 20 years is based on one or more of these components: all variants of (run-length compressed) FM-index rely on BWT, (run-length)compressed suffix arrays/trees rely on (which is easily derived from BWT). These indexes also often include the PLCP array [57]. On the other hand, LZ77-based and grammar-based indexes rely on the computation of LZ77 parsing.

Our results have particularly important implications for bioinformatics, where most of the data is highly-repetitive and over small (DNA) alphabet. Furthermore, our results and techniques imply a time- and space-optimal solution for a range of fundamental problems arising in sequence analysis and data compression:

  • We show how to compute the Lyndon factorization [12] (also known as Standard factorization), a fundamental tool in string combinatorics [4]. More precisely, we for this problem we obtain an algorithm running in time and space (this includes the output factorization). Since  [20], the above is indeed of the form .

  • We then show how to construct the RLCSA of Gagie et al. [24]. More precisely we demonstrate how to construct a data structure that grants a -time random-access to (any a segment of consecutive values of) suffix array in time and working space. We also show that our construction algorithm can be generalized to other arrays (e.g., ISA, LCP, PLCP, ).

  • Lastly, we show a time and space-optimal solutions to a number of “textbook” problems. As two typical problems we consider computing the longest substring occurring at least times, and the problem of computing the number of distinct substrings of the input text. For example, for the latter we obtain an algorithm running in time and working space.

Note that all the algorithms are time- and space-optimal (i.e., they achieve running time and space usage) under the assumption on the input text. Since can be polynomially or even exponentially smaller than , our assumption on the required (polylogarithmic) compression ratio of the input text is very weak. Furthermore, we remark that for the sake of presentation, we have not optimized the exponents in the -terms.

2 Preliminaries

We assume a word-RAM model with a word of bits and with all usual arithmetic and logic operations taking constant time. Unless explicitly specified otherwise, all space complexities are in given in words. All our algorithms are deterministic.

Throughout we consider a string of symbols drawn from an integer alphabet of size . For , we write to denote the suffix of , that is . We will often refer to suffix simply as “suffix ”. We define the rotation of as a string , for some .

The suffix array [48, 25] SA of is an array which contains a permutation of the integers such that , where denotes the lexicographical order. In other words, iff is the suffix of in ascending lexicographical order. The inverse suffix array ISA is the inverse permutation of SA, that is iff . Conceptually, tells us the position of suffix in SA. The array (see [34]) is defined by , that is, the suffix is the immediate lexicographical predecessor of the suffix .

Let denote the length of the longest-common-prefix (LCP) of suffix and suffix . The longest-common-prefix array [48, 40], , is defined such that for and . The permuted LCP array [34] is the LCP array permuted from the lexicographical order into the text order, i.e., for . Then for all .

The succinct PLCP array [56, 34] represents the PLCP array using bits. Specifically, if for some , and otherwise. Notice that the value is unique for each . Any lcp value can be recovered by the equation , where returns the location of the 1-bit in . The select query can be answered in time given a precomputed data structure of bits [13, 49].

The Burrows–Wheeler transform [10] of is defined by if and otherwise , where is a special symbol that does not appear elsewhere in the text, smaller than all other symbols in the text. Let we denote the matrix, whose rows are all the rotations of in lexicographical order. We denote the rows by , . The alternative and equivalent definition of BWT is to define BWT as the last column of .

We define iff , except when , in which case . By we denote the inverse of LF. Let , for symbol , be the number of symbols in lexicographically smaller than . The function , for string , symbol , and integer , returns the number of occurrences of in . Whenever clear from the context we will omit the string from the list of arguments to the rank function. It is well known that . From the formula for LF we obtain the following fact.

Lemma 1

Assume that is a run of the same symbol and let . Then, .

If is rank (i.e., the number of smaller suffixes) of string among all suffixes of , then is the rank of string among the suffixes of . This is called backward search [17].

We say that that an lcp value is reducible if and irreducible otherwise. The significance of reducibility is summarized in the following two lemmas.

Lemma 2 ([34])

If is reducible, then and .

Lemma 3 ([34, 33])

The sum of all irreducible lcp values is .

By we denote the number of equal-letter runs in BWT. By the above, is also the number of irreducible lcp values. It is well known that for highly repetitive string, the value of is notably smaller than .

3 Augmenting RLBWT

In this section we present a number of extensions of run-length compressed BWT that we will use later. Each extension expands the functionality of RLBWT while maintaining small space usage and, crucially, admits a low construction time and working space.

3.1 Rank and select support

One of the basic operations we will need during construction are rank and select queries on BWT. We now show that a run-length compressed BWT (potentially over large alphabet) can be quickly augmented with a data structure capable of answering these queries in BWT-runs space.

Theorem 3.1

Let be a string over alphabet of size such that BWT of contains runs. Given RLBWT of we can add space so that, given any and any , values and on BWT can each be computed in time. The data structure can be constructed in time using working space.

Proof

Assume that the input RLBWT contains the starting position and the symbol associated with each run. We start by augmenting each run with its length. We then sort all BWT-runs using the head-symbol as the primary key, and the start of the run in the BWT as the secondary key. We then augment each run of symbol with the value , i.e., the total length of all runs of preceding in the sorted order. Using the list, both queries can be easily answered in time using binary search. The list and all the auxiliary information can be computed in time and extra space. ∎

3.2 and backward search support

We now show that with the help of the above rank/select data structures we can support more complicated navigational queries, namely, given any such that we can compute (i.e., ) and (i.e., ). Note that none of the queries will require the knowledge of . As a consequence of the previous section we also obtain an efficient support for backwards-search on RLBWT.

Theorem 3.2

Let be a string over alphabet of size such that BWT of contains runs. Given RLBWT of we can add space so that, given any , values and can each be computed in time. The data structure can be constructed in time using working space.

Proof

To support LF-queries it suffices to sample the LF-value for the first position of each run. Then to compute we first binary-search in time the list of BWT-runs for the run containing position and then return the answer using Lemma 1. To compute the samples we first create a sorted list of characters occurring in by sorting the head-symbols of all runs and removing duplicates. We then perform two scans of the runs, first to compute the frequency of all symbols (on which we then perform the prefix sum), and second in which we compute LF-values for run-heads. The total time spent is and the working space does not exceed words.

To support -queries we first need a list containing, for each symbol occurring in , the total frequency of all symbols smaller than . This list can be computed as above in time. At query time it allows determining in time the symbol following in text and the number such that this is the -th occurrence of in the first column of BWT-matrix. To compute it then remains to find -th occurrence of in BWT. Thus, applying Theorem 3.1 finalizes the construction. ∎

Theorem 3.3

Let be a string over alphabet of size such that BWT of contains runs. Given RLBWT of we can add space so that, given a rank (i.e., a number of smaller suffixes) of some string among the suffixes of , for any we can compute in time the rank of string . The data structure can be constructed in time using working space.

Proof

First we augment RLBWT with rank-support using Theorem 3.1. It then remains to build a data structure that allows to determine, for any the total number of occurrences of symbols in . To achieve this we construct a sorted list (similar as in Theorem 3.2) of symbols occurring in , together with their counts, and perform a prefix-sum on the counts. The construction altogether takes time and extra space, and enables performing a step of the backward search in time. ∎

3.3 Suffix-rank support

In this subsection we describe a non-trivial extension of RLBWT that will allow us to efficiently merge two RLBWTs during the BWT construction algorithm. We start by defining a generalization of BWT-runs and stating their basic properties.

Let denote the length of the longest common suffix of strings and . We define the array [37] as for and , where is the BWT-matrix (i.e., a matrix containing sorted rotations of string ). Let be an integer. We say that a range of BWT is a -run if , , and for any , . By this definition, a BWT run is a 1-run. Let and . Then, is exactly the set of starting positions of -runs.

Lemma 4 ([37])

For any ,

Since is the inverse of LF we obtain that for any , . Thus, the set can be efficiently computed by iterating each of the starting positions of BWT-runs times using and taking a union of all visited positions. From the above we see that , which implies that the number of -runs satisfies .

Theorem 3.4

Let , be two strings over alphabet for such that BWT of (resp. ) contains (resp. ) runs. Given RLBWTs or and it is possible, for any parameter , to build a data structure of size that can, given a rank (i.e., the number of smaller suffixes) of some suffix among suffixes of , compute the rank of among suffixes of in time. The data structure can be constructed in time and working space.

Proof

We start by augmenting both RLBWTs with and LF support (Theorem 3.2). We furthermore augment RLBWT of with the backward-search support (Theorem 3.3). This requires time and space.

We then compute a (sorted) set of starting positions of -runs for both RLBWTs. For this requires answering -queries which takes time in total, and then sorting the resulting set of positions in time. Analogous processing for takes time. The starting positions of all -runs require space in total.

Next, for any -run we compute and store the associated symbols. Furthermore we also store the value , but only for -runs of . Due to simple generalization of Lemma 1, this will allow us to compute the value for any . In total this requires answering LF-queries and hence takes time. The space needed to store all symbols is .

We then lexicographically sort all length- strings associated with -runs (henceforth called -substrings) and assign to each run the rank of the associated substring in the sorted order. Importantly, -substrings of and are sorted together. These ranks will be used as order-preserving names for -substrings. We use an LSD-radix sort with a stable comparison-based sort for each digit hence the sorting takes time. The working space does not exceed . After the names are computed, we discard the substrings.

We now observe that order-preserving names for -substrings allow us to perform backward-search symbols at a time. We build a rank-support data structure analogous to the one from Theorem 3.1 for names of -substrings of . We also add support for computing the total number of occurrences of names smaller than a given name. This takes time and space. Then, given a rank of suffix among suffixes of , we can compute the rank of suffix among suffixes of in time by backwards-search on using as a position, and the name of -substring preceding as a symbol.

We now use the above multi-symbol backward search to compute the rank of every suffix of the form among suffixes of . We start from the shortest suffix and increase the length by in every step. During the computation we also maintain the rank of the current suffix of among suffixes of . This allows us to efficiently compute the name of the preceding -substring. The rank can be updated using values stored with each -run of . Thus, for each of the suffixes of we obtain a pair of integers (, ), denoting its rank among the suffixes of and . We store these pairs as a list sorted by . Computing the list takes time. After the list is computed we discard all data structures associated with -runs.

Using the above list of ranks we can answer the query from the claim as follows. Starting with , we compute a sequence of positions in the BWT of by iterating on . For each position we can check in time whether that position is in the list of ranks. Because we evenly sampled text positions, one of these positions has to correspond to the suffix of for which we computed the rank in the previous step. Suppose we found such position after steps, i.e., we now have a pair (, ) such that is the rank of among suffixes of . We then perform steps of the standard backward search starting from rank in the BWT of using symbols , …, . This takes time. ∎

4 Constructing BWT

In this section we show that given the non-wasteful encoding of text over alphabet of size using words of space, we can compute the non-wasteful encoding of BWT of in time and space, where is the number of equal-letter runs in the BWT of .

4.1 Algorithm overview

The basic scheme of our algorithm follows the BWT construction algorithm of Hon et al. [30] but we perform the main steps differently. Assume for simplicity that for some integer . The algorithm works in rounds, where . In the -th round, , we interpret the string as a string over superalphabet of size , i.e., we group symbols of the original string into sypersymbols consisting of original symbols. We denote this string as . Our working encoding of BWT will be run-length encoding. The input to the -th round, , is the run-length compressed BWT of , and the output is the run-length compressed BWT of . The rounds are executed in decreasing order of . The final output is the run-length compressed BWT of , which we then convert into non-wasteful encoding taking words of space.

For the -th round, we observe that and hence to compute BWT of it suffices to first run any of the linear-time algorithms for constructing the suffix array [35, 54, 43, 42] for and then naively compute the RLBWT from the suffix array. This takes time and space.

Let for some and suppose we are given the RLBWT of . Let be the string of length created by grouping together symbols for all , and let be the analogously constructed string for pairs . Clearly we have . Furthermore, it is easy to see that the BWT of can be obtained by merging BWTs of and , and discarding (more significant) half of the bits in the encoding of each symbol.

The construction of RLBWT for consists of two steps: (1) first we compute the RLBWT of from RLBWT of , and then (2) merge RLBWT of and into RLBWT of .

4.2 Computing BWT of

In this section we assume that for some and that we are given the RLBWT of of size . Denote the size of RLBWT of by . We will show that RLBWT of can be computed in time using working space.

Recall that both and are over alphabet . Each of the symbols in that alphabet can be interpreted as a concatenation of two symbols in the alphabet . Let be the symbol of either or and assume that for some . By major subsymbol of we denote a symbol (equal to ) from encoded by the more significant half of bits encoding , and by minor subsymbol we denote symbol encoded by remaining bits (equal to ).

We first observe that by enumerating all runs of the RLBWT of in increasing order of their minor subsymbols (and in case of ties, in the increasing order of run beginnings), we obtain (on the remaining bits) the minor subsymbols of the BWT of in correct order. Such enumeration could easily be done in time and working space. To obtain the missing (major) part of the encoding of symbols in the BWT of , it suffices to perform the LF-step for each of the runs in the BWT of in the sorted order above (i.e., by minor subsymbol), and look up the minor subsymbols in the resulting range of BWT of .

The problem with the above approach is the running time. While it indeed produces correct RLBWT of , having to scan all runs in the range of BWT of obtained by performing the LF-step on each of the runs of could be prohibitively high. To address this we first construct a run-length compressed sequence of minor subsymbols extracted from BWT of and use it to extract minor subsymbols of BWT of in total time proportional to the number of runs in the BWT of .

Theorem 4.1

Given RLBWT of of size we can compute RLBWT of in time and working space, where is the size of RLBWT of .

Proof

The whole process requires scanning the BWT of to create a run-length compressed encoding of minor subsymbols, adding the LF support to (the original) RLBWT of , sorting the runs in RLBWT of by the minor subsymbol, and executing LF-queries on the BWT of , which altogether takes . All other operations take time proportional to . The space never exceeds . ∎

4.3 Merging BWTs of and

As in the previous section, we assume that for some and that we are given the RLBWT of of size and RLBWT of of size . We will show how to use these to efficiently compute the RLBWT of in time and space.

We start by observing that to obtain BWT of it suffices to merge the BWT of and BWT of and discard all major subsymbols in the resulting sequence. The original algorithm of Hon et al. [30] achieves this by performing backward search. This requires time and hence is too expensive in our case.

Instead we employ the following observation. Suppose we have already computed the first runs of the BWT of and let the next unmerged character in the BWT of be a part of a run of symbol . Let be the analogous symbol from the BWT of . Further, let (resp. ) be the minor subsymbol of (resp. ). If then either all symbols in the current run in the BWT of (restricted to minor subsymbols) or all symbols in the current run in the (also restricted) BWT of will belong to the next run in the BWT of . Assuming we can determine the order between any two arbitrary suffixes of and given their ranks in the respective BWTs, we could consider both cases and in each perform a binary search to find the exact length of -th run in the BWT of . We first locate the end of the run of (resp. ) in the BWT of (resp. ) restricted to minor subsymbols; this can be done after preprocessing input BWTs without increasing the time/space of the merging. We then find the largest suffix of (resp. ) not greater than the suffix at the end of the run in the BWT of . Importantly, the time to compute the next run in the BWT of does not depend on the number of times the suffixes in that run alternate between and . The case is handled similarly, except we do not need to locate the end of each run. The key property of this algorithm is that the number of patterns searches is .

Thus, the merging problem can be reduced to the problem of efficient comparison of suffixes of and . To achieve that we augment both RLBWTs of and with the suffix-rank support data structure from Section 3.3. This will allow us to determine, given a rank of any suffix of , the number of smaller suffixes of and vice-versa, thus eliminating even the need for binary search. Our aim is to achieve space and construction time, thus we apply Theorem 3.4 with .

Theorem 4.2

Given the RLBWT of of size and the RLBWT of of size we can compute the RLBWT of in time and using working space, where is the size of the output RLBWT of .

Proof

Constructing the suffix-rank support for and with takes time and working space. The resulting data structures occupy space and answer suffix-rank queries in time. To compute the RLBWT of we perform suffix-rank queries for a total of time. ∎

4.4 Putting things together

To bound the size of RLBWTs in intermediate rounds, consider the -th round where for we group each symbols of to obtain the string of length and let be the number of runs in the BWT of . Recall now the construction of generalized BWT-runs from Section 3.3 and observe that the symbols of comprising each supersymbol are the same as the substring corresponding to -run containing suffix in the BWT of . It is easy to see that the corresponding suffixes of are in the same lexicographic order as the suffixes of . Thus, is bounded by the number of -runs in the BWT of , which by Section 3.3 is bounded by . Hence, the size of the output RLBWT of the -th round does not exceed . The analogous analysis shows that the size of RLBWT of has the same upper bound as .

Theorem 4.3

Given a non-wasteful encoding of string over alphabet of size using words, the non-wasteful encoding of BWT of can be computed in time and working space, where is the number of runs in the BWT of .

Proof

The -th round of the algorithm takes time working space space and produces a BWT of taking words of space. Consider the -th round of the algorithm for and let , and and denote the sizes of RLBWT of and respectively. By the above discussion we have . Thus, by Theorem 4.1 and Theorem 4.2 the -th round takes time and the working space does not exceed words, where , and we used the fact that for , . Hence over all rounds we spend time and never use more than space. Finally, it is easy to convert RLBWT into the non-wasteful encoding in time. ∎

Thus, we obtained a time- and space-optimal construction algorithm for BWT under a weak assumption on repetitiveness of the input: .

5 Constructing PLCP

In this section we show that given the run-length compressed representation of BWT of it is possible to compute the bitvector in time and working space. The resulting bitvector takes bits, or words of space. This result complements the result from the previous section. After we construct RLBWT, before converting to non-wasteful encoding, we can use it to construct the PLCP bitvector – another component of CSA and CST.

The key observation used to construct the PLCP values is that it suffices to only compute the irreducible LCP values. Then, by Lemma 2, all other values can be quickly deduced. This significantly simplifies the problem because it is known (Lemma 3) that the sum of irreducible LCP values is bounded by .

The main idea of the construction is to compute (as in Theorem 3.4) names of -runs for . This will allow us to compare symbols at a time and thus quickly computing a lower bound for large irreducible LCP values. Before we can use this, we need to augment the BWT with the support for SA/ISA queries. Note that in the next section we show how to efficiently build (in similar time and space) a support for the same queries but only taking up space space.

5.1 Computing Sa/Isa support

Suppose that we are given a run-length compressed BWT of taking space. Let be an integer. Assume for simplicity that is a multiple of . We start by computing the sorted list of starting positions of all -runs similarly, as in Theorem 3.4. This requires augmenting the RLBWT with the LF/ support first and in total takes time and working space. We then compute and store, for the first position of each -run , the value of . This will allow us to efficiently compute for any .

We then locate the occurrence of the symbol in BWT and perform iterations of on . By definition of LF, the position visited after iterations of is equal to , i.e., . For any such we save the pair into a list. When we finish the traversal we sort the list by the first component (assume this list is called ). We then create the copy of the list (call it ) and sort it by the second component. Creating the lists takes time and they occupy space. After the lists are computed we discard samples associated with all runs. Having these lists allows us to efficiently compute or for any as follows.

To compute we find (in constant time, as we in fact can store in an array) the pair in such that . We then perform steps of LF on position . The total query time is thus .

To compute we perform steps of LF (each taking time) on position . Due to the way we sampled SA/ISA values, one of the visited positions has to be the fist component in the list. For each position we can check this in time. Suppose we found a pair after steps, i.e., a pair is in . This implies , i.e., . The query time is .

Theorem 5.1

Given RLBWT of text of size we can, for any , build a data structure taking space that, for any , can answer query in time and query in time. The construction takes time and working space.

5.2 Computing irreducible LCP values

We start by augmenting the RLBWT with the SA/ISA support as explained in the previous section using . The resulting data structure answers SA/ISA queries in time. We then compute -runs and their names using the technique introduced in Theorem 3.4 for .

Given any we can check if using the above names as follows. First compute and using the ISA support. Then compare the names of -substrings preceding these two suffixes. Thus, comparing two arbitrary substrings of of length , given their text positions, takes time.

The above toolbox allows computing all irreducible LCP values as follows. For any such that is irreducible (such can be recognized by checking if belongs to a BWT-run different than ) we compute and . We then have . We start by computing the lower-bound for using the names of -substrings. Since the sum of irreducible LCP values is bounded by , over all irreducible LCP values this will take time. Finishing the computation of each LCP value requires at most symbol comparisons. This can be done by following for both pointers as long as the preceding symbols (found in the BWT) are equal. Over all irreducible LCP values, finishing the computation takes time.

Theorem 5.2

Given the RLBWT of of size , the bitvector (or the list of pairs storing all irreducible LCP values in text-order) can be computed in time and working space.

Proof

Adding the SA/ISA support using takes time and working space (Theorem 5.1). The resulting structure needs space and answers SA/ISA queries in time.

Computing the names takes time and working space (see the proof of Theorem 3.4). The names need space. Then, by the above discussion, computing all irreducible LCP values takes time.

Converting the list of irreducible LCP values (in text order) into requires SA-queries and hence can be done in time. ∎

By combining with Theorem 4.3 we obtain the following result.

Theorem 5.3

Given a non-wasteful encoding of string over alphabet of size using words, the bitvector (or the list of pairs storing all irreducible LCP values in text-order) can be computed in time and working space, where is the number of runs in the BWT of .

6 Optimal construction of run-length compressed CSA

In this section we show how our techniques can be used to build, in optimal time and space, the run-length compressed suffix array (RLCSA) recently proposed by Gagie et al. [24]. They observed that both the text and BWT have a bidirectional parse of size , where is the number of runs on the BWT, and that this property transfers to other arrays (such as or ) except they first need to be differentially encoded. They use a locally-consistent parsing to grammar-compress these arrays and describe the necessary augmentations to achieve fast decoding of the original values. They also describe a data structure called a block tree (see also [23] for an early, unidirectional, version) that allows random-access to any length- substring of text in time. The block tree takes space.

The structures described below are slightly different than the original index proposed by Gagie et al. [24]. Rather than compressing the differentially-encoded arrays, they directly exploit the structure of bidirectional-parse present in all the arrays. They can be thought of as a multi-ary block trees [7] modified to work with arrays indexed in “lex-order” instead of the original “text-order”. Our main contribution, however, is the time- and space-optimal construction of the above data structures for highly repetitive inputs.

6.1 Small-space Sa support

The data structure.

The data structure is parametrized by an integer parameter . Assume we are given RLBWT of size for text . The data structure is organized into levels. The main idea is, at any given level, to store pointers for each BWT-run that will reduce the SA-access query for positions nearby that BWT-run boundary into the SA-access query at some different position but closer (by at least a factor of ) to some run boundary. Level describes the allowed proximity of the query, so a pointer from level will always lead to level . Upon reaching the last level, the SA value at each run boundary is stored explicitly.

More precisely, for level , let and let be one of the runs in the BWT (and be the preceding run, ). Consider non-overlapping blocks of size , evenly spread around position , i.e., , , except we omit blocks not contained in . For each block we store the smallest (called LF-distance) such that there exists are least one such that is the beginning of the run in the BWT of . We also store the value (called LF-shortcut), both as an absolute value in and as a pointer to the BWT-run that contains it. Due to simple generalization of Lemma 1, this allows us to compute for any .

To access we proceed as follows. Assume first that is not more than positions from the closest run boundary. We first find the run that contains . We then follow the LF-shortcuts starting at level 1 down to the last level. After every step the distance to the closest run boundary is reduced by a factor . Thus, after steps the current position is equal to boundary of some run . Let denote the total lengths of LF-distances of the used shortcuts. Since is stored we can now answer the query as . To handle positions further than from the nearest run boundary, we add a lookup table such that stores the LF-shortcut for block .

The above data structure can be generalized to extract segments of , for any and , faster than single SA-accesses, that would cost . The main modification is that at level we instead consider blocks of size , evenly spread around position , each overlapping the next by exactly symbols, i.e.,