Faster Queries on BWT-runs Compressed Indexes

06/09/2020
by   Takaaki Nishimoto, et al.
0

Although a significant number of compressed indexes for highly repetitive strings have been proposed thus far, developing compressed indexes that support faster queries remains a challenge. Run-length Burrows-Wheeler transform (RLBWT) is a lossless data compression by a reversible permutation of an input string and run-length encoding, and it has become a popular research topic in string processing. Recently, Gagie et al. presented r-index, an efficient compressed index on RLBWT whose space usage does not depend on text length. In this paper, we present a new compressed index on RLBWT, which we call r-index-f, in which r-index is improved for faster locate queries. We introduce a novel division of RLBWT into blocks, which we call balanced BWT-sequence as follows: the RLBWT of a string is divided into several blocks, and a parent-child relationship between each pair of blocks is defined. In addition, we present a novel backward search algorithm on the balanced BWT-sequences, resulting in faster locate queries of r-index-f. We also present new algorithms for solving the queries of count query, extract query, decompression and prefix search on r-index-f.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/09/2020

Optimal-Time Queries on BWT-runs Compressed Indexes

Although a significant number of compressed indexes for highly repetitiv...
04/02/2020

On Locating Paths in Compressed Cardinal Trees

A compressed index is a data structure representing a text within compre...
05/25/2018

Strong link between BWT and XBW via Aho-Corasick automaton and applications to Run-Length Encoding

The boom of genomic sequencing makes compression of set of sequences ine...
10/04/2021

FM-Indexing Grammars Induced by Suffix Sorting for Long Patterns

The run-length compressed Burrows-Wheeler transform (RLBWT) used in conj...
07/17/2020

Adaptive Exact Learning in a Mixed-Up World: Dealing with Periodicity, Errors and Jumbled-Index Queries in String Reconstruction

We study the query complexity of exactly reconstructing a string from ad...
08/28/2019

Techniques for Inverted Index Compression

The data structure at the core of large-scale search engines is the inve...
12/08/2021

RLBWT Tricks

Experts would probably have guessed that compressed sparse bitvectors we...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A text index represents a string in a compressed format that supports locate queries (i.e., computing all the positions including a query on a string). Burrows-Wheeler transform (BWT) [burrows1994block] is a lossless data compression by a reversible permutation of an input string, and a wide variety of applications for the text index on BWT [DBLP:journals/jacm/FerraginaM05, DBLP:journals/talg/FerraginaMMN07] have been proposed (e.g., [DBLP:conf/dcc/ProchazkaH14, DBLP:journals/bmcbi/WangHMJ18, Healy2003AnnotatingLG]).

Highly repetitive string is a string including many repetitions. Examples include human genomes, version controlled documents and source code in repositories. A significant number of text indexes on various compressed formats for highly repetitive strings have been proposed thus far (e.g., SLP-index [DBLP:conf/spire/ClaudeN12a], LZ-indexes [DBLP:conf/latin/ChristiansenE18, DBLP:conf/latin/GagieGKNP14], BT-indexes [DBLP:journals/corr/abs-1811-12779, DBLP:journals/tcs/NavarroP19]). For a large collection of highly repetitive strings, the most powerful and efficient is run-length BWT (RLBWT) [burrows1994block], which is a BWT compressed by run-length encoding. Mäkinen et al. [DBLP:journals/jcb/MakinenNSV10] presented the first text index named RLFM-index on RLBWT that solves locate queries by using a backward search algorithm in RLBWT. While the space usage of RLFM-index depends on string length, RLFM-index can solve locate queries in words of space and time for length of text , length of query, number of runs in RLBWT of and alphabet size , number of occurrences of a query in and parameter . Recently, Gagie et al. [10.1145/3375890] presented r-index that reduces the space usage of RLFM-index into a space not dependent on string length. The r-index can solve locate queries more space-efficiently in only words of space and time for machine word size .

In this paper, we present a new text index on RLBWT, which we call r-index-f, in which r-index is improved for faster locate queries. We introduce a novel division of RLBWT into blocks, which we call balanced BWT-sequence as follows: the RLBWT of a string is divided into several blocks, and a parent-child relationship between each pair of blocks is defined. In addition, we present a novel backward search algorithm on the balanced BWT-sequences, resulting in faster locate queries in time and words of space. In addition, r-index-f can support the following fast queries on strings in its application, which are summarized in Table 1.

  • Count query: r-index-f can return the number of occurrences of a query in an input string in words of space and time that is not dependent on string length and the number of runs in RLBWT.

  • Extract query: r-index-f can return substrings starting at a given position bookmarked beforehand in a string in time per character and words of space, where is the number of bookmarked positions. Solving extract queries is also known as the bookmarking problem [DBLP:conf/latin/GagieGKNP14, DBLP:conf/spire/CordingGW16].

  • Decompression: r-index-f decompresses the original string in time and words of space. The decompression is fastest among algorithms on compressed indexes in words of space.

  • Prefix search: r-index-f can return the strings in a set that include a query as their prefixes in time and words of space, where is the number of output strings and is the number of runs in the RLBWT of a string made by concatenating the strings in .

This paper is organized as follows. In Section 2, we introduce several important notions used in this paper. In Section 3, we present the balanced BWT-sequence and r-index-f. Count and locate queries on r-index-f are also presented. Important properties of the balanced BWT-sequence are proven in Section 4. Section 5 presents the other queries that r-index-f can support.

(i) Locate query Space (words) Time
RLFM-index [DBLP:journals/jcb/MakinenNSV10]
r-index [10.1145/3375890]
r-index-f (This study)
(ii) Count query Space (words) Time
RLFM-index [DBLP:journals/jcb/MakinenNSV10]
r-index [10.1145/3375890]
r-index-f (This study)
(iii) Extract query Space (words) Time per character
Gagie et al.[DBLP:conf/latin/GagieGKNP14]
Cording et al.[DBLP:conf/spire/CordingGW16]
This study
(iv) Decompression Space (words) Time
Lauther and Lukovszki [DBLP:journals/algorithmica/LautherL10]
Golynski et al.[DBLP:conf/soda/GolynskiMR06]
Predecessor queries [DBLP:conf/esa/BelazzouguiN12]
This study
(v) Prefix search Space (words) Time
Compact trie
Z-fast trie [DBLP:conf/spire/BelazzouguiBV10] expected
Packed c-trie [DBLP:journals/ieicet/TakagiISA17] expected
c-trie++ [DBLP:journals/corr/abs-1904-07467] expected
This study
Table 1: Summary of methods of (i) locate and (ii) count queries on RLBWT, (iii) extract query (i.e., the bookmarking problem), (iv) decompression of BWT or RLBWT and (v) prefix search, where is the length of the input string , is the length of a given string , is the number of all occurrences of in , is the alphabet size of , is the machine word size, is the number of runs in the RLBWT of , is a parameter, is the size of a compressed grammar deriving , is the number of input positions for the bookmarking problem, is a set of strings of total length , is the number of the strings in such that each string has as a prefix and is the number of runs in the RLBWT of a string made by concatenating the strings in .

2 Preliminaries

Let be an ordered alphabet of size , be a string of length over and be the length of . Let be the -th character of  (i.e., ) and be the substring of that begins at position and ends at position . For two strings, and , means that is lexicographically smaller than . We assume that (i) and (ii) the last character of string is a special character not occurring on substring and holds for any character . For two integers, and  (), interval is a set . denotes all the occurrence positions of a string in a string , i.e., . Let be an array of size such that is the number of occurrences of characters lexicographically smaller than in string i.e., .

Our computation model is a unit-cost word RAM with a machine word size of bits. We evaluate the space complexity in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a multiplicative factor. We use log base 2 throughout this paper if the logarithmic base is not indicated.

2.1 Predecessor, rank, count and locate queries

For an integer and a set of integers, a predecessor query returns the number of elements that are no more than in  (i.e., ). Belazzougui and Navarro [DBLP:conf/esa/BelazzouguiN12] proposed a predecessor data structure that solved the predecessor query for on in time and with words of space for the size of the universe of elements. Constructing predecessor data structures takes time and words of space by processing  [10.1145/3375890].

A rank query on a string returns the number of occurrences of character in , i.e., . Belazzougui and Navarro [DBLP:conf/esa/BelazzouguiN12] also proposed a rank data structure solving a rank query on in time and with words of space. Constructing the rank data structure takes time and words of working space by processing  [10.1145/3375890].

A count query on string returns the number of occurrences of a given string in , i.e., . Similarly, a locate query on string returns all the starting positions of in , i.e., .

2.2 Suffix array (SA), sa-interval and LF function

Suffix array (SA) [DBLP:journals/siamcomp/ManberM93] of a string is an integer array of size such that stores the starting position of -th suffix of in lexicographical order. Formally, is a permutation of such that holds. Each value in SA is called sa-value.

Suffix array interval (sa-interval) of a string is an interval on such that represents all the occurrence positions of in string ; that is, for any integer , only if .

is the function that returns the position with sa-value on  (i.e., ) for a given integer if ; otherwise, it returns the position with sa-value  (i.e., ).

2.3 BWT and run-length BWT (RLBWT)

Figure 1: Left figure illustrates the BWT, SA, LF function, F and the sorted circular strings of . Middle and right figures illustrate the two cases for toehold lemma (Lemma 1).

BWT [burrows1994block] of a string is an array built by permutations of as follows: (i) all the rotations of are sorted in the lexicographical order; (ii) for any is the last character at the -th rotation in the sorted order. Similarly, for any is the first character at the -th rotation in the sorted order. Formally, let and for any . The left figure in Figure 1 illustrates the BWT, SA, LF function, and of a string.

BWT has the following two properties. First, for any integer , is equal to the number of characters that are lexicographically smaller than character plus the rank of on the BWT. Thus, holds for . This is because only if either of the following conditions holds: (i) or (ii) and for any pair of integers .

Second, for a character and string , the starting and ending positions of the sa-interval of is respectively equal to the results obtained by applying the LF function to the first and last occurrences of character on the sa-interval of on BWT. This is because (i) is a prefix of any suffix in the sa-interval of , and (ii) is the previous character of suffix . Thus, the sa-interval of for any character can be computed from the sa-interval of by backward search as follows.

[Backward search [DBLP:journals/jacm/FerraginaM05]] Let and be the sa-intervals of and for string and character , respectively. The following three statements hold: (i) for any integer only if ; (ii) and if where ; (iii) only if .

The RLBWT of is the BWT encoded by the run-length encoding, i.e., RLBWT is a partition of into substrings such that each substring is a maximal repetition of the same character in  (i.e., and ). We call each run. RLBWT can be stored in words because we can represent each run in words. Figure 1, the RLBWT of the string in the left table is .

Toehold lemma [DBLP:journals/algorithmica/PolicritiP18] enables us to compute all the occurrences of on string using a sampled suffix array of size and the sa-interval of string for any character , which is formalized as follows.

[Toehold lemma [DBLP:journals/algorithmica/PolicritiP18]] Let  () and  () be the sa-intervals of and for string and character , respectively. The following two cases hold: (i) If , for any integer ; (ii) otherwise, there exists an integer such that holds and contains the starting or ending position of  (i.e., or ).

In case (i), any character on interval in the BWT represents an occurrence of  (i.e., for any ) because interval occurs on a run of character in the RLBWT. Thus, we can compute an occurrence of if we know an occurrence of because is an occurrence of .

In case (ii), the interval contains the starting or ending position of a run of character because the interval on the BWT contains at least two distinct characters. Thus, we can compute at least one occurrence position of by storing sa-values on the starting and ending positions of runs in the RLBWT because is an occurrence of for sa-value on the starting or ending position of a run of character such that the position is contained in interval . The middle and right figures in Figure 1 show an example of the two cases.

2.4 Data structure computing adjacent sa-values

Gagie et al. [10.1145/3375890] proposed a data structure enabling computations of adjacent sa-values and from a given sa-value in time and words of space. Let for a given position with sa-value  (i.e., ). Let be an inverse function of that returns for a given sa-value (i.e., ). The data structure was used for supporting locate queries in [10.1145/3375890].

3 r-index-f

In this section, we present r-index-f, a new text index on RLBWT supporting faster count and locate queries. Formally, we show the following theorem. There exists a text index of words supporting count and locate queries on a string in and time, respectively, where is a given string with count or locate query and . We can construct the data structure in time and words of working space by processing the RLBWT of .

r-index-f is built on the notion of a novel partition of the BWT of string , which we call a balanced BWT-sequence. The balanced BWT-sequence is introduced in Section 3.1. We present r-index-f and queries in Section 3.2. Due to space limitations, we present a construction algorithm of r-index-f in Appendix A.

3.1 Balanced BWT-sequence

Figure 2: Examples of a BWT-sequence and the corresponding F-sequence.

We introduce an important notion of the balanced BWT-sequence. A BWT-sequence of a string is a sequence of strings satisfying the following two conditions: (i) the concatenation of the strings is equal to the BWT of  (i.e., ); (ii) each string is a repetition of a character (i.e., ). is defined as the set of the starting positions of strings in , i.e., and for .

For the BWT of string , two examples of BWT-sequences are and , where denotes repetitions of the same character .

F-sequence is a permutation of BWT-sequence obtained by applying the LF function to each string in . The concatenation of the F-sequence is equal to permutation by the first property of BWT (i.e., ). Formally, let be the permutation of such that holds. Then, holds for any integer . We call each string in and the F-sequence phrase. is defined as the set of the starting positions of phrases in  (i.e., is the starting position of phrase ). Figure 2 illustrates the F-sequences for two BWT-sequences and .

A parent and children relationship between phrases can be defined. Children of the -th phrase are defined as a set of phrases such that the starting position of each phrase in the set is contained in interval , where interval is equal to the interval of phrase corresponding to by the LF function (i.e., , where ). We define as a set of the starting positions of the children of the -th phrase in BWT-sequence  (i.e., ). For in Figure 2, , , , and .

A BWT-sequence of string is balanced if (i) any phrase has at most three children (i.e., for any integer ), and (ii) the number of phrases in is at most  (i.e., ). The left BWT-sequence in Figure 2 is balanced because any phrase has at most two children and the size of the sequence is no more than . On the other hand, the right BWT-sequence is not balanced because the fifth phrase in the BWT-sequence has four children.

There must exist at least one balanced BWT-sequence of any string as follows. The following two statements hold: (i) there exists a balanced BWT-sequence of any string ; and (ii) we can construct a balanced BWT-sequence of in time and words of working space by processing the RLBWT of .

Proof.

See Section 4. ∎

3.2 Data structures and search algorithms

Our text index consists of the following six data structures:

  1. A balanced BWT-sequence of ;

  2. Five arrays of size storing (1) , , , (2) , , , , (3) , , , , (4) , , , , and (5) , , , ;

  3. A rank data structure for a string such that each -th character is the first character of phrase  (i.e., );

  4. An array of size such that stores the index of the phrase in containing character  (i.e., ).

  5. The data structure for the functions and introduced in Section 2.4;

  6. An array of size such that stores a 5-tuple for sa-interval of character .

As shown in Theorem 3.1, the space usage of those data structures is words in total, which we prove in Appendix A.

3.2.1 Fast predecessor query on the BWT-sequence

Figure 3: Two phrases , for position , and phrase for  (left). The sa-intervals of two strings and on a balanced BWT-sequence (right).

Solving a predecessor query on set (i.e., ) is essential for count and locate queries. Thus, we present an algorithm that returns in constant time for a given position and . A linear search can be adopted for solving predecessor queries by leveraging the balanced BWT-sequence. Phrases are searched one by one for integer stored in  (i.e., ), and phrase is found as a solution. This is made possible because contains the starting position of and position is contained in two phrases and  (i.e., ). Computation complexity is time by using the array storing the starting positions of phrases in . This computation complexity can be bounded by  (i.e., ) because any phrase in a balanced BWT-sequence has at most three children. Formally, the following lemma holds.

for any integer and . An example of three phrases , and is presented in the left figure in Figure 3.

In the remaining section, we present two new algorithms for count and locate queries using the predecessor query.

3.2.2 Count query

A count query computes the length of an sa-interval for a given query using a backward search on . Let be the sa-interval of string and let be the sa-interval of string for character . Given a 5-tuple , the backward search on returns a 4-tuple . Our backward search algorithm consists of the following two steps: (i) compute and by Lemma 3.2.1; (ii) compute by rank queries on string .

Unless is empty, is equal to the number of character in plus the number of the characters lexicographically less than in  (i.e., ). This is because the characters of correspond one by one to the first characters of the phrases in by the LF function, and the first characters of the phrases in are lexicographically ordered by permutation .

The right figure in Figure 3 illustrates the relation between and the first characters of the phrases in . Similarly, is equal to . Thus, we compute and using two rank queries on . We also verify whether is empty because the above algorithm can return incorrect and if is empty. is empty only if substring does not contain character , and hence we complete the verification by two rank queries on .

Next, can be computed as if because interval completely contains phrase corresponding to  (i.e., ). Otherwise, can be computed as because corresponds to and the -th character of is the first occurrence of character in . Similarly, integer is computed. Thus, we can compute and in constant time after computing and . Our backward search algorithm runs in time. The following two lemmas hold.

The following two statements hold: (i) only if .. contains character  (i.e., ); (ii) and unless .

Assume that . Then, if ; otherwise . Similarly, if ; otherwise .

The sa-interval of a given string can be computed by executing the backward search times. Thus, a count query can be computed in time.

3.2.3 Locate query

A locate query enumerates all the sa-values in the sa-interval of using two functions, and . The sa-interval of is computed by the count query, and and are recursively computed for position by applying and to , respectively.

Let be an sa-interval of string and let be an sa-interval of string for an integer , where . A basic idea of our algorithm is to compute and for a given 7-tuple by the toehold lemma in Lemma 1, where and are two positions such that and . We compute and by executing the algorithm for each , which is explained next.

The toehold lemma ensures and if two sa-intervals are the same size (i.e., ). In addition, it ensures because corresponds to by the LF function for any integer . If , our algorithm returns and . Otherwise, contains the starting or ending position of phrase because contains the starting or ending position of phrase corresponding to in this case, where