1 Introduction
A text index represents a string in a compressed format that supports locate queries (i.e., computing all the positions including a query on a string). BurrowsWheeler transform (BWT) [burrows1994block] is a lossless data compression by a reversible permutation of an input string, and a wide variety of applications for the text index on BWT [DBLP:journals/jacm/FerraginaM05, DBLP:journals/talg/FerraginaMMN07] have been proposed (e.g., [DBLP:conf/dcc/ProchazkaH14, DBLP:journals/bmcbi/WangHMJ18, Healy2003AnnotatingLG]).
Highly repetitive string is a string including many repetitions. Examples include human genomes, version controlled documents and source code in repositories. A significant number of text indexes on various compressed formats for highly repetitive strings have been proposed thus far (e.g., SLPindex [DBLP:conf/spire/ClaudeN12a], LZindexes [DBLP:conf/latin/ChristiansenE18, DBLP:conf/latin/GagieGKNP14], BTindexes [DBLP:journals/corr/abs181112779, DBLP:journals/tcs/NavarroP19]). For a large collection of highly repetitive strings, the most powerful and efficient is runlength BWT (RLBWT) [burrows1994block], which is a BWT compressed by runlength encoding. Mäkinen et al. [DBLP:journals/jcb/MakinenNSV10] presented the first text index named RLFMindex on RLBWT that solves locate queries by using a backward search algorithm in RLBWT. While the space usage of RLFMindex depends on string length, RLFMindex can solve locate queries in words of space and time for length of text , length of query, number of runs in RLBWT of and alphabet size , number of occurrences of a query in and parameter . Recently, Gagie et al. [10.1145/3375890] presented rindex that reduces the space usage of RLFMindex into a space not dependent on string length. The rindex can solve locate queries more spaceefficiently in only words of space and time for machine word size .
In this paper, we present a new text index on RLBWT, which we call rindexf, in which rindex is improved for faster locate queries. We introduce a novel division of RLBWT into blocks, which we call balanced BWTsequence as follows: the RLBWT of a string is divided into several blocks, and a parentchild relationship between each pair of blocks is defined. In addition, we present a novel backward search algorithm on the balanced BWTsequences, resulting in faster locate queries in time and words of space. In addition, rindexf can support the following fast queries on strings in its application, which are summarized in Table 1.

Count query: rindexf can return the number of occurrences of a query in an input string in words of space and time that is not dependent on string length and the number of runs in RLBWT.

Extract query: rindexf can return substrings starting at a given position bookmarked beforehand in a string in time per character and words of space, where is the number of bookmarked positions. Solving extract queries is also known as the bookmarking problem [DBLP:conf/latin/GagieGKNP14, DBLP:conf/spire/CordingGW16].

Decompression: rindexf decompresses the original string in time and words of space. The decompression is fastest among algorithms on compressed indexes in words of space.

Prefix search: rindexf can return the strings in a set that include a query as their prefixes in time and words of space, where is the number of output strings and is the number of runs in the RLBWT of a string made by concatenating the strings in .
This paper is organized as follows. In Section 2, we introduce several important notions used in this paper. In Section 3, we present the balanced BWTsequence and rindexf. Count and locate queries on rindexf are also presented. Important properties of the balanced BWTsequence are proven in Section 4. Section 5 presents the other queries that rindexf can support.
(i) Locate query  Space (words)  Time 

RLFMindex [DBLP:journals/jcb/MakinenNSV10]  
rindex [10.1145/3375890]  
rindexf (This study) 
(ii) Count query  Space (words)  Time 

RLFMindex [DBLP:journals/jcb/MakinenNSV10]  
rindex [10.1145/3375890]  
rindexf (This study) 
(iii) Extract query  Space (words)  Time per character 

Gagie et al.[DBLP:conf/latin/GagieGKNP14]  
Cording et al.[DBLP:conf/spire/CordingGW16]  
This study 
(iv) Decompression  Space (words)  Time 

Lauther and Lukovszki [DBLP:journals/algorithmica/LautherL10]  
Golynski et al.[DBLP:conf/soda/GolynskiMR06]  
Predecessor queries [DBLP:conf/esa/BelazzouguiN12]  
This study 
(v) Prefix search  Space (words)  Time 

Compact trie  
Zfast trie [DBLP:conf/spire/BelazzouguiBV10]  expected  
Packed ctrie [DBLP:journals/ieicet/TakagiISA17]  expected  
ctrie++ [DBLP:journals/corr/abs190407467]  expected  
This study 
2 Preliminaries
Let be an ordered alphabet of size , be a string of length over and be the length of . Let be the th character of (i.e., ) and be the substring of that begins at position and ends at position . For two strings, and , means that is lexicographically smaller than . We assume that (i) and (ii) the last character of string is a special character not occurring on substring and holds for any character . For two integers, and (), interval is a set . denotes all the occurrence positions of a string in a string , i.e., . Let be an array of size such that is the number of occurrences of characters lexicographically smaller than in string i.e., .
Our computation model is a unitcost word RAM with a machine word size of bits. We evaluate the space complexity in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a multiplicative factor. We use log base 2 throughout this paper if the logarithmic base is not indicated.
2.1 Predecessor, rank, count and locate queries
For an integer and a set of integers, a predecessor query returns the number of elements that are no more than in (i.e., ). Belazzougui and Navarro [DBLP:conf/esa/BelazzouguiN12] proposed a predecessor data structure that solved the predecessor query for on in time and with words of space for the size of the universe of elements. Constructing predecessor data structures takes time and words of space by processing [10.1145/3375890].
A rank query on a string returns the number of occurrences of character in , i.e., . Belazzougui and Navarro [DBLP:conf/esa/BelazzouguiN12] also proposed a rank data structure solving a rank query on in time and with words of space. Constructing the rank data structure takes time and words of working space by processing [10.1145/3375890].
A count query on string returns the number of occurrences of a given string in , i.e., . Similarly, a locate query on string returns all the starting positions of in , i.e., .
2.2 Suffix array (SA), sainterval and LF function
Suffix array (SA) [DBLP:journals/siamcomp/ManberM93] of a string is an integer array of size such that stores the starting position of th suffix of in lexicographical order. Formally, is a permutation of such that holds. Each value in SA is called savalue.
Suffix array interval (sainterval) of a string is an interval on such that represents all the occurrence positions of in string ; that is, for any integer , only if .
is the function that returns the position with savalue on (i.e., ) for a given integer if ; otherwise, it returns the position with savalue (i.e., ).
2.3 BWT and runlength BWT (RLBWT)
BWT [burrows1994block] of a string is an array built by permutations of as follows: (i) all the rotations of are sorted in the lexicographical order; (ii) for any is the last character at the th rotation in the sorted order. Similarly, for any is the first character at the th rotation in the sorted order. Formally, let and for any . The left figure in Figure 1 illustrates the BWT, SA, LF function, and of a string.
BWT has the following two properties. First, for any integer , is equal to the number of characters that are lexicographically smaller than character plus the rank of on the BWT. Thus, holds for . This is because only if either of the following conditions holds: (i) or (ii) and for any pair of integers .
Second, for a character and string , the starting and ending positions of the sainterval of is respectively equal to the results obtained by applying the LF function to the first and last occurrences of character on the sainterval of on BWT. This is because (i) is a prefix of any suffix in the sainterval of , and (ii) is the previous character of suffix . Thus, the sainterval of for any character can be computed from the sainterval of by backward search as follows.
[Backward search [DBLP:journals/jacm/FerraginaM05]] Let and be the saintervals of and for string and character , respectively. The following three statements hold: (i) for any integer only if ; (ii) and if where ; (iii) only if .
The RLBWT of is the BWT encoded by the runlength encoding, i.e., RLBWT is a partition of into substrings such that each substring is a maximal repetition of the same character in (i.e., and ). We call each run. RLBWT can be stored in words because we can represent each run in words. Figure 1, the RLBWT of the string in the left table is .
Toehold lemma [DBLP:journals/algorithmica/PolicritiP18] enables us to compute all the occurrences of on string using a sampled suffix array of size and the sainterval of string for any character , which is formalized as follows.
[Toehold lemma [DBLP:journals/algorithmica/PolicritiP18]] Let () and () be the saintervals of and for string and character , respectively. The following two cases hold: (i) If , for any integer ; (ii) otherwise, there exists an integer such that holds and contains the starting or ending position of (i.e., or ).
In case (i), any character on interval in the BWT represents an occurrence of (i.e., for any ) because interval occurs on a run of character in the RLBWT. Thus, we can compute an occurrence of if we know an occurrence of because is an occurrence of .
In case (ii), the interval contains the starting or ending position of a run of character because the interval on the BWT contains at least two distinct characters. Thus, we can compute at least one occurrence position of by storing savalues on the starting and ending positions of runs in the RLBWT because is an occurrence of for savalue on the starting or ending position of a run of character such that the position is contained in interval . The middle and right figures in Figure 1 show an example of the two cases.
2.4 Data structure computing adjacent savalues
Gagie et al. [10.1145/3375890] proposed a data structure enabling computations of adjacent savalues and from a given savalue in time and words of space. Let for a given position with savalue (i.e., ). Let be an inverse function of that returns for a given savalue (i.e., ). The data structure was used for supporting locate queries in [10.1145/3375890].
3 rindexf
In this section, we present rindexf, a new text index on RLBWT supporting faster count and locate queries. Formally, we show the following theorem. There exists a text index of words supporting count and locate queries on a string in and time, respectively, where is a given string with count or locate query and . We can construct the data structure in time and words of working space by processing the RLBWT of .
rindexf is built on the notion of a novel partition of the BWT of string , which we call a balanced BWTsequence. The balanced BWTsequence is introduced in Section 3.1. We present rindexf and queries in Section 3.2. Due to space limitations, we present a construction algorithm of rindexf in Appendix A.
3.1 Balanced BWTsequence
We introduce an important notion of the balanced BWTsequence. A BWTsequence of a string is a sequence of strings satisfying the following two conditions: (i) the concatenation of the strings is equal to the BWT of (i.e., ); (ii) each string is a repetition of a character (i.e., ). is defined as the set of the starting positions of strings in , i.e., and for .
For the BWT of string , two examples of BWTsequences are and , where denotes repetitions of the same character .
Fsequence is a permutation of BWTsequence obtained by applying the LF function to each string in . The concatenation of the Fsequence is equal to permutation by the first property of BWT (i.e., ). Formally, let be the permutation of such that holds. Then, holds for any integer . We call each string in and the Fsequence phrase. is defined as the set of the starting positions of phrases in (i.e., is the starting position of phrase ). Figure 2 illustrates the Fsequences for two BWTsequences and .
A parent and children relationship between phrases can be defined. Children of the th phrase are defined as a set of phrases such that the starting position of each phrase in the set is contained in interval , where interval is equal to the interval of phrase corresponding to by the LF function (i.e., , where ). We define as a set of the starting positions of the children of the th phrase in BWTsequence (i.e., ). For in Figure 2, , , , and .
A BWTsequence of string is balanced if (i) any phrase has at most three children (i.e., for any integer ), and (ii) the number of phrases in is at most (i.e., ). The left BWTsequence in Figure 2 is balanced because any phrase has at most two children and the size of the sequence is no more than . On the other hand, the right BWTsequence is not balanced because the fifth phrase in the BWTsequence has four children.
There must exist at least one balanced BWTsequence of any string as follows. The following two statements hold: (i) there exists a balanced BWTsequence of any string ; and (ii) we can construct a balanced BWTsequence of in time and words of working space by processing the RLBWT of .
Proof.
See Section 4. ∎
3.2 Data structures and search algorithms
Our text index consists of the following six data structures:

A balanced BWTsequence of ;

Five arrays of size storing (1) , , , (2) , , , , (3) , , , , (4) , , , , and (5) , , , ;

A rank data structure for a string such that each th character is the first character of phrase (i.e., );

An array of size such that stores the index of the phrase in containing character (i.e., ).

The data structure for the functions and introduced in Section 2.4;

An array of size such that stores a 5tuple for sainterval of character .
As shown in Theorem 3.1, the space usage of those data structures is words in total, which we prove in Appendix A.
3.2.1 Fast predecessor query on the BWTsequence
Solving a predecessor query on set (i.e., ) is essential for count and locate queries. Thus, we present an algorithm that returns in constant time for a given position and . A linear search can be adopted for solving predecessor queries by leveraging the balanced BWTsequence. Phrases are searched one by one for integer stored in (i.e., ), and phrase is found as a solution. This is made possible because contains the starting position of and position is contained in two phrases and (i.e., ). Computation complexity is time by using the array storing the starting positions of phrases in . This computation complexity can be bounded by (i.e., ) because any phrase in a balanced BWTsequence has at most three children. Formally, the following lemma holds.
for any integer and . An example of three phrases , and is presented in the left figure in Figure 3.
In the remaining section, we present two new algorithms for count and locate queries using the predecessor query.
3.2.2 Count query
A count query computes the length of an sainterval for a given query using a backward search on . Let be the sainterval of string and let be the sainterval of string for character . Given a 5tuple , the backward search on returns a 4tuple . Our backward search algorithm consists of the following two steps: (i) compute and by Lemma 3.2.1; (ii) compute by rank queries on string .
Unless is empty, is equal to the number of character in plus the number of the characters lexicographically less than in (i.e., ). This is because the characters of correspond one by one to the first characters of the phrases in by the LF function, and the first characters of the phrases in are lexicographically ordered by permutation .
The right figure in Figure 3 illustrates the relation between and the first characters of the phrases in . Similarly, is equal to . Thus, we compute and using two rank queries on . We also verify whether is empty because the above algorithm can return incorrect and if is empty. is empty only if substring does not contain character , and hence we complete the verification by two rank queries on .
Next, can be computed as if because interval completely contains phrase corresponding to (i.e., ). Otherwise, can be computed as because corresponds to and the th character of is the first occurrence of character in . Similarly, integer is computed. Thus, we can compute and in constant time after computing and . Our backward search algorithm runs in time. The following two lemmas hold.
The following two statements hold: (i) only if .. contains character (i.e., ); (ii) and unless .
Assume that . Then, if ; otherwise . Similarly, if ; otherwise .
The sainterval of a given string can be computed by executing the backward search times. Thus, a count query can be computed in time.
3.2.3 Locate query
A locate query enumerates all the savalues in the sainterval of using two functions, and . The sainterval of is computed by the count query, and and are recursively computed for position by applying and to , respectively.
Let be an sainterval of string and let be an sainterval of string for an integer , where . A basic idea of our algorithm is to compute and for a given 7tuple by the toehold lemma in Lemma 1, where and are two positions such that and . We compute and by executing the algorithm for each , which is explained next.
The toehold lemma ensures and if two saintervals are the same size (i.e., ). In addition, it ensures because corresponds to by the LF function for any integer . If , our algorithm returns and . Otherwise, contains the starting or ending position of phrase because contains the starting or ending position of phrase corresponding to in this case, where
Comments
There are no comments yet.