Conversion from RLBWT to LZ77

02/14/2019 ∙ by Takaaki Nishimoto, et al. ∙ 0

Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. We discuss the problem of converting the run-length Burrows-Wheeler Transform (RLBWT) of a string to Lempel-Ziv 77 (LZ77) phrases of the reversed string. The first results with Policriti and Prezza's conversion algorithm [Algorithmica 2018] were O(n r) time and O(r) working space for length of the string n, number of runs r in the RLBWT, and number of LZ77 phrases z. Recent results with Kempa's conversion algorithm [SODA 2019] are O(n / n + r ^9 n + z ^9 n) time and O(n / _σ n + r ^8 n) working space for the alphabet size σ of the RLBWT. In this paper, we present a new conversion algorithm by improving Policriti and Prezza's conversion algorithm where dynamic data structures for general purpose are used. We argue that these dynamic data structures can be replaced and present new data structures for faster conversion. The time and working space of our conversion algorithm with new data structures are O(n { n, √( r/ r)}) and O(r), respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Converting a compressed format of a string into another compressed format without an explicit decompression is one of the central research topics in string processing. Examples are conversions from the Lempel-Ziv 77 (LZ77) Phrases of a string into a grammar-based encoding [10, 16], from a grammar-based encoding of a string into LZ78 phrases [2, 1] and from a grammar-based encoding of a string into another grammar-based encoding [17]. Such conversion is beneficial when one intends to process a compressed string to a different compressed format without decompressing it.

LZ77 parsing, proposed in 1976 [13], is one of the most popular lossless data compression algorithms and is a greedy partition of a string such that each phrase is a previous occurrence of a substring or a character not occurring previously in the string. The run-length Burrows-Wheeler transform (RLBWT) [5] is a recent popular lossless data compression algorithm with a run-length encoded permutation of a string.

Policriti and Prezza [15] proposed the first conversion algorithm from the RLBWT of an input string to the LZ77 phrases of the reversed string. The basic idea with this algorithm is to carry out backward searches on the RLBWT and find a previous occurrence of each phrase using red-black trees storing a sampled suffix array and dynamic data structure for solving the searchable partial sums with the indels problem (e.g., [4, 9]). Since these data structures are updated frequently for scanning the string in the RLBWT format, the running time and working space are and words, respectively, for string length and number of runs (i.e., the number of continuous occurrences of the same characters).

Kempa [12] recently presented a conversion algorithm from the RLBWT to LZ77, which runs in time and working space for the number of LZ77 phrases and the alphabet size of the string in the RLBWT format. While the algorithm runs in time and working space, especially when and are small (e.g., ), the working space of the algorithm is larger than that of Policriti and Prezza’s algorithm in many cases.

In this paper, we present a new conversion algorithm from the RLBWT to LZ77 by improving Policriti and Prezza’s algorithm. Their algorithm adopts dynamic data structures for four queries consisting of backward search, LF function, access queries on the RLBWT, and so-called range more than query (RMTQ). We argue that these dynamic data structures can be replaced for answering those queries and present new data structures for faster conversion. Our algorithm runs in time and working space, which is faster than their algorithm while using the same working space (see Table 1 for a summary of conversion algorithms).

Algorithm Conversion time Working space (words)
Policriti and Prezza (Thm. 7) [15]
Kempa (Thm. 7.3) [12]
This study
Table 1: Summary of conversion algorithms from the RLBWT to LZ77.

2 Preliminaries

Let be an ordered alphabet of size , be a string of length over and be the length of . Let be the -th character of and be the substring of that begins at position and ends at position . The denotes the suffix of beginning at position , i.e., . Let be the reversed string of , i.e., . For two integers and  (), represents . For two strings and , is that is lexicographically smaller than . The denotes all the occurrence positions of string in string , i.e., . Right occurrence of substring is a subsequent occurrence position of in , i.e., any .

For a string , character , and integer , The returns the number of a character in , i.e., . returns . The returns the position of the -th occurrence of a character in . If the number of occurrences of in is smaller than , it returns , i.e., where returns the minimum value in a given set .

For an integer and set of integers, a predecessor query returns the number of elements that are no more than in , i.e., . A predecessor data structure of supports predecessor queries on . For an integer array and two positions  () on , a range maximum query (RMQ) returns the maximum value in , i.e., , where returns the maximum value in a given set .

Our computation model is a unit-cost word RAM with a machine word size of bits. We evaluate the space complexity in terms of the number of machine words. A bitwise evaluation of space complexity can be obtained with a multiplicative factor.

2.1 Suffix Array (SA) and SA interval

The suffix array ([14] of string is an integer array of size such that stores the starting position of -th suffix of in lexicographical order. Formally, is the permutation of such that holds. For example, and . The left figure in Figure 1 depicts the sorted suffixes of and of .

Since suffixes in the suffix array are sorted in the lexicographical order, suffixes with prefix occur continuously on an interval in the suffix array. We call this interval the SA interval of . Formally, the SA interval of string is interval such that holds for all . For the above example, the SA intervals of si and p are and , respectively.


Figure 1: Example for (left), , (center), and LF function (right) of .

2.2 BWT and backward search

The Burrows Wheeler Transform (BWT) [5] of string is a permutation of obtained as follows. We sort all the rotations of in lexicographical order and take the last character at each rotation in sorted order. The is the permutation of such that for all , holds if and holds otherwise. Similarly, let be a permutation of that consists of the first characters in rotations in sorted order, i.e., for all . The middle table in Figure 1 represents , , and sorted rotations of .

A property of BWT is that the -th occurrence of character in corresponds to the -th occurrence of in . In other words, let and be the positions of the -th occurrence of in and , respectively. Then is a character of position in when is a character of the same position in . The function receives a position in as input and returns such corresponding position in . Since consists of the sorted characters, holds for all , where is the number of occurrences of characters lexicographically less than in .

Backward search computes the SA interval of for a given SA interval of and character using the BWT of . Function takes SA interval of and character as input and returns SA interval of . We compute backward search using , , and queries on . The LF function receives the first and last occurrences of in and returns the first and last positions of the SA interval of because represents the character preceding on for an integer . We can compute the first and last occurrences of using and queries on .

Formally, let and ; and hold if the length of the SA interval for is not zero.

2.2.1 Run-length encoding and RLBWT

For a string , Run-length encoding is a partition of into substrings such that each is a maximal repetition of the same character in . We call each a run.

The RLBWT of is the BWT encoded by the run-length encoding, i.e., . The RLBWT is stored in space because each run in the RLBWT can be encoded into a pair of integers and , where is the character and is the length of the run. We call such a representation the compressed form of the RLBWT.

2.3 Lz77

For a string , LZ77 parsing [13] of the reversed greedily partitions into substrings (phrases) in right-to-left order such that each phrase is either (i) copied from a subsequent substring in  (target phrase) or (ii) an explicit character (character phrase). We denote LZ77 phrases of the reversed as .

Formally, let be the ending position of for , i.e., . Then , and is for if is a new character (i.e., ); otherwise, is the longest suffix of , which has right occurrences in  (i.e., ).

We can store LZ77 phrases in space because we encode a target phrase into the pair of the right occurrence and length of the phrase. We call such representation the compressed form of LZ77. For example, let . Then , and the compressed form of is .

3 Policriti and Prezza’s conversion algorithm

Data: the RLBWT of and the position of on
Result:
;
  /* Initialization */
while  do
       /* Let P be T[x..x+l-1] */
       ;
        /* Access T[x] */
       ;
        /* Compute the SA interval of P */
       ;
       if  then /* Any right occurrence of P was not found */
             if  then
                   ;
                  
            else
                   ;
                   ;
                  
            ;
            
      else
             ;
            
      
Algorithm 1 Policriti and Prezza’s conversion algorithm.

Policriti and Prezza’s conversion algorithm [15] converts a compressed string of in the RLBWT format to another compressed string of in the LZ77 format while using data structures in space. The data structures support four queries: backward search, the LF function, access queries on the RLBWT , and RMTQ on the suffix array of . The RMTQ takes value , interval , and array as inputs and returns a value larger than on interval in .

The algorithm extracts the original string from in right-to-left order using the LF function and access queries on and computes LZ77 phrases sequentially using backward search and RMTQ. In each step, the algorithm extracts a suffix of the original string (i.e., current extracted string) then outputs the LZ77 phrase called current pattern in the suffix. In each step, the following two conditions for the current pattern are guaranteed: (i) the current pattern has at least one right occurrence or is the string of length ; (ii) the length of the current pattern is no less than that of the following current pattern (i.e., the next computed LZ77 phrase).

For computing the current extracted string in each step, the algorithm computes (i) the next character preceding the current extracted string, (ii) computes the SA interval corresponding to the current pattern using the backward search, and (iii) finds any right occurrence of the current pattern in the SA interval using RMTQ. If such right occurrence of the extended pattern does not exist, the algorithm outputs the current pattern as the next LZ77 phrase. When the length of the current pattern is , the next phrase is the next character. The algorithm repeats the above step until it extracts the whole string and outputs LZ77 phrases of . Algorithm 1 shows a pseudo code of the algorithm.

The algorithm uses two dynamic data structures: one for supporting backward search, the LF function, and access queries on in time and space; the other for supporting RMTQ in time and space, which is detailed in the next subsection.

3.1 RMTQ data structure

Figure 2: Left figure illustrates partitioned suffix array (PSA). Bold horizontal lines on suffix array represent partitions on PSA; hence, PSA is . Right figure illustrates three subintervals used in Section 4.1.

Policriti and Prezza presented an RMTQ data structure for fixed , which can be updated when is decremented. The construction for RMTQ data structure partitions the suffix array of into subarrays for every run in by the LF function, resulting in subarrays in total. Since the -th occurrence of any character in corresponds to the -th occurrence of in , the characters on every run in also occur continuously on . The can be partitioned into substrings such that each substring corresponds to a run in . We call such subarrays partitioned suffix arrays (PSAs). The left in Figure 2 shows an example for the PSA of in Figure 1.

The data structure does not store the whole PSA. Instead, it stores only the first and last values larger than on each subarray of the PSA and their corresponding positions on it. The red-black tree is used to store those positions. We call such a first position (respectively, last position) on the -th subarray for fixed a -open position (respectively, -close position) on the -th subarray, which is denoted as (respectively, ).

using the data structure is divided into two cases according to the relationship between the query interval and PSA: (A) there exists a subarray of the PSA completely including interval  (i.e., holds); and (B) such a subarray does not exist. Both cases are detailed as follows.

For case (B), the query interval is partitioned into some subarrays of the PSA and contains either a prefix or suffix of each subarray. Therefore, the interval contains at least one of the -open and -close positions if and only if the interval contains a value larger than . We select an answer of RMTQ from the -open positions and -close positions on subarrays in the PSA in time using the red-black tree storing the set of -open and -close positions.

For case (A), the is computed using its computation result in the previous iteration of Algorithm 1. The query interval represents the SA interval of a string , and the length of is at least because the SA interval of a character does not satisfy case (A). This means that the query interval in the previous iteration represents the SA interval of , and Algorithm 1 computes the answer  () in the previous iteration. Thus, in case (A), is the answer for because is the character on such that is at the same position on the suffix array.

Formally, let , be the starting position of in  (i.e., ), and be the permutation of such that holds. Then the PSA of is subarrays such that holds for all . Let and for and . Let be the set of -open and close positions in the suffix array of , i.e, . Then the following lemma holds. [[15]] Let be a substring of starting at a position and be the SA interval of . (1) In case (A), the length of is at least 2 and can return , where is the SA interval of . (2) In case (B), if holds, then can return the value at any position in ; otherwise holds.

4 Data structures for faster conversion

Result:
;
  /* Get the index of the subarray on the position b */
;
  /* Get the index of the subarray on the position e */
;
;
;
if  then
       return ;
      
else if  then
       return ;
      
else if  and  then
       return ;
      
else
       return ;
      
Algorithm 2 Our algorithm for case (B).

We improve the query time in the data structure for backward search, the LF function, access query on , and RMTQ for case (B) by presenting two novel data structures: one supports the RMTQ for case (B); and the other supports backward search, the LF function, and access query on

. Our data structures use static predecessor data structures in the internal and improve four query times by choosing the predecessor data structure with the fastest (estimated) query time. We show the results of our data structures, which are summarized in Table 

1.

4.1 RMTQ data structure in case (B)

Our RMTQ data structure for fixed consists of four arrays of length : -open array , -close array , max value array , and starting position array . Data structures for the RMQ and predecessor query are built on and , respectively.

The -th element of the -open array (respectively, -close array) stores the pair of -open position (respectively, -close position) and its corresponding value on the -th subarray of the PSA. The -open and -close arrays can be updated when is decremented. The -th element of the max value array stores the maximum value on the -th subarray of the PSA. The -th element of the starting position array stores the starting position of the -th subarray on the PSA (i.e., ).

Our algorithm for RMTQ ( algorithm) consists of three parts: (i) it divides a given query interval into at most three subintervals; (ii) computes the RMTQ for each subinterval; and (iii) returns the final answer of the RMTQ for a given interval using answers for subintervals. The three subintervals are defined as follows: the first subinterval is on the first subarray of the PSA in the query interval, the second subinterval is on the last subarray of the PSA in the query interval, and the third subinterval is on the remaining subarrays. The right figure in Figure 2 illustrates those three subintervals. We compute the three subintervals using predecessor queries on for a given query interval.

The first subinterval is on a suffix of the first subarray; hence, the first subinterval has a value larger than if and only if the subinterval contains the -close position of the first subarray. Similarly, the second subinterval has a value larger than if and only if the subinterval contains the -open position of the last subarray. The third subinterval is on middle subarrays; hence, the third subinterval has a value larger than if and only if the maximal value is larger than on the subarrays. Therefore, we compute the RMTQ for the first subinterval (respectively, the second subinterval) by accessing the element of the first subarray on the -close array (respectively, the element of the last subarray on the -open array). We also compute the RMTQ for the third subinterval using the RMQ whose query interval covers the third subinterval on .

Algorithm 2 shows a pseudo code of our RMTQ algorithm. Since we can compute the RMQ in constant time (e.g., [7]), the query time of our RMTQ algorithm depends on the performance of predecessor queries on . We can also convert -open and close arrays to -open and close arrays by changing at most two elements because the conversion can change the only element of the subarray containing .

Formally, let and be the ranks of the subarray of the PSA of which contains the positions and , respectively, i.e., and . For , let be the array of size such that for , if holds; otherwise . Similarly, let be the array of size such that for , if ; otherwise . Let be the integer array of length such that stores the maximal value in the -th subarray of the PSA of , i.e., for all . Let be the query time of the predecessor query on a set of size from a universe by a predecessor data structure of space. Then the following lemmas hold. In case (B), holds if and only if there exists an answer of in , or . Our data structure for case (B) can compute in time. The space usage is space.

Proof.

We construct the predecessor data structure for , which supports predecessor queries in time, and the RMQ data structure for which supports the RMQ in time using [7]. Then Lemma 4.1 holds by Algorithm 2. ∎

For given position of a on the suffix array of  (i.e., ), we can convert and to and in time using the predecessor data structure for .

Proof.

and hold for all , where . Therefore, we appropriately update and for . ∎

4.2 Data structure for backward search, LF, and access queries

We leverage the static data structures presented by Gagie et al. [8] for backward search, the LF function, and access queries on instead of Policriti and Prezza’s dynamic data structure. The static data structures compute three queries by executing only the constant number of predecessor queries, and the three queries using the static data structures can be faster than those using Policriti and Prezza’s dynamic data structure.

Since the time for the predecessor query on the static data structures is proportional to the alphabet size of the input RLBWT, the alphabet size slightly increases the query times. For faster queries, we replace one of the static data structures for the predecessor queries depending on the alphabet size with an array of size . We obtain the following lemma. Let be the construction time for the predecessor data structure of space, which supports predecessor queries in time. We can construct the data structure of space, which supports , , and queries for in time, by processing the RLBWT of in time and working space.

Proof.

See Appendix. ∎

4.3 Improved Policriti and Prezza’s algorithm

Algorithm 1 using our data structures requires space, which can be when , e.g., . To bound the space usage to , we reduce the alphabet size of the RLBWT to at most by renumbering characters in .

We modify Algorithm 1 as follows: (i) We replace each character in with the rank of in  (i.e., ) and construct the new RLBWT over the alphabet of at most . We call the converted string the shrunk string of . (ii) We convert to LZ77 phrases of the string recovered from using Algorithm 1. (iii) We recover the LZ77 phrases of from that of using the inverse array , where the -th element of the array stores the original character of rank  (i.e., holds for any ). The modified algorithm works correctly because (1) our backward search queries receive only characters that appear in the RLBWT, (2) is the RLBWT of the shrunk string of , and (3) the form of LZ77 phrases is independent of the alphabet, i.e., we obtain the -th LZ77 phrase of by mapping characters in the -th LZ77 phrase of into the original characters.

Finally, we obtain a conversion algorithm from RLBWT to LZ77 in time and space, and the algorithm depends on the performance of the static predecessor data structure. There exist two predecessor data structures such that (1) and hold [3] and (2) and hold [6]. Since we can compute and by processing the RLBWT in time, we choose the faster predecessor data structure between those predecessor data structures. Therefore, we obtain our results of our data structures, which are listed in Table 1.

Formally, the following lemmas and theorem hold. The following statements hold. (1) We can compute the shrunk string of and the inverse array in time and working space. (2) The is the RLBWT of the shrunk string of . (3) The and hold for and , where is the number of LZ77 phrase of . (4) We can convert the compressed form of to that of in constant time using for all .

Proof.

(1) We construct the string that consists of first characters in the run-length encoding of  (i.e., ) and construct the suffix array of in time and working space [11]. We construct the shrunk string of and using the suffix array and construct using . (2) Let and be the suffix arrays of and . Then holds for all . Therefore, the RLBWT of is . (3) This holds because holds for two integers . (4) If is a target phrase, we return the phrase as . Otherwise, we return as . ∎

We can construct the data structure of Lemma 4.1 in time and working space using the RLBWT data structure of Lemma 4.2.

Proof.

See Appendix. ∎

There exists a conversion algorithm from RLBWT to LZ77 in time and space.

Proof.

We already have described our algorithm in Section 4.3. Each step of Algorithm 1 additionally needs to update , and determine either case (A) or (B) for a given query interval. The total time is using predecessor queries on and Lemma 4.1. Therefore, Theorem 4.3 holds by Lemmas 4.14.24.3, and 4.3. ∎

5 Conclusion

We presented a new conversion algorithm from RLBWT to LZ77 in time and space. By leveraging the fastest static predecessor data structure using space, we obtain the conversion algorithm that runs in time. This result improves the previous result in time and space.

We have the following open problem: can we archive the conversion from RLBWT to LZ77 in time and space? It is difficult to archive the time complexity with our approach because any predecessor data structure for a set using words of bits requires query time in the worst case [3], where is the number of elements in the set and is the universe of elements. Kempa’s conversion algorithm can run faster than our algorithm, but it requires working space in the worst case. Thus, a new approach is required to solve this open problem.

References

  • [1] Hideo Bannai, Pawel Gawrychowski, Shunsuke Inenaga, and Masayuki Takeda. Converting SLP to LZ78 in almost linear time. In Proceedings of CPM, pages 38–49, 2013.
  • [2] Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Efficient LZ78 factorization of grammar compressed text. In Proceedings of SPIRE, pages 86–98, 2012.
  • [3] Paul Beame and Faith E. Fich. Optimal bounds for the predecessor problem and related problems. J. Comput. Syst. Sci., 65(1):38–72, 2002.
  • [4] Philip Bille, Anders Roy Christiansen, Patrick Hagge Cording, Inge Li Gørtz, Frederik Rye Skjoldjensen, Hjalte Wedel Vildhøj, and Søren Vind. Dynamic relative compression, dynamic partial sums, and substring concatenation. Algorithmica, 80(11):3207–3224, 2018.
  • [5] Michael Burrows and David J Wheeler. A block-sorting lossless data compression algorithm. Technical report, 1994.
  • [6] Johannes Fischer and Pawel Gawrychowski. Alphabet-dependent string searching with wexponential search trees. In Proceedings of CPM, pages 160–171, 2015.
  • [7] Johannes Fischer and Volker Heun. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput., 40(2):465–492, 2011.
  • [8] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of SODA, pages 1459–1477, 2018.
  • [9] Wing-Kai Hon, Kunihiko Sadakane, and Wing-Kin Sung. Succinct data structures for searchable partial sums with optimal worst-case performance. Theor. Comput. Sci., 412(39):5176–5186, 2011.
  • [10] Artur Jez. A really simple approximation of smallest grammar. Theor. Comput. Sci., 616:141–150, 2016.
  • [11] Juha Kärkkäinen and Peter Sanders. Simple linear work suffix array construction. In Proceedings of ICALP, pages 943–955, 2003.
  • [12] Dominik Kempa. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of SODA, pages 1344–1357, 2019.
  • [13] Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. IEEE Transactions on information theory, 22(1):75–81, 1976.
  • [14] Udi Manber and Eugene W. Myers. Suffix arrays: A new method for on-line string searches. SIAM J. Comput., 22(5):935–948, 1993.
  • [15] Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, 80(7):1986–2011, 2018.
  • [16] Wojciech Rytter. Application of lempel-ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211–222, 2003.
  • [17] Kensuke Sakai, Tatsuya Ohno, Keisuke Goto, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. Repair in compressed space and time. CoRR, abs/1811.01472, 2018.

Appendix A: The proof of Lemma 4.2

We can compute and queries using , rank, select, access queries for and construct by processing the RLBWT of in time and working space. Therefore, we give the data structures for rank, select, and access queries for by the following lemmas. We can construct the data structure of space that supports queries for in time by processing the RLBWT of in time and working space.

Proof.

We compute rank queries for a character by the constant number of accessing elements on two arrays and and the constant number of predecessor queries on . Array is the array storing sorted starting positions of runs of in , and is the integer array such that stores the rank of the first character of the -th run of character . We can construct and in time and working space by processing the RLBWT of and predecessor data structures for rank queries in time. Therefore Lemma Appendix A: The proof of Lemma 4.2 holds.

Formally, let be the number of runs of character in  (i.e., ). Array stores the starting position of the -th run of character for all and , for and and . If holds, then holds; otherwise, holds, where . Since represents the length of the -th run of character , we can compute using predecessor queries, i.e., holds if and only if holds, where . ∎

We can construct the data structure of space that supports queries for in time by processing the RLBWT of in time and working space.

Proof.

We compute select queries for a character using three arrays , , and and the predecessor on . When holds, holds. Since represents the number of s in , we can compute using . We can construct and in time and working space by processing the RLBWT of and predecessor data structures for select queries in time. Therefore Lemma Appendix A: The proof of Lemma 4.2 holds. ∎

We can construct the data structure of space that supports access queries for in time by processing the RLBWT of in time and working space.

Proof.

We compute access queries using two arrays and and the predecessor on . Array is the sorted starting positions of runs in  (i.e., ), and is the first characters of runs in  (i.e., ). We can construct and in time by processing the RLBWT. Since we can compute by for a given integer , Lemma Appendix A: The proof of Lemma 4.2 holds. ∎

Appendix B: The proof of Lemma 4.3

Proof.

Our data structure for RMTQ consists of , the RMQ data structure for , and the predecessor data structure for . We construct using -integer sequence. The -integer sequence is the subarray of such that subarray is on the run of a character in , i.e., and . We construct -integer sequence, , -integer sequence in using the LF function since holds for two positions and such that and hold. We obtain by concatenating the sequences. The total time is and the working space is .

We construct using the function and predecessor on in time since represents a position on the suffix array of for . We construct and in time since and . We construct the RMQ data structure for in time and working space using [7]. Therefore Lemma 4.3 holds since . ∎