1 Introduction
The increase in the amount of highly compressible data that requires efficient processing in the recent years, particularly in the area of computational genomics [3, 4], has caused a spike of interest in dictionary compression. Its main idea is to reduce the size of the representation of data by finding repetitions in the input and encoding them as references to other occurrences. Among the most popular methods are the LempelZiv (LZ77) compression [31], runlength BurrowsWheeler transform (RLBWT) [5, 15], and grammar compression [6]. Although in theory, LZ77 and RLBWT are separated by at most a factor of (where is the length of the input text) [14, 19], the gap in practice is usually noticeable (as also confirmed by our experiments). RLBWT is the largest of the three representations in practice, but is also the most versatile, supporting powerful suffix array and suffix tree queries [15]. LZ77, on the other hand, is the smallest, but its functionality includes only the easier LCE, randomaccess, and pattern matching queries [1, 7, 12, 13, 20]. Grammar compression occupies the middle ground between the two, supporting queries similar to LZ77 [25]. Navarro gives a comprehensive overview of these and related representations in [25, 26].
A major practical concern with these representations – RLBWT in particular – is how to construct them efficiently. Past efforts have focused on engineering efficient general algorithms for constructing the BWT and LZ77 [10, 17, 2, 16], but these are not applicable to the terabytescale datasets routinely found, e.g., in modern genomics [4]. Specialized algorithms for highly repetitive datasets have only been investigated recently. Boucher et al. [4] proposed a method for the efficient construction of RLBWT using the concept of prefixfree parsing. The same problem was approached by Policriti and Prezza, and Ohno et al. [28, 27], using a different approach based on the dynamic representation of RLBWT. These methods represent the state of the art in the practical construction of RLBWT.
A different approach to the construction of RLBWT was recently proposed in [19]. The idea is to first compute the (exact or approximate) LZ77 parsing for the text, and then convert this representation into an RLBWT. Crucially, the LZ77 RLBWT conversion takes only time, i.e. it runs not only in the compressed space but also in compressed time. ^{1}^{1}1 for any constant . The computational bottleneck is therefore shifted to the easier problem of computing or approximating the LZ77, which is the only step taking time. Internally, this new pipeline consists of three steps: text (approximate) LZ77 grammar RLBWT, unsurprisingly aligning with the gradual increase in the size and complexity of these representations. Kosolobov et al. [23] recently proposed a fast and spaceefficient algorithm to approximate LZ77, called ReLZ. The second and third steps in the pipeline, from the LZ77 parse to the RLBWT, have not been implemented. The only known algorithm to convert LZ77 into a grammar in compressed time was proposed by Rytter [29], and is based on the special type of grammars called AVL grammars, whose distinguishing feature is that all subtrees in the parse tree satisfy the AVL property: the treeheights for children of every nonterminal do not differ by more than one. The algorithm is rather complex, and until now has been considered impractical.
Our Contribution.
Our main contribution is a series of practical improvements to the basic variant of Rytter’s algorithm, and a fast and spaceefficient implementation of this improved algorithm. Compared to the basic variant, ours produces a grammar that is always five times smaller, and crucially, the same holds for all intermediate grammars computed during the algorithm, yielding very low peak RAM usage. The resulting grammar is also no more than twice of the smallest existing grammar compressors such as RePair [24]. We further demonstrate that combining our new improved algorithm with ReLZ opens up a new path to the construction of RLBWT. Our preliminary experiments indicate that at least a threefold speedup and the same level of reduction in the RAM usage is possible.
The key algorithmic idea in our variant is to delay the merging of intermediate AVL grammars as much as possible to avoid creating nonterminals that are then unused in the final grammar. We dub this variant lazy AVL grammars. We additionally incorporate KarpRabin fingerprints [18] to rewrite parts of the grammar onthefly and further reduce the grammar size. We describe two distinct versions of this technique: greedy and optimal, and demonstrate that both lead to reductions in the grammar size. As a sideresult of independent interest, we describe a fast and spaceefficient data structure for the dynamic predecessor problem, in which the inserted key is always larger than all other elements currently in the set.
2 Preliminaries
Strings.
For any string , we write , where , to denote a substring of . If , we assume to be the empty string . By we denote . Throughout the paper, we consider a string (text) of symbols from an integer alphabet . By we denote the length of the longest common prefix of suffixes and .
KarpRabin Fingerprints.
Let be a prime number and let be chosen uniformly at random. The KarpRabin fingerprint of a string is defined as
Clearly, if then . On the other hand, if then
with probability at least
[9]. In our algorithm we are comparing only substrings of of equal length. Thus, the number of different possible substring comparisons is less than , and hence for any positive constant , we can set to be a prime larger than (but still small enough to fit in words) to make the fingerprint function perfect with probability at least .LZ77 Compression.
We say that a substring is a previous factor if it has an earlier occurrence in , i.e., there exists a position such that . An LZ77like factorization of is a factorization into nonempty phrases such that every phrase with is a previous factor. Every phrase is encoded as a pair , where is such that . If there are multiple choices for such , we choose one arbitrarily. The occurrence is called the source of the phrase . If is not a previous factor, it is encoded as a pair .
The LZ77 factorization [31] (or the LZ77 parsing) of a string is an LZ77like factorization constructed by greedily parsing from left to right into longest possible phrases. More precisely, the th phrase is the longest previous factor starting at position . If no previous factor starts there, then . We denote the number of phrases in the LZ77 parsing by . For example, the text bbabaababababaababa has LZ77 parsing with phrases, and is encoded as a sequence
Grammar Compression.
A contextfree grammar is a tuple , where is a finite set of nonterminals, is a finite set of terminals, and is a set of rules. We assume and . The nonterminal is called the starting symbol. If then we write . The language of is set obtained by starting with and repeatedly replacing nonterminals with their expansions, according to .
A grammar is called a straightline grammar (SLG) if for any there is exactly one production with of the left side, and all nonterminals can be ordered such that and if then , i.e., the graph of grammar rules is acyclic. The unique such that is called the definition of and is denoted . In any SLG, for any there exists exactly one that can be obtained from . We call such the expansion of , and denote . We define the parse tree of as a rooted ordered tree , where each node is associated to a symbol . The root of is a node such that . If then has no children. If and , then has children and the subtree rooted at the th child is a copy of . The parse tree is defined as . The height of any is defined as the height of , and denoted .
The idea of grammar compression is, given a text , to compute a small SLG such that . The size of the grammar is measured by the total length of all definitions, and denoted . Clearly, it is easy to encode any in space: pick an ordering of nonterminals and write down the definitions of all variables with nonterminals replaced by their number in the order.
3 AVL Grammars and the Basic Algorithm
An SLG is said to be in Chomsky normal form, if for every , it holds or , where . An SLG in Chomsky normal form is called a straightline program (SLP). Rytter [29] defines an AVL grammar as an SLP that satisfies the following extra condition: for every such that , it holds . This condition guarantees that for every (in particular for ), it holds [29, Lemma 1].
The main result presented in [29] is an algorithm that given a nonselfreferential version (in which is a previous factor only if there exists such that ) of some LZ77like parsing for a text of length consisting of phrases, computes in time an AVL grammar generating and satisfying . Rytter’s construction was extended to allow selfreferences in [19, Theorem 6.1]. Our implementation of the basic as well as improved Rytter’s algorithm works for the selfreferential variant, but for simplicity here we only describe the algorithm for the nonselfreferential variant.
The algorithm in [29] works in steps. It maintains the dynamically changing AVL grammar such that after the th step is complete, there exists a nonterminal in such that , where is the input LZ77like factorization of the input. This implies that at end there exist a nonterminal expanding to . The algorithm does not delete any nonterminals between steps. At the end, it may perform an optional pruning of the grammar to remove the nonterminals not present in the parse tree . This reduces the grammar size but not the peak memory usage of the algorithm.
The algorithm uses the following three procedures, each of which adds a nonterminal with a desired expansion to the grammar , along with a bounded number of extra nonterminals:

: Given , add a nonterminal with to the grammar .

: Given the identifiers of nonterminals existing in , add a nonterminal to that satisfies . The difficulty of this operation is ensuring that the updated remains an AVL grammar. Simply setting would violate the AVL condition in most cases. Instead, the algorithm performs the procedure similar to the concatenation of AVL trees [21, p. 474], taking time, and introducing extra nonterminals.

: Given the identifier of a nonterminal existing in , and two positions satisfying , add a nonterminal to that satisfies . To explain how this is achieved, we define the auxiliary procedure that given the same parameters as above, returns the sequence of nonterminals satisfying . The nonterminals are found by performing two roottoleaf traversals in the parse tree . This takes time and ensures . It is easy to see that given , we can now obtain in time using . In [29], it was however shown that if we always choose the shortest nonterminal to merge with its neighbor, the total runtime is and only extra nonterminals are introduced.
Using the above three procedures, the algorithm in [29] works as follows. Suppose we have already processed the leftmost phrases. The step begins by creating a nonterminal satisfying . If , it uses the procedure . Otherwise, is obtained as the output of , where and is the source of phrase . Finally, is obtained as the output of . A single iteration thus takes time and adds extra nonterminals, for total of nonterminals over all steps.
4 Optimized Algorithm
Lazy Merging.
We start by observing that the main reason responsible for the large final grammar produced by the algorithm in Section 3 is the requirement that at the end of each step , there exist a nonterminal satisfying . We relax this requirement, and instead require only that at the end of step , there exists a sequence of nonterminals such that . The algorithm explicitly maintains these nonterminals as a sequence of pairs , where . The modified algorithm uses the following new procedures:

: Given two positions satisfying , add to a nonterminal satisfying , where and . The positions and are found using a binary search. The pairs of the sequence at positions between and are removed and replaced with a pair (. In other words, this procedure merges all the nonterminals from the current root sequence whose expansion is entirely inside the given interval . Merging of is performed pairwise, using the procedure. Note, however, that there is no dependence between heights of the adjacent nonterminals (in particular, they do not form a bitonic sequence, like in the algorithm in Section 3), and moreover, their number
is not bounded. To minimize the number of newly introduced nonterminals, we thus employ a greedy heuristic, that always chooses the nonterminal with the smallest height among the remaining elements, and merges it with the shorter neighbor. We use a list to keep the pointers between neighbors and a binary heap to maintain heights. The merging thus runs in
time. 
: Given two positions satisfying , this procedure returns a sequence of nonterminals satisfying . First, it computes positions and , as defined in the description of above. It then returns the result of , followed by , followed by the result of (appropriately handling the boundary cases, which for clarity we ignore here). In other words, this procedure finds the sequence of nonterminals that uses as many roots from the sequence , as possible, and runs the standard for the boundary roots. Letting , it runs in time.
Using the above additional procedures, our algorithm works as follows. Suppose that we have already processed the first phrases. The step begins by computing the sequence of nonterminals satisfying . If , we proceed as in Section 3. Otherwise, we first call , where and is the source of . The sequence is then obtained as a result of . Finally, , is appended to the roots sequence.
The above algorithm runs in time. To see this, note that first calling ensures that the output size of is . Thus, each step appends only nonterminals to the roots sequence. The total time spend in is thus bounded by , dominating the time complexity.
Utilizing KarpRabin Fingerprints.
Our second technique is designed to detect the situation in which the algorithm adds a nonterminal to , when there already exists some such that . For any , we define . Let us assume that there are no collisions between fingerprints. Suppose that given a nonterminal , we can quickly compute , and that given some , we can check, if there exists such that . There are two places in the above algorithm (using lazy merging) where we utilize this to reduce the number of nonterminals:

Whenever during the greedy merge in , we are about to call for the pair of adjacent nonterminals and , we instead first compute the fingerprint of their concatenation, and if there already exists such that , we use instead, avoiding the call to and the creation of extra nonterminals.

Before appending the nonterminals to the roots sequence at the end of the step, we check if there exists an equivalent but shorter sequence , i.e., such that and . We utilize that , and run a quadratic algorithm (based on dynamic programming) to find the optimal (shortest) equivalent sequence. We then use that equivalent sequence in place of .
Observe that the above techniques cannot be applied during , as the equivalent nonterminal could have a notably shorter/taller parse tree, violating the AVL property. Note also that the algorithm remains correct even when checking if there exists having returns false negatives. Our implementation uses a parameter , which is the probability of storing the mapping from to , when adding the nonterminal . The value of is one of the main parameters controlling the timespace tradeoff of the algorithm, as well as the size of the final grammar. The mapping of fingerprints to nonterminals is implemented as a hash table.
5 Implementation Details
Storing Sequences.
Our implementation stores many sequences, where the insertion only happens at the end (e.g., the sequence of nonterminals, which are never deleted in the algorithm). A standard approach to this is to use a dynamic array, which is a plain array that doubles its capacity, once it gets full. Such implementation achieves an amortized insertion time, but suffers from a high peak RAM usage. On all systems we tried, the reallocation call that extends the array is not inplace. Since the peak RAM usage is critical in our implementation, we thus implemented our own dynamic array, that instead of a single allocated block, keeps a larger number of blocks (we use 32). This significantly reduces the peak RAM usage. We found the slowdown in the accesstime to be negligible.
Implementation of the Roots Sequence.
The roots sequence undergoes predecessor queries, deletions (at arbitrary positions), and insertions (only at the end). Rather than using an offtheshelf dynamic predecessor data structure (such as balanced BST), we exploit as follows the fact that insertions happen only at the end.
All roots are stored as a static sequence that only undergoes insertions at the end (using the space efficient dynamic array implementation described above). Deleted elements are marked as deleted, but remain physically in the array. The predecessor query is implemented using a binary search, with skipping of the elements marked as deleted. To ensure that the predecessor queries are efficient, we keep a counter of accesses to the deleted elements. Once it reaches the current array size, we run the “garbage collector” that scans the whole array lefttoright, and removes all the elements marked as deleted, eliminating all gaps. This way, the predecessor query is still efficient, except the complexity becomes amortized.
Computing and for .
During the algorithm, we often need to query the value of for some nonterminal . In our implementation we utilize 64bit fingerprints, and hence storing the value for every nonterminal is expensive. We thus only store for satisfying . The number of such elements in is relatively small. To compute for any other , we first obtain , and then compute from scratch. This operation is one of the most expensive in our algorithm, and hence whenever possible we avoid doing repeated queries.
As for the values , we observe that in most cases, it fits in a single byte. Thus, we designate only a single byte, and whenever , we lookup in an array ordered by the number of nonterminal, with access implemented using binary search.
6 Experimental Results
Algorithms.
We performed experiments using the following algorithms:

BasicAVLG, our implementation of the algorithm to convert an LZlike parsing to an AVL grammar proposed by Rytter [29], and outlined in Section 3. Selfoverlapping phrases are handled as in the full version of [19, Theorem 6.1]. The implementation uses spaceefficient dynamic arrays described in Section 5. This implementation is our baseline.

LazyAVLG, our implementation of the improved version of BasicAVLG, utilizing lazy merging and KarpRabin fingerprints, as described in Section 4. This is the main contribution of our paper. In some of the experiments below, we consider the algorithm with the different probability of sampling the KarpRabin hash of a nonterminal, but our default value (as discussed below) is . Our implementation (including also BasicAVLG) is available at https://github.com/dominikkempa/lz77toslp.

BigBWT, a semiexternal algorithm constructing the RLBWT from the input text in time, proposed by Boucher et al. [4]. As shown in [4], if the input text is highly compressible, the working space of BigBWT is sublinear in the text length. We use the implementation from https://gitlab.com/manzai/BigBWT.

ReLZ, an externalmemory algorithm due to Kosolobov et al. [23] that given a text on disk, constructs its LZlike parsing in time [23]. The algorithm is faster than the currently best algorithms to compute the LZ77 parsing. In practice, the ratio between the size of the resulting parsing and the size of the LZ77 parsing usually does not exceed 1.5. Its working space is fully tunable and can be specified arbitrarily. We use the implementation from https://gitlab.com/dvalenzu/ReLZ.

RePair, an time algorithm to construct an SLG from the text, proposed by Larsson and Moffat [24]. Although no upperbound is known on its output size, RePair produces grammars that in practice are smaller than any other grammarcompression method [24]. Its main drawback is that most implementations need space [11], and hence are not applicable on massive datasets. The only implementation using space is [22], but it is not practical. There is also work on running RePair on the compressed input [30], but since it already requires the text as a grammar, it is not applicable in our case. In our experiments we therefore use RePair only as a baseline for the achievable grammar size. We use our own implementation, verified to produce grammars comparable to other implementations. We note that there exists recent work on optimizing RePair by utilizing maximal repeats [11]. The decrease in the grammar size, however, requires a potentially more expensive encoding that includes the length of the expansion for every nonterminal. For simplicity, we therefore use the basic version of RePair.
File name  

cere  461 286 644  5  11 574 640  39.85  1 700 630  271.24 
coreutils  205 281 778  236  4 684 459  43.82  1 446 468  141.91 
einstein.en.txt  467 626 544  139  290 238  1611.18  89 467  5 226.80 
influenza  154 808 555  15  3 022 821  51.21  769 286  201.23 
dna.001.1  104 857 600  5  1 716 807  61.07  308 355  340.05 
english.001.2  104 857 600  106  1 449 518  72.33  335 815  312.24 
proteins.001.1  104 857 600  21  1 278 200  82.03  355 268  295.15 
sources.001.2  104 857 600  98  1 213 427  86.41  294 994  355.45 
chr19.1000  59 125 116 167  5  45 927 063  1287.37  7 423 960  7964.09 
kernel  137 438 953 472  229  129 506 377  1061.25  30 222 602  4547.55 
All implementations are in C++ and are largely sequential, allowing for a constant number of additional threads used for asynchronous I/O.
We also considered OnlineRLBWT, an algorithm proposed by Ohno at al. [27], that given a text in a righttoleft streaming fashion, construct its runlength compressed BWT (RLBWT) in time and using only bits of working space (the implementation is available from: https://github.com/itomomoti/OnlineRlbwt). In the preliminary experiment we determined that while using only about a third of the memory of BigBWT (on the 16 GiB prefix of the kernel testfile), the algorithm was about 10x slower than BigBWT.
Experimental Platform and Datasets.
We performed experiments on a machine equipped with two twelvecore 2.2 GHz Intel Xeon E52650v4 CPUs with 30 MiB L3 cache and 512 GiB of RAM. The machine used distributed storage achieving an I/O rate >220 MiB/s (read/write).
The OS was Linux (CentOS 7.7, 64bit) running kernel 3.10.0. All programs were compiled using g++ version 4.8.5 with O3 DNDEBUG march=native options. All reported runtimes are wallclock (real) times. The machine had no other significant CPU tasks running. To measure the peak RAM usage of the programs we used the /usr/bin/time v command.
The statistics of testfiles used in our experiments are shown in Table 1. Shorter version of files used in the scalability experiments are prefixes of full files. We used the files from the Pizza & Chili repetitive corpus available at http://pizzachili.dcc.uchile.cl/repcorpus.html. We chose a sample of 8 real and pseudoreal files. Since all files are relatively small (less than 512 MiB), we additionally include 2 large repetitive files that we assembled ourselves:

chr19.1000, a concatenation of 1000 versions of Human chromosome 19. The sequences were obtained from the 1000 Genomes Project [8]. One copy consists of 58 symbols.

kernel, a concatenation of 10.7 million source files from over 300 versions of the Linux kernel (see http://www.kernel.org/).
KarpRabin Sampling Rate.
The key parameter that controls the runtime, peak RAM usage, and the size of the grammar in our algorithm is the probability of including the KarpRabin fingerprint of a nonterminal in the hash table. In our first experiment, we study the performance of our LazyAVLG implementation for different values of the parameter . We tested all values and for each we measured the algorithm’s time and memory usage, and the size of the final grammar. The results are given in Figure 1.
While utilizing the KarpRabin fingerprints (i.e., setting ) can notably reduce the final grammar size (up to 40% for the cere file), it is not worth using values much larger than , as it quickly increases the peak RAM usage (e.g., by about 2.3x for the sources.001.2 testfile) and this increase is not repaid significantly in the further grammar reduction. The fourth panel in Figure 1 provides some insight into the reason for this. It shows the percentage of cases, where during the greedy merging of nonterminals enclosed by the source of the phrase, the algorithm is able to avoid merging two nonterminals, and instead use the existing nonterminal. Having some fingerprints in the hash table turns out to be enough to avoid creating between 414% of the new nonterminals, but having more does not lead to a significant difference. Since peak RAM usage is the likely limiting factor for this algorithm — a slower algorithm is still usable, but exhaustion of RAM can prevent it running entirely — we chose as the default value in our implementation. This is the value used in the next two experiments.
Grammar Size.
In our second experiment, we compare the size of the grammar produced by LazyAVLG to BasicAVLG and RePair. In the comparison we also include the size of the grammar obtained by running BasicAVLG and removing all nonterminals not reachable from the root. We have run the experiments on 8/10 testfiles, as running RePair on the large files is prohibitively time consuming. The results are given in Figure 2.
LazyAVLG produces grammars that are between 1.59x and 2.64x larger than RePair (1.95x on average). The resulting grammar is always at least 5x smaller than produced by BasicAVLG, and also always smaller than BasicAVLG (pruned). Importantly, the RAM usage of our conversion is proportional the size of the final grammar, whereas the algorithm to compute the pruned version of BasicAVLG must first obtain the initial large grammar, increasing peak RAM usage. This is a major practical concern, as described next.
Application in the Construction of BWT.
In our third and main experiment, we evaluate the potential of the method to construct RLBWT by first computing/approximating the LZ77parsing of the text in time, then converting the resulting compressed representation of into an RLBWT in time (where denotes the number of factors) using the algorithm presented in [19]. The first step of this algorithm is the conversion of the input LZcompressed text into a straightline grammar. To our knowledge, the only known approach that achieves runtime within the time bound is the algorithm of Rytter [29] studied in this paper. Thus, if the approach of [19], has a chance of being practical, the conversion step needs to be efficient. We remark that this step is a necessary but not a sufficient condition, and the remaining components of the algorithm in [19] need to first be engineered before the practicality of this approach is fully determined.
We compared the computational performance of BigBWT, ReLZ, and our implementation of LazyAVLG with (the default value). We have evaluated the runtime and peak RAM usage of these three algorithm on successively longer prefixes of the large testfiles (chr19.1000 and kernel). The RAM use of ReLZ was set to match RAM use of BigBWT on the shortest input prefix. To allow a comparison with different methods in the future, we evaluated LazyAVLG on the LZ77 parsing rather than on the output of ReLZ. Thus, to obtain the performance of the pipeline ReLZ + LazyAVLG, one should multiply the runtime and RAM usage of LazyAVLG by the approximation ratio of ReLZ. We have found the value to not exceed on any of the kernel prefixes, and on any of the chr19.1000 prefixes (with the peak reached on the largest prefixes). This puts the RAM use of LazyAVLG on the output of ReLZ still below that of ReLZ, and its runtime still well below other two programs. The results are given in Figure 3.
At the RAM usage set to stay below that of BigBWT, the runtime of ReLZ is always below that of BigBWT. The reduction is by a factor of at least three for all prefixes of chr19.1000, and by at least 25% for all prefixes of kernel, with the gap decreasing with the prefix length. The runtime of LazyAVLG stays significantly below that of both other methods. More importantly, this also holds for LazyAVLG’s peak RAM usage. This shows that construction of the RLBWT via LZparsing has the potential to be the fastest in practice, especially given that approximation of LZ is the only step requiring in the pipeline from text to the RLBWT proposed in [19]. We further remark that although the construction of RLBWT has received much attention in recent years, practical approximation of LZ77 is a relatively unexplored topic, potentially offering significant further speedup via new techniques or parallelization. This may be an easier task since, unlike for any BWT construction algorithm, LZapproximation algorithms need not be exact.
References
 [1] Djamal Belazzougui, Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Alberto Ordóñez Pereira, Simon J. Puglisi, and Yasuo Tabei. Queries on LZbounded encodings. In DCC, pages 83–92. IEEE, 2015. doi:10.1109/DCC.2015.69.
 [2] Timo Bingmann, Johannes Fischer, and Vitaly Osipov. Inducing suffix and LCP arrays in external memory. In ALENEX, pages 88–102. SIAM, 2013. doi:10.1137/1.9781611972931.8.
 [3] Christina Boucher, Travis Gagie, Tomohiro I, Dominik Köppl, Ben Langmead, Giovanni Manzini, Gonzalo Navarro, Alejandro Pacheco, and Massimiliano Rossi. PHONI: Streamed matching statistics with multigenome references. In DCC, pages 193–202. IEEE, 2021. doi:10.1109/DCC50243.2021.00027.
 [4] Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. Prefixfree parsing for building big BWTs. Algorithms Mol. Biol., 14(1):13:1–13:15, 2019. doi:10.1186/s1301501901485.
 [5] Michael Burrows and David J. Wheeler. A blocksorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California, 1994.
 [6] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. Trans. Inf. Theory, 51(7):2554–2576, 2005. doi:10.1109/TIT.2005.850116.
 [7] Francisco Claude and Gonzalo Navarro. Selfindexed grammarbased compression. Fundam. Informaticae, 111(3):313–337, 2011. doi:10.3233/FI2011565.
 [8] The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015. doi:10.1038/nature15393.
 [9] Martin Dietzfelbinger, Joseph Gil, Yossi Matias, and Nicholas Pippenger. Polynomial hash functions are reliable. In ICALP, pages 235–246, 1992. doi:10.1007/3540557199_77.
 [10] Paolo Ferragina, Travis Gagie, and Giovanni Manzini. Lightweight data indexing and compression in external memory. Algorithmica, 63(3):707–730, 2012.
 [11] Isamu Furuya, Takuya Takagi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Takuya Kida. MRRePair: Grammar compression based on maximal repeats. In DCC, pages 508–517. IEEE, 2019. doi:10.1109/DCC.2019.00059.
 [12] Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. LZ77based selfindexing with faster pattern matching. In LATIN, pages 731–742. Springer, 2014. doi:10.1007/9783642544231_63.
 [13] Travis Gagie, Paweł Gawrychowski, Juha Kärkkäinen, Yakov Nekrich, and Simon J. Puglisi. A faster grammarbased selfindex. In LATA, pages 240–251. Springer, 2012. doi:10.1007/9783642283321_21.
 [14] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. On the approximation ratio of LempelZiv parsing. In LATIN, pages 490–503. Springer, 2018. doi:10.1007/9783319774046_36.
 [15] Travis Gagie, Gonzalo Navarro, and Nicola Prezza. Fully functional suffix trees and optimal text searching in BWTruns bounded space. J. ACM, 67(1):1–54, apr 2020. doi:10.1145/3375890.
 [16] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. LempelZiv parsing in external memory. In DCC, pages 153–162. IEEE, 2014. doi:10.1109/DCC.2014.78.
 [17] Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. Parallel external memory suffix sorting. In CPM, pages 329–342. Springer, 2015. doi:10.1007/9783319199290_28.
 [18] Richard M. Karp and Michael O. Rabin. Efficient randomized patternmatching algorithms. IBM J. Res. Dev., 31(2):249–260, 1987. doi:10.1147/rd.312.0249.
 [19] Dominik Kempa and Tomasz Kociumaka. Resolution of the BurrowsWheeler Transform conjecture. In FOCS, pages 1002–1013. IEEE, 2020. Full version: https://arxiv.org/abs/1910.10631. doi:10.1109/FOCS46700.2020.00097.
 [20] Dominik Kempa and Nicola Prezza. At the roots of dictionary compression: String attractors. In STOC, pages 827–840. ACM, 2018. doi:10.1145/3188745.3188814.
 [21] Donald E. Knuth. The Art of Computing, Vol. III, 2nd Ed. AddisonWesley, 1998.
 [22] Dominik Köppl, Tomohiro I, Isamu Furuya, Yoshimasa Takabatake, Kensuke Sakai, and Keisuke Goto. RePair in small space. Algorithms, 14(1):5, 2021. doi:10.3390/a14010005.
 [23] Dmitry Kosolobov, Daniel Valenzuela, Gonzalo Navarro, and Simon J. Puglisi. LempelZivlike parsing in small space. Algorithmica, 82(11):3195–3215, 2020. doi:10.1007/s00453020007226.
 [24] N. Jesper Larsson and Alistair Moffat. Offline dictionarybased compression. Proceedings of the IEEE, 88(11):1722–1732, 2000.
 [25] Gonzalo Navarro. Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comput. Surv., 54(2):29:1–29:31, 2021. doi:10.1145/3434399.
 [26] Gonzalo Navarro. Indexing highly repetitive string collections, part II: Compressed indexes. ACM Comput. Surv., 54(2):26:1–26:32, 2021. doi:10.1145/3432999.
 [27] Tatsuya Ohno, Kensuke Sakai, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discrete Alg., 5253:18–28, 2018. doi:10.1016/j.jda.2018.11.002.
 [28] Alberto Policriti and Nicola Prezza. LZ77 computation based on the runlength encoded BWT. Algorithmica, 80(7):1986–2011, 2018. doi:10.1007/s004530170327z.
 [29] Wojciech Rytter. Application of Lempel–Ziv factorization to the approximation of grammarbased compression. Theor. Comput. Sci., 302(1–3):211–222, 2003. doi:10.1016/S03043975(02)007776.
 [30] Kensuke Sakai, Tatsuya Ohno, Keisuke Goto, Yoshimasa Takabatake, Tomohiro I, and Hiroshi Sakamoto. RePair in compressed space and time. In DCC, pages 518–527. IEEE, 2019. doi:10.1109/DCC.2019.00060.
 [31] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. Trans. Inf. Theory, 23(3):337–343, 1977. doi:10.1109/TIT.1977.1055714.
Comments
There are no comments yet.