In this paper we consider the following problem: given a LZ77 representation of a string , decompress and output it as a stream in left to right order (without storing it explicitly). Our goal is to solve this problem in as little space as possible, i.e. close to the size of the compressed input string. This problem is fundamental and of great relevance in domains characterized by the production of huge amounts of repetitive data, where information has to be analyzed on-the-fly due to limitations in storage resources.
The folklore solution for the Lempel-Ziv decompression problem achieves optimal linear time, but requires random access to the whole string. A better solution is to convert LZ77 into a Straight-Line Program (i.e. a context-free grammar generating the text) of size . This conversion can be performed in optimal space and time . Then, the entire text can be decompressed and streamed in optimal time using just the space of the grammar. The problem has also been recently considered in  in the context of external-memory algorithms. To the best of our knowledge, no other attempts to solve the problem have been described in the literature. In particular, no solutions using space are known.
Our solution even works for decompressing any specified subsequence of with little overheads on top of the optimal running time and working space. As an application, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text.
1.1 Our contributions
The main contribution of this paper is to show that LZ77 decompression can be performed in linear space (in the compressed input’s size) and near-optimal time (i.e. almost linear in the length of the extracted subsequence). Even better, we provide two smooth space-time trade-offs which enable us to achieve either optimal time or linear space or both if the alphabet’s size is constant. The first trade-off is particularly appealing on small alphabets, while the second dominates the first on large alphabets.
We formalize the LZ77 decompression problem as follows. The input consists of the LZ77 representation of a text and in a list of text substrings encoded as pairs: . We decompress these substrings and output them (e.g. to a stream or to disk) character-by-character in the order . Since both the input strings and the output can be streamed (for example, from/to disk) we only count the working space used on top of the input and the output. Let the quantity denote the total number of characters to be extracted. Our main results are summarized in the following two theorems:
Let be a string of length from an alphabet of size compressed into an LZ77 representation with phrases. For any parameter , we can decompress any substrings of with total length in time using space.
Let be a string of length compressed into an LZ77 representation with phrases. For any parameter , we can decompress any substrings of with total length in time using space.
Theorems 1.1 and 1.1 lead to a series of new an non-trivial bounds on different algorithmic problems on LZ77. For instance, we provide a smooth time-space trade-off for decompressing in time using space for any constant . This is the optimal linear time and space for constant-sized alphabets. By combining Theorem 1.1 with with the technique based on grammars, we show how to decompress in optimal time using space. Both bounds are strict improvements over the previous best complexity of time and space. See Section 4 and corollaries 4 and 4 for details.
Our results also imply new trade-offs for the pattern matching and approximate pattern matching problems on LZ77-compressed texts. By showing how our techniques can be combined with existing pattern matching results we obtain the following:
Let be a string of length compressed into an LZ77 representation with phrases, let be a pattern of length and let be an algorithm that can detect an (approximate) occurrence of in (with at most errors) given and in time and space. Then, we can solve the same task in time and space. If reports all occurrences using time and space, then we can report all occurrences in time and space.
Let be a streaming algorithm that reports all (approximate) occurrence of a pattern (with at most errors) in a stream of length in time and space. Then, we can report all occurrences of in the LZ77 representation of a string in either:
time and space or
time and space.
The best known algorithm for detecting if pattern occurs in a string given and uses time and space. If we plug this into Theorem 1.1 we obtain time and space thereby reducing the factor in the space to at the cost of slightly increasing the time.
We also obtain new trade-offs for reporting all approximate occurrences of with at most errors, for example if we plug in the Landau-Vishkin and Cole-Hariharan [17, 5] algorithms we can solve the problem in time using space for constant-sized alphabets or space for general alphabets. The previous best solution has the same time complexity but uses space. We refer to Section 5 for more details.
1.2 Related work
While the Lempel-Ziv decompression problem has not been studied much in the literature (probably due to the existing near-optimal solution based on grammars), fast LZ77compression in small working space has lately attracted a lot of research in the field of compressed computation [8, 9, 19, 20]. Note that the problem considered in this paper differs subtly from the related random access problem, where the aim is to build a data structure taking space as close as possible to words and supporting efficient access queries to single characters. Differently from our case, an overhead of at query time is not acceptable in random access queries; despite intense research in this direction [1, 16, 3, 10, 21, 4], it is still an open problem whether efficient random access queries (e.g. with logarithmic delay) can be achieved in space. Note that The LZ77 decompression problem can be seen as a batched version of the random access problem. Under our setting, it is conceivable that we can break the time-space barrier imposed by the random access problem (and, indeed, our results show that this is the case).
We assume a standard unit-cost RAM model with word size and that the input is from an integer alphabet , and we measure space complexity in words unless otherwise specified. A string of length is a sequence of symbols from an alphabet of size . The string denoted is called a substring of . Let denote the empty string and let when . To ease the notation, let if and if . Let be shorthand for the interval and let be a special symbol that never occurs in the input text. A straight line program (SLP) is an acyclic grammar in Chomsky normal form, i.e., a grammar where each nonterminal production rule expands to two other rules and generates one string only.
2.1 Lempel-Ziv 77 Algorithm
For simplicity of exposition we use the scheme given by Farach & Thorup . Map into and assume that is prefixed by in the negative positions, i.e. for and .
An LZ77 representation [18, 22] of is a string of the form . Let and , for . For to be a valid LZ77 representation of , we require that and that for . This guarantees that represents and clearly is uniquely defined in terms of .
We refer to the substring as the phrase of the representation, the substring as the source of the phrase and as the member of . We note that the restriction for all implies that a source and a phrase cannot overlap and thus we do not handle representations that are self-referential.
2.2 Mergeable Dictionary
The Mergeable Dictionary problem is to maintain a dynamic collection of disjoint sets of elements from an ordered universe under the operations:
: Splits into two sets and . is removed from while and are inserted.
: Creates . is inserted into while and are removed.
[Iacono & Özkan ] There exists a data structure for the Mergeable Dictionary Problem supporting any sequence of split and merge operations in worst case time using space. We need an extended version of the mergeable dictionary that also supports the following operation:
for some such that for each : Creates the set . is removed from while is inserted.
Now, there is no guarantee that the sets in remain disjoint, so we lift this restriction and allow sets to store multiple elements with the same key. Iacono & Özkan  note that their data structure is easily extended to support the shift operation.
[Iacono & Özkan ] There exists a data structure for the Mergeable Dictionary Problem supporting any sequence of split, merge and shift operations in worst case time using space.
3 LZ77 Induced Context
In this section we present the centerpiece of our algorithm. It builds on the fundamental property of LZ77 compression that any substring of a phrase also occurs in the source of that phrase. Our technique is to store a short substring around every phrase which we call context. The contexts are stored in a compressed form that allows faster substring extraction than that of LZ77. We then take advantage of this property when extracting a substring of by splitting it into short chunks which in turn are extracted by repeatedly mapping them to the source of the phrase they are part of. Eventually, they will end up as a substring of a context from where they can be efficiently extracted.
In the following section we show how to map a set of short substrings of into a set of identical substrings that are all contained in the context of some phrase. The technique resembles what Farach & Thorup refers to as winding in . We show new applications of the technique and obtain better time complexity by using the mergeable dictionaries result by Iacono & Özkan  presented in Section 2.2.
In Section 3.1 we show how to obtain an LZ77 parse for the subsequence of that includes only the context of every phrase. This representation can then efficiently be transformed into an SLP as shown in Section 3.2 using the online construction algorithm by Chariker et al.  or Rytter .
Recall that we assume is prefixed by the alphabet in the negative positions, that , and that is the starting position in of the phrase.
Let be a positive integer. The -context of a string (induced by the LZ77 representation of ) is the set of positions where either or there is some such that . If positions through are in the -context of , then we simply say “ is in the -context of ”.
Let be a positive integer. The -context string of , denoted , is the subsequence of that includes if and only if is in the -context of . We denote with the unique position in where such a position is mapped to (i.e. ).
We show how to map positions from to .
Let be an LZ77 representation of a string of length with phrases and let be a positive integer. Given sorted positions, in the -context of we can compute in time and space.
Let be the number of positions inside the -th LZ77 phrase that are not in the -context of .
Let be three integers initialized as follows: , , and . We keep the following two invariants:
if , then is the smallest integer such that , i.e. position is in the -th phrase, and
is the number of positions such that is in the -context of (i.e. is the length of the prefix of containing characters from ).
It is clear that (i) and (ii) hold in the beginning of our procedure. We now show how to iterate through the LZ77 phrases and compute the desired output in one pass.
Assume that we already computed (or none of them if ). To compute , we check whether , i.e. whether is in the -th phrase. If not, we find the phrase containing as follows. We set , and repeat until we find a value of that satisfies . It is clear that, at each step, is still the length of the prefix of containing characters from (i.e. invariant (ii) is maintained).
Once such a is found, we compute simply adding to the relative position of inside its phrase, and subtract from this quantity if is within characters from the end of the phrase. More in detail, If , then . Otherwise, . The correctness of this computation is guaranteed by the way we defined in property (ii).
Note that is again the smallest integer such that (invariant (i)), so we can proceed with the same strategy to compute .
Overall, the algorithm runs in time and space.
We use as shorthand for whenever is clear from context. The following properties follow from the definitions and Lemma 1 but will come in handy later on:
If are positions in the -context of and then .
If is in the -context of then
We now consider the following problem: given a substring of length at most , find a pair of integers such that , and is in the -context of . The following algorithm solves this problem for a set of substrings using space and time.
Let be an LZ77 representation of a string of length with phrases and let be a positive integer. The input is substrings of given as pairs of integers denoting start and end position: where for all . Let be a mergeable dictionary as given by Lemma 2.2 representing a single empty set .
For each of the pairs we insert an element at position in . Each element has associated its rank (i.e. its rank among the input pairs) as satellite information.
We now consider the members of one by one in reverse order. Member is processed as follows:
If skip to the next member.
After processing all members, scan each set in to retrieve all the elements. Let denote the new position in of element . We sort elements according to their rank , and output the pairs where .
Correctness Let denote the position in of element at any point of the algorithm. We now show that before and after considering member for any element . Initially, so this is trivially true before the first iteration.
Assume by induction that this is true before considering member . If or , then will not be changed when considering member . Otherwise, , and thus is a substring of which also occurs at the same relative position in . Now is shifted such that thereby maintaining the relative position inside the two identical strings and it follows that still represents after considering member and thus also before considering member .
We now show that, for any element , the string is in the -context of after considering the last member. Observe that when considering member , every element positioned in is shifted to a position less than , because by definition. As we are considering the members in reverse order, this means that every element must end in a position such that either there is some such that or which concludes the proof of correctness.
Complexity Inserting elements in at positions in the range takes time. For every member of we do dictionary operations. All positions remain in the range thus this also takes total time . We can easily compute and store in time and space. Finally, the elements are sorted by their rank in time thus the total time is . We never store more than the elements, thus the total space is .
Algorithm 1 proves the following lemma:
Let be an LZ77 representation of a string of length with phrases and let be a positive integer. Given substrings of as pairs of integers where we can find pairs of integers such that , and is in the -context of using space and time.
3.1 LZ77 Compressed Context
It is possible to obtain an LZ77 representation of the string directly from the LZ77 representation of . Informally, the idea is to split every phrase of into two new phrases consisting of respectively the first and last characters of the phrase. In order to find a source for these phrases, we use Algorithm 1 which finds an identical string that also occurs in .
We now give an algorithm that constructs an LZ77 representation of given the LZ77 parse of .
First we construct relevant pairs of integers representing substrings of by considering the members of one by one in order. Member is processed as follows:
If : Let be a relevant pair.
If : Let and be relevant pairs.
Otherwise : Let and be relevant pairs.
Each of the relevant pairs represents a prefix or a suffix of a phrase. The concatenation of these phrase prefixes and suffixes in left-to-right order is exactly the string . Let be a relevant pair created when considering the member of . Then we say that is the related source pair pair and clearly .
Note that the related source pairs might not be in the -context. We now use Algorithm 1 to find a pair of integers for each related source pair such that , and is in the -context of . We give the pairs in order of creation and this order is preserved by Algorithm 1. If is the output of Algorithm 1 then is the member of where and is computed using Lemma 3.
Correctness Let be the relevant pairs in order of creation, be the related source pairs, be the output of Algorithm 1, and let .
Our goal is to show that is a valid LZ77 representation of , that is: (i) the concatenation of the phrases of yields , (ii) phrases of are equal to their sources, and (iii) phrases of do not overlap their sources. Note that consists of at most phrases
(i-ii) It follows directly from Definition 1 that the concatenation of the strings represented by the relevant pairs in order of creation is . Since and, by Lemma 1, then we also have that . Now, observe that since and are in the -context of then by Property 2 we have and . This proves properties (i) and (ii).
(iii) By definition of the LZ77 factorization of and since the substring represented by the pair is entirely contained in a phrase we must have and therefore, by Lemma 1, . But this means that and therefore, by Property 1, , i.e. property (iii) holds.
Complexity For every member of we create at most two relevant substrings taking total time and space. Applying Algorithm 1 takes time and space. We can easily compute and store time and space and computing for every substring reported by Algorithm 1 takes total time and space using Lemma 1 thus the total time is and the total space is .
In summary, we have the following lemma:
Let be an LZ77 representation of a string of length with phrases. We can construct an LZ77 representation of with phrases in time and space.
3.2 SLP and Word Compressed Context
In this section we consider how to store the -context string of induced by an LZ77 in compressed form.
Our first solution is to decompress the context string efficiently and store it using word packing. First, we construct the LZ77 representation of using Lemma 2 and decompress it naively. Constructing the representation takes time while decompressing it takes linear time in its length . A string of length can be stored in words using word packing. We obtain:
Let be a string of length from an alphabet of size compressed into an LZ77 representation with phrases, and let be a positive integer. We can construct and store the -context of in time and space.
As an alternative solution, we show how to store the context string as an SLP. First we need the following theorem:
Let be a string of length compressed into an LZ77 representation with phrases, and let be a positive integer. We can build a balanced SLP for with height in time and space.
4 LZ77 Decompression
We now describe how to apply the techniques described in the previous section to extract arbitrary substrings of .
We first show how to extract a substring of length . Let be a string of length compressed into an LZ77 representation with phrases and let be a positive integer that we will fix later.
Split the string into consecutive substrings of length and process a batch of substrings at a time in left-to-right order. There are batches of substrings. A batch of substrings is processed in time using Lemma 3 thereby finding a substring in the -context of for every substring in the batch.
Using Corollary 3.2 these substrings can be extracted in time. Thus the time to extract and output all batches is while the time to construct the SLP is . The total space is . If we instead use Lemma 3.2 the time is unchanged while the space becomes
Generalizing to substrings of total length the procedure is similar. We split each string into substrings of length (or less, if the string’s length is not a multiple of ) and process a batch of such substrings at a time. In total, there are no more than substrings of length at most . Now there are batches of substrings thus the total time becomes using space with Corollary 3.2 which proves Theorem 1.1. Using Lemma 3.2 the time remains the same while the space again becomes . Fixing for any constant the time is while the space is which proves Theorem 1.1.
For any parameter , we can decompress in time using space.
When decompressing the entire text, the number of substrings is and the total length is . First, we compute and with a simple scan of the LZ77 representation of . If , then we use Theorem 1.1. This solution runs in time and uses space.
Otherwise , but then it must be the case that (inequality follows from replacing with ). This means that the entire string can be packed in words, so we can decompress it naively in time and space. ∎
Note that, in particular, Corollary 4 achieves the optimal time and space when is constant. On large alphabets, we can further improve upon this result:
We can decompress in optimal time using space.
We compute and with a scan of the LZ77 representation of . If , then we use Theorem 1.1 with . This solution runs in time and uses space .
Otherwise . In this case, we simply build a grammar for of size using . The time to build the grammar is . We use this grammar to stream the text in optimal time. ∎
5 Applications in Pattern Matching
In this section we show how our techniques can be applied as a black box in combination with existing pattern matching results.
Let be a string of length and let be a pattern of length . The classical pattern matching problem is to report all starting positions of occurrences of in . In the approximate pattern matching problem we are given an error threshold in addition to and . The goal is to find all starting positions of substrings of that are within distance of under some metric, e.g. edit distance where the distance is the number of edit operations required to convert the substring to . When considering the compressed pattern matching problem, the string is given in some compressed form. Sometimes, we are only interested in whether or not occurs in .
Pattern matching on LZ77 compressed texts usually takes advantage of the property that any substring of a phrase also occurs in the source of the phrase. This means that if an occurrence of is contained in single phrase, then there must also be an occurrence in the source of that phrase. The implication is that the occurrences of can be split into two categories: the ones that overlap two or more phrases and the ones that are contained inside a single phrase - usually referred to as primary and secondary occurrences, respectively . The secondary occurrences can be found from the primary in time and space  where is the number of phrases in the LZ77 representation of and is the total number of occurrence of in .
Approximate pattern matching in small space
Consider the $-padded-context string of denoted obtained by replacing each of the maximal substrings of that are not in the -context by a single copy of the symbol . This string has length . Observe that all the primary occurrences of in are in this string, that any occurrence of in this string corresponds to a unique primary or secondary occurrence of in and that we can map these occurrences to their position in in time and space using the same technique as Lemma 1 but by adding the gap lengths instead of subtracting them.
We now prove Theorem 1.1. Let be the (approximate) pattern matching algorithm from the theorem using time and space. The idea is to run algorithm on the LZ77 representation of to find all the primary occurrences.
Our algorithm works as follows. First create the LZ77 representation of using Algorithm 2. If two consecutive phrases of are both induced by the same phrase of of length or more we add a phrase between them representing only symbol . This is easy to do as part of Algorithm 2 without changing its complexity and the result is exactly an LZ77 representation of with phrases. The pattern occurs in if and only if it occurs in . Thus we can run algorithm on to detect an (approximate) occurrence of in . Constructing takes time and space thus the total time becomes time and space.
All primary occurrences are found by finding all occurrences of in , mapping them to their position in and filtering out the secondary occurrences. Hereafter, all the secondary occurrences can be found in time and space . Mapping and filtering also takes time and space. Thus, if algorithm reports all (approximate) occurrences of in in time and space we can report all (approximate) occurrences of in in time and space.
We now prove Theorem 1.1. Let be the streaming algorithm from the theorem that reports all (approximate) occurrences of in a stream of length using time and space. The idea is to run algorithm on the string which we will stream in chunks to find all the primary occurrences. We use the same technique as above to first filter the primary occurrences and then find all the secondary occurrences in time and space.
We stream the string consisting of substrings of total length . We can easily compute when to output a during the substring extraction thus using theorems 1.1 or 1.1 we can stream in either time using space or time using space thus the total time becomes either:
time and space or
time and space.
Compressed Existence Gawrychowski  shows how to decide if occurs in given and the LZ77 representation of using time and space. Applying Theorem 1.1 we get the following: We can detect an occurrence of a pattern of length given the LZ77 representation of in time using space.
Approximate Pattern Matching By combining the Landau-Vishkin and Cole-Hariharan [17, 5] algorithms all approximate occurrences with at most errors on a stream of length can be found in time using space. Gagie et. al  shows how this algorithm can be used to detect solve the same problem given an LZ77 representation of in time and space. Applying Theorem 1.1 to the combined Landau-Vishkin and Cole-Hariharan algorithm we get the following new trade-offs: We can report all approximate occurrences of a pattern of length with errors given the LZ77 representation with phrases of a string of length in:
time using space or
time using space.
-  Djamal Belazzougui, Travis Gagie, Pawel Gawrychowski, Juha Kärkkäinen, Alberto Ordónez, Simon J Puglisi, and Yasuo Tabei. Queries on LZ-bounded encodings. In Data Compression Conference (DCC), 2015, pages 83–92. IEEE, 2015.
-  Djamal Belazzougui, Juha Kärkkäinen, Dominik Kempa, and Simon J Puglisi. Lempel-Ziv Decoding in External Memory. In International Symposium on Experimental Algorithms, pages 63–74. Springer, 2016.
-  Philip Bille, Gad M Landau, Rajeev Raman, Kunihiko Sadakane, Srinivasa Rao Satti, and Oren Weimann. Random access to grammar-compressed strings. In Proceedings of the twenty-second annual ACM-SIAM symposium on Discrete Algorithms, pages 373–389. Society for Industrial and Applied Mathematics, 2011.
-  M. Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51(7):2554–2576, 2005.
-  Richard Cole and Ramesh Hariharan. Approximate string matching: A simpler faster algorithm. SIAM Journal on Computing, 31(6):1761–1782, 2002. URL: https://doi.org/10.1137/S0097539700370527, arXiv:https://doi.org/10.1137/S0097539700370527, doi:10.1137/S0097539700370527.
-  Maxime Crochemore, Lucian Ilie, and William F Smyth. A simple algorithm for computing the Lempel Ziv factorization. In Data Compression Conference, 2008. DCC 2008, pages 482–488. IEEE, 2008.
-  M. Farach and M. Thorup. String Matching in Lempel—Ziv Compressed Strings. Algorithmica, 20(4):388–404, 1998. doi:10.1007/PL00009202.
-  Johannes Fischer, Travis Gagie, Paweł Gawrychowski, and Tomasz Kociumaka. Approximating LZ77 via small-space multiple-pattern matching. In Algorithms-ESA 2015, pages 533–544. Springer, 2015.
-  Johannes Fischer, I Tomohiro, and Dominik Köppl. Lempel Ziv Computation In Small Space (LZ-CISS). In Annual Symposium on Combinatorial Pattern Matching, pages 172–184. Springer, 2015.
-  T. Gagie, P. Gawrychowski, and S. J. Puglisi. Approximate pattern matching in LZ77-compressed texts. Journal of Discrete Algorithms, 32:64–68, 2015.
Pattern matching in Lempel-Ziv compressed strings: Fast, Simple, and
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6942 LNCS:421–432, 2011. arXiv:arXiv:1104.4203v1, doi:10.1007/978-3-642-23719-5_36.
-  John Iacono. Personal communication. 2018.
-  John Iacono and Özgür Özkan. Mergeable dictionaries. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6198 LNCS(PART 1):164–175, 2010. arXiv:arXiv:1002.4248v2, doi:10.1007/978-3-642-14165-2_15.
-  Juha Kärkkäinen, Dominik Kempa, and Simon J Puglisi. Linear time Lempel-Ziv factorization: Simple, fast, small. In Annual Symposium on Combinatorial Pattern Matching, pages 189–200. Springer, 2013.
-  Juha Kärkkäinen and Esko Ukkonen. Lempel-Ziv parsing and sublinear-size index structures for string matching. In Proc. 3rd WSP, pages 141–155. Carleton University Press, 1996.
-  S. Kreft and G. Navarro. On compressing and indexing repetitive sequences. Theoretical Computer Science, 483:115–133, 2013.
-  Gad M Landau and Uzi Vishkin. Fast parallel and serial approximate string matching. Journal of Algorithms, 10(2):157 – 169, 1989. URL: http://www.sciencedirect.com/science/article/pii/0196677489900102, doi:https://doi.org/10.1016/0196-6774(89)90010-2.
-  Abraham Lempel and Jacob Ziv. On the complexity of finite sequences. Information Theory, IEEE Transactions on, 22(1):75–81, 1976.
-  Alberto Policriti and Nicola Prezza. Fast online Lempel-Ziv factorization in compressed space. In International Symposium on String Processing and Information Retrieval, pages 13–20. Springer, 2015.
-  Alberto Policriti and Nicola Prezza. LZ77 computation based on the run-length encoded BWT. Algorithmica, pages 1–26, 2017.
-  Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science, 302(1-3):211–222, 2003. doi:10.1016/S0304-3975(02)00777-6.
-  Jacob Ziv and Abraham and Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, May 1977. doi:10.1109/TIT.1977.1055714.