Lempel-Ziv-like Parsing in Small Space

03/05/2019
by   Daniel Valenzuela, et al.
0

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the input text during the construction of the parsing, which is prohibitive for long inputs. An alternative is Relative Lempel-Ziv (RLZ), which indexes only a fixed reference sequence, whose size can be controlled. Deriving the reference sequence by sampling the text yields reasonable compression ratios for RLZ, but performance is not always competitive with that of LZ and depends heavily on the similarity of the reference to the text. In this paper we introduce ReLZ, a technique that uses RLZ as a preprocessor to approximate the LZ parsing using little memory. RLZ is first used to produce a sequence of phrases, and these are regarded as metasymbols that are input to LZ for a second-level parsing on a (most often) drastically shorter sequence. This parsing is finally translated into one on the original sequence. We analyze the new scheme and prove that, like LZ, it achieves the kth order empirical entropy compression n H_k + o(nσ) with k = o(_σ n), where n is the input length and σ is the alphabet size. In fact, we prove this entropy bound not only for ReLZ but for a wide class of LZ-like encodings. Then, we establish a lower bound on ReLZ approximation ratio showing that the number of phrases in it can be Ω( n) times larger than the number of phrases in LZ. Our experiments show that ReLZ is orders of magnitude faster than other alternatives to compute the (exact or approximate) LZ parsing, at the reasonable price of an approximation factor below 2.0 in practice, and sometimes below 1.05, to the size of LZ.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/11/2018

LZRR: LZ77 Parsing with Right Reference

Lossless data compression has been widely studied in computer science. O...
research
06/30/2022

Prefix-free parsing for building large tunnelled Wheeler graphs

We propose a new technique for creating a space-efficient index for larg...
research
03/26/2018

On the Approximation Ratio of Greedy Parsings

Shannon's entropy is a clear lower bound for statistical compression. Th...
research
11/07/2019

Towards Better Compressed Representations

We introduce the problem of computing a parsing where each phrase is of ...
research
03/29/2018

Prefix-Free Parsing for Building Big BWTs

High-throughput sequencing technologies have led to explosive growth of ...
research
06/02/2021

On the approximation ratio of LZ-End to LZ77

A family of Lempel-Ziv factorizations is a well-studied string structure...
research
09/15/2022

The Impact of Edge Displacement Vaserstein Distance on UD Parsing Performance

We contribute to the discussion on parsing performance in NLP by introdu...

Please sign up or login with your details

Forgot password? Click here to reset