On the Approximation Ratio of Greedy Parsings

03/26/2018
by   Gonzalo Navarro, et al.
0

Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, which is the optimal one when phrases can be copied only from the left. While z can be computed in linear time with a greedy algorithm, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that z = O(b log(n/b)), where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where z = Ω(b log n). Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. We proceed by observing that Lempel-Ziv is just one particular case of greedy parse, and introduce a new parse where phrases can only be copied from lexicographically smaller text locations. We prove that the size v of the smallest parse of this kind has properties similar to z, including the same approximation ratio with respect to b. Interestingly, we also show that v = O(r), whereas r = o(z) holds on some particular classes of strings. On our way, we prove other relevant bounds between compressibility measures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2019

Towards a Definitive Measure of Repetitiveness

Unlike in statistical compression, where Shannon's entropy is a definiti...
research
12/11/2018

LZRR: LZ77 Parsing with Right Reference

Lossless data compression has been widely studied in computer science. O...
research
02/06/2023

Optimal LZ-End Parsing is Hard

LZ-End is a variant of the well-known Lempel-Ziv parsing family such tha...
research
03/05/2019

Lempel-Ziv-like Parsing in Small Space

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widel...
research
06/02/2021

On the approximation ratio of LZ-End to LZ77

A family of Lempel-Ziv factorizations is a well-studied string structure...
research
10/30/2017

At the Roots of Dictionary Compression: String Attractors

A well-known fact in the field of lossless text compression is that high...
research
10/18/2022

Computing MEMs on Repetitive Text Collections

We consider the problem of computing the Maximal Exact Matches (MEMs) of...

Please sign up or login with your details

Forgot password? Click here to reset