Entropy bounds for grammar compression

04/23/2018
by   Michał Gańczorz, et al.
0

In grammar compression we represent a string as a context free grammar. This model is popular both in theoretical and practical applications due to its simplicity, good compression rate and suitability for processing of the compressed representations. In practice, achieving compression requires encoding such grammar as a binary string, there are a few commonly used. We bound the size of such encodings for several compression methods, along with well-known algorithm. For we prove that its standard encoding, which is a combination of entropy coding and special encoding of a grammar, achieves 1.5|S|H_k(S). We also show that by stopping after some iteration we can achieve |S|H_k(S). The latter is particularly important, as it explains the phenomenon observed in practice, that introducing too many nonterminals causes the bit-size to grow. We generalize our approach to other compressions methods like or wide class of irreducible grammars, and other bit encodings (including naive, which uses fixed-length codes). Our approach not only proves the bounds but also partially explains why and are much better in practice than the other grammar based methods. At last, we show that for a wide family of dictionary compression methods (including grammar compressors) Ω(nk σ/_σ n) bits of redundancy are required. This shows a separation between context-based/BWT methods and dictionary compression algorithms, as for the former there exists methods where redundancy does not depend on n, but only on k and σ.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/10/2019

Entropy Bounds for Grammar-Based Tree Compressors

The definition of k^th-order empirical entropy of strings is extended to...
research
01/07/2022

The Efficiency of the ANS Entropy Encoding

The Asymmetric Numeral Systems (ANS) is a class of entropy encoders by D...
research
04/11/2019

Modeling the Complexity and Descriptive Adequacy of Construction Grammars

This paper uses the Minimum Description Length paradigm to model the com...
research
11/13/2020

A grammar compressor for collections of reads with applications to the construction of the BWT

We describe a grammar for DNA sequencing reads from which we can compute...
research
03/02/2018

Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Can we analyze data without decompressing it? As our data keeps growing,...
research
08/20/2020

Institutional Grammar 2.0 Codebook

The Grammar of Institutions, or Institutional Grammar, is an established...
research
09/27/2022

Local Grammar-Based Coding Revisited

We revisit the problem of minimal local grammar-based coding. In this se...

Please sign up or login with your details

Forgot password? Click here to reset