DeepAI

# Entropy bounds for grammar compression

In grammar compression we represent a string as a context free grammar. This model is popular both in theoretical and practical applications due to its simplicity, good compression rate and suitability for processing of the compressed representations. In practice, achieving compression requires encoding such grammar as a binary string, there are a few commonly used. We bound the size of such encodings for several compression methods, along with well-known algorithm. For we prove that its standard encoding, which is a combination of entropy coding and special encoding of a grammar, achieves 1.5|S|H_k(S). We also show that by stopping after some iteration we can achieve |S|H_k(S). The latter is particularly important, as it explains the phenomenon observed in practice, that introducing too many nonterminals causes the bit-size to grow. We generalize our approach to other compressions methods like or wide class of irreducible grammars, and other bit encodings (including naive, which uses fixed-length codes). Our approach not only proves the bounds but also partially explains why and are much better in practice than the other grammar based methods. At last, we show that for a wide family of dictionary compression methods (including grammar compressors) Ω(nk σ/_σ n) bits of redundancy are required. This shows a separation between context-based/BWT methods and dictionary compression algorithms, as for the former there exists methods where redundancy does not depend on n, but only on k and σ.

01/10/2019

### Entropy Bounds for Grammar-Based Tree Compressors

The definition of k^th-order empirical entropy of strings is extended to...
01/07/2022

### The Efficiency of the ANS Entropy Encoding

The Asymmetric Numeral Systems (ANS) is a class of entropy encoders by D...
04/11/2019

### Modeling the Complexity and Descriptive Adequacy of Construction Grammars

This paper uses the Minimum Description Length paradigm to model the com...
03/02/2018

### Fine-Grained Complexity of Analyzing Compressed Data: Quantifying Improvements over Decompress-And-Solve

Can we analyze data without decompressing it? As our data keeps growing,...
11/13/2020

### A grammar compressor for collections of reads with applications to the construction of the BWT

We describe a grammar for DNA sequencing reads from which we can compute...
09/27/2022

### Local Grammar-Based Coding Revisited

We revisit the problem of minimal local grammar-based coding. In this se...
08/20/2020

### Institutional Grammar 2.0 Codebook

The Grammar of Institutions, or Institutional Grammar, is an established...

## 1 Introduction

Grammar compression is a type of dictionary compression, in which we represent the input as a (context free) grammar that produces exactly the input string. Variants of grammar compression achieve competitive compression ratios [27]. Its simple inductive structure makes it particularly suitable for analysis and algorithmic design. Its close ties to Lempel-Ziv type of compression methods makes grammar compression a good abstraction and an intermediate interface for those type of algorithms. Recently, there is a strong trend in algorithmic design to develop algorithms that work directly on the compressed data without the need of a full decompression; among compression methods, grammar compression is particularly suitable for such an approach. Lastly, algorithms for grammar-compressed data can be used in a compress and compute paradigm, in which we compute the grammar compressed representation of data and then process it in this succinct representation, see [28] for a recent survey.

The problem of computing the smallest grammar (in terms of the number of symbols in the productions) is known to be NP-complete [5, 36, 4]. This led to the development of approximation algorithms [33, 5, 22, 35]

as well as heuristical algorithms

[27, 32, 2, 24, 44, 40]. From the practical point of view, the approximation algorithms have their drawbacks: they achieve only logarithmic approximation, in practice, they are inferior to heuristics, and it seems that minimizing the bit-size and symbol size are not the same thing. On the other hand, heuristics perform comparatively to other dictionary based methods [27]. Note that heuristical algorithms routinely apply Huffman coding to their output [27, 2]. This is standard also for many other compression methods.

The apparent success of heuristical algorithms fuelled theoretical research that tried to explain or evaluate their performance. One branch of such approach tried to establish their approximation ratios when the size is calculated as the number of symbols [5, 20]

. On the other hand, in the spirit of compressors, attempts were made to estimate the bit-size of their output. In this way, grammar compressors could be compared not only among themselves but also with other compressors.

In general, estimating precisely the bit-size of any compressor is hard. However, this is not the case for compressors based on the (higher-order) entropy, which includes Huffman coding, arithmetical coding, PPM, and others. In their case, the bit-size of the output is very close to the (-order) empirical entropy of the input string. Thus instead of comparing to those compressors, we can compare the size of the output with the -order entropy of the input string. Moreover, in practice, it seems that higher-order entropy is a good estimation of possible compression rate of data and entropy-based compression is widely used. Such analysis was carried out for BWT [29], LZ78 [25], LZ77 [37] compression methods. In some sense, this approach generalised classic results from information theory on LZ-algorithms coding for ergodic sources [42, 44], a similar work was performed also for a large class of grammar compressors [24].

Despite wide popularity of grammar-based methods, not many results that linked their performance to -order entropy were known, with the notable exception: Re-Pair was shown to achieve for  [30]. Note that this result holds without the Huffman coding of the output, which is used in practice.

#### Our contribution

We start by proving the bounds for Re-Pair [27] and Greedy [2] compressors in terms of -order empirical entropy. Then we show that our methods can be generalized for a wide family of so-called irreducible grammars [24]. Our results extend to other grammars that have similar properties. We consider several encodings of the output, which are exactly those (or closely related) used in practice; in particular, we consider the Huffman coding.

The main technical tool is a generalization of result by Kosaraju and Manzini [25, Lemma 2.3], which in turn generalizes Ziv’s Lemma [8, Section 13.5.5]. For any parsing of :

 cH0(YS)≤|S|Hk(S)+cklogσ+|L|H0(L), (1)

where is string of lengths of consecutive phrases in . Comparing to [25, Lemma 2.3], our proof is more intuitive, simpler and removes the greatest pitfall of the previous result: the dependency on the highest frequency of a phrase in the parsing. Furthermore, it can be used to estimate the size of the Huffman-coded output, i.e. what is truly used in practice, which was not possible using previously known methods.

Using (1) we show that Re-Pair

stopped at the right moment achieves

bits. Moreover, at this moment the size of the dictionary is , , where depends on the constant in the expression hidden under . This implies that strings produced by Re-Pair and related methods have a small alphabet. On the other hand, many compression algorithms, like LZ78, do not have this property [44]. Then we prove that in general Re-Pair’s output size can be bounded by bits. One of the crucial steps is to use (1) to lower bound the entropy of the string at certain iteration.

Stopping the compressor during its runtime seems counter-intuitive but it is consistent with empirical observations [15, 13] and our results shed some light on the reasons behind this phenomenon. Furthermore, there are approaches that suggest partial decompression of grammar-compressors in order to achieve better (bit) compression rate [3].

For Greedy we give the same bounds: it achieves bits using entropy coder, without Huffman coding and if stopped after iterations it achieves bits. The last result seems of practical importance, as each iteration of Greedy requires times, and so we can reduce the running time from to and should obtain comparable if not better compression rate. No such results were known before.

Then we apply our methods to general class of irreducible grammars [24] and show that Huffman coding of an irreducible grammar uses at most bits and at most without this encoding. No such general bounds were known before.

In a sense, the upper-bound from (1) can be made constructive: for any we show how to find a parsing into phrases of length such that:

 |YS|H0(YS)≤|S|∑l−1i=0Hi(S)l+O(log|S|). (2)

This has direct applications to text encodings with random access based on parsing the string into equal phrases [11] and improves their performance guarantee from to for .

Finally, we present lower bounds that apply to algorithms that parse the input in a “natural way”, this includes not only considered grammar compressors and compressed text representations [14, 11, 7], but also most of the dictionary compression methods. The main idea is the observation that for some inputs (2) is indeed tight. We construct a family of strings, which can be viewed as a generalization of de Bruijn strings, for which any such algorithm cannot perform better than in several meanings: the constant at cannot be improved to be lower than , the additive term cannot be made smaller, and lifting the assumption that implies that the coefficient at must be larger than . The constructed family of strings has interesting properties in terms of entropy and can be of independent interest.

## 2 Strings and their parsing

A string is a sequence of elements, called letters, from a finite set, called alphabet, and it is denoted as , where each is a letter, the length of such a is ; the size of the alphabet is usually denoted as , the alphabet is usually not named explicitly as it is clear from the context or not needed, is used when some name is needed. We often analyse words over different alphabets. For any two words the denotes their concatenation. By we denote , this is a subword of . By we denote the empty word. For a pair of words the denotes the number of different (possibly overlapping) subwords of equal to ; if then for uniformity of presentation we set . Usually denotes the input string and its length.

A grammar compression represents an input string as a context free grammar that generates a unique string . The right-hand side of the start nonterminal is called a starting string (often denoted as ) and by the grammar () we mean the collection of other rules, together they are called the full grammar and denoted as . For a nonterminal we denote its rule right-hand side by . For a grammar or full grammar their right-hand sides, denoted as and , are the concatenations of strings that are the right-hand sides of all productions in , for we also concatenate . By we denote the number of nonterminals. If all right-hand sides of grammar consist of two symbols, then this grammar is in CNF. The expansion of a nonterminal in a grammar is the string generated by in this grammar. All reasonable grammar compressions guarantee that no two nonterminals have the same expansion, we implicitly assume this in the rest of the paper. We say that a grammar (full grammar) (, respectively) is small, if (, respectively). This matches the folklore information-theoretic lower bound on the size of th grammar for a string of length .

In practice, the starting string and the grammar may be encoded in different ways, especially when is in CNF, hence we make a distinction between these two. Note that for many (though not all) grammar compressors both theoretical considerations and proofs as well as practical evaluation show that the size of the grammar is considerably smaller than the size of the starting string.

A parsing of a string is any representation , where each is nonempty and is called a phrase. We denote a parsing as and treat it as a word of length over the alphabet ; in particular is its size. Then and we treat it is a word over the alphabet .

The idea of parsing is strictly related to dictionary compression method, as most of this methods pick some substring and replace it with a new symbol, thus creating a phrase in a parsing. Examples include Lempel-Ziv algorithms. In grammar methods, which are a special case of dictionary compression, parsing is often induced by starting string.

Given a word its -order empirical entropy is

 Hk(w)=−1|w|∑v:|v|=ka: letter|w|valog(|w|va|w|v),

with the convention that the summand is whenever or . We are mostly interested in the entropy of the input string and in the for parsing of . The former is a natural measure of the input, to which we shall compare the size of the output of the grammar compressor, and the latter corresponds to the size of the entropy coding of the starting string returned by the grammar compressor.

## 3 Entropy bound for string parsing

In this Section, we make a connection between the entropy of the parsing of a string , i.e.  and the -order empirical entropy ; this can be seen as a refinement and strengthening of results that relate to  [25, Lemma 2.3], i.e. our result establishes upper bounds when phrases of are encoded using entropy coder while previous one use trivial encoding of , which assigns to each letter bits.

Theorem 3 yields that entropy of any parsing is bounded by plus some additional summands, which depend on the size of the parsing end entropy of lengths of the parsing. In particular, it eliminates the main drawback of previous results [25, Lemma 2.3], which also had a dependency on the frequency of the most frequent phrase.

[cf. [25, Lemma 2.3]] Let , be a string, its parsing, and . Then: Moreover, if and for some constant then: , the same bound applies to the Huffman coding of .

For Theorem 3 implies that any parsing is within summand of . This also gives upper bound on entropy increase of any parsing. Interestingly, the second bound holds for any parsing , assuming it is small enough.

When we can choose the parsing, the upper bound can be improved, even when we are restricted to parsings with phrases of (almost) fixed phrase length

Let be a string over alphabet . Then for any integer we can construct a parsing of size satisfying: All phrases except the first and last one have length .

Note that Theorem 3, unlike Theorem 3, does not hold for every parsing, it only claims that a carefully chosen parsing can have smaller entropy than a naive one.

Theorem 3 gives better bounds for compressed text representation which parse text into equal length phrases, such as [14, 11]. Those methods consider naive parsing (i.e. into consecutive phrases of length ) and Theorem 3 suggest that we can obtain better guarantees if mean of entropies is smaller than .

The proofs follow a couple of simple steps. First, we recall a strengthening of the known fact that entropy lower-bounds the size of the encoding of the string using any prefix codes: instead of assigning natural lengths (of codes) to letters of

we can assign them some probabilities (and think that

is the “length” of the prefix code). Then we define the probabilities of phrases in the parsing in such way that they relate to higher order entropies. Depending on the result we either look at fixed, -letter context, or -letter context for -th letter of phrase. This already yields the first claim of Theorem 3, to obtain the second we substitute the estimation on the parsing size and make some basic estimation on the entropy of the lengths, which is a textbook knowledge [8, Lemma 13.5.4].

#### Entropy estimation

The following technical Lemma strengthens the well-known fact that entropy lower-bounds the size of the prefix code encoding, it is a simple corollary from Gibbs’ inequality, see [1] for a proof. [[1]] Let be a string over alphabet and be a function such that . Then:

To use Lemma 3 to prove Theorems 33 we need to devise appropriate valuation for phrases of the parsing. Instead of assigning single value to each phrase we assign it to each individual letter in each phrase, then is a product of values of consecutive letters of .

In the case of Theorem 3 for -th letter of a phrase we assign the probability of this letter occurring in letter context in , i.e. we assign . Thus we can think that we encode the letter using -th order entropy. The difference in the case of Theorem 3, is that we assign to first letters of the phrase, the remaining ones are assigned values as in the first case. The phrase costs are simply logarithmed values of phrase probabilities, which can be viewed as a cost of bit-encoding of a phrase.

The idea of assigning values to phrases was used before, for instance by Kosaraju and Manzini in estimations of entropy of LZ77 and LZ78 [25], yet their definition depends on symbols preceding the phrase. This idea was adapted from methods used to estimate entropy of the source model [42], see also [8, Sec. 13.5]. Similar idea was used in [14] in construction of compressed text representations. Their solution used constructive argument: it calculated bit strings for each phrase using arithmetic coding and context modeller. Later it was observed that arithmetic coder and context modeller can be replaced with Huffman encoding [11]. Still, both these representations were based on assumption that text is parsed into short (i.e ) phrases of equal length and used specific compression methods in the proof.

[Phrase probability, parsing cost]

Given a string and its parsing the phrase probability and -bounded phrase probability are:

 P(yi)=|yi|∏j=1|S|yi[1..j]|S|yi[1..j−1]andPk(yi)=1σmin(|yi|,k)|yi|∏j=k+1|S|yi[j−k…j]|S|yi[j−k…j−1]

where those are if, respectively, some or . Observe that the definition also holds for , as we assumed that when , and .

The phrase cost and parsing cost are and . Similarly the -bounded phrase cost and -bounded parsing cost are: and .

When comparing the cost and , the latter always uses entropy for each symbol, while uses on each first letters of each phrase, thus intuitively it looses up to on each of those letters.

Let be a string and its parsing. Then .

Let be a string. Then for any there exist a parsing such that each phrase except first and last has length exactly and .

#### Parsings and entropy

Ideally, we would like to plug-in the phrase probabilities for into Lemma 3 and so obtain that the parsing cost upper-bounds entropy coding of , i.e. . But the assumption of Lemma 3 (that the values of function sum to at most ) may not hold as we can have phrases of different lengths and so their probabilities can somehow mix. Thus we also take into the account the lengths of the phrases: we multiply the phrase probability by the probability of , i.e. the frequency of in . After simple calculations, we conclude that is upper bounded by the parsing cost plus the entropy of lengths: i.e. when , the .

Let be a string over -size alphabet, its parsing, and . Then:

 |YS|H0(YS) ≤C(YS)+|L|H0(L) and |YS|H0(YS) ≤Ck(YS)+|L|H0(L).

When (and so ) the entropy is the entropy of , thus summand on the right-hand is necessary.

Now, the combination of above claims gives directly the proof of Theorem 3 and Theorem 3, see Appendix for the full proof. The entropy of lengths is always within factor, and for small enough within factor; moreover in case of Theorem 3 it can be bounded by . This follows textbook arguments [8, Lemma 13.5.4].

## 4 Entropy upper bounds on grammar compression

In this Section we use Theorem 3 to bound the size of grammars returned by popular compressors: Re-Pair [27], Greedy [2] and a general class of methods that produce irreducible grammars. We consider a couple of natural and simple bit-encodings of the grammars, those include naive encoding (which takes bits per grammar symbol), Huffman coding and so-called incremental coding, which is popular for grammars in CNF. Interestingly, we obtain bounds for a constant even for naive encoding.

### 4.1 Encoding of grammars

There are different ways to encode the (full) grammar thus we first discuss the possible encodings and give some estimations on their sizes. All considered encoding assume a linear order on nonterminals: if contains then . In this way we can encode the rule as a sequence of nonterminals on its right-hand side, in particular instead of storing the nonterminal names we store positions in the above ordering.

The considered encodings are simple and natural and they correspond very closely or exactly to encodings used in grammars compressors like Re-Pair [27] or Sequitur [31]. Some other algorithms, e.g. Greedy [2], use specialized encodings, but at some point, they still encode grammar using entropy coder with some additional information or assign codes to each nonterminal in the grammar. Thus most of this custom encodings are roughly equivalent to (or not better than) entropy coding.

Encoding of CNF grammars deserves special attention and is a problem investigated on its own. It is known that grammar in CNF can be encoded using bits [38, 39], which is close to the information theoretic lower bound of  [38]. On the other hand, in heuristical compressors simpler encodings are used, for instance, Re-Pair was implemented and tested with several encodings, which were based on division of nonterminals into groups where if and only if and or and , for some , where is the input alphabet. Then each group is encoded separately. Even though no theoretic bounds were given, these encodings come close to the lower bound, though some only on average.

The above encodings are difficult to analyse due to heuristical optimisations, for the sake of completeness we analyse an incremental encoding, which is a simplified version of one of the original methods used to encode Re-Pair output. It matches the theoretical lower bound except a larger constant hidden in .

To be precise, we consider the following encodings:

fully naive

We concatenate the right-hand sides of the full grammar. Then each nonterminal and letter are assigned bitcodes of the same length. We store for each , as it is often small, it is sufficient to store it in unary.

naive

The starting string is entropy-coded, the rules are coded as in the fully-naive variant.

entropy-coded

The rules are concatenated with the starting string and they are coded using an entropy coder. We also store for each nonterminal .

incremental

We use this encoding only for CNF grammars, though it can be extended to general grammars. It has additional requirements on the order on letters and nonterminals: if and are the first symbols in productions for then . Given any grammar, this property can be achieved by permuting the nonterminals, but we must drop the assumption that right hand side of a given nonterminal occurs before in the sequence. Then grammar can be viewed as a sequence of nonterminals: . We encode differences with Elias -codes, and ’s naively using bits.

The incremental encoding can be generalized to grammars not in CNF, as it is possible to transform any grammar to such a form.

We upper upper-bound the sizes of grammars under various encodings; the first two estimations are straightforward, the third one requires some calculations. For the entropy coding the estimation in Lemma 4.1 requires a nontrivial argument.

Let be a string and a full grammar that generates it. Then:

fully naive

uses at most bits.

naive

uses at most bits.

incremental

uses at most bits.

The proof idea of Lemma 4.1 is to show that is a parsing of , with at most additional symbols, and then apply Theorem 3. The latter requires that different nonterminals have different expansions, all practical grammar compressors have this property. Let be a string over an alphabet of size , and a full grammar generating it, where no two nonterminals have the same expansion. Denote . If then the entropy coding of is

### 4.2 Re-Pair

Re-Pair is one of the most known grammar compression heuristics. It starts with the input string and in each step replaces a most frequent pair in with a new symbol and adds a rule . Re-Pair is simple, fast and provides compression ratio better than some of the standard dictionary methods like gzip [27]. It found usage in various applications [16, 23, 6, 12, 41, 17].

We prove that Re-Pair stopped at right moment achieves entropy (plus some smaller terms), using any of the three: naive, incremental or entropy encoding. To this end we show that Re-Pair reduces the input string to length for appropriate and stop the algorithm when the string gets below this size. This follows by estimations on number of possible different substrings in the input string. Then on one hand the grammar constructed so far is of size and on the other side Theorem 3 yields that entropy coding of the current string is plus some smaller terms. This property gives an advantage over other methods, as it ensures small alphabet size of the string to encode and in practice encoding of a large dictionary is costly. Even though the Theorem 4.2 states explicitly the values of and ,, one is function of the other, see proofs in the Appendix.

We refer to the current state of , i.e.  with some pairs replaced, as the working string.

Let , , be a string over -size alphabet, . When the size of the working string of Re-Pair is first below then the number of nonterminals in the grammar is at most and the entropy coding of the working string is at most ; such a point always exists and the bit-size of Re-Pair stopped at this point is at most for: naive, entropy and incremental encoding.

Theorem 4.2 says that Re-Pair achieves -entropy, when stopped at the appropriate time. What is surprising is that continuing to the end can lead to worse compression rate. In fact, limiting the size of the dictionary for Re-Pair as well as for similar methods in practice results in better compression rate for larger files [15, 13]. This is not obvious, in particular, Larsson and Moffat [27] believed that it is the other way around; this belief was supported by the results on smaller-size data. This is partially explained by Theorem 4.2, in which we give a bound on Re-Pair run to the end with incremental or entropy encoding; a bound for fully naive encoding was known earlier [30].

Before proving Theorem 4.2 we first show a simple example, see Lemma 4.2, that demonstrates that the factor from Theorem 4.2 is tight, assuming certain encodings, even for . The construction employs large alphabets, i.e. . Observe that our results assume that , which implies that for polynomial alphabets they hold only for . Also, such large alphabets do not reflect practical cases when is much smaller than . Yet, in case of grammar compression this example gives some valuable intuition: replacing the substring decreases the size counted in symbols but may not always decrease bit encoding size, as we have to store some additional information, regarding replaced string, which can be costly, depending on the encoding method.

There exist a family of strings such that Re-Pair with both incremental and entropy encoding uses at least bits, assuming that we encode the grammar of size (i.e. with nonterminals) using at least bits. Moreover , which implies that the cost of encoding, denoted by , satisfies .

###### Proof of Lemma 4.2.

Fix and an alphabet . Consider the word which contains all letters with in-between, first in order from to , then in order  to :

Detailed calculations are provided in the Appendix. ∎

The example from Lemma 4.2 shows that at some iteration bit encoding of Re-Pair can increase. Even though it requires large alphabet and is somehow artificial, we cannot ensure that a similar instance does not occur at some iteration of Re-Pair, as the size of the alphabet of the working string increases. In the above example size of the grammar was significant. It is the only possibility to increase bit size as by Theorem 3 adding new symbols does not increase entropy encoding of working string significantly.

Main observation needed for the proof of Theorem 4.2 is that if the size that if at some point the grammar size is significant, then the entropy of the working string is also large. In such a case the grammar transformations do not increase overall cost of encodings too much. The crucial element in this reasoning is the usage of Theorem 3 to lower-bound the -entropy of the input string, showing that entropy at desired point is indeed large.

Let , be a string over a -size alphabet, . Then the size of Re-Pair output is at most for the incremental and the entropy encoding.

### 4.3 Irreducible grammars and their properties

Kieffer et al. introduced the concept of irreducible grammars [24], which formalise the idea that there is no immediate way to make the grammar smaller. Many heuristics fall into this category: Sequential, Sequitur, LongestMatch, and Greedy, even though some were invented before the notion was introduced [5]. They also developed an encoding of such grammars which was used as a universal code for finite state sources. A full grammar is irreducible if:

1. no two nonterminals have the same expansion;

2. every nonterminal, except the starting symbol, occurs at least twice in ;

3. no pair occurs twice (without overlaps) in and right-hand sides.

Unfortunately, most irreducible grammar has the same issue as Re-Pair: they can introduce new symbols without decreasing entropy of the starting string but increasing bit size of the grammar. In particular, the example from Lemma 4.2 applies to irreducible grammars (i.e any irreducible grammar generating have at least nonterminals).

Ideally, for an irreducible grammar we would like to repeat the argument as for Re-Pair: we can stop at some of the iteration (or decompress some nonterminals as in [3]) such that the grammar is small and has a small, i.e. , number of nonterminals. It turns out that there are examples of irreducible grammars which do not have this property, Example 4.3 gives such a grammar. Moreover, grammar compressors that work in a top-down manner, like LongestMatch, tend to produce such grammars. Lastly, the grammar in Example 4.3 has size , which is the best possible estimation for the size of irreducible grammars. Consider the grammar where each production represents binary string: The above example can be generalized for binary string of any length. Decompressing any set of nonterminals such that only nonterminals remain yields a grammar of size

Still, we are able to prove positive results assuming our encodings, though with worse constant. Similarly, as in the proof of Theorem 4.2, we use Theorem 3 to lower bound the entropy of in the naive case, thus it seems, that using only previously known tools [25] such bounds could not be obtained. Let be a string over -size alphabet, Then the size of the entropy coding of any irreducible full grammar generating is at most .

Let be a string over an alphabet of size , . The size of fully naive coding of any irreducible full grammar generating is at most .

### 4.4 Greedy

Greedy [2] can be viewed as a non-binary Re-Pair: in each round it replaces a substring in , obtaining , such that is smallest possible. It is known to produce small grammars in practice, both in terms of nonterminal size and bit size. Its asymptotic construction time so far has been only bounded by . Moreover, it is notorious for being hard to analyse in terms of approximation ratio [5, 20].

Greedy has similar properties as Re-Pair: the frequency of the most frequent pair does not decrease and the size of the grammar can be estimated in terms of this frequency. In particular, there always is a point of its execution in which the number of nonterminals is and the full grammar is of size . The entropy encoding at this time yields , while Greedy run till the end achieves using entropy coding and using fully naive encoding, so the same as in the case of Re-Pair.

In practice stopping Greedy is beneficial: similarly as in Re-Pair there can exist a point where we add new symbols that do not decrease the bitsize of output. Indeed, it was suggested [2] to stop Greedy as soon as the decrease of grammar size is small enough, yet this was not experimentally evaluated. Moreover, as the time needed for one iteration is linear we can stop after iterations ending with running time, again our analysis suggests that factor depends on the constant in additional summand.

The proofs for Greedy are similar in spirit to those for Re-Pair, yet technical details differ. In particular, we use the fact that Greedy is irreducible, and at every iteration conditions (IG1) –(IG2) are satisfied.

Let , be a string over -size alphabet, . When the size of the full grammar (i.e. ) produced by Greedy is first below then the number of nonterminals of is at most and the bit-size of entropy coding of is at most . For any string such point always exists.

[cf. [30]] Let , be a string over -size alphabet, . Then the size of the fully naive encoding of full grammar produced by Greedy is at most .

Let , be a string over -size alphabet, . Then the size of entropy encoding of full grammar produced by Greedy is at most .

## 5 Lower bound on parsing-based methods

All upper bounds presented in previous sections assumed and have an additive term . In this section, we show that both are unavoidable. To this end, we construct a family of strings for which the bounds of Theorem 3 are tight and explain how this implies that the conditions on and the additive term cannot be strengthened. This in particular answers (negatively) question from [30], whether we can prove similar bounds for Re-Pair when .

#### Generalized de Bruijn words

The constructed family of words generalize de Bruijn strings, which, for a given alphabet and order , contain exactly once each word as a substring.

For every , , there exists a string over alphabet of size of length such that:

1. for ;

2. for ;

3. no word of length occurs more than once in .

For the promised family are constructed from de Bruijn strings by appropriate letter merges. For those strings the frequency of each substring depends (almost) only on its length, thus the bounds on the entropy are easy to show. For larger we make an inductive (on ) construction, which is similar to construction of de Bruijn strings: we construct a graph with edges labelled with letters and the desired strings corresponds to an Eulerian cycle in this graph, to be more precise, the st graph is exactly the line graph of the th one. We guarantee that the frequency of words depends only on their lengths, the exact condition is more involved than in case of de Bruijn strings. For : . For the word is .

#### Natural parsers

The sequence of strings from Theorem 5 are now used to show lower bounds on the size of parsings produced by various algorithms. Clearly, the lower bounds cannot apply to all parsings, as one can take a parsing into a single phrase. Thus we consider the “natural” parsings, in which a word can be made a phrase if it occurs twice or is short.

An algorithm is a natural-parser if given a string over an alphabet of size it produces its parsing such that for each phrase , of either , or ; moreover, it encodes using at least bits. Note that phrases of length that occur once are allowed, as for them and .

Re-Pair, algorithms producing irreducible grammars, LZ78 and non self-referencing LZ77 (with appropriate encoding) are natural parsers.

Natural parsers on words defined in Theorem 5 cannot do much better than the mean of entropies, which gives general bounds on algorithms inducing natural parsers.

Let A be a natural parser. Let be a non-negative and integer function of and satisfying, for every , , where denotes value of for and . Then for any there exist infinite family of strings , where , of increasing length, such that the bit-size of the output of A on is at least:

 |S|Hk(S)+ρ|S|(logσ−2λ)2≥(1+ρ)|S|Hk(S)−λ|S|,

where and .

There are several consequences of Theorem 5 for natural parsers. First, they cannot go below bits on each string and if they achieve the entropy (on each string) then an additive term of bits is needed. Let be a function of , where , and A be a natural parsing algorithm. Then there exist an infinite family of strings of increasing length such that for each , , the size of the output generated by A on is at least: .

Secondly, extending the bounds to for a constant implies that (without a constant coefficient) is not achievable. This gives (negative) answer to the question asked in [30] whether we can prove results for Re-Pair when .

Let be a function of such that , and A be a natural parsing based algorithm. Then there exist an infinite family of strings of increasing length such that for each , , if A achieves bits then .

Lastly, Theorem 3 is tight for natural parsers. For any there exist an infinite family of strings such that if then no parsing with phrases shorter than achieves , for .

## 6 Conclusions and open problems

The lower bounds provided in Section 5 hold for specific types of algorithms. Yet, there are algorithms achieving bits for where , e.g. the ones based on BWT [10]. Moreover, -order PPM-based methods should also encode words defined above efficiently, as many of them use adaptive arithmetic coding for each context separately. Can we generalize the techniques so that they provide some bounds also for those scenarios?

There are parsing based compressed text representations [18] achieving for , where ; but they encode parsing using -order entropy coders. This comes at a cost, as such representations do not allow for retrieval of substrings of length in constant time, which is possible for indexes based on parsings and using -order entropy coding [14, 11]. Can we estimate the time-space tradeoffs?

We considered bounds , where . Kosaraju and Manzini [25] considered also the stronger notion of coarse optimality, in which they require that . They developed coarse optimal algorithms only for . It is not known if similar results can be obtained for grammar compression, though our lower bounds provide some insights. On one hand, there are examples of small entropy strings on which most grammar compressors perform badly [5, 19], but there exceptions, e.g. Greedy.

It should be possible to extend construction of words from Theorem 5 such that for any constant we have for and for , for example by starting the construction with de-Bruijn words over larger alphabet than binary. This would prove that we cannot hope for a bound of for .

## References

• [1] Janos Aczél. On Shannon’s inequality, optimal coding, and characterizations of Shannon’s and Rényi’s entropies. In Symposia Mathematica, volume 15, pages 153–179, 1973.
• [2] Alberto Apostolico and Stefano Lonardi. Some theory and practice of greedy off-line textual substitution. In Data Compression Conference, 1998. DCC’98. Proceedings, pages 119–128. IEEE, 1998.
• [3] Martin Bokler and Eric Hildebrandt. Rule reduction—a method to improve the compression performance of grammatical compression algorithms. AEU-International Journal of Electronics and Communications, 65(3):239–243, 2011.
• [4] Kutrin Casel, Henning Fernau, Serge Gaspers, Benjamin Gras, and Markus L. Schmidt. On the complexity of grammar-based compression. In ICALP, LNCS. Springer, 2016. accepted.
• [5] Moses Charikar, Eric Lehman, Ding Liu, Rina Panigrahy, Manoj Prabhakaran, Amit Sahai, and Abhi Shelat. The smallest grammar problem. IEEE Trans. Information Theory, 51(7):2554–2576, 2005.
• [6] Francisco Claude and Gonzalo Navarro. A fast and compact web graph representation. In Nivio Ziviani and Ricardo A. Baeza-Yates, editors, SPIRE, volume 4726 of LNCS, pages 118–129. Springer, 2007.
• [7] Francisco Claude and Gonzalo Navarro. Improved grammar-based compressed indexes. In International Symposium on String Processing and Information Retrieval, pages 180–192. Springer, 2012.
• [8] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.
• [9] Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, and S Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Foundations of Computer Science, 2005. FOCS 2005. 46th Annual IEEE Symposium on, pages 184–193. IEEE, 2005.
• [10] Paolo Ferragina and Giovanni Manzini. Compression boosting in optimal linear time using the Burrows-Wheeler transform. In Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’04, pages 655–663, Philadelphia, PA, USA, 2004. Society for Industrial and Applied Mathematics.
• [11] Paolo Ferragina and Rossano Venturini. A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci., 372(1):115–121, 2007.
• [12] Johannes Fischer, Veli Mäkinen, and Gonzalo Navarro. Faster entropy-bounded compressed suffix trees. Theor. Comput. Sci., 410(51):5354–5364, 2009.
• [13] Michał Gańczorz and Artur Jeż. Improvements on re-pair grammar compressor. 2017 Data Compression Conference (DCC), pages 181–190, 2017.
• [14] Rodrigo González and Gonzalo Navarro. Statistical encoding of succinct data structures. In Moshe Lewenstein and Gabriel Valiente, editors,

Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings

, volume 4009 of Lecture Notes in Computer Science, pages 294–305. Springer, 2006.
• [15] Rodrigo González and Gonzalo Navarro. Compressed text indexes with fast locate. In Annual Symposium on Combinatorial Pattern Matching, pages 216–227. Springer, 2007.
• [16] Rodrigo González and Gonzalo Navarro. Compressed text indexes with fast locate. In Bin Ma and Kaizhong Zhang, editors, CPM, volume 4580 of LNCS, pages 216–227. Springer, 2007.
• [17] Rodrigo González, Gonzalo Navarro, and Héctor Ferrada. Locally compressed suffix arrays. J. Exp. Algorithmics, 19:1.1:1.1–1.1:1.30, January 2015.
• [18] Roberto Grossi, Rajeev Raman, Srinivasa Rao Satti, and Rossano Venturini. Dynamic compressed strings with random access. In Automata, Languages, and Programming - 40th International Colloquium, ICALP 2013, Riga, Latvia, July 8-12, 2013, Proceedings, Part I, pages 504–515. Springer, 2013.
• [19] Danny Hucke, Artur Jeż, and Markus Lohrey. Approximation ratio of repair. CoRR, abs/1703.06061, 2017.
• [20] Danny Hucke, Markus Lohrey, and Carl Philipp Reh. The smallest grammar problem revisited. In Shunsuke Inenaga, Kunihiko Sadakane, and Tetsuya Sakai, editors, SPIRE, volume 9954 of LNCS, pages 35–49, 2016.
• [21] Guy Joseph Jacobson. Succinct Static Data Structures. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, USA, 1988. AAI8918056.
• [22] Artur Jeż. A really simple approximation of smallest grammar. Theoretical Computer Science, 616:141–150, 2016.
• [23] Takuya Kida, Tetsuya Matsumoto, Yusuke Shibata, Masayuki Takeda, Ayumi Shinohara, and Setsuo Arikawa. Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci., 1(298):253–272, 2003.
• [24] John C. Kieffer and En-Hui Yang. Grammar-based codes: A new class of universal lossless source codes. IEEE Transactions on Information Theory, 46(3):737–754, 2000.
• [25] S. Rao Kosaraju and Giovanni Manzini. Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J. Comput., 29(3):893–911, 1999.
• [26] Sebastian Kreft and Gonzalo Navarro. Self-indexing based on lz77. In Annual Symposium on Combinatorial Pattern Matching, pages 41–54. Springer, 2011.
• [27] N. Jesper Larsson and Alistair Moffat. Offline dictionary-based compression. In Data Compression Conference, pages 296–305. IEEE Computer Society, 1999.
• [28] Markus Lohrey. Algorithmics on SLP-compressed strings: A survey. Groups Complexity Cryptology, 4(2):241–299, 2012.
• [29] Giovanni Manzini. An analysis of the burrows&mdash;wheeler transform. J. ACM, 48(3):407–430, May 2001.
• [30] Gonzalo Navarro and Luís M. S. Russo. Re-Pair achieves high-order entropy. In DCC, page 537. IEEE Computer Society, 2008.
• [31] Craig G Nevill-Manning and Ian H Witten. Compression and explanation using hierarchical grammars. The Computer Journal, 40(2_and_3):103–116, 1997.
• [32] Craig G. Nevill-Manning and Ian H. Witten. Identifying hierarchical structure in sequences: A linear-time algorithm. J. Artif. Intell. Res. (JAIR), 7:67–82, 1997.
• [33] Wojciech Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theor. Comput. Sci., 302(1-3):211–222, 2003.
• [34] Kunihiko Sadakane and Roberto Grossi. Squeezing succinct data structures into entropy bounds. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 1230–1239. Society for Industrial and Applied Mathematics, 2006.
• [35] Hiroshi Sakamoto. A fully linear-time approximation algorithm for grammar-based compression. J. Discrete Algorithms, 3(2-4):416–430, 2005.
• [36] James A. Storer and Thomas G. Szymanski. The macro model for data compression. In Richard J. Lipton, Walter A. Burkhard, Walter J. Savitch, Emily P. Friedman, and Alfred V. Aho, editors, STOC, pages 30–39. ACM, 1978.
• [37] Wojciech Szpankowski. Asymptotic properties of data compression and suffix trees. IEEE transactions on Information Theory, 39(5):1647–1659, 1993.
• [38] Yasuo Tabei, Yoshimasa Takabatake, and Hiroshi Sakamoto. A succinct grammar compression. In Annual Symposium on Combinatorial Pattern Matching, pages 235–246. Springer, 2013.
• [39] Yoshimasa Takabatake, Yasuo Tabei, and Hiroshi Sakamoto. Variable-length codes for space-efficient grammar-based compression. In International Symposium on String Processing and Information Retrieval, pages 398–410. Springer, 2012.
• [40] Michal Vasinek and Jan Platos. Prediction and evaluation of zero order entropy changes in grammar-based codes. Entropy, 19(5):223, 2017.
• [41] Raymond Wan. Browsing and Searching Compressed Documents. Phd thesis, Department of Computer Science and Software Engineering, University of Melbourne, 2003.
• [42] Aaron D. Wyner and Jacob Ziv. The sliding-window Lempel-Ziv algorithm is asymptotically optimal. Proceedings of the IEEE, 82(6):872–877, Jun 1994.
• [43] Jacob Ziv and Abraham Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337–343, 1977.
• [44] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. Information Theory, 24(5):530–536, 1978.

## Appendix A Additional material for Section 2

For simplicity if then .

We extend the notion to sets of words, i.e. .

The size of the nonterminal is the length of its right-hand side, for a grammar or full grammar the denote the lengths of concatenations of strings in , respectively.

## Appendix B Additional proofs for section 3

###### Proof Lemma 3.

The lemma follows from the definition of and . The can be viewed in such way that each letter occurring in context substitutes to the sum. Now observe that is almost the same as , but for the first letters of each phrase instead of summand we have . ∎

###### Proof lemma 3.

Consider parsings of , where in the first phrase has letters and the other phrases are of length , except maybe the last phrase (to streamline the argument, we add an empty phrase to ). Denote We estimate the sum . The costs of each of the first phrases is upper-bounded by , as for any phrase cost is at most :

 C(y) =−log|y|∏j=1|S|y[1…j]|S|y[1…j−1] =−log(|S|y|S|ϵ) ≤log(n/1).

Then without the costs of these phrases is:

 −l−1∑i=0|YiS|∑p=1|yi,p|∑j=1log|S|yi,p[1..j]|S|yi,p[1..j−1] (3)

and we claim that this is exactly . Together with the estimation of the cost of the first phrases this yields the claim, as

 l−1∑i=0C(YiS)≤llogn+|S|l−1∑i=0Hi(S)

and so one of the parsings has cost that is at most the right-hand side divided by .

To see that (3) is indeed the sum of entropies observe for each position of the word we count of probability of this letter occurring in preceding -letter context exactly once: this is clear for , as the consecutive parsings are offsetted by one position; for observe that in the letter at position we count of probability of this letter occurring in preceding