On Stricter Reachable Repetitiveness Measures*

05/28/2021 ∙ by Gonzalo Navarro, et al. ∙ 0

The size b of the smallest bidirectional macro scheme, which is arguably the most general copy-paste scheme to generate a given sequence, is considered to be the strictest reachable measure of repetitiveness. It is strictly lower-bounded by measures like γ and δ, which are known or believed to be unreachable and to capture the entropy of repetitiveness. In this paper we study another sequence generation mechanism, namely compositions of a morphism. We show that these form another plausible mechanism to characterize repetitive sequences and define NU-systems, which combine such a mechanism with macro schemes. We show that the size ν≤ b of the smallest NU-system is reachable and can be o(δ) for some string families, thereby implying that the limit of compressibility of repetitive sequences can be even smaller than previously thought. We also derive several other results characterizing ν.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study of repetitiveness measures, and of suitable measures of compressibility of repetitive sequences, has recently attracted interest thanks to the surge of repetitive text collections in areas like Bioinformatics, and versioned software and document collections. A recent survey [15] identifies a number of those measures, separating those that are reachable (i.e., any sequence can be represented within that space) from those that are not, which are still useful as lower bounds.

Reachable measures are, for example, the size of the smallest context-free grammar that generates the sequence [8], the size of the smallest collage system that generates the sequence [7] (which generalizes grammars), the number of phrases of the Lempel-Ziv parse of the sequence [10], or the number of phrases of a bidirectional macro scheme that represents the sequence [19]. Such a macro scheme cuts the sequence into phrases so that each phrase either is an explicit symbol or it can be copied from elsewhere in the sequence, in a way that no cyclic dependencies are introduced. As such, macro schemes are the ultimate measure of what can be obtained by “copy-paste” mechanisms, which characterize repetitive sequences well.

Other measures are designed as lower bounds on the compressibility of repetitive sequences: is the size of the smallest string attractor for the sequence [6] and is a measure derived from the string complexity [17, 3].

In asymptotic terms, it holds and, except for , there are string families where each measure is asymptotically smaller than the next. The recent result by Bannai et al. [2], showing that there exists a string family where , establishes a clear separation between unreachable lower bounds (,) and reachable measures ( and the larger ones).

Concretely, Bannai et al. show that and for the Thue-Morse family, defined as and , where is with s converted to s and vice versa. This family is a well-known example of the fixed point of a morphism , defined in this case by the rules and . Then, is simply . This representation of the words in the family is of size , and each word can be easily produced in optimal time by iterating the morphism.

Iterating a small morphism is arguably a relevant mechanism to define repetitive sequences. Intuitively, any short repetition that arises along the generation of a long string turns into a longer repetition in the final string, steps later. More formally, if a morphism is -uniform (i.e., all its rules are of fixed length ), then the resulting sequence is so-called -automatic [1] and its prefixes have an attractor of size [18]. That is, many small morphisms lead to sequences with low measures of repetitiveness. Further, in the Thue-Morse family, morphisms lead to a reachable measure of repetitiveness that is , below what can be achieved with copy-paste mechanisms.

In this paper we further study this formalism. First, we define macro systems, a grammar-like extension that we prove equivalent to bidirectional macro schemes. We then study deterministic Lindenmayer systems [11, 12], a grammar-like mechanism generating infinite strings via iterated morphisms; they are stopped at some level to produce a finite string. We combine both systems into what we call NU-systems. The size (“nu”) of the smallest NU-system is always reachable and . Further, we show that there are string families where , thereby showing that is not anymore a lower bound for the compressibility of repetitive sequences if we include other plausible mechanisms to represent them. We present several other results that help characterize the new measure .

2 Basic Concepts

2.1 Terminology

Let be a set of symbols, called the alphabet. A string of length (also denoted as when needed) is a concatenation of symbols from ; in particular the string of length is denoted by . The set of -length concatenations of symbols from is denoted , and the set of strings over is defined as ; we also define . We juxtapose strings () or combine them with the dot operator () to denote their concatenation. A string is a prefix of if , a suffix of if , and a substring of if , for some . Let denote an -length string. Then is the -th symbol of , and the substring if , and if .

2.2 Parsing based schemes

Probably the most popular measure of repetitiveness is the number of phrases in the so-called Lempel-Ziv parse of a word [10]. In such a parse, is partitioned optimally into phrases , so that every is either of length or it appears starting to the left in (so the phrase is copied from some source at its left). This parsing can be computed in time.

Storer and Szymanski [19] introduced bidirectional macro schemes, which allow sources appear to the left or to the right of their phrases, as long as circular dependencies are avoided. We follow the definition by Bannai et al. [2].

Let be a string. A bidirectional macro scheme of size for is a sequence satisfying and if , and if . We denote the starting position of in by . The function ,

is induced by the macro scheme. For to be a valid bidirectional macro scheme it must hold that, for each , there exists some satisfying . Therefore, it suffices with the values and , plus where , to recover .

We call the number of elements in the smallest bidirectional macro scheme generating a given string . There are string families where [16]. While is computed in linear time, computing is NP-hard.

2.3 Grammars and generalizations

The size of the smallest context-free grammar generating (only) a word [8] is a relevant measure of repetitiveness. Such a grammar has exactly one rule per nonterminal, and those can be sorted so that the right-hand sides mention only terminals and previously listed nonterminals. The size of the grammar is the sum of the lengths of the right-hand sides of the rules. The expansion of a nonterminal is the string of terminals it generates; the word defined by the grammar is the expansion of its last listed nonterminal.

More formally, a grammar over the alphabet of terminals is a sequence of nonterminals , with a rule , each being a terminal or some nonterminal in . The expansion of a terminal is , and that of a nonterminal is . The grammar represents the string , and its size is .

Composition systems were introduced by Gasieniec et al. [5]. Those add the ability to reference any prefix or suffix of the expansion of a previous nonterminal (and, thus, substrings as prefixes of suffixes). Let us use the more general form, allowing terms where .

Kida et al. [7] extended composition systems with run-length terms of the form , so that , the expansion of concatenated times. They called this extension a collage system. We call the smallest collage system generating a word , and it always holds 111At least if the collage system is internal, that is, every appears in . and . There are string families where [16], and where . Computing (and, probably, too) is NP-hard.

2.4 Lower bounds

Kempa and Prezza introduced the concept of string attractor [6], which yields an abstract measure that lower-bounds all the previous reachable measures.

Let be a string. A string attractor for is a set of positions where for every substring there exists a copy (i.e., ) and a position with . The measure is defined as the cardinality of the smallest of such attractors for a given string , and it always holds that . Further, a string family where exists [2].

Kociumaka et al. [3] used the string complexity of to define a measure called . Let be the number of distinct substrings of length in . Then . This measure is computed in time and it always holds that ; there are string families where [9]. While is unreachable in some string families, any string can be represented in space [9]. Measure has been proposed as a lower bound on the compressibility of repetitive strings, which we question in this paper.

2.5 Morphisms over strings

We explain some general concepts about morphisms acting over strings [1, 14]. A monoid is a set with an associative operation and a neutral element satisfying for every . We write for and say that is a monoid, instead of . A morphism of monoids is a function , where and are monoids, for every , and .

Let be a set of symbols, and the concatenation of strings. Then is a monoid with string concatenation, called the free monoid. A morphism of free monoids is defined completely just by specifying on the symbols on . If = , then , is called an automorphism, and is iterable. We define the -iteration (or composition) of over as .

Let be a morphism of free monoids. We define , , and . We say is expanding if , non-erasing if , and -uniform if , for every . A coding is a 1-uniform morphism. We say is prolongable on if for a non-empty string .

Let be an automorphism on . Let be prolongable on , so . Then, is the unique fixed point of starting with , that is, [14]. Words constructed in this fashion are called purely morphic words. If we apply a coding to them, we obtain morphic words. A morphic word obtained from a -uniform morphism is said to be -automatic [1].

3 Macro Systems

Our first contribution is the definition of macro systems, a generalization of composition systems we prove to be as powerful as bidirectional macro schemes. That is, the smallest macro system generating a given string is of size .

Definition 1

A macro system is a tuple , where is a finite set of symbols called the variables, is a finite set of symbols disjoint from called the terminals, is the set of rules (exactly one per variable)

and is the initial variable. If is the rule for , we also write . The symbols are called extractions. The rule is permitted only for . The size of a macro system is the sum of the lengths of the right-hand sides of the rules, .

We now define the string generated by a macro system as the expansion of its initial symbol, . Such expansions are defined as follows.

Definition 2

Let be a macro system. The expansion of a symbol is a string over defined inductively as follows:

  • If then .

  • If , then .

  • If is a rule, then .

  • (this second denotes substring).

We say that the macro system is valid if there is a single solution for . We say that the macro system generates the string .

Note that a macro system looks very similar to a composition system, however, it does not impose an order so that each symbol references only previous ones. This algorithm determines the string generated by a macro system, if any:

  1. Compute for every nonterminal , using the rules:

    • If , then .

    • If , then .

    • .

    This must generate a system of equations without loops (otherwise the macro system is invalid), which is then trivially solved.

  2. Replace every symbol by ; we use to denote .

  3. Replace every , if , iterating until obtaining a terminal:

    • Let , for .

    • Let be such that .

    • If , replace by .

    • Otherwise replace by .

  4. If the process to replace any falls in a loop (i.e., we return to ), then the system has no unique solution and thus it is invalid. Otherwise, we are left with a classical context-free grammar without extractions, and compute in the classical way.

Note that a rule like solves only for , just like the run-length symbol of collage systems. For example, and generates as follows:

This shows that macro systems are at least as powerful as collage systems. But they can be asymptotically smaller. For example, the smallest collage system generating the Fibonacci string (where , , and ) is of size [16, Thm. 32]. Instead, we can mimic a bidirectional macro scheme of size [16, Lem. 35] with a constant-sized macro system generating : if

is odd and

if is even (where ). For example, for the system is and we extract as follows, using that , , , and :

In general, we can prove that a restricted class of our macro systems is equivalent to bidirectional macro schemes.

Definition 3

A macro system generating is internal if appears in for every . We use to denote the size of the smallest internal macro system generating .

Theorem 3.1

It always holds that .

Proof

Let be the smallest bidirectional macro scheme generating . We construct a macro system with a single rule , where is the single terminal if , and the extraction symbol if not.

We now show that this macro system is valid. After we execute step 2 of our algorithm, the length of the resulting string (which we call ) is already : it has only terminals and symbols of the form . Note that this implies that in the bidirectional macro scheme. In every step, we replace each such by . Since the macro scheme is valid, for each there is a finite such that , and thus becomes a terminal symbol after steps. ∎

Theorem 3.2

For every internal macro system of size there is a bidirectional macro scheme of size .

Proof

An internal macro system generating can always be transformed into one with a single rule for the initial symbol. Let be such that . We can then replace every occurrence of by , and every occurrence of by , on the right-hand sides of all the rules. In particular, the rule defining will now contain terminals and symbols of the form , and thus all the other nonterminals can be deleted.

From the resulting macro system , where , we can derive a bidirectional macro scheme , as follows: if is a terminal, then is that terminal and . Otherwise, is of the form and then and . The resulting scheme is valid, because our algorithm extracts any after a finite number of steps, which is then the such that . ∎

That is, bidirectional macro schemes are equivalent to internal macro systems. General macro systems can be asymptotically smaller in principle, though we have not found an example where this happens.

4 Deterministic Lindenmayer Systems

In this section we study a mechanism for generating infinite sequences called deterministic Lindenmayer Systems [11, 12], which build on morphisms. We adapt those systems to generate finite repetitive strings. Those systems are, in essence, grammars with only nonterminals, which typically generate longer and longer strings, in a levelwise fashion. For our purposes, we will also specify at which level to stop the generation process and the length of the string to generate. The generated string is then the -length prefix of the sequence of nonterminals obtained at level . We adapt, in particular, the variant called CD0L-systems, though we will use the generic name L-systems for simplicity.

Definition 4

An L-system is a tuple , where is a finite set of symbols called variables, is the set of rules, is a sequence of variables called the axiom, is a coding, is the level where to stop, and is the length of the string to generate.

An L-system produces levels of strings , starting from at level 0. Each level replaces every variable from the previous level by , that is, if we identify with its homomorphic extension. The generated string is , seeing as its homomorphic extension.

The size of an L-system is . We call the size of the smallest L-system generating a string .

L-systems then represent strings by iterating a non-erasing automorphism. Somewhat surprisingly, we now exhibit a string family where , thus L-systems are a reachable mechanism to generate strings that can be asymptotically smaller than what was considered to be a stable lower bound.

Theorem 4.1

There exist string families where .

Proof

Consider the L-system where , , , , , , and . The family of strings is formed by all those generated by the systems , where . It is clear that all the strings in this family share the value .

The first strings of the family generated by this system (i.e., its levels ) are , , , , and so on. It is easy to see by induction that level contains s and s, so the string is of length .

More importantly, one can see by induction that levels start with and contain all the strings of the form for . This is true for level . Then, in level the strings become , which contains , and the first yields , containing .

Consider now the number of -length distinct substrings in , for . Each distinct substring , for , yields at least distinct -length substrings (containing at different offsets; no single -length substring may contain two of those). These add up to distinct -length substrings, and thus on the string .∎

On the other hand, -systems are always reachable, which yields the immediate result that and are incomparable.

Theorem 4.2

There exist string families where .

Proof

Kociumaka et al. [9, Thm. 2] exhibit a string family of elements with , so it needs bits, that is, space, to be represented with any method. Therefore in this family, because there are only distinct L-systems of size . ∎

Those strings are formed by s, replacing them by s at single arbitrary positions between and for every . While such a string is easily generated by a composition system of size , we could only produce L-systems of size generating it. We now prove bounds between L-systems and context-free grammars.

Theorem 4.3

For any L-system of size generating , there is a context-free grammar of size generating . If the morphism represented by is expanding, then the grammar is of size .

Proof

Consider the derivation tree for in : the root children are at level 0, and if is a node at level , then the children of are the elements in , at level . The nodes in each level spell out .

We create a grammar where contains the initial symbol and, for each variable of the L-system, nonterminals . The terminals of the grammar are the set of L-system variables, . Then, for each L-system rule appearing in level , we add the grammar rule . Further, for each rule appearing in level , we add the grammar rule . Finally, if is the L-system axiom, we add the grammar rule for its initial symbol.

It is clear that the grammar is of size at most and it generates . If every rule is of size larger than , and , then the prefix of is generated from the first symbol of , which can then be made the axiom and reduced to . In this case, the grammar is of size .∎

For example, consider our L-system and . A grammar simulating a generation of levels contains the rules , , , , , and . Note how the grammar uses the level subindices to control the point where the L-system should stop.

On the other hand, while we believe that composition systems can be smaller than L-systems, we can prove that L-systems are not larger than grammars.

Theorem 4.4

It always holds that .

Proof

Consider a grammar of height generating . We define the L-system , where contains all the rules in except the one for . We also include in the rules for all . The coding is the identity function.

It is clear that this L-system produces the same derivation tree of , reaching terminals at some level. Those remain intact up to the last level, , thanks to the rules . At this point the L-system has derived .

The size of the L-system is that of plus , which is of the same order because every symbol appears on some right-hand side (if not, we do not need to create the rule for that symbol). ∎

The following simple result characterizes a class of morphisms generating families with constant-sized L-systems.

Theorem 4.5

Let , be a non-erasing automorphism over free monoids, and . Then on the family .

Proof

We can easily simulate on the L-system of fixed size, with and . The system generates and, as grows, it does not change its size. ∎

This implies that on families of -iterations of the Thue-Morse morphism, the Fibonacci morphism, images of -uniform morphisms (i.e., morphisms generating -automatic words [1]), and standard Sturmian morphisms [4]. More generally, is on the set of prefixes of any morphic word.

5 NU-Systems

We now define a mechanism that combines both macro systems and L-systems, yielding a computable measure that is reachable and strictly better than .

Definition 5

A NU-system is a tuple , which is understood in the same way as L-systems, except that we extend rules with extractions, that is, and

The symbol means to expand variable for levels and then extract from the string at level , recursively expanding extractions if necessary. This counts as a single expansion (one level) of a rule, that is, the levels in the NU-system belong to . We also use to denote the whole level of . The size of the NU-system is . We call the size of the smallest NU-system generating a string .

Just as macro systems, a NU-system is valid only if it does not introduce circular dependencies. Let be the maximum value across every rule in the NU-system. The following algorithm determines the string generated by the system, if any:

  1. Compute for every variable and level , using the rules:

    • .

    • If and , then .

    • Replace on the previous summands .

    This generates a system of equations without loops, which is trivially solved.

  2. Replace every symbol in by ; we use to denote .

  3. Expand the rules, starting from the axiom, level by level as in L-systems. Handle the symbols as follows:

    1. Replace every (so if the NU-system is correct) by .

    2. Replace every , if and , as follows:

      • Let , for .

      • Let be such that .

      • Replace by .

    3. Return to (a) until the extraction symbol disappears.

Note that the symbol in step 3(b) can in turn be of the form ; we must then extract before continuing the extraction of . If, along the expansion, we return again to the original , then the system has no unique solution and thus it is invalid. This is computable because the number of possible combinations is bounded by .

We now show that NU-systems are at least as powerful as macro systems and L-systems.

Theorem 5.1

It always holds that .

Proof

It always holds because L-systems are a particular case of NU-systems. With respect to , let be a minimal macro system generating . Then we construct a NU-System where is the identity and , which upper-bounds the height of the derivation tree. Each level of will simulate the sequence of extractions that lead from each to its corresponding terminal in the macro system.

For each we define the rule in . For each rule in , we define the rule in , where if , and if . It is not hard to see that the NU-System simulates the macro system , and its size is . ∎

For example, consider our previous macro system and . The corresponding NU-system would have the rules , , , and . The derivation is then generated as follows:

=
.

Our new measure is then reachable, strictly better than and incomparable with . It is likely, however, that computing (i.e., finding the smallest NU-system generating a given string ) is NP-hard.

NU-systems easily allow us concatenating and composing automorphisms.

Theorem 5.2

Let and be NU-systems generating and , respectively. Then there are NU-systems of size that generate and the composition of and , which is the string generated by with axiom , .

Proof

Let and be disjoint copies of and , respectively, and let , and be variants that operate on instead of . We build a NU-system for , where , where and are new symbols. Let , plus the rules for . The axiom is then . Finally, the mapping on is , and for . It is easy to see that generates .

To generate the composition, should contain the image of by , but still is disjoint from . The axiom is . The mapping on is . On we use . On , we use . The depth is . ∎

The theorem allows a family to have , by finding a finite collection of families generated by fixed non-erasing automorphisms, and then joining them using a finite number of set unions, concatenations and morphism compositions.

6 Future work

We leave a number of open questions. We know , but it is unknown if ; if so, then would be reachable. We know , but it is unknown if ; we suspect it is not, but in general we lack mechanisms to prove lower bounds on or . We also know , but not if it can be strictly better. We also do not know if these measures are monotone, and if they are actually NP-hard to compute (they are likely so).

We could prove that , and thus the lower bound , if every L-system could be made expanding, but this is also unknown. This, for example, would prove that the stretch we found for a family of strings is the maximum possible.

7 Conclusions

Extending the study of repetitiveness measures, from parsing-based to morphism-based mechanisms, opens a wide number of possibilities for the study of repetitiveness. There is already a lot of theory behind morphisms, waiting to be exploited on the quest for a golden measure of repetitiveness.

We first generalized composition systems to macro systems, showing that a restriction of them, called internal macro systems, are equivalent to bidirectional macro schemes, the lowest reachable measure of repetitiveness considered in the literature. It is not yet known if general macro systems are more powerful.

We then showed how morphisms, and measures based on mechanisms capturing that concept called L-systems (and variations), can be strictly better than for some string families, thereby questioning the validity of as a lower bound for reachable repetitiveness measures. L-systems are never larger than context-free grammars, but probably not always as small as composition systems.

Finally, we proposed a novel mechanism of compression aimed at unifying parsing and morphisms as repetitiveness sources, called NU-systems, which builds on macro systems and L-systems. NU-systems can combine copy-paste, recurrences, and concatenations and compositions of morphisms. The size of the smallest NU-system generating a string is a relevant measure of repetitiveness because it is reachable, computable, always in and sometimes .

A simple lower bound capturing the idea of recurrence on a string, and lower bounding , just like captures the idea of copy-paste and strictly lower bounds , would be of great interest when studying morphism-based measures. For infinite strings, there exist concepts like recurrence constant and appearance constant [1], but an adaptation, or another definition, is needed for finite strings. Besides, like Lindenmayer systems, NU-systems could be used to model other repetitive structures beyond strings that appear in biology, like the growth of plants and fractals. In this sense, they can be compared with tree grammars; the relation between NU-systems and TSLPs [13], for example, deserves further study.

References

  • [1] Allouche, J.P., Shallit, J.: Automatic Sequences: Theory, Applications, Generalizations. Cambridge University Press (2003)
  • [2] Bannai, H., Funakoshi, M., I, T., Koeppl, D., Mieno, T., Nishimoto, T.: A separation of and via thue–morse words. CoRR 2104.09985 (2021)
  • [3] Christiansen, A.R., Ettienne, M.B., Kociumaka, T., Navarro, G., Prezza, N.: Optimal-time dictionary-compressed indexes. ACM Trans. Alg. 17(1), art. 8 (2020)
  • [4] de Luca, A.: Standard Sturmian morphisms. Theor. Comp. Sci. 178(1), 205–224 (1997)
  • [5] Gasieniec, L., Karpinski, M., Plandowski, W., Rytter, W.: Efficient algorithms for Lempel-Ziv encoding. In: Proc. SWAT. pp. 392–403 (1996)
  • [6] Kempa, D., Prezza, N.: At the roots of dictionary compression: String attractors. In: Proc. 50th STOC. p. 827–840 (2018)
  • [7]

    Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theor. Comp. Sci.

    298(1), 253–272 (2003)
  • [8] Kieffer, J.C., Yang, E.H.: Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 46(3), 737–754 (2000)
  • [9] Kociumaka, T., Navarro, G., Prezza, N.: Towards a definitive measure of repetitiveness. In: Proc. 14th LATIN. pp. 207–219 (2020)
  • [10] Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)
  • [11] Lindenmayer, A.: Mathematical models for cellular interactions in development I. Filaments with one-sided inputs. J. Theor. Biol. 18(3), 280–299 (1968)
  • [12] Lindenmayer, A.: Mathematical models for cellular interactions in development II. Simple and branching filaments with two-sided inputs. J. Theor. Biol. 18(3), 300–315 (1968)
  • [13] Lohrey, M.: Grammar-based tree compression. In: Proc. DLT. pp. 46–57 (2015)
  • [14] Lothaire, M.: Algebraic Combinatorics on Words. Cambridge University Press (2002)
  • [15] Navarro, G.: Indexing highly repetitive string collections, part I: Repetitiveness measures. ACM Comp. Surv. 54(2), article 29 (2021)
  • [16] Navarro, G., Ochoa, C., Prezza, N.: On the approximation ratio of ordered parsings. IEEE Trans. Inf. Theory 67(2), 1008–1026 (2021)
  • [17] Raskhodnikova, S., Ron, D., Rubinfeld, R., Smith, A.: Sublinear algorithms for approximating string compressibility. Algorithmica 65(3), 685–709 (2013)
  • [18] Shallit, J.: String attractors for automatic sequences. CoRR 2012.06840 (2020)
  • [19] Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)