The notion of String Attractor has been recently introduced and studied in [19, 7] to find a common principle underlying the main techniques constituting the fields of dictionary-based compression. It is a subset of the text’s positions such that all distinct factors have an occurrence crossing at least one of the string attractor’s elements. From one hand the problem of finding the smallest string attractor of a word has been proved to be NP-complete, on the other hand dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word . Moreover, approssimation rates with respect to the smallest string attractor can be derived for most known compressors. In particular compressors based on the Burrows-Wheeler Transform and the dictionary-based compressors are considered.
The Burrows-Wheeler Transform (BWT) is a reversible transformation that was introduced in 1994 in the field of Data Compression and it also largely used for self-indexing data structures. It has several combinatorial properties that make it a versatile tool in several contexts and applications [20, 12, 16, 22, 15].
Dictionary-based compressors are mainly based on a technique originated in two theoretical papers of Ziv and Lempel [24, 25]. Such compressors, that are able to combine compression power and compression/decompression speed, are based on a paper in which combinatorial properties of word factorization are explored .
In this paper we explore the notion of string attractor from a combinatorial point of view. In particular, we compute the size of a smallest string attractor for infinite families of words that are well known in the field of Combinatorics on Words: standard Sturmian words (and some their extension to bigger alphabets), Thue-Morse words and de Brujin words. In particular, we show that the size of the smallest string attractor for standard Sturmian words is and it contains two consecutive positions. For the de Brujin words the size of the smallest string attractor grows asintotically as , where is the length of the word. We show a string attractor of size for Thue-Morse words and we conjecture that this size is minimum. From the results presented in the paper, we believe that the distribution of the position in the smallest string attractor of a word, in addition to its size, can provide some interesting information about the combinatorial properties of the word itself. For this reason the notion of string attractor can provide hints for defining new methods and measures to investigate the combinatorial complexity of the words.
Let be a finite ordered alphabet with , where denotes the standard lexicographic order. We denote by the set of words over . Given a finite word with each , the length of , denoted , is equal to .
Given a finite word with each , a factor of a word is written as with . A factor of type is called a prefix, while a factor of type is called a suffix. We also denote by the -th letter in for any .
We denote by the reversal of , given by . If is a word that has the property of reading the same in either direction, i.e. if , then is called a palindrome.
We say that two words are conjugate, if and , where . Conjugacy between words is an equivalence relation over .
Given a finite word , denotes the word obtained by concatenating copies of . A nonempty word is primitive if implies and . A word is periodic if there exists a positive integer such that if . The integer is called period of .
A Lyndon word is a primitive word which is the minimum in its conjugacy class, with respect to the lexicographic order relation. We call Lyndon conjugate of a primitive word the conjugate of that is a Lyndon word.
The Burrows-Wheeler Transform is a permutation of the symbols in , obtained as the concatenation of the last symbol of each conjugate in the list of the lexicographically sorted conjugates of .
The LZ factorization of a word is its factorization built left to right in a greedy way by the following rule: each new factor (also called an LZ77 phrase) is either the leftmost occurrence of a letter in or the longest prefix of which occurs, as a factor, in .
3 String Attractor of a word
A string attractor of a word is a set of positions such that every factor has an occurrence with , for some .
Simply put, a string attractor for a word is a set of positions in such that all distinct factors of have an occurrence crossing at least one of the attractor’s elements.
Note that, trivially, any set that contains a string attractor for , is a string attractor for as well. Note also that a word can have different string attractors that are not included into each other. We are interested in finding a smallest string attractor, i.e. a string attractor with a minimum number of elements. We denote by the size of the smallest string attractor for . Note that all the factors made of a single letters should be covered, and therefore .
Let be a word on the alphabet . A string attractor for is for instance . Note that, in order to have a string attractor, position can be removed from , since all the factors that cross position have a different occurrence that crosses a different position in . Therefore is also a string attractor for with a smaller number of elements. The positions of are underlined in
is also a smallest string attractor since . Then . Remark that the sets and are also string attractors for . It is easy to verify that the set is not a string attractor since the factor does not intersect any position in .
The following two propositions, proved in , are useful to derive a lower bound on the value of .
Let be a string attractor for the word . Then, contains at most distinct factors of length , for every .
Let and let be the length of its longest repeated factor. Then it holds .
The following proposition gives an upper bound for of a concatenation of words, when of the single words are known.
Let and two words, then .
The bound defined in the previous proposition is tight. In fact, let and be two words in which the positions of the smallest string attractors are underlined. If we consider , the underlined positions represent one of the smallest string attractors for , as one can verify.
The following proposition gives an upper and lower bound for , when a power of a given word is considered.
Let . Then .
The upper bound given by Proposition 4 is tight. In fact consider the word . It is easy to check that the only smallest string attractors for are and . In order to find the smallest string attractor for , we remark that neither nor (neither any string attractor obtained from them by moving some position from the first to the second occurrence of ) cover all the new factors that appears after the concatenation. In particular is not covered by , and is not covered by . A way to get the smallest string attractor for is to add to or , the position corresponding either to the end of the first occurrence of or the beginning of the second occurrence. For instance, is a smallest string attractor for .
Remark that can be equal to although different point for the string attractor could be chosen. For instance, let be a word whose smallest string attractor is (the underlined letters). Then has a string attractor of cardinality . Remark that is not a string attractor for .
A straightforward consequence of Proposition 4 is the following:
If and are conjugate words, then .
Consider the word . Then a smallest string attractor for is , i.e. . Consider its conjugate . Its smallest string attractor is , i.e. . Note that the Lyndon word does not have necessarily the smallest among the conjugates. In fact, for instance, the Lyndon conjugate of is , and it is easy to verify that one of the smallest string attractor is .
4 Approximating a string attractor via compressors
In  the authors show that many of the most well-known compression schemes reducing the text’s size by exploiting its repetitiveness can induce string attractors whose sizes are bounded by the repetitiveness measures associated to such compressors. In particular, straight-line programs, Run-Length Burrows-Wheeler transform, macro schemes, collage systems, and the compact directed acyclic word graph are considered. Here we report some result related to the Burrows-Wheeler Transform, collage systems, and Lempel-Ziv 77 (that is a particular macro-scheme) that provide upper bounds on the size of the smallest string attractor for a given word. Such bounds will be used in next sections to compute the string attractors for known families of finite words.
The first theorem, proved in , states a connection between a string attractor of a word and the runs of equal letters in the .
Let be a word and let be the number of equal-letter runs in the , where is a simbol different from the ones in the alphabet and assumed smaller than any symbol in . Then, has a string attractor of size .
In particular, in the proof of Theorem 4.1 the string attractor is constructed by considering the position of the symbols in that correspond, in the output of the transformation, to the first occurrence of a symbol in each run (or, equivalently, the last occurrence of a symbol in each run).
The following result, proved in , states the relationship between a string attractor of a word and the number of phrases in the LZ parsing of .
Given a word , there exist a string attractor of of size equal to the number of phrases of its LZ parsing.
By using the previous result, a string attractor can be constructed by considering the set of positions at the end of each phrase.
A collage system is a set of rules of four possible types:
: nonterminal expands to a terminal
: nonterminal expands to , with and nonterminals different from
: nonterminal expands to nonterminal repeated times
: nonterminal expands to a substring of the expansion of nonterminal .
Let be a collage system of size generating a word . Then has a string attractors of size at most .
By using this theorem one can easily find the minimum string attractor for a particular class of very “regular” strings, as shown in the following corollary.
Let be a word. If is the form of (where all are different symbols in the alphabet), then is a string attractor of minimum size for , so .
5 Minimum size string attractors
In this section, we analyze the words whose smallest string attractor has size equal to the the size of the alphabet, that is the minimum possible size. In the following two subsections we distinguish the case of binary words and the case of words over alphabets with more than two letters.
5.1 Binary words
In this subsection we focus on an infinite family of finite binary words whose minimum string attractor has size .
Standard Sturmian words is a very well known family of binary words that are the basic bricks used for the construction of infinite Sturmian words, in the sense that every characteristic Sturmian word is the limit of a sequence of standard words (cf. Chapter 2 of ). These words have a multitude of characterizations and appear as extremal case in a very great range of contexts [9, 3]. In this paper two of their characterization are particularly useful, that is a special decomposition into palindrome words and an extremal property on the periods of the word that is closely related to Fine and Wilf’s theorem (cf. [13, 14]). More formally, standard Sturmian words can be defined in the following way which is a natural generalization of the definition of the Fibonacci word. Let any sequence of natural integers such that and (), called directive sequence. The sequence can be defined inductively as follows: , , , for . We denote by the set of all words , , constructed for any directive sequence of integers.
Furthermore, another characterization of standard Sturmian words is related to the Burrows Wheeler Transform (BWT) since, for binary alphabets, the application of the BWT to standard Sturmian words produces a total clustering of all the instances of any character (cf. ), as reported in the following theorem.
Theorem 5.1 ()
Let . Then is a conjugate of a word in if and only if with .
In the following theorem, for each standard Sturmian word, we individuate a string attractor, whose positions are strictly related with particular decompositions of such words depending on their periodicity. In particular, we recall that (cf. ), where is the set of all words having two periods and such that and . Given a word , we denote by its prefix of length , belonging to the set , uniquely defined by using previous equality. By using a property of words in (cf. ), , where are characters and and are uniquely determined palindromes. So, a standard Sturmian word can be decomposed as . We call decompositions such factorizations of .
By using decompositions, a Standard sturmian word can be decomposed as . For each with , let be the length of the longest palindromic proper prefix of , the set or the set is a minimum string attractor for .
Let us suppose that . By using a property of words in (cf. ), , where are characters and and are uniquely determined palindromes. Let us suppose that . So, . Firstly we suppose that . This means that . From a result in  and are the smallest and the greatest conjugates in the lexicographic order, respectively. By Theorem 5.1 and by using an argument similar to the proof of Theorem 4.1, a string attractor can be constructed by considering the positions corresponding to the end of each run. It is possible to see that such positions correspond to the two characters following the prefix of length . If , then . In this case and are the smallest and the greatest conjugates in the lexicographic order, respectively. In this case the ending positions of each run in the output of correspond to the two characters following the prefix . So, the positions in the string attractor are . The case can be proved analogously by considering the starting characters of each run in the clustered output of . ∎
Given the standard Sturmian word , the decompositions of are , then is a (smallest) string attractor for , since . Given , its decompositions are and . So, is a string attractor.
Previous theorem shows an infinite family of finite binary words such that the size of the smallest string attractor is minimum. In Example 7 we provide some binary words not belonging to with minimum size of string attractor. An open question is to characterize all the binary words with string attractor of size . Furthermore, we can remark that for the standard Sturmian words one can construct string attractors whose positions are consecutive. This fact could be related to the number of distinct factors appearing in the words. It could be interesting to investigate which classes of binary words have minimum size string attractors containing consecutive positions.
A possible set of smallest string attractor of the words or is . Moreover, if we consider the word or a smallest string attractor is .
5.2 The case of bigger alphabets
We now analyze the string attractors for words of more of two letters by generalizing the standard Sturmian words. Numerous generalizations of Sturmian sequences have been introduced for an alphabet with more than letters. Among them, one natural generalization are the episturmian sequences that are defined by using the palindromic closure property of Sturmian sequences (cf. ). Here we consider some special prefixes of episturmian sequences that are balanced, called finite epistandard words . We remark that, a word is balanced if, for any symbol , the numbers of ’s in two factors of of the same length differ at most by and it is circularly balanced if each conjugate is balanced. In the case of a binary alphabet they correspond to the standard Sturmian words.
We focus on the circularly balanced epistandard words defined in Theorem 5.3 (see [18, 22]), that can be built via the iterated palindromic closure function. The iterated palindromic closure function , denoted by , is defined recursively as follows. Set and, for any word and letter , define , where , the palindromic right-closure of , is the (unique) shortest palindrome having as a prefix (see ). Circularly balanced epistandard words (up to letter permutation), like the standard Sturmian words, are perfectly clustered words under application of the BWT, i.e. the BWT produces a new word that has the minimum number of clusters (, see also [23, 21] for more details about the perfectly clustered words on more letters).
Any circularly balanced finite epistandard word belongs to one of the following three families (up to letter permutation):
, with , where and ;
, with , where and ;
, where .
For each epistandard word, we find a possible string attractor and show that its size is .
If is a circularly balanced epistandard words, then the minimum size of string attractor of is .
The proof of the theorem uses similar arguments as in Theorem 5.2 and the notion of palindromic closure. It will be detailed in the full version of the paper.
We consider some circularly balanced epistandard words and their corresponding Lyndon words. A smallest string attractor of
(type 1) , obtained by , is , whereas for its Lyndon conjugate it is .
(type 2 for and ) , obtained by , is and of its Lyndon conjugate is .
(type 2 for and ) , obtained by , is and of its Lyndon conjugate is
(type 2 for and ) , obtained by , is and of its Lyndon conjugate is .
(type 3) , obtained by , is and of its Lyndon conjugate is .
Note that, in this example, one has an attractor for each symbol of the alphabet and the position of such attractor in the epistandard word coincides with the position where the last occurrence letter appears during the palindromic right-closure. Note also that, in the case of Lyndon words, for each letter , one has an attractor at the position of the last occurrence of in the run of the output of the BWT.
Unfortunately, the authors in  show that there exist words that do not belong to the families in Theorem 5.3 that are perfectly clusterized via BWT, for instance the perfectly clustered word is not a finite epistandard word and a possible string attractor is , whereas the perfectly clustered word is a finite epistandard but it is not a balanced and a possible string attractor is . The characterization of all perfectly clustered word via BWT is out the scope of this paper.
It remains open the problem of characterizing all words on an alphabet on more of two letters that have the smallest string attractor having size equal to the cardinality of the alphabet.
6 String Attractors in Thue-Morse Words
In this section we consider the problem of finding a smallest string attractor for the family of finite binary Thue-Morse words. Thue-Morse words are a sequence of words obtained by the iterated application of a morphism as described below.
Let us consider the alphabet and the morphism such that and . Let us denote by the -th iterate of the morphism that is called the -th Thue-Morse word.
It is easy to verify that, for each the -th Thue Morse word has length . The -th Thue-Morse words for can be found in Figure 1.
By using a result in  on the enumeration of factors in Thue-Morse words the following lower bound is proved.
Let be the -th Thue-Morse word with . Then .
The following proposition provides the recursive structure of the -th Thue-Morse words by using the rules of a context-free grammar.
The -th Thue-Morse word is obtained by the following grammar:
by taking as axiom the non terminal symbol .
The grammar described in Proposition 6 contains rules for the Thue Morse word of length . Therefore, by using a result proved in , reported here as Theorem 4.3, it is possible to construct a string attractor for having size .
In the last part of this section we exhibit a string attractor for of size . Our conjecture is that .
A string attractor of the -th Thue Morse word, with is
Let be a factor of , with . Then, admits an occurrence of crossing a position in the set .
Proof (Theorem 6.1)
The thesis is proved by induction on . If , it is easy to check (see Fig. 1) that is a string attractor for .
Let us suppose that is a string attractor for . We show that a string attractor for can be obtained by applying to the following two operations:
ADD(), that adds the new position
MOVE(,) that replaces the position with .
Such operations are described, for , in Fig. 1. Let be a factor of . If is also a factor of , then by Lemma 1 has at least an occurrence crossing a position in the set . We can suppose that is factor of that does not appear in and that does not cross any position in . It means that has to be factor of . In particular, such a factor exists. In fact, let and be the first and the last character of , then has only one occurrence in . If follows from the fact that and that the Thue-Morse words has no overlapping factors. Moreover, the occurrence of crosses the position .∎
7 Attractors in de Brujin words
A de Bruijn sequence (or words) of order on an alphabet of size , is a circular sequence in which every possible length- string on occurs exactly once as a substring.
De Brujin words are widely studied in combinatorics on words, and all of them can be constructed by considering all the Eulerian walks on de Brujin graphs. All the de Brujin sequences of order over an alphabet of size have length . For instance the (circular) word is a de Brujin word of order 4 over the alphabet . In fact one can verify that all strings of length 4 over appear as factor of just once.
Since we are here interested to linear and not to cyclic words, it is easy to verify that in order to have linear words containing all the -length factors exactly once, it is sufficient to consider any linearization of the circular de Brujin word of order (that is, we cut the circular word in any position to get a linear word) and concatenate it with a word equal to its own prefix of length . Therefore its length is . We call such words linear de Brujin sequences (or words). For instance the linear de Brujin word corresponding to the circular one in the above example is the word . Remark that the length of is .
In  the following two theorems are proved.
The number of phrases in a LZ parsing of a sequence of length over an alphabet of size satisfies:
Let be a de Brujin sequence of order and length over an alphabet of size (). Then
Let be a de Brujin sequence of order and length over an alphabet of size (). Then the cardinality of a smallest string attractor for satisfies:
This means that for a de Brujin word of length grows asintotically as , corresponding to the worst case for the size of a smallest string attractor of any word in .
Notice that the lower bound is somehow intuitively expected, since all the words of length appear only once in , therefore two consecutive positions in any string attractor cannot be farthest than .
For instance one can verify that for a smallest string attractor is .
8 Conclusion and Open Problems
In this paper we have studied the notion of string attractor from a combinatorial point of view. We have given an explicit construction of the string attractor for the infinite families of standard Sturmian words and Thue-Morse words. For standard Sturmian words, by using their combinatorial properties, the construction gives a smallest string attractor whose size is . String attractors of minimum size can be also constructed for circularly balanced epistandard words. It is open the question to characterize all the words whose smallest string attractor has size equal to the cardinality of the alphabet. For Thue-Morse words, the size of the attractor is logarithmic with respect to the length of the word. We conjecture that such size is minimum and we leave it as an open problem. Based on the results presented in the paper two research directions could be explored. Some standard Sturmian words and Thue-Morse words are generated by morphisms. It could be interesting to find which properties of the morphism determine a smallest string attractor of constant size. Moreover, we plan to study how the distribution of the positions in the smallest string attractor is related to the combinatorial structure of the words.
Finally, the size of smallest string attractor could be used to define a new function to measure the complexity of infinite words. It could be interesting to investigate how such a measure is related with other known complexity measures, such as the factor complexity.
S. Mantaci, G. Rosone and M. Sciortino are partially supported by the project MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.
-  Berstel, J., de Luca, A.: Sturmian words, Lyndon words and trees. Theoret. Comput. Sci. 178(1), 171 – 203 (1997)
-  Brlek, S.: Enumeration of factors in the Thue-Morse word. Discrete Applied Mathematics 24(1-3), 83–96 (1989)
-  Castiglione, G., Restivo, A., Sciortino, M.: Circular sturmian words and Hopcroft’s algorithm. Theoretical Computer Science 410(43), 4372–4381 (2009)
-  Fraenkel, A.S.: Complementing and exactly covering sequences. Journal of Combinatorial Theory, Series A 14(1), 8 – 20 (1973). https://doi.org/https://doi.org/10.1016/0097-3165(73)90059-9
-  Justin, Jacques: Episturmian morphisms and a galois theorem on continued fractions. RAIRO-Theor. Inf. Appl. 39(1), 207–215 (2005). https://doi.org/10.1051/ita:2005012, https://doi.org/10.1051/ita:2005012
-  Kempa, D., Prezza, N.: At the roots of dictionary compression: String attractors. CoRR abs/1710.10964 (2017), http://arxiv.org/abs/1710.10964
Kempa, D., Prezza, N.: At the roots of dictionary compression: string attractors. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25-29, 2018. pp. 827–840. ACM (2018)
Kida, T., Matsumoto, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: a unifying framework for compressed pattern matching. Theoret. Comput. Sci.298(1), 253 – 272 (2003)
-  Knuth, D., Morris, Jr., J., Pratt, V.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977)
-  Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976). https://doi.org/10.1109/TIT.1976.1055501
-  Lothaire, M.: Applied Combinatorics on Words (Encyclopedia of Mathematics and its Applications). Cambridge University Press, New York, NY, USA (2005)
-  Louza, F.A., Telles, G.P., Gog, S., Zhao, L.: Computing Burrows-Wheeler Similarity Distributions for String Collections. In: SPIRE 2018. Lecture Notes in Computer Science, vol. 11147, pp. 285–296. Springer (2018)
-  de Luca, A., Mignosi, F.: Some combinatorial properties of sturmian words. Theoretical Computer Science 136(2), 361–385 (1994)
-  de Luca, A.: Combinatorics of standard sturmian words. LNCS, vol. 1261, pp. 249–267. Springer (1997)
-  Mantaci, S., Restivo, A., Rosone, G., Sciortino, M., Versari, L.: Measuring the clustering effect of BWT via RLE. Theor. Comput. Sci. 698, 79–87 (2017)
-  Mantaci, S., Restivo, A., Sciortino, M.: Burrows-Wheeler transform and Sturmian words. Information Processing Letters 86, 241–246 (2003)
-  Paquin, G.: On a generalization of christoffel words: epichristoffel words. Theoretical Computer Science 410(38), 3782 – 3791 (2009). https://doi.org/https://doi.org/10.1016/j.tcs.2009.05.014
-  Paquin, G., Vuillon, L.: A characterization of balanced episturmian sequences. The Electronic Journal of Combinatorics [electronic only] 14(1), Research paper R33, 12 p.–Research paper R33, 12 p. (2007)
-  Prezza, N.: String attractors. CoRR abs/1709.05314 (2017), http://arxiv.org/abs/1709.05314
-  Prezza, N., Pisanti, N., Sciortino, M., Rosone, G.: SNPs detection by eBWT positional clustering. Algorithms for Molecular Biology 14(1), 3:1–3:13 (2019)
-  Restivo, A., Rosone, G.: Burrows-Wheeler transform and palindromic richness. Theoretical Computer Science 410(30-32), 3018 – 3026 (2009)
-  Restivo, A., Rosone, G.: Balancing and clustering of words in the Burrows-Wheeler transform. Theoretical Computer Science 412(27), 3019 – 3032 (2011)
-  Simpson, J., Puglisi, S.J.: Words with simple Burrows-Wheeler transforms. Electronic Journal of Combinatorics 15 (article R83, 2008)
-  Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23(3), 337–343 (1977)
-  Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Trans. Inf. Theor. 24, 530––536 (1978)