Two major themes in combinatorics on words are power avoidance and subword complexity (also called factor complexity or just complexity). In power avoidance, the main goals are to construct infinite words avoiding various kinds of repetitions (see, e.g., 
), and to count or estimate the number of length-finite words avoiding these repetitions (see, e.g., ). In subword complexity, the main goal is to find explicit formulas for, or estimate, the number of distinct blocks of length appearing in a given infinite word (see, e.g., ). In this paper we combine these two themes, beginning a systematic study of infinite binary and ternary words satisfying power avoidance restrictions. We follow two interlaced lines of research. First, given a power avoidance restriction, we study the set of infinite words satisfying this restriction, focusing on upper and lower bounds on their subword complexity, and examples of words of “large” and “small” complexity. Second, given a subword complexity restriction, we seek lower bounds on the powers avoided by infinite words of restricted complexity, and for words attaining these bounds. We also tried to cover the remaining white spots with open questions and conjectures. Most of the results are gathered in Table 1; precise definitions are given below in Section 1.1.
|Avoidance restriction||Small complexity||Large complexity|
|all overlap-free||minimum: Thue-Morse word||maximum: twisted Thue-Morse word|
|symmetric overlap-free||minimum: Thue-Morse word||maximum: Thue-Morse word|
|-power-free||minimum: Thue-Morse word||maximum: no; upper bound:|
|symmetric -power-free||minimum: Thue-Morse word||maximum: Thue-Morse word|
|-power-free||Thue-Morse word is not of minimum complexity||exponential|
|-power-free||minimum(?): (new word)|
|-power-free||minimum: (Fibonacci word)|
|symmetric -power-free||minimal(?) growth: (Arshon word)|
|symmetric -power-free||minimal growth: (1-3-bonacci word)|
|square-free||minimum(?): ternary Thue word||exponential|
|symmetric square-free||minimum: (1-2-bonacci word)|
|-power-free||minimum(?): (new word)|
The paper is organized as follows. After definitions, we study -power-free infinite binary words for in Section 2. The number of distinct blocks of length in this case is quite small and all infinite words are densely related to the Thue-Morse word. We provide answers for most (but not all) questions about complexity of these words. After that, in Section 3 we study low-complexity -power-free infinite binary and ternary words for the case where is big enough to provide sets of length- blocks of size exponential in . Here we leave more white spots, but still obtain quite significant results, especially about square-free ternary words. Finally, in Section 4 we briefly consider high-complexity infinite words and relate the existence of words of “very high” complexity to an old problem by Restivo and Salemi.
1.1 Definitions and notation
Throughout we let denote the -letter alphabet . By we mean the set of all finite words over , including the empty word . By we mean the set of all one-sided right-infinite words over ; throughout the paper they are referred to as “infinite words”. The length of a finite word is denoted by .
If , for possibly empty words , then we say that is a prefix of , is a suffix of , and is a subword (or factor) of . A prefix (resp., suffix, factor) of is proper if . A factor of can correspond to different factorizations of : . Each factorization corresponds to an occurrence of in ; the position of an occurrence is the length of the prefix of preceding . Thus occurrences of (and, in particular, occurrences of letters) are linearly ordered by their positions, and we may speak about the “first” or “next” occurrence. An infinite word is recurrent if every factor has infinitely many occurrences.
Any map () can be uniquely extended to all finite and infinite words over by putting , where for all . Such extended maps are called morphisms. A morphism is a coding if it maps letters to letters.
The Thue-Morse word
is a well-studied infinite word with many equivalent definitions (see, e.g., ). The first is that counts the number of ’s, modulo , in the binary expansion of . The second is that is the fixed point, starting with , of the morphism sending to and to .
A language is a set of finite words over . A language is said to be factorial if implies that every factor of is also in . If is a one-sided or two-sided infinite word, then by we mean the factorial language of all finite factors of . We call a language symmetric if for any bijective coding . An infinite word is symmetric if is symmetric. For example, the Thue-Morse word is symmetric.
The so-called subword complexity or factor complexity of an infinite word is the function that maps to the number of distinct subwords (factors) of length in . If the context is clear, we just write . A more general notion is the growth function (or combinatorial complexity, or census function) of a language ; it is the function counting the number of words in of length . Thus,
. These complexity functions can be roughly classified by theirgrowth rate ; for factorial languages, the can be replaced by . Exponential (resp., subexponential) words and languages have growth rate (resp., 1); growth rate 0 implies a finite language. Infinite words constructed by some regular procedure (e.g., those generated by morphisms) usually have small, often linear, complexity (see, e.g., ).
We say that an infinite word has minimum (resp., maximum) subword complexity in a set of words if (resp., ) for every word and every .
An integer power of a word is a word of the form . If is nonempty then by we mean the infinite word . Integer powers can be generalized to fractional powers as follows: by , for a real number , we mean the prefix of length of the infinite word . If is a finite word and is the shortest word such that is a prefix of , then the ratio is called the exponent of . The critical (or local) exponent of a finite or infinite word is the supremum of the exponents of its factors. Thus, for example, the French word contentent (as in ils se contentent) has a suffix that is the ’th power of the word nte, as well as the ’th power of this word; the exponent of the word contentent is 1, and its critical exponent is .
We say a finite or infinite word is -power-free or avoids -powers if it has no factors that are -powers for . Similarly, a finite or infinite word is -power-free or avoids -powers if it has no factors that are -powers for . In what follows, we use only the term “-power”, assuming that is either a number or a “number with a +”. We write for the language of all finite -ary -power-free words. The growth functions of these power-free languages were studied in a number of papers; see the survey  for details. For our study, we need the following rough classification of growth functions of infinite power-free languages [28, 16, 18, 33, 40]: the languages for
have polynomial growth functions, while all other power-free languages are conjectured to have exponential growth functions. This conjecture has been proved for all power-free languages over the alphabets of even size up to 10 and of odd size up to 101. The “polynomial plateau” of binary power-free languages possesses several distinctive properties due to their intimate connection to the Thue-Morse word (see, e.g.,[37, Section 2.2]).
An infinite word is called periodic if it has a suffix for some nonempty word ; otherwise, it is called aperiodic. Obviously, all power-free words are aperiodic. The minimum subword complexity of an aperiodic word is , reached by the class of Sturmian words .
A finite word from a language is right extendable in if for every integer there is a word such that and . Left extendability is defined in a dual way. Further, is two-sided extendable in if for every integer there are words such that and . We write
Note that all factors of an infinite word are right extendable in . It is known  that for every language the languages and have the same growth rate as .
For a word over , we say that we flip a letter we replace it with . The word obtained from by flipping all letters is the complement of .
2 Minimum and maximum subword complexity in small languages
The -power-free words are commonly called overlap-free due to the following equivalent characterization: a word contains an -power with if and only if two different occurrences of some factor in overlap. It is known since Thue  that the Thue-Morse word is overlap-free. The morphism satisfies the following very strong property.
Lemma 1 ().
For every real , an arbitrary word avoids -powers iff the word does.
Moreover, all binary -power-free (in particular, overlap-free) words can be expressed in terms of the morphism . Below is the “infinite” version of a well-known result proved by Restivo and Salemi  for overlap-free words and extended by Karhumäki and Shallit  to all -power-free words. The second statement of this lemma was proved in .
Let be an infinite -power-free binary word, be an integer. Then is uniquely representable in the form
where is also an infinite -power-free binary word, . Moreover, for every the condition implies either , or , or , for some .
Every -power-free infinite binary word contains, as factors, all elements of . In particular, this is true of every overlap-free word.
The Thue-Morse word has the minimum subword complexity among all binary -power-free (in particular, overlap-free) infinite words.
We now consider the analogue of the Thue-Morse sequence, where for we count the number of ’s, mod 2, (instead of the number of ’s, mod 2) in the binary representation of . By convention, we assume that the binary expansion of 0 is . We call this word
the twisted Thue-Morse word. The word was mentioned in  and has appeared previously in the study of overlap-free and -power-free words . It is the image, under the coding , , of the fixed point of the morphism , , , and is known to be overlap-free.
We now state one of our main results. The proof follows after a series of preliminary statements.
The twisted Thue-Morse word has maximum subword complexity among all overlap-free infinite binary words, and is the unique word with this property, up to complement.
The uniqueness of , stated in Theorem 5, differs strikingly from the situation with minimum complexity, described by Corollary 4. Namely, the language of factors , and thus the subword complexity , is shared by a continuum of infinite words. The explicit construction of all such words can be found in . (Precisely, Section 2 of  describes two-sided infinite words with the language of factors , but all their suffixes have exactly the same factors.)
A word is minimal forbidden for a factorial language if , while all proper factors of belong to . Every overlap-free infinite word having a factor not in contains a minimal forbidden word for ; moreover, every such word appearing in is right extendable in the language . These forbidden words are classified in the following lemma.
Lemma 8 ().
Let be the last letter of . The minimal forbidden words for are exactly the following words and their complements:
Among these, only the words (and their complements) are right extendable for .
This claim is also easy to prove using the Walnut prover .
Let be an infinite overlap-free binary word and let . If or for some word , then . In particular, the words have, in total, at most one occurrence in .
By induction on . First, consider the case . We have . Since is overlap free, the letter following in is 1. On the other hand, if is nonempty, it must end with either or , and in both cases has an overlap. So must be empty.
Now the induction step. Assume the claimed result is true for ; we prove it for . Let and assume . The factorization (1) of implies for some infinite word , which is overlap-free by Lemma 1. Next, note that for some nonempty word . Indeed, begins with . If ends with , then contains the factor , preceded by some letter(s); this certainly creates an overlap. Thus
Applying the inductive hypothesis to , we obtain . Having and , we obtain . Since has the maximum possible length, the inductive hypothesis guarantees that both words and contain overlaps. Then the word has some prefix that ends with 0, and also some prefix that ends with 1 (e.g., can be a word of the form ). Hence has the prefix that ends with 1, and the prefix that ends with 0. This means that both words and contain overlaps, and thus . So we have . This contradicts our assumption and thus proves the inductive step. ∎
Recall that a factor of is (right) -special if both and are factors of . The set of all special factors of is denoted by . We will omit when it is clear from the context. We use the following familiar fact:
The number is the first difference of the subword complexity of : .
Consider the function mapping every word from of length to its prefix of length . Each special factor of of length has two preimages, while each non-special factor of length has a single preimage. ∎
For all and every infinite binary word we have
Since for every binary word, below we we restrict our attention to for . For example,
as was first computed in . As an infinite word over , this sequence looks like
where each subsequent block of equal letters is twice the size of the previous block of the same letter.
Let be overlap-free. By Corollary 4, all -special factors are -special, so for all . We call -special factor irregular if it is not -special.
For an overlap-free infinite binary word , all -special factors are Thue-Morse factors.
Proof of Theorem 5.
Let be an overlap-free infinite binary word (the case is parallel, so we omit it). We show the following four facts:
at every position in , the first occurrence of at most one irregular -special factor begins;
for every , there is an irregular -special factor with the first occurrence beginning at position of ;
if is an irregular -special factor with the first occurrence beginning at position of an overlap-free word , then ;
the inequality is strict for at least one value of .
(i) Let be an irregular -special factor. By Lemma 13, ; but either or is not a Thue-Morse factor by definition of irregularity. W.l.o.g., . Then some suffix of is a minimal forbidden word for . By Lemma 8, this suffix equals for some . So , where by Lemma 10. Thus, the first occurrence of in is followed by 0 (and ), while all other occurrences of are followed by 1 (and ). Now assume that some proper prefix of is also an irregular -special factor. Since , the occurrence of as a prefix of is not the first occurrence of in . Hence the first occurrences of each two irregular -special factors begin in different positions.
(ii) From (3) it is easy to see that for every even the prefix of of length equals , where for and has the common suffix with for . Similarly, for every odd such a prefix equals , where has the common suffix with . According to the above description of the irregular special factors, has first occurrences of irregular special factors beginning at each position:
(iii) If is an irregular -special factor and its first occurrence is followed by , then is not Thue-Morse and thus has, by Lemma 8, the suffix or for some (because is Thue-Morse). Hence by Lemma 10, is a suffix of the prefix of , where . Therefore the factor of length occurs in for the first time at position at most (). Inverting this, the factor such that () is of length at least . But this number is exactly the length of for every ; see (6). Since is the shortest irregular special factor, we proved the required statement.
(iv) Note that the lengths of irregular special factors, given in (6), and the first letter of allow one to restore the whole word in a unique way, and this word is . Hence, for any other word starting with we have for some .
To finish the proof, consider the function that maps every -special factor to itself and every irregular factor to the irregular factor . The function is well-defined by (ii) and injective by (i). By (iii), never increases the length of the word. Finally, (iv) implies that decreases the length of some . Together, these facts imply that for every ,
and the inequality is strict for . The claim of the theorem is now immediate from (4). ∎
As a consequence of the proof, we can determine a closed form for the subword complexity of .
The number of special -factors of length is given by the formula
The proposition states that the sequence () can be written as the following word over :
Comparing this to (5), we see that some 2’s have been changed to 3’s. This means an additional -special factor for each corresponding length, and this factor must be irregular. According to (6), the set of all positions of 3’s indeed coincides with the set of lengths of irregular -special factors, thus proving the proposition. ∎
The maximum factor complexity of a binary overlap-free infinite word is the factor complexity of the twisted Thue-Morse word and is given, for , by the formula
A table of the first few values of the subword complexity of follows:
2.1 Beyond overlap-free words
Now we turn to the case of -power-free infinite binary words for arbitrary from the half-open interval . By Corollary 4, the Thue-Morse word and its complement are words of minimum complexity. Theorem 16 below shows that the asymptotic growth of subword complexity belongs to a very small range. Theorem 17 demonstrates that there is no -power-free infinite binary word of maximum complexity.
Let . Every infinite binary word avoiding -powers has linear subword complexity. Moreover, for every one has .
First assume that for some word and some . Then also avoids -powers by Lemma 1. According to Lemma 8, the shortest word that can be a factor of , but does not occur in , is either or . Hence every factor of that is contained in four consecutive blocks of the form , , is a factor of . Thus the shortest factor of that is not in has the length at least . (If contains , this factor is from Lemma 8.) So we have
By (8), all factors of of length are Thue-Morse factors, so we have
This upper bound is a bit loose; to tighten it, consider . By Lemma 2, . Let (w.l.o.g., ). Then
Since 0100110 is a Thue-Morse factor, so is . Hence the number of length words in is at most . Applying the second statement of Lemma 2 to , we easily obtain . Therefore,
Next let and let , . If or , then is not -power free. Otherwise, is a Thue-Morse factor, as well as the suffix of . Then all length- words of , that are not in , begin in on the left of . But , so we again have (9). By the same reason we obtain (9) in the case .
From the proof of Theorem 16 we have the following: for arbitrary -power-free infinite words ,
if , then ;
if , we write and see that ; the only length factor of that is possibly not in is the prefix of ;
if , we write and analyze cases to get for and for . Further, we note that every word having the property and beginning with 0 is of the following form:
and for all in the considered range (the two “additional” factors are those beginning at positions 0 and 1). On the other hand, every word with the property , starting with 0, has the form
and one has (the only additional factor 10110010 begins at position 2), (the additional factors begin at positions 1 and 2), and for (the additional factors begin at positions 0, 1, and 2). We see that reaches the maximum possible complexity for but not for , while reaches this maximum for but not for .
Thus we have proved the following result:
There is no -power-free infinite binary word of maximum complexity.
If there is no maximum subword complexity, it makes sense to look at some sort of “asymptotically maximal” complexity. For a function of linear growth, let its linear growth constant be . For overlap-free infinite binary words, the maximum subword complexity has linear growth constant , as can be easily derived from (7). From (2) we see that such a constant for the Thue-Morse word is . By Theorem 16, this means that the linear growth constants of all -power-free infinite binary words are upper bounded by 4.
Open Question 18.
What is the maximum linear growth constant for the subword complexity of a -power-free infinite binary word? Which words have such complexity?
2.2 The symmetric case
The only possible subword complexity function of a -power-free symmetric infinite binary word is the function .
Let be a -power-free infinite binary word such that . By Corollary 4, contains a factor that is not Thue-Morse. By Lemma 8, contains one of the factors , or their complements. Assume that contains . Taking the representation (1) of , we see that has the suffix . If the word contains the factor , it contains either or , which is impossible because is -power free. Thus, cannot occur in to the right of an occurrence of . Similarly, if contains , then the suffix of cannot contain without containing the -power .
Repeating the same argument for the factors in we conclude that contains neither simultaneously, nor simultaneously. So is not symmetric. Hence every -power-free symmetric infinite binary word has subword complexity . ∎
3 Small subword complexity in big languages
In this section we study binary and ternary words. Note the interconnection of the results over and through the encodings of words in both directions. To ease the reading, we denote words over by capital letters and other words by small letters.
Our study follows two related questions about small subword complexity:
Given a pair , how small can the subword complexity of an -power-free infinite -ary word be?
Given an integer and a function , what is the smallest power that can be avoided by an infinite -ary word with a subword complexity bounded above by ?
3.1 Ternary square-free words
Among -power-free infinite ternary words, the most interesting are square-free (= 2-power-free) words, the existence of which was established by Thue  and in particular -power-free words, because is the minimal power that can be avoided by an infinite ternary word, as was shown by Dejean .
Consider the ternary Thue word , which is the fixed point of the morphism defined by :
This word has critical exponent 2, which is not reached, so is square-free. Also, has two alternative definitions through the Thue-Morse word. The first definition says that for any , is the number of zeroes between the ’th and ’th occurrences of 1 in . The second definition is
The definition (11) easily implies a bijection between the length- factors of and length- factors of for any . Hence, for all , and one can use formula (2). In , the complexity of was computed directly from the morphism .
The ternary Thue word has the minimum subword complexity over all square-free ternary infinite words.
The above conjecture is supported by the following result, showing that is the only candidate for a square-free word of minimum complexity.
If a word has minimum subword complexity over all square-free ternary infinite words, then , where is a bijective coding.
Before proving this theorem and presenting further results, we need to recall an encoding technique introduced in  by the second author as a development of a particular case of Pansiot’s encoding . In what follows, are unspecified pairwise distinct letters from . Ternary square-free words contain three-letter factors of the form , called jumps (of one letter over another). Jumps occur quite often: if a square-free word has a jump at position , then the next jump in occurs at one of the positions (), (), or (). Note that a jump at position would mean that has the square at position , while no jump up to position would lead to the square at position . Also note that a jump in a square-free word can be uniquely reconstructed from the previous (or the next) jump and the distance between them. Thus,
a square-free ternary word can be uniquely reconstructed from the following information: the leftmost jump, its position, the sequence of distances between successive jumps, and, for finite words only, the number of positions after the last jump.
The property () allows one to encode square-free words by walks in the weighted graph shown in Fig. 1. The weight of an edge is the number of positions between the positions of two successive jumps. A square-free word is represented by the walk visiting the vertices in the order in which jumps occur when reading left to right. If the leftmost jump occurs in at position , then we add the edge of length to the beginning of the walk; in this case the walk begins at an edge, not a vertex. A symmetric procedure applies to the end of if is finite. By (), we can omit the vertices (except for the first one), keeping just the weights of edges and marking the “hanging” edges in the beginning and/or the end. Due to symmetry, we can omit even the first vertex, retaining all information about up to renaming the letters. The result is a word over with two additional bits of information (whether the first/last letters are marked; for infinite words, only one bit is needed). This word is called a codewalk of and denoted by . For example, here is some prefix of (with first letters of jumps written in boldface) and the corresponding prefix of its codewalk (the marked letter is underlined):
Note that two words have the same codewalk if and only if they are images of each other under bijective codings and thus have the same structure and the same properties related to power-freeness. A codewalk is closed if it corresponds to a closed walk without hanging edges in ; e.g., 212212 is closed and 212 is not.
Clearly, not all walks in the weighted graph correspond to square-free words. However, there is a strong connection between square-freeness of a word and forbidden factors in its codewalk. Combining several results of , we get the following lemma. (More restrictions can be added to statement 2 of this lemma, but we do not need them in our proofs.)
Lemma 22 ().
If a codewalk has (a) no factors , and (b) no factors of the form , where , and the codewalk is closed, then the word with this codewalk is square free.
The codewalk of a square-free word has no proper factors and for all , such that is a closed codewalk. In particular, such a codewalk contains no squares of closed codewalks.
Proof of Theorem 21.
Thue  showed that a square-free infinite ternary word contains all six factors of the form and all six factors of the form . As for the jumps, Thue proved that any two factors from different parts of the graph in Fig. 1 can absent. (This is an optimal result since it is easy to see that at least two jumps from each part must be present.) All three possible cases (up to symmetry) with two absent jumps are depicted in Fig. LABEL:k33abc. We say that a square-free ternary word is of type , , if it lacks two jumps connected by an edge of weight . Note that has no factors 010 and 212 and thus is of type 3.
Assume that a square-free infinite ternary word has the minimum subword complexity among all such words. Then avoids two jumps, otherwise . W.l.o.g., the two missing jumps are those indicated in Fig. LABEL:k33abc, depending on the type of (if this is not the case, we replace with its image under an appropriate bijective coding). Note that all four remaining jumps occur in infinitely often, since any suffix of is square-free and thus must contain four jumps. Hence has eight factors of length 4, containing a jump. Let us compute . For this, we need to consider the factors without jumps. First let have type 1. 1021 and 2012 are factors of because they are the only right extensions of 102 and 201, respectively. If the factor 0120 is absent, then it is impossible to move by the edge of weight 2 from the vertex 101 to the vertex 202; since the codewalk of cannot contain 11 or 333 by Lemma 22 this means that the codewalk of always passes the cycle in Fig. LABEL:k33abca in the same direction. But this means that the codewalk of contains squares of closed walks, which is also impossible by Lemma 22. So the word 0120 must be a factor of . The same argument works for the factors 0210, 1201, 2102, which are responsible, respectively, for the moves from 202 to 101; from 212 to 121; and from 121 to 212. Thus .
If has type 2, the same argument works. Namely, 2012 and 2102 occur in as unique extensions of 201 and 210, respectively. The remaining four factors 0120, 0210, 1021, and 2012 are responsible for the moves by the edges of weight 3; if one of them is absent, this leads to squares of closed walks, thus violating of Lemma 22. Hence we again have . The situation changes if has type 3: then each of the words 1021, 1201 can occur in only as the prefix (1021 is followed by 0, and thus can be preceded neither by 0 nor by 2; similar for 1201). The remaining factors must present by the same argument as in the previous cases. Note that avoids 1021 and 1201, and thus . So must have type 3 and . (Moreover, and have the same factors up to length 4.) To prove the theorem, it is enough to show that ; the equality then follows by minimality of complexity of .
Consider the language of all finite codewalks that can be read in the graph in Fig. LABEL:k33abcc and correspond to square-free words. We define two sequences of codewalks by induction:
Below we prove the following three statements, which immediately imply the desired inclusion . We include in the preimage only the words of type 3 avoiding the factors 010 and 212, as in Fig. LABEL:k33abcc.
for every ;
(i) Note that the codewalks , , and are closed for all . This fact immediately follows by induction from the definition (for the base case, one may consult Fig. LABEL:k33abcc).
We prove by induction that for each some suffix of is an infinite product of blocks and . For the base case note that has no factors 11 by Lemma 22. Hence, as the codewalk reaches one of the vertices 020, 202 (see Fig. LABEL:k33abcc) it infinitely proceeds between these two vertices with the paths labeled by and ; each path occurs infinitely many times because is aperiodic. The walks , and are obviously closed.
Now we proceed with the inductive step. First let . The factors and