    # On the longest common subsequence of Thue-Morse words

The length a(n) of the longest common subsequence of the n'th Thue-Morse word and its bitwise complement is studied. An open problem suggested by Jean Berstel in 2006 is to find a formula for a(n). In this paper we prove new lower bounds on a(n) by explicitly constructing a common subsequence between the Thue-Morse words and their bitwise complement. We obtain the lower bound a(n) = 2^n(1-o(1)), saying that when n grows large, the fraction of omitted symbols in the longest common subsequence of the n'th Thue-Morse word and its bitwise complement goes to 0. We further generalize to any prefix of the Thue-Morse sequence, where we prove similar lower bounds.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Thue-Morse sequence is a well known sequence in mathematics and computer science, with many interesting properties. The Thue-Morse sequence has a lot of self-symmetry in it, but is at the same time cube-free and overlap-free (for a more in depth introduction to the Thue-Morse sequence, see, for instance, Allouche and Shallit (Allouche et al., 1999)).

In 2006, Jean Berstel (2006) formulated the problem of finding the length of the longest common subsequence between the ’th Thue-Morse word and its bitwise complement. By bitwise complement we mean replacing with and with . This paper primarily studies (sequence A297618 on the Online Encyclopedia of Integer Sequences oeis ). Since the Thue-Morse words are prefixes of length for some , of the Thue-Morse sequence, a natural generalization is to consider other length prefixes of the Thue-Morse sequence. This paper also studies , the longest common subsequence between the length prefix of the Thue-Morse sequence and its bitwise complement (sequence A320847).

###### Example 1.0.

The first few values of and are:

 a(1) =1 b(1) =0 a(2) =2 b(2) =1 a(3) =5 b(3) =1 a(4) =12 b(4) =2 a(5) =26 b(5) =3 a(6) =54 b(6) =4

To show a lower bound for , it suffices to construct a common subsequence of the Thue-Morse words and their bitwise complements. This is what is done in this paper, using the symmetries of the sequence. In particular, we provide a recursive construction for such a common subsequence, which has length at least .

This new lower bound is interesting as it means that goes to , that is when grows large the longest common subsequence will only omit a vanishingly small fraction of symbols.

## 2 Setup

There are many equivalent definitions of the Thue-Morse sequence and Thue-Morse words. We will define them using morphisms.

###### Definition 2.0.

A morphism over an alphabet is a function that satisfies (concatenation) for all . Note that this means is uniquely defined by its behaviour on .

###### Definition 2.0.

Let denote the morphism on defined by and .

There are some basic properties that follow directly from the definition.

###### Proposition 2.4.

If then

1. where denotes taking the bitwise complement of (i.e., swapping 0s and 1s).

2. .

3. .

4. and .

###### Proof.

(i) follows from the symmetry (between and ) in the definition of . (ii) holds for all morphisms. (iii) follows from an induction argument since for every binary string . (iv) can be seen from . ∎

###### Definition 2.0.

We call the ’th Thue-Morse word. We also say the Thue-Morse sequence, denoted by t, is the the unique fixed point of (extended to the domain of infinite binary strings) beginning with a . See Allouche et al. (1999) for why such a fixed point exists and is unique.

###### Definition 2.0.

Denote by the length of the longest common subsequence of and . Similarly, denote by the length of the longest common subsequence of the prefix of length of the Thue-Morse sequence and its bitwise complement.

###### Example 2.0.

The first few Thue-Morse words are

 μ0(0)=0,μ1(0)=01,μ2(0)=0110, and μ3(0)=01101001

The Thue-Morse sequence starts as follows .

###### Remark.

The Thue-Morse words are sometimes defined by the recurrence relation in Proposition 2.4 part (iv), and then the Thue-Morse sequence as the infinite application of this rule. We see that ’th Thue-Morse word is the prefix of length of the Thue-Morse sequence. This also means that .

We also need the following proposition, for which the proof can be found in (Allouche et al., 1999).

###### Proposition 2.8.

If are the symbols of the Thue-Morse sequence we have and for all . Moreover, equals the parity of the number of “1” bits in the binary representation of .

###### Corollary 2.9.

The ’th digit of is the same as the ’th digit of (where we use zero-indexing).

###### Proof.

The ’th digit of is , and the ’th digit of is , by the above proposition. ∎

## 3 Construction of a common subsequence

We are now ready for a construction of a common subsequence between and when is a power of . We call this common subsequence , and define it recursively.

When , we let , a subsequence of and .

For , will be defined recursively as follows.

Let and . Say and , that is, we are constructing as a common subsequence of and . Write and as concatenations of blocks of size (since this is possible), say

 X =x0x1⋯x2m−1 Y =y0y1⋯y2m−1

Since , each is one of or . Similarly each is one of or . It is also worth noting that if the ’th digit of is , and similarly if the ’th digit of is .

Now we compare to for , and find a common subsequence between them.

• When is even, by Corollary 2.9, so we take .

• When

is odd, either

and are the same, or one is and the other is . If they are the same we take , otherwise .

We then let be the concatenation of the ’s.

###### Example 3.0.

The common subsequence , and are underlined below:

 CS(0):μ1(0)= 0–1 μ1(1)= 10– CS(1):μ2(0)= 01––– 10 μ2(1)= 10 01––– CS(2):μ4(0)= 0110––––– 1001––– 1001––––– 0110 μ4(1)= 1001 0110––––– 01–––10 1001–––––
###### Remark.

is not necessarily the longest common subsequence. For example

 μ4(0)= 011––––0 1001––––– 100––––1 01–––10 μ4(1)= 10–01– 0110–––– 0110––––– 10–01–

is the longest common subsequence between and , which has length , while .

## 4 Analysis of length

In this section we analyse the length of the common subsequence constructed in the previous section.

###### Definition 4.0.

For an integer , let be the number of symbols omitted by the common subsequence .

###### Remark.

, as .

When constructing , all the even indexed blocks (of size ) in are chosen to be in . So only the odd indexed blocks can contribute to . The last block will be completely omitted, and for the other blocks in odd positions we either miss if matching with recursively, or miss nothing if choosing to include the complete block. This leads us to the following lemma.

###### Lemma 4.12.

For every integer

 f(k+1)≤22k+(22k−1−1)f(k).
###### Proof.

The last block has size , and there are other odd indexed blocks, and in each we miss at most . So the lemma follows from the above discussion. ∎

We are now ready to prove an upper bound on .

###### Lemma 4.13.

For every integer , .

###### Proof.

We proceed by induction on .

The inequality clearly holds for since

Now suppose the inductive assertion holds for , that is . Using Lemma 4.12 and the induction hypothesis we have

 f(s+1) ≤22s+(22s−1−1)f(s) ≤22s+(22s−1−1)(22s−s+1−2) =22s+22s−1+2s−s+1−22s−1⋅2−22s−s+1+2 =22s+1−(s+1)+1−22s−s+1+2.

Note that for all integers , since for all integers . Thus

 f(s+1) ≤22s+1−(s+1)+1−22s−s+1+2 ≤22s+1−(s+1)+1−2.

This concludes the induction proof. ∎

By Lemma 4.12 it follows that for all . This means that the length of our constructed common subsequence of and where must be at least . This proves the following theorem.

###### Theorem 4.14.

For and :

 |CS(k)|≥2n(1−1n/2)=22k(1−12k−1).

## 5 Extension to all n

Up to this point we have only considered the common subsequence of and where for some . We wish to extend our construction to work for arbitrary .

If and , then say for some integer . Write

 μn(0)=μn−2k(μ2k(0)) μn(1)=μn−2k(μ2k(1))

This is saying that () can be written as blocks, where each block is either or . We can concatenate copies of the subsequence to obtain a common subsequence of and , i.e., we use our previous construction for each of the blocks independently. Using Theorem 4.14 we see that the length of this common subsequence is at least , since by choice of . We thus get a similar result as Theorem 4.14 for arbitrary .

###### Theorem 5.15.

For every , there exists a common subsequence between and with length at least

 2n(1−1n/4).
###### Corollary 5.16.

, or more generally .

We can generalize the result further to all prefixes of the Thue-Morse sequence. Let be the prefix of length of the Thue-Morse sequence, and its bitwise complement. Based on the binary representation of the number , and can be split up into at most blocks, each with a size which is a power of . We will assume the blocks are in order of decreasing size, so that a block of size is either or . Then common subsequences satisfying the inequality in Theorem 5.15 for these blocks can be concatenated to form a common subsequence between and . To bound the length of this common subsequence we use the following lemma:

for all .

###### Proof.

We prove the inequality by induction on .

For we have , and for we have .

Now suppose and . This means that

 s+1∑k=12kk=s∑k=12kk+2s+1s+1≤2s+2s−1+2s+1s+1=2s+1(3s+2)s(s+1)−1≤2s+1(4s)s(s+1)−1=2s+3(s+1)−1,

which concludes the induction proof. ∎

Now we continue to analyse the common subsequence between and . This subsequence omits at most symbols for the block of size (by Theorem 5.15). There is at most one block of size for each . The potential block of size will miss at most one symbol. Hence at most

 1+⌊log2(n)⌋∑k=12k+2k=1+4⌊log2(n)⌋∑k=12kk

symbols are omitted, which by Lemma 5.17 is at most

 1+4(2⌊log2(n)⌋+2⌊log2(n)⌋−1)=2⌊log2(n)⌋+4⌊log2(n)⌋−3≤n⌊log2(n)⌋/16.

This proves the following theorem.

###### Theorem 5.18.

For all, there exists a common subsequence between and with length at least

 n(1−1⌊log2(n)⌋/16).
###### Corollary 5.19.

, or more generally .

## 6 Strengthening the analysis

The constructed common subsequence , and the generalizations in the previous section, does in fact have a slightly better asymptotic behaviour than what was proven.

The length analysis was based on Lemma 4.12 which states that . This inequality is only tight when all for odd , using the same notation as in Section 3. However, we can get a better bound on in terms of

by estimating how many of the blocks

and are equal for odd .

###### Lemma 6.20.

If are the digits of the Thue-Morse sequence, then if and only if written in binary ends with a block of ’s with odd length.

###### Proof.

We use Proposition 2.8. if an only if and have the same number of “” bits modulo 2, when written in binary. This condition is equivalent to ending with a block of ’s of odd length when written in binary. ∎

###### Lemma 6.21.

Let . Then

 eq(n)={13(2n−1)if n is even13(2n−2)if n is odd.
###### Proof.

For a fixed , we count many bit numbers (except ) which ends with a block of ’s of odd length. We can fix the bit number to end with a “” followed by ”s, for different values of . This works as we do not want to count which is the -bit binary number with all “1”s.

So if is even .

If is odd, then . ∎

By Proposition 2.8 we see that

 x2i+1=y2i+2⟺t2i+1=¯¯¯¯¯¯¯¯¯¯t2i+2⟺¯¯¯¯ti=¯¯¯¯¯¯¯¯ti+1⟺ti=ti+1

By Lemma 6.21 we thus know that when constructing , exactly of the odd indexed blocks will already be equal. Hence exactly of the pairs will need to be recursively matched using . This leads to the following improved version of Lemma 4.12:

###### Lemma 6.22.

For every integer ,

 f(k+1)=22k+(22k−1−1−eq(2k−1))f(k)=22k+(22k−1−1−13(22k−1−2))f(k).
###### Corollary 6.23.

Let . For every integer ,

###### Proof.

If , we have by the lemma

 f(k+1)=22k+(22k−1−1−13(22k−1−2))f(k)≤22k+2322k−1f(k)=22k+22k−wf(k).

By a similar induction proof as in Lemma 4.13 we get a new upper bound on .

###### Theorem 6.24.

Let . For every integer , .

###### Proof.

We proceed by induction on .

It is easy to verify that the inequality holds for .

Now suppose the inductive assertion holds for , that is . Using Corollary 6.23 and the induction hypothesis we have

 f(s+1) ≤22s+22s−wf(s) ≤22s+22s−w(22s−ws+3−6) =22s+22s−w+2s−ws+3−2⋅22s =22s+1−w(s+1)+3−22s ≤22s+1−w(s+1)+3−6

since when . This concludes the induction proof. ∎

This means that the length of the common subsequence is

 22k−f(k)≥22k−22k−wk+3=22k(1−12wk/8)=22k(1−13k/8).

This asymptotic behaviour propagate through the other generalizations, and we obtain a slightly better versions of Corollaries 5.16 and 5.19.

and where .

## 7 Acknowledgment

I thank Jeffrey Shallit for telling me about the problem.

## References

• Allouche et al. (1999) J.-P. Allouche, J. Shallit, The ubiquitous Prouhet-Thue-Morse sequence. Sequences and Their Applications: Proceedings of SETA ’98, Springer-Verlag, 1999, pp. 1-16
• Jean Berstel (2006) Jean Berstel, Combinatorics on Words Examples and Problems.
• (3) N. J. A. Sloane, Online Encyclopedia of Integer Sequences.