DeepAI

# Generalized de Bruijn words and the state complexity of conjugate sets

We consider a certain natural generalization of de Bruijn words, and use it to compute the exact maximum state complexity for the language consisting of the conjugates of a single word.

• 10 publications
• 3 publications
• 47 publications
05/03/2019

### Palindromic Ziv-Lempel and Crochemore Factorizations of m-Bonacci Infinite Words

We introduce a variation of the Ziv-Lempel and Crochemore factorizations...
12/02/2014

### Exemplar Dynamics and Sound Merger in Language

We develop a model of phonological contrast in natural language. Specifi...
01/16/2018

### Subword complexity and power avoidance

We begin a systematic study of the relations between subword complexity ...
07/19/2017

### Expect the unexpected: Harnessing Sentence Completion for Sarcasm Detection

The trigram `I love being' is expected to be followed by positive words ...
11/06/2018

### Knuth's Moves on Timed Words

We give an exposition of Schensted's algorithm to find the length of the...
08/06/2021

### On the complexity of the generalized Q2R automaton

We study the dynamic and complexity of the generalized Q2R automaton. We...
01/31/2020

### An efficient automated data analytics approach to large scale computational comparative linguistics

This research project aimed to overcome the challenge of analysing human...

## 1 Introduction

Let be words. We say and are conjugates if one is a cyclic shift of the other; equivalently, if there exist words such that and . For example, the English words listen and enlist are conjugates.

The set of all conjugates of a word is denoted by . Thus, for example, . We also write for the set of all conjugates of elements of the language .

For a regular language let denote the state complexity of : the number of states in the smallest complete DFA accepting . State complexity is sometimes also called quotient complexity [3]. The state complexity of the cyclic shift operation for arbitrary regular languages was studied in Maslov’s pioneering 1970 paper [17]. More recently, Jirásková and Okhotin [14] improved Maslov’s bound, and Jirásek and Jirásková studied the state complexity of the conjugates of prefix-free languages [13].

In this note we investigate the state complexity of the finite language , over all word of length . Clearly achieves its minimum — namely, — at words of the form . By considering random words, it seems likely that .

Our main result makes this precise:

###### Theorem 1.

Let be an alphabet of cardinality , and let be an integer. Define . Then

 maxw∈Σnk sc(C(w))=2r+n(n−2i+1)+1,

where .

Furthermore, we characterize those words achieving this maximum.

Our theorem depends on a certain natural generalization of de Bruijn words, of independent interest, which is introduced in the next section.

## 2 Generalized de Bruijn words

De Bruijn words (also called de Bruijn sequences) have a long history [8, 16, 10, 4, 5], and have been extremely well studied [9, 18]. Let denote the -letter alphabet . Traditionally, there are two distinct ways of thinking about these words: for integers , they are

• the words of length having each word of length over exactly once as a factor; or

• the words of length having each word of length over exactly once as a factor, when is considered as a “circular word”, or “necklace”, where the word “wraps around” at the end back to the beginning.

For example, for and , the word

 0000111101100101000

is an example of the first interpretation and

 0000111101100101

is an example of the second.

In this paper, we are concerned with the second (circular) interpretation of de Bruijn words, and we write for the set of all such words. According to the definition, such words exist only for lengths of the form . Is there a sensible way to generalize this class of words so that one could speak fruitfully of (generalized) de Bruijn words of every length?

One natural way to do so is to use the notion of subword complexity (also called factor complexity or just complexity). For let denote the number of distinct length- factors of the word (considered circularly). For all words , there is a natural upper bound on for , as follows:

 γi(w)≤min(ki,N). (1)

This is immediate, since there are at most words of length over , and there are at most positions where a word could begin in (considered circularly).

Ordinary de Bruijn words are then precisely those words of length for which . But even more is true: also achieves the upper bound in (1) for all . To see this, note that if , then every word of length occurs as a prefix of some word of length , and every word of length is guaranteed to appear in . On the other hand, all factors of length are distinct, because the length- prefixes are all distinct.

This motivates the following definition:

###### Definition 2.

A word of length over a -letter alphabet is said to be a generalized de Bruijn word if for .

###### Example 3.

Table 1 gives the lexicographically least de Bruijn words for a two-letter alphabet, for lengths to , and the number of such words (counted up to cyclic shift). This forms sequence A317586 in the On-Line Encyclopedia of Integer Sequences (OEIS) [20]. The second author has computed these numbers up to .

The main result of this section is the following.

###### Theorem 4.

For all integers and there exists a generalized de Bruijn word of length over a -letter alphabet.

###### Proof.

For the proof can be found in [19], although strangely it is not explicitly stated anywhere in the paper. (Lemma 3 implies it.)

For we can derive this result from a paper by Lempel [15]. Lempel proved that for all , , , there exists a circular word of length for which the factors of size are distinct. (Also see [11, 6].) However, as stated, this result is not strong enough for our purposes. For example, there are circular words, such as of length , having distinct factors of length , but only distinct factors of length . For our purposes, then, we need a stronger version of the result, which can nevertheless be obtained from a further analysis of Lempel’s proof.

An Euler graph is a directed graph in which, for each vertex , the indegree of is equal to the outdegree of . By a closed chain we mean a sequence of edges , , , …, , where each edge is distinct, but vertices may be repeated. Each closed chain forms an Euler graph and each connected Euler graph admits a closed chain containing all its edges.

Let be the -ary de Bruijn graph of order . This is a directed graph where the vertices are the words of length , and edges join a word to a word if and for some letters and a word . So every vertex of has incoming edges, and outgoing edges, and therefore is a regular graph of degree . Building a generalized de Bruijn word of length , where , over a -letter alphabet then amounts to constructing a closed chain of length in that visits every vertex.

One of Lempel’s main results ([15, Theorem 1]) states that such a closed chain exists, but does not mention explicitly whether it visits every vertex. In the proof, the chain is obtained by constructing a connected Euler graph using [15, Lemma 6]. Now, the analysis of the proof of [15, Lemma 6] shows that the constructed Euler graph is not only connected (which is the explicit concern of the lemma) but also spanning. The closed chain is eventually obtained as a complement of a graph (denoted as in [15]), where is an Euler graph contained in such that the degree of each vertex in is at most . Therefore, its complement is obviously spanning. ∎

###### Remark 5.

We have not been able to find this precise notion of generalized de Bruijn word in the literature anywhere, although there are some papers that come very close. For example, Iványi [12] considered the analogue of Eq. (1) for ordinary (non-circular) words. He called a word supercomplex if the analogue of the upper bound (1) is attained not only for , but also for all prefixes of . However, binary supercomplex words do not exist past length . The third author also considered the analogue of Eq. (1) for ordinary words [19]. However, Lemma 3 of that paper actually implies the existence of our generalized (circular) de Bruijn words of every length over a binary alphabet, although this was not stated explicitly. Anisiu, Blázsik, and Kása [2] discussed a related concept: namely, those length- words for which where denotes the number of distinct length- factors of (here considered in the ordinary sense, not circularly). Also see [7].

We now turn to an alternative characterization of our generalized de Bruijn words.

###### Proposition 6.

A word is a generalized de Bruijn word iff both of the following hold:

• ; and

• ,

where .

###### Proof.

A generalized de Bruijn word trivially has these properties, and it is easy to see that the two properties imply the bound in Eq. (1). ∎

We now count the total number of factors of a generalized de Bruijn word. This is a generalization of Theorem 2 of [19] to all , adapted for the case of circular words.

###### Proposition 7.

If is a generalized de Bruijn word, then

 ∑0≤i≤Nγi(w)=kr+1−1k−1+N(N−r),

where .

###### Proof.

We have

 ∑0≤i≤Nγi(w) =∑0≤i≤Nmin(ki,N) =∑0≤i≤rki+∑r

## 3 State complexity

###### Theorem 8.

Let be an alphabet of cardinality . Suppose , and suppose . Define and . If then .

###### Proof.

A level is a set of nodes at a particular distance from the root. The complete -ary tree of levels therefore corresponds to words of length , and the total number of nodes in this tree is .

The language can be accepted by a DFA with the following topology: there is a complete -ary tree of levels rooted at the initial state . At the very next level there are at most nodes, and these nodes form the roots of at most chains of nodes each. These chains need not be disjoint, but will be in the worst case. At the end, there is another complete -ary tree of levels culminating in a single accepting state. Finally, there is also a single non-accepting state that captures all transitions not yet defined. The total number of states is therefore .

Define

 X =Σ≤i ∪ {x : i<|x|

More formally, the states of our DFA are , a “dead” state; , for ; and , for all with . The states correspond to prefixes of words of and the states correspond to suffixes of words of .

The initial state is .

The transitions are given by for and and , if and ; for and . All other transitions go to .

Finally, the unique final state is . ∎

This construction is illustrated in Figure 1 for , , , , , , and

 L={000010100000,000101100010,011110100001,100110011111,101011110111,110100100110,110101010011,110110101101,111001100101,111110110100}.

As a corollary, we now get an upper bound on :

###### Corollary 9.

If is a word of length over a -letter alphabet, then

 sc(C(x))≤2r+n(n−2i−1)+1,

where and .

###### Proof.

Let be a word of length , and let . Then . In Theorem 8 take and . Set and . The inequality holds in all cases except and ; this case can be checked separately. We therefore get , as desired. ∎

It now remains to prove that there exist words that achieve this upper bound. In fact, such words are exactly the generalized de Bruijn words defined in Section 2.

###### Theorem 10.

A length- word over a -letter alphabet satisfies

 sc(C(x))=2r+n(n−2i−1)+1,

where and iff is a generalized de Bruijn word.

###### Proof.

Suppose is a generalized de Bruijn word. We first show that there are inequivalent words for the Myhill-Nerode equivalence relation associated with . This will show and hence, by Corollary 9, that .

Representatives of the Myhill-Nerode classes can be classified as follows:

(a) all the words of length ;

(b) all the factors of conjugates of of length , for ;

(c) for each word of length , the lexicographically least factor of of length for which .

(d) the single equivalence class corresponding to words not in .

There are words in (a), and words in (c), there are words in (b), and one word in (d).

We need to see that these are all inequivalent. Since all the words in are of length , no two factors of different lengths can be equivalent. It therefore suffices to examine pairs of words of identical length.

In group (a), let be two distinct words of length . Since , considered circularly, contains all factors of length , it contains and as factors. Let (resp., ) be a conjugate of with prefix (resp., ). Then . If both and occur in , we would have two separate occurrences of in (considered circularly), which is impossible since is of length and has distinct factors (considered circularly). So and are inequivalent under Myhill-Nerode. This gives equivalence classes.

In group (b), let be two distinct factors of (considered circularly) of length with . Since is of length and contains distinct factors of length , the first symbols of (resp., ) uniquely determines the position of (resp., ) within (considered as a circular word). So there is a unique such that , and similarly, there is a unique such that . Just as in case (a), since , we see that . This gives equivalence classes.

In group (c), for each word of length , let be the lexicographically least word of length such that . (We know such a word exists because each such is a factor of , considered circularly.) Let be distinct words of length . Then since , the word occurs in exactly one location in , considered circularly, and there it must be followed by . So , so and are inequivalent under Myhill-Nerode. This gives equivalence classes.

Now let us prove the reverse direction. Suppose is such that . Then from the upper bound in Corollary 9 and the construction of Theorem 8 from which it is derived, we know that all the words corresponding to the states of the automaton in Theorem 8 are pairwise inequivalent under Myhill-Nerode. But there are such words of length and such words of length . Hence, by Proposition 6, is a generalized de Bruijn word. ∎

For the maximum state complexity of over length- words is given in Table 2. It is sequence A316936 in the OEIS [20].

We do not currently know an accurate asymptotic expression for the number of generalized de Bruijn words of length , except in few simple cases. If , then it follows from known results [1] that this number is (counted up to cyclic shift) .

A generalized de Bruijn word of length corresponds to a closed path in the de Bruijn graph that visits one vertex exactly twice and all others exactly once. This implies that the additional edge is a loop. Therefore, each generalized de Bruijn word of length is obtained from an ordinary de Bruijn word of length by replacing a factor with where is a letter. It follows that the number of such words is . A similar argument yields the same number of generalized de Bruijn words of length .

Already for these kinds of considerations become very complex. We leave this as a challenging open problem for the reader.

## References

• [1] T. van Aardenne-Ehrenfest and N. G. de Bruijn. Circuits and trees in oriented linear graphs. Simon Stevin 28 (1951), 203–217.
• [2] M.-C. Anisiu, Z. Blázsik, and Z. Kása. Maximal complexity of finite words. Pure Math. Appl. 13 (2002), 39–48.
• [3] J. Brzozowski. Quotient complexity of regular languages. J. Automata, Languages, and Combinatorics 15 (2010), 71–89.
• [4] N. G. de Bruijn. A combinatorial problem. Proc. Konin. Neder. Akad. Wet. 49 (1946), 758–764.
• [5] N. G. de Bruijn. Acknowledgement of priority to C. Flye Sainte-Marie on the counting of circular arrangements of zeros and ones that show each -letter word exactly once. Technical Report 75-WSK-06, Department of Mathematics and Computing Science, Eindhoven University of Technology, The Netherlands, June 1975.
• [6] T. Etzion. An algorithm for generating shift-register cycles. Theoret. Comput. Sci. 44 (1986), 209–224.
• [7] A. Flaxman, A. W. Harrow, and G. B. Sorkin. Strings with maximally many distinct subsequences and substrings. Electronic J. Combinatorics 11(1) (2004), #R8 (electronic).
• [8] C. Flye Sainte-Marie. Question 48. L’Intermédiaire Math. 1 (1894), 107–110.
• [9] H. Fredricksen. A survey of full length nonlinear shift register cycle algorithms. SIAM Review 24 (1982), 195–221.
• [10] I. J. Good. Normal recurring decimals. J. London Math. Soc. 21 (1946), 167–169.
• [11] F. Hemmati and D. J. Costello, Jr. An algebraic construction for -ary shift register sequences. IEEE Trans. Comput. 27 (1978), 1192–1195.
• [12] A. Iványi. On the -complexity of words. Ann. Univ. Sci. Budapest. Sect. Comput. 8 (1987), 69–90.
• [13] J. Jirásek and G. Jirásková. Cyclic shift on prefix-free languages. In A. A. Bulatov and A. M. Shur, editors, CSR 2013, Vol. 7913 of Lecture Notes in Computer Science, pp. 246–257. Springer-Verlag, 2013.
• [14] G. Jirásková and A. Okhotin. State complexity of cyclic shift. RAIRO Inform. Théor. App. 42 (2008), 335–360.
• [15] A. Lempel. -ary closed sequences. J. Combin. Theory 10 (1971), 253–258.
• [16] M. H. Martin. A problem in arrangements. Bull. Amer. Math. Soc. 40 (1934), 859–864.
• [17] A. N. Maslov. Estimates of the number of states of finite automata. Dokl. Akad. Nauk SSSR 194(6) (1970), 1266–1268. In Russian. English translation in Soviet Math. Dokl. 11 (5) (1970), 1373–1375.
• [18] A. Ralston. De Bruijn sequences — a model example of the interaction of discrete mathematics and computer science. Math. Mag. 55 (1982), 131–143.
• [19] J. Shallit. On the maximum number of distinct factors of a binary string. Graphs and Combinatorics 9 (1993), 197–200.
• [20] N. J. A. Sloane et al. The on-line encyclopedia of integer sequences. Available online at https://oeis.org, 2019.