    # Extending de Bruijn sequences to larger alphabets

A circular de Bruijn sequence of order n in an alphabet of k symbols is a sequence in which each sequence of length n occurs exactly once. In this work we show that for each circular de Bruijn sequence v of order n in an alphabet of k symbols there is another circular de Bruijn sequence w also of order n in an alphabet with one more symbol, that is an alphabet of k + 1 symbols, such that v is a subsequence of w and in between any two successive occurrences of the new symbol in w there are at most n + 2k-2 consecutive symbols of v. We give an algorithm that receives as input such a sequence v and outputs a sequence w. We also give a much faster algorithm that receives as input such a sequence v and outputs a sequence w, but the new symbol may not be evenly spread out.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and statement of results

A rotation is the operation that moves the final symbol of a finite sequence to the first position while shifting all other symbols to the next position, or it is the composition of this operation with itself an arbitrary number of times. A circular sequence is the equivalence class of a sequence under rotations. We write to denote the circular sequence formed by the rotations of .

We say that a subsequence of a sequence is a sequence defined by for , where is an increasing sequence of indices. The same applies to circular words, assuming any starting position. For example, , and [5,6,1,2] are subsequences of .

A circular de Bruijn sequence of order on a size- alphabet is a circular sequence of size in which every possible size- sequence on occurs exactly once as a contiguous subsequence [8, 13]. See  for a fine presentation and history. To denote the set of circular de Bruijn sequences of order in an alphabet of symbols we write . For example, is in .

In this note we show that for any given circular de Bruijn sequence of order  in an alphabet of symbols there is another circular de Bruijn sequence of order  but in an alphabet of symbols such that is a subsequence of and such that in between two successive occurrences of the new symbol in there are at most consecutive symbols of . We provide an algorithm that given such an input sequence produces the output sequence . And we give a much faster algorithm that also receives as input such a sequence and outputs a sequence without the guarantee of the fair distribution of the new symbol. Thus, Theorems 1 and 2 stated below are the main results of this note:

###### Theorem 1.

Given a circular de Bruijn sequence in there is a circular de Bruijn sequence in such that is a subsequence of and for any consecutive symbols in there is at least one occurrence of the new symbol . Moreover, there is an algorithm that given as input such a sequence generates the sequence after performing mathematical operations and it uses space.

For example, given the sequence

 v=[1,1,0,0,0,1,0,1]

the following sequence

 w=[1,2,2,2,1,2,1,1,1,0,0,2,2,0,2,0,0,0,1,2,0,1,0,2,1,0,1]

satisfies the conditions of the theorem for and in the alphabet where the new symbol is the symbol . The symbol occurs times in and given any consecutive symbols in there is at least one occurrence of the symbol .

It is not hard to see that given a sequence in there is a sequence in such that is a subsequence of . But we aim to guarantee that the new symbol is fairly distributed along the extended de Bruijn sequence . The first difficulty is to mathematically define this condition. The second difficulty is to prove the existence of such an extended sequence and to provide an elegant and fast algorithm to construct it.

In addition to classical elements from graph theory such as de Bruijn graphs, Eulerian cycles and graph transformations, we use the Edmonds-Karp algorithm [9, 6]. The output sequence obtained by our algorithm has size . Thus, Theorem 1 states that the algorithm is practically cubic on the output size and this time complexity is dominated by the Edmonds-Karp time complexity when operating on a graph with vertices and edges.

In case we ask for no guarantee on the distribution of the new symbol in the extended sequence, we obtain a faster algorithm.

###### Theorem 2.

There is an algorithm that given a circular de Bruijn sequence in generates a circular de Bruijn sequence sequence in such that is a subsequence of , after performing at most mathematical operations and it uses space.

The sequence generated by the algoritm given in Therem 2 has size . Thus, the time complexity of this solution is just above the size of the input. Precisely, for each symbol of the generated sequence this second algorithm performs a number of operations that is the square of the logarithm of the size of the output sequence. The proof of Theorem 2 is elementary and it formalizes a natural intuition on how to extend a de Bruijn sequence to a larger alphabet. We shall see that the algorithm is greedy, making just some computations on each step.

The extension problem to a larger alphabet is dual to the extension problem studied by Becher and Heiber in , where they considered the problem of extending a sequence  in to a sequence in such that is a suffix of . Theorems 1 and 2 in this note appear in .

It is possible to conceive the problem of extension to a larger alphabet for particular families of de Bruijn sequences. Gabriel Thibeault in  proved that the lexicographically greatest de Bruijn sequence in is the suffix of the lexicographically greatest sequence in . Thus, for the lexicographically greatest de Bruijn sequence in there is a very simple solution to the problem stated in Theorem 2, which is to construct the lexicographically greatest de Bruijn sequence in , and this can be done with a greedy algorithm. A fast version of the algorithm for the lexicographically greatest de Bruijn sequence was obtained by Amram, Ashlagi, Rubin, Svoray, Schwartz and Weiss  . Schwartz, Svoray and Weiss recently considered in  the extension to a larger alphabet for lexicographically greatest de Bruijn sequence.

It seems interesting to study the extension problem to a larger alphabet for other salient families. For instance, the semi-perfect de Bruijn sequences of Repke and Rytter  which satisfy that each of the prefixes (large enough) has the largest possible number of distinct words. Or the perfect sequences of Alvarez, Becher, Ferrari and Yuhjtman  which, for order , contain each word of length  exactly  times but each one starting at different positions modulo . Or the subtler nested perfect sequences of Mordachay Levin [11, Theorem 2], see also .

The document is organized as follows. In Section 2 we present the classical material on de Bruijn graphs and we fix the notation. In Section 3 we give the proof of Theorem 2 because it is simpler than that of Theorem 1. In Section 4 we elaborate the definition of fair distribution of the new symbol in the extended sequence and we devote Section 5, the last section of the paper, to the proof of Theorem 1.

## 2 de Bruijn graphs and trees of petals

Fix a finite alphabet . Without loss of generality, when we consider an alphabet of symbols we assume . As usual, we write to denote the set of symbols of size whose symbols belong to . In the sequel we use the terms word and sequence interchangeably.

A de Bruijn graph is a directed graph where is the set of words of size on a size- alphabet and whose set of edges is the set of pairs for and with . Thus, the graph has vertices and edges, it is strongly connected and every vertex has the same in-degree and out-degree.

Each circular de Bruijn sequence in can be constructed by taking a Hamiltonian cycle on the graph given that each vertex of the graph is a word of size in an alphabet of symbols. Moreover, since the line graph of is , each circular de Bruijn sequence in can be constructed as an Eulerian cycle in . For example, in the graph if one traverses the edge labelled from , one arrives at thereby indicating the presence of the contiguous subsequence in the de Bruijn sequence.

Notice that is a subgraph of . To see that, observe that the vertices of the first graph are all the possible size-n words in a size- alphabet and the vertices of the second graph are those of the first one plus all the possible size- words in an alphabet of size with at least one occurrence of the new symbol. Also, the edges of the second graph are the same as the ones from the first graph plus the ones representing words with at least one occurrence of the new symbol. This means that we can add vertices and edges to and obtain . This motivates the following definition of the augmenting graph .

If is a word on alphabet and is a symbol of we write to denote the number of occurrences of in . Similarly, if is a word we write to denote the number of occurrences of in .

###### Definition 3 (Augmenting graph).

Let where has symbols and is a symbol not in . We define the augmenting graph where

 V =ˆAn E ={(v,w): if v=a1…an then w=a2…anb where b∈ˆA and (|v|s>0 or |w|s>0)} Figure 1: The edges of the graph D(3,2) are shown in dashed lines.

To prove Theorems 1 and 2 we have to transform a given de Bruijn sequence in into a de Bruijn sequence in in such a way that the first one is a subsequence of the second one. Thus, given an Eulerian cycle in we need to construct an Eulerian cycle in where we preserve the relative order of the edges in . In the augmenting graph each of the vertices of has exactly one incoming and outgoing edge. Also observe that the outgoing edge is always labelled with the new symbol. So, the only way to define the expected Eulerian cycle in is by interleaving disjoint cycles of the augmenting graph on each of the vertices of . We will concentrate in some particular disjoint cycles in the augmenting graph that we call petals. In order to do that, we use the following proposition.

###### Proposition 4.

Fix an integer greater or equal to . The set of edges in can be partitioned into a set of cycles identified by the circular words of size .

###### Proof.

First observe that we can unequivocally identify an edge of by concatenating the outgoing vertex label with the label of that edge. Thus, each edge of is identified with a word of size . Also this word identifies a circular word of size , which is the class of all the rotations of this word. Now notice that each circular word of size corresponds to exactly one cycle in . Thus the partition of the set of words of size in the equivalence class given by the rotations of these words determines a partition of the set of edges in into cycles. ∎ Figure 2: Given a size-2 alphabet, there are 4 circular words of size 3: , ,  and , each one associated with a cycle in G(2,2).

In the following proposition we write to denote the disjoint union of two sets. It states that the augmenting graph contains the set of cycles associated to circular words of size with at least one occurrence of the new symbol. It is immediate to verify that the proposition holds.

###### Proposition 5.

Let be the set of cycles in associated to the circular words of size in an alphabet of symbols. Let be the set of cycles in associated to the circular words of size in an alphabet of symbols. Then , where is the set of cycles associated to the circular words of size with at least one occurrence of the new symbol.

We are now ready to define a petal.

###### Definition 6 (Petal).

A petal of is a cycle of cycles in associated to circular words of size that traverses only one vertex of . Figure 3: Two examples of petals of a G(2,2) graph. The left figure has a petal for the vertex 01 that only contains one cycle, the one associated to the circular word . The right figure has a petal for the vertex 10 that contains three cycles associated to the circular words ,  and .

We aim to define the wanted Eulerian cycle in as the given cycle in interleaved with the petals of the augmenting graph .

The difficulty lies in determining how to define petals using every edge of and also how to interleave these petals in to make sure that the occurrences of the new symbol are fairly distributed to satisfy the requirement of Theorem 1.

###### Definition 7 (Petals tree).

Let be an alphabet with cardinality with , a circular de Bruijn sequence in and such that . We define the Petals tree as a rooted tree subgraph of the directed graph where

 V ={[w]:w∈An and |w|s≥1} E ={([v],[w]):v,w∈An,∃u∈An−1,|v|u>0,|w|u>0,|w|s=|v|s+1} ∪ {(r,[v]):|v|s=1}.

The vertices with distance to the root have exactly one occurrence of the symbol , and each vertex of with distance to the root has exactly occurrences of the new symbol. Figure 4: A possible t(3,3,2) tree. The root r determines four petals, one for each branch. The first petal has the circular word , the second has , the third has , ,  and  and the fourth has .

When two vertices are connected in they have a common contiguous subsequence of size . We shall define a cycle that goes through several connected cycles. In order to compose two cycles and we traverse the first circular word until we find a common vertex such that the next edge in is not labelled with the new symbol . Observe that has the same number of occurrences of as . Consequently, an edge labelled with that starts from corresponds to a circular word with more occurrences of the symbol . Figure 5: On the left we have the circular words  and  from the Petals tree and their associated cycles on the right. The first circular word has one occurrence of the symbol 2 and the second one has two. Their associated cycles have the common vertex 02. Suppose we traverse the first cycle starting from the vertex 21. We would go through the edges 0 and 2 until we get to the common vertex 02. At that point, we start traversing the second cycle starting with the symbol 2 which guarantees a circular word with two occurrences of the symbol 2. We traverse 2, 0 and 2. After that, we finish the first cycle with the label 1.

## 3 Proof of Theorem 2

### 3.1 Extending a de Bruijn sequence to a larger alphabet

We give an algorithm that formalizes a common intuition on how to extend a de Bruijn sequence to a larger alphabet. Consider the graph for de Bruijn sequences of order  and alphabet in symbols.

Every de Bruijn sequence in is associated to an Eulerian cycle in and a cycle in . The idea is to traverse this cycle and, greedily, at each vertex use the outgoing edge labelled with the new symbol to extend the cycle. We already introduced a tool to traverse the de Bruijn graph using a Petals tree starting with a de Bruijn sequence that represents an Eulerian cycle in the de Bruijn graph . There can be three different possibilities on each step. Consider the current vertex and the edge . One possibility is that they have not been traversed. In this case, start traversing the new cycle. Another possibility is that they are the current circular word. In this case we keep traversing the same circular word. A last possibility is that they have already been traversed. In this case we ignore this circular word.

In the following example we perform the first six steps of the algorithm just described. Assume as input a de Bruijn sequence . We begin the traversal in vertex and immediately try to add a circular word with one occurrence of the symbol . The circular word of the vertex and label is the . We traverse the edge to the vertex . Then again we try to find a new circular word by traversing another edge labelled . The vertex  with the edge labelled  determines the circular word and we traverse the edge  to the vertex . Again, we search a new circular word. The vertex  with the edge labelled  determines the circular word . We traverse the edge and get to the same vertex. Nzow, the circular word is already used, so we have to go to the next edge of the current circular word. We traverse the edge labelled to the vertex . Again, the circular word is already in use, so we continue. When we get to the vertex  we again can start a new circular word, the . Figure 6: First steps of the algorithm with a B(2,3) de Bruijn sequence  as input. We show the circular words added on the Petals tree. Figure 7: Next steps of the algorithm with a B(2,3) de Bruijn sequence  as input.

### 3.2 An Algorithm to prove Theorem 2

Algorithm 8 takes a de Bruijn sequence and returns a de Bruijn sequence such that is a subsequence of where the new symbol  occurs times in .

The main idea of the algorithm is to traverse an array with the original sequence adding petals and cycles whenever possible. We first determine the alphabet size and the order of the de Bruijn sequence. To find the alphabet size we just have to count how many different symbols the sequence has. We can get the order of the sequence by solving . Then we make a copy of the original sequence that we will modify to get the extended sequence. There are several variables to keep track of things. The variable keeps track of the current position in the array and represents the edges that we already traversed. The variable indicates in which vertex are placed at each step. To make sure that we do not traverse any cycle more than once we have to keep track of every edge of a traversed cycle. We can reduce space by just keeping track of the vertices such that their outgoing edge labelled with the new symbol belongs to a traversed cycle. We keep track of them in the array . In this way we can unequivocally decide whether or not we should add a cycle at each vertex.

The main loop of Algorithm 8 iterates through every edge of the original sequence adding cycles. On each vertex in position of the array we have two possibilities. If we already added the circular word determined by the concatenation of and the new symbol we ignore that circular word, increment and go to the next vertex in the sequence. If we did not already added that circular word, we have to add it. To do that, we traverse each edge of the cycle, add them to the sequence on the current position, and for those labelled with we mark their outgoing vertex as .

Extra care is taken in writing the edges of the cycles. Notice that not always the cycle associated to a word of size has edges. There are as many edges as equivalence classes of the word. The algorithm starts adding each edge of the cycle until it reaches the original vertex. Once the cycle is formed we place it in the current position and keep moving forward. This process continues until we reach the last position of the array. Figure 8: The first steps of the algorithm for the B(2,3) sequence 11000101.
###### Lemma 9.

Algorithm 8 has time complexity and space complexity where is the size of the alphabet and is the order of the input de Bruijn sequence.

###### Proof.

To calculate the space complexity observe that there are two big arrays. The array has size since it has a slot for each vertex of the -sized alphabet. But the actual output, the sequence, will grow up to size . To calculate the time complexity of the main cycle observe that we iterate times, which is the number of edges for the increased alphabet and also the final size of the sequence array. Then, for each vertex there can be a cycle to add. Adding a cycle has time complexity . This is because we iterate through the edges of the cycle (up to edges) and for each of those edges we check for the equality of words of size . Then the main cycle has time complexity . ∎

## 4 Fair distribution of the new symbol

Given an Eulerian cycle in graph we created an Eulerian cycle in the graph with the property that preserves the order of the edges in . We achieved this by placing petals of the augmenting graph on each vertex of . Remember that each vertex in has incoming and outgoing edges. That means that we have options to place a petal for each vertex in the Eulerian cycle. Only petals have edges labelled with the new symbol  and no edge in is labelled with . So in order to have a fair distribution of the symbol  we need to interleave each petal in the an appropriate part of the cycle. This motivates the following definition.

###### Definition 10 (Section of a cycle).

Given an Eulerian cycle in , the section j of is a list of vertices of composed by the head of each edge of  such that .

A de Bruijn graph has vertices and edges, so a cycle in has sections with vertices each section. Given that there are the same number of sections and vertices, we would like to choose one vertex from each section to place the petal in a way that every vertex is used exactly once. Each section has vertices and each vertex in belongs to sections, not necessarily different.

###### Definition 11 (Petals Distribution graph).

Given an Eulerian cycle in a de Bruijn graph , the Petals Distribution graph is a -regular bipartite graph in which the vertices of and the sections of are the two vertex classes and the edges of are the set of such that the vertex belongs to the section . Figure 9: The de Bruijn sequence  has four sections: the section 0 has the vertex 11 twice, the section 1 has the vertices 10 and 00, the section 2 has the vertices 00 and 01, and the section 3 has the vertices 10 and 01. Figure 10: The Petals Distribution graph for the de Bruijn sequence . The left figure shows the possible sections for each vertex. The right figure shows a possible assignment of those vertices and sections. Figure 11: Flow network for PD(2,2) where each edge has capacity 1.

Given a graph , a matching in is a set of edges such that no two edges share a common vertex. A vertex is matched if it is an endpoint of one of the edges in the matching. A perfect matching is a matching which matches all vertices of the graph.

###### Lemma 12.

For every Petals Distribution graph there is a perfect matching.

###### Proof.

Let be a finite bipartite graph with bipartite sets and . For a set of vertices in , let denote the neighborhood of in , that is, the set of all vertices in adjacent to some element of . Hall’s marriage theorem  states that there is a matching that entirely covers if and only if for every subset of , . Let be the set of vertices of the original graph and the set of vertices for the sections. For any such that , the sum of the degrees of the vertices is . Given that the degree for any vertex in is , we have that . Then there is a matching that entirely covers . Furthermore, as , the matching is perfect. ∎

In order to compute the perfect match in a Petals Distribution graph we can use any method for computing the maximum flow in a network. We introduce two vertices  and for the source and sink and add an edge from  to each vertex of and an edge from each vertex of to . We assign capacity to each of the edges of the flow network. We can see that the maximum flow of the network is , so this flow has the edges of a perfect match.

###### Lemma 13.

Given a de Bruijn sequence in , for any consecutive symbols in the new sequence there is at least one occurrence of .

###### Proof.

First notice that each section has vertices. Two petals can be at most edges away, which corresponds to placing one petal on the first vertex of one section and another in the last vertex of the next section. Also remember that for any given vertex of in the outgoing edge is labelled with the new symbol  and that determines a cycle associated with the circular word . In consequence, the tail vertex of the last edge in that cycle is . This means that there is an edge labelled  exactly edges before the end of the petal. In consequence, between the last occurrence of  in a petal and the first occurrence of  in the next petal there can be at most edges. The vertices of the petals have at least one symbol , therefore we are guaranteed that given any consecutive edges of a petal there is at least one occurrence of . ∎

## 5 Proof of Theorem 1

Algorithm 14 takes a de Bruijn sequence and returns a de Bruijn sequence such that is a subsequence of , the new symbol  occurs times in and given any consecutive symbols in there is at least one occurrence of . This algorithm is similar to the Algorithm 8, but balances the occurrences of the new symbol . For this purpose, we have to find a maximum flow for the Petals Distribution graph. We use the Edmonds-Karp algorithm as described before to determine a vertex from each section to start a petal. Then we store that in the array.

In addition to the steps of Algorithm 8, we now keep track of the position in the original sequence, that means, how many edges of the original sequence we have already traversed. That is used to determine which is the actual section and therefore what petal should be placed next.

In the main loop of Algorithm 14 we iterate through every edge of the original sequence adding cycles. On each vertex we check if we can add a cycle. But in this case, if the current vertex belongs to the original graph then adding a cycle implies starting a petal. For that reason, in those cases we have to check if the current vertex can start a petal for the current section, otherwise we do not add the cycle. If the vertex does not belong to the original graph then to add a cycle we just have to check that such cycle has not been already used, because we are not starting a petal. The rest of the algorithm works in the same way as the Algorithm 8.

###### Lemma 15.

For an input circular de Bruijn sequence the Algorithm 14 produces a circular de Bruijn sequence performing at most operations and using space.

###### Proof of Lemma 15.

The space complexity of Algorithm 14 is the same as the one for Algorithm 8 given that the only addition in space is the array that has size , which is smaller than . Regarding time complexity note that the search of the maximum flow is the most expensive operation of the algorithm. To see this remember that Edmonds-Karp algorithm has running time , see [9, 6]. In our case, the vertices of the flow graph are the vertices of the original de Bruijn graph and the section vertices, so .

Also notice that there are edges in the flow graph associated to each vertex of the original de Bruijn graph. So and then the Edmonds-Karp time complexity is

 O((2kn−1)2∗(k+2)∗kn−1)=O(k3n−2).

This is higher than the main cycle time complexity

 O(n2(k+1)n).

This completes the proof. ∎

Acknowledgements. This research was supported by grant PICT-2014-3260 from Agencia Nacional de Promoción Científica y Tecnológica, Argentina. Becher is a researcher in Laboratoire International Associé SINFIN Université Paris Diderot-CNRS/Universidad de Buenos Aires-CONICET.

## References

•  Nicolás Álvarez, Verónica Becher, Pablo Ferrari, and Sergio Yuhjtman. Perfect necklaces. Advances in Applied Mathematics, 80:48 – 61, 2016.
•  Gal Amram, Yair Ashlagi, Amir Rubin, Yotam Svoray, Moshe Schwartz, and Gera Weiss. An efficient shift rule for the prefer-max de Bruijn sequence. Discrete Mathematics, 342(1):226 – 232, 2019.
•  Verónica Becher and Pablo Ariel Heiber. On extending de Bruijn sequences. Information Processing Letters, 111(18):930–932, 2011.
•  Verónica Becher and Olivier Carton. Normal numbers and nested perfect necklaces. Journal of Complexity, page in press, 2019.
•  Jean Berstel and Dominique Perrin. The origins of combinatorics on words. European Journal of Combinatorics, 28(3):996–1022, 2007.
•  Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 2009.
•  Lucas Cortés. Extending de Bruijn sequences to larger alphabets, 13 December 2018. Tesis de Licenciatura en Ciencias de la Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Director: Verónica Becher.
•  Nicolaas G. de Bruijn. A combinatorial problem. Nederl. Akad. Wetensch., Proc., 49:758–764 = Indagationes Math. 8, 461–467 (1946), 1946.
•  Jack Edmonds and Richard M. Karp. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM, 19(2):248–264, 1972.
•  Philip Hall. On representatives of subsets. Journal of the London Mathematical Society, 10, 1935.
•  Mordechay B. Levin.

On the discrepancy estimate of normal numbers.

Acta Arithmetica, 88(2):99–111, 1999.
•  Damian Repke and Wojciech Rytter. On semi-perfect de Bruijn words. Theoretical Computer Science, 720:55 – 63, 2018.
•  Camille Flye Sainte-Marie. Question 48. L’interm. des math., 1:107–110, 1894.
•  Moshe Schwartz, Yotam Svoray, and Gera Weiss. On embedding de Bruijn sequences by increasing the alphabet size. arXiv:1906.06157, 2019.
•  Gabriel Thibeault. Greatest de Bruijn sequences in many colors, ongoing 2018. Tesis de Licenciatura en Ciencias de la Computación, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires. Director: Verónica Becher.