An Efficient Generalized Shift-Rule for the Prefer-Max De Bruijn Sequence

01/30/2018 ∙ by Gal Amram, et al. ∙ 0

One of the fundamental ways to construct De Bruijn sequences is by using a shift-rule. A shift-rule receives a word as an argument and computes the digit that appears after it in the sequence. An optimal shift-rule for an (n,k)-De Bruijn sequence runs in time O(n). We propose an extended notion we name a generalized-shift-rule, which receives a word, w, and an integer, c, and outputs the c digits that comes after w. An optimal generalized-shift-rule for an (n,k)-De Bruijn sequence runs in time O(n+c). We show that, unlike in the case of a shift-rule, a time optimal generalized-shift-rule allows to construct the entire sequence efficiently. We provide a time optimal generalized-shift-rule for the well-known prefer-max and prefer-min De Bruijn sequences.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

De Bruijn sequences were rediscovered many times over the years, starting from 1894 by Flye-Sainte Marie [10], and finally by De Bruijn himself in 1946 [5]. For two positive non-zero integers, and , an -De Bruijn (-DB, for abbreviation) sequence is a cyclic sequence over the alphabet in which every word of length over appears exactly once as a subword. It is cyclic in the sense that some words are generated by concatenating the suffix of length of the sequence, with the prefix of length .

A construction for a family of -DB sequences is an algorithm that receives the two arguments, and (occasionally, is fixed and only is given as argument), and outputs an -DB sequence. Obviously, a trivial time lower bound for a construction is , as this is the exact length of an -DB sequence. Many constructions for variety families of De Bruijn sequences are known, (for example, [1, 2, 6, 9, 14, 13, 16, 21, 24, 25]) and some of them are also time optimal.

A specifically famous family of -DB sequences is the prefer-max family [11, 21], which is constructed by the well-known “granddaddy” greedy algorithm [21] (see also [17, Section 7.2.1.1]). The algorithm constructs the sequence digit by digit, where at each step, the maximal value is added to the initial segment constructed so far (assuming the alphabet is linearly orderd), so that the new suffix of length does not appear elsewhere. A symmetric approach produces the prefer-min DB sequence. Besides this highly inefficient algorithm, many other constructions for the prefer-max and prefer-min sequences have been proposed in the literature. A classical result by Fredricksen and Kessler [12], and Fredricksen and Maiorana [13] shows that the prefer-max sequence is in fact a concatenation of certain (Lyndon) words, a result we use in this work. This block construction was later proved to be time optimal in [23]. Another efficient block concatenation construction was suggested in [22].

A common and important way of generating DB sequences is by using a shift-rule (also named a shift-register). A shift-rule for an -DB sequence receives a word of length , , as an input, and outputs the digit that follows at the sequence. Here, and are parameters of the algorithm. Obviously, a shift-rule must run in -time since it must read every digit in its input to produce the correct output. Shift-rules are important since, unlike block constructions, they can be applied on words that appear at the middle of the sequence.

Several efficient shift-rules for DB sequences are known for (see [24] for a comprehensive list). However, only recently efficient shift-rules were discovered for non-binary sequences. Sawada et al. [25] introduced a new family of DB sequences and provided a linear time shift-rule for these sequences. Amram et al. [2] introduced an efficient shift-rule for the famous prefer-max and prefer-min DB sequences.

We note that, generally, a construction for a DB sequence provides an exponential-time shift-rule, since, on many inputs, it is required to construct almost the whole sequence to find the desired digit the shift-rule should output. On the other hand, a shift-rule for a DB sequence provides a construction in time, by finding the next digits one by one, which is not an optimal approach.

We see that none of these two methods, a general construction and a shift-rule, dominates the other, and we propose here a third way, which generalizes both methods, that we name a Generalized-Shift-Rule (GSR for abbreviation). A GSR for an -DB sequence is an algorithm that receives two arguments: a word, , of length , and an integer, . The GSR outputs the letters that follow at the sequence. Since the algorithm must read its input and must write letters, is a trivial time lower bound for a GSR. An optimal GSR provides an optimal shift-rule when used only with . In addition, an optimal GSR provides an optimal construction by invoking it with , or by invoking it times with (for example).

Although a GSR is defined here for the first time, researchers have noted the advantages behind this notion, and mentioned that their shift-rule poses the properties we seek for in this paper. In [25] Sawada et al. described a shift-rule with -amortized time per bit. As this seems to contradict the trivial time lower bound mentioned earlier, this statement requires a clarification. The shift-rule proposed in [25] has the interesting property that after using it once, it can be invoked more times and, by carefully retaining data from one invocation to another, it can produce the next digits in -amortized time. Hence, in fact, Sawada et al. noted and mentioned that their shift-rule also forms a time optimal GSR. A similar remark can be found in [24].

In this paper, we present an optimal GSR for the well-known prefer-max and prefer-min DB sequences. Our construction relies on the classical result of [13]. Namely, that the prefer-max -DB sequence and the prefer-min -DB sequence can be constructed by concatenating certain Lyndon words. Our GSR construction takes advantage of this result in the following manner.

The prefer-min sequence is a concatenation of Lyndon words: . In Section 4 we note that a GSR can be constructed by solving a similar problem, named filling-the-gap. The problem is to find a word, , that completes into a suffix of for some . If this suffix is of length smaller than , we use another algorithm, presented in Section 3, which finds the Lyndon words that follow in the block-construction of the prefer-min DB sequence. In Section 5 we present a filling-the-gap algorithm which provides, as explained, a GSR for the prefer-min and prefer-max sequences. In addition, in Section 2 we define notations used throughout the paper, and a discussion is given in Section 6.

2 Preliminaries

For an integer , consider the alphabet , ordered naturally by . Hence, the set of all words over , , is totally ordered by the lexicographic order, which we simply denote by . We say that a word, , is an -word if . Furthermore, the -prefix of is the prefix of of length , and the -suffix of is the suffix of of length . This notions are defined, of course, only when ,

A word, , is a rotation of a word, , if and . In addition, is a non trivial rotation of , if and . Note that a word can be equal to some of its non-trivial rotations. This happens when for some non-empty word, , and an integer . In this case, is said to be periodic. A Lyndon word [20] is a non-empty word that is strictly smaller than all its non-trivial rotations. Hence, in particular, a Lyndon word is aperiodic.

The prefer-max -DB sequence is a cyclic sequence constructed by a greedy algorithm that starts with , and repeatedly adds the largest possible digit in so that no -word appears twice as a subword of this sequence, until the sequence length is , and then rotates the obtained sequence to the left times. As an example, for and , the greedy procedure produces the sequence: and the prefer-max -DB sequence is: . Analogously, the prefer-min -DB sequence is produced by a greedy process which starts with , and repeatedly concatenates the smallest possible digit, so that no repetition occurs, and afterwards rotates the resulting sequence to the left times.

We note that the prefer-min -DB sequence and the prefer-max -DB sequence can be derived one from the other, by replacing each digit, , with . Therefore, a GSR for one of these sequences can be easily transformed into a GSR for the other one as well. We present here a GSR algorithm for the prefer-min -DB sequence.

From this point on, we refer to and as fixed, unknown, parameters, larger than (to avoid trivialities). We measure time complexity of all algorithms given here in terms of the parameter , assuming that arithmetic operations can be computed in constant time, regardless of how large the numbers they are applied on.

Let be the (finite) sequence of all Lyndon words over of length at most , ordered lexicographically. Let be the number of all Lyndon words over whose length divides . Let be an enumeration of all Lyndon words whose length divides , ordered lexicographically.

For a Lyndon word , let . Since divides , is a positive integer. Note that every is equal to some where . The main result of [13] (with a straightforward adaptation) is:

Theorem 1.

The prefer-min -DB sequence is:

As an example, for we concatenate in an increasing order all Lyndon words of length one or three. We get the following sequence, decomposed into Lyndon words:

As said, our strategy in constructing a GSR for the prefer-min sequence is to fill the gap between the input, , to a word in the sequence, and then concatenate Lyndon words until we find the required digits that follow . To this end, we refer to the sequence as cyclic, meaning that, for and , we set .

3 An Efficient Lnext() Algorithm

As a first step in constructing a GSR algorithm, we analyze a relatively simple case. By Theorem 1, for every , is a prefix of the prefer-min sequence. We consider the case where we are given an -word, , that happens to be a suffix of . To find the next digits, we need to find words: so that the length of this sequence is at least .

For dealing with this restricted case, we design an algorithm that computes, efficiently, the function: . Moreover, for technical reasons that will arise later, we also want to apply the algorithm over Lyndon words whose length does not necessarily divides . Therefore, for a Lyndon word, , we define to be the Lexicographically smallest such that , and (that is, ). In this section, we present an algorithm with time complexity.

Proposition 2.

Algorithm 3 computes in time.

In [8], Duval describes an algorithm to build the next Lyndon word from a given one.

Input: A Lyndon word,
Output:

1:, where is a proper prefix of and
2:remove largest suffix of of the form
3:increase the last digit of by one
4:return
Algorithm 1 Duval

In [8], Duval proved:

Theorem 3.

On every input , Algorithm 1 returns in time.

Note that the length of the output of the algorithm may not divide . However, we use it to construct a naive algorithm which achieves that goal, with a runtime complexity of . We first describe this naive version, and then improve it to run in linear time.

Input: a Lyndon word,
Output:

1:if  then
2:     return
3:end if
4:
5:if  then
6:     return
7:end if
8:if  then
9:     
10:end if// now
11:
12:while  do
13:     ,
14:end while
15:return .
Algorithm 2 Naive-Lnext

Note that , Algorithm 2 outputs . At each iteration of the loop in lines 12-14, the algorithm invokes Duval’s algorithm, until it finds a Lyndon word whose length divides . This establishes a worst case runtime of .

The reader may note that the if instructions in lines 5-7 and lines 8-10 can be omitted. However, we aim to construct a liner-time algorithm, and we do that by modifying the while loop. Then, it will be important that the loop acts on words whose length is larger than . Thus, lines 5-7 and 8-10 are added to simplify the comparison between this naive version and our linear time version.

To improve the runtime of this algorithm, we identify cases in which the outcome of several loop iterations can be computed directly. These are the cases in which calling Duval’s algorithm again and again results in concatenating the same sequence several times. For illustration, consider the word , where . The Lyndon words that follow are:

Instead of applying times, we can save time by computing , and go to without traversing all words in that list. This allows us to compute in linear time, as we do in Algorithm 3.

Input: a Lyndon word,
Output:

1:if  then
2:     return
3:end if
4:
5:if  then
6:     return
7:end if
8:if  then
9:     
10:end if// now
11:
12:while  do
13:     
14:      prefix of of length
15:      the word obtained from by removing its suffix that includes only occurrences of , and increasing the last digit by
16:     
17:     ,
18:end while
19:return
Algorithm 3 Lnext

In order to prove Algorithm 3 correctness, we show that both algorithms, Algorithm 2 and Algorithm 3, have the same output for any legal input. We start with two observation, derived from Duval’s algorithm.

Corollary 4.

, if , then .

Corollary 5.

, if , then .

From these two observations we can deduce the following conclusion, discussing the similarity between the two algorithms when entering the while loop:

Corollary 6.

The following invariant holds for both Algorithm 2 and Algorithm 3 whenever the while loop starts: and .

The next lemma shows that every execution of the while loop in Algorithm 3 corresponds to several executions of the while loop in Algorithm 2.

Lemma 7.

Let be a Lyndon word so that . Let be as in lines 9-12 of Algorithm 3. Then, for each :

  1. is a Lyndon word.

  2. If , then

Proof.

The proof is by induction on . If , item 1 holds as is a Lyndon word, and if item 1 holds by applying the induction hypothesis on item 2 with . It remains to prove that item 2 holds thus we assume that .

Write for a word and . Hence, the -prefix of , , is for some . Namely, for some word, , so that (note that since , the -prefix of is defined). Thus, . Let . Note that since , (this holds because is the largest integer so that ).

To summary, and . Hence, the -prefix of is for some . Therefore, is obtained by concatenating to , removing the suffix: and increasing by one. Namely, , as required. ∎

Lemma 7 states that each execution of the while loop of Algorithm 3 corresponds to executions of the while loop of Algorithm 2. Therefore, we conclude:

Corollary 8.

, the output of Algorithm 2 is equal to the output of Algorithm 3.

It is left to prove that our runtime is linear. For this purpose, consider an execution of Algorithm 3 on input , and assume that (otherwise, the algorithm terminates after steps). Let us denote by and the values assigned to variables and , respectively, at the -th iteration of the while loop. In addition, let be the value of variable before entering the while loop for the first time, and let denote the number of loop iterations at the execution.

Lemma 9.

For ,

Proof.

By induction on . The base case is trivial, as . For the induction step, take and note that , where and is the largest integer so that . Therefore, and hence, by the induction hypothesis we get: . ∎

Relaying on this lemma, we can now analyze the runtime of our algorithm, and prove Proposition 2.

Proof.

By Corollary 8, the algorithm computes correctly. We shall prove that the algorithm runs in time. If , the execution terminates in constant number of steps and we are done. Otherwise, by Lemma 9, we have:

  1. .

  2. For every , .

In each iteration of the while loop, finding and are the most time-consuming steps, each costs . Therefore, the global runtime is . ∎

4 A Gsr Algorithm Based on a Reduction to Ftg

The fact that can be computed efficiently is useful for designing an efficient GSR algorithm. Given an -word, , assume that is a suffix of . In this case, several invocations of our algorithm produce the -word that follows at the prefer-min sequence. For taking this approach, first, it is required to find a Lyndon word and a word so that is a suffix of . This implies that a GSR algorithm for the prefer-min sequence can be derived from a solution to another problem we propose in this section, which we name: Filling-The-Gap (FTG for abbreviation).

Definition 10.

For an -word, , if

  1. is a suffix of .

  2. If is a suffix of , then .

We leave for the reader to verify that is well defined, meaning that for every -word, only a single pair, , satisfies the conditions of Definition 10. We remark that it is possible that where . This occurs in the case where is a concatenation of a suffix of the prefer-min sequence with a prefix of it. For example, if , then since is a suffix of .

Note that can be trivially computed by concatenating Lyndon words and searching for . However, this naive solution is highly inefficient as may appear anywhere in the prefer-min sequence. Hence, for constructing an efficient GSR in the way described above, we need an efficient FTG-algorithm.

There is also another issue concerning the suggested approach, which requires attention. If , for computing the -word that comes after we need to invoke Algorithm 3 several times. It is required to explain why the number of Lyndon words we concatenate is proportional to the suffix we seek for. More precisely, we need to show that the total number of invocations of Algorithm 3 consumes time. This is settled by the next lemma, which claims that there are no two consecutive words, , both of length smaller than :

Lemma 11.

For every , if , then .

Proof.

Write and , for . Since and , . By Corollary 4, . Hence, if , then , since is the only number in range , which is larger than and divides . Otherwise, the length of each word among is, in particular, smaller then . Thus, by Corollary 5, , and since divides , by the same argument as before, we get that . ∎

We can now present, in Algorithm 4, a GSR algorithm based on a reduction to the FTG problem.

Input: (),
Output: a word of length that appears after at prefer-min

1:
2:while  do
3:     
4:     
5:end while
6:return the -prefix of
Algorithm 4 generalized_shift_rule

Consider the while loop in algorithm 4 and use Lemma 11 to conclude that after loop iterations, which consumes time, increases by at least digits. It follows that the loop halts in steps and hence, we get the following:

Proposition 12.

Assume that can be computed in time. Then, Algorithm 4 forms a GSR for the prefer-min -DB sequence with time complexity.

5 An Ftg Algorithm

In this section we construct an efficient FTG-algorithm. This is done in two steps. First, we define the notion of a cover of an -word , and show how a cover for can be transformed into efficiently. Then, we show how to find a cover for an -word, , in linear time.

5.1 Finding By Means of a Cover

The FTG problem, applied on an -word, , is to extend into a suffix of . For solving this problem, we introduce a similar notion.

Definition 13.

For an -word, , if

  1. is a suffix of .

  2. If is a suffix of , then .

In addition, we say that is covered by , if for some word .

Also here, we leave for the reader to verify that is well-defined. We focus on -words different from from technical reasons, as it allows us to provide a simpler presentation of our results. Otherwise, many parts in our analysis should be rephrased, and some proofs should be rewritten, to include more details. However, it is simple to show that the FTG algorithm we provide at the end of this section, works for every -word.

The two notions, and are similar, but different. The difference between these notions can be bridged by observing that if is a cover for , then is a subword of . To clear this issue, we deal with the relationships between two consecutive Lyndon words in the following Lemma.

Lemma 14.

For every , if , then , for some -word .

Proof.

Let and . By an examination of Algorithm 1, and since does not start with (the only such word is , and ), we conclude that , where is constructed by removing the suffix of that includes only occurrences of , and increasing the last digit of the obtained word by one. A simple inductive argument shows that for , for some non-empty words . By Lemma 11, thus , and is an -word. ∎

Note that it follows that for each , is a prefix of prefer-min, as for some , . Now we turn to deal with the relationships between and .

Lemma 15.

Assume that for an -word, .

  1. If , where is the -prefix of .

  2. Otherwise, , where is the -suffix of .

Proof.

We start by proving the first item. As , is a suffix of . Since , is a suffix of . Moreover, the minimality of guarantees that is not a subword of thus as required.

We turn to prove the second item, in which is a suffix of and . Then, it follows that since otherwise we get the false equation: . Moreover, as , . Hence, Lemma 14 can be invoked and we get that where is the -suffix of . As a result, is a suffix of . Furthermore, since is a suffix of where , is not a suffix of . Therefore, as required. ∎

Using the above, Algorithm 5 transforms into in linear time.

Input: a pair
Output:

1:
2:if  then
3:     -prefix of
4:     return
5:end if
6:-suffix of
7:return
Algorithm 5 cover_to_FTG

We conclude this section with the next corollary. Its first item follows by Lemma 15, and its second item follows immediately from the code.

Corollary 16.

Let be an -word.

  1. If and , then Algorithm 5 returns on the input .

  2. If , then Algorithm 5 returns on the input .

In the next subsection we show how to compute efficiently.

5.2 Computing

In this section we show how to compute , efficiently. Assume that an -word, , is covered by . Then, is a subword of . As described below in Algorithm 6, in some cases, we compute and use it to find , and in other cases we compute directly the suffix of that follows . The way this goal is achieved, relies on the analysis we provide here, which we divide to two parts. First, we show how to construct from , by concatenation certain words to . Then, we present a structural characterization of which will use us to compute .

5.2.1 Modifying into

Assume that an -word, is covered by . Hence, is a subword of , but not of . Clearly, Lemma 14 implies that is a prefix of , but what is the difference between these two sequences? The first aim of our analysis is to show how to construct from , by concatenating a suffix to .

Definition 17.

Let . We define a sequence of words: and a sequence of indices: by induction, where indicates the amount of characters left to calculate in step : Write and assume that were defined, together with .

  • If , and we are done.

  • Otherwise, is obtained as follows: take the prefix of of size , remove its suffix that includes only occurrences of , and increase the last digit by one. In addition, let be .

We show now how the words form as building blocks for constructing from . We divide the analysis into three Lemmas, to deal with the different cases.

Lemma 18.

Take , and consider the words , as defined in Definition 17. If , then:

  1. .

  2. For each , .

Proof.

The first item follows immediately from the definition of . For proving the second item, note that . But since , . Let be the -prefix of , where and . Hence, . Also, by Definition 17, , which completes the proof. ∎

Lemma 19.

Take , and consider the words as defined in Definition 17. If and , then:

  1. .

  2. and .

Proof.

For the first item, note that by Definition 17 and by the fact that (namely, ), . Moreover, , as . We turn to prove the second item. Write , where . Hence, since , . We show now by induction that for every . The induction basis trivially holds, as . Assume, now, that and . Since , . But . Since , it follows that . Therefore, the -prefix of is for some . It follows that as required. Moreover, . Thus, which proves that . ∎

Lemma 20.

Take , and consider the words , as defined in Definition 17. If and , then:

  1. .

  2. Let be the maximal integer such that (equality cannot hold since ). Then, and .

  3. For as defined above, if , then

Proof.

As in the two former lemmas, the first item trivially holds by Definition 17. For the second item, Let be maximal such that . Then, . By using the same argument as in the proof of item 2 of Lemma 19, it can be shown that . We prove that . Since , thus . By the maximality of