Arbitrary-length analogs to de Bruijn sequences

08/17/2021 ∙ by Abhinav Nellore, et al. ∙ The University of Texas at Austin 0

Let α be a length-L cyclic sequence of characters from a size-K alphabet 𝒜 such that the number of occurrences of any length-m string on 𝒜 as a substring of α is ⌊ L / K^m ⌋ or ⌈ L / K^m ⌉. When L = K^N for any positive integer N, α is a de Bruijn sequence of order N, and when L ≠ K^N, α shares many properties with de Bruijn sequences. We describe an algorithm that outputs some α for any combination of K ≥ 2 and L ≥ 1 in O(L) time using O(L log K) space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

pkl

Software for generating arbitrary-length analogs to de Bruijn sequences


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Preliminaries

This paper is concerned with necklaces, otherwise known as circular strings or circular words. A necklace is a cyclic sequence of characters; each character has a direct predecessor and a direct successor, but no character begins or ends the sequence. So if is said to be a necklace, and refer to the same necklace. In the remainder of this paper, the term string exclusively refers to a sequence of characters with a first character and a last character. A substring of a necklace is a string of contiguous characters whose length does not exceed the necklace’s length. So the set of length- substrings of the necklace is . A rotation of a necklace is a substring whose length is precisely the necklace’s length, and a prefix of a string is any substring starting at the string’s first character. So can be called a rotation of a necklace, and is a prefix of that rotation.

A de Bruijn sequence of order on a size- alphabet is a length- necklace that includes every possible length- string on as a substring [saint1894solution, de1946combinatorial, de1951circuits, de1975acknowledgement]. There are distinct de Bruijn sequences of order on [de1951circuits]. (See the appendix for a brief summary of the curious history of de Bruijn sequences.) An example for and is the length- necklace

A de Bruijn sequence of order on is optimally short in the sense that its length is , and there are possible length- strings on . But more is true: because any length- string on is a prefix of each of strings on when , the sequence has precisely occurrences of that length- string as a substring. So in the example above, there are occurrences of , occurrences of , occurrences of , and occurrence of . Note by symmetry, is also the expected number of occurrences of any length- string on as a substring of a necklace of length formed by drawing each of its characters uniformly at random from . More generally, by symmetry, is the expected number of occurrences of any length- string on for as a substring of a necklace of arbitrary length formed by drawing each of its characters uniformly at random from .

1.2 -sequences

Consider a necklace defined as follows.

Definition 1.1 (-sequence).

A -sequence is a length- necklace on a size- alphabet such that the number of occurrences of any length- string on for as a substring of the necklace is or .

This paper proves by construction that a -sequence exists for any combination of and , giving an algorithm for sequence generation that runs in time using space.

When for any positive integer , for , and a -sequence collapses to a de Bruijn sequence of order . When , a

-sequence is a natural interpolative generalization of a de Bruijn sequence: it is a necklace for which the number of occurrences of any length-

string on for as a substring differs by less than one from its expected value for a length- necklace formed by drawing each of its characters uniformly at random from . When this expected value is an integer, , and the number of occurrences of any length- string on as a substring of a given -sequence is equal to the number of occurrences of any other length- string on as a substring of that sequence. When this expected value is not an integer, a -sequence comes as close as it can to achieving the same end, as formalized in the proposition below.

Proposition 1.1.

Consider a -sequence . Load across length- strings on for is balanced in as follows.

  1. When is an integer, each length- string on occurs precisely times as a substring of .

  2. When is not an integer, each of length- strings on occurs precisely times as a substring of , and each of length- strings on occurs precisely times as a substring of .

Proof.

Item 1 is manifestly true from . To see why item 2 is true, consider the system of Diophantine equations

(1)

Above, represents the number of length- strings on for which there are occurrences each as a substring of , and represents the number of length- strings on for which there are occurrences each as a substring of . The first equation says the total number of occurrences of strings as substrings of is , and the second says there is a total of length- strings on . Note the equations hold only when is nonintegral—that is, . In this case, it is easily verified the unique solution to the system is and . ∎

An example for and is the sequence

(2)

To see why, note for and is for , for , between and for , and between and for any . Further, the sequence (2) contains, as a substring, precisely

  1. occurrences of each string in the set ;

  2. occurrences of each string in the set ;

  3. occurrences of each string in the set , which is of size , and occurrence of each string in the set , which is of size ;

  4. occurrence of each string in the set

    which is of size , and occurrences of each of the set of length- strings on not in , which is of size ; and

  5. or occurrences of any length- string for due to item 4 above.

1.3 -sequences vs. other de Bruijn-like sequences

Two other arbitrary-length generalizations of de Bruijn sequences have appeared in the literature:

  1. What we call a Lempel-Radchenko sequence is a length- necklace on a size- alphabet such that every length- string on has at most one occurrence as a substring of the necklace. As recounted by Yoeli in [yoeli1963counting], according to Radchenko and Filippov in [radchenko1959shifting], the existence of binary Lempel-Radchenko sequences of any length was first proved by Radchenko in his unpublished 1958 University of Leningrad PhD dissertation [radchenko1958code]. Other binary-case existence proofs were furnished by (1) Yoeli himself in [yoeli1961nonlinear] and [yoeli1962binary]; (2) Bryant, Heath, and Killik in [bryant1962counting] based on the work [heath1961chain] of Heath and Gribble; and (3) Golomb, Welch, and Goldstein in [golomb1959cycles]. Explicit constructions of arbitrary-length binary Lempel-Radchenko sequences were given by Etzion in 1986 [etzion1986algorithma]. In brief, Etzion’s approach is to join necklaces derived from the pure cycling register, potentially overshooting the target length , and subsequently remove substrings as necessary in the resulting sequence according to specific rules to achieve the target length. This takes time per bit generated and uses space.

    The existence of Lempel-Radchenko sequences of any length for any alphabet size was proved in 1971 by Lempel in [lempel1971m]. In the special case where the alphabet size is a power of a prime number, one of two approaches for sequence construction effective at any length may be used: either (1) pursue the algebraic construction described by Hemmati and Costello in their 1978 paper [hemmati1978algebraic], or (2) cut out a length- stretch of contiguous sequence from a de Bruijn sequence longer than generated by a linear feedback function, as described in Chapter 7, Section 5 of Golomb’s text [golomb2017shift]. In his 2000 paper [landsberg2000feedback], Landsberg built on Golomb’s technique, explaining in the appendix how to use it to construct a Lempel-Radchenko sequence on an alphabet of arbitrary size. The idea is to decompose the desired alphabet size into a product of powers of pairwise-distinct primes, construct length- sequences on alphabets of sizes equal to factors in this product with Golomb’s technique, and finally write a particular linear superposition of the sequences. The time and space requirements of Hemmati and Costello’s construction, when optimized, have gone unstudied in the literature. In general, Golomb’s technique gives a length- Lempel-Radchenko sequence in time using space, and Landsberg’s generalization multiplies these complexities by the number of factors in the prime power decomposition of the alphabet size. Etzion suggested in his 1986 paper [etzion1986algorithma] that, using results from [etzion1986algorithmb], his algorithm generating a binary Lempel-Radchenko sequence could be extended to generate a Lempel-Radchenko sequence for any alphabet size, but he did not do so explicitly. It is reported on Joe Sawada’s website [sawadasite] that in their recent unpublished manuscript [gundogan2021cut], Gündoǧan, Sawada, and Cameron extend Etzion’s construction to arbitrary alphabet sizes, streamlining it so it generates each character in time using space. Sawada’s website further includes an implementation in C.

  2. A generalized de Bruijn sequence is a length- Lempel-Radchenko sequence on a size- alphabet such that every length- string on is a substring of the sequence. Generalized de Bruijn sequences were recently introduced by Gabric, Hollub, and Shallit in [gabric2019generalized, gabric2021maximal]. These papers also prove generalized de Bruijn sequences exist for any combination of and . No work to date has given explicit constructions of arbitrary-length generalized de Bruijn sequences.

We prove the following.

Theorem 1.2.

A -sequence is a generalized de Bruijn sequence and therefore also a Lempel-Radchenko sequence.

Proof.

Let be a -sequence. The proposition is true if and only if

  1. every length- substring of occurs precisely once in the sequence, and

  2. every length- string on is a substring of .

Item 1 is true because from Definition 1.1, has

or occurrences of any length- string on as a substring. Item 2 is true because from Definition 1.1, has

or occurrences of any length- string on as a substring. ∎

-sequences are more tightly constrained than generalized de Bruijn sequences and Lempel-Radchenko sequences. A length- Lempel-Radchenko sequence imposes no requirements regarding presence or absence of particular strings as substrings; it simply requires that the number of distinct length- substrings is . A length- generalized de Bruijn sequence on goes a step further, requiring not only this distinctness, but also the presence of every string on smaller than as a substring. A -sequence goes yet another step further, requiring not only this presence, but also specific incidences of strings as substrings that, as best as they can, try not to bias the sequence toward inclusion of any one length- string over another. This requirement makes -sequences, in general, more de Bruijn-like than Lempel-Radchenko sequences and generalized de Bruijn sequences.

An example (borrowed from [gabric2021maximal]) of a Lempel-Radchenko sequence that is not a generalized de Bruijn sequence and therefore also not a -sequence for and is

(3)

In this case, , and indeed, there is precisely one occurrence in (3) of every length- substring of (3). But , and in (3) just of length- strings on occur as substrings; the sequence is missing . An example of a generalized de Bruijn sequence that is not a -sequence for and is

(4)

Again, and . Now, not only is there precisely one occurrence in (4) of every length- substring of (4), but also all length- strings on occur as substrings. However, (4) should have occurrences of each of and as substrings to be a -sequence, and it has occurrences of and occurrences of . This imbalance of s and s leads to further violations of constraints on -sequences at other substring lengths. Another example of a generalized de Bruijn sequence that is not a -sequence, this time on the nonbinary alphabet and for , is

(5)

(This sequence was constructed by Landsberg in [landsberg2000feedback] using Golomb’s technique from [golomb2017shift] as an example of a Lempel-Radchenko sequence.) Note and , every length- substring occurs exactly once, and every length- string on is present as a substring. But (5) should have, for , precisely or occurrences of every length- string on as a substring, and there is only occurrence of as a substring of (5).

1.4 de Bruijn sequence constructions vs. de Bruijn-like sequence constructions

Unlike the current situation with de Bruijn-like sequences of arbitrary length, there is a veritable cornucopia of elegant constructions of de Bruijn sequences. Excellent summaries of many of these are provided on Sawada’s website [sawadasite]. They include

  1. greedy constructions. Prominent examples are the prefer-largest/prefer-smallest [martin1934problem], prefer-same [eldert1958shifting, fredricksen1982survey, alhakim2021revisiting], and prefer-opposite [alhakim2010simple] algorithms;

  2. shift rules. A shift rule maps a length- substring of a de Bruijn sequence of order to the next length- substring of the sequence. Shift rules are often simple, economical, and efficient; examples generating each character of a de Bruijn sequence in amortized constant time using space are [sawada2016surprisingly, huang1990new, etzion1984algorithms, fredricksen1975class] in the binary case and [sawada2017simple] in the -ary case. See [amram2019efficient, jansen1991efficient, gabric2018framework, gabric2019successor, yang1992construction, chang2019general, zhu2021efficiently] for other specific rules;

  3. concatenation rules. The best-known example, obtained by Fredricksen and Maiorana in 1978 [fredricksen1978necklaces], joins all Lyndon words on an ordered alphabet of size whose lengths divide the desired order in lexicographic order to form the lexicographically smallest (i.e., “granddaddy”) de Bruijn sequence of that order on that alphabet. (Also see [ford1957cyclic] for Ford’s independent work generating this sequence.) The sequence is obtained in amortized constant time per character using space with the efficient Lyndon word generation approach of Ruskey, Savage, and Wang [ruskey1992generating], which builds on Fredricksen, Kessler, and Maiorana’s papers [fredricksen1978necklaces, fredricksen1986algorithm]. Dragon, Hernandez, Sawada, Williams, and Wong recently discovered that joining the Lyndon words in colexicographic order instead also outputs a particular de Bruijn sequence, the “grandmama” sequence [dragon2016grandmama, dragon2018constructing]. A generic concatenation approach using colexicographic order is developed in [gabric2017bruijn, gabric2018constructing];

  4. recursive constructions. Broadly, these approaches are based on transforming a de Bruijn sequence into a de Bruijn sequence of higher order, where the transformation can be implemented recursively. They fall into two principal classes:

    1. the constructions of Mitchell, Etzion, and Paterson in [mitchell1996method]

      , which interleave punctured and padded variants of a binary de Bruijn sequence of order

      and modify the result slightly to obtain a binary de Bruijn sequence of order . If starting with a known binary de Bruijn sequence, this process takes amortized time per output bit while using additional space. The constructions are notable for being efficiently decodable—that is, the position of any given string on occurring exactly once in the sequence as a substring can be retrieved in time polynomial in ;

    2. constructions based on Lempel’s D-morphism (otherwise known as Lempel’s homomorphism) [lempel1970homomorphism], whose inverse lifts any length- necklace on a size- alphabet to up to necklaces on . When is a de Bruijn sequence of order , the necklaces to which it is lifted may be joined to form a de Bruijn sequence of order . Efficient implementations constructing binary de Bruijn sequences of arbitrary order by repeated application of Lempel’s D-morphism are given by Annexstein [annexstein1997generating] as well as Chang, Park, Kim, and Song [chang1999efficient]; in general, a length- binary de Bruijn sequence is generated in time using space. Lempel confined attention to the binary case in [lempel1970homomorphism]. An extension to alphabets of arbitrary size was first written by Ronse in [ronse1984feedback] and also developed by Tuliani in [tuliani2001bruijn]; it was further generalized by Alhakim and Akinwande in [alhakim2011recursive]. See [games1983generalized, alhakim2017stretching] for other generalizations as well as [tuliani2001bruijn] for a decodable de Bruijn sequence construction exploiting both interleaving and Lempel’s D-morphism.

It is possible construction techniques for de Bruijn sequences have been more easily uncovered than for their arbitrary-length cousins as traditionally defined precisely because de Bruijn sequences are more tightly constrained. But -sequences are similarly constrained.

1.5 Our contribution

This paper introduces the concept of -sequences. Further, it extends recursive de Bruijn sequence constructions based on Lempel’s D-morphism [lempel1970homomorphism, ronse1984feedback, tuliani2001bruijn, alhakim2011recursive], giving an algorithm that outputs a -sequence on the alphabet for any combination of and in time using space. The essence of our approach is to lengthen each of longest runs of the same nonzero character by a single character at the th step before lifting, where the are the digits of the desired length of the -sequence when expressed in base —that is, for . Finally, this paper is accompanied by Python code at https://github.com/nelloreward/pkl implementing our algorithm.

We were motivated to study arbitrary-length generalizations of de Bruijn sequences by [nellore2021invertible], which introduces nength, an analog to the Burrows-Wheeler transform [burrows1994block] for offline string matching in labeled digraphs. In a step preceding the transform, a digraph with edges labeled on one alphabet is augmented with a directed cycle that (1) includes every vertex of the graph and (2) matches a de Bruijn-like sequence on a different alphabet. This vests each vertex with a unique tag along the cycle. But if the de Bruijn-like sequence is an arbitrary Lempel-Radchenko or generalized de Bruijn sequence, some vertices may be significantly more identifiable than others when locating matches to query strings in the graph using its nength, biasing performance. So in general, it is reasonable to arrange that the directed cycle matches a -sequence, which distributes identifiability across vertices as evenly as possible.

The remainder of this paper is organized as follows. The next section develops our algorithm for generating -sequences, proving space and performance guarantees. The third and final section lists some open questions.

2 Generating -sequences

2.1 Additional notation and conventions

In the development that follows, necklaces are represented by lowercase Greek letters adorned with tildes such as and , and strings are represented by unadorned lowercase Greek letters such as and . A necklace or string’s length or a set’s size is denoted using . So is the length of the necklace , and is the size of the set . Necklaces and strings may be in indexed families, where for example in , specifies the family member. Further, a necklace or string may be written as a function of another necklace or string. So denotes that the string is a function of the necklace . When any function’s argument is clear from context, that argument may be dropped with prior warning. So may be written as, simply, .

The operation of joining two necklaces and at a string to form a new necklace refers to cycle joining, described in Chapter 6 of Golomb’s text [golomb2017shift]. is obtained by concatenating rotations of and that share the prefix . So if and , joining and at gives . There may be more than one occurrence of as a substring of at least one of and , so there may be more than one way to join them at . Any way is permitted in such a case. Note joining and at preserves length- substring occurrence frequencies for .

For any positive integer ,

While results are obtained for sequences on the alphabet here, they may be translated to any size- alphabet by appropriate substitution of characters. When a string or necklace is initially declared to be on the alphabet , but an expression for one of its characters is written such that , that character should be interpreted as . This is simply the remainder of floored division of by . Put another way, expressions for characters of strings on respect arithmetic modulo . For example, if the first character of a string on the alphabet is specified as an expression that equals , that character is .

Individual characters comprising strings are often expressed in terms of variables, so a necklace or string may be written as a comma-separated list of characters enclosed by parentheses, where in the necklace case, is included as a subscript. For example, for , if is said to be on the alphabet , it is the string , while if is said to be on , it is the necklace . Bracket notation is used to refer to a specific character of a string or necklace. So refers to the character at index of . Further, characters of a string are indexed in order, so appears directly after in . and refer, respectively, to the first and last characters of the string . For a necklace, the choice of the character at index is arbitrary, but in a parenthetical representation of that necklace, the character at index always comes first. So an arbitrary length- necklace always equals

but not necessarily

A valid character index of a string is confined to , but a valid character index of a length- necklace is any integer , with the stipulation

A string or necklace on can be summed with any integer by adding that integer to each of its characters modulo . So for an integer and a length- necklace ,

Finally, is used as a shorthand for the length- string , is used as a shorthand for the length- string , and is used as a shorthand for the length- necklace . In a slight abuse of notation, a variable representing a string such as , , or can take the place of a character in a parenthetical representation of a string or necklace. So if is said to be a substring of a string on the alphabet , that substring is .

2.2 Lempel’s lift

Lempel’s lift, defined below, realizes the simplest -ary version of Lempel’s D-morphism [lempel1970homomorphism, ronse1984feedback, tuliani2001bruijn, alhakim2011recursive] in inverse form.

Definition 2.1 (Lempel’s lift).

Consider a length- necklace on the alphabet . Lempel’s lift of , denoted by , is the indexed family of necklaces on specified by

(6)

Above, is the smallest positive integer such that is divisible by , and .

The remainder of this subsection (i.e., Section 2.2) abbreviates functions of given above by dropping it as an argument. For example, is written rather than .

Observe that is a discrete integral of , with the constant of integration. The number specified in Definition 2.1 is the smallest positive integer such that integrating a cycle of a total of times gives a cycle of . Conversely, is uniquely determined by a discrete derivative of , which eliminates the constant of integration:

Above, the power on the LHS denotes is concatenated with itself times.

Note the sum of the lengths of the necklaces comprising Lempel’s lift of is . Other properties of the lift pertinent to constructing -sequences are as follows.

Lemma 2.1.

Suppose is a length- necklace on the alphabet , and is an integer satisfying . Suppose is a length- string on , and is the length- string on given by

(7)

Then occurs times as a substring of if and only if occurs times as a substring of the necklaces comprising Lempel’s lift of . When , is the length- string occurring as a substring at every character of .

Proof.

Start constructing a given by integrating from its character index up to character index . If occurs as a substring of at index , it follows from (6) and (7) that occurs as a substring of at its character index for some , and vice versa. For , continue integrating past its character index for another characters to encounter again. This time, how is defined in terms of the sum of ’s characters implies ’s presence as a substring of at index is a necessary and sufficient condition for ’s presence as a substring of at its character index . More generally, occurs as a substring of at its character index if and only if occurs as a substring of at its character index for , and all occurrences of in Lempel’s lift of for which the difference between and is divisible by are in . An occurrence of at any other value of is easily seen from (6) to be at a corresponding character index of for particular and . So there is an invertible map from the set of distinct occurrences of as a substring of into the set of distinct occurrences of as a substring of Lempel’s lift of for , giving the lemma. ∎

Lemma 2.2.

The number of occurrences of a given length- string on the alphabet for as a substring in the family of necklaces comprising Lempel’s lift of a -sequence on is or .

Proof.

By definition, any length- string on has or occurrences as a substring of a -sequence on . By Lemma 2.1, for , the length- string as defined in (7) thus occurs or times as a substring in Lempel’s lift of a -sequence on . Since any length- string on is a prefix of length- strings on , multiply each of and by to arrive at the lemma. ∎

2.3 Algorithm and analysis

In this subsection (Section 2.3), is reserved to denote a -sequence. Moreover, when a function from Definition 2.1 is invoked, and it has as an argument, that function is abbreviated by dropping the . For example, now refers to .

Lemma 2.2 suggests a way to obtain a -sequence from a -sequence : join the necklaces in Lempel’s lift of strategically to ensure the numbers of occurrences of specific strings as substrings do not violate the parameters of Definition 1.1. Below, the procedure LiftAndJoin includes an explicit prescription, and Theorem 2.4 proves it works. They are preceded by a requisite lemma extending the discussion of cycle joining from Section 2.1.

Lemma 2.3.

Consider two necklaces and on the alphabet , and suppose the length- string is a substring of each of them. For every , suppose further that no length- string is a substring of each of and , and no length- string is a substring of each of and . Finally, suppose every length- string on occurs either zero times or one time as a substring of the family . Then

  1. every length- string on occurs either zero times or one time as a substring of the necklace formed by joining and at , and

  2. every length- string for occurs the same number of times as a substring of as it does as a substring of .

Proof.

For , suppose the length- string occurs (1) in as a substring of the length- string , and (2) in as a substring of the length- string . Join and at these occurrences of to obtain the necklace . The operation replaces and with and while affecting the occurrence frequencies of no other length- strings as substrings and no length- strings as substrings for . But cannot occur elsewhere as a substring of because if it does, then either or is a substring of each of and , a contradiction. By a parallel argument, cannot occur elsewhere in . The lemma follows. ∎

1://  Returns the -sequence formed by joining the necklaces
2://  comprising Lempel’s lift of an input -sequence on the
3://  alphabet , with a clarifying input. Here, .
4:procedure LiftAndJoin(, )
5:     Construct Lempel’s lift of .
6:     if  then// Case 1         
7:         return
8:     end if
9:     if  is a substring of  then// Case 2         
10:         Find such that is a substring of each of and .
11:         Initialize to .
12:         for  do
13:              Set to the result of joining and at for .
14:         end for
15:         return
16:     end if
17:     Construct the join graph defined in Theorem 2.4.// Case 3         
18:     Initialize to the necklace represented by an arbitrary vertex .
19:      Starting at , perform a depth-first traversal of the connected component of for which , where at each vertex in reached by walking across a given edge in , the necklace represented by that vertex is joined with at the string labeling that edge, and the result is assigned to .
20:     if  then// Case 3a       
21:         return
22:     end if
23:     Find such that is a substring of each of and .// Case 3b       
24:     Initialize to .
25:     for  do
26:         Set to the result of joining and at for .
27:     end for
28:     return
29:end procedure
Theorem 2.4.

Given a -sequence on the alphabet , suppose . Consider Lempel’s lift of , and define the join graph as an undirected graph with vertices such that

  1. the vertex represents for , and

  2. an edge in labeled by a length- string of the form or for extends between vertex and vertex if and only if that string occurs as a substring of each of and for .

Then the length- necklace output by LiftAndJoin with and as inputs is a -sequence.

Proof.

Follow the logic of the LiftAndJoin pseudocode to prove it returns a -sequence. To start, line 2 constructs Lempel’s lift of , which is composed of necklaces that together have precisely the same number of occurrences of any length- string on as a substring that a -sequence does, according to Lemma 2.4. To join the necklaces, various cases are handled in order of increasing difficulty:

  1. (Lines 3-5) This is the most straightforward case, where Lempel’s lift has precisely one necklace. By Lemma 2.2 and by definition of a -sequence, the sole necklace is a -sequence, and it is returned (Line 4).

  2. (Line 6-13) In this case, and is a substring of so that by Lemma 2.1, is a substring of for at least one . Consequently, is a substring of each of and for and . Progressively joining a necklace under construction with the th member of Lempel’s lift of at for (Lines 8-11) and running from to preserves occurrence frequencies of all strings on whose lengths do not exceed . Since by Lemma 2.2 a length- string occurs either once or never as a substring of Lempel’s lift of , a string whose length exceeds occurs either once or never as a substring of the joined necklace. So that joined necklace is a -sequence, and it is returned (Line 12). When is a de Bruijn sequence (i.e., for ), Case 2 is the -ary extension [ronse1984feedback, tuliani2001bruijn, alhakim2011recursive] of the original join prescription of the paper [lempel1970homomorphism] by Lempel introducing his D-morphism.

  3. (Lines 14-25) Because a length- string on need not occur as a substring of , may not be a substring of . This bars the availability of Lempel’s join of Case 2. LiftAndJoin then looks for the closest alternative. By definition of a -sequence, is necessarily a substring of , and so by Lemma 2.1, is a substring of each necklace in Lempel’s lift of for some . So Line 14 assembles the graph encoding all possible joins at strings of the form or for . Consider any connected component of . A depth-first traversal of prescribes a sequence of joins, which are performed to obtain a single necklace (Line 16). Two cases are then considered.

    1. (Lines 17-19) In this case, there is just one connected component of . Since each join was performed at a length- string, by an argument parallel to that of Case 2, is a -sequence, and it is returned (Line 18).

    2. (Lines 20-25) If there are multiple connected components of , by symmetry, is related to any other connected component by translation modulo . More precisely, applying to each vertex , to each edge extending between and , and to each edge label corresponding to gives a different connected component, where and addition operations are performed modulo . It follows that for every , gives the result of a sequence of joins prescribed by a different connected component of . Because each join was performed at a length- string, the necklaces together have the same occurrence frequency of any length- string on as does Lempel’s lift of for . That occurrence frequency is or for , as it therefore also is for . Because possible joins at strings of the form or for were exhausted by prior joins, Lemma 2.3 guarantees that joins of the at strings of the form for preserve the occurrence frequency of any length- string on for while ensuring that when , the occurrence frequency of a length- string remains either or . So when all necklaces in are joined as on Lines 21-24, the result is a -sequence, and it is returned (Line 25). Note the joins are performed in exact analogy to those of Case 2.

The output of LiftAndJoin is thus a -sequence. ∎

Repeated application of LiftAndJoin on a -sequence outputs a -sequence whose length multiplies the length of by a power of . But this operation alone does not afford the expressive capacity to build up a -sequence of arbitrary length starting from an of length less than , in the same way that an arbitrary positive integer cannot be written as a power of times a positive integer less than . A mechanic for extending the length of by up to between applications of LiftAndJoin is required, where the length of the extension is determined by an appropriate digit from the base- representation of . The mechanic used in the iterative procedure GeneratePKL below, which outputs a -sequence for any combination of and , extends a given longest run of a nonzero character by a single character. Theorem 2.5 proves this approach works.

1://  Returns a -sequence on the alphabet given and
2://   as inputs. Here, .
3:procedure GeneratePKL(, )
4:      Compute the digits of in its base- representation as specified by .
5:     Initialize the necklace to .
6:     for  do
7:         Set to LiftAndJoin(, ).
8:          Set to the extension of by characters as obtained by replacing a substring with for every .
9:     end for
10:     return
11:end procedure
Theorem 2.5.

GeneratePKL outputs a -sequence for any combination of and .

Proof.

Use the notation to denote the value of after Line 3 of GeneratePKL is executed and the notation to denote the value of after step of the for loop of GeneratePKL. Prove the theorem by induction, showing that if is a -sequence of length , and occurs times as a substring of for all , then is a -sequence of length , and occurs times as a substring of for all . The base case for the induction holds: , as initialized on Line 3, is the -sequence of length , in which occurs as a substring times for and times for . Now suppose that is a -sequence of length , and occurs times as a substring of for all . Then for every , occurs times as a substring of LiftAndJoin, obtained on Line 5. This follows from

  1. Lemma 2.1, which says there are occurrences of as a substring of a necklace if and only if there are occurrences of as a substring in Lempel’s lift of that necklace, and

  2. how all joins of necklaces in Lempel’s lift prescribed by LiftAndJoin, including those permitted by Lemma 2.3, do not affect occurrences of substrings of the form .

The extension performed on Line 6 increases the number of occurrences of , for from to without affecting the numbers of occurrences of any other length- strings as substrings for . The longest string of 0s is never extended, and the number of occurrences of