Prefix normal words are binary words with the property that no factor has more s than the prefix of the same length. For example, is prefix normal, but is not, since the factor has too many s. These words were introduced in 
, originally motivated by the problem of Jumbled Pattern Matching[5, 20, 15, 2, 9, 3, 10, 1, 14, 18, 12].
Prefix normal words have however proved to have diverse other connections [7, 6, 8]. Among these, it has been shown that prefix normal words form a bubble language [22, 24, 23], a family of binary languages which include Lyndon words, -ary Dyck words, necklaces, and other important classes of binary words. These languages have efficient generation algorithms222Here, the term efficient is used in the sense that the cost per output word should be small—in the best case, this cost is constant amortized time (CAT)., and can be listed as (combinatorial) Gray codes, i.e. listings in which successive words differ by a constant number of operations. More recently, connections of the language of prefix normal words to the Binary Reflected Gray Code have been discovered , and prefix normal words have proved to be applicable to certain graph problems . Moreover, three different sequences related to prefix normal words are present in the On-Line Encyclopedia of Integer Sequences (OEIS ): A194850 (the number of prefix normal words of length ), A238109 (a list of prefix normal words over the alphabet ), and A238110 (equivalence class sizes of words with same prefix normal form, a related concept from ).
In this paper, we present a new recursive generation algorithm for prefix normal words of fixed length. In combinatorial generation, the aim is to find a way of efficiently listing (but not necessarily outputting) each one of a given class of combinatorial objects. Even though the number of these objects may be very large, typically exponential, in many situations it is necessary to be able to examine each one of them: this is when combinatorial generation algorithms are needed. The latest volume 4A of Donald Knuth’s The Art of Computer Programming devotes over 200 pages to combinatorial generation of basic combinatorial patterns, such as permutations and bitstrings , and much more is planned on the topic .
The previous generation algorithm for prefix normal words of length runs in amortized linear time per word , while it was conjectured there that its running time is actually amortized per word, a conjecture which is still open. Our new algorithm recursively generates all prefix normal words from a seed word, applying two operations, which we call bubble and flip. Its running time is per word, and it allows new insights into properties of prefix normal words. It can be applied (a) to produce all prefix normal words of fixed length, or (b) to produce all prefix normal words of fixed length sharing the same critical prefix. (The critical prefix of a binary word is the first run of s followed by the first run of s.) This could help proving a conjecture formulated in , namely that the expected critical prefix length of an -length prefix normal word is . Moreover, it could prove useful in counting prefix normal words of fixed length: it is easy to see that this number grows exponentially, however, neither a closed form nor a generating function are known . Finally, a slight change in the algorithm produces a (combinatorial) Gray code on prefix normal words of length .
While both algorithms generate prefix normal words recursively, they differ in fundamental ways. The algorithm of  is an application of a general schema for generating bubble languages, using a language-specific oracle. It generates separately the sets of prefix normal words with fixed weight , i.e. all prefix normal words of length containing s. The computation tree is not binary, since each word can have up to children, where is the number of s in the first run of s of . The algorithm uses an additional linear size data structure which it inherits from the parent node and modifies for the current node. A basic feature of the computation tree is that all words have the same fixed suffix, in other words, for the subtree rooted in the word , all nodes are of the form , for some .
In contrast, our new algorithm generates all prefix normal words of length (except for and ) in one single recursive call, starting from . The computation tree is binary, since each word can have at most two children, namely the one produced by the operation bubble, and the one by flip. Finally, for certain words , the words in the subtree rooted in have the same critical prefix as . This last property allows us to explore the sets of prefix normal words with fixed critical prefix.
In the final part of the paper, we prove some surprising results about extending prefix normal words. Note that if is prefix normal, then so is , but not necessarily . We introduce infinite prefix normal words and show that repeatedly applying the flip-operation used by the new algorithm—in a modified version which extends finite words—produces, in the limit, an ultimately periodic infinite prefix normal word. Moreover, we are able to predict both the length and the density of the period, and give an upper bound on when the period will appear.
Part of the results of the present paper were presented in a preliminary form in .
A (finite) binary word (or string) is a finite sequence of elements from . We denote the ’th character of by , and its length by . Note that we index words from . The empty word, denoted , is the unique word with length . The set of binary words of length is denoted and the set of all finite words by . For two words , we write for their concatenation. For an integer and , denotes the -length word (-fold concatenation of ). If , with (possibly empty), then is called a prefix, a factor (or substring), and a suffix of . We denote by , for , the factor of spanning the positions through . For a word , we write for the number of s in . We denote by the lexicographic order between words.
We denote by the prefix of of length , and by , the number of s in the prefix of length . (In the context of succinct indexing, this function is often called .) If clear from the context, we write for .
Definition 1 (Prefix normal words, prefix normal condition).
A word is called prefix normal if, for all factors of , . We denote the set of all finite prefix normal words by , and the set of prefix normal words of length by . Given a binary word , we say that a factor of satisfies the prefix normal condition if .
The word is not prefix normal because the factor violates the prefix normal condition.
It is easy to see that the number of prefix normal words grows exponentially, by noting that is prefix normal for any of length . In Table 1, we list all prefix normal words for lengths . Finding the number of prefix normal words of length is a challenging open problem, see  for partial results. The cardinalities of for can be found in the On-Line Encyclopedia of Integer Sequences (OEIS ) as sequence A194850.
Next, we give some basic facts about prefix normal words which will be needed in the following.
Fact 1 (Basic facts about prefix normal words ).
If , then either or .
if and only if for
If then for all
Let . Then if and only if for all , we have
We will define several operations on binary words in this paper. For an operation , we denote by the ’th iteration of . We denote by the set of words obtainable from by a finite number of applications of .
Finally, we introduce the critical prefix of word. The length of the critical prefix plays an important role in the analysis of the previous generation algorithm for prefix normal words .
Definition 2 (Critical prefix).
Given a non-empty binary word , it can be uniquely written in the form , where , implies , and . We refer to as the critical prefix of .
For example, the critical prefix of is , that of is , while the critical prefix of is .
In , it was conjectured that the expected length of the critical prefix of a prefix normal word of length is . This conjecture is still open. In Section 3.3, we will see how to adapt our algorithm to generate all prefix normal words with critical prefix in one run.
To close this section, we briefly discuss combinatorial Gray codes. Recall that a Gray code is a listing of all bitstrings (or binary words) of length such that two successive words differ by exactly one bit. In other words, a Gray code is a sequence such that for , where is the Hamming distance between two equal-length words and .
This definition has been generalized in several ways, we give a definition following [21, ch. 5].
Definition 3 (Combinatorial Gray Code).
Given a set of combinatorial objects and a relation on (the closeness relation), a combinatorial Gray code for is a listing of the elements of , such that for . If we also require that , then the code is called cyclic.
In particular, given a listing of the elements of a binary language , such that each two subsequent words have Hamming distance bounded by a constant, then this listing is a combinatorial Gray code for . Note that the specifier ’combinatorial’ is often dropped, so the term Gray code is frequently used in this more general sense.
3 The Bubble-Flip algorithm
In this section we present our new generation algorithm for all prefix normal words of a given length. We show that the words are generated in lexicographic order. We also show how our procedure can be easily adapted to generate all prefix normal words of a given length with the same critical prefix.
3.1 The algorithm
Let We let be the largest index such that , if it exists, and otherwise. We will use the following operations on prefix normal words:
Definition 4 (Operation ).
Given , and , we define to be the binary word obtained by changing the -th character in , i.e., , where is .
Definition 5 (Operation ).
Given and , we define , i.e., the word obtained from by shifting the rightmost one position to the right.
We start by giving a simple characterization of those -operations which preserve prefix normality.
Let such that and let be an index with . Then is not prefix normal if and only if there exists a such that and .
If there exists a such that and , then for the factor of , we have and , thus is not prefix normal.
Conversely, note that if and only if , by Fact 1 (ii) and (iii). If , then, by Fact 1 (iv), there exists a suffix of such that . Clearly, cannot be shorter than , since then , since is prefix normal and contains at least one . So spans the position of the last one of . Let us write , with . So we have , implying by monotonicity of . Moreover, again by the monotonicity of , we get , which implies that the factor consists of only s. ∎
Definition 6 (Phi).
Let . Let . Define as the minimum such that and is prefix normal, and if no such exists.
For the word , we have , since the words and both violate the prefix normal condition, for the prefixes of length and , respectively.
Let and let . Let be the maximum length of a run of zeros following a prefix of which has the same number of 1s as the suffix of of the same length. Formally,
where we set the maximum of the empty set to . Then, .
We first show that We can assume that for otherwise the desired inequality holds by definition. Let . Then, there are no such that and Thus, by Lemma 1, we have that hence
Let now be indices attaining the maximum in the definition of i.e., , and Let then for we have and Then, by Lemma 1, Hence for and in particular which completes the proof. ∎
Algorithm 1 implements the idea of Lemma 2 to compute For a given prefix normal word , it finds the position of the rightmost in . Then, for each length such that the number of s in (counted by ) is the same as the number of 1s in (counted by ), the algorithm counts the number of 0s in following and sets to the maximum of the length of such runs of ’s. By Lemma 2 and the definition of it follows that is equal to as correctly returned by Algorithm 1. It is not hard to see that the algorithm has linear running time since the two while-loops are only executed as long as , and the variable increases at each iteration of either loop. Therefore, the total number of iterations of the two loops together is upper bounded by Thus, we have proved the following lemma:
For Algorithm 1 computes in , hence time.
The next lemma gives the basis of our algorithm: applying either of the two operations or to a prefix normal word results in another prefix normal word.
Let . Then the following holds:
for every , such that , is prefix normal, and
if then is prefix normal.
For b), let i.e., is the position of the penultimate of . Let By Fact 1 we have that Moreover, since Therefore, by a) we have that ∎
Definition 7 ().
Given with , we define as the set of all prefix normal words of length such that for some with Formally,
We will use the convention that if , and if , since then resp. are undefined.
Given , we have
Moreover, these three sets are pairwise disjoint.
It is easy to see that the sets , , are pairwise disjoint.
The inclusion follows from the definition of (Def. 7) for each of the words and .
Now let and We argue by cases according to the character
Case 1. Then, for some such that Since it follows that
Case 2. Then, since we also have that Therefore, for some such that
Let . Since we have that Moreover, hence, Therefore, for some This, by definition, means that
We are now ready to give an algorithm computing all words in the set for a prefix normal word . The pseudocode is given in Algorithm 2. The procedure generates recursively the set as the union of and . The call to subroutine is a placeholder indicating that the algorithm has generated a new word in , which could be printed, or examined, or processed, as required. By Lemma 5 we know that is executed for each word in exactly once.
In order to ease the running time analysis, we next introduce a tree on . This tree coincides with the computation tree of Generate , but it will be useful to argue about it independently of the algorithm.
Definition 8 (Tree on ).
Let . Then we denote by the rooted binary tree with , root , and for a node , (1) the left child of defined as empty if and as otherwise, and (2) the right child of as empty if , and as otherwise.
The tree has the following easy-to-see properties.
Observation 1 (Properties of ).
There are three types of nodes: the root , -nodes (left children), and -nodes (right children).
The leftmost descendant of has maximal depth, namely , where .
If a node has a right child, then it also has a left child.
If a node has no right child, then no descendant of has a right child. Thus in this case, the subtree rooted in is a path of length , consisting only of -nodes, where .
The next lemma gives correctness, the generation order, and running time of algorithm Generate .
For , Algorithm 2 generates all prefix normal words in in lexicographic order in time per word.
Algorithm 2 recursively generates first all words in , then the word , and finally the words in . As we saw above (Lemma 5), these sets form a partition of , hence every word is generated exactly once. Moreover, by definition of , for every it holds that with and , thus it follows that . In addition, for every it holds that where , and , thus . Since these relations hold at every level of the recursion, it follows that the words are generated by Algorithm 2 in lexicographic order.
For the running time, note that in each node , the algorithm spends time on the computation of (Lemma 3), and if , another time on computing , and finally, if , further time on computing . This gives a total running time of , so amortized time per word. We now show that it actually runs in time per word.
Notice that the algorithm performs an in-order traversal of the tree . Given a node , the next node visited by the algorithm is given by:
In all three cases, the algorithm first computes , taking time by Lemma 3. In the first case, it then descends down to the leftmost descendant of the right child, which takes bubble operations, in time. In the second case, the parent is reached by one operation (moving the last one position to the left if is a left child, and flipping the last if is a right child), taking time. Finally, in the third case, we have up to depth of many steps of the latter kind, each taking constant time, so again in total time. In all three cases, we get a total of time before the next word is visited. ∎
Now we are ready to present the full algorithm generating all prefix normal words of length , see Algorithm 3 (Bubble-Flip). It first visits the two prefix normal words and , and then generates recursively all words in containing at least two s, from the starting word .
The Bubble-Flip algorithm generates all prefix normal words of length , in lexicographic order, and in time per word.
Recall that by Fact 1(i) every prefix normal word of length , other than , has as its first character. It is easy to see that there is only one prefix normal word of length with a single , namely Moreover, by Fact 1(i) and the definition of , the set of all prefix normal words of length with at least two s coincides with . By Lemma 6, this set is generated by Generate in lexicographic order and in time per word. Noting that prepending and preserves the lexicographic order concludes the proof. ∎
3.2 Listing as a combinatorial Gray code
The algorithm Generate (Algorithm 2) performs an in-order traversal of the nodes of the tree . If instead we do a post-order traversal, we get a combinatorial Gray code of , as we will show next. First note that the change in the traversal order can be achieved by moving line 4 in Algorithm 2 to the end of the code, resulting in Algorithm 4.
In a post-order traversal of , two consecutive words have Hamming distance at most .
Let be some node visited during the traversal of . If is a -node, then the next node in the listing will be its parent node . Since , is at Hamming distance from . Otherwise is a -node, i.e. and its parent is for some word and integer . If has no right sibling, then the next node visited is its parent, at Hamming distance from . Else the next node is the leftmost descendant of ’s right sibling, i.e. , and the Hamming distance to is at most . ∎
The Bubble-Flip algorithm using a post-order traversal produces a cyclic combinatorial Gray code on , generating each word in time .
By Lemma 7, Generate2 produces a combinatorial Gray code. By visiting the two words and first, followed by Generate2 , we get a combinatorial Gray code on all of . The last word in this code is the root and , thus this code is also cyclic.
Since only the order of the tree traversal changed w.r.t. the previous algorithm, it follows immediately that the algorithm visits in amortized time per word, since the overall running time is, as before, .
To see that the time to visit the next word is , we distinguish two cases according to the type of node. If is a flip-node, then the next node is its parent, taking time to reach. If is a bubble-node, then we have to check whether it has a right sibling by computing , where is the parent of , in time. If , then the next node is . If , then we have to reach the leftmost descendant of , passing along the way only bubble-nodes. This takes time, so altogether time for the node .
3.3 Prefix normal words with given critical prefix
Recall Definition 2. It was conjectured in  that the average length of the critical prefix taken over all prefix normal words is . Using the Bubble-Flip algorithm, we can generate all prefix normal words with a given critical prefix , which could prove useful in proving or disproving this conjecture. Moreover, if we succeed in counting prefix normal words with critical prefix , then this could lead to an enumeration of , another open problem on prefix normal words .
In the following lemma, we present a characterization of prefix normal words of length with the same critical prefix in terms of our generation algorithm. For , let us denote by the set of all prefix normal words of length and critical prefix . Note that there is only one prefix normal word whose critical prefix has , namely .
Fix and , and let Then,
If , then clearly Otherwise,
where the first equality holds by definition of critical prefix, the second by definition of , and the third by definition of . ∎
In Fig. 2, we give a sketch of the placement of some of the sets with same critical prefix within , which, as the reader will recall, contains all prefix normal words of length except and . The nodes in the tree are labelled with the corresponding generated word, and we have highlighted the subtrees corresponding to , , and . Let us take a closer look at for . The word is reached starting from the root , traversing right branches (i.e. -branches), passing through the word , and then traversing left branches (i.e. -branches). The set is then equal to the word together with its right subtree.
Apart from revealing the recursive structure of sets of prefix normal words with the same critical prefix, the Bubble-Flip algorithm allows us to collect experimental data on the size of for different values of and . We give some of these numbers, for and small values of , see Table 2. It was already known  that, for , the average critical prefix length, taken over all , is approximately ; with the new algorithm we are able to generate more precise data. In Fig. 3, we plot the relative number of prefix normal words with a given critical prefix length, for lengths and .