Generating a Gray code for prefix normal words in amortized polylogarithmic time per word

03/05/2020 ∙ by Péter Burcsi, et al. ∙ 0

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. By proving that the set of prefix normal words is a bubble language, we can exhaustively list all prefix normal words of length n as a combinatorial Gray code, where successive strings differ by at most two swaps or bit flips. This Gray code can be generated in O(log^2 n) amortized time per word, while the best generation algorithm hitherto has O(n) running time per word. We also present a membership tester for prefix normal words, as well as a novel characterization of bubble languages.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A binary word of length is prefix normal if for all , no substring of length has more s than the prefix of length . For example, the following are the 14 prefix normal words of length :

The word , for instance, is not prefix normal because the substring has more s than the prefix ; similarly is not prefix normal, because the substring has more s than the prefix of length . The number of prefix normal words for is

respectively. This enumeration sequence is included as sequence A194850 in The On-Line Encyclopedia of Integer Sequences (OEIS) [30], listing up to . It is not difficult to show that grows exponentially. Some bounds and partial enumeration results were presented in [12], and it was conjectured there that . This conjecture was recently proved by Balister and Gerke in [5]. Finding a closed form formula or generating function for , however, remains an open problem.

Prefix normal words were originally introduced in [18] by two of the current authors, in the context of Binary Jumbled Pattern Matching (BJPM): Given a binary string of length , and a pair of non-negative integers , decide whether has a substring with s and s. While the online version of this problem can be solved naively in time, the indexed version has attracted much attention during the past decade [8, 9, 24, 22, 3, 13, 4, 14, 2, 21, 23, 17, 1]. As was shown in [18, 12], every binary word can be assigned two canonical prefix normal word, called its prefix normal form, which can then be used to answer BJPM queries in constant time.

1.1 Our contributions

In this paper, we deal with the question of generating all prefix normal words of a given length . In combinatorial generation, the aim is to exhaustively list all instances of a combinatorial object. Typically, the number of these instances grows exponentially, and time is measured per object, and excluding the time for outputting the objects. For an introduction to combinatorial generation, see [26].

The current best generation algorithm for prefix normal words runs in time per word [15]. Our algorithm improves on this considerably, using amortized time per word.111This algorithm was originally presented in [11], where we proved that it ran in amortized time per word, and conjectured amortized time per word. Based on the result of [5] on the asymptotic number of prefix normal words, we have been able to prove the amortized running time per word. It is based on the theory of bubble languages [27, 28, 31], an interesting class of binary languages defined by the following property: is a bubble language if, for every word in , replacing the first occurrence of (if any) by results in another word in  [27, 28]. Many important languages are bubble languages, including binary necklaces, Lyndon words, and -ary Dyck words222Those languages are actually -bubble, while prefix normal words are -bubble: the difference is simply exchanging the role of and in the definition, see [27, 28].. A generic generation algorithm for bubble languages was given in [28], yielding cool-lex Gray codes for each subset of a bubble language containing all strings of a fixed length and weight (number of 1s). In general, a (combinatorial) Gray code is an exhaustive listing of all instances of a combinatorial object such that successive objects in the listing are “close” in some well-defined sense. In the case of cool-lex order, the strings differ by at most two swaps. The generic algorithm’s efficiency depends only on a language-dependent subroutine called an oracle, which in the best case leads to CAT (constant amortized time generation) algorithms.

In the following, we show that the set of all prefix normal words forms a bubble language; it is the first new and interesting language shown to be a bubble language since the original exposition [27, 28]. We develop an oracle for prefix normal words and apply the generic generation algorithm to obtain a cool-lex ordering of prefix normal words with length and weight . Concatenating together these lists in increasing order of weight, we obtain a Gray code for all prefix normal words of length where successive words differ by at most two swaps or by a swap and a bit flip. We then present an optimized oracle for prefix normal words, and, based on recent results from [5], we prove that our new generation algorithm runs in amortized time per word. Even though the previous time per word algorithm of [15] also provided a Gray code for prefix normal words (albeit with respect to a different measure of closeness), we are achieving a very considerable improvement in running time.

As an example, the listing of prefix normal words of length that results from our algorithm, partitioned by weight, is given in Table 1.

0000000 1000000 1010000 1101000 1101100 1110110 1110111 1111111
1001000 1010100 1110100 1111010 1111011
1000100 1100100 1101010 1101101 1111101
1000010 1010010 1100110 1110101 1111110
1000001 1100010 1110010 1101011
1100000 1010001 1101001 1110011
1001001 1010101 1111001
1100001 1100101 1111100
1110000 1100011
1110001
1111000
Table 1: All prefix normal words of length as output by our algorithm.

A second contribution of this paper is a new characterization of bubble languages. We show that bubble languages can be described in terms of a closure property in the computation tree of a simple recursive generation algorithm for all binary strings. We believe that this view could aid other researchers in applying the powerful tool of bubble languages and their accompanying Gray codes. In fact, it was the discovery that prefix normal words formed a bubble language that led to an efficient generation algorithm and Gray code for our language.


The final part of the paper deals with membership testing, i.e. deciding whether a given binary word is prefix normal. Several quadratic-time membership testers for prefix normal words were given in [18, 12]. The best worst-case time tester can be obtained by using the connection to Indexed Binary Jumbled Pattern Matching (BJPM), for which the current best algorithm, by Chan and Lewenstein, runs in time [13]. We present a new membership tester for prefix normal words which applies a simple two-phase approach and is conjectured to run in average-case time, where the average is taken over all words of length .

1.2 Related work

In addition to the connection to jumbled indexing, prefix normal words are also increasingly being studied for their own sake. Enumeration and language-theoretic results were given by Burcsi et al. in [12], and Balister and Gerke strengthened some results in [5]: in particular, they proved a conjecture about the asymptotic growth behaviour of the number of prefix normal words and gave a new result about the size of the equivalence classes. Cicalese et al. gave a generation algorithm in [15], with linear running time per word, and studied infinite prefix normal words in [16]. Prefix normal words and prefix normal forms have been applied to a certain family of graphs by Blondin-Massé et al. [7], and were shown to pertain to a new class of languages connected to the Reflected Binary Gray Code by Sawada et al. [29]. Very recently, Fleischmann et al. presented some results on the size of the equivalence classes in [FKNP20].

1.3 Overview

The paper is organized as follows. In Section 2, we give the necessary terminology and some basic facts about prefix normal words, and we develop a result on the average critical prefix length of a prefix normal word. This result will later be used in the analysis of our generating algorithm. In Section 3, we give a simple generation algorithm which, based on the result of [5], is proved to run in amortized per word. In Section 4, we present our novel view of bubble languages. In Section 5, we introduce our new generation algorithm, which uses the bubble framework. In Section 6, we present the new membership tester. We close with some open problems in Section 7.

2 Preliminaries

A binary word (or string) over is a finite sequence of elements from . Its length is denoted by , and the -th symbol of a word by , for . We denote by the set of words over of length , by the set of all words over , and by the empty word. Let . If for some , we say that is a prefix of and is a suffix of . A substring of is a prefix of a suffix of . A binary language is any subset of . We denote by the number of occurrences in of character . The number of s in , , is also called the weight of . For a binary language , let denote the subset of all strings in with length , and that of all strings in with length and weight .

We denote by the string obtained from by exchanging the characters in positions and .

We define combinatorial Gray codes, following [26, ch. 5]: Given a set of combinatorial objects and a relation on (the closeness relation), a combinatorial Gray code for is a listing of the elements of , such that for . If we also require that , then the code is called cyclic.

2.1 Prefix normal words

Let be a binary word. For each , we define , the weight of the prefix of length , and , the maximum weight of -length substrings of . The function is sometimes called maximum-ones function, while in the context of compact data structures, function is often called  [25].

Definition 2.1

A word is called prefix normal if for all , . We denote by the language of prefix normal words, and by , the number of prefix normal words of length .

In [18, 12] it was shown that for every word there exists a unique word , called its prefix normal form, such that for all , , and is prefix normal. We give the formal definition:

Definition 2.2

Given a word , the prefix normal form of , denoted , is the prefix normal word given by , for . Two words are prefix normal equivalent if .

As an example, the word has the maximum-ones function as can be checked easily. It is furthermore not difficult to see that for all : or . Thus the sequence of first differences yields a binary word, in this case the word . In Table 2 we list all prefix normal words of length followed by the set of binary words with this prefix normal form.

Words with this prefix normal form Words with this prefix normal form
{} {}
{, }
{, }
{, , }
{}
{}
{}
Table 2: All prefix normal words of length 5 and their equivalence classes.

The next lemma lists some properties of prefix normal words which will be needed in the following. Proofs can be found in [12].

Lemma 2.3 ([12])

Let be a binary word.

  1. is prefix normal if and only if all of its prefixes are prefix normal.

  2. If is prefix normal, then so is .

  3. Let be prefix normal. Then the word is prefix normal if and only if, for every suffix of , .

  4. is prefix normal if and only if .

We refer the interested reader to [12] for more on prefix normal words.

2.2 Critical prefix length

It will often be useful to write binary words as , where and is either or a binary word beginning with 1. In other words, is the length of the first, possibly empty, -run of , is the length of the first -run, and the remaining, possibly empty, suffix. Note that this representation is unique.

Definition 2.4

Let , , where and . We refer to as ’s critical prefix, and denote by the critical prefix length of .

For example, the critical prefix length of is , that of is , and that of is .

Lemma 2.5

The expected critical prefix length of a binary string of length is .

Proof. Let , with . Let

be a random variable with

. We have that if and only if , with , so for words. Otherwise for some , and

since the probability of having a

is . Therefore,

where we have used in the last two equations that , and that the tail of the infinite sum has the closed form .

Remark: It can be shown in a similar way that the expected critical prefix length of a randomly chosen infinite binary word is .

Lemma 2.6

The sequence of the sum of over all words of length obeys the recurrence

Proof. Consider all strings of length and what happens to their critical prefix length if one character is added. For those with , it just stays the same, and since we get two new strings and , these are counted twice. The remaining strings are either of the form, with , in which case adding a will increase by , and adding a will not; there are many of these. Else it is , in which case adding a or a will increase by . So altogether we get

Incidentally, the sequence , is listed as sequence of the OEIS [30], along with the second-order recurrence for That it also obeys the first-order recurrence of Lemma 2.6 can be seen by computing the difference and substituting the recursive formula of order for both.

Next we show that the expected critical prefix of the prefix normal form of a randomly chosen word is . Note that this is not the same as the expected critical prefix length of a random prefix normal word, due to the fact that the equivalence class sizes of the prefix normal equivalence vary considerably (see [18], and Thm. 2 in [5]).

Lemma 2.7

Given a random word , let be the prefix normal form of . Then the expected critical prefix length of is .

Proof. Let , with and the r.v.’s and . It is known that the expected maximum length of a run in a random word of length is  [19]. Clearly, equals the length of the longest run of ’s of , thus . What about ? Consider a -run of of maximum length . If has more than ’s, then there is a substring of consisting of this -run and one more ; the number of ’s in this substring is an upper bound on . Since these s form one single run, their number is again in expectation. If has exactly ’s, then so . The number of words with at most one -run is . So we have:

It is easy to see that the number of prefix normal words grows exponentially (just note that is prefix normal for every ). Balister and Gerke [5] recently proved a conjecture from [12] about the asymptotic number of prefix normal words:

Theorem 2.8 ([5], Thm. 1)

The number of prefix normal words of length is .

We will use this theorem to prove an upper bound on for prefix normal words . We need the following lemma.

Lemma 2.9

Let and , and suppose that . Let be a random variable taking values for prefix normal words . Then .

Proof. Consider all prefix normal words with . These contribute to . Now consider all prefix normal words with . There are at most binary words with , since these words must begin with one of the patterns , and therefore, at most this number of prefix normal words with . Each prefix normal word can only contribute at most to the average. So the contribution to the average summed over all prefix normal words with is at most , which is , since , and hence negligible:

Theorem 2.10

The expected length of the critical prefix of a prefix normal word of length is .

Proof. By Theorem 2.8, we know that there exists a constant such that, for sufficiently large , . Applying Lemma 2.9, we get , where ranges over all prefix normal words of length .

3 A Simple Generation Algorithm for Prefix Normal Words

Our first generation algorithm uses Lemma 2.3: (1) A word is prefix normal if and only if all of its prefixes are prefix normal; (2) if is prefix normal, so is , but not necessarily ; and (3) if and only if for every suffix of , the number of ones in is strictly less than . Words for which is not prefix normal are called extension critical. Thus, whether a word is extension critical can be tested in linear time in .

We can therefore generate all prefix normal words of length by iteratively generating all prefix normal words of length , for , and extending each one by a 0 if it is extension critical, or by a 0 and a 1 if it is not. This yields a computation tree whose leaves are precisely the prefix normal words of length . We refer to this algorithm as Simple Generation Algorithm.

Theorem 3.1

The Simple Generation Algorithm generates all prefix normal words of length in amortized time per word.

Proof. Notice that an extension critical test is performed in each inner node, taking time if the node is at depth . An inner node at depth corresponds to a prefix normal word of length , so the number of tests equals the total number of prefix normal words of length smaller than . From Theorem 2.8 (Balister and Gerke, 2019 [5]) it follows that most prefix normal words are not extension critical, in particular more than half of all prefix normal words of a given length can be extended. Therefore, , and by induction , implying that the total time taken by the algorithm to generate all prefix normal words of length is

4 Bubble Languages and Combinatorial Generation

In this section we give a brief introduction to bubble languages, mostly summarising results from [27, 28]. However, our presentation is different in that it presents the generation of a bubble language as a restriction of an algorithm for generating all binary words. This view also yields a new characterization of bubble languages in terms of the computation tree of this generation algorithm (Prop. 4.2).

Definition 4.1 ([27, 28])

A language is called a first- bubble language if, for every word , exchanging the first occurrence of (if any) by results in another word in . It is called a a first-10 bubble language if, for every word , exchanging the first occurrence of (if any) by results in another word in . If not further specified, by bubble language we mean first-01 bubble.

For example, the languages of binary Lyndon words and necklaces are -bubble languages. As was shown in [27], a language is a bubble language if and only if each of its fixed-weight subsets is a bubble language. This implies that for generating a bubble language, it suffices to generate its fixed-weight subsets.

Next we consider combinatorial generation of binary strings. Let be a binary string of length , let be its weight, and let denote the positions of the s in . Clearly, we can obtain from the word with the following algorithm: first swap the last with the in position , then swap the st with the in position etc. Note that every is moved at most once, and in particular, once the ’th is moved into the position , the suffix remains fixed for the rest of the algorithm.

These observations lead to the Recursive Swap Generation Algorithm (Algorithm 1). Starting from the string , it generates recursively all -length binary strings with weight and fixed suffix , where . The call RecursiveSwap() generates all binary strings of length with weight . The algorithm swaps the last of the first -run with each of the s of the first -run, thereby generating a new string each, for which it makes a recursive call. During the execution of the algorithm, the current string resides in a global array . The function Swap() swaps the values stored in and . In the subroutine Visit() we can print the contents of this array, or increment a counter, or check some property of the current string. Crucially, Visit() is called on every string exactly once.

1:
2:procedure RecursiveSwap()
3:     if  and  then
4:          for  to  do
5:               Swap(,)
6:               RecursiveSwap()
7:               Swap(,)                
8:     Visit()
9:
10:for  to  do
11:     RecursiveSwap()
Algorithm 1 Recursive Swap Generation Algorithm to generate all binary strings of length .

Let denote the computation tree of RecursiveSwap(). As an example, Fig. 1 illustrates (ignore for now the highlighted words). In slight abuse of notation, in the following we identify a node with the string it represents. The depth of equals , the number of s, while the maximum degree (number of children) is , the number of s. Consider the subtree rooted at : its depth is and the maximum degree of nodes is ; the number of children of itself is exactly , and ’s th child is . Note that suffix remains unchanged in the entire subtree; that the computation tree is isomorphic to the computation tree of ; and that the critical prefix length strictly decreases along any downward path in the tree. The algorithm performs a post-order traversal of the tree, yielding a listing of the strings of length with weight , in what is referred to as cool-lex order [31, 28, 27].

Figure 1: The computation tree for . Prefix normal words in bold.

We can express the property of bubble language in terms of the computation tree as follows:

Proposition 4.2

A language is a bubble language if and only if, for every , its fixed-density subset is closed w.r.t. parents and left siblings in the computation tree of the Recursive Swap Generation Algorithm. In particular, if , then it forms a subtree rooted in .

Proof. Follows immediately from the definition of bubble languages.

Using Prop. 4.2, the Recursive Swap Generation Algorithm can be applied to generate any fixed-weight bubble language , as long as we have a way of deciding, for a node , already known to be in , which is its rightmost child (if any) that is still in . If such a child exists, and it is the th child , then the bubble property ensures that all children to its left are also in . Thus, line in Algorithm 1 can simply be replaced by “for ”.

The framework provided in [27, 28] to list the strings in for a given bubble language , can thus be viewed as a restriction of the Recursive Swap Generation Algorithm: Given a string , compute the largest integer such that , in other words, the rightmost child of node which is still in , called the bubble upper bound333In [27, 28], actually a “bubble lower bound” is computed. Because we feel it simplifies the discussion, here we introduce a related value called the “bubble upper bound”. The bubble lower bound is equal to minus the bubble upper bound.. This simple framework is outlined in Algorithm 2 for a given bubble language , where the current word is stored globally. The function Oracle() returns the bubble upper bound for with respect to . The membership tester Member( returns true if and only if . The initial call is GenBubble() with initialized to .

1:
2:function Oracle()
3:     
4:     while  and Member()  do        
5:     return
6:
7:procedure GenBubble()
8:     if  and  then
9:          for  to Oracle(do
10:               Swap(, )
11:               GenBubble()
12:               Swap(, )                
13:     Visit()
Algorithm 2 Generic algorithm to list for a given bubble language in cool-lex order.

It was further shown in [27] that cool-lex order, the order in which the generic algorithm visits the strings of , gives a Gray code. This can be seen on the tree as follows:

Lemma 4.3

Let be a node in the computation tree . Then each of the following can be obtained from by a single swap operation: (a) any sibling of , (b) , and (c) any node on the leftmost path in the subtree rooted in .

Proof. Let and be siblings, and let their parent be . Then there exist such that and . Then , while . For (c), let and for some ; then .

We now report the main result on bubble languages from [27, 28], for which we give a proof using Prop. 4.2.

Proposition 4.4 ([27, 28])

Any fixed-length bubble language , where for all , can be generated such that subsequent strings differ by at most two swaps, or by a swap and a bit flip. Given a membership tester Member() which runs in time, this generation algorithm takes amortized time per word.

Proof. For a fixed-weight subset , let denote the subtree of corresponding to . Note that in a post-order traversal of , we have:

By Prop. 4.2, we have that the leftmost descendant of any node in lies on the leftmost path in . Thus, by Lemma 4.3, can be reached in one or two swaps.

By concatenating the lists for weights , the procedure GenerateAll() shown in Algorithm 3 will exhaustively list for a given bubble language . To see the Gray code property, notice that for any weight , the last string visited is , while the first string visited for the next weight is the leftmost descendant of , i.e. a string of the form , which is one swap and one bit flip away from .

For the running time, notice that for , we do at most membership tests, where is the bubble upper bound for . The successful tests can be charged to the children of , while the possible last unsuccessful test can be charged to itself.

1:
2:procedure GenerateAll()
3:     for  to  do
4:         
5:         GenBubble()      
Algorithm 3 A Gray code to exhaustively list for a given bubble language .

Remark: It is even possible to give a cyclic Gray code for

, by giving the fixed-weight subsets listed first by the odd weights (increasing), followed by the even weights (decreasing).

The oracle of Algorithm 2 applies a simple membership tester to compute the bubble upper bound for given . However, we do not actually need a general membership tester, since all we want to know is which of the children of a node already known to be in are in ; moreover, the membership tester is allowed to use other information, which it can build up iteratively while examining earlier nodes. In the next section, we will apply this method to the language of prefix normal words.

5 A Gray Code for Prefix Normal Words

In this section, we prove that the set of prefix normal words is a bubble language. Then using the bubble framework and applying a basic quadratic-time membership tester, we show how to generate all words in in Gray code order. By concatenating the lists together for all weights in increasing order, we obtain an algorithm to list as a Gray code in amortized time per word. By then providing an enhanced membership tester for prefix normal words specific to the bubble framework, we further show how this Gray code can be generated in amortized time.

Theorem 5.1

is a bubble language.

Proof. Let be a prefix normal word containing an occurrence of . Let be the word obtained from by replacing the first occurrence of with . Then , for some . Let be a substring of . We have to show that .

Note that for any , . In fact, , and for every , . Now if is contained in or in , then is a substring of , and thus . If , with suffix of and prefix of , then . If , with prefix of , then , and is a substring of , thus . Else , with suffix of . We can assume that is a proper suffix of . Let be the substring of of the same length as and starting one position before (in other words, is obtained by shifting to the left by one position). Since does not contain as a substring, we have for some . If is a power of ’s, then and the claim holds. Else, , and is a substring of . Thus .

Since there is a membership tester for prefix-normal words that runs in time, e.g. as described in Algorithm 4, the aforementioned Gray codes for both and can be generated in amortized time (Prop. 4.4). We show the computation tree in Fig. 1, with prefix normal words in bold. The complete listing for is given in Table 1.

1:
2:function Member( )
3:     
4:     for  to  do         
5:     for  to  do
6:         
7:         for  to  do
8:              
9:              if  then return False                             
10:     return True
Algorithm 4 Test if in time.

5.1 A More Efficient Approach

Now we develop a more efficient membership tester for that is specific to one required by an oracle for bubble languages. In particular, membership tests are only made on strings of the form , given that .

Lemma 5.2

Let be a word in where , and . Let for some . Then is not in if and only if either

  1. , or

  2. .

Figure 2: Illustration of Lemma 5.2. On the right we have the two cases of a substring (in gray) of which may violate the prefix normal property.

Proof. The proof is illustrated in Fig. 2. Note that .

() The prefix of of has ones. If , then it must be that contains a substring of length , or less, with at least ones. Similarly, if then also contains a substring of length with at least ones. Thus, if either or then is not in .

() Assume is not in . Then there is a shortest substring with length in such that . Clearly and since is minimal . Suppose . Then can have at most ones since and thus , a contradiction. Thus . We now consider three cases for . If then since , . But since , this means . Thus . Suppose . If then the prefix of of length overlaps with , i.e. we can write and for some non-empty containing the swapped . Since , this implies that also has more s than the prefix of the same length, a contradiction to our choice of . Thus . Since starts with and it must be that . By extending to have length we have . Finally, suppose . Then because , is a substring of and hence a substring of . Since we have . Since for all and , and , it must be that . For each of these possible values for , . Thus which means . Finally, since the length of is at least , we also have . Considering all cases, we must either have , or .

Let denote the value . By maintaining as GenBubble iterates through the prefix normal words, we can apply the previous lemma to optimize a membership tester. Pseudocode is given in Algorithm 5. This function requires the passing of the variables and from the function Oracle, recalling that the current string is stored globally.

1:
2:function MemberPN()
3:          first 1 accounted for by the (proposed) swap of a 1 to
4:     for  to  do
5:         if  then                 
6:     if