On Prefix Normal Words

05/31/2018 ∙ by Gabriele Fici, et al. ∙ Université Nice Sophia Antipolis Bielefeld University 0

We present a new class of binary words: the prefix normal words. They are defined by the property that for any given length k, no factor of length k has more a's than the prefix of the same length. These words arise in the context of indexing for jumbled pattern matching (a.k.a. permutation matching or Parikh vector matching), where the aim is to decide whether a string has a factor with a given multiplicity of characters, i.e., with a given Parikh vector. Using prefix normal words, we give the first non-trivial characterization of binary words having the same set of Parikh vectors of factors. We prove that the language of prefix normal words is not context-free and is strictly contained in the language of pre-necklaces, which are prefixes of powers of Lyndon words. We discuss further properties and state open problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a finite word over a finite ordered alphabet , the Parikh vector of is defined as the vector of multiplicities of the characters in . In recent years, Parikh vectors have been increasingly studied, in particular Parikh vectors of factors (substrings) of words, motivated by applications in computational biology, e.g. mass spectrometry [10, 9, 5, 1]. Among the new problems introduced in this context is that of jumbled pattern matching (a.k.a. permutation matching or Parikh vector matching), whose decision variant is the task of deciding whether a given word (the text) has a factor with a given Parikh vector (the pattern). In [8], Cicalese et al. showed that in order to answer decision queries for binary words, it suffices to know, for each , the maximum and minimum number of ’s in a factor of length . Thus it is possible to create an index of size of a text of length , which contains, for every , the maximum and minimum number of ’s in a factor of length , and which allows answering decision queries in constant time.

In this paper, we introduce a new class of binary words, prefix normal words. They are defined by the property that for any given length , no factor of length appearing in the word has more ’s than the prefix of the word of the same length. For example, the word is not a prefix normal word, because the factor has more ’s then the prefix of the same length, .

We show that for every binary word , there is a prefix normal word such that, for every , the maximum number of ’s in a factor of length coincide for and (Theorem 2.1). We refer to as the prefix normal form of (with respect to ).

Given a word , a factor Parikh vector of is the Parikh vector of a factor of . An interesting characterization of words with the same multi-set of factor Parikh vectors was given recently by Acharya et al. [1]. In this paper, we give the first non-trivial characterization of the set of factor Parikh vectors, by showing that two words have the same set of factor Parikh vectors if and only if their prefix normal forms, with respect to and , both coincide (Theorem 2.2).

We explore the language of prefix normal words and its connection to other known languages. Among other things, we show that this language is not context-free (Theorem 3.1) by adapting a proof of Berstel and Boasson [3] for Lyndon words, and that it is properly included in the language of pre-necklaces, the prefixes of powers of Lyndon words (Theorem 4.1). We close with a number of open problems.

Connection to Indexed Jumbled Pattern Matching. The current fastest algorithms for computing an index for the binary jumbled pattern matching problem were concurrently and independently developed by Burcsi et al. [6] and Moosa and Rahman [14]. In order to compute an index of a text of length , both used a reduction to min-plus convolution, for which the current best algorithms have a runtime of . Very recently, Moosa and Rahman [13] introduced an algorithm with runtime which uses word-RAM operations. Our characterization of the set of factor Parikh vectors in terms of prefix normal forms yields a new approach to the problem of indexed jumbled pattern matching: Given the prefix normal forms of a word , the index for the jumbled pattern matching problem can be computed in linear time . This implies that any algorithm for computing the prefix normal form with runtime will result in an improvement for the indexing problem for binary jumbled pattern matching. Since on the other hand, the prefix normal forms can be computed from the index in time, we also have that a lower bound for the computation of the prefix normal form would yield a lower bound for the binary jumbled pattern matching problem.

2 The Prefix Normal Form

We fix the ordered alphabet , with . A word over is a finite sequence of elements from . Its length is denoted by . We denote the empty word by . For any , the -th symbol of a word is denoted by . As is standard, we denote by the words over of length , and by the set of finite words over . Let . If for some , we say that is a prefix of and is a suffix of . A factor of is a prefix of a suffix of (or, equivalently, a suffix of a prefix). We denote by , , the set of prefixes, suffixes, and factors of the word , respectively.

For a letter , we denote by the number of occurrences of in the word . The Parikh vector of a word over is defined as . The Parikh set of is , the set of Parikh vectors of the factors of .

Finally, given a word over and a letter , we denote by , the number of ’s in the prefix of length of , and by the position of the ’th in , i.e., . When is clear from the context, we also write and . Note that in the context of succint indexing, these functions are frequently called rank and select, cf. [15]: We have and .

Definition 1

Let . We define, for each ,

the maximum number of ’s in a factor of of length . When no confusion can arise, we also write for . The function is defined analogously by taking in place of .

Example 1

Take . In Table 1, we give the values of and for .

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 1 2 3 3 4 4 4 5 5 6 7 7 7 8 8 9 9 9 10 10
0 1 2 3 3 3 4 4 5 5 6 6 7 7 7 8 8 9 9 10 10
Table 1: The sequences and for the word .
Lemma 1

Let . The function has the following property:

Proof

Assume otherwise. Then there are indices such that . Let be a word that realizes , i.e.,  and . Let us write . Then for the word , we have , in contradiction to the definition of , since . ∎

We are now ready to show that for every word there is a word which realizes the function as its prefix function .

Theorem 2.1

Let . Then there exists a unique word s.t. for all , . We call this word the prefix normal form of (with respect to ), and denote it . Analogously, there exists a unique word , such that for all , , the prefix normal form of with respect to , denoted .

Proof

We only give the proof for . The construction of is analogous. It is easy to see that for , one has either or . Now define the word by

for every .

By construction, we have for every . We still need to show that for all , i.e., that is in prefix normal form. By definition, for all . Now let , , and . Then , where the inequality holds by Lemma 1. We have thus proved that , and we are done.∎

Example 2

Let . The prefix normal forms of are the words

and

The operators and are idempotent operators, that is, if then , for any . Also, for any and , it holds that , where is the reversal of .

The prefix normal forms of a word allow one to determine the Parikh vectors of the factors of the word, as we will show in Theorem 2.2. We first recall the following lemma from [8], where we say that a Parikh vector occurs in a word if has a factor with .

Lemma 2 (Interval Lemma, Cicalese et al. [8])

Let . Fix . If the Parikh vectors and both occur in , then so does for any .

The lemma can be proved with a simple sliding window argument.

Theorem 2.2

Let be words over . Then if and only if and .

Proof

Let denote the minimum number of ’s in a factor of of length . As a direct consequence of Lemma 2, we have that for a Parikh vector , if and only if . Thus for two words , we have if and only if and . It is easy to see that for all , , thus the last statement is equivalent to and . This holds if and only if and , and the claim is proved. ∎

There is a simple geometrical construction for computing the prefix normal forms of a word , and hence, by Theorem 2.2, the set of Parikh vectors occurring in . An example of this construction is given in Fig. 1.

Draw in the Euclidean plane the word by linking, for every , the points , where is the difference between the number of ’s and the number of ’s in the prefix of of length . That is, draw by representing each letter by an upper unit diagonal and each letter by a lower unit diagonal, starting from the origin .

Then draw all the suffixes of in the same way, always starting from the origin. The region of the plane so delineated is in fact the region of points such that there exists a factor of such that and . Hence belongs to .

The region is connected by Lemma 2, in the sense that all internal points belong to . The prefix normal forms and are obtained by connecting the upper and the lower points of the region, respectively.

Figure 1: The word , its prefix normal forms and , and the region delineated by , the Parikh set of .

3 The Language of Prefix Normal Words

In this section, we take a closer look at those words which are in prefix normal form, which we refer to as prefix normal words. For simplicity of exposition, from now on we only refer to prefix normality with respect to the letter .

Definition 2

A prefix normal word is a word such that for every , . That is, a word such that . We denote by the language of prefix normal words.

The following proposition gives some characterizations of prefix normal words. Recall that is the number of ’s in the prefix of length , and is the position of the ’th . When no confusion can arise, we write simply and . In particular, we have and .

Proposition 1

Let . The following properties are equivalent:

  1. is a prefix normal word;

  2. where , we have ;

  3. such that , we have ;

  4. such that , we have .

Proof

(1) (2). Follows from Lemma 1, since .

(2) (3). Assume otherwise. Then there exists s.t. , where . Let , thus . Then . But , a contradiction.

(3) (4). Again assume that the claim does not hold. Then there are s.t. . Let and and define . Then has many ’s. But , in contradiction to (3).

(4) (1). Let , . We have to show that . This is equivalent to showing that . Let , thus . Let , thus the first in is the ’th of . Note that we have and . By the assumption, we have . ∎

We now give some simple facts about the language .

Proposition 2

Let be the language of prefix normal words. .

  1. is prefix-closed, that is, any prefix of a word in is a word in .

  2. If , then any word of the form or , , also belongs to .

  3. Let . Then iff either for some or the first letter of is .

  4. Let . Then there exist infinitely many such that .

Proof

The claims 1., 2., 3. follow easily from the definition. For 4., note that for any , the word belongs to . ∎

We now deal with the question of how a prefix normal word can be extended to the right into another prefix normal word.

Lemma 3

Let . Then if and only if for every the suffix of of length has less ’s than the prefix of of length .

Proof

Suppose . Fix and let be the suffix of of length . By definition of one has , and therefore .

Conversely, let be the suffix of of length . Since one has . We cannot have and since by hypothesis we must have . Thus either or . In both cases we have then . Since no suffix of has more ’s than the prefix of of the same length, and since , it follows that . ∎

We close this section by proving that is not context-free. Our proof is an easy modification of the proof that Berstel and Boasson gave for the fact that the language of binary Lyndon words is not context-free [3].

Theorem 3.1

is not context-free.

Proof

Recall that Ogden’s iteration lemma (see e.g. [2]) states that, for every context-free language there exists an integer such that, for any word and for any choice of at least distinguished positions in , there exists a factorization such that

  1. either each contain at least one distinguished position, or each contain at least one distinguished position;

  2. the word contains at most distinguished positions;

  3. for any , the word is in .

Now, assume that the language is context-free, and consider the word where is the constant of Ogden’s Lemma. It is easy to see that . Distinguish the central run of letters . We claim that for every factorization of , pumping and eventually results in a word in which the first run of ’s is not the longest one. Such a word cannot belong to .

If each contain at least one distinguished position, then is non-empty and it is contained in the central run of ’s. Now observe that every word obtained by pumping and is prefixed by . Pumping and , we then get a word , for some word , where . This word is not in .

Suppose now that each contain at least one distinguished position. Then is non-empty and it is contained in the central run of ’s.

If is contained in the first run of ’s and it is non-empty, then, pumping down, one gets a word of the form with . This word is not in . In all other cases ( is contained in the second run of ’s and it is non-empty, or , or contains the first of ), every word obtained by pumping and is prefixed by . Again, pumping and , we obtain a word in which the first run of ’s is not the longest one. ∎

4 Prefix Normal Words vs. Lyndon Words

In this section, we explore the relationship between the language of prefix normal words and some known classes of words defined by means of lexicographic properties.

A Lyndon word is a word which is lexicographically (strictly) smaller than any of its proper non-empty suffixes. Equivalently, is a Lyndon word if it is the (strictly) smallest, in the lexicographic order, among its conjugates, i.e., for any factorization , with non-empty words, one has that the word is lexicographically greater than [12]. Note that, by definition, a Lyndon word is primitive, i.e., it cannot be written as for a and . Let us denote by the set of Lyndon words over . One has that and . For example, the word belongs to but is not a Lyndon word since it is not primitive. An example of Lyndon word which is not in prefix normal form is .

A power of a Lyndon word is called a prime word [11] or necklace (see [4] for more details and references on this definition).

Let us denote by the set of prefixes of powers of Lyndon words, also called sesquipowers (or fractional powers) of Lyndon words [7], or preprime words [11], or also pre-necklaces[16]. It is easy to see that is in fact the set of prefixes of Lyndon words plus the powers of the letter .

The next proposition shows that any prefix normal word different form a power of the letter is a prefix of a Lyndon word.

Proposition 3

Let with . Then the word is a Lyndon word.

Proof

We have to prove that any non-empty suffix of is greater than . Suppose by contradiction that there exists a non-empty suffix of that is smaller than , and let be the longest common prefix between and . This implies that is followed by different letters when it appears as prefix of and as prefix of . Since we supposed that is smaller than , we conclude that is prefix of and is prefix of . Since is a factor of ending with , must be a factor of and therefore is a prefix of . Thus the factor of has one more than the prefix of , contradicting the fact that is a prefix normal word. ∎

We can now state the following result.

Theorem 4.1

Every prefix normal word is a pre-necklace. That is, .

Proof

If is of the form , , then is a power of the Lyndon word . Otherwise, contains at least one and the claim follows by Proposition 3. ∎

The languages and , however, do not coincide. The shortest word in that does not belong to is . Below we give the table of the number of words in of each length, up to 16, compared with that of pre-necklaces. This latter sequence is listed in Neil Sloane’s On-Line Encyclopedia of Integer Sequences [17].

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2 3 5 8 14 23 41 70 125 218 395 697 1273 2279 4185 7568
2 3 5 8 14 23 41 71 127 226 412 747 1377 2538 4720 8800
Table 2: The number of words in and in for each length up to 16.

5 The Prefix Normal Equivalence

The prefix normal form induces an equivalence relation on , namely if and only if . In Table 3, we give all prefix normal words of length , and their equivalence classes.

class card.
{} 1
{, } 2
{, } 2
{, , } 3
{, } 2
{} 1
{, , , } 4
{} 1
Table 3: The classes of words of length having the same prefix normal form.

An interesting question is how to characterize two words that have the same prefix normal form. The classes of this equivalence do not seem to follow regular patterns. For example, the words , , , , , all have the same prefix normal form , so that no simple statement about the lengths of the runs of the two letters seems to provide a characterization of the classes. This example also shows that the prefix normal form of a word is in general more complicated that just a rotation of or of its reversal (which is in fact the case for small lengths).

Recall that for word length , we listed the number of equivalence classes in Table 2. The sizes of the equivalence classes seem to exhibit an irregular behaviour. We report in Table 4, for each of the equivalence classes for words of length , the prefix normal form and the number of words in the class. Furthermore, we report the cardinality of the largest class of words for each length up to (Table 5).

card. card. card. card.
1 6 6 2
2 2 9 4
2 6 2 1
3 4 8 6
2 8 4 4
4 1 10 2
2 4 3 6
4 2 4 2
2 6 3 2
4 2 8 5
3 4 2 4
6 2 2 3
2 6 6 2
4 1 1 1
2 4 4 8
5 2 2 1
2 4 7
4 2 2
Table 4: The cardinalities of the classes of words of length having the same prefix normal form. There are classes of length , classes of length , classes of length , classes of length , classes of length , classes of length , class of length , classes of length , class of length and class of length .
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 8 10 12 18 24 30 40 60 80 111
Table 5: The maximum cardinality of a class of words having the same prefix normal form.

6 Conclusion and Open Problems

In this paper, we introduced the prefix normal form of a binary word. This construction arises in the context of indexing for jumbled pattern matching and provides a characterization of the set of factor Parikh vectors of a binary word. We then investigated the language of words which are in prefix normal form (w.r.t. the letter ).

Many open problems remain. Among these, the questions regarding the prefix normal equivalence were explored in Section 5: How can we characterize two words that have the same prefix normal form? Can we say anything about the number or the size of equivalence classes for a given word length?

Although we showed that the language is strictly contained in the language of pre-necklaces (prefixes of powers of Lyndon words), we were not able to find a formula for enumerating prefix normal words. A possible direction for attacking this problem would be finding a characterization of those pre-necklaces which are not prefix normal. Indeed, an enumerative formula for the pre-necklaces is known [17].

Another open problem is to find an algorithm for testing whether a word is in prefix normal form. The best offline algorithms at the moment are the ones for computing an index for the jumbled pattern matching problem, and thus the prefix normal form 

[6, 14, 13]; these have running time for a word of length , or in the word-RAM model. However, testing may be easier than actually computing the PNF. Note also that Lemma 3 gives us an online testing algorithm, with time complexity . Another, similar, online testing algorithm is provided by condition 4. of Proposition 1, with running time , which is, of course, again in general.

Acknowledgements.

We would like to thank Bill Smyth for interesting discussions on prefix normal words. We are also grateful to an anonymous referee who helped us improve the proof of Theorem 3.

References

  • [1] J. Acharya, H. Das, O. Milenkovic, A. Orlitsky, and S. Pan. Reconstructing a string from its substring compositions. In IEEE International Symposium on Information Theory, ISIT 2010. Proceedings, pages 1238–1242, 2010.
  • [2] J. Berstel and L. Boasson. Context-free languages. In Handbook of Theoretical Computer Science, Volume B: Formal Models and Sematics (B), pages 59–102. Elsevier, 1990.
  • [3] J. Berstel and L. Boasson. The set of Lyndon words is not context-free. Bull. Eur. Assoc. Theor. Comput. Sci. EATCS, 63:139–140, 1997.
  • [4] J. Berstel and D. Perrin. The origins of combinatorics on words. Eur. J. Comb., 28:996–1022, 2007.
  • [5] S. Böcker. Simulating multiplexed SNP discovery rates using base-specific cleavage and mass spectrometry. Bioinformatics, 23(2):5–12, 2007.
  • [6] P. Burcsi, F. Cicalese, G. Fici, and Zs. Lipták. On Table Arrangements, Scrabble Freaks, and Jumbled Pattern Matching. In P. Boldi and L. Gargano, editors, 5th International Conference on Fun with Algorithms, FUN 2010. Proceedings, volume 6099 of Lecture Notes in Comput. Sci., pages 89–101. Springer, 2010.
  • [7] J. Champarnaud, G. Hansel, and D. Perrin. Unavoidable sets of constant length. Internat. J. Algebra Comput., 14:241–251, 2004.
  • [8] F. Cicalese, G. Fici, and Zs. Lipták. Searching for Jumbled Patterns in Strings. In J. Holub and J. Zdárek, editors, Prague Stringology Conference, PSC 2009. Proceedings, pages 105–117. Czech Tech. Univ. in Prague, 2009.
  • [9] M. Cieliebak, T. Erlebach, Zs. Lipták, J. Stoye, and E. Welzl. Algorithmic complexity of protein identification: combinatorics of weighted strings. Discrete Appl. Math., 137(1):27–46, 2004.
  • [10] R. Eres, G. M. Landau, and L. Parida. Permutation pattern discovery in biosequences. J. Comput. Biol., 11(6):1050–1060, 2004.
  • [11] D. E. Knuth. Generating All Tuples and Permutations. The Art of Computer Programming, Vol. 4, Fascicle 2. Addison-Wesley, 2005.
  • [12] M. Lothaire. Algebraic Combinatorics on Words. Encyclopedia of Mathematics and its Applications. Cambridge Univ. Press, 2002.
  • [13] T. M. Moosa and M. S. Rahman. Sub-quadratic time and linear size data structures for permutation matching in binary strings. J. Discrete Algorithms. To appear.
  • [14] T. M. Moosa and M. S. Rahman. Indexing permutations for binary strings. Inf. Process. Lett., 110:795–798, 2010.
  • [15] G. Navarro and V. Mäkinen. Compressed full-text indexes. ACM Comput. Surv., 39(1), 2007.
  • [16] F. Ruskey, C., and T. M. Y. Wang. Generating necklaces. J. Algorithms, 13(3):414 – 430, 1992.
  • [17] N. J. A. Sloane. The On-Line Encyclopedia of Integer Sequences. Available electronically at http://oeis.org. Sequence A062692.