Longest Property-Preserved Common Factor

10/04/2018 ∙ by Lorraine A. K. Ayad, et al. ∙ King's College London University of Pisa 0

In this paper we introduce a new family of string processing problems. We are given two or more strings and we are asked to compute a factor common to all strings that preserves a specific property and has maximal length. Here we consider three fundamental string properties: square-free factors, periodic factors, and palindromic factors under three different settings, one per property. In the first setting, we are given a string x and we are asked to construct a data structure over x answering the following type of on-line queries: given string y, find a longest square-free factor common to x and y. In the second setting, we are given k strings and an integer 1 < k'≤ k and we are asked to find a longest periodic factor common to at least k' strings. In the third setting, we are given two strings and we are asked to find a longest palindromic factor common to the two strings. We present linear-time solutions for all settings. We anticipate that our paradigm can be extended to other string properties or settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the longest common factor problem, also known as longest common substring problem, we are given two strings and , each of length at most , and we are asked to find a maximal-length string occurring in both and . This is a classical and well-studied problem in computer science arising out of different practical scenarios. It can be solved in time and space [10, 18] (see also [21, 26]). Recently, the same problem has been extensively studied under distance metrics; that is, the sought factors (one from and one from ) must be at distance at most and have maximal length [8, 28, 27, 2, 25, 24] (and references therein).

In this paper we initiate a new related line of research. We are given two or more strings and our goal is to compute a factor common to all strings that preserves a specific property and has maximal length. An analogous line of research was introduced in [11]. It focuses on computing a subsequence (rather than a factor) common to all strings that preserves a specific property and has maximal length. Specifically, in [11, 3, 19], the authors considered computing a longest common palindromic subsequence and in [20] computing a longest common square subsequence.

We consider three fundamental string properties: square-free factors, periodic, and palindromic factors [23] under three different settings, one per property. In the first setting, we are given a string and we are asked to construct a data structure over answering the following type of on-line queries: given string , find a longest square-free factor common to and . In the second setting, we are given strings and an integer and we are asked to find a longest periodic factor common to at least strings. In the third setting, we are given two strings and we are asked to find a longest palindromic factor common to the two strings. We present linear-time solutions for all settings. We anticipate that our paradigm can be extended to other string properties or settings.

1.1 Definitions and Notation

An alphabet is a non-empty finite ordered set of letters of size . In this work we consider that or that is a linearly-sortable integer alphabet. A string on an alphabet is a sequence of elements of . The set of all strings on an alphabet , including the empty string of length , is denoted by . For any string , we denote by the substring (sometimes called factor) of that starts at position and ends at position . In particular, is the prefix of that ends at position , and is the suffix of that starts at position , where denotes the length of . A string , , is called a square. A square-free string is a string that does not contain a square as a factor.

A period of is a positive integer such that holds for all . The smallest period of is denoted by . String is called periodic if and only if . A run of string is an interval such that for the smallest period it holds that and the periodicity cannot be extended to the left or right, i.e., or , and, or .

We denote the reversal of by string , i.e. . A string is said to be a palindrome if and only if . If factor , , of string of length is a palindrome, then is the center of in and is the radius of . In other words, a palindrome is a string that reads the same forward and backward, i.e. a string is a palindrome if where is a string, is the reversal of and is either a single letter or the empty string. Moreover, is called a palindromic factor of . It is said to be a maximal palindrome if there is no other palindrome in with center and larger radius. Hence has exactly maximal palindromes. A maximal palindrome of can be encoded as a pair , where is the center of in and is the radius of .

1.2 Algorithmic Toolbox

The maximum number of runs in a string of length is less than  [4], and, moreover, all runs can be computed in time [22, 4].

The suffix tree of a non-empty string of length is a compact trie representing all suffixes of . can be constructed in time [14]. We can analogously define and construct the generalised suffix tree for a set of strings. We assume the reader is familiar with these data structures.

The matching statistics capture all matches between two strings and  [7]. More formally, the matching statistics of a string with respect to a string is an array , where is a pair such that (i) is the longest prefix of that is a factor of ; and (ii) . Matching statistics can be computed in time for by using [18, 6, 16].

Given a rooted tree with leaves coloured from to , , the colour set size problem is finding, for each internal node of , the number of different leaf colours in the subtree rooted at . In [10], the authors present an -time solution to this problem.

In the weighted ancestor problem, introduced in [15], we consider a rooted tree with an integer weight function defined on the nodes. We require that the weight of the root is zero and the weight of any other node is strictly larger than the weight of its parent. A weighted ancestor query, given a node and an integer value , asks for the highest ancestor of such that , i.e., such an ancestor that and is the smallest possible. When is the suffix tree of a string of length , we can locate the locus of any factor of using a weighted ancestor query. We define the weight of a node of the suffix tree as the length of the string it represents. Thus a weighted ancestor query can be used for the terminal node corresponding to to create (if necessary) and mark the node that corresponds to . Given a collection of weighted ancestor queries on a weighted tree on nodes with integer weights up to , all the queries in can be answered off-line in time [5].

2 Square-Free-Preserved Matching Statistics

In this section, we introduce the square-free-preserved matching statistics problem and provide a linear-time solution. In the square-free-preserved matching statistics problem we are given a string of length and we are asked to construct a data structure over answering the following type of on-line queries: given string , find the longest square-free prefix of that is a factor of , for all . (For related work see [12].) We represent the answer using an integer array of lengths, but we can trivially modify our algorithm to report the actual factors. It should be clear that a maximum element in SQMS gives the length of some longest square-free factor common to and .

Construction. Our data structure over string consists of the following:

  • An integer array , where stores the length of the longest square-free factor starting at position of string .

  • The suffix tree of string .

The idea for constructing array efficiently is based on the following crucial observation.

Observation 1.

If contains a square then , for all , is the length of the shortest prefix of (factor ) containing a square. In fact, the square is a suffix of , otherwise would not have been the shortest. If does not contain a square then .

We thus shift our focus to computing the shortest such prefixes. We start by considering the runs of . Specifically, we consider squares in observing that a run with period contains squares of length with the leftmost one starting at position . Let denote the ending position of the leftmost such square of the run. In order to find, for all ’s, the shortest prefix of containing a square , and thus compute , we have two cases:

  1. is part of a run in that starts after . In particular, such that , , and is minimal. In this case the shortest factor has length ; we store this value in an integer array . If no run starts after position we set . To compute , after computing in time all the runs of with their and  [22, 4], we sort them by . A right-to-left scan after this sorting associates to the closest with .

  2. is part of a run in and . This implies that if then a square starts at and we store the length of the shortest such square in an integer array . If no square starts at position we set . Array can be constructed in time by applying the algorithm of [13].

Since we do not know which of the two cases holds, we compute both and . By Observation 1, if ( does not contain a square) we set ; otherwise ( contains a square) we set .

Finally, we build the suffix tree of string in time [14]. This completes our construction.

Querying. We rely on the following fact for answering the queries efficiently.

Fact 1.

Every factor of a square-free string is square-free.

Let string be an on-line query. Using , we compute the matching statistics of with respect to . For each , indicates that . This computation can be done in time [18, 6]. By applying Fact 1, we can answer any query in time for by setting , for all .

We arrive at the following result.

Theorem 1.

Given a string of length over an alphabet of size , we can construct a data structure of size in time , answering on-line queries in time.

Proof.

The time complexity of our algorithm follows from the above discussion.

We next show the correctness of our algorithm. Let us first show the correctness of computing array . The square contained in the shortest prefix of (containing a square) starts by definition either at or after . If it starts at this is correctly computed by the algorithm of [13] which assigns the length of the shortest such square in . If it starts after it must be the leftmost square of another run by the runs definition. stores the length of the shortest prefix containing such a square. Then by Observation 1, is computed correctly.

It suffices to show that, if is the longest square-free substring common to and occurring at position in and at position in , then (i) with and ; (ii) is a prefix of ; and (iii) . Case (i) directly follows from the correctness of the matching statistics algorithm. For Case (ii), since occurs at and is square-free, . For Case (iii), since is square-free we have to show that . We know from (i) that and from (ii) that . If , then cannot be extended because the possibly longer than square-free string occurring at does not occur in , and in this case . Otherwise, if then cannot be extended because it is no longer square-free, and in this case . Hence we conclude that . The statement follows. ∎

The following example provides a complete overview of the workings of our algorithm.

Example 1.

Let and . The length of a longest common square-free factor is 3, and the factors are bab and aba.

0 1 2 3 4 5 6 7 8 9 10
a a b a b a a b a b b
5 6 5 4 3 5 5 4 3
2 4 4 6 2 4 2
1 3 3 3 2 1 3 3 2 1 1
b a b a b a b b a a a b
(4,2) (5,1) (4,2) (5,6) (4,7) (3,8) (2,9) (3,4) (2,0) (3,0) (2,1) (1,2)
3 3 3 3 3 2 1 2 1 1 2 1

3 Longest Periodic-Preserved Common Factor

In this section, we introduce the longest periodic-preserved common factor problem and provide a linear-time solution. In the longest periodic-preserved common factor problem, we are given strings of total length and an integer , and we are asked to find a longest periodic factor common to at least strings. In what follows we present two different algorithms to solve this problem. We represent the answer by the length of a longest factor, but we can trivially modify our algorithms to report an actual factor. Our first algorithm, denoted by lPcf, works as follows.

  1. Compute the runs of string , for all .

  2. Construct the generalised suffix tree of .

  3. For each string and for each run with period of , augment GST with the explicit node spelling , decorate it with , and mark it as a candidate node. This can be done as follows: for each run of , for all , find the leaf corresponding to and answer the weighted ancestor query in GST with weight . Moreover, mark as candidates all explicit nodes spelling a prefix of length of any run with .

  4. Mark as good the nodes of the tree having at least different colours on the leaves of the subtree rooted there. Let aGST be this augmented tree.

  5. Return as the string depth of a candidate node in aGST which is also a good node, and that has maximal string depth (if any, otherwise return 0).

Theorem 2.

Given strings of total length on alphabet , and an integer , algorithm lPcf returns in time .

Proof.

Let us assume wlog that , and let with period be the longest periodic factor common to all strings. By the construction of aGST (Steps 1-4), the path spelling leads to a good node as occurs in all the strings. We make the following observation.

Observation 2.

Each periodic factor with period of string is a factor of , where is a run with period .

By Observation 2, in all strings, is included in a run having the same period. Observe that for at least one of the strings, there is a run ending with , otherwise we could extend obtaining a longer periodic common factor (similarly, for at least one of the strings, there is a run starting with ). Therefore is both a good and a candidate node. By definition, is at string depth at least and, by construction, is the string depth of a deepest such node; thus will be returned by Step 5.

As for the time complexity, Step 1 [22, 4] and Step 2 [14] can be done in time. Since the total number of runs is less than  [4], Step 3 can be done in time using off-line weighted ancestor queries [5] to mark the runs as candidate nodes; and then a post-order traversal to mark their ancestor explicit nodes as candidates, if their string-depth is at least for any run with period . The size of the aGST is still in . Step 4 can be done in time [10]. Step 5 can be done in by a post-order traversal of aGST. ∎

The following example provides a complete overview of the workings of our algorithm.

Example 2.

Consider ababbabba, ababaab, and . The runs of are: , , , , , , and , ; those of are , and , . Fig 1 shows aGST for , , and . Algorithm lPcf outputs , with , as the node spelling abab is the deepest good one that is also a candidate.

Figure 1: aGST for , ababaab, and .

We next present a second algorithm to solve this problem with the same time complexity but without the use of off-line weighted ancestor queries. The algorithm works as follows.

  1. Compute the runs of string , for all .

  2. Construct the generalised suffix tree of .

  3. Mark as good the nodes of GST having at least different colours on the leaves of the subtree rooted there.

  4. Compute and store, for every leaf node, the nearest ancestor that is good.

  5. For each string and for each run with period of , check the nearest good ancestor for the leaf corresponding to . Let be the string-depth of the nearest good ancestor. Then:

    1. If , the entire run is also good.

    2. If , check if , and if so the string for the good ancestor is periodic.

  6. Return as the maximal string depth found in Step 5 (if any, otherwise return 0).

Figure 2: GST for , , and . Good nodes are marked red.

Let us analyse this algorithm. Let us assume wlog that , and let with period be the longest periodic factor common to all strings. By the construction of GST (Steps 1-3), the path spelling leads to a good node as occurs in all the strings.

By Observation 2, in all strings, is included in a run having the same period. Observe that for at least one of the strings, there is a run starting with , otherwise we could extend obtaining a longer periodic common factor. So the algorithm should check, for each run, if there is a periodic-preserved common prefix of the run and take the longest such prefix. is the string depth of a deepest good node spelling a periodic factor; thus will be returned by Step 6.

As for the time complexity, Step 1 [22, 4] and Step 2 [14] can be done in time. Step 3 can be done in time [10] and Step 4 can be done in time by using a tree traversal. Since the total number of runs is less than  [4], Step 5 can be done in time. We thus arrive at Theorem 2 with a different algorithm.

The following example provides a complete overview of the workings of our algorithm.

Example 3.

Consider ababaa, bababb, and . The runs of are: , , , ; those of are , and , . Fig 2 shows GST for , , and . Consider the run . The nearest good node of leaf spelling is the node spelling abab. We have that , and . The algorithm outputs as abab is a longest periodic-preserved common factor. Another longest periodic-preserved common factor is baba.

4 Longest Palindromic-Preserved Common Factor

In this section, we introduce the longest palindromic-preserved common factor problem and provide a linear-time solution. In the longest palindromic-preserved common factor problem, we are given two strings and , and we are asked to find a longest palindromic factor common to the two strings. (For related work in a dynamic setting see [17, 1].) We represent the answer LPALCF by the length of a longest factor, but we can trivially modify our algorithm to report an actual factor. Our algorithm is denoted by lPalcf

. In the description below, for clarity, we consider odd-length palindromes only. (Even-length palindromes can be handled in an analogous manner.)

  1. Compute the maximal odd-length palindromes of and the maximal odd-length palindromes of .

  2. Collect the factors of (resp. the factors of ) such that () is the center of an odd-length maximal palindrome of () and () is the ending position of the odd-length maximal palindrome centered at ().

  3. Create a lexicographically sorted list of these strings from and .

  4. Compute the longest common prefix of consecutive entries (strings) in .

  5. Let be the maximal length of longest common prefixes between any string from and any string from . For odd lengths, return LPALCF.

Theorem 3.

Given two strings and on alphabet , algorithm lPalcf returns LPALCF in time .

Proof.

The correctness of our algorithm follows directly from the following observation.

Observation 3.

Any longest palindromic-preserved common factor is a factor of a maximal palindrome of with the same center and a factor of a maximal palindrome of with the same center.

Step 1 can be done in time [18]. Step 2 can be done in time by going through the set of maximal palindromes computed in Step 1. Step 3 and Step 4 can be done in time by constructing the data structure of [9]. Step 5 can be done in time by going through the list of computed longest common prefixes.

The following example provides a complete overview of the workings of our algorithm.

Example 4.

Consider ababaa and bababb. In Step 1 we compute all maximal palindromes of and . Considering odd-length palindromes gives the following factors (Step 2) from : , , , , , and . The analogous factors from are: , , , , , and . We sort these strings lexicographically and compute the longest common prefix information (Steps 3-4). We find that : the maximal longest common prefixes are ba and ab, denoting that aba and bab are the longest palindromic-preserved common factors of odd length. In fact, algorithm lPalcf outputs as aba and bab are the longest palindromic-preserved common factors of any length.

5 Final Remarks

In this paper, we introduced a new family of string processing problems. The goal is to compute factors common to a set of strings preserving a specific property and having maximal length. We showed linear-time algorithms for square-free, periodic, and palindromic factors under three different settings. We anticipate that our paradigm can be extended to other string properties or settings.

Acknowledgements

We would like to acknowledge an anonymous reviewer of a previous version of this paper who suggested the second linear-time algorithm for computing the longest periodic-preserved common factor. Solon P. Pissis and Giovanna Rosone are partially supported by the Royal Society project IE 161274 “Processing uncertain sequences: combinatorics and applications”. Giovanna Rosone and Nadia Pisanti are partially supported by the project Italian MIUR-SIR CMACBioSeq (“Combinatorial methods for analysis and compression of biological sequences”) grant n. RBSI146R5L.

References

  • [1] Amihood Amir, Panagiotis Charalampopoulos, Solon P. Pissis, and Jakub Radoszewski. Longest common factor made fully dynamic. CoRR, abs/1804.08731, 2018.
  • [2] Lorraine A. K. Ayad, Carl Barton, Panagiotis Charalampopoulos, Costas S. Iliopoulos, and Solon P. Pissis. Longest common prefixes with -errors and applications. In SPIRE, volume 11147 of LNCS, pages 27–41. Springer, 2018.
  • [3] Sang Won Bae and Inbok Lee. On finding a longest common palindromic subsequence. Theoretical Computer Science, 710:29–34, 2018. Advances in Algorithms & Combinatorics on Strings (Honoring 60th birthday for Prof. Costas S. Iliopoulos).
  • [4] Hideo Bannai, Tomohiro I, Shunsuke Inenaga, Yuto Nakashima, Masayuki Takeda, and Kazuya Tsuruta. The “runs” theorem. SIAM Journal on Computing, 46(5):1501–1514, 2017.
  • [5] Carl Barton, Tomasz Kociumaka, Chang Liu, Solon P. Pissis, and Jakub Radoszewski. Indexing weighted sequences: Neat and efficient. CoRR, abs/1704.07625, 2017.
  • [6] Djamal Belazzougui and Fabio Cunial. Indexed matching statistics and shortest unique substrings. In Edleno Silva de Moura and Maxime Crochemore, editors, 21st International Symposium on String Processing and Information Retrieval (SPIRE), volume 8799 of LNCS, pages 179–190, 2014.
  • [7] W. I. Chang and E. L. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4):327–344, 1994.
  • [8] Panagiotis Charalampopoulos, Maxime Crochemore, Costas S. Iliopoulos, Tomasz Kociumaka, Solon P. Pissis, Jakub Radoszewski, Wojciech Rytter, and Tomasz Walen. Linear-time algorithm for long LCF with k mismatches. In CPM, volume 105 of LIPIcs, pages 23:1–23:16. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2018.
  • [9] Panagiotis Charalampopoulos, Costas S. Iliopoulos, Chang Liu, and Solon P. Pissis. Property suffix array with applications. In Michael A. Bender, Martin Farach-Colton, and Miguel A. Mosteiro, editors, LATIN 2018: Theoretical Informatics - 13th Latin American Symposium, Buenos Aires, Argentina, April 16-19, 2018, Proceedings, volume 10807 of Lecture Notes in Computer Science, pages 290–302. Springer, 2018.
  • [10] Lucas Chi and Kwong Hui. Color set size problem with applications to string matching. In

    Combinatorial Pattern Matching

    , pages 230–243. Springer Berlin Heidelberg, 1992.
  • [11] Shihabur Rahman Chowdhury, Md. Mahbubul Hasan, Sumaiya Iqbal, and M. Sohel Rahman. Computing a longest common palindromic subsequence. Fundam. Inf., 129(4):329–340, 2014.
  • [12] Marius Dumitran, Florin Manea, and Dirk Nowotka. On prefix/suffix-square free words. In Costas S. Iliopoulos, Simon J. Puglisi, and Emine Yilmaz, editors, 22nd International Symposium, on String Processing and Information Retrieval (SPIRE), volume 9309 of LNCS, pages 54–66, 2015.
  • [13] Jean-Pierre Duval, Roman Kolpakov, Gregory Kucherov, Thierry Lecroq, and Arnaud Lefebvre. Linear-time computation of local periods. Theoretical Computer Science, 326(1):229–240, 2004.
  • [14] Martin Farach. Optimal suffix tree construction with large alphabets. In 38th Annual Symposium on Foundations of Computer Science (FOCS), pages 137–143, 1997.
  • [15] Martin Farach and S. Muthukrishnan. Perfect hashing for strings: Formalization and algorithms. In 7th Symposium on Combinatorial Pattern Matching (CPM), pages 130–140. 1996.
  • [16] Maria Federico and Nadia Pisanti. Suffix tree characterization of maximal motifs in biological sequences. Theor. Comput. Sci., 410(43):4391–4401, 2009.
  • [17] Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Longest substring palindrome after edit. In Gonzalo Navarro, David Sankoff, and Binhai Zhu, editors, Annual Symposium on Combinatorial Pattern Matching (CPM 2018), volume 105 of Leibniz International Proceedings in Informatics (LIPIcs), pages 12:1–12:14, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  • [18] Dan Gusfield. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology. Cambridge University Press, 1997.
  • [19] Shunsuke Inenaga and Heikki Hyyrö. A hardness result and new algorithm for the longest common palindromic subsequence problem. Information Processing Letters, 129:11–15, 2018.
  • [20] Takafumi Inoue, Shunsuke Inenaga, Heikki Hyyrö, Hideo Bannai, and Masayuki Takeda. Computing longest common square subsequences. In 29th Symposium on Combinatorial Pattern Matching (CPM), volume 105 of LIPIcs, pages 15:1–15:13, 2018.
  • [21] Tomasz Kociumaka, Tatiana A. Starikovskaya, and Hjalte Wedel Vildhøj. Sublinear space algorithms for the longest common substring problem. In Algorithms - ESA 2014 - 22th Annual European Symposium, Wroclaw, Poland, September 8-10, 2014. Proceedings, pages 605–617, 2014.
  • [22] Roman Kolpakov and Gregory Kucherov. Finding maximal repetitions in a word in linear time. In 40th Symposium on Foundations of Comp Science, pages 596–604, 1999.
  • [23] M. Lothaire. Applied Combinatorics on Words. Encyclopedia of Mathematics and its Applications. Cambridge University Press, 2005.
  • [24] Pierre Peterlongo, Nadia Pisanti, Frédéric Boyer, Alair Pereira do Lago, and Marie-France Sagot. Lossless filter for multiple repetitions with hamming distance. J. Discr. Alg., 6(3):497–509, 2008.
  • [25] Pierre Peterlongo, Nadia Pisanti, Frédéric Boyer, and Marie-France Sagot. Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array. In 12th International Symposium String Processing and Information Retrieval, 12th International Conference (SPIRE), pages 179–190, 2005.
  • [26] Tatiana A. Starikovskaya and Hjalte Wedel Vildhøj. Time-space trade-offs for the longest common substring problem. In 24th Symposium on Combinatorial Pattern Matching (CPM), pages 223–234, 2013.
  • [27] Sharma V. Thankachan, Chaitanya Aluru, Sriram P. Chockalingam, and Srinivas Aluru. Algorithmic framework for approximate matching under bounded edits with applications to sequence analysis. In RECOMB, volume 10812 of LNCS, pages 211–224, 2018.
  • [28] Sharma V. Thankachan, Alberto Apostolico, and Srinivas Aluru. A provably efficient algorithm for the k-mismatch average common substring problem. Journal of Computational Biology, 23(6):472–482, 2016.