The longest common substring of two strings and is a longest string that is a substring of both and . It is well known that the problem can be solved in linear time, using the generalized suffix tree of and [19, 12].
Ayad et al. [1, 2] proposed a class of problems called longest common property preserved substring, where the aim is to find the longest substring that has some property and is common to a subset of the input strings. They considered several problems in two different settings.
In the first on-line setting, one is given a string that can be pre-processed, and the problem is to answer, for any given query string , the longest square-free substring that is common to both and . Their solution takes time for preprocessing and time for queries, where is the alphabet size.
In the second off-line setting, one is given a set of strings of total length and a positive integer , and the problem is to find the longest periodic substring, as well as the longest palindromic substring, that is common to at least of the strings. Their solution works in time and space. However, it does not (at least directly) give a solution for the on-line setting.
In this paper, we consider a generalized and unified on-line setting, where we are given a set of strings with total length that can be pre-processed, and the problem is to answer, for any given query string and positive integer , the longest property preserved substring that is common to and at least of the strings. We give solutions to the following properties in this setting, all working in time and space preprocessing, and time and working space for answering queries: squares, periodic substrings, palindromes, and Lyndon words. Furthermore, we note that solutions for the off-line setting can be obtained by using our solutions for the on-line setting. We also note that our algorithms can be modified to remove the factor in the off-line setting.
As related work, the off-line version of property preserved subsequences have been considered for some properties. The longest common square subsequence between two strings can be computed in time . The longest common palindromic subsequence between two strings of length can be computed in time .
Let be a set of symbols, or alphabet, and the set of strings over . We assume a constant or linearly-sortable integer alphabet111Note that a string on a general ordered alphabet can be transformed into a string on an integer alphabet in time. and use to denote the size of the alphabet, i.e. . For any string , let denote its length. For any integer , is the th symbol of , and for any integers , . For convenience, is the empty string when . If a string satisfies , then, strings ,, are respectively called a prefix, substring, and suffix of . A prefix (resp. substring, suffix) is called a proper prefix (resp. substring, suffix) if it is shorter than the string.
Let denote the reverse of , i.e., . A string is said to be a palindrome if . A string is a square if for some string , called the root of . A string is primitive if there does not exist any such that for some integer . A square is called a primitively rooted square, if its root is primitive. An integer , , is called a period of a string , if for all . A string is periodic, if the smallest period of is at most . A run in a string is a maximal periodic substring, i.e., a periodic substring with smallest period is a run, if the period of any string with and either or , is not . The following is the well known (weak) periodicity lemma concerning periods:
Lemma 1 ((Weak) Periodicity Lemma )
If and are two periods of a string , and , then, is also a period of .
Let denote a total order on , as well as the lexicographic order on it induces. That is, for any two strings , if and only if either is a prefix of , or there exist strings such that and , and . A string is a Lyndon word  if it is lexicographically smaller than any of its non-empty proper suffixes.
2.2 Suffix Trees
The suffix tree  of a string is a compacted trie of the set of non-empty suffixes of , where denotes a unique symbol that does not occur in . More precisely, it is 1) a rooted tree where edges are labeled by non-empty strings, 2) the concatenation of root to leaf paths correspond to all and only suffixes of , 3) any non-leaf node has at least two children, and the first letter of the label on the edges to its children are distinct.
A node in is called an explicit node, while a position on the edges corresponding to proper prefixes of the edge label are called implicit nodes. For a (possibly implicit) node in , let denote the string obtained by concatenating the edge labels on the path from the root to . The locus of a string in is a (possibly implicit) node in such that . Each explicit node of the suffix tree can be augmented with a suffix link, that points to the node , such that . It is easy to see that because is an explicit node, is also always an explicit node.
It is well known that the suffix tree (and suffix links) can be represented in space and constructed in time  for general ordered alphabets, or in time for constant  or linearly-sortable integer alphabets . The suffix tree can also be defined for a set of strings , and again can be constructed in time for general ordered alphabets or in time for constant or linearly-sortable integer alphabets where is the total length of the strings. This is done by considering and building the suffix tree for the string and pruning edges below any . It is also easy to process to compute for each explicit node , the length , as well as an occurrence of in , where . Also, these values can be computed in constant time for any implicit node, given the values for the closest descendant explicit node.
We will later use the following lemma to efficiently add explicit nodes corresponding to the loci of a set of substrings that are of interest.
Lemma 2 ([16, Corollary 7.3])
Given a collection of substrings of a string of length , each represented by an occurrence in , in total time we can compute the locus of each substring in the suffix tree of . Moreover, these loci can be made explicit in extra time.
For a string of length , a longest extension query, given positions asks for the longest common prefix between and . It is known that the string can be pre-processed in time so that the longest extension query can be answered in time for any (e.g. ).
2.3 Matching Statistics
For two strings and , the matching statistics of with respect to is an array , where
for any . That is, for each position of , is the length of the longest prefix of that occurs in . The concept of matching statistics can be generalized to a set of strings. For a set of strings and a string , the -matching statistics of with respect to is an array where
That is, for each position of , is the length of the longest prefix of that occurs in at least of the strings in .
2.4 Longest Common Property Preserved Substring Queries
Let a function be called a property function. In this paper, we will consider the following property functions , which return if and only if a string is respectively a square-free, square, periodic string, palindrome, or a Lyndon word.
The following is the on-line version of the problem considered in , where a solution was given for the longest common square free substring, i.e., .
Problem 1 (Longest common property preserved substring query)
Let be a property function. Consider a string which can be pre-processed. For a query string , compute the longest string that is a substring of both and , and also satisfies .
The following is the generalized version of the on-line setting that we consider in this paper.
Problem 2 (Generalized longest common property preserved substring query)
Let be a property function. Consider a set of strings that can be pre-processed. For a query string and positive integer , compute the longest substring of that is a substring of at least strings in , and also satisfies .
Below is a summary of our results. Here, the working space of the query is the amount of memory that is required in excess to the data structure constructed in the pre-processing. All memory other than the working space can be considered as read-only.
For any property function , Problem 2 can be answered in time and working space by constructing an space data structure in time, where is the total length of the strings in .
We further note that our algorithms do not require random access to during the query and thus work in the same time/space bounds even if each symbol of is given in a streaming fashion.
The proof of the Theorem for each property function is given in the next section.
In this section we present our algorithm. The outline of our solutions is as follows.
The preprocessing consists of the following steps:
Construct the generalized suffix tree of .
For each explicit node of , compute the number of strings in that contain as a substring.
Process and construct a data structure so that, given any position on , we can efficiently find a candidate for the solution.
Then, queries are answered as follows:
For each position in , compute , i.e., the -matching statistics of with respect to , as a locus of in .
For each such locus , compute a candidate using the data structure computed in Step 3 of the pre-processing.
Output the longest string computed in the previous step.
Using , Step 4, i.e., each locus of the substring of that corresponds to each element of the matching statistics of with respect to (i.e., ) can be computed in time and working space, with a minor modification to the algorithm for computing the matching statistics between single strings [12, Theorem 7.8.1]. The algorithm for a single string first traverses the suffix tree with the string as long as possible to compute the locus corresponding to . Let this prefix be . Given a locus of for some , the suffix link of the closest ancestor of is used in order to first efficiently find the locus of . Then, the suffix tree is further traversed to obtain the locus of . The time bound for the traversal follows from a well known amortized analysis which considers the depth of the traversed nodes, similar to that in the online construction of suffix trees . For computing , we can simply imagine that subtrees below edges leading to a node of are pruned if is not contained in at least strings of . This can be done by aborting the traversal of when we encounter such an edge, by using the information obtained in Step 2. It is easy to see that the algorithm still works in the same time bound, since the suffix link of any remaining node still points to a node that is not pruned (i.e., if a string is contained in at least strings, its suffix will also be contained in the same strings). Thus, we can visit each locus corresponding to in total time and working space.
The crux of our algorithm is therefore in the details of Step 3 and Step 5: designing what the candidates are and how to compute them given . Notice that the solution is the longest string which satisfies and is a substring of for some .
3.1 Square-free substrings
Ayad et al. [1, 2] gave a solution to the on-line longest common square-free substring query problem for a single string (Problem 1), in time and space preprocessing and time and space query. We note that their algorithm easily extends to the generalized version (Problem 2). The only difference lies in that is computed instead of , which can be done in time and space, as described above. We leave the other details to [1, 2].
As mentioned in the introduction, Ayad et al. [1, 2] also considered longest common periodic substrings, but in the off-line setting. However, their algorithm is not readily extendible to the on-line setting. It relies on the fact that the ending position of a longest common periodic substring must coincide with an ending position of some run in the set of strings, and solved the problem by computing all loci corresponding to runs in . To utilize this observation in the on-line setting, the loci of all runs in the query string must be identified in , which seems difficult to do in time not depending on .
Here, we first show how to efficiently solve the problem in the on-line setting for squares, and then we extend that solution to obtain an efficient solution for periodic substrings. Below is an important property of squares which we exploit.
Lemma 3 ([10, Theorem 1])
A given position can be the right-most occurrence of at most two distinct squares.
It follows from the above lemma that the number of distinct squares in a given string is linear in its length . Also, it gives us the following Corollary.
There can only be at most two implicit nodes on an edge of a suffix tree which corresponds to a square. ∎
The right-most occurrence of a square is the maximum position corresponding to leaves in the subtree rooted at the square. Since implicit nodes that correspond to squares on the same edge share the right-most occurrence, a third implicit node would contradict Lemma 3.
For squares, we first compute the locus in of all distinct squares that are substrings of strings in , and we make them explicit nodes in . Note that these additional nodes will not have suffix links, but since there are only a constant number of them on each edge of the original suffix tree, it will only add a constant factor to the amortized analysis when computing Step 4. The loci of all squares can be computed in total time using the algorithm of . We further add to each explicit node in (including the ones we introduced above) a pointer to the nearest ancestor that is the locus of a square (see Fig. 1 for an example of the pointers). Notice that a node that is the locus of a square is explicit and points to itself. This can be also done in linear time by a simple depth-first traversal on .
The candidate, which we will compute in Step 5 for each locus , is the longest square that is a prefix of . It can be determined for each locus by using the pointers. When is an explicit node, the pointer of is the answer. When is an implicit node, the pointer of the parent of is the answer. The longest such square for all loci is the answer to the query. This is because the longest common square must be a prefix of the string corresponding to the -matching statistics of some position. Thus, we have a solution as claimed in Theorem 2.1 for .
3.3 Periodic Substrings
Next, we extend the solution for squares to periodic substrings as follows.
We first explain the data structure, which is again an augmented . For each primitively rooted square substring , we make the locus of in an explicit node. (The non-primitively rooted squares are redundant since they will lead to the same periodic substrings.) Furthermore, we also make explicit the deepest locus in obtained by periodically extending a primitively rooted square , i.e., is a prefix of and the smallest period of is . We add to each explicit node, a pointer to the nearest explicit ancestor (including itself) that is a locus of some square or its extension. If an explicit node is an extension of a square, it will also hold information to identify which square it is an extension of (e.g., its period). Note that the pointer of an explicit node that lies between a square and its extension will point to itself. Fig. 2 shows an example of for a single string , where loci corresponding to squares and their rightmost-maximal extensions are depicted.
We first show how the above augmentation of can be executed in time. We first make explicit all loci of squares as in Section 3.2, which can be done in time. Then, we start from the locus of each primitively rooted square and extend the periodicity of the square towards the leaves of as deep as possible. For each explicit node we encounter during this extension, the pointer will point to itself, and the node will also store the period of the underlying square. The total cost of this extension can be bounded as follows. Due to the periodicity lemma (Lemma 1), any locus of a primitively rooted square or its extension cannot be an extension of a shorter square; if it were, a (proper) divisor of the period of the longer square would also be a period of the square, and would contradict that the longer square is primitively rooted. Thus, we can naturally discard all non-primitive squares by doing the extensions in increasing order of the length of squares (which can be obtained in linear time by radix sort). During the extension, if the square is non-primitive, then the pointers of nodes will already have been determined by a square of a shorter period, and this situation can be detected. From the above arguments, any edge of is traversed at most once, and thus the extension can be done by traversing a total of nodes and edges. Because we know the period of the square we wish to extend, and for any locus in , an occurrence in some , we can compute the extension in time per edge of by using longest common extension queries. Therefore, the total time for building the augmented can be done in time.
Queries can be answered in the same way as for squares, except for a small modification. For any locus , if is an explicit node, then the pointer of gives the answer. When is an implicit node, we check ’s parent and its immediate descendant. If these nodes are both extensions originating from the same square (i.e., their labels have the same period), then, the answer is itself, since it is also an extension of the square. Otherwise, the pointer of the parent node provides the answer.
Thus, we have a solution as claimed in Theorem 2.1 for .
It is well known that the number of non-empty distinct palindromes in a string of length is at most , since only the longest palindrome starting at some position can be the right-most occurrence of that palindrome in the string.
Lemma 4 ([6, Proposition 1])
A given position can be the right-most occurrence of at most one distinct palindrome.
All distinct palindromes in a string can be computed in linear time . The locus of each palindrome in can be computed in time by Lemma 2. The rest is the same as for squares; we make all loci corresponding to a palindrome an explicit node, and do a linear time depth-first traversal on to make pointers to the nearest ancestor that is a palindrome. As in the case of squares, we can bound the number of palindromes which will be on an edge of the original suffix tree.
There can only be at most one implicit node on an edge of a suffix tree that corresponds to a palindrome.
Analogous to the proof of Corollary 1. ∎
The rest of the algorithm and analysis is the same, thus we obtain a solution for of Theorem 2.1.
3.5 Lyndon Words
For Lyndon words, we use the following result by Kociumaka . A minimal suffix query (MSQ) on a string , given indices such that , determines the lexicographically smallest suffix of .
Lemma 5 ([15, Theorem 17])
For any string of length , there exists a data structure of size which can be constructed in time, which can answer minimal suffix queries on in constant time.
Using Lemma 5, we can find the longest Lyndon word ending at a given position.
For any string , the lexicographically smallest suffix is the longest Lyndon word that is a suffix of .
From the definition of Lyndon words, it is clear that the minimal suffix must be a Lyndon word, and that a longer suffix cannot be a Lyndon word. ∎
For Step 3, we process all strings in so that MSQ can be answered in constant time.
For Step 5, the situation is a bit different from squares, periodic strings, and palindromes, in that we can have multiple candidates for each rather than just one. For convenience, let . For each , suppose we have obtained the locus of , and the next locus of . Notice that and for all positions such that , is the longest substring of that ends at and is a substring of for some . As mentioned in Section 2.2, we can obtain an occurrence of such that . Then, we use MSQ on substrings such that , which is equivalent to using MSQ on substrings for all . The longest suffix Lyndon word obtained in all the queries is therefore the longest Lyndon word that is a substring of for some . Since we perform MSQs for each position of , the total number of MSQs is , which takes time. Thus, we have a solution as claimed in Theorem 2.1 for .
3.6 Solutions in the Off-line Setting
We note that a solution for the on-line setting gives a solution for the off-line setting, since, for any , we can consider the string , where is again a symbol that doesn’t appear elsewhere. Since , the preprocessing time is , and the query time is .
Furthermore, we can remove the factor by processing for level ancestor queries. Level ancestor queries, given a node of tree and integer , answer the ancestor of at (node) depth . It is known that level ancestor queries can be answered in constant time, after linear time preporocessing of (e.g. ). The factor came from determining which child of a branching node we needed to follow when traversing with some suffix of . Since, in this case, we can identify the leaf in that corresponds to the current suffix of (i.e. some suffix of in ) that is being traversed, we can use level ancestor query to determine, in constant time, the child of the branching node that is an ancestor of that leaf, thus getting rid of the factor.
We considered the generalized on-line variant of the longest common property preserved substring problem proposed by Ayad et al. [1, 16], and 1) unified the two problem settings, and 2) proposed algorithms for several properties, namely, squares, periodic substrings, palindromes, and Lyndon words. For all these properties, we can answer queries in time and working space, with time and space preprocessing.
This work was supported by JSPS KAKENHI Grant Numbers JP18K18002 (YN), JP17H01697 (SI), JP16H02783 (HB), and JP18H04098 (MT). Tomasz Kociumaka was supported by ISF grants no. 824/17 and 1278/16 and by an ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme (grant no. 683064).
-  Ayad, L.A.K., Bernardini, G., Grossi, R., Iliopoulos, C.S., Pisanti, N., Pissis, S.P., Rosone, G.: Longest property-preserved common factor. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) String Processing and Information Retrieval - 25th International Symposium, SPIRE 2018, Lima, Peru, October 9-11, 2018, Proceedings. Lecture Notes in Computer Science, vol. 11147, pp. 42–49. Springer (2018). https://doi.org/10.1007/978-3-030-00479-8_4, https://doi.org/10.1007/978-3-030-00479-8_4
-  Ayad, L.A.K., Bernardini, G., Grossi, R., Iliopoulos, C.S., Pisanti, N., Pissis, S.P., Rosone, G.: Longest property-preserved common factor. CoRR abs/1810.02099 (2018), http://arxiv.org/abs/1810.02099
-  Bae, S.W., Lee, I.: On finding a longest common palindromic subsequence. Theoretical Computer Science 710, 29–34 (2018)
Bannai, H., Inenaga, S., Köppl, D.: Computing all distinct squares in linear time for integer alphabets. In: Kärkkäinen, J., Radoszewski, J., Rytter, W. (eds.) 28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland. LIPIcs, vol. 78, pp. 22:1–22:18. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2017).https://doi.org/10.4230/LIPIcs.CPM.2017.22, https://doi.org/10.4230/LIPIcs.CPM.2017.22
-  Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci. 321(1), 5–12 (2004). https://doi.org/10.1016/j.tcs.2003.05.002, https://doi.org/10.1016/j.tcs.2003.05.002
-  Droubay, X., Justin, J., Pirillo, G.: Episturmian words and some constructions of de Luca and Rauzy. Theor. Comput. Sci. () 255(1-2), 539–553 (2001)
-  Farach, M.: Optimal suffix tree construction with large alphabets. In: 38th Annual Symposium on Foundations of Computer Science, FOCS ’97, Miami Beach, Florida, USA, October 19-22, 1997. pp. 137–143. IEEE Computer Society (1997). https://doi.org/10.1109/SFCS.1997.646102, https://doi.org/10.1109/SFCS.1997.646102
-  Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proceedings of the American Mathematical Society 16(1), 109–114 (1965), http://www.jstor.org/stable/2034009
-  Fischer, J., Heun, V.: Theoretical and practical improvements on the rmq-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) Combinatorial Pattern Matching, 17th Annual Symposium, CPM 2006, Barcelona, Spain, July 5-7, 2006, Proceedings. Lecture Notes in Computer Science, vol. 4009, pp. 36–48. Springer (2006). https://doi.org/10.1007/11780441_5, https://doi.org/10.1007/11780441_5
-  Fraenkel, A.S., Simpson, J.: How many squares can a string contain? J. Comb. Theory, Ser. A 82(1), 112–120 (1998). https://doi.org/10.1006/jcta.1997.2843, https://doi.org/10.1006/jcta.1997.2843
-  Groult, R., Prieur, É., Richomme, G.: Counting distinct palindromes in a word in linear time. Information Processing Letters 110(20), 908–912 (2010)
-  Gusfield, D., Cambridge University Press, Fund, J.D.P.S.H.L.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. EBL-Schweitzer, Cambridge University Press (1997)
-  Hui, L.C.K.: Color set size problem with application to string matching. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) Combinatorial Pattern Matching, Third Annual Symposium, CPM 92, Tucson, Arizona, USA, April 29 - May 1, 1992, Proceedings. Lecture Notes in Computer Science, vol. 644, pp. 230–243. Springer (1992). https://doi.org/10.1007/3-540-56024-6_19, https://doi.org/10.1007/3-540-56024-6_19
-  Inoue, T., Inenaga, S., Hyyrö, H., Bannai, H., Takeda, M.: Computing longest common square subsequences. In: Navarro, G., Sankoff, D., Zhu, B. (eds.) Annual Symposium on Combinatorial Pattern Matching, CPM 2018, July 2-4, 2018 - Qingdao, China. LIPIcs, vol. 105, pp. 15:1–15:13. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2018). https://doi.org/10.4230/LIPIcs.CPM.2018.15, https://doi.org/10.4230/LIPIcs.CPM.2018.15
-  Kociumaka, T.: Minimal suffix and rotation of a substring in optimal time. In: Grossi, R., Lewenstein, M. (eds.) 27th Annual Symposium on Combinatorial Pattern Matching, CPM 2016, June 27-29, 2016, Tel Aviv, Israel. LIPIcs, vol. 54, pp. 28:1–28:12. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik (2016). https://doi.org/10.4230/LIPIcs.CPM.2016.28, https://doi.org/10.4230/LIPIcs.CPM.2016.28
-  Kociumaka, T., Kubica, M., Radoszewski, J., Rytter, W., Walen, T.: A linear time algorithm for seeds computation. CoRR abs/1107.2422 (2011), http://arxiv.org/abs/1107.2422v2
-  Lyndon, R.C.: On Burnside’s problem. Transactions of the American Mathematical Society 77(2), 202–215 (1954)
-  Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995). https://doi.org/10.1007/BF01206331, https://doi.org/10.1007/BF01206331
-  Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, October 15-17, 1973. pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13, https://doi.org/10.1109/SWAT.1973.13