On Longest Common Property Preserved Substring Queries

06/13/2019
by   Kazuki Kai, et al.
KYUSHU UNIVERSITY
University of Warsaw
0

We revisit the problem of longest common property preserving substring queries introduced by Ayad et al. (SPIRE 2018, arXiv 2018). We consider a generalized and unified on-line setting, where we are given a set X of k strings of total length n that can be pre-processed so that, given a query string y and a positive integer k'≤ k, we can determine the longest substring of y that satisfies some specific property and is common to at least k' strings in X. Ayad et al. considered the longest square-free substring in an on-line setting and the longest periodic and palindromic substring in an off-line setting. In this paper, we give efficient solutions in the on-line setting for finding the longest common square, periodic, palindromic, and Lyndon substrings. More precisely, we show that X can be pre-processed in O(n) time resulting in a data structure of O(n) size that answers queries in O(|y|σ) time and O(1) working space, where σ is the size of the alphabet, and the common substring must be a square, a periodic substring, a palindrome, or a Lyndon word.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/04/2018

Longest Property-Preserved Common Factor

In this paper we introduce a new family of string processing problems. W...
06/25/2018

Fast entropy-bounded string dictionary look-up with mismatches

We revisit the fundamental problem of dictionary look-up with mismatches...
04/17/2020

Faster Approximate Pattern Matching: A Unified Approach

Approximate pattern matching is a natural and well-studied problem on st...
04/18/2018

On Abelian Longest Common Factor with and without RLE

We consider the Abelian longest common factor problem in two scenarios: ...
03/04/2020

Time-Space Tradeoffs for Finding a Long Common Substring

We consider the problem of finding, given two documents of total length ...
06/29/2020

Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are gi...
07/27/2018

Faster Recovery of Approximate Periods over Edit Distance

The approximate period recovery problem asks to compute all approximate ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The longest common substring of two strings and is a longest string that is a substring of both and . It is well known that the problem can be solved in linear time, using the generalized suffix tree of and  [19, 12].

Ayad et al. [1, 2] proposed a class of problems called longest common property preserved substring, where the aim is to find the longest substring that has some property and is common to a subset of the input strings. They considered several problems in two different settings.

In the first on-line setting, one is given a string that can be pre-processed, and the problem is to answer, for any given query string , the longest square-free substring that is common to both and . Their solution takes time for preprocessing and time for queries, where is the alphabet size.

In the second off-line setting, one is given a set of strings of total length and a positive integer , and the problem is to find the longest periodic substring, as well as the longest palindromic substring, that is common to at least of the strings. Their solution works in time and space. However, it does not (at least directly) give a solution for the on-line setting.

In this paper, we consider a generalized and unified on-line setting, where we are given a set of strings with total length that can be pre-processed, and the problem is to answer, for any given query string and positive integer , the longest property preserved substring that is common to and at least of the strings. We give solutions to the following properties in this setting, all working in time and space preprocessing, and time and working space for answering queries: squares, periodic substrings, palindromes, and Lyndon words. Furthermore, we note that solutions for the off-line setting can be obtained by using our solutions for the on-line setting. We also note that our algorithms can be modified to remove the factor in the off-line setting.

As related work, the off-line version of property preserved subsequences have been considered for some properties. The longest common square subsequence between two strings can be computed in time [14]. The longest common palindromic subsequence between two strings of length can be computed in time [3].

2 Preliminaries

2.1 Strings

Let be a set of symbols, or alphabet, and the set of strings over . We assume a constant or linearly-sortable integer alphabet111Note that a string on a general ordered alphabet can be transformed into a string on an integer alphabet in time. and use to denote the size of the alphabet, i.e. . For any string , let denote its length. For any integer , is the th symbol of , and for any integers , . For convenience, is the empty string when . If a string satisfies , then, strings ,, are respectively called a prefix, substring, and suffix of . A prefix (resp. substring, suffix) is called a proper prefix (resp. substring, suffix) if it is shorter than the string.

Let denote the reverse of , i.e., . A string is said to be a palindrome if . A string is a square if for some string , called the root of . A string is primitive if there does not exist any such that for some integer . A square is called a primitively rooted square, if its root is primitive. An integer , , is called a period of a string , if for all . A string is periodic, if the smallest period of is at most . A run in a string is a maximal periodic substring, i.e., a periodic substring with smallest period is a run, if the period of any string with and either or , is not . The following is the well known (weak) periodicity lemma concerning periods:

Lemma 1 ((Weak) Periodicity Lemma [8])

If and are two periods of a string , and , then, is also a period of .

Let denote a total order on , as well as the lexicographic order on it induces. That is, for any two strings , if and only if either is a prefix of , or there exist strings such that and , and . A string is a Lyndon word [17] if it is lexicographically smaller than any of its non-empty proper suffixes.

2.2 Suffix Trees

The suffix tree [19] of a string is a compacted trie of the set of non-empty suffixes of , where denotes a unique symbol that does not occur in . More precisely, it is 1) a rooted tree where edges are labeled by non-empty strings, 2) the concatenation of root to leaf paths correspond to all and only suffixes of , 3) any non-leaf node has at least two children, and the first letter of the label on the edges to its children are distinct.

A node in is called an explicit node, while a position on the edges corresponding to proper prefixes of the edge label are called implicit nodes. For a (possibly implicit) node in , let denote the string obtained by concatenating the edge labels on the path from the root to . The locus of a string in is a (possibly implicit) node in such that . Each explicit node of the suffix tree can be augmented with a suffix link, that points to the node , such that . It is easy to see that because is an explicit node, is also always an explicit node.

It is well known that the suffix tree (and suffix links) can be represented in space and constructed in time [18] for general ordered alphabets, or in time for constant [19] or linearly-sortable integer alphabets [7]. The suffix tree can also be defined for a set of strings , and again can be constructed in time for general ordered alphabets or in time for constant or linearly-sortable integer alphabets where is the total length of the strings. This is done by considering and building the suffix tree for the string and pruning edges below any . It is also easy to process to compute for each explicit node , the length , as well as an occurrence of in , where . Also, these values can be computed in constant time for any implicit node, given the values for the closest descendant explicit node.

We will later use the following lemma to efficiently add explicit nodes corresponding to the loci of a set of substrings that are of interest.

Lemma 2 ([16, Corollary 7.3])

Given a collection of substrings of a string of length , each represented by an occurrence in , in total time we can compute the locus of each substring in the suffix tree of . Moreover, these loci can be made explicit in extra time.

For a string of length , a longest extension query, given positions asks for the longest common prefix between and . It is known that the string can be pre-processed in time so that the longest extension query can be answered in time for any (e.g. [9]).

2.3 Matching Statistics

For two strings and , the matching statistics of with respect to is an array , where

for any . That is, for each position of , is the length of the longest prefix of that occurs in . The concept of matching statistics can be generalized to a set of strings. For a set of strings and a string , the -matching statistics of with respect to is an array where

That is, for each position of , is the length of the longest prefix of that occurs in at least of the strings in .

2.4 Longest Common Property Preserved Substring Queries

Let a function be called a property function. In this paper, we will consider the following property functions , which return if and only if a string is respectively a square-free, square, periodic string, palindrome, or a Lyndon word.

The following is the on-line version of the problem considered in [1], where a solution was given for the longest common square free substring, i.e., .

Problem 1 (Longest common property preserved substring query)

Let be a property function. Consider a string which can be pre-processed. For a query string , compute the longest string that is a substring of both and , and also satisfies .

The following is the generalized version of the on-line setting that we consider in this paper.

Problem 2 (Generalized longest common property preserved substring query)

Let be a property function. Consider a set of strings that can be pre-processed. For a query string and positive integer , compute the longest substring of that is a substring of at least strings in , and also satisfies .

Below is a summary of our results. Here, the working space of the query is the amount of memory that is required in excess to the data structure constructed in the pre-processing. All memory other than the working space can be considered as read-only.

Theorem 2.1

For any property function , Problem 2 can be answered in time and working space by constructing an space data structure in time, where is the total length of the strings in .

We further note that our algorithms do not require random access to during the query and thus work in the same time/space bounds even if each symbol of is given in a streaming fashion.

The proof of the Theorem for each property function is given in the next section.

3 Algorithms

In this section we present our algorithm. The outline of our solutions is as follows.

The preprocessing consists of the following steps:

  1. Construct the generalized suffix tree of .

  2. For each explicit node of , compute the number of strings in that contain as a substring.

  3. Process and construct a data structure so that, given any position on , we can efficiently find a candidate for the solution.

Then, queries are answered as follows:

  1. For each position in , compute , i.e., the -matching statistics of with respect to , as a locus of in .

  2. For each such locus , compute a candidate using the data structure computed in Step 3 of the pre-processing.

  3. Output the longest string computed in the previous step.

As mentioned in Section 2.2, Step 1 can be performed in time and space. The task beyond Step 2 is known as the color set size problem, and can also be executed in time [13].

Using , Step 4, i.e., each locus of the substring of that corresponds to each element of the matching statistics of with respect to (i.e., ) can be computed in time and working space, with a minor modification to the algorithm for computing the matching statistics between single strings [12, Theorem 7.8.1]. The algorithm for a single string first traverses the suffix tree with the string as long as possible to compute the locus corresponding to . Let this prefix be . Given a locus of for some , the suffix link of the closest ancestor of is used in order to first efficiently find the locus of . Then, the suffix tree is further traversed to obtain the locus of . The time bound for the traversal follows from a well known amortized analysis which considers the depth of the traversed nodes, similar to that in the online construction of suffix trees [18]. For computing , we can simply imagine that subtrees below edges leading to a node of are pruned if is not contained in at least strings of . This can be done by aborting the traversal of when we encounter such an edge, by using the information obtained in Step 2. It is easy to see that the algorithm still works in the same time bound, since the suffix link of any remaining node still points to a node that is not pruned (i.e., if a string is contained in at least strings, its suffix will also be contained in the same strings). Thus, we can visit each locus corresponding to in total time and working space.

The crux of our algorithm is therefore in the details of Step 3 and Step 5: designing what the candidates are and how to compute them given . Notice that the solution is the longest string which satisfies and is a substring of for some .

3.1 Square-free substrings

Ayad et al. [1, 2] gave a solution to the on-line longest common square-free substring query problem for a single string (Problem 1), in time and space preprocessing and time and space query. We note that their algorithm easily extends to the generalized version (Problem 2). The only difference lies in that is computed instead of , which can be done in time and space, as described above. We leave the other details to [1, 2].

3.2 Squares

As mentioned in the introduction, Ayad et al. [1, 2] also considered longest common periodic substrings, but in the off-line setting. However, their algorithm is not readily extendible to the on-line setting. It relies on the fact that the ending position of a longest common periodic substring must coincide with an ending position of some run in the set of strings, and solved the problem by computing all loci corresponding to runs in . To utilize this observation in the on-line setting, the loci of all runs in the query string must be identified in , which seems difficult to do in time not depending on .

Here, we first show how to efficiently solve the problem in the on-line setting for squares, and then we extend that solution to obtain an efficient solution for periodic substrings. Below is an important property of squares which we exploit.

Lemma 3 ([10, Theorem 1])

A given position can be the right-most occurrence of at most two distinct squares.

It follows from the above lemma that the number of distinct squares in a given string is linear in its length [10]. Also, it gives us the following Corollary.

Corollary 1

There can only be at most two implicit nodes on an edge of a suffix tree which corresponds to a square. ∎

Proof

The right-most occurrence of a square is the maximum position corresponding to leaves in the subtree rooted at the square. Since implicit nodes that correspond to squares on the same edge share the right-most occurrence, a third implicit node would contradict Lemma 3.

For squares, we first compute the locus in of all distinct squares that are substrings of strings in , and we make them explicit nodes in . Note that these additional nodes will not have suffix links, but since there are only a constant number of them on each edge of the original suffix tree, it will only add a constant factor to the amortized analysis when computing Step 4. The loci of all squares can be computed in total time using the algorithm of [4]. We further add to each explicit node in (including the ones we introduced above) a pointer to the nearest ancestor that is the locus of a square (see Fig. 1 for an example of the pointers). Notice that a node that is the locus of a square is explicit and points to itself. This can be also done in linear time by a simple depth-first traversal on .

Figure 1: Example of for a single string augmented with nodes and pointers for . The solid dark circles show the loci corresponding to squares which are made explicit. The dotted arrows show the pointers which point to the nearest ancestor which is a square. Pointers which point to the root (i.e., there is no non-empty prefix that is a square) are omitted.

The candidate, which we will compute in Step 5 for each locus , is the longest square that is a prefix of . It can be determined for each locus by using the pointers. When is an explicit node, the pointer of is the answer. When is an implicit node, the pointer of the parent of is the answer. The longest such square for all loci is the answer to the query. This is because the longest common square must be a prefix of the string corresponding to the -matching statistics of some position. Thus, we have a solution as claimed in Theorem 2.1 for .

3.3 Periodic Substrings

Next, we extend the solution for squares to periodic substrings as follows.

We first explain the data structure, which is again an augmented . For each primitively rooted square substring , we make the locus of in an explicit node. (The non-primitively rooted squares are redundant since they will lead to the same periodic substrings.) Furthermore, we also make explicit the deepest locus in obtained by periodically extending a primitively rooted square , i.e., is a prefix of and the smallest period of is . We add to each explicit node, a pointer to the nearest explicit ancestor (including itself) that is a locus of some square or its extension. If an explicit node is an extension of a square, it will also hold information to identify which square it is an extension of (e.g., its period). Note that the pointer of an explicit node that lies between a square and its extension will point to itself. Fig. 2 shows an example of for a single string , where loci corresponding to squares and their rightmost-maximal extensions are depicted.

Figure 2: Example of for a single string augmented with nodes and pointers for . The solid dark circles show the loci corresponding to squares, which are made explicit, and the grey circles show the loci corresponding to their extensions, where the ones with a solid border are made explicit. The dotted arrows show the pointers which point to the nearest ancestor (longest prefix) which is periodic. The pointers which point to the root are omitted. The total number of solid dark circles, as well as grey circles with a solid border is . The total number of implicit grey circles without the solid border is not necessarily , but since they occur consecutively from a solid node, they can be represented in space.

We first show how the above augmentation of can be executed in time. We first make explicit all loci of squares as in Section 3.2, which can be done in time. Then, we start from the locus of each primitively rooted square and extend the periodicity of the square towards the leaves of as deep as possible. For each explicit node we encounter during this extension, the pointer will point to itself, and the node will also store the period of the underlying square. The total cost of this extension can be bounded as follows. Due to the periodicity lemma (Lemma 1), any locus of a primitively rooted square or its extension cannot be an extension of a shorter square; if it were, a (proper) divisor of the period of the longer square would also be a period of the square, and would contradict that the longer square is primitively rooted. Thus, we can naturally discard all non-primitive squares by doing the extensions in increasing order of the length of squares (which can be obtained in linear time by radix sort). During the extension, if the square is non-primitive, then the pointers of nodes will already have been determined by a square of a shorter period, and this situation can be detected. From the above arguments, any edge of is traversed at most once, and thus the extension can be done by traversing a total of nodes and edges. Because we know the period of the square we wish to extend, and for any locus in , an occurrence in some , we can compute the extension in time per edge of by using longest common extension queries. Therefore, the total time for building the augmented can be done in time.

Queries can be answered in the same way as for squares, except for a small modification. For any locus , if is an explicit node, then the pointer of gives the answer. When is an implicit node, we check ’s parent and its immediate descendant. If these nodes are both extensions originating from the same square (i.e., their labels have the same period), then, the answer is itself, since it is also an extension of the square. Otherwise, the pointer of the parent node provides the answer.

Thus, we have a solution as claimed in Theorem 2.1 for .

3.4 Palindromes

It is well known that the number of non-empty distinct palindromes in a string of length is at most , since only the longest palindrome starting at some position can be the right-most occurrence of that palindrome in the string.

Lemma 4 ([6, Proposition 1])

A given position can be the right-most occurrence of at most one distinct palindrome.

All distinct palindromes in a string can be computed in linear time [11]. The locus of each palindrome in can be computed in time by Lemma 2. The rest is the same as for squares; we make all loci corresponding to a palindrome an explicit node, and do a linear time depth-first traversal on to make pointers to the nearest ancestor that is a palindrome. As in the case of squares, we can bound the number of palindromes which will be on an edge of the original suffix tree.

Corollary 2

There can only be at most one implicit node on an edge of a suffix tree that corresponds to a palindrome.

Proof

Analogous to the proof of Corollary 1. ∎

The rest of the algorithm and analysis is the same, thus we obtain a solution for of Theorem 2.1.

3.5 Lyndon Words

For Lyndon words, we use the following result by Kociumaka [15]. A minimal suffix query (MSQ) on a string , given indices such that , determines the lexicographically smallest suffix of .

Lemma 5 ([15, Theorem 17])

For any string of length , there exists a data structure of size which can be constructed in time, which can answer minimal suffix queries on in constant time.

Using Lemma 5, we can find the longest Lyndon word ending at a given position.

Lemma 6

For any string , the lexicographically smallest suffix is the longest Lyndon word that is a suffix of .

Proof

From the definition of Lyndon words, it is clear that the minimal suffix must be a Lyndon word, and that a longer suffix cannot be a Lyndon word. ∎

For Step 3, we process all strings in so that MSQ can be answered in constant time.

For Step 5, the situation is a bit different from squares, periodic strings, and palindromes, in that we can have multiple candidates for each rather than just one. For convenience, let . For each , suppose we have obtained the locus of , and the next locus of . Notice that and for all positions such that , is the longest substring of that ends at and is a substring of for some . As mentioned in Section 2.2, we can obtain an occurrence of such that . Then, we use MSQ on substrings such that , which is equivalent to using MSQ on substrings for all . The longest suffix Lyndon word obtained in all the queries is therefore the longest Lyndon word that is a substring of for some . Since we perform MSQs for each position of , the total number of MSQs is , which takes time. Thus, we have a solution as claimed in Theorem 2.1 for .

3.6 Solutions in the Off-line Setting

We note that a solution for the on-line setting gives a solution for the off-line setting, since, for any , we can consider the string , where is again a symbol that doesn’t appear elsewhere. Since , the preprocessing time is , and the query time is .

Furthermore, we can remove the factor by processing for level ancestor queries. Level ancestor queries, given a node of tree and integer , answer the ancestor of at (node) depth . It is known that level ancestor queries can be answered in constant time, after linear time preporocessing of  (e.g. [5]). The factor came from determining which child of a branching node we needed to follow when traversing with some suffix of . Since, in this case, we can identify the leaf in that corresponds to the current suffix of (i.e. some suffix of in ) that is being traversed, we can use level ancestor query to determine, in constant time, the child of the branching node that is an ancestor of that leaf, thus getting rid of the factor.

4 Conclusion

We considered the generalized on-line variant of the longest common property preserved substring problem proposed by Ayad et al. [1, 16], and 1) unified the two problem settings, and 2) proposed algorithms for several properties, namely, squares, periodic substrings, palindromes, and Lyndon words. For all these properties, we can answer queries in time and working space, with time and space preprocessing.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Numbers JP18K18002 (YN), JP17H01697 (SI), JP16H02783 (HB), and JP18H04098 (MT). Tomasz Kociumaka was supported by ISF grants no. 824/17 and 1278/16 and by an ERC grant MPM under the EU’s Horizon 2020 Research and Innovation Programme (grant no. 683064).

References