Linear-size Suffix Tries for Parameterized Strings

02/01/2019
by   Katsuhito Nakashima, et al.
0

In this paper, we propose a new indexing structure for parameterized strings, called parameterized linear-size suffix tries, by generalizing linear-size suffix tries for ordinary strings. Two parameterized strings are said to match if there is a bijection between symbols that makes the two coincide. Parameterized linear-size suffix tries are applicable to the parameterized pattern matching problem, which is to decide whether the input text has a substring that matches the input pattern. The size of our proposed structure is linear in the text size, with which our algorithm solves the problem in linear time in the pattern size. Our proposed data structure can be seen as a compacted version of a parameterized suffix trie and an alternative of a parameterized suffix tree.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/18/2020

The Parameterized Suffix Tray

Let Σ and Π be disjoint alphabets, respectively called the static alphab...
04/12/2016

Separating Sets of Strings by Finding Matching Patterns is Almost Always Hard

We study the complexity of the problem of searching for a set of pattern...
03/07/2018

Flexible and Efficient Algorithms for Abelian Matching in Strings

The abelian pattern matching problem consists in finding all substrings ...
08/03/2018

Right-to-left online construction of parameterized position heaps

Two strings of equal length are said to parameterized match if there is ...
03/14/2019

The Parameterized Position Heap of a Trie

Let Σ and Π be disjoint alphabets of respective size σ and π. Two string...
06/03/2019

Direct Linear Time Construction of Parameterized Suffix and LCP Arrays for Constant Alphabets

We present the first worst-case linear time algorithm that directly comp...
09/01/1997

Identifying Hierarchical Structure in Sequences: A linear-time algorithm

SEQUITUR is an algorithm that infers a hierarchical structure from a seq...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The pattern matching problem is to check whether a pattern string occurs in a text string or not. To efficiently solve the pattern matching problem, a numerous number of text indexing structures have been proposed. Suffix trees are most widely used data structures and provide many applications including several variants of pattern matching problems [5, 11]. They can be seen as a compacted type of suffix tries, where two branching nodes that have no other branching nodes between them in a suffix trie are directly connected in the suffix tree. The new edges have a reference to an interval of the text so that the original path label of the suffix trie can be recovered. Recently, Crochemore et al. [6] proposed a new indexing tree structure, called a linear-size suffix trie (LST), which is another compacted variant of a suffix trie. An LST replaces paths consisting only of non-branching nodes by edges like a suffix tree, but the original path labels are recovered by referring to other edge labels in the LST itself unlike suffix trees. LSTs may use less memory space than suffix trees for indexing the same strings. LSTs may be used as an alternative of suffix trees for various applications not limited to the pattern matching problem.

On the other hand, different types of pattern matching have been proposed and intensively studied. The variant this paper is concerned with is the parameterized pattern matching problem, introduced by Baker [2]. Considering two disjoint sets of symbols and , we call a string over a parameterized string (p-string). In the parameterized pattern matching problem, given p-strings and , we must check whetehr substrings of that can be transformed into by applying a one-to-one function that renames symbols in . The parameterized pattern matching is motivated by applying to the software maintenance [1, 2, 3], the plagiarism detection [9], the analysis of gene structure [14], and so on. Similarly to the basic string matching problem, several indexing structures that support the parameterized pattern matching have been proposed, such as parameterized suffix trees [2], structural suffix trees [14], parameterized suffix arrays [7, 12] and parameterized position heaps [8, 10].

In this paper, we propose a new indexing structure for p-strings, which we call parameterized linear-size suffix trie (PLST). A PLST is a variant of a suffix tree for prev-encoded [2] suffixes of a p-string. We show that the size of a PLST is and give an algorithm that runs in time for the parameterized pattern matching problem for given a pattern and a PLST, where is the length of the text, is the length of the pattern.

2 Preliminaries

2.1 Basic definitions and notation

We denote the set of all non-negative integers by .

Let be an alphabet. For a string , , , and are called prefix, substring, and suffix of , respectively. The length of is denoted by and the -th symbol of is denoted by for . The substring of that begins at position and ends at position is denoted by for . For convenience, we abbreviate to and to for . The empty string is denoted by , that is . Moreover, let if . For a string and an extension , we write . For a nonempty string with and , the string obtained by removing the first symbol is denoted by .

Throughout this paper, we fix two alphabets and . We call elements of constant symbols and those of parameter symbols. An element of is called a constant string and that of is called a parameterized string, or p-string for short. We assume that the size of and are constant.

Given two p-strings and of length , and are a parameterized match or p-match, denoted by , if there is a bijection on such that for any and for all  [2]. We can determine whether or not by using an encoding called prev-encoding defined as follows.

Definition 1 (Prev-encoding [2]).

For a p-string of length over , the prev-encoding for , denoted by , is defined to be a string over of length such that for each ,

We call strings over pv-strings.

For any p-strings and , if and only if . For example, given and , and are p-matches by such that and , where .

We define parameterized pattern matching as follows.

Definition 2 (Parameterized pattern matching [2]).

Given two p-strings, text and pattern , decide whether has a substring that p-matches .

For example, considering a text and a pattern over and , has two substrings and that p-match .

Throughout this paper, we assume that a text ends with a sentinel symbol , which occurs nowhere else in .

2.2 Suffix tries, suffix trees, and linear-size suffix tries

This subsection briefly reviews tree structures for indexing all the substrings of a constant string .

The suffix trie is a tree with nodes corresponding to all the substrings of . Figure 1 (a) shows an example of a suffix trie. Throughout this paper, we identify a node with its corresponding string for explanatory convenience. Note that each node, however, does not explicitly remember its corresponding string. For each nonempty substring of where , we have an edge from to labeled with . Then by reading the labels on the path from the root to a node , one can obtain the string the node corresponds. Then the path label from a node to a descendant is for . Since there are substrings of , the size of is .

The suffix tree is a tree obtained from by removing all non-branching internal nodes and replacing each path with no branching nodes by a single edge whose label refers to a corresponding interval of the text . That is, the label on the edge is a pair such that . Since there are at most branching nodes, the size of is .

An important auxiliary map on nodes is called suffix links, denoted by , which is defined by for each node with and .

(a) (b)
Figure 1: (a) The suffix trie for a string . (b) The linear-size suffix trie for . Solid and broken arrows represent the edges and suffix links, respectively. The LST keeps only the first symbol (black) on each edge, while the succeeding symbols (orange) are discarded. Big white and small black circles represent nodes of Type 1 and Type 2, respectively. The signs represent the 1-bit flag. If a node has sign, the edge has a path label of length greater than 1 in where is the parent node of in .

The linear-size suffix trie (LST) [6] of a string is another compact variant of a suffix trie (see Figure 1 (b)). An LST suppresses (most) non-branching nodes and replaces paths with edges like a suffix tree, but the labels of those new edges do not refer to intervals of the input text. Each edge retains only the first symbol of the original path label . To recover the original label , we refer to another edge or a path in the LST itself using a suffix link, using the fact that . The reference will be recursive, but eventually one can regain the original path label by collecting those retained symbols. For this sake, keeps some non-branching internal nodes from and thus it may have more nodes than , but still the size is linear in

. Let us classify the nodes of

into Type 1, Type 2 and others as follows, among which Type 1 nodes are exactly those of and in addition Type 2 nodes constitute .

  1. Type 1 nodes are either a leaf or a branching node.

  2. Type 2 nodes are non-branching internal nodes whose suffix link points at a Type 1 node.

Each edge has a 1-bit flag that tells whether . If it is the case, one knows the complete label . Otherwise, one needs to follow the suffix link to regain the other symbols. An LST uses suffix links to regain the original path label in the suffix trie. If we had only Type 1 nodes, for some edge , there may be a branching node between and , which makes it difficult to uniquely regain the original path label. Having Type 2 nodes, there is no branching node between and for every edge . Then it is enough to go straight down from to regain the original path label.

2.3 Parameterized suffix tries and parameterized suffix trees

Figure 2: The parameterized suffix tree for where and . Broken blue arrows denote suffix links. Some suffix links do not point to a branching node (an example is shown with a bold broken arrow).

For a p-string , a prev-encoded substring (pv-substring) of is the prev-encoding of a substring of . The set of pv-substrings of is denoted by .

A parameterized suffix trie of , denoted by , is the trie that represents all the pv-substrings of . The size of is .

For a pv-string , the re-encoding for , denoted by , is defined to be the pv-string of length such that for each ,

We then have for any p-string and .

Usually suffix links are defined on nodes of suffix trees but it is convenient to have “virtual suffix links” on all nodes but the root of , i.e., all the nonempty substrings of , as well. For a nonempty pv-string , let denote the re-encoding of the string obtained by deleting the first symbol. This operation on strings will define real suffix links in indexing structures for parameterized strings based on parameterized suffix tries. Differently from constant strings, does not necessarily imply . What we have is .

A parameterized suffix tree (p-suffix tree) [2] of , denoted by , is a compacted variant of the parameterized suffix trie. Figure 2 shows an example of a p-suffix tree. Like the suffix tree for a constant string over , is obtained from by removing non-branching internal nodes and giving each edge as a label references to some interval of the original text . Lee et al. [13] showed that can be built online in randomized time by using suffix links, which connect nodes and .

In a suffix tree, the suffix link of a branching node necessarily points to a branching node. That is, if is a node, then so is . However, in a parameterized suffix tree, there are branching nodes whose “suffix links” do not point to a branching node. Figure 2 shows an example, where the node (red circle) is a branching node in but is not.

This does not matter for pattern matching using p-suffix trees, although suffix links are important for construction. However, it is critical in the parameterized linear-size suffix trie, since we need to recursively follow suffix links to recover the original label. We will discuss this point in more detail in the next section.

3 Parameterized linear-size suffix tries

We now introduce our indexing tree structures for p-strings, which we call parameterized linear-size suffix tries (PLSTs), based on linear-size suffix tries reviewed in Section 2.2. An example of a PLST is shown in Figure 3. There are two difficulties in extending LSTs to deal with p-strings. We want to know for an edge , but

  1. it is not necessarily that ,

  2. there is a branching node of such that is not a branching node.

The first one is caused by the fact that rather than . Then, the path label referenced by the suffix link may not give exactly what we want. The second one is critical to regain the original path label in the suffix trie. If we do not have in our indexing structure, we cannot use the technique of “reduction by suffix links” to regain the original path label in the suffix trie.

3.1 Definition and properties of parameterized linear-size suffix tries

(a) (b)
Figure 3: (a) The parameterized suffix trie and (b) the PLST for () where and . White, black and double white circles represent nodes of Type 1, Type 2 and Type 3, respectively. The numbers in rhombus represent re-encoding signs. The PLST keeps only the first symbol (black) on each edge, while the succeeding symbols (orange) are discarded.

Let be the set of nodes of . The set of nodes of the PLST for is a subset of , which is partitioned as . Nodes in are called Type  for . The definition of Type 1 and Type 2 nodes follows the one for original LSTs [6].

  1. A node is Type 1 if is a leaf or a branching internal node in ,

  2. A node is Type 2 if and .

However, those nodes are not sufficient, since there can be a node such that , for which the technique of “reduction by suffix links” fails to recover for a child of . Let us call a node bad if . One idea to overcome this problem might be to add to for all and so that is closed under , where and . However, the number of those additional nodes will be as we show in Appendix A.1. Our solution is to give explicitly on the path from to when is a bad node, without using suffix links. We add the following nodes of Type 3.

  1. A node is Type 3 if and the parent of is either a bad Type 1 or Type 3 node.

We will show in Section 3.3 that . We say that is good if . Otherwise, it is bad, including the case where . Note that the root is a Type 1 bad node.

Edges of are trivially determined: we have as an edge if and only if and there is no proper nonempty prefix of such that . The label of the edge is and is called the -child of and denoted by .

For good nodes, we retain the definition of a suffix link. For bad nodes, we leave the suffix link undefined.

Figure 4: Illustrating how re-encoding signs are given. Each number in a rhombus represents the re-encoding sign at a node. For instance, for the node , we have , because and the parent of is of length . For the node , .

The following properties are easily obtained from the definitions.

Observation 1.

For any edge in , if is bad, then .

If is good, we want to recover for an edge of using suffix links. An important observation is that the equation , which was a key property to regain the original label in (non-parameterized) LSTs, does not necessarily hold for PLSTs. Figure 4 shows an example, where ; the third symbol in is re-encoded to in , because the first symbol of , that is referenced by the symbol , is cut out in . Fortunately, the possible difference between and is limited.

Observation 2.

Any prev-encoded substring of text has at most one position such that . For such a position , we have . Thus, such a position is unique in for each edge in .

For each edge , we add the re-encoding sign, defined below, so that we can regain from .

Definition 3 (Re-encoding sign).

For each node , let be the parent of . Define re-encoding sign

The re-encoding sign is uniquely defined by Obsevation 2. Figure 4 shows an example of re-encoding signs. Lemma 1 immediately follows Obsevation 2 and Definiton 3.

Lemma 1.

Let be an edge in such that is not the root. Then for any , . If , then and .

Lemma 1 tells how to recover from using the re-encoding sign at .

In summary, consists of four kinds of nodes, good Type 1, bad Type 1, Type 2 (all good), and Type 3 (all bad). If is a good node, has its depth, suffix link and re-encode sign, i.e., the triple , where . Here we use the notation to emphasize that the suffix link is the pointer to the node corresponding to the string rather than the string itself. Therefore, it requires only constant size of memory space. If is bad, dose not have a suffix link, i.e., has the triple . Each edge has a label .

3.2 Parameterized pattern matching with parameterized linear-size suffix tries

This subsection presents our algorithm for solving the parameterized pattern matching problem as an application of PLSTs. The function P-Match of Algorithm LABEL:alg:simplematch takes a prev-encoded string and a node in and checks whether there is such that . If it is the case, it returns the least extension of such that . In other words, is a prefix of , where should be itself if . Otherwise, it returns . If a p-string pattern p-matches substrings of at positions , then will be a node whose descendant leaves are exactly .

For an input pair , if , then P-Match returns , as it is required. Otherwise, it first tries to regain for the -child of , if has such a child. At first, suppose . We would like to know whether . If , it means that we have already confirmed that . Then we just go down to and recursively call the function with . If , we cannot know from the edge itself what is except for its first symbol . To recover , we use the suffix link of . Since , is a good node by Observation 1, and thus is defined. If , we have by Lemma 1. In this case we simply call . Otherwise, by Lemma 1, we have if and only if and , where

for . Therefore, the recursive call of returns iff . We note that it may be the case that , but this does not matter for our algorithm. The recursive call checks whether but is not an argument and not used. If returns a node, then and thus we continue matching by calling .

The above discussion is still valid when . If or , then is a prefix of iff is a prefix of . Otherwise, is a prefix of iff and is a prefix of . Thus the recursion is justified. If returns a node, then is a prefix of and we call , which returns .

algocf[t]    

Lemma 2.

We can decide whether has a substring that p-matches using Algorithm LABEL:alg:simplematch.

Figure 5: The fast link drawn by the red broken arrow is obtained by skipping intermediate nodes visited by the suffix links drawn by blue broken arrows. We jump from to by using the fast link .

Let us discuss the time complexity of Algorithm LABEL:alg:simplematch. Suppose that is called. It can be the case and either or where . In this case, the algorithm simply calls , where the first argument has not changed from the preceeding call. Such recursion may be repeated many times. Figure 5 shows an example of such an edge where . The same problem and a solution have already been discussed by Crochemore et al. [6] for (non-parameterized) LSTs. Following them, we introduce fast links as follows, which allow us to skip recursions that preserve the first argument.

Definition 4 (Fast link).

For each edge such that , the fast link for is defined to be where is the smallest integer satisfying either

  1. , or,

  2. ,

where for .

Algorithm LABEL:alg:simplematch will run in linear time by replacing in Line LABEL:ln:fast by . When , we have . When , we change the -th symbol of , which must be a positive integer, to . Therefore, the number of fast links we follow is bounded by . In the example of Figure 5, we jump from to .

Theorem 1.

Given the parameterized linear-size suffix trie and a pattern of length , we can decide whether has a substring that p-matches in time.

3.3 The size of parameterized linear-size suffix tries

We now show that the size of is linear with respect to the length of a text . First, we show a linear upper bound on the number of nodes of . The nodes of Type 1 appear in the p-suffix tree, so they are at most  [2]. It is enough to show that the numbers of nodes of Type 2 and Type 3 are linearly bounded as well. We relegate proofs of lemmas to Appendices.

Lemma 3.

The number of Type 2 nodes in is smaller than .

We also show an upper bound on the number of Type 3 nodes.

Lemma 4 (Baker [4]).

Any bad node has exactly two children and .

Let us say that a bad node governs a Type 3 node if all the nodes between them are Type 3. Every Type 3 node is governed by a unique bad node.

Lemma 5.

The number of Type 3 nodes that a bad node governs is at most for

where is the length of the longest common prefix of strings and .

Lemma 6.

The number of Type 3 nodes in is smaller than .

The number of edges and their labels, as well as the number of suffix links, depth and re-encoding sign for nodes, is asymptotically bound above by the number of nodes in . Therefore, we can conclude that the size of is .

Theorem 2.

Given a p-string over of length , the size of is .

4 Conclusion and future work

In this paper, we presented an indexing structure called a parameterized linear-size suffix trie for the parameterized pattern matching problem. The size of the parameterized linear-size suffix trie for is where is the length of . We presented an algorithm that solves the problem in time with respect to the length of an input pattern .

Parameterized suffix trees by Lee et al. [13] keep the text and each edge has a triple of positions of the text to recover the prev-encoding substring of text. Our PLSTs do not have the text and each node has a triple of its depth, suffix link, and re-encoding sign. Thus, parameterized linear-size suffix tries may use less memory than parameterized suffix trees.

For PLSTs to be useful for various applications, like computing the longest common substrings, an efficient algorithm for constructing PLSTs is required. This is left for future work.

References

  • [1] Brenda S. Baker. A program for identifying duplicated code. Computing Science and Statistics, 24:49–57, 1992.
  • [2] Brenda S. Baker. A theory of parameterized pattern matching: algorithms and applications. In

    Proc. 25th annual ACM symposium on Theory of computing

    , pages 71–80, 1993.
    doi:10.1145/167088.167115.
  • [3] Brenda S. Baker. Parameterized pattern matching: Algorithms and applications. Journal of Computer and System Sciences, 52(1):28–42, 1996. doi:10.1006/jcss.1996.0003.
  • [4] Brenda S. Baker. Parameterized duplication in strings: Algorithms and an application to software maintenance. SIAM Journal on Computing, 26(5):1343–1362, 1997.
  • [5] M. Crochemore and W. Rytter. Jewels of Stringology: Text Algorithms. World Scientific, 2003.
  • [6] Maxime Crochemore, Chiara Epifanio, Roberto Grossi, and Filippo Mignosi. Linear-size suffix tries. Theoretical Computer Science, 638:171–178, 2016.
  • [7] Satoshi Deguchi, Fumihito Higashijima, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Parameterized suffix arrays for binary strings. In Proceedings of the Prague Stringology Conference 2008, pages 84–94, Czech Technical University in Prague, Czech Republic, 2008.
  • [8] Diptarama, Takashi Katsura, Yuhei Otomo, Kazuyuki Narisawa, and Ayumi Shinohara. Position heaps for parameterized strings. In 28th Annual Symposium on Combinatorial Pattern Matching (CPM 2017), pages 8:1–8:13, 2017.
  • [9] Kimmo Fredriksson and Maxim Mozgovoy. Efficient parameterized string matching. Information Processing Letters, 100(3):91–96, 2006. doi:10.1016/j.ipl.2006.06.009.
  • [10] Noriki Fujisato, Yuto Nakashima, Shunsuke Inenaga, Hideo Bannai, and Masayuki Takeda. Right-to-left online construction of parameterized position heaps. In Prague Stringology Conference 2018, pages 91–102, 2018.
  • [11] Dan Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
  • [12] Tomohiro I, Satoshi Deguchi, Hideo Bannai, Shunsuke Inenaga, and Masayuki Takeda. Lightweight parameterized suffix array construction. In Combinatorial Algorithms (IWOCA 2009), pages 312–323, Berlin, Heidelberg, 2009. Springer Berlin Heidelberg. doi:10.1007/978-3-642-10217-2_31.
  • [13] Taehyung Lee, Joong Chae Na, and Kunsoo Park. On-line construction of parameterized suffix trees for large alphabets. Information Processing Letters, 111(5):201–207, 2011.
  • [14] Tetsuo Shibuya. Generalization of a suffix tree for RNA structural pattern matching. Algorithmica, 39(1):1–19, 2004. doi:10.1007/s00453-003-1067-9.

Appendix A Appendix

a.1 The virtual suffix link closure of branching nodes is too big

We show that the total number of nodes of the form for some cannot be linearly bounded by . Let us consider a text

where and for each . Note that . Here

is a Type 1 node, since . Then the set has elements. Therefore, we cannot keep our indexing structure in linear size. Figure 6 illustrates the case of , where twelve additional nodes are created.

Figure 6: An example demonstrating that the virtual suffix link closure of Type 1 nodes has too many elements, where () with and . Big and small red circles represent bad nodes in and newly added nodes not in , respectively.

a.2 Proof of Lemma 3

Lemma.

The number of Type 2 nodes in is smaller than .

Proof.

Let us consider a virtual suffix link chain in starting from with , i.e., . has such chains and every internal node of appears in at least one chain. If a chain has two distinct Type 2 nodes and with , since is a Type 1 node by definition, one can always find a Type 1 node between them.

Define a binary relation between and by

and let . Since is a partial function from branching nodes to Type 2 nodes, we have . By the above argument on a chain, each chain has at most one Type 2 node such that . Since there are chains, we have . All in all, . ∎

a.3 Proof of Lemma 5

Lemma.
Figure 7: A bad Type 1 node governs two Type 3 nodes on each of the two branches. Here we have , and .

The number of Type 3 nodes that a bad node governs is at most for

where is the length of the longest common prefix of strings and .

Proof.

Figure 7 may help understanding the following proof. By Lemma 4, a bad node has just two children. It is enough to show that the number of Type 3 descendants of each child that governs is at most . Let . Since is non-branching, we have . By definition, there is such that

That is, . Therefore, , which is a descendant of a child of . Hence, the number of Type 3 descendants of each child that governs is at most . ∎

a.4 Proof of Lemma 6

Lemma.

The number of nodes of Type 3 in is smaller than .

Proof.

The function in Lemma 5 is an injection. Thus, it is enough to show that for . Since implies , we have for all . One can easily show by induction on that for . Particularly for , this means . ∎