Right-to-left online construction of parameterized position heaps

08/03/2018
by   Noriki Fujisato, et al.
KYUSHU UNIVERSITY
0

Two strings of equal length are said to parameterized match if there is a bijection that maps the characters of one string to those of the other string, so that two strings become identical. The parameterized pattern matching problem is, given two strings T and P, to find the occurrences of substrings in T that parameterized match P. Diptarama et al. [Position Heaps for Parameterized Strings, CPM 2017] proposed an indexing data structure called parameterized position heaps, and gave a left-to-right online construction algorithm. In this paper, we present a right-to-left online construction algorithm for parameterized position heaps. For a text string T of length n over two kinds of alphabets Σ and Π of respective size σ and π, our construction algorithm runs in O(n (σ + π)) time with O(n) space. Our right-to-left parameterized position heaps support pattern matching queries in O(m (σ + π) + m π + pocc)) time, where m is the length of a query pattern P and pocc is the number of occurrences to report. Our construction and pattern matching algorithms are as efficient as Diptarama et al.'s algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/14/2019

The Parameterized Position Heap of a Trie

Let Σ and Π be disjoint alphabets of respective size σ and π. Two string...
02/17/2020

DAWGs for parameterized matching: online construction and related indexing structures

Two strings x and y over Σ∪Π of equal length are said to parameterized m...
01/29/2019

Online Algorithms for Constructing Linear-size Suffix Trie

The suffix trees are fundamental data structures for various kinds of st...
02/01/2019

Linear-size Suffix Tries for Parameterized Strings

In this paper, we propose a new indexing structure for parameterized str...
06/29/2020

Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are gi...
11/25/2020

Left Lyndon tree construction

We extend the left-to-right Lyndon factorisation of a word to the left L...
07/16/2020

Substring Complexity in Sublinear Space

Shannon's entropy is a definitive lower bound for statistical compressio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text indexing is the task to preprocess the text string so that subsequent pattern matching queries can be answered efficiently. To date, a numerous number of text indexing structure for exact pattern matching have been proposed, ranging from classical data structures such as suffix trees [14], directed acyclic word graphs [2, 3], and suffix arrays [10], to more advanced ones such as compressed suffix arrays [8] and FM index [7], just to mention a few.

Ehrenfeucht et al. [6] proposed a text indexing structure called position heaps. Ehrenfeucht et al.’s position heap is constructed in a right-to-left online manner, where a new node is incrementally inserted to the current position heap for each decreasing position in the input string of length . In other words, Ehrenfeucht et al.’s position heap is defined over a sequence of the suffixes of in increasing order of their length, where is the empty string of length . Kucherov [9] proposed another variant of position heaps. Kucherov’s position heap is constructed in a left-to-right online manner, where a new node is incrementally inserted to the current position heap for each increasing . In other words, Kucherov’s position heap is defined over a sequence of the suffixes of in decreasing order of their length. We will call Ehrenfeucht et al.’s position heap as the RL position heap, and Kucherov’s position heap as the LR position heap. Both of the RL and LR position heaps for a text string of length require space and can be constructed in time, where is the alphabet size. By augmenting the RL and LR position heaps of with auxiliary links called maximal reach pointers, pattern matching queries can be answered in time, where is the length of a query pattern and is the number of occurrences of in .

Nakashima et al. [12] proposed position heaps for a set of strings that is given as a reversed trie, and proposed an algorithm that constructs the position heap of a given trie in time and space, where is the size of the input trie. Later, the same authors showed how to construct the position heap of a trie in time and space, for integer alphabets of size polynomialy bounded in  [13].

Baker [1] introduced the parameterized pattern matching problem, that seeks for the occurrences of substrings of the text that have the “same” structures as the given pattern . Parameterized pattern matching is motivated by e.g., software maintenance and plagiarism detection [1]. More formally, we consider two distinct alphabets and , and we call an element over a p-string. The parameterized pattern matching problem is, given two p-strings and , to find all occurrences of substrings of that can be transformed to by a bijection from to which is identity for . For instance, if and where and , then the positions to output are and . To see why, observe that for the substring there is a bijection , , , , and that maps the substring to . Also, observe that for the other substring , there is a bijection , , , , and that maps the substring to as well.

Of various algorithms and indexing structures for the parameterized pattern matching (see [11] for a survey), we focus on Diptarama et al.’s parameterized position heaps [5]. Diptarama et al.’s parameterized position heaps are based on Kucherov’s LR position heaps, which are constructed in a left-to-right online manner. Let us call their structure the LR p-position heaps. Diptarama et al. showed how to construct the LR p-position heap for a given text of length in time with space, where and . They also showed that the LR p-position heap augmented with maximal reach pointers can support parameterized pattern matching queries in time, where is the number of occurrences to report.

In this paper, we propose RL p-position heaps which are constructed in a right-to-left online manner. We show how to construct our RL position heap for a given text string of length in time with space. Our construction algorithm is based on Ehrenfeucht et al.’s construction algorithm for RL position heaps [6], and Weiner’s suffix tree construction algorithm [14]. Namely, we use reversed suffix links defined for the nodes of RL p-position heaps. The key to our algorithm is how to label the reversed suffix links, which will be clarified in Definition 3. Using our RL p-position heap augmented with maximal reach pointers, one can perform parameterized pattern matching queries in time.

2 Preliminaries

2.1 Notations on strings

Let and be disjoint sets called a static alphabet and a parameterized alphabet, respectively. Let and . An element of is called an s-character, and that of is called a p-character. In the sequel, both an s-character and a p-character are sometimes simply called a character. An element of is called a string, and an element of is called a p-string. The length of a (p-)string is the number of characters contained in . The empty string is a string of length 0, namely, . For a (p-)string , , and are called a prefix, substring, and suffix of , respectively. The set of prefixes, substrings, and suffixes of a (p-)string is denoted by , , and , respectively. The -th character of a (p-)string is denoted by for , and the substring of a (p-)string that begins at position and ends at position is denoted by for . For convenience, let if . Also, let for any .

2.2 Parameterized pattern matching

For any p-string and , let . Two p-strings and of length each are said to parameterized match (p-match) iff there is a bijection on such that for any and for all . For instance, if and , then and p-match since there is a bijection such that , , , , and and . We write iff and p-match.

The previous encoding of a p-string of length is a sequence of length such that the first occurrence of each p-character is replaced with and any other occurrence of is replaced by the distance to the previous occurrence of in , and each s-character remains the same. More formally, is a sequence over of length such that for each ,

Observe that iff . Using the same example as above, we have that .

Let and be p-strings of length and , respectively, where . The parameterized pattern matching problem is to find all positions in such that .

3 Parameterized position heaps

Let be a sequence of strings such that for any , for any . For convenience, we assume that .

Definition 1 (Sequence hash trees [4]).

The sequence hash tree of a sequence of strings, denoted , is a trie structure that is recursively defined as follows: Let . Then

where is the longest prefix of which satisfies , , and is the shortest prefix of which satisfies .

Note that since we have assumed that each is not a prefix of for any , the new node and new edge always exist for each . Clearly contains nodes (including the root).

In what follows, we will define our indexing data structure for a text p-string of length . Let be the sequence of previous encoded suffixes of arranged in increasing order of their length. It is clear that for any and for any . Hence we can naturally define the sequence hash tree for , and we obtain our data structure:

Definition 2 (Parameterized positions heaps).

The parameterized position heap (p-position heap) for a p-string , denoted , is the sequence hash tree of i.e., .

Figure 1: To the left is the list of for p-string of length , where and . To the right is an illustration for . The underlined prefix of each in the left list denotes the longest prefix of that was inserted to and hence, the node with id represents this underlined prefix of .

See Figure 1 for an example of our p-position heap.

Note that we can obtain by adding at the beginning of . This also means that for each . Hence, we can construct by processing the input string from right to left. We remark that we can easily compute from in a total of time for all using extra space, e.g., by maintaining a balanced search tree that stores the distinct p-characters that have occurred in and records the leftmost occurrences of these p-character in the nodes.

Diptarama et al. [5] proposed another version of parameterized position heap for a sequence of previous encoded suffixes of the input p-string arranged in decreasing order of their length. Since their algorithm processes from left to right, we sometimes call their structure as a left-to-right p-position heap (LR p-position heap), while we call our as a right-to-left p-position heap (RL p-position heap) since our construction algorithm processes from right to left.

For any p-string , we say that is represented by iff has a path which starts from the root and spells out .

Lemma 1.

For any string of length , consists of exactly nodes. Also, there is a one-to-one correspondence between the positions in and the non-root nodes of .

Proof.

Initially, consists only of the root that represents . For each , since for any , it is clear that there is a prefix of that is not represented by . Therefore, when we construct from , then exactly one node is inserted, which corresponds to position . ∎

Let be the set nodes of . Based on Lemma 1, we define a bijection such that for the root and iff was the node that was inserted when constructing from .

Unlike our RL p-position heap, Diptarama et al.’s LR p-position heap can have double nodes to which two positions of the text p-string are associated.

We remark that the pattern matching algorithm of Diptarama et al. [5] can be applied to our RL p-position heap for a text p-string , and this way one can solve the parameterized pattern matching problem in time, where is the number of positions in text such that the pattern p-string of length and the corresponding substring p-match. We note that since our RL p-position heap does not have double nodes, the pattern matching algorithm can be somewhat simplified.

The following lemma is an analogue to Lemma 6 of [5] for Diptarama et al.’s LR p-position heap.

Lemma 2.

For any if is represented by , then for any substring of , is represented by .

Proof.

The lemma can be shown in a similar way to Lemma 6 of [5]. For the sake of completeness, we provide a full proof below.

First, we show that for any proper prefix of with , is represented by . It follows from the definition of previous encoding that , and hence is a prefix of . Since is represented by and , is also represented by .

Now it suffices for us to show that for any proper suffix of with , is represented by , since then we can inductively apply the above discussion for the prefixes. By the above discussions for the prefixes of , there exist positions in such that for . By the definition of , the root has an out-going edge labeled by , and this is the base case for our induction. Since , we have . Now since and , is a prefix of . This implies that if is represented by , then is also represented by . By induction, we have that is represented by . Applying the same argument inductively, it is immediate that with are also represented by . ∎

In the next section, we show how to construct our RL p-position heap for an input text p-string of length in time and space.

4 Right to left construction of parameterized position heaps

In this section, we present our algorithm which constructs of a given p-string in a right-to-left online manner. The key to our construction algorithm is the use of reversed suffix links, which will be defined in the following subsection.

4.1 Reversed suffix links

For convenience, we will sometimes identify each node of with the path label from the root to . In our right-to-left online construction of , we use the reversed suffix links, which are a generalization of the Weiner links that are used in right-to-left construction of the suffix tree [14] for (standard) string matching:

Definition 3 (Reversed suffix links).

For any node of and a character , let

Figure 2: Illustration of the reversed suffix links of with the same p-string as in Figure 1. The reversed suffix links and their labels are shown in red.

It is clear that by taking one link from a node, then the node depth (and hence the string length) increases exactly one.

Observe that the first case of of the definition of is a direct extension of the Weiner links, where points to the node that is obtained by prepending to . The second case, however, is a special case that arises in parameterized pattern matching. The following lemma ensures that our reversed suffix links are well defined:

Lemma 3.

For any node in and a character , let , where is a node of . Then, for any string such that , .

Proof.

In the first case of the definition of where , we have . Hence, .

In the second case of the definition of where , we have , which implies that and for any . Thus, . ∎

The next proposition shows that there is a monotonicity in the labels of the reversed suffix links that come from the nodes in the same path of .

Proposition 1.

Suppose there is a reversed suffix link of a node with . Let be any ancestor of . Then, if , has a reversed suffix link . Also, if and , then has a reversed suffix link , and if and , then has a reversed suffix link .

Proof.

It suffices for us to show that the lemma holds for the parent of , since then the lemma inductively holds for any ancestor of . Note that . Let .

If , then . Hence, the parent of is . Therefore, there is a reversed suffix link .

If and , then it follows from the definition of that and . Since , we have that and . Thus is represented by . Consequently, there is a reversed suffix link .

If and , then it follows from the definition of that and . Thus is represented by . Consequently, there is a reversed suffix link . ∎

4.2 Adding a new node

Our algorithm processes a given p-string of length from right to left and maintains in decreasing order of . Initially, we begin with which consists of the root representing the empty string . For convenience, we use an auxiliary node as a parent of the root , and create reversed suffix links for every .

Now suppose we have constructed for , and we will update it to . In so doing, we begin with node such that . We know the locus of this node since is the node that was inserted at the last step when was constructed from . Note also that this node is a leaf in . We climb up the path from until finding its lowest ancestor that satisfies the following. There are three cases:

  1. If , then is the lowest ancestor of such that is defined.

  2. If and for any , then is the lowest ancestor of such that is defined.

  3. Otherwise, let where is the smallest position such that and . Then is the lowest ancestor of such that is defined if it exists, and is the lowest ancestor of such that is defined otherwise.

Let be the node of that is pointed by the reversed suffix link of as above. Then, we create a new node as a child of such that . The new edge is labeled by . We repeat the above procedure for all positions in in decreasing order. See also Figure 3 for concrete examples.

Figure 3: A snapshot of updating for with the same p-string as in Figures 1 and 2. First, we update (upper left) to (upper right). Since and , we first try to find the lowest ancestor of the node with id that has a reversed suffix link labeled with by climbing up the path. However, it does not exist, and then we arrive at the lowest ancestor with id whose depth is  (). Hence the second sub-case of Case 3 is applied, and using its reversed suffix link we move to the node with id . The new node with id is inserted as its child. Next, we update (upper right) to (lower left). Since and , we first try to find the lowest ancestor of the node with id that has a reversed suffix link labeled with by climbing up the path, and we arrive at the node with id . Hence the first sub-case of Case 3 is applied, and using its reversed suffix link we move to the node with id . The new node with id is inserted as its child. Finally, we update (lower left) to (lower right). Since , Case 1 is applied. Thus we try to find the lowest ancestor of the node with id that has a reversed suffix link labeled with by climbing up the path, and we arrive at the root. Using its reversed suffix link, we move to the node with id . The new node with id is inserted as its child.
Lemma 4.

The above algorithm correctly updates to .

Proof.

Note that and are prefixes of . Let be the character in that is used in the reversed suffix link as above.

In Cases 1 and 2 above, we have or . Then it is clear that is a prefix of . Since is the lowest ancestor of for which is defined, is the longest prefix of that is represented by . Hence, the new node and its incoming edge labeled by are correctly inserted.

Consider Case 3 above. We first try to find in the first sub-case, where . If it exists, then is the lowest ancestor of such that is defined, and thus . It now follows from Lemma 2 that is the longest prefix of that is represented by . Hence, the new node and its incoming edge labeled by are correctly inserted in this sub-case. It is clear that in the first sub-case is at least of depth . Hence, if we arrive at the ancestor of of depth without encountering the lowest ancestor satisfying the condition of the first sub-case, then we try to find the lowest ancestor of that has a reversed suffix link labeled by (second sub-case). Thus, by a similar argument to Case 2, the new node its incoming edge labeled by are correctly inserted in this second sub-case. ∎

4.3 Adding a new reversed suffix link

After inserting the new node , we need to maintain the reversed suffix links corresponding to .

Lemma 5.

There is exactly one reversed suffix link that points to the new node in . Moreover, this reversed suffix link comes from the ancestor of of depth .

Proof.

Suppose on the contrary that there are two distinct nodes and each of which has a reversed suffix link pointing to . The label of any reversed suffix link that points to is uniquely determined by the path label from the root to . Therefore, the reversed suffix links of and that point to are both labeled by the same symbol. This means that , however, this contradicts the definition of the p-position heap. Hence, there is at most one node which has a reversed suffix link that points to .

Let be the ancestor of of depth . Also, let , namely, is the text character that corresponds to the label of the edge that is on the path from the root to *please check , and to the label of the new edge . If and is the smallest position in such that , then is labeled with while is labeled with . Otherwise, the label of the new edge must be equal to that of . It follows from the definition of reversed suffix links that in both cases the reversed suffix link to comes from . ∎

Lemma 6.

There is no reversed suffix link that comes from the new node in .

Proof.

Suppose on the contrary that there is a reversed suffix link from in , and let be the node that is pointed by this reversed suffix link. Notice that . Let be the suffix of for which this node was inserted, namely, . By Lemma 2, for any substring of , is represented by , and hence it is also represented by since . Recall that , which implies that the node existed already in . However, this contradicts that is the node that was inserted when was updated to . ∎

Due to Lemmas 5 and 6, there is only one reversed suffix link that is newly inserted in .

4.4 Complexity analysis

Lemma 7.

The proposed algorithm runs in a total of time with space.

Proof.

For each , the algorithm updates to . The update begins with node such that , and climbs up the path to . It takes a reversed suffix link from and moves to of depth , and the new node of depth with is inserted. Hence the total number of nodes visited when updating to is . Thus, the total number of nodes visited for all sums up to . At each node that we visit, it takes time to search for the corresponding reversed suffix link, as well as inserting a new edge. Hence, the total time cost is .

It is clear that the number of nodes in is , including the root and the auxiliary node . It follows from Lemmas 5 and 6 that the number of reversed suffix links coming out from the root, the internal nodes, and the leaves is . As for the reversed suffix links that come from to the root, we add a new reversed suffix link labeled with only if and for any . This way, we can maintain these reversed suffix links from in an online manner, using space. ∎

We have proven the following theorem, which is the main result of this paper.

Theorem 1.

For an input p-string of length , the proposed algorithm constructs in a right-to-left online manner for , in a total of time with space.

5 Parameterized pattern matching with augmented

Ehrenfeucht et al. [6] introduced maximal reach pointers, which used for efficient pattern matching queries on position heaps. Diptarama et al. [5] introduced maximal reach pointers for their LR p-position heaps, and showed how to perform pattern matching queries in time, where is the length of a given pattern p-string and is the number of occurrences to report. We can naturally extend the notion of maximal reach pointers to our RL p-position heaps, as follows:

Definition 4 (Maximal reach pointers).

For each position in , the maximal reach pointer of the node with points to the deepest node of such that is a prefix of .

We denote by the pointer of node such that . The augmented is with the maximal reach pointers of all nodes. For simplicity, if points to the node with id , then we omit this pointer. See Figure 4 for an example of maximal reach pointers and augmented .