1 Introduction
Text indexing is the task to preprocess the text string so that subsequent pattern matching queries can be answered efficiently. To date, a numerous number of text indexing structure for exact pattern matching have been proposed, ranging from classical data structures such as suffix trees [14], directed acyclic word graphs [2, 3], and suffix arrays [10], to more advanced ones such as compressed suffix arrays [8] and FM index [7], just to mention a few.
Ehrenfeucht et al. [6] proposed a text indexing structure called position heaps. Ehrenfeucht et al.’s position heap is constructed in a righttoleft online manner, where a new node is incrementally inserted to the current position heap for each decreasing position in the input string of length . In other words, Ehrenfeucht et al.’s position heap is defined over a sequence of the suffixes of in increasing order of their length, where is the empty string of length . Kucherov [9] proposed another variant of position heaps. Kucherov’s position heap is constructed in a lefttoright online manner, where a new node is incrementally inserted to the current position heap for each increasing . In other words, Kucherov’s position heap is defined over a sequence of the suffixes of in decreasing order of their length. We will call Ehrenfeucht et al.’s position heap as the RL position heap, and Kucherov’s position heap as the LR position heap. Both of the RL and LR position heaps for a text string of length require space and can be constructed in time, where is the alphabet size. By augmenting the RL and LR position heaps of with auxiliary links called maximal reach pointers, pattern matching queries can be answered in time, where is the length of a query pattern and is the number of occurrences of in .
Nakashima et al. [12] proposed position heaps for a set of strings that is given as a reversed trie, and proposed an algorithm that constructs the position heap of a given trie in time and space, where is the size of the input trie. Later, the same authors showed how to construct the position heap of a trie in time and space, for integer alphabets of size polynomialy bounded in [13].
Baker [1] introduced the parameterized pattern matching problem, that seeks for the occurrences of substrings of the text that have the “same” structures as the given pattern . Parameterized pattern matching is motivated by e.g., software maintenance and plagiarism detection [1]. More formally, we consider two distinct alphabets and , and we call an element over a pstring. The parameterized pattern matching problem is, given two pstrings and , to find all occurrences of substrings of that can be transformed to by a bijection from to which is identity for . For instance, if and where and , then the positions to output are and . To see why, observe that for the substring there is a bijection , , , , and that maps the substring to . Also, observe that for the other substring , there is a bijection , , , , and that maps the substring to as well.
Of various algorithms and indexing structures for the parameterized pattern matching (see [11] for a survey), we focus on Diptarama et al.’s parameterized position heaps [5]. Diptarama et al.’s parameterized position heaps are based on Kucherov’s LR position heaps, which are constructed in a lefttoright online manner. Let us call their structure the LR pposition heaps. Diptarama et al. showed how to construct the LR pposition heap for a given text of length in time with space, where and . They also showed that the LR pposition heap augmented with maximal reach pointers can support parameterized pattern matching queries in time, where is the number of occurrences to report.
In this paper, we propose RL pposition heaps which are constructed in a righttoleft online manner. We show how to construct our RL position heap for a given text string of length in time with space. Our construction algorithm is based on Ehrenfeucht et al.’s construction algorithm for RL position heaps [6], and Weiner’s suffix tree construction algorithm [14]. Namely, we use reversed suffix links defined for the nodes of RL pposition heaps. The key to our algorithm is how to label the reversed suffix links, which will be clarified in Definition 3. Using our RL pposition heap augmented with maximal reach pointers, one can perform parameterized pattern matching queries in time.
2 Preliminaries
2.1 Notations on strings
Let and be disjoint sets called a static alphabet and a parameterized alphabet, respectively. Let and . An element of is called an scharacter, and that of is called a pcharacter. In the sequel, both an scharacter and a pcharacter are sometimes simply called a character. An element of is called a string, and an element of is called a pstring. The length of a (p)string is the number of characters contained in . The empty string is a string of length 0, namely, . For a (p)string , , and are called a prefix, substring, and suffix of , respectively. The set of prefixes, substrings, and suffixes of a (p)string is denoted by , , and , respectively. The th character of a (p)string is denoted by for , and the substring of a (p)string that begins at position and ends at position is denoted by for . For convenience, let if . Also, let for any .
2.2 Parameterized pattern matching
For any pstring and , let . Two pstrings and of length each are said to parameterized match (pmatch) iff there is a bijection on such that for any and for all . For instance, if and , then and pmatch since there is a bijection such that , , , , and and . We write iff and pmatch.
The previous encoding of a pstring of length is a sequence of length such that the first occurrence of each pcharacter is replaced with and any other occurrence of is replaced by the distance to the previous occurrence of in , and each scharacter remains the same. More formally, is a sequence over of length such that for each ,
Observe that iff . Using the same example as above, we have that .
Let and be pstrings of length and , respectively, where . The parameterized pattern matching problem is to find all positions in such that .
3 Parameterized position heaps
Let be a sequence of strings such that for any , for any . For convenience, we assume that .
Definition 1 (Sequence hash trees [4]).
The sequence hash tree of a sequence of strings, denoted , is a trie structure that is recursively defined as follows: Let . Then
where is the longest prefix of which satisfies , , and is the shortest prefix of which satisfies .
Note that since we have assumed that each is not a prefix of for any , the new node and new edge always exist for each . Clearly contains nodes (including the root).
In what follows, we will define our indexing data structure for a text pstring of length . Let be the sequence of previous encoded suffixes of arranged in increasing order of their length. It is clear that for any and for any . Hence we can naturally define the sequence hash tree for , and we obtain our data structure:
Definition 2 (Parameterized positions heaps).
The parameterized position heap (pposition heap) for a pstring , denoted , is the sequence hash tree of i.e., .
See Figure 1 for an example of our pposition heap.
Note that we can obtain by adding at the beginning of . This also means that for each . Hence, we can construct by processing the input string from right to left. We remark that we can easily compute from in a total of time for all using extra space, e.g., by maintaining a balanced search tree that stores the distinct pcharacters that have occurred in and records the leftmost occurrences of these pcharacter in the nodes.
Diptarama et al. [5] proposed another version of parameterized position heap for a sequence of previous encoded suffixes of the input pstring arranged in decreasing order of their length. Since their algorithm processes from left to right, we sometimes call their structure as a lefttoright pposition heap (LR pposition heap), while we call our as a righttoleft pposition heap (RL pposition heap) since our construction algorithm processes from right to left.
For any pstring , we say that is represented by iff has a path which starts from the root and spells out .
Lemma 1.
For any string of length , consists of exactly nodes. Also, there is a onetoone correspondence between the positions in and the nonroot nodes of .
Proof.
Initially, consists only of the root that represents . For each , since for any , it is clear that there is a prefix of that is not represented by . Therefore, when we construct from , then exactly one node is inserted, which corresponds to position . ∎
Let be the set nodes of . Based on Lemma 1, we define a bijection such that for the root and iff was the node that was inserted when constructing from .
Unlike our RL pposition heap, Diptarama et al.’s LR pposition heap can have double nodes to which two positions of the text pstring are associated.
We remark that the pattern matching algorithm of Diptarama et al. [5] can be applied to our RL pposition heap for a text pstring , and this way one can solve the parameterized pattern matching problem in time, where is the number of positions in text such that the pattern pstring of length and the corresponding substring pmatch. We note that since our RL pposition heap does not have double nodes, the pattern matching algorithm can be somewhat simplified.
The following lemma is an analogue to Lemma 6 of [5] for Diptarama et al.’s LR pposition heap.
Lemma 2.
For any if is represented by , then for any substring of , is represented by .
Proof.
The lemma can be shown in a similar way to Lemma 6 of [5]. For the sake of completeness, we provide a full proof below.
First, we show that for any proper prefix of with , is represented by . It follows from the definition of previous encoding that , and hence is a prefix of . Since is represented by and , is also represented by .
Now it suffices for us to show that for any proper suffix of with , is represented by , since then we can inductively apply the above discussion for the prefixes. By the above discussions for the prefixes of , there exist positions in such that for . By the definition of , the root has an outgoing edge labeled by , and this is the base case for our induction. Since , we have . Now since and , is a prefix of . This implies that if is represented by , then is also represented by . By induction, we have that is represented by . Applying the same argument inductively, it is immediate that with are also represented by . ∎
In the next section, we show how to construct our RL pposition heap for an input text pstring of length in time and space.
4 Right to left construction of parameterized position heaps
In this section, we present our algorithm which constructs of a given pstring in a righttoleft online manner. The key to our construction algorithm is the use of reversed suffix links, which will be defined in the following subsection.
4.1 Reversed suffix links
For convenience, we will sometimes identify each node of with the path label from the root to . In our righttoleft online construction of , we use the reversed suffix links, which are a generalization of the Weiner links that are used in righttoleft construction of the suffix tree [14] for (standard) string matching:
Definition 3 (Reversed suffix links).
For any node of and a character , let
It is clear that by taking one link from a node, then the node depth (and hence the string length) increases exactly one.
Observe that the first case of of the definition of is a direct extension of the Weiner links, where points to the node that is obtained by prepending to . The second case, however, is a special case that arises in parameterized pattern matching. The following lemma ensures that our reversed suffix links are well defined:
Lemma 3.
For any node in and a character , let , where is a node of . Then, for any string such that , .
Proof.
In the first case of the definition of where , we have . Hence, .
In the second case of the definition of where , we have , which implies that and for any . Thus, . ∎
The next proposition shows that there is a monotonicity in the labels of the reversed suffix links that come from the nodes in the same path of .
Proposition 1.
Suppose there is a reversed suffix link of a node with . Let be any ancestor of . Then, if , has a reversed suffix link . Also, if and , then has a reversed suffix link , and if and , then has a reversed suffix link .
Proof.
It suffices for us to show that the lemma holds for the parent of , since then the lemma inductively holds for any ancestor of . Note that . Let .
If , then . Hence, the parent of is . Therefore, there is a reversed suffix link .
If and , then it follows from the definition of that and . Since , we have that and . Thus is represented by . Consequently, there is a reversed suffix link .
If and , then it follows from the definition of that and . Thus is represented by . Consequently, there is a reversed suffix link . ∎
4.2 Adding a new node
Our algorithm processes a given pstring of length from right to left and maintains in decreasing order of . Initially, we begin with which consists of the root representing the empty string . For convenience, we use an auxiliary node as a parent of the root , and create reversed suffix links for every .
Now suppose we have constructed for , and we will update it to . In so doing, we begin with node such that . We know the locus of this node since is the node that was inserted at the last step when was constructed from . Note also that this node is a leaf in . We climb up the path from until finding its lowest ancestor that satisfies the following. There are three cases:

If , then is the lowest ancestor of such that is defined.

If and for any , then is the lowest ancestor of such that is defined.

Otherwise, let where is the smallest position such that and . Then is the lowest ancestor of such that is defined if it exists, and is the lowest ancestor of such that is defined otherwise.
Let be the node of that is pointed by the reversed suffix link of as above. Then, we create a new node as a child of such that . The new edge is labeled by . We repeat the above procedure for all positions in in decreasing order. See also Figure 3 for concrete examples.
Lemma 4.
The above algorithm correctly updates to .
Proof.
Note that and are prefixes of . Let be the character in that is used in the reversed suffix link as above.
In Cases 1 and 2 above, we have or . Then it is clear that is a prefix of . Since is the lowest ancestor of for which is defined, is the longest prefix of that is represented by . Hence, the new node and its incoming edge labeled by are correctly inserted.
Consider Case 3 above. We first try to find in the first subcase, where . If it exists, then is the lowest ancestor of such that is defined, and thus . It now follows from Lemma 2 that is the longest prefix of that is represented by . Hence, the new node and its incoming edge labeled by are correctly inserted in this subcase. It is clear that in the first subcase is at least of depth . Hence, if we arrive at the ancestor of of depth without encountering the lowest ancestor satisfying the condition of the first subcase, then we try to find the lowest ancestor of that has a reversed suffix link labeled by (second subcase). Thus, by a similar argument to Case 2, the new node its incoming edge labeled by are correctly inserted in this second subcase. ∎
4.3 Adding a new reversed suffix link
After inserting the new node , we need to maintain the reversed suffix links corresponding to .
Lemma 5.
There is exactly one reversed suffix link that points to the new node in . Moreover, this reversed suffix link comes from the ancestor of of depth .
Proof.
Suppose on the contrary that there are two distinct nodes and each of which has a reversed suffix link pointing to . The label of any reversed suffix link that points to is uniquely determined by the path label from the root to . Therefore, the reversed suffix links of and that point to are both labeled by the same symbol. This means that , however, this contradicts the definition of the pposition heap. Hence, there is at most one node which has a reversed suffix link that points to .
Let be the ancestor of of depth . Also, let , namely, is the text character that corresponds to the label of the edge that is on the path from the root to *please check , and to the label of the new edge . If and is the smallest position in such that , then is labeled with while is labeled with . Otherwise, the label of the new edge must be equal to that of . It follows from the definition of reversed suffix links that in both cases the reversed suffix link to comes from . ∎
Lemma 6.
There is no reversed suffix link that comes from the new node in .
Proof.
Suppose on the contrary that there is a reversed suffix link from in , and let be the node that is pointed by this reversed suffix link. Notice that . Let be the suffix of for which this node was inserted, namely, . By Lemma 2, for any substring of , is represented by , and hence it is also represented by since . Recall that , which implies that the node existed already in . However, this contradicts that is the node that was inserted when was updated to . ∎
4.4 Complexity analysis
Lemma 7.
The proposed algorithm runs in a total of time with space.
Proof.
For each , the algorithm updates to . The update begins with node such that , and climbs up the path to . It takes a reversed suffix link from and moves to of depth , and the new node of depth with is inserted. Hence the total number of nodes visited when updating to is . Thus, the total number of nodes visited for all sums up to . At each node that we visit, it takes time to search for the corresponding reversed suffix link, as well as inserting a new edge. Hence, the total time cost is .
It is clear that the number of nodes in is , including the root and the auxiliary node . It follows from Lemmas 5 and 6 that the number of reversed suffix links coming out from the root, the internal nodes, and the leaves is . As for the reversed suffix links that come from to the root, we add a new reversed suffix link labeled with only if and for any . This way, we can maintain these reversed suffix links from in an online manner, using space. ∎
We have proven the following theorem, which is the main result of this paper.
Theorem 1.
For an input pstring of length , the proposed algorithm constructs in a righttoleft online manner for , in a total of time with space.
5 Parameterized pattern matching with augmented
Ehrenfeucht et al. [6] introduced maximal reach pointers, which used for efficient pattern matching queries on position heaps. Diptarama et al. [5] introduced maximal reach pointers for their LR pposition heaps, and showed how to perform pattern matching queries in time, where is the length of a given pattern pstring and is the number of occurrences to report. We can naturally extend the notion of maximal reach pointers to our RL pposition heaps, as follows:
Definition 4 (Maximal reach pointers).
For each position in , the maximal reach pointer of the node with points to the deepest node of such that is a prefix of .
We denote by the pointer of node such that . The augmented is with the maximal reach pointers of all nodes. For simplicity, if points to the node with id , then we omit this pointer. See Figure 4 for an example of maximal reach pointers and augmented .
Comments
There are no comments yet.