Generalized Dictionary Matching under Substring Consistent Equivalence Relations

09/17/2019
by   Diptarama Hendrian, et al.
Tohoku University
0

Given a set of patterns called a dictionary and a text, the dictionary matching problem is a task to find all occurrence positions of all patterns in the text. The dictionary matching problem can be solved efficiently by using the Aho-Corasick algorithm. Recently, Matsuoka et al. [TCS, 2016] proposed a generalization of pattern matching problem under substring consistent equivalence relations and presented a generalization of the Knuth-Morris-Pratt algorithm to solve this problem. An equivalence relation ≈ is a substring consistent equivalence relation (SCER) if for two strings X,Y, X ≈ Y implies |X| = |Y| and X[i:j] ≈ Y[i:j] for all 1 < i < j < |X|. In this paper, we propose a generalization of the dictionary matching problem and present a generalization of the Aho-Corasick algorithm for the dictionary matching under SCER. We present an algorithm that constructs SCER automata and an algorithm that performs dictionary matching under SCER by using the automata. Moreover, we show the time and space complexity of our algorithms with respect to the size of input strings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/07/2018

Streaming dictionary matching with mismatches

In the k-mismatch problem we are given a pattern of length m and a text ...
06/29/2020

Pattern Masking for Dictionary Matching

In the Pattern Masking for Dictionary Matching (PMDM) problem, we are gi...
08/05/2019

Heuristic Algorithm for Generalized Function Matching

The problem of generalized function matching can be defined as follows: ...
02/17/2020

Computing Covers under Substring Consistent Equivalence Relations

Covers are a kind of quasiperiodicity in strings. A string C is a cover ...
02/17/2020

DAWGs for parameterized matching: online construction and related indexing structures

Two strings x and y over Σ∪Π of equal length are said to parameterized m...
11/27/2020

Adaptive Non-linear Pattern Matching Automata

Efficient pattern matching is fundamental for practical term rewrite eng...
12/18/2018

Frank-Wolfe Algorithm for the Exact Sparse Problem

In this paper, we study the properties of the Frank-Wolfe algorithm to s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The pattern matching problem is one of the most fundamental problems in string processing and extensively studied due to its wide range of applications. Given a text of length and a pattern of length , the pattern matching problem is to find all occurrence positions of in . A naive approach to solve this problem is by comparing all substrings of whose length is to which takes time. One of the algorithms that can solve this problem in linear time and space is the Knuth-Morris-Pratt (KMP) algorithm [12]. The KMP algorithm constructs an space array as a failure function by preprocessing the pattern in time, then use the failure function to perform pattern matching in time.

Many of variant of pattern matching problems are studied for various applications such as parameterized pattern matching [5] for detecting duplication in source code, order-preserving pattern matching [11, 13] for numerical analysis, permuted pattern matching [10] for multi sensor data, and so on [15]. In order to solve these problems efficiently, the KMP algorithm is extended for each pattern matching problem such as parameterized pattern matching [3], order-preserving pattern matching [11, 13], permuted pattern matching [6, 8], and cartesian tree matching [15].

Recently, Matsuoka et al. [14] defined a general pattern matching problem under a substring consistent equivalence relation. An equivalence relation for two strings is a substring consistent equivalence relation (SCER[14] if for two strings , implies and for all . The equivalence relation used in parameterized pattern matching, order-preserving pattern matching, and permuted pattern matching are SCERs. Matsuoka et al. proposed a generalized KMP algorithm that can solve any pattern matching under SCER and showed the time complexity of the algorithm. They also show periodicity properties on strings under SCERs.

The dictionary matching problem is a task to find all occurrence positions of multiple patterns in a text. Given a set of patterns called a dictionary , we can find the occurrence positions of all patterns in a text by performing pattern matching for each pattern in the dictionary. However, we need to read the text multiple times in this approach. Aho and Corasick [1] proposed an algorithm that can perform dictionary matching in linear time by extending the failure function of the KMP algorithm. The Aho-Corasick (AC) algorithm constructs an automaton (we call this automaton as an AC-automaton) from and then uses this automaton to find the occurrences of all patterns in the text. The AC-automaton of uses space and can be constructed in time, where is the sum of the length of all patterns in and is the alphabet size. By using an AC-automaton, all occurrences of patterns in can be found only by read once, which takes time. Similarly to the KMP algorithm, the AC algorithm is also extended to perform dictionary matching on some variant of strings [7, 8, 9, 10, 11]. In order to performed dictionary matching efficiently, the extended AC algorithms encode the patterns in a dictionary and create an automaton from the encoded patterns instead of the patterns itself.

In this paper, we propose a generalization of the Aho-Corasick algorithm for dictionary matching under SCER. The proposed algorithm encodes the patterns in the dictionary, and then constructs an automaton with a failure function called a substring consistent equivalence relation automaton (SCER automaton) from the encoded strings. We present an algorithm to construct SCER automata and show how to perform dictionary matching by using SCER automata. Suppose we can encode a string in time and re-encode in time, we show that the size of SCER automaton is and can be constructed in time, where is the sum of pattern length in the dictionary, is the length of the longest pattern in the dictionary, and is the alphabet size of the encoded strings. Moreover, we show that the dictionary matching under SCER can be performed in time by using SCER automata, where is the length of the text. By using our algorithm, we can perform dictionary matching under any SCER if we have a prefix encoding for strings under the SCER.

2 Preliminaries

Let and be integer alphabets, and (resp. ) be the set of all strings over (resp. ). The empty string is the string of length 0. We assume that the size of any symbol in and is constant and a comparison of any two symbols in or can be done in constant time. For a string , denotes the length of . For , denotes the -th character of and denotes the substring of that starts at and ends at . Let denote the prefix of that ends at and denote the suffix of that starts at . For convenience we define if . Note that is a substring, a prefix, and a suffix of any string. For a string , let denote the set of all prefixes of . For two strings and , let we denote by or the concatenation of and .

Let be a set of patterns, called a dictionary. Let denote the number of patterns in and denote the total length of the patterns in . For a dictionary , is the set of all prefixes of the patterns in .

Figure 1: The AC-automaton of . The solid arrows represent the goto function, the dashed blue arrows represent the failure function, and the sets of strings represent the output function.

The Aho-Corasick automaton [1] of consists of a set of states and three functions: goto, failure, and output functions. Eace state of corresponds to a prefix in . The goto function of is defined as , for any and . The failure fuction of is defined as where . The output fuction of is defined as . Figure 1 shows an example of AC-automaton. We will define a generalization of AC-automata later in Section 3.

Next, we define the class of equivalence relations that we consider in this paper called substring consistent equivalence relations.

Definition 1 (Substring consistent equivalence relation (SCER)  [14]).

An equivalence relation is a substring consistent equivalence relation (SCER) if for two string and , implies and for all .

We say -matches iff . For instance, the matching relations in parameterized pattern matching [5], order-preserving pattern matching [11, 13], and permuted pattern matching [10] are SCERs, while the matching relations in indeterminate string pattern matching [4] and function matching [2] are not.

Matsuoka et al. [14] define occurrences of a pattern in a text under an SCER , which is used to define the pattern matching under SCERs as follows.

Definition 2 (-occurrence).

For two strings and , a position is an -occurrence of in iff for .

By using the above definition we define the dictionary matching under SCERs.

Definition 3 (-dictionary matching).

Given a dictionary and a text , the dictionary matching with respect to an SCER (-dictionary matching) is a task to find all -occurrences of in for all .

In order to perform some variants of dictionary matching fast, encodings are used on strings. For instance, the prev-encoding is used for parameterized pattern matching [9] and the nearest neighbor encoding is used for order-preserving pattern matching [11]. Following the previous research, we generalize these encodings for SCERs as follows.

Definition 4 (-prefix encoding).

Let and be alphabets. We say an encoding function is a prefix encoding with respect to an SCER (-prefix encoding) if (1) for any string , , (2) , and (3) for two strings and , iff .

We can easily confirm that both the prev-encoding and the nearest neighbor encoding are prefix encodings. By using a -prefix encoding, if we can check whether just by checking whether . Therefore, -dictionary matching can be performed fast by using prefix encoded strings.

For a string and prefix encoding , let we denote by for simplicity. For a dictionary , let and .

Throughout the paper, let be a text of length , be a dictionary, , , and . Let be the co-domain of a -prefix encoding. For a string , suppose that can be computed in time. Assuming that has been computed, suppose we can re-encode in time.

3 SCER automata

In this section we propose automata for the -dictionary matching problem called substring consistent equivalence relation automata (SCERAs). First, we describe the definition and properties of the SCREA for a dictionary , then show the size of the SCREA of with respect to the size of . After that, we propose a -dictionary matching algorithm by using SCERAs and show the time complexity of the proposed algorithm. Last, we present an algorithm to construct SCERAs and show its time complexity.

3.1 Definition and properties

For a dictionary , the substring consistent equivalence relation automaton of consists of a set of states and three functions, namely goto, failure, and output functions.

The set of states of defined as follows.

Each state of corresponds to a prefix of for some , thus we can identify each state by the corresponded prefix. Since the number of prefixes of is and , the number of states of is as follows.

Lemma 1.

For a dictionary , the number of states of is .

Next, we define the functions in . First, the goto function of is defined as follows.

Definition 5 (Goto function).

The goto function of is defined by

Intuitively, , for any and . The states and goto function form a trie of all encoded patterns in . For two states and such that for some , we say the parent of and a child of . For convenience, for a state and a string , let and .

Next, the failure fuction of is defined as follows.

Definition 6 (Failure function).

The failure function of is defined by for , where .

In other words, for any state , iff is the longest suffix of such that . Moreover, by the definition of prefix encoding, is also the longest suffix of that -matches a prefix in . For convenience, we define recursively, namely and .

The failure function has the following properties.

Lemma 2.

For any state , if is a state, there is such that .

Proof.

Straightforward by the definition. ∎∎

Lemma 3.

Consider two states and . If , there is such that is the parent of .

Proof.

Let . From the condition , we have . Since , clearly and is the parent of . By Lemma 2, for some . ∎∎

Lemma 2 implies that any re-encoded suffix of such that can be found by executing failure function recursively starting from . Moreover, Lemma 3 implies that for any state such that , the parent of can be found by executing failure function recursively starting from .

Last, the output function of is defined as follows.

Definition 7 (Output function).

The output function of is defined by .

For a state , is the set patterns that -matchs some suffix of .

The output function has the following properties.

Lemma 4.

For any , for some .

Proof.

From the definition of and prefix encoding, and for some . By Lemma 2, for some . ∎∎

Lemma 5.

For any state , if , . Otherwise, .

Proof.

Assume there is a pattern such that but . By Lemma 4, there exists such that . Since , we have which contradicts the assumption.

Next, assume there is a pattern such that but . By Lemma 4, there is such that . If , implies which contradicts the assumption. Therefore, the remaining possibility is which implies . ∎∎

Lemma 5 implies that we can compute by copying and adding if . We will utilize Lemma 5 to construct the output function efficiently.

Implementation and Space complexity

We will describe how to implement SCER automata and show its space complexity. First, The goto function can be implemented by using an associative array on each state. We have the following lemma for the required space to implement the goto function of .

Lemma 6.

Assume that the size of any symbol in is constant. The goto function of can be implemented in space.

Proof.

The number of associative arrays used to implement is . Since there only exists one pair such that for each , the total size of associative arrays is . Therefore, can be implemented in space by Lemma 1. ∎∎

Next, the failure function can be implemented by using a state pointer on each state.

Lemma 7.

For a dictionary , can be implemented in space.

Proof.

Since is defined for each state, can be implemented using space. Therefore, can be implemented in space by Lemma 1. ∎∎

Last, similarly to the original AC-automata, the output function can be implemented in linear space by using a list.

Lemma 8.

For a dictionary , can be implemented in space.

Proof.

Each state stores a pair , where is the pattern number if or otherwise and is a pointer to a state for the smallest such that if exist or otherwise. Since is by Lemma 1, can be implemented in space. ∎∎

From Lemmas 1, 6, 7, and 8, we get the following theorem.

Theorem 1.

Assume that the size of any symbol in is constant. For a dictionary of total size , can be implemented in space.

3.2 Dictionary matching using SCERA

algocf[t]    

In this section, we describe how to use SCER automata for dictionary matching. LABEL:alg:match shows a dictionary matching algorithm by using . In order to simplify the algorithm we use an auxiliary state where for any and . For any state , let be the depth of i.e. the length of the shortest path from to .

The algorithm starts with as the active state and as the active position. The algorithm read the encoded text from the left to the right, while updating the active state and the active position. Let be the active state and be the active position, the algorithm find transition from . If transition exists, the algorithm updates the active state to and increments . After the algorithm updates the active state, it outputs all patterns in . Otherwise if transition does not exist, the algorithm updates the active state to without updating . The algorithm repeats these operation until it reads all the text.

Lemma 9.

Let be the active state and be the active position. is the longest suffix of such that .

Proof.

We will proof by induction. At initial, and , thus is the longest suffix of . Next, if , is updated to . Clearly is the longest suffix of and . Otherwise if , is updated to . Since implies , is the longest suffix of that is also an element of .

Assume that is the longest suffix of such that . Let be the next active state and be the next active position. If , we have . By the assumption, is the longest suffix of such that . Otherwise if , we have , where be the smallest integer such that . Let be the longest suffix of such that . By the definition of SCERA, . Moreover, for some by Lemma 2 and . Since, be the longest suffix of such that , for any , Therefore, which implies the correctness of Lemma 9. ∎∎

Theorem 2.

Given and a text of length , LABEL:alg:match outputs all occurrence positions of all patterns in in time, where is the time required to encode , is the time required to re-encode substrings of , and is the number of occurrences of all patterns in .

Proof.

First, we show the correctness of the algorithm. Assume there is an occurrence position of in that not be output by the algorithm. Let be the active position. By Lemma 9, the active state is the longest suffix of such that . Since , we have . By the definition of the output function, . Therefore, the algorithm outputs which contradicts the assumption.

Next, we show the time complexity of the algorithm. The encoding can be computed in . For each position , the depth of the active position is increased by one, thus the depth of the active position is increased by in total. Since the depth of the active position is decreased by at least one each time is executed, is executed at most times. Next, each time is executed, either the depth of the active position is increased by one or is executed, thus is executed times. Since we need to re-encode a substrings each time is executed and can be executed in time, the algorithm takes time to execute in total. In order to output the occurrence positions, the algorithm takes time to check whether there is any occurrence and time to output the occurrence positions. ∎∎

3.3 Constructing SCERA

algocf[t]    

algocf[t]    

algocf[t]    

In this section, we describe an algorithm to construct . We divide the algorithm into three parts: goto function, failure function, and output function construction algorithms.

First, the goto function construction algorithm is shown in LABEL:alg:goto. Initially, the algorithm computes for all , then constructs the root state and an auxiliary state . Next, for each pattern , the algorithm finds the longest prefix of that exists in the current automaton. After that, the algorithm creates states corresponding to the remaining prefixes from the shortest to the longest. After creating each state, the algorithm updates the goto function, adds a label to the state, and compute the depth of the state.

Lemma 10.

Given a dictionary , LABEL:alg:goto constructs the goto function of in time.

Proof.

By assumption, the dictionary can be encoded in time. The operations in the inner loop are computed times. for any can be computed in by binary search. ∎∎

Next, we describe how to compute the failure function of . LABEL:alg:failure shows the algorithm for computing the failure function. The algorithm computes the failure function recursively by the breadth-first search. The algorithm uses the property in Lemma 3 to compute the failure function.

Consider computing . Since the algorithm computing by the breadth-first search, has been computed. By Lemma 3, there is such that is the parent of or . We can find by executing the failure function recursively from and checking whether .

Lemma 11.

Given a dictionary and goto function of , LABEL:alg:failure constructs the failure function of in time.

Proof.

The dictionary can be encoded in time. The running time of LABEL:alg:failure is bounded by the number of executions of . Let be the number of executions of when finding . Since , we can compute . The goto function is executed each time be executed. Since we need to re-encode a substring each time be executed and can be executed in time, the algorithm takes time in total. ∎∎

Last, LABEL:alg:output shows an algorithm to compute the output function of . The algorithm first adds to for each . Next, the algorithm updates the output function recursively by the breadth-first search. The algorithm uses the property in Lemma 5 to compute the output function.

Consider computing . Since the algorithm computes by the breadth-first search, has been computed. By Lemma 5, we can compute by just adding to if or by copying if .

Lemma 12.

Given a dictionary , the goto function, and the failure function of , LABEL:alg:output constructs the output function of in time.

Proof.

The dictionary can be encoded in time. Clearly the loops are executed times in total. The goto function can be executed in time. Therefore, LABEL:alg:output runs in time. ∎∎

From Lemmas 10, 11, and 12, we get the following theorem.

Theorem 3.

Given a dictionary , can be constructed in time.

4 Concluding remark

In this paper, we presented a generalization of dictionary matching problem under SCERs. We proposed a generalization of the Aho-Corasick algorithm for the dictionary matching problem under SCERs. We encoded the patterns by using a -prefix encoding, and then construct an SCER automaton from the encoded strings.

We believe there exists a -prefix encoding for any SCER , so that our algorithm can be used for any -dictionary matching. The proof remains as an open problem.

References

  • [1] Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Communications of the ACM 18(6), 333–340 (1975)
  • [2] Amir, A., Aumann, Y., Lewenstein, M., Porat, E.: Function Matching. Society for Industrial and Applied Mathematics 35(5), 1007–1022 (2006)
  • [3] Amir, A., Farach, M., Muthukrishnan, S.: Alphabet dependence in parameterized matching. Information Processing Letters 49(3), 111–115 (1994)
  • [4] Antoniou, P., Crochemore, M., Iliopoulos, C., Jayasekera, I., Landau, G.: Conservative string covering of indeterminate strings. In: Prague Stringology Conference 2008. pp. 108–115 (2008)
  • [5] Baker, B.S.: Parameterized Pattern Matching: Algorithms and Applications. Journal of Computer and System Sciences 52(1), 28–42 (1996)
  • [6] Diptarama, Ueki, Y., Narisawa, K., Shinohara, A.: KMP Based Pattern Matching Algorithms for Multi-Track Strings. In: Proceedings of Student Research Forum Papers and Posters at SOFSEM2016. pp. 100–107 (2016)
  • [7] Diptarama, Yoshinaka, R., Shinohara, A.: Fast Full Permuted Pattern Matching Algorithms on Multi-track Strings. In: Prague Stringology Conference (PSC) 2016. pp. 7–21 (2016)
  • [8] Hendrian, D., Ueki, Y., Narisawa, K., Yoshinaka, R., Shinohara, A.: Permuted Pattern Matching Algorithms on Multi-Track Strings. Algorithms 12(4), 73:1–20 (2019)
  • [9] Idury, R.M., Schäffer, A.A.: Multiple matching of parameterized patterns. Theoretical Computer Science 154(2), 203–224 (1996)
  • [10] Katsura, T., Narisawa, K., Shinohara, A., Bannai, H., Inenaga, S.: Permuted Pattern Matching on Multi-track Strings. In: SOFSEM 2013: Theory and Practice of Computer Science. pp. 280–291 (2013)
  • [11] Kim, J., Eades, P., Fleischer, R., Hong, S.H., Iliopoulos, C.S., Park, K., Puglisi, S.J., Tokuyama, T.: Order-preserving matching. Theoretical Computer Science 525, 68–79 (2014)
  • [12] Knuth, D.E., Morris, Jr., J.H., Pratt, V.R.: Fast Pattern Matching in Strings. SIAM Journal on Computing 6(2), 323–350 (1977)
  • [13] Kubica, M., Kulczyński, T., Radoszewski, J., Rytter, W., Waleń, T.: A linear time algorithm for consecutive permutation pattern matching. Information Processing Letters 113(12), 430–433 (2013)
  • [14] Matsuoka, Y., Aoki, T., Inenaga, S., Bannai, H., Takeda, M.: Generalized pattern matching and periodicity under substring consistent equivalence relations. Theoretical Computer Science 656, 225–233 (2016)
  • [15] Park, S.G., Amir, A., Landau, G.M., Park, K.: Cartesian Tree Matching and Indexing. In: 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019). pp. 16:1–16:14 (2019)