Fast Multiple Pattern Cartesian Tree Matching

11/05/2019
by   Geonmo Gu, et al.
0

Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. In this paper, we deal with Cartesian tree matching for the case of multiple patterns. We present two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. We propose three practical algorithms for multiple pattern Cartesian tree matching based on the Wu-Manber algorithm, the Rabin-Karp algorithm, and the Alpha Skip Search algorithm, respectively. In the experiments we compare our solutions against the previous algorithm [18]. Our solutions run faster than the previous algorithm as the pattern lengths increase. Especially, our algorithm based on Wu-Manber runs up to 33 times faster.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/14/2019

Fast Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings of a gi...
11/06/2017

FAMOUS: Fast Approximate string Matching using OptimUm search Schemes

Finding approximate occurrences of a pattern in a text using a full-text...
11/18/2016

Fast low-level pattern matching algorithm

This paper focuses on pattern matching in the DNA sequence. It was inspi...
02/17/2022

Term Rewriting Based On Set Automaton Matching

In previous work we have proposed an efficient pattern matching algorith...
02/09/2018

Self-Bounded Prediction Suffix Tree via Approximate String Matching

Prediction suffix trees (PST) provide an effective tool for sequence mod...
02/19/2020

Fast and linear-time string matching algorithms based on the distances of q-gram occurrences

Given a text T of length n and a pattern P of length m, the string match...
04/27/2020

SFTM: Fast Comparison of Web Documents using Similarity-based Flexible Tree Matching

Tree matching techniques have been investigated in many fields, includin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gu, Song, and Park were supported by Collaborative Genome Program for Fostering New Post-Genome industry through the National Research Foundation of Korea(NRF) funded by the Ministry of Science ICT and Future Planning (No. NRF-2014M3C9A3063541).

Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. For instance, given text and pattern in Figure 0(a), has the same Cartesian tree as the substring of . Among many generalized matchings, Cartesian tree matching is analogous to order-preserving matching[13, 15, 5, 9] in the sense that they deal with relative order between numbers. Accordingly, both of them can be applied to time series data such as stock price analysis, but Cartesian tree matching can be sometimes more appropriate than order-preserving matching in finding patterns [18].

In this paper, we deal with Cartesian tree matching for the case of multiple patterns. Although finding multiple different patterns is interesting by itself, multiple pattern Cartesian tree matching can be applied in finding one meaningful pattern when the meaningful pattern is represented by multiple Cartesian trees: Suppose we are looking for the double-top pattern [17]. Two Cartesian trees in Figure 0(b) are required to identify the pattern, where the relative order between and causes the difference. In general, the more complex the pattern is, the more Cartesian trees having the same lengths are required. (e.g., the head-and-shoulder pattern [17] requires four Cartesian trees.)

Text

Pattern
(a) Cartesian tree matching
(b) Left: double-top patterns. Right: corresponding Cartesian trees.
Figure 1: Cartesian tree matching: multiple Cartesian trees are required for the double-top pattern.

Recently, Park et al. [18] introduced (single pattern) Cartesian tree matching, multiple pattern Cartesian tree matching, and Cartesian tree indexing with their respective algorithms. They proposed the parent-distance representation that has a one-to-one mapping with Cartesian trees, and gave linear-time solutions for the problems, utilizing the representation and existing string algorithms, i.e., KMP algorithm, Aho-Corasick algorithm, and suffix tree construction algorithm. Song et al. [19] proposed new representations about Cartesian trees, and proposed practically fast algorithms for Cartesian tree matching based on the framework of filtering and verification.

Extensive works have been done to develop algorithms for multiple pattern matching, which is one of the fundamental problems in computer science

[20, 11, 16]. Aho and Corasick [1] presented a linear-time algorithm based on an automaton. Commentz-Walter [6] presented an algorithm that combines the Aho-Corasick algorithm and the Boyer-Moore technique [3]. Crochemore et al. [8] proposed an algorithm that combines the Aho-Corasick automaton and a Directed Acyclic Word Graph, which runs linear in the worst case and runs in time in the average case, where is the length of the shortest pattern. Rabin and Karp [12] proposed an algorithm that runs linear on average and in the worst case, where is the sum of lengths of all patterns. Charras et al. [4] proposed an algorithm called Alpha Skip Search, which can efficiently handle both single pattern and multiple patterns. Wu and Manber [22] presented an algorithm that uses an extension of the Boyer-Moore-Horspool technique.

In this paper we present practically fast algorithms for multiple pattern Cartesian tree matching. We present three algorithms based on Wu-Manber, Rabin-Karp, and Alpha Skip Search. All of them use the filtering and verification approach, where filtering relies on efficient fingerprinting methods of a string. Two fingerprinting methods are presented, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. In the experiments we compare our solutions against the previous algorithm [18] which is based on the Aho-Corasick algorithm. Our solutions run faster than the previous algorithm. Especially, our algorithm based on Wu-Manber runs up to 33 times faster.

2 Problem Definition

2.1 Notation

A string is a sequence of characters drawn from an alphabet , which is a set of integers. We assume that a comparison between any two characters can be done in constant time. For a string , represents the -th character of , and represents the substring of starting from and ending at .

A Cartesian tree [21] is a binary tree derived from a string. Specifically, the Cartesian tree for a string can be uniquely defined as follows:

  • If is an empty string, is an empty tree.

  • If is not empty and is the minimum value in , is the tree with as the root, as the left subtree, and as the right subtree. If there is more than one minimum value, we choose the leftmost one as the root.

Given two strings and , where , we say that matches at position if . For example, given and in Figure 0(a), matches at position 8. We also say that is a match of in .

Cartesian tree matching is the problem of finding all the matches in the text which have the same Cartesian trees as a given pattern.

Definition 1

(Cartesian tree matching [18]) Given two strings text and pattern , find every such that .

2.2 Multiple Pattern Cartesian Tree Matching

Cartesian tree matching can be extended to the case of multiple patterns. Multiple pattern Cartesian tree matching is the problem of finding all the matches in the text which have the same Cartesian trees as at least one of the given patterns.

Definition 2

(Multiple pattern Cartesian tree matching [18]) Given a text and patterns , find every position in the text which matches at least one pattern, i.e., it has the same Cartesian tree as that of at least one pattern.

3 Fingerprinting Methods

Fingerprinting is a technique that maps a string to a much shorter form of data, such as a bit string or an integer. In Cartesian tree matching, we can use fingerprints to filter out unpromising matching positions with low computational cost.

In this section we introduce two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding, for the purpose of representing information about Cartesian tree as an integer. The two encodings make use of the parent-distance representation and the binary representation, respectively, both of which are strings that represent Cartesian trees.

3.1 Parent-distance Encoding

In order to represent Cartesian trees efficiently, Park et al. proposed the parent-distance representation [18], which is another form of the all nearest smaller values [2].

Definition 3

(Parent-distance representation) Given a string S[1..n], the parent-distance representation of S is an integer string PD(S)[1..n], which is defined as follows:

(1)

Intuitively, stores the distance between and the parent of in . For example, the parent-distance representation of string is , where stores the distance between and ( is the parent of in ). The parent-distance representation has a one-to-one mapping to the Cartesian tree [18], and so if two strings have the same parent-distance representations, the two strings also have the same Cartesian trees. The parent-distance representation of a string can be computed in linear time [18]. Note that holds a value between to by definition, and at all times.

With the parent-distance representation, we can define a fingerprint encoding function that maps a string to an integer, using the factorial number system [14].

Definition 4

(Parent-distance Encoding) Given a string , the encoding function , which maps into an integer within the range , is defined as follows:

(2)

The parent-distance encoding maps a string into a unique integer according to its parent-distance representation. That is, given two strings and , if and only if . This is because if then due to the fact that . The encoding function can be computed in time, since can be computed in linear time. For a long string, the fingerprint may not fit in a word size, so we select a prime number by which we divide the fingerprint, and use the residue instead of the actual fingerprint. A similar encoding function was used to solve the multiple pattern order-preserving matching problem [10].

3.2 Binary Encoding

For order-preserving matching, the representation of a string as a binary string is first presented by Chhabra and Tarhio [5]. Recently, Song et al. make use of the binary representation for Cartesian tree matching as follows [19].

Definition 5

(Binary representation) Given an -length string , binary representation of length is defined as follows: for ,

(3)

Given two strings and , the binary representations and are the same if the Cartesian trees and are the same [19]. Obviously, the Cartesian tree has a many-to-one mapping to the binary representation. Thus, two strings whose binary representations are the same may not have the same Cartesian trees, but two strings whose Cartesian trees are the same have the same binary representations.

A fingerprint encoding function can be defined using the binary representation.

Definition 6

(Binary Encoding) Given a string , encoding function , which maps into an integer within the range , is defined as follows:

(4)

Since is a polynomial, it can be efficiently computed in linear time using Horner’s rule [7]. Moreover, a fingerprint computed by the binary encoding can be reused when two strings overlap, which will be discussed in Appendix 0.A.3.2. Like the parent-distance encoding, in case the fingerprint does not fit in a word size, we select a prime number by which we divide the fingerprint, and use the residue instead of the actual fingerprint.

4 Fast Multiple Pattern Cartesian Tree Matching Algorithms

In this section we introduce three algorithms for multiple pattern Cartesian tree matching. Each of them consists of preprocessing and search. In the preprocessing step, hash tables are built using fingerprints of patterns. In the search step, the filtering and verification approach is adopted. To filter out unpromising matching positions, a fingerprinting method is applied to either length- substrings of the text, where is the length of the shortest pattern, or much shorter length- substrings of the text (we will discuss how to set in Section 4.4). Then each candidate pattern is verified by an efficient comparison method (see Appendix 0.A.3.1).

4.1 Algorithm Based on Wu-Manber

1:
2:
3:procedure Preprocessing
4:     
5:     
6:     Initialize each entry of SHIFT to
7:     for  to  do
8:         for  to  do
9:              fp
10:              if  then
11:                                          
12:         fp
13:               
14:procedure Search
15:     
16:     while  do
17:         
18:         for  do
19:              if  matches  then
20:                  output                        
21:               
Algorithm 1 Algorithm based on Wu-Manber

Algorithm 1 shows the pseudo-code of an algorithm for multiple pattern Cartesian tree matching based on the Wu-Manber algorithm [22]. The algorithm uses two hash tables, HASH and SHIFT. Both tables use a fingerprint of length- string, called a block. Either the parent-distance encoding or the binary encoding is used to compute the fingerprint. Given patterns , let be the length of the shortest pattern. HASH maps a fingerprint fp of a block to the list of patterns such that the fingerprint of the last block in ’s length- prefix is the same as fp. For a block and a fingerprint encoding function , HASH is defined as follows:

(5)

SHIFT maps a fingerprint fp of a block to the amount of a valid shift when the block appears in the text. The shift value is determined by the rightmost occurrence of a block in terms of the fingerprint among length- prefixes of the patterns. For a block and a fingerprint encoding function , we define the rightmost occurrence as follows:

(6)

Then SHIFT is defined as follows:

(7)

In the preprocessing step, we build HASH and SHIFT (as described in Algorithm 1). In the search step, we scan the text from left to right, computing the fingerprint of a length- substring of the text to get a list of patterns from HASH. Let index be the current scanning position of the text. We compute fingerprint fp of , and get a list of patterns in the entry . If the list is not empty, each pattern is verified by an efficient comparison method (see Appendix 0.A.3.1). Consider in the list. The comparison method verifies whether matches . After verifying all patterns in the list, the text is shifted by .

The worst case time complexity of Algorithm 1 is , where is the total pattern length, is the block size, and is the length of the text (consider and the patterns of which prefixes are ). On the other hand, the best case time complexity of Algorithm 1 is .

4.2 Algorithm Based on Rabin-Karp

Algorithm 2 in Appendix shows the pseudo-code of an algorithm for multiple pattern Cartesian tree matching based on the Rabin-Karp algorithm [12]. The algorithm uses one hash table, namely HASH. HASH is similarly defined as in Algorithm 1 except that we consider length- prefixes instead of blocks and we use only binary encoding for fingerprinting. For a string and the binary encoding function , HASH is defined as follows:

(8)

In the preprocessing step, we build HASH. In the search step, we shift one by one, and compute the fingerprint of a length- substring of the text to get candidate patterns by using HASH. Again, each candidate pattern is verified by an efficient comparison method.

Given a fingerprint at position of the text, the next fingerprint at position can be computed in constant time if we use the binary encoding as a fingerprinting method. Let the former fingerprint be and the latter one be . Then,

(9)

Subtracting removes the leftmost bit from , multiplying the result by 2 shifts the number to the left by one position, and adding brings in the appropriate rightmost bit.

The worst case time complexity of Algorithm 2 is (consider and patterns of which prefixes are ). The best case time complexity is since fingerprint at position , , can be computed in time using Equation (9).

4.3 Algorithm Based on Alpha Skip Search

Algorithm 3 in Appendix shows the pseudo-code of an algorithm for multiple pattern Cartesian tree matching based on Alpha Skip Search [4]. Recall that a length- string is called a block. The algorithm uses a hash table POS that maps the fingerprint of a block to a list of occurrences in all length- prefixes of the patterns. Either the parent-distance encoding or the binary encoding is used for fingerprinting. For a block and a fingerprint encoding function , POS is defined as follows:

(10)

In the preprocessing step, we build POS. In the search step, we scan the text from left to right, computing the fingerprint of a length- substring of the text to get the list of pairs , meaning that the fingerprint of is the same as that of the substring of the text. Verification using an efficient comparison method is performed for each pair in the list. Note that the algorithm always shifts by .

The worst case time complexity of Algorithm 3 is , where is the total pattern length, is the block size, and is the length of the text (consider and patterns of which prefixes are ). On the other hand, the best case time complexity of Algorithm 3 is since the algorithm always shifts by .

4.4 Selecting the Block Size

The size of the block affects the running time of Algorithms 1 and 3

. A longer block size leads to a lower probability of candidate pattern occurrences, so it decreases verification time. On the other hand, a longer block size increases the overhead required for computing fingerprints. Thus, it is important to set a block size appropriate for each algorithm.

In order to set a block size, we first study the matching probability of two strings, in terms of Cartesian trees. Assume that numbers are independent and identically distributed, and there are no identical numbers within any length- string.

Lemma 1

Given two strings and , the probability that and have the same Cartesian tree can be defined by the recurrence formula, where and , as follows:

(11)

We have the following upper bound on the matching probability.

Theorem 4.1

Assume that numbers are independent and identically distributed, and there are no identical numbers within any length- string. Given two strings and , the probability that the two strings match, in terms of Cartesian trees, is at most , i.e., .

We set the block size if ; otherwise we set , where is the number of patterns and is the length of the shortest pattern, in order to get a low probability of match and a relatively short block size with respect to . By Theorem 4.1, if we set , .

5 Experiments

We conduct experiments to evaluate the performances of the proposed algorithms against the previous algorithm. We compare algorithms based on Aho-Corasick (AC) [18], Wu-Manber (WM), Rabin-Karp (RM), and Alpha Skip Search (AS). By default, all our algorithms use optimization techniques introduced in Appendix 0.A.3, except the min-index filtering method which is evaluated in the experiments. Particularly, in order to compare the fingerprinting methods and see the effect of min-index filtering method, we compare variants of our algorithms. The following algorithms are evaluated.

  • AC: multiple Cartesian tree matching algorithm based on Aho-Corasick [18].

  • WMP: algorithm based on Wu-Manber that uses the parent-distance encoding as a fingerprinting method.

  • WMB: algorithm based on Wu-Manber that uses the binary encoding as a fingerprinting method. The algorithm reuses fingerprints when adjacent blocks overlap characters (i.e., when the text shifts by one position), where is the block size.

  • WMBM: WMB that exploits additional min-index filtering in Appendix 0.A.3.3.

  • RK: algorithm based on Rabin-Karp that uses the binary encoding as a fingerprinting method.

  • ASB: algorithm based on Alpha Skip Search that uses the binary encoding as a fingerprinting method. The algorithm reuses fingerprints when adjacent blocks overlap characters.

All algorithms are implemented in C++. Experiments are conducted on a machine with Intel Xeon E5-2630 v4 2.20GHz CPU and 128GB memory running CentOS Linux.

The total time includes the preprocessing time for building data structures and the search time. To evaluate an algorithm, we run it 100 times and measure the average total time in milliseconds.

We randomly build a text of length 10,000,000 where the alphabet size is 1,000. A pattern is extracted from the text at a random position.

5.1 Evaluation on the Equal Length Patterns

(a)
(b)
(c)
(d)
(e)
(f)
Figure 2: Evaluation on the length of pattern. Left: patterns of equal length. Right: patterns of different lengths.

We first conduct experiments with sets of patterns of the same length. Figures 1(a), 1(c), 1(e), and Table 1 show the results, where is the number of patterns and x-axis represents the length of the patterns, i.e., . As the length of the patterns increases, WMB, WMBM, and ASB become the fastest algorithms due to a long shift length, low verification time, and light fingerprinting method. WMBM and WMB outperforms AC up to 33 times ( and ). ASB outperforms AC up to 28 times ( and ). RK outperforms AC up to 3 times ( and ). When the length of the patterns is extremely short, however, AC is the clear winner (). In this case, other algorithms naïvely compare the greatest part of patterns for each position of the text. WMP works visibly worse when due to the extreme situation and overhead of the fingerprinting method. Since short patterns are more likely to have the same Cartesian trees, the proposed algorithms are sometimes faster when than when due to the grouping technique in Appendix 0.A.3. Comparing WMB and WMBM, the min-index filtering method is more effective when there are many short patterns ( and ).

AC WMP WMB WMBM RK ASB
10 4 129.46 303.093 176.249 166.04 165.351 147.889
8 142.114 241.573 83.6087 92.1517 69.0761 88.5753
16 138.79 93.5485 30.7575 33.4786 57.5673 39.4656
32 160.921 42.6767 12.3674 13.3405 115.497 21.5187
64 156.562 25.2625 7.59158 8.29616 115.381 11.0158
128 145.905 15.0862 5.0663 5.97869 115.296 7.03843
256 157.123 9.00974 4.69218 4.81503 102.995 5.43152
50 4 130.961 345.84 257.453 209.683 229.506 267.698
8 203.431 651.249 193.496 173.894 181.898 150.484
16 197.931 145.531 58.6471 59.2581 63.8881 68.6459
32 201.09 59.732 21.66 22.8856 115.723 30.24
64 197.544 30.9735 9.86238 10.6876 115.721 14.7707
128 203.944 18.0982 6.73188 6.9642 116.156 9.65942
256 221.186 12.0733 6.57459 6.66625 103.055 8.05778
100 4 132.263 346.139 264.371 209.588 229.396 267.633
8 225.327 681.149 319.767 231.097 278.165 264.218
16 211.893 180.281 70.2239 67.93 67.3007 85.3792
32 229.12 68.7025 24.4567 25.7314 115.032 36.4216
64 227.275 34.1059 11.6273 12.3154 116.446 17.1575
128 233.471 20.4809 9.49517 9.43364 115.08 12.6862
256 254.042 15.563 7.66052 7.5831 103.943 9.98069
Table 1: Evaluation on the patterns of equal length. Total time in ms.
interval AC WMP WMB WMBM RK ASB
10 [8, 32] 152.628 240.46 97.506 103.019 65.6954 97.5208
[16, 64] 153.663 95.9347 30.7831 33.076 50.4686 35.7311
[32, 128] 150.329 44.4056 12.1087 13.629 103.051 19.3249
[64, 256] 147.741 25.5997 7.22873 7.83777 102.949 10.1762
50 [8, 32] 205.042 724.675 201.416 190.008 180.04 169.276
[16, 64] 196.745 149.612 60.3754 61.1807 54.4075 70.1237
[32, 128] 206.627 61.7051 18.5565 20.2259 104.028 27.9782
[64, 256] 203.731 31.6943 9.79816 10.678 104.11 15.3719
100 [8, 32] 217.625 757.974 331.015 250.613 300.803 304.732
[16, 64] 228.42 180.796 60.9719 63.0149 55.602 79.0707
[32, 128] 228.194 71.0881 22.5928 24.1753 104.574 33.8765
[64, 256] 237.803 35.1944 11.8472 12.4182 104.79 19.3238
Table 2: Evaluation on the patterns of different lengths. Total time in ms.

5.2 Evaluation on the Different Length Patterns

We compare algorithms with sets of patterns of different lengths. Figures 1(b), 1(d), 1(f), and Table 2 show the results. The length is randomly selected in an interval, i.e., [8, 32], [16, 64], [32, 128], and [64, 256]. After a length is selected, a pattern is extracted from the text at a random position. When there are many short patterns, i.e., and patterns of length 8–32, AC is the fastest due to the short minimum pattern length.

When the length of the shortest pattern is sufficiently long, however, the proposed algorithms outperform AC. Specifically WMB outperforms AC up to 20 times ( and patterns of length 64–256). ASB outperforms AC up to 14 times ( and patterns of length 64–256). RK outperforms AC up to 4 times ( and patterns of length 16–64).

5.3 Evaluation on the Real Dataset

(a)
(b)
(c)
Figure 3: Evaluation on the Seoul temperatures dataset.
AC WMP WMB WMBM RK ASB
10 4 6.46631 20.9454 12.6187 10.2736 10.9732 11.8492
8 6.53721 14.3666 7.37876 7.14195 5.57104 8.00697
16 7.76917 7.8657 4.57646 4.85934 2.78754 4.5365
32 8.18157 3.89075 2.06438 2.27235 6.73496 5.99976
64 7.60696 4.37882 2.60346 2.7861 7.06377 3.01767
128 7.84501 1.34436 0.643153 0.743147 7.19664 1.86238
256 9.47242 0.88061 0.337183 0.36453 7.22575 0.850833
50 4 6.1634 22.5166 15.2899 11.4285 13.4452 14.4079
8 7.47185 33.9852 12.581 11.9699 10.2026 11.0986
16 9.53764 17.3211 11.0096 10.48 5.12495 15.234
32 9.80261 6.14176 5.79404 6.21041 6.90745 9.44348
64 9.82792 4.34029 4.09002 4.16979 7.15372 6.4055
128 11.6782 2.40814 1.91395 2.1363 7.34409 3.99501
256 14.7849 2.54673 1.5183 1.67897 7.47328 3.50649
100 4 6.15083 23.0344 16.4377 11.9024 14.58 15.904
8 8.11009 35.6604 16.5101 14.9557 13.8331 15.015
16 10.5246 22.2591 14.8361 14.3679 7.22885 21.4713
32 11.5976 8.5304 9.03897 9.25709 7.05395 13.7257
64 11.8653 5.6808 5.67174 5.92024 7.35357 9.04152
128 13.6058 3.71476 3.36717 3.74349 7.50687 6.83653
256 22.7509 4.3859 2.38758 2.66045 7.73048 5.58111
Table 3: Evaluation on the Seoul temperatures dataset. Total time in ms.

We conduct experiment on a real dataset, which is a time series of Seoul temperatures. The Seoul temperatures dataset consists of 658,795 integers referring to the hourly temperatures in Seoul (multiplied by ten) in the years 1907–2019 [19]. In general, temperatures rise during the day and fall at night. Therefore, the Seoul temperatures dataset has more matches than random datasets when patterns are extracted from the text. Figure 3 and Table 3 show the results on the Seoul temperatures dataset with sets of patterns of the same length. As the pattern length grows, the proposed algorithms run much faster than AC. For short patterns (), AC is the fastest algorithm, and AC is up to twice times faster than WMBM ( and ) and 1.7 times faster than RK ( and ). For moderate-length patterns (), RK is up to 2.8 times faster than AC ( and ), and WMB is up to 4 times faster than AC ( and ). For relatively long patterns (), all the proposed algorithms outperform AC. Specifically, WMB, WMBM, ASB, and WMP outperform AC up to 28, 26, 11, and 10 times, respectively ( and ), and RK outperforms AC up to 2.9 times ( and ).

References

  • [1] A. V. Aho and M. J. Corasick (1975) Efficient string matching: an aid to bibliographic search. Communications of the ACM 18 (6), pp. 333–340. Cited by: §1.
  • [2] O. Berkman, B. Schieber, and U. Vishkin (1993) Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values. Journal of Algorithms 14 (3), pp. 344–370. Cited by: §3.1.
  • [3] R. S. Boyer and J. S. Moore (1977) A fast string searching algorithm. Communications of the ACM 20 (10), pp. 762–772. Cited by: §1.
  • [4] C. Charras, T. Lecroq, and J. D. Pehoushek (1998) A very fast string matching algorithm for small alphabets and long patterns. In Annual Symposium on Combinatorial Pattern Matching, pp. 55–64. Cited by: §1, §4.3.
  • [5] T. Chhabra and J. Tarhio (2014) Order-preserving matching with filtration. In International Symposium on Experimental Algorithms, pp. 307–314. Cited by: §1, §3.2.
  • [6] B. Commentz-Walter (1979) A string matching algorithm fast on the average. In International Colloquium on Automata, Languages, and Programming, pp. 118–132. Cited by: §1.
  • [7] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2001) Introduction to algorithms second edition. The Knuth-Morris-Pratt Algorithm. Cited by: §3.2.
  • [8] M. Crochemore, A. Czumaj, L. Gasieniec, T. Lecroq, W. Plandowski, and W. Rytter (1999) Fast practical multi-pattern matching. Information Processing Letters 71 (3-4), pp. 107–113. Cited by: §1.
  • [9] A. Ganguly, W. Hon, K. Sadakane, R. Shah, S. V. Thankachan, and Y. Yang (2016) Space-efficient dictionaries for parameterized and order-preserving pattern matching. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 2:1–2:12. Cited by: §1.
  • [10] M. Han, M. Kang, S. Cho, G. Gu, J. S. Sim, and K. Park (2015) Fast multiple order-preserving matching algorithms. In International Workshop on Combinatorial Algorithms, pp. 248–259. Cited by: §3.1.
  • [11] N. Hua, H. Song, and T. Lakshman (2009)

    Variable-stride multi-pattern matching for scalable deep packet inspection

    .
    In IEEE INFOCOM 2009, pp. 415–423. Cited by: §1.
  • [12] R. M. Karp and M. O. Rabin (1987) Efficient randomized pattern-matching algorithms. IBM journal of research and development 31 (2), pp. 249–260. Cited by: §1, §4.2.
  • [13] J. Kim, P. Eades, R. Fleischer, S. Hong, C. S. Iliopoulos, K. Park, S. J. Puglisi, and T. Tokuyama (2014) Order-preserving matching. Theoretical Computer Science 525, pp. 68–79. Cited by: §1.
  • [14] D. E. Knuth (2014) The art of computer programming, volume 2: seminumerical algorithms. Addison-Wesley Professional. Cited by: §3.1.
  • [15] M. Kubica, T. Kulczyński, J. Radoszewski, W. Rytter, and T. Waleń (2013) A linear time algorithm for consecutive permutation pattern matching. Information Processing Letters 113 (12), pp. 430–433. Cited by: §1.
  • [16] H. Liao, C. R. Lin, Y. Lin, and K. Tung (2013) Intrusion detection system: a comprehensive review. Journal of Network and Computer Applications 36 (1), pp. 16–24. Cited by: §1.
  • [17] J. N. Liu and R. W. Kwong (2007) Automatic extraction and identification of chart patterns towards financial forecast. Applied Soft Computing 7 (4), pp. 1197–1208. Cited by: §1.
  • [18] S. Park, A. Amir, G. M. Landau, and K. Park (2019) Cartesian tree matching and indexing. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 16:1–16:14. Cited by: Fast Multiple Pattern Cartesian Tree Matching, §0.A.3.4, §1, §1, §1, §3.1, §3.1, 1st item, §5, Definition 1, Definition 2.
  • [19] S. Song, C. Ryu, S. Faro, T. Lecroq, and K. Park Fast cartesian tree matching algorithms. Accepted to SPIRE (2019), https://arxiv.org/abs/1908.04937. Cited by: §0.A.3.1, §1, §3.2, §3.2, §5.3.
  • [20] T. Song, W. Zhang, D. Wang, and Y. Xue (2008) A memory efficient multiple pattern matching architecture for network security. In IEEE INFOCOM 2008-The 27th Conference on Computer Communications, pp. 166–170. Cited by: §1.
  • [21] J. Vuillemin (1980) A unifying look at data structures. Communications of the ACM 23 (4), pp. 229–239. Cited by: §2.1.
  • [22] S. Wu and U. Manber (1994) A fast algorithm for multi-pattern searching. Technical report. TR-94-17, Department of Computer Science, University of Arizona. Cited by: §1, §4.1.

Appendix 0.A Appendix

1:
2:
3:procedure Preprocessing
4:     
5:     for  to  do
6:         fp
7:               
8:procedure Search
9:     
10:     while  do
11:         
12:         for  do
13:              if  matches  then
14:                  output                        
15:               
Algorithm 2 Algorithm based on Rabin-Karp
1:
2:
3:procedure Preprocessing
4:     
5:     
6:     for  to  do
7:         for  to  do
8:              fp
9:                             
10:procedure Search
11:     
12:     while  do
13:         
14:         for  do
15:              if  matches  then
16:                  output                        
17:               
Algorithm 3 Algorithm based on Alpha Skip Search

0.a.1 Proof of Lemma 1

Proof

When the -th numbers are the roots of both and , the probability that is . Since there are distinct numbers, the probability that both and have the -th numbers as their roots is . Summing the probabilities for gives the probability . ∎

0.a.2 Proof of Theorem 4.1

Proof

We prove the theorem by induction on .

If , , , . Therefore, the theorem holds when .

Let’s assume that the theorem holds when , for , and show that it holds when .

(12)

Therefore, we have proved that . ∎

0.a.3 Optimization Techniques

0.a.3.1 Optimizing Naïve Verification

An efficient verification method is essential for the proposed three algorithms because they all adopt the filtering and verification approach. We employ the verification method introduced by Song et al. [19]. They first introduce the notion of the global-parent representation of a string , where stores the index of the parent of in . For example, the global-parent representation of string is . Note that the parent of the root is the root itself. Two strings and have the same Cartesian trees if and only if , or with , for all [19]. Note that we do not need any representation of . After the global-parent representation of is computed, we can verify whether in linear time by checking the conditions. In our algorithms, the global-parent representation of the patterns are computed and stored in advance, and verification is done by the above method without computing any representation about the text.

0.a.3.2 Reusing Fingerprint of Binary Encoding

In Algorithm 2, successive fingerprints can be computed in constant time by Equation (9) when using the binary encoding. Likewise, we can reuse a previous fingerprint to create the current fingerprint in Algorithms 1 and 3 as well. This can be done by applying Equation (9) times when two blocks of size overlap by . Our experimental study showed that reusing fingerprints when is the most efficient. Thus, we reuse fingerprints only when the text shifts by one position. It is worth mentioning that we do not reuse fingerprints of the parent-distance encoding because multiple characters in the parent-distance representation can be changed by just one shift, countervailing the effect of reusing.

0.a.3.3 Additional Filtering via Min-index

In the filtering stage of an algorithm, we may further filter out candidate patterns by additional filtering methods. We introduce a simple filtering method based on the index of the minimum value (min-index). Since two strings have the same Cartesian trees only if the indices of the minimum values (roots) of the two strings are the same, we may first compare the min-index before we verify each candidate pattern retrieved by a fingerprint. To this end, for each input pattern , we store the min-index among where is the length of the shortest pattern in the preprocessing step. In the search step, the fingerprint and the min-index of a block in the text are computed at the same time. Among the patterns retrieved by the fingerprint, only patterns are verified such that the min-index of the last block in ’s length- prefix is the same as that of the block in the text. The information of the root is not represented by the binary representation, but it is represented by the parent-distance representation. Therefore, this additional filtering method is effective only when we use the binary encoding.

0.a.3.4 Grouping Patterns Having the Same Cartesian Trees

Since the input patterns are strings, some of them may have the same Cartesian trees. The Aho-Corasick algorithm [18] assembles those patterns in a state of its automaton, while the presented algorithms in this paper do not perform it explicitly. In our implementation, we group those patterns having the same Cartesian trees, so as to avoid the redundant computation. This process is particularly beneficial for short input patterns.