1 Introduction
^{†}^{†}Gu, Song, and Park were supported by Collaborative Genome Program for Fostering New PostGenome industry through the National Research Foundation of Korea(NRF) funded by the Ministry of Science ICT and Future Planning (No. NRF2014M3C9A3063541).Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. For instance, given text and pattern in Figure 0(a), has the same Cartesian tree as the substring of . Among many generalized matchings, Cartesian tree matching is analogous to orderpreserving matching[13, 15, 5, 9] in the sense that they deal with relative order between numbers. Accordingly, both of them can be applied to time series data such as stock price analysis, but Cartesian tree matching can be sometimes more appropriate than orderpreserving matching in finding patterns [18].
In this paper, we deal with Cartesian tree matching for the case of multiple patterns. Although finding multiple different patterns is interesting by itself, multiple pattern Cartesian tree matching can be applied in finding one meaningful pattern when the meaningful pattern is represented by multiple Cartesian trees: Suppose we are looking for the doubletop pattern [17]. Two Cartesian trees in Figure 0(b) are required to identify the pattern, where the relative order between and causes the difference. In general, the more complex the pattern is, the more Cartesian trees having the same lengths are required. (e.g., the headandshoulder pattern [17] requires four Cartesian trees.)
Recently, Park et al. [18] introduced (single pattern) Cartesian tree matching, multiple pattern Cartesian tree matching, and Cartesian tree indexing with their respective algorithms. They proposed the parentdistance representation that has a onetoone mapping with Cartesian trees, and gave lineartime solutions for the problems, utilizing the representation and existing string algorithms, i.e., KMP algorithm, AhoCorasick algorithm, and suffix tree construction algorithm. Song et al. [19] proposed new representations about Cartesian trees, and proposed practically fast algorithms for Cartesian tree matching based on the framework of filtering and verification.
Extensive works have been done to develop algorithms for multiple pattern matching, which is one of the fundamental problems in computer science
[20, 11, 16]. Aho and Corasick [1] presented a lineartime algorithm based on an automaton. CommentzWalter [6] presented an algorithm that combines the AhoCorasick algorithm and the BoyerMoore technique [3]. Crochemore et al. [8] proposed an algorithm that combines the AhoCorasick automaton and a Directed Acyclic Word Graph, which runs linear in the worst case and runs in time in the average case, where is the length of the shortest pattern. Rabin and Karp [12] proposed an algorithm that runs linear on average and in the worst case, where is the sum of lengths of all patterns. Charras et al. [4] proposed an algorithm called Alpha Skip Search, which can efficiently handle both single pattern and multiple patterns. Wu and Manber [22] presented an algorithm that uses an extension of the BoyerMooreHorspool technique.In this paper we present practically fast algorithms for multiple pattern Cartesian tree matching. We present three algorithms based on WuManber, RabinKarp, and Alpha Skip Search. All of them use the filtering and verification approach, where filtering relies on efficient fingerprinting methods of a string. Two fingerprinting methods are presented, i.e., the parentdistance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. In the experiments we compare our solutions against the previous algorithm [18] which is based on the AhoCorasick algorithm. Our solutions run faster than the previous algorithm. Especially, our algorithm based on WuManber runs up to 33 times faster.
2 Problem Definition
2.1 Notation
A string is a sequence of characters drawn from an alphabet , which is a set of integers. We assume that a comparison between any two characters can be done in constant time. For a string , represents the th character of , and represents the substring of starting from and ending at .
A Cartesian tree [21] is a binary tree derived from a string. Specifically, the Cartesian tree for a string can be uniquely defined as follows:

If is an empty string, is an empty tree.

If is not empty and is the minimum value in , is the tree with as the root, as the left subtree, and as the right subtree. If there is more than one minimum value, we choose the leftmost one as the root.
Given two strings and , where , we say that matches at position if . For example, given and in Figure 0(a), matches at position 8. We also say that is a match of in .
Cartesian tree matching is the problem of finding all the matches in the text which have the same Cartesian trees as a given pattern.
Definition 1
(Cartesian tree matching [18]) Given two strings text and pattern , find every such that .
2.2 Multiple Pattern Cartesian Tree Matching
Cartesian tree matching can be extended to the case of multiple patterns. Multiple pattern Cartesian tree matching is the problem of finding all the matches in the text which have the same Cartesian trees as at least one of the given patterns.
Definition 2
(Multiple pattern Cartesian tree matching [18]) Given a text and patterns , find every position in the text which matches at least one pattern, i.e., it has the same Cartesian tree as that of at least one pattern.
3 Fingerprinting Methods
Fingerprinting is a technique that maps a string to a much shorter form of data, such as a bit string or an integer. In Cartesian tree matching, we can use fingerprints to filter out unpromising matching positions with low computational cost.
In this section we introduce two fingerprinting methods, i.e., the parentdistance encoding and the binary encoding, for the purpose of representing information about Cartesian tree as an integer. The two encodings make use of the parentdistance representation and the binary representation, respectively, both of which are strings that represent Cartesian trees.
3.1 Parentdistance Encoding
In order to represent Cartesian trees efficiently, Park et al. proposed the parentdistance representation [18], which is another form of the all nearest smaller values [2].
Definition 3
(Parentdistance representation) Given a string S[1..n], the parentdistance representation of S is an integer string PD(S)[1..n], which is defined as follows:
(1) 
Intuitively, stores the distance between and the parent of in . For example, the parentdistance representation of string is , where stores the distance between and ( is the parent of in ). The parentdistance representation has a onetoone mapping to the Cartesian tree [18], and so if two strings have the same parentdistance representations, the two strings also have the same Cartesian trees. The parentdistance representation of a string can be computed in linear time [18]. Note that holds a value between to by definition, and at all times.
With the parentdistance representation, we can define a fingerprint encoding function that maps a string to an integer, using the factorial number system [14].
Definition 4
(Parentdistance Encoding) Given a string , the encoding function , which maps into an integer within the range , is defined as follows:
(2) 
The parentdistance encoding maps a string into a unique integer according to its parentdistance representation. That is, given two strings and , if and only if . This is because if then due to the fact that . The encoding function can be computed in time, since can be computed in linear time. For a long string, the fingerprint may not fit in a word size, so we select a prime number by which we divide the fingerprint, and use the residue instead of the actual fingerprint. A similar encoding function was used to solve the multiple pattern orderpreserving matching problem [10].
3.2 Binary Encoding
For orderpreserving matching, the representation of a string as a binary string is first presented by Chhabra and Tarhio [5]. Recently, Song et al. make use of the binary representation for Cartesian tree matching as follows [19].
Definition 5
(Binary representation) Given an length string , binary representation of length is defined as follows: for ,
(3) 
Given two strings and , the binary representations and are the same if the Cartesian trees and are the same [19]. Obviously, the Cartesian tree has a manytoone mapping to the binary representation. Thus, two strings whose binary representations are the same may not have the same Cartesian trees, but two strings whose Cartesian trees are the same have the same binary representations.
A fingerprint encoding function can be defined using the binary representation.
Definition 6
(Binary Encoding) Given a string , encoding function , which maps into an integer within the range , is defined as follows:
(4) 
Since is a polynomial, it can be efficiently computed in linear time using Horner’s rule [7]. Moreover, a fingerprint computed by the binary encoding can be reused when two strings overlap, which will be discussed in Appendix 0.A.3.2. Like the parentdistance encoding, in case the fingerprint does not fit in a word size, we select a prime number by which we divide the fingerprint, and use the residue instead of the actual fingerprint.
4 Fast Multiple Pattern Cartesian Tree Matching Algorithms
In this section we introduce three algorithms for multiple pattern Cartesian tree matching. Each of them consists of preprocessing and search. In the preprocessing step, hash tables are built using fingerprints of patterns. In the search step, the filtering and verification approach is adopted. To filter out unpromising matching positions, a fingerprinting method is applied to either length substrings of the text, where is the length of the shortest pattern, or much shorter length substrings of the text (we will discuss how to set in Section 4.4). Then each candidate pattern is verified by an efficient comparison method (see Appendix 0.A.3.1).
4.1 Algorithm Based on WuManber
Algorithm 1 shows the pseudocode of an algorithm for multiple pattern Cartesian tree matching based on the WuManber algorithm [22]. The algorithm uses two hash tables, HASH and SHIFT. Both tables use a fingerprint of length string, called a block. Either the parentdistance encoding or the binary encoding is used to compute the fingerprint. Given patterns , let be the length of the shortest pattern. HASH maps a fingerprint fp of a block to the list of patterns such that the fingerprint of the last block in ’s length prefix is the same as fp. For a block and a fingerprint encoding function , HASH is defined as follows:
(5) 
SHIFT maps a fingerprint fp of a block to the amount of a valid shift when the block appears in the text. The shift value is determined by the rightmost occurrence of a block in terms of the fingerprint among length prefixes of the patterns. For a block and a fingerprint encoding function , we define the rightmost occurrence as follows:
(6) 
Then SHIFT is defined as follows:
(7) 
In the preprocessing step, we build HASH and SHIFT (as described in Algorithm 1). In the search step, we scan the text from left to right, computing the fingerprint of a length substring of the text to get a list of patterns from HASH. Let index be the current scanning position of the text. We compute fingerprint fp of , and get a list of patterns in the entry . If the list is not empty, each pattern is verified by an efficient comparison method (see Appendix 0.A.3.1). Consider in the list. The comparison method verifies whether matches . After verifying all patterns in the list, the text is shifted by .
4.2 Algorithm Based on RabinKarp
Algorithm 2 in Appendix shows the pseudocode of an algorithm for multiple pattern Cartesian tree matching based on the RabinKarp algorithm [12]. The algorithm uses one hash table, namely HASH. HASH is similarly defined as in Algorithm 1 except that we consider length prefixes instead of blocks and we use only binary encoding for fingerprinting. For a string and the binary encoding function , HASH is defined as follows:
(8) 
In the preprocessing step, we build HASH. In the search step, we shift one by one, and compute the fingerprint of a length substring of the text to get candidate patterns by using HASH. Again, each candidate pattern is verified by an efficient comparison method.
Given a fingerprint at position of the text, the next fingerprint at position can be computed in constant time if we use the binary encoding as a fingerprinting method. Let the former fingerprint be and the latter one be . Then,
(9) 
Subtracting removes the leftmost bit from , multiplying the result by 2 shifts the number to the left by one position, and adding brings in the appropriate rightmost bit.
4.3 Algorithm Based on Alpha Skip Search
Algorithm 3 in Appendix shows the pseudocode of an algorithm for multiple pattern Cartesian tree matching based on Alpha Skip Search [4]. Recall that a length string is called a block. The algorithm uses a hash table POS that maps the fingerprint of a block to a list of occurrences in all length prefixes of the patterns. Either the parentdistance encoding or the binary encoding is used for fingerprinting. For a block and a fingerprint encoding function , POS is defined as follows:
(10) 
In the preprocessing step, we build POS. In the search step, we scan the text from left to right, computing the fingerprint of a length substring of the text to get the list of pairs , meaning that the fingerprint of is the same as that of the substring of the text. Verification using an efficient comparison method is performed for each pair in the list. Note that the algorithm always shifts by .
4.4 Selecting the Block Size
The size of the block affects the running time of Algorithms 1 and 3
. A longer block size leads to a lower probability of candidate pattern occurrences, so it decreases verification time. On the other hand, a longer block size increases the overhead required for computing fingerprints. Thus, it is important to set a block size appropriate for each algorithm.
In order to set a block size, we first study the matching probability of two strings, in terms of Cartesian trees. Assume that numbers are independent and identically distributed, and there are no identical numbers within any length string.
Lemma 1
Given two strings and , the probability that and have the same Cartesian tree can be defined by the recurrence formula, where and , as follows:
(11) 
We have the following upper bound on the matching probability.
Theorem 4.1
Assume that numbers are independent and identically distributed, and there are no identical numbers within any length string. Given two strings and , the probability that the two strings match, in terms of Cartesian trees, is at most , i.e., .
We set the block size if ; otherwise we set , where is the number of patterns and is the length of the shortest pattern, in order to get a low probability of match and a relatively short block size with respect to . By Theorem 4.1, if we set , .
5 Experiments
We conduct experiments to evaluate the performances of the proposed algorithms against the previous algorithm. We compare algorithms based on AhoCorasick (AC) [18], WuManber (WM), RabinKarp (RM), and Alpha Skip Search (AS). By default, all our algorithms use optimization techniques introduced in Appendix 0.A.3, except the minindex filtering method which is evaluated in the experiments. Particularly, in order to compare the fingerprinting methods and see the effect of minindex filtering method, we compare variants of our algorithms. The following algorithms are evaluated.

AC: multiple Cartesian tree matching algorithm based on AhoCorasick [18].

WMP: algorithm based on WuManber that uses the parentdistance encoding as a fingerprinting method.

WMB: algorithm based on WuManber that uses the binary encoding as a fingerprinting method. The algorithm reuses fingerprints when adjacent blocks overlap characters (i.e., when the text shifts by one position), where is the block size.

WMBM: WMB that exploits additional minindex filtering in Appendix 0.A.3.3.

RK: algorithm based on RabinKarp that uses the binary encoding as a fingerprinting method.

ASB: algorithm based on Alpha Skip Search that uses the binary encoding as a fingerprinting method. The algorithm reuses fingerprints when adjacent blocks overlap characters.
All algorithms are implemented in C++. Experiments are conducted on a machine with Intel Xeon E52630 v4 2.20GHz CPU and 128GB memory running CentOS Linux.
The total time includes the preprocessing time for building data structures and the search time. To evaluate an algorithm, we run it 100 times and measure the average total time in milliseconds.
We randomly build a text of length 10,000,000 where the alphabet size is 1,000. A pattern is extracted from the text at a random position.
5.1 Evaluation on the Equal Length Patterns
We first conduct experiments with sets of patterns of the same length. Figures 1(a), 1(c), 1(e), and Table 1 show the results, where is the number of patterns and xaxis represents the length of the patterns, i.e., . As the length of the patterns increases, WMB, WMBM, and ASB become the fastest algorithms due to a long shift length, low verification time, and light fingerprinting method. WMBM and WMB outperforms AC up to 33 times ( and ). ASB outperforms AC up to 28 times ( and ). RK outperforms AC up to 3 times ( and ). When the length of the patterns is extremely short, however, AC is the clear winner (). In this case, other algorithms naïvely compare the greatest part of patterns for each position of the text. WMP works visibly worse when due to the extreme situation and overhead of the fingerprinting method. Since short patterns are more likely to have the same Cartesian trees, the proposed algorithms are sometimes faster when than when due to the grouping technique in Appendix 0.A.3. Comparing WMB and WMBM, the minindex filtering method is more effective when there are many short patterns ( and ).
AC  WMP  WMB  WMBM  RK  ASB  

10  4  129.46  303.093  176.249  166.04  165.351  147.889 
8  142.114  241.573  83.6087  92.1517  69.0761  88.5753  
16  138.79  93.5485  30.7575  33.4786  57.5673  39.4656  
32  160.921  42.6767  12.3674  13.3405  115.497  21.5187  
64  156.562  25.2625  7.59158  8.29616  115.381  11.0158  
128  145.905  15.0862  5.0663  5.97869  115.296  7.03843  
256  157.123  9.00974  4.69218  4.81503  102.995  5.43152  
50  4  130.961  345.84  257.453  209.683  229.506  267.698 
8  203.431  651.249  193.496  173.894  181.898  150.484  
16  197.931  145.531  58.6471  59.2581  63.8881  68.6459  
32  201.09  59.732  21.66  22.8856  115.723  30.24  
64  197.544  30.9735  9.86238  10.6876  115.721  14.7707  
128  203.944  18.0982  6.73188  6.9642  116.156  9.65942  
256  221.186  12.0733  6.57459  6.66625  103.055  8.05778  
100  4  132.263  346.139  264.371  209.588  229.396  267.633 
8  225.327  681.149  319.767  231.097  278.165  264.218  
16  211.893  180.281  70.2239  67.93  67.3007  85.3792  
32  229.12  68.7025  24.4567  25.7314  115.032  36.4216  
64  227.275  34.1059  11.6273  12.3154  116.446  17.1575  
128  233.471  20.4809  9.49517  9.43364  115.08  12.6862  
256  254.042  15.563  7.66052  7.5831  103.943  9.98069 
interval  AC  WMP  WMB  WMBM  RK  ASB  

10  [8, 32]  152.628  240.46  97.506  103.019  65.6954  97.5208 
[16, 64]  153.663  95.9347  30.7831  33.076  50.4686  35.7311  
[32, 128]  150.329  44.4056  12.1087  13.629  103.051  19.3249  
[64, 256]  147.741  25.5997  7.22873  7.83777  102.949  10.1762  
50  [8, 32]  205.042  724.675  201.416  190.008  180.04  169.276 
[16, 64]  196.745  149.612  60.3754  61.1807  54.4075  70.1237  
[32, 128]  206.627  61.7051  18.5565  20.2259  104.028  27.9782  
[64, 256]  203.731  31.6943  9.79816  10.678  104.11  15.3719  
100  [8, 32]  217.625  757.974  331.015  250.613  300.803  304.732 
[16, 64]  228.42  180.796  60.9719  63.0149  55.602  79.0707  
[32, 128]  228.194  71.0881  22.5928  24.1753  104.574  33.8765  
[64, 256]  237.803  35.1944  11.8472  12.4182  104.79  19.3238 
5.2 Evaluation on the Different Length Patterns
We compare algorithms with sets of patterns of different lengths. Figures 1(b), 1(d), 1(f), and Table 2 show the results. The length is randomly selected in an interval, i.e., [8, 32], [16, 64], [32, 128], and [64, 256]. After a length is selected, a pattern is extracted from the text at a random position. When there are many short patterns, i.e., and patterns of length 8–32, AC is the fastest due to the short minimum pattern length.
When the length of the shortest pattern is sufficiently long, however, the proposed algorithms outperform AC. Specifically WMB outperforms AC up to 20 times ( and patterns of length 64–256). ASB outperforms AC up to 14 times ( and patterns of length 64–256). RK outperforms AC up to 4 times ( and patterns of length 16–64).
5.3 Evaluation on the Real Dataset
AC  WMP  WMB  WMBM  RK  ASB  

10  4  6.46631  20.9454  12.6187  10.2736  10.9732  11.8492 
8  6.53721  14.3666  7.37876  7.14195  5.57104  8.00697  
16  7.76917  7.8657  4.57646  4.85934  2.78754  4.5365  
32  8.18157  3.89075  2.06438  2.27235  6.73496  5.99976  
64  7.60696  4.37882  2.60346  2.7861  7.06377  3.01767  
128  7.84501  1.34436  0.643153  0.743147  7.19664  1.86238  
256  9.47242  0.88061  0.337183  0.36453  7.22575  0.850833  
50  4  6.1634  22.5166  15.2899  11.4285  13.4452  14.4079 
8  7.47185  33.9852  12.581  11.9699  10.2026  11.0986  
16  9.53764  17.3211  11.0096  10.48  5.12495  15.234  
32  9.80261  6.14176  5.79404  6.21041  6.90745  9.44348  
64  9.82792  4.34029  4.09002  4.16979  7.15372  6.4055  
128  11.6782  2.40814  1.91395  2.1363  7.34409  3.99501  
256  14.7849  2.54673  1.5183  1.67897  7.47328  3.50649  
100  4  6.15083  23.0344  16.4377  11.9024  14.58  15.904 
8  8.11009  35.6604  16.5101  14.9557  13.8331  15.015  
16  10.5246  22.2591  14.8361  14.3679  7.22885  21.4713  
32  11.5976  8.5304  9.03897  9.25709  7.05395  13.7257  
64  11.8653  5.6808  5.67174  5.92024  7.35357  9.04152  
128  13.6058  3.71476  3.36717  3.74349  7.50687  6.83653  
256  22.7509  4.3859  2.38758  2.66045  7.73048  5.58111 
We conduct experiment on a real dataset, which is a time series of Seoul temperatures. The Seoul temperatures dataset consists of 658,795 integers referring to the hourly temperatures in Seoul (multiplied by ten) in the years 1907–2019 [19]. In general, temperatures rise during the day and fall at night. Therefore, the Seoul temperatures dataset has more matches than random datasets when patterns are extracted from the text. Figure 3 and Table 3 show the results on the Seoul temperatures dataset with sets of patterns of the same length. As the pattern length grows, the proposed algorithms run much faster than AC. For short patterns (), AC is the fastest algorithm, and AC is up to twice times faster than WMBM ( and ) and 1.7 times faster than RK ( and ). For moderatelength patterns (), RK is up to 2.8 times faster than AC ( and ), and WMB is up to 4 times faster than AC ( and ). For relatively long patterns (), all the proposed algorithms outperform AC. Specifically, WMB, WMBM, ASB, and WMP outperform AC up to 28, 26, 11, and 10 times, respectively ( and ), and RK outperforms AC up to 2.9 times ( and ).
References
 [1] (1975) Efficient string matching: an aid to bibliographic search. Communications of the ACM 18 (6), pp. 333–340. Cited by: §1.
 [2] (1993) Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values. Journal of Algorithms 14 (3), pp. 344–370. Cited by: §3.1.
 [3] (1977) A fast string searching algorithm. Communications of the ACM 20 (10), pp. 762–772. Cited by: §1.
 [4] (1998) A very fast string matching algorithm for small alphabets and long patterns. In Annual Symposium on Combinatorial Pattern Matching, pp. 55–64. Cited by: §1, §4.3.
 [5] (2014) Orderpreserving matching with filtration. In International Symposium on Experimental Algorithms, pp. 307–314. Cited by: §1, §3.2.
 [6] (1979) A string matching algorithm fast on the average. In International Colloquium on Automata, Languages, and Programming, pp. 118–132. Cited by: §1.
 [7] (2001) Introduction to algorithms second edition. The KnuthMorrisPratt Algorithm. Cited by: §3.2.
 [8] (1999) Fast practical multipattern matching. Information Processing Letters 71 (34), pp. 107–113. Cited by: §1.
 [9] (2016) Spaceefficient dictionaries for parameterized and orderpreserving pattern matching. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 2:1–2:12. Cited by: §1.
 [10] (2015) Fast multiple orderpreserving matching algorithms. In International Workshop on Combinatorial Algorithms, pp. 248–259. Cited by: §3.1.

[11]
(2009)
Variablestride multipattern matching for scalable deep packet inspection
. In IEEE INFOCOM 2009, pp. 415–423. Cited by: §1.  [12] (1987) Efficient randomized patternmatching algorithms. IBM journal of research and development 31 (2), pp. 249–260. Cited by: §1, §4.2.
 [13] (2014) Orderpreserving matching. Theoretical Computer Science 525, pp. 68–79. Cited by: §1.
 [14] (2014) The art of computer programming, volume 2: seminumerical algorithms. AddisonWesley Professional. Cited by: §3.1.
 [15] (2013) A linear time algorithm for consecutive permutation pattern matching. Information Processing Letters 113 (12), pp. 430–433. Cited by: §1.
 [16] (2013) Intrusion detection system: a comprehensive review. Journal of Network and Computer Applications 36 (1), pp. 16–24. Cited by: §1.
 [17] (2007) Automatic extraction and identification of chart patterns towards financial forecast. Applied Soft Computing 7 (4), pp. 1197–1208. Cited by: §1.
 [18] (2019) Cartesian tree matching and indexing. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 16:1–16:14. Cited by: Fast Multiple Pattern Cartesian Tree Matching, §0.A.3.4, §1, §1, §1, §3.1, §3.1, 1st item, §5, Definition 1, Definition 2.
 [19] Fast cartesian tree matching algorithms. Accepted to SPIRE (2019), https://arxiv.org/abs/1908.04937. Cited by: §0.A.3.1, §1, §3.2, §3.2, §5.3.
 [20] (2008) A memory efficient multiple pattern matching architecture for network security. In IEEE INFOCOM 2008The 27th Conference on Computer Communications, pp. 166–170. Cited by: §1.
 [21] (1980) A unifying look at data structures. Communications of the ACM 23 (4), pp. 229–239. Cited by: §2.1.
 [22] (1994) A fast algorithm for multipattern searching. Technical report. TR9417, Department of Computer Science, University of Arizona. Cited by: §1, §4.1.
Appendix 0.A Appendix
0.a.1 Proof of Lemma 1
Proof
When the th numbers are the roots of both and , the probability that is . Since there are distinct numbers, the probability that both and have the th numbers as their roots is . Summing the probabilities for gives the probability . ∎
0.a.2 Proof of Theorem 4.1
Proof
We prove the theorem by induction on .
If , , , . Therefore, the theorem holds when .
Let’s assume that the theorem holds when , for , and show that it holds when .
(12) 
Therefore, we have proved that . ∎
0.a.3 Optimization Techniques
0.a.3.1 Optimizing Naïve Verification
An efficient verification method is essential for the proposed three algorithms because they all adopt the filtering and verification approach. We employ the verification method introduced by Song et al. [19]. They first introduce the notion of the globalparent representation of a string , where stores the index of the parent of in . For example, the globalparent representation of string is . Note that the parent of the root is the root itself. Two strings and have the same Cartesian trees if and only if , or with , for all [19]. Note that we do not need any representation of . After the globalparent representation of is computed, we can verify whether in linear time by checking the conditions. In our algorithms, the globalparent representation of the patterns are computed and stored in advance, and verification is done by the above method without computing any representation about the text.
0.a.3.2 Reusing Fingerprint of Binary Encoding
In Algorithm 2, successive fingerprints can be computed in constant time by Equation (9) when using the binary encoding. Likewise, we can reuse a previous fingerprint to create the current fingerprint in Algorithms 1 and 3 as well. This can be done by applying Equation (9) times when two blocks of size overlap by . Our experimental study showed that reusing fingerprints when is the most efficient. Thus, we reuse fingerprints only when the text shifts by one position. It is worth mentioning that we do not reuse fingerprints of the parentdistance encoding because multiple characters in the parentdistance representation can be changed by just one shift, countervailing the effect of reusing.
0.a.3.3 Additional Filtering via Minindex
In the filtering stage of an algorithm, we may further filter out candidate patterns by additional filtering methods. We introduce a simple filtering method based on the index of the minimum value (minindex). Since two strings have the same Cartesian trees only if the indices of the minimum values (roots) of the two strings are the same, we may first compare the minindex before we verify each candidate pattern retrieved by a fingerprint. To this end, for each input pattern , we store the minindex among where is the length of the shortest pattern in the preprocessing step. In the search step, the fingerprint and the minindex of a block in the text are computed at the same time. Among the patterns retrieved by the fingerprint, only patterns are verified such that the minindex of the last block in ’s length prefix is the same as that of the block in the text. The information of the root is not represented by the binary representation, but it is represented by the parentdistance representation. Therefore, this additional filtering method is effective only when we use the binary encoding.
0.a.3.4 Grouping Patterns Having the Same Cartesian Trees
Since the input patterns are strings, some of them may have the same Cartesian trees. The AhoCorasick algorithm [18] assembles those patterns in a state of its automaton, while the presented algorithms in this paper do not perform it explicitly. In our implementation, we group those patterns having the same Cartesian trees, so as to avoid the redundant computation. This process is particularly beneficial for short input patterns.
Comments
There are no comments yet.