1 Introduction††Gu, Song, and Park were supported by Collaborative Genome Program for Fostering New Post-Genome industry through the National Research Foundation of Korea(NRF) funded by the Ministry of Science ICT and Future Planning (No. NRF-2014M3C9A3063541).
Cartesian tree matching is the problem of finding all substrings in a given text which have the same Cartesian trees as that of a given pattern. For instance, given text and pattern in Figure 0(a), has the same Cartesian tree as the substring of . Among many generalized matchings, Cartesian tree matching is analogous to order-preserving matching[13, 15, 5, 9] in the sense that they deal with relative order between numbers. Accordingly, both of them can be applied to time series data such as stock price analysis, but Cartesian tree matching can be sometimes more appropriate than order-preserving matching in finding patterns .
In this paper, we deal with Cartesian tree matching for the case of multiple patterns. Although finding multiple different patterns is interesting by itself, multiple pattern Cartesian tree matching can be applied in finding one meaningful pattern when the meaningful pattern is represented by multiple Cartesian trees: Suppose we are looking for the double-top pattern . Two Cartesian trees in Figure 0(b) are required to identify the pattern, where the relative order between and causes the difference. In general, the more complex the pattern is, the more Cartesian trees having the same lengths are required. (e.g., the head-and-shoulder pattern  requires four Cartesian trees.)
Recently, Park et al.  introduced (single pattern) Cartesian tree matching, multiple pattern Cartesian tree matching, and Cartesian tree indexing with their respective algorithms. They proposed the parent-distance representation that has a one-to-one mapping with Cartesian trees, and gave linear-time solutions for the problems, utilizing the representation and existing string algorithms, i.e., KMP algorithm, Aho-Corasick algorithm, and suffix tree construction algorithm. Song et al.  proposed new representations about Cartesian trees, and proposed practically fast algorithms for Cartesian tree matching based on the framework of filtering and verification.
Extensive works have been done to develop algorithms for multiple pattern matching, which is one of the fundamental problems in computer science[20, 11, 16]. Aho and Corasick  presented a linear-time algorithm based on an automaton. Commentz-Walter  presented an algorithm that combines the Aho-Corasick algorithm and the Boyer-Moore technique . Crochemore et al.  proposed an algorithm that combines the Aho-Corasick automaton and a Directed Acyclic Word Graph, which runs linear in the worst case and runs in time in the average case, where is the length of the shortest pattern. Rabin and Karp  proposed an algorithm that runs linear on average and in the worst case, where is the sum of lengths of all patterns. Charras et al.  proposed an algorithm called Alpha Skip Search, which can efficiently handle both single pattern and multiple patterns. Wu and Manber  presented an algorithm that uses an extension of the Boyer-Moore-Horspool technique.
In this paper we present practically fast algorithms for multiple pattern Cartesian tree matching. We present three algorithms based on Wu-Manber, Rabin-Karp, and Alpha Skip Search. All of them use the filtering and verification approach, where filtering relies on efficient fingerprinting methods of a string. Two fingerprinting methods are presented, i.e., the parent-distance encoding and the binary encoding. By combining an efficient fingerprinting method and a conventional multiple string matching algorithm, we can efficiently solve multiple pattern Cartesian tree matching. In the experiments we compare our solutions against the previous algorithm  which is based on the Aho-Corasick algorithm. Our solutions run faster than the previous algorithm. Especially, our algorithm based on Wu-Manber runs up to 33 times faster.
2 Problem Definition
A string is a sequence of characters drawn from an alphabet , which is a set of integers. We assume that a comparison between any two characters can be done in constant time. For a string , represents the -th character of , and represents the substring of starting from and ending at .
A Cartesian tree  is a binary tree derived from a string. Specifically, the Cartesian tree for a string can be uniquely defined as follows:
If is an empty string, is an empty tree.
If is not empty and is the minimum value in , is the tree with as the root, as the left subtree, and as the right subtree. If there is more than one minimum value, we choose the leftmost one as the root.
Given two strings and , where , we say that matches at position if . For example, given and in Figure 0(a), matches at position 8. We also say that is a match of in .
Cartesian tree matching is the problem of finding all the matches in the text which have the same Cartesian trees as a given pattern.
(Cartesian tree matching ) Given two strings text and pattern , find every such that .
2.2 Multiple Pattern Cartesian Tree Matching
Cartesian tree matching can be extended to the case of multiple patterns. Multiple pattern Cartesian tree matching is the problem of finding all the matches in the text which have the same Cartesian trees as at least one of the given patterns.
(Multiple pattern Cartesian tree matching ) Given a text and patterns , find every position in the text which matches at least one pattern, i.e., it has the same Cartesian tree as that of at least one pattern.
3 Fingerprinting Methods
Fingerprinting is a technique that maps a string to a much shorter form of data, such as a bit string or an integer. In Cartesian tree matching, we can use fingerprints to filter out unpromising matching positions with low computational cost.
In this section we introduce two fingerprinting methods, i.e., the parent-distance encoding and the binary encoding, for the purpose of representing information about Cartesian tree as an integer. The two encodings make use of the parent-distance representation and the binary representation, respectively, both of which are strings that represent Cartesian trees.
3.1 Parent-distance Encoding
(Parent-distance representation) Given a string S[1..n], the parent-distance representation of S is an integer string PD(S)[1..n], which is defined as follows:
Intuitively, stores the distance between and the parent of in . For example, the parent-distance representation of string is , where stores the distance between and ( is the parent of in ). The parent-distance representation has a one-to-one mapping to the Cartesian tree , and so if two strings have the same parent-distance representations, the two strings also have the same Cartesian trees. The parent-distance representation of a string can be computed in linear time . Note that holds a value between to by definition, and at all times.
With the parent-distance representation, we can define a fingerprint encoding function that maps a string to an integer, using the factorial number system .
(Parent-distance Encoding) Given a string , the encoding function , which maps into an integer within the range , is defined as follows:
The parent-distance encoding maps a string into a unique integer according to its parent-distance representation. That is, given two strings and , if and only if . This is because if then due to the fact that . The encoding function can be computed in time, since can be computed in linear time. For a long string, the fingerprint may not fit in a word size, so we select a prime number by which we divide the fingerprint, and use the residue instead of the actual fingerprint. A similar encoding function was used to solve the multiple pattern order-preserving matching problem .
3.2 Binary Encoding
For order-preserving matching, the representation of a string as a binary string is first presented by Chhabra and Tarhio . Recently, Song et al. make use of the binary representation for Cartesian tree matching as follows .
(Binary representation) Given an -length string , binary representation of length is defined as follows: for ,
Given two strings and , the binary representations and are the same if the Cartesian trees and are the same . Obviously, the Cartesian tree has a many-to-one mapping to the binary representation. Thus, two strings whose binary representations are the same may not have the same Cartesian trees, but two strings whose Cartesian trees are the same have the same binary representations.
A fingerprint encoding function can be defined using the binary representation.
(Binary Encoding) Given a string , encoding function , which maps into an integer within the range , is defined as follows:
Since is a polynomial, it can be efficiently computed in linear time using Horner’s rule . Moreover, a fingerprint computed by the binary encoding can be reused when two strings overlap, which will be discussed in Appendix 0.A.3.2. Like the parent-distance encoding, in case the fingerprint does not fit in a word size, we select a prime number by which we divide the fingerprint, and use the residue instead of the actual fingerprint.
4 Fast Multiple Pattern Cartesian Tree Matching Algorithms
In this section we introduce three algorithms for multiple pattern Cartesian tree matching. Each of them consists of preprocessing and search. In the preprocessing step, hash tables are built using fingerprints of patterns. In the search step, the filtering and verification approach is adopted. To filter out unpromising matching positions, a fingerprinting method is applied to either length- substrings of the text, where is the length of the shortest pattern, or much shorter length- substrings of the text (we will discuss how to set in Section 4.4). Then each candidate pattern is verified by an efficient comparison method (see Appendix 0.A.3.1).
4.1 Algorithm Based on Wu-Manber
Algorithm 1 shows the pseudo-code of an algorithm for multiple pattern Cartesian tree matching based on the Wu-Manber algorithm . The algorithm uses two hash tables, HASH and SHIFT. Both tables use a fingerprint of length- string, called a block. Either the parent-distance encoding or the binary encoding is used to compute the fingerprint. Given patterns , let be the length of the shortest pattern. HASH maps a fingerprint fp of a block to the list of patterns such that the fingerprint of the last block in ’s length- prefix is the same as fp. For a block and a fingerprint encoding function , HASH is defined as follows:
SHIFT maps a fingerprint fp of a block to the amount of a valid shift when the block appears in the text. The shift value is determined by the rightmost occurrence of a block in terms of the fingerprint among length- prefixes of the patterns. For a block and a fingerprint encoding function , we define the rightmost occurrence as follows:
Then SHIFT is defined as follows:
In the preprocessing step, we build HASH and SHIFT (as described in Algorithm 1). In the search step, we scan the text from left to right, computing the fingerprint of a length- substring of the text to get a list of patterns from HASH. Let index be the current scanning position of the text. We compute fingerprint fp of , and get a list of patterns in the entry . If the list is not empty, each pattern is verified by an efficient comparison method (see Appendix 0.A.3.1). Consider in the list. The comparison method verifies whether matches . After verifying all patterns in the list, the text is shifted by .
4.2 Algorithm Based on Rabin-Karp
Algorithm 2 in Appendix shows the pseudo-code of an algorithm for multiple pattern Cartesian tree matching based on the Rabin-Karp algorithm . The algorithm uses one hash table, namely HASH. HASH is similarly defined as in Algorithm 1 except that we consider length- prefixes instead of blocks and we use only binary encoding for fingerprinting. For a string and the binary encoding function , HASH is defined as follows:
In the preprocessing step, we build HASH. In the search step, we shift one by one, and compute the fingerprint of a length- substring of the text to get candidate patterns by using HASH. Again, each candidate pattern is verified by an efficient comparison method.
Given a fingerprint at position of the text, the next fingerprint at position can be computed in constant time if we use the binary encoding as a fingerprinting method. Let the former fingerprint be and the latter one be . Then,
Subtracting removes the leftmost bit from , multiplying the result by 2 shifts the number to the left by one position, and adding brings in the appropriate rightmost bit.
4.3 Algorithm Based on Alpha Skip Search
Algorithm 3 in Appendix shows the pseudo-code of an algorithm for multiple pattern Cartesian tree matching based on Alpha Skip Search . Recall that a length- string is called a block. The algorithm uses a hash table POS that maps the fingerprint of a block to a list of occurrences in all length- prefixes of the patterns. Either the parent-distance encoding or the binary encoding is used for fingerprinting. For a block and a fingerprint encoding function , POS is defined as follows:
In the preprocessing step, we build POS. In the search step, we scan the text from left to right, computing the fingerprint of a length- substring of the text to get the list of pairs , meaning that the fingerprint of is the same as that of the substring of the text. Verification using an efficient comparison method is performed for each pair in the list. Note that the algorithm always shifts by .
4.4 Selecting the Block Size
. A longer block size leads to a lower probability of candidate pattern occurrences, so it decreases verification time. On the other hand, a longer block size increases the overhead required for computing fingerprints. Thus, it is important to set a block size appropriate for each algorithm.
In order to set a block size, we first study the matching probability of two strings, in terms of Cartesian trees. Assume that numbers are independent and identically distributed, and there are no identical numbers within any length- string.
Given two strings and , the probability that and have the same Cartesian tree can be defined by the recurrence formula, where and , as follows:
We have the following upper bound on the matching probability.
Assume that numbers are independent and identically distributed, and there are no identical numbers within any length- string. Given two strings and , the probability that the two strings match, in terms of Cartesian trees, is at most , i.e., .
We set the block size if ; otherwise we set , where is the number of patterns and is the length of the shortest pattern, in order to get a low probability of match and a relatively short block size with respect to . By Theorem 4.1, if we set , .
We conduct experiments to evaluate the performances of the proposed algorithms against the previous algorithm. We compare algorithms based on Aho-Corasick (AC) , Wu-Manber (WM), Rabin-Karp (RM), and Alpha Skip Search (AS). By default, all our algorithms use optimization techniques introduced in Appendix 0.A.3, except the min-index filtering method which is evaluated in the experiments. Particularly, in order to compare the fingerprinting methods and see the effect of min-index filtering method, we compare variants of our algorithms. The following algorithms are evaluated.
AC: multiple Cartesian tree matching algorithm based on Aho-Corasick .
WMP: algorithm based on Wu-Manber that uses the parent-distance encoding as a fingerprinting method.
WMB: algorithm based on Wu-Manber that uses the binary encoding as a fingerprinting method. The algorithm reuses fingerprints when adjacent blocks overlap characters (i.e., when the text shifts by one position), where is the block size.
WMBM: WMB that exploits additional min-index filtering in Appendix 0.A.3.3.
RK: algorithm based on Rabin-Karp that uses the binary encoding as a fingerprinting method.
ASB: algorithm based on Alpha Skip Search that uses the binary encoding as a fingerprinting method. The algorithm reuses fingerprints when adjacent blocks overlap characters.
All algorithms are implemented in C++. Experiments are conducted on a machine with Intel Xeon E5-2630 v4 2.20GHz CPU and 128GB memory running CentOS Linux.
The total time includes the preprocessing time for building data structures and the search time. To evaluate an algorithm, we run it 100 times and measure the average total time in milliseconds.
We randomly build a text of length 10,000,000 where the alphabet size is 1,000. A pattern is extracted from the text at a random position.
5.1 Evaluation on the Equal Length Patterns
We first conduct experiments with sets of patterns of the same length. Figures 1(a), 1(c), 1(e), and Table 1 show the results, where is the number of patterns and x-axis represents the length of the patterns, i.e., . As the length of the patterns increases, WMB, WMBM, and ASB become the fastest algorithms due to a long shift length, low verification time, and light fingerprinting method. WMBM and WMB outperforms AC up to 33 times ( and ). ASB outperforms AC up to 28 times ( and ). RK outperforms AC up to 3 times ( and ). When the length of the patterns is extremely short, however, AC is the clear winner (). In this case, other algorithms naïvely compare the greatest part of patterns for each position of the text. WMP works visibly worse when due to the extreme situation and overhead of the fingerprinting method. Since short patterns are more likely to have the same Cartesian trees, the proposed algorithms are sometimes faster when than when due to the grouping technique in Appendix 0.A.3. Comparing WMB and WMBM, the min-index filtering method is more effective when there are many short patterns ( and ).
5.2 Evaluation on the Different Length Patterns
We compare algorithms with sets of patterns of different lengths. Figures 1(b), 1(d), 1(f), and Table 2 show the results. The length is randomly selected in an interval, i.e., [8, 32], [16, 64], [32, 128], and [64, 256]. After a length is selected, a pattern is extracted from the text at a random position. When there are many short patterns, i.e., and patterns of length 8–32, AC is the fastest due to the short minimum pattern length.
When the length of the shortest pattern is sufficiently long, however, the proposed algorithms outperform AC. Specifically WMB outperforms AC up to 20 times ( and patterns of length 64–256). ASB outperforms AC up to 14 times ( and patterns of length 64–256). RK outperforms AC up to 4 times ( and patterns of length 16–64).
5.3 Evaluation on the Real Dataset
We conduct experiment on a real dataset, which is a time series of Seoul temperatures. The Seoul temperatures dataset consists of 658,795 integers referring to the hourly temperatures in Seoul (multiplied by ten) in the years 1907–2019 . In general, temperatures rise during the day and fall at night. Therefore, the Seoul temperatures dataset has more matches than random datasets when patterns are extracted from the text. Figure 3 and Table 3 show the results on the Seoul temperatures dataset with sets of patterns of the same length. As the pattern length grows, the proposed algorithms run much faster than AC. For short patterns (), AC is the fastest algorithm, and AC is up to twice times faster than WMBM ( and ) and 1.7 times faster than RK ( and ). For moderate-length patterns (), RK is up to 2.8 times faster than AC ( and ), and WMB is up to 4 times faster than AC ( and ). For relatively long patterns (), all the proposed algorithms outperform AC. Specifically, WMB, WMBM, ASB, and WMP outperform AC up to 28, 26, 11, and 10 times, respectively ( and ), and RK outperforms AC up to 2.9 times ( and ).
-  (1975) Efficient string matching: an aid to bibliographic search. Communications of the ACM 18 (6), pp. 333–340. Cited by: §1.
-  (1993) Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values. Journal of Algorithms 14 (3), pp. 344–370. Cited by: §3.1.
-  (1977) A fast string searching algorithm. Communications of the ACM 20 (10), pp. 762–772. Cited by: §1.
-  (1998) A very fast string matching algorithm for small alphabets and long patterns. In Annual Symposium on Combinatorial Pattern Matching, pp. 55–64. Cited by: §1, §4.3.
-  (2014) Order-preserving matching with filtration. In International Symposium on Experimental Algorithms, pp. 307–314. Cited by: §1, §3.2.
-  (1979) A string matching algorithm fast on the average. In International Colloquium on Automata, Languages, and Programming, pp. 118–132. Cited by: §1.
-  (2001) Introduction to algorithms second edition. The Knuth-Morris-Pratt Algorithm. Cited by: §3.2.
-  (1999) Fast practical multi-pattern matching. Information Processing Letters 71 (3-4), pp. 107–113. Cited by: §1.
-  (2016) Space-efficient dictionaries for parameterized and order-preserving pattern matching. In 27th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 2:1–2:12. Cited by: §1.
-  (2015) Fast multiple order-preserving matching algorithms. In International Workshop on Combinatorial Algorithms, pp. 248–259. Cited by: §3.1.
Variable-stride multi-pattern matching for scalable deep packet inspection. In IEEE INFOCOM 2009, pp. 415–423. Cited by: §1.
-  (1987) Efficient randomized pattern-matching algorithms. IBM journal of research and development 31 (2), pp. 249–260. Cited by: §1, §4.2.
-  (2014) Order-preserving matching. Theoretical Computer Science 525, pp. 68–79. Cited by: §1.
-  (2014) The art of computer programming, volume 2: seminumerical algorithms. Addison-Wesley Professional. Cited by: §3.1.
-  (2013) A linear time algorithm for consecutive permutation pattern matching. Information Processing Letters 113 (12), pp. 430–433. Cited by: §1.
-  (2013) Intrusion detection system: a comprehensive review. Journal of Network and Computer Applications 36 (1), pp. 16–24. Cited by: §1.
-  (2007) Automatic extraction and identification of chart patterns towards financial forecast. Applied Soft Computing 7 (4), pp. 1197–1208. Cited by: §1.
-  (2019) Cartesian tree matching and indexing. In 30th Annual Symposium on Combinatorial Pattern Matching (CPM), pp. 16:1–16:14. Cited by: Fast Multiple Pattern Cartesian Tree Matching, §0.A.3.4, §1, §1, §1, §3.1, §3.1, 1st item, §5, Definition 1, Definition 2.
-  Fast cartesian tree matching algorithms. Accepted to SPIRE (2019), https://arxiv.org/abs/1908.04937. Cited by: §0.A.3.1, §1, §3.2, §3.2, §5.3.
-  (2008) A memory efficient multiple pattern matching architecture for network security. In IEEE INFOCOM 2008-The 27th Conference on Computer Communications, pp. 166–170. Cited by: §1.
-  (1980) A unifying look at data structures. Communications of the ACM 23 (4), pp. 229–239. Cited by: §2.1.
-  (1994) A fast algorithm for multi-pattern searching. Technical report. TR-94-17, Department of Computer Science, University of Arizona. Cited by: §1, §4.1.
Appendix 0.A Appendix
0.a.1 Proof of Lemma 1
When the -th numbers are the roots of both and , the probability that is . Since there are distinct numbers, the probability that both and have the -th numbers as their roots is . Summing the probabilities for gives the probability . ∎
0.a.2 Proof of Theorem 4.1
We prove the theorem by induction on .
If , , , . Therefore, the theorem holds when .
Let’s assume that the theorem holds when , for , and show that it holds when .
Therefore, we have proved that . ∎
0.a.3 Optimization Techniques
0.a.3.1 Optimizing Naïve Verification
An efficient verification method is essential for the proposed three algorithms because they all adopt the filtering and verification approach. We employ the verification method introduced by Song et al. . They first introduce the notion of the global-parent representation of a string , where stores the index of the parent of in . For example, the global-parent representation of string is . Note that the parent of the root is the root itself. Two strings and have the same Cartesian trees if and only if , or with , for all . Note that we do not need any representation of . After the global-parent representation of is computed, we can verify whether in linear time by checking the conditions. In our algorithms, the global-parent representation of the patterns are computed and stored in advance, and verification is done by the above method without computing any representation about the text.
0.a.3.2 Reusing Fingerprint of Binary Encoding
In Algorithm 2, successive fingerprints can be computed in constant time by Equation (9) when using the binary encoding. Likewise, we can reuse a previous fingerprint to create the current fingerprint in Algorithms 1 and 3 as well. This can be done by applying Equation (9) times when two blocks of size overlap by . Our experimental study showed that reusing fingerprints when is the most efficient. Thus, we reuse fingerprints only when the text shifts by one position. It is worth mentioning that we do not reuse fingerprints of the parent-distance encoding because multiple characters in the parent-distance representation can be changed by just one shift, countervailing the effect of reusing.
0.a.3.3 Additional Filtering via Min-index
In the filtering stage of an algorithm, we may further filter out candidate patterns by additional filtering methods. We introduce a simple filtering method based on the index of the minimum value (min-index). Since two strings have the same Cartesian trees only if the indices of the minimum values (roots) of the two strings are the same, we may first compare the min-index before we verify each candidate pattern retrieved by a fingerprint. To this end, for each input pattern , we store the min-index among where is the length of the shortest pattern in the preprocessing step. In the search step, the fingerprint and the min-index of a block in the text are computed at the same time. Among the patterns retrieved by the fingerprint, only patterns are verified such that the min-index of the last block in ’s length- prefix is the same as that of the block in the text. The information of the root is not represented by the binary representation, but it is represented by the parent-distance representation. Therefore, this additional filtering method is effective only when we use the binary encoding.
0.a.3.4 Grouping Patterns Having the Same Cartesian Trees
Since the input patterns are strings, some of them may have the same Cartesian trees. The Aho-Corasick algorithm  assembles those patterns in a state of its automaton, while the presented algorithms in this paper do not perform it explicitly. In our implementation, we group those patterns having the same Cartesian trees, so as to avoid the redundant computation. This process is particularly beneficial for short input patterns.