1 Introduction
String matching is one of fundamental problems in computer science. There are generalized matchings such as parameterized matching [PARA1, PARA2], swapped matching [SWAP1, SWAP2], overlap matching [Overlap], jumbled matching [JUMBLE], and so on. These problems are characterized by the way of defining a match, which depends on the application domains of the problems. In particular, orderpreserving matching [OPM, OPM2, OPM3] and Cartesian tree matching [CTM] deal with the order relations between numbers.
The Cartesian tree [CT] is a tree data structure that represents a string, focusing on the orders between elements of the string. Park et al. [CTM] introduced a metric of match called Cartesian tree matching. It is the problem of finding all substrings of a text which have the same Cartesian trees as that of a pattern . Cartesian tree matching can be applied to finding patterns in time series data such as share prices in stock markets, like orderpreserving matching, but sometimes it may be more appropriate as indicated in [CTM]. Fig. 1 shows an example of Cartesian tree matching. Suppose and . The Cartesian tree of substring is the same as that of . Note that if we use orderpreserving matching instead of Cartesian tree matching as a metric, does not match .
String matching algorithms have been designed over the years. To speed up the search phase of string matching, algorithms based on automata and bitparallelism were developed [AOSO, SBNDM]. In recent years, the SIMD instruction set architecture gave rise to packed string matching, where one can compare packed data elements in parallel. In the last few years, many solutions for orderpreserving matching have been proposed. Given a text of length and a pattern of length , Kubica et al. [OPM3] and Kim et al. [OPM] gave time solutions based on the KMP algorithm. Cho et al. [CHO] presented an algorithm using the Boyer–Moore approach. Chhabra and Tarhio [FilterOPM] presented a new practical solution based on filtration, and Chhabra et al. [SIMDOPM] gave a filtration algorithm using the BoyerMooreHorspool approach and SIMD instructions. Cantone et al. [OrderOPM] proposed filtration methods using the neighborhood representation and SIMD instructions. These filtration methods [FilterOPM, SIMDOPM, OrderOPM] take sublinear time on average.
In this paper we introduce new representations, prefixparent representation and prefixchild representation, which can be used to decide whether two strings have the same Cartesian trees or not. Using these representations, we improve the running time of the previous Cartesian tree matching algorithm in [CTM]. We also present a binary filtration method for Cartesian tree matching, and give an efficient verification technique for Cartesian tree matching based on the globalparent representation. On the framework of our binary filtration method and efficient verification technique, we can apply any known string matching algorithm [SkipSearch, HORSPOOL, SBNDM] as a filtration for Cartesian tree matching. In addition, we present a SIMD solution for Cartesian tree matching based on the globalparent representation, which is suitable for short patterns. We conduct experiments comparing many algorithms for Cartesian tree matching, which show that known string matching algorithms combined on the framework of our binary filtration and efficient verification for Cartesian tree matching produce algorithms of good performances for Cartesian tree matching.
This paper is organized as follows. In Section 2, we describe notations and the problem definition. In Section 3, we present an improved lineartime algorithm using new representations. In Section 4, we present the framework of binary filtration and efficient verification. In Section 5, we present a SIMD solution for short patterns. In Section 6, we give the experimental results of the previous algorithm and the proposed algorithms.
2 Preliminaries
2.1 Basic notations
A string is defined as a finite sequence of elements in an alphabet . In this paper, we will assume that has a total order . For a string , represents the th element of , and represents a substring of from the th element to the th element. If then is an empty string.
We will say , if and only if , or and have the same value with . Note that (as elements of the string) if and only if . Unless stated otherwise, the minimum is defined by .
2.2 Cartesian tree matching
A string can be associated with its corresponding Cartesian tree [CT] according to the following rules:

If is an empty string, then is an empty tree.

If is not empty and is the minimum value among , then is the tree with as the root, as the left subtree, and as the right subtree.
Cartesian tree matching is to find all substrings of the text which have the same Cartesian trees as that of the pattern. Formally, Park et al. [CTM] define it as follows:
Definition 1
(Cartesian tree matching) Given two strings text and pattern , find every such that .
Instead of building the Cartesian tree for every position in the text to solve Cartesian tree matching, Park et al. [CTM] use the following representation for a Cartesian tree.
Definition 2
(Parentdistance representation) Given a string , the parentdistance representation of is a function , which is defined as follows:
Since the parentdistance representation has a onetoone mapping to the Cartesian tree [CTM], it can replace the Cartesian tree without any loss of information.
3 Fast linear Cartesian tree matching
The previous algorithm for Cartesian tree matching due to Park et al. [CTM] is based on the KMP algorithm [KMP]. They changed the pattern and the text to parentdistance representations and found matches using the KMP algorithm. To compute the parentdistance representations of substrings of the text using space, however, they used a deque data structure. We improve the text search phase of the previous algorithm by removing the overhead of computing parentdistance representations including deque operations.
In the text search phase of the previous algorithm, the parentdistance of each element in is computed to check whether it matches when we know that matches . We can do it directly without computing the parentdistances of text elements by using following representations: prefixparent representation and prefixchild representation.
Definition 3
(prefixparent representation) Given a string , the prefixparent representation of is a function , which is defined as follows:
Since , the prefixparent representation also has a onetoone mapping to the Cartesian tree.
Definition 4
(prefixchild representation) Given a string , the prefixchild representation of is a function , which is defined as follows: , and for ,
In other words, is a child of , because is the root of when , and is the root of when . When , there is no child of in , and thus we set as .
Fig. 2 shows the prefixparent representation (resp. the prefixchild representation) of string by arrows. The arrow starting from indicates (resp. ). If (resp. ), we omit the arrow.
The advantage of using the prefixchild representation and the prefixparent representation is that we can check whether each text element matches the corresponding pattern element in constant time without computing its parentdistance [CTM].
Theorem 3.1
Given two strings and , assume that and have the same prefixparent representations. If , then and have the same prefixparent representations, and vice versa.
Proof
If , and always have the same prefixparent . Now let’s assume . There are three cases, in each of which we show that .

Case : Since is the minimum element in and for , is also the minimum element in . Therefore, if holds, then we have .

Case : Since , we have .

Case : Since is the minimum element in and for , is also the minimum element in . Therefore, if holds, then .
It is trivial by definitions of and . ∎
With the prefixparent representation and the prefixchild representation of pattern , we can simplify the text search. For each element , we can check by comparing with the elements in whose indices correspond to and in . Using this idea, we don’t have to compute . Algorithm 1 describes the algorithm to do this. We compute the failure function in the same way as [CTM] does.
Given a string , we can compute the prefixchild representation and the prefixparent representation simultaneously in linear time using a stack. means that for . The same is true for . On the stack, therefore, we maintain only ’s which satisfy for while scanning from to . Suppose that are on the stack when we are computing and . (We assume that .) Then, forms an increasing subsequence of . When we consider a new index , we pop the indices repeatedly until we have . If there exists such an index , we set and . (If , then .) Otherwise, is the minimum element in , and thus and . Finally, we push onto the stack. Algorithm 2 describes the algorithm to compute and simultaneously.
4 Fast Cartesian tree matching with filtration
In this section we present a practical solution based on filtration. Our solution for Cartesian tree matching consists of two phases: filtration and verification. First, the text is filtered with some exact string matching algorithm using a binary representation. In the second phase, the potential candidates are verified using a globalparent representation.
4.1 Filtration
In the filtration phase, a string is translated into a binary representation as follows.
Definition 5
(binary representation) Given a string , the binary representation of is a binary string of length , which is defined as follows:
for each .
One can easily check whether is true or not by comparing and : if and only if . The following theorem proves that the binary representation can be used to filter a text to search for all Cartesian tree matching occurrences of a pattern .
Theorem 4.1
Let and be two strings of lengths and , respectively, and let and be the binary representations associated with and , respectively. If , then for .
Proof
The prefixparent representation has a onetoone mapping to the Cartesian tree. Therefore, if , then for . If , then for .
Theorem 4.1 guarantees that any standard exact string matching algorithm can be used as a filtration procedure. As the exact string matching algorithm returns matches of in , these matches are only possible candidates of Cartesian tree matching which should be verified.
Cantone et al. [OrderOPM] presented two filtration methods other than the binary representation to solve orderpreserving matching. They used the property that doesn’t match at position if there are two positions and such that doesn’t hold. Thus any comparison result between two positions can be used for filtration. In Cartesian tree matching, however, even if there exist such and , the corresponding Cartesian trees can be the same when . Therefore, we cannot use these filtration methods for Cartesian tree matching.
4.2 Verification
In the verification phase, we have to check whether the candidates found by the filtration phase are actual matches or not. This checking can be done using prefixparent and prefixchild representations by Theorem 3.1, which takes 2 comparisons per element. In order to reduce the number of comparisons to 1, we introduce another representation as follows.
Definition 6
(Globalparent representation) Given a string , the globalparent representation of is a function , which is defined as follows:
is welldefined because there is at most one which satisfies . Fig. 2 shows the globalparent representation by arrows. The arrow starting from indicates the global parent of . If , we omit the arrow.
Theorem 4.2
Two strings and have the same Cartesian trees if and only if for all .
Proof
We will prove that for all if and only if for all .
It is trivial by definition of .
Assume for all . For any , we first show , and then we show .

(Proof of ) There are two cases: and . If , then holds trivially. Otherwise, since , . Therefore, holds.

(Proof of ) If , then . So we only have to consider the case that there is which satisfies . Let be a sequence such that , and there is no which satisfies . Since is a strictly increasing sequence, such always exists. Note that except for . On the sequence, there may or may not exist such that .
Suppose that there exists some such that . Since , is the minimum element in , and so . Proceeding inductively, for all . Thus holds trivially.
Now we consider the case that for all . Then, we have by the assumption that for all . We now show as follows. Since , is the minimum element in , and . Hence, we have . Inductively, we can show that . Therefore, holds.
By Theorem 4.2, we only have to compare once for each element in the verification phase. For a potential candidate obtained from the filtration phase (say, it starts from ), we compare and from to . The candidate is discarded when there exists such that .
We compute the globalparent representation using a stack, as in computing the prefixparent and the prefixchild representations. The only difference is that first we set as , and then if we find such that we update to .
4.3 Sublinear time on average
The proof of sublinearity is similar to the analysis of orderpreserving matching with filtration [FilterOPM]. Let’s assume that the elements in the pattern and the text are independent of each other and the distribution is uniform. The verification phase takes time proportional to the pattern length times the number of potential candidates. When alphabet size is
, the probability that
(i.e., probability that ) is , since there are pairs and pairs among them have equal elements. Similarly, the probability that is , and it is the same for . Therefore, the probability that is . As the pattern length increases, the number of potential candidates decreases exponentially, and the verification time approaches zero. Hence, the filtration time dominates. So if the filtration method takes a sublinear time in the average case, the total algorithm takes a sublinear time in the average case, too.4.4 SIMD instructions
When we use the BoyerMooreHorspool algorithm [HORSPOOL] and the Alpha skip search algorithm [SkipSearch] as the filtration method, we pack four 32bit numbers or sixteen 8bit numbers into a register, as in orderpreserving matching algorithms [SIMDOPM, OrderOPM]. Each pair of two corresponding packed data elements can be compared in parallel using streaming SIMD extensions (SSE) [SSE]. In the case of 32bit integers, for example, we compute , , , and in parallel as in Algorithm 3, where instruction _mm_loadu_si128((__m128i *)()) loads four 32bit integers from memory into a 128bit register, instruction _mm_cmpgt_epi32(, ) compares four pairs of packed 32bit integers and returns the results of the comparisons into a 128bit register, instruction _mm_castsi128_ps casts the integer type to the float type, and instruction _mm_movemask_ps selects only the most significant bits of the 4 floats. Comparing a pair of sixteen 8bit numbers can be done similarly.
5 SIMD solution for short patterns
In this section we present an algorithm that works when the alphabet consists of 1byte characters and the pattern length is at most 16. As shown in Section 4.2, we test for to check for an occurrence at position of the text .
Let be a word of 16 bytes containing the current window of the text, i.e., . For , we define (word obtained from by shifting positions to the left or to the right, depending on the sign of ) as follows:
For fixed , we can find the positions which satisfy for in parallel by comparing to using SIMD instructions. The satisfying positions for all are the occurrences of the pattern. The details of the algorithm are as follows. We test whether for in parallel using the SIMD instruction for or for . (In order to get only significant bits when computing , we use instruction _mm_movemask_epi8.) Then we compute . Finally, we report a match at position of the text if .
Example 1
Let’s consider an example of the pattern and the window of the text . We observe that since , . Moreover we do not need to compute , since . Hence we compute , , and .
=  10,  12,  16,  15,  06,  14,  09,  12,  11,  14,  09,  17,  12,  13,  12,  10  

=  12,  16,  15,  06,  14,  09,  12,  11,  14,  09,  17,  12,  13,  12,  10  
=  0,  0,  1,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  1,   
=  10,  12,  16,  15,  06,  14,  09,  12,  11,  14,  09,  17,  12,  13,  12,  10  
=  10,  12,  16,  15,  06,  14,  09,  12,  11,  14,  09,  17,  12,  13  
=  ,  ,  1,  1,  0,  0,  1,  0,  1,  1,  0,  1,  1,  0,  1,  1 
=  10,  12,  16,  15,  06,  14,  09,  12,  11,  14,  09,  17,  12,  13,  12,  10  
=  10,  12,  16,  15,  06,  14,  09,  12,  11,  14,  09,  17,  12,  13,  12  
=  ,  1,  1,  0,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  0 
The final result can be computed as follows:
=  0,  0,  1,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  1,    
=  1,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  1,  ,  0,  0  
=  1,  0,  0,  1,  0,  1,  1,  0,  1,  1,  0,  1,  1,  0,  0,  0  
=  0,  1,  0,  1,  0,  1,  0,  1,  0,  1,  0,  0,  0,  0,  0,  0  
=  0,  0,  0,  1,  0,  1,  0,  0,  0,  1,  0,  0,  0,  0,  0,  0 
Therefore, we can report 3 matches. After we have tested a window of the text, we shift the current window to the right by positions. This algorithm takes SIMD instructions.
6 Experiments
Dataset  KMP  IKMP  SBNDMCT  BMHCT  SKSCT  PM  
CT  CT  2  4  6  4  8  12  16  4  8  12  16  CT  
Random  5  10.52  6.84  4.99  4.42  4.17  3.31  
int  9  10.71  6.83  2.71  2.31  1.95  1.95  1.64  1.91  2.26  
17  10.69  6.83  1.39  1.34  0.95  1.31  0.80  0.86  1.60  1.13  0.45  0.61  3.91  
33  10.69  6.83  0.72  0.70  0.65  1.07  0.51  0.51  1.01  0.76  0.32  0.30  0.48  
65  10.71  6.83  0.72  0.71  0.66  0.98  0.44  0.43  0.71  0.61  0.27  0.24  0.28  
Seoul  5  5.08  3.07  2.67  2.91  2.52  2.27  
temp  9  5.11  3.14  1.56  1.45  1.55  1.55  1.23  1.27  1.77  
17  5.51  3.12  0.89  0.81  0.71  1.10  0.62  0.63  0.84  0.88  0.44  0.49  2.55  
33  5.56  3.12  0.49  0.48  0.45  0.84  0.40  0.34  0.41  0.68  0.32  0.20  0.25  
65  5.52  3.11  0.48  0.48  0.46  0.77  0.26  0.19  0.28  0.57  0.25  0.13  0.12  
Random  5  10.24  6.86  4.80  4.44  3.95  3.22  0.50  
char  7  10.32  6.86  3.53  2.89  4.47  2.39  2.40  0.84  
9  10.34  6.85  2.65  2.32  1.94  1.74  1.24  1.91  1.47  1.32  
13  10.32  6.85  1.75  1.68  1.10  1.23  0.70  0.68  1.34  0.45  1.15  3.76  
17  10.35  6.86  1.28  1.25  0.87  1.04  0.52  0.49  0.79  1.04  0.27  0.32  1.64  
33  10.34  6.85  0.61  0.60  0.54  0.78  0.29  0.26  0.43  0.66  0.16  0.09  0.11  
65  10.36  6.86  0.63  0.63  0.55  0.74  0.20  0.17  0.27  0.47  0.13  0.04  0.05 
In this section we conduct experiments comparing the following algorithms.

KMPCT: algorithm of Park, Amir, Landau, and Park [CTM]

IKMPCT: our improved lineartime algorithm based on prefixparent and prefixchild representations (Section 3)

PMCT: SIMD solution for short patterns (Section 5)

SBNDMCT: algorithm based on the SBNDM filtration implemented by Faro and Lecroq [SMART] on the binary representations of the text and the pattern (Section 4.1) and verification using the globalparent representation (Section 4.2) [SBNDM] (The following algorithms have the same framework as SBNDMCT; only SBNDM is replaced by another filtration method.)

BMHCT: algorithm based on the gram BoyerMooreHorspool filtration using SIMD instructions [HORSPOOL, QGRAM, SIMDOPM]

SKSCT: algorithm based on the gram Alpha skip search filtration using SIMD instructions [SkipSearch, OrderOPM]
We tested for two random datasets and one real dataset, which is a time series of Seoul temperatures. The first random dataset consists of 10,000,000 random integers. The second random dataset consists of 10,000,000 random characters. The Seoul temperatures dataset consists of 658,795 integers referring to the hourly temperatures in Seoul (multiplied by ten) in the years 19072019. In general, temperatures rise during the day and fall at night. Therefore, the Seoul temperatures dataset has more matches than random datasets. We picked 100 random patterns per pattern length from random datasets and 1000 random patterns per pattern length for the Seoul temperatures dataset.
The experimental environments and parameters are as follows. All algorithms were implemented in C++11 and compiled with GNU C++ compiler version 4.8.5, and O3 and msse4 options were used. The experiments were performed on a CentOS Linux 7 with 128GB RAM and Intel Xeon CPU E52630 processor.
Table 1 shows the total execution times of Cartesian tree matching algorithms for random patterns (including the preprocessing). The best results are boldfaced. We choose the best results of the random character dataset from each algorithm regardless of and present them in Fig. 3 (except KMPCT because of readability). Our lineartime algorithm IKMPCT improves upon algorithm KMPCT of [CTM] by about 35%. In the random character dataset, PMCT is the fastest algorithm for short patterns. However, as the pattern length grows, algorithms based on the filtration method are much faster in practice. It can be seen that SKSCT is the fastest algorithm in most cases. When the pattern length is equal to 9, BMHCT utilizing 8grams is the fastest algorithm, irrespective of the datasets. As pattern length grows, SKSCT utilizing 12grams becomes the fastest algorithm.
Regardless of the data type, the results are almost consistent. In details, however, there are several differences. First, filtration algorithms, especially SKSCT algorithms, are slower at the Seoul temperatures dataset relatively. It’s because there are more matches in the Seoul temperatures dataset. Second, when is large, BMHCT and SKSCT algorithms are faster in the random character dataset than in the random integer dataset. It’s because the maximum number that we can compute in parallel is 16 in the character dataset while it is 4 in the integer dataset.
Acknowledgments. Song, Ryu and Park were supported by Collaborative Genome Program for Fostering New PostGenome industry through the National Research Foundation of Korea(NRF) funded by the Ministry of Science ICT and Future Planning (No. NRF2014M3C9A3063541).
Comments
There are no comments yet.