Fast Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings of a given text which have the same Cartesian trees as that of a given pattern. So far there is one linear-time solution for Cartesian tree matching, which is based on the KMP algorithm. We improve the running time of the previous solution by introducing new representations. We present the framework of a binary filtration method and an efficient verification technique for Cartesian tree matching. Any exact string matching algorithm can be used as a filtration for Cartesian tree matching on our framework. We also present a SIMD solution for Cartesian tree matching suitable for short patterns. By experiments we show that known string matching algorithms combined on our framework of binary filtration and efficient verification produce algorithms of good performances for Cartesian tree matching.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/05/2019

Fast Multiple Pattern Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings in a gi...
11/08/2020

Scout Algorithm For Fast Substring Matching

Exact substring matching is a common task in many software applications....
02/09/2018

Self-Bounded Prediction Suffix Tree via Approximate String Matching

Prediction suffix trees (PST) provide an effective tool for sequence mod...
02/17/2020

Detecting k-(Sub-)Cadences and Equidistant Subsequence Occurrences

The equidistant subsequence pattern matching problem is considered. Give...
03/07/2018

Flexible and Efficient Algorithms for Abelian Matching in Strings

The abelian pattern matching problem consists in finding all substrings ...
04/27/2020

SFTM: Fast Comparison of Web Documents using Similarity-based Flexible Tree Matching

Tree matching techniques have been investigated in many fields, includin...
07/08/2020

Supervised machine learning techniques for data matching based on similarity metrics

Businesses, governmental bodies and NGO's have an ever-increasing amount...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

String matching is one of fundamental problems in computer science. There are generalized matchings such as parameterized matching [PARA1, PARA2], swapped matching [SWAP1, SWAP2], overlap matching [Overlap], jumbled matching [JUMBLE], and so on. These problems are characterized by the way of defining a match, which depends on the application domains of the problems. In particular, order-preserving matching [OPM, OPM2, OPM3] and Cartesian tree matching [CTM] deal with the order relations between numbers.

The Cartesian tree [CT] is a tree data structure that represents a string, focusing on the orders between elements of the string. Park et al. [CTM] introduced a metric of match called Cartesian tree matching. It is the problem of finding all substrings of a text which have the same Cartesian trees as that of a pattern . Cartesian tree matching can be applied to finding patterns in time series data such as share prices in stock markets, like order-preserving matching, but sometimes it may be more appropriate as indicated in [CTM]. Fig. 1 shows an example of Cartesian tree matching. Suppose and . The Cartesian tree of substring is the same as that of . Note that if we use order-preserving matching instead of Cartesian tree matching as a metric, does not match .

Figure 1: Cartesian tree matching, and Cartesian tree corresponding to pattern.

String matching algorithms have been designed over the years. To speed up the search phase of string matching, algorithms based on automata and bit-parallelism were developed [AOSO, SBNDM]. In recent years, the SIMD instruction set architecture gave rise to packed string matching, where one can compare packed data elements in parallel. In the last few years, many solutions for order-preserving matching have been proposed. Given a text of length and a pattern of length , Kubica et al. [OPM3] and Kim et al. [OPM] gave time solutions based on the KMP algorithm. Cho et al. [CHO] presented an algorithm using the Boyer–Moore approach. Chhabra and Tarhio [FilterOPM] presented a new practical solution based on filtration, and Chhabra et al. [SIMDOPM] gave a filtration algorithm using the Boyer-Moore-Horspool approach and SIMD instructions. Cantone et al. [OrderOPM] proposed filtration methods using the -neighborhood representation and SIMD instructions. These filtration methods [FilterOPM, SIMDOPM, OrderOPM] take sublinear time on average.

In this paper we introduce new representations, prefix-parent representation and prefix-child representation, which can be used to decide whether two strings have the same Cartesian trees or not. Using these representations, we improve the running time of the previous Cartesian tree matching algorithm in [CTM]. We also present a binary filtration method for Cartesian tree matching, and give an efficient verification technique for Cartesian tree matching based on the global-parent representation. On the framework of our binary filtration method and efficient verification technique, we can apply any known string matching algorithm [SkipSearch, HORSPOOL, SBNDM] as a filtration for Cartesian tree matching. In addition, we present a SIMD solution for Cartesian tree matching based on the global-parent representation, which is suitable for short patterns. We conduct experiments comparing many algorithms for Cartesian tree matching, which show that known string matching algorithms combined on the framework of our binary filtration and efficient verification for Cartesian tree matching produce algorithms of good performances for Cartesian tree matching.

This paper is organized as follows. In Section 2, we describe notations and the problem definition. In Section 3, we present an improved linear-time algorithm using new representations. In Section 4, we present the framework of binary filtration and efficient verification. In Section 5, we present a SIMD solution for short patterns. In Section 6, we give the experimental results of the previous algorithm and the proposed algorithms.

2 Preliminaries

2.1 Basic notations

A string is defined as a finite sequence of elements in an alphabet . In this paper, we will assume that has a total order . For a string , represents the th element of , and represents a substring of from the th element to the th element. If then is an empty string.

We will say , if and only if , or and have the same value with . Note that (as elements of the string) if and only if . Unless stated otherwise, the minimum is defined by .

2.2 Cartesian tree matching

A string can be associated with its corresponding Cartesian tree [CT] according to the following rules:

  • If is an empty string, then is an empty tree.

  • If is not empty and is the minimum value among , then is the tree with as the root, as the left subtree, and as the right subtree.

Cartesian tree matching is to find all substrings of the text which have the same Cartesian trees as that of the pattern. Formally, Park et al. [CTM] define it as follows:

Definition 1

(Cartesian tree matching) Given two strings text and pattern , find every such that .

Instead of building the Cartesian tree for every position in the text to solve Cartesian tree matching, Park et al. [CTM] use the following representation for a Cartesian tree.

Definition 2

(Parent-distance representation) Given a string , the parent-distance representation of is a function , which is defined as follows:

Since the parent-distance representation has a one-to-one mapping to the Cartesian tree [CTM], it can replace the Cartesian tree without any loss of information.

3 Fast linear Cartesian tree matching

The previous algorithm for Cartesian tree matching due to Park et al. [CTM] is based on the KMP algorithm [KMP]. They changed the pattern and the text to parent-distance representations and found matches using the KMP algorithm. To compute the parent-distance representations of substrings of the text using space, however, they used a deque data structure. We improve the text search phase of the previous algorithm by removing the overhead of computing parent-distance representations including deque operations.

In the text search phase of the previous algorithm, the parent-distance of each element in is computed to check whether it matches when we know that matches . We can do it directly without computing the parent-distances of text elements by using following representations: prefix-parent representation and prefix-child representation.

Figure 2: for .
Definition 3

(prefix-parent representation) Given a string , the prefix-parent representation of is a function , which is defined as follows:

Since , the prefix-parent representation also has a one-to-one mapping to the Cartesian tree.

Definition 4

(prefix-child representation) Given a string , the prefix-child representation of is a function , which is defined as follows: , and for ,

In other words, is a child of , because is the root of when , and is the root of when . When , there is no child of in , and thus we set as .

Fig. 2 shows the prefix-parent representation (resp. the prefix-child representation) of string by arrows. The arrow starting from indicates (resp. ). If (resp. ), we omit the arrow.

The advantage of using the prefix-child representation and the prefix-parent representation is that we can check whether each text element matches the corresponding pattern element in constant time without computing its parent-distance [CTM].

1:procedure CARTESIAN-TREE-MATCH()
2:      PREFIX-PARENT-CHILD-REP()
3:      FAILURE-FUNC()
4:     
5:     for  to  do
6:         while  do
7:              if  then
8:                  break
9:              else
10:                                          
11:         
12:         if  then
13:              print “Match occurred at
14:                             
Algorithm 1 Text search of Cartesian tree matching
Theorem 3.1

Given two strings and , assume that and have the same prefix-parent representations. If , then and have the same prefix-parent representations, and vice versa.

Proof

If , and always have the same prefix-parent . Now let’s assume . There are three cases, in each of which we show that .

  1. Case : Since is the minimum element in and for , is also the minimum element in . Therefore, if holds, then we have .

  2. Case : Since , we have .

  3. Case : Since is the minimum element in and for , is also the minimum element in . Therefore, if holds, then .

It is trivial by definitions of and . ∎

With the prefix-parent representation and the prefix-child representation of pattern , we can simplify the text search. For each element , we can check by comparing with the elements in whose indices correspond to and in . Using this idea, we don’t have to compute . Algorithm 1 describes the algorithm to do this. We compute the failure function in the same way as [CTM] does.

Given a string , we can compute the prefix-child representation and the prefix-parent representation simultaneously in linear time using a stack. means that for . The same is true for . On the stack, therefore, we maintain only ’s which satisfy for while scanning from to . Suppose that are on the stack when we are computing and . (We assume that .) Then, forms an increasing subsequence of . When we consider a new index , we pop the indices repeatedly until we have . If there exists such an index , we set and . (If , then .) Otherwise, is the minimum element in , and thus and . Finally, we push onto the stack. Algorithm 2 describes the algorithm to compute and simultaneously.

1:procedure PREFIX-PARENT-CHILD-REP()
2:     
3:     for  to  do
4:         
5:         while  is not empty do
6:              
7:              if  then
8:                  break               
9:              
10:                        
11:         
12:         if  is empty then
13:              
14:         else
15:                        
16:               
17:     return
Algorithm 2 Computing prefix-parent and prefix-child representations

4 Fast Cartesian tree matching with filtration

In this section we present a practical solution based on filtration. Our solution for Cartesian tree matching consists of two phases: filtration and verification. First, the text is filtered with some exact string matching algorithm using a binary representation. In the second phase, the potential candidates are verified using a global-parent representation.

4.1 Filtration

In the filtration phase, a string is translated into a binary representation as follows.

Definition 5

(binary representation) Given a string , the binary representation of is a binary string of length , which is defined as follows:

for each .

One can easily check whether is true or not by comparing and : if and only if . The following theorem proves that the binary representation can be used to filter a text to search for all Cartesian tree matching occurrences of a pattern .

Theorem 4.1

Let and be two strings of lengths and , respectively, and let and be the binary representations associated with and , respectively. If , then for .

Proof

The prefix-parent representation has a one-to-one mapping to the Cartesian tree. Therefore, if , then for . If , then for .

Theorem 4.1 guarantees that any standard exact string matching algorithm can be used as a filtration procedure. As the exact string matching algorithm returns matches of in , these matches are only possible candidates of Cartesian tree matching which should be verified.

Cantone et al. [OrderOPM] presented two filtration methods other than the binary representation to solve order-preserving matching. They used the property that doesn’t match at position if there are two positions and such that doesn’t hold. Thus any comparison result between two positions can be used for filtration. In Cartesian tree matching, however, even if there exist such and , the corresponding Cartesian trees can be the same when . Therefore, we cannot use these filtration methods for Cartesian tree matching.

4.2 Verification

In the verification phase, we have to check whether the candidates found by the filtration phase are actual matches or not. This checking can be done using prefix-parent and prefix-child representations by Theorem 3.1, which takes 2 comparisons per element. In order to reduce the number of comparisons to 1, we introduce another representation as follows.

Definition 6

(Global-parent representation) Given a string , the global-parent representation of is a function , which is defined as follows:

is well-defined because there is at most one which satisfies . Fig. 2 shows the global-parent representation by arrows. The arrow starting from indicates the global parent of . If , we omit the arrow.

Theorem 4.2

Two strings and have the same Cartesian trees if and only if for all .

Proof

We will prove that for all if and only if for all .

It is trivial by definition of .

Assume for all . For any , we first show , and then we show .

  1. (Proof of ) There are two cases: and . If , then holds trivially. Otherwise, since , . Therefore, holds.

  2. (Proof of ) If , then . So we only have to consider the case that there is which satisfies . Let be a sequence such that , and there is no which satisfies . Since is a strictly increasing sequence, such always exists. Note that except for . On the sequence, there may or may not exist such that .

    Suppose that there exists some such that . Since , is the minimum element in , and so . Proceeding inductively, for all . Thus holds trivially.

    Now we consider the case that for all . Then, we have by the assumption that for all . We now show as follows. Since , is the minimum element in , and . Hence, we have . Inductively, we can show that . Therefore, holds.

By Theorem 4.2, we only have to compare once for each element in the verification phase. For a potential candidate obtained from the filtration phase (say, it starts from ), we compare and from to . The candidate is discarded when there exists such that .

We compute the global-parent representation using a stack, as in computing the prefix-parent and the prefix-child representations. The only difference is that first we set as , and then if we find such that we update to .

4.3 Sublinear time on average

The proof of sublinearity is similar to the analysis of order-preserving matching with filtration [FilterOPM]. Let’s assume that the elements in the pattern and the text are independent of each other and the distribution is uniform. The verification phase takes time proportional to the pattern length times the number of potential candidates. When alphabet size is

, the probability that

(i.e., probability that ) is , since there are pairs and pairs among them have equal elements. Similarly, the probability that is , and it is the same for . Therefore, the probability that is . As the pattern length increases, the number of potential candidates decreases exponentially, and the verification time approaches zero. Hence, the filtration time dominates. So if the filtration method takes a sublinear time in the average case, the total algorithm takes a sublinear time in the average case, too.

4.4 SIMD instructions

When we use the Boyer-Moore-Horspool algorithm [HORSPOOL] and the Alpha skip search algorithm [SkipSearch] as the filtration method, we pack four 32-bit numbers or sixteen 8-bit numbers into a register, as in order-preserving matching algorithms [SIMDOPM, OrderOPM]. Each pair of two corresponding packed data elements can be compared in parallel using streaming SIMD extensions (SSE) [SSE]. In the case of 32-bit integers, for example, we compute , , , and in parallel as in Algorithm 3, where instruction _mm_loadu_si128((__m128i *)()) loads four 32-bit integers from memory into a 128-bit register, instruction _mm_cmpgt_epi32(, ) compares four pairs of packed 32-bit integers and returns the results of the comparisons into a 128-bit register, instruction _mm_castsi128_ps casts the integer type to the float type, and instruction _mm_movemask_ps selects only the most significant bits of the 4 floats. Comparing a pair of sixteen 8-bit numbers can be done similarly.

1:procedure CompareUsingSIMD()
2:     __m128i _mm_loadu_si128((__m128i *)())
3:     __m128i _mm_loadu_si128((__m128i *)())
4:     __m128i _mm_cmpgt_epi32(, )
5:     return _mm_movemask_ps(_mm_castsi128_ps())
Algorithm 3 Compare integers in parallel

5 SIMD solution for short patterns

In this section we present an algorithm that works when the alphabet consists of 1-byte characters and the pattern length is at most 16. As shown in Section 4.2, we test for to check for an occurrence at position of the text .

Let be a word of 16 bytes containing the current window of the text, i.e., . For , we define (word obtained from by shifting positions to the left or to the right, depending on the sign of ) as follows:

For fixed , we can find the positions which satisfy for in parallel by comparing to using SIMD instructions. The satisfying positions for all are the occurrences of the pattern. The details of the algorithm are as follows. We test whether for in parallel using the SIMD instruction for or for . (In order to get only significant bits when computing , we use instruction _mm_movemask_epi8.) Then we compute . Finally, we report a match at position of the text if .

Example 1

Let’s consider an example of the pattern and the window of the text . We observe that since , . Moreover we do not need to compute , since . Hence we compute , , and .

= 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10
= 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10
= 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, -
= 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10
= 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13
= -, -, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1
= 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12, 10
= 10, 12, 16, 15, 06, 14, 09, 12, 11, 14, 09, 17, 12, 13, 12
= -, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0

The final result can be computed as follows:

= 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, -
= 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, -, 0, 0
= 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0
= 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0
= 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0

Therefore, we can report 3 matches. After we have tested a window of the text, we shift the current window to the right by positions. This algorithm takes SIMD instructions.

6 Experiments

Dataset KMP IKMP SBNDMCT BMHCT SKSCT PM
CT CT 2 4 6 4 8 12 16 4 8 12 16 CT
Random 5 10.52 6.84 4.99 4.42 4.17 3.31
int 9 10.71 6.83 2.71 2.31 1.95 1.95 1.64 1.91 2.26
17 10.69 6.83 1.39 1.34 0.95 1.31 0.80 0.86 1.60 1.13 0.45 0.61 3.91
33 10.69 6.83 0.72 0.70 0.65 1.07 0.51 0.51 1.01 0.76 0.32 0.30 0.48
65 10.71 6.83 0.72 0.71 0.66 0.98 0.44 0.43 0.71 0.61 0.27 0.24 0.28
Seoul 5 5.08 3.07 2.67 2.91 2.52 2.27
temp 9 5.11 3.14 1.56 1.45 1.55 1.55 1.23 1.27 1.77
17 5.51 3.12 0.89 0.81 0.71 1.10 0.62 0.63 0.84 0.88 0.44 0.49 2.55
33 5.56 3.12 0.49 0.48 0.45 0.84 0.40 0.34 0.41 0.68 0.32 0.20 0.25
65 5.52 3.11 0.48 0.48 0.46 0.77 0.26 0.19 0.28 0.57 0.25 0.13 0.12
Random 5 10.24 6.86 4.80 4.44 3.95 3.22 0.50
char 7 10.32 6.86 3.53 2.89 4.47 2.39 2.40 0.84
9 10.34 6.85 2.65 2.32 1.94 1.74 1.24 1.91 1.47 1.32
13 10.32 6.85 1.75 1.68 1.10 1.23 0.70 0.68 1.34 0.45 1.15 3.76
17 10.35 6.86 1.28 1.25 0.87 1.04 0.52 0.49 0.79 1.04 0.27 0.32 1.64
33 10.34 6.85 0.61 0.60 0.54 0.78 0.29 0.26 0.43 0.66 0.16 0.09 0.11
65 10.36 6.86 0.63 0.63 0.55 0.74 0.20 0.17 0.27 0.47 0.13 0.04 0.05
Table 1: Execution times in seconds for random patterns in texts (Random datasets: for 100 patterns, Seoul temperatures dataset: for 1000 patterns).
Figure 3: Execution times for the random character dataset.

In this section we conduct experiments comparing the following algorithms.

  • KMPCT: algorithm of Park, Amir, Landau, and Park [CTM]

  • IKMPCT: our improved linear-time algorithm based on prefix-parent and prefix-child representations (Section 3)

  • PMCT: SIMD solution for short patterns (Section 5)

  • SBNDMCT: algorithm based on the SBNDM filtration implemented by Faro and Lecroq [SMART] on the binary representations of the text and the pattern (Section 4.1) and verification using the global-parent representation (Section 4.2) [SBNDM] (The following algorithms have the same framework as SBNDMCT; only SBNDM is replaced by another filtration method.)

  • BMHCT: algorithm based on the -gram Boyer-Moore-Horspool filtration using SIMD instructions [HORSPOOL, QGRAM, SIMDOPM]

  • SKSCT: algorithm based on the -gram Alpha skip search filtration using SIMD instructions [SkipSearch, OrderOPM]

We tested for two random datasets and one real dataset, which is a time series of Seoul temperatures. The first random dataset consists of 10,000,000 random integers. The second random dataset consists of 10,000,000 random characters. The Seoul temperatures dataset consists of 658,795 integers referring to the hourly temperatures in Seoul (multiplied by ten) in the years 1907-2019. In general, temperatures rise during the day and fall at night. Therefore, the Seoul temperatures dataset has more matches than random datasets. We picked 100 random patterns per pattern length from random datasets and 1000 random patterns per pattern length for the Seoul temperatures dataset.

The experimental environments and parameters are as follows. All algorithms were implemented in C++11 and compiled with GNU C++ compiler version 4.8.5, and O3 and msse4 options were used. The experiments were performed on a CentOS Linux 7 with 128GB RAM and Intel Xeon CPU E5-2630 processor.

Table 1 shows the total execution times of Cartesian tree matching algorithms for random patterns (including the preprocessing). The best results are boldfaced. We choose the best results of the random character dataset from each algorithm regardless of and present them in Fig. 3 (except KMPCT because of readability). Our linear-time algorithm IKMPCT improves upon algorithm KMPCT of [CTM] by about 35%. In the random character dataset, PMCT is the fastest algorithm for short patterns. However, as the pattern length grows, algorithms based on the filtration method are much faster in practice. It can be seen that SKSCT is the fastest algorithm in most cases. When the pattern length is equal to 9, BMHCT utilizing 8-grams is the fastest algorithm, irrespective of the datasets. As pattern length grows, SKSCT utilizing 12-grams becomes the fastest algorithm.

Regardless of the data type, the results are almost consistent. In details, however, there are several differences. First, filtration algorithms, especially SKSCT algorithms, are slower at the Seoul temperatures dataset relatively. It’s because there are more matches in the Seoul temperatures dataset. Second, when is large, BMHCT and SKSCT algorithms are faster in the random character dataset than in the random integer dataset. It’s because the maximum number that we can compute in parallel is 16 in the character dataset while it is 4 in the integer dataset.

Acknowledgments. Song, Ryu and Park were supported by Collaborative Genome Program for Fostering New Post-Genome industry through the National Research Foundation of Korea(NRF) funded by the Ministry of Science ICT and Future Planning (No. NRF-2014M3C9A3063541).

References