Cartesian Tree Matching and Indexing

We introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. Based on Cartesian tree matching, we define single pattern matching for a text of length n and a pattern of length m, and multiple pattern matching for a text of length n and k patterns of total length m. We present an O(n+m) time algorithm for single pattern matching, and an O((n+m) log k) deterministic time or O(n+m) randomized time algorithm for multiple pattern matching. We also define an index data structure called Cartesian suffix tree, and present an O(n) randomized time algorithm to build the Cartesian suffix tree. Our efficient algorithms for Cartesian tree matching use a representation of the Cartesian tree, called the parent-distance representation.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

12/18/2020

The Parameterized Suffix Tray

Let Σ and Π be disjoint alphabets, respectively called the static alphab...
01/16/2020

Generalised Pattern Matching Revisited

In the problem of Generalised Pattern Matching (GPM) [STOC'94, Muthukris...
08/05/2019

Heuristic Algorithm for Generalized Function Matching

The problem of generalized function matching can be defined as follows: ...
11/15/2018

Vectorized Character Counting for Faster Pattern Matching

Many modern sequence alignment tools implement fast string matching usin...
02/27/2022

Parallel algorithm for pattern matching problems under substring consistent equivalence relations

Given a text and a pattern over an alphabet, the pattern matching proble...
03/07/2018

Flexible and Efficient Algorithms for Abelian Matching in Strings

The abelian pattern matching problem consists in finding all substrings ...
04/06/2022

Faster Pattern Matching under Edit Distance

We consider the approximate pattern matching problem under the edit dist...

1 Introduction

String matching is one of fundamental problems in computer science, and it can be applied to many practical problems. In many applications string matching has variants derived from exact matching (which can be collectively called generalized matching), such as order-preserving matching [19, 20, 22], parameterized matching [4, 7, 8], jumbled matching [9], overlap matching [3], pattern matching with swaps [2], and so on. These problems are characterized by the way of defining a match, which depends on the application domains of the problems. In financial markets, for example, people want to find some patterns in the time series data of stock prices. In this case, they would like to know more about some pattern of price fluctuations than exact prices themselves [15]. Therefore, we need a definition of match which is appropriate to handle such cases.

Figure 1: Example pattern and its corresponding Cartesian tree

The Cartesian tree [27] is a tree data structure that represents an array, only focusing on the results of comparisons between numeric values in the array. In this paper we introduce a new metric of match, called Cartesian tree matching, which means that two strings match if they have the same Cartesian trees. If we model the time series stock prices as a numerical string, we can find a desired pattern from the data by solving a Cartesian tree matching problem. For example, let’s assume that the pattern we want to find looks like the picture on the left of Figure 1, which is a common pattern called the head-and-shoulder [15] (in fact there are two versions of the head-and-shoulder: one is the picture in Figure 1 and the other is the picture reversed). The picture on the right of Figure 1 is the Cartesian tree corresponding to the pattern on the left. Cartesian tree matching finds every position of the text which has the same Cartesian tree as the picture on the right of Figure 1.

Even though order-preserving matching [19, 20, 22] can also be applied to finding patterns in time series data, Cartesian tree matching may be more appropriate than order-preserving matching in finding patterns. For instance, let’s assume that we are looking for the pattern in Figure 1 in time series stock prices. An important characteristic of the pattern is that the price hit the bottom (head), and it has two shoulders before and after the head. But the relative order between the two shoulders (i.e., which one is higher) does not matter. If we model this pattern into order-preserving matching, then order-preserving matching imposes a relative order between two shoulders and . Moreover, it imposes an unnecessary order between two valleys and . Hence, order preserving matching may not be able to find such a pattern in time series data. In contrast, the pattern in Figure 1 can be represented by one Cartesian tree, and therefore Cartesian tree matching is a more appropriate metric in such cases.

In this paper we define string matching problems based on Cartesian tree matching: single pattern matching for a text of length and a pattern of length , and multiple pattern matching for a text of length and patterns of total length , and we present efficient algorithms for them. We also define an index data structure called Cartesian suffix tree as in the cases of parameterized matching and order-preserving matching [8, 13], and present an efficient algorithm to build the Cartesian suffix tree. To obtain efficient algorithms for Cartesian tree matching, we define a representation of the Cartesian tree, called the parent-distance representation.

In Section 2 we give basic definitions for Cartesian tree matching. In Section 3 we propose an time algorithm for single pattern matching. In Section 4 we present an deterministic time or randomized time algorithm for multiple pattern matching. In Section 5 we define the Cartesian suffix tree, and present an randomized time algorithm to build the Cartesian suffix tree of a string of length .

2 Problem Definition

2.1 Basic notations

A string is a sequence of characters in an alphabet , which is a set of integers. We assume that the comparison between any two characters can be done in time. For a string , represents the -th character of , and represents a substring of starting from and ending at .

2.2 Cartesian tree matching

A string can be associated with its corresponding Cartesian tree according to the following rules [27]:

  • If is an empty string, is an empty tree.

  • If is not empty and is the minimum value among , is the tree with as the root, as the left subtree, and as the right subtree. If there are two or more minimum values, we choose the leftmost one as the root.

Since each character in string corresponds to a node in Cartesian tree , we can treat each character as a node in the Cartesian tree.

Cartesian tree matching is the problem to find all the matches in the text which have the same Cartesian tree as a given pattern. Formally, we define it as follows:

(Cartesian tree matching) Given two strings text and pattern , find every such that .

For example, let’s consider a sample text . If we find the pattern in Figure 1, which is , we can find a match at position 5 of the text, i.e., . Note that the matched text is not a match in order-preserving matching [20, 22] because the relative order between and is different from that between and , but it is a match in Cartesian tree matching.

3 Single Pattern Matching in Time

3.1 Parent-distance representation

In order to solve Cartesian tree matching without building every possible Cartesian tree, we propose an efficient representation to store the information about Cartesian trees, called the parent-distance representation.

(Parent-distance representation) Given a string , the parent-distance representation of is an integer string , which is defined as follows:

For example, the parent-distance representation of string is . Note that in Definition 3.1 represents the parent of in Cartesian tree . Furthermore, if there is no such , is the root of Cartesian tree .

Theorem 3.1 shows that the parent-distance representation has a one-to-one mapping to the Cartesian tree, so it can substitute the Cartesian tree without any loss of information.

[] Two strings and have the same Cartesian trees if and only if and have the same parent-distance representations.

Proof.

If two strings have different lengths, they have different Cartesian trees and different parent-distance representations, so the theorem holds. Therefore, we can only consider the case where and have the same length. Let be the length of and . We prove the theorem by an induction on .

If , and will always have the same Cartesian trees with only one node. Furthermore, they will have the same parent-distance representation . Therefore, the theorem holds when .

Let’s assume that the theorem holds when , and show that it holds when .

Assume that and have the same Cartesian trees (i.e., ). There are two cases.

  • If and are not roots of the Cartesian trees, let be the parent of , and the parent of . Since , we have . If we remove from Cartesian tree , we obtain the tree , where the left subtree of is attached to its parent . If we remove from , we obtain in the same way. Since , we get , and therefore by induction hypothesis. Since and (and ), we have .

  • If and are roots, we remove and to get and . Since , we have , and therefore by induction hypothesis. Since in this case, we get .

Assume that and have the same parent-distance representations (i.e., ). Since , we have by induction hypothesis. From , we can derive as follows. If , let be . We insert into so that the parent of is and the original right subtree of becomes the left subtree of . If , is the root of and becomes the left subtree of . We derive from in the same way. Since and , we can conclude that .

Therefore, we have proved that there is a one-to-one mapping between Cartesian trees and parent-distance representations. ∎

3.2 Computing parent-distance representation

Given a string , we can compute the parent-distance representation in linear time using a stack, as in [13, 14]. The main idea is that if two characters and for satisfy , cannot be the parent of for any . Therefore, we will only store which does not have such while scanning from left to right. If we store such only, they form a non-decreasing subsequence of . When we consider a new value, therefore, we can pop values that are larger than the new value, find its parent, and push the new value and its index into the stack. Algorithm 1 describes the algorithm to compute .

1:procedure PARENT-DIST-REP()
2:     
3:     for  to  do
4:         while  is not empty do
5:              
6:              if  then
7:                  break               
8:                        
9:         if  is empty then
10:              
11:         else
12:                        
13:               
14:     return
Algorithm 1 Computing parent-distance representation of a string

Furthermore, given the parent-distance representation of string , we can compute the parent-distance representation of any substring easily. To compute , we need only check whether the parent of is within or not (i.e., the parent is outside if ).

(1)

For example, the parent-distance representation of string is . For , we can use the above equation and compute the value at each position in constant time, getting .

3.3 Failure function

We can define a failure function similar to the one used in the KMP algorithm [21].

(Failure function) The failure function of string is an integer string such that:

That is, is the largest such that the prefix and the suffix of of length have the same Cartesian trees. For example, assuming that , the corresponding failure function is . We can see that from . We will present an algorithm to compute the failure function of a given string in Section 3.5.

1:procedure CARTESIAN-TREE-MATCH()
2:      PARENT-DIST-REP()
3:      FAILURE-FUNC()
4:     
5:      an empty deque
6:     for  to  do
7:         Pop elements from back of such that
8:         while  do
9:              if  then
10:                  break
11:              else
12:                  
13:                  Delete elements from front of such that                        
14:         
15:         
16:         if  then
17:              print “Match occurred at
18:              
19:              Delete elements from front of such that               
Algorithm 2 Text search of Cartesian tree matching

3.4 Text search

As in the original KMP text search algorithm, we can use the failure function in order to achieve linear time text search: scan the text from left to right, and use the failure function every time we find a mismatch between the text and the pattern. We apply this idea to Cartesian tree matching.

In order to perform a text search using space, we compute the parent-distance representation of the text online as we read the text, so that we don’t need to store the parent-distance representation of the whole text, which would cost space. Furthermore, among the text characters which are matched with the pattern, we only have to store elements that form a non-decreasing subsequence by using a deque (instead of a stack in Section 3.2) in order to delete elements in front. Using this idea, we can keep the size of the deque to be always smaller than or equal to . Therefore, we can perform the text search using space. Algorithm 2 shows the text search algorithm of Cartesian tree matching. In line 9 we need to compute . If the deque is empty, then . Otherwise, let be the element at the back of the deque. Then . This computation takes constant time. Just before line 14, we do not compare and when , because they always match. Therefore, we can safely perform line 14.

3.5 Computing failure function

We compute the failure function in a way similar to the text search, as in the KMP algorithm. However, we can compute the parent-distance representation of the pattern in time before we compute the failure function. Hence we don’t need a deque and the computation is slightly simpler than text search. Algorithm 3 shows the procedure to compute the failure function.

1:procedure FAILURE-FUNC()
2:      PARENT-DIST-REP()
3:     
4:     
5:     for  to  do
6:         while  do
7:              if  then
8:                  break
9:              else
10:                                          
11:         
12:               
Algorithm 3 Computing failure function in Cartesian tree matching

3.6 Correctness and time complexity

Since our algorithm for Cartesian tree matching including text search and the computation of the failure function follow the KMP algorithm, it is easy to see that our algorithm correctly finds all occurrences (in the sense of Cartesian tree matching) of the pattern in the text. Since our algorithm checks one character of the parent-distance representation in constant time, it takes time for text search and time to compute the failure function, as in KMP algorithm. Therefore, our algorithm requires time for Cartesian tree matching using space.

3.7 Cartesian tree signature

There is an alternative representation of Cartesian trees, called Cartesian tree signature [14]. The Cartesian tree signature of is an array such that equals the number of the elements popped from the stack in the -th iteration of Algorithm 1. Furthermore, the Cartesian tree signature can be represented as a bit string of length less than , which is a succinct representation of a Cartesian tree. For example, the Cartesian tree signature of string is , and its corresponding bit string is .

We can use this representation to perform Cartesian tree matching. While we compute the Cartesian tree signature, we store one more array , which is defined as follows: If is never popped out from the stack, . Otherwise, let be the value which popped out from the stack, and . For string , we have .

Using array , we can delete one character at the front of string in constant time. In order to get Cartesian tree signature and its corresponding for , we do the following: If , we decrease by one and erase from . If , we just erase . After that, we delete from to get . For example, if we want to delete one character at the front of , we decrease by one, and delete and . This results in and . These arrays are the correct Cartesian tree signature and its corresponding array of . In this way, we can perform Algorithm 2 using the Cartesian tree signature. Computing the failure function can also be done in a similar way.

Note that the Cartesian tree signature can represent a Cartesian tree using less space than the parent-distance representation, but it needs an auxiliary array to perform string matching, which uses the same space as the parent-distance representation. For Cartesian tree matching, therefore, it uses more space than Algorithm 2.

4 Multiple Pattern Matching in Time

In this section we extend Cartesian tree matching to the case of multiple patterns. Definition 4 gives the formal definition of multiple pattern matching. (Multiple pattern Cartesian tree matching) Given a text and patterns , where , multiple pattern Cartesian tree matching is to find every position in the text which matches at least one pattern, i.e., it has the same Cartesian tree as that of at least one pattern. We modify the Aho-Corasick algorithm [1] using the parent-distance representation defined in Section 3.1 to do multiple pattern matching in time.

Figure 2: Aho-Corasick automaton for

4.1 Constructing the Aho-Corasick automaton

Instead of using the patterns themselves in the Aho-Corasick automaton, we use their parent-distance representations to make an automaton. Each node in the automaton corresponds to the prefix of the parent-distance representation of some pattern. We maintain two integers and for every node such that the node corresponds to the parent-distance representation of the pattern prefix . If there are more than one possible indexes, we store the smallest one. Each node also has a state transition function , which gets an integer as an input and returns the next node, or report that there is no such node. We can construct the trie and the state transition function for every node in time, assuming that we use a balanced binary search tree to implement the transition function. Figure 2 shows an Aho-Corasick automaton for three patterns , where we use the parent-distance representations of the patterns, to construct the automaton.

1:procedure MULTIPLE-FAILURE-FUNC(, , …, )
2:     for  to  do
3:          PARENT-DIST-REP()      
4:      Build trie with ’s
5:     for  breadth-first traversal of the trie do
6:         
7:         
8:         
9:          parent of in the trie
10:         while  do
11:              
12:              
13:              
14:              if  exists then
15:                  
16:                  break                             
Algorithm 4 Computing failure function in multiple pattern matching

The failure function of the Aho-Corasick automaton is defined as follows: Let be a node in the automaton, and be the substring that node represents in the trie. Let be the longest proper suffix of which matches (in the sense of Cartesian tree matching) prefix of some pattern . The failure function of is defined as node (i.e., ). The dotted lines in Figure 2 shows the failure function of each node. For example, node represents , and its failure function represents . We can see that matches (i.e., ), which is the longest proper suffix of that matches a prefix of some pattern. Note that the parent-distance representation of may not be the suffix of the parent-distance representation of . For example, has the parent-distance representation , but its failure function has the parent-distance representation which is not a suffix of .

Algorithm 4 computes the failure function of the trie. As in the original Aho-Corasick algorithm, we traverse the trie with breadth-first order (except the root) and compute the failure function. The main difference between Algorithm 4 and the Aho-Corasick algorithm is at line 13, where we decide the next character to match. According to the definition of the trie, corresponds to the parent-distance representation of , and so the parent of corresponds to the parent-distance representation of . In the while loop from line 10 to 16, corresponds to the parent-distance representation of some suffix of , because is a node that can be reached from the parent of following the failure links. Since corresponds to some string of length , we can conclude that represents . We want to check whether matches some node in the trie, so we should check whether has the transition using . If has the transition , it corresponds to , and we can conclude that . If doesn’t have such a transition, there is no node that represents , and thus we have to continue the loop.

For example, suppose that we compute the failure function of in Figure 2. From and , we know that represents , and so , which is the parent of , represents . We begin the while loop starting from . Since , we know that , which represents , matches . In order to check whether matches some node in the trie, we compute and check whether exists. Since there is no such transition, we continue the while loop with . We know that , which represents , matches from . In order to check whether matches some node, we compute and check whether exists. Since there is such a transition, we conclude that . Note that may change during the while loop, which is not the case in the Aho-Corasick algorithm.

While computing the failure function, we can also compute the output function in the same way as the Aho-Corasick algorithm. The output function of node is the set of patterns which match some suffix of . This function is used to output all possible matches at the node.

4.2 Multiple pattern matching

Using the automaton defined above, we can solve multiple pattern Cartesian tree matching in time. The text search algorithm is essentially the same as that of the Aho-Corasick algorithm, following the trie and using the failure links in case of any mismatches. As in the single pattern case, we compute the parent-distance representation of the text online in the same way as Algorithm 2 (using a deque) to ensure space. The time complexity of our multiple pattern Cartesian tree matching is using space, where the factor is included due to the binary search tree in each node. Since there can be at most outgoing edges from each node, we can perform an operation in the binary search tree in time. Combined with the time-complexity analysis of the Aho-Corasick algorithm, this shows that our algorithm has the time complexity of . We can reduce the time complexity further to randomized time by using a hash instead of a binary search tree [12].

5 Cartesian Suffix Tree in Randomized Time

In this section we apply the notion of Cartesian tree matching to the suffix tree as in the cases of parameterized matching and order-preserving matching [8, 13]. We first define the Cartesian suffix tree, and show that it can be built in randomized time or worst-case time using the result from Cole and Hariharan [12].

5.1 Defining Cartesian suffix tree

The Cartesian suffix tree is an index data structure that allows us to find an occurrence of a given pattern in randomized time or worst-case time, where is the length of the text string. In order to store the information of Cartesian suffix trees efficiently, we again use the parent-distance representation from Section 3.1. Definition 5.1 gives the formal definition of the Cartesian suffix tree.

(Cartesian suffix tree) Given a string , the Cartesian suffix tree of is a compacted trie built with for every (where the special character is concatenated to the end of ) and string .

Note that we append a special character to the end of each parent-distance representation to ensure that no string is a prefix of another string.

Figure 3: Cartesian suffix tree of

Figure 3 shows an example Cartesian suffix tree of . Each edge actually stores the suffix number, start position, and end position instead of the parent-distance representation itself. For example, node corresponds to substring or , whose parent-distance representation is . Hence, the edge that goes into node stores the suffix number or , start and end positions and .

5.2 Constructing Cartesian suffix tree

There are several algorithms efficiently constructing the suffix tree, such as McCreight’s algorithm [24] and Ukkonen’s algorithm [26]. However, the distinct right context property [16, 8] should hold in order to apply these algorithms, which means that the suffix link of every internal node should point to an explicit node. The Cartesian suffix tree does not have the distinct right context property. In Figure 3, the internal node marked with does not satisfy this property because and thus there is no explicit node corresponding to parent-distance representation .

In order to handle this issue, we use an algorithm due to Cole and Hariharan [12]. This algorithm can construct a compacted trie for a quasi-suffix collection, which satisfies the following properties:

  1. A quasi-suffix collection is a set of strings , where the length of is .

  2. For any two different strings and , should not be a prefix of .

  3. For any and , if and have a common prefix of length , and should have a common prefix of length at least .

A collection of parent-distance representations for the Cartesian suffix tree satisfies all of the above properties. The first two properties are trivial. Furthermore, if and have a common prefix of length , i.e., , we can show that by Equation 1. Therefore, and have a common prefix of length or more, showing the third property holds.

One more property we need to perform Cole and Hariharan’s algorithm is a character oracle, which returns the -th character of in constant time. We can do this in constant time using Equation 1, once the parent-distance representation of is computed.

Since we have all properties needed to perform Cole and Hariharan’s algorithm, we can construct a Cartesian suffix tree in randomized time using space [12]. In the worst case, it can be built in time by using a binary search tree instead of a hash table to store the children of each node in the suffix tree, because the alphabet size is . We can also modify our algorithm to construct a Cartesian suffix tree online, using the idea in [23, 25].

6 Conclusion

We have defined Cartesian tree matching and the parent-distance representation of a Cartesian tree. We developed a linear time algorithm for single pattern matching and an deterministic time or randomized time algorithm for multiple pattern matching. Finally, we defined an index data structure called Cartesian suffix tree, and showed that it can be constructed in randomized time. We believe that the notion of Cartesian tree matching, which is a new metric on string matching and indexing over numeric strings, can be used in many applications.

There have been many works on approximate generalized matching. For example, there are results for approximate order-preserving matching [11], approximate jumble matching [10], approximate swapped matching [5], and approximate parameterized matching [6, 18]. There are also results on computing the period of a generalized string, such as computing the period in the order-preserving model [17]. Since Cartesian tree matching is first introduced in this paper, many problems including approximate matching and computing the period in the Cartesian tree matching model are future research topics.

Acknowledgments

S.G. Park and K. Park were supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No. 2018-0-00551, Framework of Practical Algorithms for NP-hard Graph Problems). A. Amir and G.M. Landau were partially supported by the Israel Science Foundation grant 571/14, and Grant No. 2014028 from the United States-Israel Binational Science Foundation (BSF).

References