KV-match: An Efficient Subsequence Matching Approach for Large Scale Time Series

10/02/2017
by   Jiaye Wu, et al.
Tsinghua University
FUDAN University
0

Time series data have exploded due to the popularity of new applications, like data center management and IoT. Time series data management system (TSDB), emerges to store and query the large volume of time series data. Subsequence matching is critical in many time series mining algorithms, and extensive approaches have been proposed. However, the shift of distributed storage system and the performance gap make these approaches not compatible with TSDB. To fill this gap, we propose a new index structure, KV-index, and the corresponding matching algorithm, KV-match. KV-index is a file-based structure, which can be easily implemented on local files, HDFS or HBase tables. KV-match algorithm probes the index efficiently with a few sequential scans. Moreover, two optimization techniques, window reduction and window reordering, are proposed to further accelerate the processing. To support the query of arbitrary lengths, we extend KV-match to KV-match_DP, which utilizes multiple varied length indexes to process the query simultaneously. A two-dimensional dynamic programming algorithm is proposed to find the optimal query segmentation. We implement our approach on both local files and HBase tables, and conduct extensive experiments on synthetic and real-world datasets. Results show that our index is of comparable size to the popular tree-style index while our query processing is order of magnitudes more efficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

10/10/2019

Time series classification for varying length series

Research into time series classification has tended to focus on the case...
10/17/2018

A Periodicity-based Parallel Time Series Prediction Algorithm in Cloud Computing Environments

In the era of big data, practical applications in various domains contin...
06/13/2009

Exact Indexing for Massive Time Series Databases under Time Warping Distance

Among many existing distance measures for time series data, Dynamic Time...
03/17/2019

Time Series Predict DB

In this work, we are motivated to make predictive functionalities native...
06/05/2020

An Improved and Parallel Version of a Scalable Algorithm for Analyzing Time Series Data

Today, very large amounts of data are produced and stored in all branche...
03/16/2021

Rollage: Efficient Rolling Average Algorithm to Estimate ARMA Models for Big Time Series Data

We develop a new method to estimate an ARMA model in the presence of big...
10/27/2017

Fine-grained Pattern Matching Over Streaming Time Series

Pattern matching of streaming time series with lower latency under limit...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Time series data are pervasive across almost all human endeavors, including medicine, finance and science. In consequence, there is an enormous interest in querying and mining time series data. [1, 2].

Subsequence matching problem is a core subroutine for many time series mining algorithms. Specifically, given a long time series , for any query series and a distance threshold , the subsequence matching problem finds all subsequences from , whose distance with falls within the threshold .

FRM [3] is the pioneer work of subsequence matching. Many approaches have been proposed, either to improve the efficiency [4, 5] or to deal with various distance functions [6, 7], such as Euclidean distance and Dynamic Time Warping. However, all these approaches only consider the raw subsequence matching problem (RSM for short). In recent years, researchers realize the importance of the subsequence normalization [8]. It is more meaningful to compare the z-normalized subsequences, instead of the raw ones. UCR Suite [8] is the state-of-the-art approach to solve the normalized subsequence matching problem (NSM for short).

The NSM approach suffers from two drawbacks. First, it needs to scan the full time series , which is prohibitively expensive for long time series. For example, for a time series of length , UCR Suite needs more than 100 seconds to process a query of length 1,000. [8] analyzed the reason why it is impossible to build the index for the NSM problem. Second, the NSM query may output some results not satisfying users’ intent. The reason is that NSM fully ignores the offset shifting and amplitude scaling. However, in real world applications, the extent of offset shifting and amplitude scaling may represent certain specific physical mechanism or state. Users often only hope to find subsequences within similar state as the query. We illustrate it with an example.

(c) Information of Q
Offset [Label] Length
877 [] 17,124
(d) Results of NSM
Offset [Label] Distance
252,492 [] 117.78
(a) PAMAP time series 97,458 [] 130.80
34,562 [] 138.12
161,416 [] 149.37
134,456 [] 164.88
296,063 [] 166.74
(b) Aligned normalized subsequences
Fig. 1: Illustrative example of cNSM

Example 1. The time series in Fig. 1(a) comes from the Physical Activity Monitoring for Aging People (PAMAP) dataset [1] collected from z-accelerometer at hand position. The monitored person conducts various activities alternatively, like sitting, standing, running and so on. Each activity lasts for about 3 minutes, and the data collection frequency is 100Hz. We use one subsequence corresponding to lying activity as the query ( in Fig. 1(c)) to find other “lying” subsequences. We issue a NSM query with , and Fig. 1(d) lists the top results. Unfortunately, all top-4 results corresponds to other activities. and correspond to sitting activity, while and correspond to breaking activity. Although and are the desired results (correspond to lying activity), they are ranked out of top-20. We show the normalized , and in Fig. 1(b). It is difficult to distinguish them after normalization.

By observing Fig. 1(a), one can filter the undesired results easily by adding an additional constraint: the output subsequences should have similar mean value as . In fact, this new type of NSM query, NSM plus some constraints, is useful in many applications. We list two of them as follows,

  • (Industry application) In the wind power generation field, LIDAR system can provide preview information of wind disturbances [9]. Extreme Operating Gust (EOG) is a typical gust pattern which is a phenomenon of dramatic changes of wind speed in a short period. Fig. 3 shows a typical EOG pattern. This pattern is important because it may generate damage on the turbine. All EOG pattern occurrences have the similar shape, and their fluctuation degree falls within certain range, because the wind speed cannot be arbitrarily high. If we hope to find all EOG pattern occurrences in the historical data, we can use a typical EOG pattern as the query, plus the constraint on the range of the values.

  • (IoT application) When a container truck goes through a bridge, the strain meter planted in the bridge will demonstrate a specific fluctuation pattern. The value range in the pattern depends on the weight of the truck. If we have one occurrence of the pattern as a query, we can additionally set a mean value range as the constraint to search container trucks whose weight falls within a certain range.

Note that the above applications cannot be handled by RSM query, because the existing offset shifting and amplitude scaling forces us to set a very large distance threshold, which will cause many false positive results.

Furthermore, to verify the universality of this new query type, we investigate the motif pairs in some popular real-world time series benchmarks. Motif mining [2] is an important time series mining task, which finds a pair (or set) of subsequences with minimal normalized distance. For a motif subsequence pair, say and , we show the relative mean value difference (Mean

) and the ratio of standard deviation (

Std) in Fig. 3 We can see that although these pairs are found without any constraint (like NSM query), both mean value and standard deviation of motif subsequences are very similar. So we can find these pairs by the cNSM query, a NSM query plus a small constraint.

Fig. 2: EOG pattern
Fig. 3: Motif example

In this paper, we formally define a new subsequence matching problem, called constrained normalized subsequence matching problem (cNSM for short). Two constraints, one for mean value and the other for standard deviation, are added to the traditional NSM problem. One exemplar cNSM query looks like “given a query with mean value and standard deviation , return subsequences which satisfy: (1) ; (2) ; (3) ”. With the constraint, the cNSM problem provides a knob to flexibly control the degree of offset shifting (represented by mean value) and amplitude scaling (represented by standard deviation). Moreover, the cNSM problem offers us the opportunity to build index for the normalized subsequence matching.

Challenges. Solving the cNSM problem faces the following challenges. First, how can we process the cNSM query efficiently? A straightforward approach is to first apply UCR Suite to find unconstrained results, and then use mean value and standard deviation constraints to prune the unqualified ones. However, it still needs to scan the full series. Can we build an index and process the query more efficiently?

Second, users often conduct the similar subsequence search in an exploratory and interactive fashion. Users may try different distance functions, like Euclidean distance or Dynamic Time Warping. Meanwhile, users may try RSM and cNSM query simultaneously. Can we build a single index to support all these query types?

Contributions. Besides proposing the cNSM problem, we also have the following contributions.

  • We present the filtering conditions for four query types, RSM-ED, RSM-DTW, cNSM-ED and cNSM-DTW, and prove the correctness. The conditions enable us to build index and meanwhile guarantee no false dismissals.

  • We propose a new index structure, KV-index, and the query processing approach, KV-match, to support all these query types. The biggest advantage is that we can process various types of queries efficiently with a single index. Moreover, KV-match only needs a few numbers of sequential scans of the index, instead of many random accesses of tree nodes in the traditional R-tree index, which makes it much more efficient.

  • Third, to support the query of arbitrary lengths efficiently, we extend KV-match to KV-match, which utilizes multiple indexes with different window lengths. We conduct extensive experiments. The results verify the efficiency and effectiveness of our approach.

The rest of the paper is organized as follows. We present the preliminary knowledge and problem statements in Section II. In Section III we introduce the theoretical foundation and motivate the approach. Section IV and V describe our index structure, index building algorithm and query processing algorithm. Section VI extends our method to use multi-level indexes with different window lengths. Our implementation details are described in Section VII. The experimental results are presented in Section VIII and we discuss related works in Section IX. Finally, we conclude the paper and look into the future work in Section X.

Ii Preliminary Knowledge

In this section, we introduce the definition of time series and other useful notations.

Ii-a Definitions and Problem Statement

Notation Description
a time series
a length- subsequence of starting at offset
the normalized series of time series
the -th length- disjoint window of
the mean value of the -th disjoint window of
the standard deviation of the -th disjoint window of
WI a window interval containing continuous window positions
a set of window intervals satisfying the criterion for
a set of candidates for and for all
the number of window intervals and window positions
TABLE I: Frequently used notations

A time series is a sequence of ordered values, denoted as , where is the length of . A length- subsequence of is a shorter time series, denoted as , where .

For any subsequence , and are the mean value and standard deviation of respectively. Thus the normalized series of , denoted as , is

Our work supports two common distance measures, Euclidean distance and Dynamic Time Warping. Here we give the definition of them.

Euclidean Distance (ED): Given two length- sequences, and , their distance is .

Dynamic Time Warping (DTW): Given two length- sequences, and , their distance is

(1)

where represents empty series and is a suffix subsequence of .

In DTW, the warping path is defined as a matrix to represent the optimal alignment for two series. The matrix element represents that is aligned to . To reduce the computation complexity, we use the Sakoe-Chiba band [10] to restrict the width of warping, denoted as . Any pair should satisfy . When , it degenerates into ED.

We aim to support subsequence matching for both the raw subsequence and the normalized subsequence simultaneously. The problem statements are given here.

Raw Subsequence Matching (RSM): Given a long time series , a query sequence () and a distance threshold , find all subsequences of length from , which satisfy . In this case, we call that and are in -match.

Normalized Subsequence Matching (NSM): Given a long time series , a query sequence and a distance threshold , find all subsequences of length from , which satisfy , where and are the normalized series of and respectively.

The cNSM problem adds two constraints to the NSM problem. Thresholds and are introduced to constrain the degree of amplitude scaling and offset shifting.

Constrained Normalized Subsequence Matching (cNSM): Given a long time series , a query sequence , a distance threshold , and the constraint thresholds and , find all subsequences of length from , which satisfy

The larger and , the looser the constraint. In this case, we call that and are in -match.

The distance is either ED or DTW. In this paper, we build an index to support four types of queries, RSM-ED, RSM-DTW, cNSM-ED and cNSM-DTW simultaneously.

10mm

Iii Theoretical Foundation and
Approach Motivation

In this section, we establish the theoretical foundation of our approach. We propose a condition to filter the unqualified subsequences. For all four types of queries, the conditions share the same format, which enables us to support all query types with a single index.

Specifically, for the query and the subsequence of length-, we segment them into aligned disjoint windows of the same length . The -th window of (or ) is denoted as (or ), (), that is, .

For each window, we hope to find one or more features, based on which we can construct the filtering condition. In this work, we choose to utilize one single feature, the mean value of the window. The advantages are two-folds. First, with a single feature, we can build a one-dimensional index, which improves the efficiency of index retrieval greatly. Second, the mean value allows us to design the condition for both RSM and cNSM query.

We denote mean values of and as and . The condition consists of number of ranges. The one is denoted as (). If is a qualified subsequence, for any , must fall within . If any is outside the range, we can filter safely.

Iii-a RSM-ED Query Processing

In this section, we first present the condition for the simplest case, RSM-ED query, and then illustrate our approach.

Lemma 1.

If and are in -match under ED measure, that is, , then must satisfy

(2)
Proof.

Based on the ED definition, we have

where . According to the corollary in [11],

If , after inequality transformation, it should hold that , so we get Eq. (2). ∎

Fig. 4: Illustrative example

Now we illustrate our approach with the example in Fig. 4. is a long time series, and is the query sequence of length . The goal is to find all length- subsequences from , which satisfy . The parameter of the window length is set to 50. We split into three disjoint windows of length , , , 111We can ignore the remain part without sacrificing the correctness since Lemma 1 is a necessary condition for RSM.. According to Lemma 1, for any qualified subsequence , the mean value of the disjoint window must fall within the range (). To facilitate finding the windows satisfying this condition, we build the index as follows. We compute the mean values of all sliding windows , denoted as , and build a sorted list of entries. With this structure, we find the candidates in two steps. First, for each window , we obtain all sliding windows whose mean values fall within by a single sequential scan operation. We denote the found windows for as . Then, we generate the final candidates by intersecting windows in , and .

In Fig. 4, sliding windows in , and are marked with “triangle”, “cross” and “circle” respectively. The only candidate is , because , and .

Iii-B Range for cNSM-ED Query

We solve the cNSM problem based on KV-index either. For the given query , we determine whether a subsequence is -match with by checking the raw subsequence directly. Specifically, we achieve this goal by designing the range for each query window . For any subsequence , if any falls outside this range, cannot be -match with and we can filter safely. We illustrate it with an example. Let , , and  222To make the example simple enough, we set as 0.. By simple calculation, we obtain and . For any length-4 subsequence , if only , we can infer that cannot be matched with without checking the whether satisfies the cNSM condition, as follows. To make , must be -4. If it is the case, is 4.6188 at least. However, , which violates the cNSM condition.

Now we formally give the range for cNSM-ED query. Let and be the global mean values of and , and be the standard deviations, and be the normalized and respectively.

Lemma 2.

If and are in -match under ED measure, that is, , then satisfies

(3)

where
,
.

Proof.

Based on the normalized ED definition, we have

Let and , where and . If , it holds that

According to the corollary in [11], similar to Lemma 1, for the -th window and , we have

By simple transformation, for any specific pair of , we can get a range of as follows,

For ease of description, we assign and .

The final range should be

As illustrated in Fig. 6, the rectangle represents the whole legal range of and . Let and . Apparently, both and increase monotonically for . As for , we have two cases,

  • If , increases monotonically for . is minimal when and , which is represented by the point in Fig. 6;

  • If , decreases monotonically for . is minimal when and , which is represented by the point in Fig. 6.

So

Note that formula means is either or .

Similarly, we can infer the maximal value of as following two cases,

  • If , is maximal when and , which is represented by the point in Fig. 6.

  • If , is maximal when and , which is represented by the point in Fig. 6.

So

Fig. 5: Legal Range of
Fig. 6: Index Structure

Iii-C Range for RSM-DTW and cNSM-DTW Query

Before introducing the ranges, we first review the query envelop and the lower bound of DTW distance, LB_PAA [12]. To deal with DTW measure, given length- query , the query envelop consists of two length- series, and , as the lower and upper envelop respectively. The -th elements of and , denoted as and , are defined as

, .

LB_PAA is defined based on the query envelop. and are split into number of length- disjoint windows, and , in which and (). The mean values of and are denoted as and respectively. For any length- subsequence , the LB_PAA is as follows,

(4)

which satisfies  [12].

Now we give the ranges for RSM and cNSM under the DTW measure in turn.

Lemma 3.

If and are in -match under DTW measure, that is, , then satisfies

(5)
Proof.

See Appendix A.

Lemma 4.

If and are in -match under DTW measure, that is, , then satisfies

(6)

where
,
.

Proof.

See Appendix B.

Analysis. We provide the ranges of mean value for all four query types, which means that we can support all queries with a single index. When processing different query types, the only difference is to use different ranges of . This property is beneficial for exploratory search tasks.

Iv KV-index

In this section, we present our index structure KV-index, and the index building algorithm.

Iv-a Index Structure

The index structure in Fig. 4 has approximately equal number of entries of , which causes a huge space cost. To avoid that, we propose a more compact index structure which utilizes the data locality property, that is, the values of adjacent time points may be close. In consequence, the mean values of adjacent sliding windows will be similar too.

Logically, KV-index consists of ordered rows of key-value pairs. The key of the -th row, denoted as , is a range of mean values of sliding windows, that is, , where and are the left and right endpoint of the mean value range of respectively. It is a left-closed-right-open range, and the ranges of adjacent rows are disjoint.

The corresponding value, denoted as , is the set of sliding windows whose mean values fall within . To facilitate the expression, we represent each window by its position, that is, we represent sliding window with . To further save the space cost and also facilitate subsequence matching algorithm, we organize the window positions in as follows. The positions in are sorted in ascending order, and consecutive ones are merged into a window interval, denoted as WI. So consists of one or more sorted and non-overlapped window intervals.

Definition 1 (Window Interval).

We combine the to length- sliding windows of as a window interval , which contains a set of sliding windows , where .

In the following descriptions, we use to denote the window position belonging to the window interval , that is, . Moreover, we use , and to denote the left boundary, the right boundary and the size of interval WI respectively. The overall number of window intervals in is denoted as , and the number of window positions in as . Formally, we have

(7)
(8)

Fig. 6 shows KV-index for Fig. 4. The first row indicates that there exists three sliding windows, , and , whose mean values fall within the range . In the second row, three windows are organized into two intervals and . Thus and . Note that, is a special interval which only contains one single window position.

To facilitate the query processing, KV-index also contains a meta table, in which each entry is a quadruple as , where is the offset of -th row in the index file. Due to its small size, we can load the meta table to memory before processing the query. With the meta table, we can quickly determine the offset and the length of a scan operation by the simple binary search.

Physically, KV-index can be implemented as a local file, an HDFS file or an HBase table, because of its simple format. In this work, we implement two versions, a local file version and an HBase table version (details are in Section VIII). In general, if a file system or a database supports the “scan” operation with start-key and end-key parameters, it can support KV-index. We provide details about the index implementation in Section VII.

Iv-B Index Building Algorithm

We build the index with two steps. First, we build an index in which all rows use the equal-width range of the mean values. Second, because data distribution is not balanced among rows, we merge adjacent rows to optimize the index. We first introduce a basic in-memory algorithm, which works for moderate data size. Then we discuss how to extend it to very large data scale.

In the first step, we pre-define a parameter , which represents the range width of the mean values. The range of each row will be , where . We read series sequentially. A circular array is used to maintain the length- sliding window , and its mean value are computed on the fly. Assume the mean value of , , is in range , and the mean value of the current window , , is also in , we modify the current WI by changing its right boundary from to . Otherwise, a new interval, , will be added into certain row according to .

The equal-width range can cause the zigzag style of adjacent rows. For example, the and . Apparently, a better way is to merge these two rows so that the corresponding value becomes .

In the second step, we merge adjacent rows with a greedy algorithm. We check the rows beginning from and . Let the current rows be and . The merging condition is whether is smaller than , a pre-defined parameter. The rationale is that we merge the rows in which a large number of intervals are neighboring. If rows and are merged, the new key is , and the new value is . Moreover, all neighboring window intervals from and are merged to one interval.

The merge operation is actually a union operation between two ordered interval sequences, which can be implemented efficiently similar to the merge-sort algorithm. Since each window interval will be examined exactly once, its time complexity is .

If the size of index exceeds memory capacity, we build the index as follows. In the first step, we divide time series into segments, and build the fixed-width range index for each segment in turn. After all segments are processed, we merge the rows of different segments. The second step visits index rows sequentially, which can be also divided into sub-tasks. Since each step can be divided into sub-tasks, the whole index building algorithm can be easily adapted to distributed environment, like MapReduce.

Complexity analysis. The process of building KV-index consists of two steps, generating rows with the fixed width, and merging them into varied-width ones. The first step scans all data in stream fashion, computes the mean value, and inserts entry into hash table. Note that the mean value of can be computed based on that of , whose cost is . So the cost of the first step is . In the second step, we detect adjacent rows and merge them if necessary. Since the intervals are ordered within each row, the merge operation is similar to the merge sort, whose cost is . Therefore, the whole cost is ( is the number of rows in first step). Because and , we can infer that its cost is . In summary, the complexity of building index is .

All previous index-based approaches, like FRM and General Match, are based on R-tree, whose building cost is  [13]. Moreover, they use DFT to transform each -size window of , whose cost is . So the total transformation cost is . Therefore, building KV-index is more efficient.

V KV-match

In this section, we present the matching algorithm KV-match, whose pseudo-code is shown in Algorithm 1.

V-a Overview

Initially, given query , we segment it into disjoint windows of length (), and compute mean values (Line 1). We assume that is an integral multiple of . If not, we keep the longest prefix which is a multiple of . According to the analysis in Section III, the rest part can be ignored safely.

The main matching process consists of two phases:

  1. Index-probing (Line 2-12): For each window , we fetch a list of consecutive rows in KV-index according to the lemmas in Section III. Based on these rows, we generate a set of subsequence candidates, denoted as CS.

  2. Post-processing (Line 13-18): All subsequences in CS will be verified by fetching the data and computing the actual distance.

Note that all four types of queries have the same matching process, the only difference is that in the index-probing phase, for each window, different types have the various row ranges, as introduced in Section III.

1:
2:for  do
3:     
4:     
5:     for all  do
6:                
7:     
8:     
9:     for all  do
10:          .add()      
11:     if  then
12:     else      
13:
14:for all  do
15:      Scan from data
16:     for  do
17:          if  then Extra test for cNSM
18:               .add()                
19:return
Algorithm 1 MatchSubsequence()

V-B Window Interval Generation

For each window , we calculate the range of , , firstly according to the query type. Then we visit KV-index with a single scan operation, which will obtain a list of consecutive rows, denoted as , which satisfies and . Note that the -th row (or the -th row) may contain mean values out of the range. However, it only brings negative candidates, without missing any positive one.

We denote all window intervals in as . We use to indicate that window interval WI belongs to . Also, for any window position in WI (), we have .

According to Eq. (7) and Eq. (8), we indicate the number of window intervals in as , and the number of window positions in as . Note that the window intervals in are disjoint with each other. To facilitate the next “interaction” operation, we sort these intervals in ascending order, that is, , where is the window interval in (Line 7).

V-C The Matching Algorithm

Based on (), we generate the final candidate set CS with an “intersection” operation. We first introduce the concept of candidate set for , denoted as (). For window , any window position in maps to a candidate subsequence . Therefore, the candidate set for , denoted as , is composed of all positions in . is still organized as a sequence of ordered non-overlapped window intervals, like .

For , each window position in also corresponds to a candidate subsequence. However, position in corresponds to the candidate subsequence , because is its second disjoint window. So the candidate set for , denoted as , can be obtained by left-shifting each window position in with . Similarly, is obtained by left-shifting the positions in with . In general, for window (), the candidate set is as follows,

The shifting offset for is denoted as . All candidate sets () are still organized as an ordered sequence of non-overlapped window intervals. Moreover, it can be easily inferred that and .

Through combining the lemmas in Section III and the definition of , we can obtain two important properties,

Property 1.

If is not contained by certain (), then and are not matched.

Property 2.

If and are matched, position belongs to all candidate sets , that is,.

Now we present our approach to intersect ’s to generate the final CS. It consists of rounds (Line 2-12). In the first round, we fetch from the index, and generate and . We initialize CS as . In the second round, we fetch , and generate by shifting all window intervals in with (Line 9-10). Then we intersect CS with to obtain up-to-date CS (Line 12). Because all intervals in , as well as , are ordered, the intersection operation can be executed by sequentially intersecting window intervals of CS and , which is quite similar to merge-sort algorithm with complexity. In general, during the -th round, we intersect with CS of the last round, and generate the up-to-date CS. After rounds, we obtain the final candidate set CS.

Fig. 7: Example of the matching algorithm

We illustrate the algorithm with the example in Fig. 7. contains three intervals, , and . contains three intervals, , and . (or ) contains all the intervals covered by (or ). equals to , while is generated by left-shifting with offset . Then we intersect and to get CS in the second round, which is composed of and .

In phase 2, according to CS, we fetch data to generate the final qualified results (Line 13-18). Formally, for each window interval WI in CS, we fetch the subsequences