I Introduction
Time series data are pervasive across almost all human endeavors, including medicine, finance and science. In consequence, there is an enormous interest in querying and mining time series data. [1, 2].
Subsequence matching problem is a core subroutine for many time series mining algorithms. Specifically, given a long time series , for any query series and a distance threshold , the subsequence matching problem finds all subsequences from , whose distance with falls within the threshold .
FRM [3] is the pioneer work of subsequence matching. Many approaches have been proposed, either to improve the efficiency [4, 5] or to deal with various distance functions [6, 7], such as Euclidean distance and Dynamic Time Warping. However, all these approaches only consider the raw subsequence matching problem (RSM for short). In recent years, researchers realize the importance of the subsequence normalization [8]. It is more meaningful to compare the znormalized subsequences, instead of the raw ones. UCR Suite [8] is the stateoftheart approach to solve the normalized subsequence matching problem (NSM for short).
The NSM approach suffers from two drawbacks. First, it needs to scan the full time series , which is prohibitively expensive for long time series. For example, for a time series of length , UCR Suite needs more than 100 seconds to process a query of length 1,000. [8] analyzed the reason why it is impossible to build the index for the NSM problem. Second, the NSM query may output some results not satisfying users’ intent. The reason is that NSM fully ignores the offset shifting and amplitude scaling. However, in real world applications, the extent of offset shifting and amplitude scaling may represent certain specific physical mechanism or state. Users often only hope to find subsequences within similar state as the query. We illustrate it with an example.
(c) Information of Q  
Offset [Label]  Length  
877 []  17,124  
(d) Results of NSM  
Offset [Label]  Distance  
252,492 []  117.78  
(a) PAMAP time series  97,458 []  130.80 
34,562 []  138.12  
161,416 []  149.37  
134,456 []  164.88  
296,063 []  166.74  
(b) Aligned normalized subsequences 
Example 1. The time series in Fig. 1(a) comes from the Physical Activity Monitoring for Aging People (PAMAP) dataset [1] collected from zaccelerometer at hand position. The monitored person conducts various activities alternatively, like sitting, standing, running and so on. Each activity lasts for about 3 minutes, and the data collection frequency is 100Hz. We use one subsequence corresponding to lying activity as the query ( in Fig. 1(c)) to find other “lying” subsequences. We issue a NSM query with , and Fig. 1(d) lists the top results. Unfortunately, all top4 results corresponds to other activities. and correspond to sitting activity, while and correspond to breaking activity. Although and are the desired results (correspond to lying activity), they are ranked out of top20. We show the normalized , and in Fig. 1(b). It is difficult to distinguish them after normalization.
By observing Fig. 1(a), one can filter the undesired results easily by adding an additional constraint: the output subsequences should have similar mean value as . In fact, this new type of NSM query, NSM plus some constraints, is useful in many applications. We list two of them as follows,

(Industry application) In the wind power generation field, LIDAR system can provide preview information of wind disturbances [9]. Extreme Operating Gust (EOG) is a typical gust pattern which is a phenomenon of dramatic changes of wind speed in a short period. Fig. 3 shows a typical EOG pattern. This pattern is important because it may generate damage on the turbine. All EOG pattern occurrences have the similar shape, and their fluctuation degree falls within certain range, because the wind speed cannot be arbitrarily high. If we hope to find all EOG pattern occurrences in the historical data, we can use a typical EOG pattern as the query, plus the constraint on the range of the values.

(IoT application) When a container truck goes through a bridge, the strain meter planted in the bridge will demonstrate a specific fluctuation pattern. The value range in the pattern depends on the weight of the truck. If we have one occurrence of the pattern as a query, we can additionally set a mean value range as the constraint to search container trucks whose weight falls within a certain range.
Note that the above applications cannot be handled by RSM query, because the existing offset shifting and amplitude scaling forces us to set a very large distance threshold, which will cause many false positive results.
Furthermore, to verify the universality of this new query type, we investigate the motif pairs in some popular realworld time series benchmarks. Motif mining [2] is an important time series mining task, which finds a pair (or set) of subsequences with minimal normalized distance. For a motif subsequence pair, say and , we show the relative mean value difference (Mean
) and the ratio of standard deviation (
Std) in Fig. 3 We can see that although these pairs are found without any constraint (like NSM query), both mean value and standard deviation of motif subsequences are very similar. So we can find these pairs by the cNSM query, a NSM query plus a small constraint.In this paper, we formally define a new subsequence matching problem, called constrained normalized subsequence matching problem (cNSM for short). Two constraints, one for mean value and the other for standard deviation, are added to the traditional NSM problem. One exemplar cNSM query looks like “given a query with mean value and standard deviation , return subsequences which satisfy: (1) ; (2) ; (3) ”. With the constraint, the cNSM problem provides a knob to flexibly control the degree of offset shifting (represented by mean value) and amplitude scaling (represented by standard deviation). Moreover, the cNSM problem offers us the opportunity to build index for the normalized subsequence matching.
Challenges. Solving the cNSM problem faces the following challenges. First, how can we process the cNSM query efficiently? A straightforward approach is to first apply UCR Suite to find unconstrained results, and then use mean value and standard deviation constraints to prune the unqualified ones. However, it still needs to scan the full series. Can we build an index and process the query more efficiently?
Second, users often conduct the similar subsequence search in an exploratory and interactive fashion. Users may try different distance functions, like Euclidean distance or Dynamic Time Warping. Meanwhile, users may try RSM and cNSM query simultaneously. Can we build a single index to support all these query types?
Contributions. Besides proposing the cNSM problem, we also have the following contributions.

We present the filtering conditions for four query types, RSMED, RSMDTW, cNSMED and cNSMDTW, and prove the correctness. The conditions enable us to build index and meanwhile guarantee no false dismissals.

We propose a new index structure, KVindex, and the query processing approach, KVmatch, to support all these query types. The biggest advantage is that we can process various types of queries efficiently with a single index. Moreover, KVmatch only needs a few numbers of sequential scans of the index, instead of many random accesses of tree nodes in the traditional Rtree index, which makes it much more efficient.

Third, to support the query of arbitrary lengths efficiently, we extend KVmatch to KVmatch, which utilizes multiple indexes with different window lengths. We conduct extensive experiments. The results verify the efficiency and effectiveness of our approach.
The rest of the paper is organized as follows. We present the preliminary knowledge and problem statements in Section II. In Section III we introduce the theoretical foundation and motivate the approach. Section IV and V describe our index structure, index building algorithm and query processing algorithm. Section VI extends our method to use multilevel indexes with different window lengths. Our implementation details are described in Section VII. The experimental results are presented in Section VIII and we discuss related works in Section IX. Finally, we conclude the paper and look into the future work in Section X.
Ii Preliminary Knowledge
In this section, we introduce the definition of time series and other useful notations.
Iia Definitions and Problem Statement
Notation  Description 

a time series  
a length subsequence of starting at offset  
the normalized series of time series  
the th length disjoint window of  
the mean value of the th disjoint window of  
the standard deviation of the th disjoint window of  
WI  a window interval containing continuous window positions 
a set of window intervals satisfying the criterion for  
a set of candidates for and for all  
the number of window intervals and window positions 
A time series is a sequence of ordered values, denoted as , where is the length of . A length subsequence of is a shorter time series, denoted as , where .
For any subsequence , and are the mean value and standard deviation of respectively. Thus the normalized series of , denoted as , is
Our work supports two common distance measures, Euclidean distance and Dynamic Time Warping. Here we give the definition of them.
Euclidean Distance (ED): Given two length sequences, and , their distance is .
Dynamic Time Warping (DTW): Given two length sequences, and , their distance is
(1) 
where represents empty series and is a suffix subsequence of .
In DTW, the warping path is defined as a matrix to represent the optimal alignment for two series. The matrix element represents that is aligned to . To reduce the computation complexity, we use the SakoeChiba band [10] to restrict the width of warping, denoted as . Any pair should satisfy . When , it degenerates into ED.
We aim to support subsequence matching for both the raw subsequence and the normalized subsequence simultaneously. The problem statements are given here.
Raw Subsequence Matching (RSM): Given a long time series , a query sequence () and a distance threshold , find all subsequences of length from , which satisfy . In this case, we call that and are in match.
Normalized Subsequence Matching (NSM): Given a long time series , a query sequence and a distance threshold , find all subsequences of length from , which satisfy , where and are the normalized series of and respectively.
The cNSM problem adds two constraints to the NSM problem. Thresholds and are introduced to constrain the degree of amplitude scaling and offset shifting.
Constrained Normalized Subsequence Matching (cNSM): Given a long time series , a query sequence , a distance threshold , and the constraint thresholds and , find all subsequences of length from , which satisfy
The larger and , the looser the constraint. In this case, we call that and are in match.
The distance is either ED or DTW. In this paper, we build an index to support four types of queries, RSMED, RSMDTW, cNSMED and cNSMDTW simultaneously.
10mm
Iii Theoretical Foundation and
Approach Motivation
In this section, we establish the theoretical foundation of our approach. We propose a condition to filter the unqualified subsequences. For all four types of queries, the conditions share the same format, which enables us to support all query types with a single index.
Specifically, for the query and the subsequence of length, we segment them into aligned disjoint windows of the same length . The th window of (or ) is denoted as (or ), (), that is, .
For each window, we hope to find one or more features, based on which we can construct the filtering condition. In this work, we choose to utilize one single feature, the mean value of the window. The advantages are twofolds. First, with a single feature, we can build a onedimensional index, which improves the efficiency of index retrieval greatly. Second, the mean value allows us to design the condition for both RSM and cNSM query.
We denote mean values of and as and . The condition consists of number of ranges. The one is denoted as (). If is a qualified subsequence, for any , must fall within . If any is outside the range, we can filter safely.
Iiia RSMED Query Processing
In this section, we first present the condition for the simplest case, RSMED query, and then illustrate our approach.
Lemma 1.
If and are in match under ED measure, that is, , then must satisfy
(2) 
Now we illustrate our approach with the example in Fig. 4. is a long time series, and is the query sequence of length . The goal is to find all length subsequences from , which satisfy . The parameter of the window length is set to 50. We split into three disjoint windows of length , , , ^{1}^{1}1We can ignore the remain part without sacrificing the correctness since Lemma 1 is a necessary condition for RSM.. According to Lemma 1, for any qualified subsequence , the mean value of the disjoint window must fall within the range (). To facilitate finding the windows satisfying this condition, we build the index as follows. We compute the mean values of all sliding windows , denoted as , and build a sorted list of entries. With this structure, we find the candidates in two steps. First, for each window , we obtain all sliding windows whose mean values fall within by a single sequential scan operation. We denote the found windows for as . Then, we generate the final candidates by intersecting windows in , and .
In Fig. 4, sliding windows in , and are marked with “triangle”, “cross” and “circle” respectively. The only candidate is , because , and .
IiiB Range for cNSMED Query
We solve the cNSM problem based on KVindex either. For the given query , we determine whether a subsequence is match with by checking the raw subsequence directly. Specifically, we achieve this goal by designing the range for each query window . For any subsequence , if any falls outside this range, cannot be match with and we can filter safely. We illustrate it with an example. Let , , and ^{2}^{2}2To make the example simple enough, we set as 0.. By simple calculation, we obtain and . For any length4 subsequence , if only , we can infer that cannot be matched with without checking the whether satisfies the cNSM condition, as follows. To make , must be 4. If it is the case, is 4.6188 at least. However, , which violates the cNSM condition.
Now we formally give the range for cNSMED query. Let and be the global mean values of and , and be the standard deviations, and be the normalized and respectively.
Lemma 2.
If and are in match under ED measure, that is, , then satisfies
(3) 
where
,
.
Proof.
Based on the normalized ED definition, we have
Let and , where and . If , it holds that
By simple transformation, for any specific pair of , we can get a range of as follows,
For ease of description, we assign and .
The final range should be
As illustrated in Fig. 6, the rectangle represents the whole legal range of and . Let and . Apparently, both and increase monotonically for . As for , we have two cases,

If , increases monotonically for . is minimal when and , which is represented by the point in Fig. 6;

If , decreases monotonically for . is minimal when and , which is represented by the point in Fig. 6.
So 
Note that formula means is either or .
IiiC Range for RSMDTW and cNSMDTW Query
Before introducing the ranges, we first review the query envelop and the lower bound of DTW distance, LB_PAA [12]. To deal with DTW measure, given length query , the query envelop consists of two length series, and , as the lower and upper envelop respectively. The th elements of and , denoted as and , are defined as
, .
LB_PAA is defined based on the query envelop. and are split into number of length disjoint windows, and , in which and (). The mean values of and are denoted as and respectively. For any length subsequence , the LB_PAA is as follows,
(4) 
which satisfies [12].
Now we give the ranges for RSM and cNSM under the DTW measure in turn.
Lemma 3.
If and are in match under DTW measure, that is, , then satisfies
(5) 
Proof.
See Appendix A.
Lemma 4.
If and are in match under DTW measure, that is, , then satisfies
(6) 
where
,
.
Proof.
See Appendix B.
Analysis. We provide the ranges of mean value for all four query types, which means that we can support all queries with a single index. When processing different query types, the only difference is to use different ranges of . This property is beneficial for exploratory search tasks.
Iv KVindex
In this section, we present our index structure KVindex, and the index building algorithm.
Iva Index Structure
The index structure in Fig. 4 has approximately equal number of entries of , which causes a huge space cost. To avoid that, we propose a more compact index structure which utilizes the data locality property, that is, the values of adjacent time points may be close. In consequence, the mean values of adjacent sliding windows will be similar too.
Logically, KVindex consists of ordered rows of keyvalue pairs. The key of the th row, denoted as , is a range of mean values of sliding windows, that is, , where and are the left and right endpoint of the mean value range of respectively. It is a leftclosedrightopen range, and the ranges of adjacent rows are disjoint.
The corresponding value, denoted as , is the set of sliding windows whose mean values fall within . To facilitate the expression, we represent each window by its position, that is, we represent sliding window with . To further save the space cost and also facilitate subsequence matching algorithm, we organize the window positions in as follows. The positions in are sorted in ascending order, and consecutive ones are merged into a window interval, denoted as WI. So consists of one or more sorted and nonoverlapped window intervals.
Definition 1 (Window Interval).
We combine the to length sliding windows of as a window interval , which contains a set of sliding windows , where .
In the following descriptions, we use to denote the window position belonging to the window interval , that is, . Moreover, we use , and to denote the left boundary, the right boundary and the size of interval WI respectively. The overall number of window intervals in is denoted as , and the number of window positions in as . Formally, we have
(7)  
(8) 
Fig. 6 shows KVindex for Fig. 4. The first row indicates that there exists three sliding windows, , and , whose mean values fall within the range . In the second row, three windows are organized into two intervals and . Thus and . Note that, is a special interval which only contains one single window position.
To facilitate the query processing, KVindex also contains a meta table, in which each entry is a quadruple as , where is the offset of th row in the index file. Due to its small size, we can load the meta table to memory before processing the query. With the meta table, we can quickly determine the offset and the length of a scan operation by the simple binary search.
Physically, KVindex can be implemented as a local file, an HDFS file or an HBase table, because of its simple format. In this work, we implement two versions, a local file version and an HBase table version (details are in Section VIII). In general, if a file system or a database supports the “scan” operation with startkey and endkey parameters, it can support KVindex. We provide details about the index implementation in Section VII.
IvB Index Building Algorithm
We build the index with two steps. First, we build an index in which all rows use the equalwidth range of the mean values. Second, because data distribution is not balanced among rows, we merge adjacent rows to optimize the index. We first introduce a basic inmemory algorithm, which works for moderate data size. Then we discuss how to extend it to very large data scale.
In the first step, we predefine a parameter , which represents the range width of the mean values. The range of each row will be , where . We read series sequentially. A circular array is used to maintain the length sliding window , and its mean value are computed on the fly. Assume the mean value of , , is in range , and the mean value of the current window , , is also in , we modify the current WI by changing its right boundary from to . Otherwise, a new interval, , will be added into certain row according to .
The equalwidth range can cause the zigzag style of adjacent rows. For example, the and . Apparently, a better way is to merge these two rows so that the corresponding value becomes .
In the second step, we merge adjacent rows with a greedy algorithm. We check the rows beginning from and . Let the current rows be and . The merging condition is whether is smaller than , a predefined parameter. The rationale is that we merge the rows in which a large number of intervals are neighboring. If rows and are merged, the new key is , and the new value is . Moreover, all neighboring window intervals from and are merged to one interval.
The merge operation is actually a union operation between two ordered interval sequences, which can be implemented efficiently similar to the mergesort algorithm. Since each window interval will be examined exactly once, its time complexity is .
If the size of index exceeds memory capacity, we build the index as follows. In the first step, we divide time series into segments, and build the fixedwidth range index for each segment in turn. After all segments are processed, we merge the rows of different segments. The second step visits index rows sequentially, which can be also divided into subtasks. Since each step can be divided into subtasks, the whole index building algorithm can be easily adapted to distributed environment, like MapReduce.
Complexity analysis. The process of building KVindex consists of two steps, generating rows with the fixed width, and merging them into variedwidth ones. The first step scans all data in stream fashion, computes the mean value, and inserts entry into hash table. Note that the mean value of can be computed based on that of , whose cost is . So the cost of the first step is . In the second step, we detect adjacent rows and merge them if necessary. Since the intervals are ordered within each row, the merge operation is similar to the merge sort, whose cost is . Therefore, the whole cost is ( is the number of rows in first step). Because and , we can infer that its cost is . In summary, the complexity of building index is .
All previous indexbased approaches, like FRM and General Match, are based on Rtree, whose building cost is [13]. Moreover, they use DFT to transform each size window of , whose cost is . So the total transformation cost is . Therefore, building KVindex is more efficient.
V KVmatch
In this section, we present the matching algorithm KVmatch, whose pseudocode is shown in Algorithm 1.
Va Overview
Initially, given query , we segment it into disjoint windows of length (), and compute mean values (Line 1). We assume that is an integral multiple of . If not, we keep the longest prefix which is a multiple of . According to the analysis in Section III, the rest part can be ignored safely.
The main matching process consists of two phases:
Note that all four types of queries have the same matching process, the only difference is that in the indexprobing phase, for each window, different types have the various row ranges, as introduced in Section III.
VB Window Interval Generation
For each window , we calculate the range of , , firstly according to the query type. Then we visit KVindex with a single scan operation, which will obtain a list of consecutive rows, denoted as , which satisfies and . Note that the th row (or the th row) may contain mean values out of the range. However, it only brings negative candidates, without missing any positive one.
We denote all window intervals in as . We use to indicate that window interval WI belongs to . Also, for any window position in WI (), we have .
According to Eq. (7) and Eq. (8), we indicate the number of window intervals in as , and the number of window positions in as . Note that the window intervals in are disjoint with each other. To facilitate the next “interaction” operation, we sort these intervals in ascending order, that is, , where is the window interval in (Line 7).
VC The Matching Algorithm
Based on (), we generate the final candidate set CS with an “intersection” operation. We first introduce the concept of candidate set for , denoted as (). For window , any window position in maps to a candidate subsequence . Therefore, the candidate set for , denoted as , is composed of all positions in . is still organized as a sequence of ordered nonoverlapped window intervals, like .
For , each window position in also corresponds to a candidate subsequence. However, position in corresponds to the candidate subsequence , because is its second disjoint window. So the candidate set for , denoted as , can be obtained by leftshifting each window position in with . Similarly, is obtained by leftshifting the positions in with . In general, for window (), the candidate set is as follows,
The shifting offset for is denoted as . All candidate sets () are still organized as an ordered sequence of nonoverlapped window intervals. Moreover, it can be easily inferred that and .
Through combining the lemmas in Section III and the definition of , we can obtain two important properties,
Property 1.
If is not contained by certain (), then and are not matched.
Property 2.
If and are matched, position belongs to all candidate sets , that is,.
Now we present our approach to intersect ’s to generate the final CS. It consists of rounds (Line 212). In the first round, we fetch from the index, and generate and . We initialize CS as . In the second round, we fetch , and generate by shifting all window intervals in with (Line 910). Then we intersect CS with to obtain uptodate CS (Line 12). Because all intervals in , as well as , are ordered, the intersection operation can be executed by sequentially intersecting window intervals of CS and , which is quite similar to mergesort algorithm with complexity. In general, during the th round, we intersect with CS of the last round, and generate the uptodate CS. After rounds, we obtain the final candidate set CS.
We illustrate the algorithm with the example in Fig. 7. contains three intervals, , and . contains three intervals, , and . (or ) contains all the intervals covered by (or ). equals to , while is generated by leftshifting with offset . Then we intersect and to get CS in the second round, which is composed of and .
Comments
There are no comments yet.