Exact Indexing of Time Series under Dynamic Time Warping

02/11/2020 ∙ by Zhengxin Li, et al. ∙ 0

Dynamic time warping (DTW) is a robust similarity measure of time series. However, it does not satisfy triangular inequality and has high computational complexity, severely limiting its applications in similarity search on large-scale datasets. Usually, we resort to lower bounding distances to speed up similarity search under DTW. Unfortunately, there is still a lack of an effective lower bounding distance that can measure unequal-length time series and has desirable tightness. In the paper, we propose a novel lower bounding distance LB_Keogh+, which is a seamless combination of sequence extension and LB_Keogh. It can be used for unequal-length sequences and has low computational complexity. Besides, LB_Keogh+ can extend sequences to an arbitrary suitable length, without significantly reducing tightness. Next, based on LB_Keogh+, an exact index of time series under DTW is devised. Then, we introduce several theorems and complete the relevant proofs to guarantee no false dismissals in our similarity search. Finally, extensive experiments are conducted on real-world datasets. Experimental results indicate that our proposed method can perform similarity search of unequal-length sequences with high tightness and good pruning power.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rapid development of information technology, time series pervades almost every field of human activity, such as finance [Sezer2018Algorithmic], traffic [Hong2018A], medicine [Pradhan2017Association], meteorology [Wang2018A], hydrology [Vesakoski2017Arctic], multimedia [Zhao2017Real], etc. Similarity search is one of the most fundamental problems in time series data mining [Qin2018Salient]. It can be applied in many scenarios, for example, searching for stocks with similar fluctuations [Rubio2017Improving], looking for patients with similar EEG [Chaovalitwongse2006EEG].

To mention about similarity search, we must start with similarity measure. The most common similarity measures of time series are Euclidean distance and DTW distance. Euclidean distance is parameter-free and has linear complexity, but it is sensitive to shifting and scaling in the time axis [2017articleDTW]. DTW is able to handle shifting and scaling by searching an optimal match between the points of two sequences. Besides, DTW can measure the similarity of time series with different lengths, and achieve high matching precision [Jeong2011Weighted]. In spite of the consideration of dozens of alternatives, there is an increasing evidence that DTW distance is the best measure in most domains [Ding2008].

However, the high computational complexity of DTW limits the applications of similarity search on large-scale datasets. Therefore, research on efficient similarity search under DTW is of importance in both theory and practice. In this paper, we propose a novel similarity search under DTW, which can effectively improve search efficiency of time series with different lengths and guarantee no false dismissals.

The remainder of the paper is organized as follows. In Section 2, we formulate the similarity search problem, give a brief review of related work and state our motivation for this work. In Section 3, we propose a novel method of sequence extension, and combine it with LB_Keogh to form our lower bounding distance LB_Keogh. In Section 4, we devise an efficient index of time series and give the procedure of similarity search under DTW. In Section 5, we conduct an extensive experiment to evaluate the validity of the proposed method. Finally, we draw the conclusions and present future work in Section 6.

2 Background

A time series is a continuous set of observations (), arriving in time sequence [gorecki2018classification]. Since time series from a dataset usually have the same time intervals, they can be simply denoted as , where is the length of the sequence.

Given two time series , , their DTW distance is defined as [berndt1994using]:

(1)

where is the base distance:

(2)

In the paper, we mainly focus on -range search, which can be described as: given a query sequence and a time series dataset {}, we need to retrieve all the sequences () from the dataset, such that .

Since DTW has high computational complexity, the retrieval efficiency of sequential scan is usually unacceptable. Even worse, DTW does not satisfy triangular inequality, making it difficult to devise appropriate index of time series under DTW distance. Most methods devise their lower bounding distances to build index of time series. Then, the index and lower bounding distances are used to speed up similarity search under DTW. The key of these methods is their lower bounding distances.

There are three desirable properties of lower bounding distances [Nguyen2012Comparing]. It does not incur false dismissals: lower bounding distances must be smaller than or equal to DTW distance. It must be fast to compute: we would like their computational complexity to be linear in the length of the sequences. It must be relatively tight: the lower bounding distance is close to DWT distance.

Among existing lower bounding distances, LB_Kim [Kim2001An], LB_Yi [yi1998efficient] and LB_Keogh [keogh2005exact] are the most representative techniques. They all have linear computational complexity and can guarantee no false dismissals.

LB_Kim uses instead of or as its base distance. Unlike most of other methods, the DTW distance defined by LB_Kim is not the sum of the base distance. Based on its defined DTW distance, LB_Kim extracts four points (the first and last points, the maximum and minimum points) from each sequence, and use these feature points to calculate the lower bounding distance. LB_Kim can be used for sequences with unequal length.

Given two time series, LB_Yi chooses a sequence as the criterion. Then, it extracts some points of the other sequence to calculate the lower bounding distance, such that these extracted points are larger than the maximum value, or smaller than the minimum value of the criterion sequence. LB_Yi can also handle unequal-length sequences.

LB_Keogh extracts the upper and lower boundary sequences from the query sequence , by making use of global or local constraint of the warping path. The area surrounded by the upper and lower boundary sequences is called envelope. Based on these points of not falling into the envelope, LB_Keogh calculates the lower bounding distance between and .

LB_Keogh has been verified to have higher tightness and pruning power than LB_Kim and LB_Yi. Therefore, it has been recognized as the best one and attracts extensive attention. Since then, Zhu [zhu2003warping], Zhou [zhou2011boundary] and Li [Li2014Extensions], etc. respectively proposed their improvements on LB_Keogh to further enhance tightness and pruning power.

However, LB_Keogh and its various improved methods can only deal with time series of the same length, making it look slightly imperfect. Just this defect alone is enough to seriously limit its practical application. Because it is difficult to demandingly require the same length of time series in the real world. In addition, the defect of LB_Keogh forces DTW distancec to abandon its unique advantage that it can measure unequal-length sequences.

From this perspective, we try to propose a novel method of sequence extension to remove this defect of LB_Keogh; and then, we give the procedure of index building and similarity search; finally, relevant proofs are completed to guarantee no false dismissals of our proposed method. Basic notations in the work are summarized in Table I.

3 Proposed lower bounding distance of DTW

In this section, the calculation mechanism of DTW distance is analyzed. Then, we propose a novel lower bounding distance LB_Keogh, which is a seamless combination of sequence extension and LB_Keogh. After that, we introduce several theorems and complete the relevant proofs to guarantee no false dismissals of LB_Keogh.

DTW distance
Base distance of DTW
Subsequence of from the -th to the -th element
The extended sequence of by our proposed method
PAA form of a sequence
Query sequence and its extended sequence
Candidate sequence and its extended sequence
Warping matrix
Warping path
The -th element of a warping path
The upper and lower boundary sequences of
The upper and lower boundary sequences of
The length after sequence extension
The number of sequences in a dataset
TABLE I: The notations in this work.

3.1 The warping path of dynamic time warping

To calculate DTW distance of and , we usually construct a warping matrix  [Salvador2007Toward], where the element is the base distance between and . A warping path is a set of elements in the warping matrix, as illustrated in Fig. 1(c), which is denoted as:

(3)

where denotes the matching relationship between and . Thus, a warping path defines a kind of mapping between all the points of and , as illustrated in Fig. 1(b)-(c). Besides, a warping path is subject to the following constraints [Morel2017Time].

Boundary condition: and . An warping path should start from (,) and finish with (,).

Monotonicity: Given and , we have and . This requires that a warping path must increase monotonously in time dimension.

Continuity: Given and , we have and . This restricts the allowable steps in a warping path to adjacent elements (including diagonally adjacent elements).

Fig. 1: A warping path in the warping matrix [keogh2005exact].

For two sequences and , there are several warping paths that satisfy the above three conditions. DTW corresponds to the optimal path that minimizes the total cumulative distance:

(4)

where and is the base distance between and .

Theorem 1.

Given two sequences and , there is a warping path in the warping matrix. If , then we have .

Proof.

In the warping matrix formed by and , we can find an optimal warping path . According to the definition of the optimal warping path, we can derive the following inequality:

(5)

According to Eq. (4), we can further infer . Because , we can derive holds. ∎

3.2 Additional constraints on a warping path

Some warping paths satisfying the above constraints may have pathological shape, where a small section of one sequence maps onto a large section of another. To avoid these pathological shapes, global and local constraints are introduced to define the feasible scope of a warping path, called the warping window.

Sakoe-Chiba band and Itakura parallelogram are the most frequently used global constraints, as illustrated in Fig. 2. Their warping windows are respectively a band and a parallelogram in the diagonal direction. Local constraints define the permissible steps by the current position of a warping path. In some cases, they can be reinterpreted as global constraints. Here, we do not elaborate further.

(a) Sakoe-Chiba band
(b) Itakura-Parallelogram
Fig. 2: Global constraints on warping path [keogh2005exact].

Without loss of generality, we mainly involve Sakoe-Chiba band in this paper, which seems to be the most common constraint used in practice [Rabiner1978Considerations][Dynamic1978programming] . It constrains every element of a warping path such that , where is a constant specified the matching range of each point. Thus, the difference of the lengths of any two sequences should satisfy the following theorem.

Theorem 2.

Given two sequences , , every warping path is constrained by Sakoe-Chiba band with the constant . When we calculate , the length difference between and should not be greater than .

Proof.

Reduction to absurdity is used to prove the theorem.

We assume the length difference between and is greater than when calculating . That is, for the two sequences and , we have or .

According to the boundary condition of a warping path, should finish with (,), which means must match .

Because is also subject to the constraint of Sakoe-Chiba band. If matches , we can infer . Further, we have and , which is just contradictory to the previous assumption.

Therefore, we can deduce that the difference of the lengths of any two sequences should not be greater than . ∎

In the rest of this paper, we suppose the length difference between a query sequence and any candidate sequence should not be greater than .

Theorem 3.

Given two sequences and , under the constraints and of Sakoe-Chiba band, we respectively calculate their DTW distance , . If we have , then holds.

Proof.

For two sequences and , we assume is the optimal path under the constraint of Sakoe-Chiba band. According to Eq. (4), we have

(6)

When the width of the warping window expands from to (), must be a feasible path in the warping matrix, as illustrated in Fig. 2(a).

That is, we can find a warping path under the constraint . And we have deduced that Eq. (6) holds. According to Theorem 1, we can infer Theorem 3 holds. ∎

3.3 Lower bounding distance

Given a query sequence , its upper and lower boundary sequences and are respectively defined as:

(7)

where is a constant that comes with the Sakoe-Chiba band. The query sequence is enclosed in the region formed by and , as illustrated in Fig. 3. The region between and is called envelope.

Fig. 3: A query sequence and its upper and lower boundary sequences.

For any sequence in a dataset, only if , the lower bounding distance of DTW between and can be defined as [keogh2005exact]:

(8)

LB_Keogh calculates the sum of the base distance between any point of not falling into the envelope and the corresponding point in the nearest boundary sequence. Fig. 4 visualizes the meaning of LB_Keogh, where the shadow areas represent the parts that need to be cumulated by the base distance.

Fig. 4: A visual intuition of LB_Keogh

In addition, LB_Keogh satisfies the following theorem. Please refer to [keogh2005exact] for detailed proof.

Theorem 4.

For two sequences and of the same length, if every warping path is constrained by Sakoe-Chiba band with a constant , the inequality LB_Keogh(Q,C) holds.

When LB_Keogh is used for similarity search, Theorem 4 can ensure that there is no false dismissals. Besides, it has low computational complexity and good tightness. Therefore, LB_Keogh is so far the best lower bounding distance of DTW. However, it can only measure sequences of the same length, severely limiting its application in practical scenarios.

3.4 Technique of sequence extension

In order to solve the intractable problem above, we propose a novel method of sequence extension. Given two sequences , , we add the same arbitrary constant after them to construct two sequences of the same length:

(9)

where the constraint means that we should add at least one element to the longer sequence.

From the external form, the equal-length sequences and can be directly used in LB_Keogh. More important, we need to further analyze whether the proposed method can guarantee no false dismissals, when is used for similarity search.

Theorem 5.

Given two sequences , , they are extended to the equal-length sequences , by Eq. (9). Then, the following inequality holds:

(10)
Proof.

For the original sequences and , we assume that is the optimal warping path, as illustrated in Fig. 5(a). According to Eq. (4), we have

(11)

For the extended equal-length sequences and , we can simply construct a warping path , as illustrated in Fig. 5(b).

(12)

The front part of is exactly the same as , contained in the red rectangle of Fig. 5(b). The back part of is formed by and , which are the extended parts of and , as shown in the blue rectangle of Fig. 5(b).

(a) The optimal path of and
(b) A warping path of and
Fig. 5: A warping path of the extended equal-length sequences.

Because , from we can infer:

(13)

For the whole warping path , we have:

(14)

We substitute Eq. (11) and Eq. (13) into Eq. (14):

(15)

That is, we can find a warping path in the warping matrix formed by and . The total cumulative distance of is . According to Theorem 1, we can deduce that Eq. (10) holds. ∎

From Theorem 4, we can infer the proposed method of sequence extension can guarantee no false dismissals. Under the constraint in Eq. (9), we need to further discuss another important problem: how does the extended length affect similarity search?

Theorem 6.

Given two sequences , , they are extended to the equal-length sequences , by Eq. (9). Then we further add the same constant after and , to obtain two new sequences of the same length:

(16)

Then, the following two inequalities hold:

(17)
(18)
Proof.

For the original sequences and , we assume their extended equal-length sequences are denoted as , .

Because , and should add at least one element to the end of and . Without loss of generality, we assume and add one element behind to get . Correspondingly, we add elements behind to get .

Then we can get the optimal matching relations of all points from and , as illustrated in Fig. 6(a), which correspond to the optimal warping path. We get by calculating the sum of the base distance between all those matching points.

Further, we add the same number of elements behind and , to form another extended equal-length sequences , . It’s worth noting that can be any positive integer under the constraint , which means that the length of sequence extension in Eq. (16) is arbitrary.

We can find the optimal matching relations of all points from and in the process of calculating . To simplify the analysis, we divide and into three contiguous segments, as illustrated in Fig. 6(b).

(a) The first extended sequences ,
(b) The second extended sequences ,
Fig. 6: The calculation of DTW distance for extended sequences.

Segment a. This segment is formed by , . When calculating , due to the constraint of Sakoe-Chiba band, all the points from and in this segment will not be affected by the extended sequences , . For all the points of and in this segment, their optimal matching relations are the same as those of and .

Segment b. This segment is formed by , . The points from and in this segment have the possibility to be affected by the extended sequences , . In the process of calculating DTW, any point on a sequence can replicate themselves. Because and . The extended sequences , can be understood as self-replications of and . For all the points of and in this segment, the sum of the base distances of these points is equal to the counterpart of and .

Segment c. This segment is formed by , . Because . For all the points of and in this segment, the sum of the base distance of these points is equal to 0.

Because is calculated by the sum of the base distance between all those matching points, corresponding to the optimal warping path. To sum up the above analysis, we can infer Eq. (17) holds.

We assume the upper and lower boundary sequences of are denoted as:

(19)

where , are defined in Eq. (7). Then we have

(20)

Similarly, the upper and lower boundary sequences of are denoted as , . According to the definition of in Eq. (16), we can derive that the first elements of , are just and :

(21)

According to the definition of LB_Keogh, we have

(22)

We substitute Eq. (20) into Eq. (22):

(23)

Because , according to the definition of upper and lower boundary sequences in Eq. (7), we have

(24)

Therefore, we can deduce that Eq. (18) holds.

Similarly, we can prove that Theorem. 6 still holds when the constraint of Sakoe-Chiba band is removed. ∎

From Theorem 6, we can get an important property of the proposed method of sequence extension. If the original sequences , are extended to , by Eq. (9), the extended length is independent of and LB_Keogh.

4 Similarity search under dynamic time warping

In order to improve the efficiency of similarity search, it is necessary to resort to indexes. If time series are directly organized by indexes, the performance of similarity search will seriously degrade. In this section, we reduce the dimension of time series, and organize them with spatial indexes. Then, we present the procedure of -range search.

4.1 Index of time series

In a dataset, any candidate sequence can be represented by a

-dimensional vector

, where the -th element is defined as:

(25)

The transition from to is called piecewise aggregate approximation (PAA). In practical applications, cannot be exactly an integer multiple of . We can simply solve the problem by using our sequence extension method. Detailed analysis will be carried out in the section of experiments.

Given two sequences , , we transform them into equal-length sequences , . Then we have

(26)

The distance between and is defined as:

(27)

We have the following inequality. Please refer to [Keogh2001Dimensionality][Yi2000Fast] for detailed proof.

(28)

For any candidate sequence in a dataset, we extend it to the sequence of length by our proposed method. Then, is transformed into -dimensional vector by PAA. Thus, we can make use of spatial indexes to organize these -dimensional vectors.

Without loss of generality, we use R-Tree to organize the PAA form of candidate sequences. Supposing is a leaf node of R-Tree, the MBR (Minimum Bounding Rectangle) related to the leaf node is denoted as , where , are the lower and upper boundaries of MBR. Any candidate sequence, the PAA form of which is contained in the MBR, will be included in the leaf node .

4.2 The procedure of similarity search

We extract the feature of a query sequence and take it as the input of -range search. For a query sequence , we extend it to the sequence of length by our proposed method. Next, we obtain the upper and lower boundary sequences of , denoted as and . Then, , are transformed into -dimensional vectors , by PAA.

In order to complete similarity search, we need to introduce another two lower bounding distances: LB_PAA and LB_MBR.

LB_PAA between and is defined as:

(29)

According to Eq. (28), we have

(30)

LB_MBR between and MBR is defined as:

(31)

Given the extended query sequence and MBR , for any extended candidate sequence contained in , we have the following inequality:

(32)

For detailed proof, please refer to [keogh2005exact]. The procedure of -range search is summarized in Algorithm 1.

0:  A query sequence , time series dataset (), the root node of R-tree, distance threshold .
0:  Result set R of -range search.
1:  Initialize , , , , ;
2:  if  is a non-leaf node then
3:     for each child node of  do
4:        if LB_MBR then
5:           // is the MBR corresponding to node .
6:           RangeSearch();
7:        end if
8:     end for
9:  else
10:     for each PAA point in  do
11:        if LB_PAA then
12:           retrieve original sequence from the dataset;
13:        end if
14:        if  then
15:           add to R;
16:        end if
17:     end for
18:  end if
Algorithm 1 RangeSearch()

4.3 Analysis of effectiveness and complexity

So far, we have introduced many kinds of lower bounding distances. LB_PAA and LB_MBR are used for similarity search on spatial indexes, and LB_Keogh is the basis for the two methods. Fig. 7 illustrates the relationship between different distances. For two original sequences , , if , we can get and these lower bounding distances are all less than or equal to . Therefore, our proposed similarity search can guarantee no false dismissals.

Fig. 7: The relationship between different distances.

The complexity of DTW distance between and is , where , are the lengths of and respectively. In our proposed method, the computational cost of sequence extension is very low and even can be ignored. The complexity of LB_Keogh, LB_PAA and LB_MBR is at most linear in the length of sequences. Compared with sequential scan under DTW distance, our similarity search can effectively improve the retrieval efficiency.

5 Experimental evaluation

Our experiments are carried out on a PC with Intel Core i7-8550U CPU and 16 GB RAM, running with Matlab R2018a. The proposed method is evaluated on 10 benchmark datasets coming from the UCR time series repository [UCRDataSet], where the lengths of the sequences on a dataset are all the same. To obtain sequences of different lengths, we truncate a random length at the end of every sequence, such that any pair of the sequences on a dataset satisfies Theorem 2.

These datasets cover a wide range of applications, such as energy, medicine, image matching, motion recognition, etc. The average lengths of the sequences in these datasets vary from 23 to 683. More information of the datasets is shown in Table II.

In our experiments, the constraint of Sakoe-Chiba band is equal to 10% of the length of the longest original sequence on a dataset. Because this value appears to be the most commonly used in many literatures.

5.1 The evaluation of validity for sequence extension

Due to space limitation, we just choose GunPoint dataset to evaluate the validity of sequence extension. It involves one female actor and one male actor making a motion with their hand, and contains two classes: Gun-Draw and Point, as illustrated in Fig. 8.

ID Dataset Length Av.length Class Sample
1 ItalyPowerDemand 22–24 23 2 1096
2 SyntheticControl 54–60 57 6 600
3 ECG5000 126–140 133 5 5000
4 GunPoint 135–150 143 2 200
5 WordSynonyms 243–270 256 25 905
6 Words50 243–270 257 50 905
7 Symbols 359–398 379 6 1020
8 Yoga 384–426 405 2 3300
9 ShapesAll 461–512 486 60 1200
10 Computers 648–720 683 2 500
TABLE II: The details of the benchmark datasets.

For Gun-Draw the actors have their hands by their sides. They draw a replicate gun from a hip-mounted holster, point it at a target for approximately one second, then return the gun to the holster, and their hands to their sides. For Point the actors have their gun by their sides. They point with their index fingers to a target for approximately one second, and then return their hands to their sides. For both classes, we record the centroid of the actor’s right hands in X-axis, which appear to be highly correlated.

Fig. 8: Description of GunPoint dataset.

We extend all the sequences to the same minimum length , according to Eq. (9). Then, we randomly choose one as the query sequence , and calculate DTW distance between and every candidate sequence . Correspondingly, DTW distance between and every original sequence is also calculated. In Fig. 9, we can see is always less than or equal to , which just verifies Eq. (10) in Theorem. 5. In fact, there is almost no difference between and , except for the two candidate sequences , .

Fig. 9: Results of sequence extension on GunPoint dataset.

We continue to increase to observe the effect of extension length on DTW distance. Fig. 10 shows the average values of , between the query sequence and every candidate sequence. With the increase of , the average value of remains the same. It indicates that extension length does not affect DTW distance, which just verifies Eq. (17) in Theorem. 6.

Fig. 10: Results of increasing sequence length on GunPoint dataset.

We repeat the experiments above on other datasets. Fig. 11 illustrates the experimental results, where the sequences in a dataset are all extended to the same minimum length. On all the benchmark datasets, the average values of , are close to each other. This means that sequence extension has little effect on DTW distance.

Fig. 11: Results of sequence extension on 10 benchmark datasets.

5.2 Comparison of lower bounding distances

Our proposed lower bounding distance is a seamless combination of sequence extension and LB_Keogh. Thus, we mark it as LB_Keogh in the paper. We compare LB_Keogh with LB_Kim and LB_Yi, which can be used for unequal-length sequences.

We first evaluate LB_Keogh with tightness [0, 1], which is defined as the ratio of the lower bounding distance to DTW distance. The larger the tightness is, the better the lower bounding distance is. The tightness of LB_Keogh is written as:

(33)

For an -range search on a dataset, we compute the average tightness between the query sequence and each candidate sequence . Fig. 12 and Table III illustrate the average tightness over 100 -range searches on every dataset.

Fig. 12: Tightness comparison on 10 benchmark datasets.

We can see, the tightness of LB_Keogh is obviously greater than that of LB_Kim and LB_Yi on every dataset, and LB_Kim has the lowest tightness on most datasets except for ItalyPowerDemand. The reason is that LB_Keogh makes more points of sequences participate in the calculation of lower bounding distance. LB_Kim only chooses four feature points to calculate the lower bounding distance. When sequence length is short, such as ItalyPowerDemand, the tightness of LB_Kim is acceptable. With the increase of sequence length, its tightness decreases sharply.

Dataset LB_Yi LB_Kim
ItalyPowerDemand 0.3459 0.1083 0.2105
SyntheticControl 0.3225 0.0251 0.0678
ECG5000 0.3959 0.1419 0.0952
GunPoint 0.5823 0.5787 0.0396
WordSynonyms 0.3986 0.0798 0.0162
Words50 0.3939 0.0786 0.0122
Symbols 0.4839 0.0302 0.0128
Yoga 0.3810 0.0431 0.0124
ShapesAll 0.3749 0.0527 0.0092
Computers 0.4711 0.3131 0.0115
TABLE III: Results of tightness comparison.

Pruning power [0,1] is another important indicator to evaluate lower bounding distances, which is defined as:

(34)

where is the number of sequences calculated by DTW distance using sequential scan method, is the number of sequences that do not require the calculation of DTW distance by using a kind of lower bounding distance. The larger is, the better the filtering effect of a lower bounding distance is.

We calculate the average pruning power over 100 -range searches on GunPoint, ECG5000, Yoga and Computers datasets.

ECG5000 dataset is a 20-hour long ECG downloaded from Physionet. The data are pre-processed in two steps: extract each heartbeat; make each heartbeat equal length using interpolation. After that, 5,000 heartbeats are randomly selected. The patients have severe congestive heart failure and the class values are obtained by automated annotation.

Yoga dataset is obtained by capturing two actors transiting between yoga poses in front of a green screen, as illustrated in Fig. 13. Each image was converted to a one dimensional series by finding the outline and measuring the distance of the outline to the centre. The problem is to discriminate between one actor (male) and another (female).

Fig. 13: Description of Yoga dataset.

Computers dataset is taken from data recorded as part of government sponsored study called Powering the Nation. The intention is to collect behavioural data about how consumers use electricity within the home to help reduce the UK’s carbon footprint. The data contains readings from 251 households, sampled in two-minute intervals over a month. Classes are Desktop and Laptop.

Experimental results are shown in Fig. 14, where the height of a bar represents the ratio of the number of retrieved sequences to the total number of sequences. We can see the pruning power of LB_Keogh outperforms that of LB_Kim and LB_Yi, and LB_Kim hardly works except when is small. Because LB_Kim uses instead of or as its base distance. Unlike most of other methods, the DTW distance defined by LB_Kim is not the sum of the base distance, which makes it unfair to compare LB_Kim with other methods.

(a) GunPoint
(b) ECG5000
(c) Yoga
(d) Computers
Fig. 14: Comparison of pruning power on four benchmark datasets.

When is small, the pruning power of LB_Keogh is acceptable. With the increase of , its pruning power also gradually decreases to zero. The reason is that determines the number of sequences filtered out by LB_Keogh. Especially, when is large enough, LB_Keogh completely loses its filtering effect. Generally speaking, -range search only needs to find a small number of sequences from a dataset, for example, 10% of the total number of sequences. That is, is usually not large in the sense. Therefore, LB_Keogh has desirable pruning power in practical applications.

5.3 Discussion on the number of subsegments in PAA

In this subsection, we discuss the effect of changing the number of subsegments in PAA. The tightness of LB_PAA is defined as:

(35)

We extend all the sequences of a dataset to the same length , such that more integers are divisible by this length. The results of sequence extension are shown in Table IV. Then, we calculate the average tightness over 100 -range searches, as illustrated in Fig. 15.

<
Dataset Lmax Dataset Lmax
ItalyPowerDemand 26 Words50 272
SyntheticControl 63 Symbols 399
ECG5000 144 Yoga 429
GunPoint 152 ShapesAll 513
WordSynonyms