Multimedia retrieval is a common application which requires finding similar objects to a query object. We consider examples such as gesture recognition with modern virtual reality motion controllers, GPS tracking, speech recognition, and classification of handwritten letters where the objects are multi-dimensional time series.
In many cases, similarity search is performed using a distance function on the time series, where small distances imply similar time series. A nearest neigbor query to the query time series can be a -nearest neighbor (-NN) query or an -nearest neighbor (-NN) query. A -NN query retrieves the most similar time series. An -NN query retrieves all time series with a distance of at most .
A common application of
-NN queries is solving classification tasks in machine learning[9, 1]. For example, if all data objects (here time series) are labeled, a -NN classifier basically assigns the label of a nearest neighbor to the unlabeled query object.
In our examples, the time series of the same classes, (e. g., same written character or same gestures) follow the same path in space, but have some temporal displacements. Tracking the GPS coordinates of two cars driving the same route from A to B is an illustrative example. We want these tracks (i. e. 2-dimensional time series) to be recognized as similar, although driving style, traffic lights, and traffic jams might result in temporal differences. Time warping distance functions such as dynamic time warping (DTW) , the dog-keeper distance (DK) , and the edit distance with real penalties (ERP)  respect this semantic requirement. They map pairs of time series representing the same trajectory to small distances.
However, we are usually not interested in the exact distance values but in the set of the nearest neighbors to our query time series. During such a nearest neighbor query, a common approach for improving the runtime is pruning as much distance computations as possible using lower bounds to the distance function: If the lower bound exceeds a certain threshold already (e. g. the largest distance to our nearest neighbors found so far), the expensive distance also yields a larger value and thus, its computation can be skipped.
E. Keogh proposed one of the state of the art algorithms for nearest neighbor queries with DTW on 1-dimensional time series. His main contribution is the lower bound [17, 14] pruning many expensive DTW computations during a linear scan.
The contributions of this paper are the following:
We introduce as canonical extension to for multi-dimensional time series in Section 3, i. e. is a lower bound to DTW on multi-dimensional time series and equals on 1-dimensional time series.
1.2 Basic Notation
We denote the natural numbers including zero with and the real numbers with . Elements of a
-dimensional vectorare accessed using subindices, i. e. is the third element of the vector. Sequences (here also called time series) are usually written using capital letters, e. g. is a sequence of length . Suppose , then denotes the -th element of the -th vector in the sequence . The projection to the -th dimension is denoted via , i. e. . The Euclidean norm of a vector is denoted via , thus denotes the Euclidean distance between and . In general, we denote distance functions via d.
We denote as the 1-dimensional interval from to . We denote the cartesian product using the symbol , thus denotes the set of vectors with for .
2.1 Time Warping Distance Functions
Dynamic time warping (DTW) is a distance function on time series . Its benefit is a dynamic time alignment thus it is robust against time distortion or temporal displacements. Distance functions on time series which are robust against time distortion are from hereon called time warping distance functions.
For a formal definition, let and be two time series and d a distance function for the elements of the time series. DTW is defined recursively:
where . The dog-keeper distance (DK) is similar to DTW. It only differs by taking the maximum distance along a warping path instead of the sum.
Well known algorithms computing DTW and DK in quadratic time exist [18, 11]. These algorithms first build the cross product of both time series and using the distance function on the elements. The resulting distance matrix consists of entries where is in the bottom left cell and is in the top right cell. After that, the algorithms compute the cheapest path from the bottom left to the top right cell by cell. For each cell, they replace the value with a combination of that value and the smallest value of one of the possible predecessors: the left, the lower, and the left lower neighbor cell, yielding the warping matrix. Unfortunately, any algorithm computing any of these distance functions has quadratic runtime complexity in worst case [6, 5].
2.2 Sakoe Chiba Band
The Sakoe Chiba band changes the semantic of DTW by constraining the possible paths in the warping matrix to a diagonal band of a certain bandwidth . Thus, the warping of the time is constrained. On the other hand, the computation time is decreased since a huge part of the distance matrix does not need to be considered. Still, the runtime complexity remains quadratic because only a fixed ratio of the distance matrix is “cut off”.
2.3 Keoghs Lower Bound for DTW
Cheap lower bounds are used to prune many complete DTW computations and thus improve the overall computation time of a nearest neighbor search: If the lower bound already exceeds a desired threshold, then the final result of the expensive DTW distance function will be larger as well. Keogh proposed one of the most successfull lower bounding functions to DTW [14, 17]. His lower bound depends on the Sakoe Chiba Band which constrains finding best time alignments to a maximum time distance of a certain band width.
Basically, the idea of his lower bound is the following (cf. Figure 1 for an illustration): Consider two time series and and map each to the interval of all possible aligned values within the time range (i. e. Sakoe Chiba band) and :
Summing up the square distances of to each interval is a lower bound to DTW since DTW aligns each to at least one of the values within , i. e.:
The computation of the intervals takes linear time when using the algorithm of Daniel Lemire . Hence, it is obvious that the computation of is linear in the length of the time series as well.
3 Multi Dimensional Time Series
Consider two time series and with , i. e. and are multi-dimensional time series, and let
be the Euclidean distance of the two vectors and . We extend the lower bound of DTW canonically using the interpretation presented in Section 2.3: For each we find the minimal bounding box to all values . Summing up the square distances of to their bounding boxes is again a lower bound to DTW.
For a formal definition, let
be the minimal axis aligned bounding box of . The vectors and are the lower left and upper right corners, respectively. The distance of a vector to the bounding box is the distance to the nearest element within the bounding box:
Considering the definition of DTW, the following function is a lower bound to DTW:
The distance function
is hard to compute but can be estimated:
As it turns out, the estimation is not only a canonical extension of the lower bound proposed by Keogh, but it also can be computed by using his proposed algorithm on the different dimensions of the time series. Please note, that holds especially for 1-dimensional time series.
Curse Of Dimensionality
Consider the bounding box of a subsequence as illustrated in Figure 2. For 2-dimensional time series, the following situation might happen: The time series moves along the left and then along the bottom edge of the bounding box. The query point however is at the top right vertex of the bounding box. Hence, a perfect alignment of DTW would still result in a distance
Simplifying the distance function to the bounding box results in
Thus, there is a clear divergence of DTW and . With increasing dimensionality, there is more space for the time series to sneak past the query point, i. e. the probability for this situation to happen increases. For this reason, we claim that the tightness (i. e. the ratio ) of the lower bound gets worse with increasing dimensionality. This effect is similar to that of the Curse of Dimensionality , thus we still call it the same.
The goal of this section is the proof of Theorem 3.1 which claims the existence of the Curse of Dimensionality on . Since we want to assume as little as possible from the data sets, we assume that the time series consist of independent and identical distributed elements. Section 5 confirms the theoretical results experimentally. For proving Theorem 3.1, we first need some technical lemmata.
Let be independent identically distributed random variables for and . Furthermore, let . Then,
Let be the mean and be the variance of the identically distributed variables . The following inequation holds using calculation rules for expected values:
Let . Since we only consider finite data sets for our nearest neighbour queries, holds obviously. For a theoretical analysis on data sets with infinite many elements, this property is a necessary preliminary in order to estimate the denominator using the Berry-Esseen Theorem. Basically, the Berry-Esseen Theorem claims that the sum of random variables converges to a normal distribution for increasing number of summands. We also apply the well known Markov inequality to estimate the denominator:
where the first inequation is the Markov inequality and the second inequation is derived from the Berry-Esseen Theorem which also claims the existence of the constant .
Let be independent identically distributed random variables for such that for each . Then, for .
The expected value can be calculated as follows:
Since is Lebesgue measurable and for , the last integral converges to zero for .
Let and be two time series in with length . Then
Since we do not know anything about the time series, we assume that their elements are independent identically distributed variables. For simplicity, we even assume that the distance between any two elements is a random variable, i. e.
are independent identically distributed random variables. With a Sakoe Chiba band of width ( constant) we get
4 Dog-Keeper Distance
Practical implementations of DTW speed up the computation process by stopping early when a promised lower bound already exceeds a certain threshold. For example, processing the warping matrix while computing DTW, the minimum value of a column or row is a lower bound to the final distance value. The lower bound proposed by Keogh promises even tighter values to the final distance value .
Since DTW sums up values along a warping path through the distance matrix, the values in the early computation time exceed a certain threshold less probable than the values at a later computation time. This observation does not hold when computing the DK distance . This insight yields the idea that the matrix filled out during computation of the DK distance might be very sparse.
Therefore, our approach to improving the computation time of the DK distance is to compute the distance matrix using a sparse matrix algorithm (cf. Section 4.2). However, to avoid computing most of the cells of this matrix, we need a low threshold beforehand. Such a threshold is found using a cheap upper bound to the DK distance. Specifically, we propose a greedy algorithm to the time warping alignment problem (cf. 4.1).
To explain the algorithms in detail, consider two time series and .
4.1 Greedy Dog-Keeper
The greedy dog-keeper distance (GDK) starts by aligning and . It then successively steps to one of the next pairs aligning , , or with the lowest distance. When it reaches the alignment of to , the maximum of the distances along the choosen path yields the final distance. Algorithm 1 provides the pseudo code for the algorithm.
So far, the GDK distance matches whole sequences against each other. In order to support sub sequence matching, we alter the algorithm by first finding the best match of to any (name it ). Then, we run GDK starting at and stop the computation as soon, as is aligned to one of the . For details, refer to Algorithm 2.
Both algorithms have linear complexity, since the while loop runs at maximum times
4.2 Sparse dog-keeper distance
The sparse dog-keeper algorithm essentially works the same as the original algorithm, except that it only visits those neighbor cells of the distance matrix having a value not larger than a given threshold. Algorithm 3 provides the pseudo code for the algorithm: Similar to the original algorithm, we compute the (sparse) warping matrix column by column (cf. Line 10). The variables and store the indices of cells to visit in the current and next column, respectively. If the value at the current cell of the matrix is not larger than the threshold (cf. Line 13), we also need to visit the right, upper right, and upper cells (cf. Lines 15 and 16). The actual values within the columns are stored in (current column) and (previous column). After we finished the computation of a column, we prepare the variables to enter the next column (cf. Lines 19 to 26).
The algorithm performs subsequence search iff passing true for the parameter SUB. It differs to the whole matching version by considering each position of the super sequence as possible start of a match (cf. Line 24) and by considering each position as possible end of a match (cf. Line 19).
4.3 Nearest Neighbour Search
Our nearest neighbor search algorithm requires finding a good upper bound. Hence, we first loop over all time series and find the lowest upper bound using the GDK distance. Then, we perform a common nearest neighbor search by scanning the time series.
In Section 5.1, we evaluate Theorem 3.1 (which claims the existence of the curse of dimensionality for ) experimentally on synthetic data sets which have a parameter for setting the dimensionality . We confirm our observations on real world data sets.
We compare DTW (with as lower bound) against our implementation of the DK distance in terms of runtime in Section 5.2. We focus on the subsequence matching algorithms since this is the more challenging task.
The retrieval quality is of high importance when considering an alternative distance function for nearest neighbor queries. Therefore, we close the evaluation by comparing the accuracy of the DK and DTW distance functions on nearest neighbor tasks in Section 5.3.
Synthetic Data Sets
Due to space limitations, we can not show results for all parameters of the data set generators. The following set of parameters yield the most interesting results: If not mentioned otherwise explicitely, we generate data with distortion , radius , distinguish classes, and representatives per class using the RAM generator and distinguish classes and representatives per class using the CBF generator.
5.1 Curse Of Dimensionality
Theorem 3.1 claims that the tightness of the lower bound function to DTW gets worse with increasing dimensionality. We evaluate the theorem experimentally on data sets generated by the two synthesizers CBF and RAM . Figure 3 illustrates examples for time series generated by the RAM and CBF synthesizers.
Our implementation of is based on from the UCR Suite . We skipped the normalization on some data sets to improve the runtime and accuracy of . While adapting the UCR Suite to work on multi-dimensional time series, we checked that the runtime remained stable for 1-dimensional time series. Thus, we made sure that all of our results for runtime comparison are not implementation dependent. We ran all experiments on one core of an Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz with 24GB Memory.
Figure 5 demonstrates that the tightness of to DTW is decreasing down to zero for increasing length, which confirms Theorem 3.1. The theorem also claims that the tightness drops to a constant value for increasing dimensionality but constant length. With Figure 5 we confirm this claim for the RAM data sets and show even more that the tightness converges (drops down) already for moderate dimensionality (e. g. 3 dimensional time series). We could ovbserve the same results for the CBF data sets (cf. Figure 6) although the dimensionality has greater impact on the tightness while the length has a smaller impact.
We show the effects on the pruning power in the second heat maps of both Figures 5 and 6. Their black lines (lines of same values within each heat map) look rather similar to the black lines in their corresponding heat maps on the left side. They confirm the correlation between the dropping of the tightness of and the dropping of the pruning power within a nearest neighbor search. Still, the dimensionality has a larger impact to the pruning power than to the tightness of the time series.
The reason is the curse of dimensionality on retrieval tasks . It basically says that the variance of distance values between random elements of a data set decreases with increasing dimensionality. Thus, even if the tightness of a lower bound such as equals on two distinct data sets, pruning is less probable in the higher dimensional data set. Figure 4 illustrates an example where is the query and the nearest neighbor found so far during the search. If the lower bound is larger than then we can prune the next element . Assume where
is equal in the low dimensional and the high dimensional data set. Sinceand converge on data sets with increasing dimensionality (this is the curse of dimensionality), is more probable in higher dimensionality. The lower bound can not be used for pruning in these cases. Please note, that this insight holds for any lower bound driven approach.
5.2 Computation Time
For the evaluation of the runtime of our DK algorithm, we consider the runtime of DTW with as the base line. In Figures 7 and 8 we show the speedup of our DK implementation against DTW with on the RAM and CBF data sets, respectively. The speedup decreases for longer time series while increasing with growing dimensionality. Figure 9 confirms this observation on an even larger parameter set. Thus, our DK implementation outperforms DTW with on rather short and multi-dimensional time series.
To ensure that our results do not depend on the generated synthetical data sets, we repeated the same experiments on the following real world data sets from the UCI Machine Learning Repository : Character Trajectories (CT), Activity Recognition system based on Multisensor data fusion (AReM) , EMG Physical Action (EMGPA), Australian Sign Language 2 (ASL) , Arabic Spoken Digits (ASD), and Vicon Physical Action (VICON). We also used the ECG data proposed by Keogh  for querying in one very long time series. Columns 1 and 2 of Table 1 show that the speedup increases with growing dimensionality. Thus, we demonstrate that our implementation of the DK distance outperforms DTW with in terms of computation time, especially on multi-dimensional time series.
|data set||dim.||speedup||acc. DTW||acc. DK|
|AReM 111No Z-Normalization||6|
5.3 Retrieval Quality:
Figures 10 and 11 reveal that the accuracy of both distance functions decreases on CBF while increasing on RAM with growing dimensionality. While DK outperforms DTW on the CBF data set it looses on the RAM data set.
Columns 1, 3 and 4 of Table 1 also prove that there is no clear winner regarding accuracy.
We introduced as a canonical extension to Keogh’s lower bound for DTW on multi-dimensional time series and proved its correctness. Not only do lower bounds suffer from the curse of dimensionality even if their tightness remains constant, but we also proved that the tightness of decreases with growing dimensionality. On the other hand, we proposed an alternative algorithm for the computation of the long known DK distance, which is similar to DTW. Please note that the DK distance satisfies the triangle inequality and can thus be used in metric indexes [12, 2].
In our evaluation, we confirmed our theory that as extension of suffers from the curse of dimensionality. We could show that our implementation of DK outperforms on multi-dimensional synthesized data sets as well as real world data sets in terms of computation time by more than one order of magnitude. However, there is no clear winner regarding the accuracy in retrieval tasks. Hence, we propose to stay with on 1-dimensional time series while choosing the DK distance on multi-dimensional time series.
We like to thank Jochen Taeschner for his help on this work.
-  Miscellaneous Clustering Methods, pages 215–255. Wiley-Blackwell, 2011.
-  J. P. Bachmann and J.-C. Freytag. Dynamic Time Warping and the (Windowed) Dog-Keeper Distance, pages 127–140. Springer International Publishing, Cham, 2017.
-  J. P. Bachmann and J.-C. Freytag. High Dimensional Time Series Generators, 2018. arXiv:1804.06352v1.
-  A. Backurs and P. Indyk. Edit distance cannot be computed in strongly subquadratic time (unless SETH is false). CoRR, abs/1412.0348, 2014.
-  K. Bringmann. Why walking the dog takes time: Frechet distance has no strongly subquadratic algorithms unless SETH fails. CoRR, abs/1404.1448, 2014.
-  K. Bringmann and M. Künnemann. Quadratic conditional lower bounds for string problems and dynamic time warping. CoRR, abs/1502.01063, 2015.
-  E. Chávez, G. Navarro, R. Baeza-Yates, and J. L. Marroquín. Searching in metric spaces. ACM Comput. Surv., 33(3):273–321, Sept. 2001.
-  L. Chen and R. Ng. On the marriage of lp-norms and edit distance. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, VLDB ’04, pages 792–803. VLDB Endowment, 2004.
D. Coomans and D. Massart.
Alternative k-nearest neighbour rules in supervised pattern recognition: Part 1. k-nearest neighbour classification by using alternative voting rules.Analytica Chimica Acta, 136:15 – 27, 1982.
-  D. Dheeru and E. Karra Taniskidou. UCI machine learning repository, 2017.
-  T. Eiter and H. Mannila. Computing discrete fréchet distance. Technical report, Technische Universität Wien, 1994.
-  M. R. Fréchet. Sur quelques points du calcul fonctionnel. 22. Rendiconti del Circolo Mathematico di Palermo, 1906.
-  M. W. Kadous. Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series. PhD thesis, New South Wales, Australia, Australia, 2002. AAI0806481.
-  E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3):358–386, 2005.
-  D. Lemire. Faster retrieval with a two-pass dynamic-time-warping lower bound. Pattern Recogn., 42(9):2169–2180, Sept. 2009.
-  F. Palumbo, C. Gallicchio, R. Pucci, and A. Micheli. Human activity recognition using multisensor data fusion based on reservoir computing. 8:87–, 03 2016.
-  T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pages 262–270, New York, NY, USA, 2012. ACM.
-  H. Sakoe and S. Chiba. Readings in speech recognition. In A. Waibel and K.-F. Lee, editors, Readings in Speech Recognition, chapter Dynamic Programming Algorithm Optimization for Spoken Word Recognition, pages 159–165. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
-  M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. Indexing multi-dimensional time-series with support for multiple distance measures. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, pages 216–225, New York, NY, USA, 2003. ACM.