Repository for Niko and Alexis to share stuff
Dynamic Time Warping (DTW) is an algorithm to align temporal sequences with possible local non-linear distortions, and has been widely applied to audio, video and graphics data alignments. DTW is essentially a point-to-point matching method under some boundary and temporal consistency constraints. Although DTW obtains a global optimal solution, it does not necessarily achieve locally sensible matchings. Concretely, two temporal points with entirely dissimilar local structures may be matched by DTW. To address this problem, we propose an improved alignment algorithm, named shape Dynamic Time Warping (shapeDTW), which enhances DTW by taking point-wise local structural information into consideration. shapeDTW is inherently a DTW algorithm, but additionally attempts to pair locally similar structures and to avoid matching points with distinct neighborhood structures. We apply shapeDTW to align audio signal pairs having ground-truth alignments, as well as artificially simulated pairs of aligned sequences, and obtain quantitatively much lower alignment errors than DTW and its two variants. When shapeDTW is used as a distance measure in a nearest neighbor classifier (NN-shapeDTW) to classify time series, it beats DTW on 64 out of 84 UCR time series datasets, with significantly improved classification accuracies. By using a properly designed local structure descriptor, shapeDTW improves accuracies by more than 10 datasets. To the best of our knowledge, shapeDTW is the first distance measure under the nearest neighbor classifier scheme to significantly outperform DTW, which had been widely recognized as the best distance measure to date. Our code is publicly accessible at: https://github.com/jiapingz/shapeDTW.READ FULL TEXT VIEW PDF
Repository for Niko and Alexis to share stuff
Dynamic time warping (DTW) is an algorithm to align temporal sequences, which has been widely used in speech recognition , human motion animation , human activity recognition  and time series classification . DTW allows temporal sequences to be locally shifted, contracted and stretched, and under some boundary and monotonicity constraints, it searches for a global optimal alignment path. DTW is essentially a point-to-point matching algorithm, but it additionally enforces temporal consistencies among matched point pairs. If we distill the matching component from DTW, the matching is executed by checking the similarity of two points based on their Euclidean distance. Yet, matching points based solely on their coordinate values is unreliable and prone to error, therefore, DTW may generate perceptually nonsensible alignments, which wrongly pair points with distinct local structures (see Fig.1 (c)). This partially explains why the nearest neighbor classifier under the DTW distance measure is less interpretable than the shapelet classifier : although DTW does achieve a global minimal score, the alignment process itself takes no local structural information into account, possibly resulting in an alignment with little semantic meaning. In this paper, we propose a novel alignment algorithm, named shape Dynamic Time Warping (shapeDTW), which enhances DTW by incorporating point-wise local structures into the matching process. As a result, we obtain perceptually interpretable alignments: similarly-shaped structures are preferentially matched based on their degree of similarity. We further quantitatively evaluate alignment paths against the ground-truth alignments, and shapeDTW achieves much lower alignment errors than DTW on both simulated and real sequence pairs. An alignment example by shapeDTW is shown in Fig.1 (d).
Point matching is a well studied problem in the computer vision community, widely known as image matching. In order to search corresponding points from two distinct images taken from the same scene, a quite naive way is to compare their pixel values. But pixel values at a point lacks spatial neighborhood context, making it less discriminative for that point; e.g., a tree leaf pixel from one image may have exactly the same RGB values as a grass pixel from the other image, but these two pixels are not corresponding pixels and should not be matched. Therefore, a routine for image matching is to describe points by their surrounding image patches, and then compare the similarities of point descriptors. Since point descriptors designed in this way encode image structures around local neighborhoods, they are more distinctive and discriminative than single pixel values. In early days, raw image patches were used as point descriptors, and now more powerful descriptors like SIFT  are widely adopted since they capture local image structures very well and are invariant to image scale and rotation.
Intuitively, local neighborhood patches make points more discriminative from other points, while matching based on RGB pixel values is brittle and results in high false positives. However, the matching component in the traditional DTW bears the same weakness as image matching based on single pixel values, since similarities between temporal points are measured by their coordinates, instead of by their local neighborhoods. An analogous remedy for temporal matching hence is: first encode each temporal point by some descriptor, which captures local subsequence structural information around that point, and then match temporal points based on the similarity of their descriptors. If we further enforce temporal consistencies among matchings, then comes the algorithm proposed in the paper: shapeDTW.
shapeDTW is a temporal alignment algorithm, which consists of two sequential steps: (1) represent each temporal point by some shape descriptor, which encodes structural information of local subsequences around that point; in this way, the original time series is converted into a sequence of descriptors. (2) use DTW to align two sequences of descriptors. Since the first step takes linear time while the second step is a typical DTW, which takes quadratic time, the total time complexity is quadratic, indicating that shapeDTW has the same computational complexity as DTW. However, compared with DTW and its variants (derivative Dynamic Time Warping (dDTW)  and weighted Dynamic Time Warping (wDTW)), it has two clear advantages: (1) shapeDTW obtains lower alignment errors than DTW/dDTW/wDTW on both artificially simulated aligned sequence pairs and real audio signals; (2) the nearest neighbor classifier under the shapeDTW distance measure (NN-shapeDTW) significantly beats NN-DTW on 64 out of 84 UCR time series datasets . NN-shapeDTW outperforms NN-dDTW/NN-wDTW significantly as well. Our shapeDTW time series alignment procedure is shown in Fig. 2.
Extensive empirical experiments have shown that a nearest neighbor classifier with the DTW distance measure (NN-DTW) is the best choice to date for most time series classification problems, since no alternative distance measures outperforms DTW significantly [34, 30, 28]. However, in this paper, the proposed temporal alignment algorithm, shapeDTW, if used as a distance measure under the nearest neighbor classifier scheme, significantly beats DTW. To the best of our knowledge, shapeDTW is the first distance measure that outperforms DTW significantly.
Our contributions are several fold: (1) we propose a temporal alignment algorithm, shapeDTW, which is as efficient as DTW (dDTW, wDTW) but achieves quantitatively better alignments than DTW (dDTW, wDTW); (2) Working under the nearest neighbor classifier as a distance measure to classify 84 UCR time series datasets, shapeDTW, under all tested shape descriptors, outperforms DTW significantly; (3) shapeDTW provides a quite generic alignment framework, and users can design new shape descriptors adapted to their domain data characteristics and then feed them into shapeDTW for alignments.
Since shapeDTW is developed for sequence alignment, here we first review research work related to sequence alignment. DTW is a typical sequence alignment algorithm, and there are many ways to improve DTW to obtain better alignments. Traditionally, we could enforce global warping path constraints to prevent pathological warpings , and several typical such global warping constraints include Sakoe-Chiba band and Itakura Parallelogram. Similarly, we could choose to use different step patterns in different applications: apart from the widely used step pattern - “symmetric1”, there are other popular steps patterns like “symmetric2”, “asymmetric” and “RabinerJuangStepPattern” . However, how to choose an appropriate warping band constraint and a suitable step pattern depends on our prior knowledge on the application domains.
There are several recent works to improve DTW alignment. In , to get the intuitively correct “feature to feature” alignment between two sequences, the authors introduced derivative dynamic time warping (dDTW), which computes first-order derivatives of time series sequences, and then aligns two derivative sequences by DTW. In , the authors developed weighted DTW (wDTW), which is a penalty-based DTW. wDTW takes the phase difference between two points into account when computing their distances. Batista et al  proposed a complexity-invariant distance measure, which essentially rectifies an existing distance measure (e.g., Euclidean, DTW) by multiplying a complexity correction factor. Although they achieve improved results on some datasets by rectifying the DTW measure, they do not modify the original DTW algorithm. In , the authors proposed to learn a distance metric, and then align temporal sequences by DTW under this new metric. One major drawback is the requirement of ground truth alignments for metric learning, because in reality true alignments are usually unavailable. In , the authors proposed to utilize time series local structure information to constrain the search of the warping path. They introduce a SIFT-like feature point detector and descriptor to detect and match salient feature points from two sequences first, and then use matched point pairs to regularize the search scope of the warping path. Their major initiative is to improve the computational efficiency of dynamic time warping by enforcing band constraints on the potential warping paths, such that they do not have to compute the full accumulative distance matrix between the two sequences. Our method is sufficiently different from theirs in following aspects: first, we have no notion of feature points, while feature points are key to their algorithm, since feature points help to regularize downstream DTW; second, our algorithm aims to achieve better alignments, while their algorithm attempts to improve the computational efficiency of the traditional DTW. In , the authors focus on improving the efficiency of the nearest neighbor classifier under the DTW distance measure, but they keep the traditional DTW algorithm unchanged.
Our algorithm, shapeDTW, is different from the above works in that: we measure similarities between two points by computing similarities between their local neighborhoods, while all the above works compute the distance between two points based on their single-point y-values (derivatives).
Since shapeDTW can be applied to classify time series (e.g., NN-shapeDTW), we review representative time series classification algorithms. In , the authors use the popular Bag-of-Words to represent time series instances, and then classify the representations under the nearest neighbor classifier. Concretely, it discretizes time series into local SAX  words, and uses the histogram of SAX words as the time series representation. In 
, the authors developed an algorithm to first extract class-membership discriminative shapelets, and then learn a decision tree classifier based on distances between shapelets and time series instances. In, they first represent time series using recurrent plots, and then measure the similarity between recurrence plots using Campana-Keogh (CK-1) distance (PRCD). PRCD distance is used as the distance measure under the one-nearest neighbor classifier to do classification. In 
, a bag-of-feature framework to classify time series is introduced. It uses a supervised codebook to encode time series instances, and then uses random forest classifier to classify the encoded time series. In, the authors first encode time series as a bag-of-patterns, and then use polynomial kernel SVM to do the classification. Zhao and Itti 
proposed to first encode time series by the 2nd order encoding method - Fisher Vectors, and then classify encoded time series by a linear kernel SVM. In their paper, subsequences are sampled from both feature points and flat regions.
shapeDTW is different from above works in that: shapeDTW is developed to align temporal sequences, but can be further applied to classify time series. However, all above works are developed to classify time series, and they are incapable to align temporal sequences at their current stages. Since time series classification is only one application of shapeDTW, we compare NN-shapeDTW against the above time series classification algorithms in the supplementary materials.
In this section, we introduce a temporal alignment algorithm, shapeDTW. First we introduce DTW briefly.
DTW is an algorithm to search for an optimal alignment between two temporal sequences. It returns a distance measure for gauging similarities between them. Sequences are allowed to have local non-linear distortions in the time dimension, and DTW handles local warpings to some extent. DTW is applicable to both univariate and multivariate time series, and here for simplicity we introduce DTW in the case of univariate time series alignment.
A univariate time series is a sequence of real values, i.e., . Given two sequences and of possible different lengths and , namely and , and let be an pairwise distance matrix between sequences and , where is the distance between and . One widely used pairwise distance measure is the Euclidean distance, i.e., . The goal of temporal alignment between and is to find two sequences of indices and of the same length , which match index in the time series to index in the time series , such that the total cost along the matching path is minimized. The alignment path is constrained to satisfies boundary, monotonicity and continuity conditions [32, 20, 12]:
Given an alignment path , we define two warping matrices and for and respectively, such that , otherwise , and similarly , otherwise . Then the total cost along the matching path is equal to , thus searching for the optimal temporal matching can be formulated as the following optimization problem:
DTW finds a global optimal alignment under certain constraints, but it does not necessarily achieve locally sensible matchings. Here we incorporate local shape information around each point into the dynamic programming matching process, resulting in more semantically meaningful alignment results, i.e., points with similar local shapes tend to be matched while those with dissimilar neighborhoods are unlikely to be matched. shapeDTW consists of two steps: (1) represent each temporal point by some shape descriptor; and (2) align two sequences of descriptors by DTW. We first introduce the shapeDTW alignment framework, and in the next section, we introduce several local shape descriptors.
Given a univariate time series , shapeDTW begins by representing each temporal point by a shape descriptor , which encodes structural information of temporal neighborhoods around , in this way, the original real value sequence is converted to a sequence of shape descriptors of the same length, i.e., . shapeDTW then aligns the transformed multivariate descriptor sequences by DTW, and at last the alignment path between descriptor sequences is transferred to the original univariate time series sequences. We give implementation details of shapeDTW:
Given a univariate time series of length , e.g.,, we first extract a subsequence of length from each temporal point . The subsequence is centered on , with its length typically much smaller than (
). Note we have to pad both ends ofby with duplicates of to make subsequences sampled at endpoints well defined. Now we obtain a sequence of subsequences, i.e., , with corresponding to the temporal point . Next, we design shape descriptors to express subsequences, under the goal that similarly-shaped subsequences have similar descriptors while differently-shaped subsequences have distinct descriptors. The shape descriptor of subsequence naturally encodes local structural information around the temporal point , and is named as shape descriptor of the temporal point as well. Designing a shape descriptor boils down to designing a mapping function , which maps subsequence to shape descriptor , i.e., , so that similarity between descriptors can be measured simply with the Euclidean distance. Different mapping functions define different shape descriptors, and one straightforward mapping function is the identity function (), in this case, , i.e., subsequence itself acts as local shape descriptor. Given a shape descriptor computation function , we convert the subsequence sequence to a descriptor sequence , i.e., . At last, we use DTW to align two descriptor sequences and transfer the warping path to the original univariate time series.
Given two univariate time series and , let and be their shape descriptor sequences respectively, shapeDTW alignment is equivalent to solving the optimization problem:
Where and are warping matrices of and , and is the -norm of matrix, i.e., , where is the row of matrix . Program 3 is a multivariate time series alignment problem, and can be effectively solved by dynamic programming in time . The key difference between DTW and shapeDTW is that: DTW measures similarities between and by their Euclidean distance , while shapeDTW uses the Euclidean distance between their shape descriptors, i.e., , as the similarity measure. shapeDTW essentially handles local non-linear warping, since it is inherently DTW, and, on the other hand, it prefers matching points with similar neighborhood structures to points with similar values. shapeDTW algorithm is described in Algo.1.
shapeDTW provides a generic alignment framework, and users can design shape descriptors adapted to their domain data characteristics and feed them into shapeDTW for alignments. Here we introduce several general shape descriptors, each of which maps a subsequence to a vector representation , i.e., .
The length of subsequences defines the size of neighborhoods around temporal points. When , no neighborhood information is taken into account. With increasing , larger neighborhoods are considered, and in the extreme case when ( is the length of the time series), subsequences sampled from different temporal points become the same, i.e., the whole time series, in which case, shape descriptors of different points resemble each other too much, making temporal points less identifiable by shape descriptors. In practice, is set to some appropriate value. But in this section, we first let be any positive integers (), which does not affect the definition of shape descriptors. In Sec.6, we will experimentally explore the sensitivity of NN-shapeDTW to the choice of .
Raw subsequence sampled around point can be directly used as the shape descriptor of , i.e., , where is the identity function. Although simple, it inherently captures the local subsequence shape and helps to disambiguate points with similar values but different local shapes.
Piecewise aggregate approximation (PAA) is introduced in [18, 36] to approximate time series. Here we use it to approximate subsequences. Given a -dimensional subsequence , it is divided into () equal-lengthed intervals, the mean value of temporal points falling within each interval is calculated and a vector of these mean values gives the approximation of and is used as the shape descriptor of , i.e., .
Discrete Wavelet Transform (DWT) is another widely used technique to approximate time series instances. Again, here we use DWT to approximate subsequences. Concretely, we use a Haar wavelet basis to decompose each subsequence into 3 levels. The detail wavelet coefficients of all three levels and the approximation coefficients of the third level are concatenated to form the approximation, which is used the shape descriptor of , i.e., .
All the above three shape descriptors encode local shape information inherently. However, they are not invariant to y-shift, to be concrete, given two subsequences of exactly the same shape, but is a y-shifted relative to , e.g., , where is the magnitude of y-shift, then their shape descriptors under Raw-Subsequence, PAA and DWT differ approximately by as well, i.e., . Although magnitudes do help time series classification, it is also desirable that similarly-shaped subsequences have similar descriptors. Here we further exploit three shape descriptors in experiments, Slope, Derivative and HOG1D, which are invariant to y-shift.
Slope is extracted as a feature and used in time series classification in [4, 8]. Here we use it to represent subsequences. Given a -dimensional subsequence , it is divided into () equal-lengthed intervals. Within each interval, we employ the total least square (TLS) line fitting approach  to fit a line according to points falling within that interval. By concatenating the slopes of the fitted lines from all intervals, we obtain a -dimensional vector representation, which is the slope representation of , i.e., .
Similar to Slope, Derivative is y-shift invariant if it is used to represent shapes. Given a subsequence , its first-order derivative sequence is , where is the first order derivative according to time . To keep consistent with derivatives used in derivative Dynamic Time Warping  (dDTW), we follow their formula to compute numeric derivatives.
HOG1D is introduced in  to represent 1D time series sequences. It inherits key concepts from the histogram of oriented gradients (HOG) descriptor , and uses concatenated gradient histograms to represent shapes of temporal sequences. Similarly to Slope and Derivative descriptors, HOG1D is invariant to y-shift as well.
In experiments, we divide a subsequence into 2 non-overlapping intervals, compute gradient histograms (under 8 bins) in each interval and concatenate two histograms as the HOG1D descriptor (a 16D vector) of that subsequence. We refer interested readers to  for computation details of HOG1D. We have to emphasize that: in , the authors introduce a global scaling factor and tune it using all training sequences; but here, we fix to be 0.1 in all our experiments, therefore, HOG1D computation on one subsequence takes only linear time , where is the length of that subsequence. See our published code for details.
Shape descriptors, like HOG1D, Slope and Derivative, are invariant to y-shift. However, in the application of matching two subsequences, y-magnitudes may sometimes be important cues as well, e.g., DTW relies on point-wise magnitudes for alignments. Shape descriptors, like Raw-Subsequence, PAA and DWT, encode magnitude information, thus they complement y-shift invariant descriptors. By fusing pure-shape capturing and magnitude-aware descriptors, the compound descriptor has the potential to become more discriminative of subsequences. In the experiments, we generate compound descriptors by concatenating two complementary descriptors, i.e., , where is a weighting factor to balance two simple descriptors, and is the generated compound descriptor.
Here we adopt the “mean absolute deviation” measure used in the audio literature  to quantify the proximity between two alignment paths. “Mean absolute deviation” is defined as the mean distance between two alignment paths, which is positively proportional to the area between two paths. Intuitively, two spatially proximate paths have small between-areas, therefore low “Mean absolute deviation”. Formally, given a reference sequence , a target sequence and two alignment paths between them, the Mean absolute deviation between and is calculate as: , where is the area between and and is the length of the reference sequence . Fig. 3 shows two alignment paths , blue and red curves, between and . is the area of the slashed region, and in practice, it is computed by counting the number of cells falling within it. Here a cell refers to the position in the pairwise distance matrix between and .
We test shapeDTW for sequence alignment and time series classification extensively on 84 UCR time series datasets  and the Bach10 dataset . For sequence alignment, we compare shapeDTW against DTW and its other variants both qualitatively and quantitatively: specifically, we first visually compare alignment results returned by shapeDTW and DTW (and its variants), and then quantify their alignment path qualities on both synthetic and real data. Concretely, we simulate aligned pairs by artificially scaling and stretching original time series sequences, align those pairs by shapeDTW and DTW (and its variants), and then evaluate the alignment paths against the ground-truth alignments. We further evaluate the alignment performances of shapeDTW and DTW (and its variants) on audio signals, which have the ground-truth point-to-point alignments. For time series classification, since it is widely recognized that the nearest neighbor classifier with the distance measure DTW (NN-DTW) is very effective and is hard to beaten [34, 2], we use the nearest neighbor classifier as well to test the effectiveness of shapeDTW (NN-shapeDTW), and compare NN-shapeDTW against NN-DTW. We further compare NN-shapeDTW against six other state-of-the-art classification algorithms in the supplementary materials.
We evaluate sequence alignments qualitatively in Sec. 6.1.2 and quantitatively in Sec. 6.1.3 and Sec. 6.1.4. We compare shapeDTW against DTW, derivative Dynamic Time Warping (dDTW)  and weighted Dynamic Time Warping (wDTW). dDTW first computes derivative sequences, and then aligns them by DTW. wDTW uses a weighted distance, instead of the regular distance, to compute distances between points, and the weight accounts for the phase differences between points. wDTW is essentially a DTW algorithm. Here, both dDTW and wDTW are variants of the original DTW. Before the evaluation, we briefly introduce some popular step patterns in DTW.
Step pattern in DTW defines the allowed transitions between matched pairs, and the corresponding weights. In both Program. 2 (DTW) and Program. 3 (shapeDTW), we use the default step pattern, whose recursion formula is . In the following alignment experiments, we try other well-known step patterns as well, and we follow the naming convention in  to name these step-patterns. Five popular step-patterns, “symmetric1”, “symmetric2”, “symmetric5”, “asymmetric” and “rabinerJuang”, are listed in Fig. 4. Step-pattern (a), “symmetric1”, is the one used by shapeDTW in all the following alignment and classification experiments, and we will not explicitly mention that in following texts.
We plot alignment results by shapeDTW and DTW/dDTW, and evaluate them visually. shapeDTW under 5 shape descriptors, Raw-Subsequence, PAA, DWT, Derivative and HOG1D, obtains similar alignment results, here we choose Derivative as a representative to report results, with the subsequence length set to be 30. Here, shapeDTW, DTW and dDTW all use step pattern (a) in Fig. 4.
Time series with rich local features: time series with rich local features, such as those in the “OSUleaf” dataset (bottom row in Fig.5), have many bumps and valleys; DTW becomes quite brittle to align such sequences, since it matches two points based on their single-point y-magnitudes. Because single magnitude value does not incorporate local neighborhood information, it is hard for DTW to discriminate a peak point from a valley point with the same magnitude, although and have dramatically different local shapes. dDTW bears similar weakness as DTW, since it matches points bases on their derivative differences and does not take local neighborhood into consideration either. On the contrary, shapeDTW distinguishes peaks from valleys easily by their highly different local shape descriptors. Since shapeDTW takes both non-linear warping and local shapes into account, it gives more perceptually interpretable and semantically sensible alignments than DTW (dDTW). Some typical alignment results of time series from feature rich datasets “OSUleaf” and “Fish” are shown in Fig.5.
We simulate aligned sequence pairs by scaling and stretching original time series. Then we run shapeDTW and DTW (and its variants) to align the simulated pairs, and compare their alignment paths against the ground-truth. In this section, shapeDTW is run under the fixed settings: (1) fix the subsequence length to be 30, (2) use Derivative as the shape descriptor and (3) use “symmetric1” as the step-pattern.
concretely, given a time series of length , we simulate a new time series by locally scaling and stretching . The simulation consists of two sequential steps: (1) scaling: scale point-wisely, resulting in a new time series , where is a positive scale vector with the same length as , and is a point-wise multiplication operator; (2) stretching: randomly choose percent of temporal points from , stretch each point by a random length and result in a new time series . and are a simulated alignment pair, with the ground-truth alignment known from the simulation process. The simulation algorithm is described in Alg. 2.
One caveat we have to pay attention to is that: scaling an input time series by a random scale vector can make the resulting time series perceptually quite different from the original one, such that simulated alignment pairs make little sense. Therefore, in practice, a scale vector should be smooth, i.e., adjacent elements in cannot be random, instead, they should be similar in magnitude, making adjacent temporal points from the original time series be scaled by a similar amount. In experiments, we first use a random process, which is similar to Brownian motion, to initialize scale vectors, and then recursively smooth it. The scale vector generation algorithm is shown in Alg. 2. As seen, adjacent scales are initialized to be differed by at most 1 (i.e., ), such that the first order derivatives are bounded and initialized scale vectors do not change abruptly. Initialized scale vectors usually have local bumps, and we further recursively utilize cumulative summation and sine-squashing, as described in the algorithm, to smooth the scale vectors. Finally, the smoothed scale vectors are linearly squashed into a positive range .
After non-uniformly scaling an input time series by a scale vector, we obtain a scale-transformed new sequence, and then we randomly pick percent of points from the new sequence and stretch each of them by some random amount . Stretching at point by some amount is to duplicate by times.
using training data from each UCR dataset as the original time series, we simulate their alignment pairs by running Alg. 2. Since there are 27,136 training time series instances from 84 UCR datasets, we simulate 27,136 aligned-pairs in total. We fix most simulation parameters as follows: , , , and the stretching percentage is the only flexible parameter we will vary, e.g., when , each original input time series is on average stretched by (in length). Typical scale vectors and simulated alignment pairs are shown in Fig. 6. The scale vectors are smooth and the simulated time series are both scaled and stretched, compared with the original ones.
we run shapeDTW and DTW/dDTW/wDTW to align simulated pairs, and compare alignment paths against the ground-truth in terms of “Mean Absolute Deviation” scores. DTW and dDTW are parameter-free, but wDTW has one tuning parameter (see Eq. (3) in their paper), which controls the curvature of the logistic weight function. However in the case of aligning two sequences, is impossible to be tuned and should be pre-defined by experiences. Here we fix to be 0.1, which is the approximate mean value of the optimal in the original paper. For the purpose of comparing the alignment qualities of different algorithms, we use the default step pattern, (a) in Fig. 4, for both shapeDTW and DTW/dDTW/wDTW, but we further evaluate effects of different step-patterns in the following experiments.
We simulate alignment pairs by stretching raw time series by different amounts, , , , and , and report the alignment qualities of shapeDTW and DTW/dDTW/wDTW under each stretching amount in terms of the mean of “Mean Absolute Deivation” scores over 27,136 simulated pairs. The results are shown in Fig. 7, which shows shapeDTW achieves lower alignment errors than DTW / dDTW / wDTW over different stretching amounts consistently. shapeDTW almost halves the alignment errors achieved by dDTW, although dDTW already outperforms its two competitors, DTW and wDTW, by a large margin.
choosing a suitable step pattern is a traditionally way to improve sequence alignments, and it usually needs domain knowledge to make the right choice. Here, instead of choosing an optimal step pattern, we run DTW/dDTW/wDTW under all 5 step patterns in Fig. 4 and compare their alignment performances against shapeDTW. Similar as the above experiments, we simulate aligned-pairs under different amounts of stretches, report alignment errors under different step patterns in terms of the mean of “Mean Absolute Deivation” scores over 27,136 simulated pairs, and plot the results in Fig. 8. As seen, different step patterns obtain different alignment qualities, and in our case, step patterns, “symmetric1” and “asymmetric”, have similar alignment performances and they reach lower alignment errors than the other 3 step patterns. However, shapeDTW still wins DTW/dDTW/wDTW (under “symmetric1” and “asymmetric” step-patterns) by some margin.
|Mean Absolute Deviation from the ground-truth alignments|
” scores. The mean and standard deviation of the “Mean Absolute Deviation” scores on each dataset is documented, with smaller means and stds in bold font. shapeDTW achieves lower “Mean Absolute Deviation” scores than dDTW on 56 datasets, showing its clear advantage for time series alignment.
From the above simulation experiments, we observe dDTW (under the step patterns “symmetric1” and “asymmetric”) has the closest performance as shapeDTW. Here we simulate aligned-pairs with on average stretches, run dDTW (under “symmetric1” step pattern) and shapeDTW alignments, and report the “Mean Absolute Deviation” scores in Table I. shapeDTW has lower “Mean Absolute Deivation” scores on 56 datasets, and the mean of “Mean Absolute Deivation” on 84 datasets of shapeDTW and dDTW are 1.68/2.75 respectively, indicating shapeDTW achieves much lower alignment errors. This shows a clear superiority of shapeDTW to dDTW for sequence alignment.
The key difference between shapeDTW and DTW/dDTW/wDTW is that whether neighborhood is taken into account when measuring similarities between two points. We demonstrate that taking local neighborhood information into account (shapeDTW) does benefit the alignment.
Notes: before running shapeDTW and DTW variants alignment, two sequences in a simulated pair are z-normalized in advance; when computing “Mean Absolute Deviation”, we choose the original time series as the reference sequence, i.e., divide the area between two alignment paths by the length of the original time series.
We showed the superiority of shapeDTW to align synthesized alignment pairs, and in this section, we further empirically demonstrate its effectiveness to align audio signals, which have ground-truth alignments.
The Bach10 dataset  consists of audio recordings of 10 pieces of Bach’s Chorales, as well as their MIDI scores and the ground-truth alignment between the audio and the MIDI score. MIDI scores are symbolic representations of audio files, and by aligning symbolic MIDI scores with audio recordings, we can do musical information retrieval from MIDI input-data . Many previous work used DTW to align MIDI to audio sequences [16, 9, 12], and they typically converted MIDI data into audios as a first step, and the problem boils down to audio-to-audio alignment, which is then solved by DTW. We follow this convention to convert MIDI to audio first, but run shapeDTW instead for alignments.
Each piece of music is approximately 30 seconds long, and in experiments, we segment both the audio and the converted audio from MIDI data into frames of 46ms length with a hopsize of 23ms, extract features from each 46ms frame window, and in this way, the audio is represented as a multivariate time series with the length equal to the number of frames and dimension equal to the feature dimensions. There are many potential choices of frame features, but how to select and combine features in an optimal way to improve the alignment is beyond the scope of this paper, we refer the interested readers to [21, 12]. Without loss of generality, we use Mel-frequency cepstral coefficients (MFCCs) as features, due to its common usage and good performance in speech recognition and musical information retrieval . In our experiments, we use the first 5 MFCCs coefficients.
After MIDI-to-audio conversion and MFCCs feature extraction, MIDI files and audio recordings are represented as 5-dimensional multivariate time series, with approximately length. A typical audio signal, MIDI-converted audio signal, and their 5D MFCCs features are shown in Fig. 9. We align 5D MFCCs sequences by shapeDTW: although shapeDTW is designed for univariate time series alignments, it naturally extends to multivariate cases: first extract a subsequence from each temporal point, then encode subsequences by shape descriptors, and in this way, the raw multivariate time series is converted to a descriptor sequence. In the multivariate time series case, each extracted subsequence is multi-dimensional, having the same dimension as the raw time series, and to compute the shape descriptor of a multi-dimensional subsequence, we compute shape descriptors of each dimension independently, concatenate all shape descriptors, and use it as the shape representation of that subsequence.
We compare alignments by shapeDTW against DTW/dDTW, and all of them use the “symmetric1” step pattern. The length of subsequences in shapeDTW is fixed to be 20 (we tried 5,10, 30 as well and achieved quite similar results), and Derivative is used as the shape descriptor. The alignment qualities in terms of “Mean Absolute Deviation” on 10 Chorales are plotted in Fig. 9. To be consistent with the convention in the audio community, we actually report the mean-delayed-second between the alignment paths and the ground-truth. The mean-delayed-second is computed as: dividing “Mean Absolute Deviation” by the sampling rate of the audio signal. shapeDTW outperforms dDTW/DTW on 9/10 MIDI-to-audio alignments. This shows taking local neighborhood information into account does benefit the alignment.
We compare NN-shapeDTW with NN-DTW on 84 UCR time series datasets for classification. Since these datasets have standard partitions of training and test data, we experiment with these given partitions and report classification accuracies on the test data.
In the above section, we explore the influence of different steps patterns, but here both DTW and shapeDTW use the widely adopted step pattern “symmetric1”(Fig. 4 (a)) under no temporal window constraints to align sequences.
NN-DTW: each test time series is compared against the training set, and the label of the training time series with the minimal DTW distance to that test time series determines the predicted label. All training and testing time series are z-normalized in advance.
shapeDTW: we test all 5 shape descriptors. We z-normalize time series in advance, sample subsequences from the time series, and compute 3 magnitude-aware shape descriptors, Raw-Subsequence, PAA and DWT, and 2 y-shift invariant shape descriptors, Slope and HOG1D. Parameter setting for 5 shape descriptors: (1) The length of subsequences to be sampled around temporal points is fixed to 30, as a result Raw-Subsequence descriptor is a 30D vector; (2) PAA and Slope uses 5 equal-lengthed intervals, therefore they have the dimensionality 5; (3) As mentioned, HOG1D uses 8 bins and 2 non-overlapping intervals, and the scale factor is fixed to be 0.1. At last HOG1D is a 16D vector representation.
NN-shapeDTW: first transform each training/testing time series to a shape descriptor sequence, and in this way, original univariate time series are converted into multivariate descriptor time series. Then apply NN-DTW on the multivariate time series to predict labels.
NN-shapeDTW vs. NN-DTW: we compare NN-shapeDTW, under 4 shape descriptors Raw-Subsequence, PAA, DWT and HOG1D, with NN-DTW, and plot their classification accuracies on 84 datasets in Fig.10. shapeDTW outperforms (including ties) DTW on 64/63/64/61 (Raw-Subsequence/PAA/DWT/HOG1D) datasets, and by running the Wilcoxon signed rank test between performances of NN-shapeDTW and NN-DTW, we obtain p-values ///, showing that shapeDTW under all 4 descriptors performs significantly better than DTW. Compared with DTW, shapeDTW has a preceding shape descriptor extraction process, and approximately takes time , where and is the length of subsequence and time series respectively. Since generally , the total time complexity of shapeDTW is , which is the same as DTW. By trading off a slight amount of time and space, shapeDTW brings large accuracy gains.
Since PAA and DWT are approximations of Raw-Subsequence, and they have similar performances as Raw-Subsequence under the nearest classifier, we choose Raw-Subsequence as a representative for following analysis. Shape descriptor Raw-Subsequence loses on 20 datasets, on 18 of which it has minor losses (), and on the other 2 datasets, “Computers” and “Synthetic-control”, it loses by and . Time series instances from these 2 datasets either have high-frequency spikes or have many abrupt direction changes, making them resemble noisy signals very much. Possibly, comparing the similarity of two points using their noisy neighborhoods is not as good as using their single coordinate values (DTW), since temporal neighborhood may accumulate and magnify noise.
HOG1D loses on 23 datasets, on 18 of which it has minor losses (), and on the other 5 datasets, “CBF”, “Computers”, “ItalyPowerDemand”, “Synthetic-control” and “Wine”, it loses by , , , and . By visually inspecting, time series from “Computers”, “CBF” and “Synthetic-control” are spiky and bumpy, making them highly non-smooth. This makes the first-order-derivative based descriptor HOG1D inappropriate to represent local structures. Time series instances from ’ItalyPowerDemand’ have length 24, while we sample subsequences of length 30 from each point, this makes HOG1D descriptors from different local points almost the same, such that HOG1D becomes not discriminative of local structures. This makes shapeDTW inferior to DTW. Although HOG1D loses on more datasets than Raw-Subsequence, HOG1D boosts accuracies by more than on 18 datasets, compared with on 12 datasets by Raw-Subsequence. On datasets “OSUleaf” and “BirdChicken”, the accuracy gain is as high as and . By checking these two datasets closely, we find different classes have membership-discriminative local patterns (a.k.a shapelets ), however, these patterns differ only slightly among classes. Raw-Subsequence shape descriptor can not capture these minor differences well, while HOG1D is more sensitive to shape variations since it calculates derivatives.
Both Raw-Subsequence and HOG1D bring significant accuracy gains, however, they boost accuracies to different extents on the same dataset. This indicates the importance of designing domain-specific shape descriptors. Nevertheless, we show that even by using simple and dataset-independent shape descriptors, we still obtain significant improvements over DTW. Classification error rates of DTW, Raw-Subsequence and HOG1D on 84 datasets are documented in Table.II.
Superiority of Compound shape descriptors: as mentioned in Sec.4, a compound shape descriptor obtained by fusing two complementary descriptors may inherit benefits from both descriptors, and becomes even more discriminative of subsequences. As an example, we concatenate a y-shift invariance descriptor HOG1D and a magnitude-aware descriptor DWT using equal weights, resulting in a compound descriptor . Then we evaluate classification performances of 3 descriptors under the nearest neighbor classifier, and plot the comparisons in Fig.11. HOG1D+DWT outperforms (including ties) HOG1D / DWT on 66/51 (out of 84) datasets, and by running the Wilcoxon signed rank hypothesis test between performances of HOG1D+DWT and HOG1D (DWT), we get p-values /, showing the compound descriptor outperforms individual descriptors significantly under the confidence level . We can generate compound descriptors by weighted concatenation, with weights tuned by cross-validation on training data, but this is beyond the scope of this paper.
Texas Sharpshooter plot: although NN-shapeDTW performs better than NN-DTW, knowing this is not useful unless we can tell in advance on which problems it will be more accurate, as stated in . Here we use the Texas sharpshooter plot  to show when NN-shapeDTW has superior performance on the test set as predicted from performance on the training set, compared with NN-DTW. We run leave-one-out cross validation on training data to measure the accuracies of NN-shapeDTW and NN-DTW, and we calculate the expected gain: accuracy(NN-shapeDTW)/accuracy(NN-DTW). We then measure the actual accuracy gain using the test data. The Texas Sharpshooter plots between Raw-Subsequence/HOG1D and DTW on 84 datasets are shown in Fig.12. / points (Raw-Subsequence/HOG1D) fall in the TP and TN regions, which means we can confidently predict that our algorithm will be superior/inferior to NNDTW. There are respectively 7/7 points falling inside the FP region for descriptors Raw-Subsequence/HOG1D, but they just represent minor losses, i.e., actual accuracy gains lie within .
In the above experiments, we showed that shapeDTW outperforms DTW both qualitatively and quantitatively. But we are still left with one free-parameter: the size of neighborhood, i.e., the length of the subsequence to be sampled from each point. Let be some temporal point on the time series , and be the subsequence sampled at . When , shapeDTW (under the Raw-Subsequence shape descriptor) degenerates to DTW; when , subsequences sampled at different points become almost identical, make points un-identifiable by their shape descriptors. This shows the importance to set an appropriate subsequence length. However, without dataset-specific domain knowledge, it is hard to determine the length intelligently. Here instead, we explore the sensitivity of the classification accuracies to different subsequence lengths. We conduct experiments on 42 old UCR datasets.
We use Raw-Subsequence
as the shape descriptor, and NN-shapeDTW as the classifier. We let the length of subsequences to vary from 5 to 100, with stride 5, i.e., we repeat classification experiments on each dataset for 20 times, and each time set the length of subsequences to be, where is the index of experiments (). The test accuracies under 20 experiments are shown by a box plot ( Fig.13). On 33 out of 42 datasets, even the worst performances of NN-shapeDTW are better than DTW, indicating shapeDTW performs well under wide ranges of neighborhood sizes.
|classification error rates on 84 UCR datasets|
We have proposed an new temporal sequence alignment algorithm, shapeDTW, which achieves quantitatively better alignments than DTW and its variants. shapeDTW is a quite generic framework as well, and uses can design their own local subsequence descriptor and fit it into shapeDTW. We experimentally showed that shapeDTW under the nearest neighbor classifier obtains significantly improved classification accuracies than NN-DTW. Therefore, NN-shapeDTW sets a new accuracy baseline for further comparison.
This work was supported by the National Science Foundation (grant number CCF-1317433), the Office of Naval Research (N00014-13-1-0563) and the Army Research Office (W911NF-11-1-0046 and W911NF-12-1-0433). The authors affirm that the views expressed herein are solely their own, and do not represent the views of the United States government or any agency thereof.