(CNN) is a family of machine learning algorithms that have gained importance in recent years and displayed demonstrable success tackling difficult problems in areas of research such as machine vision and time series classification. An in-depth review ofArtificial Neural Networks (ANN) and CNNs is available in DBLP:journals/corr/Schmidhuber14 ; Goodfellow-et-al-2016 . The fundamental component of a CNN is the convolution layer, a layer that is capable of learning jointly with other components a set of parameters that enables the entire network to achieve its learning task. The convolution layer can readily combine with other typical ANN layers such as a fully connected layer and a pooling layer
to provide a rich end-to-end approach to develop deep learning models that can be trained using gradient decent methods. Deep learning methods such as CNNs alleviate in most cases the need to perform elaborate pre-processing of the inputs and in particular alleviate the need to hand engineer features which typically require expert domain knowledge, is time-consuming and error prone.
Dynamic Time Warping (DTW) 1163055 is an algorithm that enables aligning two temporal sequences under assumptions of local deformations in time such as local accelerations and decelerations and has been used extensively since its inception in solving problems in many domains where temporal sequences are involved. Utilising DTW and k-NN for classification has consistently obtained state-of-the-art results in many reported temporal sequences classification tasks WangYO16 ; ratanamahatana2004everything ; IsmailFawaz2019 .
In this work a subset of the family of CNNs is considered where the input is limited to either a univariate or multivariate temporal sequence such as time series data or any data that can be converted into a temporal sequence, these CNNs are referred to as 1-D CNNs. A novel DTW convolution layer (DTW-CNN) is proposed that combines an algorithm similar to the DTW and the standard 1-D convolution layer to improve on the generalisation capabilities of the standard convolution layer to local deformations in time. The proposed DTW-CNN is evaluated on a number of time series classification tasks to demonstrate the effects of replacing a standard 1-D convolution layer with a DTW convolution layer. In summary the key contributions of this work are:
The proposal of the DTW-CNN layer, a novel 1-D convolution layer that is designed to be invariant to local deformations.
Embedding a non-parametric warping aspect of temporal sequences similarity directly in deep networks.
In this section a review of the Dynamic Time Warping algorithm is given to set the notation and naming conventions used in subsequent sections.
Dynamic Time Warping 1163055 is a method to calculate a mapping between two temporal sequences such that the misalignment between the two sequences is minimised. Let be a sequence of length and be a sequence of length such that and . The two sequences can be aligned using DTW as follows. First construct a local distance matrix . A warping path is an ordered mapping with that associates an element with under the following constraints:
Boundary condition: and .
Monotonicity condition: and .
Continuity condition: and .
The mapping that minimises the misalignment between the two sequences is a path through the matrix that minimises the warping cost:
where is the matrix entry indexed by the element of a warping path and is the length of the path . The path is referred to as the optimal path.
DTW utilises a dynamic programming equation to find the optimal path with complexity
or lower using heuristics or approximationsprune_dtw ; Salvador:2007:TAD:1367985.1367993 ; lucky_dtw ; msdtw . In the most straightforward approach also referred to as unconstrained DTW it does so by calculating a cumulative cost matrix as follows:
All other elements:
This formulation of the DTW algorithm has a number of drawbacks:
Paths can map very short sections of one sequence to long sections of the other resulting in unrealistic and in many cases undesirable deformations.
The number of paths grows exponentially with and .
Paths lengths are at most which represents a significant deviation from the length of the original sequences.
To alleviate these issues and to reduce computational complexity additional constraints and modifications had been proposed to the DTW algorithm:
Slope condition: a limitation on the ”gradient” of the path to mitigate local unrealistic alignment between short sequences to long sequences. This is achieved by putting a limit on the number of consecutive steps that can be taken in a direction before a step in the other direction must be taken 1163055 .
3 Problem Description
Where dealing with temporal sequences it is fair to assume that the same kind of deformations that motivated the development of the DTW algorithm could be relevant also in the calculation of the dot product (”convolution”) in the first convolution layer, and possibly also in subsequent layers. Considering a particular filter in a 1-D convolution layer, it is modified during the training process so that it has a maximised response when a certain feature is present in a section of the input. However when local deformations are present in the input resulting in a local phase mismatch the response of the filter could decrease. Therefore it is desirable for convolution layers to be invariant to such deformations and by doing so possibly improve the generalisation capabilities of a CNN for temporal sequence analysis.
4 Proposed Method
Let be a filter of length and be a section of 1-D input starting at index of equal length. Let the product matrix . Having calculated the product matrix , it is possible using an approach similar to DTW to find the optimal path in under the same conditions (Boundary, Monotonicity, Continuity) as defined in section 2. Formally we would like to calculate:
where is the matrix entry indexed by the element of a warping path , is the length of the path and is a normalising function .
As opposed to DTW, the path is the optimal path that maximises the filter response to the input signal under the deformations that can be realised within the limits of the DTW conditions. In many cases increasing the number of terms in (4) will lead to larger sums and therefore longer paths are more likely to emerge as . Therefore the normalising function is required to normalise the cost in relation to the number of terms in the path.
A basic normalising function that can be suggested assigns an equal weight to each of the terms and thus maintains the full alignment of the two signals however adjusts for the path’s length:
where is the length of the path . Therefore equation (4) can be rewritten as:
This weighting scheme is referred to as symmetric and effectively calculates the mean response of the filter under different alignments as defined by each path .
The cost of a path as defined in equation (4) is by definition a sum of products where all elements of and are included. By rearranging the terms and grouping by elements of the calculation of the cost of path can be rewritten as:
where . One possible way to define is as follows:
The cost formulation in equation (9) can be seen as a weighted alignment of onto and is referred to x onto w in subsequent sections. There is however a natural alternative rearrangement of equation (4) that can be seen as a weighted alignment of onto and is defined by regrouping the cost of the path by elements of with an alternative normalising function:
where . In a similar fashion one possible way to define is as such:
Where in this case is fully determined by the elements of and the normalising function . Note that equation (13) is differentiable in respect
For example, given the path :
these are the matrices corresponding for each of the formulations:
Note that in all cases the non zero elements of the matrix are the path ’s indices and moreover in the rows sum to 1, in the columns sum to 1 and in the entire matrix sums to 1.
Regardless of a formulation choice the output of the filter is then defined as:
where is a non-linear activation, is the matrix corresponding with the optimal path and is an optional bias scaler. The path following the main diagonal of the matrix corresponds with
the identity matrix and thus reduces the output of equation (14) to the standard convolution layer filter output. To generate a feature map equation (14) is applied to multiple filters and input sections by repeating the process for each filter/input section combination. Note there is no prevention to use a parameter sharing regime to generate the feature maps. Lastly the computation in practice of the output of the filter in equation (14) can be done more efficiently than in the naive form described above in (14) since is relatively sparse.
The proposed DTW-CNN layer expands on the hyperparameters that need to be configured for a standard 1-D CNN layer such as the number of filters, the filter size and the stride. The added hyperparameters are the choice of alignment between the signals (x onto w, w onto x, or symmetric), adjustment window condition size, and the slope. According toDau:2018:ODT:3238306.3238325 the optimal adjustment window condition size parameter in relation to accuracy is data dependent, and in some rare cases does not even matter. Lower values of are desirable since in the very least will result is less calculations to compute the DTW, also intuitively excessive values of can result in extreme and unrealistic deformations. Historically the speech recognition community used 10% warping constraint, however in ratanamahatana2004everything it is claimed that even 10% is too large for real world data.
Experimental results are given in section 6 to demonstrate the impact of the choice of alignment and the adjustment window condition size on the accuracy for some classification tasks. The slope condition impact was not looked into in this work.
Technically there are no limitations on where a DTW-CNN layer can be used along the computation graph however considering the class of deformations it is designed to deal with it seems reasonable that it would be useful as the first layer immediately following the input layer so that it can compute activations that are invariant to local out of phase deformations in the input. In addition it seems reasonable to separate multivariate signals and feed each channel through a separate DTW-CNN since it is feasible for different channels to have local deformations that are different in nature and/or location. The individual feature maps output by each DTW-CNN can be concatenated into a single feature map or kept separate before feeding into the next layer(s), this approach is somewhat similar to the MC-DCNN architecture suggested in Zheng2014TimeSC .
There are a number of considerations in regard to the training of networks that contain DTW-CNN layers. The first is the impact of the transformation
on the gradients calculated during the backpropagation algorithm. This can be reasoned about by rewriting equation (14) as which follows from the commutativity property of matrix multiplication. This highlights that the gradient updates is similar to a standard CNN in relation to the weights however the inputs to the filter are the product , a deterministic transformation of the input dependant on and . Given is fixed during the forward and backward propagation can be considered as a parameter of the Path cost function (4) thus making the cost function and the choice of a function of only during each and every training iteration. This is consistent with the calculation during inference where is constant and therefore the Path cost function (4) is a function of only.
To give another explanation that may be more insightful it is possible to consider an alternative brute force implementation of equation (14) by a static computation graph. Let be the set of all matrices corresponding with the set of all possible paths given a choice of normalising function and a filter of length . Instead of calculating by the dynamic programming method a graph is constructed where the input is multiplied by each and every followed by a operation such that . The operation performs as a gradient flow selection channelling the backpropagation of gradients through the branch of the graph corresponding with the maximal value of all such products. It is also clear that all operations are differential with respect to therefore enabling standard backpropagation of gradients.
5 Related work
There exists a huge body of work related to CNN and to the DTW algorithm relating to both theoretical aspects and to solving specific problems using these methods. Due to the volume of work it is impossible to provide a complete review of papers related to the CNN and to the DTW and instead the reader is referred to a number of summary resources and to the references included therein. A review of deep learning for time series classification is done in IsmailFawaz2019 . A comparison of different methods for time series classification including a few CNN architectures is available in WangYO16 . An overview of deep learning in neural networks in given in DBLP:journals/corr/Schmidhuber14 . In gamboa2017deep a review is available of various deep learning techniques for time series analysis. A comparative benchmark and a review of various methods for time series classification is available in DBLP:journals/corr/BagnallBLL16 . A survey of similarity measures for time series is given in Serr__2014 which determined that DTW and Time Warped Edit Distance (TWED) are the two best performing measures.
In cui2016multiscale it is suggested to augment inputs by down-sampling and applying lowpass filters to the inputs before using multiple independent filters for each generated input. With this approach the augmentation operation is fixed, and furthermore the approach has the drawback of increasing the number of parameters in the model. Invariant Scattering Convolution Networks DBLP:journals/corr/abs-1203-1513 computes a translation invariant image representation which is stable to deformations utilising wavelet transform convolutions with non-linear modulus and averaging operators. NeuralWarp jointly learns a deep embedding of the time series with a warping neural network that learns to align values in the latent space grabocka2018neuralwarp however the warping function is soft and with no constraints imposed which is likely to result in unrealistic warping. Insights in regard to the optimal warping window size are provided in Dau:2018:ODT:3238306.3238325 where it determined that the optimal window size is both data and dataset size dependent. In addition Dau:2018:ODT:3238306.3238325 proposes methods to learn the warping window size and demonstrates that by setting the warping window size correctly most or all the improvement gap of the more sophisticated methods proposed in recent years is matched.
6 Experimental Results
To evaluate different aspects of the proposed method four time series datasets are used:
The Photometric LSST Astronomical Time-series Classification Challenge 111http://www.timeseriesclassification.com/description.php?Dataset=LSST 181000000:online ; The:UEA:UCR . The test and train sets are merged, shuffled and split.
Crop 222http://www.timeseriesclassification.com/description.php?Dataset=Crop wei_tan_2017 ; The:UEA:UCR . The test and train sets are swapped.
InsectWingbeatSound 333http://www.timeseriesclassification.com/description.php?Dataset=InsectWingbeatSound DBLP:journals/corr/ChenWBMK14 ; The:UEA:UCR . The test and train sets are merged, shuffled and split, only observations with exactly seven channels is included.
Time Series Land Cover Classification Challenge 444https://sites.google.com/site/dinoienco/tiselc.
All experiments shared the following basic architecture and settings:
For each channel in the input either a standard convolution layer or a DTW convolution layer is created that takes as input the entire sequence for that particular channel, configured with a learned bias and Relu activation.
Pooling layer configured with pool size of 2 and a stride of 2.
Convolution layer operating on the entire input volume, configured with 128 filters of size 5, stride of 1, a learned bias and Relu activation.
Fully connected layer with 512 units, followed by batch normalisation, Relu activation and dropout layer with 0.5 drop probability.
Fully connected layer with 256 units, followed by batch normalisation, Relu activation and dropout layer with 0.5 drop probability.
Fully connected logits layer.
Softmax cross-entropy layer.
For optimisation Adam Adam with the default parameter values is employed. Note that there was no attempt to find an optimal architecture or hyperparameter settings for the experiments but merely to measure the relative accuracy for different hyperparameter settings and between the DTW-CNN to a standard CNN under this basic network configuration. Lastly unless explicitly stated when DTW-CNN is used it is used during both training and inference.
6.3 Hyperparameter experimentation
The methodology used for evaluating impact of different hyperparameters settings is straightforward and is based on comparing the accuracy of a classification task. The evaluation controls for the other factors by fixing the data, the architecture and the other hyperparameters when running each experiment. To evaluate the impact of different hyperparameters settings the classification accuracy on the test set is calculated and recorded every time after a number of epochs are fed through the network for training.
6.3.2 When to apply ?
An interesting aspect to consider is whether it is best to apply the transformation only in training, only in inference, or in both. If it is sufficient to apply only in inference with good results then the computational complexity during training is substantially reduced. The rationale for applying only during inference is that having learnt the filters’ weights of a standard CNN in relation to the data it might still be useful to align the inputs against the weights during inference to account for local deformations that may be present in the input, and by doing so improve accuracy. Three different configurations in relation to the application of the DTW transform are trialled:
Both during training and during inference.
Only during training.
Only during inference.
For this purpose dataset 1 (LSST) was used. The first DTW-CNN layer (for each channel) was configured to have 8 filters of size 7 with a stride of 1, the warping window was set to 1, a symmetric weighting as per equation (6), and a mini-batch size of 100 was used in all experiments. The feature maps output by the individual DTW-CNN layers were concatenated before pooling. The results indicate that the maximum accuracy is achieved by a large margin when the DTW transform is applied both in training and inference, and the worst result was obtained when done only in inference as illustrated in figure 1. These results can be explained by hypothesising that when is applied only during inference it is equivalent to attempting inference on data that contain substantial local deformations that were not commonly present during training and thus demonstrate the adverse affect it has on a standard CNN as the discriminative model learns whereas seems to be substantially different for the tested data resulting in substantially reduced accuracy. However when applied only during training the warping maps similar inputs that have local deformations being equivalent to some sort of ”pre-clustering” of the inputs which in turn restricts learning to a subspace of the actual training data.
6.3.3 Adjustment window condition
To evaluate the impact of modifying the adjustment window condition hyperparameter the accuracy obtained for different values of was measured while holding all other hyperparameters fixed for the datasets 1 (LSST) and 2 (Crop). The warping window was set to values ranging from 1 to 4, a mapping of onto w as per equation (9), and a mini-batch size of 50 was used in all experiments. The results demonstrate that the network trained successfully for all values.
For the LSST dataset the first DTW-CNN layer was configured to have 8 filters of size 7 with a stride of 1 for each channel. In this setting all values of gave about the same accuracy. It is notable that larger values of resulted in less steps required for training by roughly 30% until convergence as illustrated in figure 2.
For the Crop dataset the first DTW-CNN layer was configured to have 64 filters of size 7 with a stride of 2. The results show that the network trained successfully for all values and that gave the best accuracy by a relatively small margin as illustrated in figure 2.
The results demonstrate that is data dependent and whilst overall accuracy may not increase by increasing , in the context of the DTW-CNN larger values of may contribute to faster convergence.
6.3.4 Normalising choice
In section 4 three approaches to normalise the path cost are described:
To evaluate the effect of modifying the normalising choice the accuracy obtained for different normalising methods was measured while holding all other hyperparameters fixed for the datasets 1 (LSST) and 3 (Insect). For the LSST dataset the first DTW-CNN layer was configured to have 32 filters of size 7 with a stride of 1 for each channel and with the warping window . For the Insect dataset the first DTW-CNN layer was configured to have 10 filters of size 7 with a stride of 5 for each channel and with the warping window .
The results indicate that the optimal choice of path cost normalisation is dependent on the dataset and possibly other factors as illustrated in figure 3.
6.4 Comparison against a standard CNN
The methodology used for evaluating the effectiveness of the proposed DTW-CNN layer is straightforward and is based on comparing the accuracy of a classification task when using a standard 1-D convolution layer against when using the proposed DTW-CNN layer. The evaluation controls for the other factors by fixing the data, the architecture and the hyperparameters when running each experiment. To be explicit note that when DTW-CNN is used it is used during both training and inference.
The classification accuracy on the test set is calculated and recorded every time after a number of epochs are fed through the network. Three metrics are calculated to estimate the performance of the DTW-CNN layer against the standard 1-D convolution layer:
Mean accuracy to , the mean accuracy reported on the test set from training iteration to iteration .
Std accuracy to
, the standard deviation of accuracy reported on the test set from training iterationto iteration .
Max accuracy to ,the maximum accuracy reported on the test set from training iteration to iteration .
The following tables summarise the results obtained for the metrics described in 6.4.2
when replacing the first CNN layer with a DTW-CNN layer. The results demonstrate that the classification accuracy for the DTW-CNN layer is mostly better, and never significantly worse than a CNN in the tested settings where the magnitude of improvement in accuracy varies across datasets. The variance in accuracy as training progresses when employing the DTW-CNN layer is comparable to the standard CNN layer indicating that standard optimisers do not experience significant problems to converge with the DTW-CNN layer included in the graph. Moreover merely increasing the number of filters and/or their length does not bridge the gap in results between the two type of layers implying that the DTW-CNN is capable of learning representations that fundamentally go beyond the capacity of the standard CNN layer given data with certain characteristics.
|344 - 4128||0.5127||0.094||0.635||0.4894||0.0776||0.6021|
|4472 - 8256||0.5925||0.0413||0.6293||0.554||0.0327||0.5921|
|8600 - 12384||0.6129||0.0208||0.6479||0.5599||0.0537||0.6171|
|12728 - 16512||0.6014||0.0270||0.6293||0.548||0.0513||0.6043|
|16856 - 20640||0.5969||0.0364||0.6236||0.5547||0.043||0.6079|
|344 - 4128||0.5405||0.0885||0.6336||0.5081||0.0737||0.5979|
|4472 - 8256||0.6039||0.0299||0.6343||0.529||0.0667||0.6114|
|8600 - 12384||0.6057||0.0399||0.64||0.5665||0.0201||0.5979|
|12728 - 16512||0.6184||0.0201||0.6357||0.5434||0.0562||0.6007|
|16856 - 20640||0.617||0.0126||0.635||0.5660||0.0401||0.6171|
|1845 - 18450||0.7463||0.0066||0.7538||0.7353||0.0065||0.7485|
|20295 - 36900||0.7524||0.0021||0.7572||0.7381||0.0034||0.7432|
|38745 - 55350||0.7527||0.0051||0.758||0.74||0.005||0.7458|
|57195 - 73800||0.7501||0.0036||0.7562||0.7402||0.0033||0.7465|
|75645 - 92250||0.7524||0.0026||0.7565||0.7426||0.0025||0.7471|
|1845 - 18450||0.7499||0.0034||0.7557||0.7422||0.0044||0.751|
|20295 - 36900||0.7511||0.0039||0.7585||0.7381||0.0019||0.7457|
|38745 - 55350||0.7525||0.0028||0.7568||0.7414||0.0038||0.7449|
|57195 - 73800||0.7524||0.0026||0.7546||0.7446||0.0019||0.7473|
|75645 - 92250||0.7533||0.0034||0.7585||0.7425||0.0058||0.7503|
|3268 - 52288||0.9195||0.01869||0.9343||0.919||0.02072||0.9342|
|55556 - 104576||0.9356||0.00199||0.9391||0.9372||0.00179||0.9399|
|107844 - 153596||0.9394||0.00214||0.9441||0.9401||0.00227||0.9435|
|156864 - 205884||0.9417||0.00167||0.9441||0.9425||0.00171||0.9454|
|209152 - 258172||0.9435||0.00117||0.9454||0.9435||0.00122||0.9454|
In this article a novel DTW-CNN layer is proposed that combines a 1-D convolution layer with an algorithm similar to the DTW to align the convolution kernel against the inputs. The DTW-CNN layer enables embedding a non-parametric warping of temporal sequences for increasing similarity directly in deep networks. Combining the similarity warping with learned kernel weights results in an overall warping path that minimises the learning task, therefore it can expand on the generalisation capabilities and the capacity of standard 1-D convolution layer where local sequential deformations are present in the input. The results demonstrate that the DTW-CNN exceeds or matches the standard CNN layer in terms of the maximum accuracy achieved on a number of time series classification tasks. In addition the impact of different hyperparameters settings is demonstrated given different datasets and the results support the conclusions of previous work done in relation to the choice of DTW parameter values.
The author wishes to thank Patrick Peursum for his insightful remarks throughout this research.
-  [1810.00001] the photometric LSST astronomical time-series classification challenge (PLAsTiCC): Data set. (Accessed on 10/22/2018).
-  A. Bagnall, J. Lines, W. Vickers, and E. Keogh. The UEA & UCR time series classification repository.
-  A. J. Bagnall, A. Bostrom, J. Large, and J. Lines. The great time series classification bake off: An experimental evaluation of recently proposed algorithms. extended version. CoRR, abs/1602.01711, 2016.
-  J. Bruna and S. Mallat. Invariant scattering convolution networks. CoRR, abs/1203.1513, 2012.
-  Y. Chen, A. Why, G. E. A. P. A. Batista, A. Mafra-Neto, and E. J. Keogh. Flying insect classification with inexpensive sensors. CoRR, abs/1403.2654, 2014.
-  Z. Cui, W. Chen, and Y. Chen. Multi-scale convolutional neural networks for time series classification, 2016.
-  H. A. Dau, D. F. Silva, F. Petitjean, G. Forestier, A. Bagnall, A. Mueen, and E. Keogh. Optimizing dynamic time warping’s window width for time series data mining applications. Data Min. Knowl. Discov., 32(4):1074–1120, July 2018.
-  J. C. B. Gamboa. Deep learning for time-series analysis, 2017.
-  I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
-  J. Grabocka and L. Schmidt-Thieme. Neuralwarp: Time-series similarity with warping networks, 2018.
-  H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.-A. Muller. Deep learning for time series classification: a review. Data Mining and Knowledge Discovery, 33(4):917–963, Jul 2019.
-  F. Itakura. Minimum prediction residual principle applied to speech recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 23(1):67–72, February 1975.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
-  M. Müller, H. Mattes, and F. Kurth. An efficient multiscale approach to audio synchronization. In In Proceedings of the 6th International Conference on Music Information Retrieval, pages 192–197, 2006.
-  C. A. Ratanamahatana and E. Keogh. Everything you know about dynamic time warping is wrong. In Third Workshop on Mining Temporal and Sequential Data. Citeseer, 2004.
-  H. Sakoe and S. Chiba. Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1):43–49, February 1978.
-  S. Salvador and P. Chan. Toward accurate dynamic time warping in linear time and space. Intell. Data Anal., 11(5):561–580, Oct. 2007.
-  J. Schmidhuber. Deep learning in neural networks: An overview. CoRR, abs/1404.7828, 2014.
-  J. Serrà and J. L. Arcos. An empirical evaluation of similarity measures for time series classification. Knowledge-Based Systems, 67:305–314, Sep 2014.
-  D. Silva and G. Batista. Speeding up all-pairwise dynamic time warping matrix calculation. pages 837–845, 06 2016.
-  S. Spiegel, B. Jain, and S. Albayrak. Fast time series classification under lucky time warping distance. 03 2014.
-  Z. Wang, W. Yan, and T. Oates. Time series classification from scratch with deep neural networks: A strong baseline. CoRR, abs/1611.06455, 2016.
-  C. Wei Tan. Dataset: Time series indexing (tsi), Jan 2017.
-  Y. Zheng, Q. Liu, E. Chen, Y. Ge, and J. L. Zhao. Time series classification using multi-channels deep convolutional neural networks. In WAIM, 2014.