Deep Neural Network Ensembles for Time Series Classification

by   H. Ismail Fawaz, et al.

Deep neural networks have revolutionized many fields such as computer vision and natural language processing. Inspired by this recent success, deep learning started to show promising results for Time Series Classification (TSC). However, neural networks are still behind the state-of-the-art TSC algorithms, that are currently composed of ensembles of 37 non deep learning based classifiers. We attribute this gap in performance due to the lack of neural network ensembles for TSC. Therefore in this paper, we show how an ensemble of 60 deep learning models can significantly improve upon the current state-of-the-art performance of neural networks for TSC, when evaluated over the UCR/UEA archive: the largest publicly available benchmark for time series analysis. Finally, we show how our proposed Neural Network Ensemble (NNE) is the first time series classifier to outperform COTE while reaching similar performance to the current state-of-the-art ensemble HIVE-COTE.



There are no comments yet.


page 1

page 2

page 3

page 4


HIVE-COTE 2.0: a new meta ensemble for time series classification

The Hierarchical Vote Collective of Transformation-based Ensembles (HIVE...

Deep learning for time series classification: a review

Time Series Classification (TSC) is an important and challenging problem...

Gated Transformer Networks for Multivariate Time Series Classification

Deep learning model (primarily convolutional networks and LSTM) for time...

EnSyth: A Pruning Approach to Synthesis of Deep Learning Ensembles

Deep neural networks have achieved state-of-art performance in many doma...

Techniques All Classifiers Can Learn from Deep Networks: Models, Optimizations, and Regularization

Deep neural networks have introduced novel and useful tools to the machi...

Deep Neural Network Ensembles

Current deep neural networks suffer from two problems; first, they are h...

Ensemble deep learning: A review

Ensemble learning combines several individual models to obtain better ge...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Time series data are omnipresent in many practical data science applications ranging from health care 

[IsmailFawaz2018evaluating] and stock market predictions [anghinoni2018time] to social media analysis [xu2018mnrd] and human activity recognition [xi2018deep]. Since 2006, time series analysis has been considered one of the most challenging problems in data mining [yang200610], and in a more recent poll it has been shown that 48% of data expert had analyzed time series data during their career, ahead of text and images [piatetsky2014].

Time Series Classification (TSC) tasks differ from traditional classification tasks by the natural temporal ordering of their attributes [bagnall2017the]. To tackle this problem, a huge amount of research was dedicated into coupling and enhancing time series similarity measures with a Nearest Neighbor (NN) classifier [dau2017judicious, gharghabi2018ultra]. In [lines2015time], ten elastic distances were compared to the traditional Dynamic Time Warping (DTW) algorithm to find out that no single measure could outperform the classic NN coupled with DTW (NN-DTW) for TSC. These findings motivated the authors to construct a single Elastic Ensemble (EE) classifier that includes all eleven different similarity measures, and achieve a significant improvement compared to the individual classifiers [lines2015time]

. Hence, recent contributions were focused on ensembling different discriminant classifiers such as decision trees (random forest


and Support Vector Machines (SVMs) 

[bostrom2015binary] on different data representation techniques such shapelet transform [bostrom2015binary] or DTW features [kate2016using]. These ideas gave rise to the Collective Of Transformation-based Ensembles (COTE) [bagnall2016time] and its extended version HIVE-COTE [lines2018time] where 37 different classifiers were ensembled over multiple time series data transformation techniques in order to reach current state-of-the-art performance for TSC [bagnall2017the].

With the advent of deep neural networks into industrial and commercial applications such as self-driving cars [qiu2018multi] and speech recognition systems [liul2018stochastic], time series data mining practitioners started investigating the application of deep learning to TSC problems [wang2017time]. In our recent empirical study [IsmailFawaz2018deep]

, we showed how deep Convolutional Neural Networks (CNNs) are able to achieve results that are not significantly different than current state-of-the-art algorithms for TSC problems when evaluatedover the 85 time series datasets from the UCR/UEA archive 

[ucrarchive, bagnall2017the]. Outside the UCR/UEA benchmark, deep neural networks have seen some very successful applications such as evaluating surgical skills from multivariate time series [IsmailFawaz2018evaluating] and recognizing human activities from wearable sensors data [xi2018deep]. These results suggest that building upon deep learning based solutions for TSC could further improve the current state-of-the-art performance of deep neural networks.

One way of improving neural network based classifiers is to build an ensemble of deep learning models. This idea seems very interesting for TSC tasks since the state-of-the-art is moving towards ensembled solutions [lines2018time, lines2015time, bagnall2017the, baydogan2013a]

. In addition, deep neural network ensembles seem to achieve very promising results in many supervised machine learning domains such as skin lesions detection 

[goyal2018deep], facial expression recognition [wen2017ensemble] and automatic bucket filling [dadhich2018predicting].

Fig. 1: Ensemble of deep convolutional neural networks for time series classification.

Therefore, we propose to ensemble the current state-of-the-art deep learning models for TSC developed in [IsmailFawaz2018deep], by constructing one model composed of 60 different deep neural networks: 6 different architectures [wang2017time, zheng2014time, zhao2017convolutional, serra2018towards]

each one with 10 different initial weight values. By evaluating on the 85 datasets from the UCR/UEA archive, we demonstrate a significant improvement over the individual classifiers while also reaching very similar performance to HIVE-COTE: the current state-of-the-art ensemble of 37 non deep learning based time series classifiers. Finally, inspired by the recent success of transfer learning for TSC 

[IsmailFawaz2018transfer], we replace ensembling randomly initialized networks with an ensemble constructed out of fine-tuned models from 84 different source datasets, which showed a significant improvement for TSC problems.

The paper is divided as follows, we first start by explaining the background material, before presenting our different techniques of ensembling deep neural networks. We then describe our results and discussions before drafting a final conclusion with our future directions.

Ii Background

In this section we describe the current state of research in neural networks for TSC and then present some work related to ensembling neural network classifiers.

Ii-a Neural networks for time series classification

Since AlexNet [krizhevsky2012imagenet]

won the ImageNet 

[russakovsky2015imagenet] competition in 2012 with a significant improvement in accuracy compared to previous state-of-the-art approaches, the computer vision field was revolutionized with many deep neural networks papers being published every year to solve image recognition and object localization problems [lecun2015deep]. In addition, sequential data mining tasks such as natural language processing and speech recognition are being tackled with deep convolutional, recurrent and generative adversarial neural networks [li2018text, wang2018hierarchical].

Inspired by this recent success of deep learning models, researchers started adopting these complex machine learning techniques to solve the underlying task of Time Series Classification [IsmailFawaz2018deep, IsmailFawaz2019adversarial]. Specifically Wang et al. [wang2017time] showed very promising results, where a Fully Convolutional Neural network (FCN) and a Residual Network (ResNet) were designed to reach COTE’s performance when evaluated on 44 datasets from the UCR/UEA archive [ucrarchive, bagnall2017the]. Moreover, in our recent empirical evaluation of deep learning models for TSC [IsmailFawaz2018deep]

, we managed to reinforce these findings by testing FCN and ResNet on 85 datasets from the UCR/UEA archive. In fact, similar to two dimensional data (images), one dimensional convolutions when slid over an input time series, enable a non-linear transformation of the data. By applying backpropagation over a cascade of several convolutional layers with many filters, the network is able to learn this time invariant hierarchical representation of the input time series which is potentially useful for classification. For more detail about how these convolutions are being applied to one dimensional time series data, we refer the interested reader to our recent survey of deep learning for time series classification 


Different variants of CNNs were proposed for TSC and validated on the UCR/UEA archive. Multi-scale CNN (MCNN) [cui2016multi] was among the first deep learning architectures to be evaluated for domain agnostic TSC. In [leguennec2016data] Time LeNet (t-LeNet) was proposed as an adaptation of the famous LeNet architecture which was originally proposed for document recognition [lecun1998gradient]. Multi-Channels Deep Convolutional Neural Networks (MCDCNN) [zheng2014time] and Time-CNN [zhao2017convolutional] were originally proposed for multivariate TSC, however in [IsmailFawaz2018deep] we have shown how they can be easily extended for univariate TSC. One last CNN model called Encoder was proposed in [serra2018towards]

where FCN was extended to include the attention mechanism. Adding to the aforementioned neural network architectures, the classical Multi-Layer Perceptron (MLP) was considered as a baseline architecture in 

[wang2017time]. Finally, we should mention in addition to this pool of deep CNNs for TSC, a non-convolutional recurrent model called Time Warping Invariant Echo State Networks (TWIESN) [tanisaro2016time], which showed promising results on different datasets in the archive [IsmailFawaz2018deep].

In [IsmailFawaz2018deep], we showed how ResNet, FCN and Encoder won on 43, 18 and 10 datasets respectively suggesting that indeed no single network would outperform all the others on the whole benchmark. This would motivate researchers to ensemble the decision of these deep learning classifiers, which is the main contribution of this paper: showing how an ensemble of different deep neural networks can outperform all single individual classifiers and reach new state-of-the-art performance for TSC.

Ii-B Neural networks ensemble

Constructing an ensemble of many deep learning classifiers has been shown to achieve high performance in many different fields. In [goyal2018deep], an ensemble of two neural networks was adopted: (1) Inception-v4 and (2) Inception-ResNet-v2. Both of these classifiers are learned with a joint meta-learning approach in an end-to-end manner. A forest CNN was proposed in [lee2017ensemble] for image classification, where similarly to random forest, the ensemble is constructed by replacing the individual nodes with a CNN and finally the classifier’s decision is taken by performing a majority voting scheme over the different decisions of the individual trees in the forest. Another ensemble of CNNs for facial expression recognition was proposed in [wen2017ensemble]

where each individual classifier was trained independently to output a probability for each class and then the network’s final decision was taken using a probability-based fusion method. In 

[dadhich2018predicting], an ensemble of neural networks was found to outperform other hybrid machine learning ensembles when solving an automatic bucket filling problem. Finally in [ienco2018semi], deep auto-encoders were ensembled in order to learn an unsupervised latent representation of the input data over multiple resolutions, thus improving the quality of the produced clusters.

Although in almost all use cases ensembling deep neural networks almost always yields to better decisions, we did not find any approach using a neural network ensemble for domain agnostic TSC. Perhaps the work in [jin2016ensemble] is the closest to ours where a neural network based ensemble was used to perform biomedical TSC, where individual architectures were constructed with some domain knowledge specific to the classification problem at hand such as choosing the filter length with local and distorted views. In addition, our recent work on ensembling two deep learning models (with or without data augmentation) showed how the ensemble classifier was able to outperform significantly the individual model [IsmailFawaz2018data]. Therefore, we decided to further explore ensembling deep neural networks for TSC, by combining multiple deep learning models in different settings.

Iii Methods

In this section, we start by presenting the six different architectures composing our ensembles of neural networks. For completeness, we describe the random initialization technique adopted for all models. Finally, we present a transfer learning based alternative to randomly initializing the weights of the networks.

Iii-a Architectures

The average rank of the six chosen deep learning classifiers, over the 85 datasets from the UCR/UEA archive [ucrarchive, bagnall2017the] is listed in Table I. All of these architectures were implemented in a common framework during our empirical study [IsmailFawaz2018deep]

, containing originally 9 different deep learning approaches for TSC. However only 6 out of these 9 approaches were probabilistic classifiers whereas the three other classifiers performed a hard prediction: meaning an input time series is assigned a specific class rather than a probability distribution over all the classes in a dataset. Therefore, we chose to only ensemble the 6 probabilistic models, thus allowing us to combine the networks by averaging the a posteriori probability for each class over the individual classifiers’ output. Finally, we present a brief description of these 6 different architectures and refer the interested reader to a more thorough explanation in the corresponding papers. All hyperparameters can be found in 


Approach Rank Wins
ResNet [wang2017time] 1.88 41
FCN [wang2017time] 2.49 18
Encoder [serra2018towards] 3.34 10
MLP [wang2017time] 4.08 4
Time-CNN [zhao2017convolutional] 4.38 4
MCDCNN [zheng2014time] 4.83 3
TABLE I: Average rank of the six classifiers constituting the Neural Network Ensemble for time series classification over the 85 datasets from the UCR/UEA archive.

Iii-A1 Multi-Layer Perceptron

(MLP) is the simplest form of deep neural networks and was proposed in [wang2017time] as a baseline architecture for TSC. The architecture contains three hidden layers, with each one fully connected to the output of its previous layer. The main characteristic of this architecture is the use of a Dropout layer [srivastava2014a] to reduce overfitting. One disadvantage is that since the input time series is fully connected to the first hidden layer, the temporal information in a time series is lost [IsmailFawaz2018deep].

Iii-A2 Fully Convolutional Neural Network

(FCN), originally proposed in [wang2017time], is considered a competitive architecture yielding the second best results when evaluated on the UCR/UEA archive (see Table I). This network is comprised of three convolutional layers, each one performing a non-linear transformation of the input time series. A global average pooling operation is used before the final softmax classifier, thus reducing drastically the number of parameters in a network and allowing an architecture that is invariant to the length of the input time series. The latter characteristic motivated us to perform a transfer learning technique in [IsmailFawaz2018transfer], and ensembling the resulting neural networks which is later discussed in Section III-C.

Iii-A3 Residual Network

(ResNet) was originally proposed in [wang2017time] and showed similar performance to FCN when evaluated on 44 datasets from the archive. However, when evaluated over the 85 datasets, ResNet significantly outperformed FCN (see Table I

). The main characteristic of ResNet is the addition of residual connections which enables a direct flow of the gradient 


Iii-A4 Encoder

(Encoder) was originally proposed in [serra2018towards] as a hybrid CNN that modifies the FCN architecture [wang2017time] by mainly adding a Dropout layer [srivastava2014a] and an attention mechanism. The latter operation enables Encoder to learn to localize which regions of the input time series are useful for a certain class identification.

Iii-A5 Multi-Channels Deep Convolutional Neural Networks

(MCDCNN) was originally proposed in [zheng2014time] for multivariate TSC and adapted to univariate data in [IsmailFawaz2018deep]

. It consists of a traditional CNN, where each convolutional layer is followed by a max pooling operation, then a traditional fully connected layer is used before the final softmax classifier.

Iii-A6 Time Convolutional Neural Network

(Time-CNN) was originally proposed for univariate as well as multivariate TSC [zhao2017convolutional]

. Similarly to MCDCNN, this network is a traditional CNN with one major exception: the use of the mean squared error instead of the traditional categorical cross-entropy loss function, which has been used by all the deep learning approaches we have mentioned so far. Therefore for Time-CNN, the sum of the output class probabilities is not guaranteed to be equal to one.

Iii-B Ensembling models with random initial weights

We have described in the previous subsection, the architecture of six different classifiers. The weights for each network are initialized randomly using Glorot’s uniform initialization method [glorot2010understanding]

. This technique ensures a uniform distribution of the initial weight values. However due to non-convexity, networks with the same architecture but different initial weights could yield different validation accuracy. In 

[choromanska2015loss], the authors showed that deeper networks are much more stable with respect to the randomness. This would suggest that ensembling relatively non deep architectures would yield to a much better improvement in accuracy than ensembling deeper architectures. Fortunately, for low dimensional time series data, current state-of-the-art architectures are much less deeper than their counterpart networks for high dimensional images. Therefore, we believe that we can leverage this instability of neural networks for time series data by ensembling the decision taken by the same network but with different random initializations, using the following equation:


with denoting the ensemble’s output probability of having the input time series belonging to class , which is equal to the logistic output averaged over the randomly initialized models. We should note that training an ensemble of the same architecture with different initial weight values has been shown to improve neural network’s performance on many computer vision problems [wen2017ensemble], however, we did not encounter any previous work that combines such classifiers for TSC.

Iii-C Transfer learning

An alternative to training a deep classifier from scratch is to fine-tune a model that has been already pre-trained on a un/related task [IsmailFawaz2018transfer]

. This process is called transfer learning, where the network is first trained on a source dataset, then the final layer is removed and replaced with a new randomly initialized softmax layer whose number of neurons is equal to the number of classes in the target dataset. The pre-trained model is then fine-tuned or re-trained on the target dataset’s training set. With 85 datasets in the archive, each target dataset will have 84 potential source datasets, which motivated us to ensemble the decision of these 84 FCN models.

Iv Results

In this section we present the results of different ensembling schemes when evaluated on the 85 datasets from the UCR/UEA archive [ucrarchive, bagnall2017the], which is currently the largest publicly available benchmark for time series analysis. In order to compare multiple classifiers over several datasets, following the recommendations in [demsar2006statistical]

, we perform the Friedman test to first reject the null hypothesis. For the post-hoc analysis, following the recent recommendations in 

[benavoli2016should], we abandoned the average rank comparison in favor of a pairwise statistical comparison: the Wilcoxon signed-rank test with Holm’s alpha correction (). Finally, we used a critical difference diagram [demsar2006statistical] to visualize the results of these statistical tests projected onto the average rank axis, with a thick horizontal line showing a clique of classifiers that are not significantly different (see Figure 2 for an example of such diagram). All experiments were conducted on a hybrid cluster of more than 60 NVIDIA GPUs comprised of GTX 1080 Ti, Tesla K20, K40 and K80. Note that the code, the raw results and all the pre-trained models are publicly available on the paper’s companion repository111

Iv-a Ensembling randomly initialized models

By ensembling randomly initialized networks, we are able to achieve a significant improvement in accuracy. Figure 2 shows a critical difference diagram where ten different random initializations of ResNet did not yield to significantly different results. However, by ensembling these different networks, we were able to demonstrate a significant improvement in the average rank over the 85 datasets. We should note that the latter phenomenon was also observed for the five other neural networks described in Section III. Finally, we should emphasize that an ensembling technique will improve the stability of ResNet in terms of accuracy, in other words reducing the bias due to the initial weight values as well as the randomness induced by gradient descent based optimization.

Fig. 2: Critical difference diagram showing the pairwise statistical comparison of ten ResNets with random initializations as well as one ResNet ensemble composed of these ten individual neural networks.

Iv-B Ensembling all neural networks

After demonstrating that using an ensemble of neural networks is always better than a single classifier, we sought to answer the following question: Could an ensemble of hybrid randomly initialized networks achieve even better performance? Figure 3 shows a critical difference diagram containing six ensembles of homogenized networks as well as the hybrid ensemble of all available networks. The latter classifier contains sixty different networks: each architecture (six in total) is initialized with ten different random weight values. The results show that ensembling all networks was able to outperform all classifiers. However the statistical test failed to find any significant difference between the full ensemble and individual ResNet/FCN ensembles. This would suggest that the ensemble is highly affected by the poor performance of Time-CNN, MLP and MCDCNN. The latter classifiers showed the worst average rank without any significant difference, thus suggesting that removing them would yield even better performance.

Fig. 3: Critical difference diagram showing the pairwise statistical comparison of six architectures ensembled with ten different random initializations each, as well as one ensemble containing the six models.

Iv-C Neural Network Ensemble

The results in the previous section, suggest that choosing carefully the classifiers in the pool would yield to a better ensemble. Therefore, we construct a Neural Network Ensemble (NNE) comprised solely of ResNet, FCN and Encoder. These three architectures were the only ones to yield significantly different results when a homogenized ensemble was adopted (Figure 3). Further investigations suggested that FCN performs better than ResNet on electrocardiography datasets [IsmailFawaz2018deep], which would motivate researchers to combine these two classifiers in order to have a robust algorithm that improves the accuracy over the whole datasets. However, for small datasets such as DiatomSizeReduction, both FCN and ResNet overfitted the dataset very easily with 30% test accuracy [IsmailFawaz2018data], whereas Encoder managed to achieve very good performance with a 92% accuracy, therefore implying a combination of ResNet, FCN and Encoder would yield to better accuracy on a various range of TSC datasets. Figure 4 shows how NNE is able to outperform an ensemble of pure ResNets with 45 wins and 18 ties on 85 datasets from the archive. We believe that the combination of an FCN with ResNet and Encoder, enables the classifier to benefit respectively from the residual linear connections and the attention mechanism.

Fig. 4: The Neural Network Ensemble (NNE) composed of ResNet, FCN and Encoder is significantly better than an ensemble of pure ResNets.

To further understand how NNE is performing with respect to current state-of-the-art TSC algorithms, we illustrate in Figure 5 a critical difference diagram containing NNE and seven other non deep learning based classiifers: (1) NN-DTW corresponds to the nearest neighbor coupled with the Dynamic Time Warping distance; (2) EE is an ensemble of nearest neighbor classifiers with eleven elastic distances; (3) BOSS corresponds to the ensemble Bag-of-SFA-Symbols; (4) ST is another ensemble of off-the-shelf classifiers computed over the Shapelet Transform data domain; (5) PF or Proximity Forest is an ensemble of decision trees coupled with eleven elastic distances; finally (6) COTE and (7) HIVE-COTE are two ensembles of respectively 35 and 37 classifiers using multiple data transformation techniques. The results for these classifiers were taken from [bagnall2017the] except for PF whose results were taken from the original paper [lucas2018proximity]. Figure 5

clearly shows how our NNE is able to reach state-of-the-art performance for TSC, suggesting that CNNs are able to extract one dimensional discriminant features useful for classification in an end-to-end manner, as opposed to other hand-engineered features used by HIVE-COTE such as the Discrete Fourier Transform, DTW features and the Shapelet Transform.

Fig. 5: Critical difference diagram showing the pairwise statistical comparison of current state-of-the-art algorithms with the Neural Network Ensemble (NNE) added to the pool.

Iv-D Ensembling fine-tuned models

Figure 6 shows that ensembling fine-tuned FCNs is significantly better than ensembling randomly initialized FCN models that are trained from scratch. However, this transfer learning based ensemble did not manage to outperform ResNets’ ensemble nor NNE. These results show that the choice of architecture is very crucial and suggest that an ensemble of transferred ResNets would demonstrate even better performance than an ensemble of pure ResNets or NNE.

Fig. 6: Ensembling fine-tuned models is significantly better than ensembling randomly initialized FCN models that are trained from scratch.

V Conclusion

In this paper, we showed how ensembling deep neural networks can achieve state-of-the-art performance for time series classification. We showed that it would be almost always beneficial to ensemble randomly initialized models rather than choosing one trained neural network out of the ensemble. Finally, we investigated an ensemble of transferred deep CNNs to demonstrate even better performance than ensembling randomly initialized networks. In the future, we would like to consider a meta-learning approach where the output logistics of individual deep learning models are fed to a meta-network that learns to map these inputs to the correct prediction.


The authors would like to thank the providers of the UCR/UEA benchmark datasets, as well as NVIDIA Corporation for the GPU Grant and the Mésocentre of Strasbourg for providing access to the GPU cluster.