InceptionTime: Finding AlexNet for Time Series Classification

09/11/2019 ∙ by Hassan Ismail Fawaz, et al. ∙ UHA Monash University 135

Time series classification (TSC) is the area of machine learning interested in learning how to assign labels to time series. The last few decades of work in this area have led to significant progress in the accuracy of classifiers, with the state of the art now represented by the HIVE-COTE algorithm. While extremely accurate, HIVE-COTE is infeasible to use in many applications because of its very high training time complexity in O(N^2*T^4) for a dataset with N time series of length T. For example, it takes HIVE-COTE more than 72,000s to learn from a small dataset with N=700 time series of short length T=46. Deep learning, on the other hand, has now received enormous attention because of its high scalability and state-of-the-art accuracy in computer vision and natural language processing tasks. Deep learning for TSC has only very recently started to be explored, with the first few architectures developed over the last 3 years only. The accuracy of deep learning for TSC has been raised to a competitive level, but has not quite reached the level of HIVE-COTE. This is what this paper achieves: outperforming HIVE-COTE's accuracy together with scalability. We take an important step towards finding the AlexNet network for TSC by presenting InceptionTime---an ensemble of deep Convolutional Neural Network (CNN) models, inspired by the Inception-v4 architecture. Our experiments show that InceptionTime slightly outperforms HIVE-COTE with a win/draw/loss on the UCR archive of 40/6/39. Not only is InceptionTime more accurate, but it is much faster: InceptionTime learns from that same dataset with 700 time series in 2,300s but can also learn from a dataset with 8M time series in 13 hours, a quantity of data that is fully out of reach of HIVE-COTE.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

Code Repositories

InceptionTime

InceptionTime: Finding AlexNet for Time Series Classification


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent times have seen an explosion in the magnitude and prevalence of time series data. Industries varying from health care (Forestier et al., 2018; Lee et al., 2018; Ismail Fawaz et al., 2019d) and social security (Yi et al., 2018) to human activity recognition (Yuan et al., 2018) and remote sensing (Pelletier et al., 2019), all now produce time series datasets of previously unseen scale—both in terms of time series length and quantity. This growth also means an increased dependence on automatic classification of time series data, and ideally, algorithms with the ability to do this at scale.

These problems, known as time series classification (TSC), differ significantly to traditional supervised learning for structured data, in that the algorithms should be able to handle and harness the temporal information present in the signal 

(Bagnall et al., 2017). It is easy to draw parallels from this scenario to computer vision problems such as image classification and object localization, where successful algorithms learn from the spatial information contained in an image. Put simply, the time series problem is essentially the same class of problem, just with one less dimension. Yet despite this similarity, the current state-of-the-art algorithms from the two fields share little resemblance (Ismail Fawaz et al., 2019b).

Deep learning has a long history (in machine learning terms) in computer vision (LeCun et al., 1998) but its popularity exploded with AlexNet (Krizhevsky et al., 2012), after which it has been unquestionably the most successful class of algorithms (LeCun et al., 2015). Conversely, deep learning has only recently started to gain popularity amongst time series data mining researchers (Ismail Fawaz et al., 2019b). This is emphasized by the fact that the Residual Network (ResNet), which is currently considered the state-of-the-art neural network architecture for TSC when evaluated on the UCR archive (Dau et al., 2018), was originally proposed merely as a baseline model for the underlying task (Wang et al., 2017). Given the similarities in the data, it is easy to suggest that there is much potential improvement for deep learning in TSC.

In this paper, we take an important step towards finding the equivalent of ‘AlexNet’ for TSC by presenting InceptionTime—a novel deep learning ensemble for TSC. InceptionTime achieves state-of-the-art accuracy when evaluated on the UCR archive (currently the largest publicly available repository for TSC (Dau et al., 2018)) while also possessing ability to scale to a magnitude far beyond that of its strongest competitor.

InceptionTime is an ensemble of five deep learning models for TSC, each created by cascading multiple Inception modules 

(Szegedy et al., 2015), having the same architecture but different initial weight values. Figure 1 illustrates the details of an Inception module, where multiple filters are applied simultaneously to an input time series. The module includes filters of varying lengths, which as we will show, allows the network to automatically extract relevant features from both long and short time series.

Figure 1: Inside our Inception module for time series classification

After presenting InceptionTime and its results, we perform an analysis of the architectural hyperparameters of deep neural networks—depth, filter length, number of filters—and the characteristics of the Inception module—the bottleneck and residual connection, in order to provide insight into why this model is so successful. In fact, we construct networks with filters larger than have ever been explored for computer vision tasks, taking direct advantage of the fact that time series exhibit one less dimension than images.

The remainder of this paper is structured as follows: first we start by presenting the background and related work in Section 2. We then proceed to Section 3 to explain the InceptionTime architecture and its main building block—the Inception module. Section 4 contains the details of our experimental setup. In Section 5, we show that InceptionTime produces state-of-the-art accuracy on the UCR archive, the TSC benchmark, while also presenting a runtime comparison with its nearest competitor. In Section 6, we provide a detailed hyperparameter study that provides insight into the choices made when designing our proposed neural network. Finally we conclude the paper in Section 7 and give directions for further research on deep learning for TSC.

2 Related work

In this section, we start with some preliminary definitions for ease of understanding, before presenting the current state-of-the-art algorithms for TSC. We end by providing a deeper background for designing neural network architectures for domain-agnostic TSC problems.

2.1 Time series classification

Definition 1

An -dimensional Multivariate Time Series (MTS) consists of ordered elements .

Definition 2

A Univariate time series of length is simply an MTS with .

Definition 3

is a dataset containing a collection of pairs where could either be a univariate or multivariate time series with as its label.

The task of classifying time series data consists of learning a classifier on in order to map from the space of possible inputs

to a probability distribution over the classes

. For many years, the leading classifier for TSC was the nearest neighbor algorithm coupled with the Dynamic Time Warping similarity measure (NN-DTW) (Bagnall et al., 2017). Much research has subsequently focused on finding alternative similarity measures (Marteau, 2009; Stefan et al., 2013; Keogh and Pazzani, 2001; Vlachos et al., 2006), however none have been found to significantly outperform NN-DTW on the UCR Archive (Lines and Bagnall, 2015). This research informed one current state-of-the-art method, named Elastic Ensemble (EE), which is an ensemble of 11 nearest neighbor classifiers each coupled with a different similarity measure (Lines and Bagnall, 2015). While this algorithm produces state-of-the-art accuracy, its use on large datasets is limited by its training complexity, with some of its parameter searches being in . Following this line of research, all recent successful classification algorithms for time series data are all ensemble based models. Furthermore, to tackle EE’s huge training time, Lucas et al. (2019) proposed a tree-based ensemble called Proximity Forest (PF) that uses EE’s distances as a splitting criteria while replacing the parameter searches by a random sampling.

The Bag-of-SFA-Symbols (BOSS) is based on an ensemble of NNs classifiers coupled with a bespoke Euclidean distance computed on the frequency histograms obtained from the SFA discretization (Schäfer, 2015b). BOSS has a high training complexity of , which the authors identified as a shortcoming and attempted to address with subsequent scalable variations of the algorithm in Schäfer (2015a); Schäfer and Leser (2017), however neither of these reached state-of-the-art accuracy. Another type of ensemble classifiers is shapelet based algorithms, such as in Hills et al. (2014), where discriminative subsequences (shapelets) are extracted from the training set and fed to off-the-shelf classifiers such as SVMs and RandFs. The shapelet transform has a training complexity of and thus, again, has little potential to scale to large datasets.

More recently, Bagnall et al. (2016) noted that there is no single time series transformation technique (such as shapelets or SFA) that significantly dominates the others, showing that constructing an ensemble of different classifiers over different time series representations, called COTE, will significantly improve the accuracy. Lines et al. (2016) extended COTE with a hierarchical voting scheme, which further improves the decision taken by the ensemble. Named the Hierarchical Vote Collective of Transformation-Based Ensembles (HIVE-COTE), it represents the current state-of-the-art accuracy when evaluated on the UCR archive, however its practicality is hindered by its huge training complexity of order . This is highlighted by the extensive experiments in Lucas et al. (2019) where PF showed competitive performance with COTE and HIVE-COTE, while having a runtime that is orders of magnitudes lower. Deep learning models, which we will discuss in detail in the following subsection, also significantly beat the runtime of HIVE-COTE by trivially leveraging GPU parallel computation abilities. A comprehensive detailed review of recent methods for TSC can be found in Bagnall et al. (2017).

2.2 Deep learning for time series classification

Since the recent success of deep learning techniques in supervised learning such as image recognition (Zhang et al., 2018) and natural language processing (Guan et al., 2019), researchers started investigating these complex machine learning models for TSC (Wang et al., 2017; Cui et al., 2016; Ismail Fawaz et al., 2019a). Precisely, Convolutional Neural Networks (CNNs) have showed promising results for TSC. Given an input MTS, a convolutional layer consists of sliding one-dimensional filters over the time series, thus enabling the network to extract non-linear discriminant features that are time-invariant and useful for classification. By cascading multiple layers, the network is able to further extract hierarchical features that should in theory improve the network’s prediction. Note that given an input univariate time series, by applying several one-dimensional filters, the outcome can be considered an MTS whose length is preserved and the number of dimensions is equal the number of filters applied at this layer. More details on how deep CNNs are being adapted for one-dimensional time series data can be found in Ismail Fawaz et al. (2019b). The rest of this subsection is dedicated into describing what is currently being explored in deep learning for TSC.

Multi-scale Convolutional Neural Networks (MCNN) (Cui et al., 2016) and Time LeNet (Le Guennec et al., 2016) are considered among the first architectures to be validated on a domain-agnostic TSC benchmark such as the UCR archive. These models were inspired by image recognition modules, which hindered their accuracy, mainly because of the use of progressive pooling layers, that were mainly added for computational feasibility when dealing with image data (Sabour et al., 2017). Consequently, Fully Convolutional Neural Networks (FCNs) were shown to achieve great performance without the need to add pooling layers to reduce the input data’s dimensionality (Wang et al., 2017). More recently, it has been shown that deeper CNN models coupled with residual connections such as ResNet can further improve the classification performance (Ismail Fawaz et al., 2019b). In essence, time series data dimensionality enabled the exploration of many complex models that are not computionally feasible for image recognition problems: for example removing the pooling layers that throw away valuable information in favour of reducing the model’s complexity. In this paper, we propose an Inception based network that applies several convolutions with various filters lengths. In contrast to networks designed for images, we are able to explore filters 10 times longer than recent Inception variants for image recognition tasks (Szegedy et al., 2017).

Inception was first proposed by Szegedy et al. (2015) for end-to-end image classification. Now the network has evolved to become Inceptionv4, where Inception was coupled with residual connections to further improve the performance (Szegedy et al., 2017). As for TSC a relatively competitive Inception-based approach was proposed in Karimi-Bidhendi et al. (2018), where time series where transformed to images using Gramian Angular Difference Field (GADF), and finally fed to an Inception model that had been pre-trained for (standard) image recognition. Unlike this feature engineering approach, by adopting an end-to-end learning from raw time series data, a one-dimensional Inception model was used for Supernovae classification using the light flux of a region in space as an input MTS for the network (Brunel et al., 2019)

. However, the authors limited the conception of their Inception architecture to the one proposed by Google for ImageNet 

(Szegedy et al., 2017). In our work, we explore much larger filters than any previously proposed network for TSC in order to reach state-of-the-art performance on the UCR benchmark.

3 Method

In this section, we start by describing the proposed architecture we call InceptionTime for classifying time series data. Specifically, we detail the main component of our network: the Inception module. We then present our proposed model InceptionTime which consists of an ensemble of 5 different Inception networks initialized randomly. Finally, we adapt the concept of Receptive Field for time series data.

3.1 Inception Network: a novel architecture for TSC

The composition of an Inception network classifier contains two different residual blocks, as opposed to ResNet, which is comprised of three

. For the Inception network, each block is comprised of three Inception modules rather than traditional fully convolutional layers. Each residual block’s input is transferred via a shortcut linear connection to be added to the next block’s input, thus mitigating the vanishing gradient problem by allowing a direct flow of the gradient 

(He et al., 2016). Figure 2 depicts an Inception network’s architecture showing 6 different Inception modules stacked one after the other.

Figure 2: Our Inception network for time series classification

As for the Inception module, Figure 1 illustrates the inside details of this operation. Consider the input to be an MTS with dimensions. The first major component of the Inception module is called the “bottleneck” layer. This layer performs an operation of sliding

filters of length 1 with a stride equal to 1. This will transform the time series from an MTS with

dimensions to an MTS with dimensions, thus reducing significantly the dimensionality of the time series as well as the model’s complexity and mitigating overfitting problems for small datasets. Note that for visualization purposes, Figure 1 illustrates a bottleneck layer with . Finally, we should mention that this bottleneck technique allows the Inception network to have much longer filters than ResNet (almost ten times) with roughly the same number of parameters to be learned, since without the bottleneck layer, the filters will have dimensions compared to when using the bottleneck layer. The second major component of the Inception module is sliding multiple filters of different lengths simultaneously on the same input time series. For example in Figure 1, three different convolutions with length are applied to the input MTS, which is technically the output of the bottleneck layer. In addition, to introduce the ability of having a model that is invariant to small perturbations, we introduce another parallel MaxPooling operation, followed by a bottleneck layer to reduce the dimensionality. The output of sliding MaxPooling window is computed by taking the maximum value in this given window of time series. Finally, the output of each independent parallel convolution/MaxPooling is concatenated to form the output MTS of the current Inception module.

By stacking multiple Inception modules and training the weights (filters’ values) via backpropagation, the network is able to extract latent hierarchical features of multiple resolutions thanks to the use of filters with various lengths. For completeness, we specify the exact number of filters for our proposed Inception module: 3 sets of filters each with 32 filters of length

with MaxPooling added to the mix, thus making the total number of filters per layer equal to - the dimensionality of the output MTS.

3.2 InceptionTime: a neural network ensemble for TSC

Our proposed state-of-the-art InceptionTime model is an ensemble of 5 Inception networks, with each prediction given an even weight. In fact, during our experimentation, we have noticed that a single Inception network exhibits high standard deviation in accuracy, which is very similar to ResNet’s behavior 

(Ismail Fawaz et al., 2019c). We believe that this variability comes from both the randomly initialized weights and the stochastic optimization process itself. This was an important finding for us, previously observed in Scardapane and Wang (2017), as rather than training only one, potentially very good or very poor, instance of the Inception network, we decided to leverage this instability through ensembling, creating InceptionTime. The following equation explains the ensembling of predictions made by a network with different initializations:

(1)

with denoting the ensemble’s output probability of having the input time series belonging to class , which is equal to the logistic output averaged over the randomly initialized models. More details on ensembling neural networks for TSC can be found in Ismail Fawaz et al. (2019c). As for the proposed model in this paper, we chose the number of individual classifiers to be equal to , which is justified in Section 5.

3.3 Receptive field

Figure 3: Receptive field illustration for a two layers CNN

The concept of Receptive Field (RF) is an essential tool to the understanding of deep CNNs (Luo et al., 2016)

. Unlike fully-connected networks or Multi-Layer Perceptrons, a neuron in a CNN depends only on a region of the input signal. This region in the input space is called the receptive field of that particular neuron. For computer vision problems this concept was extensively studied, such as in 

Liu et al. (2018) where the authors compared the effective and theoretical receptive fields of a CNN for image segmentation.

For temporal data, the receptive field can be considered as a theoretical value that measures the maximum field of view of a neural network in a one-dimensional space: the larger it is, the better the network becomes (in theory) in detecting longer patterns. We now provide the definition of the RF for time series data, which is later used in our experiments. Suppose that we are sliding convolutions with a stride equal to . The formula to compute the RF for a network of depth with each layer having a filter length equal to with is:

(2)

By analyzing equation 2 we can clearly see that adding two layers to the initial set of layers, will increase only slightly the value of . In fact in this case, if the old value is equal to , the new value will be equal to . Conversely, by increasing the filter length , by 2, the new value will be equal to . This is rather expected since by increasing the filter length for all layers, we are actually increasing the for each layer in the network. Figure 3 illustrates the RF for a two layers CNN.

In this paper, we chose to focus on the RF concept since it has been known for computer vision problems, that larger RFs are required to capture more context for object recognition (Luo et al., 2016). Following the same line of thinking, we hypothesize that detecting larger patterns from very long one-dimensional time series data, requires larger receptive fields.

Figure 4: Example of a synthetic binary time series classification problem

4 Experimental setup

First, we detail the method to generate our synthetic dataset, which is later used in our architecture and hyperparameter study. For testing our different deep learning methods, we created our own synthetic TSC dataset. The goal was to be able to control the length of the time series data as well as the number of classes and their distribution in time. To this end, we start by generating a univariate time series using uniformly distributed noise sampled between 0.0 and 0.1. Then in order to assign this synthetic random time series to a certain class, we inject a pattern with an amplitude equal to 1.0 in a pre-defined region of the time series. This region will be specific to a certain class, therefore by changing the placement of this pattern we can generate an unlimited amount of classes, whereas the random noise will allow us to generate an unlimited amount of time series instances per class. One final note is that we have fixed the length of the pattern to be equal to 10% the length of the synthetic time series. An example of a synthetic binary TSC problem is depicted in Figure 

4.

All deep neural networks were trained by leveraging the parallel computation of a remote cluster of more than 60 GPUs comprised of GTX 1080 Ti, Tesla K20, K40 and K80. Local testing and development was performed on an NVIDIA Quadro P6000. The latter graphics card was also used for computing the training time of a model. When evaluating on univariate time series, we have used the UCR archive (Dau et al., 2018), which is the largest publicly available archive for TSC. The models were trained/tested using the original training/testing splits provided in the archive. To study the effect of different hyperparameters and architectural designs, we used the synthetic dataset whose generation is described in details in the previous paragraph. All time series data were -normalized (including the synthetic series) to have a mean equal to zero and a standard deviation equal to one. This is considered a common best-practice before classifying time series data (Bagnall et al., 2017). Finally, we should note that all models are trained using the Adam optimization algorithm (Kingma and Ba, 2015) and all weights are initialized randomly using Glorot’s uniform technique (Glorot and Bengio, 2010).

Similarly to Ismail Fawaz et al. (2019b), when comparing with the state-of-the-art results published in Bagnall et al. (2017) we used the deep learning model’s median test accuracy over the different runs. Following the recommendations in Demšar (2006) we adopted the Friedman test (Friedman, 1940)

in order to reject the null hypothesis. We then performed the pairwise post-hoc analysis recommended by 

Benavoli et al. (2016) where we replaced the average rank comparison by a Wilcoxon signed-rank test with Holm’s alpha () correction (Garcia and Herrera, 2008). To visualize this type of comparison we used a critical difference diagram proposed by Demšar (2006), where a thick horizontal line shows a cluster of classifiers (a clique) that are not-significantly different in terms of accuracy.

In order to allow for the time series community to build upon and verify our findings, the source code for all these experiments was made publicly available on our companion repository111https://github.com/hfawaz/InceptionTime

. In addition, upon the acceptance of the paper, we will provide the pre-trained deep learning models, thus allowing data mining practitioners to leverage these models in a transfer learning setting 

(Ismail Fawaz et al., 2018).

5 Experiments: InceptionTime

Figure 5: Critical difference diagram showing the performance of InceptionTime compared to the current state-of-the-art classifiers of time series data.

In this section, we present the results of our proposed novel classifier called InceptionTime, evaluated on the 85 datasets of the UCR archive. We note that throughout the paper (unless specified otherwise) InceptionTime refers to an ensemble of 5 Inception networks, while the “InceptionTime()” notation is used to denote an ensemble of Inception networks.

Figure 5 illustrates the critical difference diagram with InceptionTime added to the mix of the current state-of-the-art classifiers for time series data, whose results were taken from Bagnall et al. (2017). We can see here that our InceptionTime ensemble reaches competitive accuracy with the class-leading algorithm HIVE-COTE, an ensemble of 37 TSC algorithms with a hierarchical voting scheme (Lines et al., 2016). While the two algorithms share the same clique on the critical difference diagram, the trivial GPU parallelization of deep learning models makes learning our InceptionTime model a substantially easier task than training the 37 different classifiers of HIVE-COTE, whose implementation does not trivially leverage the GPUs’ computational power. We therefore propose here that InceptionTime should be considered as the new state of the art for TSC as its accuracy is equal to that of HIVE-COTE (see Figure 6) while being much faster (see Figure 7 and 8).

Figure 6: Accuracy plot showing how our proposed InceptionTime model is not significantly different than HIVE-COTE.

To further visualize the difference between the InceptionTime and HIVE-COTE, Figure 6 depicts the accuracy plot of InceptionTime against HIVE-COTE for each of the 85 UCR datasets. The results show a Win/Tie/Loss of 40/6/39 in favor of InceptionTime, however the difference is not statistically significant as previously discussed. From Figure 6, we can also easily spot the two datasets for which InceptionTime noticeably under-performs (in terms of accuracy) with respect to HIVE-COTE: Wine and Beef. These two datasets contain spectrography data from different types of beef/wine, with the goal being to determine the correct type of meat/wine using the recorded time series data. Recently, transfer learning has been shown to significantly increase the accuracy for these two datasets, especially when fine-tuning a dataset with similar time series data (Ismail Fawaz et al., 2018). Our results suggest that further potential improvements may be available for InceptionTime when applying a transfer learning approach, as recent discoveries in Kashiparekh et al. (2019) show that the various filter lengths of the Inception modules have been shown to benefit more from fine-tuning than networks with a static filter length.

Figure 7: Training time as a function of the series length for the InlineSkate dataset.
Figure 8: Training time as a function of the training set size for the SITS dataset.

Now that we have demonstrated that our proposed technique is able to reach the current state-of-the-art accuracy for TSC problems, we will further investigate the time complexity of our model. Note that during the following experiments, we ran our ensemble on a single Nvidia Quadro P6000 in a sequential manner, meaning that for InceptionTime, 5 different Inception networks were trained one after the other. Therefore we did not make use of our remote cluster of GPUs. First we start by investigating how our algorithm scales with respect to the length of the input time series. Figure 7 shows the training time versus the length of the input time series. For this experiment, we used the InlineSkate dataset with an exponential re-sampling. We can clearly see that InceptionTime’s complexity increases almost linearly with an increase in the time series’ length, unlike HIVE-COTE, whose execution is almost two order of magnitudes slower. Having showed that InceptionTime is significantly faster when dealing with long time series, we now proceed to evaluating the training time with respect to a number of time series in a dataset. To this end, we used a Satellite Image Time Series dataset (Tan et al., 2017). The data contain approximately one million time series, each of length 46 and labelled as one of 24 possible land-use classes (e.g. ‘wheat’, ‘corn’, ‘plantation’, ‘urban’). From Figure 8 we can easily see how our InceptionTime is an order of magnitude faster than HIVE-COTE, and the trend suggests that this difference will only continue to grow, rendering InceptionTime a clear favorite classifier in the Big Data era.

Figure 9: Plot showing how InceptionTime significantly outperforms ResNet(5).

The pairwise accuracy plot in Figure 9 compares InceptionTime to a model we call ResNet(5), which is an ensemble of 5 different ResNet networks (Ismail Fawaz et al., 2019c). We found that InceptionTime showed a significant improvement over its neural network competitor, the previous best deep learning ensemble for TSC. Specifically, our results show a Win/Tie/Loss of 53/7/25 in favor of InceptionTime against ResNet(5) with a -value , suggesting the significant gain in performance is mainly due to improvements in our proposed Inception network architecture. Interestingly, we found that InceptionTime showed a huge improvement in accuracy compared to ResNet on the DiatomSizeReduction dataset. Further investigations showed that the main improvement is from using a batch size larger than 1 (which is the case for the ResNet model for this specific dataset). Therefore we cannot attribute the improvement on DiatomSizeReduction to our novel architecture, nevertheless, the results over all 85 datasets suggest that InceptionTime improves significantly the accuracy of deep learning models for TSC.

Figure 10: Critical difference diagram showing the effect of the number of individual classifiers in the InceptionTime ensemble.

In order to better understand the effect of the randomness on the accuracy of our neural networks, we present in Figure 10 the critical difference diagram of different InceptionTime() ensembles with denoting the number of individual networks in the ensemble. Note that InceptionTime(1) is equivalent to a single Inception network and InceptionTime is equivalent to InceptionTime(5). By observing Figure 10 we notice how there is no significant improvement when , which is why we chose to use an ensemble of size 5, to minimize the classifiers’ training time.

6 Architectural Hyperparameter study

In this section, we will further investigate the hyperparameters of our deep learning architecture and the characteristics of the Inception module in order to provide insight for practitioners looking at optimizing neural networks for TSC. First, we start by investigating the batch size hyperparameter, since this will greatly influence training time of all of our models. Then we investigate the effectiveness of residual and bottleneck connections, both of which are present in InceptionTime. After this we will experiment on model depth, filter length, and number of filters. In all experiments the default values for InceptionTime are: batch size 64; depth 6; filter length 40; and, number of filters 32.

6.1 Batch size

Figure 11: Critical difference diagram showing the effect of the batch size hyperparameter value over InceptionTime’s average rank.

We started by investigating the batch size hyperparameter, since this will greatly influence training time of our models. The critical difference diagram in Figure 11 shows how the batch size affects the performance of InceptionTime. The horizontal thick line between the different models shows a non significant difference between them when evaluated on the 85 datasets, with a small superiority to InceptionTime (batch size equal to 64). Finally, we should note that as we did not observe any significant impact on accuracy we did not study the effect of this hyperparameter on the simulated dataset and we chose to fix the batch size to 128 for time efficiency when experimenting with InceptionTime on the simulated dataset below.

6.2 Bottleneck and residual connections

Figure 12: Accuracy plot for InceptionTime with/without the bottleneck layer.

In Ismail Fawaz et al. (2019b), ResNet achieved the best classification accuracy when evaluated on the 85 datasets and as a result we chose to look at the specific characteristic of this architecture—its residual connections. Additionally, we tested one of the defining characteristics of Inception—the bottleneck feature. For the simulated dataset, we did not observe any significant impact of these two connections, we therefore proceed with experimenting on the 85 datasets from the UCR archive.

Figure 12 shows the pairwise accuracy plot comparing InceptionTime with/without the bottleneck. Similar to the experiments on the simulated dataset, we did not find any significant variation in accuracy when adding or removing the bottleneck layer.

In fact, using a Wilcoxon Signed-Rank test we found that InceptionTime with the bottleneck layer is only slightly better than removing the bottleneck layer (-value ). In terms of accuracy, these results all suggest not to use a bottleneck layer, however we should note that the major benefit of this layer is to significantly decrease the number of parameters in the network. In this case, InceptionTime with the bottleneck contains almost half the number of parameters to be learned, and given that it does not significantly decrease accuracy, we chose to retain its usage. In a more general sense, these experiments suggest that choosing whether or not to use a bottleneck layer is actually a matter of finding a balance between a model’s accuracy and its complexity.

Figure 13: Accuracy plot for InceptionTime with/without the residual connections.

To test the residual connections, we simply removed the residual connection from InceptionTime. Thus, without any shortcut connection, InceptionTime will simply become a deep convolutional neural network with stacked Inception modules. Figure 13 shows how the residual connections have a minimal effect on accuracy when evaluated over the whole 85 datasets in the UCR archive with a -value .

This result was unsurprising given that for computer vision tasks residual connections are known to improve the convergence rate of the network but not alter its test accuracy (Szegedy et al., 2017). However, for some datasets in the archive, the residual connections did not show any improvement nor deterioration of the network’s convergence either. This could be linked to other factors that are specific to these data, such as the complexity of the dataset.

One example of interest that we noticed was a significant decrease in InceptionTime’s accuracy when removing the residual component for the ShapeletSim dataset. This is a synthetic dataset, designed specifically for shapelets discovery algorithms, with shapelets (discriminative subsequences) of different lengths (Hills et al., 2014). Further investigations on this dataset indicated that InceptionTime without the residual connections suffered from a severe overfitting.

While not the case here, some research has observed benefits of skip, dense or residual connections (Huang et al., 2017). Given this, and the small amount of labeled data available in TSC compared to computer vision problems, we believe that in each case should be independently study whether to include residual connections. Finally, we should note that the residual connection has a minimal impact on the network’s complexity (Szegedy et al., 2017).

6.3 Depth

Most of deep learning’s success in image recognition tasks has been attributed to how ‘deep’ the architectures are (LeCun et al., 2015). Consequently, we decided to further investigate how the number of layers affects a network’s accuracy. Unlike the previous hyperparameters, we present here the results on the simulated dataset. Apart from the depth parameter, we used the default values of InceptionTime. For the dataset we fixed the number of training instances to 128 and the number of classes to 2 (see Figure 4 for an example). The only dataset parameter we varied was the length of the input time series.

Figure 14: Inception network’s accuracy over the simulated dataset, with respect to the network’s depth as well as the length of the input time series.

Figure 14 illustrates how the model’s accuracy varies with respect to the network’s depth when classifying datasets of time series with different lengths. Our initial hypothesis was that as longer time series can potentially contain longer patterns and thus should require longer receptive fields in order for the network to separate the classes in the dataset. In terms of depth, this means that longer input time series will garner better results with deeper networks. And indeed, when observing Figure 14, one can easily spot this trend: deeper networks deliver better results for longer time series.

In order to further see how much effect the depth of a model has on real TSC datasets, we decided to implement deeper and shallower InceptionTime models, with 9 and 3 inception modules layers, respectively. In fact, compared with the original architecture proposed by Wang et al. (2017), the deeper (shallower) version of InceptionTime will contain one additional (fewer) residual blocks each one comprised of three inception modules. By adding these layers, the deeper (shallower) InceptionTime model will contain roughly double (half) the number of parameters to be learned. Figure 15 depicts the critical difference diagram comparing the deeper and shallower InceptionTime models to the original InceptionTime.

Figure 15: Critical difference diagram showing how the network’s depth affects InceptionTime’ average rank.

Unlike the experiments on the simulated dataset, we did not manage to improve the network’s performance by simply increasing its depth. This may be due to many reasons, however it is likely due to the fact that deeper networks need more data to achieve high generalization capabilities (LeCun et al., 2015), and since the UCR archive does not contain datasets with a huge number of training instances, the deeper version of InceptionTime was overfitting the majority of the datasets and exhibited a small insignificant decrease in performance. On the other hand, the shallower version of InceptionTime suffered from a significant decrease in accuracy (see InceptionTime-3 in Figure 15). This suggests that a shallower architecture will contain a significantly smaller RF, thus achieving lower accuracy on the overall UCR archive.

From these experiments we can conclude that increasing the RF by adding more layers will not necessarily result in an improvement of the network’s performance, particularly for datasets with a small training set. However, one benefit that we have observed from increasing the network’s depth, is to choose an RF that is long enough to achieve good results without suffering from overfitting.

We therefore proceed by experimenting with varying the RF by varying the filter length.

6.4 Filter length

In order to test the effect of the filter length, we start by analyzing how the length of a time series influences the accuracy of the model when tuning this hyperparameter. In these experiments we fixed the number of training time series to 128 and the number of classes to 2. Figure 16 illustrates the results of this experiment.

Figure 16: Inception network’s accuracy over the simulated dataset, with respect to the filter length as well as the input time series length.

We can easily see that as the length of the time series increases, a longer filter is required to produce accurate results. This is explained by the fact that longer kernels are able to capture longer patterns, with higher probability, than shorter ones can. Thus, we can safely say that longer kernels almost always improve accuracy.

Figure 17: Inception network’s accuracy over the simulated dataset, with respect to the receptive field as well as the input time series length.

In addition to having visualized the accuracy as a function of both depth (Figure 14) and filter length (Figure 16), we proceed by plotting the accuracy as function of the RF for the simulated time series dataset with various lengths. By observing Figure 17 we can confirm the previous observations that longer patterns require longer RFs, with length clearly having a higher impact on accuracy compared to the network’s depth.

There is a downside to longer filters however, in the potential for overfitting small datasets, as longer filters significantly increase the number of parameters in the network. To answer this question, we again extend our experiments to the real data from the UCR archive, allowing us to verify whether long kernels tend to overfit the datasets when a limited amount of training data is available.

Figure 18: Critical difference diagram showing the effect of the filter length hyperparameter value over InceptionTime’ average rank.

To further verify if the previous observations generalize beyond the simulated dataset, we decided to train and evaluate InceptionTime versions containing both long and short filters on the UCR archive. Where the original InceptionTime contained filters of length {10,20,40}, the two models we are testing here contain filters of length {16,32,64} and {2,4,8}. Figure 18 illustrates a critical difference diagram showing how InceptionTime with longer filters will slightly decrease the network’s performance in terms of accurately classifying the time series datasets. We can therefore summarize that the results from the simulated dataset do generalize (to some extent) to real datasets: longer filters will improve the model’s performance as long as there is enough training data to mitigate the overfitting phenomena.

In summary, we can confidently state that increasing the receptive field of a model by adopting longer filters will help the network in learning longer patterns present in longer time series. However there is an accompanying disclaimer that it may negatively impact the accuracy for some datasets due to overfitting.

6.5 Number of filters

Figure 19: Inception network’s accuracy over the simulated dataset, with respect to the number of filters as well as the number of classes.

To provide some directions on how the number of filters affects the performance of the network, we experimented with varying this hyperparameter with respect to the number of classes in the dataset. To generate new classes in the simulated data, we varied the position of the patterns; for example, to create data with three classes, we inject patterns of the same length at three different positions. For this series of experiments, we fixed the length of the time series to 256.

Figure 19 depicts the network’s accuracy with respect to the number of filters for datasets with a differing number of classes. Our prior intuition was that the more classes, or variability, present in the training set, the more features are required to be extracted in order to discriminate the different classes, and this will necessitate a greater number of filters. This is confirmed by the trend displayed in Figure 19, where the datasets with more classes require more filters to be learned in order to be able to accurately classify the input time series.

Figure 20: Critical difference diagram showing how network’s width affects InceptionTime’ average rank.

After observing on the synthetic dataset that the number of filters significantly affects the performance of the network, we asked ourselves if the current implementation of InceptionTime could benefit/lose from a naive increase/decrease in the number of filters per layer. Our proposed InceptionTime model contains 32 filters per Inception module’s component, while for these experiments we tested two ensembles with 64 and 16 filters per layer, approximately doubling and halving the number of filters per layer, respectively. Figure 20 illustrates a critical difference diagram showing how increasing the number of filters per layer significantly deteriorated the accuracy of the network, whereas decreasing the number of filters did not significantly affect the accuracy. It appears that our InceptionTime model contains enough filters to separate the classes of the 85 UCR datasets, of which some have up to 60 classes (ShapesAll dataset).

Increasing the number of filters also has another side effect: it causes an explosion in the number of parameters in the network. The wider InceptionTime contains four times the number of parameters than the original implementation. We therefore conclude that naively increasing the number of filters is actually detrimental, as it will drastically increase the network’s complexity and eventually cause overfitting.

7 Conclusion

Deep learning for time series classification still lags behind neural networks for image recognition in terms of experimental studies and architectural designs. In this paper, we fill this gap by introducing InceptionTime, inspired by the recent success of Inception-based networks for various computer vision tasks. We ensemble these networks to produce new state-of-the-art results for TSC on the 85 datasets of the UCR archive. Our approach is highly scalable, two orders of magnitude faster than current state-of-the-art models such as HIVE-COTE. The magnitude of this speed up is consistent across both Big Data TSC repositories as well as longer time series with high sampling rate. We further investigate the effects on overall accuracy of various hyperparameters of the CNN architecture. For these, we go far beyond the standard practices for image data, and design networks with long filters. We look at these by using a simulated dataset and frame our investigation in terms of the definition of the receptive field for a CNN for TSC. In the future, we would like to explore how to design deep neural networks for multivariate TSC while investigating more recent architectural advancements that are being published each year for computer vision tasks.

Acknowledgements.
The authors would like to thank the creators and providers of the datasets. The authors would also like to thank NVIDIA Corporation for the GPU Grant and the Mésocentre of Strasbourg for providing access to the cluster. This work was supported by the ANR TIMES project (grant ANR-17-CE23-0015) of the French Agence Nationale de la Recherche. François Petitjean is the recipient of an Australian Research Council Discovery Early Career Award (project number DE170100037) funded by the Australian Government. This material is based upon work supported by the Air Force Office of Scientific Research, Asian Office of Aerospace Research and Development (AOARD) under award number FA2386-18-1-4030.

References

  • A. Bagnall, J. Lines, J. Hills, and A. Bostrom (2016) Time-series classification with COTE: the collective of transformation-based ensembles. In International Conference on Data Engineering, pp. 1548–1549. Cited by: §2.1.
  • A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh (2017) The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery 31 (3), pp. 606–660. Cited by: §1, §2.1, §2.1, §4, §4, §5.
  • A. Benavoli, G. Corani, and F. Mangili (2016) Should we really use post-hoc tests based on mean-ranks?. Machine Learning Research 17 (1), pp. 152–161. Cited by: §4.
  • A. Brunel, J. Pasquet, J. Pasquet, N. Rodriguez, F. Comby, D. Fouchez, and M. Chaumont (2019) A CNN adapted to time series for the classification of Supernovae. In Electronic Imaging, Cited by: §2.2.
  • Z. Cui, W. Chen, and Y. Chen (2016) Multi-scale convolutional neural networks for time series classification. ArXiv. External Links: 1603.06995 Cited by: §2.2, §2.2.
  • H. A. Dau, A. Bagnall, K. Kamgar, C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, and E. Keogh (2018) The ucr time series archive. ArXiv. Cited by: §1, §1, §4.
  • J. Demšar (2006) Statistical comparisons of classifiers over multiple data sets. Machine Learning Research 7, pp. 1–30. Cited by: §4.
  • G. Forestier, F. Petitjean, P. Senin, F. Despinoy, A. Huaulmé, H. Ismail Fawaz, J. Weber, L. Idoumghar, P. Muller, and P. Jannin (2018) Surgical motion analysis using discriminative interpretable patterns. Artificial Intelligence in Medicine 91, pp. 3 – 11. Cited by: §1.
  • M. Friedman (1940) A comparison of alternative tests of significance for the problem of rankings. The Annals of Mathematical Statistics 11 (1), pp. 86–92. Cited by: §4.
  • S. Garcia and F. Herrera (2008) An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Machine learning research 9, pp. 2677–2694. Cited by: §4.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, Vol. 9, pp. 249–256. Cited by: §4.
  • C. Guan, X. Wang, Q. Zhang, R. Chen, D. He, and X. Xie (2019) Towards a deep and unified understanding of deep neural models in NLP. In International Conference on Machine Learning, pp. 2454–2463. Cited by: §2.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 770–778. Cited by: §3.1.
  • J. Hills, J. Lines, E. Baranauskas, J. Mapp, and A. Bagnall (2014) Classification of time series by shapelet transformation. Data Mining and Knowledge Discovery 28 (4), pp. 851–881. Cited by: §2.1, §6.2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708. Cited by: §6.2.
  • H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2018) Transfer learning for time series classification. In IEEE International Conference on Big Data, pp. 1367–1376. Cited by: §4, §5.
  • H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2019a) Adversarial attacks on deep neural networks for time series classification. In IEEE International Joint Conference on Neural Networks, Cited by: §2.2.
  • H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2019b) Deep learning for time series classification: a review. Data Mining and Knowledge Discovery. Cited by: §1, §1, §2.2, §2.2, §4, §6.2.
  • H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2019c) Deep neural network ensembles for time series classification. In IEEE International Joint Conference on Neural Networks, Cited by: §3.2, §5.
  • H. Ismail Fawaz, G. Forestier, J. Weber, F. Petitjean, L. Idoumghar, and P. Muller (2019d) Automatic alignment of surgical videos using kinematic data. In Artificial Intelligence in Medicine, pp. 104–113. Cited by: §1.
  • S. Karimi-Bidhendi, F. Munshi, and A. Munshi (2018) Scalable classification of univariate and multivariate time series. In IEEE International Conference on Big Data, pp. 1598–1605. Cited by: §2.2.
  • K. Kashiparekh, J. Narwariya, P. Malhotra, L. Vig, and G. Shroff (2019) ConvTimeNet: a pre-trained deep convolutional neural network for time series classification. In IEEE International Joint Conference on Neural Networks, Cited by: §5.
  • E. J. Keogh and M. J. Pazzani (2001) Derivative dynamic time warping. In Proceedings of the 2001 SIAM International Conference on Data Mining, pp. 1–11. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1.
  • A. Le Guennec, S. Malinowski, and R. Tavenard (2016) Data augmentation for time series classification using convolutional neural networks. In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, Cited by: §2.2.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521, pp. 436–444. Cited by: §1, §6.3, §6.3.
  • Y. LeCun, L. Bottou, G. B. Orr, and K. Müller (1998) Efficient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pp. 9–50. Cited by: §1.
  • W. Lee, S. Park, W. Joo, and I. Moon (2018) Diagnosis prediction via medical context attention networks using deep generative modeling. In IEEE International Conference on Data Mining, pp. 1104–1109. Cited by: §1.
  • J. Lines, S. Taylor, and A. Bagnall (2016) HIVE-COTE: the hierarchical vote collective of transformation-based ensembles for time series classification. In IEEE International Conference on Data Mining, pp. 1041–1046. Cited by: §2.1, §5.
  • J. Lines and A. Bagnall (2015) Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery 29 (3), pp. 565–592. Cited by: §2.1.
  • Y. Liu, J. Yu, and Y. Han (2018) Understanding the effective receptive field in semantic image segmentation. Multimedia Tools and Applications 77 (17), pp. 22159–22171. Cited by: §3.3.
  • B. Lucas, A. Shifaz, C. Pelletier, L. O’Neill, N. Zaidi, B. Goethals, F. Petitjean, and G. I. Webb (2019) Proximity forest: an effective and scalable distance-based classifier for time series. Data Mining and Knowledge Discovery 33 (3), pp. 607–635. Cited by: §2.1, §2.1.
  • W. Luo, Y. Li, R. Urtasun, and R. Zemel (2016) Understanding the effective receptive field in deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 4898–4906. Cited by: §3.3, §3.3.
  • P. Marteau (2009) Time warp edit distance with stiffness adjustment for time series matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2), pp. 306–318. Cited by: §2.1.
  • C. Pelletier, G. I. Webb, and F. Petitjean (2019) Temporal convolutional neural network for the classification of satellite image time series. Remote Sensing 11 (5), pp. 523. Cited by: §1.
  • S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866. Cited by: §2.2.
  • S. Scardapane and D. Wang (2017) Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (2), pp. e1200. Cited by: §3.2.
  • P. Schäfer (2015a) Scalable time series classification. Data Mining and Knowledge Discovery, pp. 1–26. Cited by: §2.1.
  • P. Schäfer and U. Leser (2017) Fast and accurate time series classification with WEASEL. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 637–646. Cited by: §2.1.
  • P. Schäfer (2015b) The boss is concerned with time series classification in the presence of noise. Data Mining and Knowledge Discovery 29 (6), pp. 1505–1530. Cited by: §2.1.
  • A. Stefan, V. Athitsos, and G. Das (2013) The move-split-merge metric for time series. IEEE Transactions on Knowledge and Data Engineering 25 (6), pp. 1425–1438. Cited by: §2.1.
  • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI Conference on Artificial Intelligence, Cited by: §2.2, §2.2, §6.2, §6.2.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §2.2.
  • C. W. Tan, G. I. Webb, and F. Petitjean (2017) Indexing and classifying gigabytes of time series under time warping. In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 282–290. Cited by: §5.
  • M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh (2006) Indexing multidimensional time-series. The VLDB Journal—The International Journal on Very Large Data Bases 15 (1), pp. 1–20. Cited by: §2.1.
  • Z. Wang, W. Yan, and T. Oates (2017) Time series classification from scratch with deep neural networks: a strong baseline. In International Joint Conference on Neural Networks, pp. 1578–1585. Cited by: §1, §2.2, §2.2, §6.3.
  • F. Yi, Z. Yu, F. Zhuang, X. Zhang, and H. Xiong (2018) An integrated model for crime prediction using temporal and spatial factors. In IEEE International Conference on Data Mining, pp. 1386–1391. Cited by: §1.
  • Y. Yuan, G. Xun, F. Ma, Y. Wang, N. Du, K. Jia, L. Su, and A. Zhang (2018) Muvan: a multi-view attention network for multivariate temporal data. In IEEE International Conference on Data Mining, pp. 717–726. Cited by: §1.
  • C. Zhang, W. Tavanapong, G. Kijkul, J. Wong, P. C. de Groen, and J. Oh (2018)

    Similarity-based active learning for image classification under class imbalance

    .
    In IEEE International Conference on Data Mining, pp. 1422–1427. Cited by: §2.2.