Log In Sign Up

Enhancing Transformer Efficiency for Multivariate Time Series Classification

Most current multivariate time series (MTS) classification algorithms focus on improving the predictive accuracy. However, for large-scale (either high-dimensional or long-sequential) time series (TS) datasets, there is an additional consideration: to design an efficient network architecture to reduce computational costs such as training time and memory footprint. In this work we propose a methodology based on module-wise pruning and Pareto analysis to investigate the relationship between model efficiency and accuracy, as well as its complexity. Comprehensive experiments on benchmark MTS datasets illustrate the effectiveness of our method.


page 1

page 2

page 3

page 4


A Shapelet Transform for Multivariate Time Series Classification

Shapelets are phase independent subsequences designed for time series cl...

A Transformer-based Framework for Multivariate Time Series Representation Learning

In this work we propose for the first time a transformer-based framework...

Memory-Gated Recurrent Networks

The essence of multivariate sequential learning is all about how to extr...

HigeNet: A Highly Efficient Modeling for Long Sequence Time Series Prediction in AIOps

Modern IT system operation demands the integration of system software an...

TACTiS: Transformer-Attentional Copulas for Time Series

The estimation of time-varying quantities is a fundamental component of ...

Large-scale directed network inference with multivariate transfer entropy and hierarchical statistical testing

Network inference algorithms are valuable tools for the study of large-s...

1 Introduction

Time series (TS) data is ubiquitous, occurring in healthcare [1], stock market [2], astronomy [3], and many other domains  [4, 5]. With the advance of sensing techniques, TS classification across wide-ranging domains has gained much interest during the past decade [6, 7].

The availability of the UCR/UEA time series benchmark datasets [7] has led to an abundance of TS classification algorithms [8, 9, 10, 11, 12]. The classification accuracy has been the key metric used to evaluate existing methods  [13]. However, the high accuracy of these algorithms often comes with the cost of high computational complexity [14]

. From common preconceptions in natural language processing (NLP) and computer vision (CV), in order to achieve high accuracy, training top performing models with millions/billions of parameters is a computationally intensive task, requiring days or weeks on many parallel GPUs or TPUs. However, such intensive training makes the model difficult to retrain for further improvement on performance. Likewise, for large-scale time series data with high dimensionality or long sequence length, it is challenging to maintain the balance between the predictive accuracy and training efficiency.

In this work, we propose a method to investigate the relationship between model efficiency and its effectiveness, as well as its complexity for MTS classification. The model architecture is based on Transformer and Fourier transform. We use 18 benchmark MTS datasets for evaluation. Comprehensive experiments are conducted on all datasets, including ablation study of each module of the network and module-by-module pruning in terms of accuracy, training speed, and model size. Experimental results demonstrate the competitive performance of our proposed architecture compared with current state-of-the-art methods. Ablation studies identify the main contributors to the predictive performance, such as multi-head self-attention and Fourier transform. In addition, module-wise pruning of the network reveals the trade-off between model efficiency and effectiveness, as well as model efficiency and complexity. Finally, we conduct Pareto analysis to examine the trade-off between efficiency and performance.

The main contributions of this paper are highlighted as follows:

  1. To the best of our knowledge, this is the first paper to perform Pareto analysis to investigate the relationship between efficiency and accuracy.

  2. Through module-by-module pruning, comprehensive experimental results indicate an evident trade-off between model efficiency and its effectiveness, as well as its complexity.

  3. We employ Pareto analysis to investigate the relationship between model efficiency and performance. Such analysis methods can provide general guidance for researchers on how to select efficient model configurations, which can be applied to any model architecture.

The remainder of this paper is organized as follows. Section 2 describes related work of Transformer and Fourier transform on time series analysis and existing methods on model efficiency improvement. The network architecture is outlined in Section 3. Section 4 discusses datasets and experiments on 18 benchmark datasets, including ablation studies, module-wise pruning and Pareto efficiency visualization. Finally, our conclusions are presented in Section 5.

2 Related Work

Neural Networks for Time Series Classification. Currently, most TS classification algorithms can be divided into three categories: feature-based [15], distance-based [16]

, and neural network based methods 

[6]. Here, we focus only on neural network based methods. Since the advancements of deep learning, two popular frameworks, CNN and RNN, are widely applied in TS classification tasks.  [17] combined Fully Convolutional Networks (FCN) and Residual Networks (ResNet) for univariate time series classification. [18] developed a group-constrained method, which combines a CNN with an RNN. More recent works such as InceptionTime [19], TapNet [20], and TST [12] are proposed for TS classification. For additional deep learning methods, we refer readers to  [6].

Fourier Transform in Time Series. The Fourier transform (FT) has been an important tool in time series analysis for decades [21]

, and is widely used for applications such as anomaly detection 

[22], periodicity detection [23], and similarity measures [24]

. The FT converts a TS from time domain to frequency domain, and uses Fourier coefficients to represent the original data. For the TS classification task, FTs have been used indirectly in disparate applications. For instance,  

[25] utilizes the FT to filter noisy data for vegetation type classification, and  [26]

uses the FT as a feature extraction technique to classify electroencephalography (EEG) data. However, none of the above methods apply the FT directly to TS classification, particularly in the context of neural networks. In contrast, we aim to apply the discrete FT and its inverse as modules of a deep learning framework. The unparameterized FT can reduce the computational cost of the network to some extent.

Transformer Networks for Time Series Classification. With the exemplary performance of the Transformer architecture [27] in NLP and CV, researchers in the time series community began exploring Transformers in TS classification in specific domains [28, 29]. More recent works have generalized Transformer frameworks for MTS classification.  [12] adopts a Transformer encoder architecture for unsupervised representation learning of MTS.  [30] explored an extension of the current Transformer architecture by gating, which merges two towers for MTS classification. In contrast, we propose to generalize a mixing framework which utilizes both Transformer and FT. By replacing some self-attention sublayers with FT, the computational complexity can be reduced.

Model Training Efficiency. Due to the increasing size of both models and training data, many works have focused on improving model training efficiency through parameter reduction, such as DenseNet [31] and EfficientNet [32], training speed improvement including NFNets [33] and BotNet [34], or both [35]. One of the most common techniques to improve network efficiency is model pruning. Early works focused on non-structured methods. For instance, [36, 37] proposed to remove individual weight values. Recent works focused more on structured methods, such as channel weight pruning based on norm [38].

3 Methodology

In this section, we present our network architecture, which contains all of the modules for potential model pruning. The overall model structure is illustrated in Figure 1.

Figure 1:

An overview of the full model framework. Our architecture is based on Transformer and Fourier transform. Following the sequence embedding, we apply a 2D discrete Fourier transform (particularly Fast Fourier transform) to convert the TS features from the time domain to the frequency domain, a multi-head self-attention layer, and a 2D inverse discrete Fourier transform to map the features back to the time domain. Then we employ a Global Average Pooling (GAP) layer to average the output of the MTS over the entire time dimension. Finally, a Softmax layer is used for the multi-class MTS classification task.

Input Embeddings.

Input embeddings are commonly used in NLP models, which map relatively low-dimensional vectors to high-dimensional vectors to facilitate sequence modeling 

[39]. Correspondingly, an embedding for TS sequence is required to capture the dependencies among different features without considering the temporal information [40]. Our framework employs a 1D convolutional layer to obtain the K-dimensional embeddings at each time step.

Discrete Fourier Transform. The Fourier transform decomposes a function of time into its constituent frequencies. For clarity, we first consider the 1D Discrete Fourier transform (DFT). Given a sequence of complex numbers with , the 1D DFT is defined by

where . Given the DFT , the original sequence can be recovered by the inverse DFT (IDFT)

The 2D DFT is a direct extension of the 1D DFT, obtained by alternately performing the 1D DFT on the row and column dimensions. Given a 2D signal with , the 2D DFT is given by

Similar to the 1D IDFT, the 2D DFT is invertible via the 2D IDFT,

To compute the DFT efficiently, the Fast Fourier Transform (FFT) algorithm takes advantage of the periodicity and symmetry properties of such that the computational complexity of the DFT reduces from to , regardless of dimension.

Multi-head Attention. The multi-head attention (MHA) mechanism, the major component of the Transformer architecture [27], allows the model to jointly attend to information from different representation subspaces at different positions. MHA is defined as:

where are input embedding matrices, is the sequence length, is the embedding dimension, and is the number of heads. Each head is defined as:

where are parameter matrices to be learned.

Global Average Pooling. Global average pooling involves calculating the average value of all of the elements in a feature map. It is mainly used to reduce the amount of learnable parameters.

Batch Normalization.

Instead of using layer normalization in Transformer-related architectures in NLP, we consider the necessity of applying batch normalization to each block shown in Figure 


. Compared to layer normalization, batch normalization can mitigate the effect of outlier values in time series data, which does not appear in text representations.

Activation Function.

Using the same activation function as the original Transformer architecture 

[27], we consider the necessity of applying the activation function gelu for each module shown in Figure 1.

Feedforward Neural Network. A position-wise feedforward neural network (FNN) is applied with two 1D convolutional layers with kernel size , and a gelu activation function in between.

4 Experiments

In this section, we describe benchmark MTS datasets [7] used for experimental evaluation, the experimental setup, and corresponding results.

4.1 Datasets

We select a set of 18 publicly available benchmark datasets from the UCR/UEA classification archive: AtrialFibrillation (AF), BasicMotions (BM), Cricket (CR), DuckDuckGeese (DDG), Epilepsy (EP), EthanolConcentration (EC), ERing (ER), FingerMovements (FM), HandMovementDirection (HMD), Handwriting (HW), Heartbeat (HB), Libras (LIB), NATOPS (NATO), PEMS-SF (PEMS), RacketSports (RS), SelfRegulationSCP1 (SRS1), SelfRegulationSCP2 (SRS2), and UWaveGestureLibrary (UW). The main characteristics of each dataset are summarised in Table 1. All of the datasets have been split into training and testing sets by default. Thus, there are no preprocessing steps for these data. The predictive performance on all datasets is evaluated in terms of accuracy.


Dataset Code Train Size Test Size Dimensions Length Classes
AtrialFibrillation AF 15 15 2 640 3
BasicMotions BM 40 40 6 100 4
Cricket CR 108 72 6 1197 12
DuckDuckGeese DDG 50 50 1345 270 5
Epilepsy EP 137 138 3 206 4
EthanolConcentration EC 261 263 3 1751 4
ERing ER 30 270 4 65 6
FingerMovements FM 316 100 28 50 2
HandMovementDirection HMD 160 74 10 400 4
Handwriting HW 150 850 3 152 26
Heartbeat HB 204 205 61 405 2
Libras LIB 180 180 2 45 15
NATOPS NATO 180 180 24 51 6
PEMS-SF PEMS 267 173 963 144 7
RacketSports RS 151 152 6 30 4
SelfRegulationSCP1 SRS1 268 293 6 896 2
SelfRegulationSCP2 SRS2 200 180 7 1152 2
UWaveGestureLibrary UW 120 320 3 315 8


Table 1: Summary of the 18 UCR/UEA datasets used in experimentation.

4.2 Setup

We set aside

of the default training set for the validation set, which we used to select the best collection of hyperparameters. All experiments were implemented in Pytorch 

[41] on one GTX 1080 Ti GPU. We minimized the cross entropy loss with the Adam [42] optimizer for training. The hyperparameter search space for each dataset is listed in Table 2. Note that the batch size choice is limited by the available GPU memory.


Hyperparameters Search Space
learning rate [1e-3, 5e-3, 1e-4, 5e-4, 1e-5, 5e-5]
dropout rate [0.1, 0.2, 0.3]
batch size [8, 16, 32]
of heads [4, 8, 16]
of FFT layers [0, 1, 2, 3, 4]
of IFFT layers [0, 1, 2, 3, 4]
of MHA layers [0, 1, 2, 3, 4]
of Feedforward layers [0, 1, 2, 3, 4]


Table 2: Hyperparameter search space of the model on each dataset. If the number of layers of a module is equal to 0, then this module is removed in the pruned model.


AF Mean 0.667 0.600 0.400 0.467 0.400 0.667 0.533 0.600 0.667
Std. 0.003 0.005 0.005 0.004 0.003 0.006 0.006 0.004 0.003
BM Mean 0.975 0.950 0.725 0.775 0.750 0.900 0.925 0.900 0.950
Std. 0.008 0.010 0.012 0.009 0.012 0.010 0.014 0.009 0.011
CR Mean 0.987 0.958 0.875 0.861 0.833 0.889 0.944 0.972 0.944
Std. 0.007 0.009 0.012 0.008 0.012 0.006 0.009 0.012 0.008
DDG Mean 0.580 0.580 0.440 0.420 0.380 0.520 0.560 0.560 0.580
Std. 0.016 0.017 0.020 0.016 0.014 0.016 0.016 0.014 0.016
EP Mean 0.986 0.978 0.891 0.913 0.899 0.949 0.971 0.956 0.971
Std. 0.014 0.013 0.016 0.014 0.014 0.012 0.014 0.013 0.015
EC Mean 0.456 0.445 0.376 0.395 0.365 0.418 0.441 0.445 0.452
Std. 0.003 0.002 0.003 0.003 0.004 0.002 0.004 0.003 0.002
ER Mean 0.963 0.956 0.896 0.889 0.885 0.892 0.948 0.952 0.956
Std. 0.006 0.007 0.006 0.006 0.008 0.005 0.006 0.007 0.005
FM Mean 0.640 0.620 0.490 0.520 0.500 0.600 0.590 0.610 0.620
Std. 0.009 0.008 0.007 0.008 0.010 0.008 0.009 0.010 0.011
HMD Mean 0.486 0.446 0.365 0.351 0.338 0.406 0.459 0.432 0.473
Std. 0.018 0.016 0.020 0.017 0.018 0.019 0.018 0.016 0.020
HW Mean 0.529 0.514 0.471 0.473 0.468 0.506 0.506 0.512 0.514
Std. 0.006 0.007 0.006 0.005 0.007 0.007 0.008 0.007 0.006
HB Mean 0.771 0.766 0.683 0.707 0.688 0.751 0.756 0.766 0.756
Std. 0.014 0.015 0.014 0.017 0.015 0.016 0.014 0.015 0.016
LIB Mean 0.917 0.906 0.822 0.827 0.839 0.889 0.894 0.906 0.911
Std. 0.009 0.011 0.012 0.010 0.012 0.013 0.011 0.009 0.010
NATO Mean 0.844 0.833 0.728 0.739 0.750 0.772 0.811 0.833 0.833
Std. 0.005 0.004 0.005 0.007 0.006 0.005 0.006 0.004 0.006
PEMS Mean 0.908 0.884 0.815 0.809 0.803 0.867 0.879 0.896 0.896
Std. 0.013 0.012 0.014 0.016 0.014 0.013 0.013 0.014 0.012
RS Mean 0.914 0.901 0.796 0.816 0.803 0.855 0.908 0.901 0.908
Std. 0.021 0.020 0.020 0.018 0.019 0.021 0.020 0.021 0.019
SRS1 Mean 0.915 0.894 0.836 0.823 0.819 0.853 0.887 0.894 0.901
Std. 0.005 0.007 0.006 0.006 0.005 0.007 0.006 0.005 0.005
SRS2 Mean 0.600 0.594 0.522 0.533 0.516 0.578 0.583 0.588 0.594
Std. 0.002 0.003 0.002 0.001 0.004 0.002 0.003 0.003 0.002
UW Mean 0.922 0.906 0.844 0.850 0.841 0.875 0.894 0.897 0.903
Std. 0.006 0.008 0.009 0.006 0.007 0.008 0.006 0.007 0.007


Table 3: Ablation study in the testing accuracy loss on 18 datasets by removing each module at a time while leaving others the same. Each experiment is conducted

times with different random seeds. The results are shown in the format of mean and standard deviation. Column

shows the accuracy of the full model with all modules included. Columns to represent the accuracy when the module in that column is removed from the model. Bold indicates that the module contributes most to the loss in accuracy and underlining indicates that the module contributes least to the loss in accuracy when the module is removed.

4.3 Module Settings

Based on Section 3, we define the following eight modules of the network for further analysis: input embedding (EMBED), fast Fourier transform (FFT), inverse fast Fourier transform (IFFT), multi-head attention (MHA), feedforward neural network (FFN), global average pooling (GAP), batch normalization (BN), and activation function (ACT). The corresponding abbreviations of each module are shown in parentheses.

4.4 Ablation Study

First, we conduct ablation studies to analyze the contributions of each module on the predictive performance. The contribution of each module is obtained when a module is removed from the full network while other modules remain intact. The fine-tuned results on 18 datasets are shown in Table 3. Starting from Column , the smaller the accuracy is, the larger the module’s contribution is, and vice versa. The accuracy of each dataset for the unpruned model (Table 3 Column ) is competitive with current state-of-the-art methods [7]. Among eight modules, it can be seen that MHA and FFT contribute most to the predictive performance on out of the datasets and out of the datasets, respectively. For MTS data, the correlations between different dimensions across all time steps are important to consider. Hence, the MHA is able to catch different feature correlations, and influence the accuracy to a large extent. The FFT, as the core of signal processing and more generalized time series, extracts frequency information embedded in data, which provides a more straightforward representation compared to the original data in the time-domain. In contrast, we observe that EMBED, BN, and ACT contribute least to the predictive performance on out of the datasets, out of the datasets, and out of the datasets, respectively. Although these operations are important for the training of the model, they influence the testing accuracy marginally compared with MHA and FFT.

To clearly demonstrate the influence of each module on the predictive performance and efficiency of the network, the averaged testing accuracy loss and the corresponding efficiency improvement for each module (compared with the unpruned model) over all datasets are presented in Figure 2

. Here, efficiency is defined as the product of training time per epoch and the amount of learnable parameters. The higher the product, the lower the efficiency is. In consideration of highly diversed datasets with respect to sequence length, number of samples, and dimensionality, the average loss in accuracy for each module demonstrates a high variance from Figure 

2 as the performance loss extent can vary depending on dataset characteristics. The modules MHA, FFT, and IFFT demonstrate a notable influence on the model performance on average (21.9%, 20.1%, and 17.7% loss in accuracy respectively). For modules like BN, EMBED and ACT, removing them bring about minimal accuracy loss compared to other modules (3.6%, 2.7%, and 1.6% respectively). Meanwhile, comparing Figure 2 and Figure 2, the module which has larger impact on the predictive performance does not indicate that removing it can bring about more efficiency improvement. For instance, the computationally inexpensive FFT influences the predictive performance to a large extent. In contrast, although the computational cost of BN is high, its contribution to the performance is marginal.

[] []

Figure 2: (a) represents the average testing accuracy loss across all datasets while removing one module at a time and other modules remain in the network. Modules MHA, FFT, and IFFT bring about larger influence on the predictive performance due to the high percentage of accuracy loss when removing them. In comparison, BN, EMBED, and ACT bring about marginal influence on the predictive performance compared with other modules. (b) represents the corresponding average efficiency improvement across all datasets when one module is removed from the network while other modules keep intact.

4.5 Module-by-Module Pruning

Next, we explore the relationship between efficiency (defined the same as Section 4.4) and effectiveness (predictive performance). Based on the contribution of each module on the performance loss shown in Figure 2, we perform module-by-module pruning by following the order of modules from the most significant contributor to the least significant contributor (MHA, FFT, IFFT, FFN, GAP, BN, EMBED, ACT) to accuracy. We evaluate such pruning effect in two aspects: effectiveness: testing accuracy; efficiency: average training time per epoch in seconds and the number of learnable parameters. Due to limited space, we only show some datasets’ testing accuracy in Table 4 and their efficiency results in Figure 3. We observe that after removing the entire MHA module, the number of learnable parameters shrinks drastically, so as the accuracy (Table 4 Column 4). The representation capability of the pruned network, which has fewer parameters, is damaged since the amount of parameters is a key aspect to the network representation. Furthermore, the pace of accuracy loss and parameter reduction removal of subsequent modules slows down as FFT/IFFT has no learnable parameters. For the remaining modules, the number of parameters they carry is much fewer than the MHA module. Based on Figure 2, their effects on the predictive performance are moderate. Hence, the curves in Figure 3 are relatively flat following MHA. We further investigate the extent of change in accuracy of module-wise pruning on all datasets, as shown in Figure 4. We notice that the performance variation in different datasets vary widely. For datasets such as AF, BM, and DDG, the model pruning has a great impact on their performance. This may be due to very limit amount of training samples. Conversely, for datasets like HB, LIB, and SRS1, the model pruning brings little effect after removing the MHA module (within 1%).


Dataset Acc. Unpruned MHA FFT IFFT FFN GAP BN IE AF
EC Mean 0.456 0.365 0.363 0.363 0.361 0.358 0.354 0.354 0.354
Std. 0.003 0.004 0.004 0.002 0.004 0.003 0.003 0.003 0.003
NATO Mean 0.844 0.750 0.750 0.744 0.739 0.733 0.733 0.728 0.728
Std. 0.005 0.006 0.003 0.004 0.006 0.005 0.006 0.004 0.005
FM Mean 0.640 0.500 0.495 0.495 0.493 0.493 0.490 0.490 0.490
Std. 0.009 0.010 0.011 0.010 0.008 0.009 0.011 0.010 0.011
SRS1 Mean 0.915 0.819 0.817 0.816 0.814 0.814 0.812 0.812 0.812
Std. 0.005 0.005 0.003 0.003 0.0.004 0.006 0.003 0.004 0.005


Table 4: Module-wise pruning results of datasets EC, NATO, FM and SRS1. The results from Column (MHA) to Column (AF) with regard to accuracy represent that the module in that column is removed from the model architecture. Experiments are conducted times with different random seeds. The accuracy results are shown in the format of mean and standard deviation. Bold represents that the module brings about much accuracy loss compared to the unpruned model. Following MHA, the accuracy decreasing trend remains stable.

[] []

Figure 3: Module-wise results for changes in terms of number of parameters and training time per epoch on four datasets: EC, NATO, FM, SRS1.

[] [] []

Figure 4: Change in accuracy (%) from module-by-module pruning across all datasets. The order of datasets shown from (a) to (c) correspond to Table 3.

Overall, based on the above module-by-module pruning scheme, we observe that as the effectiveness (predictive performance) of the network increases, the corresponding efficiency (training speed and model size) generally decreases. The evident cost–benefit trade-off between efficiency and effectiveness provides a key question to researchers on how to find efficient model settings while maintaining the “equilibrium” between these two aspects. This problem will be discussed in Section 4.7.

4.6 Efficiency vs. Complexity

Here, we explore the relationship between network efficiency and complexity. In general, the more complex a model is, the less efficient it is. The network’s efficiency is defined in the same way as previous sections, in terms of the training time and the number of parameters. Meanwhile, we define the complexity of the model as the stacking of modules. Contrary to model pruning, we stack each module based on their influence on the predictive performance, from the least significant contributor to the most significant contributor (ACT, EMBED, BN, GAP, FFN, IFFT, FFT, MHA) to accuracy. Our empirical results in Figure 5 shed light on the trade-off between model efficiency and complexity. As can be seen in Figure 5, as more modules are stacked over the network, the corresponding computational efficiency decreases. All datasets illustrate similar trends.

[] [] []

Figure 5: Trade-off between network efficiency and complexity across all datasets. Due to the notable differences of dataset sizes, the computation of efficiency is normalized for each dataset. The order of datasets shown from (a) to (c) correspond to Table 3.

4.7 Pareto Analysis for Trade-Off Exploration Between Efficiency and Performance/effectiveness

We define the model efficiency in terms of the reciprocal of the product between training time per epoch and the number of parameters. Thus, the higher the reciprocal, the higher the efficiency. To explore the relationship between model efficiency and performance, we employ Pareto analysis [43]. Pareto efficiency represents a state for which improving the performance as measured by one criterion would worsen the performance as measured by another criterion. We choose the FingerMovements and Heartbeat datasets to obtain the Pareto frontiers, where the set of points on the front correspond to Pareto-efficient solutions. We have two objectives: maximize the efficiency; maximize the accuracy. Figure 6 shows the result of Pareto fronts for both datasets in blue, where the red points are Pareto-efficient solutions. The scattered cyan points are randomly sampled experimental data from all different configurations. The Pareto analysis provides us with a principled approach for choosing efficient network settings, while exploring the trade-off between efficiency and performance. Specifically, we can identify the extent of computational resources that is required in order for a model to achieve a certain performance. Conversely, we can identify how well a model can perform, given a certain amount of resources.

[FingerMovements.] [Heartbeat.]

Figure 6: Pareto efficiency visualization of the FingerMovements and Heartbeat datasets. The scattered cyan points, the marked red points, and the blue curve represent randomly sampled experimental data, Pareto-efficient solutions, and Pareto efficient frontiers.

5 Discussion

In this work, we propose a methodology to investigate the relationship between model efficiency and effectiveness, as well as its complexity. The method is performed on a mixing network based on Transformer and Fourier transform for MTS classification. Extensive experiments are conducted on 18 MTS datasets, including ablation studies on different modules of the network, module-by-module pruning evaluated in terms of the predictive performance, training speed, and the number of learnable parameters. The network achieves competitive performance compared to current best-performing methods. Ablation studies indicate that self-attention and Fourier transform are the largest contributors that influence the model performance across all datasets. Furthermore, through sequential pruning of each module, we observed the efficiency–effectiveness and the efficiency–complexity trade-offs of the network. Through Pareto analysis, we show how to choose efficient settings of the network, while investigating the performance–efficiency trade-off through visualization of the Pareto fronts. We note that for far more complex models applied to large-scale data, due to finite computational resources, it is not practical to consider all possible configurations of the model and perform experiments. In these cases, given a reasonable number of experiments, techniques like regression can be used to generate massive random model settings and corresponding model performance. Pareto analysis can then be performed to evaluate the efficiency-performance trade-off, to guide researchers to adjust the model settings to improve the efficiency and effectiveness accordingly.


  • [1] Li-wei, H. L., Adams, R. P., Mayaud, L., et al.: A physiological time series dynamics-based approach to patient monitoring and outcome prediction. IEEE journal of biomedical and health informatics, vol. 19, pp. 1068-1076 (2014).
  • [2] Liu, H., Long, Z.: An improved deep learning model for predicting stock market price time series. Digital Signal Processing, vol. 102, pp. 102741 (2020).
  • [3]

    Fu, T. C.: A review on time series data mining. Engineering Applications of Artificial Intelligence, vol. 24, pp. 164-181 (2011).

  • [4]

    Gao, B., Li, X., Woo, W. L., et al.: Physics-based image segmentation using first order statistical properties and genetic algorithm for inductive thermography imaging. IEEE Transactions on Image Processing, vol. 27, pp. 2160-2175 (2017).

  • [5] Hu, B., Gao, B., Woo, W. L., et al.: A Lightweight Spatial and Temporal Multi-Feature Fusion Network for Defect Detection. IEEE Transactions on Image Processing, vol. 30, pp. 472-486 (2020).
  • [6] Fawaz, H. I., Forestier, G., Weber, J., et al.: Deep learning for time series classification: a review. Data mining and knowledge discovery, vol. 33, pp. 917-963 (2019).
  • [7] Ruiz, A. P., Flynn, M., Large, J., et al.: The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, vol. 35, pp. 401-449 (2021).
  • [8]

    Hüsken, M., Stagge, P.: Recurrent neural networks for time series classification. Neurocomputing, vol. 50, pp. 223-235 (2003).

  • [9]

    Zhao, B., Lu, H., Chen, S., et al.: Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics, vol. 28, pp. 162-169 (2017).

  • [10] Lines, J., Taylor, S., Bagnall, A.: Time series classification with HIVE-COTE: The hierarchical vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery from Data, vol. 12 (2018).
  • [11] Dempster, A., Petitjean, F., Webb, G. I.: ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, vol. 34, pp. 1454-1495 (2020).
  • [12] Zerveas, G., Jayaraman, S., Patel, D., et al.: A transformer-based framework for multivariate time series representation learning. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2114-2124 (2021).
  • [13] Lines, J., Bagnall, A.: Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery, vol. 25, pp. 565-592 (2015).
  • [14] Schäfer, P.: Scalable time series classification. Data Mining and Knowledge Discovery, vol. 30, pp. 1273-1298 (2016).
  • [15] Fulcher, B. D., Jones, N. S.: Highly comparative feature-based time-series classification. IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 3026-3037 (2014).
  • [16] Abanda, A., Mori, U., Lozano, J. A.: A review on distance based time series classification. Data Mining and Knowledge Discovery, vol. 33, pp. 378-412 (2019).
  • [17] Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: A strong baseline. In: 2017 International joint conference on neural networks, pp. 1578-1585 (2017).
  • [18] Lin, S., Runger, G. C.: GCRNN: Group-constrained convolutional recurrent neural network. IEEE transactions on neural networks and learning systems, vol. 29, pp.4709-4718 (2017).
  • [19] Fawaz, H. I., Lucas, B., Forestier, G., et al.: Inceptiontime: Finding alexnet for time series classification. Data Mining and Knowledge Discovery, vol. 34, pp. 1936-1962 (2020).
  • [20] Zhang, X., Gao, Y., Lin, J., et al.: Tapnet: Multivariate time series classification with attentional prototypical network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6845-6852 (2020).
  • [21] Bloomfield, P.: Fourier analysis of time series: an introduction (2004).
  • [22] Ren, H., Xu, B., Wang, Y., et al.: Time-series anomaly detection service at microsoft. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3009-3017 (2019).
  • [23] Puech, T., Boussard, M., D’Amato, A., et al.: A fully automated periodicity detection in time series. In: International Workshop on Advanced Analysis and Learning on Temporal Data, pp. 43-54 (2019).
  • [24] Janacek, G. J., Bagnall, A. J., Powell, M.: A likelihood ratio distance measure for the similarity between the fourier transform of time series. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 737-743 (2005).
  • [25] Geerken, R., Zaitchik, B., Evans, J. P.: Classifying rangeland vegetation type and coverage from NDVI time series using Fourier Filtered Cycle Similarity. International Journal of Remote Sensing, vol. 26, pp. 5535-5554 (2005).
  • [26] Samiee, K., Kovacs, P., Gabbouj, M.: Epileptic seizure classification of EEG time-series using rational discrete short-time Fourier transform. IEEE transactions on Biomedical Engineering, vol. 62, pp. 541-552 (2014).
  • [27] Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998-6008 (2017).
  • [28]

    Oh, J., Wang, J., Wiens, J.: Learning to exploit invariances in clinical time-series data using sequence transformer networks. In: Machine Learning for Healthcare Conference, pp. 332-347 (2018).

  • [29] Zhao, Y., Hong, Q., Zhang, X., et al.: BERTSurv: BERT-Based Survival Models for Predicting Outcomes of Trauma Patients. arXiv preprint arXiv:2103.10928 (2021).
  • [30] Liu, M., Ren, S., Ma, S., et al.: Gated Transformer Networks for Multivariate Time Series Classification. arXiv preprint arXiv:2103.14438 (2021).
  • [31]

    Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708 (2017).

  • [32] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105-6114 (2019).
  • [33] Brock, A., De, S., Smith, S. L., et al.: High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171 (2021).
  • [34] Srinivas, A., Lin, T. Y., Parmar, N., et al.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519-16529 (2021).
  • [35] Tan, M., Le, Q. V.: Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298 (2021).
  • [36] LeCun, Y., Denker, J. S., Solla, S. A.: Optimal brain damage. In: Advances in neural information processing systems, pp. 598-605 (1990).
  • [37] Han, S., Pool, J., Tran, J., et al.: Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626 (2015).
  • [38] Li, H., Kadav, A., Durdanovic, I., et al.: Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).
  • [39] Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-1751 (2014).
  • [40]

    Song, H., Rajan, D., Thiagarajan, J. J., et al.: Attend and diagnose: Clinical time series analysis using attention models. In: Thirty-second AAAI conference on artificial intelligence (2018).

  • [41] Paszke, A., Gross, S., Massa, F., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp. 8026-8037 (2019).
  • [42] Kingma, D. P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • [43] Censor, Y.: Pareto optimality in multiobjective problems. Applied Mathematics and Optimization, vol. 4, pp. 41-59 (1977).