1 Introduction
Time series (TS) data is ubiquitous, occurring in healthcare [1], stock market [2], astronomy [3], and many other domains [4, 5]. With the advance of sensing techniques, TS classification across wide-ranging domains has gained much interest during the past decade [6, 7].
The availability of the UCR/UEA time series benchmark datasets [7] has led to an abundance of TS classification algorithms [8, 9, 10, 11, 12]. The classification accuracy has been the key metric used to evaluate existing methods [13]. However, the high accuracy of these algorithms often comes with the cost of high computational complexity [14]
. From common preconceptions in natural language processing (NLP) and computer vision (CV), in order to achieve high accuracy, training top performing models with millions/billions of parameters is a computationally intensive task, requiring days or weeks on many parallel GPUs or TPUs. However, such intensive training makes the model difficult to retrain for further improvement on performance. Likewise, for large-scale time series data with high dimensionality or long sequence length, it is challenging to maintain the balance between the predictive accuracy and training efficiency.
In this work, we propose a method to investigate the relationship between model efficiency and its effectiveness, as well as its complexity for MTS classification. The model architecture is based on Transformer and Fourier transform. We use 18 benchmark MTS datasets for evaluation. Comprehensive experiments are conducted on all datasets, including ablation study of each module of the network and module-by-module pruning in terms of accuracy, training speed, and model size. Experimental results demonstrate the competitive performance of our proposed architecture compared with current state-of-the-art methods. Ablation studies identify the main contributors to the predictive performance, such as multi-head self-attention and Fourier transform. In addition, module-wise pruning of the network reveals the trade-off between model efficiency and effectiveness, as well as model efficiency and complexity. Finally, we conduct Pareto analysis to examine the trade-off between efficiency and performance.
The main contributions of this paper are highlighted as follows:
-
To the best of our knowledge, this is the first paper to perform Pareto analysis to investigate the relationship between efficiency and accuracy.
-
Through module-by-module pruning, comprehensive experimental results indicate an evident trade-off between model efficiency and its effectiveness, as well as its complexity.
-
We employ Pareto analysis to investigate the relationship between model efficiency and performance. Such analysis methods can provide general guidance for researchers on how to select efficient model configurations, which can be applied to any model architecture.
The remainder of this paper is organized as follows. Section 2 describes related work of Transformer and Fourier transform on time series analysis and existing methods on model efficiency improvement. The network architecture is outlined in Section 3. Section 4 discusses datasets and experiments on 18 benchmark datasets, including ablation studies, module-wise pruning and Pareto efficiency visualization. Finally, our conclusions are presented in Section 5.
2 Related Work
Neural Networks for Time Series Classification. Currently, most TS classification algorithms can be divided into three categories: feature-based [15], distance-based [16]
, and neural network based methods
[6]. Here, we focus only on neural network based methods. Since the advancements of deep learning, two popular frameworks, CNN and RNN, are widely applied in TS classification tasks. [17] combined Fully Convolutional Networks (FCN) and Residual Networks (ResNet) for univariate time series classification. [18] developed a group-constrained method, which combines a CNN with an RNN. More recent works such as InceptionTime [19], TapNet [20], and TST [12] are proposed for TS classification. For additional deep learning methods, we refer readers to [6].Fourier Transform in Time Series. The Fourier transform (FT) has been an important tool in time series analysis for decades [21]
, and is widely used for applications such as anomaly detection
[22], periodicity detection [23], and similarity measures [24]. The FT converts a TS from time domain to frequency domain, and uses Fourier coefficients to represent the original data. For the TS classification task, FTs have been used indirectly in disparate applications. For instance,
[25] utilizes the FT to filter noisy data for vegetation type classification, and [26]uses the FT as a feature extraction technique to classify electroencephalography (EEG) data. However, none of the above methods apply the FT directly to TS classification, particularly in the context of neural networks. In contrast, we aim to apply the discrete FT and its inverse as modules of a deep learning framework. The unparameterized FT can reduce the computational cost of the network to some extent.
Transformer Networks for Time Series Classification. With the exemplary performance of the Transformer architecture [27] in NLP and CV, researchers in the time series community began exploring Transformers in TS classification in specific domains [28, 29]. More recent works have generalized Transformer frameworks for MTS classification. [12] adopts a Transformer encoder architecture for unsupervised representation learning of MTS. [30] explored an extension of the current Transformer architecture by gating, which merges two towers for MTS classification. In contrast, we propose to generalize a mixing framework which utilizes both Transformer and FT. By replacing some self-attention sublayers with FT, the computational complexity can be reduced.
Model Training Efficiency. Due to the increasing size of both models and training data, many works have focused on improving model training efficiency through parameter reduction, such as DenseNet [31] and EfficientNet [32], training speed improvement including NFNets [33] and BotNet [34], or both [35]. One of the most common techniques to improve network efficiency is model pruning. Early works focused on non-structured methods. For instance, [36, 37] proposed to remove individual weight values. Recent works focused more on structured methods, such as channel weight pruning based on norm [38].
3 Methodology
In this section, we present our network architecture, which contains all of the modules for potential model pruning. The overall model structure is illustrated in Figure 1.

An overview of the full model framework. Our architecture is based on Transformer and Fourier transform. Following the sequence embedding, we apply a 2D discrete Fourier transform (particularly Fast Fourier transform) to convert the TS features from the time domain to the frequency domain, a multi-head self-attention layer, and a 2D inverse discrete Fourier transform to map the features back to the time domain. Then we employ a Global Average Pooling (GAP) layer to average the output of the MTS over the entire time dimension. Finally, a Softmax layer is used for the multi-class MTS classification task.
Input Embeddings.
Input embeddings are commonly used in NLP models, which map relatively low-dimensional vectors to high-dimensional vectors to facilitate sequence modeling
[39]. Correspondingly, an embedding for TS sequence is required to capture the dependencies among different features without considering the temporal information [40]. Our framework employs a 1D convolutional layer to obtain the K-dimensional embeddings at each time step.Discrete Fourier Transform. The Fourier transform decomposes a function of time into its constituent frequencies. For clarity, we first consider the 1D Discrete Fourier transform (DFT). Given a sequence of complex numbers with , the 1D DFT is defined by
where . Given the DFT , the original sequence can be recovered by the inverse DFT (IDFT)
The 2D DFT is a direct extension of the 1D DFT, obtained by alternately performing the 1D DFT on the row and column dimensions. Given a 2D signal with , the 2D DFT is given by
Similar to the 1D IDFT, the 2D DFT is invertible via the 2D IDFT,
To compute the DFT efficiently, the Fast Fourier Transform (FFT) algorithm takes advantage of the periodicity and symmetry properties of such that the computational complexity of the DFT reduces from to , regardless of dimension.
Multi-head Attention. The multi-head attention (MHA) mechanism, the major component of the Transformer architecture [27], allows the model to jointly attend to information from different representation subspaces at different positions. MHA is defined as:
where are input embedding matrices, is the sequence length, is the embedding dimension, and is the number of heads. Each head is defined as:
where are parameter matrices to be learned.
Global Average Pooling. Global average pooling involves calculating the average value of all of the elements in a feature map. It is mainly used to reduce the amount of learnable parameters.
Batch Normalization.
Instead of using layer normalization in Transformer-related architectures in NLP, we consider the necessity of applying batch normalization to each block shown in Figure
1. Compared to layer normalization, batch normalization can mitigate the effect of outlier values in time series data, which does not appear in text representations.
Activation Function.
Using the same activation function as the original Transformer architecture
[27], we consider the necessity of applying the activation function gelu for each module shown in Figure 1.Feedforward Neural Network. A position-wise feedforward neural network (FNN) is applied with two 1D convolutional layers with kernel size , and a gelu activation function in between.
4 Experiments
In this section, we describe benchmark MTS datasets [7] used for experimental evaluation, the experimental setup, and corresponding results.
4.1 Datasets
We select a set of 18 publicly available benchmark datasets from the UCR/UEA classification archive: AtrialFibrillation (AF), BasicMotions (BM), Cricket (CR), DuckDuckGeese (DDG), Epilepsy (EP), EthanolConcentration (EC), ERing (ER), FingerMovements (FM), HandMovementDirection (HMD), Handwriting (HW), Heartbeat (HB), Libras (LIB), NATOPS (NATO), PEMS-SF (PEMS), RacketSports (RS), SelfRegulationSCP1 (SRS1), SelfRegulationSCP2 (SRS2), and UWaveGestureLibrary (UW). The main characteristics of each dataset are summarised in Table 1. All of the datasets have been split into training and testing sets by default. Thus, there are no preprocessing steps for these data. The predictive performance on all datasets is evaluated in terms of accuracy.
|
||||||
Dataset | Code | Train Size | Test Size | Dimensions | Length | Classes |
AtrialFibrillation | AF | 15 | 15 | 2 | 640 | 3 |
BasicMotions | BM | 40 | 40 | 6 | 100 | 4 |
Cricket | CR | 108 | 72 | 6 | 1197 | 12 |
DuckDuckGeese | DDG | 50 | 50 | 1345 | 270 | 5 |
Epilepsy | EP | 137 | 138 | 3 | 206 | 4 |
EthanolConcentration | EC | 261 | 263 | 3 | 1751 | 4 |
ERing | ER | 30 | 270 | 4 | 65 | 6 |
FingerMovements | FM | 316 | 100 | 28 | 50 | 2 |
HandMovementDirection | HMD | 160 | 74 | 10 | 400 | 4 |
Handwriting | HW | 150 | 850 | 3 | 152 | 26 |
Heartbeat | HB | 204 | 205 | 61 | 405 | 2 |
Libras | LIB | 180 | 180 | 2 | 45 | 15 |
NATOPS | NATO | 180 | 180 | 24 | 51 | 6 |
PEMS-SF | PEMS | 267 | 173 | 963 | 144 | 7 |
RacketSports | RS | 151 | 152 | 6 | 30 | 4 |
SelfRegulationSCP1 | SRS1 | 268 | 293 | 6 | 896 | 2 |
SelfRegulationSCP2 | SRS2 | 200 | 180 | 7 | 1152 | 2 |
UWaveGestureLibrary | UW | 120 | 320 | 3 | 315 | 8 |
|
4.2 Setup
We set aside
of the default training set for the validation set, which we used to select the best collection of hyperparameters. All experiments were implemented in Pytorch
[41] on one GTX 1080 Ti GPU. We minimized the cross entropy loss with the Adam [42] optimizer for training. The hyperparameter search space for each dataset is listed in Table 2. Note that the batch size choice is limited by the available GPU memory.
|
|
Hyperparameters | Search Space |
learning rate | [1e-3, 5e-3, 1e-4, 5e-4, 1e-5, 5e-5] |
dropout rate | [0.1, 0.2, 0.3] |
batch size | [8, 16, 32] |
of heads | [4, 8, 16] |
of FFT layers | [0, 1, 2, 3, 4] |
of IFFT layers | [0, 1, 2, 3, 4] |
of MHA layers | [0, 1, 2, 3, 4] |
of Feedforward layers | [0, 1, 2, 3, 4] |
|
|
||||||||||
Dataset | Acc. | Unpruned | EMBED | FFT | IFFT | MHA | FFN | GAP | BN | ACT |
AF | Mean | 0.667 | 0.600 | 0.400 | 0.467 | 0.400 | 0.667 | 0.533 | 0.600 | 0.667 |
Std. | 0.003 | 0.005 | 0.005 | 0.004 | 0.003 | 0.006 | 0.006 | 0.004 | 0.003 | |
BM | Mean | 0.975 | 0.950 | 0.725 | 0.775 | 0.750 | 0.900 | 0.925 | 0.900 | 0.950 |
Std. | 0.008 | 0.010 | 0.012 | 0.009 | 0.012 | 0.010 | 0.014 | 0.009 | 0.011 | |
CR | Mean | 0.987 | 0.958 | 0.875 | 0.861 | 0.833 | 0.889 | 0.944 | 0.972 | 0.944 |
Std. | 0.007 | 0.009 | 0.012 | 0.008 | 0.012 | 0.006 | 0.009 | 0.012 | 0.008 | |
DDG | Mean | 0.580 | 0.580 | 0.440 | 0.420 | 0.380 | 0.520 | 0.560 | 0.560 | 0.580 |
Std. | 0.016 | 0.017 | 0.020 | 0.016 | 0.014 | 0.016 | 0.016 | 0.014 | 0.016 | |
EP | Mean | 0.986 | 0.978 | 0.891 | 0.913 | 0.899 | 0.949 | 0.971 | 0.956 | 0.971 |
Std. | 0.014 | 0.013 | 0.016 | 0.014 | 0.014 | 0.012 | 0.014 | 0.013 | 0.015 | |
EC | Mean | 0.456 | 0.445 | 0.376 | 0.395 | 0.365 | 0.418 | 0.441 | 0.445 | 0.452 |
Std. | 0.003 | 0.002 | 0.003 | 0.003 | 0.004 | 0.002 | 0.004 | 0.003 | 0.002 | |
ER | Mean | 0.963 | 0.956 | 0.896 | 0.889 | 0.885 | 0.892 | 0.948 | 0.952 | 0.956 |
Std. | 0.006 | 0.007 | 0.006 | 0.006 | 0.008 | 0.005 | 0.006 | 0.007 | 0.005 | |
FM | Mean | 0.640 | 0.620 | 0.490 | 0.520 | 0.500 | 0.600 | 0.590 | 0.610 | 0.620 |
Std. | 0.009 | 0.008 | 0.007 | 0.008 | 0.010 | 0.008 | 0.009 | 0.010 | 0.011 | |
HMD | Mean | 0.486 | 0.446 | 0.365 | 0.351 | 0.338 | 0.406 | 0.459 | 0.432 | 0.473 |
Std. | 0.018 | 0.016 | 0.020 | 0.017 | 0.018 | 0.019 | 0.018 | 0.016 | 0.020 | |
HW | Mean | 0.529 | 0.514 | 0.471 | 0.473 | 0.468 | 0.506 | 0.506 | 0.512 | 0.514 |
Std. | 0.006 | 0.007 | 0.006 | 0.005 | 0.007 | 0.007 | 0.008 | 0.007 | 0.006 | |
HB | Mean | 0.771 | 0.766 | 0.683 | 0.707 | 0.688 | 0.751 | 0.756 | 0.766 | 0.756 |
Std. | 0.014 | 0.015 | 0.014 | 0.017 | 0.015 | 0.016 | 0.014 | 0.015 | 0.016 | |
LIB | Mean | 0.917 | 0.906 | 0.822 | 0.827 | 0.839 | 0.889 | 0.894 | 0.906 | 0.911 |
Std. | 0.009 | 0.011 | 0.012 | 0.010 | 0.012 | 0.013 | 0.011 | 0.009 | 0.010 | |
NATO | Mean | 0.844 | 0.833 | 0.728 | 0.739 | 0.750 | 0.772 | 0.811 | 0.833 | 0.833 |
Std. | 0.005 | 0.004 | 0.005 | 0.007 | 0.006 | 0.005 | 0.006 | 0.004 | 0.006 | |
PEMS | Mean | 0.908 | 0.884 | 0.815 | 0.809 | 0.803 | 0.867 | 0.879 | 0.896 | 0.896 |
Std. | 0.013 | 0.012 | 0.014 | 0.016 | 0.014 | 0.013 | 0.013 | 0.014 | 0.012 | |
RS | Mean | 0.914 | 0.901 | 0.796 | 0.816 | 0.803 | 0.855 | 0.908 | 0.901 | 0.908 |
Std. | 0.021 | 0.020 | 0.020 | 0.018 | 0.019 | 0.021 | 0.020 | 0.021 | 0.019 | |
SRS1 | Mean | 0.915 | 0.894 | 0.836 | 0.823 | 0.819 | 0.853 | 0.887 | 0.894 | 0.901 |
Std. | 0.005 | 0.007 | 0.006 | 0.006 | 0.005 | 0.007 | 0.006 | 0.005 | 0.005 | |
SRS2 | Mean | 0.600 | 0.594 | 0.522 | 0.533 | 0.516 | 0.578 | 0.583 | 0.588 | 0.594 |
Std. | 0.002 | 0.003 | 0.002 | 0.001 | 0.004 | 0.002 | 0.003 | 0.003 | 0.002 | |
UW | Mean | 0.922 | 0.906 | 0.844 | 0.850 | 0.841 | 0.875 | 0.894 | 0.897 | 0.903 |
Std. | 0.006 | 0.008 | 0.009 | 0.006 | 0.007 | 0.008 | 0.006 | 0.007 | 0.007 | |
|
times with different random seeds. The results are shown in the format of mean and standard deviation. Column
shows the accuracy of the full model with all modules included. Columns to represent the accuracy when the module in that column is removed from the model. Bold indicates that the module contributes most to the loss in accuracy and underlining indicates that the module contributes least to the loss in accuracy when the module is removed.4.3 Module Settings
Based on Section 3, we define the following eight modules of the network for further analysis: input embedding (EMBED), fast Fourier transform (FFT), inverse fast Fourier transform (IFFT), multi-head attention (MHA), feedforward neural network (FFN), global average pooling (GAP), batch normalization (BN), and activation function (ACT). The corresponding abbreviations of each module are shown in parentheses.
4.4 Ablation Study
First, we conduct ablation studies to analyze the contributions of each module on the predictive performance. The contribution of each module is obtained when a module is removed from the full network while other modules remain intact. The fine-tuned results on 18 datasets are shown in Table 3. Starting from Column , the smaller the accuracy is, the larger the module’s contribution is, and vice versa. The accuracy of each dataset for the unpruned model (Table 3 Column ) is competitive with current state-of-the-art methods [7]. Among eight modules, it can be seen that MHA and FFT contribute most to the predictive performance on out of the datasets and out of the datasets, respectively. For MTS data, the correlations between different dimensions across all time steps are important to consider. Hence, the MHA is able to catch different feature correlations, and influence the accuracy to a large extent. The FFT, as the core of signal processing and more generalized time series, extracts frequency information embedded in data, which provides a more straightforward representation compared to the original data in the time-domain. In contrast, we observe that EMBED, BN, and ACT contribute least to the predictive performance on out of the datasets, out of the datasets, and out of the datasets, respectively. Although these operations are important for the training of the model, they influence the testing accuracy marginally compared with MHA and FFT.
To clearly demonstrate the influence of each module on the predictive performance and efficiency of the network, the averaged testing accuracy loss and the corresponding efficiency improvement for each module (compared with the unpruned model) over all datasets are presented in Figure 2
. Here, efficiency is defined as the product of training time per epoch and the amount of learnable parameters. The higher the product, the lower the efficiency is. In consideration of highly diversed datasets with respect to sequence length, number of samples, and dimensionality, the average loss in accuracy for each module demonstrates a high variance from Figure
2 as the performance loss extent can vary depending on dataset characteristics. The modules MHA, FFT, and IFFT demonstrate a notable influence on the model performance on average (21.9%, 20.1%, and 17.7% loss in accuracy respectively). For modules like BN, EMBED and ACT, removing them bring about minimal accuracy loss compared to other modules (3.6%, 2.7%, and 1.6% respectively). Meanwhile, comparing Figure 2 and Figure 2, the module which has larger impact on the predictive performance does not indicate that removing it can bring about more efficiency improvement. For instance, the computationally inexpensive FFT influences the predictive performance to a large extent. In contrast, although the computational cost of BN is high, its contribution to the performance is marginal.[]
[]
4.5 Module-by-Module Pruning
Next, we explore the relationship between efficiency (defined the same as Section 4.4) and effectiveness (predictive performance). Based on the contribution of each module on the performance loss shown in Figure 2, we perform module-by-module pruning by following the order of modules from the most significant contributor to the least significant contributor (MHA, FFT, IFFT, FFN, GAP, BN, EMBED, ACT) to accuracy. We evaluate such pruning effect in two aspects: effectiveness: testing accuracy; efficiency: average training time per epoch in seconds and the number of learnable parameters. Due to limited space, we only show some datasets’ testing accuracy in Table 4 and their efficiency results in Figure 3. We observe that after removing the entire MHA module, the number of learnable parameters shrinks drastically, so as the accuracy (Table 4 Column 4). The representation capability of the pruned network, which has fewer parameters, is damaged since the amount of parameters is a key aspect to the network representation. Furthermore, the pace of accuracy loss and parameter reduction removal of subsequent modules slows down as FFT/IFFT has no learnable parameters. For the remaining modules, the number of parameters they carry is much fewer than the MHA module. Based on Figure 2, their effects on the predictive performance are moderate. Hence, the curves in Figure 3 are relatively flat following MHA. We further investigate the extent of change in accuracy of module-wise pruning on all datasets, as shown in Figure 4. We notice that the performance variation in different datasets vary widely. For datasets such as AF, BM, and DDG, the model pruning has a great impact on their performance. This may be due to very limit amount of training samples. Conversely, for datasets like HB, LIB, and SRS1, the model pruning brings little effect after removing the MHA module (within 1%).
|
||||||||||
Dataset | Acc. | Unpruned | MHA | FFT | IFFT | FFN | GAP | BN | IE | AF |
EC | Mean | 0.456 | 0.365 | 0.363 | 0.363 | 0.361 | 0.358 | 0.354 | 0.354 | 0.354 |
Std. | 0.003 | 0.004 | 0.004 | 0.002 | 0.004 | 0.003 | 0.003 | 0.003 | 0.003 | |
NATO | Mean | 0.844 | 0.750 | 0.750 | 0.744 | 0.739 | 0.733 | 0.733 | 0.728 | 0.728 |
Std. | 0.005 | 0.006 | 0.003 | 0.004 | 0.006 | 0.005 | 0.006 | 0.004 | 0.005 | |
FM | Mean | 0.640 | 0.500 | 0.495 | 0.495 | 0.493 | 0.493 | 0.490 | 0.490 | 0.490 |
Std. | 0.009 | 0.010 | 0.011 | 0.010 | 0.008 | 0.009 | 0.011 | 0.010 | 0.011 | |
SRS1 | Mean | 0.915 | 0.819 | 0.817 | 0.816 | 0.814 | 0.814 | 0.812 | 0.812 | 0.812 |
Std. | 0.005 | 0.005 | 0.003 | 0.003 | 0.0.004 | 0.006 | 0.003 | 0.004 | 0.005 | |
|
[]
[]
[]
[]
[]
Overall, based on the above module-by-module pruning scheme, we observe that as the effectiveness (predictive performance) of the network increases, the corresponding efficiency (training speed and model size) generally decreases. The evident cost–benefit trade-off between efficiency and effectiveness provides a key question to researchers on how to find efficient model settings while maintaining the “equilibrium” between these two aspects. This problem will be discussed in Section 4.7.
4.6 Efficiency vs. Complexity
Here, we explore the relationship between network efficiency and complexity. In general, the more complex a model is, the less efficient it is. The network’s efficiency is defined in the same way as previous sections, in terms of the training time and the number of parameters. Meanwhile, we define the complexity of the model as the stacking of modules. Contrary to model pruning, we stack each module based on their influence on the predictive performance, from the least significant contributor to the most significant contributor (ACT, EMBED, BN, GAP, FFN, IFFT, FFT, MHA) to accuracy. Our empirical results in Figure 5 shed light on the trade-off between model efficiency and complexity. As can be seen in Figure 5, as more modules are stacked over the network, the corresponding computational efficiency decreases. All datasets illustrate similar trends.
[]
[]
[]
4.7 Pareto Analysis for Trade-Off Exploration Between Efficiency and Performance/effectiveness
We define the model efficiency in terms of the reciprocal of the product between training time per epoch and the number of parameters. Thus, the higher the reciprocal, the higher the efficiency. To explore the relationship between model efficiency and performance, we employ Pareto analysis [43]. Pareto efficiency represents a state for which improving the performance as measured by one criterion would worsen the performance as measured by another criterion. We choose the FingerMovements and Heartbeat datasets to obtain the Pareto frontiers, where the set of points on the front correspond to Pareto-efficient solutions. We have two objectives: maximize the efficiency; maximize the accuracy. Figure 6 shows the result of Pareto fronts for both datasets in blue, where the red points are Pareto-efficient solutions. The scattered cyan points are randomly sampled experimental data from all different configurations. The Pareto analysis provides us with a principled approach for choosing efficient network settings, while exploring the trade-off between efficiency and performance. Specifically, we can identify the extent of computational resources that is required in order for a model to achieve a certain performance. Conversely, we can identify how well a model can perform, given a certain amount of resources.
[FingerMovements.]
[Heartbeat.]
5 Discussion
In this work, we propose a methodology to investigate the relationship between model efficiency and effectiveness, as well as its complexity. The method is performed on a mixing network based on Transformer and Fourier transform for MTS classification. Extensive experiments are conducted on 18 MTS datasets, including ablation studies on different modules of the network, module-by-module pruning evaluated in terms of the predictive performance, training speed, and the number of learnable parameters. The network achieves competitive performance compared to current best-performing methods. Ablation studies indicate that self-attention and Fourier transform are the largest contributors that influence the model performance across all datasets. Furthermore, through sequential pruning of each module, we observed the efficiency–effectiveness and the efficiency–complexity trade-offs of the network. Through Pareto analysis, we show how to choose efficient settings of the network, while investigating the performance–efficiency trade-off through visualization of the Pareto fronts. We note that for far more complex models applied to large-scale data, due to finite computational resources, it is not practical to consider all possible configurations of the model and perform experiments. In these cases, given a reasonable number of experiments, techniques like regression can be used to generate massive random model settings and corresponding model performance. Pareto analysis can then be performed to evaluate the efficiency-performance trade-off, to guide researchers to adjust the model settings to improve the efficiency and effectiveness accordingly.
References
- [1] Li-wei, H. L., Adams, R. P., Mayaud, L., et al.: A physiological time series dynamics-based approach to patient monitoring and outcome prediction. IEEE journal of biomedical and health informatics, vol. 19, pp. 1068-1076 (2014).
- [2] Liu, H., Long, Z.: An improved deep learning model for predicting stock market price time series. Digital Signal Processing, vol. 102, pp. 102741 (2020).
-
[3]
Fu, T. C.: A review on time series data mining. Engineering Applications of Artificial Intelligence, vol. 24, pp. 164-181 (2011).
-
[4]
Gao, B., Li, X., Woo, W. L., et al.: Physics-based image segmentation using first order statistical properties and genetic algorithm for inductive thermography imaging. IEEE Transactions on Image Processing, vol. 27, pp. 2160-2175 (2017).
- [5] Hu, B., Gao, B., Woo, W. L., et al.: A Lightweight Spatial and Temporal Multi-Feature Fusion Network for Defect Detection. IEEE Transactions on Image Processing, vol. 30, pp. 472-486 (2020).
- [6] Fawaz, H. I., Forestier, G., Weber, J., et al.: Deep learning for time series classification: a review. Data mining and knowledge discovery, vol. 33, pp. 917-963 (2019).
- [7] Ruiz, A. P., Flynn, M., Large, J., et al.: The great multivariate time series classification bake off: a review and experimental evaluation of recent algorithmic advances. Data Mining and Knowledge Discovery, vol. 35, pp. 401-449 (2021).
-
[8]
Hüsken, M., Stagge, P.: Recurrent neural networks for time series classification. Neurocomputing, vol. 50, pp. 223-235 (2003).
-
[9]
Zhao, B., Lu, H., Chen, S., et al.: Convolutional neural networks for time series classification. Journal of Systems Engineering and Electronics, vol. 28, pp. 162-169 (2017).
- [10] Lines, J., Taylor, S., Bagnall, A.: Time series classification with HIVE-COTE: The hierarchical vote collective of transformation-based ensembles. ACM Transactions on Knowledge Discovery from Data, vol. 12 (2018).
- [11] Dempster, A., Petitjean, F., Webb, G. I.: ROCKET: exceptionally fast and accurate time series classification using random convolutional kernels. Data Mining and Knowledge Discovery, vol. 34, pp. 1454-1495 (2020).
- [12] Zerveas, G., Jayaraman, S., Patel, D., et al.: A transformer-based framework for multivariate time series representation learning. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 2114-2124 (2021).
- [13] Lines, J., Bagnall, A.: Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery, vol. 25, pp. 565-592 (2015).
- [14] Schäfer, P.: Scalable time series classification. Data Mining and Knowledge Discovery, vol. 30, pp. 1273-1298 (2016).
- [15] Fulcher, B. D., Jones, N. S.: Highly comparative feature-based time-series classification. IEEE Transactions on Knowledge and Data Engineering, vol. 26, pp. 3026-3037 (2014).
- [16] Abanda, A., Mori, U., Lozano, J. A.: A review on distance based time series classification. Data Mining and Knowledge Discovery, vol. 33, pp. 378-412 (2019).
- [17] Wang, Z., Yan, W., Oates, T.: Time series classification from scratch with deep neural networks: A strong baseline. In: 2017 International joint conference on neural networks, pp. 1578-1585 (2017).
- [18] Lin, S., Runger, G. C.: GCRNN: Group-constrained convolutional recurrent neural network. IEEE transactions on neural networks and learning systems, vol. 29, pp.4709-4718 (2017).
- [19] Fawaz, H. I., Lucas, B., Forestier, G., et al.: Inceptiontime: Finding alexnet for time series classification. Data Mining and Knowledge Discovery, vol. 34, pp. 1936-1962 (2020).
- [20] Zhang, X., Gao, Y., Lin, J., et al.: Tapnet: Multivariate time series classification with attentional prototypical network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 6845-6852 (2020).
- [21] Bloomfield, P.: Fourier analysis of time series: an introduction (2004).
- [22] Ren, H., Xu, B., Wang, Y., et al.: Time-series anomaly detection service at microsoft. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3009-3017 (2019).
- [23] Puech, T., Boussard, M., D’Amato, A., et al.: A fully automated periodicity detection in time series. In: International Workshop on Advanced Analysis and Learning on Temporal Data, pp. 43-54 (2019).
- [24] Janacek, G. J., Bagnall, A. J., Powell, M.: A likelihood ratio distance measure for the similarity between the fourier transform of time series. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 737-743 (2005).
- [25] Geerken, R., Zaitchik, B., Evans, J. P.: Classifying rangeland vegetation type and coverage from NDVI time series using Fourier Filtered Cycle Similarity. International Journal of Remote Sensing, vol. 26, pp. 5535-5554 (2005).
- [26] Samiee, K., Kovacs, P., Gabbouj, M.: Epileptic seizure classification of EEG time-series using rational discrete short-time Fourier transform. IEEE transactions on Biomedical Engineering, vol. 62, pp. 541-552 (2014).
- [27] Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in neural information processing systems, pp. 5998-6008 (2017).
-
[28]
Oh, J., Wang, J., Wiens, J.: Learning to exploit invariances in clinical time-series data using sequence transformer networks. In: Machine Learning for Healthcare Conference, pp. 332-347 (2018).
- [29] Zhao, Y., Hong, Q., Zhang, X., et al.: BERTSurv: BERT-Based Survival Models for Predicting Outcomes of Trauma Patients. arXiv preprint arXiv:2103.10928 (2021).
- [30] Liu, M., Ren, S., Ma, S., et al.: Gated Transformer Networks for Multivariate Time Series Classification. arXiv preprint arXiv:2103.14438 (2021).
-
[31]
Huang, G., Liu, Z., Van Der Maaten, L., et al.: Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700-4708 (2017).
- [32] Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105-6114 (2019).
- [33] Brock, A., De, S., Smith, S. L., et al.: High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171 (2021).
- [34] Srinivas, A., Lin, T. Y., Parmar, N., et al.: Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16519-16529 (2021).
- [35] Tan, M., Le, Q. V.: Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298 (2021).
- [36] LeCun, Y., Denker, J. S., Solla, S. A.: Optimal brain damage. In: Advances in neural information processing systems, pp. 598-605 (1990).
- [37] Han, S., Pool, J., Tran, J., et al.: Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626 (2015).
- [38] Li, H., Kadav, A., Durdanovic, I., et al.: Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710 (2016).
- [39] Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-1751 (2014).
-
[40]
Song, H., Rajan, D., Thiagarajan, J. J., et al.: Attend and diagnose: Clinical time series analysis using attention models. In: Thirty-second AAAI conference on artificial intelligence (2018).
- [41] Paszke, A., Gross, S., Massa, F., et al.: Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems, pp. 8026-8037 (2019).
- [42] Kingma, D. P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
- [43] Censor, Y.: Pareto optimality in multiobjective problems. Applied Mathematics and Optimization, vol. 4, pp. 41-59 (1977).