Multilevel_Wavelet_Decomposition_Network_Pytorch
This project is the pytorch implementation version of Multilevel Wavelet Decomposition Network.
view repo
Recent years have witnessed the unprecedented rising of time series from almost all kindes of academic and industrial fields. Various types of deep neural network models have been introduced to time series analysis, but the important frequency information is yet lack of effective modeling. In light of this, in this paper we propose a wavelet-based neural network structure called multilevel Wavelet Decomposition Network (mWDN) for building frequency-aware deep learning models for time series analysis. mWDN preserves the advantage of multilevel discrete wavelet decomposition in frequency learning while enables the fine-tuning of all parameters under a deep neural network framework. Based on mWDN, we further propose two deep learning models called Residual Classification Flow (RCF) and multi-frequecy Long Short-Term Memory (mLSTM) for time series classification and forecasting, respectively. The two models take all or partial mWDN decomposed sub-series in different frequencies as input, and resort to the back propagation algorithm to learn all the parameters globally, which enables seamless embedding of wavelet-based frequency analysis into deep learning frameworks. Extensive experiments on 40 UCR datasets and a real-world user volume dataset demonstrate the excellent performance of our time series models based on mWDN. In particular, we propose an importance analysis method to mWDN based models, which successfully identifies those time-series elements and mWDN layers that are crucially important to time series analysis. This indeed indicates the interpretability advantage of mWDN, and can be viewed as an indepth exploration to interpretable deep learning.
READ FULL TEXT VIEW PDFThis project is the pytorch implementation version of Multilevel Wavelet Decomposition Network.
None
A time series is a series of data points indexed in time order. Methods for time series analysis could be classified into two types: time-domain methods and frequency-domain methods.
^{1}^{1}1https://en.wikipedia.org/wiki/Time_seriesTime-domain methods consider a time series as a sequence of ordered points and analyze correlations among them. Frequency-domain methods use transform algorithms, such as discrete Fourier transform and Z-transform, to transform a time series into a frequency spectrum, which could be used as features to analyze the original series.
In recent years, with the booming of deep learning concept, various types of deep neural network models have been introduced to time series analysis and achieved state-of-the-art performances in many real-life applications (Wang et al., 2017c; Rajpurkar et al., 2017)
. Some well-known models include Recurrent Neural Networks (RNN)
(Williams and Zipser, 1989) and Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)that use memory nodes to model correlations of series points, and Convolutional Neural Network (CNN) that uses trainable convolution kernels to model local shape patterns
(Zheng et al., 2016). Most of these models fall into the category of time-domain methods without leveraging frequency information of a time series, although some begin to consider in indirect ways (Cui et al., 2016; Koutnik et al., 2014).Wavelet decompositions (Daubechies, 1992) are well-known methods for capturing features of time series both in time and frequency domains. Intuitively, we can employ them as feature engineering tools for data preprocessing before a deep modeling. While this loose coupling way might improve the performance of raw neural network models (Liu et al., 2013), they are not globally optimized with independent parameter inference processes. How to integrate wavelet transforms into the framework of deep learning models remains a great challenge.
In this paper, we propose a wavelet-based neural network structure, named multilevel Wavelet Decomposition Network (mWDN), to build frequency-aware deep learning models for time series analysis. Similar to the standard Multilevel Discrete Wavelet Decomposition (MDWD) model (Mallat, 1989), mWDN can decompose a time series into a group of sub-series with frequencies ranked from high to low, which is crucial for capturing frequency factors for deep learning. Different from MDWD with fixed parameters, however, all parameters in mWDN can be fine-turned to fit training data of different learning tasks. In other words, mWDN can take advantages of both wavelet based time series decomposition and the learning ability of deep neural networks.
Based on mWDN, two deep learning models, i.e., Residual Classification Flow (RCF) and multi-frequency Long Short-Term Memory (mLSTM), are designed for time series classification (TSC) and forecasting (TSF), respectively. The key issue in TSC is to extract as many as possible representative features from time series. The RCF model therefore adopts the mWDN decomposed results in different levels as inputs, and employs a pipelined classifier stack to exploit features hidden in sub-series through residual learning methods. For the TSF problem, the key issue turns to inferring future states of a time series according to the hidden trends in different frequencies. Therefore, the mLSTM model feeds all mWDN decomposed sub-series in high frequencies into independent LSTM models, and ensembles all LSTM outputs for final forecasting. Note that all parameters of RCF and mLSTM including the ones in mWDN are trained using the back propagation algorithm in an end-to-end manner. In this way, the wavelet-based frequency analysis is seamlessly embedded into deep learning frameworks.
We evaluate RCF on 40 UCR time series datasets for TSC, and mLSTM on a real-world user-volume time series dataset for TSF. The results demonstrate their superiorities to state-of-the-art baselines and the advantages of mWDN with trainable parameters. As a nice try for interpretable deep learning, we further propose an importance analysis method to mWDN based models, which successfully identifies those time-series elements and mWDN layers that are crucially important to the success of time series analysis. This indicates the interpretability advantage of mWDN by integrating wavelet decomposition for frequency factors.
Throughout the paper, we use lowercase symbols such as , to denote scalars, bold lowercase symbols such as
to denote vectors, bold uppercase symbols such as
to denote matrices, and uppercase symbols such as , to denote constant.Multilevel Discrete Wavelet Decomposition (MDWD) (Mallat, 1989) is a wavelet based discrete signal analysis method, which can extract multilevel time-frequency features from a time series by decomposing the series as low and high frequency sub-series level by level.
We denote the input time series as , and the low and high sub-series generated in the -th level as and . In the -th level, MDWD uses a low pass filter and a high pass filter , , to convolute low frequency sub-series of the upper level as
(1) | |||
where is the -th element of the low frequency sub-series in the -th level, and is set as the input series. The low and high frequency sub-series and in the level are generated from the 1/2 down-sampling of the intermediate variable sequences and .
The sub-series set is called as the -th level decomposed results of . Specifically, satisfies: 1) We can fully reconstruct from ; 2) The frequency from to is from high to low; 3) For different layers, has different time and frequency resolutions. As increases, the frequency resolution is increasing and the time resolution, especially for low frequency sub-series, is decreasing.
Because the sub-series with different frequencies in keep the same order information with the original series , MDWD is regarded as time-frequency decomposition.
In this section, we propose a multilevel Wavelet Decomposition Network (mWDN), which approximatively implements a MDWD under a deep neural network framework.
The structure of mWDN is illustrated in Fig. 1. As shown in the figures, the mWDN model hierarchically decomposes a time series using the following two functions
(2) | ||||
where
is a sigmoid activation function, and
andare trainable bias vectors initialized as close-to-zero random values. We can see the functions in Eq. (
2) have similar forms as the functions in Eq. (1) for MDWD. and also denote the low and high frequency sub-series of generated in the -th level, which are down-sampled from the intermediate variables and using an average pooling layer as .In order to implement the convolution defined in Eq. (1), we set the initial values of the weight matrices and as
(3) |
(4) |
Obviously, and , where is the size of . The in the weight matrices are random values that satisfy and . We use the Daubechies 4 Wavelet (Rowe and Abbott, 1995) in our practice, where the filter coefficients are set as
The task of TSC is to predict unknown category label of a time series. A key issue of TSC is extracting distinguishing features from time series data. The decomposed results of mWDN are natural time-frequency features that could be used in TSC. In this subsection, we propose a Residual Classification Flow (RCF) network to exploit the potentials of mWDN in TSC.
The framework of RCF is illustrated in Fig. 2. As shown in the figure, RCF contains many independent classifiers. The RCF model connects the sub-series generated by the -th mWDN level, i.e., and , with a forward neural network as
(5) |
where
could be a multilayer perceptron, a convolutional network, or any other types of neural networks, and
represents the trainable parameters. Moreover, RCF adopts a residual learning method (He et al., 2016) to join of all classifiers as(6) |
where is a softmax classifier,
is a predicted value of one-hot encoding of the category label of the input series.
In the RCF model, the decomposed results of all mWDN levels, i.e. , are evolved. Because the decomposed results in different mWDN levels have different time and frequency resolutions (Mallat, 1989), the RCF model can fully exploit patterns of the input time series from different time/frequency-resolutions. In other words, RCF employs a multi-view learning methodology to achieve high-performance time series classification.
Moreover, deep residual networks (He et al., 2016) were proposed to solve the problem that using deeper network structures may result in a great training difficulty. The RCF model also inherits this merit. In Eq. (6), the -th classifier makes decision based on and the decision made by the -th classifier, which can learn from the incremental knowledge that the -th classifier does not have. Therefore, users could append residual classifiers one after another until classification performance does not increase any more.
In this subsection, we propose a multi-frequency Long-Short Term Memory (mLSTM) model based on mWDN for TSF. The design of mLSTM is based on the insight that the temporal correlations of points hidden in a time series have close relations with frequency. For example, large time scale correlations, such as long-term tendencies, usually lay in low frequency, and the small time scale correlations, such as short-term disturbances and events, usually lay in high frequency. Therefore, we could divide a complicated TSF problem as many sub-problems of forecasting sub-series decomposed by mWDN, which are relatively easier because the frequency components in the sub-series are simpler.
Given a time series with infinite length, on which we open a size slide window from the past to the time as
(7) |
Using mWDN to decompose , we get the low and high frequency component series in the -th level as
(8) | ||||
As shown in Fig. 3, the mLSTM model uses the decomposed results of the last level, i.e., the sub-series in , as the inputs of independent LSTM sub-networks. Every LSTM sub-network forecasts the future state of one sub-series in . Finally, a fully connected neural network is employed to fuse the LSTM sub-networks as an ensemble for forecasting.
In TSC applications, we adopt a deep supervision method to train the RCF model (Wang et al., 2015). Given a set of time series , we use cross-entropy as loss metric and define the objective function of the -th classifier as
(9) |
where is the one-hot encoding of ’s real category, and is the softmax output of the -th classifier with the input . For a RCF with classifiers, the final objective function is a weighted sum of all (Wang et al., 2015):
(10) |
The result of the last classifier, , is used as the final classification result of RCF.
In TSF applications, we adopt a pre-training and fine turning method to train the mLSTM model. In the pre-training step, we use MDWD to decompose the real value of the future state to be predicted as wavelet components, i.e. , and then combine the outputs of all LSTM sub-network as , then the objective function of the pre-training step is defined as
(11) |
where is the Frobenius Norm. In the fine-turning step, we use the following objective function to train mLSTM based on the parameters learned in the pre-training step:
(12) |
where is future state predicted by mLSTM and is the real value.
We use the error back propagation (BP) algorithm to optimize the objective functions. Denoting as the parameters of the RCF or mLSTM model, the BP algorithm iteratively updates as
(13) |
where is an adjustable learning rate. The weight matrices and of mWDN are also trainable in Eq. (13). A problem of training parameters with preset initial values like and is that the model may “forget” the initial values (French, 1999) in the training process. To deal with this, we introduce two regularization items to the objective function and therefore have
(14) | ||||
where and are the same matrices as and except that , and are hyper-parameters which are set as empirical values. Accordingly, the BP algorithm iteratively updates the weight matrices of mWDN as
(15) | ||||
In this way, the weights in mWDN will converge to a point that is near to the wavelet decomposed prior, unless wavelet decomposition is far inappropriate to the task.
Err Rate | RNN | LSTM | MLP | FCN | ResNet | MLP-RCF | FCN-RCF | ResNet-RCF | Wavelet-RCF |
Adiac | 0.233 | 0.341 | 0.248 | 0.143 | 0.174 | 0.212 | 0.155 | 0.151 | 0.162 |
Beef | 0.233 | 0.333 | 0.167 | 0.25 | 0.233 | 0.06 | 0.03 | 0.06 | 0.06 |
CBF | 0.189 | 0.118 | 0.14 | 0 | 0.006 | 0.056 | 0 | 0 | 0.016 |
ChlorineConcentration | 0.135 | 0.16 | 0.128 | 0.157 | 0.172 | 0.096 | 0.068 | 0.07 | 0.147 |
CinCECGtorso | 0.333 | 0.092 | 0.158 | 0.187 | 0.229 | 0.117 | 0.014 | 0.084 | 0.011 |
CricketX | 0.449 | 0.382 | 0.431 | 0.185 | 0.179 | 0.321 | 0.216 | 0.297 | 0.211 |
CricketY | 0.415 | 0.318 | 0.405 | 0.208 | 0.195 | 0.254 | 0.172 | 0.301 | 0.192 |
CricketZ | 0.4 | 0.328 | 0.408 | 0.187 | 0.187 | 0.313 | 0.162 | 0.275 | 0.162 |
DiatomSizeReduction | 0.056 | 0.101 | 0.036 | 0.07 | 0.069 | 0.013 | 0.023 | 0.026 | 0.028 |
ECGFiveDays | 0.088 | 0.417 | 0.03 | 0.015 | 0.045 | 0.023 | 0.01 | 0.035 | 0.016 |
FaceAll | 0.247 | 0.192 | 0.115 | 0.071 | 0.166 | 0.094 | 0.098 | 0.126 | 0.076 |
FaceFour | 0.102 | 0.364 | 0.17 | 0.068 | 0.068 | 0.102 | 0.05 | 0.057 | 0.058 |
FacesUCR | 0.204 | 0.091 | 0.185 | 0.052 | 0.042 | 0.15 | 0.087 | 0.102 | 0.087 |
50words | 0.316 | 0.284 | 0.288 | 0.321 | 0.273 | 0.316 | 0.288 | 0.258 | 0.3 |
FISH | 0.126 | 0.103 | 0.126 | 0.029 | 0.011 | 0.086 | 0.021 | 0.034 | 0.026 |
GunPoint | 0.1 | 0.147 | 0.067 | 0 | 0.007 | 0.033 | 0 | 0.02 | 0 |
Haptics | 0.594 | 0.529 | 0.539 | 0.449 | 0.495 | 0.480 | 0.461 | 0.473 | 0.476 |
InlineSkate | 0.667 | 0.638 | 0.649 | 0.589 | 0.635 | 0.543 | 0.566 | 0.578 | 0.572 |
ItalyPowerDemand | 0.055 | 0.072 | 0.034 | 0.03 | 0.04 | 0.031 | 0.023 | 0.034 | 0.028 |
Lighting2 | 0 | 0 | 0.279 | 0.197 | 0.246 | 0.213 | 0.145 | 0.197 | 0.162 |
Lighting7 | 0.288 | 0.384 | 0.356 | 0.137 | 0.164 | 0.179 | 0.091 | 0.177 | 0.144 |
MALLAT | 0.119 | 0.127 | 0.064 | 0.02 | 0.021 | 0.058 | 0.044 | 0.046 | 0.024 |
MedicalImages | 0.299 | 0.276 | 0.271 | 0.208 | 0.228 | 0.251 | 0.164 | 0.188 | 0.206 |
MoteStrain | 0.133 | 0.167 | 0.131 | 0.05 | 0.105 | 0.105 | 0.076 | 0.032 | 0.05 |
NonInvasiveFatalECGThorax1 | 0.09 | 0.08 | 0.058 | 0.039 | 0.052 | 0.029 | 0.026 | 0.04 | 0.042 |
NonInvasiveFatalECGThorax2 | 0.069 | 0.071 | 0.057 | 0.045 | 0.049 | 0.056 | 0.028 | 0.033 | 0.048 |
OliveOil | 0.233 | 0.267 | 0.6 | 0.167 | 0.133 | 0.03 | 0 | 0 | 0.012 |
OSULeaf | 0.463 | 0.401 | 0.43 | 0.012 | 0.021 | 0.342 | 0.018 | 0.021 | 0.021 |
SonyAIBORobotSurface | 0.21 | 0.309 | 0.273 | 0.032 | 0.015 | 0.193 | 0.042 | 0.032 | 0.052 |
SonyAIBORobotSurfaceII | 0.219 | 0.187 | 0.161 | 0.038 | 0.038 | 0.092 | 0.064 | 0.083 | 0.072 |
StarLightCurves | 0.027 | 0.035 | 0.043 | 0.033 | 0.029 | 0.021 | 0.018 | 0.027 | 0.03 |
SwedishLeaf | 0.085 | 0.128 | 0.107 | 0.034 | 0.042 | 0.089 | 0.057 | 0.017 | 0.046 |
Symbols | 0.179 | 0.117 | 0.147 | 0.038 | 0.128 | 0.126 | 0.04 | 0.107 | 0.084 |
TwoPatterns | 0.005 | 0.001 | 0.114 | 0.103 | 0 | 0.070 | 0 | 0 | 0.005 |
uWaveGestureLibraryX | 0.224 | 0.195 | 0.232 | 0.246 | 0.213 | 0.213 | 0.218 | 0.194 | 0.162 |
uWaveGestureLibraryY | 0.335 | 0.265 | 0.297 | 0.275 | 0.332 | 0.306 | 0.232 | 0.296 | 0.241 |
uWaveGestureLibraryZ | 0.297 | 0.259 | 0.295 | 0.271 | 0.245 | 0.298 | 0.265 | 0.204 | 0.194 |
wafer | 0 | 0 | 0.004 | 0.003 | 0.003 | 0.003 | 0 | 0 | 0 |
WordsSynonyms | 0.429 | 0.343 | 0.406 | 0.42 | 0.368 | 0.391 | 0.338 | 0.387 | 0.314 |
yoga | 0.202 | 0.158 | 0.145 | 0.155 | 0.142 | 0.138 | 0.112 | 0.139 | 0.128 |
Winning times | 2 | 2 | 0 | 9 | 6 | 2 | 19 | 7 | 7 |
AVG arithmetic ranking | 7.425 | 6.825 | 7.2 | 4.025 | 4.55 | 5.15 | 2.175 | 3.375 | 3.075 |
AVG geometric ranking | 6.860 | 6.131 | 7.043 | 3.101 | 3.818 | 4.675 | 1.789 | 2.868 | 2.688 |
MPCE | 0.039 | 0.043 | 0.041 | 0.023 | 0.025 | 0.028 | 0.017 | 0.021 | 0.019 |
In this section, we evaluate the performance of the mWDN-based models in both the TSC and TSF tasks.
Experimental Setup. The classification performance was tested on 40 datasets of the UCR time series repository (Chen et al., 2015), with various competitors as follows:
MLP, FCN, and ResNet. These three models were proposed in (Wang et al., 2017c) as strong baselines on the UCR time series datasets. They have the same framework: an input layer, followed by three hidden basic blocks, and finally a softmax output. MLP adopts a fully-connected layer as its basic block, FCN and ResNet adopt a fully convolutional layer and a residual convolutional network, respectively, as their basic blocks.
MLP-RCF, FCN-RCF, and ResNet-RCF. The three models use the basic blocks of MLP/FCN/ResNet as the model of RCF in Eq. (5). We compare them with MPL/FCN/ResNet to verify the effectiveness of RCF.
Wavelet-RCF. This model has the same structure as ResNet-RCF but replaces the mWDN part with a standard MDWD with fixed parameters. We compare it with ResNet-RCF to verify the effectiveness of trainable parameters in mWDM.
For each dataset, we ran a model 10 times and returned the average classification error rate as the evaluation. To compare the overall performances on all the 40 data sets, we further introduced Mean Per-Class Error (MPCE) as the performance indicator for each competitor (Wang et al., 2017c). Let denote the amount of categories in the th dataset, and the error rate of a model on that dataset, MPCE of a model is then defined as
(16) |
Note that the factor of category amount is wiped out in MPCE. A smaller MPCE value indicates a better overall performance.
Results & Analysis. Table 1 shows the experimental results, with the summarized information listed in the bottom two lines. Note that the best performance for each dataset is highlighted in bold, and the second best is in italic. From the table, we have various interesting observations. Firstly, it is clear that among all the competitors, FCN-RCF achieves the best performance in terms of both the largest number of wins (the best in 19 out of 40 datasets) and the smallest MPCE value. While the baseline FCN itself also achieves a satisfactory performance — the second largest number of wins at 9 and a rather small MPCE value at 0.023, the gap to FCM-RCF is still rather big, implying the significant benefit from adopting our RCF framework. This is actually not an individual case; from Table 1, MLP-RCF performs much better than MLP on 37 datasets, and the number for ResNet-RCF against ResNet is 27. This indicates RCF is indeed a general framework compatible with different types of deep learning classifiers and can improve TSF performance sharply.
Another observation is from the comparison between Wavelet-RCF and ResNet-RCF. Table 1 shows that Wavelet-RCF achieved the second overall performance on MPCE and AVG rankings, which indicates that the frequency information introduced by wavelet tools is very helpful for time series problems. It is clear from the table that ResNet-RCF outperforms Wavelet-RCF on most of the datasets. This strongly demonstrates the advantage of our RCF framework in adopting parameter-trainable mWDN under the deep learning architecture, rather than using directly the wavelet decomposition as a feature engineering tool. More technically speaking, compared with Wavelet-RCF, mWND-based ResNet-RCF can achieve a good tradeoff between the prior of frequency-domain and the likelihoods of training data. This well illustrates why RCF based models can achieve much better results in the previous observation.
Summary. The above experiments demonstrate the superiority of RCF based models to some state-of-the-art baselines in the TSC tasks. The experiments also imply that the trainable parameters in a deep learning architecture and the strong priors from wavelet decomposition are two key factors for the success of RCF.
Experimental Setup. We tested the predictive power of mLSTM on a visitor volume prediction scenario (Wang et al., 2017b). The experiment adopts a real-life dataset named WuxiCellPhone, which contains user-volume time series of 20 cell-phone base stations located in the downtown of Wuxi city during two weeks. Detail informantion of cell-phone data refers (Wang et al., 2018, 2017a; Song et al., 2017). The time granularity of a user-volume series is 5 minutes. In the experiments, we compared mLSTM with the following baselines:
SAE (Stacked Auto-Encoders), which has been used in various TSF tasks (Lv et al., 2015).
RNN (Recurrent Neural Networks) and LSTM (Long Short-Term Memory), which are specifically designed for time series analysis.
wLSTM , which has the same structure with mLSTM but replaces the mWDN part with a standard MDWD.
We use three metrics to evaluate the performance of the models, including Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE), which are defined as
(17) | ||||
where is the real value of the -th sample in a time series, and is the predicted one. The less value of the three metrics means the better performance.
Results & Analysis. We compared the performance of the competitors in two TSF scenarios suggested in (Wang et al., 2016). In the first scenario, we predicted the average user volumes of a base station in subsequent periods. The length of the periods was varied from 5 to 30 minutes. Fig. 4 is a comparison of the performance averaged on the 20 base stations in one week. As can be seen, while all the models experience a gradual decrease in prediction error as the period length increases, that mLSTM achieves the best performance compared with the baselines. Particularly, the performance of mLSTM is consistently better than wLSTM, which again approves the introduction of mWDN for time series forecasting.
In the second scenario, we predicted the average user volumes in 5 minutes after a given time interval varying from 0 to 30 minutes. Fig. 5 is a performance comparison between mLSTM and the baselines. Different from the tend we observed in Scenario I, the prediction errors in Fig. 5 generally increase along the x-axis for the increasing uncertainty. From Fig. 5 we can see that mLSTM again outperforms wLSTM and other baselines, which confirms the observations from Scenario I.
Summary. The above experiments demonstrate the superiority of mLSTM to the baselines. The mWDN structure adopted by mLSTM again becomes an important factor for the success.
In this section, we highlight the unique advantage of our mWDN model: the interpretability. Since mWDN is embedded with a discrete wavelet decomposition, the outputs of the middle layers in mWDN, i.e., and , inherit the physical meanings of wavelet decompositions. We here take two data sets for illustration: WuxiCellPhone used in Sect. 4.2 and ECGFiveDays used in Sect. 4.1. Fig. 6(a) shows a sample of the user number series of a cell-phone base station in one day, and Fig. 6(b) exhibits an electrocardiogram (ECG) sample.
Fig. 7 shows the outputs of mWDN layers in the mLSTM and RCF models fed with the two samples given in Fig. 6, respectively. In Fig. 7(a), we plot the outputs of the first three layers in the mLSTM model as different sub-figures. As can be seen, from to , the outputs of the middle layers correspond to the frequency components of the input series running from high to low. A similar phenomenon could be observed in Fig. 7(b), where the outputs of the first three layers in the RCF model are presented. This phenomenon again indicates that the middle layers of mWDN inherit the frequency decomposition function of wavelet. Then here comes the problem: can we evaluate quantitatively what layer or which frequency of a time series is more important to the final output of the mWDN based models? If possible, this can provide valuable interpretability to our mWDN model.
We here introduce an importance analysis method for the proposed mWDN model, which aims to quantify the importance of each middle layer to the final output of the mWDN based models.
We denote the problem of time series classification/forecasting using a neural network model as
(18) |
where denotes the neural network, denotes the input series, and is the prediction. Given a well-trained model , if a small disturbance to the -th element can cause a large change to the output , we say is sensitive to . Therefore, the sensibility of the network to the -th element of the input series is defined as the partial derivatives of to as follows:
(19) |
Obviously, is also a function of for a given model . Given a training data set with training samples, the importance of the -th element of the input series to the model is defined as
(20) |
where is the value of the -th element in the -th training sample.
The importance definition in Eq. (20) can be extended to the middle layers in the mWDN model. Denoting as an output of a middle layer in mWDN, the neural network can be rewritten as
(21) |
and the sensibility of to is then defined as
(22) |
Given a training data set , the importance of w.r.t. is calculated as
(23) |
The calculation of and in Eq. (19) and Eq. (22) are given in the Appendix for concision. Eq. (20) and Eq. (23) respectively define the importance of a time-series element and an mWDN layer to an mWDN based model.
Fig. 8 and Fig. 9 shows the results of importance analysis. In Fig. 8, the mLSTM model trained on WuxiCellPhone in Sect. 4.2 is used. Fig. 8(b) exhibits the importance spectrum of all the elements, where the x-axis denotes the increasing timestamps and the colors in spectrum denote the varying importance of the features: the redder, the more important. From the spectrum, we can see that the latest elements are more important than the older ones, which is quite reasonable in the scenario of time series forecasting and justifies the time value of information.
Fig. 8(a) exhibits the importance spectra of the middle layers listed from top to bottom in the increasing order of frequency. Note that for the sake of comparison, we resize the lengths of the outputs to the same. From the figure, we can observe that ) the lower frequency layers in the top are with higher importance, and ) only the layers with higher importance exhibit the time value of the elements as in Fig. 8(b). These imply that the low frequency layers in mWDN are crucially important to the success of time series forecasting. This is not difficult to understand since the information captured by low frequency layers often characterizes the essential tendency of human activities and therefore is of great use to revealing the future.
Fig. 9 depicts the importance spectra of the RCF model trained on the ECGFiveDay data set in Sect. 4.1. As shown in Fig. 9(b), the most important elements are located in the range from roughly 100 to 110 of the time axis, which is quite different from that in Fig. 8(b). To understand this, recall Fig. 6(b) that this range corresponds to the T-Wave of electrocardiography, covering the period of the heart relaxing and preparing for the next contraction. It is generally believed that abnormalities in the T-Wave can indicate seriously impaired physiological functioning ^{2}^{2}2https://en.m.wikipedia.org/wiki/T_wave. As a result, the elements describing T-Wave are more important to the classification task.
Fig. 9(a) shows the importance spectra of middle layers, also listed from top to bottom in the increasing order of frequency. It is interesting that the phenomenon is opposite to the one in Fig. 8(a); that is, the layers in high frequency are more important to the classification task on ECGFiveDays. To understand this, we should know that the general trends of ECG curves captured by low frequency layers are very similar for everyone, whereas the abnormal fluctuations captured by high frequency layers are the real distinguishable information for heart diseases identification. This also indicates the difference between a time-series classification task and the a time-series forecasting task.
Summary. The experiments in this section demonstrate the interpretability advantage of the mWDN model stemming from the integration of wavelet decomposition and our proposed importance analysis method. It can also be regarded as an indepth exploration to solve the black box problem of deep learning.
Time Series Classification (TSC). The target of TSC is to assign a time series pattern to a specific category, e.g., to identify a word based on series of voice signals. Traditional TSC methods could be classified into three major categories: distance based, feature based, and ensemble methods (Cui et al., 2016). Distance based methods predict the category of a time series by comparing the distances or similarities to other labeled series. The widely used TSC distances includes the Euclidean distance and dynamic time warping (DTW) (Berndt and Clifford, 1994)
, and DTW with KNN classifier has been the state-of-the-art TSC method for a long time
(Keogh and Ratanamahatana, 2005). A defect of distance based TSC methods is the relatively high computational complexity. Feature based methods overcome this defect by training classifiers on deterministic features and category labels of time series. Traditional methods, however, usually depend on handcraft features as inputs, such as symbolic aggregate approximation and interval mean/deviation/slop (Lin et al., 2003; Deng et al., 2013). In recent years, automatic feature engineering was introduced to TSC, such as time series shapelets mining (Grabocka et al., 2014), attention (Qin et al., 2017) and deep learning based representative learning (Längkvist et al., 2014). Our study also falls in this area but with frequency awareness. The well-known ensemble methods for TSC include PROP (Lines and Bagnall, 2015), COTE (Bagnall et al., 2015), etc., which aim to improve classification performance via knowledge integration. As reported by some latest works (Cui et al., 2016; Wang et al., 2017c), however, existing ensemble methods are yet inferior to some distance based deep learning methods.Time Series Forecasting (TSF). TSF refers to predicting future values of a time series using past and present data, which is widely adopted in nearly all application domains (Wang et al., 2014b, a). A classic model is autoregressive integrated moving average (ARIMA) (Box and Pierce, 1970), with a great many variants, e.g., ARIMA with explanatory variables (ARIMAX) (Lee and Fambro, 1999) and seasonal ARIMA (SARIMA) (Williams and Hoel, 2003), to meet the requirements of various applications. In recent years, a tendency of TSF research is to introduce supervised learning methods, such as support vector regression (Jeong et al., 2013) and deep neural networks (Zhang, 2003), for modeling complicated non-linear correlations between past and future states of time series. Two well-known deep neural network structures for TSF are recurrent neural networks (RNN) (Connor et al., 1994) and long short-term memory (LSTM) (Gers et al., 2002). These indicate that an elaborate model design is crucially important for achieving excellent forecasting performance.
Frequency Analysis of Time Series. Frequency analysis of time series data has been deeply studied by the signal processing community. Many classical methods, such as Discrete Wavelet Transform (Mallat, 1989), Discrete Fourier (Harris, 1978), and Z-Transform (Jury, 1964), have been proposed to analysis the frequency pattern of time series signals. In existing TSC/TSF applications, however, transforms are usually used as an independent step in data preprocessing (Cui et al., 2016; Liu et al., 2013), which have no interactions with model training and therefore might not be optimized for TSC/TSF tasks from a global view. In recent years, some research works, such as Clockwork RNN (Koutnik et al., 2014) and SFM (Hu and Qi, 2017), begins to introduce the frequency analysis methodology into the deep learning framework. To our best knowledge, our study is among the very few works that embed wavelet time series transforms as a part of neural networks so as to achieve an end-to-end learning.
In this paper, we aim at building frequency-aware deep learning models for time series analysis. To this end, we first designed a novel wavelet-based network structure called mWDN for frequency learning of time series, which can then be seamlessly embedded into deep learning frameworks by making all parameters trainable. We further designed two deep learning models based on mWDN for time series classification and forecasting, respectively, and the extensive experiments on abundant real-world datasets demonstrated their superiority to state-of-the-art competitors. As a nice try for interpretable deep learning, we further propose an importance analysis method for identifying important factors for time series analysis, which in turn verifies the interpretability merit of mWDN.
A time series forest for classification and feature extraction.
Information Sciences 239 (2013), 142–153.International Conference on Machine Learning
. 1568–1577.Proceedings of the 32nd AAAI Conference on Artificial Intelligence
.In a neural network model, the outputs of the layer are connected as the inputs of the layer
. According to the chain rule, the partial derivative of the model
to middle layer outputs could be calculated layer-by-layer as(24) |
where is the -th output of the layer . The proposed models contain types of layers: the convolutional, LSTM and fully connected layers, which are discussed below.
For convolutional layers, only 1D convolutional operation is used in our cases. The output of the layer is a matrix with the size of , which is connected to neural matrix of the -th with a convolutional kernel in the size of . The partial derivative of to the output of the layer is calculated as
where denotes the -th element of the convolutional kernel, , and is the derivative of activation function.
For LSTM laysers, we denote the output of a LSTM unit in layer at time as
where is calculated as
is the history state that is saved in the memory cell. Therefore, the partial derivative of to the is calculated as
where is an equation as
The derivative in the above equation is calculated as
For fully connect layers, the output . Then the partial derivative is equal to