1. Introduction
A time series is a series of data points indexed in time order. Methods for time series analysis could be classified into two types: timedomain methods and frequencydomain methods.
^{1}^{1}1https://en.wikipedia.org/wiki/Time_seriesTimedomain methods consider a time series as a sequence of ordered points and analyze correlations among them. Frequencydomain methods use transform algorithms, such as discrete Fourier transform and Ztransform, to transform a time series into a frequency spectrum, which could be used as features to analyze the original series.
In recent years, with the booming of deep learning concept, various types of deep neural network models have been introduced to time series analysis and achieved stateoftheart performances in many reallife applications (Wang et al., 2017c; Rajpurkar et al., 2017)
. Some wellknown models include Recurrent Neural Networks (RNN)
(Williams and Zipser, 1989) and Long ShortTerm Memory (LSTM) (Hochreiter and Schmidhuber, 1997)that use memory nodes to model correlations of series points, and Convolutional Neural Network (CNN) that uses trainable convolution kernels to model local shape patterns
(Zheng et al., 2016). Most of these models fall into the category of timedomain methods without leveraging frequency information of a time series, although some begin to consider in indirect ways (Cui et al., 2016; Koutnik et al., 2014).Wavelet decompositions (Daubechies, 1992) are wellknown methods for capturing features of time series both in time and frequency domains. Intuitively, we can employ them as feature engineering tools for data preprocessing before a deep modeling. While this loose coupling way might improve the performance of raw neural network models (Liu et al., 2013), they are not globally optimized with independent parameter inference processes. How to integrate wavelet transforms into the framework of deep learning models remains a great challenge.
In this paper, we propose a waveletbased neural network structure, named multilevel Wavelet Decomposition Network (mWDN), to build frequencyaware deep learning models for time series analysis. Similar to the standard Multilevel Discrete Wavelet Decomposition (MDWD) model (Mallat, 1989), mWDN can decompose a time series into a group of subseries with frequencies ranked from high to low, which is crucial for capturing frequency factors for deep learning. Different from MDWD with fixed parameters, however, all parameters in mWDN can be fineturned to fit training data of different learning tasks. In other words, mWDN can take advantages of both wavelet based time series decomposition and the learning ability of deep neural networks.
Based on mWDN, two deep learning models, i.e., Residual Classification Flow (RCF) and multifrequency Long ShortTerm Memory (mLSTM), are designed for time series classification (TSC) and forecasting (TSF), respectively. The key issue in TSC is to extract as many as possible representative features from time series. The RCF model therefore adopts the mWDN decomposed results in different levels as inputs, and employs a pipelined classifier stack to exploit features hidden in subseries through residual learning methods. For the TSF problem, the key issue turns to inferring future states of a time series according to the hidden trends in different frequencies. Therefore, the mLSTM model feeds all mWDN decomposed subseries in high frequencies into independent LSTM models, and ensembles all LSTM outputs for final forecasting. Note that all parameters of RCF and mLSTM including the ones in mWDN are trained using the back propagation algorithm in an endtoend manner. In this way, the waveletbased frequency analysis is seamlessly embedded into deep learning frameworks.
We evaluate RCF on 40 UCR time series datasets for TSC, and mLSTM on a realworld uservolume time series dataset for TSF. The results demonstrate their superiorities to stateoftheart baselines and the advantages of mWDN with trainable parameters. As a nice try for interpretable deep learning, we further propose an importance analysis method to mWDN based models, which successfully identifies those timeseries elements and mWDN layers that are crucially important to the success of time series analysis. This indicates the interpretability advantage of mWDN by integrating wavelet decomposition for frequency factors.
2. Model
Throughout the paper, we use lowercase symbols such as , to denote scalars, bold lowercase symbols such as
to denote vectors, bold uppercase symbols such as
to denote matrices, and uppercase symbols such as , to denote constant.2.1. Multilevel Discrete Wavelet Decomposition
Multilevel Discrete Wavelet Decomposition (MDWD) (Mallat, 1989) is a wavelet based discrete signal analysis method, which can extract multilevel timefrequency features from a time series by decomposing the series as low and high frequency subseries level by level.
We denote the input time series as , and the low and high subseries generated in the th level as and . In the th level, MDWD uses a low pass filter and a high pass filter , , to convolute low frequency subseries of the upper level as
(1)  
where is the th element of the low frequency subseries in the th level, and is set as the input series. The low and high frequency subseries and in the level are generated from the 1/2 downsampling of the intermediate variable sequences and .
The subseries set is called as the th level decomposed results of . Specifically, satisfies: 1) We can fully reconstruct from ; 2) The frequency from to is from high to low; 3) For different layers, has different time and frequency resolutions. As increases, the frequency resolution is increasing and the time resolution, especially for low frequency subseries, is decreasing.
Because the subseries with different frequencies in keep the same order information with the original series , MDWD is regarded as timefrequency decomposition.
2.2. Multilevel Wavelet Decomposition Network
In this section, we propose a multilevel Wavelet Decomposition Network (mWDN), which approximatively implements a MDWD under a deep neural network framework.
The structure of mWDN is illustrated in Fig. 1. As shown in the figures, the mWDN model hierarchically decomposes a time series using the following two functions
(2)  
where
is a sigmoid activation function, and
andare trainable bias vectors initialized as closetozero random values. We can see the functions in Eq. (
2) have similar forms as the functions in Eq. (1) for MDWD. and also denote the low and high frequency subseries of generated in the th level, which are downsampled from the intermediate variables and using an average pooling layer as .In order to implement the convolution defined in Eq. (1), we set the initial values of the weight matrices and as
(3) 
(4) 
Obviously, and , where is the size of . The in the weight matrices are random values that satisfy and . We use the Daubechies 4 Wavelet (Rowe and Abbott, 1995) in our practice, where the filter coefficients are set as
2.3. Residual Classification Flow
The task of TSC is to predict unknown category label of a time series. A key issue of TSC is extracting distinguishing features from time series data. The decomposed results of mWDN are natural timefrequency features that could be used in TSC. In this subsection, we propose a Residual Classification Flow (RCF) network to exploit the potentials of mWDN in TSC.
The framework of RCF is illustrated in Fig. 2. As shown in the figure, RCF contains many independent classifiers. The RCF model connects the subseries generated by the th mWDN level, i.e., and , with a forward neural network as
(5) 
where
could be a multilayer perceptron, a convolutional network, or any other types of neural networks, and
represents the trainable parameters. Moreover, RCF adopts a residual learning method (He et al., 2016) to join of all classifiers as(6) 
where is a softmax classifier,
is a predicted value of onehot encoding of the category label of the input series.
In the RCF model, the decomposed results of all mWDN levels, i.e. , are evolved. Because the decomposed results in different mWDN levels have different time and frequency resolutions (Mallat, 1989), the RCF model can fully exploit patterns of the input time series from different time/frequencyresolutions. In other words, RCF employs a multiview learning methodology to achieve highperformance time series classification.
Moreover, deep residual networks (He et al., 2016) were proposed to solve the problem that using deeper network structures may result in a great training difficulty. The RCF model also inherits this merit. In Eq. (6), the th classifier makes decision based on and the decision made by the th classifier, which can learn from the incremental knowledge that the th classifier does not have. Therefore, users could append residual classifiers one after another until classification performance does not increase any more.
2.4. Multifrequency Long ShortTerm Memory
In this subsection, we propose a multifrequency LongShort Term Memory (mLSTM) model based on mWDN for TSF. The design of mLSTM is based on the insight that the temporal correlations of points hidden in a time series have close relations with frequency. For example, large time scale correlations, such as longterm tendencies, usually lay in low frequency, and the small time scale correlations, such as shortterm disturbances and events, usually lay in high frequency. Therefore, we could divide a complicated TSF problem as many subproblems of forecasting subseries decomposed by mWDN, which are relatively easier because the frequency components in the subseries are simpler.
Given a time series with infinite length, on which we open a size slide window from the past to the time as
(7) 
Using mWDN to decompose , we get the low and high frequency component series in the th level as
(8)  
As shown in Fig. 3, the mLSTM model uses the decomposed results of the last level, i.e., the subseries in , as the inputs of independent LSTM subnetworks. Every LSTM subnetwork forecasts the future state of one subseries in . Finally, a fully connected neural network is employed to fuse the LSTM subnetworks as an ensemble for forecasting.
3. Optimization
In TSC applications, we adopt a deep supervision method to train the RCF model (Wang et al., 2015). Given a set of time series , we use crossentropy as loss metric and define the objective function of the th classifier as
(9) 
where is the onehot encoding of ’s real category, and is the softmax output of the th classifier with the input . For a RCF with classifiers, the final objective function is a weighted sum of all (Wang et al., 2015):
(10) 
The result of the last classifier, , is used as the final classification result of RCF.
In TSF applications, we adopt a pretraining and fine turning method to train the mLSTM model. In the pretraining step, we use MDWD to decompose the real value of the future state to be predicted as wavelet components, i.e. , and then combine the outputs of all LSTM subnetwork as , then the objective function of the pretraining step is defined as
(11) 
where is the Frobenius Norm. In the fineturning step, we use the following objective function to train mLSTM based on the parameters learned in the pretraining step:
(12) 
where is future state predicted by mLSTM and is the real value.
We use the error back propagation (BP) algorithm to optimize the objective functions. Denoting as the parameters of the RCF or mLSTM model, the BP algorithm iteratively updates as
(13) 
where is an adjustable learning rate. The weight matrices and of mWDN are also trainable in Eq. (13). A problem of training parameters with preset initial values like and is that the model may “forget” the initial values (French, 1999) in the training process. To deal with this, we introduce two regularization items to the objective function and therefore have
(14)  
where and are the same matrices as and except that , and are hyperparameters which are set as empirical values. Accordingly, the BP algorithm iteratively updates the weight matrices of mWDN as
(15)  
In this way, the weights in mWDN will converge to a point that is near to the wavelet decomposed prior, unless wavelet decomposition is far inappropriate to the task.
Err Rate  RNN  LSTM  MLP  FCN  ResNet  MLPRCF  FCNRCF  ResNetRCF  WaveletRCF 
Adiac  0.233  0.341  0.248  0.143  0.174  0.212  0.155  0.151  0.162 
Beef  0.233  0.333  0.167  0.25  0.233  0.06  0.03  0.06  0.06 
CBF  0.189  0.118  0.14  0  0.006  0.056  0  0  0.016 
ChlorineConcentration  0.135  0.16  0.128  0.157  0.172  0.096  0.068  0.07  0.147 
CinCECGtorso  0.333  0.092  0.158  0.187  0.229  0.117  0.014  0.084  0.011 
CricketX  0.449  0.382  0.431  0.185  0.179  0.321  0.216  0.297  0.211 
CricketY  0.415  0.318  0.405  0.208  0.195  0.254  0.172  0.301  0.192 
CricketZ  0.4  0.328  0.408  0.187  0.187  0.313  0.162  0.275  0.162 
DiatomSizeReduction  0.056  0.101  0.036  0.07  0.069  0.013  0.023  0.026  0.028 
ECGFiveDays  0.088  0.417  0.03  0.015  0.045  0.023  0.01  0.035  0.016 
FaceAll  0.247  0.192  0.115  0.071  0.166  0.094  0.098  0.126  0.076 
FaceFour  0.102  0.364  0.17  0.068  0.068  0.102  0.05  0.057  0.058 
FacesUCR  0.204  0.091  0.185  0.052  0.042  0.15  0.087  0.102  0.087 
50words  0.316  0.284  0.288  0.321  0.273  0.316  0.288  0.258  0.3 
FISH  0.126  0.103  0.126  0.029  0.011  0.086  0.021  0.034  0.026 
GunPoint  0.1  0.147  0.067  0  0.007  0.033  0  0.02  0 
Haptics  0.594  0.529  0.539  0.449  0.495  0.480  0.461  0.473  0.476 
InlineSkate  0.667  0.638  0.649  0.589  0.635  0.543  0.566  0.578  0.572 
ItalyPowerDemand  0.055  0.072  0.034  0.03  0.04  0.031  0.023  0.034  0.028 
Lighting2  0  0  0.279  0.197  0.246  0.213  0.145  0.197  0.162 
Lighting7  0.288  0.384  0.356  0.137  0.164  0.179  0.091  0.177  0.144 
MALLAT  0.119  0.127  0.064  0.02  0.021  0.058  0.044  0.046  0.024 
MedicalImages  0.299  0.276  0.271  0.208  0.228  0.251  0.164  0.188  0.206 
MoteStrain  0.133  0.167  0.131  0.05  0.105  0.105  0.076  0.032  0.05 
NonInvasiveFatalECGThorax1  0.09  0.08  0.058  0.039  0.052  0.029  0.026  0.04  0.042 
NonInvasiveFatalECGThorax2  0.069  0.071  0.057  0.045  0.049  0.056  0.028  0.033  0.048 
OliveOil  0.233  0.267  0.6  0.167  0.133  0.03  0  0  0.012 
OSULeaf  0.463  0.401  0.43  0.012  0.021  0.342  0.018  0.021  0.021 
SonyAIBORobotSurface  0.21  0.309  0.273  0.032  0.015  0.193  0.042  0.032  0.052 
SonyAIBORobotSurfaceII  0.219  0.187  0.161  0.038  0.038  0.092  0.064  0.083  0.072 
StarLightCurves  0.027  0.035  0.043  0.033  0.029  0.021  0.018  0.027  0.03 
SwedishLeaf  0.085  0.128  0.107  0.034  0.042  0.089  0.057  0.017  0.046 
Symbols  0.179  0.117  0.147  0.038  0.128  0.126  0.04  0.107  0.084 
TwoPatterns  0.005  0.001  0.114  0.103  0  0.070  0  0  0.005 
uWaveGestureLibraryX  0.224  0.195  0.232  0.246  0.213  0.213  0.218  0.194  0.162 
uWaveGestureLibraryY  0.335  0.265  0.297  0.275  0.332  0.306  0.232  0.296  0.241 
uWaveGestureLibraryZ  0.297  0.259  0.295  0.271  0.245  0.298  0.265  0.204  0.194 
wafer  0  0  0.004  0.003  0.003  0.003  0  0  0 
WordsSynonyms  0.429  0.343  0.406  0.42  0.368  0.391  0.338  0.387  0.314 
yoga  0.202  0.158  0.145  0.155  0.142  0.138  0.112  0.139  0.128 
Winning times  2  2  0  9  6  2  19  7  7 
AVG arithmetic ranking  7.425  6.825  7.2  4.025  4.55  5.15  2.175  3.375  3.075 
AVG geometric ranking  6.860  6.131  7.043  3.101  3.818  4.675  1.789  2.868  2.688 
MPCE  0.039  0.043  0.041  0.023  0.025  0.028  0.017  0.021  0.019 
4. Experiments
In this section, we evaluate the performance of the mWDNbased models in both the TSC and TSF tasks.
4.1. Task I: Time Series Classification
Experimental Setup. The classification performance was tested on 40 datasets of the UCR time series repository (Chen et al., 2015), with various competitors as follows:

MLP, FCN, and ResNet. These three models were proposed in (Wang et al., 2017c) as strong baselines on the UCR time series datasets. They have the same framework: an input layer, followed by three hidden basic blocks, and finally a softmax output. MLP adopts a fullyconnected layer as its basic block, FCN and ResNet adopt a fully convolutional layer and a residual convolutional network, respectively, as their basic blocks.

MLPRCF, FCNRCF, and ResNetRCF. The three models use the basic blocks of MLP/FCN/ResNet as the model of RCF in Eq. (5). We compare them with MPL/FCN/ResNet to verify the effectiveness of RCF.

WaveletRCF. This model has the same structure as ResNetRCF but replaces the mWDN part with a standard MDWD with fixed parameters. We compare it with ResNetRCF to verify the effectiveness of trainable parameters in mWDM.
For each dataset, we ran a model 10 times and returned the average classification error rate as the evaluation. To compare the overall performances on all the 40 data sets, we further introduced Mean PerClass Error (MPCE) as the performance indicator for each competitor (Wang et al., 2017c). Let denote the amount of categories in the th dataset, and the error rate of a model on that dataset, MPCE of a model is then defined as
(16) 
Note that the factor of category amount is wiped out in MPCE. A smaller MPCE value indicates a better overall performance.
Results & Analysis. Table 1 shows the experimental results, with the summarized information listed in the bottom two lines. Note that the best performance for each dataset is highlighted in bold, and the second best is in italic. From the table, we have various interesting observations. Firstly, it is clear that among all the competitors, FCNRCF achieves the best performance in terms of both the largest number of wins (the best in 19 out of 40 datasets) and the smallest MPCE value. While the baseline FCN itself also achieves a satisfactory performance — the second largest number of wins at 9 and a rather small MPCE value at 0.023, the gap to FCMRCF is still rather big, implying the significant benefit from adopting our RCF framework. This is actually not an individual case; from Table 1, MLPRCF performs much better than MLP on 37 datasets, and the number for ResNetRCF against ResNet is 27. This indicates RCF is indeed a general framework compatible with different types of deep learning classifiers and can improve TSF performance sharply.
Another observation is from the comparison between WaveletRCF and ResNetRCF. Table 1 shows that WaveletRCF achieved the second overall performance on MPCE and AVG rankings, which indicates that the frequency information introduced by wavelet tools is very helpful for time series problems. It is clear from the table that ResNetRCF outperforms WaveletRCF on most of the datasets. This strongly demonstrates the advantage of our RCF framework in adopting parametertrainable mWDN under the deep learning architecture, rather than using directly the wavelet decomposition as a feature engineering tool. More technically speaking, compared with WaveletRCF, mWNDbased ResNetRCF can achieve a good tradeoff between the prior of frequencydomain and the likelihoods of training data. This well illustrates why RCF based models can achieve much better results in the previous observation.
Summary. The above experiments demonstrate the superiority of RCF based models to some stateoftheart baselines in the TSC tasks. The experiments also imply that the trainable parameters in a deep learning architecture and the strong priors from wavelet decomposition are two key factors for the success of RCF.
4.2. Task II: Time Series Forecasting
Experimental Setup. We tested the predictive power of mLSTM on a visitor volume prediction scenario (Wang et al., 2017b). The experiment adopts a reallife dataset named WuxiCellPhone, which contains uservolume time series of 20 cellphone base stations located in the downtown of Wuxi city during two weeks. Detail informantion of cellphone data refers (Wang et al., 2018, 2017a; Song et al., 2017). The time granularity of a uservolume series is 5 minutes. In the experiments, we compared mLSTM with the following baselines:

SAE (Stacked AutoEncoders), which has been used in various TSF tasks (Lv et al., 2015).

RNN (Recurrent Neural Networks) and LSTM (Long ShortTerm Memory), which are specifically designed for time series analysis.

wLSTM , which has the same structure with mLSTM but replaces the mWDN part with a standard MDWD.
We use three metrics to evaluate the performance of the models, including Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE), which are defined as
(17)  
where is the real value of the th sample in a time series, and is the predicted one. The less value of the three metrics means the better performance.
Results & Analysis. We compared the performance of the competitors in two TSF scenarios suggested in (Wang et al., 2016). In the first scenario, we predicted the average user volumes of a base station in subsequent periods. The length of the periods was varied from 5 to 30 minutes. Fig. 4 is a comparison of the performance averaged on the 20 base stations in one week. As can be seen, while all the models experience a gradual decrease in prediction error as the period length increases, that mLSTM achieves the best performance compared with the baselines. Particularly, the performance of mLSTM is consistently better than wLSTM, which again approves the introduction of mWDN for time series forecasting.
In the second scenario, we predicted the average user volumes in 5 minutes after a given time interval varying from 0 to 30 minutes. Fig. 5 is a performance comparison between mLSTM and the baselines. Different from the tend we observed in Scenario I, the prediction errors in Fig. 5 generally increase along the xaxis for the increasing uncertainty. From Fig. 5 we can see that mLSTM again outperforms wLSTM and other baselines, which confirms the observations from Scenario I.
Summary. The above experiments demonstrate the superiority of mLSTM to the baselines. The mWDN structure adopted by mLSTM again becomes an important factor for the success.
5. Interpretation
In this section, we highlight the unique advantage of our mWDN model: the interpretability. Since mWDN is embedded with a discrete wavelet decomposition, the outputs of the middle layers in mWDN, i.e., and , inherit the physical meanings of wavelet decompositions. We here take two data sets for illustration: WuxiCellPhone used in Sect. 4.2 and ECGFiveDays used in Sect. 4.1. Fig. 6(a) shows a sample of the user number series of a cellphone base station in one day, and Fig. 6(b) exhibits an electrocardiogram (ECG) sample.
5.1. The Motivation
Fig. 7 shows the outputs of mWDN layers in the mLSTM and RCF models fed with the two samples given in Fig. 6, respectively. In Fig. 7(a), we plot the outputs of the first three layers in the mLSTM model as different subfigures. As can be seen, from to , the outputs of the middle layers correspond to the frequency components of the input series running from high to low. A similar phenomenon could be observed in Fig. 7(b), where the outputs of the first three layers in the RCF model are presented. This phenomenon again indicates that the middle layers of mWDN inherit the frequency decomposition function of wavelet. Then here comes the problem: can we evaluate quantitatively what layer or which frequency of a time series is more important to the final output of the mWDN based models? If possible, this can provide valuable interpretability to our mWDN model.
5.2. Importance Analysis
We here introduce an importance analysis method for the proposed mWDN model, which aims to quantify the importance of each middle layer to the final output of the mWDN based models.
We denote the problem of time series classification/forecasting using a neural network model as
(18) 
where denotes the neural network, denotes the input series, and is the prediction. Given a welltrained model , if a small disturbance to the th element can cause a large change to the output , we say is sensitive to . Therefore, the sensibility of the network to the th element of the input series is defined as the partial derivatives of to as follows:
(19) 
Obviously, is also a function of for a given model . Given a training data set with training samples, the importance of the th element of the input series to the model is defined as
(20) 
where is the value of the th element in the th training sample.
The importance definition in Eq. (20) can be extended to the middle layers in the mWDN model. Denoting as an output of a middle layer in mWDN, the neural network can be rewritten as
(21) 
and the sensibility of to is then defined as
(22) 
Given a training data set , the importance of w.r.t. is calculated as
(23) 
The calculation of and in Eq. (19) and Eq. (22) are given in the Appendix for concision. Eq. (20) and Eq. (23) respectively define the importance of a timeseries element and an mWDN layer to an mWDN based model.
5.3. Experimental Results
Fig. 8 and Fig. 9 shows the results of importance analysis. In Fig. 8, the mLSTM model trained on WuxiCellPhone in Sect. 4.2 is used. Fig. 8(b) exhibits the importance spectrum of all the elements, where the xaxis denotes the increasing timestamps and the colors in spectrum denote the varying importance of the features: the redder, the more important. From the spectrum, we can see that the latest elements are more important than the older ones, which is quite reasonable in the scenario of time series forecasting and justifies the time value of information.
Fig. 8(a) exhibits the importance spectra of the middle layers listed from top to bottom in the increasing order of frequency. Note that for the sake of comparison, we resize the lengths of the outputs to the same. From the figure, we can observe that ) the lower frequency layers in the top are with higher importance, and ) only the layers with higher importance exhibit the time value of the elements as in Fig. 8(b). These imply that the low frequency layers in mWDN are crucially important to the success of time series forecasting. This is not difficult to understand since the information captured by low frequency layers often characterizes the essential tendency of human activities and therefore is of great use to revealing the future.
Fig. 9 depicts the importance spectra of the RCF model trained on the ECGFiveDay data set in Sect. 4.1. As shown in Fig. 9(b), the most important elements are located in the range from roughly 100 to 110 of the time axis, which is quite different from that in Fig. 8(b). To understand this, recall Fig. 6(b) that this range corresponds to the TWave of electrocardiography, covering the period of the heart relaxing and preparing for the next contraction. It is generally believed that abnormalities in the TWave can indicate seriously impaired physiological functioning ^{2}^{2}2https://en.m.wikipedia.org/wiki/T_wave. As a result, the elements describing TWave are more important to the classification task.
Fig. 9(a) shows the importance spectra of middle layers, also listed from top to bottom in the increasing order of frequency. It is interesting that the phenomenon is opposite to the one in Fig. 8(a); that is, the layers in high frequency are more important to the classification task on ECGFiveDays. To understand this, we should know that the general trends of ECG curves captured by low frequency layers are very similar for everyone, whereas the abnormal fluctuations captured by high frequency layers are the real distinguishable information for heart diseases identification. This also indicates the difference between a timeseries classification task and the a timeseries forecasting task.
Summary. The experiments in this section demonstrate the interpretability advantage of the mWDN model stemming from the integration of wavelet decomposition and our proposed importance analysis method. It can also be regarded as an indepth exploration to solve the black box problem of deep learning.
6. Related Works
Time Series Classification (TSC). The target of TSC is to assign a time series pattern to a specific category, e.g., to identify a word based on series of voice signals. Traditional TSC methods could be classified into three major categories: distance based, feature based, and ensemble methods (Cui et al., 2016). Distance based methods predict the category of a time series by comparing the distances or similarities to other labeled series. The widely used TSC distances includes the Euclidean distance and dynamic time warping (DTW) (Berndt and Clifford, 1994)
, and DTW with KNN classifier has been the stateoftheart TSC method for a long time
(Keogh and Ratanamahatana, 2005). A defect of distance based TSC methods is the relatively high computational complexity. Feature based methods overcome this defect by training classifiers on deterministic features and category labels of time series. Traditional methods, however, usually depend on handcraft features as inputs, such as symbolic aggregate approximation and interval mean/deviation/slop (Lin et al., 2003; Deng et al., 2013). In recent years, automatic feature engineering was introduced to TSC, such as time series shapelets mining (Grabocka et al., 2014), attention (Qin et al., 2017) and deep learning based representative learning (Längkvist et al., 2014). Our study also falls in this area but with frequency awareness. The wellknown ensemble methods for TSC include PROP (Lines and Bagnall, 2015), COTE (Bagnall et al., 2015), etc., which aim to improve classification performance via knowledge integration. As reported by some latest works (Cui et al., 2016; Wang et al., 2017c), however, existing ensemble methods are yet inferior to some distance based deep learning methods.Time Series Forecasting (TSF). TSF refers to predicting future values of a time series using past and present data, which is widely adopted in nearly all application domains (Wang et al., 2014b, a). A classic model is autoregressive integrated moving average (ARIMA) (Box and Pierce, 1970), with a great many variants, e.g., ARIMA with explanatory variables (ARIMAX) (Lee and Fambro, 1999) and seasonal ARIMA (SARIMA) (Williams and Hoel, 2003), to meet the requirements of various applications. In recent years, a tendency of TSF research is to introduce supervised learning methods, such as support vector regression (Jeong et al., 2013) and deep neural networks (Zhang, 2003), for modeling complicated nonlinear correlations between past and future states of time series. Two wellknown deep neural network structures for TSF are recurrent neural networks (RNN) (Connor et al., 1994) and long shortterm memory (LSTM) (Gers et al., 2002). These indicate that an elaborate model design is crucially important for achieving excellent forecasting performance.
Frequency Analysis of Time Series. Frequency analysis of time series data has been deeply studied by the signal processing community. Many classical methods, such as Discrete Wavelet Transform (Mallat, 1989), Discrete Fourier (Harris, 1978), and ZTransform (Jury, 1964), have been proposed to analysis the frequency pattern of time series signals. In existing TSC/TSF applications, however, transforms are usually used as an independent step in data preprocessing (Cui et al., 2016; Liu et al., 2013), which have no interactions with model training and therefore might not be optimized for TSC/TSF tasks from a global view. In recent years, some research works, such as Clockwork RNN (Koutnik et al., 2014) and SFM (Hu and Qi, 2017), begins to introduce the frequency analysis methodology into the deep learning framework. To our best knowledge, our study is among the very few works that embed wavelet time series transforms as a part of neural networks so as to achieve an endtoend learning.
7. Conclusions
In this paper, we aim at building frequencyaware deep learning models for time series analysis. To this end, we first designed a novel waveletbased network structure called mWDN for frequency learning of time series, which can then be seamlessly embedded into deep learning frameworks by making all parameters trainable. We further designed two deep learning models based on mWDN for time series classification and forecasting, respectively, and the extensive experiments on abundant realworld datasets demonstrated their superiority to stateoftheart competitors. As a nice try for interpretable deep learning, we further propose an importance analysis method for identifying important factors for time series analysis, which in turn verifies the interpretability merit of mWDN.
References
 (1)
 Bagnall et al. (2015) Anthony Bagnall, Jason Lines, Jon Hills, and Aaron Bostrom. 2015. Timeseries classification with COTE: the collective of transformationbased ensembles. IEEE TKDE 27, 9 (2015), 2522–2535.
 Berndt and Clifford (1994) Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In KDD ’94, Vol. 10. Seattle, WA, 359–370.
 Box and Pierce (1970) George EP Box and David A Pierce. 1970. Distribution of residual autocorrelations in autoregressiveintegrated moving average time series models. Journal of the American statistical Association 65, 332 (1970), 1509–1526.
 Chen et al. (2015) Yanping Chen, Eamonn Keogh, Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo Batista. 2015. The UCR Time Series Classification Archive. www.cs.ucr.edu/~eamonn/time_series_data/.
 Connor et al. (1994) Jerome T Connor, R Douglas Martin, and Les E Atlas. 1994. Recurrent neural networks and robust time series prediction. IEEE T NN 5, 2 (1994), 240–254.
 Cui et al. (2016) Zhicheng Cui, Wenlin Chen, and Yixin Chen. 2016. Multiscale convolutional neural networks for time series classification. arXiv preprint arXiv:1603.06995 (2016).
 Daubechies (1992) Ingrid Daubechies. 1992. Ten lectures on wavelets. SIAM.

Deng
et al. (2013)
Houtao Deng, George
Runger, Eugene Tuv, and Martyanov
Vladimir. 2013.
A time series forest for classification and feature extraction.
Information Sciences 239 (2013), 142–153.  French (1999) Robert M French. 1999. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3, 4 (1999), 128–135.
 Gers et al. (2002) Felix A Gers, Douglas Eck, and Jürgen Schmidhuber. 2002. Applying LSTM to time series predictable through timewindow approaches. In Neural Nets WIRN Vietri01. Springer, 193–200.
 Grabocka et al. (2014) Josif Grabocka, Nicolas Schilling, Martin Wistuba, and Lars SchmidtThieme. 2014. Learning timeseries shapelets. In KDD ’14. ACM, 392–401.
 Harris (1978) Fredric J Harris. 1978. On the use of windows for harmonic analysis with the discrete Fourier transform. Proc. IEEE 66, 1 (1978), 51–83.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR ’16. 770–778.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation 9, 8 (1997), 1735–1780.

Hu and Qi (2017)
Hao Hu and GuoJun Qi.
2017.
StateFrequency Memory Recurrent Neural Networks.
In
International Conference on Machine Learning
. 1568–1577.  Jeong et al. (2013) YoungSeon Jeong, YoungJi Byon, Manoel Mendonca CastroNeto, and Said M Easa. 2013. Supervised weightingonline learning algorithm for shortterm traffic flow prediction. IEEE T ITS 14, 4 (2013), 1700–1707.
 Jury (1964) Eliahu Ibraham Jury. 1964. Theory and Application of the zTransform Method. (1964).
 Keogh and Ratanamahatana (2005) Eamonn Keogh and Chotirat Ann Ratanamahatana. 2005. Exact indexing of dynamic time warping. Knowledge and Information Systems 7, 3 (2005), 358–386.
 Koutnik et al. (2014) Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. 2014. A clockwork rnn. In International Conference on Machine Learning. 1863–1871.
 Längkvist et al. (2014) Martin Längkvist, Lars Karlsson, and Amy Loutfi. 2014. A review of unsupervised feature learning and deep learning for timeseries modeling. Pattern Recognition Letters 42 (2014), 11–24.
 Lee and Fambro (1999) Sangsoo Lee and Daniel Fambro. 1999. Application of subset autoregressive integrated moving average model for shortterm freeway traffic volume forecasting. Transportation Research Record: Journal of the Transportation Research Board 1678 (1999), 179–188.
 Lin et al. (2003) Jessica Lin, Eamonn Keogh, Stefano Lonardi, and Bill Chiu. 2003. A symbolic representation of time series, with implications for streaming algorithms. In SIGMOD’03 workshop on Research issues in DMKD. ACM, 2–11.
 Lines and Bagnall (2015) Jason Lines and Anthony Bagnall. 2015. Time series classification with ensembles of elastic distance measures. Data Mining and Knowledge Discovery 29, 3 (2015), 565–592.
 Liu et al. (2013) Hui Liu, Hongqi Tian, Difu Pan, and Yanfei Li. 2013. Forecasting models for wind speed using wavelet, wavelet packet, time series and Artificial Neural Networks. Applied Energy 107 (2013), 191–208.
 Lv et al. (2015) Yisheng Lv, Yanjie Duan, Wenwen Kang, Zhengxi Li, and FeiYue Wang. 2015. Traffic flow prediction with big data: A deep learning approach. IEEE T ITS 16, 2 (2015), 865–873.
 Mallat (1989) Stephane G Mallat. 1989. A theory for multiresolution signal decomposition: the wavelet representation. IEEE T PAMI 11, 7 (1989), 674–693.
 Qin et al. (2017) Yao Qin, Dongjin Song, Haifeng Cheng, Wei Cheng, Guofei Jiang, and Garrison Cottrell. 2017. A dualstage attentionbased recurrent neural network for time series prediction. arXiv preprint arXiv:1704.02971 (2017).
 Rajpurkar et al. (2017) Pranav Rajpurkar, Awni Y Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y Ng. 2017. CardiologistLevel Arrhythmia Detection with Convolutional Neural Networks. arXiv preprint arXiv:1707.01836 (2017).
 Rowe and Abbott (1995) Alistair CH Rowe and Paul C Abbott. 1995. Daubechies wavelets and mathematica. Computers in Physics 9, 6 (1995), 635–648.
 Song et al. (2017) Xin Song, Yuanxin Ouyang, Bowen Du, Jingyuan Wang, and Zhang Xiong. 2017. Recovering Individual s Commute Routes Based on Mobile Phone Data. Mobile Information Systems,2017,(2017029) 2017, 18 (2017), 1–11.
 Wang et al. (2017a) Jingyuan Wang, Chao Chen, Junjie Wu, and Zhang Xiong. 2017a. No Longer Sleeping with a Bomb: A Duet System for Protecting Urban Safety from Dangerous Goods. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1673–1681.
 Wang et al. (2014a) Jingyuan Wang, Fei Gao, Peng Cui, Chao Li, and Zhang Xiong. 2014a. Discovering urban spatiotemporal structure from timeevolving traffic networks. In Proceedings of the 16th AsiaPacific Web Conference. Springer International Publishing, 93–104.
 Wang et al. (2016) Jingyuan Wang, Qian Gu, Junjie Wu, Guannan Liu, and Zhang Xiong. 2016. Traffic speed prediction and congestion source exploration: A deep learning method. In Data Mining (ICDM), 2016 IEEE 16th International Conference on. IEEE, 499–508.

Wang
et al. (2018)
Jingyuan Wang, Xu He,
Ze Wang, Junjie Wu Wu,
Nicholas Jing Yuan, Xing Xie, and
Zhang Xiong. 2018.
CDCNN: A Partially Supervised CrossDomain Deep
Learning Model for Urban Resident Recognition. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
.  Wang et al. (2017b) Jingyuan Wang, Yating Lin, Junjie Wu, Zhong Wang, and Zhang Xiong. 2017b. Coupling Implicit and Explicit Knowledge for Customer Volume Prediction. In Proceedings of the 31st AAAI Conference on Artificial Intelligence. 1569–1575.
 Wang et al. (2014b) Jingyuan Wang, Yu Mao, Jing Li, Zhang Xiong, and WenXu Wang. 2014b. Predictability of road traffic and congestion in urban areas. Plos One 10, 4 (2014), e0121825.
 Wang et al. (2015) Liwei Wang, ChenYu Lee, Zhuowen Tu, and Svetlana Lazebnik. 2015. Training deeper convolutional networks with deep supervision. arXiv preprint arXiv:1505.02496 (2015).
 Wang et al. (2017c) Zhiguang Wang, Weizhong Yan, and Tim Oates. 2017c. Time series classification from scratch with deep neural networks: A strong baseline. In IJCNN ’17. IEEE, 1578–1585.
 Williams and Hoel (2003) Billy M Williams and Lester A Hoel. 2003. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: Theoretical basis and empirical results. Journal of Transportation Engineering 129, 6 (2003), 664–672.
 Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1, 2 (1989), 270–280.
 Zhang (2003) G Peter Zhang. 2003. Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50 (2003), 159–175.
 Zheng et al. (2016) Yi Zheng, Qi Liu, Enhong Chen, Yong Ge, and J Leon Zhao. 2016. Exploiting multichannels deep convolutional neural networks for multivariate time series classification. Frontiers of Computer Science 10, 1 (2016), 96–112.
Appendix
In a neural network model, the outputs of the layer are connected as the inputs of the layer
. According to the chain rule, the partial derivative of the model
to middle layer outputs could be calculated layerbylayer as(24) 
where is the th output of the layer . The proposed models contain types of layers: the convolutional, LSTM and fully connected layers, which are discussed below.
For convolutional layers, only 1D convolutional operation is used in our cases. The output of the layer is a matrix with the size of , which is connected to neural matrix of the th with a convolutional kernel in the size of . The partial derivative of to the output of the layer is calculated as
where denotes the th element of the convolutional kernel, , and is the derivative of activation function.
For LSTM laysers, we denote the output of a LSTM unit in layer at time as
where is calculated as
is the history state that is saved in the memory cell. Therefore, the partial derivative of to the is calculated as
where is an equation as
The derivative in the above equation is calculated as
For fully connect layers, the output . Then the partial derivative is equal to
Comments
There are no comments yet.