Sensors are now becoming cheaper and more prevalent in recent years to motivate the broad usage of large amount of time series data. For example, Non-invasive, continuous, high resolution vital signs data, such as Electrocardiography (ECG) and Photoplethysmograph (PPG), are commonly used in hospital settings for better monitoring of patient outcomes to optimize early care. Industrial time series help the engineers to predict and get early preparation of the potential failure. We formulate these tasks as regular multivariate time series classification/learning problem. Compared to the univariate time series, multivariate time series is more ubiquitous, hence providing more patterns and insight of the underlying phenomena to help improve the classification performance. Therefore, multivariate time series classification is becoming more and more important in a broad range of applications, such as industrial inspection and clinical monitoring.
Multivariate time series data is not only characterized by individual attributes, but also by the relationships between the attributes banko2012correlation . Such information is not captured by the similarity between the individual sequences weng2008classification . To deal with the classification problem on multivariate time series, several similarity measurements including Edit distance with Real Penalty (ERP) and Time Warping Edit Distance (TWED) are summarized and tested on several benchmark dataset Lin2012pattern . Recently, a symbolic representation for multivariate time series classification (SMTS) is proposed. Mining core feature for early classification (MCFEC) along the sequence is proposed to capture the shapelets in each channel independently he2015early . SMTS builds a tree learner with two ensembles to learn the segmentations and a high-dimensional codebook baydogan2014learning
. While these methods provide new perspective to handle multivariate data, some are time consuming (e.g. SMTS), some are effective but cannot address the curse of dimensionality (distance on raw data).
. Compared with those sequence-distance based approaches, the feature-based approaches skip the tricky hand-crafted features as they learn a hierarchical feature representation from raw data automatically. However, the feature learning approach are only limited on the scenario of supervised learning and few comparison towards distance-based learning approaches (likezheng2014time ). The method described in wang2015pooling is simple but not fully automated, instead they still need to design the weighting scheme manually.
Our work provides a new perspective to learn the hidden representations with deconvolutional networks (in the self-supervised learning way), hence to fully exploit the unlabeled data especially when the data is large. We design the network structure to capture the cross-channel correlation with convolutions, forcing the pooling operation to perform the dimension reduction along each position of the individual channel. Inspired by the discretization approaches like Symbolic Aggregate Approximation (SAX) with its variationslin2003symbolic ; sun2014improvement ; wang2015pooling and Markov matrix wang2015imaging , we further show how this representation helps on classification and visualization tasks. A full comparison with the sequence distance based approach is provided to demonstrate the effectiveness of our approach.
2 Background and Related Work
2.1 Deep Neural Networks
. One successful deep learning architecture used in computer vision is convolutional neural networks (CNN)lecun1998gradient . CNNs exploit translational invariance by extracting features through receptive fields hubel1962receptive and learning with weight sharing, becoming the state-of-the-art approach in various image recognition and computer vision tasks krizhevsky2012imagenet
. Most exciting advance comes from the exploration of unsupervised learning algorithms for generative models, such as Deep Belief Networks (DBN) and Denoised Auto-encoders (DA)hinton2006fast ; vincent2008extracting hausler2013temporal
. A training strategy inspired by recent work on optimization-based learning is proposed to train complex neural networks for imputation tasksbrakel2013training . A generalized Denoised Auto-encoder extends the theoretical framework and is applied to Deep Generative Stochastic Networks (DGSN) bengio2013generalized ; bengio2013deep .
, deconvolution and Topographic Independent Component Analysis (TICA) are integrated as unsupervised pretraining approaches to learn more diverse features with complex invariancezeiler2010deconvolutional ; ngiam2010tiled ; wang2016efficient . We use deconvolution to capture both the temporal and cross-channel correlation in the multivariate time series, rather than pretrain a supervised model.
2.2 Discretization and Visualization for Time Series
Time Series discretization is broadly used in symbolic approximation based approach . Aligned Cluster Analysis (ACA) is introduced as an unsupervised method to cluster the temporal patterns of human motion datazhou2008aligned
. It is an extension of kernel k-means clustering but requires quite computational capacity. Persist is an unsupervised discretization methods to maximize the persistence measurement of each symbolmorchen2006finding . Piecewise Aggregate Approximation (PAA) methods is proposed by Keogh keogh2001dimensionality to reduce the dimensionality of time series, which is then upgraded to Symbolic Aggregate Approximation (SAX) lin2003symbolic
. In SAX, each aggregation value after PAA process is mapped into the equiprobable intervals based on standard normal distribution to produce a sequence of symbolic representations. Among these symbolic approaches, SAX method has become one of the de facto standard to discretize time series and is at the core of many effective classification algorithms.
The principal idea of SAX is to smooth the input time series using Piecewise Aggregation Approximation (PAA) and assign symbols to the PAA bins. The overall time series trend is extracted as a sequence of symbols.
The algorithm requires three parameters: window length , number of symbols and alphabet size . Different parameters lead to different representations of the time series. Given a normalized time series of length , we first reduce the dimensionality by dividing it into non-overlapping sliding windows with skip size 1. Each sliding window is partitioned into
subwindows. Mean values are computed to reduce volume and smooth the noise. Then PAA values are mapped to a probability density function, which is divided into several equiprobable segments. Letters starting from A to Z are assigned to each PAA values according to their corresponding segments (Figure 1).
In another hand, reformulating time series as visual clues has raised much attention in computer science and physics in which the discretization method plays an important role. The typical examples are that acoustic/speech data input is typically represented by Mel-frequency cepstral coefficients (MFCCs) or Perceptual Linear Prediction (PLP) to explicitly represent the temporal and frequency information. Researchers are trying to build different network structures from time series for visual inspection or designing distance measures. Recurrence Networks were proposed to analyze the structural properties of time series from complex systems donner2010recurrence ; donner2011recurrence . They build adjacency matrices from the predefined recurrence functions to interpret the time series as complex networks. Silva et al. extended the recurrence plot paradigm for time series classification using compression distance silva2013time . Another way to build a weighted adjacency matrix is extracting transition dynamics from the first order Markov matrix campanharo2011duality . Although these maps demonstrate distinct topological properties among different time series, it remains unclear how these topological properties relate to the original time series since they have no exact inverse operations. wang2015imaging proposed an generalized Markovian encoding to map the complex correlations in the time series into images while preserving the temporal information as well.
To give a intuition about how our learned feature is shaped, we simply build the Markov Matrix to visualize the topology of the formed complex networks as given by campanharo2011duality .
3 Representation Learning Using Deconvolutional Networks
Deconvolutional networks have the similar mathematical form with convolutional networks. The difference is, deconvolutional networks contain the ’inverse’ operation of convolution and pooling for reconstruction.
Convolutional layers connect multiple input activations within a filter window to a single activation. In contrary, deconvolutional layers associate a single input activation with multiple outputs (Figure 2). The output of the deconvolutional layer is an enlarged and dense feature map. In practice, we crop the boundary of the enlarged feature map to keep the size of the output map identical to the one from the preceding unpooling layer.
The learned filters in deconvolutional layers are actually matching the bases to reconstruct the same shape of the input, thus, similar to the convolution network, a hierarchical structure of deconvolutional layers are used to capture different level of shape details. The low level filters tend to capture detailed/fine-grained features while the filters in higher layers tends to capture more abstract features. Thus, the network directly takes specific shape information into account for multi-scale feature capturing, which is often ignored in other approaches based only on convolutional layers.
For the deconvolution operation, the feature maps are calculated as
where represents the i-th element of input and
denotes the j-th filter map after convolution and activation. The function padding functionpads with zeros to keep the output size same with the input. After deconvolution, will be processed through a pooling layer. The reconstruction of is built based on in a reversed procedure of convolution. The reconstruction works in form of
is the bias term for the reconstruction . is the feature map extracted by the unpooling layer. The gradient of each parameter in back propagation could be obtained and propagated to the preceding layers in an end-to-end manner until final convergence.
Pooling in convolution network abstracts activations in a receptive field with a single representative value to gain the robustness to noise and translation. Although it helps classification by retaining only robust activations in upper layers, spatial information within a receptive field is lost during pooling. Such information loss may be critical for precise feature learning that is required for reconstruction and classification.
Unpooling layers in deconvolution network perform the reverse operation of pooling and reconstruct the original size of activations as illustrated in Figure 3. To implement the unpooling operation, we record the locations of the maximum activations selected during pooling operation in the transposed variables, which are employed to place each activation back to its original pooled location. This unpooling strategy is particularly useful to reconstruct the structure of input object. Note that the output of an unpooling layer is an enlarged, yet sparse activation map, which might loss the expressiveness of the complex feature for reconstruction. To resolve the issue, the deconvolution layers is used after the unpooling operation to densify the sparse activations through convolution-like operations with multiple learned filters.
3.3 Deconvolution for Multivariate Time Series
In the proposed algorithm, the deconvolution network is a key component for precise feature learning on the multivariate time series data. Contrary to the simple usual deconvolution and pooling both performed with square kernels, our algorithm generates feature maps using deep deconvolution network across the channel but pooling along each individual channel. The dense element-wise deconvolutional map is obtained by successive operations of unpooling, deconvolution, and rectification.
Figure 4 visualizes the example network structure layer by layer, which is helpful to understand internal operations of our deconvolution network. We can observe that deconvolution with multiple filters are applied to capture both the temporal and cross-channel correlation. Lower layers tend to capture overall coarse configuration of the short term signals (e.g. location and frequency), while more complex patterns are discovered in higher layers. Note that pooling/unpooling layer and deconvolution play different roles for the construction of the learned features. Pooling/unpooling captures the significant information within a single channel by tracing each individual position with strong activations back to the signal space. As a result, it effectively reconstructs the detailed structure of the multivariate signals in finer resolutions. On the other hand, learned filters in deconvolutional layers tend to capture the generic generating shapes. Through deconvolution and tied weights, the activations closely related to the generating distribution along each signal and cross the channels are amplified while noisy activations from other regions are suppressed effectively. By the combination of unpooling and deconvolution, our network is able to generates accurate reconstruction of the multivariate time series.
4 Visualization and Classification
To visualize the learned representation to inspect and understand, we choose to discretize and convert the final encoding in the hidden layers of the deconvolutional networks to a Markov Matrix, hence visualizing them as complex networks campanharo2011duality .
As in Figure 5, a time series is split into quantiles, each quantile is assigned to a node in the corresponding network . Then, nodes and are connected in the network with the arc where the weight of the arc is given by the probability that a point in quantile is followed by a point in quantile
. Repeated transitions between quantiles results in arcs in the network with larger weights, hence the connection is represented by thicker lines. Note that the discretization is originally based on quantile bins. As indicated in SAX methods that time series tends to follow the Gaussian distribution, we use Gaussian mapping to replace quantile bins for discretization.
The deconvolution operation has a sliding window along time, which means the hidden representation should maintain a significant temporal component, thus be particularly within the application domain of SAX and bag-of-words approaches. The bag-of-words dictionary built from the SAX words is benefit from the invariance to locality. Compared with the raw vector-based representation, these feature bags improves the classification performance as it fits the temporal correlation while increase the expressiveness against noise and outliers. In our experiments, we use both the raw hidden vector and the bag of SAX words for classification.
5 Experiments and Results
This section first describes the settings and results of representation learning with deconvolution. Then, we analyze and evaluate the proposed representation in classification and visualization tasks.
We primarily use two standard datasets that are broadly appeared in the literature about multivariate time series 333http://www.cs.cmu.edu/bobski/. The ECG dataset contains 200 samples with two channels, among which 133 samples are normal and 67 samples are abnormal.The length of a MTS sample is between 39 and 153. The wafer datasets contain 1194 samples. 1067 samples are normal and 127 samples are abnormal.The length of a sample is between 104 and 198. We preprocess each dataset by standardization and realigning all the signals with the maximum of the length. All missing values are filled by 0. Table 1 gives the statistics summary of each dataset.
5.1 Representation Learning with Deconvolution
summarizes the detailed configuration of the proposed network. Our network has symmetrical configuration of convolution and deconvolution network centered around the output of the 2nd Convolutional layer. The input and output layers correspond to input signals and their corresponding reconstruction. We use ReLU as activation function. The network is trained by Adadelta with learning rateand 444Codes are available at https://github.com/cauchyturing/Deconv_SAX.
Figure 6 and 7 show the reconstructions by our deconvolutional networks. While the filters trained by the deconvolution captures both the temporal and cross-channel information, combination of unpooling and deconvolution, our network is able to generates accurate reconstruction of the multivariate time series, which guarantees the expressiveness of the learned representations. As shown in Figure 8, After filtering by the deconvolution and pooling/unpooling, the final encoding of each map learned different representation independently. Diverse local patterns (shapes) of time series are captured automatically. Through the deconvolution, the filters determine the importance of each feature by considering both the single channel and cross channel information.
For classification, we feed both the learned representation vector and the bag of SAX words into a linear SVM. Note that we only use training data to train the representation with deconvolutional networks, then generate the test representation using the well trained model with a single forward pass on the test set. The parameters of SAX, window length , number of symbols and alphabet size is selected using Leave-One-Out cross validation in the training set with Bayesian optimization snoek2012practical .
After discretization and symbolization, bag of words dictionary are built by a sliding window of length and convert each subsequence into SAX words. Bag of words catch the features shared in the same structure among different instance and regardless of where they occur. The discretized features are built based on bag of words histogram of the word counts.
We compared our model with several best methods for multivariate time series classification in recent literatures including Dynamic Time Warping (DTW), Edit Distance on Real sequence (EDR), Edit distance with Real Penalty (ERP)Lin2012pattern , STMS baydogan2014learning and MCFEC he2015early and Pooling SAX wang2015pooling .
|Avg. Degree||Modularity||Pagerank||Avg. Path Length|
Table 2 summarizes the classification results555The results with are reported as the error rate of 10-fold cross validation on the whole datasets (train + test).. Our model outperform all other approaches. Even wafer dataset has 6 channels, our approach is still able to capture the precise information through deconvolution to improve the classification performance. Because the datasets are small, supervised deep learning model tends to overfit the label, but in our unsupervised feature learning framework, the model takes advantage of the great expressiveness of the neural networks from the large number of weights to build precise feature set. These precise features provide a more optimal space for classification.
Another comparison is performed between the vector and discretized bag-of-words representation from the deconvolutional networks (Table 4
). Although discretization by SAX introduces more hyperparameters, both cross validation and test error rate are better than the feature vector. As we analyzed before, the deconvolutional networks learn the representation which preserve the high order abstract temporal information. Through SAX and bag of words, these information is enhanced particularly for classification in the supervised way. Noise and outliers that are less useful for classification are removed, while the constraint of the dependency on temporal locality is weakened by bag of words. Thus, the bag of SAX words show advantage against the raw vector feature. The Bayesian optimization greatly facilitate the searching process on the hyperparameter and converge fast.
|CV Train||Test||CV Train||Test|
5.3 Visualization and Statistical Analysis
has shown how the vector representation is like as time series. To fully understand the representation learned through deconvolution and the effect of discretization through SAX, we flatten and discretize each feature map (which is feed in the classifier as input) and visualize them as complex networks to further inspect other statistical properties.
As for the number of discretization bins (or the alphabet size in our SAX settings), we set for the ECG dataset and for the wafer dataset. We use the hierarchical force-directed algorithm as the network layout hu2005efficient . As shown in Figure 9 and 10, the demo on the both dataset show different network structures. For ECG, The normal sample tends to have round-shaped layout while the abnormal sample always has a narrow and winded structure. As for wafer, normal sample is shown as a regular closed-form shape, but the structure of the abnormal sample is open while thicker edges piercing through the border.
Table 3 summarizes four statistics of all the complex networks generated from the deconvolutional representations: average degree, modularity class blondel2008fast , Pagerank index langville2011google
and average path length. Note that Pagerank index here denotes the largest value in its propagation distribution. For the ECG dataset, all statistics are significantly different between the graphs with different labels under the rejection threshold of 0.05. However, for the Wafer dataset, only average path length show significant difference between two labels. We think the reason is that thicker edges are appearing around the network structures, which indicates the number of edge and their weights are highly skewed. This topological structure would not effect other statistics, but would be reflected in the path degree.
6 Conclusion and Future Work
We propose a new model based on the deconvolutional networks and SAX discretization to learn the representation for multivariate time series. Deconvolutional networks fully exploit the advantage the powerful expressiveness of deep neural networks in the manner of unsupervised learning. We design a network structure specifically to capture the cross-channel correlation with deconvolution, forcing the pooling operation to perform the dimension reduction along each position in the individual channel. SAX discretization is applied on the feature vectors to further extract the bag of features. We show how this representation and bag of features helps on classification. A full comparison with the sequence distance based approach is provided to demonstrate the effectiveness of our approach. We further build the Markov matrix from the discretized representation to visualize the time series as complex networks, which show more statistical properties and clear class-specific structures with respect to different labels.
As future work, we suppose to integrate grammar induction approach on the deconvolutional SAX words to further infer the semantics of multivariate time series. We are also interested in designing advanced intelligent interfaces to enable the interaction from human to inspect, understand and guide the feature learning and semantic inference for mechanical multivariate signals.
This work is supported by the Natural Science Foundation of China (NSFC) under Grant No. 61472370, 61170223;The National Key Technology Research and Development Program of China under Grant No. 2013BAH23F01; The Education Department of Henan Province under Grant No.13A520453; The Department of Science & Technology of Henan Province under Grant No.142300410229.
- (1) Z. Bankó, J. Abonyi, Correlation based dynamic time warping of multivariate time series, Expert Systems with Applications 39 (17) (2012) 12814–12823.
- (2) X. Weng, J. Shen, Classification of multivariate time series using locality preserving projections, Knowledge-Based Systems 21 (7) (2008) 581–587.
B. K. D. D. Lin J, Williamson S, Pattern recognition in time series, Chapman & Hall, To appear, 2012.
- (4) G. He, Y. Duan, R. Peng, X. Jing, T. Qian, L. Wang, Early classification on multivariate time series, Neurocomputing 149 (2015) 777–787.
- (5) M. G. Baydogan, G. Runger, Learning a symbolic representation for multivariate time series classification, Data Mining and Knowledge Discovery (2014) 1–23.
- (6) Z. Wang, T. Oates, Pooling sax-bop approaches with boosting to classify multivariate synchronous physiological time series data., in: FLAIRS Conference, 2015, pp. 335–341.
- (7) Y. Zheng, Q. Liu, E. Chen, Y. Ge, J. L. Zhao, Time series classification using multi-channels deep convolutional neural networks, in: International Conference on Web-Age Information Management, Springer, 2014, pp. 298–310.
- (8) J. Lin, E. Keogh, S. Lonardi, B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, in: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, ACM, 2003, pp. 2–11.
- (9) Y. Sun, J. Li, J. Liu, B. Sun, C. Chow, An improvement of symbolic aggregate approximation distance measure for time series, Neurocomputing 138 (2014) 189–198.
- (10) Z. Wang, T. Oates, Imaging time-series to improve classification and imputation, arXiv preprint arXiv:1506.00327.
Y. Bengio, Learning deep architectures for ai, Foundations and trends® in Machine Learning 2 (1) (2009) 1–127.
L. Deng, D. Yu,
learning: Methods and applications, Tech. Rep. MSR-TR-2014-21 (January
- (13) Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86 (11) (1998) 2278–2324.
- (14) D. H. Hubel, T. N. Wiesel, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, The Journal of physiology 160 (1) (1962) 106.
A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.
- (16) G. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief nets, Neural computation 18 (7) (2006) 1527–1554.
P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, ACM, 2008, pp. 1096–1103.
- (18) C. Häusler, A. Susemihl, M. P. Nawrot, M. Opper, Temporal autoencoding improves generative models of time series, arXiv preprint arXiv:1309.3103.
- (19) P. Brakel, D. Stroobandt, B. Schrauwen, Training energy-based models for time-series imputation, The Journal of Machine Learning Research 14 (1) (2013) 2771–2797.
- (20) Y. Bengio, L. Yao, G. Alain, P. Vincent, Generalized denoising auto-encoders as generative models, in: Advances in Neural Information Processing Systems, 2013, pp. 899–907.
- (21) Y. Bengio, E. Thibodeau-Laufer, Deep generative stochastic networks trainable by backprop, arXiv preprint arXiv:1306.1091.
- (22) D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pre-training help deep learning?, The Journal of Machine Learning Research 11 (2010) 625–660.
- (23) K. Grzegorczyk, M. Kurdziel, P. I. Wójcik, Encouraging orthogonality between weight vectors in pretrained deep neural networks, Neurocomputing 202 (2016) 84–90.
- (24) M. D. Zeiler, D. Krishnan, G. W. Taylor, R. Fergus, Deconvolutional networks, in: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2528–2535.
- (25) J. Ngiam, Z. Chen, D. Chia, P. W. Koh, Q. V. Le, A. Y. Ng, Tiled convolutional neural networks, in: Advances in Neural Information Processing Systems, 2010, pp. 1279–1287.
- (26) Y. Wang, Z. Xie, K. Xu, Y. Dou, Y. Lei, An efficient and effective convolutional auto-encoder extreme learning machine network for 3d feature learning, Neurocomputing 174 (2016) 988–998.
- (27) F. Zhou, F. Torre, J. K. Hodgins, Aligned cluster analysis for temporal segmentation of human motion, in: Automatic Face & Gesture Recognition, 2008. 8th IEEE International Conference on, IEEE, 2008, pp. 1–7.
F. Mörchen, A. Ultsch, Finding persisting states for knowledge discovery in time series, in: From Data and Information Analysis to Knowledge Engineering, Springer, 2006, pp. 278–285.
- (29) E. Keogh, K. Chakrabarti, M. Pazzani, S. Mehrotra, Dimensionality reduction for fast similarity search in large time series databases, Knowledge and information Systems 3 (3) (2001) 263–286.
- (30) R. V. Donner, Y. Zou, J. F. Donges, N. Marwan, J. Kurths, Recurrence networks—a novel paradigm for nonlinear time series analysis, New Journal of Physics 12 (3) (2010) 033025.
- (31) R. V. Donner, M. Small, J. F. Donges, N. Marwan, Y. Zou, R. Xiang, J. Kurths, Recurrence-based time series analysis by means of complex network methods, International Journal of Bifurcation and Chaos 21 (04) (2011) 1019–1046.
- (32) D. F. Silva, V. Souza, M. De, G. E. Batista, Time series classification using compression distance of recurrence plots, in: Data Mining (ICDM), 2013 IEEE 13th International Conference on, IEEE, 2013, pp. 687–696.
- (33) A. S. Campanharo, M. I. Sirer, R. D. Malmgren, F. M. Ramos, L. A. N. Amaral, Duality between time series and networks, PloS one 6 (8) (2011) e23378.
- (34) J. Snoek, H. Larochelle, R. P. Adams, Practical bayesian optimization of machine learning algorithms, in: Advances in neural information processing systems, 2012, pp. 2951–2959.
- (35) Y. Hu, Efficient, high-quality force-directed graph drawing, Mathematica Journal 10 (1) (2005) 37–71.
- (36) V. D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, Journal of statistical mechanics: theory and experiment 2008 (10) (2008) P10008.
- (37) A. N. Langville, C. D. Meyer, Google’s PageRank and beyond: The science of search engine rankings, Princeton University Press, 2011.