Learning representations for multivariate time series with missing data using Temporal Kernelized Autoencoders

05/09/2018
by   Filippo Maria Bianchi, et al.
0

Learning compressed representations of multivariate time series (MTS) facilitate the analysis and process of the data in presence of noise, redundant information, and large amount of variables and time steps. However, classic dimensionality reduction approaches are not designed to process sequential data, especially in the presence of missing values. In this work, we propose a novel autoencoder architecture based on recurrent neural networks to generate compressed representations of MTS, which may contain missing values and have variable lengths. Our autoencoder learns fixed-length vectorial representations, whose pairwise similarities are aligned with a kernel function that operates in input space and handles missing values. This, allows to preserve relationships in the low-dimensional vector space even in presence of missing values. To highlight the main features of the proposed autoencoder, we first investigate its performance in controlled experiments. Successively, we show how the learned representations can be exploited both in several benchmark and real-world classification tasks on medical data. Finally, based on the proposed architecture, we conceive a framework for one-class classification and imputation of missing data in time series extracted from ECG signals.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/20/2017

Learning compressed representations of blood samples time series with missing data

Clinical measurements collected over time are naturally represented as m...
07/09/2019

Multivariate Time Series Imputation with Variational Autoencoders

Multivariate time series with missing values are common in many areas, f...
04/17/2021

Recursive input and state estimation: A general framework for learning from time series with missing data

Time series with missing data are signals encountered in important setti...
09/17/2020

Indoor Environment Data Time-Series Reconstruction Using Autoencoder Neural Networks

As the number of installed meters in buildings increases, there is a gro...
01/21/2018

Time series kernel similarities for predicting Paroxysmal Atrial Fibrillation from ECGs

We tackle the problem of classifying Electrocardiography (ECG) signals w...
09/30/2021

LIFE: Learning Individual Features for Multivariate Time Series Prediction with Missing Values

Multivariate time series (MTS) prediction is ubiquitous in real-world fi...
06/13/2016

Modeling Missing Data in Clinical Time Series with RNNs

We demonstrate a simple strategy to cope with missing data in sequential...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-valued multivariate time series (MTS) allow to characterize the evolution of complex systems and their analysis is the core component in many research fields and application domains chatfield2016analysis . MTS analysis should account for relationships across variables and time steps, and, at the same time, deal with unequal time lengths and missing data langkvist2014review . Missing values, commonly found in real-world data such as electronic health records (EHR) che2017time , are usually filled with imputation techniques before processing the data. However, unless they are missing completely at random heitjan1996distinguishing , imputation destroys information of potential interest contained in the missingness patterns. Furthermore, especially for large fractions of missing values, each imputation method can introduce a strong kind of bias that influences the analysis outcome little2014statistical . A data driven approach has been recently proposed to learn when to switch between two particular types of imputation Che2018 , but it relies on strong assumptions that are suitable only for specific applications.

MTS data often contain noise, redundant information, and can be characterized by a large amount of variables and time steps. In those cases, extracting the relevant information and generating compressed representations using dimensionality reduction techniques facilitate the analysis and processing of the data langkvist2014review ; Keogh2001 ; DENG2013142 . Dimensionality reduction has been a fundamental research topic in machine learning scholkopf1997kernel ; fukumizu2004dimensionality ; jenssen2010kernel ; harandi2017dimensionality . However, classic approaches are not designed to process sequential data, especially in the presence of missing values.

In this paper, we propose a novel neural network architecture called Temporal Kernelized Autoencoder (TKAE) to learn compressed representations of real-valued MTS with unequal lengths and missing data. Our model is based on a deep Autoencoder Hinton504

with recurrent layers, which generates a fixed-size vectorial representation of the input MTS. To better capture time dependencies, we implement the encoder with a bidirectional recurrent neural network 

6638947 . The final states of the forward and backward network are combined by a dense nonlinear layer that reduces the dimensionality of the representation.

To avoid the undesired biases introduced by imputation, we train our architecture with a loss function that preserves pairwise similarities in the learned representations, even in presence of missing data. This is achieved by a kernel alignment procedure 

kernelAE , which matches the dot products of the representations with a kernel similarity defined in the input space. To this end, we consider the recently-proposed Time series Cluster Kernel (TCK) mikalsen2017time , which has been shown to capture well the relationships between MTS containing missing data.

The proposed architecture serves different purposes. When represented as ordinary vectors, MTS can be processed by non-sequential classification or unsupervised machine learning algorithms Xing:2010:BSS:1882471.1882478 , and their indexing and retrieval is more efficient 7179089 ; chung2016unsupervised . Furthermore, the dimensionality of the data is reduced and, accordingly, models can potentially be trained with less samples. Contrarily to other nonlinear dimensionality reduction techniques, the decoder provides an explicit mapping back to the input space. This can be used, for example, to design an anomaly detector based on the reconstruction error of inputs pmlr-v48-zhai16 , or to implement an imputation method that leverages the generalization capability of the decoder reconstruction, rather than relying on a-priori assumptions that may introduce stronger biases beaulieu2017missing .

TKAE is evaluated through several experiments on synthetic and benchmark datasets, and MTS extracted from real-world EHR. We first investigate the advantages of using recurrent layers to generate compressed MTS representations with respect to other dimensionality reduction methods. Then, we show the benefit of the kernel alignment for learning representations in classification settings with missing data. Finally, we exploit the capability of the TKAE decoder to impute missing data and build a one-class classifier for anomaly detection.

2 Methods

Sec. 2.1 and 2.2 provide the required background, describing respectively the Autoencoder (AE) and TCK, the selected kernel similarity for MTS with missing values. The details of our methodological contribution are presented in Sec. 2.3.

2.1 Autoencoder

The AE is a neural network traditionally conceived as a non-linear dimensionality reduction algorithm Hinton504 , which has been further exploited to learn representations in deep architectures bengio2009learning and to pre-train neural network layers erhan2010does . An AE simultaneously learns two functions; the first one, called encoder, is a mapping from an input domain,

, to a hidden representation (

code) in . The second function, decoder, maps from back to . The encoding and decoding functions are defined as

(1)

where , , and denote a sample from the input space, its hidden representation, and its reconstruction given by the decoder, respectively. The encoder

is usually implemented by stacked dense layers of neurons equipped with sigmoidal activation functions. The decoder

often has an architecture symmetric to the encoder that operates in reverse direction; when inputs are real-valued vectors, decoder squashing nonlinearities are often replaced by linear activations vincent2010stacked . Finally, and are the trainable parameters of the encoder and decoder, respectively. In this work, we focus on AEs implemented with fully connected layers (later, also recurrent layers will be introduced). In those cases, the parameters are the connection weights and biases of each layer, i.e., and . AEs are trained to minimize the discrepancy between the input and its reconstruction . In case of real-valued inputs, this is usually achieved by minimizing a loss implemented as the empirical Mean Squared Error (MSE).

In this paper, we focus on AEs with a “bottleneck”, which learn an under-complete representation of the input, i.e., , retaining as much useful information as possible to allow an accurate reconstruction Hinton504 . The learned lossy, compressed representation of the input can be exploited, e.g., in clustering and visualization tasks makhzani2015adversarial , or to train a classifier ng2016dual

. The bottleneck already provides a strong regularization as it limits the variance of the model. However, further regularization can be introduced by tying encoder and decoder weights (

) or by adding a norm penalty to the loss function

(2)

where is the norm of all model weights and

is the hyperparameter controlling the contribution of the regularization term.

Recurrent neural networks (RNNs) are models excelling in capturing temporal dependencies in sequences elman1990finding ; bianchi2017recurrent and are at the core of seq2seq models sutskever2014sequence . The latter learns fixed-size representations of sequences with unequal lengths and, at the same time, generates variable-length outputs. The main applications of RNN AEs based on the idea of seq2seq have been on text DBLP:journals/corr/BowmanVVDJB15 , speech NIPS2015_5653 , and video data srivastava2015unsupervised . However, few efforts have been devoted so far in applying these architectures to real-valued MTS.

Modern seq2seq architectures implement a powerful mechanism called attention, which provides an inductive bias that facilitate modelling long-term dependencies and grants a more accurate decoding if the length of the input sequences varies considerably BahdanauCB14 ; kim2017structured ; NIPS2017_7181 . Rather than learning a single vector representation for the whole input sequence, a model with attention maintains all the encoder states generated over time, which are combined by a time-varying decoding vector at each decoding step. Therefore, models with attention provide a representation that is neither compact nor of fixed size and, henceforth, are not suitable for our purposes.

2.2 Time Series Cluster Kernel

The Time series Cluster Kernel (TCK) mikalsen2017time

is an algorithmic procedure to compute kernel similarities among MTS containing missing data. The method is based on an ensemble learning approach that guarantees robustness with respect to hyperparameters. This ensures that the TCK also works well in unsupervised settings, the ones in which AE usually operates, where it is not possible to tune the hyperparameters by means of supervised cross-validation. The base models in the ensemble are Gaussian mixture models (GMMs), which are fit to the dataset of MTS using a range of numbers of mixture components. By fitting GMMs with different numbers of mixtures, the TCK procedure generates partitions at different resolutions that capture both local and global structures in the data.

To further enhance diversity in the ensemble, each partition is evaluated on a random subset of MTS samples, MTS attributes (variates) and time segments, using random initializations and randomly chosen hyperparameters. This also contributes to provide robustness with respect to the selection hyperparameters, such as the number of mixture components. To avoid imputation, missing data are analytically marginalized away in the likelihoods. To obtain the GMM posteriors, the likelihoods are multiplied with smooth priors, whose contribution becomes stronger as the percentage of missingness increases. TCK is then built by summing up, for each partition, the inner products between pairs of posterior assignments corresponding to different MTS. The details of TCK are provided in A.

2.3 Temporal Kernelized Autoencoder

Figure 1: Schematic representation of TKAE. Inputs are processed by a stacked bidirectional RNN (). The last states obtained in forward and backward directions are combined by a dense layer to produce a fixed-size representation of the input. is used to initialize the decoder state, which is a stacked RNN operating in generative mode and is trained to reproduce inputs by minimizing the reconstruction error . TKAEs allow to learn similarity-preserving representations of inputs. In particular, the matrix containing the dot products of the representations of the MTS in the dataset is aligned, by means of a cost term , to the kernel similarity matrix . The kernel matrix is provided by the user as prior information to embed desired properties in the representations. In our case, the kernel alignment generates representations whose relationships account for missing data in the input.

The Temporal Kernelized Autoencoder (TKAE) is a novel AE architecture designed for MTS of unequal length with missing values. A schematic representation of TKAE is provided in Fig. 1. We assume MTS to be represented as a matrix , where denotes the number of variables and is the number of time steps that may vary in each MTS. Analogously to seq2seq sutskever2014sequence , in TKAE the dense layers of standard AEs are replaced by recurrent layers, which process inputs sequentially and update their hidden state at each time step according to the following map,

(3)

where

is the set of trainable parameters of the recurrent units. The recurrent layers are composed of either gated recurrent unit (GRU) 

cho2014learning

or long short-term memory (LSTM) 

hochreiter1997long cells. The choice of the cell is usually guided by the task at hand chung2014empirical .

Conventional RNNs make use of previous inputs to build their current hidden representation 6638947 . However, in applications like MTS classification where the whole input is available at once, it is also possible to exploit the information contained in future inputs to generate the current state of the network. For this reason, the encoder is implemented as a stacked bidirectional RNN schuster1997bidirectional consisting of two RNNs working in parallel, each one with layers of cells and transition function (3). While one RNN captures input dependencies going backward in time, the other processes the same input but in reverse order, thus modeling relationships that go forward in time. After the whole input is processed, the final states of the forward and backward RNN are denoted as and , respectively. While is influenced by the past observations, depends on the future ones, hence their combination can capture a wider range of temporal dependencies in the input. In TKAE, the combination is implemented with a dense nonlinear layer , which produces an output vector . The latter, is the fixed-size, vectorial representation of the MTS.

The decoder operates according to the following map,

(4)

where is a stacked RNN with layers parametrized by that operates in generative mode, processing the previously generated output as new input. To initialize the decoder, we let its initial state and first input , which corresponds to an “average input” if MTS are standardized. The decoder iteratively produces outputs for steps, being the length of the input MTS. Unequal lengths are naturally handled since the whole architecture is independent of .

TKAE is trained end-to-end by means of stochastic gradient descent with scheduled sampling 

NIPS2015_5956 . More specifically, during training the decoder input at time

is, with probability

, the decoder output at time (inference mode) and with probability the desired output at time (teacher forcing). Since the desired output is not available during the test phase, the decoder generates test data operating only in generative mode (). In most of our experiments, scheduled sampling improved the training convergence speed, providing a practical motivation for our choice.

Analogously to standard AEs, the RNNs in TKAE cannot process data with missing values. Therefore, those are filled beforehand with some imputed value (0, mean value, last observed value) schafer2002missing . However, imputation injects biases in data that may negatively affect the quality of the representations and conceal potentially useful information contained in the missingness patterns. To compensate for these shortcomings, we introduce a kernel alignment procedure kernelAE , which allows to preserve the pairwise similarities of the inputs, encoded as a positive semi-definite matrix , in the learned representations. can be any positive semi-definite matrix and is selected according to which properties one wants to embed into the representations. In our case, by choosing the TCK matrix as , the relationships of the learned representations will also account for missing data.

Kernel alignment is implemented by an additional regularization term in the loss function (2), which becomes

(5)

is the kernel alignment cost, which computes the normalized Frobenius norm of the difference between two matrices: , the prior kernel matrix, and , the dot product matrix between the encodings of the input MTS. The term is defined as

(6)

where is the matrix of hidden representations relative to the MTS in the dataset (or, more specifically, in the current mini-batch). Finally, is a hyperparameter controlling the contribution of the alignment cost in the loss function.

3 Experiments

We perform three sets of experiments. First, in Sec. 3.1 we investigate advantages and shortcomings of an AE with recurrent layers to learn MTS representations, with respect to other dimensionality reduction methods. In this case, we do not use kernel alignment () and we refer to TKAE simply as TAE. Secondly, in Sec. 3.2 we show how kernel alignment improves the learned representations when MTS contain missing data. Lastly, in Sec. 3.3 we present two case-studies, where TKAE is used for one-class classification and for imputing missing data. Both applications exploit not only the TKAE hidden representation but also its decoder, as the results are computed in the input space.

In the following, we compare the proposed architecture with baseline methods for dimensionality reduction, such as a standard AE and PCA; the learned compressed representations have the same dimensionality in all models taken into account. In each experiment, we train the models for 5000 epochs with mini-batches containing 25 MTS using the Adam optimizer 

kingma2014adam with initial learning rate . We independently standardize each variate of the MTS in all datasets. In each experiment and for each method, we identify the optimal hyperparameters with -fold cross-validation evaluated on the reconstruction error (or, in general, on the unsupervised loss function) and we report the average results on the test set, obtained in independent runs. We consider only TKAE models with maximum 3 hidden layers of either LSTM or GRU cells, as deeper models generally improve performance slightly at the cost of greater complexity reimers2017optimal .

3.1 Evaluation of TAE representations

In this section, we compare the compressed representations yielded by TAE, a standard AE, and PCA, to investigate which are the types of data that are better represented when processed by a recurrent architecture. Beside the benchmark datasets, we also consider synthetic data to study the properties of the different architectures in a controlled setting. We let be the input dimensionality; in TAE , as it processes recursively each single time step. On the other hand, the MTSs must be unrolled when processed by PCA and AE since they expect vectors as inputs rather than sequences. Therefore, in AE and PCA and then the reconstructed outputs are folded back to match the original shape of the input sequence. We let be the size of the compressed representations, which corresponds to the number of RNN cells in each TAE layer, the size of the innermost layer in AE, and the number of principal components in PCA, respectively. In all experiments we use an AE architecture with 3 hidden layers, ; the number of neurons in the intermediate layers () has been set after preliminary experiments and is not a sensitive hyperparameter (comparable results were obtained using or neurons). As measure of performance, we consider the MSE between the original test elements and their reconstructions produced by each model.

Time series with different frequencies

Here, we show the capability of TAE to compress periodic signals having different frequencies and phases. We generate a dataset of sinusoids , where are drawn from and . The proposed task is closely related to the multiple superimposed oscillators, studied in pattern generation and frequency modulation sussillo2009generating . The training and test set contain and samples, respectively. We let and the optimal configurations are: AE with nonlinear decoder and ; TAE with 2 layers of LSTM cells, , and . The reconstruction MSE on the test set is for PCA, for AE, and for TAE.

Both PCA and AE process the entire time series at once. This may appear an advantage with respect to TAE, which stores information in memory for the whole sequence length before generating the final representation. Nonetheless, TAE yields a better reconstruction since the AE (and PCA) architecture is unsuitable for this task. Indeed, in AEs the time step in each MTS is always processed by the same input neuron. For periodic signals, the training procedure tries to couple neurons associated to time steps with the same phase, by associating similar weights to their connections. However, these couplings always change if inputs have different frequencies (see Fig.2).

Figure 2: Periodic inputs with different frequencies generate different activation patterns in AEs. It is not possible to learn connections weights that preserve neurons couplings for each frequency.

Therefore, the standard AE training never converges as it is impossible to learn a model that generalizes well for each frequency. On the other hand, thanks to its recurrent architecture, TAE can naturally handle inputs of different frequencies since there is no pairing between structural parameters and time steps.

Fig. 3 shows the reconstruction of one sample time series.

Figure 3: Reconstructions obtained by PCA, AE, and TAE on a sample sinusoid, whose frequency and phase are randomly chosen.

The lower quality of the reconstruction yielded by AE and PCA can be immediately noticed. Additionally, since those methods are unable to reproduce the dynamics of each sample, they rather adopt a more conservative setting and output signals with lower amplitudes that are closer (in a mean square sense) to the “average” of all the random sinusoids in the dataset.

Time series of different lengths

While TKAE can process MTS with different lengths, standard AE and PCA require inputs of fixed size. The common workaround, also followed in this work, is to pad the shorter MTS with zeros 

mann2008smoothing

. To systematically study the performance of the different methods when MTS have fixed or variable length, we generate data by integrating the following system of first-order Ordinary Differential Equations (ODE):

(7)

where , is a matrix with 50% sparsity and elements uniformly drawn in . To guarantee system stability, we set the spectral radius of to . is applied component-wise and introduces nonlinear dependencies among the variables. A MTS is obtained by integrating (7) for steps, starting from a random initial condition . Since a generic deterministic dynamical system can be described by a ODE system, these synthetic MTS represent well many case-studies.

We generate two different datasets of MTS with variables, each one with 400 and 1000 samples for training and test set, respectively. The first, ODEfix, contains MTS with same length , while in the second, ODEvar, each MTS has a random length . We let and compare the reconstruction MSE of PCA, AE, and TAE. The optimal configurations for this task are: AE with and linear decoder; TAE with 1 LSTM layer, , and . The average results for independent random generations of the data () and initialization of AE and TAE are reported in Tab. 1.

Dataset PCA AE TKAE
5pt. ODEfix 0.018 0.004 0.060
ODEvar 0.718 0.676 0.185
Table 1: Average reconstruction MSE of MTS with fixed (ODEfix) and variable (ODEvar) length.

In ODEfix, both AE and PCA yield almost perfect reconstructions, which is expected due to simplicity of the task. However, they perform worse in ODEvar despite the presence of many padded values and a consequent lower amount (in average) of information to encode in the compressed representation. On the other hand, TAE naturally deals with variable-length inputs, since once the input sequence terminates its state is no longer updated, as well as its model weights during the training.

Dealing with large number of variates and time steps

In order to test the ability to learn compressed representations when the number of variates in the MTS increases, starting from (7) we generate 4 datasets ODE5, ODE10, ODE15, and ODE20, obtained by setting . The number of time steps is fixed to in each dataset. We let ; TAE is configured with 2 layers of LSTM and ; is in both AE and TAE. We also include in the comparison an AE with tied weights in the (nonlinear) decoder, which has fewer parameters. Reconstruction errors are reported in Tab. 2. We notice that AE performs well on MTS characterized by low dimensionality, but performance degrades when assumes larger values. Since AE processes MTS unrolled into a unidimensional vector, the input size grows quickly as increases (one additional variable increases the total input size by ). Accordingly, the number of parameters in the first dense layer scales-up quickly, possibly leading to model overfit. We also notice that the tied weights regularization, despite halving the number of trainable parameters, degrades performance in each case, possibly because it hinders too much the flexibility of the model. On the other hand, in TAE the model complexity changes slowly, as only one single neuron is added for an additional input dimension. As a consequence, TAE is the best performing model when MTS have a large number of variates.

Dataset TAE AE AE (tw) PCA
MSE #par MSE #par MSE #par MSE
5pt. ODE5 0.019 6130 0.04 31170 0.014 15870 0.007
ODE10 0.060 6780 0.04 61670 0.071 31370 0.018
ODE15 0.072 7430 0.106 92170 0.153 46870 0.174
ODE20 0.089 8080 0.121 122670 0.181 62370 0.211
Table 2: Average reconstruction MSE on the ODE task for different values of , obtained by TAE, AE, AE with tied weights (tw) and PCA. For AE and TAE we report the number of trainable parameters (#par). Best results are in bold.

To study the performance as the length of MTS increase, we generate 8 datasets with the ODE system (7) by varying , while keeping fixed . In Fig. 4, we report the reconstruction errors and note that TAE performs poorly as increases. Especially if there are no temporal patterns in the data that can be exploited by the RNNs to model the inputs, like in this case, TAE is more effective when MTSs are short.

Figure 4: Reconstruction MSE when increasing length of MTS in ODE15. TAE performance decreases for large .

Evaluation on benchmark classification datasets

Here we consider several classification benchmarks from Baydogan2016 , UCR, and UCI repositories111www.cs.ucr.edu/~eamonn/time_series_data, archive.ics.uci.edu/ml/datasets.html, whose details are reported in Tab. 3. The datasets have been selected in order to cover a wide variety of cases, in terms of training/test sets size, number of variates , number of classes and (variable) lengths of the MTS. We evaluate the quality of the representations yielded by TAE, standard AE, and PCA by classifying them with a NN classifier configured with and Euclidean distance. For each method the results, which include also the reconstruction MSE, are reported in Tab. 4. On the majority of the datasets TAE produces compressed representations that not only best describe the classes of the original inputs (in terms of classification accuracy), but also provide the most accurate reconstruction. In few cases, however, PCA achieves higher performance, which is even superior to the one obtained by the standard AE. Such a result may appear contradictory at first, since the AE is a non-linear extension of PCA and should perform at least equally well; when properly trained. Nevertheless, in MTS datasets having large dimensionality (i.e., many variates) but few samples, it is difficult to obtain a good fit of the parameters in models such as AE and TAE.

Dataset Train Test Classes Source
5pt. ECG 1 500 4500 5 140 140 UCR
ECG2 2 100 100 2 39 152 UCR
Libras 2 180 180 15 45 45 Baydogan2016
Char.Traj. 3 300 2558 20 109 205 UCI
Wafer 6 298 896 2 104 198 UCR
Jp. Vow. 12 270 370 9 7 29 UCI
Arab. Dig. 13 6600 2200 10 4 93 UCI
Auslan 22 1140 1425 95 45 136 UCI
Table 3: Benchmark time series datasets. Column 2 to 5 report the number of attributes, samples in training and test set, and classes, respectively. is the length of the shortest MTS in the dataset and the longest MTS.
Dataset PCA AE TAE
MSE ACC MSE ACC tw MSE ACC cell
5pt. ECG 10 0.026 0.934 0.027 0.937 lin. 0 no 0.064 0.931 GRU1 1.0 0
ECG2 10 0.100 0.801 0.132 0.809 sig. 0.001 no 0.131 0.819 LSTM2 1.0 0.001
Lbras 5 0.199 0.577 0.173 0.583 sig. 0.001 no 0.110 0.661 LSTM2 0.9 0.001
Char. Traj. 10 0.094 0.922 0.116 0.903 lin. 0.001 no 0.122 0.912 LSTM3 1.0 0.001
Wafer 10 0.139 0.928 0.163 0.910 lin. 0 no 0.088 0.919 LSTM2 1.0 0
Jp. Vow. 10 0.186 0.929 0.216 0.927 lin. 0.001 no 0.177 0.932 LSTM2 0.8 0.001
Arab. Dig. 15 0.542 0.965 0.554 0.926 sig. 0 no 0.418 0.984 GRU2 1.0 0
Ausaln 10 0.245 0.538 0.338 0.344 lin. 0 yes 0.213 0.680 LSTM2 1.0 0
Table 4: Average reconstruction errors (MSE) and classification accuracy on the compressed representations (ACC) on benchmark datasets. We report the optimal hyperparameters found with cross-validation. For AE: of regularization and tied weights in the decoder (tw). For TAE: type of cell, number of layers, probability of scheduled sampling (), and of regularization.
Figure 5: First two principal components of ECG, Character Trajectories (CHAR), and Wafer (WAF) datasets. PCA is computed on (i) input space, (ii) AE and (iii) TAE hidden representations. Samples are colored according to their class.

In particular, TAE does not obtain the best classification results only on three of the benchmark datatsets: ECG, Character Trajectories, and Wafer. However, in these cases the classification accuracy achieved by TAE is only slightly inferior to the competitors and it can be shown that TAE learns compressed representations which are qualitatively comparable to the ones of the other methods. This is illustrated in Fig. 5, which depicts the first two principal components of the learned representations; for TAE and AE, the components are computed in their hidden-state space. It is possible to observe that the representations yielded by TAE are characterized by a well-defined structure and the class separation (qualitatively speaking) is comparable to the other methods.

3.2 Kernel alignment to handle missing data

So far, we have investigated the effectiveness of a recurrent AE architecture to generate compressed representations of MTSs, with respect to baseline methods for dimensionality reduction. Hereinafter, we demonstrate the effect of kernel alignment when MTS contain missing data. Specifically, we compare the representations learned by TAE () and TKAE () in a synthetic and real classification problem.

First, we consider the Jp. Vow. dataset (see Tab. 3). This dataset does not originally contain missing data, but similarly to previous studies Baydogan2016 ; mikalsen2017time we inject missing data in a controlled way by randomly removing a certain percentage of elements from the dataset. We vary such percentage from to , evaluating each time the reconstruction MSE and classification accuracy of TAE and TKAE encodings using NN with . We apply zero imputation to replace missing data in the MTS. TAE and TKAE are configured with 2 LSTM cells, and . In TKAE, . In Fig. 6, we show the kernel matrix yielded by TCK and the dot products of the representations of the test set when of the data are missing. is very similar to the TCK matrix, as they are both characterized by a block structure indicating that intra-class similarities in the 9 classes are much higher than inter-class similarities.

(a) TCK ()
(b) TKAE ()
Figure 6: Test set of Jp. Vow. with of missing data. (a) prior computed with TCK in input space; (b) dot products of the representations in TKAE.

Fig. 7 shows how the classification accuracy and reconstruction error of TAE and TKAE vary as we increase the amount of missing data. The classification accuracy (blue lines) does not decrease in TKAE when the data contain up to 50% missing values and is always higher than in TAE. When of the data are missing, TKAE still achieves a classification accuracy of 0.7, while for TAE it drops to 0.1. We also observe that the alignment does not compromise input reconstruction, since the MSE in TAE and TKAE (red lines) is almost identical. As a side note, the reconstruction MSE decreases for higher amount of missingness since there are more imputed values (which are constants), hence less information to compress.

Figure 7: Classification accuracy and reconstruction MSE for TAE and TKAE when increasing missing values in Jp. Vow.

In a second experiment, we analyze medical data obtained from EHR in the form of blood measurements of patients undergoing a gastrointestinal surgery at the University Hospital of North Norway in 2004–2012 2018arXiv180307879O . Each patient is represented by a MTS of blood sample measurements collected for days after surgery. We consider the problem of classifying patients with and without surgical site infections from their blood samples. The dataset consists of MTS, of which pertain to infected patients. The original MTS contain missing data, corresponding to measurements not collected for a given patient, which are replaced with mean-imputation.

Performance is assessed by classifying the representations generated by TAE and TKAE. We also include in the comparison the representations generated by PCA and a standard AE. We let ; TAE and TKAE are configured with 2 layers of 10 GRU cells, and ; we set in TKAE. AE is configured with and a linear decoder without tied weights. Since the dataset is imbalanced, beside classification accuracy in Tab. 5 we also report the F1 score. We can observe that TKAE representations achieves the best accuracy and F1 score.

Method Accuracy F1 score
5pt. PCA 0.835 0.651
AE 0.8460.013 0.6750.034
TAE 0.8530.021 0.6820.022
TKAE 0.8990.022 0.8020.047
Table 5: Classification of the blood data. F1 score is calculated considering infected as “positive” class.

Fig. 8

depicts the first two principal components of the representations learned by TAE and TKAE and their densities, computed with a kernel density estimator. It is possible to recognize the effect of the kernel alignment, as the densities of the components relative to different classes become more separated.

(a) TAE
(b) TKAE
Figure 8: PCA and density of the two principal components of the representations yielded by TAE and TKAE on blood sample data.

3.3 Case studies

Imputation of missing data with TKAE

Dataset Mean Imp. LOCF DAE TKAE
MSE CORR MSE CORR MSE CORR MSE CORR
5pt. ECG 0.883 0.702 0.393 0.884 0.1570.004 0.9560.001 0.1510.003 0.9560.001
Libras 0.505 0.666 0.085 0.949 0.0500.001 0.9700.001 0.0290.002 0.9780.002
Wafer 0.561 0.695 0.226 0.911 0.1990.017 0.9350.004 0.0930.007 0.9640.003
Jp. Vow. 0.502 0.699 0.084 0.954 0.1320.001 0.9260.000 0.1140.003 0.9380.001
Table 6: MSE and Pearson correlation (CORR) of the MTS where missing values are imputed using different methods, with respect to the original MTS (without missing values).

In presence of missing data, the reconstruction MSE of the loss function can be modified to account only for non-imputed values,

(8)

where if is imputed and 1 otherwise. In this way, the decoder is not constrained to reproduce the values that are imputed and, instead, freely assigns values to the entries that are originally missing. Thus, we can exploit the generalization capability of the decoder to provide an alternative form of imputation, which depends on nonlinear relationships among the whole training data. A similar principle is followed by denoising AEs (DAEs) vincent2010stacked , as they try to reconstruct the original input from a corrupted version where some entries are randomly removed, hence implementing a form of imputation beaulieu2016semi ; gondara2017multiple .

We randomly remove approximately of the values from 4 datasets in Tab. 3 and compare the capability of TKAE to reconstruct missing values with respect to other imputation techniques. As baseline, we consider mean imputation, last occurrence carried forward (LOCF), and DAE imputation beaulieu2017missing . In Tab. 6 we report MSE and Pearson correlation (CORR) of the MTS with imputed missing values, with respect to the original MTS. For TKAE and DAE, we use the same configurations from Tab. 4, but in TKAE we replace the term with (8) and we set . In DAE, we apply a stochastic corruption of the inputs, by setting input values with probability to 0. From the results, we observe that TKAE achieve a more accurate reconstruction of the true input in 3 of the 4 datasets. In Jp. Vow. instead, LOCF imputation allows to retrieve missing values with the highest accuracy. This can be explained by the fact that the MTS in the dataset contain very similar values that are repeated for several time intervals. However, it is possible to notice that also in this case TKAE performs better than DAE.

One-class classification with TKAE

One-class classification and anomaly detection are applied in several domains, including healthcare, where non-nominal samples are scarce and often unavailable during training irigoien2014towards

. The methods based on dimensionality reduction procedures, such as AEs and energy based models 

pmlr-v48-zhai16 ; sakurada2014anomaly rely on the assumption that anomalous samples do not belong to the subspace containing nominal data, which is learned during training. Therefore, the representations generated by the trained model for samples of a new, unseen class will arguably fail to capture important characteristics. Consequently, for those samples an AE would yield large reconstruction errors, which we consider as the classification scores for the new class.

Method AUC
5pt. OCSVM 0.713
IF 0.6620.01
PCA 0.707
AE 0.7120.001
TKAE 0.7320.006
Table 7: AUC obtained by different one-class classification methods in detecting the MTS of atrial fibrillation class, which is not present in the training set.

We process time series of peak-to-peak intervals extracted from ECGs in the 2017 Atrial Fibrillation challenge clifford2017af , which are divided in 4 classes: normal (N), atrial fibrillation (A), other symptoms (O) and noisy records (). By following a commonly used procedure doi:10.3102/00346543074004525 , we simulate missing data by randomly removing approximately of the entries in each MTS and then we exclude samples of class A from the training set (which are then considered as non-nominal). We evaluated the performance of TKAE, AE, and PCA in detecting class A in a test set containing samples of all classes (N,A,O,). As performance measure, we considered the area under ROC curve (AUC) and compared the performance also with two baseline classifiers: one-class SVM (OCSVM) and Isolation Forests (IF). The optimal configurations are: ; TKAE with 1 layer of GRU cells, , , and ; AE with non-linear decoder, no tied weights, and ; OCSVM with rbf kernel width and ; IF with contamination . Results in Tab. 7 show that TKAE scores the highest AUC.

4 Conclusion

We proposed the Temporal Kernelized Autoencoder, an RNN-based model for representing MTS with missing values as fixed-size vectors. MTS with missing values are commonly found in domains such as healthcare and are caused by measurement errors, incorrect data entry or lack of observations. Through a kernel alignment with the Time series Cluster Kernel, a similarity measure designed for MTS with missing data, we learned compressed representations that better preserve the original input pairwise relationships in presence of missing values.

We demonstrated how the recurrent architecture of TKAE excels in encoding short MTS with many variables, characterized by different lengths and periodicity. We showed that the representations learned by TKAE can be used both in supervised and unsupervised tasks. Experimental results, contrasted with other dimensionality reduction techniques on several datasets of MTS, showed that the TKAE representations are classified more accurately, especially in presence of missing data.

To further evaluate the capabilities of the TKAE architecture, we considered two applications that exploit the decoder, which is learned as part of the optimization. Specifically, we built a framework based on dimensionality reduction and inverse mapping to the input space for imputing missing data and for one-class classification. We showed that TKAE is able to outperform competing methods on those tasks.

Acknowledgments

This work was partially funded by the Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines. The authors would like to thank Arthur Revhaug, Rolv-Ole Lindsetmo and Knut Magne Augestad, all clinicians currently or formerly affiliated with the gastrointestinal surgery department at the University Hospital of North Norway, for preparing the blood samples dataset.

Appendix A Details of the TCK algorithm

A MTS is represented by a sequence of univariate time series (UTS) of length , , being and the dimension and length of , respectively. Given a dataset of samples, denotes the -th MTS and a binary MTS describes whether the realisation in is observed () or is missing ().

DiagGMM

The TCK kernel matrix is built by first fitting diagonal covariance GMM (DiagGMM) to the MTS dataset. Each DiagGMM is parametrized by a time-dependent mean and a time-constant covariance matrix , being the variance of UTS . Moreover, the data is assumed to be missing at random, i.e. the missing elements are only dependent on the observed values. Under these assumptions, missing data can be analytically integrated away rubin1976inference and the pdf for each incompletely observed MTS is given by

(9)

The conditional probabilities follows from Bayes’ theorem,

(10)

The parameters of the DiagGMM are trained by means of a maximum a posteriori expectation maximization algorithm, as described in

mikalsen2017time .

Ensemble generation

To ensure diversity in the ensemble, each GMM model has a different number of components from the interval and is trained times, using random initial conditions and hyperparameters. Specifically, denotes the index set of the initial conditions and hyperparameters (), and the number of components (). Moreover, each DiagGMM is trained on a subset of the original dataset, defined by a random set of the MTS samples, a random set of variables, and a randomly chosen time segment . The inner products of the posterior distributions from each mixture component are then added up to build the final TCK kernel matrix. Details are provided in Alg. 1.

1:Training set of MTS , initializations, maximal number of mixture components.
2:Initialize kernel matrix .
3:for  do
4:     Compute posteriors , by applying maximum a posteriori expectation maximization mikalsen2017time to the DiagGMM with components and by randomly selecting,
  • hyperparameters ,

  • a time segment of length ,

  • attributes , with cardinality ,

  • a subset of MTS, , with ,

  • initialization of the mixture parameters .

5:     Update kernel matrix, .
6:end for
7:TCK matrix .
Algorithm 1 TCK kernel training

References

References