1 Introduction
Realvalued multivariate time series (MTS) allow to characterize the evolution of complex systems and their analysis is the core component in many research fields and application domains chatfield2016analysis . MTS analysis should account for relationships across variables and time steps, and, at the same time, deal with unequal time lengths and missing data langkvist2014review . Missing values, commonly found in realworld data such as electronic health records (EHR) che2017time , are usually filled with imputation techniques before processing the data. However, unless they are missing completely at random heitjan1996distinguishing , imputation destroys information of potential interest contained in the missingness patterns. Furthermore, especially for large fractions of missing values, each imputation method can introduce a strong kind of bias that influences the analysis outcome little2014statistical . A data driven approach has been recently proposed to learn when to switch between two particular types of imputation Che2018 , but it relies on strong assumptions that are suitable only for specific applications.
MTS data often contain noise, redundant information, and can be characterized by a large amount of variables and time steps. In those cases, extracting the relevant information and generating compressed representations using dimensionality reduction techniques facilitate the analysis and processing of the data langkvist2014review ; Keogh2001 ; DENG2013142 . Dimensionality reduction has been a fundamental research topic in machine learning scholkopf1997kernel ; fukumizu2004dimensionality ; jenssen2010kernel ; harandi2017dimensionality . However, classic approaches are not designed to process sequential data, especially in the presence of missing values.
In this paper, we propose a novel neural network architecture called Temporal Kernelized Autoencoder (TKAE) to learn compressed representations of realvalued MTS with unequal lengths and missing data. Our model is based on a deep Autoencoder Hinton504
with recurrent layers, which generates a fixedsize vectorial representation of the input MTS. To better capture time dependencies, we implement the encoder with a bidirectional recurrent neural network
6638947 . The final states of the forward and backward network are combined by a dense nonlinear layer that reduces the dimensionality of the representation.To avoid the undesired biases introduced by imputation, we train our architecture with a loss function that preserves pairwise similarities in the learned representations, even in presence of missing data. This is achieved by a kernel alignment procedure
kernelAE , which matches the dot products of the representations with a kernel similarity defined in the input space. To this end, we consider the recentlyproposed Time series Cluster Kernel (TCK) mikalsen2017time , which has been shown to capture well the relationships between MTS containing missing data.The proposed architecture serves different purposes. When represented as ordinary vectors, MTS can be processed by nonsequential classification or unsupervised machine learning algorithms Xing:2010:BSS:1882471.1882478 , and their indexing and retrieval is more efficient 7179089 ; chung2016unsupervised . Furthermore, the dimensionality of the data is reduced and, accordingly, models can potentially be trained with less samples. Contrarily to other nonlinear dimensionality reduction techniques, the decoder provides an explicit mapping back to the input space. This can be used, for example, to design an anomaly detector based on the reconstruction error of inputs pmlrv48zhai16 , or to implement an imputation method that leverages the generalization capability of the decoder reconstruction, rather than relying on apriori assumptions that may introduce stronger biases beaulieu2017missing .
TKAE is evaluated through several experiments on synthetic and benchmark datasets, and MTS extracted from realworld EHR. We first investigate the advantages of using recurrent layers to generate compressed MTS representations with respect to other dimensionality reduction methods. Then, we show the benefit of the kernel alignment for learning representations in classification settings with missing data. Finally, we exploit the capability of the TKAE decoder to impute missing data and build a oneclass classifier for anomaly detection.
2 Methods
Sec. 2.1 and 2.2 provide the required background, describing respectively the Autoencoder (AE) and TCK, the selected kernel similarity for MTS with missing values. The details of our methodological contribution are presented in Sec. 2.3.
2.1 Autoencoder
The AE is a neural network traditionally conceived as a nonlinear dimensionality reduction algorithm Hinton504 , which has been further exploited to learn representations in deep architectures bengio2009learning and to pretrain neural network layers erhan2010does . An AE simultaneously learns two functions; the first one, called encoder, is a mapping from an input domain,
, to a hidden representation (
code) in . The second function, decoder, maps from back to . The encoding and decoding functions are defined as(1) 
where , , and denote a sample from the input space, its hidden representation, and its reconstruction given by the decoder, respectively. The encoder
is usually implemented by stacked dense layers of neurons equipped with sigmoidal activation functions. The decoder
often has an architecture symmetric to the encoder that operates in reverse direction; when inputs are realvalued vectors, decoder squashing nonlinearities are often replaced by linear activations vincent2010stacked . Finally, and are the trainable parameters of the encoder and decoder, respectively. In this work, we focus on AEs implemented with fully connected layers (later, also recurrent layers will be introduced). In those cases, the parameters are the connection weights and biases of each layer, i.e., and . AEs are trained to minimize the discrepancy between the input and its reconstruction . In case of realvalued inputs, this is usually achieved by minimizing a loss implemented as the empirical Mean Squared Error (MSE).In this paper, we focus on AEs with a “bottleneck”, which learn an undercomplete representation of the input, i.e., , retaining as much useful information as possible to allow an accurate reconstruction Hinton504 . The learned lossy, compressed representation of the input can be exploited, e.g., in clustering and visualization tasks makhzani2015adversarial , or to train a classifier ng2016dual
. The bottleneck already provides a strong regularization as it limits the variance of the model. However, further regularization can be introduced by tying encoder and decoder weights (
) or by adding a norm penalty to the loss function(2) 
where is the norm of all model weights and
is the hyperparameter controlling the contribution of the regularization term.
Recurrent neural networks (RNNs) are models excelling in capturing temporal dependencies in sequences elman1990finding ; bianchi2017recurrent and are at the core of seq2seq models sutskever2014sequence . The latter learns fixedsize representations of sequences with unequal lengths and, at the same time, generates variablelength outputs. The main applications of RNN AEs based on the idea of seq2seq have been on text DBLP:journals/corr/BowmanVVDJB15 , speech NIPS2015_5653 , and video data srivastava2015unsupervised . However, few efforts have been devoted so far in applying these architectures to realvalued MTS.
Modern seq2seq architectures implement a powerful mechanism called attention, which provides an inductive bias that facilitate modelling longterm dependencies and grants a more accurate decoding if the length of the input sequences varies considerably BahdanauCB14 ; kim2017structured ; NIPS2017_7181 . Rather than learning a single vector representation for the whole input sequence, a model with attention maintains all the encoder states generated over time, which are combined by a timevarying decoding vector at each decoding step. Therefore, models with attention provide a representation that is neither compact nor of fixed size and, henceforth, are not suitable for our purposes.
2.2 Time Series Cluster Kernel
The Time series Cluster Kernel (TCK) mikalsen2017time
is an algorithmic procedure to compute kernel similarities among MTS containing missing data. The method is based on an ensemble learning approach that guarantees robustness with respect to hyperparameters. This ensures that the TCK also works well in unsupervised settings, the ones in which AE usually operates, where it is not possible to tune the hyperparameters by means of supervised crossvalidation. The base models in the ensemble are Gaussian mixture models (GMMs), which are fit to the dataset of MTS using a range of numbers of mixture components. By fitting GMMs with different numbers of mixtures, the TCK procedure generates partitions at different resolutions that capture both local and global structures in the data.
To further enhance diversity in the ensemble, each partition is evaluated on a random subset of MTS samples, MTS attributes (variates) and time segments, using random initializations and randomly chosen hyperparameters. This also contributes to provide robustness with respect to the selection hyperparameters, such as the number of mixture components. To avoid imputation, missing data are analytically marginalized away in the likelihoods. To obtain the GMM posteriors, the likelihoods are multiplied with smooth priors, whose contribution becomes stronger as the percentage of missingness increases. TCK is then built by summing up, for each partition, the inner products between pairs of posterior assignments corresponding to different MTS. The details of TCK are provided in A.
2.3 Temporal Kernelized Autoencoder
The Temporal Kernelized Autoencoder (TKAE) is a novel AE architecture designed for MTS of unequal length with missing values. A schematic representation of TKAE is provided in Fig. 1. We assume MTS to be represented as a matrix , where denotes the number of variables and is the number of time steps that may vary in each MTS. Analogously to seq2seq sutskever2014sequence , in TKAE the dense layers of standard AEs are replaced by recurrent layers, which process inputs sequentially and update their hidden state at each time step according to the following map,
(3) 
where
is the set of trainable parameters of the recurrent units. The recurrent layers are composed of either gated recurrent unit (GRU)
cho2014learningor long shortterm memory (LSTM)
hochreiter1997long cells. The choice of the cell is usually guided by the task at hand chung2014empirical .Conventional RNNs make use of previous inputs to build their current hidden representation 6638947 . However, in applications like MTS classification where the whole input is available at once, it is also possible to exploit the information contained in future inputs to generate the current state of the network. For this reason, the encoder is implemented as a stacked bidirectional RNN schuster1997bidirectional consisting of two RNNs working in parallel, each one with layers of cells and transition function (3). While one RNN captures input dependencies going backward in time, the other processes the same input but in reverse order, thus modeling relationships that go forward in time. After the whole input is processed, the final states of the forward and backward RNN are denoted as and , respectively. While is influenced by the past observations, depends on the future ones, hence their combination can capture a wider range of temporal dependencies in the input. In TKAE, the combination is implemented with a dense nonlinear layer , which produces an output vector . The latter, is the fixedsize, vectorial representation of the MTS.
The decoder operates according to the following map,
(4) 
where is a stacked RNN with layers parametrized by that operates in generative mode, processing the previously generated output as new input. To initialize the decoder, we let its initial state and first input , which corresponds to an “average input” if MTS are standardized. The decoder iteratively produces outputs for steps, being the length of the input MTS. Unequal lengths are naturally handled since the whole architecture is independent of .
TKAE is trained endtoend by means of stochastic gradient descent with scheduled sampling
NIPS2015_5956 . More specifically, during training the decoder input at timeis, with probability
, the decoder output at time (inference mode) and with probability the desired output at time (teacher forcing). Since the desired output is not available during the test phase, the decoder generates test data operating only in generative mode (). In most of our experiments, scheduled sampling improved the training convergence speed, providing a practical motivation for our choice.Analogously to standard AEs, the RNNs in TKAE cannot process data with missing values. Therefore, those are filled beforehand with some imputed value (0, mean value, last observed value) schafer2002missing . However, imputation injects biases in data that may negatively affect the quality of the representations and conceal potentially useful information contained in the missingness patterns. To compensate for these shortcomings, we introduce a kernel alignment procedure kernelAE , which allows to preserve the pairwise similarities of the inputs, encoded as a positive semidefinite matrix , in the learned representations. can be any positive semidefinite matrix and is selected according to which properties one wants to embed into the representations. In our case, by choosing the TCK matrix as , the relationships of the learned representations will also account for missing data.
Kernel alignment is implemented by an additional regularization term in the loss function (2), which becomes
(5) 
is the kernel alignment cost, which computes the normalized Frobenius norm of the difference between two matrices: , the prior kernel matrix, and , the dot product matrix between the encodings of the input MTS. The term is defined as
(6) 
where is the matrix of hidden representations relative to the MTS in the dataset (or, more specifically, in the current minibatch). Finally, is a hyperparameter controlling the contribution of the alignment cost in the loss function.
3 Experiments
We perform three sets of experiments. First, in Sec. 3.1 we investigate advantages and shortcomings of an AE with recurrent layers to learn MTS representations, with respect to other dimensionality reduction methods. In this case, we do not use kernel alignment () and we refer to TKAE simply as TAE. Secondly, in Sec. 3.2 we show how kernel alignment improves the learned representations when MTS contain missing data. Lastly, in Sec. 3.3 we present two casestudies, where TKAE is used for oneclass classification and for imputing missing data. Both applications exploit not only the TKAE hidden representation but also its decoder, as the results are computed in the input space.
In the following, we compare the proposed architecture with baseline methods for dimensionality reduction, such as a standard AE and PCA; the learned compressed representations have the same dimensionality in all models taken into account. In each experiment, we train the models for 5000 epochs with minibatches containing 25 MTS using the Adam optimizer
kingma2014adam with initial learning rate . We independently standardize each variate of the MTS in all datasets. In each experiment and for each method, we identify the optimal hyperparameters with fold crossvalidation evaluated on the reconstruction error (or, in general, on the unsupervised loss function) and we report the average results on the test set, obtained in independent runs. We consider only TKAE models with maximum 3 hidden layers of either LSTM or GRU cells, as deeper models generally improve performance slightly at the cost of greater complexity reimers2017optimal .3.1 Evaluation of TAE representations
In this section, we compare the compressed representations yielded by TAE, a standard AE, and PCA, to investigate which are the types of data that are better represented when processed by a recurrent architecture. Beside the benchmark datasets, we also consider synthetic data to study the properties of the different architectures in a controlled setting. We let be the input dimensionality; in TAE , as it processes recursively each single time step. On the other hand, the MTSs must be unrolled when processed by PCA and AE since they expect vectors as inputs rather than sequences. Therefore, in AE and PCA and then the reconstructed outputs are folded back to match the original shape of the input sequence. We let be the size of the compressed representations, which corresponds to the number of RNN cells in each TAE layer, the size of the innermost layer in AE, and the number of principal components in PCA, respectively. In all experiments we use an AE architecture with 3 hidden layers, ; the number of neurons in the intermediate layers () has been set after preliminary experiments and is not a sensitive hyperparameter (comparable results were obtained using or neurons). As measure of performance, we consider the MSE between the original test elements and their reconstructions produced by each model.
Time series with different frequencies
Here, we show the capability of TAE to compress periodic signals having different frequencies and phases. We generate a dataset of sinusoids , where are drawn from and . The proposed task is closely related to the multiple superimposed oscillators, studied in pattern generation and frequency modulation sussillo2009generating . The training and test set contain and samples, respectively. We let and the optimal configurations are: AE with nonlinear decoder and ; TAE with 2 layers of LSTM cells, , and . The reconstruction MSE on the test set is for PCA, for AE, and for TAE.
Both PCA and AE process the entire time series at once. This may appear an advantage with respect to TAE, which stores information in memory for the whole sequence length before generating the final representation. Nonetheless, TAE yields a better reconstruction since the AE (and PCA) architecture is unsuitable for this task. Indeed, in AEs the time step in each MTS is always processed by the same input neuron. For periodic signals, the training procedure tries to couple neurons associated to time steps with the same phase, by associating similar weights to their connections. However, these couplings always change if inputs have different frequencies (see Fig.2).
Therefore, the standard AE training never converges as it is impossible to learn a model that generalizes well for each frequency. On the other hand, thanks to its recurrent architecture, TAE can naturally handle inputs of different frequencies since there is no pairing between structural parameters and time steps.
Fig. 3 shows the reconstruction of one sample time series.
The lower quality of the reconstruction yielded by AE and PCA can be immediately noticed. Additionally, since those methods are unable to reproduce the dynamics of each sample, they rather adopt a more conservative setting and output signals with lower amplitudes that are closer (in a mean square sense) to the “average” of all the random sinusoids in the dataset.
Time series of different lengths
While TKAE can process MTS with different lengths, standard AE and PCA require inputs of fixed size. The common workaround, also followed in this work, is to pad the shorter MTS with zeros
mann2008smoothing. To systematically study the performance of the different methods when MTS have fixed or variable length, we generate data by integrating the following system of firstorder Ordinary Differential Equations (ODE):
(7) 
where , is a matrix with 50% sparsity and elements uniformly drawn in . To guarantee system stability, we set the spectral radius of to . is applied componentwise and introduces nonlinear dependencies among the variables. A MTS is obtained by integrating (7) for steps, starting from a random initial condition . Since a generic deterministic dynamical system can be described by a ODE system, these synthetic MTS represent well many casestudies.
We generate two different datasets of MTS with variables, each one with 400 and 1000 samples for training and test set, respectively. The first, ODEfix, contains MTS with same length , while in the second, ODEvar, each MTS has a random length . We let and compare the reconstruction MSE of PCA, AE, and TAE. The optimal configurations for this task are: AE with and linear decoder; TAE with 1 LSTM layer, , and . The average results for independent random generations of the data () and initialization of AE and TAE are reported in Tab. 1.
Dataset  PCA  AE  TKAE 

5pt. ODEfix  0.018  0.004  0.060 
ODEvar  0.718  0.676  0.185 
In ODEfix, both AE and PCA yield almost perfect reconstructions, which is expected due to simplicity of the task. However, they perform worse in ODEvar despite the presence of many padded values and a consequent lower amount (in average) of information to encode in the compressed representation. On the other hand, TAE naturally deals with variablelength inputs, since once the input sequence terminates its state is no longer updated, as well as its model weights during the training.
Dealing with large number of variates and time steps
In order to test the ability to learn compressed representations when the number of variates in the MTS increases, starting from (7) we generate 4 datasets ODE5, ODE10, ODE15, and ODE20, obtained by setting . The number of time steps is fixed to in each dataset. We let ; TAE is configured with 2 layers of LSTM and ; is in both AE and TAE. We also include in the comparison an AE with tied weights in the (nonlinear) decoder, which has fewer parameters. Reconstruction errors are reported in Tab. 2. We notice that AE performs well on MTS characterized by low dimensionality, but performance degrades when assumes larger values. Since AE processes MTS unrolled into a unidimensional vector, the input size grows quickly as increases (one additional variable increases the total input size by ). Accordingly, the number of parameters in the first dense layer scalesup quickly, possibly leading to model overfit. We also notice that the tied weights regularization, despite halving the number of trainable parameters, degrades performance in each case, possibly because it hinders too much the flexibility of the model. On the other hand, in TAE the model complexity changes slowly, as only one single neuron is added for an additional input dimension. As a consequence, TAE is the best performing model when MTS have a large number of variates.
Dataset  TAE  AE  AE (tw)  PCA  

MSE  #par  MSE  #par  MSE  #par  MSE  
5pt. ODE5  0.019  6130  0.04  31170  0.014  15870  0.007 
ODE10  0.060  6780  0.04  61670  0.071  31370  0.018 
ODE15  0.072  7430  0.106  92170  0.153  46870  0.174 
ODE20  0.089  8080  0.121  122670  0.181  62370  0.211 
To study the performance as the length of MTS increase, we generate 8 datasets with the ODE system (7) by varying , while keeping fixed . In Fig. 4, we report the reconstruction errors and note that TAE performs poorly as increases. Especially if there are no temporal patterns in the data that can be exploited by the RNNs to model the inputs, like in this case, TAE is more effective when MTSs are short.
Evaluation on benchmark classification datasets
Here we consider several classification benchmarks from Baydogan2016 , UCR, and UCI repositories^{1}^{1}1www.cs.ucr.edu/~eamonn/time_series_data, archive.ics.uci.edu/ml/datasets.html, whose details are reported in Tab. 3. The datasets have been selected in order to cover a wide variety of cases, in terms of training/test sets size, number of variates , number of classes and (variable) lengths of the MTS. We evaluate the quality of the representations yielded by TAE, standard AE, and PCA by classifying them with a NN classifier configured with and Euclidean distance. For each method the results, which include also the reconstruction MSE, are reported in Tab. 4. On the majority of the datasets TAE produces compressed representations that not only best describe the classes of the original inputs (in terms of classification accuracy), but also provide the most accurate reconstruction. In few cases, however, PCA achieves higher performance, which is even superior to the one obtained by the standard AE. Such a result may appear contradictory at first, since the AE is a nonlinear extension of PCA and should perform at least equally well; when properly trained. Nevertheless, in MTS datasets having large dimensionality (i.e., many variates) but few samples, it is difficult to obtain a good fit of the parameters in models such as AE and TAE.
Dataset  Train  Test  Classes  Source  

5pt. ECG  1  500  4500  5  140  140  UCR 
ECG2  2  100  100  2  39  152  UCR 
Libras  2  180  180  15  45  45  Baydogan2016 
Char.Traj.  3  300  2558  20  109  205  UCI 
Wafer  6  298  896  2  104  198  UCR 
Jp. Vow.  12  270  370  9  7  29  UCI 
Arab. Dig.  13  6600  2200  10  4  93  UCI 
Auslan  22  1140  1425  95  45  136  UCI 
Dataset  PCA  AE  TAE  

MSE  ACC  MSE  ACC  tw  MSE  ACC  cell  
5pt. ECG  10  0.026  0.934  0.027  0.937  lin.  0  no  0.064  0.931  GRU1  1.0  0 
ECG2  10  0.100  0.801  0.132  0.809  sig.  0.001  no  0.131  0.819  LSTM2  1.0  0.001 
Lbras  5  0.199  0.577  0.173  0.583  sig.  0.001  no  0.110  0.661  LSTM2  0.9  0.001 
Char. Traj.  10  0.094  0.922  0.116  0.903  lin.  0.001  no  0.122  0.912  LSTM3  1.0  0.001 
Wafer  10  0.139  0.928  0.163  0.910  lin.  0  no  0.088  0.919  LSTM2  1.0  0 
Jp. Vow.  10  0.186  0.929  0.216  0.927  lin.  0.001  no  0.177  0.932  LSTM2  0.8  0.001 
Arab. Dig.  15  0.542  0.965  0.554  0.926  sig.  0  no  0.418  0.984  GRU2  1.0  0 
Ausaln  10  0.245  0.538  0.338  0.344  lin.  0  yes  0.213  0.680  LSTM2  1.0  0 
In particular, TAE does not obtain the best classification results only on three of the benchmark datatsets: ECG, Character Trajectories, and Wafer. However, in these cases the classification accuracy achieved by TAE is only slightly inferior to the competitors and it can be shown that TAE learns compressed representations which are qualitatively comparable to the ones of the other methods. This is illustrated in Fig. 5, which depicts the first two principal components of the learned representations; for TAE and AE, the components are computed in their hiddenstate space. It is possible to observe that the representations yielded by TAE are characterized by a welldefined structure and the class separation (qualitatively speaking) is comparable to the other methods.
3.2 Kernel alignment to handle missing data
So far, we have investigated the effectiveness of a recurrent AE architecture to generate compressed representations of MTSs, with respect to baseline methods for dimensionality reduction. Hereinafter, we demonstrate the effect of kernel alignment when MTS contain missing data. Specifically, we compare the representations learned by TAE () and TKAE () in a synthetic and real classification problem.
First, we consider the Jp. Vow. dataset (see Tab. 3). This dataset does not originally contain missing data, but similarly to previous studies Baydogan2016 ; mikalsen2017time we inject missing data in a controlled way by randomly removing a certain percentage of elements from the dataset. We vary such percentage from to , evaluating each time the reconstruction MSE and classification accuracy of TAE and TKAE encodings using NN with . We apply zero imputation to replace missing data in the MTS. TAE and TKAE are configured with 2 LSTM cells, and . In TKAE, . In Fig. 6, we show the kernel matrix yielded by TCK and the dot products of the representations of the test set when of the data are missing. is very similar to the TCK matrix, as they are both characterized by a block structure indicating that intraclass similarities in the 9 classes are much higher than interclass similarities.
Fig. 7 shows how the classification accuracy and reconstruction error of TAE and TKAE vary as we increase the amount of missing data. The classification accuracy (blue lines) does not decrease in TKAE when the data contain up to 50% missing values and is always higher than in TAE. When of the data are missing, TKAE still achieves a classification accuracy of 0.7, while for TAE it drops to 0.1. We also observe that the alignment does not compromise input reconstruction, since the MSE in TAE and TKAE (red lines) is almost identical. As a side note, the reconstruction MSE decreases for higher amount of missingness since there are more imputed values (which are constants), hence less information to compress.
In a second experiment, we analyze medical data obtained from EHR in the form of blood measurements of patients undergoing a gastrointestinal surgery at the University Hospital of North Norway in 2004–2012 2018arXiv180307879O . Each patient is represented by a MTS of blood sample measurements collected for days after surgery. We consider the problem of classifying patients with and without surgical site infections from their blood samples. The dataset consists of MTS, of which pertain to infected patients. The original MTS contain missing data, corresponding to measurements not collected for a given patient, which are replaced with meanimputation.
Performance is assessed by classifying the representations generated by TAE and TKAE. We also include in the comparison the representations generated by PCA and a standard AE. We let ; TAE and TKAE are configured with 2 layers of 10 GRU cells, and ; we set in TKAE. AE is configured with and a linear decoder without tied weights. Since the dataset is imbalanced, beside classification accuracy in Tab. 5 we also report the F1 score. We can observe that TKAE representations achieves the best accuracy and F1 score.
Method  Accuracy  F1 score 

5pt. PCA  0.835  0.651 
AE  0.8460.013  0.6750.034 
TAE  0.8530.021  0.6820.022 
TKAE  0.8990.022  0.8020.047 
Fig. 8
depicts the first two principal components of the representations learned by TAE and TKAE and their densities, computed with a kernel density estimator. It is possible to recognize the effect of the kernel alignment, as the densities of the components relative to different classes become more separated.
3.3 Case studies
Imputation of missing data with TKAE
Dataset  Mean Imp.  LOCF  DAE  TKAE  

MSE  CORR  MSE  CORR  MSE  CORR  MSE  CORR  
5pt. ECG  0.883  0.702  0.393  0.884  0.1570.004  0.9560.001  0.1510.003  0.9560.001 
Libras  0.505  0.666  0.085  0.949  0.0500.001  0.9700.001  0.0290.002  0.9780.002 
Wafer  0.561  0.695  0.226  0.911  0.1990.017  0.9350.004  0.0930.007  0.9640.003 
Jp. Vow.  0.502  0.699  0.084  0.954  0.1320.001  0.9260.000  0.1140.003  0.9380.001 
In presence of missing data, the reconstruction MSE of the loss function can be modified to account only for nonimputed values,
(8) 
where if is imputed and 1 otherwise. In this way, the decoder is not constrained to reproduce the values that are imputed and, instead, freely assigns values to the entries that are originally missing. Thus, we can exploit the generalization capability of the decoder to provide an alternative form of imputation, which depends on nonlinear relationships among the whole training data. A similar principle is followed by denoising AEs (DAEs) vincent2010stacked , as they try to reconstruct the original input from a corrupted version where some entries are randomly removed, hence implementing a form of imputation beaulieu2016semi ; gondara2017multiple .
We randomly remove approximately of the values from 4 datasets in Tab. 3 and compare the capability of TKAE to reconstruct missing values with respect to other imputation techniques. As baseline, we consider mean imputation, last occurrence carried forward (LOCF), and DAE imputation beaulieu2017missing . In Tab. 6 we report MSE and Pearson correlation (CORR) of the MTS with imputed missing values, with respect to the original MTS. For TKAE and DAE, we use the same configurations from Tab. 4, but in TKAE we replace the term with (8) and we set . In DAE, we apply a stochastic corruption of the inputs, by setting input values with probability to 0. From the results, we observe that TKAE achieve a more accurate reconstruction of the true input in 3 of the 4 datasets. In Jp. Vow. instead, LOCF imputation allows to retrieve missing values with the highest accuracy. This can be explained by the fact that the MTS in the dataset contain very similar values that are repeated for several time intervals. However, it is possible to notice that also in this case TKAE performs better than DAE.
Oneclass classification with TKAE
Oneclass classification and anomaly detection are applied in several domains, including healthcare, where nonnominal samples are scarce and often unavailable during training irigoien2014towards
. The methods based on dimensionality reduction procedures, such as AEs and energy based models
pmlrv48zhai16 ; sakurada2014anomaly rely on the assumption that anomalous samples do not belong to the subspace containing nominal data, which is learned during training. Therefore, the representations generated by the trained model for samples of a new, unseen class will arguably fail to capture important characteristics. Consequently, for those samples an AE would yield large reconstruction errors, which we consider as the classification scores for the new class.Method  AUC 

5pt. OCSVM  0.713 
IF  0.6620.01 
PCA  0.707 
AE  0.7120.001 
TKAE  0.7320.006 
We process time series of peaktopeak intervals extracted from ECGs in the 2017 Atrial Fibrillation challenge clifford2017af , which are divided in 4 classes: normal (N), atrial fibrillation (A), other symptoms (O) and noisy records (). By following a commonly used procedure doi:10.3102/00346543074004525 , we simulate missing data by randomly removing approximately of the entries in each MTS and then we exclude samples of class A from the training set (which are then considered as nonnominal). We evaluated the performance of TKAE, AE, and PCA in detecting class A in a test set containing samples of all classes (N,A,O,). As performance measure, we considered the area under ROC curve (AUC) and compared the performance also with two baseline classifiers: oneclass SVM (OCSVM) and Isolation Forests (IF). The optimal configurations are: ; TKAE with 1 layer of GRU cells, , , and ; AE with nonlinear decoder, no tied weights, and ; OCSVM with rbf kernel width and ; IF with contamination . Results in Tab. 7 show that TKAE scores the highest AUC.
4 Conclusion
We proposed the Temporal Kernelized Autoencoder, an RNNbased model for representing MTS with missing values as fixedsize vectors. MTS with missing values are commonly found in domains such as healthcare and are caused by measurement errors, incorrect data entry or lack of observations. Through a kernel alignment with the Time series Cluster Kernel, a similarity measure designed for MTS with missing data, we learned compressed representations that better preserve the original input pairwise relationships in presence of missing values.
We demonstrated how the recurrent architecture of TKAE excels in encoding short MTS with many variables, characterized by different lengths and periodicity. We showed that the representations learned by TKAE can be used both in supervised and unsupervised tasks. Experimental results, contrasted with other dimensionality reduction techniques on several datasets of MTS, showed that the TKAE representations are classified more accurately, especially in presence of missing data.
To further evaluate the capabilities of the TKAE architecture, we considered two applications that exploit the decoder, which is learned as part of the optimization. Specifically, we built a framework based on dimensionality reduction and inverse mapping to the input space for imputing missing data and for oneclass classification. We showed that TKAE is able to outperform competing methods on those tasks.
Acknowledgments
This work was partially funded by the Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines. The authors would like to thank Arthur Revhaug, RolvOle Lindsetmo and Knut Magne Augestad, all clinicians currently or formerly affiliated with the gastrointestinal surgery department at the University Hospital of North Norway, for preparing the blood samples dataset.
Appendix A Details of the TCK algorithm
A MTS is represented by a sequence of univariate time series (UTS) of length , , being and the dimension and length of , respectively. Given a dataset of samples, denotes the th MTS and a binary MTS describes whether the realisation in is observed () or is missing ().
DiagGMM
The TCK kernel matrix is built by first fitting diagonal covariance GMM (DiagGMM) to the MTS dataset. Each DiagGMM is parametrized by a timedependent mean and a timeconstant covariance matrix , being the variance of UTS . Moreover, the data is assumed to be missing at random, i.e. the missing elements are only dependent on the observed values. Under these assumptions, missing data can be analytically integrated away rubin1976inference and the pdf for each incompletely observed MTS is given by
(9) 
The conditional probabilities follows from Bayes’ theorem,
(10) 
The parameters of the DiagGMM are trained by means of a maximum a posteriori expectation maximization algorithm, as described in
mikalsen2017time .Ensemble generation
To ensure diversity in the ensemble, each GMM model has a different number of components from the interval and is trained times, using random initial conditions and hyperparameters. Specifically, denotes the index set of the initial conditions and hyperparameters (), and the number of components (). Moreover, each DiagGMM is trained on a subset of the original dataset, defined by a random set of the MTS samples, a random set of variables, and a randomly chosen time segment . The inner products of the posterior distributions from each mixture component are then added up to build the final TCK kernel matrix. Details are provided in Alg. 1.
References
References
 (1) C. Chatfield, The Analysis of Time Series: An Introduction, CRC press, 2016.

(2)
M. Längkvist, L. Karlsson, A. Loutfi, A review of unsupervised feature learning and deep learning for timeseries modeling, Pattern Recognition Letters 42 (2014) 11–24.
doi:10.1016/j.patrec.2014.01.008.  (3) Z. Che, S. Purushotham, D. Kale, W. Li, M. T. Bahadori, R. Khemani, Y. Liu, Time Series Feature Learning with Applications to Health Care, Springer International Publishing, Cham, 2017, pp. 389–409. doi:10.1007/9783319513942_20.
 (4) D. F. Heitjan, S. Basu, Distinguishing “missing at random” and “missing completely at random”, The American Statistician.
 (5) R. J. A. Little, D. B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, 2014.
 (6) Z. Che, S. Purushotham, K. Cho, D. Sontag, Y. Liu, Recurrent neural networks for multivariate time series with missing values, Scientific Reports 8 (1) (2018) 6085.
 (7) E. Keogh, K. Chakrabarti, M. Pazzani, S. Mehrotra, Dimensionality reduction for fast similarity search in large time series databases, Knowledge and Information Systems 3 (3) (2001) 263–286. doi:10.1007/PL00011669.

(8)
H. Deng, G. Runger, E. Tuv, M. Vladimir, A time series forest for classification and feature extraction, Information Sciences 239 (2013) 142 – 153.
doi:https://doi.org/10.1016/j.ins.2013.02.030. 
(9)
B. Schölkopf, A. Smola, K.R. Müller, Kernel principal component analysis, in: W. Gerstner, A. Germond, M. Hasler, J.D. Nicoud (Eds.), Artificial Neural Networks — ICANN’97, Springer Berlin Heidelberg, Berlin, Heidelberg, 1997, pp. 583–588.

(10)
K. Fukumizu, F. R. Bach, M. Jordan, Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces, Journal of Machine Learning Research.
 (11) R. Jenssen, Kernel entropy component analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (5) (2010) 847–860. doi:10.1109/TPAMI.2009.100.
 (12) M. Harandi, M. Salzmann, R. Hartley, Dimensionality reduction on SPD manifolds: The emergence of geometryaware methods, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (1) (2018) 48–62. doi:10.1109/TPAMI.2017.2655048.
 (13) G. E. Hinton, R. R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science 313 (5786) (2006) 504–507. doi:10.1126/science.1127647.
 (14) A. Graves, A. r. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649. doi:10.1109/ICASSP.2013.6638947.
 (15) M. Kampffmeyer, S. Løkse, F. M. Bianchi, R. Jenssen, L. Livi, Deep kernelized autoencoders, in: P. Sharma, F. M. Bianchi (Eds.), Image Analysis, Springer International Publishing, Cham, 2017, pp. 419–430.
 (16) K. Ø. Mikalsen, F. M. Bianchi, S. SogueroRuiz, R. Jenssen, Time series cluster kernel for learning similarities between multivariate time series with missing data, Pattern Recognition 76 (2018) 569–581. doi:10.1016/j.patcog.2017.11.030.
 (17) Z. Xing, J. Pei, E. Keogh, A brief survey on sequence classification, SIGKDD Explor. Newsl.
 (18) K. Levin, A. Jansen, B. V. Durme, Segmental acoustic indexing for zero resource keyword search, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5828–5832. doi:10.1109/ICASSP.2015.7179089.
 (19) Y. Chung, C. Wu, C. Shen, H. Lee, L. Lee, Audio word2vec: Unsupervised learning of audio segment representations using sequencetosequence autoencoder, CoRR abs/1603.00982. arXiv:1603.00982.
 (20) S. Zhai, Y. Cheng, W. Lu, Z. Zhang, Deep structured energy based models for anomaly detection, in: Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, JMLR.org, 2016, pp. 1100–1109.
 (21) B. K. BeaulieuJones, J. H. Moore, Missing data imputation in the electronic health record using deeply learned autoencoders, World Scientific, 2016, pp. 207–218. doi:10.1142/9789813207813_0021.
 (22) Y. Bengio, Learning deep architectures for AI, Foundation and Trends Machine Learning 2 (1) (2009) 1–127. doi:10.1561/2200000006.
 (23) D. Erhan, Y. Bengio, A. Courville, P.A. Manzagol, P. Vincent, S. Bengio, Why does unsupervised pretraining help deep learning?, Journal of Machine Learning Research 11 (2010) 625–660.

(24)
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.A. Manzagol, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research 11 (Dec) (2010) 3371–3408.
 (25) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial autoencoders, arXiv preprint arXiv:1511.05644.
 (26) W. W. Y. Ng, G. Zeng, J. Zhang, D. S. Yeung, W. Pedrycz, Dual autoencoders features for imbalance classification problem, Pattern Recognition.
 (27) J. L. Elman, Finding structure in time, Cognitive science 14 (2) (1990) 179–211.
 (28) F. M. Bianchi, E. Maiorino, M. C. Kampffmeyer, A. Rizzi, R. Jenssen, Recurrent Neural Networks for ShortTerm Load Forecasting: An Overview and Comparative Analysis, Springer, 2017.
 (29) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 27, Curran Associates, Inc., 2014, pp. 3104–3112.
 (30) S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Józefowicz, S. Bengio, Generating sentences from a continuous space, CoRR abs/1511.06349. arXiv:1511.06349.
 (31) J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, Y. Bengio, A recurrent latent variable model for sequential data, in: C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems 28, Curran Associates, Inc., 2015, pp. 2980–2988.

(32)
N. Srivastava, E. Mansimov, R. Salakhutdinov, Unsupervised learning of video representations using lstms, in: Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ICML’15, JMLR.org, 2015, pp. 843–852.
 (33) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, CoRR abs/1409.0473. arXiv:1409.0473.
 (34) Y. Kim, C. Denton, L. Hoang, A. M. Rush, Structured attention networks, in: International Conference on Learning Representations, 2017.
 (35) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30, Curran Associates, Inc., 2017, pp. 5998–6008.
 (36) K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using RNN encoderdecoder for statistical machine translation, arXiv preprint arXiv:1406.1078.
 (37) S. Hochreiter, J. Schmidhuber, Long shortterm memory, Neural Computation 9 (8) (1997) 1735–1780.
 (38) J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555.
 (39) M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Transactions on Signal Processing 45 (11) (1997) 2673–2681.
 (40) S. Bengio, O. Vinyals, N. Jaitly, N. Shazeer, Scheduled sampling for sequence prediction with recurrent neural networks, in: Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press, Cambridge, MA, USA, 2015, pp. 1171–1179.
 (41) J. L. Schafer, J. W. Graham, Missing data: Our view of the state of the art, Psychological Methods.
 (42) D. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980.
 (43) N. Reimers, I. Gurevych, Optimal hyperparameters for deep LSTMnetworks for sequence labeling tasks, arXiv preprint arXiv:1707.06799.
 (44) D. Sussillo, L. F. Abbott, Generating coherent patterns of activity from chaotic neural networks, Neuron 63 (4) (2009) 544–557. doi:10.1016/j.neuron.2009.07.018.
 (45) M. E. Mann, Smoothing of climate time series revisited, Geophysical Research Letters.
 (46) M. G. Baydogan, G. Runger, Time series representation and similarity based on local autopatterns, Data Mining and Knowledge Discovery 30 (2) (2016) 476–509. doi:10.1007/s106180150425y.
 (47) K. Øyvind Mikalsen, C. SogueroRuiz, F. M. Bianchi, A. Revhaug, R. Jenssen, An Unsupervised Multivariate Time Series Kernel Approach for Identifying Patients with Surgical Site Infection from Blood Samples, ArXiv eprintsarXiv:1803.07879.

(48)
B. K. BeaulieuJones, C. S. Greene, Semisupervised learning of the electronic health record for phenotype stratification, Journal of Biomedical Informatics.
 (49) L. Gondara, K. Wang, Multiple imputation using deep denoising autoencoders, arXiv preprint arXiv:1705.02737.
 (50) I. Irigoien, B. Sierra, C. Arenas, Towards application of oneclass classification methods to medical data, The Scientific World Journal.
 (51) M. Sakurada, T. Yairi, Anomaly detection using autoencoders with nonlinear dimensionality reduction, in: Proceedings of the MLSDA 2014 2Nd Workshop on Machine Learning for Sensory Data Analysis, MLSDA’14, ACM, New York, NY, USA, 2014, pp. 4:4–4:11. doi:10.1145/2689746.2689747.
 (52) G. Clifford, C. Liu, B. Moody, L. Lehman, I. Silva, Q. Li, A. E. Johnson, R. G. Mark, AF classification from a short single lead ecg recording: The physionet computing in cardiology challenge 2017, Computing in Cardiology.
 (53) J. L. Peugh, C. K. Enders, Missing data in educational research: A review of reporting practices and suggestions for improvement, Review of Educational Research 74 (4) (2004) 525–556. doi:10.3102/00346543074004525.
 (54) D. B. Rubin, Inference and missing data, Biometrika 63 (3) (1976) 581–592.