The application of machine learning and deep learning brought a significant impact in healthcare industry, improving diagnosis outcomes and changing the way of providing care to patients [cheng2016risk]. The main challenge that machine learning is asked to solve is to discover relevant structural patterns in clinical data, usually concealed and difficult to detect manually.
An important fraction of electronic health records are clinical measurements collected from patients over time, which are represented as multivariate time series (MTS) [che2016recurrent]. Several efforts have been devoted to learn informative and compact representations of MTS [che2017time], not only to improve the quality of the analysis, but also to manage the large amounts of data necessary to train deep learning models [miotto2016deep]. Furthermore, MTS are characterized by complex relationships across the variables and time that must be accounted in the analysis. However, most methods are designed to treat vectorial data and they cannot be trivially extended to capture such relationships.
The autoencoder (AE) is a type of neural network originally conceived as a non-linear dimensionality reduction algorithm[Hinton504], which has been further exploited to learn data representations in deep architectures [bengio2009learning]. AEs have been adopted to map time series data into codes
, which are real-typed vectors lying in a lower dimensional space[langkvist2014review].
Clinical measurements are often recorded at irregular frequencies that change for different patients, across variables, and over time. Hence, after discretizing time, the resulting MTS end up containing missing values [mikalsen2016learning]
. Missing values follow patterns that reflect medical conditions of the patients or decisions of the doctors and, therefore, are important to be included in the analysis. Since AE cannot process data containing missing values, those are usually replaced with imputation techniques that, however, cannot capture those patterns as they only fill blanks trying to introduce as less bias as possible. On the other hand, a recently proposed method, called Time series Cluster Kernel (TCK)[mikalsen2017time], computes an unsupervised kernel similarity between MTS with missing data. TCK leverages on the configurations of missingness patterns to improve the evaluation of the similarity.
In this work, we propose a completely unsupervised approach for learning compressed representations of MTS in presence of missing data. Towards that end, we utilize the deep kernelized autoencoder (dkAE) [kampffmeyer2017deep], a recently proposed architecture that embeds the properties of a given prior kernel in the code representation of an AE through kernel alignment. By introducing TCK as prior kernel, we extend the dkAE framework to time series. Moreover, due TCK’s properties, the relationships among the learned codes accounts for the presence of missing data, yielding a more discriminative representation of the data.
We apply our method to classify MTS of blood samples, relative to patients with site infections contracted after surgery and with a high percentage of missing data. We compare the classification results obtained on the representations learned by a standard AE with the ones of a dkAE implementing the alignment to TCK. Results indicate that the learned codes not only provide a compact vectorial representation, but the same classifier achieves better results when operates in our code space rather than in the input space.
2.1 Time series Cluster Kernel
The Time series Cluster Kernel [mikalsen2017time]
exploits the missing patterns in MTS to compute their similarities, rather than relying on imputation methods that may introduce strong biases. TCK implements an ensemble learning approach wherein the robustness to hyperparameters is ensured by joining the clustering results of many Gaussian mixture models (GMM) to form the final kernel. Hence, no critical hyperparameters must be tuned by the user.
To deal with missing data, the GMMs are extended using informative prior distributions [Marlin:2012:UPD:2110363.2110408]. The TCK matrix is built by fitting GMMs to the set of time series for a range of numbers of mixture components, to provide partitions with different resolutions that capture both local and global structures in the data. To enhance diversity in the ensemble, each partition is evaluated on a random subset of attributes and segments, using random initializations and randomly chosen hyperparameters. This also provides robustness in the hyperparameters selection. TCK is then built by summing (for each partition) the inner products between pairs of posterior distributions corresponding to different MTS.
AEs simultaneously learn two functions. The first one, encoder, provides a mapping from an input domain, , to a code domain,
, i.e., the hidden representation. The second function,decoder, maps from back to . In AEs with a single hidden layer, the encoding and decoding function are and , where , , and denote, respectively, a sample from the input space, its hidden representation (the code), and its reconstruction. While is usually implemented as a sigmoid, in the case inputs are real-valued vectors, the squashing nonlinearity in can be replaced by a linear activation. Finally, and are the weights and and the bias of the encoder and decoder, respectively.
To minimize the discrepancy between the input and its reconstruction, model parameters are learned by minimizing a reconstruction loss
By stacking more hidden layers an AE is capable of learning more complex representations by transforming inputs through multiple nonlinear transformations. In its native formulation, an AE can process vectorial data and, therefore, a MTS is flattened into a uni-dimensional vector when fed to the AE. Since an AE process inputs of same lengths, missing are filled with numeric values.
2.3 Deep Kernelized Autoencoder
A dkAE is trained by minimizing the loss function
where is the reconstruction loss in Eq. 1 and is a hyperparameter that balances the contribution of the two cost terms. If , becomes the traditional AE loss in Eq. 1. is the code loss that enforces similarity between two matrices: , the kernel matrix given as prior, and , the inner product matrix of codes associated to input data. A depiction of the training procedure is reported in Fig. 1.
can be implemented as the normalized Frobenius distance between and . Each matrix element in is given by and the code loss reads
By minimizing the normalized Frobenius distance from TCK, we indirectly include in the codes the information it captures about the missingness patterns and we improve the quality of the learned codes in presence of missing data.
The dkAE model is trained using mini-batches. Therefore, a training matrix is generated from the codes associated to the elements in the th mini-batch and distance is computed on the submatrix of related to the entries in the mini-batch .
We analyze blood measurements collected from patients undergoing a gastrointestinal surgery at University Hospital of North Norway in the years 2004–2012. Each patient in the dataset is represented by a MTS of blood samples extracted within days after surgery. The MTS contain measurements of variables, which are alanine aminotransferase, albumin and alkaline phosphatase, creatinine, CRP, hemoglobine, leukocytes, potassium, sodium and thrombocytes. We focus on a cohort of two classes of patients: the ones with and without surgical site infections. Dataset labels are assigned according to International Classification of Diseases and NOMESCO Classification of Surgical Procedures, relative to patients with severe postoperative complications.
Missing data in MTS correspond to measurements that are not collected for a given patient in one day of the observation period. Patients with less than two measurements are excluded from the cohort. We ended up with MTS, of which are patients with infections. The first of the datasets is used as training set and the rest as test set.
The dataset, the code implementing all the methods described in this paper, and a detailed description of experimental setup are publicly available111https://github.com/FilippoMB/TCK_AE.
To evaluate the effect of the alignment with TCK kernel, we compare the classification results obtained on the codes learned by standard AE and dkAE. Missing values are filled with three different imputation techniques: zero imputation (AE-z and dkAE-z), mean imputation (AE-m and dkAE-m) and last-value-carried-forward imputation (AE-l and dkAE-l). The codes are classified by a -NN with equipped with Euclidean distance. We also consider the results yielded in the input space by a NN with TCK similarity (TCK-i).
In Tab. 1
we report the mean and standard deviation of F1 score and area under the ROC curve (AUC) of the test set in 10 independent runs. For AE and dkAE we also report the mean squared error (MSE) between the encoder input and the decoder output. A low MSE of the reconstruction does not only guarantee to learn a better representation of the input, but it implies an accurate back-mapping from code to input space. In both AE without kernel alignment and dkAE with zero imputation and last-value-carried-forward we obtain the best and worst classification performance, respectively.
For each imputation method, codes learned by dkAE are classified more accurately and the reconstruction error does not increase even if the codes are aligned with the prior kernel. This demonstrate the importance of embedding into the codes the similarity information yielded by TCK, which captures missingness patterns. Indeed, those patterns are ignored if one relies solely on imputation, whose purpose is to fill missing entries introducing as less bias as possible. It is interesting to notice that the classification in the input space based on TCK similarity is slightly less accurate than the classification in the code space of dkAE. Therefore, dkAE not only yields codes of reduced dimensionality that can be handled more easily and processed faster, but they are discriminated easier than the inputs themselves from a simple classifier.
In Fig. 2 we visualize the first two PCA components of the test set, both in input and in the code spaces. We compute a linear PCA on the codes and on the TCK kernel matrix (this corresponds to compute kernel PCA in the input space using TCK as kernel). Coloring depends on the ground truth label and we observe the two classes to be better separated in the code space of dkAE. Interestingly, in dkAE we notice the same structure yield by kPCA in the input space with TCK as kernel. This demonstrate how the kernel alignment procedure successfully embed in the codes the properties of TCK, without compromising the precision of the decoder reconstruction. We underline that by using an AE rather than kPCA we avoid performing a costly eigendecomposition and we also learn the inverse mapping from the code to the input space, provided by the decoder.
In this paper, we proposed a novel approach for learning compressed vectorial representations of MTS with missing values, which are common in clinical records. This is achieved by combining a deep kernelized Autoencoder with TCK, a similarity measure for MTS that accounts for missingness patterns. We tackled the classification of blood samples from patients with postoperative infections, where data are MTS with a high percentage of missing data. Our results showed that by aligning the codes in the AE to TCK kernel matrix, we embed into the representation important information relative to the missingess patterns in the data and improve the classification outcome.