While the volume of electronic health records (EHR) data continues to grow, it remains rare for hospital systems to capture dense physiological data streams, even in the data-rich intensive care unit setting. Instead, it is common for the physiological time series data in electronic health records to be both sparse and irregularly sampled. An irregularly sampled time series is a sequence of samples with irregular intervals between their observation times. Irregularly sampled data are considered to be sparse when the intervals between successive observations are often large. The physiological data contained in EHRs represent multivariate time series consisting of the time course of multiple vital signs. In the multivariate setting, the additional issue of the lack of alignment in the observation times across physiological variables is also very common.
It is well understood that such data cause significant issues for standard supervised machine learning models that typically assume fully observed, fixed-size feature representations (Yadav et al., 2018). However, over the last several years, there has been significant progress in developing specialized models and architectures that can accommodate sparse and irregularly sampled time series as input (Marlin et al., 2012; Li and Marlin, 2015, 2016; Lipton et al., 2016; Che et al., 2018; Futoma et al., 2017)
. Of particular interest in the supervised learning setting are methods that perform end-to-end learning directly using multivariate sparse and irregularly sampled time series as input without the need for a separate interpolation or imputation step.
In this work, we present a new model architecture for supervised learning with multivariate sparse and irregularly sampled data: Interpolation-Prediction Networks. The architecture is based on the use of several semi-parametric interpolation layers organized into an interpolation network, followed by the application of a prediction network that can leverage any standard deep learning model.
The interpolation network allows for information contained in each input time series to contribute to the interpolation of all other time series in the model. The parameters of the interpolation and prediction networks are learned end-to-end via a composite objective function consisting of supervised and unsupervised components. The interpolation network serves the same purpose as the multivariate Gaussian process used in the work of Futoma et al. (2017), but remove the restrictions associated with the need for a positive definite covariance matrix.
Our approach also allows us to compute an explicit multi-timescale representation of the input time series, which we use to isolate information about transients (short duration events) from broader trends. Similar to the work of Lipton et al. (2016) and Che et al. (2018), our architecture also explicitly leverages a separate information channel related to the pattern of observed values. However, our representation uses a semi-parametric intensity function representation of this information that is more closely related to the work of Lasko (2014) on modeling medical event point processes.
We evaluate the proposed architecture on two tasks using the MIMIC-III data set (Johnson et al., 2016): mortality prediction and length of stay prediction. Our approach outperforms a variety of simple baseline models as well as the basic and advanced GRU models introduced by Che et al. (2018) on both tasks across several metrics.
2 Model Framework
We let represent a data set containing data cases. An individual data case consists of a single target value (discrete in the case of classification and real-valued in the case of regression), as well as a -dimensional, sparse and irregularly sampled multivariate time series . Different dimensions of the multivariate time series can have observations at different times, as well as different total numbers of observations . Thus, we represent time series for data case as a tuple where is the list of time points at which observations are defined and is the corresponding list of observed values.
The overall model architecture consists of two main components: an interpolation network and a prediction network. The interpolation network interpolates the multivariate, sparse, and irregularly sampled input time series against a set of reference time points . In this work, we propose a two-layer interpolation network with each layer performing a different type of interpolation.
The first interpolation layer in the interpolation network performs a semi-parametric univariate interpolation for each of the
time series separately. The interpolation is based on a radial basis function network. This results in a set of valuesfor each data case , each input time series , and each reference time point , and each interpolation channel as shown in Equation 1.
The second interpolation layer merges information across all of the time series at each reference time point by taking into account learned correlations across all time series. This results in a set of values for each data case , each input time series , each reference time point , and each interpolation channel as shown in Equation 2. Here, the terms represent the intensity of the observations on input dimension for data case . The more observations that occur near reference time point , the larger the value of . This final interpolation layer thus models the the interpolant as a weighted linear combination of the first layer interpolants, while focusing the combination on the time series for which data are actually available at nearby time points.
In the experiments presented in the next section, we use a total of three interpolation outputs () corresponding to a smooth interpolant to capture trends, a non-smooth interpolant to capture transients, and the intensity function to retain information about where observations occur. The smooth interpolant corresponds to the first interpolant . The collection of model parameters associated with this interpolant are the cross-dimension correlation coefficients , and the RBF network bandwidths for all and .
For the non-smooth interpolant, we start with a second interpolant . The collection of model parameters associated with this interpolant are the RBF network parameters for all . To accomplish the goal of having this component represent a less smooth interpolation than , we enforce the relationship for all for a value of greater than one. This ensures that the temporal similarity decays faster for the component intended to model transients. To further minimize any redundancy between and , we subtract the smooth interpolant from the non-smooth interpolant leaving the non-smooth residual: .
Finally, for the intensity function, we use . This component shares its RBF network parameters with . In the experiments, we study the end-to-end impact of each of these interpolation outputs.
The second component, the prediction network, takes the output of the interpolation network as its input and produces a prediction
for the target variable. The prediction network can consist of any standard supervised neural network architecture (fully-connected feedforward, convolutional, recurrent, etc). Thus, the architecture is fully modular with respect to the use of different prediction networks. AppendixA.1 shows the architecture of the proposed model.
To learn the model parameters, we use a composite objective function consisting of a supervised component and an unsupervised component. This is due to the fact that the supervised component alone is insufficient to learn reasonable parameters for the interpolation network given the amount of available training data. The unsupervised component used corresponds to an autoencoder-like loss function. However, the semi-parametric RBF interpolation layers have the ability to exactly fit the input points by setting the RBF kernel parameters to very large values. To avoid this solution and force the interpolation layers to learn to properly interpolate the input data, it is necessary to hold out some observed data pointsduring learning and then to compute the reconstruction loss only for these data points. This is a well-known problem with high-capacity autoencoders, and past work has used similar strategies to avoid the problem of trivially memorizing the input data without learning useful structure.
To implement the autoencoder component of the loss, we introduce a set of masking variables for each data point . If , then we remove the data point as an input to the interpolation network, and include the predicted value of this time point when assessing the autoencoder loss. We use the shorthand notation to represent the subset of values of that are masked out, and to represent the subset of values of that are not masked out. The value that we predict for a masked input at time point is the value of the smooth interpolant at that time point, calculated based on the un-masked input values: .
Using these definitions, we can now specify the learning problem for the proposed framework. We let be the loss for the prediction network (we use cross-entropy loss for classification and squared error for regression). We let be the interpolation network autoencoder loss (we use standard squared error). We also include regularizers for both the interpolation and prediction networks parameters.
where and are interpolation and prediction network respectively.
3 Experiments and Results
We test the model framework on publicly available MIMIC-III data set 111MIMIC-III dataset is publicly available at https://mimic.physionet.org/, a multivariate time series dataset consisting of sparse and irregularly sampled physiological signals (Johnson et al., 2016). More details on data extraction and sparsity are available in appendix A.2. We use mortality and length of stay prediction as example classification and regression tasks.
We compare the proposed model with a number of baseline approaches including off-the-shelf classification and regression models learned using basic features, as well as more recent approaches based on customized neural network models. For non-neural network baselines, we evaluate Logistic Regression(Hosmer Jr et al., 2013)2001)
, Support Vector Machines (SVM)(Cortes and Vapnik, 1995)
, Random Forest (RF)(Breiman, 2001) and AdaBoost (Freund and Schapire, 1997). Standard instances of all of these models require fixed-size feature representations. We use the mean of the available values in each of the physiological time series for a given admission record as the feature set. In addition, we compare to several existing deep learning baselines built on GRUs using simple interpolation or imputation approaches: GRU-M (missing observations replaced with global mean), GRU-F (missing observations replaced with last observation), GRU-S (input concatenated with masking variable to identify missingness and time interval indicating how long the particular variable is missing) and GRU-D Che et al. (2018) (similar to GRU-S except instead of just replacing the missing value with the last measurement, it is decayed over time towards the empirical mean). Training and implementation details can be found in appendix A.3.
We report the results from the 5-fold cross validation in terms of the average area under the ROC curve (AUC score), average area under precision-recall curve (AUPRC score) and average cross-entropy loss for classification task and average median absolute error and average fraction of explained variation score (EV) for regression task. We also report the standard deviation over cross validation folds. We note that in highly skewed datasets, as is the case with MIMIC-III, AUPRC(Davis and Goadrich, 2006) can give better insight into classification performance compared to the AUC score.
Performance on mortality and length of stay prediction tasks on MIMIC-III. Loss: Cross-Entropy Loss, MedAE: Median Absolute Error (in days), EV: Explained variance
compares the predictive performance of the mortality and length of stay prediction task on MIMIC-III. The proposed model consistently achieves the best average score over all the metrics. We note that a paired t-test indicates that the proposed model results in statistically significant improvements over all baseline modelswith respect to all the metrics except median absolute error. The version of the proposed model used in this experiment includes all three interpolation network outputs (smooth interpolation, transients, and intensity function).
4 Discussion and Conclusions
In this paper, we have presented a new framework for dealing with the problem of supervised learning in the presence of sparse and irregularly sampled time series. The proposed framework is fully modular. It uses an interpolation network to accommodate the complexity that results from using sparse and irregularly sampled data as supervised learning inputs, followed by the application of a prediction network that operates over the regularly spaced and fully observed, multi-channel output provided by the interpolation network. The proposed approach also addresses some difficulties with prior approaches including the complexity of the Gaussian process interpolation layers used in Li and Marlin (2016), and the lack of modularity in the approach of Che et al. (2018). Our framework also introduces novel elements including the use of semi-parametric, feed-forward interpolation layers, and the decomposition of an irregularly sampled input time series into multiple distinct information channels. Our results show statistically significant improvements for both classification and regression tasks over a range of baseline and state-of-the-art methods.
This work was supported by the National Science Foundation under Grant No. 1350522.
- Breiman  Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001. ISSN 0885-6125. doi: 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.
- Che et al.  Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018. URL https://doi.org/10.1038/s41598-018-24271-9.
- Cortes and Vapnik  Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, Sep 1995. ISSN 1573-0565. doi: 10.1007/BF00994018. URL https://doi.org/10.1007/BF00994018.
- Davis and Goadrich  Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 233–240, New York, NY, USA, 2006. ACM. ISBN 1-59593-383-2. doi: 10.1145/1143844.1143874. URL http://doi.acm.org/10.1145/1143844.1143874.
- Freund and Schapire  Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, August 1997. ISSN 0022-0000. doi: 10.1006/jcss.1997.1504. URL http://dx.doi.org/10.1006/jcss.1997.1504.
- Futoma et al.  Joseph Futoma, Sanjay Hariharan, Katherine Heller, Mark Sendak, Nathan Brajer, Meredith Clement, Armando Bedoya, and Cara O’Brien. An improved multi-output gaussian process rnn with real-time validation for early sepsis detection. In Machine Learning for Healthcare Conference, pages 243–254, 2017.
- Hastie et al.  Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
- Hosmer Jr et al.  David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression, volume 398. John Wiley & Sons, 2013.
- Johnson et al.  Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3:160035, 2016.
Thomas A Lasko.
Efficient inference of gaussian-process-modulated renewal processes
with application to medical event data.
Uncertainty in artificial intelligence: proceedings of the… conference. Conference on Uncertainty in Artificial Intelligence, volume 2014, page 469. NIH Public Access, 2014.
- Li and Marlin  Steven Cheng-Xian Li and Benjamin M Marlin. A scalable end-to-end gaussian process adapter for irregularly sampled time series classification. In Advances In Neural Information Processing Systems, pages 1804–1812, 2016.
- Li and Marlin  Steven Cheng-Xian Li and Benjmain M. Marlin. Classification of sparse and irregularly sampled time series with mixtures of expected Gaussian kernels and random features. In 31st Conference on Uncertainty in Artificial Intelligence, 2015.
- Lipton et al.  Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Machine Learning for Healthcare Conference, pages 253–270, 2016.
- Marlin et al.  Benjamin M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pages 389–398, 2012.
- Yadav et al.  Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining electronic health records (ehrs): A survey. ACM Computing Surveys (CSUR), 50(6):85, 2018.
Appendix A Appendix
a.1 Model Architecture
a.2 Dataset Description
We evaluate our model framework on the publicly available MIMIC-III dataset . MIMIC-III is a de-identified dataset collected at Beth Israel Deaconess Medical Center from 2001 to 2012. It consists of approximately 58,000 hospital admission records. This data set contains sparse and irregularly sampled physiological signals, medications, diagnostic codes, in-hospital mortality, length of stay and more. We focus on predicting in-hospital mortality and length of stay using the first 48 hours of data. We extracted 12 standard physiological variables from each of the 53,211 records obtained after removing hospital admission records with length of stay less than 48 hours. Table 2 shows the features, sampling rates (per hour) and their missingness information computed using the union of all time stamps that exist in any dimension of the input time series.
In our experiments, each admission record corresponds to one data case . Each data case consists of a sparse and irregularly sampled time series with dimensions. Each dimension of corresponds to one of the 12 vital sign time series mentioned above. In the case of classification, is a binary indicator where indicates that the patient died at any point within the hospital stay following the first 48 hours and indicates that the patient was discharged at any point after the first 48 hours. There are 4310 (8.1%) patients with a mortality label. The complete data set is , and there are data cases. The goal in the classification task is to learn a classification function of the form where is a discrete value.
In the case of regression, is a real-valued regression target corresponding to the length of stay. Since the data set include some very long stay durations, we let represent the log of the length of stay in days for all models. We convert back from the log number of days to the number of days when reporting results. The complete data set is again with data cases (we again require 48 hours worth of data). The goal in the regression task is to learn a regression function of the form where is a continuous value.
a.3 Implementation Details
a.3.1 Proposed Model
The model is learned using the Adam optimization method in TensorFlow with gradients provided via automatic differentiation. However, the actual multivariate time series representation used during learning is based on the union of all time stamps that exist in any dimension of the input time series. Undefined observations are represented as zeros and a separate missing data mask is used to keep track of which time series have observations at each time point. Equations 1 & 2 are modified such that data that are not available are not taken into account at all. This implementation is exactly equivalent to the computations described in Equations 1 & 2, but support parallel computation across all dimensions of the time series for a given data case.
Finally, we note that the learning problem can be solved using a doubly stochastic gradient based on the use of mini batches combined with re-sampling the artificial missing data masks used in the interpolation loss. In practice, we randomly select of the observed data points to hold out from every input time series.
The Logistic Regression model is trained with cross entropy loss with regularization strength set to 1. The support vector classifier is used with a RBF kernel and trained to minimize the soft margin loss. We use the cross entropy loss on the validation set to select the optimal number of estimators in case of Adaboost and Random Forest. Similar to the classification setting, the optimal number of estimators for regression task in Adaboost and Random Forest is chosen on the basis of squared error on validation set.
We evaluate all models using a five-fold cross-validation estimate of generalization performance. In the classification setting, all the deep learning baselines are trained to minimize the cross entropy loss while the proposed model uses a composite loss consisting of cross-entropy loss and interpolation loss (with ) as described in section . In the case of the regression task, all baseline models are trained to minimize squared error and the proposed model is again trained with a composite loss consisting of squared error and interpolation loss.
For all of the GRU-based models, we use the already specified parameters . The models are learned using the Adam optimization. Early stopping is used on a validation set sub-sampled from the training folds. In the classification case, the final outputs of the GRU hidden units are used in a logistic layer that predicts the class. In the regression case, the final outputs of the GRU hidden units are used as input for a dense hidden layer with units, followed by a linear output layer.