1 Introduction
While the volume of electronic health records (EHR) data continues to grow, it remains rare for hospital systems to capture dense physiological data streams, even in the datarich intensive care unit setting. Instead, it is common for the physiological time series data in electronic health records to be both sparse and irregularly sampled. An irregularly sampled time series is a sequence of samples with irregular intervals between their observation times. Irregularly sampled data are considered to be sparse when the intervals between successive observations are often large. The physiological data contained in EHRs represent multivariate time series consisting of the time course of multiple vital signs. In the multivariate setting, the additional issue of the lack of alignment in the observation times across physiological variables is also very common.
It is well understood that such data cause significant issues for standard supervised machine learning models that typically assume fully observed, fixedsize feature representations (Yadav et al., 2018). However, over the last several years, there has been significant progress in developing specialized models and architectures that can accommodate sparse and irregularly sampled time series as input (Marlin et al., 2012; Li and Marlin, 2015, 2016; Lipton et al., 2016; Che et al., 2018; Futoma et al., 2017)
. Of particular interest in the supervised learning setting are methods that perform endtoend learning directly using multivariate sparse and irregularly sampled time series as input without the need for a separate interpolation or imputation step.
In this work, we present a new model architecture for supervised learning with multivariate sparse and irregularly sampled data: InterpolationPrediction Networks. The architecture is based on the use of several semiparametric interpolation layers organized into an interpolation network, followed by the application of a prediction network that can leverage any standard deep learning model.
The interpolation network allows for information contained in each input time series to contribute to the interpolation of all other time series in the model. The parameters of the interpolation and prediction networks are learned endtoend via a composite objective function consisting of supervised and unsupervised components. The interpolation network serves the same purpose as the multivariate Gaussian process used in the work of Futoma et al. (2017), but remove the restrictions associated with the need for a positive definite covariance matrix.
Our approach also allows us to compute an explicit multitimescale representation of the input time series, which we use to isolate information about transients (short duration events) from broader trends. Similar to the work of Lipton et al. (2016) and Che et al. (2018), our architecture also explicitly leverages a separate information channel related to the pattern of observed values. However, our representation uses a semiparametric intensity function representation of this information that is more closely related to the work of Lasko (2014) on modeling medical event point processes.
We evaluate the proposed architecture on two tasks using the MIMICIII data set (Johnson et al., 2016): mortality prediction and length of stay prediction. Our approach outperforms a variety of simple baseline models as well as the basic and advanced GRU models introduced by Che et al. (2018) on both tasks across several metrics.
2 Model Framework
We let represent a data set containing data cases. An individual data case consists of a single target value (discrete in the case of classification and realvalued in the case of regression), as well as a dimensional, sparse and irregularly sampled multivariate time series . Different dimensions of the multivariate time series can have observations at different times, as well as different total numbers of observations . Thus, we represent time series for data case as a tuple where is the list of time points at which observations are defined and is the corresponding list of observed values.
The overall model architecture consists of two main components: an interpolation network and a prediction network. The interpolation network interpolates the multivariate, sparse, and irregularly sampled input time series against a set of reference time points . In this work, we propose a twolayer interpolation network with each layer performing a different type of interpolation.
The first interpolation layer in the interpolation network performs a semiparametric univariate interpolation for each of the
time series separately. The interpolation is based on a radial basis function network. This results in a set of values
for each data case , each input time series , and each reference time point , and each interpolation channel as shown in Equation 1.(1) 
The second interpolation layer merges information across all of the time series at each reference time point by taking into account learned correlations across all time series. This results in a set of values for each data case , each input time series , each reference time point , and each interpolation channel as shown in Equation 2. Here, the terms represent the intensity of the observations on input dimension for data case . The more observations that occur near reference time point , the larger the value of . This final interpolation layer thus models the the interpolant as a weighted linear combination of the first layer interpolants, while focusing the combination on the time series for which data are actually available at nearby time points.
(2) 
In the experiments presented in the next section, we use a total of three interpolation outputs () corresponding to a smooth interpolant to capture trends, a nonsmooth interpolant to capture transients, and the intensity function to retain information about where observations occur. The smooth interpolant corresponds to the first interpolant . The collection of model parameters associated with this interpolant are the crossdimension correlation coefficients , and the RBF network bandwidths for all and .
For the nonsmooth interpolant, we start with a second interpolant . The collection of model parameters associated with this interpolant are the RBF network parameters for all . To accomplish the goal of having this component represent a less smooth interpolation than , we enforce the relationship for all for a value of greater than one. This ensures that the temporal similarity decays faster for the component intended to model transients. To further minimize any redundancy between and , we subtract the smooth interpolant from the nonsmooth interpolant leaving the nonsmooth residual: .
Finally, for the intensity function, we use . This component shares its RBF network parameters with . In the experiments, we study the endtoend impact of each of these interpolation outputs.
The second component, the prediction network, takes the output of the interpolation network as its input and produces a prediction
for the target variable. The prediction network can consist of any standard supervised neural network architecture (fullyconnected feedforward, convolutional, recurrent, etc). Thus, the architecture is fully modular with respect to the use of different prediction networks. Appendix
A.1 shows the architecture of the proposed model.To learn the model parameters, we use a composite objective function consisting of a supervised component and an unsupervised component. This is due to the fact that the supervised component alone is insufficient to learn reasonable parameters for the interpolation network given the amount of available training data. The unsupervised component used corresponds to an autoencoderlike loss function. However, the semiparametric RBF interpolation layers have the ability to exactly fit the input points by setting the RBF kernel parameters to very large values. To avoid this solution and force the interpolation layers to learn to properly interpolate the input data, it is necessary to hold out some observed data points
during learning and then to compute the reconstruction loss only for these data points. This is a wellknown problem with highcapacity autoencoders, and past work has used similar strategies to avoid the problem of trivially memorizing the input data without learning useful structure.To implement the autoencoder component of the loss, we introduce a set of masking variables for each data point . If , then we remove the data point as an input to the interpolation network, and include the predicted value of this time point when assessing the autoencoder loss. We use the shorthand notation to represent the subset of values of that are masked out, and to represent the subset of values of that are not masked out. The value that we predict for a masked input at time point is the value of the smooth interpolant at that time point, calculated based on the unmasked input values: .
Using these definitions, we can now specify the learning problem for the proposed framework. We let be the loss for the prediction network (we use crossentropy loss for classification and squared error for regression). We let be the interpolation network autoencoder loss (we use standard squared error). We also include regularizers for both the interpolation and prediction networks parameters.
(3)  
where and are interpolation and prediction network respectively.
3 Experiments and Results
We test the model framework on publicly available MIMICIII data set ^{1}^{1}1MIMICIII dataset is publicly available at https://mimic.physionet.org/, a multivariate time series dataset consisting of sparse and irregularly sampled physiological signals (Johnson et al., 2016). More details on data extraction and sparsity are available in appendix A.2. We use mortality and length of stay prediction as example classification and regression tasks.
We compare the proposed model with a number of baseline approaches including offtheshelf classification and regression models learned using basic features, as well as more recent approaches based on customized neural network models. For nonneural network baselines, we evaluate Logistic Regression
(Hosmer Jr et al., 2013)(Hastie et al., 2001), Support Vector Machines (SVM)
(Cortes and Vapnik, 1995), Random Forest (RF)
(Breiman, 2001) and AdaBoost (Freund and Schapire, 1997). Standard instances of all of these models require fixedsize feature representations. We use the mean of the available values in each of the physiological time series for a given admission record as the feature set. In addition, we compare to several existing deep learning baselines built on GRUs using simple interpolation or imputation approaches: GRUM (missing observations replaced with global mean), GRUF (missing observations replaced with last observation), GRUS (input concatenated with masking variable to identify missingness and time interval indicating how long the particular variable is missing) and GRUD Che et al. (2018) (similar to GRUS except instead of just replacing the missing value with the last measurement, it is decayed over time towards the empirical mean). Training and implementation details can be found in appendix A.3.We report the results from the 5fold cross validation in terms of the average area under the ROC curve (AUC score), average area under precisionrecall curve (AUPRC score) and average crossentropy loss for classification task and average median absolute error and average fraction of explained variation score (EV) for regression task. We also report the standard deviation over cross validation folds. We note that in highly skewed datasets, as is the case with MIMICIII, AUPRC
(Davis and Goadrich, 2006) can give better insight into classification performance compared to the AUC score.Model  Classification  Regression  

AUC  AUPRC  Loss  MedAE  EV score  
Log/LinReg  
SVM  
AdaBoost 

RF 

GRUM 

GRUF 

GRUS 

GRUD 

Proposed 
Performance on mortality and length of stay prediction tasks on MIMICIII. Loss: CrossEntropy Loss, MedAE: Median Absolute Error (in days), EV: Explained variance
Table 1
compares the predictive performance of the mortality and length of stay prediction task on MIMICIII. The proposed model consistently achieves the best average score over all the metrics. We note that a paired ttest indicates that the proposed model results in statistically significant improvements over all baseline models
with respect to all the metrics except median absolute error. The version of the proposed model used in this experiment includes all three interpolation network outputs (smooth interpolation, transients, and intensity function).4 Discussion and Conclusions
In this paper, we have presented a new framework for dealing with the problem of supervised learning in the presence of sparse and irregularly sampled time series. The proposed framework is fully modular. It uses an interpolation network to accommodate the complexity that results from using sparse and irregularly sampled data as supervised learning inputs, followed by the application of a prediction network that operates over the regularly spaced and fully observed, multichannel output provided by the interpolation network. The proposed approach also addresses some difficulties with prior approaches including the complexity of the Gaussian process interpolation layers used in Li and Marlin (2016), and the lack of modularity in the approach of Che et al. (2018). Our framework also introduces novel elements including the use of semiparametric, feedforward interpolation layers, and the decomposition of an irregularly sampled input time series into multiple distinct information channels. Our results show statistically significant improvements for both classification and regression tasks over a range of baseline and stateoftheart methods.
Acknowledgements
This work was supported by the National Science Foundation under Grant No. 1350522.
References
 Breiman [2001] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001. ISSN 08856125. doi: 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.
 Che et al. [2018] Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Scientific Reports, 8(1):6085, 2018. URL https://doi.org/10.1038/s41598018242719.
 Cortes and Vapnik [1995] Corinna Cortes and Vladimir Vapnik. Supportvector networks. Machine Learning, 20(3):273–297, Sep 1995. ISSN 15730565. doi: 10.1007/BF00994018. URL https://doi.org/10.1007/BF00994018.
 Davis and Goadrich [2006] Jesse Davis and Mark Goadrich. The relationship between precisionrecall and roc curves. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 233–240, New York, NY, USA, 2006. ACM. ISBN 1595933832. doi: 10.1145/1143844.1143874. URL http://doi.acm.org/10.1145/1143844.1143874.
 Freund and Schapire [1997] Yoav Freund and Robert E Schapire. A decisiontheoretic generalization of online learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, August 1997. ISSN 00220000. doi: 10.1006/jcss.1997.1504. URL http://dx.doi.org/10.1006/jcss.1997.1504.
 Futoma et al. [2017] Joseph Futoma, Sanjay Hariharan, Katherine Heller, Mark Sendak, Nathan Brajer, Meredith Clement, Armando Bedoya, and Cara O’Brien. An improved multioutput gaussian process rnn with realtime validation for early sepsis detection. In Machine Learning for Healthcare Conference, pages 243–254, 2017.
 Hastie et al. [2001] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA, 2001.
 Hosmer Jr et al. [2013] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression, volume 398. John Wiley & Sons, 2013.
 Johnson et al. [2016] Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Liwei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3:160035, 2016.

Lasko [2014]
Thomas A Lasko.
Efficient inference of gaussianprocessmodulated renewal processes
with application to medical event data.
In
Uncertainty in artificial intelligence: proceedings of the… conference. Conference on Uncertainty in Artificial Intelligence
, volume 2014, page 469. NIH Public Access, 2014.  Li and Marlin [2016] Steven ChengXian Li and Benjamin M Marlin. A scalable endtoend gaussian process adapter for irregularly sampled time series classification. In Advances In Neural Information Processing Systems, pages 1804–1812, 2016.
 Li and Marlin [2015] Steven ChengXian Li and Benjmain M. Marlin. Classification of sparse and irregularly sampled time series with mixtures of expected Gaussian kernels and random features. In 31st Conference on Uncertainty in Artificial Intelligence, 2015.
 Lipton et al. [2016] Zachary C Lipton, David Kale, and Randall Wetzel. Directly modeling missing data in sequences with rnns: Improved classification of clinical time series. In Machine Learning for Healthcare Conference, pages 253–270, 2016.
 Marlin et al. [2012] Benjamin M. Marlin, David C. Kale, Robinder G. Khemani, and Randall C. Wetzel. Unsupervised pattern discovery in electronic health care data using probabilistic clustering models. In Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, pages 389–398, 2012.
 Yadav et al. [2018] Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. Mining electronic health records (ehrs): A survey. ACM Computing Surveys (CSUR), 50(6):85, 2018.
Appendix A Appendix
a.1 Model Architecture
a.2 Dataset Description
We evaluate our model framework on the publicly available MIMICIII dataset [9]. MIMICIII is a deidentified dataset collected at Beth Israel Deaconess Medical Center from 2001 to 2012. It consists of approximately 58,000 hospital admission records. This data set contains sparse and irregularly sampled physiological signals, medications, diagnostic codes, inhospital mortality, length of stay and more. We focus on predicting inhospital mortality and length of stay using the first 48 hours of data. We extracted 12 standard physiological variables from each of the 53,211 records obtained after removing hospital admission records with length of stay less than 48 hours. Table 2 shows the features, sampling rates (per hour) and their missingness information computed using the union of all time stamps that exist in any dimension of the input time series.
feature  #Missing  Sampling Rate 

SpO2  
HR  
RR  
SBP  
DBP  
Temp 
feature  #Missing  Sampling Rate 

TGCS  
CRR  
UO  
FiO2  
Glucose  
pH 
Prediction Tasks
In our experiments, each admission record corresponds to one data case . Each data case consists of a sparse and irregularly sampled time series with dimensions. Each dimension of corresponds to one of the 12 vital sign time series mentioned above. In the case of classification, is a binary indicator where indicates that the patient died at any point within the hospital stay following the first 48 hours and indicates that the patient was discharged at any point after the first 48 hours. There are 4310 (8.1%) patients with a mortality label. The complete data set is , and there are data cases. The goal in the classification task is to learn a classification function of the form where is a discrete value.
In the case of regression, is a realvalued regression target corresponding to the length of stay. Since the data set include some very long stay durations, we let represent the log of the length of stay in days for all models. We convert back from the log number of days to the number of days when reporting results. The complete data set is again with data cases (we again require 48 hours worth of data). The goal in the regression task is to learn a regression function of the form where is a continuous value.
a.3 Implementation Details
a.3.1 Proposed Model
The model is learned using the Adam optimization method in TensorFlow with gradients provided via automatic differentiation. However, the actual multivariate time series representation used during learning is based on the union of all time stamps that exist in any dimension of the input time series. Undefined observations are represented as zeros and a separate missing data mask is used to keep track of which time series have observations at each time point. Equations 1 & 2 are modified such that data that are not available are not taken into account at all. This implementation is exactly equivalent to the computations described in Equations 1 & 2, but support parallel computation across all dimensions of the time series for a given data case.
Finally, we note that the learning problem can be solved using a doubly stochastic gradient based on the use of mini batches combined with resampling the artificial missing data masks used in the interpolation loss. In practice, we randomly select of the observed data points to hold out from every input time series.
a.3.2 Baselines
The Logistic Regression model is trained with cross entropy loss with regularization strength set to 1. The support vector classifier is used with a RBF kernel and trained to minimize the soft margin loss. We use the cross entropy loss on the validation set to select the optimal number of estimators in case of Adaboost and Random Forest. Similar to the classification setting, the optimal number of estimators for regression task in Adaboost and Random Forest is chosen on the basis of squared error on validation set.
We evaluate all models using a fivefold crossvalidation estimate of generalization performance. In the classification setting, all the deep learning baselines are trained to minimize the cross entropy loss while the proposed model uses a composite loss consisting of crossentropy loss and interpolation loss (with ) as described in section . In the case of the regression task, all baseline models are trained to minimize squared error and the proposed model is again trained with a composite loss consisting of squared error and interpolation loss.
For all of the GRUbased models, we use the already specified parameters [2]. The models are learned using the Adam optimization. Early stopping is used on a validation set subsampled from the training folds. In the classification case, the final outputs of the GRU hidden units are used in a logistic layer that predicts the class. In the regression case, the final outputs of the GRU hidden units are used as input for a dense hidden layer with units, followed by a linear output layer.