I Introduction
Electronic health records (EHRs) store longitudinal data consisting of patients’ clinical observations in the intensive care unit (ICU). Despite the surge of interest in clinical research on EHR, it still holds diverse challenging issues to be tackled, these include highdimensionality, temporality, sparsity, irregularity, and bias [cheng2016, yadav2018]. Specifically, sequences of multidimensional medical measurements are recorded irregularly in terms of its variables and times. The reasons behind these typical measurements are diverse, such as lack of collection, documentation, or even recording faults [wells2013, cheng2016]. Since it carries essential information regarding a patient’s health status, improper handling of missing values might cause an unintentional bias [wells2013, jones2017], yielding an unreliable downstream analysis and verdict.
Completecase analysis is an approach that draws clinical outcomes by disregarding the missing values and relying only on the observed values. However, excluding the missing data yields poor performance at high missing rates and also requires modeling separately for the distinct dataset. In fact, those missing values reflect the decisions made by healthcare providers [lipton2016]. Therefore, the missing values and their patterns contain information regarding a patient’s health status [lipton2016] and correlate with the outcomes or target labels [che2018]. Thus, we resort to the imputation approach by exploiting those missing patterns to improve the prediction of the clinical outcomes as the downstream task.
There exist numerous strategies for imputing missing values in the literature. Generally, imputation methods can be classified into a deterministic or stochastic approach, depending on the use of randomness
[brick1996]. Deterministic methods, such as mean [little1987] and median filling [acuna2004], produce only one possible value when estimating the missing values. However, it is desirable for the imputation model to generate values or samples by considering the distribution of the available observed data. Thus, it leads us to the employment of stochasticbased imputation methods.The recent rise of deep learning models has offered potential solutions in accommodating such circumstances. Variational autoencoders (VAEs)
[kingma2014] and generative adversarial networks (GANs) [goodfellow2014] exploit the latent distribution of highdimensional incomplete data and generate comparable data points as the approximation estimates for the missing or corrupted values [nazabal2018, luo2018, jun2019]. However, such deep generative models are insufficient for estimating the missing values of multivariate time series owing to their nature of ignoring temporal relations between a span of time points. On the other hand, by virtue of recurrent neural networks (RNNs), which have proved to perform remarkably well in modeling sequential data, we can estimate the complete data by taking into account the temporal characteristics. In this approach, GRUD
[che2018]introduced a modified gatedrecurrent unit (GRU) cell to model missing patterns in the form of masking vectors and temporal delays. Likewise, BRITS
[cao2018] exploited the temporal relations of bidirectional dynamics by considering feature correlations in estimating the missing values.Even though such models employed the stochastic approach for inferring and generating samples by utilizing both features and temporal relations, they scarcely exploited the uncertainty in estimating the missing values in multivariate time series data (i.e., since the imputation estimates are not thoroughly accurate, we may introduce a fidelity score denoted by the uncertainty, which enhances the downstream task performance by emphasizing the reliable information more than the less certain information) [he2010, gemmeke2010, jun2019]. We can use the imputation model, which captures the aleatoric uncertainty in estimating the missing values by placing a distribution over the output of the model [kendall2017]
. In particular, we would like to estimate the heteroscedastic aleatoric uncertainties, which are useful in cases where observation noises vary with the input
[kendall2017].In this work, we define our primary task as the prediction of inhospital mortality on clinical time series data. However, since such data are portrayed with sparse and irregularlysampled characteristics, we devise an imputation model as the secondary problem to enhance the clinical outcome predictions. We propose a novel variationalrecurrent imputation network (VRIN), as illustrated in Fig. 1, which unifies the imputation and prediction networks for multivariate time series EHR data, governing both correlations among variables and temporal relations. Specifically, given the sparse data, an inference network of VAEs is employed to capture data distribution in the latent space. From this, we employ a generative network to obtain the reconstructed data as the imputation estimates for the missing values and the uncertainty indicating the imputation fidelity score. Then, we integrate the temporal and feature correlations into a combined vector and feed it into a novel uncertaintyaware GRU in the recurrent imputation network. Finally, we obtain the mortality prediction as a clinical verdict from the complete imputed data. In general, our main contributions in this study are as follows:

We estimate the missing values by utilizing a deep generative model combined with a recurrent imputation network to capture both feature correlations and the temporal dynamics jointly, yielding the uncertainty.

We effectively incorporate the uncertainty with the imputation estimates in our novel uncertaintyaware GRU cell for better prediction results.

We evaluate the effectiveness of the proposed models by training the imputation and prediction networks jointly in an endtoend manner, achieving superior performance on realworld multivariate time series EHR data compared to other competing stateoftheart methods.
This study extends the preliminary work published in [jun2019]. Unlike to the preceding study, we have further expanded the proposed model by introducing more complex recurrent imputation networks, which utilize the uncertainties, instead of vanilla RNNs. We also include two additional realworld EHR datasets in our experiments, described in Section (IVA), and validate the robustness of our proposed networks by comparing them with existing stateoftheart models in the literature (Section IVC). Furthermore, we have conducted extensive experiments to discover the impacts of missing value estimation by utilizing the uncertainties when performing the downstream task (Section IVD  IVF).
The rest of the paper is organized as follows. In Section II, we discuss the works closely related to our proposed model in imputing the missing values. In Section III, we detail our proposed model. In Section IV, we report on the experimental results and analysis by comparing them with stateoftheart methods. Finally, we conclude the work in Section V.
Ii Related Work
Imputation strategies have been extensively devised to resolve the issue of sparse highdimensional time series data by means of statistics, machine learning, or deep learning methods. For instance, previous works exploited statistical attributes of observed data, such as mean
[little1987] and median filling [acuna2004], which clearly ignored the temporal relations and the correlations among variables. From the machine learning approaches, the expectationmaximization (EM) algorithm
[dempster1977],nearest neighbors (KNN)
[troyanskaya2001], and principal component analysis (PCA)
[oba2003, mohamed2009] were proposed by considering the relationships of the features either in the original or in the latent space. Furthermore, multiple imputation by chained equations (MICE) [white2011, azur2011] introduced variability by means of repeating the imputation process multiple times. However, these methods ignore the temporal relations as crucial attributes in time series modeling.The deep learningbased imputation models, are closely related to our proposed models. A previous study [nazabal2018] leveraged VAEs to generate stochastic imputation estimates by exploiting the distribution and correlations of features in the latent space. However, it ignored the temporal relations as well as the uncertainties. Recently, GPVAE [fortuin2019] was proposed to obtain the latent representation by means of VAEs and model temporal dynamics in the latent space using the Gaussian process. However, since the model is merely focused on the imputation task, they required a separate model for the further downstream outcome.
To deal with time series data, a series of RNNbased imputation models were proposed. GRUD [che2018] considered the temporal dynamics by incorporating the missing patterns, together with the mean imputation, and forward filling with past values using temporal decay factor. Similarly, GRUI [luo2018] trained the RNNs using such temporal decay factor and further incorporated an adversarial scheme of GANs as the stochastic approach. In the meantime, BRITS [cao2018] were proposed to combine the feature correlations and temporal dynamic networks using bidirectional dynamics, which enhanced the accuracy by estimating missing values in both forward and backward directions. By considering the delayed gradients of the missingness in the forward and backward directions, their models were able to achieve more accurate missing values imputations. Likewise, MRNN [yoon2017]
utilized bidirectional recurrent dynamics by operating interpolation (intrastream) and imputation (interstream). Despite temporal dynamics and stochastic methods being considered in the models, the uncertainties for imputation purposes were scarcely incorporated.
As we are unsure of the actual values, we argue that these uncertainties are beneficial and can be utilized to estimate the missing values. Such uncertainty can be captured by accommodating the distribution over the model output [kendall2017]. For this purpose, we exploited VAEs [kingma2014]
as the Bayesian networks, which are able to model the data distribution. In this work, we introduce the uncertainty as the imputation fidelity of estimates, which compensates for the potential impairment of imputation estimates. Therefore, we assumed that our model could provide reliable estimates while giving less attention to the unreliable ones determined by its uncertainties. We expect to obtain better estimates of the missing values leading to a better prediction performance as a downstream task. However, since VAEs alone are not designed to model the temporal dynamics, we combined the model with the recurrent imputation network by further utilizing those uncertainties. Thus, our proposed model differs from the aforementioned models in which that it integrates the imputation and prediction networks jointly, and the utilization of the uncertainties serve as the major motivation in our works.
Iii Proposed Methods
Our proposed model architecture consists of two key networks – an imputation and a prediction network – as depicted in Fig. 1. The imputation network is devised based on VAEs to capture the latent distribution of the sparse data by means of its inference network (i.e., encoder ). Then, the subsequent generative network of VAEs (i.e., decoder
) estimates the reconstructed data distribution. We regard its reconstructed values as the imputation estimates while exploiting its variances as the uncertainties to be further utilized in the recurrent imputation network sequentially.
The succeeding recurrent imputation networks are built upon RNNs to model the temporal dynamics. For each time step, we employ the regression layer to model the temporal relations incorporated within the hidden states of RNNs cell in imputing the missing values. However, as we also consider the time gap between observed values, we incorporate the time decay factor, which is then exploited in such hidden states, leading to decayed hidden states. Eventually, by systematically unifying the imputation estimates obtained from VAEs and RNNs, we expect to acquire more likely estimates by considering feature correlations and temporal relations over time, including the utilization of the uncertainty. By doing so, we expect to acquire more reliable prediction outcomes. We describe each of the networks more specifically in the following sections after introducing the data representation.
Iiia Data Representation
Given the multivariate time series EHR data of number of patients, a set of clinical observations and their corresponding label is denoted as . For each patient, we have , where and represent the time points and variables, respectively; denotes all observable variables at time point , and is the th element of variables at time point . In addition, it has the corresponding clinical label , representing the clinical outcomes. In our case, it denotes the inhospital mortality of a patient, which falls into a binary classification problem. For the sake of clarity, we omit the superscript hereafter.
As is characterized with sparsity properties, we address the missing values by introducing masking matrix , indicating whether values are observed or missing. In addition, we define a new data representation to be fed into the model, where we initialize the missing value with zero [lipton2016, nazabal2018, jun2019] as follows:
Another consideration in dealing with the sparse data is that there exists a time gap between two observed values. Such information in fact carries a piece of essential information in estimating the missing values over the times. Thus, we accommodate this information by further devising the time delay matrix , which is derived from , denoting the timestamp of the measurement. As initialization, we fix for the , while setting the time delay for the remaining times by referring to a masking matrix and a timestamp vector as follows:
IiiB VAEbased Imputation Network
Given the observations at each time point , we infer as the latent representation with as its corresponding dimension by making use of the inference network, using the true posterior distribution . Intuitively, we assume that
is generated from some unobserved random variable
by some conditional distribution , while is generated from a prior distribution , which can be interpreted as the hidden health status of the patient. In addition, we define the marginal likelihood asby integrating out the joint distribution
for defined asHowever, in practice, this is analytically intractable. Since, , the true posterior becomes intractable as well. Therefore, we approximate it with
using a Gaussian distribution
, where the mean and logvariance are obtained such thatwhere denotes the inference network with parameter . Furthermore, we apply the reparameterization trick [kingma2014] as , where , with denoting elementwise multiplication, thus, making it possible to be differentiated and trained using standard gradient methods.
Furthermore, given this latent vector , we estimate by means of the generative network with parameters as
where and denote the mean and variance of reconstructed data distribution, respectively. We regard the mean as the estimate to the missing values and maintain the observed values in by making use of corresponding masks vector as
In the meantime, we regard the variance of reconstructed data as the uncertainty to be further utilized in the recurrent imputation process. For this purpose, we introduce an uncertainty matrix with . We quantify this uncertainty as the fidelity score of the missing value estimates. In particular, we set the corresponding uncertainty to zero if the data is observed, indicating that we are confident with full trust in the observation, and set it as a value if the corresponding value is missing as
Finally, as an output of this VAEbased imputation network, we acquire the set denoting the imputed values and their corresponding uncertainty, respectively. Furthermore, to alleviate the bias of missing value estimations, we utilize this uncertainty in the following recurrent imputation network.
IiiC Recurrent Imputation Network
The recurrent imputation network is based on RNNs, where we further model the temporal relations in the imputed data and exploit the uncertainties. While both GRU [cho2014] (depicted in Fig. 2
a) and longshort term memory (LSTM)
[hochreiter1997] are feasible choices, inspired by the previous work of GRUD[che2018] depicted in Fig. 2b, we employed a modified GRU cell by leveraging the uncertaintyaware GRU cell to consider further uncertainty and the temporal decaying factor, which is depicted in Fig. 2c. Graphical illustrations of several forms of GRU cells are depicted in Fig. 2.Specifically, at each time step , we produce the uncertainty decay factor in Eq. (1) using a negative exponential rectifier to guarantee [che2018, cao2018].
(1) 
We utilize such factors to emphasize the reliable estimates and give less attention to the uncertain ones. In particular, we first employ a fullyconnected layer to and elementwise multiply with the uncertainty decay factor as follows:
(2) 
Note that we zeroedout the diagonal of the parameter to enforce the estimations based on other features. Thus, we obtain as the featurebased correlated estimates to the missing values.
In addition, we further consider the missing value estimates based on the temporal relations. For this purpose, we employ the time delay which is an essential element to capture temporal relations and missing patterns of the data [che2018]. We exploit this information as the temporal decay factor as follows
(3) 
Meanwhile, by employing the GRU, we obtain the hidden state as the comprehensive information compiled from the preceding sequences. Thus, we take advantage of the temporal decay factor in governing the influence of past observations embedded into hidden states using the form of decayed hidden states as
(4) 
Thereby, given the previous hidden states , we can estimate the current complete observation through regression.
(5) 
In addition, we further make use of those estimates by applying another operation:
(6) 
again setting the diagonal parameter of to be zeros to consider the featurebased estimation on top of the temporal relations from the previous hidden states.
Hence, we have a pair of imputed values , corresponding to missing value estimates obtained from the VAE by considering the uncertainties, and from the recurrent imputation network, respectively. We then merge this information jointly to get combined vector comprising both estimates by simply employing a convolution operation () as
(7) 
Finally, we obtain the complete vector by replacing the missing values with the combined estimates as
(8) 
In addition, we concatenate the complete vector with the corresponding mask, and then feed it into the modified GRU cell to obtain the subsequent hidden states
(9) 
Lastly, to predict the inhospital mortality as the clinical outcome, we utilize the last hidden state to get the predicted label such that
(10) 
Hereby, , , and are the learnable parameters in our recurrent imputation network.
IiiD Learning
We describe the composite loss function, comprising the imputation and prediction loss function to tune all model parameters jointly, which are
. Such loss function accommodates the VAEs and the recurrent imputation network as well. By means of VAEs, we define the loss functionto maximize the variational evidence lower bound (ELBO) that comprises the reconstruction loss term and the KullbackLeibler divergence. We add
regularization to introduce sparsity into the network withas the hyperparameter. Moreover, for each time step, we measure the difference between the observed data and the combined imputation estimates by the mean absolute error (MAE) as
.(11) 
(12) 
Furthermore, we define the binary crossentropy loss function to evaluate the prediction of inhospital mortality. Thus, we define the overall composite loss function as
(13) 
where and are the hyperparameters to represent the ratio between the and , respectively.
Note that our proposed model is also applicable to consider the bidirectional dynamics. Such a scenario can be carried out by having the forward and backward direction of the data fed into the recurrent imputation network. By doing so, we make our proposed model a fair comparison to MRNN [yoon2017], BRITSI, and BRITS [cao2018] which adopted such strategy to achieve better estimates of the missing values and the prediction outcomes. In bidirectional cases, we employ an additional consistency loss function to the aforementioned total loss function , as introduced in [cao2018] to impose consistency estimates for each time step in both directions as
(14) 
with and denoting the estimates from the forward and backward direction, respectively. Another hyperparameter could be introduced for this consistency loss to optimize the model.
Lastly, we use stochastic gradient descent in an endtoend manner to optimize the model parameters during the training. We summarize the overall training steps of our proposed framework in Algorithm
1.Iv Experiments
Iva Dataset and Implementation Setup
IvA1 PhysioNet 2012 Challenge
PhysioNet^{1}^{1}1Publicly available on https://physionet.org/content/challenge2012/1.0.0/ [goldberger2000, ikaro2012] consists of 35 irregularly sampled clinical variables (e.g., heart and respiration rate, blood pressure, etc.) from 4,000 patients during their first 48 hours of medical care in the ICU. Note that we ignore the demographic information and categorical data types from this dataset. Hereby, we exploit only the clinical time series data. From those samples, we excluded three patients with no observations at all. We sampled the observations hourly, using the time window as the timestamps, and took the average of values in cases of multiple measurements within this time window. It resulted in sparse EHR data with an average missing rate of 80.51%. Our aim is to predict the inhospital mortality of patients, with 554 positive mortality labels (13.86%). As for the implementation setup of the PhysioNet dataset, we employed three layers of feedforward networks for the inference network of VAEs with hidden units of , with
denoting the dimension of latent representation. The generative network has equal numbers of hidden units with those of the inference network but in the reverse order. We employed hyperbolic tangent (tanh) as the nonlinear activation function for each hidden layer. Prior to those activation functions, we also applied batch normalization and a dropout rate of
for classification and imputation tasks, respectively. We employed modified GRU for recurrent imputation network with hidden units.IvA2 MimicIii
The Medical Information Mart for Intensive Care (MIMICIII)^{2}^{2}2Publicly available on https://physionet.org/content/mimiciii/1.4/ [johnson2016] dataset consists of 53,432 ICU stays of adult patients in the Beth Israel Deaconess Medical Center in the period of 2001 – 2012 [johnson2016]. We selected 99 variables from several source tables, such as laboratory tests, inputs to patients, outputs from patients, and drug prescriptions tables, resulting in a cohort of 13,998 patients with 1,181 positive inhospital mortality labels (8.44%) among them. Moreover, from those irregular measurements, we further sampled the data into two hourly samples for the first 48 hours of their medical care, leading to an average missing rate of 93.92%. As in the case of PhysioNet, we took the average value if there existed multiple measurements. We referred to [che2018, purushotham2018] in preprocessing the MIMICIII cohort. We employed three layers of feedforward networks for the inference network of VAEs with hidden units of and an equal number of hidden units in reverse for the generative network. Likewise, for each hidden layer, we used batch normalization, drop out rate of for both of classification and imputation tasks, and tanh activation function. Furthermore, hidden units were employed for the recurrent imputation network.
We trained the proposed model on both datasets using an Adam optimizer with epochs and minibatches. For imputation task, we fixed the learning rates of and for PhysioNet and MIMICIII, respectively, while for both datasets on the classification task. We set and weight decay equally with . For bidirectional models, we set as the hyperparameter of the consistency loss.
IvB Tasks
In this work, we validated the performance of our proposed models from two perspectives: (1) the inhospital mortality prediction (classification), and (2) missing value imputation.
IvB1 Classification Task
Our primary goal for this work is to predict the inhospital mortality as the binary classification task. For this purpose, we reported the test result on the inhospital mortality prediction task from the 5fold crossvalidation in terms of the average of the Area Under the Curve (AUC) ROC. Additionally, to measure the robustness of the models in dealing with the imbalance data portrayed in both datasets, we also reported the Area Under the PrecisionRecall Curve (AUPRC). We randomly removed samples of the training data with scenarios and left the validation and test set untouched to be reported as the results.
IvB2 Imputation Task
For the secondary task, we additionally evaluated imputation performance on the missing values. As for this task, we randomly removed samples with settings of of observed data from all training, validation, and test set as the ground truth. Then we reported the test result from the 5fold crossvalidation by measuring the MAE. In addition, we also measured the other mostused imputation similarity metric, which is mean relative error (MRE). Given and as the ground truth and imputation estimates of th item, respectively, and ground truth items in total, MAE and MRE defined as
Dataset  Task  Metric  VRNN [chung2015]  VAE+RNN [jun2019]  VRIN (Ours)  VRINfull (Ours) 

PhysioNet  Classification  AUC  
AUPRC  
Imputation  MAE  
MRE  
MIMICIII  Classification  AUC  
AUPRC  
Imputation  MAE  
MRE 
IvC Comparative Models
We compared the performance of our proposed model in carrying the aforementioned tasks with the closelyrelated competing stateoftheart models in the literature by grouping them into unidirectional and bidirectional models.
IvC1 Unidirectional Models

GRUD [che2018] estimates the missing values by utilizing the informative missing value patterns in the form of the masking and time decay factor using modified GRU cells.

RITSI [cao2018] utilizes unidirectional dynamics that rely solely on temporal relations through regression operations.

RITS [cao2018] is devised based on RITSI by further taking into account the feature correlations function. Furthermore, it utilizes the temporal decay as the factor to weigh between both features and temporalbased estimates.

VRINfull (Ours) executed all operations in the proposed model including featurebased correlations, temporal relations, and the uncertainty decay utilization.
IvC2 Bidirectional Models

MRNN [yoon2017] exploits the multidirectional RNNs which execute both interpretation and imputation to infer the missing data.

BRITSI [cao2018] is based on the RITSI by extending it to be able to handle bidirectional dynamics in estimating the missing values.

BRITS [cao2018] takes the bidirectional dynamics of RITS in handling the sparsity in the data. Both BRITSI and BRITS additionally employ consistency loss of forward and backward directions in their attempt to estimate the missing values more precisely.

To make a fair comparison, we extended our proposed model of VRIN and VRINfull into the bidirectional models by means of the recurrent imputation networks. Similar to BRITSI and BRITS, we further computed consistency loss in Eq. (14) for both forward and backward directions of the estimates.
Models  10%  5%  

AUC  AUPRC  AUC  AUPRC  
Unidirectional  GRUD [che2018]  
RITSI [cao2018]  
RITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours)  
Bidirectional  MRNN [yoon2017]  
BRITSI [cao2018]  
BRITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours) 
Models  10%  5%  

AUC  AUPRC  AUC  AUPRC  
Unidirectional  GRUD [che2018]  
RITSI [cao2018]  
RITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours)  
Bidirectional  MRNN [yoon2017]  
BRITSI [cao2018]  
BRITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours) 
IvD Experimental Results : Ablation Studies
As part of the ablation studies, we report the performance of the unidirectional model of the VRIN and VRINfull models on the inhospital mortality classification task. Firstly, we investigated the effect of the pair as the hyperparameter in Eq. (13). We reflected these parameters as a ratio to weigh the imputation by the VAEs and the recurrent imputation network to achieve the optimal performance on estimating the missing values and classifying the clinical outcomes. For each parameter, we defined set of values in the range of .
For PhysioNet, as illustrated in Fig. 3, in general, we observed that in almost all the combination settings, VRINfull achieved higher performance than VRIN. We interpreted these findings that introducing the uncertainty helps the model in estimating the missing values, leading to better classifying the outcome. Both models were able to achieve high performance in terms of their average AUC scores of for VRIN and for VRINfull. These AUC results were obtained with settings of and , respectively. For these results, we observed that the model favored the emphasis on the feature correlations over the temporal relations to obtain its best performance. For the case of VRIN, once we tried to increase the , we observed that the classification performance was degraded to some degree. In contrast, the performance of VRINfull was considerably better when we increased the parameter. We also carried out similar ablation studies on the MIMICIII dataset, which is reported in the supplementary material. To summarize, for both datasets, we argue that both features and temporal relations are essential in estimating the missing values with some latent proportion.
Table I presents the comparison of our model with closely related models on both classification and imputation tasks, such as VRNN [chung2015], which integrates VAEs for each time step of RNNs, and VAE+RNN [jun2019], which employs VAEs followed by RNNs without incorporating the uncertainty. For the case of VRNN on PhysioNet and MIMICIII, we observed that the classification performance was the lowest among reported models in terms of both its AUC and AUPRC. In addition, their imputation performance was high in comparison to both our proposed models. As in the case of VAE+RNN, in comparison to VRNN, we noticed a considerable improvement in the performance on PhysioNet and MIMICIII, especially in terms of AUPRC. VAE+RNN is, in fact, the model closely related to ours that it executes the imputation process by first exploiting the feature correlations followed by temporal dynamics in exact order. However, [jun2019] employed the vanilla RNNs instead of recurrent imputation network, which is a novel extension in this study. By introducing the temporal decay in VRIN, the model was better able to learn the temporal dynamics effectively, resulting in better AUC and AUPRC, as well as better imputation results in terms of MAE and MRE on both datasets, by a large margin. Finally, once we introduced the uncertainty which is incorporated in the recurrent imputation network of VRINfull, we observed a significant enhancement of AUC, AUPRC, MAE, and MRE on both datasets. Thus, we conclude that the utilization of both temporal decay and the uncertainties are beneficial in both imputation and classification tasks.
Models  10%  5%  

MAE  MRE  MAE  MRE  
Unidirectional  GRUD [che2018]  
RITSI [cao2018]  
RITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours)  
Bidirectional  MRNN [yoon2017]  
BRITSI [cao2018]  
BRITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours) 
Models  10%  5%  

MAE  MRE  MAE  MRE  
Unidirectional  GRUD [che2018]  
RITSI [cao2018]  
RITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours)  
Bidirectional  MRNN [yoon2017]  
BRITSI [cao2018]  
BRITS [cao2018]  
VRIN (Ours)  
VRINfull (Ours) 
IvE Classification Result Analysis
We presented the experimental results of the inhospital mortality prediction in comparison with other competing methods in terms of average AUC and AUPRC in Table II and Table III for PhysioNet and MIMICIII, respectively. We reported the evaluation of both unidirectional and bidirectional models for 10% and 5% removal on both datasets. For the case of unidirectional models on PhysioNet with 10% and 5% removal, our VRIN model achieved better performance in terms of AUC and AUPRC by a large margin compared to all comparative models even with the RITS, which utilized both the features and temporal relations, sequentially. The highest AUC was obtained by the VRINfull in both removal percentage scenarios. As for bidirectional models, all models in the 10% and 5% scenarios improved their performance, except for the MRNN, which achieved lowest results among all competing methods. Although MRNN employed similar strategies using bidirectional dynamics, it struggles to perform the task properly. Both AUC and AUPRC of MRNN was lower than GRUD, despite GRUD using only the forward directional data. The highest AUC was achieved by our VRINfull with and for the 10% and 5% removal scenarios, respectively. These findings reassure that the utilization of the uncertainty is truly beneficial in estimating the missing values.
As for MIMICIII, we reported the classification performance in Table III. In this dataset, we observed quite similar patterns to PhysioNet, with our model outperforming all competing models. The imbalance ratio, missing ratios, and dimensionality of MIMICIII are much greater than PhysioNet. We found that the performance of VRIN and VRINfull was comparable to some extent. In unidirectional models with 10%, our VRINfull achieved the highest AUC and AUPRC, whereas the VRIN achieved the highest performance by a small margin in the bidirectional models. This contrasts with the 5% scenarios, where VRINfull with bidirectional models achieved the highest AUC and AUPRC of and , respectively.
IvF Imputation Results Analysis
As the secondary task, we further evaluated the imputation performance of our proposed model in contrast with the comparative models. Table IV presents the experimental results on the PhysioNet dataset. GRUD which exploited the mean and temporal relations in imputing the values struggle to achieve good performance. By using bidirectional dynamics, MRNN obtained better results than GRUD. However, those models are still inferior to the RITS variants in both directional scenarios. Overall, both of our proposed models revealed its imputation robustness on this dataset, by exhibiting best performance, in both unidirectional and bidirectional scenario, consistently. Table V presents the imputation performance on MIMICIII. In general, the results showed that our proposed model performed better than comparative models, and comparable to BRITS in bidirectional models for both 10% and 5% removal scenarios.
We further exploited layerwise relevance propagation (LRP) [montavon2018] through a publicly available toolbox: the innvestigate^{3}^{3}3https://github.com/albermax/innvestigate/ [alber2018] to examine the relevance of the features. Specifically, we employed the LRP rule [montavon2018, alber2018]
to discover which features induce the activation of the neurons in carrying out the classification task. As a result, we depicted the imputed values as the input to the recurrent imputation network and further revealed the positive and negative relevance of each feature over time on both datasets in Fig.
4. The observed values are illustrated as black circles. By means of VAEs, we obtained the imputation estimates, which are highlighted by the hollow diamond markers alongside the uncertainties, illustrated with the shaded areas. Then, by employing the recurrent imputation network, we acquired the final estimates to the missing values depicted as the hollow circles. Overall, we observed that the relevances were getting stronger toward the end of the time period, regardless of its sign. Furthermore, compared to the observed values, the missing value estimates with high uncertainties tend to hold a low relevances. Thus, this demonstrates the benefits of the uncertainty utilization in the recurrent imputation networks for the downstream task.V Conclusion
In this study, we proposed a novel unified framework consisting of imputation and prediction networks for sparse highdimensional multivariate time series. It combined a deep generative model with a recurrent model to capture feature correlations and temporal relations for estimating the missing values by taking into account the uncertainty. We utilized the uncertainties as the fidelity of our estimation and incorporated them for predicting clinical outcomes. We evaluated the effectiveness of the proposed model with the PhysioNet 2012 Challenge and MIMICIII datasets as the realworld EHR multivariate time series data, proving the superiority of our model in the inmortality prediction task, compared to other comparative stateoftheart models in the literature.
Acknowledgment
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 20170 01779, A machine learning and statistical inference framework for explainable artificial intelligence, and No. 2019000079, Department of Artificial Intelligence (Korea University)).
Comments
There are no comments yet.