Uncertainty-Aware Variational-Recurrent Imputation Network for Clinical Time Series

03/02/2020 ∙ by Ahmad Wisnu Mulyadi, et al. ∙ Korea University 0

Electronic health records (EHR) consist of longitudinal clinical observations portrayed with sparsity, irregularity, and high-dimensionality, which become major obstacles in drawing reliable downstream clinical outcomes. Although there exist great numbers of imputation methods to tackle these issues, most of them ignore correlated features, temporal dynamics and entirely set aside the uncertainty. Since the missing value estimates involve the risk of being inaccurate, it is appropriate for the method to handle the less certain information differently than the reliable data. In that regard, we can use the uncertainties in estimating the missing values as the fidelity score to be further utilized to alleviate the risk of biased missing value estimates. In this work, we propose a novel variational-recurrent imputation network, which unifies an imputation and a prediction network by taking into account the correlated features, temporal dynamics, as well as the uncertainty. Specifically, we leverage the deep generative model in the imputation, which is based on the distribution among variables, and a recurrent imputation network to exploit the temporal relations, in conjunction with utilization of the uncertainty. We validated the effectiveness of our proposed model on two publicly available real-world EHR datasets: PhysioNet Challenge 2012 and MIMIC-III, and compared the results with other competing state-of-the-art methods in the literature.



There are no comments yet.


page 1

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Electronic health records (EHRs) store longitudinal data consisting of patients’ clinical observations in the intensive care unit (ICU). Despite the surge of interest in clinical research on EHR, it still holds diverse challenging issues to be tackled, these include high-dimensionality, temporality, sparsity, irregularity, and bias [cheng2016, yadav2018]. Specifically, sequences of multidimensional medical measurements are recorded irregularly in terms of its variables and times. The reasons behind these typical measurements are diverse, such as lack of collection, documentation, or even recording faults [wells2013, cheng2016]. Since it carries essential information regarding a patient’s health status, improper handling of missing values might cause an unintentional bias [wells2013, jones2017], yielding an unreliable downstream analysis and verdict.

Complete-case analysis is an approach that draws clinical outcomes by disregarding the missing values and relying only on the observed values. However, excluding the missing data yields poor performance at high missing rates and also requires modeling separately for the distinct dataset. In fact, those missing values reflect the decisions made by health-care providers [lipton2016]. Therefore, the missing values and their patterns contain information regarding a patient’s health status [lipton2016] and correlate with the outcomes or target labels [che2018]. Thus, we resort to the imputation approach by exploiting those missing patterns to improve the prediction of the clinical outcomes as the downstream task.

There exist numerous strategies for imputing missing values in the literature. Generally, imputation methods can be classified into a deterministic or stochastic approach, depending on the use of randomness

[brick1996]. Deterministic methods, such as mean [little1987] and median filling [acuna2004], produce only one possible value when estimating the missing values. However, it is desirable for the imputation model to generate values or samples by considering the distribution of the available observed data. Thus, it leads us to the employment of stochastic-based imputation methods.

The recent rise of deep learning models has offered potential solutions in accommodating such circumstances. Variational autoencoders (VAEs)

[kingma2014] and generative adversarial networks (GANs) [goodfellow2014] exploit the latent distribution of high-dimensional incomplete data and generate comparable data points as the approximation estimates for the missing or corrupted values [nazabal2018, luo2018, jun2019]

. However, such deep generative models are insufficient for estimating the missing values of multivariate time series owing to their nature of ignoring temporal relations between a span of time points. On the other hand, by virtue of recurrent neural networks (RNNs), which have proved to perform remarkably well in modeling sequential data, we can estimate the complete data by taking into account the temporal characteristics. In this approach, GRU-D


introduced a modified gated-recurrent unit (GRU) cell to model missing patterns in the form of masking vectors and temporal delays. Likewise, BRITS

[cao2018] exploited the temporal relations of bidirectional dynamics by considering feature correlations in estimating the missing values.

Even though such models employed the stochastic approach for inferring and generating samples by utilizing both features and temporal relations, they scarcely exploited the uncertainty in estimating the missing values in multivariate time series data (i.e., since the imputation estimates are not thoroughly accurate, we may introduce a fidelity score denoted by the uncertainty, which enhances the downstream task performance by emphasizing the reliable information more than the less certain information) [he2010, gemmeke2010, jun2019]. We can use the imputation model, which captures the aleatoric uncertainty in estimating the missing values by placing a distribution over the output of the model [kendall2017]

. In particular, we would like to estimate the heteroscedastic aleatoric uncertainties, which are useful in cases where observation noises vary with the input


Fig. 1: Architecture of the proposed model, which unifies two key networks, namely variational autoencoder and recurrent imputation network. The model considers the feature correlations, temporal relations, and the uncertainties in estimating the missing values to get a better prediction of clinical outcome. Refer to Section III for more details of the notation.

In this work, we define our primary task as the prediction of in-hospital mortality on clinical time series data. However, since such data are portrayed with sparse and irregularly-sampled characteristics, we devise an imputation model as the secondary problem to enhance the clinical outcome predictions. We propose a novel variational-recurrent imputation network (V-RIN), as illustrated in Fig. 1, which unifies the imputation and prediction networks for multivariate time series EHR data, governing both correlations among variables and temporal relations. Specifically, given the sparse data, an inference network of VAEs is employed to capture data distribution in the latent space. From this, we employ a generative network to obtain the reconstructed data as the imputation estimates for the missing values and the uncertainty indicating the imputation fidelity score. Then, we integrate the temporal and feature correlations into a combined vector and feed it into a novel uncertainty-aware GRU in the recurrent imputation network. Finally, we obtain the mortality prediction as a clinical verdict from the complete imputed data. In general, our main contributions in this study are as follows:

  • We estimate the missing values by utilizing a deep generative model combined with a recurrent imputation network to capture both feature correlations and the temporal dynamics jointly, yielding the uncertainty.

  • We effectively incorporate the uncertainty with the imputation estimates in our novel uncertainty-aware GRU cell for better prediction results.

  • We evaluate the effectiveness of the proposed models by training the imputation and prediction networks jointly in an end-to-end manner, achieving superior performance on real-world multivariate time series EHR data compared to other competing state-of-the-art methods.

This study extends the preliminary work published in [jun2019]. Unlike to the preceding study, we have further expanded the proposed model by introducing more complex recurrent imputation networks, which utilize the uncertainties, instead of vanilla RNNs. We also include two additional real-world EHR datasets in our experiments, described in Section (IV-A), and validate the robustness of our proposed networks by comparing them with existing state-of-the-art models in the literature (Section IV-C). Furthermore, we have conducted extensive experiments to discover the impacts of missing value estimation by utilizing the uncertainties when performing the downstream task (Section IV-D - IV-F).

The rest of the paper is organized as follows. In Section II, we discuss the works closely related to our proposed model in imputing the missing values. In Section III, we detail our proposed model. In Section IV, we report on the experimental results and analysis by comparing them with state-of-the-art methods. Finally, we conclude the work in Section V.

Ii Related Work

Imputation strategies have been extensively devised to resolve the issue of sparse high-dimensional time series data by means of statistics, machine learning, or deep learning methods. For instance, previous works exploited statistical attributes of observed data, such as mean

[little1987] and median filling [acuna2004]

, which clearly ignored the temporal relations and the correlations among variables. From the machine learning approaches, the expectation-maximization (EM) algorithm


-nearest neighbors (KNN)


, and principal component analysis (PCA)

[oba2003, mohamed2009] were proposed by considering the relationships of the features either in the original or in the latent space. Furthermore, multiple imputation by chained equations (MICE) [white2011, azur2011] introduced variability by means of repeating the imputation process multiple times. However, these methods ignore the temporal relations as crucial attributes in time series modeling.

The deep learning-based imputation models, are closely related to our proposed models. A previous study [nazabal2018] leveraged VAEs to generate stochastic imputation estimates by exploiting the distribution and correlations of features in the latent space. However, it ignored the temporal relations as well as the uncertainties. Recently, GP-VAE [fortuin2019] was proposed to obtain the latent representation by means of VAEs and model temporal dynamics in the latent space using the Gaussian process. However, since the model is merely focused on the imputation task, they required a separate model for the further downstream outcome.

To deal with time series data, a series of RNN-based imputation models were proposed. GRU-D [che2018] considered the temporal dynamics by incorporating the missing patterns, together with the mean imputation, and forward filling with past values using temporal decay factor. Similarly, GRU-I [luo2018] trained the RNNs using such temporal decay factor and further incorporated an adversarial scheme of GANs as the stochastic approach. In the meantime, BRITS [cao2018] were proposed to combine the feature correlations and temporal dynamic networks using bidirectional dynamics, which enhanced the accuracy by estimating missing values in both forward and backward directions. By considering the delayed gradients of the missingness in the forward and backward directions, their models were able to achieve more accurate missing values imputations. Likewise, M-RNN [yoon2017]

utilized bidirectional recurrent dynamics by operating interpolation (intra-stream) and imputation (inter-stream). Despite temporal dynamics and stochastic methods being considered in the models, the uncertainties for imputation purposes were scarcely incorporated.

As we are unsure of the actual values, we argue that these uncertainties are beneficial and can be utilized to estimate the missing values. Such uncertainty can be captured by accommodating the distribution over the model output [kendall2017]. For this purpose, we exploited VAEs [kingma2014]

as the Bayesian networks, which are able to model the data distribution. In this work, we introduce the uncertainty as the imputation fidelity of estimates, which compensates for the potential impairment of imputation estimates. Therefore, we assumed that our model could provide reliable estimates while giving less attention to the unreliable ones determined by its uncertainties. We expect to obtain better estimates of the missing values leading to a better prediction performance as a downstream task. However, since VAEs alone are not designed to model the temporal dynamics, we combined the model with the recurrent imputation network by further utilizing those uncertainties. Thus, our proposed model differs from the aforementioned models in which that it integrates the imputation and prediction networks jointly, and the utilization of the uncertainties serve as the major motivation in our works.

Iii Proposed Methods

Our proposed model architecture consists of two key networks – an imputation and a prediction network – as depicted in Fig. 1. The imputation network is devised based on VAEs to capture the latent distribution of the sparse data by means of its inference network (i.e., encoder ). Then, the subsequent generative network of VAEs (i.e., decoder

) estimates the reconstructed data distribution. We regard its reconstructed values as the imputation estimates while exploiting its variances as the uncertainties to be further utilized in the recurrent imputation network sequentially.

The succeeding recurrent imputation networks are built upon RNNs to model the temporal dynamics. For each time step, we employ the regression layer to model the temporal relations incorporated within the hidden states of RNNs cell in imputing the missing values. However, as we also consider the time gap between observed values, we incorporate the time decay factor, which is then exploited in such hidden states, leading to decayed hidden states. Eventually, by systematically unifying the imputation estimates obtained from VAEs and RNNs, we expect to acquire more likely estimates by considering feature correlations and temporal relations over time, including the utilization of the uncertainty. By doing so, we expect to acquire more reliable prediction outcomes. We describe each of the networks more specifically in the following sections after introducing the data representation.

Iii-a Data Representation

Given the multivariate time series EHR data of number of patients, a set of clinical observations and their corresponding label is denoted as . For each patient, we have , where and represent the time points and variables, respectively; denotes all observable variables at time point , and is the -th element of variables at time point . In addition, it has the corresponding clinical label , representing the clinical outcomes. In our case, it denotes the in-hospital mortality of a patient, which falls into a binary classification problem. For the sake of clarity, we omit the superscript hereafter.

As is characterized with sparsity properties, we address the missing values by introducing masking matrix , indicating whether values are observed or missing. In addition, we define a new data representation to be fed into the model, where we initialize the missing value with zero [lipton2016, nazabal2018, jun2019] as follows:

Another consideration in dealing with the sparse data is that there exists a time gap between two observed values. Such information in fact carries a piece of essential information in estimating the missing values over the times. Thus, we accommodate this information by further devising the time delay matrix , which is derived from , denoting the timestamp of the measurement. As initialization, we fix for the , while setting the time delay for the remaining times by referring to a masking matrix and a timestamp vector as follows:

Fig. 2: Graphical illustrations of comparing three different forms of GRU cells (: update gate, : reset gate).

Iii-B VAE-based Imputation Network

Given the observations at each time point , we infer as the latent representation with as its corresponding dimension by making use of the inference network, using the true posterior distribution . Intuitively, we assume that

is generated from some unobserved random variable

by some conditional distribution , while is generated from a prior distribution , which can be interpreted as the hidden health status of the patient. In addition, we define the marginal likelihood as

by integrating out the joint distribution

for defined as

However, in practice, this is analytically intractable. Since, , the true posterior becomes intractable as well. Therefore, we approximate it with

using a Gaussian distribution

, where the mean and log-variance are obtained such that

where denotes the inference network with parameter . Furthermore, we apply the reparameterization trick [kingma2014] as , where , with denoting element-wise multiplication, thus, making it possible to be differentiated and trained using standard gradient methods.

Furthermore, given this latent vector , we estimate by means of the generative network with parameters as

where and denote the mean and variance of reconstructed data distribution, respectively. We regard the mean as the estimate to the missing values and maintain the observed values in by making use of corresponding masks vector as

In the meantime, we regard the variance of reconstructed data as the uncertainty to be further utilized in the recurrent imputation process. For this purpose, we introduce an uncertainty matrix with . We quantify this uncertainty as the fidelity score of the missing value estimates. In particular, we set the corresponding uncertainty to zero if the data is observed, indicating that we are confident with full trust in the observation, and set it as a value if the corresponding value is missing as

Finally, as an output of this VAE-based imputation network, we acquire the set denoting the imputed values and their corresponding uncertainty, respectively. Furthermore, to alleviate the bias of missing value estimations, we utilize this uncertainty in the following recurrent imputation network.

Iii-C Recurrent Imputation Network

The recurrent imputation network is based on RNNs, where we further model the temporal relations in the imputed data and exploit the uncertainties. While both GRU [cho2014] (depicted in Fig. 2

a) and long-short term memory (LSTM)

[hochreiter1997] are feasible choices, inspired by the previous work of GRU-D[che2018] depicted in Fig. 2b, we employed a modified GRU cell by leveraging the uncertainty-aware GRU cell to consider further uncertainty and the temporal decaying factor, which is depicted in Fig. 2c. Graphical illustrations of several forms of GRU cells are depicted in Fig. 2.

Specifically, at each time step , we produce the uncertainty decay factor in Eq. (1) using a negative exponential rectifier to guarantee [che2018, cao2018].


We utilize such factors to emphasize the reliable estimates and give less attention to the uncertain ones. In particular, we first employ a fully-connected layer to and element-wise multiply with the uncertainty decay factor as follows:


Note that we zeroed-out the diagonal of the parameter to enforce the estimations based on other features. Thus, we obtain as the feature-based correlated estimates to the missing values.

In addition, we further consider the missing value estimates based on the temporal relations. For this purpose, we employ the time delay which is an essential element to capture temporal relations and missing patterns of the data [che2018]. We exploit this information as the temporal decay factor as follows


Meanwhile, by employing the GRU, we obtain the hidden state as the comprehensive information compiled from the preceding sequences. Thus, we take advantage of the temporal decay factor in governing the influence of past observations embedded into hidden states using the form of decayed hidden states as


Thereby, given the previous hidden states , we can estimate the current complete observation through regression.


In addition, we further make use of those estimates by applying another operation:


again setting the diagonal parameter of to be zeros to consider the feature-based estimation on top of the temporal relations from the previous hidden states.

Hence, we have a pair of imputed values , corresponding to missing value estimates obtained from the VAE by considering the uncertainties, and from the recurrent imputation network, respectively. We then merge this information jointly to get combined vector comprising both estimates by simply employing a convolution operation () as


Finally, we obtain the complete vector by replacing the missing values with the combined estimates as


In addition, we concatenate the complete vector with the corresponding mask, and then feed it into the modified GRU cell to obtain the subsequent hidden states


Lastly, to predict the in-hospital mortality as the clinical outcome, we utilize the last hidden state to get the predicted label such that


Hereby, , , and are the learnable parameters in our recurrent imputation network.

input : clinical time series data
output : imputed values ; outcome prediction
2 while not converge do
10       for  to  do
11             // Eqs. (1) - (9)
21       end for
23       Eq. (13)
27 end while
Algorithm 1 Algorithm of our proposed model

Iii-D Learning

We describe the composite loss function, comprising the imputation and prediction loss function to tune all model parameters jointly, which are

. Such loss function accommodates the VAEs and the recurrent imputation network as well. By means of VAEs, we define the loss function

to maximize the variational evidence lower bound (ELBO) that comprises the reconstruction loss term and the Kullback-Leibler divergence. We add

-regularization to introduce sparsity into the network with

as the hyperparameter. Moreover, for each time step, we measure the difference between the observed data and the combined imputation estimates by the mean absolute error (MAE) as



Furthermore, we define the binary cross-entropy loss function to evaluate the prediction of in-hospital mortality. Thus, we define the overall composite loss function as


where and are the hyperparameters to represent the ratio between the and , respectively.

Note that our proposed model is also applicable to consider the bidirectional dynamics. Such a scenario can be carried out by having the forward and backward direction of the data fed into the recurrent imputation network. By doing so, we make our proposed model a fair comparison to M-RNN [yoon2017], BRITS-I, and BRITS [cao2018] which adopted such strategy to achieve better estimates of the missing values and the prediction outcomes. In bidirectional cases, we employ an additional consistency loss function to the aforementioned total loss function , as introduced in [cao2018] to impose consistency estimates for each time step in both directions as


with and denoting the estimates from the forward and backward direction, respectively. Another hyperparameter could be introduced for this consistency loss to optimize the model.

Lastly, we use stochastic gradient descent in an end-to-end manner to optimize the model parameters during the training. We summarize the overall training steps of our proposed framework in Algorithm


Iv Experiments

Iv-a Dataset and Implementation Setup

Iv-A1 PhysioNet 2012 Challenge

PhysioNet111Publicly available on https://physionet.org/content/challenge-2012/1.0.0/ [goldberger2000, ikaro2012] consists of 35 irregularly sampled clinical variables (e.g.,  heart and respiration rate, blood pressure, etc.) from 4,000 patients during their first 48 hours of medical care in the ICU. Note that we ignore the demographic information and categorical data types from this dataset. Hereby, we exploit only the clinical time series data. From those samples, we excluded three patients with no observations at all. We sampled the observations hourly, using the time window as the timestamps, and took the average of values in cases of multiple measurements within this time window. It resulted in sparse EHR data with an average missing rate of 80.51%. Our aim is to predict the in-hospital mortality of patients, with 554 positive mortality labels (13.86%). As for the implementation setup of the PhysioNet dataset, we employed three layers of feedforward networks for the inference network of VAEs with hidden units of , with

denoting the dimension of latent representation. The generative network has equal numbers of hidden units with those of the inference network but in the reverse order. We employed hyperbolic tangent (tanh) as the non-linear activation function for each hidden layer. Prior to those activation functions, we also applied batch normalization and a dropout rate of

for classification and imputation tasks, respectively. We employed modified GRU for recurrent imputation network with hidden units.

Iv-A2 Mimic-Iii

The Medical Information Mart for Intensive Care (MIMIC-III)222Publicly available on https://physionet.org/content/mimiciii/1.4/ [johnson2016] dataset consists of 53,432 ICU stays of adult patients in the Beth Israel Deaconess Medical Center in the period of 2001 – 2012 [johnson2016]. We selected 99 variables from several source tables, such as laboratory tests, inputs to patients, outputs from patients, and drug prescriptions tables, resulting in a cohort of 13,998 patients with 1,181 positive in-hospital mortality labels (8.44%) among them. Moreover, from those irregular measurements, we further sampled the data into two hourly samples for the first 48 hours of their medical care, leading to an average missing rate of 93.92%. As in the case of PhysioNet, we took the average value if there existed multiple measurements. We referred to [che2018, purushotham2018] in pre-processing the MIMIC-III cohort. We employed three layers of feedforward networks for the inference network of VAEs with hidden units of and an equal number of hidden units in reverse for the generative network. Likewise, for each hidden layer, we used batch normalization, drop out rate of for both of classification and imputation tasks, and tanh activation function. Furthermore, hidden units were employed for the recurrent imputation network.

We trained the proposed model on both datasets using an Adam optimizer with epochs and mini-batches. For imputation task, we fixed the learning rates of and for PhysioNet and MIMIC-III, respectively, while for both datasets on the classification task. We set and weight decay equally with . For bidirectional models, we set as the hyperparameter of the consistency loss.

Iv-B Tasks

In this work, we validated the performance of our proposed models from two perspectives: (1) the in-hospital mortality prediction (classification), and (2) missing value imputation.

Iv-B1 Classification Task

Our primary goal for this work is to predict the in-hospital mortality as the binary classification task. For this purpose, we reported the test result on the in-hospital mortality prediction task from the 5-fold cross-validation in terms of the average of the Area Under the Curve (AUC) ROC. Additionally, to measure the robustness of the models in dealing with the imbalance data portrayed in both datasets, we also reported the Area Under the Precision-Recall Curve (AUPRC). We randomly removed samples of the training data with scenarios and left the validation and test set untouched to be reported as the results.

Iv-B2 Imputation Task

For the secondary task, we additionally evaluated imputation performance on the missing values. As for this task, we randomly removed samples with settings of of observed data from all training, validation, and test set as the ground truth. Then we reported the test result from the 5-fold cross-validation by measuring the MAE. In addition, we also measured the other most-used imputation similarity metric, which is mean relative error (MRE). Given and as the ground truth and imputation estimates of -th item, respectively, and ground truth items in total, MAE and MRE defined as

Fig. 3: Ablation studies on the impact of hyperparameter pair for the classification task on PhysioNet dataset.
Dataset Task Metric VRNN [chung2015] VAE+RNN [jun2019] V-RIN (Ours) V-RIN-full (Ours)
PhysioNet Classification AUC
Imputation MAE
MIMIC-III Classification AUC
Imputation MAE
TABLE I: Ablation study results on both PhysioNet and MIMIC-III datasets using 10% removal scenario. *) For the VAE+RNN [jun2019], we obtained the missing value estimates solely from the VAE in reconstructing values

Iv-C Comparative Models

We compared the performance of our proposed model in carrying the aforementioned tasks with the closely-related competing state-of-the-art models in the literature by grouping them into unidirectional and bidirectional models.

Iv-C1 Unidirectional Models

  • GRU-D [che2018] estimates the missing values by utilizing the informative missing value patterns in the form of the masking and time decay factor using modified GRU cells.

  • RITS-I [cao2018] utilizes unidirectional dynamics that rely solely on temporal relations through regression operations.

  • RITS [cao2018] is devised based on RITS-I by further taking into account the feature correlations function. Furthermore, it utilizes the temporal decay as the factor to weigh between both features and temporal-based estimates.

  • V-RIN (Ours) is based on our proposed model except that we ignored the uncertainty decay factor. Specifically, we excluded Eq. (1) and omitted the element-wise multiplication operation of the in Eq. (2).

  • V-RIN-full (Ours) executed all operations in the proposed model including feature-based correlations, temporal relations, and the uncertainty decay utilization.

Iv-C2 Bidirectional Models

  • M-RNN [yoon2017] exploits the multi-directional RNNs which execute both interpretation and imputation to infer the missing data.

  • BRITS-I [cao2018] is based on the RITS-I by extending it to be able to handle bidirectional dynamics in estimating the missing values.

  • BRITS [cao2018] takes the bidirectional dynamics of RITS in handling the sparsity in the data. Both BRITS-I and BRITS additionally employ consistency loss of forward and backward directions in their attempt to estimate the missing values more precisely.

  • To make a fair comparison, we extended our proposed model of V-RIN and V-RIN-full into the bidirectional models by means of the recurrent imputation networks. Similar to BRITS-I and BRITS, we further computed consistency loss in Eq. (14) for both forward and backward directions of the estimates.

Models 10% 5%
Unidirectional GRU-D [che2018]
RITS-I [cao2018]
RITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
Bidirectional M-RNN [yoon2017]
BRITS-I [cao2018]
BRITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
TABLE II: Performance of in-hospital mortality prediction task on PhysioNet Dataset
Models 10% 5%
Unidirectional GRU-D [che2018]
RITS-I [cao2018]
RITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
Bidirectional M-RNN [yoon2017]
BRITS-I [cao2018]
BRITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
TABLE III: Performance of in-hospital mortality prediction task on MIMIC-III Dataset

Iv-D Experimental Results : Ablation Studies

As part of the ablation studies, we report the performance of the unidirectional model of the V-RIN and V-RIN-full models on the in-hospital mortality classification task. Firstly, we investigated the effect of the pair as the hyperparameter in Eq. (13). We reflected these parameters as a ratio to weigh the imputation by the VAEs and the recurrent imputation network to achieve the optimal performance on estimating the missing values and classifying the clinical outcomes. For each parameter, we defined set of values in the range of .

For PhysioNet, as illustrated in Fig. 3, in general, we observed that in almost all the combination settings, V-RIN-full achieved higher performance than V-RIN. We interpreted these findings that introducing the uncertainty helps the model in estimating the missing values, leading to better classifying the outcome. Both models were able to achieve high performance in terms of their average AUC scores of for V-RIN and for V-RIN-full. These AUC results were obtained with settings of and , respectively. For these results, we observed that the model favored the emphasis on the feature correlations over the temporal relations to obtain its best performance. For the case of V-RIN, once we tried to increase the , we observed that the classification performance was degraded to some degree. In contrast, the performance of V-RIN-full was considerably better when we increased the parameter. We also carried out similar ablation studies on the MIMIC-III dataset, which is reported in the supplementary material. To summarize, for both datasets, we argue that both features and temporal relations are essential in estimating the missing values with some latent proportion.

Table I presents the comparison of our model with closely related models on both classification and imputation tasks, such as VRNN [chung2015], which integrates VAEs for each time step of RNNs, and VAE+RNN [jun2019], which employs VAEs followed by RNNs without incorporating the uncertainty. For the case of VRNN on PhysioNet and MIMIC-III, we observed that the classification performance was the lowest among reported models in terms of both its AUC and AUPRC. In addition, their imputation performance was high in comparison to both our proposed models. As in the case of VAE+RNN, in comparison to VRNN, we noticed a considerable improvement in the performance on PhysioNet and MIMIC-III, especially in terms of AUPRC. VAE+RNN is, in fact, the model closely related to ours that it executes the imputation process by first exploiting the feature correlations followed by temporal dynamics in exact order. However, [jun2019] employed the vanilla RNNs instead of recurrent imputation network, which is a novel extension in this study. By introducing the temporal decay in V-RIN, the model was better able to learn the temporal dynamics effectively, resulting in better AUC and AUPRC, as well as better imputation results in terms of MAE and MRE on both datasets, by a large margin. Finally, once we introduced the uncertainty which is incorporated in the recurrent imputation network of V-RIN-full, we observed a significant enhancement of AUC, AUPRC, MAE, and MRE on both datasets. Thus, we conclude that the utilization of both temporal decay and the uncertainties are beneficial in both imputation and classification tasks.

Models 10% 5%
Unidirectional GRU-D [che2018]
RITS-I [cao2018]
RITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
Bidirectional M-RNN [yoon2017]
BRITS-I [cao2018]
BRITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
TABLE IV: Performance of imputation task on PhysioNet Dataset
Models 10% 5%
Unidirectional GRU-D [che2018]
RITS-I [cao2018]
RITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
Bidirectional M-RNN [yoon2017]
BRITS-I [cao2018]
BRITS [cao2018]
V-RIN (Ours)
V-RIN-full (Ours)
TABLE V: Performance of imputation task on MIMIC-III Dataset

Iv-E Classification Result Analysis

We presented the experimental results of the in-hospital mortality prediction in comparison with other competing methods in terms of average AUC and AUPRC in Table II and Table III for PhysioNet and MIMIC-III, respectively. We reported the evaluation of both unidirectional and bidirectional models for 10% and 5% removal on both datasets. For the case of unidirectional models on PhysioNet with 10% and 5% removal, our V-RIN model achieved better performance in terms of AUC and AUPRC by a large margin compared to all comparative models even with the RITS, which utilized both the features and temporal relations, sequentially. The highest AUC was obtained by the V-RIN-full in both removal percentage scenarios. As for bidirectional models, all models in the 10% and 5% scenarios improved their performance, except for the M-RNN, which achieved lowest results among all competing methods. Although M-RNN employed similar strategies using bidirectional dynamics, it struggles to perform the task properly. Both AUC and AUPRC of M-RNN was lower than GRU-D, despite GRU-D using only the forward directional data. The highest AUC was achieved by our V-RIN-full with and for the 10% and 5% removal scenarios, respectively. These findings reassure that the utilization of the uncertainty is truly beneficial in estimating the missing values.

As for MIMIC-III, we reported the classification performance in Table III. In this dataset, we observed quite similar patterns to PhysioNet, with our model outperforming all competing models. The imbalance ratio, missing ratios, and dimensionality of MIMIC-III are much greater than PhysioNet. We found that the performance of V-RIN and V-RIN-full was comparable to some extent. In unidirectional models with 10%, our V-RIN-full achieved the highest AUC and AUPRC, whereas the V-RIN achieved the highest performance by a small margin in the bidirectional models. This contrasts with the 5% scenarios, where V-RIN-full with bidirectional models achieved the highest AUC and AUPRC of and , respectively.

Fig. 4: Analysis of imputed values relevance in carrying classification task by utilizing LRP [montavon2018, alber2018] to our recurrent imputation network. We obtained the positive and negative relevances depicted as red and blue, respectively, while illustrate the observed values with black circles (), imputed values by VAEs with hollow diamonds () with shaded areas as its corresponding uncertainties, and combined estimates by means of recurrent imputation network as hollow circles (). Color intensity is normalized to the maximum absolute relevance per feature over time.

Iv-F Imputation Results Analysis

As the secondary task, we further evaluated the imputation performance of our proposed model in contrast with the comparative models. Table IV presents the experimental results on the PhysioNet dataset. GRU-D which exploited the mean and temporal relations in imputing the values struggle to achieve good performance. By using bidirectional dynamics, M-RNN obtained better results than GRU-D. However, those models are still inferior to the RITS variants in both directional scenarios. Overall, both of our proposed models revealed its imputation robustness on this dataset, by exhibiting best performance, in both unidirectional and bidirectional scenario, consistently. Table V presents the imputation performance on MIMIC-III. In general, the results showed that our proposed model performed better than comparative models, and comparable to BRITS in bidirectional models for both 10% and 5% removal scenarios.

We further exploited layer-wise relevance propagation (LRP) [montavon2018] through a publicly available toolbox: the innvestigate333https://github.com/albermax/innvestigate/ [alber2018] to examine the relevance of the features. Specifically, we employed the LRP- rule [montavon2018, alber2018]

to discover which features induce the activation of the neurons in carrying out the classification task. As a result, we depicted the imputed values as the input to the recurrent imputation network and further revealed the positive and negative relevance of each feature over time on both datasets in Fig.

4. The observed values are illustrated as black circles. By means of VAEs, we obtained the imputation estimates, which are highlighted by the hollow diamond markers alongside the uncertainties, illustrated with the shaded areas. Then, by employing the recurrent imputation network, we acquired the final estimates to the missing values depicted as the hollow circles. Overall, we observed that the relevances were getting stronger toward the end of the time period, regardless of its sign. Furthermore, compared to the observed values, the missing value estimates with high uncertainties tend to hold a low relevances. Thus, this demonstrates the benefits of the uncertainty utilization in the recurrent imputation networks for the downstream task.

V Conclusion

In this study, we proposed a novel unified framework consisting of imputation and prediction networks for sparse high-dimensional multivariate time series. It combined a deep generative model with a recurrent model to capture feature correlations and temporal relations for estimating the missing values by taking into account the uncertainty. We utilized the uncertainties as the fidelity of our estimation and incorporated them for predicting clinical outcomes. We evaluated the effectiveness of the proposed model with the PhysioNet 2012 Challenge and MIMIC-III datasets as the real-world EHR multivariate time series data, proving the superiority of our model in the in-mortality prediction task, compared to other comparative state-of-the-art models in the literature.


This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2017-0- 01779, A machine learning and statistical inference framework for explainable artificial intelligence, and No. 2019-0-00079, Department of Artificial Intelligence (Korea University)).