I Introduction
The computational modeling in clinical applications attracts growing interest with the realization that the quantitative understanding of patient pathophysiological progression is crucial to clinical studies [63]. With a comprehensive and precise modeling, we can have a better understanding of a patient’s state, offer more precise diagnosis and provide better individualized therapies [29]. Researchers are increasingly motivated to build more accurate computational models from multiple types of clinical data. However, missing values in clinical data challenge researchers using analytic techniques for modeling, as many of the techniques are designed for complete data.
Traditional strategies used in clinical studies to handle missing values include deleting records with missing values and imputing missing entries by mean values. However, deleting records with missing values and some other filtering strategies can introduce biases [62]
that can impact modeling in many ways, thus limiting its generalizability. Mean imputation is widely used by researchers to handle missing values. However, it is shown to yield less effective estimates than many other modern imputation techniques
[6, 64, 1, 61], such as maximum likelihood approaches and multiple imputation methods (e.g. multiple imputation by chained equations (MICE) [7]). The maximum likelihood and multiple imputation methods are based on solid statistical foundations and become standard in the last few decades [20, 46].Multiple imputation is originally proposed by Rubin [44] and with the idea of replacing each missing value with multiple estimates that reflect variation within each imputation model. The classic multiple imputation method treats the complete variables as predictors and incomplete variables as outcomes, and uses multivariate regression models to estimate missing values [45, 48]. Another imputation approach relies on chained equations where a sequence of regression models is fitted by treating one incomplete variable as the outcome and other variables with possible best estimates for missing values as predictors [43, 60, 59]. Multiple imputation has its success in healthcare applications, where many imputation methods designed for various tasks are based on multiple imputation [21, 42, 52, 25, 36, 58, 13].
In recent years, additional imputation methods are proposed. Although many imputation methods [61, 7, 51, 43, 32, 37, 71, 57] show their effectiveness in many applications, few of them are designed for time seriesbased clinical data. These clinical data are usually multivariable time series, where patients have measurements of multiple laboratory tests at different times. Many methods are designed for crosssectional imputation (measurements taken at the same time point) and do not consider temporal information that is useful in making predictions or imputing missing values. Ignoring informative temporal correlations and only capturing crosssectional information may yield less effective imputation.
In order to address the limitations mentioned above, we present a mixturebased multiple imputation model (MixMI) for clinical time series ^{1}^{1}1 The implementations of our model is available at: https://github.com/yxue/MixMI.
. Our model captures both crosssectional information and temporal correlations to estimate missing values using mixture models. We model the distribution of measurements using a mixture model. The mixture is composed of linear regression to model crosssectional correlations and Gaussian processes (GPs) to capture temporal correlations. The problem of integrating GP within a standard mixture model is that GP models in all patient cases get the same mixing weights, while the confidence of predictions by GP models can vary largely across different patient cases. We overcome this problem by introducing individualized mixing weights for each patient cases, instead of assigning a fixed weight. We train our model using the ExpectationMaximization (EM) algorithm. We demonstrate the effectiveness of our model by comparing it with several stateoftheart imputation algorithms on multiple clinical datasets.
Our main contributions are summarized as follows.
1. To the best of our knowledge, we are the first to build imputation model for time series by integrating GP within mixture models. We address the problem that all GP models in all patient cases get a fixed mixing weight by introducing individualized mixing weights.
2. We test the performance of our model on two realworld clinical datasets and several synthetic datasets. Our best model outperforms all comparison models including several stateoftheart imputation models. Using synthetic datasets, we also explore and discover the properties of the data that benefit our model and/or comparison models. Experiments show that our best model is robust to the variation of these properties and outperforms comparison models on all synthetic datasets.
The remainder of this paper is structured as follows. Section II discusses related work while in Section III, the proposed method is described. The experimental setup, including dataset collection and evaluation procedure, is described in Section IV. Section V discusses the computational results and underlying analyses. The conclusions are drawn in Section VI.
Ii Related work
Research in designing imputation methods for multivariable time series attracts growing interest in recent decades. Previous studies generally fall into two categories. One comes from methods using linear or other simple parametric functions to estimate missing values. The other is the methods treating time series as smooth curves and estimating missing values using GP or other nonparametric methods.
In the first category, multivariable time series are modeled based on either linear models [49]
[34, 47][23, 3]. However, in these methods, the potential trajectories of variables are only limited to linear or other simple parametric functions. Alternatively, many authors choose GPs or other nonparametric functions to model time series. Compared to linear models, GPs only have locality constraints in which close time points in a time series usually have close values. Therefore, GPs bring in more flexibility in capturing temporal trajectories of variables.However, directly applying GPs on imputation for multivariable time series usually yields less effective imputation due to the lack of considerations for the similarities and correlations across time series. Futoma et al. [17] extends the GPbased approach to imputation in the multivariable setting by applying the multitask Gaussian Processes (MTGP) prediction [5]. However, the quality of estimating missing values relies on the estimation of covariance structure among variables when using MTGP or other multitask functional approaches [22, 10, 28]. To make a confident estimation of the covariance, a large amount of time points with shared observations of these variables are often required by these multitask approaches [5, 69]. Due to the fact that many patients only have records with a limited number of time points, time series of inpatient clinical laboratory tests fall short of such a requirement. Hori et al. [24]
extend MTGP to imputation on multienvironmental trial data in a thirdorder tensor. However, this approach is not applicable to clinical data with a large number of patients, since the covariance matrix of observed values needs to be computed together with its inverse, which is intractable for our datasets.
Recently, Luo et al. [38]
explore the application of GPs in clinical data and propose an algorithm, 3dimensional multiple imputation with chained equations (3DMICE), that combines the imputation from GP and traditional MICE based on weighting equations. However, the weighting equations are calculated only based on the standard deviations of GP and MICE predictions for missing values. The weighting strategy is static and not optimized. We postulate that calculating weights through an optimization problem can help to improve the imputation quality. In our work, instead of the predictive mean matching used in
[38], we choose linear regression as one component of our model. Our model is also based on a statistical model and thus statistically justified which is not the case for [38].The mixture model proposed in this work differs from the mixtures of Gaussian processes (MGP) [56] in several ways. The mixing components in our model are not limited to Gaussian process, thus bringing in possibilities of capturing correlations with various choices of parametric functions. More importantly, our model is designed for multivariable time series and captures both temporal and crosssectional correlations. The latter is ignored in [56]
when MGP is applied to multivariable time series, which yields less effective predictions since some clinical tests have strong crosssectional relations. Without a Gaussian process component, the standard Gaussian mixture model (GMM) is well studied and has been extended to imputation tasks
[14, 12, 66, 50]. These approaches still suffer from losing informative temporal correlations when imputing missing values in longitudinal data.In order to effectively model the interaction between different aspects, we represent the data as a tensor with each aspect being one mode. Therefore, our method is also considered as a tensor completion approach. Tensor completion problems are extensively studied. However, the classic tensor completion methods [15, 35, 54] focus on general tensors and usually do not consider temporal aspects. In recent years, many studies explore the application of temporal augmented tensor completion on imputing missing values in time series [2, 70, 18, 53, 8]. These methods discretize time into evenly sampled intervals. However, due to the fact that inpatient clinical laboratory tests are usually measured at varying intervals, assembling clinical data over regularly sampled time periods might have several drawbacks, such as leading to sparse tensors if discretizing time at fine granularity (e.g. every minute) while some laboratory tests are measured less frequently (e.g. daily). Furthermore, extending these methods to the case, where time is not regularly sampled, is not easy and straightforward, requiring changing design details and the objective functions to be optimized. Recently, Yang et al. [67] propose a tensor completion method that can deal with irregularly sampled time. However, most components of this approach are tailored to the characteristics of septic patients. We implement this approach with only the component of timeaware mechanism that is general and applicable to our experimental settings. However, this approach is not so effective in our experiments, thus not included as a benchmark in this paper. Lately, Zhe et al. [72] propose a Bayesian nonparametric tensor decomposition model that captures temporal correlations between interacting events. However, this approach is not directly applicable to continuous multivariable time series because it focuses on discrete events and captures temporal correlations between the occurrences of events.
Recurrent Neural Networks (RNNs) are another choice to capture temporal relationships and have been studied in imputation tasks. Early studies [4, 55, 41] in RNNbased imputation models capture temporal correlations either within each time series or across all time series [68]. Yoon et al. [68] recently propose a novel multidirectional RNN that captures temporal correlations both within and across time series. Compared with traditional statistical models, neural networkbased models are usually harder to train and provide limited explainability, while the performance is not always necessarily better. We use [68] as a benchmark since it is the most recent model. Besides imputation, researchers also attempt to build classification models with RNNs that utilize the missing data patterns [11, 31, 9]. Instead of relying on the estimation of missing values, these models perform predictions by exploring the missingness itself and no separate imputation is required. These models can not be a substitute for imputation methods since they do not provide imputed values.
Iii Methodology
Iiia Imputation Framework
In many predictive tasks on temporal clinical data, time series are often aligned into the samelength sequences to derive more robust patient phenotypes through matrix decomposition or discover feature groups by applying sequence/graph mining techniques [67, 65]. We model under this assumption. In this work, we use tensor representation, in which patients have the same number of time points. We represent the data collected from patients with laboratory tests and time points as two 3D tensors and , shown in Fig. 1. Each laboratory test measurement is stored in the measurement tensor . Each time , when is measured, is stored in the time tensor .
Table I lists the main symbols we use throughout the paper. Missing values in the measurement tensor are denoted as , showing that the value of test at time index for patient is missing. Correspondingly, denotes an observed value. The time tensor is complete, since we only collect patient records at the time when at least one laboratory test measurement is available. We assume we know the “prescribed” time a missing measurement should have been taken. In the matrix at time index , the measurement time and can be different when , whereas for a given patient , we have for . That is, all tests for a particular patient are taken at the same time.
Disregarding the temporal dimension, the imputation problem is well studied. If one dimension is time, in order to apply imputation methods that are not designed for time series, we need to disregard the temporal aspect or ignore temporal correlations of the data. However, temporal trajectories can reveal patient’s underlying pathophysiological evolution, the modeling of which can help to better estimate missing values. For the reason that both crosssectional information and temporal aspects can impact the estimation of missing measurements, we explore mixture models, which are composed of several base models through either a crosssectional or temporal view. We introduce these base models in Section IIIB.
Symbol  Definition 

,  Measurement and time tensor 
Measurements in fiber excluding  
Times in fiber excluding  
Mixture model for and .  
Concatenation of and  
(,) 
The th prior multivariate normal distribution in 
,  Predictive mean and variance of a GP model 
The set of all trainable parameters of 

The identity matrix 
In our imputation framework, a mixture model is trained for each variable and time index. We use to denote the mixture model to impute missing values of variable at time index . The missing values in the fiber are imputed by the optimized . In each iteration of the algorithm, it is assumed that all other values are known and only is imputed. This is wrapped in an outer loop. We call this procedure simplepass tensor imputation, i.e., one pass through all . Since several simplepass tensor imputations are conducted, our approach is also considered as an iterative imputation [19], which can also be regarded as a samplingbased approach where a Gibbs sampler is used to approximate convergence. The convergence of iterative imputation methods can be quite fast with a few iterations [7].
In detail, the iterative imputation approaches start by replacing all missing data with values from simple guesses; we fill in all missing values with initial estimates by taking random draws from observed values. This procedure is called an initial imputation. Then we perform iterative tensor imputation on each copy separately. The training procedure and imputation for fibers are introduced in Section IIIC and IIID. We also rely on the concept of multiple imputations, where several iterative imputations are performed and the imputed values are averaged at the end. Each iterative imputation starts with a different iterative imputation tensor and/or uses a different order of ,.
In summary, the algorithm creates different copies of , each one filled with different random . For each , we then perform simplepass tensor imputations. Each simplepass has a loop over all ,, which uses to adjust . At the end of the imputations, are averaged across all to yield the final imputed tensor.
The whole imputation process involves imputation models. We next first focus on the base models behind and then on the actual mixture model.
IiiB Base Models
Our mixture models are composed of three components that are derived from two base models, linear regression and Gaussian processes. One component consists of GP models and the other two components are linear models through two different views of the measurement tensor. Through a crosssectional view, the tensor can be considered as a vector of patientbyvariable matrices at different time indices. Through a temporal view, we can view the tensor as a vector of patientbytime matrices for different variables.
IiiB1 Linear model through crosssectional view
We can view the measurement tensor as a vector of patientbyvariable matrices. On the slice , we use a linear regression model to fit the target variable as a function of the other variables except . The target values are modeled as
(1) 
where is the column vector of coefficients and is the standard deviation of the error , regarding the regression model through crosssectional view for variable and time index .
The likelihood distribution of is then given by
(2) 
The training data consists of observed target values and the input data , where is the training patient set and includes only if is observed.
IiiB2 Linear model through temporal view
In addition to the crosssectional view, we can also view the measurement tensor as a vector of patientbytime matrices. On matrix , we use linear regression to model the measurements at time index against those at other indices.
The target values are modeled as
(3) 
where is the column vector of coefficients and is the standard deviation of the error , regarding the linear regression model though temporal view for variable and time index .
The likelihood distribution of is given by
(4) 
The training data consists of observed target values and input data .
IiiB3 Gaussian processes through temporal view
Gaussian processes are commonly used to capture trajectories of variables, thus used in our mixture model to capture temporal correlations. Through the same temporal view as introduced above, on matrix , we fit GPs on time series for each patient.
The target value is modeled as
(5) 
where is the overall mean of the model, is a Gaussian process with mean of and a covariance matrix of time pairs (,). Then the likelihood distribution of is written as
(6) 
where are the kernel parameters of the GP models, and the predictive mean and variance are given by and ; see more details in Appendix B. For a certain and , all GP models share the same kernel parameters .
IiiC The Mixture Model
Given the likelihood distribution of all three components, we model the joint mixture distribution, regarding the variable and time index , in the following way
(7) 
where we define , , and is the mixing weight for the
th component. This model can be interpreted as the joint distribution between observed data
and missing values , which consists of a mixture of three distributions. The first one is modeled as(8) 
the remaining two follow the same logic.
By marginalizing over
, the prior probability distribution
is written as(9) 
which is a mixture of Gaussians. It also follows that
(10) 
where we define .
We train our mixture model on observed target values by maximizing the log likelihood of the joint mixture distribution
where is the training input data and defined as the concatenation of and , and is the set of all trainable parameters
After training, the missing values are imputed using individualized (perpatient) mixing weights that are derived from the conditional distribution. The missing value is imputed by
where the individualized mixing weight of the th component for patient , variable and time index is defined as
(11) 
where is the observed data and defined as the concatenation of and .
IiiD Mixture Parameter Estimation
Let be the log likelihood . Explicitly maximizing is hard. Instead, we use the EM algorithm to repeatedly construct a lowerbound on and then optimize that lowerbound . We first define a latent indicator variable that specifies which mixing component that data points come from. Then we use Jensen’s inequality to get the lowerbound , which is given by
(12) 
where
(13) 
In (12) and (13), the marginal distribution over is specified by the mixing coefficients . We can view as the responsibility that component of the mixture model takes to “explain” . We use the standard EM algorithm to maximize the lowerbound ; see more details about the estimation of the parameters in Appendix A.
Dataset  GP  MTGP  MRNN  GMM  MICE  3DMICE  MixMILL  MixMI 
Realworld MIMIC (=0)  0.13072  0.12599  0.12512  0.11741  0.11763  0.11186  0.09304  0.09285 
Synthetic MIMIC (=0.5)  0.10466    0.12320  0.11455  0.11573  0.09087  0.08612  0.08448 
Synthetic MIMIC (=1)  0.09220    0.12201  0.11436  0.11561  0.07715  0.08427  0.07538 
NMEDW  0.18353  0.17370  0.14607  0.13593  0.13718  0.13624  0.11600  0.11589 
MTGP cannot have multiple inputs, thus not applicable to the synthetic datasets. 
Variable  GP  MTGP  MRNN  GMM  MICE  3DMICE  MixMILL  MixMI 

Chloride  0.12993  0.12549  0.10865  0.10650  0.10575  0.10836  0.08664  0.08603 
Potassium  0.11533  0.11256  0.11222  0.10963  0.10997  0.10822  0.09453  0.09442 
Bicarbonate  0.13196  0.12905  0.12371  0.12231  0.12275  0.11984  0.10302  0.10254 
Sodium  0.12525  0.11939  0.11557  0.10159  0.10138  0.10727  0.08787  0.08807 
Hematocrit  0.11436  0.10780  0.09189  0.06518  0.06558  0.06726  0.05486  0.05482 
Hemoglobin  0.14168  0.12997  0.06556  0.05774  0.05772  0.06301  0.05117  0.05103 
MCV  0.14215  0.13732  0.13798  0.13391  0.13474  0.13340  0.11634  0.11657 
Platelets  0.13855  0.13087  0.14369  0.14203  0.14236  0.12815  0.10090  0.10070 
WBC count  0.13963  0.13614  0.14583  0.14026  0.14068  0.13060  0.10934  0.10913 
RDW  0.14592  0.13668  0.15938  0.15778  0.15836  0.13897  0.11340  0.11340 
Blood urea nitrogen (BUN)  0.12358  0.12720  0.16345  0.15160  0.15189  0.11814  0.09479  0.09410 
Creatinine  0.13341  0.12803  0.14904  0.13211  0.13212  0.12217  0.10067  0.10014 
Glucose  0.12491  0.12420  0.11677  0.11771  0.11794  0.11921  0.10493  0.10501 
IiiE The Imputation Model
The proposed mixture model provides flexibility in changing base models and can be solved under various assumptions. For example, a full mixture model consists of all three base models introduced earlier. However, under the assumption of only linear correlations within time series, mixture models composed of only the two linear components (denoted as MixMILL) can be used. Although the choice of mixtures can be determined empirically or based on expert knowledge of the data, we propose an imputation model (denoted as MixMI) that chooses the type of mixture automatically.
In MixMI, for each variable and time index, we train both the MixMILL and the full mixture model, then select the one with lower training error as the final mixture model that is used to perform imputation in inference. We use the absolute training error, instead of likelihood, as the selection criterion because the likelihood of a mixture model with two linear models is not of the same scale as the full mixture model.
Iv Experimental Setup
Iva Realworld Datasets
We collect two realworld datasets from the Medical Information Mart for Intensive Care (MIMICIII) database [27] and the Northwestern Medicine Enterprise Data Warehouse (NMEDW). Each dataset contains inpatient test results from 13 laboratory tests. These tests are quantitative, frequently measured on hospital inpatients and commonly used in downstream classification tasks [30, 9]. They are the same as those used in [38]
in their imputation study. We organize the data by unique admissions. We distinguish multiple admissions of the same patient. Each admission consists of time series of the 13 laboratory tests. Although our model is evaluated only on continuous variables, they are also applicable on categorical variables with a simple extension (e.g. through predictive mean matching
[33]).In both MIMICIII and NMEDW datasets, the length of time series varies across admissions. To apply our imputation model on these datasets, we truncate time series so that they have the same length. The length is the average number of time points across all admissions. Before truncating, the average number of time points in MIMICIII dataset is 11. We first exclude admissions that have less than 11 time points, and then we truncate time series by removing measurements taken after the 11th time point. We also exclude admissions that contain time series that have no observed values. Our MIMICIII dataset includes 26,154 unique admissions and the missing rate is about 28.71%. The same data collection procedure is applied on the NMEDW dataset (approved by Northwestern IRB) where we end up with 13,892 unique admissions that have 7 time points. The missing rate of the NMEDW dataset is 24.22%.
IvB Synthetic Datasets
We create synthetic datasets to explore and discover the properties of the data that might benefit our model and/or comparison models. In synthetic datasets, we augment the correlation between measurements and times. We do not augment correlations by imposing strong constraints on time series where closer measurements have closer values. Instead, we generate synthetic times by altering real times so that the constraints in synthetic data are “slightly” stronger than realworld data. We also introduce a scaling factor to control the strength of the constraints in synthetic data.
The synthetic datasets are generated based on the realworld MIMICIII dataset. We move two consecutive times of a time series closer, if the relative difference in two consecutive measurements is smaller than the relative difference in two consecutive times. The relative differences and of a time series are given by
The scaling factor controls how much we move times. If , we do not move times. In other words, the synthetic dataset at is the same as the realworld MIMICIII dataset. As increases, stronger constraints are introduced to synthetic data. The synthetic time for a time series is generated as follows:
If a time series has missing values, we first calculate the synthetic times for the observed measurements. Then we perform a linear interpolation between real times and synthetic times for observed measurements to generate synthetic times for missing measurements.
IvC Evaluation of Imputation Quality
We randomly mask 20% observed measurements in a data set as missing and treat the masked values as the test set. The remaining observed values are used in training. We impute originally missing and masked values together, and compare the imputed values with the ground truth for masked data to evaluate imputation performance. The Mean Absolute Scaled Error (MASE) [26] is used to measure the quality of imputation on the test set. MASE is a scalefree measure of the accuracy of predictions and recommended by Hyndman et al [26, 16] to measure the accuracy of predictions for series. We calculate MASE for all variables and take a weighted average according to the number of masked values of a variable to get an overall MASE per dataset.
Let be the set of cardinality of all time indices that have been masked for patient and variable . Also let be the sequence of length of all observed values for patient and variable , and let represent the imputed value. The MASE for variable is defined as
To show the effectiveness of our imputation model MixMI and its variant MixMILL, we compare MASE scores of MixMI and MixMILL with other six imputation methods: (a) MICE [7] with 100 imputations, where the average of all imputations are used for evaluation; (b) the pure Gaussian processes, where a GP model is fitted to the observed data of each time series using GPfit [39] in R and missing values are replaced with the predictions from the fitted GP models; (c) 3DMICE, a stateoftheart imputation model [38] for which we obtain their code and adapt it to account for our use of the tensor representation; (d) multitask Gaussian process prediction (MTGP) [5]; (e) a Gaussian mixture model for imputation (GMM) [14, 40] and (f) MRNN, a stateoftheart RNNbased imputation model [68].
To tune hyperparameters if any in these models, we mask out 20% observed measurements in the training set as a validation set and tune hyperparameters on the validation set. We introduce
and as hyperparameters of our imputation model. In our experiments, all mixture models start with the same mixing weights, i.e., for . Other trainable parameters are initialized directly from the data after initial imputation and do not require a manual specification.We also evaluate our imputation model under a classification task on our MIMIC dataset and compare it with GRUD [9]
, a stateoftheart RNN model for classifications with multivariable time series, which does not require a separate data imputation stage. We build a RNN model with the standard Gated Recurrent Unit as our classification model (GRUMixMI) trained on data imputed by MixMI. GRUD is trained on data without imputation. We predict whether a patient dies within 30 days after admission, instead of 48 hours as in
[9], because of severe imbalance in our cohort, where only 0.57% patients have positive labels within 48 hours.V Results
Va Performance Comparison
VA1 Imputation task
Table II and Fig. 2 compare all imputation models on 4 datasets. Table II shows the overall MASE score of each imputation model on all datasets. Fig. 2 provides a comparison for all imputation models in MASE score over 3DMICE by showing the percentage deviation against 3DMICE. We select 3DMICE since it is a stateoftheart benchmark model. We observe that MixMI outperforms all comparison models on all 4 datasets and is significantly better than the second best model (=.001, permutation test with 1000 replicates) on all 4 datasets.
Table III shows a variablewise comparison of the imputation models on the realworld MIMIC (=0) dataset. Our two imputation models outperform all comparison models on all variables. MixMI is better than MixMILL on most variables. All models except GP and MTGP achieve a much lower error on Hematocrit and Hemoglobin than on other variables. The reason is that these two variables are highly correlated. Those methods that capture the correlation between variables can reasonably infer missing values for Hematocrit from observed measurements of Hemoglobin, and vice versa. Compared to MICE, MixMI achieves even lower errors on these two variables, which indicates that temporal correlations captured by our model help to make better estimation of missing values, even when there is a more dominant crosssectional correlation.
As shown in Table II, all models except MTGP benefit from the increment of , the scaling factor used to generate synthetic data. The reason is that all models take into account temporal aspects and the measurements in the synthetic time series have stronger temporal correlations as increases. MICE and GMM also benefit from the temporal correlations because we include time as a feature in addition to variable features. We also try to exclude times, however, experiments show that these two models perform better when times are included. MixMILL is outperformed by 3DMICE when increases to , while MixMI shows its robustness to the variation of in our current experimental settings.
We compare the running times of all models on the realworld MIMIC dataset. MixMILL (taking 4.2 hours), GP (1.1 hours), MTGP (1.2 hours) GMM (2.2 hours) and MRNN (0.9 hours) are the fastest models; 3DMICE (156.1 hours) is the slowest; MICE (77.5 hours) and MixMI (109.5 hours) are in the middle.
VA2 Classification task
Our classification model GRUMixMI is compared with GRUD on data with different folds and different seeds. First, we make a copy of the original data and split each copy into the same training and test set. One copy is then imputed by MixMI and used to train GRUMixMI. The other is used by GRUD directly without imputation. Both models are trained and tested 5 times with different random seeds pertaining to algorithms. The area under the ROC curve (AUC) scores on test sets are reported. This procedure repeats 3 times with different splits. On average, GRUMixMI ties with GRUD in terms of AUC scores ( vs. , ). However, compared with models that perform imputation and prediction together, our imputation model, as a pure imputation method, offers more opportunities for other supervised and unsupervised tasks, and modeling choices and effective feature engineering.
VB Component Weights
The key information our imputation model uses to estimate missing values is the component weight , which quantifies the interaction of crosssectional and temporal correlations. We study MixMI on the realworld MIMIC dataset and visualize how the interaction is captured by our model. Fig. 3 shows a comparison of all component weights for variables across all patients and times. When imputing Hematocrit and Hemoglobin, our model relies mostly (79.6% and 84.8%) on the linear model through the crosssectional view because of the strong correlation between these two variables. Interestingly, most values of are lower than 10%, which indicates that temporal correlations are not strong in these variables in our dataset. However, our model is still able to detect the week temporal correlations and utilizes them to improve imputation. Additionally, we observe that when predicting the missing values at the beginning and the end of a time series, our model reasonably uses a lower value of than in the middle. On average, of the first and the last time indices is 13.9% lower than those in the middle. This is due to the usage of GPs, which usually produce less confident estimates at the end points of series. Furthermore, we observe an increment of as we impose stronger temporal correlations in synthetic datasets, which further validates the ability of our model in capturing such interaction.
VC Individualized Weights
By introducing individualized (per patient) mixing weights defined in (11), we improve the performance in MASE score from 0.08351 to 0.07538, an improvement of 9.73% compared against the model where each mixture component has a fixed weight for all patient cases. The reason that individualized weights are better than fixed weights in our model might be that they better approximate the responsibilities defined in (13).
In training, we can optimize the responsibility a component should take to “explain” an observed target value for . However, when making inference, the responsibility each component should take to “explain” a missing value is unknown, because responsibilities depend on observed target values, according to (13). We have to use , the individualized mixing weights, as an approximation of the responsibilities in inference. As defined in (11), only depends on the inputs, therefore, we can calculate them when making inferences on the test set.
In a standard mixture model, we could use , which is the average of responsibilities of the th component across all training patients, as a fixed weight that the th component should contribute to impute missing values for all test patients. However, patient time series can be very different and the confidence of predictions by the GP component can vary largely across different patient cases. A fixed weight can not reflect such variation in prediction confidence.
We shall view an individualized mixing weight as an approximation of how much responsibility a component should take to impute the missing value for a particular patient case. It is tailored for each patient. To visualize how individualized mixing weights help to produce better estimates, in Fig. 4, we plot the distribution of the individualized weights of the GP component in the training set and compare it with the distribution of the optimized responsibility values . The responsibilities that the GP component should take can vary a lot in different patient cases, especially on the synthetic dataset, which implies that it is more reasonable for patients to get individualized mixing weights than a fixed weight. We also observe that the individualized mixing weights reasonably mimic the distribution of the optimized responsibilities on the training set. The improvement of our model on the test set attests that the individualized weights approximate the responsibilities better than fixed weights.
Vi Conclusions
We present and demonstrate a mixturebased imputation model MixMI for multivariable clinical time series. Our model can capture both crosssectional and temporal correlations in time series. We integrate Gaussian processes with mixture models and introduce individualized mixing weights to further improve imputation accuracy. We show that MixMI can provide more accurate imputation than several stateoftheart imputation models. Although in this work our model is tested on inpatient clinical data, the proposed idea is general and should work for all multivariable time series where interactions of crosssectional and temporal correlations exist.
References
 [1] (1996) Full information estimation in the presence of incomplete data. In Advanced Structural Equation Modeling: Issues and Techniques, G. A. Marcoulides and R. E. Schumacker (Eds.), Vol. 243, pp. 277. Cited by: §I.
 [2] (2014) Fast multivariate spatiotemporal analysis via low rank tensor learning. In Advances in Neural Information Processing Systems, pp. 3491–3499. Cited by: §II.
 [3] (2017) Handling missing data in multivariate time series using a vector autoregressive modelimputation (varim) algorithm. Neurocomputing. Cited by: §II.
 [4] (1996) Recurrent neural networks for missing or asynchronous data. In Advances in neural information processing systems, pp. 395–401. Cited by: §II.
 [5] (2008) Multitask gaussian process prediction. In Advances in Neural Information Processing Systems, pp. 153–160. Cited by: §II, §IVC.
 [6] (1994) Efficacy of the indirect approach for estimating structural equation models with missing data: a comparison of five methods. Structural Equation Modeling: A Multidisciplinary Journal 1 (4), pp. 287–316. Cited by: §I.
 [7] (2011) Mice: multivariate imputation by chained equations in r. Journal of Statistical Software 45 (3). Cited by: §I, §I, §IIIA, §IVC.
 [8] (2015) Facets: fast comprehensive mining of coevolving highorder time series. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 79–88. Cited by: §II.
 [9] (2018) Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 6085. Cited by: §II, §IVA, §IVC.

[10]
(2014)
A functional data approach to missing value imputation and outlier detection for traffic flow data
. Transportmetrica B: Transport Dynamics 2 (2), pp. 106–129. Cited by: §II.  [11] (2016) Doctor ai: predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, pp. 301–318. Cited by: §II.
 [12] (2012) Efficient em training of gaussian mixtures with missing data. arXiv preprint arXiv:1209.0521. Cited by: §II.

[13]
(2016)
Multiple imputation for general missing data patterns in the presence of highdimensional data
. Scientific Reports 6, pp. 21689. Cited by: §I.  [14] (2007) Imputation through finite gaussian mixture models. Computational Statistics & Data Analysis 51 (11), pp. 5305–5316. Cited by: §II, §IVC.
 [15] (2015) Tucker factorization with missing data with application to lowrank tensor completion. Multidimensional Systems and Signal Processing 26 (3), pp. 677–692. Cited by: §II.
 [16] (2016) A note on the mean absolute scaled error. International Journal of Forecasting 32 (1), pp. 20–22. Cited by: §IVC.

[17]
(2017)
Learning to detect sepsis with a multitask gaussian process rnn classifier
. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1174–1182. Cited by: §II.  [18] (2016) Uncovering the spatiotemporal dynamics of memes in the presence of incomplete information. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1493–1502. Cited by: §II.
 [19] (2004) Parameterization and bayesian modeling. Journal of the American Statistical Association 99 (466), pp. 537–545. Cited by: §IIIA.
 [20] (2009) Missing data analysis: making it work in the real world. Annual Review of Psychology 60, pp. 549–576. Cited by: §I.
 [21] (2007) Multiple imputation for the comparison of two screening tests in twophase alzheimer studies. Statistics in Medicine 26 (11), pp. 2370–2388. Cited by: §I.
 [22] (2011) A functional multiple imputation approach to incomplete longitudinal data. Statistics in Medicine 30 (10), pp. 1137–1156. Cited by: §II.
 [23] (2012) MARSS: multivariate autoregressive statespace models for analyzing timeseries data. R Journal 4 (1). Cited by: §II.
 [24] (2016) Multitask gaussian process for imputing missing data in multitrait and multienvironment trials. Theoretical and Applied Genetics 129 (11), pp. 2101–2115. Cited by: §II.
 [25] (2006) Survival analysis using auxiliary variables via nonparametric multiple imputation. Statistics in Medicine 25 (20), pp. 3503–3517. Cited by: §I.
 [26] (2006) Another look at measures of forecast accuracy. International Journal of Forecasting 22 (4), pp. 679–688. Cited by: §IVC.
 [27] (2016) MIMICiii, a freely accessible critical care database. Scientific Data 3, pp. 160035. Cited by: §IVA.
 [28] (2014) A bayesian approach to functional mixedeffects modeling for longitudinal data with binomial outcomes. Statistics in Medicine 33 (18), pp. 3130–3146. Cited by: §II.
 [29] (2015) Ten things we have to do to achieve precision medicine. Science 349 (6243), pp. 37–38. Cited by: §I.
 [30] (1993) A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama 270 (24), pp. 2957–2963. Cited by: §IVA.
 [31] (2016) Directly modeling missing data in sequences with rnns: improved classification of clinical time series. In Machine Learning for Healthcare Conference, pp. 253–270. Cited by: §II.
 [32] (2004) Robust likelihoodbased analysis of multivariate data with missing values. Statistica Sinica, pp. 949–968. Cited by: §I.
 [33] (1988) Missingdata adjustments in large surveys. Journal of Business & Economic Statistics 6 (3), pp. 287–296. Cited by: §IVA.

[34]
(2000)
Multiple imputation and posterior simulation for multivariate missing data in longitudinal studies
. Biometrics 56 (4), pp. 1157–1163. Cited by: §II.  [35] (2015) Trace norm regularized candecomp/parafac decomposition with missing data. IEEE Transactions on Cybernetics 45 (11), pp. 2437–2448. Cited by: §II.
 [36] (2012) Doubly robust nonparametric multiple imputation for ignorable missing data. Statistica Sinica 22, pp. 149. Cited by: §I.
 [37] (2016) Using machine learning to predict laboratory test results. American Journal of Clinical Pathology 145 (6), pp. 778–788. Cited by: §I.
 [38] (2017) 3Dmice: integration of crosssectional and longitudinal imputation for multianalyte longitudinal clinical data. Journal of the American Medical Informatics Association 25 (6), pp. 645–653. Cited by: §II, §IVA, §IVC.
 [39] (2015) GPfit: an r package for fitting a gaussian process model to deterministic simulator outputs. Journal of Statistical Software 64 (1), pp. 1–23. Cited by: §IVC.
 [40] (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §IVC.
 [41] (2002) Speech recognition with missing data using recurrent neural nets. In Advances in Neural Information Processing Systems, pp. 1189–1195. Cited by: §II.
 [42] (2010) A comparison of multiple imputation and fully augmented weighted estimators for cox regression with missing covariates. Statistics in Medicine 29 (25), pp. 2592–2604. Cited by: §I.
 [43] (2001) A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey Methodology 27 (1), pp. 85–96. Cited by: §I, §I.
 [44] (1978) Multiple imputations in sample surveysa phenomenological bayesian approach to nonresponse. In Proceedings of the survey research methods section of the American Statistical Association, Vol. 1, pp. 20–34. Cited by: §I.
 [45] (1987) Multiple imputation for nonresponse in surveys. John Wiley & Sons. Cited by: §I.
 [46] (2002) Missing data: our view of the state of the art.. Psychological Methods 7 (2), pp. 147. Cited by: §I.
 [47] (2002) Computational strategies for multivariate linear mixedeffects models with missing values. Journal of Computational and Graphical Statistics 11 (2), pp. 437–457. Cited by: §II.
 [48] (1997) Analysis of incomplete multivariate data. Chapman and Hall/CRC. Cited by: §I.
 [49] (1997) A randomeffects model for multiple characteristics with possibly missing data. Journal of the American Statistical Association 92 (438), pp. 775–779. Cited by: §II.
 [50] (2018) Multivariate data imputation using gaussian mixture models. Spatial statistics 27, pp. 74–90. Cited by: §II.
 [51] (2011) MissForest—nonparametric missing value imputation for mixedtype data. Bioinformatics 28 (1), pp. 112–118. Cited by: §I.
 [52] (2011) Multiple imputation with diagnostics (mi) in r: opening windows into the black box. Journal of Statistical Software 45 (2), pp. 1–31. Cited by: §I.
 [53] (2017) Autoregressive tensor factorization for spatiotemporal predictions. IEEE International Conference on Data Mining. Cited by: §II.
 [54] (2010)(Website) External Links: Link Cited by: §II.
 [55] (1998) A solution for missing data in recurrent neural networks with an application to blood glucose prediction. In Advances in Neural Information Processing Systems, pp. 971–977. Cited by: §II.
 [56] (2001) Mixtures of gaussian processes. In Advances in neural information processing systems, pp. 654–660. Cited by: §II.
 [57] (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §I.
 [58] (1999) Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18 (6), pp. 681–694. Cited by: §I.
 [59] (2006) Fully conditional specification in multivariate imputation. Journal of statistical computation and simulation 76 (12), pp. 1049–1064. Cited by: §I.
 [60] (2018) Flexible imputation of missing data. Chapman and Hall/CRC. Cited by: §I.
 [61] (2013) Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 3 (8), pp. e002847. Cited by: §I, §I.
 [62] (2017) Biases introduced by filtering electronic health records for patients with “complete data”. Journal of the American Medical Informatics Association 24 (6), pp. 1134–1141. Cited by: §I.
 [63] (2012) Computational medicine: translating models to clinical care. Science Translational Medicine 4 (158), pp. 158rv11–158rv11. Cited by: §I.
 [64] (2000) Longitudinal and multigroup modeling with missing data. In Modeling longitudinal and multiplegroup data: Practical issues, applied approaches, and specific examples, T. D. Little, K. U. Schnabel, and J. Baumert (Eds.), pp. 219–240. Cited by: §I.
 [65] (2018) Predicting icu readmission using grouped physiological and medication trends. Artificial Intelligence in Medicine. Cited by: §IIIA.
 [66] (2015) Missing value imputation based on gaussian mixture model for the internet of things. Mathematical Problems in Engineering 2015. Cited by: §II.
 [67] (2018) Timeaware subgroup matrix decomposition: imputing missing data using forecasting events. In 2018 IEEE International Conference on Big Data, pp. 1524–1533. Cited by: §II, §IIIA.
 [68] (2018) Estimating missing data in temporal data streams using multidirectional recurrent neural networks. IEEE Transactions on Biomedical Engineering 66 (5), pp. 1477–1490. Cited by: §II, §IVC.
 [69] (2005) Learning gaussian processes from multiple tasks. In Proceedings of the 22nd International Conference on Machine Learning, pp. 1012–1019. Cited by: §II.
 [70] (2015) Accelerated online lowrank tensor learning for multivariate spatiotemporal streams. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pp. 238–247. Cited by: §II.
 [71] (2009) Extensions of the penalized spline of propensity prediction method of imputation. Biometrics 65 (3), pp. 911–918. Cited by: §I.
 [72] (2018) Stochastic nonparametric eventtensor decomposition. In Advances in Neural Information Processing Systems, pp. 6855–6865. Cited by: §II.
Appendix A Parameter Estimation in EM
In the E (Expectation) step, we calculate the responsibilities for using the current values of the parameters in iteration :
Let and . In the M (Maximization) step, we reestimate the parameters in iteration using the th responsibilities:
Comments
There are no comments yet.