The widespread growth of electronic health records (EHRs) has generated vast amounts of clinical data. Normally, an encounter-based patient EHR is longitudinal and contains clinical notes, diagnosis codes, medications, laboratory tests and vital signs, etc., which can be represented as multivariate time series (MTS). As a consequence, EHRs contain valuable information about the clinical observations depicting both patients’ health and care provided by physicians. However, the longitudinal and heterogeneous data sources make EHR analysis difficult from a computational perspective. In addition, the EHRs are often subject to a lack of completeness, implying that the MTS extracted from EHRs often contain massive missing data [sharafoddini2019new]. Missing data might, however, occur for different reasons. It could be that the physician orders lab tests, but because of an error some results are not recorded. On the other hand, it can also happen that the physician decides to not order lab tests because he thinks the patient is in good shape. In the first case, the missingness is ignorable, whereas in the latter case, the missing values and patterns potentially can contain rich information about the patient’s diseases and clinical outcomes. Efficient data-driven systems aiming to extract knowledge, perform predictive modeling, etc., must be capable of capturing this information.
Traditionally, missingness mechanisms have been divided into missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). The main difference between these mechanisms consists in whether the missingness is ignorable (MCAR and MAR) or non-ignorable (MNAR) [donders2006review, rubin1976inference]. This traditional description of missingness mechanisms is, however, not always sufficient in medical applications as the missing patterns might be correlated with additional variables, such as e.g. a disease. This means that the distribution of the missing patterns for patients with a particular disease might be different than the corresponding distribution for patients without the disease, i.e. the missingness is informative [ghorbani2018embedding, wells2013strategies].
Several methods have been proposed to handle missing data in MTS [little2014statistical, schafer2002missing]. A simple approach is to create a complete dataset by discarding the MTS with missing data. Alternatively, one can do simple imputation of the missing values, e.g. using the last observation carried forward scheme (impute the last non-missing value for the following missing values) [shao2003last], zero-value imputation (replace missing values with zeros) or mean-value imputation (missing values are replaced with the mean of the observed data) [zhang2016missing]. A common limitation of these approaches is that they lead to additional bias, loss of precision, and they ignore uncertainty associated with the missing values [donders2006review]
. This problem is to some extent solved via multiple imputation methods, i.e. by creating multiple complete datasets using single imputation independently each time. Then, by training a classifier using an ensemble learning strategy, one can improve the performance compared to simple imputation. However, this imputation procedure is an ad-hoc solution as it is performed independently of the rest of the analysis and it ignores the potential predictive value of the missing patterns[MA2018297].
Due to the limitations of imputation methods, several research efforts have been devoted over the last years to deal with missing data in MTS using alternative strategies [8489716, bianchi2019learning, DBLP:journals/corr/ChePCSL16, Ghassemi, li2019vs, pmlr-v56-Lipton16, mikalsen2016learning, shukla2018interpolation]. Prominent examples are kernels, i.e. positive semi-definite time series similarity measures, such as the learned pattern similarity (LPS) [baydogan2016time] and the time series cluster kernel (TCK) [mikalsen2018time]
that can naturally deal with missing data. The former generalizes autoregressive models to local autopatterns, which capture the local dependency structure in the time series, and uses an ensemble learning (random forest) strategy in which a bag-of-words representation is created from the output of the leaf-nodes for each tree. TCK is also based on an ensemble learning approach and shares many properties with LPS. It is designed using an ensemble learning approach in which Bayesian mixture models form the base models. However, while LPS exploits the inherent missing data handling abilities of decision trees, TCK is a likelihood-based approach in which the incomplete dataset is analysed using maximum a posteriori expectation-maximization. An advantage of these methods, compared to e.g. multiple imputation that requires a careful selection of imputation model and parameters[schafer2002missing], is that the missing data are handled automatically and no additional tasks are left to the designer. Additionally, since the methods are based on ensemble learning, they are robust to hyperparameter choices. In particular, these properties are important in unsupervised settings, which frequently occur in medical applications where manual label annotation of large datasets often is not feasible [Halpernocw011, MIKALSEN2017105].
A shortcoming of these kernel methods is, however, that they cannot exploit informative missing patterns, which frequently occur in medical MTS, and unbiased predictions are only guaranteed for ignorable missingness as MAR is an underlying assumption. Recently, several studies have focused on modeling the informative or nonignorable missigness by analyzing the observed values as well as the indicators of missingness, concluding that the missing patterns can add more insights beyond the observed values [agniel2018biases, DBLP:journals/corr/ChePCSL16, pmlr-v56-Lipton16, sharafoddini2019new]. In this work, we present a novel time series cluster kernel, TCK, that also represents the missing patterns using binary indicator time series. By doing so, we obtain MTS consisting of both continuous and discrete attributes. However, we do not only concatenate the binary MTS to the real-valued MTS and analyse these data in a naive way. Instead, we take a statistically principled Bayesian approach [little2014statistical, MA2018297] and model the missingness mechanism more rigorously by introducing novel mixed mode Bayesian mixture models, which can effectively exploit information provided by the missing patterns as well as the temporal dependencies in the observed MTS. The mixed mode Bayesian mixture models are then used as base models in an ensemble learning strategy to form the TCK kernel. Experiments on three real-world datasets of patients described by longitudinal EHR data, demonstrate the effectiveness of the proposed method.
Time series cluster kernel to exploit informative missingness
Here we present the proposed TCK kernel. The kernel is learned using an ensemble learning strategy, i.e. by training individual base models which are combined into a composite kernel in the end. As base model we introduce a novel mixed mode Bayesian mixture model. Before we provide the details of this method, we describe the notation used throughout the paper.
Notation We define a multivariate time series (MTS) as a finite combination of univariate time series (UTS) of length , i.e. The dimensionality of the MTS is the same as the number of UTS, , and the length of is the same as the length of the UTS . A –dimensional MTS, , of length can be represented as a matrix in . Given a dataset of MTS, we denote as the -th MTS. In a dataset of incompletely observed MTS, the -th MTS is denoted by the pair , where is a binary MTS with entry if the realization is missing and if it is observed.
Mixed mode Bayesian mixture model
Let be a MTS generated from two modes, where is a V-variate real-valued MTS () and is a V-variate binary MTS (). In the mixture model it is assumed that is generated from a finite mixture density
where is the number of components, is the density of the components parametrized by , and are the mixing coefficients, and
. We formulate the mixture model in terms of a latent random variable
, described via the one-hot vectorwith marginal distribution given by The latent variable describes which cluster component the MTS belongs to, i.e. if belongs to cluster component and otherwise. The conditional is given by
and therefore it follows that the joint distribution is given by
We further assume that the parameters of each component are given by and
where is a density function given by
where is a time-dependent mean, and is a diagonal covariance matrix in which
is the variance of attribute. Hence, the covariance is assumed to be constant over time.
is a probability mass given by
where . The idea with this formulation is to use the Bernoulli term to capture information from the missing patterns and to capture the information from the observed data.
Using Bayes’ theorem we compute the conditional probability ofgiven , ,
To improve the capability of handling missing data, a Bayesian extension is introduced where informative priors are put over the parameters of the normal distribution as well as the Bernoulli distribution. This enforces that the cluster representatives become smooth over time even in the presence of large amounts of missing data and that the parameters of clusters with few MTS are similar to the overall mean. Towards this end, a kernel-based Gaussian prior is defined for the mean,
where are the empirical means and are the prior covariance matrices.
are empirical standard deviations andis a kernel matrix, whose elements are with , being user-defined hyperparameters. For the standard deviation
, an inverse Gamma distribution prior is introduced
is a hyperparameter. Further, we put a Beta distribution prior on
where , are hyperparameters. We let denote the set all of hyperparameters.
Given a dataset
, we estimate the parametersusing maximum a posteriori expectation maximization (MAP-EM) [dempster1977maximum]. The Q-function is computed as follows
The E-step in MAP-EM is the same as in maximum likelihood EM and consists in updating the posterior (Eq. (6)) using the current parameter estimates, whereas the M-step consists in maximizing the Q-function (Eq. (Mixed mode Bayesian mixture model) wrt. the parameters . I.e.,
Computing the derivatives of with respect to the parameters , , and leads to Alg. 1.
The TCK kernel
To compute the TCK kernel, we use the mixed mode Bayesian mixture model, described above, as the base model in an ensemble approach. Key to ensure that TCK will have statistical advantages (lower variance), computational advantages (less sensitive to local optima) as well as representational advantages (increased expressiveness) compared to the individual base models, is diversity and accuracy [Dietterich2000, hansen1990neural]. In general, this means that the base models should not do the same type of errors and each base model has to perform better than random guessing. Hence, to ensure diversity, we integrate multiple outcomes of the base model as it is trained under different, randomly chosen, settings (hyperparameters, initialization, subsampling). In more detail, the number of cluster components for the base models is sampled from a set of integers . For each number of cluster components , we apply different random initial conditions and sample hyperparameters uniformly as follows: , , , . We let be the index set keeping track of initial conditions and hyperparameters () as well as the number of components (). Each base model is trained on a random subset of MTS . To further increase the diversity, for each , we select random subsets of variables as well as random time segments . After having trained the individual base models using an embarrasingly parallel procedure, we compute a normalized sum of the inner products of the normalized posterior distributions from each mixture component to build the TCK kernel matrix. Details of the method are presented in Alg. 2, whereas Alg. 3 describes how to compute the kernel for MTS not seen during training.
To test the performance of the proposed kernel, we considered three clinical datasets of which the characteristics are summarized in Tab. 1. The variables and the corresponding missing rates in the three datasets are summarized in Tab. 2. A more detailed description of the datasets follows below.
|# of patients||4000||858||402|
|# attrib (Lab, vital)||28 (17, 11)||11 (11, 0)||11 (7, 4)|
|Length of MTS||48 (hours)||10 (days)||15 (days)|
|Positive class||874 (21.9%)||227 (26.5%)||31 (7.7%)|
|Av. missing rate||76.6 %||80.7%||83%|
|Variables (missing rate)|
|Phys||Albumin (0.99)||ALP (0.98)||AST (0.98)|
|Bilirubin (0.98)||BUN (0.93)||Creat. (0.93)|
|Diast. BP (0.11)||FiO2 (0.84)||GCS (0.68)|
|Glucose (0.93)||HCO3 (0.93)||HCT (0.91)|
|HR (0.10)||K (0.93)||Lactate (0.96)|
|MAP (0.12)||Mg (0.93)||Na (0.93)|
|PaCO2 (0.88)||PaO2 (0.88)||Platelets (0.93)|
|RespRate (0.76)||SaO2 (0.96)||Syst. BP (0.11)|
|Temp (0.63)||Urine (0.31)||WBC (0.93)|
|SSI||Albumin (0.79)||Amylase (0.95)||Creat. (0.87)|
|CRP (0.69)||Glucose (0.92)||Hb (0.65)|
|K (0.71)||Na (0.71)||Platelets (0.92)|
|Urea (0.94)||WBC (0.73)|
|AL||Albumin (0.80)||Creat. (0.94)||CRP (0.77)|
|Diast. BP (0.76)||Hb (0.71)||HR (0.76)|
|K (0.75)||Na (0.75)||Syst. BP (0.76)|
|Temp (0.66)||WBC (0.80)|
The PhysioNet dataset is collected from the PhysioNet Challenge 2012 [silva2012predicting]. We extracted the first part, which consists of 4000 patient records of patients from the intensive care units (ICUs) of various hospitals. Each patient stayed in the ICU for at least 48 hours and the records contain information about both vital signs and blood samples collected over time. We extracted all measurements taken within 48 hours after admission and aligned the MTS into same-length sequences using an hourly discretization. Variables with a missing rate higher than 99% were omitted, which led to a total of 28 variables. The classification task was to predict whether the patient was recovering from surgery, which is a task that also has been considered in other work [DBLP:journals/corr/ChePCSL16, li2019vs].
Surgical site infection This dataset contains data for 11 blood samples collected postoperatively for patients who underwent major abdominal surgery at the department of gastrointestinal surgery at a Norwegian university hospital in the years 2004-2012. The task considered was to detect surgical site infection (SSI), which is one of the most common types of nosocomial infections [lewis2013] and represents up to 30% of all hospital-acquired infections [magill2012prevalence]. Patients with no recorded lab tests during the period from postoperative day 1 until day 10 were removed from the cohort, which lead to a final cohort consisting of 858 patients. The average proportion of missing data in the cohort was . To identify the patients in the cohort who developed postoperative SSI and create ground truth labels, ICD-10 as well as NOMESCO Classification of Surgical Procedures codes related to severe postoperative complications were considered. Patients without these codes who also did not have a mention of the word “infection” in any of their postoperative text documents were considered as controls. This lead to a dataset consisting of 227 infected patients and 631 non-infected patients.
Anastomosis leakage Anastomosis leakage (AL) is potentially a serious complication that can occur after colon rectal cancer (CRC) surgery, of which one of the consequences is an increase in 30-day mortality [snijders2013anastomotic]. It is estimated that 5-15% of the patients who undergo surgery for CRC suffer from AL [branagan2005prognosis]. Recent studies have shown that both unstructured as well structured EHR data such as measurements of blood tests and vital signs could have predictive value for AL [soguero2016predicting, soguero2014support]. The dataset considered in this work contains only structured data and is collected from the same hospital as the SSI dataset. It contains physiological data (blood tests and vital signs) for the period from the day after surgery to day 15 for 402 patients who underwent CRC surgery. A total of 31 of these patients got AL after surgery, but there is no information available about exactly when it happened. The classification task considered here was to detect which of the patients got AL.
We considered the following experimental setup. We performed kernel principal component analysis (KPCA)[scholkopf1997kernel] using the proposed TCK
and then trained a kNN-classifier in the low dimensional space. The dimensionality of the KPCA-representation was set to 3 to also be able to visualize the embeddings, whereas we used 5-fold cross validation to set the number of neighbors
for the kNN-classifier. Performance was measured in terms of F1-score, sensitivity and specificity. Sensitivity is the fraction of correctly classified cases, whereas specificity is the fraction of controls that are correctly classified as negative. F1-score is the harmonic mean of precision and sensitivity, where precision is the fraction of actual positives among all those that are classified as positive cases.
We compared the performance of the proposed kernel to four baseline kernels, namely the linear kernel (Lin), the global alignment kernel (GAK) [cuturi2011fast], LPS and TCK. GAK is a positive semi-definite kernel formulation of the widely used, but non-metric, time series similarity measure called dynamic time warping (DTW) [Berndt:1994:UDT:3000850.3000887]. It has two hyperparameters, namely the kernel bandwidth and the triangular parameter, which have to be set by the user and it does not naturally deal with missing data and incomplete datasets, and therefore also requires a preprocessing step involving imputation. Therefore, we created a complete dataset using mean imputation for Lin and GAK (initial experiments showed that mean imputation worked better than last observation carried forward). In accordance with [Cuturi], for GAK we set the bandwidth to 0.1 times the median distance of all MTS in the training set scaled by the square root of the median length of all MTS, and the triangular parameter to 0.2 times the median length (Frobenius norm) of all MTS. In order to design baseline kernels that can exploit informative missingness, we also created baselines (referred to as Lin, GAK and LPS) by concatenating the binary indicator MTS to . LPS was run with default hyperparameters using the implementation provided by [Baydogan], with the exception that the minimal segment length was adjusted to account for the relatively short MTS in the datasets. For the TCK we let and , and, likewise, TCK was run with and . For all methods, except LPS which do not require standardization, we standardized each attribute to zero mean and unit standard deviation.
For PhysioNet and SSI, we did 5-fold cross validation to measure performance. The AL dataset is, however, highly imbalanced and therefore we employed an undersampling strategy by randomly sampling two patients from the negative class per positive case. 20% of this dataset was then set aside as a test set. This entire process was repeated 10 times and we reported mean and standard errors of the performance measures. The AL dataset is small, and for that reason the hyperparameters of the methods had to be adjusted accordingly. For TCKwe let and .
Results and discussion
|Phys.||Lin||0.329 0.049||0.812 0.007||0.328 0.052|
|GAK||0.313 0.021||0.801 0.019||0.309 0.012|
|LPS||0.511 0.069||0.948 0.010||0.600 0.068|
|TCK||0.411 0.053||0.833 0.022||0.408 0.049|
|Lin||0.556 0.054||0.939 0.013||0.625 0.046|
|GAK||0.566 0.032||0.941 0.010||0.636 0.038|
|LPS||0.611 0.060||0.939 0.009||0.667 0.041|
|TCK||0.699 0.034||0.980 0.007||0.789 0.033|
|SSI||Lin||0.480 0.041||0.878 0.041||0.529 0.072|
|GAK||0.639 0.064||0.921 0.036||0.687 0.025|
|LPS||0.687 0.044||0.929 0.040||0.730 0.028|
|TCK||0.683 0.071||0.922 0.025||0.719 0.069|
|Lin||0.700 0.012||0.944 0.030||0.755 0.024|
|GAK||0.718 0.073||0.940 0.026||0.761 0.036|
|LPS||0.652 0.056||0.925 0.015||0.701 0.036|
|TCK||0.775 0.017||0.929 0.010||0.786 0.015|
|AL||Lin||0.414 0.273||0.907 0.119||0.468 0.240|
|GAK||0.428 0.223||0.935 0.078||0.520 0.206|
|LPS||0.843 0.105||0.835 0.075||0.776 0.058|
|TCK||0.700 0.147||0.886 0.107||0.722 0.080|
|Lin||0.742 0.221||0.864 0.097||0.724 0.097|
|GAK||0.728 0.227||0.928 0.047||0.759 0.156|
|LPS||0.800 0.120||0.864 0.091||0.773 0.091|
|TCK||0.843 0.184||0.864 0.095||0.792 0.115|
Tab. 3 shows the performance of the TCK kernel, as well as the baseline methods, on the three real-world datasets. The first thing to notice is that the two kernels (Lin and GAK) that rely on imputation consistently perform much worse than the other kernels in terms of both F1-score across all datasets. We also note that these methods achieve a relatively high specificity. However, this is because they put too many patients in the negative class, which also leads to a high false negative rate and, consequently, a low sensitivity. The reasons could be that the imputation methods introduce biases and that the missingness mechanism is ignored.
The two kernels, TCK and LPS, that naturally handle the missing data perform better than the kernels that rely on imputation for all three datasets. TCK and LPS perform quite similarly across all 3 evaluation metrics for the SSI dataset, whereas LPS outperforms TCK on the PhysioNet and AL dataset. These methods probably perform better than the imputation methods because ignoring the missingness introduces less bias than replacing missing values with biased estimates. The performance of the baselines that account for informative missingness, Linand GAK, is considerably better than Lin and GAK, respectively, for all datasets. LPS also performs better than LPS on the PhysioNet datasets, whereas the performance of these two baselines is more or less equal on the two other datasets. The proposed TCK performs considerably better than all baselines, and in particular compared to TCK (the kernel which it is an improvement of) for the PhysioNet and SSI datasets in terms of F1-score, and it performs better or comparable than the other kernels on the AL dataset. This demonstrates that the missing patterns in clinical time series are often informative and the TCK can exploit this information very efficiently.
Fig. 1 shows the KPCA embeddings obtained using five kernels (Lin, GAK, LPS, TCK and TCK). In general, the classes are more separated in the representations obtained using TCK than in the other representations.
Limitations, future work and conclusions
In this paper, we presented a MTS kernel capable of exploiting informative missingness along with the temporal dependencies in the observed data. We showed that TCK can learn good representations that can be exploited both in supervised and unsupervised tasks, even when the percentage of missing data is high. In this work, the representations learned using TCK were evaluated visually (Fig. 1) and using a supervised scheme by training a classifier (Tab. 3). The experimental results suggested that TCK achieved superior performances compared to baselines on three real datasets.
The experiments presented in this work focused on binary classification tasks, both of patients at the ICU and patients who had undergone colonrectal cancer surgery. However, we believe that TCK is also a very good choice in applications where there is a lack of labels, which often is the case in medical applications, thanks to the ensemble learning strategy that makes the kernel robust to hyperparameter choices. In fact, since it is a kernel, it can be used in many different applications, including classification as well as clustering tasks, benefiting from the vast body of work in the field of kernel methods. In future work, we would like to test TCK in a realistic unsupervised task from the medical domain.
A limitation of TCK is that it is only designed for MTS of the same length. In further work, we would therefore like to design a time series cluster kernel that can also deal with varying length MTS. It should also be pointed out that if the missing patterns are not informative, i.e. the missingness is not correlated the particular medical condition(s) of interest, the performance gain of TCK compared to TCK is low. It is therefore an advantage if the user has some understanding about the underlying missingness mechanism in the data. On the other hand, our experiments on benchmark datasets (see Appendix) demonstrate that in cases when the missingness mechanism is almost ignorable (low correlation between missing patterns and labels), the performance of TCK is not worse than TCK.
The authors would like to thank K. Hindberg for assistance on extraction of EHR data and the physicians A. Revhaug, R.-O. Lindsetmo and K. M. Augestad for helpful guidance throughout the study.
Appendix – Synthetic benchmark datasets
To test how well TCK performs for a varying degree of informative missingness, we generated in total 16 synthetic datasets by randomly injecting missing data into 4 MTS benchmark datasets. The characteristics of the datasets are described in Tab. 4. We transformed all MTS in each dataset to the same length, , where T is given by Here, is the ceiling operator and is the length of the longest MTS in the original dataset.
The following procedure was used to create 8 synthetic datasets with missing data from the Wafer and Japanese vowels datasets. We randomly sampled a number for each attribute , where indicates that the attribute and the labels are positively correlated and negatively correlated. Thereafter, we sampled a missing rate from for each MTS and attribute. The parameter was tuned such that the Pearson correlation (absolute value) between the missing rates for the attributes and the labels took the values , respectively. By doing so, we could control the amount of informative missingness and because of the way we sampled , the missing rate in each dataset was around 50% independently of the Pearson correlation.
Further, the following procedure was used to create 8 synthetic datasets from the uWave and Character trajectories datasets, which both consist of only 3 attributes. We randomly sampled a number for each attribute . Attribute(s) with became negatively correlated with the labels by sampling from , whereas the attribute(s) with became positively correlated with the labels by sampling from . The parameter was computed in the same way as above. Then, we computed the mean of each attribute over the complete dataset and let each element with be missing with probability . This means that the probability of being missing is dependent on the value of the missing element, i.e. the missingness mechanism is MNAR within each class. Hence, this type of informative missingness is not the same as the one we created for the Wafer and Japanese vowels datasets.
Three baseline models were created. The first baseline, namely ordinary TCK, ignores the missingness mechanism. We created a second baseline, refered to as TCK, in which the missing patterns we modeled naively by concatenating the binary missing indicator MTS to the MTS and creating a new MTS with attributes. Then, ordinary TCK was trained on the datasets consisting of . In the third baseline, TCK, we investigated how well informative missingness can be captured by imputing zeros for the missing values and then training the TCK on the imputed data.
Tab. 5 shows the performance of the proposed TCK and the three baselines for all of the 16 synthetic datasets. We see that the proposed TCK achieves the best accuracy for 14 out of 16 datasets and is the only method which consistently has the expected behaviour, namely that the accuracy increases as the correlation between missing values and class labels increases. It can also be seen that the performance of TCK is similar to TCK when the amount of information in the missing patterns is low, whereas TCK is clearly outperformed when the informative missingness is high. This demonstrates that TCK can effectively exploit informative missingness.