Interpretable Outcome Prediction with Sparse Bayesian Neural Networks in Intensive Care

05/07/2019 ∙ by Anna-Lena Popkes, et al. ∙ University of Cambridge Microsoft 12

Clinical decision making is challenging because of pathological complexity, as well as large amounts of heterogeneous data generated as part of routine clinical care. In recent years, machine learning tools have been developed to aid this process. Intensive care unit (ICU) admissions represent the most data dense and time-critical patient care episodes. In this context, prediction models may help clinicians determine which patients are most at risk and prioritize care. However, flexible tools such as artificial neural networks (ANNs) suffer from a lack of interpretability limiting their acceptability to clinicians. In this work, we propose a novel interpretable Bayesian neural network architecture which offers both the flexibility of ANNs and interpretability in terms of feature selection. In particular, we employ a sparsity inducing prior distribution in a tied manner to learn which features are important for outcome prediction. We evaluate our approach on the task of mortality prediction using two real-world ICU cohorts. In collaboration with clinicians we found that, in addition to the predicted outcome results, our approach can provide novel insights into the importance of different clinical measurements. This suggests that our model can support medical experts in their decision making process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clinicians often need to make critical decisions, for example, about treatments or patient scheduling, based on available data and personal expertise. The increasing prevalence of electronic health records (EHRs) means that routinely collected medical data are increasingly available in machine readable form. In some areas such as the intensive care unit (ICU), the data density may be so large that it becomes difficult for clinicians to fully appreciate relationships and patterns in clinical records. At the same time, ICU patients are the most severely ill. Life-supporting treatments are not only expensive and limited but may be associated with potentially catastrophic side effects. It is in this context that appropriate treatments must be delivered in a time-critical manner based on accurate appraisal of all available information. A degree of automated data analysis may assist clinicians navigate the current ‘data deluge’ and make the best informed decisions. The recent success of artificial intelligence and machine learning in various real-world applications suggests that such technology can be the key to unlocking the full potential of medical data and help with making real-time decisions

(Shipp et al., 2002; Tan and Gilbert, 2003; Caruana et al., 2015; Bouton et al., 2016; Chen et al., 2017; Hamesse et al., 2018).

The high data density of the ICU is ideal for applying machine learning methods to assist with clinical decision making. All medical decision making is predicated on the prediction of future outcomes. Especially relevant to the ICU are predictions surrounding patient mortality. As a consequence, several machine learning studies published over the course of the last years focused on this task (Joshi and Szolovits, 2012; Celi et al., 2012; Ghassemi et al., 2014; Caballero B. and Akella, 2015; Meiring et al., 2018). Most of these studies concentrated on improving previously published measures of performance such as discrimination, specificity and sensitivity.

Deploying machine learning solutions in ICUs to support life/death decision making is challenging and requires high prediction accuracy. Artificial neural networks (ANNs) are powerful machine learning models that have been successful in several highly complex real-word tasks (Collobert et al., 2011; Boulanger-Lewandowski et al., 2012; Bojarski et al., 2016; Silver et al., 2016). The non-linearity of ANNs allows them to capture complex non-linear dependencies, a quality which often results in high predictive performance. Despite widespread success, predictions from ANNs lack interpretability. Instead, they often function as a black box. For example, after training an ANN on the task of outcome prediction it is difficult to determine which input features are relevant for making predictions. This is highly undesirable in the medical domain — making potentially life-changing decisions without being able to clearly justify them is unacceptable to both clinicians and patients. As a consequence, the application of ANNs in practice has been limited. Advancing the interpretability of such networks is a key step towards increasing their impact in healthcare applications.

In this work, we propose an interpretable machine learning model based on a Bayesian neural network (BNN) (MacKay, 1992; Hinton and Van Camp, 1993; Blundell et al., 2015; Hernández-Lobato and Adams, 2015; Louizos et al., 2017; Ghosh and Doshi-Velez, 2017) for outcome prediction in the ICU. Our proposed method offers not only the flexibility of ANNs but also interpretable predictions in terms of feature importance — inspecting the model parameters directly shows which features are essential for prediction and which are considered irrelevant. We choose a BNN as the underlying architecture because it explicitly models the uncertainty in a dataset, as well as in the model parameters and predictions. This replicates the inherently probabilistic nature of clinical decision making. We further utilize sparsity inducing prior distributions to allow for feature selection.

Technical Significance

We propose a novel model based on a BNN that automatically selects relevant features for the prediction task at hand. In particular, we propose the use of tied sparsity inducing prior distributions, where the same sparsity prior is shared among all weights connected to the same input feature. As a consequence, after training the model, some weights connected to an input feature will be close to zero, indicating that the corresponding input feature is not relevant for performing prediction. For the prior distribution we use the horseshoe prior because of its sparsity inducing and heavy-tailed nature. Moreover, we derive a variational inference method to estimate the proposed model. We verify the effectiveness of our method using a synthetic dataset and apply it to two real-world ICU cohorts: MIMIC-III

(Johnson et al., 2016) and CENTER-TBI (Maas et al., 2014).

Clinical Relevance

We contribute a machine learning model for outcome prediction in the ICU which provides fully probabilistic predictions as well as insights into the relevance of input features. Such results can help clinicians with making treatment decisions and communicating with patients’ families. The predictions and insights can further be used for resource allocation purposes, for example, to decide whether a patient should be discharged in order to accommodate a new patient. More importantly, our method provides insights into how predictions are made. In particular, it allows to determine which medical measures are relevant for the task of outcome prediction. This is an important advance because model interpretability is essential if critical decisions based on diagnostic support systems are to be accepted by both clinicians and the public.

The rest of this work is organized as follows. We review related literature in Section 2 and present technical details of our proposed BNN architecture in Section 3. We evaluate our proposed method on real-world ICU cohorts: MIMIC-III (Section 4.2) and CENTER-TBI (Section 4.3). The cohorts and the experimental results are presented in Section 4.

2 Related work

Sparsity in linear models

In linear models sparsity is typically induced using a suitable prior distribution over the model parameters. The well known LASSO (Least Absolute Shrinkage and Selection Operator) (Tibshirani, 1996) produces a sparse estimate of the parameters in a linear model, by regularizing the

norm of the parameter vector towards zero. A Bayesian interpretation of LASSO is discussed in

Park and Casella (2008), showing that the

regularizer can be interpreted as a Laplace prior over the parameters, and that the LASSO estimate is equivalent to a maximum a posteriori estimate of the linear coefficients given data.

Sparsity inducing prior distributions beyond the Laplace distribution have been proposed to improve feature selection results. As a prevalent example, the spike and slab prior places a mixture distribution on each parameter, comprising a point mass distribution at zero (the spike) and an absolutely continuous density (the slab) (Mitchell and Beauchamp, 1988). Since its introduction it has been employed in several settings (Ishwaran et al., 2005; Carbonetto et al., 2012; Scheipl et al., 2012; Hernández-Lobato et al., 2015a; Malsiner-Walli and Wagner, 2018)

. However, the spike and slab prior requires a careful choice of the mixture weights and the variance of the “slab”. Another popular choice for introducing sparsity is the horseshoe prior

(Carvalho et al., 2009), which assigns a half-Cauchy prior over the variance of the Gaussian prior over the parameters. The heavy tail of the horseshoe distribution allows coefficients associated with important features to remain large, while at the same time the tall spike at the origin encourages shrinkage of other parameters. Compared to the spike and slab prior, the horseshoe prior is more flexible because no mixing proportions need to be set. Furthermore, Bhadra et al. (2017) showed that the horseshoe prior is computationally more efficient. Therefore, in the rest of this work we employ the horseshoe prior as the sparsity inducing prior for better feature selection.

Sparsity in non-linear models

Less work has been directed towards the application of sparsity inducing priors in non-linear models. One of the first approaches was Automatic Relevance Determination applied to BNNs (MacKay et al., 1994), which fits the prior variance for each individual parameters by maximizing the marginal likelihood. However, this approach fails to scale to large datasets as it involves the inversion of large matrices.

Louizos et al. (2017) and Ghosh and Doshi-Velez (2017) applied a horseshoe prior to prune inactive hidden units from BNNs, thereby achieving better compression. The models were trained using variational inference (Jordan et al., 1999; Beal, 2003)

and the gradient of the loss function was approximated using unbiased Monte Carlo estimates. Our work uses the same inference techniques, however, we focus on selecting input features, not hidden units.

Outcome prediction in the medical domain

ICU clinicians are faced with a large variety of data types including EHRs, monitoring data and diagnostic data. The heterogeneity and volume of this data, along with short timescales over which clinical changes can occur, makes the ICU setting a good test case for many prediction problems in healthcare. Over the past years researchers have explored the application of artificial intelligence to ICU tasks, like the prediction of outcomes or length of stay (Desautels et al., 2016; Churpek et al., 2016; Vranas et al., 2017; Nemati et al., 2018; Meiring et al., 2018).

Because of its central importance to patients and clinicians, outcome prediction is one of the most widely studied tasks within this field. Models can be divided into two categories: those using only static features (Knaus et al., 1985; Le Gall et al., 1993; Lemeshow et al., 1993; Elixhauser et al., 1998; Steyerberg et al., 2008) and those utilizing information about the temporal evolution of features (Joshi and Szolovits, 2012; Ghassemi et al., 2014; Caballero B. and Akella, 2015; Harutyunyan et al., 2017; Che et al., 2018).

Most approaches based on static features model only linear relationships or rely on manual feature engineering. Manual feature engineering scales poorly, and prevents models from automatically discovering patterns in the data. Linear models are easy to interpret, because the importance of input features can directly be inferred from the magnitude of the associated model coefficients. This is appealing for transparent clinical decision making and highly desirable for real-world applications for two reasons. Firstly, decisions without a justification are unacceptable both medicolegally and to clinicians and patients. Secondly, interpretability allows the model to be queried to gain novel insights into data which may be biologically hypothesis-generating. However, the capacity of linear models is limited. In most real world problems the relationship between input features and target values is non-linear or may involve complex interactions between predictors. Consequently, more powerful approaches are needed to model such data well.

In this work, we propose a model for mortality prediction named HorseshoeBNN. In contrast to previous work (Joshi and Szolovits, 2012; Celi et al., 2012; Ghassemi et al., 2014; Caballero B. and Akella, 2015) our model is able to both capture non-linear relationships and learn which input features are important for prediction, thereby making the model interpretable.

3 Methods

In this section, we describe our proposed BNN architecture for mortality prediction, with a horseshoe prior that enables feature selection for better interpretability. We first revisit the BNN, a type of ANN which explicitly incorporates uncertainty by introducing distributions over the model parameters (Section 3.1). The specific form of prior distribution we employ, the horseshoe prior, is discussed in section 3.2. It induces sparsity in the first layer of the BNN. Section 3.2 describes the architecture of our model, the HorseshoeBNN. Section 3.3 addresses the computational methods used to implement it.

3.1 Bayesian Neural Networks

Given an observed dataset , we want to determine a model that fits the data well and generalizes to unseen cases. In the context of the mortality prediction task studied in this work, the vector consists of different medical measurements and

is a binary indicator of the outcome for a specific patient. In parametric modelling a popular model for such tasks is the ANN

(Rumelhart et al., 1986; Goodfellow et al., 2016)

, which defines highly non-linear functions by stacking multiple layers of simple non-linear transformations. As an example, a fully connected ANN with

hidden layers defines the function in the following way:

(1)

Here represents an activation function

which is usually a simple non-linear transformation, e.g. sigmoid or ReLU

(Nair and Hinton, 2010). Depending on the task, the non-linearity for the output layer may be an idenity function for regression, or a sigmoid/softmax function for binary/multiclass classification. Bias terms can also be included in each layer by defining . In the rest of this work we use to represent all parameters of an ANN, and we denote the function defined by Equation (1) as to emphasize the dependence of the function values on .

Instead of directly predicting the response with a deterministic function , BNNs start from a probabilistic description of the modelling task, and estimate the uncertainty of the parameters given the data. Concretely, the network parameters

are considered random variables, and a prior distribution

is selected to represent the prior belief of their configuration. Assuming that the observed data is independent and identically distributed (i.i.d.), the likelihood function of is defined as

(2)

where, in case of a binary classification task like the one presented in this work, the label is a scalar, and

(3)

For regression tasks, we have After observing the training data , a posterior distribution of the network weights is defined by Bayes’ rule

(4)

This posterior distribution represents the updated belief of how likely the network parameters are given the observations. With the posterior distribution one can predict the response of an unseen input using the predictive distribution:

(5)

3.2 The HorseshoeBNN: Feature Selection with Sparsity Inducing Priors

The prior distribution

captures the prior belief about which model parameters are likely to generate the target outputs

, before observing any data. When focusing on feature selection, sparsity inducing priors are of particular interest. In this work, we use a horseshoe prior (Carvalho et al., 2009), which in its simplest form can be described as

(6)

where

is the half-Cauchy distribution and

is a scale parameter. The probability density function of the horseshoe prior with

is illustrated in Figure 1. It has a sharp peak around zero and wide tails. This encourages shrinkage of weights that do not contribute to prediction, while at the same time allowing large weights to remain large.

Figure 1: Left: Probability density function of the horseshoe prior. Right: Illustration of a HorseshoeBNN. The prior distribution of the weights in the input layer is given by a horseshoe distribution. All weights for particular input feature share the same shrinkage parameter, allowing for feature selection. The prior of the weights of the second layer is given by a Gaussian.

For feature selection we propose a horseshoe prior for the first layer of a BNN by using a shared half-Cauchy distribution to control the shrinkage of weights connected to the same input feature. Specifically, denoting as the weight connecting the -th component of the input vector to the -th node in the first hidden layer, the associated horseshoe prior is given by

(7)

The layer-wide scale tends to shrink all weights in a layer, whereas the local shrinkage parameter allows for reduced shrinkage of all weights related to a specific input feature . As a consequence, certain features of the input vector are selected whereas others are ignored. For the bias node we use a Gaussian prior distribution. It is important to note that a simpler solution which keeps all feature weights small but does not encourage significant shrinkage would not be sufficient to perform feature selection — large weights in deeper layers of the network might increase the influence of irrelevant input features.

We model the prior of the weights in the second layer of the HorseshoeBNN by a Gaussian distribution, which prevents overfitting

(Blundell et al., 2015). The complete network architecture is given in Figure 1. Although we perform our experiments with a single hidden layer, the model can easily be enlarged by adding more hidden layers with Gaussian priors.

A direct parameterization of the half-Cauchy prior can lead to instabilities in variational inference for BNNs. Therefore, we follow Ghosh and Doshi-Velez (2017) to reparametrize the horseshoe prior using auxiliary parameters:

(8)

After adding the auxiliary variables to the Horseshoe prior, the prior over all the unobserved random variables is

3.3 Scalable Variational Inference for HorseshoeBNN

For most BNN architectures both the posterior distribution and the predictive distribution are intractable due to a lack of analytic forms for the integrals. To address this outstanding issue we fit a simpler distribution and later replace with in prediction. More specifically, we define

(9)

and use factorized Gaussian distributions for the weights in upper layers:

To ensure non-negativity of the shrinkage parameters, we consider a log-normal approximation to the posterior of and , i.e.

(10)

In the horseshoe prior (see Eq. 6) the weights and the scales and are strongly correlated. This leads to strong correlations in the posterior distribution with pathological geometries that are hard to approximate. Betancourt and Girolami (2015) and Ingraham and Marks (2016) show that this problem can be mitigated by reparametrizing the weights in the horseshoe layer as follows:

(11)

and equivalently, parametrizing the approximate distribution as

(12)

Because the log-likelihood term does not depend on or , one can show that the optimal approximations and

are inverse Gamma distributions with distributional parameters dependent on

(Ghosh and Doshi-Velez, 2017).

We fit the variational posterior by minimizing the Kullback-Leibler (KL) divergence . One can show that the KL divergence minimization task is equivalent to maximizing the evidence lower-bound (ELBO) (Jordan et al., 1999; Beal, 2003; Zhang et al., 2018)

Since the ELBO still lacks an analytic form due to the non-linearity of the BNN, we apply black box variational inference (Ranganath et al., 2014)

to compute an unbiased estimate of the ELBO by sampling

. More specifically, because the

distribution is constructed by a product of (log-)normal distributions, we apply the reparametrization trick

(Kingma and Welling, 2013; Rezende et al., 2014) to draw samples from the variational distribution: Furthermore, stochastic optimization techniques are employed to allow for mini-batch training, which enables the variational inference algorithm to scale to large datasets. Combining both, the doubly stochastic approximation to the ELBO is

which is used as the loss function for the stochastic gradient ascent training of the variational parameters .

4 Experiments and Results

In this section, we focus on evaluating our method on two real-world ICU cohorts: MIMIC-III (Johnson et al., 2016) and CENTER-TBI (Maas et al., 2014). We also present a validation study on a synthetic feature selection dataset in appendix A, showing that our approach can fully recover the ground truth set of features. The two ICU cohort evaluations share the same experimental set-up which is summarized below. Details of the two cohorts and experimental results are presented in Section 4.2 and 4.3 separately.

4.1 Experimental Set-Up

We compare our proposed HorseshoeBNN with three related models for outcome prediction in ICUs. In this section we summarize all evaluated models.

  • LinearGaussian: a linear model with a Gaussian prior distribution on all weights (Bishop, 2006). This is the most commonly used Bayesian model for prediction tasks.

  • GaussianBNN: a standard BNN with a Gaussian prior distribution on all weights (Blundell et al., 2015).

  • LinearHorseshoe: a linear model with a horseshoe prior distribution on all weights (Carvalho et al., 2009). This model extends the LinearGaussian model by using sparsity inducing prior distributions for performing feature selection.

  • HorseshoeBNN: our novel extension of the GaussianBNN with a tied horseshoe prior distribution on the weights in the first layer. This enables the model to perform feature selection.

All models are trained until convergence using 10 fold cross-validation and the ADAM optimizer (Kingma and Ba, 2014)

. We use 50 hidden units for the MIMIC-III cohort and 100 hidden units for CENTER-TBI. The full list of hyperparameter settings can be found in Appendix

B. The code for the experiments will be available at (link to be added).

4.2 Mortality prediction on MIMIC-III

Cohort

MIMIC-III (‘Medical Information Mart for Intensive Care’) (Johnson et al., 2016) is a comprehensive publicly available intensive care database collected from tertiary care hospitals. The data includes information about laboratory measurements, medications, notes from care providers and other features. In total, the database contains medical records of 7.5K ICU patients over 11 years. Since its introduction, it has been used both by the medical and machine learning community for various analyses (Desautels et al., 2016; Choi et al., 2017; Che et al., 2018; Kurniati et al., 2018).

Preprocessing

We preprocess MIMIC-III using code introduced in Harutyunyan et al. (2017), focusing on the task of mortality prediction. For all measurements we remove invalid feature values beyond the allowed range. The ranges are defined by medical experts as shown in Table 4 in Appendix B

. Because we do not focus on prediction based on dynamic features, we reduce the time series for each patient by computing the mean value of each feature in the first 48 hours of the patient’s stay in the ICU. We use mean imputation for missing values. The final cohort contains 17903 samples and 17 features. 86.5% of patients contained in the dataset survived (positive outcome), 13.5% deceased (negative outcome). A complete list of features is displayed in Table

4 in Appendix B.

Prediction Results

We compare the predictive performance in terms of error rate (lower is better), the area under the receiver operating characteristic curve (AUROC) (higher is better) and negative predictive log-likelihood (lower is better). The results are presented in Table

1

. The non-linear models (GaussianBNN, HorseshoeBNN) show a small improvement compared to the linear models (LinearGaussian, LinearHorseshoe). The HorseshoeBNN performs on par with or slightly better than the GaussianBNN, because feature selection may potentially help to reduce overfitting. In addition, we present the confusion matrix for each method in Figure

2. As illustrated in Figure 2, the non-linear models perform better due to an improvement in correctly predicting the outcome of deceased patients. The fact that all models perform worse on deceased than surviving patients can be explained by data imbalance.

Model Error rate AUROC Negative Log-Likelihood
LinearGaussian 0.129 0.008 0.807 0.013 0.321 0.013
GaussianBNN 0.123 0.008 0.830 0.014 0.304 0.013
LinearHorseshoe 0.130 0.008 0.807 0.013 0.320 0.013
HorseshoeBNN 0.122 0.006 0.831 0.013 0.304 0.012
Table 1:

Results of the different models for the task of mortality prediction tested on the MIMIC-III cohort. The mean value and one standard deviation of each metric over 10-fold cross-validation is presented.

Figure 2: Confusion matrices of the four different models trained on the MIMIC-III cohort over 10-fold cross-validation.

Interpretability Results and Clinical Relevance

An important part of our results is constituted by the interpretability of the model predictions. Both the LinearHorseshoe model and our proposed HorseshoeBNN are trained to attribute relevance to the different input features. For the LinearHorseshoe model each input feature is associated with a single distribution. Therefore, the magnitude of the mean of each distribution, shown in the middle panel of Figure 3, directly reflects the relative importance of the corresponding input feature. In the HorseshoeBNN each input feature is associated with a vector of 50 distributions (the number of hidden units). We plot the average of the mean value of these distributions in the right panel of Figure 3.

Figure 3: Left: percentage of missing data for the features in the MIMIC-III cohort. Middle: Norm of the weights of the LinearHorseshoe model, representing the relative importance of the corresponding features. Right: Norm of the weights of the HorseshoeBNN. The name of each input feature is given on the far left of the plots. Feature weights of zero indicate that the corresponding features are irrelevant for outcome prediction. All non-zero weights indicate that the corresponding features are relevant for predicting mortality.

When comparing the weight values we find that the two models agree on the importance of most features. Notable differences lie in the importance attributed to pH, systolic blood pressure and Glasgow coma scale.

Whereas the LinearHorseshoe model finds pH to be irrelevant, this is not the case for the HorseshoeBNN. This might be explained by the fact that the healthy range for pH is very narrow. Negative outcomes depend on pH values in a non-monotonic way — both too high and too low values are dangerous. Therefore, the non-linear model (i.e. the HorseshoeBNN) might be able to capture the importance of this feature better than the linear model.

For blood pressure, three values are recorded by the ICU: systolic, diastolic, and mean blood pressure. Any one of these values can be used to establish a baseline blood pressure for a given patient. The LinearHorseshoe model captures only one of them, namely diastolic blood pressure. In contrast, the HorseshoeBNN captures both the diastolic and systolic blood pressure. In the clinical context combining diastolic and systolic values allows to obtain additional information about the waveform of a patient’s blood pressure. This information can be of clinical relevance. Our results suggest that the HorseshoeBNN is able to recognize the importance of this additional information, whereas the LinearHorseshoe model is not.

The feature Glasgow coma scale total is selected only by the LinearHorseshoe model, but considered irrelevant by the HorseshoeBNN. It is important to note that the Glasgow coma scale components have a deterministic relationship: the total scale equals the sum of the verbal response, motor response and eye opening. This redundancy allows to model the data using any combination of the features. Two features, namely height and capillary refill rate are considered irrelevant by both models. However, it should be noted that these features are missing for a large percentage of patients in the cohort. Consequently, we cannot conclude anything about their importance.

Overall our results on feature selection suggest that, compared to a linear model, a non-linear model like our HorseshoeBNN can provide additional insights into the importance of input features.

4.3 Mortality prediction on CENTER-TBI

Cohort

The CENTER-TBI (Collaborative European NeuroTrauma Effectiveness Research in Traumatic Brain Injury) study is a longitudinal observational study that was conducted across Europe and Israel (Maas et al., 2014). The core study collected data from 5400 patients with a diagnosis of traumatic brain injury (TBI) for a period of up to two years post injury. Patients were divided into three strata with different clinical care paths. In this work we focus on the stratum with severest brain injury: patients seen in the emergency room and admitted to the ICU. The cohort contains a broad range of clinical data including baseline demographics, mechanism of injury, prehospital clinical course (e.g. vital signs and Glasgow coma scale), brain computed tomographic reports, any many other features.

Preprocessing

We predict mortality based on the features listed in Table 6 using release 1.0 of the CENTER-TBI cohort. We remove the data of patients for which no outcome was reported. To address missing values we use zero-imputation for binary features and mean imputation for continuous and ordinal features, as suggested by clinicians. The final cohort contains 1613 samples and 39 features. 75% of patients contained in the cohort survived, 25% deceased. A complete list of features is displayed in Table 6 in Appendix B.

Prediction Results

The resulting metric values of our experiment are summarized in Table 2 and Figure 4. We again observe that the HorseshoeBNN achieves slightly better metric values than the linear models (LinearGaussian, LinearHorseshoe). The GaussianBNN performs worse than all other models. This could be due to the large amount of noise in the data which the GaussianBNN might be modeling, thereby overfitting the data. In contrast, the HorseshoeBNN removes input features which makes the model less likely to overfit. The confusion matrices in Figure 4 show that BNN based models are better at predicting the outcome for deceased patients, which is consistent with our previous experiment.

Model Error rate AUROC Negative Log-Likelihood
LinearGaussian 0.185 0.024 0.871 0.041 0.393 0.057
GaussianBNN 0.195 0.033 0.869 0.042 0.390 0.063
LinearHorseshoe 0.180 0.025 0.874 0.041 0.383 0.046
HorseshoeBNN 0.179 0.026 0.873 0.041 0.380 0.045
Table 2: Results of the different models for the task of mortality prediction tested on the CENTER-TBI cohort. The mean value and one standard deviation of each metric over 10-fold cross-validation is presented.
Figure 4: Confusion matrices of the four different models trained on the CENTER-TBI cohort over 10-fold cross-validation.

Interpretability Results and Clinical Relevance

The relative importance of the different input features as determined by the LinearHorseshoe model and the HorseshoeBNN are shown in Figure 5. Features marked in bold are included in the IMPACT model (Steyerberg et al., 2008), a model commonly used for outcome prediction in TBI. As evident in the figure, there is little difference between the LinearHorseshoe model and the HorseshoeBNN. Both models attribute similar relevance to most features. Furthermore, we can observe a large overlap between features used in the IMPACT model and features considered important by our horseshoe models.

Figure 5: Left: ercentage of missing data for the features in the CENTER-TBI cohort. Middle: Norm of the weights of the LinearHorseshoe model, representing the relative importance of the corresponding features. Right: Norm of the weights of the HorseshoeBNN. The name of each input feature is given on the far left of the plots. Features included in the IMPACT model for outcome prediction in TBI are marked bold (Steyerberg et al., 2008). Feature weights of zero indicate that the corresponding features are irrelevant for outcome prediction. All non-zero weights indicate that the corresponding features are relevant for predicting mortality.

Interestingly, our models attribute little relevance to events of hypotension, although such events are considered important from a clinical perspective. A clinically likely explanation for this is that hypotension is typically unrecognised, leading to many false negatives for this feature. Similar to hypotension, both models consider the feature hemoglobin irrelevant for mortality prediction. This might be explained by the small role of hemoglobin in the IMPACT model. It is also possible that the small effect of hemoglobin is already accounted for by the features platelet count or international normalized ratio. Furthermore, both models consider certain features important that are not included in the IMPACT model. These features are: sex, heart rate, pH, international normalized ratio and absent basal cistern.

When comparing with the results for MIMIC-III we observe different results for the feature pH. Whereas both horseshoe models trained on CENTER-TBI determine pH to be important, this is not the case for MIMIC-III. For the latter only the non-linear model, that is, the HorseshoeBNN, selects this feature. This may be due to a difference between the patient groups contained in MIMIC-III and CENTER-TBI. This assumption is further supported by the distribution of pH values in the cohorts. A more detailed discussion of the findings, including figures is given in Appendix B.1. A further difference can be observed for the blood pressure features. This can again be explained by the heterogeneity of the datasets. Inspecting the relationship between systolic and diastolic blood pressure in both datasets shows that the two features are strongly correlated for CENTER-TBI but not for MIMIC-III. A more elaborate discussion can be found in Appendix B.1.

5 Conclusion and Future Work

In this work we proposed a novel model, the HorseshoeBNN, for performing interpretable patient outcome prediction. Our method extends traditional BNNs to perform feature selection using sparsity inducing prior distributions in a tied manner. Our architecture offers many advantages. Firstly, being based on a BNN, it represents a non-linear, fully probabilistic method which is highly compatible with the clinical decision making process. Secondly, with our proposed advances, the model is able to learn which input features are important for prediction, thereby making it interpretable which is highly desirable in the clinical domain.

We worked closely with clinicians and evaluated our model using two real-world ICU cohorts. We showed that our proposed HorseshoeBNN can provide additional insights about the importance of input features. Together with its ability to provide uncertainty estimates, the HorseshoeBNN could be used to support clinicians in their decision making process. In view of the high-dimensional complex nature of medical data and the high relevance of outcome prediction in healthcare, our method could be useful not only for ICUs but in any medical settings. Our work illustrates how a close collaboration between computational and clinical experts can lead to methodological advances suitable for translation into tools for patient benefit.

In future work, we will extend our model to be able to work with the entire time series (Yoon et al., 2018) as dynamic prediction is a key area of interest in the ICU. Utilizing information about the evolution of features over time might not only improve predictive accuracy, but could provide additional insights about temporal changes of measurement values.

Moreover, we will use more sophisticated methods for missing value imputation to obtain better predictive performance (Ma et al., 2018; Little and Rubin, 2019). Finally, contemporary digital healthcare is fundamentally transdisciplinary and we would like to continue working with medical experts to explore how to deploy our method in a real-world clinical setting.

Acknowledgements

CENTER-TBI is supported by The European Union FP 7th Framework program (grant 602150) with additional funding provided by the Hannelore Kohl Foundation (Germany) and by the non-profit organization One Mind For Research (directly to INCF).

References

  • Beal [2003] M. J. Beal.

    Variational algorithms for approximate Bayesian inference

    .
    PhD thesis, University of London, 2003.
  • Betancourt and Girolami [2015] M. Betancourt and M. Girolami. Hamiltonian Monte Carlo for hierarchical models. In Current trends in Bayesian methodology with applications, 2015.
  • Bhadra et al. [2017] A. Bhadra, J. Datta, N. G. Polson, and B. T. Willard. Lasso meets horseshoe. arXiv preprint arXiv:1706.10179, 2017.
  • Bishop [2006] C. M. Bishop. Pattern recognition and machine learning. Springer, 2006.
  • Blundell et al. [2015] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
  • Bojarski et al. [2016] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016.
  • Boulanger-Lewandowski et al. [2012] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. arXiv preprint arXiv:1206.6392, 2012.
  • Bouton et al. [2016] C. Bouton, A. Shaikhouni, N. Annetta, M. Bockbrader, et al. Restoring cortical control of functional movement in a human with quadriplegia. Nature, 533(7602):247, 2016.
  • Caballero B. and Akella [2015] Karla L. Caballero B. and R. Akella. Dynamically modeling patient’s health state from electronic medical records: A time series approach. In KDD Conference, pages 69–78. ACM, 2015.
  • Carbonetto et al. [2012] P. Carbonetto, M. Stephens, et al. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian analysis, 7(1):73–108, 2012.
  • Caruana et al. [2015] R. Caruana, Y. Lou, J. Gehrke, P. Koch, M. Sturm, and N. Elhadad. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In SIGKDD, pages 1721–1730. ACM, 2015.
  • Carvalho et al. [2009] C. Carvalho, N. Polson, and J. Scott. Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics, pages 73–80, 2009.
  • Celi et al. [2012] L. Celi, S. Galvin, G. Davidzon, J. Lee, D. Scott, and R. Mark. A database-driven decision support system: customized mortality prediction. Journal of personalized medicine, 2(4):138–148, 2012.
  • Che et al. [2018] Z. Che, S. Purushotham, K. N. Cho, D. Sontag, and Y. Liu. Recurrent neural networks for multivariate time series with missing values. Scientific reports, 8(1):6085, 2018.
  • Chen et al. [2017] M. Chen, Y. Hao, K. Hwang, L. Wang, and L. Wang. Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5:8869–8879, 2017.
  • Choi et al. [2017] E. Choi, M. T. Bahadori, Le Song, W. F. Stewart, and J. Sun.

    Gram: graph-based attention model for healthcare representation learning.

    In KDD Conference, pages 787–795. ACM, 2017.
  • Churpek et al. [2016] M. M. Churpek, T. C. Yuen, C. Winslow, D. O. Meltzer, M. W. Kattan, and D. P. Edelson. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit. Care Med., 44(2):368, 2016.
  • Collobert et al. [2011] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12(Aug):2493–2537, 2011.
  • Desautels et al. [2016] T. Desautels, J. Calvert, J. Hoffman, M. Jay, et al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR medical informatics, 4(3), 2016.
  • Elixhauser et al. [1998] A. Elixhauser, C. Steiner, D. R. Harris, and R. M. Coffey. Comorbidity measures for use with administrative data. Medical care, pages 8–27, 1998.
  • Ghassemi et al. [2014] M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits. Unfolding physiological state: Mortality modelling in intensive care units. In KDD Conference, pages 75–84. ACM, 2014.
  • Ghosh and Doshi-Velez [2017] S. Ghosh and F. Doshi-Velez. Model selection in Bayesian neural networks via horseshoe priors. arXiv preprint arXiv:1705.10388, 2017.
  • Goodfellow et al. [2016] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
  • Hamesse et al. [2018] C. Hamesse, R. Tu, P. Ackermann, H. Kjellström, and C. Zhang. Simultaneous measurement imputation and outcome prediction for Achilles tendon rupture rehabilitation. arXiv preprint arXiv:1810.03435, 2018.
  • Harutyunyan et al. [2017] H. Harutyunyan, H. Khachatrian, D. Kale, and A. Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
  • Hernández-Lobato et al. [2015a] D. Hernández-Lobato, J. M. Hernández-Lobato, and Z. Ghahramani. A probabilistic model for dirty multi-task feature selection. In ICML, pages 1073–1082, 2015a.
  • Hernández-Lobato and Adams [2015] J. M. Hernández-Lobato and R. Adams.

    Probabilistic backpropagation for scalable learning of Bayesian neural networks.

    In Proceedings of the 32nd International Conference on Machine Learning, pages 1861–1869, 2015.
  • Hernández-Lobato et al. [2015b] J. M. Hernández-Lobato, D. Hernández-Lobato, and A. Suárez.

    Expectation propagation in linear regression models with spike-and-slab priors.

    Machine Learning, 99(3):437–487, Jun 2015b.
  • Hinton and Van Camp [1993] G. E. Hinton and D. Van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In

    Proceedings of the sixth annual conference on Computational learning theory

    , pages 5–13. ACM, 1993.
  • Ingraham and Marks [2016] J. B. Ingraham and D. S. Marks. Bayesian sparsity for intractable distributions. arXiv preprint arXiv:1602.03807, 2016.
  • Ishwaran et al. [2005] H. Ishwaran, J. S. Rao, et al. Spike and slab variable selection: frequentist and Bayesian strategies. The Annals of Statistics, 33(2):730–773, 2005.
  • Johnson et al. [2016] A. E. Johnson, T. J. Pollard, L. Shen, L. W. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. MIMIC-III, a freely accessible critical care database. Sci Data, 3:160035, May 2016.
  • Jordan et al. [1999] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • Joshi and Szolovits [2012] R. Joshi and P. Szolovits. Prognostic physiology: modeling patient severity in intensive care units using radial domain folding. In AMIA Annual Symposium Proceedings, volume 2012, page 1276. American Medical Informatics Association, 2012.
  • Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma and Welling [2013] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Knaus et al. [1985] W. A. Knaus, E. A. Draper, D. P. Wagner, and J. E. Zimmerman. APACHE II: a severity of disease classification system. Crit. Care Med., 13(10):818–829, 1985.
  • Kurniati et al. [2018] A. Kurniati, G. Hall, D. Hogg, and O. Johnson. Process mining in oncology using the MIMIC-III dataset. In J. Phys. Conf. Ser., volume 971, page 012008. IOP Publishing, 2018.
  • Le Gall et al. [1993] J. Le Gall, S. Lemeshow, and F. Saulnier. A new simplified acute physiology score (SAPS II) based on a european/north american multicenter study. Jama, 270(24):2957–2963, 1993.
  • Lemeshow et al. [1993] S. Lemeshow, D. Teres, J. Klar, J. S. Avrunin, S. H. Gehlbach, and J. Rapoport. Mortality probability models based on an international cohort of intensive care unit patients. Jama, 270(20):2478–2486, 1993.
  • Little and Rubin [2019] R. J. A. Little and D. B. Rubin. Statistical analysis with missing data, volume 793. Wiley, 2019.
  • Louizos et al. [2017] C. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In NeurIPS, pages 3288–3298, 2017.
  • Ma et al. [2018] C. Ma, S. Tschiatschek, K. Palla, J. M. Hernández-Lobato, S. Nowozin, and C. Zhang. Eddi: Efficient dynamic discovery of high-value information with partial VAE. arXiv preprint arXiv:1809.11142, 2018.
  • Maas et al. [2014] A. I. R. Maas, D. K. Menon, E. W. Steyerberg, G. Citerio, F. Lecky, G. T. Manley, S. Hill, V. Legrand, and A. Sorgner. Collaborative european neurotrauma effectiveness research in traumatic brain injury (center-tbi) a prospective longitudinal observational study. Neurosurgery, 76(1):67–80, 2014.
  • MacKay [1992] D. J. C. MacKay. A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3):448–472, 1992.
  • MacKay et al. [1994] D. J. C. MacKay et al. Bayesian nonlinear modeling for the prediction competition. ASHRAE transactions, 100(2):1053–1062, 1994.
  • Malsiner-Walli and Wagner [2018] G. Malsiner-Walli and H. Wagner. Comparing spike and slab priors for Bayesian variable selection. arXiv preprint arXiv:1812.07259, 2018.
  • Meiring et al. [2018] C. Meiring, A. Dixit, S. Harris, N. S. MacCallum, D. A. Brealey, P. J. Watkinson, A. Jones, S. Ashworth, R. Beale, S. J. Brett, M. Singer, and A. Ercole. Optimal intensive care outcome prediction over time using machine learning. PLoS ONE, 13(11):e0206862, 2018.
  • Mitchell and Beauchamp [1988] T. Mitchell and J. Beauchamp. Bayesian variable selection in linear regression. JASA, 83(404):1023–1032, 1988.
  • Nair and Hinton [2010] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
  • Nemati et al. [2018] S. Nemati, A. Holder, F. Razmi, M. D. Stanley, G. D. Clifford, and T. G. Buchman. An interpretable machine learning model for accurate prediction of sepsis in the ICU. Crit. Care Med., 46(4):547–553, 2018.
  • Park and Casella [2008] T. Park and G. Casella. The Bayesian lasso. JASA, 103(482):681–686, 2008.
  • Ranganath et al. [2014] R. Ranganath, S. Gerrish, and D. Blei. Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, pages 814–822, 2014.
  • Rezende et al. [2014] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pages 1278–1286, 2014.
  • Rumelhart et al. [1986] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323:9, 1986.
  • Scheipl et al. [2012] F. Scheipl, L. Fahrmeir, and T. Kneib. Spike-and-slab priors for function selection in structured additive regression models. JASA, 107(500):1518–1532, 2012.
  • Shipp et al. [2002] M. Shipp, K. Ross, P. Tamayo, A. Weng, et al. Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature medicine, 8(1):68, 2002.
  • Silver et al. [2016] D. Silver, A. Huang, C. Maddison, A. Guez, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • Steyerberg et al. [2008] E. W. Steyerberg, N. Mushkudiani, P. Perel, I. Butcher, et al. Predicting outcome after traumatic brain injury: development and international validation of prognostic scores based on admission characteristics. PLoS medicine, 5(8):e165, 2008.
  • Tan and Gilbert [2003] A. Tan and D. Gilbert. Ensemble machine learning on gene expression data for cancer classification. 2003.
  • Tibshirani [1996] R. Tibshirani. Regression shrinkage and selection via the Lasso. J. Royal Stat. Soc.: Series B (Methodological), 58(1):267–288, 1996.
  • Vranas et al. [2017] K. C. Vranas, J. K. Jopling, T. E. Sweeney, M. C. Ramsey, A. S. Milstein, C. G. Slatore, G. J. Escobar, and V. X. Liu. Identifying distinct subgroups of ICU patients: A machine learning approach. Crit. Care Med., 45(10):1607–1615, 2017.
  • Yoon et al. [2018] J. Yoon, W. R. Zame, and M. van der Schaar. Deep sensing: Active sensing using multi-directional recurrent neural networks. 2018.
  • Zhang et al. [2018] C. Zhang, J. Butepage, H. Kjellstrom, and S. Mandt. Advances in variational inference. IEEE TPAMI, 2018.

Appendix A Validation of Induced Sparsity

In this section, we verify the sparsity-inducing capacities of our LinearHorseshoe model. We repeat an experiment proposed in Hernández-Lobato et al. (2015b). In the experiment, a data matrix with 75 datapoints is sampled from the unit hypersphere. Target values are computed using , where represents Gaussian distributed noise with standard deviation 0.005. The 512-dimensional weight vector is sparse, that is, only 20 randomly selected components are non-zero.

Because the target values depend linearly on the data matrix , we use a linear model with a horseshoe prior to obtain an estimate of the weight vector. The estimate is given by the mean of the posterior distribution. We evaluate the model using the reconstruction error and find results similar to those reported in Hernández-Lobato et al. (2015b). Hyperparameter settings can be found in Table 3. The results of this validation experiment show that our model is capable of correctly reproducing sparse weight vectors.

Hyperparameter Value
Number of weight samples during training 10
Number of weight samples during testing 100
Batch size 64
Learning rate 0.001

Number of epochs

2000000
Global shrinkage parameter of Horseshoe prior 1.0
Local shrinkage parameter of Horseshoe prior 1.0
Table 3: Parameter and hyperparameter settings for prior experiment on induced sparsity
Figure 6: Signal reconstruction (bottom) of a sparse signal (top).

The average reconstruction error over twenty realizations of the experiment is . This error is slightly higher but of the same order of magnitude as the error of reported in Hernández-Lobato et al. (2015b). The difference can be explained by the fact that we use variational inference to approximate the posterior distribution whereas Hernández-Lobato et al. (2015b)

use Markov chain Monte Carlo techniques, which are known to give a more accurate approximation of the posterior distribution. An example of a sparse signal and the reconstructed weights is shown in Figure 

6.

Appendix B Experiments

Feature Unit Min threshold Max threshold
Height cm 250
Temperature °C 49
Blood pH - 6 8
Fraction of inspired oxygen - 0 1
Capillary refill time seconds 0 -
Heart rate bpm 0 300
Systolic blood pressure mmHg 275
Diastolic blood pressure mmHg 150
Mean blood pressure mmHg 190
Weight kg 0 250
Glucose mg/dL 0 1250
Respiratory rate number breaths per min 0 150
Oxygen saturation % 100
Glasgow coma scale eye response - 1 4
Glasgow coma scale motor response - 1 6
Glasgow coma scale verbal response - 1 5
Glasgow coma scale total - 1 15
Table 4: MIMIC-III feature ranges
For blood pH values higher than 14 (physically impossible), we assume that these are actually measures of hydrogen ion concentrations in nanomole/L. These values are converted to pH values, after which the threshold is applied.
Hyperparameter Value
Number of weight samples during training 10
Number of weight samples during testing 100
Batch size 64
Number of hidden units 50
Learning rate 0.001
Number of epochs 5000
Standard deviation of Gaussian prior 1.0
Global shrinkage parameter of Horseshoe prior 1.0
Local shrinkage parameter of Horseshoe prior 1.0
Table 5: Parameter and hyperparameter settings for MIMIC models
Category Features
General Age, gender, prior alcohol use, history of anti-coagulants
Injury Cause of injury, time of injury
Condition on arrival heart rate, respiratory rate, temperature, SpO
Systolic blood pressure, diastolic blood pressure
Arterial O tension, CO tension, pH
Assessment of airway, breathing, circulation
Episode of hypoxia or hypotension
Neurological assessment Glasgow Coma Score, pupil reaction
Initial imaging Marshall classification, depressed skull fracture
subarachnoid hemorrhage, midline shift
Absent basal cisterns, extradural hematoma
Blood chemistry tests Glucose, sodium, albumin, calcium, hemoglobin
hematocrit, white blood cell count, C-reactive protein,
Platelet count, International normalized ratio
activated partial thromboplastin time, fibrogen
Table 6: Features used to predict mortality on the CENTER-TBI dataset
Hyperparameter Value
Number of weight samples during training 10
Number of weight samples during testing 100
Batch size 64
Number of hidden units 100
Learning rate 0.001
Number of epochs 5000
Standard deviation of Gaussian prior 1.0
Global shrinkage parameter of Horseshoe prior 1.0
Local shrinkage parameter of Horseshoe prior 1.0
Table 7: Parameter and hyperparameter settings for CENTER-TBI models

b.1 Discussion of feature selection results

As discussed in section 4.3 we observed different results for the feature pH when comparing MIMIC-III and CENTER-TBI. Whereas both horseshoe models trained on CENTER-TBI determine pH to be important (see Figure 5), this is not the case for MIMIC-III. Here, only the non-linear model, that is, the HorseshoeBNN selects this feature (see Figure 3). This can be explained by the difference between the patient groups contained in MIMIC vs. CENTER-TBI. Whereas MIMIC contains a very heterogeneous group of patients, CENTER-TBI contains only patients that experienced some kind of trauma , and the outcome predictors will therefore be expected to be very different given this is a different disease. Figure 7 illustrates the distribution of pH in both datasets. When computing the KL-divergence between the distributions for patients that survived and patients that deceased we observe that the divergence is larger for CENTER-TBI. This might explain why both models determine pH to be important for CENTER-TBI, whereas only the non-linear model includes pH for MIMIC.

Figure 7: Distribution of the feature pH in the MIMIC and CENTER-TBI cohorts. Illustrated in red is the range of pH for healthy patients (7.35-7.45).

A further difference was observed for blood pressure. Both CENTER-TBI horseshoe models indicate that only systolic blood pressure is of relevance, whereas for MIMIC-III the HorseshoeBNN determined both systolic and diastolic blood pressure to be important. Again, this could be explained by the heterogeneity of the datasets. CENTER-TBI contains predominantly people with relatively isolated brain injuries not affecting the blood circulation. Therefore, blood pressure (absolute, e.g. systolic) is most important as this pressure perfuses the brain. In contrast, MIMIC-III includes a large number of patients in shock (e.g. with sepsis). The degree of this pathology, which would be less likely to occur in the CENTER-TBI patients, is clinically likely to be a strong determinant of outcome and would be expected to be reflected in a difference in diastolic blood pressure. This is further supported when inspecting the relationship between systolic and diastolic blood pressure in both datasets. Figure 8 shows that the two features are strongly correlated for CENTER-TBI but not for MIMIC. This might explain why only the systolic blood pressure is considered relevant for CENTER-TBI, but both the systolic and diastolic blood pressure for MIMIC.

Figure 8: Correlations between the features diastolic blood pressure and systolic blood pressure in the MIMIC and CENTER-TBI cohorts.