Log In Sign Up

Bayesian Modelling in Practice: Using Uncertainty to Improve Trustworthiness in Medical Applications

by   David Ruhe, et al.

The Intensive Care Unit (ICU) is a hospital department where machine learning has the potential to provide valuable assistance in clinical decision making. Classical machine learning models usually only provide point-estimates and no uncertainty of predictions. In practice, uncertain predictions should be presented to doctors with extra care in order to prevent potentially catastrophic treatment decisions. In this work we show how Bayesian modelling and the predictive uncertainty that it provides can be used to mitigate risk of misguided prediction and to detect out-of-domain examples in a medical setting. We derive analytically a bound on the prediction loss with respect to predictive uncertainty. The bound shows that uncertainty can mitigate loss. Furthermore, we apply a Bayesian Neural Network to the MIMIC-III dataset, predicting risk of mortality of ICU patients. Our empirical results show that uncertainty can indeed prevent potential errors and reliably identifies out-of-domain patients. These results suggest that Bayesian predictive uncertainty can greatly improve trustworthiness of machine learning models in high-risk settings such as the ICU.


page 1

page 2

page 3

page 4


An Interpretable Intensive Care Unit Mortality Risk Calculator

Mortality risk is a major concern to patients have just been discharged ...

Quantifying sources of uncertainty in drug discovery predictions with probabilistic models

Knowing the uncertainty in a prediction is critical when making expensiv...

Considerations for Visualizing Uncertainty in Clinical Machine Learning Models

Clinician-facing predictive models are increasingly present in the healt...

Early prediction of the risk of ICU mortality with Deep Federated Learning

Intensive Care Units usually carry patients with a serious risk of morta...

Medical Dead-ends and Learning to Identify High-risk States and Treatments

Machine learning has successfully framed many sequential decision making...

Dirichlet uncertainty wrappers for actionable algorithm accuracy accountability and auditability

Nowadays, the use of machine learning models is becoming a utility in ma...

Uncertainty estimation for classification and risk prediction in medical settings

In a data-scarce field such as healthcare, where models often deliver pr...

1 Introduction

The Intensive Care Unit (ICU) is a resource-intensive environment where patients receive care that is not readily available elsewhere in the hospital. ICU patients usually are in life-threatening conditions. Therefore, adequate assessment of illness severity and expected effectiveness of interventions is of utmost importance for clinical decision making. Recent advancements in machine learning have cleared the path to deployment of machine learning based decision support software systems in critical areas like the ICU. Examples are predicting risk of readmission after discharge (Thoral et al., 2018), optimizing sepsis treatment (Raghu et al., 2017), or predicting mortality risk given the current state of the patient (Pirracchio et al., 2015). While these applications have proven to be effective, model trustworthiness remains elusive. Most machine learning models only provide point estimates of their parameters and corresponding predictions. A recent study has shown that these models can output high-risk predictions on datapoints that lie far from their observed dataset (Nguyen et al., 2015). Especially in a critical area like the ICU these flaws could result in catastrophes. For effective machine learning based decision support, we think it is crucial that machine learning methods output uncertainty estimates beside predictions. Bayesian modelling has the desirable property of expressing predictive uncertainty through stochasticity in its parameters. In this work, we show how Bayesian Neural Networks, through predictive uncertainty (hereafter referred to as uncertainty), can correctly identify predictions that are likely to be misguided. When the model encounters a datapoint that lies far from its observed set, a practitioner can be notified through the model’s uncertainty and observe that the patient has a combination of symptoms that hitherto has not been seen. Therefore, the practitioner should proceed with care.

In previous work, Leibig et al. (2017) are among the first to apply model uncertainty to healthcare by diagnosing diabetic retinopathy from fundus images. Similarly, Nair et al. (2018); Wang et al. (2019); Orlando et al. (2019) also estimate uncertainty to image analysis using MC Dropout (Gal, 2016). We extend on previous work by making the following constributions.

  1. We provide mathematical bounds on the obtainable loss with respect to uncertainty.

  2. Through these bounds, model performance is directly related to uncertainty. Uncertainty prevents high-loss prediction errors.

  3. We show that BNNs can identify out-of-domain patients competently in a real use-case.

Additionally, we are (to our best knowledge) novel in the approach of using Bayes By Backprop (Blundell et al., 2015)

instead of MC Dropout to estimate model distribution parameters directly. This choice has two motivations. First, MC Dropout rates have to be carefully adjusted to obtain well-calibrated uncertainties. This requires tuning all dropout probabilities, which is unfeasible for deep neural networks. Second, the MC dropout approximate posterior does not contract with more data, and therefore the approach has been questioned

(Osband, 2016). We also extend on earlier publications by applying Bayesian uncertainty to (ICU) signal processing.

The rest of the paper is structured as follows. In section 2 we discourse required background knowledge. In section 3 we elaborate on methodological decisions for signal processing and modelling. In section 4 we provide an overview of the results, showing that model uncertainty can mitigate the prediction loss and how uncertainty relates to out-of-domain observations. In section 5 we conclude and provide suggestions for future research directions.

2 Background

In this section we summarize important background theory about Bayesian modelling.

2.1 Bayesian Neural Networks

In deterministic modelling, given a dataset of observations and labels , we restrict model parameters to a point estimate and optimize the likelihood function directly. That is, we try to find


To capture model uncertainty, we strive to find the posterior distribution over the model parameters:


where is a prior. The marginal likelihood is computed by marginalisation: . Given many continuous model parameters this term cannot be evaluated analytically, which is the case when are the parameters of a (deep) neural network. Consequently, the posterior is intractable as well. Therefore, we approximate it with a tractable variational distribution through a procedure called variational inference (Hinton & van Camp, 1993). The goal is to minimize the KL divergence between the true and approximate posterior:


This KL divergence, too, is analytically intractable. Instead, we can minimize it by maximizing the evidence lower bound:


Since the KL-divergence is non-negative, the evidence lower bound floors the marginal likelihood. Thus, we are able to minimize the gap between the true and approximate posterior.

In this paper, we used the approach of Blundell et al. (2015), coined Bayes By Backprop (BBB), to maximize the evidence lower bound. Predictions for a data point were made by sampling model parameters from the variational posterior distribution and taking the mean of the predictions of all sampled models. That is:



. Predictive uncertainty was computed as the variance in the predictions:


3 Methods

3.1 Data and Preprocessing

Data was obtained from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC-III v1.4) database (Johnson et al., 2016). This dataset contains signal data for 46,520 patients and 58,976 ICU admissions. of the patients in the MIMIC-III dataset are newborns; these were excluded from the dataset as newborns come with different characteristics and treatment requirements. For each patient, features were constructed by aggregating relevant clinical information, lab values and vital signs. The objective of the model was to find patients that have a high risk of mortality during the admission. More information on the task and data processing is included in the supplementary material.

3.2 Model Architecture

Following Blundell et al. (2015)

, we used a Gaussian scale mixture as our prior. We used two 128-neuron hidden layers, ReLU intermediate activation and Sigmoid final activation to obtain the probability distribution

. All weight distributions were initialized with and . Adam was used as the optimizer with the configurations as suggested by the authors (Kingma & Ba, 2014).111Code is available at

4 Results

We discuss our results in the following two subsections.

4.1 Uncertainty Mitigates Prediction Loss

Figure 1: The reachable loss region in a Bayesian binary classification problem.
Figure 2: Cumulative prediction loss () related to the amount of uncertain testing data included in the analysis (

). The uncertainty effectively mitigates prediction loss for both the BNN (a) and the Gradient Boosting model (b).

In the supplementary material we derive the upper and lower bounds of the binary cross-entropy loss with respect to uncertainty:


These are depicted in figure 1. In the supplementary material we show how the data follows these bounds. Areas of both low and high loss cannot be reached with high uncertainty. In other words, when the variational posterior distribution is a good approximation of the true posterior, far-from-domain examples that cause high uncertainty will mathematically drive the loss away from the extreme values. From this line of thought, we hypothesize that uncertainty is able to mitigate prediction loss. Deterministic models can output very confident predictions on inputs that lie far from their training domain, yielding high performance loss. From equation 7 we see that this is not possible for a BNN. Additionally, the cause of a highly certain wrong prediction may be due to errors in the data, for example due to mislabelled observations. If the variational posteriors is not a good approximation of the true posterior, low uncertainty can still be achieved on wrong predictions, leading to bad consequences. Therefore, we evaluate empirically.

Figure 2 illustrates how uncertainty relates to predictive performance. When the observations are sorted according to their predictive uncertainty, it can be seen that the loss increases superlinearly for each additional data point included. That is, the most uncertain of the data contribute about times the loss the most certain do. In the supplementary material, we show that for the

most certain data the area under the receiver operating characteristic curve (AUROC) approaches unity. Thus, low-uncertainty patients are more often classified correctly. Therefore, by restricting classification to certain data points, we can effectively employ predictive uncertainty to calibrate the performance of the model.

A peculiar finding is depicted in figure 1(b)

. We observe that uncertainty generalizes to a gradient boosting decision tree model, trained on the same data. In the case of tabular data like MIMIC, this result could prove to be particularly interesting, as neural networks are often outperformed by other non-linear models in similar tasks

(Fernández-Delgado et al., 2014). This observation could mean that a Bayesian model can be deployed for the sole purpose of its uncertainty output, leaving the classification task to a second higher-performing model. The total obtained cumulative loss is lower for the tree-based model, meaning that it had more confident correct predictions. This is likely due to the fact that the prediction risk for the decision tree is not mathematically bound to uncertainty. Note that we have not investigated when this does or does not hold. This remains worthy of further investigation.

4.2 Detecting Out-of-Domain Patients

Figure 3: In plot 2(a) we depict the predictive uncertainty () related to predictive uncertainty (). In plot 2(b) we see how the BNN effectively identifies out-of-domain examples.

Plotting the relationship between predictions and uncertainty results in the moon-shaped scatter plot depicted in figure 2(a). This corresponds to the intuition given earlier, showing that extreme predictions are only made on observations with low uncertainty. In the central region of the graph, we see that the uncertainty has a wider spread, meaning that there are both observations with feature values occurring frequently (low uncertainty) and patients with sets of features that lie further from the previously observed domain. Most of the mass expectedly concentrates around the surviving group of patients.

In the medical field, research is often conducted on a biased sample of the population of patients and therefore minorities are underrepresented (Baird, 1999; Swanson & Ward, 1995; Giuliano et al., 2000; Bonevski et al., 2014). Regrettably, models and treatments are still applied to the true population, including these minorities, which can lead to adverse results. In the MIMIC dataset, there is a group of newborns. These were set aside as out-of-domain patients, and excluded from training. We replaced obvious differences like weight and age with training data averages. We observe in figure 2(b) that when the model is presented newborns, it becomes about more uncertain on average. This means that a high-uncertainty output can be used to warn a practitioner that the (combination of) symptoms and vital signs have rarely been observed before, and he or she should proceed with care. A plot for the newborns similar to figure 2(a) is given in the supplementary material. Since the characteristics of newborns greatly differ from the rest of the population, we also experimented with uncertainty on ethnic minorities. Among others, Carson et al. (1999) investigated the differences in responses to heart failure therapies between ethnicities. African American people did show different responses as a result of different features compared to the baseline. This motivated us to investigate model uncertainty on such an ethnic minority. After setting it apart, we observed that the BNN became 130% more uncertain () on this group. Comparison to a deterministic baseline on these tasks is given in the supplementary material.

5 Conclusions

In this paper, we explored the application of Bayesian Neural Networks to improve the safety of machine-learning-based clinical decision support tools in critical areas such as the ICU. Following the findings of Blundell et al. (2015), we trained a BNN on the MIMIC critical care dataset. The objective of the model was to predict patient mortality given clinical observations and lab values. We derived bounds on cross-entropy loss with respect to predictive uncertainty. Through these bounds, uncertainty is able to mitigate performance risk and loss. Empirically, we showed that the loss of test set predictions increased superlinearly when the patients were sorted according to their corresponding uncertainty. Secondly, the results reveal that the uncertainty of the predictions increases significantly on out-of-domain patients. This suggests that in an applied setting a BNN can effectively identify patients outside of its previously observed domain. Overall, this work demonstrates that uncertainty is effective in enhancing model trustworthiness and mitigating prediction risk and loss in an critical setting like the ICU. A mathematical intuition as to why uncertainty generalizes to other models remains unclear. This can be an interesting direction for future research.

6 Acknowledgements

This research was funded by Pacmed BV. We like to thank the authors of the MIMIC-III dataset for allowing us access and usage rights.


Appendix A

Figure 4: Illustration of how the data follows the bounds.

In this section, we show the derivation of the bounds on the binary cross-entropy criterion relating to predictive uncertainty. Uncertainty is computed as the variance in the predictions for each datapoint given samples from a Bayesian posterior . We start off with the Bhatia-Davis inequality on variance given a distribution maximum and distribution minimum :


Solving the equality case for ,


gives us bounds for the mean with respect to the variance:


In our case , . is obtained from Sigmoid activation, therefore and . Plugging these values and bounds in 11 into the binary cross-entropy criterion,


gives us a single bound (for both and ) on the loss w.r.t. the uncertainty:


In figure 4 we depict how the data follows these bounds.

Appendix B

In figure 5 we show that the uncertainty can calibrate performance in AUC for both a BNN (a) and a Gradient Boosting model, motivating researching the usage of uncertainty in combination with readily deployed models.

Figure 5: Predictive performance measured in AUROC () related to the amount of uncertain data included in the analysis ().

Appendix C

Patient data from the MIMIC-III database was used (Johnson et al., 2016)

. Vital signs, lab values and patient characteristics that were most abundantly available were gathered. Examples are blood pressure, potassium and age, respectively. To keep interpretability, we restricted ourselves to 25 clinically relevant features. Arterial and non-invasive blood pressures were combined, using the arterial blood pressure where possible. Features that lied further than 8 interquartile ranges were regarded as outliers and removed. Labels were obtained directly from the MIMIC tables, regarding expirement during the last hospital admission as a positive label. 9,237 ICU stays were set apart for testing purposes, leaving 36,944 patients for training.

Appendix D

In figure 6 we observe how the predictions on the newborns follow the same moon shape as the trained dataset. However, average uncertainty is much higher.

Figure 6:

Appendix E

Following the approach of Hendrycks & Gimpel (2016), we compare how well sigmoid probabilities are able to detect correct classifications and out of domain patients compared to Bayesian uncertainty.

BNN STD 83.7 97.7 32.6
NN Sigmoid 79.7 95.8 37.7
Table 1: Comparison of error and success detection between deterministic baseline and BNN.

BNN STD 75.9 80.5 69.7
NN Sigmoid 49.8 61.7 42.5
Table 2: Comparison of in and out of domain detection between deterministic baseline and BNN.