Analyzing the Role of Model Uncertainty for Electronic Health Records

06/10/2019 ∙ by Michael W. Dusenberry, et al. ∙ 3

In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise well-tuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that population-level metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patient-specific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning has found great and increasing levels of success in the last several years on many well-known benchmark datasets. This has led to a mounting interest in non-traditional problems and domains, each of which bring their own requirements. In medicine specifically, individualized predictions are of great importance to the field Council et al. (2011), and there can be severe costs for incorrect predictions and decisions due to the risk to human life and the associated ethical concerns Gillon (1994).

Existing state-of-the-art approaches using deep neural networks in medicine often make use of either a single model or an average over a small ensemble of models, focusing on improving the accuracy of probabilistic predictions Harutyunyan et al. (2017); Rajkomar et al. (2018a); Xu et al. (2018); Choi et al. (2018). These works, while focusing on capturing the data uncertainty, do not address the model uncertainty that is inherent in fitting deep neural networks. For example, when predicting patient mortality in an ICU setting, existing approaches might be able to achieve high *aucroc, but will be unable to differentiate between patients for whom the model is certain about its probabilistic prediction, and those for whom the model is fairly uncertain.

In this paper, we examine the use of model uncertainty specifically in the context of predictive medicine. Model uncertainty has made many methodological advances in recent years—including reparameterization-based variational Bayesian neural networks (Blundell et al., 2015; Kucukelbir et al., 2017; Louizos and Welling, 2017), Monte Carlo dropout (Gal and Ghahramani, 2016), ensembles (Lakshminarayanan et al., 2017), and function priors (Hafner et al., 2018; Garnelo et al., 2018; Malinin and Gales, 2018). In order to directly impact clinical care, model uncertainty methods raise several natural questions:

  • How do the realized functions in any of the approaches, such as individual models in the ensemble approach, compare in terms of metric performance such as AUC-PR, AUC-ROC, or log-likelihood?

  • If and how does model uncertainty assist in calibrating predictions?

  • What is the effect of uncertainty on predictions across patient subgroups, such as by race, gender, age, or length of stay?

  • Which feature values are responsible for the highest model uncertainty?

  • How does model uncertainty affect optimal decisions made under a given clinically-relevant cost function?

Contributions

Using sequence models on the *mimic clinical dataset Johnson et al. (2016), we make several important findings. For the ensembling approach of quantifying model uncertainty, we find that the models within the ensemble are nearly identical in terms of dataset-level metric performance, despite each model yielding different patient-specific predictions. In addition to the strong metric performance, the models in the ensemble also appear to be well-calibrated. It is therefore likely that any one of these models could be selected in practice if we were only using one model for our clinical problem. Furthermore, we see that predictive uncertainty due to model uncertainty extends into the space of optimal decisions. That is, models with nearly equivalent performance can disagree significantly on the final decision.

Overall, we show that model uncertainty is not captured by dataset-level (i.e., population-level) metrics, such as AUC-PR, AUC-ROC, log-likelihood, and calibration error. Rather, the significant variability in sample-specific (i.e., patient-specific) predictions and decisions motivates the importance of model uncertainty; this is an area with clear clinical impact. Additionally, we show that models with Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.

2 Background

Data uncertainty

Data uncertainty can be viewed as uncertainty regarding a given outcome due to incomplete information, and is also known as “output uncertainty” or “risk” Knight (1957). For binary tasks, this equates to a single probability value. More specifically, this can be described as

(1)

where the model , as a function of the inputs and parameters , outputs the parameter

for a Bernoulli distribution representing the conditional distribution

for the outcome .

Model uncertainty

Model uncertainty can be viewed as uncertainty in the correct values of the parameters for the predictive outcome distribution due to not knowing the true function. For binary tasks, this equates to a distribution of plausible probability values for a Bernoulli distribution, corresponding to a set of plausible functions. More specifically, this can be described as

(2)

where there is a distribution over the Bernoulli parameter for a given example that represents uncertainty in the true outcome distribution due to uncertainty in function space. For the remainder of the paper, we will use the phrase predictive uncertainty distribution to refer to this distribution over the parameter of the outcome distribution.

Deep Ensembles

Deep ensembles Lakshminarayanan et al. (2017) is a method for quantifying model uncertainty. In this approach, an ensemble of deterministic111We use the term “deterministic" to refer to the usual setup in which we optimize the parameter values of our function directly, yielding a trained model with fixed parameter values at test time. *nn is trained by varying only the random seed of an otherwise well-tuned model. Given this ensemble, predictions can be made with each model for a given example , where (for a binary task) each prediction is the probability parameter for the Bernoulli distribution over the outcome. The set of probabilistic predictions for the same example can then be viewed as the distribution over , where this distribution represents model uncertainty.

Bayesian *rnn

Bayesian *rnn Fortunato et al. (2017) are *rnn in which the parameters are represented by distributions. This allows us to express model uncertainty as uncertainty over the true values for the parameters in the model, i.e., “weight uncertainty" Blundell et al. (2015). By introducing a distribution over all, or a subset, of the weights in the model, we can induce different functions, and thus different outcomes, through realizations of different weight values via draws from the posterior distributions. This allows us to empirically capture model uncertainty in the predictive uncertainty distribution by drawing samples from a trained Bayesian *rnn for a given example.

3 Medical Uncertainty

3.1 Mortality Prediction From Medical Records

Similar to Rajkomar et al. (2018a), we train deep *rnn to predict patient mortality on *mimic Johnson et al. (2016), an EHR dataset collected from 46,520 patients admitted to *icu at Beth Israel Deaconess Medical Center, where 9,974 expired during the encounter (i.e., 1:4 ratio between positive and negative samples). Our model embeds and aggregates a patient’s time-series features (e.g. medications, lab measures) and global features (e.g. gender, age), feeds them to one or more *lstm layers Schmidhuber and Hochreiter (1997)

, and follows that with fully-connected hidden and output affine layers, with a final sigmoid output for binary prediction. See the Appendix for more details.

Existing deep learning approaches in predictive medicine focus on capturing data uncertainty, namely accurately predicting the risk

of a patient’s mortality (i.e., how likely is the patient to expire?). This work, on the other hand, also focuses on addressing the model uncertainty aspect of deep learning, namely the distribution over the risk of mortality for a patient (i.e., are there alternative risk predictions, and if so, how diverse are they?).

3.2 Predictive Uncertainty Distributions

Figure 1: A histogram of predictions from *rnn models for the probability of mortality for a given patient in the *icu. The predictions form a predictive uncertainty distribution for the patient, where the disagreement is due to model uncertainty. This is not captured when using a single model or an average over an ensemble. Figure 2:

A plot of the mean versus standard deviation of the predictive uncertainty distributions of the deterministic ensemble for positive and negative patients in the validation set. We find that the standard deviations are not simple functions of the mean, and are instead conditioned on each individual patient. For reference, we note that the variances of the distributions are generally lower than that of a Bernoulli distribution’s variance curve.

To quantify model uncertainty for our mortality prediction task, we explore the use of deep *rnn ensembles and various Bayesian *rnn. For the deep ensembles approach, we optimize for the ideal hyperparameter values for our *rnn model via black-box Bayesian optimization

Golovin et al. (2017), and then train replicas of the best model. Only the random seed differs between the replicas. At prediction time, we make predictions with all models for each patient. For the Bayesian *rnn, we take a variational inference approach by adapting our *rnn to use factorized weight posteriors

where weight tensors

in the models are represented by normal distributions with learnable mean and diagonal covariance parameters represented as

. Normal distributions with zero mean and tunable standard deviation are used as weight priors. We train our models by minimizing the *kl divergence

(3)

between the approximate weight posterior and the true, but unknown posterior, which overall equates to minimizing an expectation over the usual negative log likelihood term plus a *kl regularization term. To easily shift between the deterministic and Bayesian models, we make use of the Bayesian Layers Tran et al. (2018) abstractions.

Figure 1 visualizes the predictive uncertainty distribution for a single patient. We find that there is a wide variability in predicted Bernoulli probabilities for some patients (with spreads as high as ), which we will show is often of concern for clinical practice.

3.3 Optimal Decisions Via Sensitivity Requirements

While predicted probabilities of the binary mortality outcome are important, the key desire for clinical practice is to make a decision. Given a set of potential outcomes , a set of conditional probabilities for the given outcomes, and the associated costs of correctly and incorrectly predicting the correct outcome, an optimal decision can be determined by minimizing the expected decision loss

(4)

where is the decision region for assigning example to class , and is the density of Bishop (2006).

Unfortunately, designing decision cost functions for clinical applications is difficult, and is a research problem in itself. That being said, we do have an alternative target that is already clinically relevant: sensitivity requirements. Often in clinical research, certain sensitivity (i.e., recall) levels must be met when making binary predictions in order for a model to be clinically relevant. The goal in these cases is to maximize the precision while still reaching the required recall level. Viewed as a decision cost function, the cost is infinite if the recall is below the target level, and is otherwise minimized as the precision is increased, where the optimized parameter is the global probability threshold.

For each of the models in our ensemble, we can optimize the sensitivity-based decision cost function and make optimal decisions for all examples. Thus, for each example, there will be a set of optimal decisions, forming a distribution. The optimal decision

then becomes a random variable

(5)

where is the percentage of the set of probability values for a given example that are greater than or equal to the optimized decision threshold .

4 Experiments

We perform four sets of experiments. First, we examine the relationship of individual model samples across clinical metrics, calibration, uncertainty distributions, and decision-making. Second, we examine where uncertainty in the model matters most. Third, we examine patterns in uncertainty across patient subgroups. Finally, we examine patterns in uncertainty across individual features.

4.1 Deep *rnn Ensemble for Mortality Prediction


Metric Mean Standard Deviation
Val. *aucpr 0.4496 0.0025
Val. *aucroc 0.8753 0.0019
Test *aucpr 0.3886 0.0059
Test *aucroc 0.8623 0.0031
Table 1: Metric performance statistics for the mortality task across models in the deterministic *rnn ensemble. The models are nearly identical in terms of dataset-level performance.

Clinical Metrics

We measure the metric performance of each model in our ensemble on the mortality task, where our clinically-relevant metrics are the *aucpr and *aucroc. Table 1 shows the mean and standard deviation of the metrics for the models in our deterministic ensemble on both the validation and test sets. For the ensemble, we find that the models are nearly equivalent in terms of performance, and it is highly likely that any one could have been selected in practice if we were only using one model for our clinical problem.


Calibration Measure Mean Standard Deviation
Val. *ece 0.0176 0.0040
Val. *ace 0.0210 0.0042
Test *ece 0.0162 0.0043
Test *ace 0.0233 0.0057
Table 2: Calibration error mean and standard deviation statistics for the models in the deterministic *rnn ensemble (lower values are better). We find that the models are well-calibrated, thus limiting concerns about over- or under-confident predictions.

Calibration

A model is said to be perfectly calibrated if, for all examples for which the model produces the same prediction for some outcome, the percentage of those examples truly associated with the outcome is equal to , across all values of . If a model is systematically over- or under-confident, it can be difficult to reliably use its predicted probabilities for decision making. Using the *ece metric Naeini et al. (2015) as a tractable way to approximate the calibration of a model given a finite dataset, we measure the calibration of each of the models in our deterministic ensemble. We also make use of *ace Nixon et al. (2019). Table 2 shows the mean and standard deviation of the calibration metrics for the models in our deterministic ensemble. The models are all well-calibrated, indicating that they could each be nearly equally reliable in practice.

Predictive Uncertainty Distributions & Statistics

Knowing that the models in our ensemble are well-calibrated and effectively equivalent in terms of performance, we turn to making predictions for individual examples. As seen previously in Figure 1, the predictive uncertainty distributions can be wide for some patients. Figure 2 visualizes the means versus standard deviations of the predictive uncertainty distributions for the deterministic ensemble on all validation set examples. In contrast to the variance of a Bernoulli distribution, which is a simple function of the mean, we find that the standard deviations are patient-specific, and thus cannot be determined a priori.

Figure 3: Left two: A set of predictive uncertainty distributions from our deterministic ensemble for two patients in the validation dataset on the mortality task. Right two: The corresponding optimal decision distributions. For some patients, such as the one on the left, the ensemble is relatively certain about the optimal decision, while for other patients, such as the one on the right, there is a large amount of uncertainty.

Optimal Decision Distributions & Statistics

In practice, model uncertainty is only important if it affects the decisions one would make. To test this, we optimize the recall-based decision cost function with respect to the probability threshold for each model separately to achieve a recall of , and then make optimal decisions for each example with each of the models. Figure 3 visualizes how model uncertainty in probability space is realized in optimal decision space for two patients. We see that the model uncertainty does indeed extend into the optimal decision space, converting the optimal decision into a random variable. Furthermore, the decision distribution’s variance can be quite high, and knowing when this is the case is important in order to avoid the cost of any incorrect decisions.

Additional Clinical Tasks

We additionally examined the role of model uncertainty with the determinstic ensemble on the *ccs for Healthcare Research and Quality multiclass task, where ICD-9-CM diagnosis codes are categorized into 285 distinct categories. See the Appendix for details.

4.2 Bayesian *rnn for Mortality Prediction


Model
Val.
*aucpr
Val.
*aucroc
Val. log-
likelihood
Test
*aucpr
Test
*aucroc
Test log-
likelihood
Deterministic Ensemble
0.4550 0.8774 -0.7113 0.3921 0.8646 -0.7148

Bayesian Embeddings
0.4551 0.8773 -0.7153 0.3965 0.8614 -0.7186

Bayesian Output
0.4358 0.8714 -0.7058 0.3679 0.8570 -0.7087

Bayesian Hidden+Output
0.4480 0.8749 -0.7118 0.3885 0.8606 -0.7149

Bayesian RNN+Hidden+Output
0.4384 0.8674 -0.7097 0.3844 0.8541 -0.7125

Fully Bayesian
0.4328 0.8691 -0.7100 0.3804 0.8549 -0.7133
Table 3: Metrics for marginalized predictions on the mortality task for an average over models in the deterministic *rnn ensemble, and samples from each of the Bayesian *rnn models. Held-out log-likelihood values are normalized by the number of dataset examples.

A natural question in practice is where precisely to be uncertain in the model. To do so, we study Bayesian *rnn under a variety of priors:

  • Bayesian Embeddings A Bayesian *rnn in which the embedding parameters are stochastic, and all other parameters are deterministic.

  • Bayesian Output A Bayesian *rnn in which the output layer parameters are stochastic, and all other parameters are deterministic.

  • Bayesian Hidden+Output A Bayesian *rnn in which the hidden and output layer parameters are stochastic, and all other parameters are deterministic.

  • Bayesian RNN+Hidden+Output A Bayesian *rnn in which the *lstm, hidden, and output layer parameters are stochastic, and all other parameters are deterministic.

  • Fully Bayesian A Bayesian *rnn in which all parameters are stochastic.

Table 3 displays the metrics over marginalized predictions for the each of the Bayesian *rnn models and the deterministic *rnn ensemble on the mortality task. We find that the Bayesian Embeddings *rnn model outperforms all other Bayesian variants and slightly outperforms the deterministic ensemble in terms of *aucpr. Additionally, all of the Bayesian variants are either comparable or outperform the deterministic ensemble in terms of held-out log-likelihood.

Figure 4: Predictive uncertainty distributions of both the *rnn with Bayesian embeddings and the deterministic *rnn ensemble for individual patients. We find that the Bayesian model is qualitatively able to capture model uncertainty that is quite similar to that of the ensemble.

Figure 4 visualizes the predictive distributions of both the Bayesian *rnn and the deterministic *rnn ensemble for four individual patients. We find that the Bayesian model is qualitatively able to capture model uncertainty that is quite similar to that of the deterministic ensemble.

4.3 Patient Subgroup Analysis

We next turn to an exploration of the effects of model uncertainty across patient subgroups. For this analysis, we use the deterministic *rnn ensemble. We split patients into subgroups by demographic characteristics, namely gender (male vs. female) or age (adult patients divided into quartiles, with neonates as a separate fifth group). We stratify our performance metrics by subgroup and examine correlations between these metrics to evaluate whether the ensemble models tend to specialize to one or more subgroups, at the cost of performance on others. We find some evidence of this phenomenon: for example, AUC-PR for male patients is negatively correlated with AUC-PR for female patients (Pearson’s

, see Figure 5), and AUC-PR for the oldest quartile of adult patients is somewhat negatively correlated with AUC-PR for other adults or for neonates (Pearson’s between and ).

We also compare uncertainty metrics across subgroups, including standard deviation and range of the predictive uncertainty distributions and variance of the optimal decision distributions for patients in each subgroup. We find that all metrics are correlated with subgroup label prevalence: both uncertainty and mortality rate increase monotonically across age groups (Figure 5), and both are slightly higher in women than in men. These findings imply that random model variation during training may actually cause unintentional harm to certain patient populations, which may not be reflected in aggregate performance.

Figure 5: Left: Model performance comparison on male vs. female patients. Each point represents stratified AUC-PR for a single model. Correlation coefficient . Right: Summary of uncertainty measures within each age subgroup. On all measures, uncertainty increases monotonically with age. This corresponds to an increase in mortality rate with age, as positive cases are more uncertain on average.

4.4 Embedding Uncertainty Analysis

Figure 6: Correlation between the entropy of the Bayesian embedding distributions for free-text clinical notes and the associated word frequency. We find that rarer words are associated with higher model uncertainty, with some level of variance at a given frequency.

Lowest Entropy Highest Entropy
Word Entropy Frequency Word Entropy Frequency

the
-82.5445 41803 24pm -16.0790 336
and -80.6055 42812 labwork -16.0750 272
of -80.2735 43191 colonial -16.0690 198
no -79.8994 43420 zoysn -16.0601 269
tracing -78.5988 32181 ht -16.0523 515
is -78.5553 42560 txcf -15.9982 112
to -77.6408 42365 arrangements -15.9795 407
for -76.8005 42972 parvus -15.9773 132
with -75.3513 42819 nas -15.9164 251
in -72.8006 42144 anesthesiologist -15.8796 220
Table 4: Top and bottom 10 words in free-text clinical notes based on their associated Bayesian embeddings distribution’s entropy, along with their frequency in the training dataset.

Another motivation for model uncertainty lies in understanding which feature values are most responsible for the variance of the predictive uncertainty distribution. Our *rnn with Bayesian embeddings model is particularly well suited for this task in that the uncertainty in embedding space directly corresponds to the predictive uncertainty distribution and represents uncertainty associated with the discrete feature values. Understanding model uncertainty associated with features can allow us to recognize particularly difficult examples and understand which feature values are leading to the difficulties. Additionally, it provides a means of determining the types of examples that could be beneficial to add to the training dataset for future updates to the model.

For this analysis, we focus on the free-text clinical notes found in the EHR. For each word in the notes vocabulary, we have an associated embeddings distribution formulated as a multivariate Normal. We rank each word by its level of model uncertainty (measured by its embedding distribution’s entropy). Table 4 lists the top and bottom 10 words, along with their frequency in the training dataset. We find that common words, both subjectively and based on prevalence counts, have low entropy and thus limited model uncertainty, while rarer words have higher entropy levels, which corresponds to higher model uncertainty. We additionally measure the correlation between entropy and word frequency as visualized in Figure 6. We find further confirmation that rarer words are associated with higher model uncertainty, although there is a level of variance at a given frequency.

5 Discussion

In this work, we demonstrated the need for capturing model uncertainty in medicine and examined methods to do so. Our experiments showed multiple findings. For example, an ensemble of deterministic *rnn captures individualized uncertainty conditioned on each patient, while the models each maintained nearly equivalent clinically-relevant dataset-level metrics. As another example, we found that models need only be uncertain around the embeddings for competitive performance, with the benefit of also enabling the ability to determine the the level of model uncertainty associated with individual feature values. Furthermore, using model uncertainty methods, we examined patterns in uncertainty across patient subgroups, showing that models can exhibit higher levels of uncertainty for certain groups.

Future work includes designing more specific decision cost functions based on both quantified medical ethics Gillon (1994) and monetary axes, as well as exploring methods to reduce model uncertainty at both training and prediction time.

References

Appendix A *ccs Multiclass Task


Metric Mean Standard Deviation
Val. top-5 recall 0.7126 0.0071
Val. top-5 precision 0.1425 0.0014
Val. top-5 F1 0.2375 0.0024
Val. log-likelihood -5.1040 0.0075
Val. ECE 0.0446 0.0072
Val. ACE 4.2189e-3 7.3111e-8
Test top-5 recall 0.7090 0.0088
Test top-5 precision 0.1418 0.0018
Test top-5 F1 0.2363 0.0029
Test log-likelihood -5.1081 0.0083
Test ECE 0.0499 0.0082
Test ACE 4.2191e-3 7.6136e-8
Table 5: Metric performance and calibration statistics for the *ccs task across models in the deterministic *rnn ensemble. The models are nearly identical in terms of dataset-level performance.
Figure 7: Left two: A set of distributions for the maximum predicted probability from our deterministic ensemble for two patients in the validation dataset on the mortality task. Note the difference in x-axis scales. Right two: The corresponding distributions of classes associated with the max probabilities. Similar to the mortality task, for some patients, such as the one on the left, the ensemble is relatively certain about the predicted class (completely certain in this case), while for other patients, such as the one on the right, there is a larger amount of model uncertainty.

In addition to the binary mortality task, we also look at the multiclass *ccs single-level task on the *mimic dataset. The *ccs single-level system categorizes ICD-9-cm codes into 285 distinct categories. For this task, we use the same deterministic ensemble setup as for the mortality task, but with 50 models. Table 5 displays the metric performance and calibration statistics for the ensemble on both the validation and test datsets. We find that, similar to the binary mortality task, the models are nearly equivalent in terms of performance. Figure 7 examines the distribution of maximum predicted probabilities over the *ccs classes, along with the distribution of predicted classes associated with the maximum probabilities. Similar to the binary mortality task, this demonstrates the presence of model uncertainty in the multiclass clinical setting.

Appendix B Additional Training Details


Hyperparameter Range/Set
batch size {32, 64, 128, 256, 512}
learning rate [0.00001, 0.1]
KL or regularization annealing steps [1, 1e6]
prior standard deviation (Bayesian only) [0.135, 1.0]
Dense embedding dimension {16, 32, 64, 100, 128, 256, 512}
Embedding dimension multiplier [0.5, 1.5]
*rnn dimension {16, 32, 64, 128, 256, 512, 1024}
Number of *rnn layers {1, 2, 3}
Hidden affine layer dimension {0, 16, 32, 64, 128, 256, 512}
Bias uncertainty (Bayesian only) {True, False}
Table 6: Hyperparameters and their associated search sets or ranges.

Model
Batch
size
Learn.
rate
anneal.
steps
Prior
std.
dev.
Dense
embed.
dim.
Embed.
dim.
mult.
*rnn
dim.
Num.
*rnn
layers
Hidden
layer
dim.
Bias
uncert.

Deterministic
Ensemble
256 3.035e-4 1 32 0.858 1024 1 512

Bayesian
Embeddings
256 1.238e-3 9.722e+5 0.292 32 0.858 1024 1 512 False

Bayesian
Output
256 1.647e-4 8.782e+5 0.149 32 0.858 1024 1 512 False

Bayesian
Hidden+Output
256 2.710e-4 9.912e+5 0.149 32 0.858 1024 1 512 False

Bayesian
RNN+Hidden
+Output
512 1.488e-3 6.342e+5 0.252 32 1.291 16 1 0 True

Fully
Bayesian
128 1.265e-3 9.983e+5 0.162 256 1.061 16 1 0 True
Table 7: Model-specific hyperparameter values.

Our *rnn model uses the same embedding logic as used in Rajkomar et al. [2018b]

to embed sequential and contextual features. Sequential embeddings are bagged into 1-day blocks, and fed into one or more *lstm layers. The final time-step output of the *lstm layers is concatenated with the contextual embeddings and fed into a hidden dense layer, and the output of that layer is then fed into an output dense layer yielding a single probability value. A ReLU non-linearity is used between the hidden and output dense layers, and default initializers in tf.keras.layers.* are used for all layers. More details on the training setup can be found in the code

222Code will be open-sourced..

In terms of hyperparameter optimization, we searched over the hyperparameters listed in Table 6 for the original deterministic *rnn (all others in the ensemble differ only by the random seed) and each of the Bayesian models. Table 7 lists the final hyperparameters associated with each of the models presented in the paper.

Models were implemented using TensorFlow 1.13 Abadi et al. [2016], and trained on machines equipped with Nvidia’s P100 using the Adam optimizer Kingma and Ba [2014]. MIMIC-III data were split into train, validation, and test set in 8:1:1 ratio.

Appendix C Additional Metrics and Statistics

Figure 8: Validation *aucpr versus held-out log-likelihood values for the deterministic *rnn ensemble on the mortality task. We find that there is no apparent correlation between the two metrics, likely due to the limited differences between the models.

In Figure 8, we examine the correlation between held-out log-likelihood and *aucpr values for models in the deterministic *rnn ensemble on the mortality task.

Figure 9: A histogram of differences between the maximum and minimum predicted probability values for each patient’s predictive uncertainty distribution. This shows that there is wide variability in predicted probabilities for some patients, and that negative patients have less variability on average.

In Figure 9, we plot the differences between the maximum and minimum predicted probability values for each patient’s predictive uncertainty distribution. We find that there is wide variability in predicted probabilities for some patients, and that negative patients have less variability on average.


Model
Val.
*ece
Val.
*ace
Test
*ece
Test
*ace
Deterministic Ensemble
0.0157 0.0191 0.0157 0.0191

Bayesian Embeddings
0.0167 0.0194 0.0163 0.0221

Bayesian Output
0.0263 0.0217 0.0241 0.0279

Bayesian Hidden+Output
0.0194 0.0212 0.0173 0.0240

Bayesian RNN+Hidden+Output
0.0240 0.0228 0.0182 0.0247

Fully Bayesian
0.0226 0.0192 0.0178 0.0197
Table 8: Calibration error for marginalized predictions on the mortality task for an average over models in the deterministic *rnn ensemble, and samples from each of the Bayesian *rnn models. We find that marginalization slightly increases the calibration of the deterministic ensemble, and that the Bayesian models are comparably well-calibrated.

In Table 8, we measure the calibration of marginalized predictions of our deterministic *rnn ensemble and the Bayesian rnn.

Figure 10: pr curves for all models in the deterministic ensemble. All of the curves are nearly identical, which is in line with the *aucpr results.

In Figure 10, we plot the *pr curves of all models in our deterministic rnn ensemble across the full dataset along with error bars. We find that the *pr curves are nearly identical for all models, and thus it again seems highly likely that any one of the models could have been selected if we were focused on our recall-based decision cost function.