1 Introduction
Machine learning has found great and increasing levels of success in the last several years on many wellknown benchmark datasets. This has led to a mounting interest in nontraditional problems and domains, each of which bring their own requirements. In medicine specifically, individualized predictions are of great importance to the field Council et al. (2011), and there can be severe costs for incorrect predictions and decisions due to the risk to human life and the associated ethical concerns Gillon (1994).
Existing stateoftheart approaches using deep neural networks in medicine often make use of either a single model or an average over a small ensemble of models, focusing on improving the accuracy of probabilistic predictions Harutyunyan et al. (2017); Rajkomar et al. (2018a); Xu et al. (2018); Choi et al. (2018). These works, while focusing on capturing the data uncertainty, do not address the model uncertainty that is inherent in fitting deep neural networks. For example, when predicting patient mortality in an ICU setting, existing approaches might be able to achieve high *aucroc, but will be unable to differentiate between patients for whom the model is certain about its probabilistic prediction, and those for whom the model is fairly uncertain.
In this paper, we examine the use of model uncertainty specifically in the context of predictive medicine. Model uncertainty has made many methodological advances in recent years—including reparameterizationbased variational Bayesian neural networks (Blundell et al., 2015; Kucukelbir et al., 2017; Louizos and Welling, 2017), Monte Carlo dropout (Gal and Ghahramani, 2016), ensembles (Lakshminarayanan et al., 2017), and function priors (Hafner et al., 2018; Garnelo et al., 2018; Malinin and Gales, 2018). In order to directly impact clinical care, model uncertainty methods raise several natural questions:

How do the realized functions in any of the approaches, such as individual models in the ensemble approach, compare in terms of metric performance such as AUCPR, AUCROC, or loglikelihood?

If and how does model uncertainty assist in calibrating predictions?

What is the effect of uncertainty on predictions across patient subgroups, such as by race, gender, age, or length of stay?

Which feature values are responsible for the highest model uncertainty?

How does model uncertainty affect optimal decisions made under a given clinicallyrelevant cost function?
Contributions
Using sequence models on the *mimic clinical dataset Johnson et al. (2016), we make several important findings. For the ensembling approach of quantifying model uncertainty, we find that the models within the ensemble are nearly identical in terms of datasetlevel metric performance, despite each model yielding different patientspecific predictions. In addition to the strong metric performance, the models in the ensemble also appear to be wellcalibrated. It is therefore likely that any one of these models could be selected in practice if we were only using one model for our clinical problem. Furthermore, we see that predictive uncertainty due to model uncertainty extends into the space of optimal decisions. That is, models with nearly equivalent performance can disagree significantly on the final decision.
Overall, we show that model uncertainty is not captured by datasetlevel (i.e., populationlevel) metrics, such as AUCPR, AUCROC, loglikelihood, and calibration error. Rather, the significant variability in samplespecific (i.e., patientspecific) predictions and decisions motivates the importance of model uncertainty; this is an area with clear clinical impact. Additionally, we show that models with Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.
2 Background
Data uncertainty
Data uncertainty can be viewed as uncertainty regarding a given outcome due to incomplete information, and is also known as “output uncertainty” or “risk” Knight (1957). For binary tasks, this equates to a single probability value. More specifically, this can be described as
(1) 
where the model , as a function of the inputs and parameters , outputs the parameter
for a Bernoulli distribution representing the conditional distribution
for the outcome .Model uncertainty
Model uncertainty can be viewed as uncertainty in the correct values of the parameters for the predictive outcome distribution due to not knowing the true function. For binary tasks, this equates to a distribution of plausible probability values for a Bernoulli distribution, corresponding to a set of plausible functions. More specifically, this can be described as
(2) 
where there is a distribution over the Bernoulli parameter for a given example that represents uncertainty in the true outcome distribution due to uncertainty in function space. For the remainder of the paper, we will use the phrase predictive uncertainty distribution to refer to this distribution over the parameter of the outcome distribution.
Deep Ensembles
Deep ensembles Lakshminarayanan et al. (2017) is a method for quantifying model uncertainty. In this approach, an ensemble of deterministic^{1}^{1}1We use the term “deterministic" to refer to the usual setup in which we optimize the parameter values of our function directly, yielding a trained model with fixed parameter values at test time. *nn is trained by varying only the random seed of an otherwise welltuned model. Given this ensemble, predictions can be made with each model for a given example , where (for a binary task) each prediction is the probability parameter for the Bernoulli distribution over the outcome. The set of probabilistic predictions for the same example can then be viewed as the distribution over , where this distribution represents model uncertainty.
Bayesian *rnn
Bayesian *rnn Fortunato et al. (2017) are *rnn in which the parameters are represented by distributions. This allows us to express model uncertainty as uncertainty over the true values for the parameters in the model, i.e., “weight uncertainty" Blundell et al. (2015). By introducing a distribution over all, or a subset, of the weights in the model, we can induce different functions, and thus different outcomes, through realizations of different weight values via draws from the posterior distributions. This allows us to empirically capture model uncertainty in the predictive uncertainty distribution by drawing samples from a trained Bayesian *rnn for a given example.
3 Medical Uncertainty
3.1 Mortality Prediction From Medical Records
Similar to Rajkomar et al. (2018a), we train deep *rnn to predict patient mortality on *mimic Johnson et al. (2016), an EHR dataset collected from 46,520 patients admitted to *icu at Beth Israel Deaconess Medical Center, where 9,974 expired during the encounter (i.e., 1:4 ratio between positive and negative samples). Our model embeds and aggregates a patient’s timeseries features (e.g. medications, lab measures) and global features (e.g. gender, age), feeds them to one or more *lstm layers Schmidhuber and Hochreiter (1997)
, and follows that with fullyconnected hidden and output affine layers, with a final sigmoid output for binary prediction. See the Appendix for more details.
Existing deep learning approaches in predictive medicine focus on capturing data uncertainty, namely accurately predicting the risk
of a patient’s mortality (i.e., how likely is the patient to expire?). This work, on the other hand, also focuses on addressing the model uncertainty aspect of deep learning, namely the distribution over the risk of mortality for a patient (i.e., are there alternative risk predictions, and if so, how diverse are they?).3.2 Predictive Uncertainty Distributions
To quantify model uncertainty for our mortality prediction task, we explore the use of deep *rnn ensembles and various Bayesian *rnn. For the deep ensembles approach, we optimize for the ideal hyperparameter values for our *rnn model via blackbox Bayesian optimization
Golovin et al. (2017), and then train replicas of the best model. Only the random seed differs between the replicas. At prediction time, we make predictions with all models for each patient. For the Bayesian *rnn, we take a variational inference approach by adapting our *rnn to use factorized weight posteriorswhere weight tensors
in the models are represented by normal distributions with learnable mean and diagonal covariance parameters represented as
. Normal distributions with zero mean and tunable standard deviation are used as weight priors. We train our models by minimizing the *kl divergence(3) 
between the approximate weight posterior and the true, but unknown posterior, which overall equates to minimizing an expectation over the usual negative log likelihood term plus a *kl regularization term. To easily shift between the deterministic and Bayesian models, we make use of the Bayesian Layers Tran et al. (2018) abstractions.
Figure 1 visualizes the predictive uncertainty distribution for a single patient. We find that there is a wide variability in predicted Bernoulli probabilities for some patients (with spreads as high as ), which we will show is often of concern for clinical practice.
3.3 Optimal Decisions Via Sensitivity Requirements
While predicted probabilities of the binary mortality outcome are important, the key desire for clinical practice is to make a decision. Given a set of potential outcomes , a set of conditional probabilities for the given outcomes, and the associated costs of correctly and incorrectly predicting the correct outcome, an optimal decision can be determined by minimizing the expected decision loss
(4) 
where is the decision region for assigning example to class , and is the density of Bishop (2006).
Unfortunately, designing decision cost functions for clinical applications is difficult, and is a research problem in itself. That being said, we do have an alternative target that is already clinically relevant: sensitivity requirements. Often in clinical research, certain sensitivity (i.e., recall) levels must be met when making binary predictions in order for a model to be clinically relevant. The goal in these cases is to maximize the precision while still reaching the required recall level. Viewed as a decision cost function, the cost is infinite if the recall is below the target level, and is otherwise minimized as the precision is increased, where the optimized parameter is the global probability threshold.
For each of the models in our ensemble, we can optimize the sensitivitybased decision cost function and make optimal decisions for all examples. Thus, for each example, there will be a set of optimal decisions, forming a distribution. The optimal decision
then becomes a random variable
(5) 
where is the percentage of the set of probability values for a given example that are greater than or equal to the optimized decision threshold .
4 Experiments
We perform four sets of experiments. First, we examine the relationship of individual model samples across clinical metrics, calibration, uncertainty distributions, and decisionmaking. Second, we examine where uncertainty in the model matters most. Third, we examine patterns in uncertainty across patient subgroups. Finally, we examine patterns in uncertainty across individual features.
4.1 Deep *rnn Ensemble for Mortality Prediction
Metric  Mean  Standard Deviation 

Val. *aucpr  0.4496  0.0025 
Val. *aucroc  0.8753  0.0019 
Test *aucpr  0.3886  0.0059 
Test *aucroc  0.8623  0.0031 
Clinical Metrics
We measure the metric performance of each model in our ensemble on the mortality task, where our clinicallyrelevant metrics are the *aucpr and *aucroc. Table 1 shows the mean and standard deviation of the metrics for the models in our deterministic ensemble on both the validation and test sets. For the ensemble, we find that the models are nearly equivalent in terms of performance, and it is highly likely that any one could have been selected in practice if we were only using one model for our clinical problem.
Calibration Measure  Mean  Standard Deviation 

Val. *ece  0.0176  0.0040 
Val. *ace  0.0210  0.0042 
Test *ece  0.0162  0.0043 
Test *ace  0.0233  0.0057 
Calibration
A model is said to be perfectly calibrated if, for all examples for which the model produces the same prediction for some outcome, the percentage of those examples truly associated with the outcome is equal to , across all values of . If a model is systematically over or underconfident, it can be difficult to reliably use its predicted probabilities for decision making. Using the *ece metric Naeini et al. (2015) as a tractable way to approximate the calibration of a model given a finite dataset, we measure the calibration of each of the models in our deterministic ensemble. We also make use of *ace Nixon et al. (2019). Table 2 shows the mean and standard deviation of the calibration metrics for the models in our deterministic ensemble. The models are all wellcalibrated, indicating that they could each be nearly equally reliable in practice.
Predictive Uncertainty Distributions & Statistics
Knowing that the models in our ensemble are wellcalibrated and effectively equivalent in terms of performance, we turn to making predictions for individual examples. As seen previously in Figure 1, the predictive uncertainty distributions can be wide for some patients. Figure 2 visualizes the means versus standard deviations of the predictive uncertainty distributions for the deterministic ensemble on all validation set examples. In contrast to the variance of a Bernoulli distribution, which is a simple function of the mean, we find that the standard deviations are patientspecific, and thus cannot be determined a priori.
Optimal Decision Distributions & Statistics
In practice, model uncertainty is only important if it affects the decisions one would make. To test this, we optimize the recallbased decision cost function with respect to the probability threshold for each model separately to achieve a recall of , and then make optimal decisions for each example with each of the models. Figure 3 visualizes how model uncertainty in probability space is realized in optimal decision space for two patients. We see that the model uncertainty does indeed extend into the optimal decision space, converting the optimal decision into a random variable. Furthermore, the decision distribution’s variance can be quite high, and knowing when this is the case is important in order to avoid the cost of any incorrect decisions.
Additional Clinical Tasks
We additionally examined the role of model uncertainty with the determinstic ensemble on the *ccs for Healthcare Research and Quality multiclass task, where ICD9CM diagnosis codes are categorized into 285 distinct categories. See the Appendix for details.
4.2 Bayesian *rnn for Mortality Prediction
Model 









0.4550  0.8774  0.7113  0.3921  0.8646  0.7148  

0.4551  0.8773  0.7153  0.3965  0.8614  0.7186  

0.4358  0.8714  0.7058  0.3679  0.8570  0.7087  

0.4480  0.8749  0.7118  0.3885  0.8606  0.7149  

0.4384  0.8674  0.7097  0.3844  0.8541  0.7125  

0.4328  0.8691  0.7100  0.3804  0.8549  0.7133 
A natural question in practice is where precisely to be uncertain in the model. To do so, we study Bayesian *rnn under a variety of priors:

Bayesian Embeddings A Bayesian *rnn in which the embedding parameters are stochastic, and all other parameters are deterministic.

Bayesian Output A Bayesian *rnn in which the output layer parameters are stochastic, and all other parameters are deterministic.

Bayesian Hidden+Output A Bayesian *rnn in which the hidden and output layer parameters are stochastic, and all other parameters are deterministic.

Bayesian RNN+Hidden+Output A Bayesian *rnn in which the *lstm, hidden, and output layer parameters are stochastic, and all other parameters are deterministic.

Fully Bayesian A Bayesian *rnn in which all parameters are stochastic.
Table 3 displays the metrics over marginalized predictions for the each of the Bayesian *rnn models and the deterministic *rnn ensemble on the mortality task. We find that the Bayesian Embeddings *rnn model outperforms all other Bayesian variants and slightly outperforms the deterministic ensemble in terms of *aucpr. Additionally, all of the Bayesian variants are either comparable or outperform the deterministic ensemble in terms of heldout loglikelihood.
Figure 4 visualizes the predictive distributions of both the Bayesian *rnn and the deterministic *rnn ensemble for four individual patients. We find that the Bayesian model is qualitatively able to capture model uncertainty that is quite similar to that of the deterministic ensemble.
4.3 Patient Subgroup Analysis
We next turn to an exploration of the effects of model uncertainty across patient subgroups. For this analysis, we use the deterministic *rnn ensemble. We split patients into subgroups by demographic characteristics, namely gender (male vs. female) or age (adult patients divided into quartiles, with neonates as a separate fifth group). We stratify our performance metrics by subgroup and examine correlations between these metrics to evaluate whether the ensemble models tend to specialize to one or more subgroups, at the cost of performance on others. We find some evidence of this phenomenon: for example, AUCPR for male patients is negatively correlated with AUCPR for female patients (Pearson’s
, see Figure 5), and AUCPR for the oldest quartile of adult patients is somewhat negatively correlated with AUCPR for other adults or for neonates (Pearson’s between and ).We also compare uncertainty metrics across subgroups, including standard deviation and range of the predictive uncertainty distributions and variance of the optimal decision distributions for patients in each subgroup. We find that all metrics are correlated with subgroup label prevalence: both uncertainty and mortality rate increase monotonically across age groups (Figure 5), and both are slightly higher in women than in men. These findings imply that random model variation during training may actually cause unintentional harm to certain patient populations, which may not be reflected in aggregate performance.
4.4 Embedding Uncertainty Analysis
Lowest Entropy  Highest Entropy  

Word  Entropy  Frequency  Word  Entropy  Frequency 
the 
82.5445  41803  24pm  16.0790  336 
and  80.6055  42812  labwork  16.0750  272 
of  80.2735  43191  colonial  16.0690  198 
no  79.8994  43420  zoysn  16.0601  269 
tracing  78.5988  32181  ht  16.0523  515 
is  78.5553  42560  txcf  15.9982  112 
to  77.6408  42365  arrangements  15.9795  407 
for  76.8005  42972  parvus  15.9773  132 
with  75.3513  42819  nas  15.9164  251 
in  72.8006  42144  anesthesiologist  15.8796  220 
Another motivation for model uncertainty lies in understanding which feature values are most responsible for the variance of the predictive uncertainty distribution. Our *rnn with Bayesian embeddings model is particularly well suited for this task in that the uncertainty in embedding space directly corresponds to the predictive uncertainty distribution and represents uncertainty associated with the discrete feature values. Understanding model uncertainty associated with features can allow us to recognize particularly difficult examples and understand which feature values are leading to the difficulties. Additionally, it provides a means of determining the types of examples that could be beneficial to add to the training dataset for future updates to the model.
For this analysis, we focus on the freetext clinical notes found in the EHR. For each word in the notes vocabulary, we have an associated embeddings distribution formulated as a multivariate Normal. We rank each word by its level of model uncertainty (measured by its embedding distribution’s entropy). Table 4 lists the top and bottom 10 words, along with their frequency in the training dataset. We find that common words, both subjectively and based on prevalence counts, have low entropy and thus limited model uncertainty, while rarer words have higher entropy levels, which corresponds to higher model uncertainty. We additionally measure the correlation between entropy and word frequency as visualized in Figure 6. We find further confirmation that rarer words are associated with higher model uncertainty, although there is a level of variance at a given frequency.
5 Discussion
In this work, we demonstrated the need for capturing model uncertainty in medicine and examined methods to do so. Our experiments showed multiple findings. For example, an ensemble of deterministic *rnn captures individualized uncertainty conditioned on each patient, while the models each maintained nearly equivalent clinicallyrelevant datasetlevel metrics. As another example, we found that models need only be uncertain around the embeddings for competitive performance, with the benefit of also enabling the ability to determine the the level of model uncertainty associated with individual feature values. Furthermore, using model uncertainty methods, we examined patterns in uncertainty across patient subgroups, showing that models can exhibit higher levels of uncertainty for certain groups.
Future work includes designing more specific decision cost functions based on both quantified medical ethics Gillon (1994) and monetary axes, as well as exploring methods to reduce model uncertainty at both training and prediction time.
References
 Council et al. [2011] National Research Council et al. Toward precision medicine: building a knowledge network for biomedical research and a new taxonomy of disease. National Academies Press, 2011.
 Gillon [1994] Raanan Gillon. Medical ethics: four principles plus attention to scope. Bmj, 309(6948):184, 1994.
 Harutyunyan et al. [2017] Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. arXiv preprint arXiv:1703.07771, 2017.
 Rajkomar et al. [2018a] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M. Dai, Nissan Hajaj, Michaela Hardt, Peter J. Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Yi Zhang, Gerardo Flores, Gavin E. Duggan, Jamie Irvine, Quoc Le, Kurt Litsch, Alexander Mossin, Justin Tansuwan, De Wang, James Wexler, Jimbo Wilson, Dana Ludwig, Samuel L. Volchenboum, Katherine Chou, Michael Pearson, Srinivasan Madabushi, Nigam H. Shah, Atul J. Butte, Michael D. Howell, Claire Cui, Greg S. Corrado, and Jeffrey Dean. Scalable and accurate deep learning with electronic health records. Nature Partner Journals: Digital Medicine, 1(1):18, May 2018a. ISSN 23986352. doi: 10.1038/s4174601800291.
 Xu et al. [2018] Yanbo Xu, Siddharth Biswal, Shriprasad R Deshpande, Kevin O Maher, and Jimeng Sun. Raim: Recurrent attentive and intensive model of multimodal patient monitoring data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2565–2573. ACM, 2018.
 Choi et al. [2018] Edward Choi, Cao Xiao, Walter Stewart, and Jimeng Sun. Mime: Multilevel medical embedding of electronic health records for predictive healthcare. In Advances in Neural Information Processing Systems, pages 4547–4557, 2018.
 Blundell et al. [2015] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight Uncertainty in Neural Networks. arXiv.org, May 2015. URL http://arxiv.org/abs/1505.05424v2.
 Kucukelbir et al. [2017] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference. The Journal of Machine Learning Research, 18(1):430–474, 2017.
 Louizos and Welling [2017] Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 2218–2227. JMLR. org, 2017.
 Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Advances in Neural Information Processing Systems, volume stat.ML, December 2017. URL http://arxiv.org/abs/1612.01474v3.
 Hafner et al. [2018] Danijar Hafner, Dustin Tran, Alex Irpan, Timothy Lillicrap, and James Davidson. Reliable uncertainty estimates in deep neural networks using noise contrastive priors. arXiv preprint arXiv:1807.09289, 2018.
 Garnelo et al. [2018] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.

Malinin and Gales [2018]
Andrey Malinin and Mark Gales.
Predictive uncertainty estimation via prior networks.
In Advances in Neural Information Processing Systems, pages 7047–7058, 2018.  Johnson et al. [2016] Alistair E.W. Johnson, Tom J. Pollard, Lu Shen, Liwei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMICIII, a freely accessible critical care database. Scientific Data, 3:160035, May 2016. ISSN 20524463. doi: 10.1038/sdata.2016.35. URL http://www.nature.com/articles/sdata201635.
 Knight [1957] Frank H. Knight. Risk, Uncertainty and Profit. New York, Kelley & Millman, 1957. URL https://mises.org/sites/default/files/Risk,%20Uncertainty,%20and%20Profit_4.pdf.
 Fortunato et al. [2017] Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian Recurrent Neural Networks. arXiv.org, April 2017. URL http://arxiv.org/abs/1704.02798v3.
 Schmidhuber and Hochreiter [1997] Juergen Schmidhuber and Sepp Hochreiter. Long shortterm memory. Neural computation, 9(8):1735–1780, November 1997. doi: doi.org/10.1162/neco.1997.9.8.1735.
 Golovin et al. [2017] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski, John Karro, and D. Sculley. Google Vizier: A Service for BlackBox Optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining  KDD ’17, pages 1487–1495, Halifax, NS, Canada, 2017. ACM Press. ISBN 9781450348874. doi: 10.1145/3097983.3098043. URL https://ai.google/research/pubs/pub46180.pdf.
 Tran et al. [2018] Dustin Tran, Michael W. Dusenberry, Mark van der Wilk, and Danijar Hafner. Bayesian Layers: A Module for Neural Network Uncertainty. arXiv:1812.03973 [cs, stat], December 2018. URL http://arxiv.org/abs/1812.03973.
 Bishop [2006] Christopher M. Bishop. Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York, corrected 8th printing 2009 edition, 2006. ISBN 9780387310732.

Naeini et al. [2015]
Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht.
Obtaining Well Calibrated Probabilities Using Bayesian Binning.
In
AAAI Conference on Artificial Intelligence
, volume 2015, pages 2901–2907, January 2015. URL https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4410090/pdf/nihms679964.pdf.  Nixon et al. [2019] Jeremy Nixon, Mike Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring Calibration in Deep Learning. arXiv:1904.01685 [cs, stat], April 2019. URL http://arxiv.org/abs/1904.01685.
 [24] Agency for Healthcare Research and Quality. Clinical Classifications Software (CCS) for ICD9CM. https://www.hcupus.ahrq.gov/toolssoftware/ccs/ccsfactsheet.jsp. Accessed: 20190520.
 Rajkomar et al. [2018b] Alvin Rajkomar, Michaela Hardt, Michael D. Howell, Greg Corrado, and Marshall H. Chin. Ensuring Fairness in Machine Learning to Advance Health Equity. Annals of Internal Medicine, 169(12):866, December 2018b. ISSN 00034819. doi: 10.7326/M181990. URL http://annals.org/article.aspx?doi=10.7326/M181990.
 Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
 Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Appendix A *ccs Multiclass Task
Metric  Mean  Standard Deviation 

Val. top5 recall  0.7126  0.0071 
Val. top5 precision  0.1425  0.0014 
Val. top5 F1  0.2375  0.0024 
Val. loglikelihood  5.1040  0.0075 
Val. ECE  0.0446  0.0072 
Val. ACE  4.2189e3  7.3111e8 
Test top5 recall  0.7090  0.0088 
Test top5 precision  0.1418  0.0018 
Test top5 F1  0.2363  0.0029 
Test loglikelihood  5.1081  0.0083 
Test ECE  0.0499  0.0082 
Test ACE  4.2191e3  7.6136e8 
In addition to the binary mortality task, we also look at the multiclass *ccs singlelevel task on the *mimic dataset. The *ccs singlelevel system categorizes ICD9cm codes into 285 distinct categories. For this task, we use the same deterministic ensemble setup as for the mortality task, but with 50 models. Table 5 displays the metric performance and calibration statistics for the ensemble on both the validation and test datsets. We find that, similar to the binary mortality task, the models are nearly equivalent in terms of performance. Figure 7 examines the distribution of maximum predicted probabilities over the *ccs classes, along with the distribution of predicted classes associated with the maximum probabilities. Similar to the binary mortality task, this demonstrates the presence of model uncertainty in the multiclass clinical setting.
Appendix B Additional Training Details
Hyperparameter  Range/Set 

batch size  {32, 64, 128, 256, 512} 
learning rate  [0.00001, 0.1] 
KL or regularization annealing steps  [1, 1e6] 
prior standard deviation (Bayesian only)  [0.135, 1.0] 
Dense embedding dimension  {16, 32, 64, 100, 128, 256, 512} 
Embedding dimension multiplier  [0.5, 1.5] 
*rnn dimension  {16, 32, 64, 128, 256, 512, 1024} 
Number of *rnn layers  {1, 2, 3} 
Hidden affine layer dimension  {0, 16, 32, 64, 128, 256, 512} 
Bias uncertainty (Bayesian only)  {True, False} 
Model 













256  3.035e4  1  –  32  0.858  1024  1  512  –  

256  1.238e3  9.722e+5  0.292  32  0.858  1024  1  512  False  

256  1.647e4  8.782e+5  0.149  32  0.858  1024  1  512  False  

256  2.710e4  9.912e+5  0.149  32  0.858  1024  1  512  False  

512  1.488e3  6.342e+5  0.252  32  1.291  16  1  0  True  

128  1.265e3  9.983e+5  0.162  256  1.061  16  1  0  True 
Our *rnn model uses the same embedding logic as used in Rajkomar et al. [2018b]
to embed sequential and contextual features. Sequential embeddings are bagged into 1day blocks, and fed into one or more *lstm layers. The final timestep output of the *lstm layers is concatenated with the contextual embeddings and fed into a hidden dense layer, and the output of that layer is then fed into an output dense layer yielding a single probability value. A ReLU nonlinearity is used between the hidden and output dense layers, and default initializers in tf.keras.layers.* are used for all layers. More details on the training setup can be found in the code
^{2}^{2}2Code will be opensourced..In terms of hyperparameter optimization, we searched over the hyperparameters listed in Table 6 for the original deterministic *rnn (all others in the ensemble differ only by the random seed) and each of the Bayesian models. Table 7 lists the final hyperparameters associated with each of the models presented in the paper.
Appendix C Additional Metrics and Statistics
In Figure 8, we examine the correlation between heldout loglikelihood and *aucpr values for models in the deterministic *rnn ensemble on the mortality task.
In Figure 9, we plot the differences between the maximum and minimum predicted probability values for each patient’s predictive uncertainty distribution. We find that there is wide variability in predicted probabilities for some patients, and that negative patients have less variability on average.
Model 







0.0157  0.0191  0.0157  0.0191  

0.0167  0.0194  0.0163  0.0221  

0.0263  0.0217  0.0241  0.0279  

0.0194  0.0212  0.0173  0.0240  

0.0240  0.0228  0.0182  0.0247  

0.0226  0.0192  0.0178  0.0197 
In Table 8, we measure the calibration of marginalized predictions of our deterministic *rnn ensemble and the Bayesian rnn.
In Figure 10, we plot the *pr curves of all models in our deterministic rnn ensemble across the full dataset along with error bars. We find that the *pr curves are nearly identical for all models, and thus it again seems highly likely that any one of the models could have been selected if we were focused on our recallbased decision cost function.