1. Introduction
Transparency is one way for an artificial intelligence (AI) system to show stakeholders that the system is trustworthy
(o2018linking; weller2019transparency). Transparency includes a wide variety of efforts to provide stakeholders, such as model developers and end users, with relevant information about how a machine learning (ML) model works (weller2019transparency; bhatt2020explainable). One form of transparency is procedural transparency, which provides information about model development (e.g., code release, model cards, dataset details) (gebru2018datasheets; raji2019ml; arnold2019factsheets; mitchell2019model). Another form is algorithmic transparency, which exposes information about a model’s behavior to various stakeholders (ribeiro2016should; sundararajan2017axiomatic; koh2017understanding).Most of the algorithmic transparency agenda for the Fairness, Accountability, and Transparency (FAccT) community has focused on explainability. Explainability attempts to provide reasons for a model’s behavior to stakeholders (bhatt2020explainable). However, understanding a model’s specific behavior might not be enough for stakeholders to gauge whether the model may be wrong or lack sufficient knowledge to solve the task at hand (bhatt2020machine). We argue that a complementary form of transparency is to estimate and communicate the uncertainty associated with model predictions.
There are multiple ways in which uncertainty can be characterized. In regression tasks, uncertainty is often expressed in terms of predictive variance. For example, when predicting the number of crimes in a given city, we could report that the number of predicted crimes is
, where “” represents a 95% confidence interval(capturing two deviations on either side of the central, mean estimate). The smaller the interval, the more certain the model. In classification tasks, probability scores are often used to capture how confident a model is in a specific prediction. For example, a classification model may predict that a person is at a high risk for developing diabetes given a prediction of
% chance of diabetes.There may be various sources of uncertainty in datadriven decisionmaking systems (hora1996aleatory; gal2016uncertainty). Aleatoric uncertainty is induced by inherent randomness (or noise) in the quantity we want to predict or in our input variables. Epistemic uncertainty can arise due to lack of sufficient data to learn the model precisely. The question of how to quantify different types of uncertainty and communicate them well has long been studied across many domains. For example, in the broader AI community, uncertainty has been leveraged in planning and reasoning tasks (pearman1985uncertainty; horvitz1986the; horvitz1989heuristic). In this paper, we seek to study how uncertainty affects learning models from data.
This work is structured as follows. In Section 2, we motivate uncertainty as a form of transparency by discussing three use cases: designing fairer models, informing decisionmaking, and calibrating trust in automated systems. In Section 3, we review possible sources of uncertainty and methods for uncertainty quantification. In Section 4, we describe how uncertainty can be leveraged in each use case from Section 2. Next, in Section 5, we discuss how to communicate uncertainty effectively. Finally, we discuss the importance of taking a usercentered approach to collecting requirements for uncertainty quantification and communication in Section 6.
2. Why Do We Care?
Wellcalibrated uncertainty helps stakeholders understand when they should trust model predictions and helps developers address fairness issues in models. Uncertainty is crucial in the context of MLassisted, or automated, decisionmaking. To that end, we discuss the use of uncertainty for obtaining fairer model outcomes, improved decisionmaking, and building trust in automation, using the following cancer diagnostics scenario for illustrative purposes.
Suppose we are tasked with diagnosing individuals as having breast cancer or not, as in (dua; curtis2012genomic)
. Given categorical and continuous characteristics about an individual (medical test results, family medical history, etc.), we estimate the probability of an individual having breast cancer. We can then apply a threshold to classify them into high or lowrisk groups. Specifically, we have been tasked with building MLpowered tools to help three distinct audiences: doctors, who will be assisted in making diagnoses; patients, who will be helped to understand their diagnoses; and review board members, who will be aided in reviewing doctors’ decisions across many hospitals.
Throughout the paper, we will refer back to the scenario above to discuss how uncertainty may arise in the design of an ML model and why, when wellcalibrated and wellcommunicated, it can act as a useful form of transparency for stakeholders.
2.1. Fairness
Developers often aim to assess model fairness to mitigate or prevent unwanted biases. Uncertainty, if not properly quantified and considered in model development, can endanger these efforts. In our example of the breast cancer diagnostic tool, if the model is trained on data where young age groups are underrepresented, the model might be underspecified and would likely be biased towards lower error rates on older patients. Such dataset bias will manifest itself as epistemic uncertainty in the model itself. Yet, in some cases, uncertainty can also be leveraged to diagnose and improve the model. For example, one can use uncertainty to identify portions of the input space where the model is errorprone: regions of high epistemic uncertainty indicate where additional training data could improve model performance. Section 4 details different ways uncertainty interacts with bias in the data collection and modeling stages and how such biases can be mitigated by accounting for uncertainty.
2.2. Decisionmaking
End users of ML models—e.g., decisionmakers using an MLpowered decisionsupport system—should consider the uncertainty of the model’s output if wellcalibrated uncertainty measures are available. A user may have to decide whether to accept a model’s output in a given interaction or to delegate certain actions to a model. Treating all model predictions the same, independent from their uncertainty, can lead decisionmakers to overrely on the model in cases where it produces spurious outputs or to underrely on the model in cases where model predictions are likely to be accurate. Conversely, a doctor might observe a model’s uncertainty estimates before leveraging the model’s output in making a diagnosis. In Section 4, we draw upon the literature of judgment and decisionmaking (JDM) to discuss the potential implications of showing uncertainty estimates to end users of ML models.
2.3. Trust in Automation
Communicating wellcalibrated uncertainty can be seen as a sign of the model’s trustworthiness: this communication could in turn improve model adoption and user experience. However, high uncertainty, seemingly arbitrary uncertainty, or incomprehensible uncertainty information could be perceived negatively, spawn confusion, and impair user trust. If our model’s recommendations are always accompanied by large error bars (high uncertainty), doctors may choose to always override the model’s output, the model’s seeming imprecision resulting in an erosion of the doctors’ trust. In general, accurately measured and carefully communicated uncertainty estimates should aim to support users in calibrating and forming appropriate trust in an ML model. Appropriately calibrated trust is crucial for avoiding misuse of automated systems (lee2004trust). In Section 4, we review prior work on how users form trust in automated systems and discuss the potential impact of uncertainty for user trust in ML models.
3. Measuring Uncertainty
In ML, we use the term uncertainty to refer to our lack of knowledge about some outcome of interest. We use the tools of probability to quantify and reason about uncertainty. But what are probabilities? The Bayesian school of thought interprets probabilities as subjective degrees of belief in an outcome of interest occurring (mackay2003information). For frequentists, probabilities reflect how often we would observe the outcome if we were to repeat our observation multiple times (bayes_vs_frequentist; Bland1151). Fortunately for endusers, uncertainty from Bayesian and frequentist methods conveys similar information in practice (frequentist_paramerter_estimation), and can be treated interchangeably in downstream tasks.
3.1. Metrics for Uncertainty
The metrics used to communicate uncertainty vary between research communities. As shown in Figure 1, a predictive distribution tells us about our model’s degree of belief in every possible prediction. Despite containing a lot of information (prediction modes, tails, etc.), a full predictive distribution may not always be desirable. In our cancer diagnostic scenario, we may want our automated system to abstain from making a prediction and instead request the assistance of a medical professional when its uncertainty is above a certain threshold. When deferring to a human expert, it may not matter to us whether the system believes the patient has cancer or not, just how uncertain it is. For this reason, summary statistics of the predictive distribution are often used to convey information about uncertainty. For classification, class probability intuitively communicates our degree of belief in an outcome. On the other hand, predictive entropy decouples our predictions from their uncertainty, only telling us about the latter. For regression, the predictive mean is often given together with error bars (written ). These commonly reflect the standard deviation or some percentiles of the predictive distribution. As discussed in Section 6, the best choice of uncertainty metric will be use case dependent. We show common summary statistics for uncertainty in Table 1 and further discuss them in Appendix A.
Full Information  Summary Statistics  

Regression  Predictive Distribution  Predictive Variance, Percentile 
(or quantile) Confidence Intervals 

Classification  Predictive Probabilities  Predictive Entropy, Expected Entropy, 
Mutual Information, Variation Ratio 
3.2. The Different Sources of Uncertainty
While there can be many sources of uncertainty (van_der_bles_communicating_nodate), herein we focus on those types that we can quantify in ML models: aleatoric uncertainty (also known as indirect uncertainty) and epistemic uncertainty (also known as direct uncertainty) (der2009aleatory; gal2016uncertainty; Depeweg_thesis).
Aleatoric uncertainty stems from noise, or class overlap, in our data. Noise in the data is a consequence of unaccountedfor factors that introduce variability in the inputs or targets. Examples of this could be background noise in a signal detection scenario or the imperfect reliability of a medical test in our cancer diagnosis scenario. Aleatoric uncertainty is also known as irreducible uncertainty: it cannot be decreased by observing more data. If we wish to reduce aleatoric uncertainty, we may need to leverage different sources of data, e.g. switching to a more reliable clinical test. In practice, most ML models account for aleatoric uncertainty through the specification of a noise model or likelihood function. A homoscedastic noise model makes an assumption that all of the input space is equally noisy,
. However, this may not always be true. Returning to our medical scenario, consider using results from a clinical test which produces few false positives but many false negatives as an input to our model. A heteroscedastic noise assumption allows us to express aleatoric uncertainty as a function of our inputs
. Perhaps the most commonly used heteroscedastic noise models in deep learning are those induced by the sigmoid or softmax output layers. These enable almost any ML model to express aleatoric uncertainty (Appendix
A.1).Epistemic uncertainty stems from a lack of knowledge about which function best explains the data we have observed. There are two reasons why epistemic uncertainty may arise. Consider a scenario in which we employ a very complex model relative to the amount of training data available. We will be unable to properly constrain our model’s parameters. This means that, out of all the possible functions that our model can represent, we are unsure of which ones to choose. This uncertainty about a model’s parameters is known as model uncertainty. We might also be uncertain of whether we picked the correct model class in the first place. Perhaps we are using a linear predictor but the phenomenon we are trying to predict is nonlinear. This is known as model specification uncertainty or architecture uncertainty. Epistemic uncertainty can be reduced by collecting more data in input regions where the original training set was sparse. It is less common for ML models to capture epistemic uncertainty. Often, those that do are referred to as probabilistic models.
Given a probabilistic predictive model, aleatoric and epistemic uncertainties can be quantified separately, as described in Appendix A. We depict them both separately in Figure 2. Being aware of which regions of the input space present large aleatoric uncertainty can help ML practitioners identify issues in their data collection process. On the other hand, epistemic uncertainty tells us about which regions of input space we have yet to learn about. Thus, epistemic uncertainty is used to detect dataset shift (ovadia2019can), or adversarial datapoints (Bayesian_active_learning)
. It is also used to guide methods that require explorations such as active learning
(houlsby2011bayesian), continual learning (2018variational), Bayesian optimisation (hernandez2014predictive), and reinforcement learning
(Successor_Uncertainties).3.3. Methods to Quantify Uncertainty
Most ML approaches involve a noise model, thus capturing aleatoric uncertainty. However, few are able to express epistemic uncertainty. When we say that a method is able to quantify uncertainty, we are implicitly referring to those that capture both epistemic and aleatoric uncertainty. These methods can be broadly classified into two categories: Bayesian approaches (welling2011bayesian; graves2011practical; chen2014stochastic; blundell2015weight; hernandez2015probabilistic; kingma2015variational; gal2016dropout; maddox2019simple; farquhar_radial_2020; zhang2020csgmcmc; antoran2020depth) and NonBayesian, or frequentist, approaches (lakshminarayanan2017simple; lee2018simple; ahuja2019probabilistic; van2020uncertainty; lingkai2020sdenet; liu2020simple).
Bayesian methods explicitly define a hypothesis space of plausible models a priori
(before observing any data) and use deductive logic to update these priors given the observed data. In parametric models, like Bayesian Neural Networks (BNNs)
(mackay1992practical; neal1995bayesian), this is most often done by treating model weights as random variables instead of single values, and assigning them a prior distribution
. Given some observed data , the conditional likelihood tells us how well each weight setting explains our observations. The likelihood is used to update the prior, yielding the posterior distribution over the weights :(1) 
Predictions for test points are made through marginalization: all possible weight configurations are considered with each configuration’s predictions being weighed by that set of weights’ posterior density. The disagreement among the predictions from different plausible weight settings induces model (epistemic) uncertainty. Thus, the predictive posterior distribution:
(2) 
captures both epistemic and aleatoric uncertainty.
In recent years, the ML community has moved towards favoring NNs as their choice of model due to their flexibility and scalability to large amounts of data. Unfortunately, the more complicated the model, the more difficult it is to compute the exact posterior distribution . For NNs, it is analytically and computationally intractable (hernandez2015probabilistic). However, various approximations have been proposed. Among the most popular are variational inference (hinton1993keeping; blundell2015weight; gal2016dropout) and stochastic gradient MCMC (welling2011bayesian; chen2014stochastic; zhang2020csgmcmc). Methods that provide more faithful approximations, and thus more calibrated uncertainty estimates, tend to be more computationally intensive and scale worse to larger models. As a result, the best method will vary depending on the use case.
Often, the predictive distribution, Equation (2), is also intractable. In practice, for parametric models like NNs, it is approximated by making predictions with multiple plausible models, sampled from . We see this in the leftmost plot from Figure 2: the predictions from different ensemble elements can be interpreted as samples which, when combined, approximate the predictive posterior (wilson2020bayesian). Different predictions tend to agree in the datadense regions but they disagree elsewhere, yielding epistemic uncertainty. Note, because we need to evaluate multiple models, samplingbased approximations of the predictive posterior incur additional computational cost at test time compared to nonprobabilistic methods.
Also worth mentioning are Bayesian nonparametrics, such as Gaussian Processes (gp_book). These are easy to deploy and allow for exact probabilistic reasoning, making their predictions and uncertainty estimates very robust. Unfortunately their computational cost grows cubically with the number of datapoints, making them a very strong choice of model only in the small data regime ( 5000 points). Otherwise, approximate algorithms, such as variational inference (matthews2017scalable), are required.
Frequentist methods do not specify a prior distribution over hypothesis. They exclusively consider how well the distribution over observations implied by each hypothesis matches the data. Here, uncertainty stems from how we would expect an outcome to change if we were to repeatedly sample different sets of data given our chosen hypothesis. Ensembles (Lobato2009PredictionBO) train multiple models in different ways to obtain multiple plausible fits. At test time, the disagreement between ensemble elements’ predictions yields model uncertainty, as show in Figure 2. Currently, deep ensembles (lakshminarayanan2017simple) are one of best performing uncertainty quantification approaches for NNs (ashukha2019pitfalls), retaining calibration even under dataset shift (ovadia2019can)
. Unfortunately, the computational cost involved with running multiple models at both train and test time also make ensembles one of the most expensive methods. Within frequentist methods, posthoc approaches are especially attractive; they allow us to obtain uncertainty estimates from nonprobabilistic models, independently of how these were trained. A principled way to do this is to leverage the curvature of a loss function around its optima. Optima in flatter regions suffer from less variance
(hochreiter1997flat), suggesting our model’s parameters are well specified by the data. This is used to draw plausible weight samples by Resampling Uncertainty Estimation (RUE) (schulam2019can) and local ensembles (Madras2020Detecting). alaa2020discriminative adapt the Jackknife, a traditional frequentist method, to generating posthoc confidence intervals in smallscale neural networks. We discuss additional uncertainty quantification approaches in Appendix C.3.4. Uncertainty Evaluation
Calibration is a form of quality assurance for uncertainty estimates. It is not enough to provide larger error bars when our model is more likely to make a mistake. Our predictive distribution must reflect the true distribution of our targets. Recall our cancer diagnosis scenario, where the system declines to make a prediction when uncertainty is above a threshold, and a doctor is queried instead. Due to the doctor’s time being limited, we design our system such that it only declines to make a prediction if it estimates there is a probability greater than of the prediction being wrong. If instead of being wellcalibrated, our system is underconfident, we would overquery the doctor in situations where the AI’s prediction is correct. Overconfidence would result in us taking action on unreliable predictions: delivering unnecessary treatment or abstaining from providing necessary treatment.
Calibration is orthogonal to accuracy. A model with a predictive distribution that matches the marginal distribution of the targets would be perfectly calibrated but would not provide any useful predictions. Thus, calibration is usually measured in tandem with accuracy through, either a general fidelity metric (most often chosen to be a proper scoring rule (gneiting2007strictly)) which subsumes both objectives, or two separate metrics. The most common metrics of the former category are negative loglikelihood (NLL) and Brier score (brier1950verification). Of the latter type, Expected calibration error (ECE) (naeini2015obtaining) is popularly used in classification scenarios. ECE segregates a model’s predictions into bins depending on their predictive probability. In our example, this would mean grouping all patients who have been assigned a probability of having a cancer into a first bin, those who have been assigned probability into the second bin, etc. For each bin, calibration error is the difference between the proportion of patients who actually have the disease and the average of the probabilities assigned to that bin. This is illustrated in Figure 2, where we use 10 bins. In this example, our model presents overconfidence in bins with and underconfidence otherwise. We refer to Appendix B for a discussion of additional calibration metrics, including some for regression.
A transparent ML model requires both wellcalibrated uncertainty estimates and an effective way to communicate them to stakeholders. If uncertainty is not wellcalibrated, our model cannot be transparent since its uncertainty estimates provide false information. Thus calibration is a precursor to using uncertainty as a form of transparency.
4. Using Uncertainty
In this section, we discuss the use of uncertainty based on the three motivations presented in Section 2. The different uses mentioned are not mutually exclusive; as such, they could all be leveraged simultaneously depending on the use case. In Section 6, we discuss the importance of gathering requirements on how stakeholders, both internal and external, will use uncertainty estimates.
4.1. Uncertainty and Fairness
An unfair ML model is one that exhibits unwanted bias towards or against one or more target classes. Here we discuss possible ways in which bias can be mitigated when uncertainty affects measurement of the features, data collection, and modeling.
Uncertainty in the data
Measurement bias, also known as feature noise, is a case of aleatoric uncertainty (defined in Section 3), and arises when one or more of the features in the data only represent a proxy for the features that were intended to be measured.
We briefly describe noise on two types of features.
First, noise in the sensitive attribute.
In contexts such as our running example, information on race and ethnicity of patients may not be collected (chen2019fairness).
Another example is answers provided by participants in a survey, who may have incentives to misreport their religious or political affiliations.
The experimental results of gupta2018proxy have shown that enforcing fairness constraints on a noisy sensitive attribute, without assumptions on the structure of the noise, is not guaranteed to lead to any improvement in the fairness properties of the classifier trained on that data. Successive papers have explored which assumptions are needed in order to obtain such guarantees.
When the noise only depends on the true unobserved value of the sensitive attribute (i.e., the noise follows the “mutually contaminated learning model” (scott2013classification)
), the measures of demographic parity and equalized odds computed on the observed data are equal to the true metrics up to a scaling factor, which is proportional to the value of the noise
(lamy2019noise); if the noise rates are known, then the true metrics can be directly estimated. When information on the protected class is unavailable, but it can be predicted from an auxiliary dataset, disparity measures are generally unidentifiable (chen2019fairness; kallus2019assessing). Second, we discuss noise in the outcome. In a case similar to our running example, medical expenses were used as a proxy for illness, and the algorithm severely underestimated the prevalence of the illness for the population of Black patients (obermeyer2019dissecting). Interestingly, this bias has attracted less attention in the fairness community. Notably, jiang2020identifying; blum2019recovering have shown that fairness constraints are guaranteed to improve the properties of the classifier only under appropriate assumptions on the noise. The work of fogliato2020fairness has shown that even a small amount of noise can greatly impact the assessment of the fairness properties of the model. This type of uncertainty can be mitigated by an appropriately specified noise model.Sampling bias is epistemic in nature and occurs when the observations in the available data are not representative of the true data distribution. Models trained in presence of sampling bias could exhibit unwanted bias towards the underrepresented group. For example, the differential performance of gender classification systems across racial groups may be due to underrepresentation of Black individuals in the sample (buolamwini2018gender). Similarly, historical overpolicing of certain communities has unavoidable impacts on the data used to train algorithms for predictive policing (lum2016predict; ensign2018runaway). In our cancer diagnostics scenario, we would likely prefer to deploy a model that is trained on a population of patients from the same hospital that will treat future patients. This type of bias, often called representation bias (suresh2019framework), can show up due to the difference in the relative distributions of the classes in the training data as compared to the population being represented. In addition, if there is a mismatch between the diversity representation in the data used to build an algorithm and the diversity of the target population, the algorithm can suffer from population bias (olteanu2019social). This problem also arises when deploying an algorithm trained on one population on a different population. Note that if sampling bias only consists of covariate shift (shimodaira2000improving) and the model is correctly specified, then it won’t affect the quality of the model’s predictions. Still, sample size may still represent an issue. Additionally, in many domains, sample size may also not be large enough to assess the existence of biases (ethayarajh2020your).
Uncertainty in model specification
This type of uncertainty arises when the hypothesis class chosen for modeling the data may not contain the true datagenerating process and could result in unwanted bias in model predictions. For example, we might prefer a simple and explainable model for cancer diagnostics that, mistakenly, does not account for the nonlinearity present in the data. Similarly, for ethical reasons we might choose to exclude the patient’s race as a predictor from the model, when this information could help improve its performance (vyas2020hidden). The hypothesis class or the family of functions used to fit the data is primarily determined using domain knowledge and preferences of the model designer. For these reasons, the resulting model should be seen only as an approximation of the true datagenerating process (buja2019models). In addition, using different benchmarks for the measurement of an algorithm’s performance can lead to different choices in the final model. As a result, the trained classifier may not achieve high performance, even with potentially unlimited and rich data.
Bias arising from model uncertainty can potentially be mitigated by considering enlarging the hypothesis class considered, such as by using deep neural networks, when the datasets are sufficiently large. In general, this kind of bias is hard to disentangle from data uncertainty and therefore it can rarely be detected or analysed.
Uncertainty and bias mitigation
We now present some of the methods to mitigate data bias and the possible implications of using uncertainty. These methods are typically categorized as pre, in, or postprocessing based on the stage at which the model learning is intervened upon.
Preprocessing techniques modify the distribution of the data the classifier will be trained on, either directly (calmon2017optimized) or in a lowdimensional representation space (zemel2013learning). Implicitly, these techniques reduce the uncertainty emerging from the features, outcome, or sensitive variables. Uncertainty estimation at this stage involves representing and comparing the training data distribution and random population samples with respect to target classes. The uncertainty measurements represented as distribution shifts of targeted classes between the training data and the population samples give a measure of training data bias which can be corrected by data augmentation of underrepresented classes or by collecting larger and richer datasets (chen2018my). The equalized odds postprocessing method of hardt2016equality is guaranteed to reduce the bias of the classifier under an assumption on the noise in the sensitive attributes, namely the independence of the classifier prediction and the observed attribute, conditional on both the outcome and the true sensitive attribute (awasthi2020equalized).
Inprocessing methods modify the learning objective by introducing constraints through which the resulting classifier can achieve the desired fairness properties (agarwal2018reductions; zhang2018mitigating; donini2018empirical). Comparisons of the distribution of the features can be used to detect uncertainty in the model during training. When little or no information about the sensitive attribute is available, distributionally robust optimization can be used to enforce fairness constraints (hashimoto2018fairness; wang2020robust). Algorithmic fairness approaches that employ active learning to either acquire features (noriega2019active) or samples (anahideh2020fair) with accuracy and fairness objectives also fall under this category.
Postprocessing techniques essentially modify the model’s predictions posttraining to satisfy a chosen fairness criterion (hardt2016equality). Frameworks like (kamiran2012decision)
can use this uncertainty information to debias the output of such models. Here, the predictions that are associated with high uncertainty can be skewed to favor a sensitive class as a debiasing postprocessing measure. Uncertainty in predictions can also be used to abstain from making decisions or defer the decisions to experts, which can lead to overall improvement in accuracy and fairness of the predictions
(madras2018predict). An optimizationbased approach is proposed in (wei2019optimized) to transform the scores with suitable tradeoff between utility of the predictions and fairness, with the assumption that the scores are wellcalibrated.Often there exist tradeoffs between the different notions of fairness (corbett2017algorithmic; chouldechova2017fair; kleinberg2016inherent) (see Appendix E for the definitions of the commonly used fairness metrics). For example, it has been shown that calibration and equalized odds cannot be achieved simultaneously when the base rates of the sensitive groups are different (pleiss2017fairness). Interestingly, this impossibility can be overcome, as shown in (canetti2019soft), by deferring uncertain predictions to experts. It is important that the uncertainty measurements produced by the prediction models are meaningful and unbiased (romano2019malice) for reliable functioning of such bias mitigation methods and for communication to decisionmakers.
4.2. Uncertainty and Decisionmaking
While ML can be used for many purposes, one use is to support or augment human decisionmaking. Depending on the context, the forms of decision support vary. For example, a model could recommend a product, suggest a risk score, detect potential abnormalities, predict a future event, etc. All of these situations require the end user to make a decision weighing uncertainty, whether the uncertainty is explicitly presented or not, asking themselves: Should I accept or rely on the model’s output? Sometimes when users are given output from multiple models, they may ask: Which model should I accept or rely on? These questions correspond to the prototypical tasks of decisionmaking under uncertainty/risk ^{1}^{1}1Social scientists often use the term “risk” for chances of negative events known to the decisionmaker, and “uncertainty” for the unmeasurable likelihood of events that are uncertain (but can be assessed by the decisionmaker) (rakow2010risk; knight1921risk) as studied in the Judgment and DecisionMaking (JDM) literature, i.e., action threshold decision and multioption choices (fischhoff2014communicating). While recent work has only begun to examine how uncertainty estimates might affect user interactions with ML models and decision task performance (zhang2020effect; arshad2015investigating), we highlight a few conclusions from the bulk of JDM literature that suggest how uncertainty estimates might be used in decisionmaking.
For classification tasks, decisionmakers can use the probability score representing uncertainty as explicit risk information–the chance that the model prediction is wrong. Prospect Theory suggests that risk is not considered independently but together with the expected outcome (tversky_advances_1992; kahneman2013prospect). That is, a prediction with a small uncertainty but leading to a large loss might be perceived more negatively than a prediction with medium uncertainty for a small loss. How people value choices in terms of their risks and outcomes is also nonlinear and asymmetrical. Specifically, when faced with a risky choice leading to gains, people are riskaverse; when the choice leads to losses, people are riskseeking. Since uncertainty in MLmediated decisions is more often than not framed as a gain (e.g., a 95% chance or confidence of the model being correct, instead of a 5% chance of the model being wrong), people may have a nonlinear riskaverse tendency: As the stake of the decisionoutcome increases, people’s tolerance for the magnitude of uncertainty, i.e., the acceptable level of confidence, could decrease at a speedier rate (van2020effects).
How people actually assess risk, however, also depends on how the uncertainty estimates are communicated and perceived. Both lay people and experts trained in statistics rely on mental shortcuts, or heuristics, to
interpret uncertainty (tversky1974judgment). This could lead to biased understanding or appraisal of uncertainty even if it is accurately measured. In Section 5, we discuss some of these biases and their implications for communication of uncertainty. There are also individual differences in one’s acceptable or preferred level of uncertainty (politi_communicating_2007), depending on many factors, such as expertise (experts might tolerate less uncertainty (heath1991preference)), personality (e.g., uncertaintyorientation (sorrentino1988uncertainty)) and cognitive style (miller1987monitoring).As discussed in Section 3, uncertainty in ML models may come from different sources. To our knowledge, the empirical understanding of how decisionmakers make use of aleatory versus epistemic uncertainty is limited. There is some prior work in the JDM literature showing that in the face of epistemic uncertainty about the occurrence of future events, people may postpone their decision (tversky1992disjunction). Understanding how people react to different types of uncertainty, and uncertainty expressed as a range or confidence intervals as in regression tasks, could be important gaps to fill by future work on humanAI interaction.
4.3. Uncertainty and Trust Formation
While trust could be implicit in a decision to rely on a model’s suggestion, the communication of a model’s uncertainty can also affect people’s general trust in an ML system. At a high level, communicating uncertainty is a form of model transparency that can help gain people’s trust. However, a look into the underlying construct of trust, and how people form trust, paints a more complex picture of how end users and stakeholders might use uncertainty estimates to form trust in an ML system.
While not limited to MLpowered systems, the HCI and Human Factors communities have a long history of studying trust in automation (lee2004trust; hoff2015trust; korber2018theoretical). These models of trust often build on mayer1995integrative’s classic ABI model of interpersonal trust, which postulates that the perceived trustworthiness of the trustee is determined by three attributes: 1) Ability: The level of competencies that enable the trustee to have influence within the targeted domain. 2) Benevolence: The extent to which a trustee is perceived to want to do good to the trustor. 3) Integrity: The extent to which the trustee consistently adheres to a set of principles that the trustor finds acceptable. Taking into account some fundamental differences between interpersonal trust and trust in automation, lee2004trust adapted the ABI model to posit that trustworthiness of automated systems is determined by three underlying dimensions: Competence, Intention of Developers, and Predictability/Understandability. We speculate that communicating uncertainty estimates could be relevant to all three of these dimensions. If a model always has high uncertainty, it will harm the model’s perceived Competence. If a model shows uncertainty that could not be understood or expected, it will be negatively perceived in Predictability. If uncertainty is not communicated or intentionally miscommunicated, users or stakeholders will hold a negative opinion on the Intention of Developers.
To anticipate how uncertainty estimates and ways to communicate them could impact user trust, it is also useful to consider process models on how people develop trust. Rooted in informationprocessing and decisionmaking theories (petty1986elaboration; chaiken1999heuristic; kahneman2011thinking), process models differentiate between an analytic, or systematic, process of trust formation, and an affective, or heuristic, process of trust formation (lee2004trust; metzger2013credibility; sundar2008main). Specifically, the former process involves rational evaluation of a trustee’s characteristics, while the later process relies on feelings or heuristics to form a quick judgment to (un)trust. When lacking either the ability or motivation to perform an analytic evaluation, people rely more on the affective or heuristic route (petty1986elaboration; sundar2019machine). While detailed probabilistic uncertainty estimates could facilitate analytic evaluation of model trustworthiness, it is important to note that users, especially lay people, might simply rely on some kind of heuristics or feelings that are invoked by the presentation of uncertainty information. For example, for some users the mere presence of uncertainty information could signal that the ML engineers are being transparent and sincere, which then enhances their trust (hovland1953communication). For others, uncertainty could invoke negative heuristics as a lack of expertise (van_der_bles_communicating_nodate). Even the style of communication matters. Prior work suggests that politely communicating the existence of uncertainty could promote trust (parasuraman2004trust). How uncertainty estimates are processed for trust formation, and what kind of affective impact or heuristics related to trust they could invoke, remain open questions and merit future research.
Lastly, we highlight a nontrivial point that the goal of presenting uncertainty estimates to end users and stakeholders should support forming appropriate trust, rather than blindly enhancing trust. A wellmeasured and wellcommunicated uncertainty estimate should not only facilitate the calibration of overall trust on a system, but also resolution of trust (lee2004trust; cohen1998trust), referring to how precisely the judgment of trust could differentiate types of model capabilities, for example in what situations the system is more or less trustworthy.
5. Communicating Uncertainty
Treating uncertainty as a form of transparency also requires accurately communicating it to the stakeholders. However, even wellcalibrated uncertainty estimates could be perceived inaccurately by people because (a) people have varying levels of understanding about probability and statistics, and (b) human perception of uncertainty quantities is often biased by their decisionmaking heuristics. In this section, we will review some of these issues that hinder people’s understanding of uncertainty estimates and will discuss how various communication methods may help address these issues. We will first describe how to communicate uncertainty in the form of confidence or prediction probabilities for classification tasks, and then more broadly in the form of ranges, confidence intervals, or full distributions. We will then dive into a case study on the utility of uncertainty communication during the COVID19 pandemic.
5.1. Issues in Understanding Uncertainty
Many application domains involve communicating uncertainty estimates to the general public to help them make decisions, e.g., weather forecasting, transit information system (kay_when_2016), medical diagnosis and interventions (politi_communicating_2007). One key issue in these applications is that a great deal of their audience may not have high numeracy skills and may not understand uncertainty correctly. In a survey (galesic_statistical_2010) conducted in 2010 on statistical numeracy across the US and Germany, it was found that many people do not understand relatively simple statements that involve statistics concepts. For example, 20% of the German and US participants could not say “which of the following numbers represents the biggest risk of getting a disease: 1%, 5%, or 10%,” and almost 30% could not answer whether 1 in 10, 1 in 100, or 1 in 1000 represents the largest risk. Another study (zikmundfisher_validation_2007) found that people’s numeracy skills significantly affect how well they comprehend risks. Many of the aforementioned decisionmaking scenarios involve highstake decisions, thus it is vital to find alternative ways to communicate uncertainty estimates to people with low numeracy skills.
Besides numeracy skills, research (cf. (kahneman2011thinking; reyna_numeracy_2008; spiegelhalter_risk_2017)) shows that humans in general suffer from a variety of cognitive biases, some of which hinder our understanding of uncertainties. One of them is called ratio bias, which refers to the phenomenon that people sometimes believe a ratio with a big numerator is larger than an equivalent ratio with a small numerator. For example, people may see 10/100 as a larger odds of having breast cancer than 1/10. This same phenomenon is sometimes manifested as an underweighting of the denominator, e.g. believing 9/11 is smaller than 10/13. This is also called denominator neglect.
In addition to ratio biases, people’s perception of probabilities are also distorted in that they tend to underweight high probabilities while overweighting low probabilities, and this distortion prevents people from making optimal decisions. zhang_ubiquitous_2012 showed that when people are asked to estimate probabilities or frequencies of events based on memory or visual observations, their estimates are distorted in a way that follows a logodds transformation of the true probabilities. Research (tversky_advances_1992; zhang_designing_2015) also found that this bias occurs when people are asked to make decisions under risk and that their decisions imply such distortions. Therefore, when communicating probabilities, we need to be aware that people’s perception of high risks may be lower than the actual risk, while that of low risks may be higher than actual.
A different kind of cognitive bias that impacts people’s perception of uncertainty is framing (kahneman2011thinking). Framing has to do with how information is contextualized. Typically, people prefer options with positive framing (e.g., a 80% likelihood of surviving breast cancer) than an equivalent option with negative framing (e.g., a 20% likelihood of dying from breast cancer). This bias has an effect on how people perceive uncertainty information. A remedy to this bias is to always describe the uncertainty of both positive and negative outcomes, rather than relying on the audience to infer what’s left out of the description.
5.2. Communication Methods
Choosing the right communication methods can address some of the above issues. van_der_bles_communicating_nodate
categorize the different ways of expressing uncertainty into nine groups with increasing precision, from explicitly denying that uncertainty exists to displaying a full probability distribution. While highprecision communication methods help experts understand the full scale of the uncertainty of the ML models, low precision methods require less numeracy skill and can be used for lay people. In this paper, we focus on the pros and cons of the four more precise methods of communicating uncertainty: 1) describe the degree of uncertainty using a predefined categorization, 2) describe a numerical range, 3) show a summary of a distribution, and 4) show the full probability distribution. The first two methods can be communicated verbally, while the last two often require visualizations.
A predefined, ordered categorization of uncertainty and risk levels reduces the cognitive effort needed to comprehend uncertainty estimates, and therefore is particularly likely to help people with low numeracy skills (peters_numeracy_2007). A great example of how to appropriately use this technique is the GRADE guidelines (balshem_grade_2011), which introduce a fourcategory system, from high to very low, to rate the quality of evidence for medical treatments. GRADE has provided definitions for each category and a detailed description of the aspects of studies to evaluate for constructing quality ratings. Uncertainty ratings are also frequently used by financial agencies to communicate the overall risks associated with an investment instrument (dionisio2007entropy).
The main drawback of communicating uncertainty via predefined categories is that the audience, especially nonexperts, might not be aware of or even misinterpret the threshold criteria of the categories. Many studies have shown that although individuals have internally consistent interpretation of words for describing probabilities (e.g., likely, probably), these interpretations can vary substantially from one person to another (cf. (clark_verbal_1990; budescu_consistency_1985; lichtenstein_empirical_1967)). More recently, budescu_effective_2012 investigated how the general public interpret the uncertainty information in the climate change report published by the Intergovernmental Panel on Climate Change (IPCC). They found that people generally interpreted the IPCC’s categorical description of probabilities as less likely than the IPCC intended. For example, people took the word “very likely” as indicating a probability of around 60%, whereas the IPCC’s guideline specifies that it indicates a greater than 90% probability. To avoid such misinterpretation, both categorical and numerical forms of uncertainty should be communicated, when possible.
Though numbers and numerical ranges are more precise than categorical scales in communicating uncertainty, as discussed earlier, they are harder to understand for people with low numeracy and can induce ratio biases. However, a few techniques can be used to remediate these problems. First, to overcome the adverse effect of denominator neglect, it is important to present ratios with the same denominator so that they can be compared with just the numerator (spiegelhalter_visualizing_2011). Denominators that are powers of 10 are preferred since they are easier to compute. There is so far no conclusive findings on whether frequencies are easier to understand than ratios or percentages, but people do seem to perceive risk probabilities represented in the frequency format as showing higher risk than those represented in the percentage format (c.f. (reyna_numeracy_2008)). Therefore, it is helpful to use a consistent format to represent probabilities, and if the audience underestimates risk levels, the frequency format may be preferred.
Uncertainty estimates can also be represented with graphics, which have several advantages over verbal communications, such as attracting and holding the audience’s attention, revealing trends or patterns in the data, and evoking mental mathematical operations (lipkus_visual_1999). Commonly used visualizations include pie charts, bar charts, and more recently, icon arrays (Figure 2(a)). Pie charts are particularly useful for conveying proportions since all possible outcomes are depicted explicitly. However, it is more difficult to make accurate comparisons with pie charts than with bar charts because pie charts use areas to represent probabilities. Icon arrays vividly depict parttowhole relationship, and because they shows the denominator explicitly, they can be used to overcome ratio biases.
So far, what we have discussed pertains mostly to conveying uncertainty of a binary event, which takes the form of a single number (probability), whereas the uncertainty of a continuous variable or model prediction takes the form of a distribution. This latter type of uncertainty estimate can be communicated either as a series of summary statistics about the distribution, or directly as the full distribution. Commonly reported summary statistics include mean, median, confidence intervals, standard deviation, and quartiles
(theory_of_statistics). These statistics are often depicted graphically as error bars and boxplots for univariate data, and two dimensional error bars and bagplots (rousseeuw_bagplot_1999) for bivariate data. We describe these summary statistics and plots in detail in Appendix B. Error bars only have a few graphical elements and are hence relatively easy to interpret. However, since they have represented a range of different statistics in the past, they are ambiguous if presented without explicit labeling (wilke_fundamentals_2019). Error bars may also overly emphasize the range within the bar (correll_error_2014). Boxplots and bagplots are less popular in the mass media, and generally require some training to understand.When presenting uncertainty about a single model prediction, it might be better to show the entire posterior predictive distribution, which can avoid overemphasis of the withinbar range and allow more granular visual inferences. Popular visualizations of distributions are histograms, density plots, and violin plots (
(hintze_violin_1998) shows multiple density plots sidebyside), but they seem to be hard for an uninitiated audience to grasp. They are often mistaken as bivariate casevalue plots in which the lines or bars denote values instead of frequencies (boels_conceptual_2019). More recently, kay_when_2016 developed quantile dot plots to convey distributions (see Figure 2(b) for an example). These plots use stacked dots, where each dot represents a group of cases, to approximate the data frequency at particular values. This method translates the abstract concept of probability distribution into a set of discrete outcomes, which are more familiar concepts to people who have not been trained in statistics. kay_when_2016’s study showed that people could more accurately derive probability estimates from quantile dot plots than from density plots.One very different approach to conveying uncertainty is to individually show random draws from the probability distribution as a series of animation frames called hypothetical outcome plots (HOP) (hullman_hypothetical_2015). Similar to the quantile dot plots, HOPs accommodate the frequency view of uncertainty very well. In addition, showing events individually does not add any new visual encodings (such as the length of the bar or height of the stacked dots) and thus requires no additional learning from the viewers. hullman_hypothetical_2015 showed that this visualization enabled people to make more accurate comparisons of two or three random variables than error bars and violin plots, presumably because statistical inference based on multiple distribution plots require special strategies while HOP does not. The drawbacks of HOP are: (a) it takes more time to show a representative sample of the distribution, and (b) it may incur high cognitive load since viewers need to mentally count and integrate frames. Nevertheless, because this method is easy to understand for people with low numeracy, similarly animated visualizations are frequently used in the mass media, e.g. (yau_years_2015; badger_income_2018).
The above methods are designed to communicate uncertainty around a single quantity, so they need to be extended for visualizing uncertainty around a range of predictions, such as those in timeseries forecasting. The simplest form of such visualization is a quantile plot, which uses lines to connect predictions at equal quantiles of the uncertainty distribution across the output range. When used in timeseries forecasting, such plots are called coneofuncertainty plots (see Figure 2(c)), in which the cone enlarges over time, indicating increasingly uncertain predictions. Gradient plots, or fan charts (see Figure 2(d)) in the context of time series forecasting, can be used to show more granular changes in uncertainty, but they require extra visual encoding that may not be easily understood by the viewer. In contrast, spaghetti plots simply represent each model’s predictions as one line, while uncertainty can be inferred from the closeness of the lines. However, they might put too much emphasis on the lines themselves and deemphasize the range of the model predictions. Lastly, HOP can also be used to show uncertainty estimates over a range of predictions by showing each model’s predictions in an animation frame (wilke_fundamentals_2019).
5.3. COVID Case Study
Uncertainty communication as a form of transparency is pivotal to garnering public trust during a pandemic. The COVID19 global pandemic is an exemplar setting: disease forecasts, amongst other tools, have become critical for health communication efforts (jewell2020predictive). In this setting, forecasts are being disseminated to governments, organizations, and individuals for policy, resource allocation, and personalrisk judgments and behavior (li2020estimated; eker2020validity; petropoulos2020forecasting). The United States Centers for Disease Control and Prevention (CDC) maintains Influenza Forecasting Centers of Excellence, which have recently turned to creating publicfacing hubs for COVID19 forecasts, with the purpose of integrating infectious disease forecasting into public health decisionmaking (ray2020ensemble). For example, we may want to forecast the number of deaths due to COVID19. The CDC repository contains several individual models for this task. The types of models include the classic susceptibleinfectedrecovered infectious disease model, statistical models fit to case data, regression models with various types of regularization, and others. Each model has its own assumptions and approaches to computing and illustrating underlying predictive uncertainty. Figure 4 shows how uncertainty from one model is visualized as a predictive band with a mean estimate highlighted. Given that such models are disseminated widely in the public via the Internet and other channels, and their result can directly affect personal behavior and disease transmission, this setting exemplifies an opportunity for usercentered design in uncertainty expression. In particular, assessing the forms of uncertainty visualization can be useful (e.g. 95% confidence intervals versus 50% confidence intervals, versus showing multiple different models, etc.). Indeed, the forms of uncertainty in COVID19 forecasts could be used to inform a study to systematically assess userspecific uncertainty needs (e.g. a municipal public health department may require the most conservative estimate for adequate resource allocation, while an individual may be more interested in trends to plan their own activities).
6. Uncertainty Requirements
Familiarity with the findings discussed in Section 5 above will be helpful for teams building uncertainty into ML workflows. Yet, none of these findings should be treated as conclusive when it comes to predicting how usable a given expression of uncertainty will be for different types of users facing different kinds of constraints in realworld settings. Instead, findings from the literature should be treated as fertile ground for generating hypotheses in need of testing with real users, ideally engaged in the concrete tasks of their typical workflow, and ideally doing so in realworld settings.
It is important to recognize just how diverse individual users are, and how different their social contexts can be. In our cancer diagnostic scenario, the needs and constraints of a doctor making a timepressed and highstakes medical decision using an MLpowered tool will likely be very different from those of a patient attempting to understand their diagnosis, and different again from those of an ML engineer reviewing model output in search of strategies for model improvement. Furthermore, if we zoom in on any one of these user populations, we still typically observe a tremendous diversity in skills, experience, environmental constraints, and so on. For example, among doctors, there can be big differences in terms of statistical literacy, openness to trusting MLpowered tools, time available to consume and decide on model output, and so on. These variations have important implications for designing effective tools.
To design and build an effective expression of uncertainty, we need to begin with an understanding of who the tool will be used by, what goal that user has, and what needs and constraints the user has. Frequently we also need to understand the organizational and social context in which a user is embedded. For example, to understand how an organization calculates and processes risk, which can influence the design of humanintheloop processes, automation, where thresholds are set, and so on. This point is not a new one, and it is by no means unique to the field of ML. Usercentered design (UCD), humancomputer interaction (HCI), user experience (UX), human factors, and related fields have arisen as responses to this challenge across a wide range of product and tool design contexts (goodman2009three; goodman2012observing; preece2015interaction).
UCD and HCI have a firm footing in many software development contexts, yet they remain relatively neglected in the field of ML. Nevertheless, a growing body of research is beginning to demonstrate the importance of usercentered design for work on ML tools (e.g., (thieme2020machine; inkpen2019human)). For example, yang2016investigating draw on field research with healthcare decisionmakers to understand why an MLpowered tool that performed well in laboratory tests was rejected by clinicians in realworld settings. They found that users saw little need for the tool, lacked trust in its output, and faced environmental barriers that made it difficult to use. narayanan2018humans conduct a series of user tests for explainability to uncover which kinds of increases in explanation complexity have the greatest effect on the time it takes for users to achieve certain tasks. doshi2017towards propose a framework for evaluation of explainability that incorporates tests with users engaged in concrete and realistic tasks. From a practitioner’s perspective, lovejoy2018ux describes the usercentered design lessons learned by the team building Google Clips, an AIenabled camera designed to capture candid photographs of familiar people and animals. One of their key conclusions is that “[m]achine learning won’t figure out what problems to solve. If you aren’t aligned with a human need, you’re just going to build a very powerful system to address a very small — or perhaps nonexistent — problem.”
Research to uncover user goals, needs, and constraints can involve a wide spectrum of methods, including but not limited to indepth interviews, contextual inquiry, diary studies, card sorting studies, user tests, user journey mapping, and jobstobedone workshops with users (goodman2009three; goodman2012observing; rubin2008plan). It is helpful to divide user research into two buckets: 1) discovery research, which aims to understand what problem needs to be solved and for which type of user; and 2) evaluative research, which aims to understand how well our attempts to solve the given problem are succeeding with real users. Ideally, discovery research precedes any effort to build a solution, or at least occurs as early in the process as possible. Doing so helps the team focus on the right problem and right user type when considering possible solutions, and can help a team avoid costly investments that create little value for users. Which of the many methods a researcher uses in discovery and evaluative research will depend on many factors, including how easy it is to find relevant participants, how easy it is for the researcher to observe participants in the context of their daytoday work, how expensive and timeconsuming it is for the team to prototype potential solutions for the purposes of user testing, and so on. The key takeaway is that teams building uncertainty into ML workflows should do user research to understand what problem needs solving and for what type of user.
7. Conclusion
Throughout this paper, we have argued that uncertainty is a form of transparency and is pertinent to the FAccT community. We surveyed the machine learning, visualization/HCI, decisionmaking and fairness literature. We reviewed how to quantify uncertainty and leverage it in three use cases: (1) for developers reducing the unfairness of models, (2) for experts making decisions, and (3) for stakeholders placing their trust in ML models. We then described the methods for and pitfalls of communicating uncertainty, concluding with a discussion on how to collect requirements for leveraging uncertainty in practice. In summary, wellcalibrated uncertainty estimates improve ML model transparency. In addition to calibration, it is important that these estimates are applied coherently and communicated clearly to various stakeholders considering the use case at hand. Future work could study the interplay between FAccT topics and uncertainty. For example, one could explore how communicating uncertainty to a stakeholder affects their perception of a model’s fairness, or one could study how to best measure the calibration of uncertainty in regression settings. We hope this work inspires others to study uncertainty as transparency and to be mindful of uncertainty’s effects on models in deployment.
8. Acknowledgments
The authors would like to thank the following individuals for their advice, contributions, and/or support: James Allingham (University of Cambridge), McKane Andrus (Partnership on AI), Hudson Hongo (Partnership on AI), Terah Lyons (Partnership on AI), Elena Spitzer (Google), Kush Varshney (IBM), and Carroll Wainwright (Partnership on AI).
UB acknowledges support from DeepMind and the Leverhulme Trust via the Leverhulme Centre for the Future of Intelligence (CFI), and from the Partnership on AI. JA acknowledges support from Microsoft Research. AW acknowledges support from the David MacKay Newton research fellowship at Darwin College, The Alan Turing Institute under EPSRC grant EP/N510129/1 & TU/B/000074, and the Leverhulme Trust via CFI.
References
Appendix A Uncertainty Metrics
In this appendix we detail different metrics with which the uncertainty in a predictive distribution can be summarized. We distinguish between metrics that communicate aleatoric uncertainty, those that communicate epistemic uncertainty, and those that inform us about the combination of both. We also distinguish between the classification and regression setting. Recall that, as discussed in Section 5, predictive probabilities more intuitively communicate a model’s predictions and uncertainty to stakeholders than a continuous predictive distribution. Therefore, summary statistics for uncertainty might play a larger role when building transparent regression systems.
a.1. Classification setting
Notation: Let represent a dataset consisting of samples, where for each example , is the input and is the groundtruth class label. Let be the output from the parametric classifier with model parameters . In probabilistic models, the output predictive distribution can be approximated from stochastic forward passes (Monte Carlo samples), as described in Section 3: , where .
Predictive entropy:
The entropy (shannon1948mathematical) of the predictive distribution is given by Equation 3. Predictive entropy represents the overall predictive uncertainty of the model, a combination of aleatoric and epistemic uncertainties (mukhoti2018evaluating).
(3) 
In the case of pointestimate deterministic models, predictive entropy is given by Equation 4 and captures only the aleatoric uncertainty.
(4) 
The predictive entropy always takes positive values between 0 and . Its maximum is attained when the probability of all classes is . Predictive entropy can be additively decomposed into aleatoric and epistemic components:
Expected entropy:
The expectation of entropy obtained from multiple stochastic forward passes captures the aleatoric uncertainty.
(5) 
Mutual information:
The mutual information (shannon1948mathematical) between the posterior of model parameters and the targets captures epistemic uncertainty (houlsby2011bayesian; gal2016uncertainty). It is given by Equation 6.
(6) 
The predictive entropy can be recovered as the addition of the expected entropy and mutual information:
Variation ratio:
Variation ratio (freeman1965) captures the disagreement of a model’s predictions across multiple stochastic forward passes given by Equation 7.
(7) 
where, f represents the number of times the output class was predicted from stochastic forward passes.
a.2. Regression setting
We assume the same notation as in the classification setting with the distinction that our targets are continuous. For generality, we employ heteroscedastic noise models. Recall that this means our aleatoric uncertainty may be different in different regions of input space. We assume a Gaussian noise model with its mean and variance predicted by parametric models: . Approximately marginalizing over with T Monte Carlo samples induces a Gaussian mixture over outputs. Its mean is obtained as:
Aleatoric and Epistemic Variances:
There is no closedform expression for the entropy of the mixture of Gaussians (GMM). Instead, we use the variance of the GMM as an uncertainty metric. It also decomposes into aleatoric and epistemic components :
These are also estimated with MC:
Here, reflects model uncertainty  our lack of knowledge about  while tells us about the irreducible uncertainty or noise in our training data. Similarly to entropy, we can express uncertainty in regression as the addition of aleatoric and epistemic components.
We now briefly discuss other common summary statistics to describe continuous predictive distributions (theory_of_statistics). Note that these do not admit simple aleatoricepistemic decompositions.
Percentiles: Percentiles tell us about the values below which there is a certain probability of our targets falling. For example, if the 20 percentile of our predictive distribution is 5, this means that values take up of the probability mass of our predictive distribution.
Confidence Intervals: A confidence interval communicates that, with a probability , our quantity of interest will lie within the provided range . Thus, the commonly used confidence interval tells us that percentile of our predictive distribution corresponds to and percentile corresponds to .
Quantiles: Quantiles divide the predictive distribution into sections of equal probability mass. Quartiles, which divide the predictive distribution into four parts, are the most common use of quantiles. The first quartile corresponds to the 25 percentile, the second to the 50 (or median) and the third to percentile 75.
The summary statistics above are often depicted in errorbar plots, boxplots, and violin plots, as shown in Figure 5. Error bar plots provide information about the spread of the predictive distribution. They can reflect variance (or standard deviation), percentiles, confidence intervals, quantiles, etc. Box plots tell us about our distribution’s shape by depicting its quartiles. Additionally, box plots often depict longer error bars, referred to as “whiskers,” which tell us about the heaviness of our distribution’s tails. Whiskers are most commonly chosen to be of length 1.5
the interquartile range (Q1  Q3) or extreme percentile values, e.g. 298. Samples that fall outside of the range depicted by whiskers are treated as outliers and plotted individually. Although less popular, violin plots have been gaining some traction for summarizing large groups of samples. Violin plots depict the estimated shape of the distribution of interest (usually by applying kernel density estimation to samples). They combine this with a box plot that provides information about quartiles. However, differently from regular box plots, violin plots’ whiskers are often chosen to reflect the maximum and minimum values of the sampled population.
Appendix B Calibration Metrics
This section describes existing metrics that reflect the calibration of predictive distributions for classification. There are no widely adopted calibration metrics for regression within the ML community. However, the use of calibration metrics for regression is common in other fields, such as econometrics. We discuss how these can be adapted to provide analogous information to popular ML classification calibration metrics. Calibration metrics should be computed on a validation set sampled independently from the data used to train the model being evaluated. In our cancer diagnosis scenario, this could mean collecting validation data from different hospitals than those used to collect the training data.
Test Log Likelihood (higher is better): This metric tells us about how probable it is that the validation targets were generated using the validation inputs and our model. It is a proper scoring rule (gneiting2007strictly) that depends on both the accuracy of predictions and their uncertainty. We can employ it in both classification and regression settings. Loglikelihood is also the most commonly used training objective for neural networks. The popular classification crossentropy and regression mean squared error objectives represent maximum loglikelihood objectives under categorical and unit variance Gaussian noise models, respectively (bishop_pattern).
Brier Score (brier1950verification) (lower is better): Proper scoring rule that measures the accuracy of predictive probabilities in classification tasks. It is computed as the mean squared distance between predicted class probabilities and onehot class labels:
Unlike loglikelihood, Brier score is bounded from above. Erroneous predictions made with high confidence are penalised less by Brier score than by loglikelihood. This can avoid outliers or misclassified inputs from having a dominant effect on experimental results. On the other hand, it makes Brier score less sensitive.
Expected Calibration Error (ECE) (naeini2015obtaining) (lower is better): This metric is popularly used to evaluate the calibration of deep classification neural networks. ECE measures the difference between predictive confidence and empirical accuracy in classification. It is computed by dividing the [0,1] range into a set of bins and weighing the miscalibration in each bin by the number of points that fall into it :
Here,
ECE is not a proper scoring rule. A perfect ECE score can be obtained by predicting the marginal distribution of class labels for every input. A wellcalibrated predictor with poor accuracy would obtain low log likelihood values (undesirable result) but also low ECE (desirable result). Although ECE works well for binary classification, the naive adaption to the multiclass setting results in a disproportionate amount of class predictions being assigned to low probability bins, biasing results. (nixon2019measuring) and (kull2019beyond) propose alternatives that mitigate this issue.
Expected Uncertainty Calibration Error (UCE) (laves2019well) (lower is better): This metric measures the difference in expectation between a model’s error and its uncertainty. The key difference from ECE is this metric quantifies model miscalibration with respect to predictive uncertainty (using a single uncertainty summary statistic, Appendix A.1). This differs from ECE, which quantifies the model miscalibration with respect to confidence (probability of predicted class).
(8) 
Here,
where is a normalized uncertainty summary statistic (defined in Appendix A.1).
Conditional Probabilities for Uncertainty Evaluation (mukhoti2018evaluating) (higher is better): Conditional probabilities p(accurate — certain) and p(uncertain — inaccurate) have been proposed in (mukhoti2018evaluating) to evaluate the quality of uncertainty estimates obtained from different probabilistic methods on semantic segmentation tasks, but can be used for any classification task.
p(accurate — certain) measures the probability that the model is accurate on its output given that it is confident on the same. p(uncertain — inaccurate) measures the probability that the model is uncertain about its output given that it has made inaccurate prediction. Based on these two conditional probabilities, patch accuracy versus patch uncertainty (PAPU) metric is defined as below.
Here, , , , are the number of predictions that are accurate and certain (AC), accurate and uncertain (AU), inaccurate and certain (IC), inacurate and uncertain (IU) respectively.
Regression Calibration Metrics: We can extend ECE to regression settings, while avoiding the pathologies described by (nixon2019measuring). We seek to assess how well our model’s predictive distribution describes the residuals obtained on the test set. It is not straightforward to define bins, like in standard ECE, because our predictive distribution might not have finite support. We apply the cumulative density function (CDF) of our predictive distribution to our test targets. If the predictive distribution describes the targets well, the transformed distribution should resemble a uniform with support . This procedure is common for backtesting market risk models (market_risk_book).
Regression Calibration Error (RCE) (lower is better): To assess the global similarity between our targets’ distribution and our predictive distribution, we separate the interval into equalsized bins . We compute calibration error in each bin as the difference between the proportion of points that have fallen within that bin and :
Tail Calibration Error (TCE)
(lower is better): In cases of model misspecification, e.g. our noise model is Gaussian but our residuals are multimodal, RCE might become large due to this mismatch, even though the moments of our predictive distribution might be generally correct. We can exclusively assess how well our model predicts extreme values with a “frequency of tail losses” approach
(tail_frequency_paper). Only considering calibration at the tails of the predictive distribution allows us to ignore shape mismatch between the predictive distribution and the true distribution over targets. Instead, we focus on our model’s capacity to predict on which inputs it is likely to make large mistakes. We specify two bins , one at each tail end of our predictive distribution, and compute:We specify the tail range of our distribution by selecting . Note that this is slightly different from tail_frequency_paper, who uses a binomial test to assess whether a model’s predictive distribution agrees with the distribution over targets in the tails. RCE and TCE are not proper scoring rules. Additionally, they are only applicable to onedimensional continuous target variables.
Appendix C Uncertainty Quantification Methods
In this appendix, we describe additional approaches to uncertainty quantification and calibration which were omitted from Section 3.
c.1. Bayesian Methods
Various approximate inference methods have been proposed for Bayesian uncertainty quantification in parametric models, such as deep neural networks. We refer to the resulting models as Bayesian Neural Networks (BNN), Figure 6. Variational inference (hinton1993keeping; graves2011practical; blundell2015weight; farquhar_radial_2020) approximates a complex probability distribution with a simpler distribution , parameterized by variational parameters , by minimizing the KullbackLeibler (KL) divergence between the two . In practice, this is done by maximising the evidence lower bound (ELBO), as given by Equation 9.
(9) 
In Bayesian neural networks, this objective can be optimised with stochastic gradient descent optimization
(kingma2015variational). After optimization represents a distribution over plausible models which explain the data well. In meanfield variational inference, the approximate weight posterior is represented by a fully factorized distribution, most often chosen to be Gaussian. Some stochastic regularization techniques, originally designed to prevent overfitting, can also be interpreted as instances of variational inference. The popular Monte Carlo dropout (gal2016dropout)method is a form of variational inference which approximates the Bayesian posterior with a multiplicative Bernoulli distribution over sets of weights. The stochasticity introduced through minibatch sampling in batchnorm can also be seen in this light
(teye2018bayesian). Stochastic weight averaging Gaussian (SWAG) (maddox2019simple) computes a Gaussian approximation to the posterior from checkpoints of a deterministic neural network’s stochastic gradient descent trajectory. These approaches are simple to implement but represent crude approximations to the Bayesian posterior. As such, the uncertainty estimates obtained by using these methods may suffer from some limitations (foong_expressivity).Stochastic gradient MCMC (welling2011bayesian; chen2014stochastic; zhang2020csgmcmc) methods allow us to draw biased samples from the posterior distribution over NN parameters in a minibatch friendly manner.
Recent work has started to explore performing Bayesian inference over the function space of NNs directly.
sun2018functional use a stochastic NN as a variational approximation to the posterior over functions. They define and approximately optimize a functional ELBO. Variational_Implicit_Processes use a variant of the wake sleep algorithm to approximate the predictive posterior of a neural network with a Gaussian Process as a surrogate model. The Spectralnormalized Neural Gaussian Process (SNGP) (liu2020simple) enables us to compute predictive uncertainty through input distance awareness, avoiding Monte Carlo sampling. Neural stochastic differential equation models (SDENet) (lingkai2020sdenet) provide ways to quantify uncertainties from the stochastic dynamical system perspective using Brownian motion. More recently, antoran2020depth introduce Depth Uncertainty, an approach that captures model specification uncertainty instead of model uncertainty. By marginalizing over network depth their method is able to generate uncertainty estimates from a single forward pass.c.2. NonBayesian Uncertainty quantification and calibration methods
Deterministic uncertainty quantification (DUQ) (van2020uncertainty)
builds upon ideas of radial basis function networks
(lecun1998gradient), allowing one to obtain uncertainty, computed as the distance to a centroid in latent space, with a single forward pass.guo2017calibration
show how using a validation set to learn a multiplicative scaling factor (known as a temperature) to output logits is a cheap posthoc way to improve the calibration of NN models.
Appendix D Other Algorithmic Use Cases for Uncertainty
Epistemic uncertainty can be used to guide modelbased search in applications such as active learning (houlsby2011bayesian; kirsch2019batchbald). In an active learning scenario, we are provided with a large amount of inputs but few or none of them are labelled. We assume labelling additional inputs has a large cost, e.g. querying a human medical professional. In order to train our model to make the best possible predictions, we would like to identify for which inputs it would be most useful to acquire labels. Here, each input’s epistemic uncertainty tells us about how much our model would learn from seeing its labels and therefore directly answers our question.
Uncertainty can also be baked directly into the model learning procedure. In rejectionoption classification, models can explicitly abstain from making predictions for points on which they expect to underperform (bartlett2008classification). For example, our credit limit model may have high epistemic uncertainty when predicting on customers with specific attributes (e.g., extremely low incomes) that are underrepresented in the training data; as such, uncertainty can be leveraged to decide whether the model should defer to a human based on the observed income level. Section 4 reviews various ways to use uncertainty to improve ML models.
Distributional shift, which occurs when the distribution of the test set is different from the training set distribution, is sometimes considered as a special case of epistemic uncertainty, although malinin2018predictive explicitly consider it as a third type of uncertainty: distributional uncertainty. ovadia2019can compare different uncertainty methods in the specific case of dataset shifts. These techniques all have in common that they use epistemic uncertainty estimates to improve the model by identifying regions of the input space where the model performs badly due to a lack of data. Some work also shows how using uncertainty techniques can directly improve the performance of a model and can outperform using the softmax probabilities (gal2016dropout; kendall2015bayesian; kendall2016modelling). miller2018dropout discuss how dropout sampling helped in an openset object detection task where new unknown objects can appear in the frame. Some other papers propose their own uncertainty techniques, and presents how they outperform different baselines, often focusing on the outofdistribution detection task (devries2018learning; lakshminarayanan2017simple; malinin2018predictive).
It is also important to note that sometimes a model’s output could be used for downstream decisionmaking tasks by another model, whether an ML model, or an operational research model which will optimize a plan based on a predicted class or quantity. Uncertainty estimates are essential to transparently inform the downstream models of the validity of the input. For instance, stochastic optimization can take distributional input to reduce the cost of the solution in the presence of uncertainty. More generally, in systems where models, humans, and/or heuristics are chained, it is crucial to understand how the uncertainty of each step interacts with each other, and how it impacts the overall uncertainty of the system.
Appendix E Fairness Definitions
Algorithmic fairness is complementary to algorithmic transparency. The ML community has attempted to define various notions of fairness statistically: see barocashardtnarayanan for an overview of the fairness literature. While there is no single definition of fairness for all contexts of deployment, many define ML fairness as absence of any prejudice or favoritism toward an individual or a group based on their inherent or acquired characteristics (Mehrabi2019ASO). Unfairness can be the result of biases in (a) data used to build the algorithms or (b) the algorithms chosen to be implemented. We discuss some standard definitions of fairness here from the machine learning literature (hardt2016equality; beutel2017data). Note that kleinberg2018inherent finds that typically it is not possible to satisfy many fairness notions simultaneously. Let us assume we have a classifier , which outputs a predicted outcome for some input . Let be the actual outcome for . Let be a binary sensitive attribute that is contained explicitly in or encoded implicitly in . When we refer to groups, we mean the two sets that result from partitioning a dataset based on (i.e., if Group 1 was , Group 2 would be ). Let (Group 1) be considered unprivileged. Below are three common fairness metrics.

Demographic Parity (DP): A classifier is considered to be fair with regard to DP (also known as statistical parity) if the following quantity is close to 0:
The predicted positive rates for both groups should be the same (dwork2012fairness).

Equal Opportunity (EQ): A classifier is considered to be fair with regard to EQ if the following quantity is close to 0:
That is, the true positive rates for both groups should the same (hardt2016equality).

Equalized Odds (EO): A classifier is considered to be fair with regard to EO if the following quantity is close to 0:
That is, we want to equalize the true positive and false positive rates across groups (hardt2016equality). EO is satisfied if and are independent conditional on .