1 Introduction
Complex machine learning models, particularly deep learning models, are now being applied to a variety of practical prediction problems ranging from diagnosis of medical images
(Kermany et al., 2018) to autonomous driving (Du et al., 2017). As a result, software systems with embedded machine learning components are becoming increasingly common. Many of these models will be black boxes from the perspective of downstream users, such as models developed remotely by commercial entities and hosted as a service in the cloud (Yao et al., 2017). For a variety of reasons (legal, economic, competitive), users will often have no direct access to the detailed workings of the model, how the model was trained, or the training data.In this context it is increasingly important for the user of a model to have accurate and robust assessments of the quality of a model’s predictions. However, as an example, “selfconfidence" estimates provided by machine learning predictors can often be quite unreliable and miscalibrated (Zadrozny and Elkan, 2002; Kull et al., 2017; Ovadia et al., 2019). In particular, complex models such as deep networks with highdimensional inputs (e.g., images and text) can be significantly overconfident in practice (Gal and Ghahramani, 2016; Guo et al., 2017; Lakshminarayanan et al., 2017). Furthermore, test set performance metrics from traintest splits might not be an accurate reflection of downstream performance due to factors such as label and distribution shift (e.g., Lipton et al. (2018); Hendrycks and Gimpel (2017); Ovadia et al. (2019) and/or implicit optimistic bias (Recht et al., 2019).
Thus, downstream users of blackbox predictors will need the capability to carry out assessment separately and independently from the training and evaluation procedures used when fitting the model. This assessment could for example be conducted by organizations not involved in training the model, in a manner similar to the assessments of commercial products carried out by regulatory agencies. Additional motivations for independent assessment include legal requirements that may mandate independent assessment of models, the need to build trust on the part of a human consumer of model predictions, or situations where the predictor is being deployed in an environment with a different distribution over inputs and outputs compared to the environment the model was trained on.
In this paper we develop a Bayesian framework for performing this assessment that requires no internal access to the model being evaluated, only its output probabilities. We focus on classification models in particular, although the ideas are broadly applicable to prediction models in general. Figure 1 provides an illustrative example applying our approach to assess performance of a ResNet classifier on the CIFAR100 dataset, with detailed assessments of the classwise accuracy and calibration properties, along with posterior credible intervals measuring how certain we are in our assessment given the data available.
Assessment of model properties such as accuracy and calibration requires labeled data. In realworld deployment scenarios, this data is likely to be scarce and costly to collect, e.g., consider a pretrained classification model being deployed in a diagnostic imaging context in a particular hospital. With this in mind we also develop a framework for active assessment of blackbox classifiers, using techniques from active learning to efficiently select instances to label so that model deficiencies such as low accuracy, or high cost mistakes can be quickly identified.
In summary, our primary contributions are:

[nosep]

We propose a general framework for blackbox classifier assessment, using a Bayesian approach that is applicable to a range of performancerelated metrics.

We illustrate the utility of the framework via Bayesian inference with posterior uncertainty for quantities such as classwise accuracy, expected calibration error (ECE), and reliability diagrams.

We develop a new framework called active assessment and demonstrate how this framework can be used to identify extreme classes in an online labelefficient manner with significant gains over traditional random sampling methods.
2 Preliminaries
2.1 Notation
We consider classification problems with a feature vector
and a class label , e.g., classifying image pixels into one of classes. We assume access to a trained prediction model that makes predictions of given a feature vector . In particular we assume that the model produces numerical scores per class, reflecting its confidence, typically in the form of a set of estimates of classconditional probabilities. Such probability estimates can be obtained from a logistic classifier, from the softmax output layer of a neural network, from averages over leaf nodes in treebased models, and so on. A notational aside: for probabilities that are being generated by the model we use subscript
, e.g., . When we refer to the actual true probability with respect to the underlying true distribution we drop the subscript, e.g., when using terms like and in computing expectations.Under 01 classification loss, will be the classifier’s label prediction for a particular input . We can define as the score of a model, as a function of , i.e., the class probability that the model produces for its predicted class given input . This is sometimes also referred to as a model’s confidence in its prediction and can be viewed a model’s own estimate of its accuracy when it predicts given . The model’s scores in general need not be perfectly calibrated, i.e., they need not match the true probabilities .
2.2 Datasets and Classification Models
Mode  Size  Classes  Model  

CIFAR100  Image  10K  100  ResNet110 
ImageNet  Image  50K  1000  ResNet152 
SVHN  Image  100K  10  ResNet152 
20 Newsgroups  Text  7.5K  20  BERT_{BASE} 
DBpedia  Text  70K  14  BERT_{BASE} 
Assessment Datasets
We assess performance characteristics of neural models on several standard image and text classification datasets. The image datasets we use are: CIFAR100 (Krizhevsky and Hinton, 2009), SVHN (Netzer et al., 2011) and ImageNet (Russakovsky et al., 2015). The text datasets we use are: 20 Newsgroups (Lang, 1995) and DBpedia (Zhang et al., 2015). Detailed statistics are provided in Table 1. The assessment datasets are based on standard test sets used for each dataset in the literature.
Prediction Models
For image classification we use ResNet (He et al., 2016)
architectures with either 110 layers (CIFAR100) or 152 layers (SVHN and ImageNet). For ImageNet we use the pretrained model provided by PyTorch, and for CIFAR and SVHN we use the pretrained model checkpoints provided at:
https://github.com/bearpaw/pytorchclassification. For text classification tasks we use finetuned BERT_{BASE} (Devlin et al., 2019) models. These models were all trained on standard training sets in the literature, independent from the datasets used for assessment. To facilitate reproducing our results we provide all of the model predictions used in our experiments at: https://github.com/disiji/bayesianblackbox.3 Bayesian Assessment of Classification Metrics
We focus on the problem of assessing the performance of a model on data drawn from some unknown distribution
representing the environment where the model is being used. This joint distribution in general need not necessarily be the same as the distribution that the model was trained on. We are interested in particular in the situation where the model is a black box, where we can observe the inputs
and the outputs , but don’t have any other information about its inner workings (for example about any internal parameters of the model). Specifically, in this paper, rather than learning a model itself we want to learn about the characteristics of a fixed model that is making predictions in a particular environment.A natural approach to assessing the performance of a blackbox classifier is to adopt a Bayesian framework where we treat the metrics of interest (classification accuracy, calibration error) as unknown parameters that we estimate from (limited) labeled data drawn from a distribution .
3.1 Assessing Classwise Accuracy
Beginning with classification accuracy, the marginal accuracy of a classification model is defined as . We also define regional accuracy over local regions of the input space. For any region in the input space, regional accuracy is the marginal probability that the predicted label matches with the true label, conditioned on :
(1)  
We will use this as one of our main assessment tools.
In particular, we will focus on assessment of classwise accuracy, , the expected accuracy of the model whenever it predicts class . This corresponds to having the input region be the classifier’s decision region . To estimate the classwise accuracies from data, a standard approach would be to empirically approximate the integral above by sampling pairs from the conditional distribution . Equivalently, can be modeled as an unknown Bernoulli parameter , with draws , conditioned on , leading to binary outcomes , with a frequencybased (maximum likelihood) estimate:
(2) 
It is natural to consider Bayesian estimation in this context, especially in situations where there is little labeled data available for assessment and/or where the number of regions is large. In particular, we can put a prior on , model the draws with a binomial likelihood, and produce Beta posteriors for each . The Bayesian approach allows for uncertainty in our inferences about quantities such as as well as providing a basis for supporting techniques such as active selection of examples for labeling (discussed later in the paper). In situations where we have no a priori information about we can use a weak uninformative prior with ; alternatively, we can use strong prior information (e.g., if the assessor believes the performance metrics reported by the modelbuilder) when available.
3.2 Assessing Calibration Performance
We can also assess calibrationrelated metrics for a classifier in a Bayesian fashion using any of wellknown various calibration metrics (Kumar et al., 2019; Nixon et al., 2019). Here we focus on expected calibration error (ECE) given that it is among the widelyused calibration metrics in the machine learning literature (e.g., Guo et al. (2017); Ovadia et al. (2019)). We use the standard ECE binning procedure, with scores aggregated into equalwidth bins, denoting the th bin or region as:
(3) 
where ( is often used in practice). Note that these are not the decision regions induced by the model that we discussed earlier but instead are a partition of the input space determined by the score of the predicted class . The marginal ECE is defined as a weighted average of the absolute distance between the true accuracy and the average score per bin:
(4) 
where is the probability of a score lying in bin . The unknown accuracy of the model per bin is , which can be viewed as a marginal accuracy over the region in the input space corresponding to , i.e.,
To assess the marginal ECE (Eqn. 4), we put Beta priors over the ’s, . Our default setting for the priors is a weak prior () with the mean of the prior on the diagonal for each bin, i.e., we assume a priori that the model is calibrated but allow the data to easily overwhelm the prior if there is evidence that the model is not wellcalibrated. The posterior distribution over ECE is a weighted average of the absolute value of shifted Beta posterior distributions corresponding to the individual ’s. The posterior is not available in closed form but Monte Carlo samples are straightforward to obtain.
As for accuracy, we can also model classwise ECE, , by modifying the model described above to use regions that partition the input space by predicted class in addition to the model score.
3.3 Experiments with Accuracy and Calibration Assessment
A simple illustration of our approach is provided in Figure 1. We plot mean posterior estimates and 95% credible intervals of classwise accuracy and ECE produced using predictions from a ResNet110 model on the entire CIFAR100 test set. The assessment shows that (a) model accuracy and calibration varies substantially across classes and (b) that classes with low classwise accuracy also tend to be less calibrated. We discuss this further in the supplemental material where we show that negative correlation between classwise accuracy and ECE is observed across all five datasets.
Another example of where we can apply Bayesian assessment is in assessing reliability diagrams for classifiers, a widely used tool for visually diagnosing model calibration (DeGroot and Fienberg, 1983; NiculescuMizil and Caruana, 2005) (e.g., Figure 2). These diagrams plot the empirical sample accuracy as a function of the model’s confidence . If the model is perfectly calibrated, then and the diagram should plot the identity function on the diagonal. Any deviation away from the diagonal reflects miscalibration of the model. For a particular value along the xaxis, the corresponding value is defined as . As we did in the previous section, we model the marginal accuracy within each bin as an unknown quantity . We use a Beta prior over each , then update this prior using a binomial likelihood over binary observations .
In Figure 2
, the prior distribution of marginal accuracy within each bin is a Beta distribution with its mean on the diagonal and pseudocount
. As the amount of data used increases, the credible intervals of the Bayesian reliability diagram (left column) get narrower, the posterior density of ECE (right column) converges to ground truth, and the uncertainty about ECE decreases. When the number of samples is small, with the same set of randomly selected samples 100 samples (row 1), the Bayesian estimation of ECE puts nonnegative probability mass on ground truth marginal ECE, where “ground truth" refers to the marginal ECE computed with all labeled assessment data, while the frequentist method significantly overestimates ECE without any notion of uncertainty.In Figure 3 we show the percentage error obtained for Bayesian mean posterior estimates (MPE) and frequentist estimates of marginal ECE as a function of the number of labeled data points (“queries") across five datasets. The percentage is computed relative to the ground truth marginal . The MPE is computed with Monte Carlo samples from the posterior distribution (an example of histograms of such samples are shown in Figure 2). At each step, with we randomly draw and label queries from the pool of unlabeled data, and compute both a Bayesian and frequentist estimate of marginal calibration error with these labeled data. We run the simulation 100 times, and report the average over the samples. Figure 2 plots as a percentage. The Bayesian method consistently has lower ECE estimation error, especially when the number of queries is small.
lowest accuracy predicted classes, comparing active learning (with Thompson sampling (TS)) with no active learning, across five datasets. In the top row
, and in the bottom row for CIFAR100 and ImageNet, and for the other datasets.4 Active Bayesian Assessment
The results in the previous section illustrate how the Bayesian approach can be useful in obtaining assessments with uncertainty quantification, e.g., for practical deployment situations where labeled data is likely to be sparse, in contrast to typical results in the literature which assume the availability of large test datasets for assessing performance metrics. In this section we illustrate how we can further improve performance by extending our Bayesian framework to active assessment, allowing for model assessment to be performed by actively selecting examples for labeling in a dataefficient manner. This scenario is particularly relevant to problems where we have a potentially large pool of unlabeled examples available, and have limited resources for labeling (e.g., a human labeler). The question we address here is, if we can only select classes from a larger pool of unlabeled examples, which examples should we select. We illustrate below how active data selection can be performed to solve the problem of identifying extreme classes (e.g., classes that rank first or the last according to a given metric such accuracy, calibration error, or expected cost)
4.1 Best Arm Identification
Thompson Sampling
The problem of identifying extreme classes can be formulated as a multiarmed bandit problem where the “arms” are predicted classes , and the “reward” signal is a function of the model’s prediction and true label (e.g., +1 every time if the goal is to determine the least accurate predicted class). The popular Thompson sampling algorithm used to solve this class of problems (Thompson, 1933; Russo et al., 2018)
readily lends itself to our Bayesian framework. The basic idea is, at each query, to sample from the posterior distribution of the evaluation metric, and select one data point
to be labeled (from a pool of unlabeled examples) for the predicted class (or arm) with highest/lowest sampled value.We also experimented with a modified version of Thompson sampling called toptwo Thompson sampling (TTTS) which has theoretical advantages for identifying the best arm in a pure exploration mode (Russo, 2016). This algorithm adds a resampling process to encourage more exploration. At each step, the secondmost optimal arm may be randomly selected in place of the most optimal arm. We found that for the problems and datasets we investigated in this paper that TS and TTTS gave very similar performance, so we for simplicity we just present results for TS in this paper. A formal description of these algorithms is provided in the supplemental material.
There are a variety of other active learning algorithms (such as epsilongreedy and UCB methods) that could also be used for active assessment. We found Thompson sampling to be more reliable and consistent in terms of efficiency than these methods across all five datasets (results and sensitivity analysis for prior strength in the supplementary material). We focus on the Bayesian/TS approach in our results below since our primary aim is to demonstrate the utility of active assessment compared to no active assessment.
4.2 Best Arms Identification
Another variant of an active Bayesian assessment framework is best arms identification.^{1}^{1}1This is typically referred to as best arms identification in the literature. We use the symbol to avoid overloading . This can be motivated for example by task allocation, e.g., finding the predicted classes that a model is least accurate on, so that whenever the model predicts one of these classes the prediction decision is handed instead to a more accurate predictor (e.g., a human). For example, suppose we have a dataset with equallylikely classes (e.g., CIFAR100) and a budget where we can send 10% of our examples to a human to make predictions (and the other 90% are made by our blackbox model). One way to address this is to find the set of 10 predicted classes that the model is least accurate for and use the human to make predictions when is in this set.
Identification of the best arms can be formulated as a multipleplay multiarmed bandit (MAB) problem. Komiyama et al. (2015) proposed the multipleplay Thompson sampling (MPTS) algorithm and proved that MPTS has the optimal regret upper bound when the reward is binary. This algorithm differs from standard Thompson sampling in that, at each step, data points for the top arms (according to a sample from the posterior) are labeled as opposed to just the top arm. A detailed algorithm description is provided in the supplemental material.
4.3 Active Bayesian Assessment of Accuracy
We apply the active Bayesian assessment framework to the problem of determining the predicted classes with the lowest classwise accuracies, using the betabinomial model described in Section 3.1. Figure 4 compares our active Bayesian assessment method to a traditional nonactive assessment method (i.e., evaluating on a test set of uniformly drawn data points). Each algorithm was run 100 different times on each of the datasets listed in Table 1. For evaluation, for each run, at each step, we identify the least accurate classes, according to the MPE of the posterior distribution for the Bayesian method and the frequencybased estimate for the nonactive method.
The xaxis measures the number of queries made to the oracle. The yaxis measures the mean reciprocal rank (MRR) of the predicted top classes:
(5) 
where is the predicted rank of the th best class. Following standard practice, other classes in the best are ignored when computing rank so that if the predicted top classes match ground truth.
Our results demonstrate that the active learning approach is much more effective at identifying the least accurate class or least accurate top classes relative to working with a randomly sampled test set. For example, for CIFAR100 and ImageNet, in all of the trials, the correct class is identified after querying 30% and 20% of the pool of unlabeled data respectively. In contrast, the nonactive strategy only gets MRR around 0.5 and 0.2 on these two datasets with the same amount of labeled data. In general, active assessment appears to achieve the largest gains in efficiency on datasets where the number of classes is large (e.g., CIFAR100 and ImageNet).
Figure 4 shows the results when the prior distribution of accuracy is uninformative. However, since a model’s confidence reflects a model’s selfassessment of accuracy, we could also use an informative prior by placing a prior distribution for accuracy that is centered around the model’s confidence per class. Figure 5 shows the results for two data sets comparing active assessment for an informative prior (green) and an uninformative prior (orange). We set the uninformative prior for classwise accuracy to be for each predicted class, the informative prior to be , where is the average model confidence (score) of all data points (which we can obtain using unlabeled data) for the predicted class . The results in Figure 5 illustrate that the informative prior can be helpful when the prior captures the relative ordering of classwise accuracy well (e.g., ImageNet), but less helpful when the difference in classwise accuracy across classes is small and the classwise ordering reflected in the “selfassessment prior" is more likely to be in error (e.g., SVHN; classwise accuracies provided in the supplementary material).
4.4 Active Bayesian Assessment of Confusion Matrices and Misclassification Costs
Accuracy assessment can be viewed as implicitly assigning a binary cost to model mistakes, i.e. a cost of 1 to incorrect predictions and a cost of 0 to correct predictions. In this sense, identifying the predicted class with lowest accuracy is equivalent to identifying the class with greatest expected cost. However, in real world applications, costs of different types of mistakes can vary drastically. For example, in autonomous driving applications, misclassifying a pedestrian as a crosswalk can have much more severe consequences than other misclassifications.
To deal with such situations, we extend our approach to incorporate an arbitrary cost matrix , where is the cost of predicting class for a data point whose true class is . Conditioned on a predicted class , the true class label has a categorical distribution . We will refer to
as confusion probabilities since they resemble the elements of a confusion matrix. The
classwise expected cost for predicted class is given by:(6) 
Similar to how we use a betabinomial distribution to model accuracy, we can model these confusion probabilities using a Dirichletmultinomial distribution:
. The same active querying approach described in the previous section can then be used to identify the class with the highest classwise expected cost (e.g., ).For prior distributions we evaluate two options. The first is an uninformative prior with . The second is an informative prior based on the model’s own prediction scores, . This informative prior is likely to be more useful for problems such as CIFAR100 where the number of confusion probabilities to estimate is large and observations are relatively sparse.
We experiment with two different cost matrices on the CIFAR100 dataset: (i) the cost of misclassifying a person (e.g., predicting fish when the true class is a woman, girl, boy, etc.) is more expensive than other mistakes, (ii) the cost of confusing a class with another superclass (e.g., a vehicle with a fish) is more expensive than the cost of mistaking labels within the same superclass (e.g., confusing shark with trout). We visualize these cost matrices in Figure 6(a). In Figure 6(b), we compare the performance of active and nonactive assessment at identifying the class with highest cost, averaged over 100 trials. We set the pseudocount of both priors to be 1, and the cost of expensive mistakes to be 10x the cost of other mistakes. For both of the cost matrices we find that the active approach is more effective than the nonactive approach, and that the informative prior is more effective than the uninformative prior. Even though the model is not wellcalibrated (e.g., see Figures 1 and 2) there is nonetheless valuable information about confusion probabilities available from the model’s estimates of classconditional probabilities.
To illustrate how the informative prior helps deal with sparsity, we plot samples from the posterior of when the number of queries is 10, 1000, 10000 in Figure 6(c). For the uninformative prior (top panel), even when all of the available data is used, there is still considerable uncertainty about the magnitude of the offdiagonal confusion probabilities. However, this is not the case for the informative prior (bottom panel) since the prior for the confusion probabilities more closely resembles the true confusion matrix.
Lastly, we investigate sensitivity of varying the relative cost of mistakes. We consistently observe that active assessment with an informative prior performs best, followed by active assessment with an uninformative prior and finally random sampling. Results are provided in the supplemental material.
4.5 Active Bayesian Assessment of Calibration
To identify classes for which the model is least calibrated (i.e., has lowest classwise ECE), we use the Bayesian model we proposed in Section 3 to actively assess classwise ECE of a model across different predicted classes. In Figure 7 we plot the average mean reciprocal rank (MRR) from 100 independent runs. As with classwise accuracy, active assessment can identify the correct top
predicted classes much more efficiently than the nonactive approach, as a function of the number of label queries. The improvement in efficiency is particularly significant when the classwise calibration performance has large variance across the classes, e.g., CIFAR100, ImageNet and 20 Newsgroups (additional details in the supplemental material).
5 Related Work
Prior work on using Bayesian ideas in the context of classifier assessment has tended to focus on very specific types assessment. Goutte and Gaussier (2005)
propose a framework for Bayesian estimation of precision, recall, and Fscore in an information retrieval context, and
Johnson et al. (2019) use Bayesian mixture models to provide posterior distributions of diagnostic metrics (such as true positive rates) for medical tests. Benavoli et al. (2017)develop a general Bayesian framework for comparing multiple classifiers as an alternative to more traditional null hypothesis significance testing. We contribute to this body of work in two significant ways. Firstly, we expand the set of Bayesian diagnostics to a broader range of metrics, such as classwise accuracy and calibration metrics such as ECE and reliability diagrams. In addition, we develop approaches for the previously unstudied task of
active assessment that are more labelefficient (and costeffective) than traditional approaches which use fixedsize test sets or uniform sampling.Other work has proposed frequentist methods for uncertainty quantification in an assessment context, e.g., resampling approaches such as the bootstrap for generating confidence intervals on calibration performance
(Bröcker and Smith, 2007; Vaicenavicius et al., 2019). Our focus in this work is not to supplant these existing techniques, but instead to supplement them by providing an approach that includes the ability to incorporate of prior knowledge, and which readily lends itself to be used for active assessment.While there is a large literature on active learning and multiarmed bandits (e.g., Settles (2012); Russo et al. (2018)), our paper is the first that applies these ideas to classifier assessment. In particular, our work builds on multiarmed bandit (MAB) inspired, poolbased active learning algorithms for data selection (Thompson, 1933; Russo, 2016; Komiyama et al., 2015). The techniques we use in this paper can in principle be replaced by any Bayesian active learning algorithms designed for MAB problems—determining the optimal active learning approach for model assessment is an interesting avenue for future research.
6 Conclusions
In this paper we described a Bayesian framework for assessing performance metrics of blackbox classifiers, developing inference procedures for classwise accuracy, expected cost, and calibration metrics such as ECE. In addition, we proposed a new framework called active assessment for labelefficient assessment of classifier performance, and demonstrated its performance across five wellknown datasets for identification of extreme classes such as the least accurate, least calibrated, or highest cost.
There are a number of interesting and useful directions for future work, such as Bayesian estimation of continuous functions related to accuracy and calibration (rather than over regions). The framework can also be extended to assess a particular model operating in multiple environments using a Bayesian hierarchical approach, or to comparatively assess multiple models operating in the same environment. A related direction is to consider environments where humans are in the loop where, given a constraint on the number of problems that can be allocated to humans, the goal is to identify for which types of prediction problems human accuracy will most likely exceed model accuracy.
Appendix A: Classwise ECE and Accuracy are Negatively Correlated
Figure 8 shows scatter plots of classwise accuracy and ECE assessed with our proposed Bayesian method for five datasets used in the paper. The assessment shows that model accuracy and calibration vary substantially across classes. For CIFAR100, ImageNet and 20 Newsgroups, the variance of classwise accuracy and ECE among all predicted class is considerably greater than the variance of two other datasets. Figure 8 also illustrates that there is significant negative correlation between classwise accuracy and ECE across all 5 datasets, i.e. classes with low classwise accuracy also tend to be less calibrated.
Appendix B: Bayesian Reliability Diagrams
Figure 9 shows Bayesian reliability diagrams for five datasets, based on different amounts of labeled data. We used a Beta prior for each bin with and i.e., a weak prior with pseudocount centered on the diagonal. Rows 1 and 2 display reliability diagrams estimated using and randomly selected examples (respectively). Row 3 displays diagrams estimated using the full set of available labeled examples for each dataset (e.g., the size column in Table 1).
With the full set of examples (row 3), the posterior means and the posterior 95% credible intervals are generally below the diagonal, i.e., we can infer with high confidence that the models are miscalibrated (and overconfident, to varying degrees) across all five datasets. For some bins where the scores are less than 0.5, the credible intervals are wide due to little data, and there is not enough information to determine with high confidence if the corresponding models are calibrated or not in these regions. With examples (row 1), the posterior uncertainty captured by the 95% credible intervals indicates that there is not yet enough information to determine whether the models are miscalibrated given only labeled examples. With examples (row 2) there is enough information to reliably infer that the CIFAR100 model is overconfident in all bins for scores above 0.3. For the remaining datasets the credible intervals are generally wide enough to encompass 0.5 for most bins, meaning that we do not have enough data to make reliable inferences about calibration, i.e., the possibility that the models are wellcalibrated cannot be ruled out without acquiring more data.
Appendix C: Inferring Statistics of Interest via Monte Carlo Sampling
An additional benefit of the Bayesian framework is that we can draw samples from the posterior to infer other statistics of interest. Here we illustrate this method with two examples.
Bayesian Ranking via Monte Carlo Sampling
We can infer the Bayesian ranking of classes in terms of classwise accuracy or expected calibration error(ECE), by drawing samples from the posterior distributions. For instance, we can estimate the ranking of classwise accuracy of a model for CIFAR100, by sampling ’s (from their respective posterior Beta densities) for each of the classes and then compute the rank of each class with the sampled accuracy. We run this experiment 10,000 times and then for each class we can empirically estimate the distribution of its ranking. The MPE and 95% credible interval of ranking per predicted class for top 10 and bottom 10 are provided in Figure (a)a for CIFAR100.
(a) MCMCbased ranking of accuracy across predicted classes for CIFAR100 (where 1 corresponds to the class with the highest accuracy. (b) Posterior probabilities of the most and least accurate predictions on CIFAR100. The class with the highest classwise accuracy is somewhat uncertain, while the class with the lowest classwise accuracy is very likely
lizard.Posterior probabilities of the most and least accurate predictions
We can estimate the probability that a particular class such as lizard is the least accurate predicted class of CIFAR100 by sampling ’s (from their respective posterior Beta densities) for each of the classes and then measuring whether is the minimum of the sampled values. Running this experiment 10,000 times and then averaging the results, we determine that there is a 68% chance that lizard is the least accurate class predicted by ResNet110 on CIFAR100. The posterior probabilities for other classes are provided in Figure (b)b, along with results for estimating which class has the highest classwise accuracy.
Appendix D: Different MultiArmed Bandit Algorithms
Below we provide brief descriptions and pseudocode for the different variants of multiarmed bandit algorithm for the best arm or top arms identification we investigated in this paper, including Thompson Sampling(TS), TopTwo Thompson Sampling(TTTS), and multipleplay Thompson sampling(MPTS)(Thompson, 1933; Russo, 2016; Komiyama et al., 2015).
Best Arm Identification
Thompson sampling(TS) is a widely used method for online learning of multiarmed problems. The algorithm samples actions according to the posterior probability that they are optimal. Toptwo Thompson sampling(TTTS)is a modified version of TS that is tailored for bestarm identification, and has some theoretical advantages. This algorithm adds a resampling process to encourage more exploration. Algorithm 1 and 2 describe the sampling process for identifying the most accurate predicted class with TS and TTTS. is the number of classes and is the prior distribution of .
Top Arms Identification
Multipleplay Thompson sampling(MPTS)is an extension of TS to the multipleplay multiarmed bandit problem and it has a theoretical optimal regret guarantee with binary rewards. Algorithm 3 is the sampling process to identify the topm arms with MPTS, where is the number of best arms to identify.
Appendix E: Derivation of Classwise Expected Cost
Suppose we are given a model producing probability estimates , and costmatrix where is the cost of predicting class for a data point whose true class is . Conditioned on a predicted class , the true class label has a categorical distribution . The classwise expected cost for predicted class is given by,
We compute:
Since ,  
Appendix F: Sensitivity Analysis for Hyperparameters
In Figure 11, we show Bayesian reliability diagrams for five datasets as the strength of the prior increases from 10 to 100. As the strength of the prior increases, it takes more labeled data to overcome the prior belief that the model is calibrated. In Figure 12, we show MRR of the lowest accurate predicted classes as the strength of the prior increases from 2 to 10 to 100. And in Figure 13, we show MRR of the least calibrated predicted classes as the strength of the prior increase from 2 to 5 and 10. From these plots, the proposed approach appears to be relatively robust to the prior strength.
We also investigate the sensitivity of varying the relative cost of mistakes. Results are provided in Table 2. We consistently observe that active assessment with an informative prior performs the best, followed by active assessment with an uninformative prior and finally random sampling.
(a) 
(b) 
(a) 
(b) 
(c) 
(a) 
(b) 
(c) 
Uninformative  Informative  

Cost  Nonactive  Prior  Prior  
Human  2  7.7K  2.0K  1.6K 
5  8.9K  3.1K  2.1K  
10  8.6K  5.5K  3.4K  
20  7.1K  5.2K  2.5K  
Superclass  2  9.2K  2.3K  1.7K 
5  9.9K  2.4K  2.3K  
10  9.7K  2.1K  1.8K  
20  9.6K  2.5K  2.2K 
References
 Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis. The Journal of Machine Learning Research 18 (1), pp. 2653–2688. Cited by: §5.
 Increasing the reliability of reliability diagrams. Weather and Forecasting 22 (3), pp. 651–661. Cited by: §5.
 The comparison and evaluation of forecasters. Journal of the Royal Statistical Society: Series D (The Statistician) 32 (12), pp. 12–22. Cited by: §3.3.
 BERT: pretraining of deep bidirectional transformers for language understanding. In NAACLHLT 2019, Vol. 1, pp. 4171–4186. Cited by: §2.2.

Fused DNN: a deep neural network fusion approach to fast and robust pedestrian detection.
In
Winter Conference on Applications of Computer Vision
, pp. 953–961. Cited by: §1.  Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050–1059. Cited by: §1.
 A probabilistic interpretation of precision, recall and Fscore, with implication for evaluation. In European Conference on Information Retrieval, pp. 345–359. Cited by: §5.
 On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §1, §3.2.

Deep residual learning for image recognition.
In
Computer Vision and Pattern Recognition
, pp. 770–778. Cited by: §2.2.  A baseline for detecting misclassified and outofdistribution examples in neural networks. In International Conference on Learning Representations, Cited by: §1.
 Gold standards are out and Bayes is in: implementing the cure for imperfect reference tests in diagnostic accuracy studies. Preventive Veterinary Medicine 167, pp. 113–127. Cited by: §5.
 Identifying medical diagnoses and treatable diseases by imagebased deep learning. Cell 172 (5), pp. 1122–1131. Cited by: §1.
 Optimal regret analysis of Thompson sampling in stochastic multiarmed bandit problem with multiple plays. In International Conference on Machine Learning, pp. 1152–1161. Cited by: §4.2, §5, Appendix D: Different MultiArmed Bandit Algorithms.
 Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §2.2.
 Beta calibration: a wellfounded and easily implemented improvement on logistic calibration for binary classifiers. In Artificial Intelligence and Statistics, pp. 623–631. Cited by: §1.
 Verified uncertainty calibration. In Advances in Neural Information Processing Systems, pp. 3787–3798. Cited by: §3.2.
 Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §1.
 Newsweeder: learning to filter netnews. In Machine Learning Proceedings, pp. 331–339. Cited by: §2.2.
 Detecting and correcting for label shift with black box predictors. In International Conference on Machine Learning, pp. 3128–3136. Cited by: §1.
 Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning. Cited by: §2.2.

Predicting good probabilities with supervised learning
. In International Conference on Machine Learning, pp. 625–632. Cited by: §3.3.  Measuring calibration in deep learning. In The IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §3.2.
 Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pp. 13969–13980. Cited by: §1, §3.2.
 Do imagenet classifiers generalize to imagenet?. In International Conference on Machine Learning, pp. 5389–5400. Cited by: §1.
 Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §2.2.
 A tutorial on thompson sampling. Foundations and Trends in Machine Learning 11 (1), pp. 1–96. Cited by: §4.1, §5.
 Simple Bayesian algorithms for best arm identification. In Conference on Learning Theory, pp. 1417–1418. Cited by: §4.1, §5, Appendix D: Different MultiArmed Bandit Algorithms.
 Active learning. Synthesis Lectures on AI and ML, Morgan Claypool. Cited by: §5.
 On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25 (3/4), pp. 285–294. Cited by: §4.1, §5, Appendix D: Different MultiArmed Bandit Algorithms.
 Evaluating model calibration in classification. In International Conference on Artificial Intelligence and Statistics, pp. 3459–3467. Cited by: §5.
 Complexity vs. performance: empirical analysis of machine learning as a service. In Internet Measurement Conference, pp. 384–397. Cited by: §1.
 Transforming classifier scores into accurate multiclass probability estimates. In International Conference on Knowledge Discovery and Data Mining, pp. 694–699. Cited by: §1.
 Characterlevel convolutional networks for text classification. In Advances in Neural Information Processing Systems, pp. 649–657. Cited by: §2.2.
Comments
There are no comments yet.