Log In Sign Up

Evaluating Predictive Distributions: Does Bayesian Deep Learning Work?

by   Ian Osband, et al.

Posterior predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed, which provides tools for the systematic evaluation of agents that generate such predictions. Crucially, these tools assess not only the quality of marginal predictions per input, but also joint predictions given many inputs. Joint distributions are often critical for useful uncertainty quantification, but they have been largely overlooked by the Bayesian deep learning community. We benchmark several approaches to uncertainty estimation using a neural-network-based data generating process. Our results reveal the importance of evaluation beyond marginal predictions. Further, they reconcile sources of confusion in the field, such as why Bayesian deep learning approaches that generate accurate marginal predictions perform poorly in sequential decision tasks, how incorporating priors can be helpful, and what roles epistemic versus aleatoric uncertainty play when evaluating performance. We also present experiments on real-world challenge datasets, which show a high correlation with testbed results, and that the importance of evaluating joint predictive distributions carries over to real data. As part of this effort, we opensource The Neural Testbed, including all implementations from this paper.


page 6

page 7


Looking at the posterior: on the origin of uncertainty in neural-network classification

Bayesian inference can quantify uncertainty in the predictions of neural...

Evaluating High-Order Predictive Distributions in Deep Learning

Most work on supervised learning research has focused on marginal predic...

Evaluating Probabilistic Inference in Deep Learning: Beyond Marginal Predictions

A fundamental challenge for any intelligent system is prediction: given ...

Benchmarking Bayesian Deep Learning on Diabetic Retinopathy Detection Tasks

Bayesian deep learning seeks to equip deep neural networks with the abil...

Weight Pruning and Uncertainty in Radio Galaxy Classification

In this work we use variational inference to quantify the degree of epis...

Code Repositories

1 Introduction

Deep learning has emerged as the state-of-the-art approach across a number of application domains in which agents learn from large amounts of data (lecun2015deep). Neural networks are increasingly used not only to predict outcomes but also to inform decisions. Common approaches in deep learning produce point estimates but not uncertainty estimates, which are often required for effective decision-making. Bayesian deep learning extends the methodology to produce such uncertainty estimates (mackay1992practical; neal2012bayesian).

We consider agents that are trained on data pairs and subsequently generate predictions given new inputs. When presented with an input , a Bayesian neural network can generate a predictive distribution of the outcome that is yet to be observed. This distribution characterizes the agent’s uncertainty about . We refer to such a prediction as marginal to distinguish it from a joint predictive distribution over a list of prospective outcomes corresponding to inputs .

The importance of uncertainty estimation has motivated a great deal of research over recent years (KendallGal17WhatUncertainties). This research has produced a variety of agents that learn to generate predictive distributions. With this proliferation of alternatives, it is increasingly important to analyze and compare their performance (filos2019systematic; nado2021uncertainty). In this paper, we introduce new tools for systematic evaluation of such agents.

Our tools overcome several limitations faced by previous methods of evaluation. First, by focusing purely on predictive distributions, we allow for a unified treatment of approaches developed within the Bayesian neural network community and beyond. This sidesteps the question of whether any approach ‘is really Bayesian’ (wilson2020bayesian). Second, our tools evaluate the quality of higher-order joint predictions (). Until now, the Bayesian deep learning literature has focused almost exclusively on evaluating marginal predictions (wang2021beyond). Finally, we develop a neural-network-based data generating process for Bayesian deep learning that can be used to drive insight and algorithm development. Where research has focused on a small set of challenge datasets, this might introduce bias through overfitting via multiple iterations of algorithm development. We use these tools to compare hundreds of agent variants. Further, we show that performance on our synthetic data generating process data is highly correlated with performance on real-world challenge datasets. We opensource all code used in this paper as The Neural Testbed.

Our results reconcile several sources of confusion in the field. One concerns why particular approaches developed by the Bayesian deep learning community, such as Bayes-by-backprop, dropout, and deep ensembles, perform poorly in sequential decision tasks despite faring well based on evaluation metrics of that community

(osband2018rpf). Our results demonstrate that, while such methods produce accurate marginal predictions, they are no longer competitive when it comes to high-order joint predictions. Joint predictions play a critical role in sequential decision-making (lu2021BeyondMarginal).

Another puzzling issue is that state-of-the-art methods do not employ domain-specific priors. Whether Bayesian deep learning approaches should at all is a subject of controversy (wenzel2020good). We show that the benefits of domain-specific priors can be pronounced when evaluating high-order joint predictions, even where they are negligible for marginals.

We also help to resolve a point of philosophical debate within the deep learning community: the importance of epistemic versus aleatoric uncertainty111Epistemic uncertainty relates to knowledge (ancient Greek epistemeknowledge), as opposed to aleatoric uncertainty relating to chance (Latin aleadice) (der2009aleatory).. The strangeness of this distinction has even made its way into wider popular culture, as satirized in the XKCD comic of Figure 1 (munroe2021xkcd)

. For a given parametric model, we can clearly distinguish parameter uncertainty from noise, or reducible from irreducible uncertainty. However, from the perspective of a learning agent, the choice of model is subjective; different models can lead to the same marginal predictions. Our formulation provides a clear and objective way to assess the quality of predictive distributions, without reliance on this subjective distinction between knowledge and chance. Crucially, we show that this can be judged via the quality of joint predictions, but that marginals are not sufficient.

Figure 1: Epistemic or aleatoric? Does it matter?

It is worth mentioning another notable contribution of this work. The quality of a predictive distribution is commonly assessed in terms of cross-entropy loss. While this measure is well-defined for both marginal and joint predictions, to the best of our knowledge, the literature has only addressed computation in the former case. For high-order joint predictions, the straightforward approach would require computing sums over exponentially many values. To render this computationally tractable, we developed a novel approximation algorithm that leverages a random partitioning operation and Monte Carlo simulation. While this approach is motivated by concepts from high-dimensional geometry (Kaski1998DimensionalityRB; donoho2006compressed), we leave its analysis as a topic for future theoretical research.

2 Evaluating predictive distributions

In this section, we introduce notation for the standard supervised learning framework we will consider (classification) as well as our evaluation metric (the KL-loss). We also explain how we estimate the KL-loss for high-order joint predictions where exact computation is infeasible, using random partitions and Monte Carlo simulation.

2.1 Kullback–Leibler loss

Consider a sequence of pairs , where each

is a feature vector and each

is its target label. This sequence is i.i.d. conditioned on the environment

, which produces the data, and which we view as a latent random variable. We consider an agent that is uncertain about the environment and predicts class labels

given training data pairs and unlabelled feature vectors . From the agent’s perspective, each feature vector is generated i.i.d from a fixed distribution , and each class label is then drawn from .

We describe the agent’s predictions in terms of a generative model, parameterized by a vector that the agent learns from the training data . For any inputs , determines a predictive distribution, which could be used to sample imagined outcomes . We define the -order expected KL-loss by


where is independent of . The expectation is taken over all random variables, including the environment , the parameters , , and . Note that is equivalent to the widely used notion of cross-entropy loss, though offset by a quantity that is independent of (kullback1951information). For , assesses joint rather than the marginal predictions.

2.2 Marginal Versus Joint Predictions

Evaluating an agent’s ability to estimate uncertainty on joint instead of marginal predictions can result in very different answers. We provide a simple example that illustrates the point. Suppose the data

is generated by repeated tosses of a possibly biased coin with unknown probability

of heads.222We consider this coin as an illustrative model of more complex binary outcomes, such as whether a user will click on an ad, or whether a given mortgage will default on payments. Let , to indicate that there is no input, and let each outcome be or to indicate tails or heads, respectively. Consider two agents that, without any training, predict outcomes. Agent 1 assumes and models the outcome of each flip as pure chance. Agent 2 assumes that the coin is fully biased, meaning that , but assigns probabilities and to and .

Let and denote the outcomes imagined by the two agents. Despite their differing assumptions, the two agents generate identical marginal predictive distributions: . On the other hand, joint predictions greatly differ for large :

We can say that agent 1 attributes all uncertainty to aleatoric sources and agent 2, epistemic sources (although as Figure 1 alludes, there are many ways an agent can attribute sources of uncertainty). Evaluating marginal predictions cannot distinguish between the two possibilities, though for a specific prior distribution over , one agent could be right and the other wrong. One must evaluate joint predictions to make this distinction.

When it comes to decision-making, this distinction can be critical (lu2021BeyondMarginal). In a casino, under the first agent’s assumption, there is large upside and little risk on repeatedly betting on heads in the long run. However, if there is a 1/3 chance the coin will always land tails, as is the case in the second agent’s prediction, there is a ruinous risk to repeatedly betting heads. Evaluating joint predictions beyond marginals distinguishes these cases.

2.3 Computation of Kullback–Leibler loss

In contexts we will consider, it is not possible to compute exactly. As such, we will approximate via Monte Carlo simulation. This section provides a high level overview of our approach, we push the full details to Appendix A. Algorithm 1 outlines a basic approach to estimating with respect to a synthetic data generating process. The algorithm samples a set of environments and a training dataset for each environment. For each of these pairs, the agent is re-initialized, trained, and then tested on independent test data -samples. Note that each test data -sample includes data pairs. For each test data -sample, the likelihood of the environment is computed exactly, but that of the agent’s belief distribution is approximated. The estimate of is taken to be the sample mean of the log-likelihood-ratios (Algorithm 2).

1:for  do
2:     sample environment and training dataset, and train agent
3:     for  do
4:         sample a test data -sample with feature-label pairs
5:         compute likelihood of environment
6:         compute estimated likelihood of agent’s belief distribution      
7:return estimated log-likelihood-ratio
Algorithm 1 KL-Loss Computation

While the likelihood of an environment can be efficiently computed, that of an agent’s belief distribution poses a computational challenge. One approach is to estimate this likelihood via Monte Carlo simulation (Algorithm 3

). This produces unbiased estimates, which can be accurate when

is small. However, maintaining accuracy requires the number of samples to grow exponentially with , as discussed in Appendix A.1. To overcome this challenge, we propose a novel approach that estimates the likelihood of the agent’s beliefs via a combination of randomized partitioning and Monte Carlo simulation (Algorithm 4) (Kaski1998DimensionalityRB). We conjecture that, under suitable regularity conditions, this novel approach produces accurate estimates even when is large, but leave a formal analysis to future work. Even though Algorithm 1 is developed for a synthetic data generating process, it is straightforward to extend it to evaluate agents on real data. We outline our approach to real data in Section 5.1, with full details in Appendix A.2.

3 Benchmark agents

In this section we outline the baseline agents that we use to benchmark canonical approaches to uncertainty estimation in deep learning. Table 1 links to papers that introduce these agents, as well as the hyperparamters that we tuned to optimize their performance via gridsearch. In each case, we attempt to match ‘canonical’ implementations, which we open source at

agent description hyperparameters
mlp Vanilla MLP decay
ensemble ‘Deep Ensemble’ (lakshminarayanan2017simple) decay, ensemble size
dropout Dropout (Gal2016Dropout) decay, network, dropout rate
bbb Bayes by Backprop (blundell2015weight) prior mixture, network, early stopping
sgmcmc Stochastic Langevin MCMC (welling2011bayesian) learning rate, prior, momentum
ensemble+ Ensemble + prior functions (osband2018rpf) decay, ensemble size, prior scale, bootstrap
hypermodel Hypermodel (Dwaracherla2020Hypermodels) decay, prior, bootstrap, index dimension
Table 1: Summary of benchmark agents, full details in Appendix B.

In addition to these agent implementations, our opensource project contains all the evaluation code to reproduce the results of this paper. Our code is written in Python and makes use of Jax internally (jax2018github)

. However, our evaluation procedure is framework agnostic, and can equally be used with any Python package including Tensorflow, Pytorch or even SKlearn. Over the course of this paper, we have made extensive use of parallel computation to facilitate large hyperparameter sweeps over many problems. Nevertheless, the overall computational cost is relatively low by modern deep learning standards and relies only on standard CPU. For reference, evaluating the

mlp agent across all the problems in our testbed and real data requires less than 3 CPU-hours. We view our opensource effort as one of the major contributions of this work. We provide clear and strong baselines, together with an objective and accessible method for assessing uncertainty estimates beyond marginal distributions.

4 The Neural Testbed

In this section we introduce the Neural Testbed, a system for assessing and comparing agent performance. The Testbed implements synthetic data generating processes and streamlines the process of sampling data, training agents, and evaluating test performance by estimating KL-loss for marginal and high-order joint predictions. Since independent data can be generated for each execution, the Testbed can drive insight and multiple iterations of algorithm development without risk of overfitting to a fixed dataset. We begin by describing the simple generative model based around a random 2-layer MLP. We then apply this testbed to evaluate a comprehensive set of benchmark agents.

4.1 Synthetic data generating processes

By data generating process, we do not mean only the conditional distribution of data pairs but also the distribution of the environment

. The Testbed considers 2-dimensional inputs and binary classification problems, although the generating processes can be easily extended to any input dimension and number of classes. The Testbed offers three data generating processes distinguished by a “temperature” setting, which signifies the signal-to-noise ratio (SNR) regime of the generated data. The agent can be tuned separately for each setting. This reflects prior knowledge of whether the agent is operating in a high SNR regime such as image recognition or a low SNR regime such as weather forecasting.

To generate a model, the Testbed samples a 2-hidden-layer ReLU MLP with 2 output units, which are scaled by

and passed through a softmax function to produce class probabilities. The MLP is sampled according to standard Xavier initialization (glorot2010understanding), with the exception that biases in the first layer are drawn from . The inputs are drawn i.i.d. from . The agent is provided with the data generating process as prior knowledge.

In Section 2.1, we described KL-loss as a metric for evaluating performance of an agent. The Neural Testbed estimates KL-loss, with , for three temperature settings and several training dataset sizes. For each value of , the KL-losses are averaged to produce an aggregate performance measure. Further details concerning data generation and agent evaluation are offered in Appendix A.

Figure 2: Most Bayesian deep learning approaches do not significantly outperform a single MLP in marginal predictions (). Once we examine predictive distributions beyond marginals we see a clear difference in performance between our benchmark agents (). For each , the KL estimates are normalized by the KL of the MLP agent.
Figure 3: Agent robustness improves with bootstrapping. Figure 4: The benefits of additive prior functions are clear in the high tau, low data regime.

4.2 Performance in marginal predictions

We begin our evaluation of benchmark approaches to Bayesian deep learning in marginal predictions (). This setting has been the main focus of the Bayesian deep learning literature. Despite this focus, it is surprising to see in Figure 2 that none of the benchmark methods significantly outperform a well-tuned MLP baseline according to . Of course, there are many other metrics one might consider, but in this fundamental metric of prediction quality, the mlp agent presents a baseline that is difficult to outperform.

One of the keys to this result is that all of the agents are able to tune their hyperparameters, such as weight decay, to the SNR regime and number of training points. This matches the way deep learning systems are typically implemented in practice, with extensive hyperparameter tuning on validation data. This methodology has led many practitioners to doubt the usefulness of automatic tuning procedures such as bootstrap sampling (nixon2020bootstrapped). In Figure 4.1, we compare the performance of an ensemble+ agent that uses bootstrapping with and without the ability to tune the hyperparameters per problem setting. We see that bootstrap sampling is beneficial when the agent is expected to work robustly over a wide range of problem settings. However, the benefits are no longer apparent when the agent is allowed to tune its hyperparameters to individual tasks.

4.3 Performance beyond marginals

One of the key contributions of this paper is to evaluate predictive distributions beyond marginals. In Figure 2, the red bars show the results of benchmark agents evaluated on joint predictive distributions with . Unlike when evaluating on marginal predictions, where no method significantly outperforms a well-tuned MLP, the potential benefits afforded by Bayesian deep learning become clear when examining higher-order predictive distributions. Our results refute prior works’ claims that examining beyond marginals provides little new information (wang2021beyond).

Figure 2 shows that sgmcmc

is the top-performing agent overall. This should be reassuring to the Bayesian deep learning community and beyond. In the limit of large compute this agent should recover the ‘gold-standard’ of Bayesian inference, and it does indeed perform best

(welling2011bayesian). However, some of the most popular approaches in this field (ensemble, dropout) do not actually provide good approximations to the predictive distribution in . In fact, we see that even though Bayesian purists may deride ensemble+ and hypermodels as ‘not really Bayesian’, these methods actually provide much better approximations to the Bayesian posterior than ‘fully Bayesian’ VI approaches like bbb. We note too that while sgmcmc performs best, it also requires orders of magnitude more computation than competitive methods even in this toy setting (see Appendix C.2). As we scale to more complex environments, it may therefore be worthwhile to consider alternative approaches to approximate Bayesian inference.

For insight into where our top agents are able to outperform, we compare ensemble and ensemble+ under the medium SNR regime in Figures 4.1 and 5. These methods are identical, except for the addition of a randomized prior function (osband2018rpf). Figure 4.1 shows that, although these methods perform similarly in the quality of their marginal predictions (), the addition of a prior function greatly improves the quality of joint predictive distributions () in the low data regime. Figure 5 provides additional intuition into how the randomized prior functions are able to drive improved performance. Figure 4(a) shows a sampled generative model from our Testbed, with the training data shown in red and blue circles. Figure 4(b) shows the mean predictions and randomly sampled ensemble members from each agent (top=ensemble, bottom=ensemble+). We see that, although the agents mostly agree in their mean predictions, ensemble+ produces more diverse sampled outcomes enabled by the addition of randomized prior functions. In contrast, ensemble produces similar samples, which may explain why its performance is close to baseline mlp.

(a) True model.
(b) Agent samples: only ensemble+ produces diverse decision boundaries.
Figure 5: Visualization of the predictions of ensemble and ensemble+ agents.

5 Performance on real data

Section 4 provides a simple, sanitized testbed for clear insight to the efficacy of Bayesian deep learning techniques. However, most deep learning research is not driven by these sorts of synthetic generative models, but the ultimate goal of performing well on real datasets. In this section, we apply the same benchmark agents to a selection of small challenge datasets. We find that, on average, tuning agents for the synthetic problems leads to better performance on real data. We also find that, just as the synthetic testbed, agents that perform similarly in marginal predictions may be distinguished in the quality of their joint predictions.

5.1 Datasets

We focus on 10 benchmark datasets (3 feature-based, 7 image from pixels) drawn from the literature including Iris, MNIST, and CIFAR-10

(TFDS). This collection is not intended to be comprehensive, or to include the most challenging large-scale problems, but instead to represent some canonical real-world data that might reasonably be addressed with the MLP models of Section 4.1. We apply a basic pre-processing step to each dataset, normalizing input features and flattening observations. We push full details to Appendix D.1.

To assess performance in real datasets, we follow a similar procedure as Algorithm 1. The only difference is that since it is impossible to compute the likelihood of environment for real datasets, we compute the negative log-likelihood (NLL) rather than . Appendix A.2 provides further details. Note that NLL and are equivalent for agent comparison since they differ by a constant (see Equation 1). Furthermore, to allow for more direct comparison with the synthetic testbed, we also consider variants of each dataset where the number of training pairs is limited to less than the ‘full’ dataset size.

5.2 Synthetic data is predictive of real data

Recall that Figure 2 compares performance across an array of agents, assessed using our synthetic data generating process. Each agent’s hyperparameters were tuned by first enumerating a list of plausibly near-optimal choices and selecting the one that optimizes performance. Each of our real-world datasets can be viewed as generated by an environment sampled from an alternative data generating process. A natural question is whether performance on the synthetic data correlates with performance on the real-world data.

The table of Figure 5(a) displays results pertaining to each of our agents. For each agent, performance for each candidate hyperparameter setting was assessed on synthetic and real data, and the correlation across these pairs is reported. The left and right columns restrict attention to datasets with low and high volumes of training data, respectively. If a correlation were equal to , the hyperparameter setting that optimizes agent performance on real data would be identical to that on synthetic data. It is reassuring that the correlations are high, reflecting a strong degree of alignment, with the exception of bbb in low data regime, for which there appear to be pathological outcomes distorting performance for small training sets. The values in parentheses express th and th percentile confidence bounds, measured via the statistical bootstrap.

Figure 5(b) plots performance on real versus synthetic data for the high data regime. Each data point represents one agent-hyperparameter combination. If the correlation were equal to

, the combination that performs best on the synthetic data would also perform best on the real data. It is reassuring that the correlation is large, and the confidence interval between the

th and th percentiles small. Agent-hyperparameter combinations that perform better on the testbed tend to perform better on real data as well.

agent low data high data mlp 0.74 (0.57,0.85) 0.68 (0.38,0.99) ensemble 0.72 (0.52,0.85) 0.63 (0.34,0.96) dropout 0.77 (0.68,0.86) 0.78 (0.66,0.87) bbb -0.48 (-0.6,-0.35) 0.76 (0.68,0.83) sgmcmc 0.72 (0.53,0.85) 0.86 (0.79,0.92) ensemble+ 0.85 (0.63,0.98) 0.74 (0.3,0.97) hypermodel 0.52 (0.17,0.76) 0.33 (0.03,0.59)

(a) Correlation by agent by data regime.
(b) Correlation in high data regime.
Figure 6: Performance on the Testbed correlates with performance on real datasets.

5.3 Higher order predictions and informative priors

Our synthetic testbed can be helpful in driving innovations that carry over to real data. Section 5.2 indicated that performance on the Testbed is correlated with that on real-world data. We now repeat the observation from Figure 4.1 on real data; additive prior functions can significantly improve the accuracy of joint predictive distributions generated by ensembles. We show this by comparing the performance of ensemble+ with different forms of prior functions on benchmark datasets. We evaluate an ensemble with no prior function (none

), a random MLP prior (MLP), and a random linear function of a 2-dimensional latent representation as the prior, trained via variational autoencoder (VAE)

(kingma2013auto). We provide full details in Appendix D.3.

Figure 7 plots the improvement in NLL for the ensemble agent relative to a baseline MLP (lower is better), and breaks out the result for datasets=MNIST,Iris and . We can see that the results for Iris mirror our synthetic data almost exactly. The results for MNIST share some qualitative insights, but also reveal some important differences. For Iris none of the methods outperform the MLP baseline, but for we see significant benefits to an additive MLP prior in the low data regime. For MNIST we actually see benefits to ensembles, even without prior functions and even in the high data regime. This reveals some aspects of this real data that are not captured by our synthetic model, where we did not see this behaviour. For the random MLP prior gives a slight advantage, but the effect is much less pronounced. We hypothesize this is because, unlike the testbed, the MLP prior is not well-matched to the input image data. However, the VAE prior is able to provide significant benefit in the low data regime.333We hypothesize that appropriately initialized convnet architectures may be able to leverage image structure as noted in prior work (ulyanov2018deep). These benefits also carry over to Iris, even where random MLPs already provided signficant value. Designing architectures that offer useful priors for learning agents is an exciting area for future work.

Figure 7:

Prior functions provide significant benefit in the high tau, low data regime, just like the testbed. However, for image datasets random MLP priors provide relatively little benefit. Unsupervised pretraining can help to design useful priors in high dimensional data.

6 Conclusion

This paper highlights the need to evaluate predictive distributions beyond marginals. In addition to this conceptual contribution, we develop a suite of practical computational tools that can evaluate diverse approaches to uncertainty estimation. Together with these tools, we provide a neural-network-based data generating process that facilitates research and iteration beyond a small set of challenge datasets. We package these together as The Neural Testbed, including a variety of baseline agent implementations. We believe that this represents an exciting and valuable new benchmark for Bayesian deep learning and beyond.

We have already used this testbed to generate several new insights in this paper. We have shown many popular Bayesian deep learning approaches perform similarly in marginal predictions but quite differently in joint predictions. We reveal the importance of bootstrapping for parameter robustness, and also help reconcile the observed lack of improvement when tuned to specific datasets. We have shown that these insights from synthetic data can carry over to real datasets; that performance in these settings is correlated, that agents with similar marginal predictions can be distinguished by their joint predictions, and that suitable prior functions can play an important role in driving good performance.

The results in this paper are in some sense preliminary. The grand challenge for Bayesian deep learning is to provide effective uncertainty estimates in large, rich datasets. While we have demonstrated benefits to predictive evaluation beyond marginals only in the ‘low data’ regime and small-scale problems, we believe that they will extend more broadly to situations where new test inputs appear novel relative to training data. As such, we believe our core insights should carry over to the related problems of nonstationarity and covariate shift that plague modern deep learning systems. As an agent takes on more and more complex tasks, it will continue to run into new and unfamiliar settings and uncertain outcomes; as such, effective predictive distributions will be more important than ever.


Appendix A Testbed Pseudocode

We present the testbed pseudocode in this section. Specifically, Algorithm 2 is the pseudocode for our neural testbed, and Algorithm 3 and Algorithm 4 are two different approaches to estimate the likelihood of a test data -sample conditioned on an agent’s belief. Algorithm 3 is based on the standard Monte-Carlo estimation, while Algorithm 4 adopts a random partitioning approach. The presented testbed pseudocode works for any prior over the environment and any input distribution , including the ones described in Section 4.1. We also release full code and implementations at

In addition to presenting the testbed pseudocode, we also discuss some core technical issues in the neural testbed design. Specifically, Appendix A.1 discusses how to estimate the likelihood of an agent’s belief distribution; Appendix A.2 discusses how to extend the testbed to agent evaluation on real data; finally, Appendix A.3 explains our choices of experiment parameters.


the testbed requires the following inputs
  1. prior distribution over the environment , input distribution

  2. agent

  3. number of training data , test distribution order

  4. number of sampled problems , number of test data samples

  5. parameters for agent likelihood estimation, as is specified in Algorithm 3 and 4

for  do
     Step 1: sample environment and training data
         1. sample environment
         2. sample inputs i.i.d. from
         3. sample the training labels conditionally i.i.d. as
         4. choose the training dataset as
     Step 2: train agent
         train agent based on training dataset
     Step 3: compute likelihoods
         for do
            1. sample i.i.d. from
            2. generate conditionally independently as
            3. compute the likelihood under the environment as
            4. estimate the likelihood conditioned on the agent’s belief
            based on Algorithm 3 or 4 with test data -sample . return


Algorithm 2 Neural Testbed
  1. trained agent and number of Monte Carlo samples

  2. test data -sample

Step 1: sample models conditionally i.i.d. from
Step 2: estimate as
Algorithm 3 Monte Carlo Estimation of Likelihood of Agent’s Belief
  1. trained agent

  2. number of Monte Carlo samples

  3. number of hyperplanes

  4. test data -sample

Step 1: sample models conditionally i.i.d. from ; for each model , class , and , define
and , where is the CDF of the standard normal function. For each model , define a vector
Step 2: sample a matrix and a -dimensional vector , with each element/component sampled i.i.d. from . For each , compute
Step 3: partition the sampled models, with each cell indexed by and defined by
and assign a probability to each cell:
Step 4: and , estimate the probability of predicting conditioned on the cell:
Step 5: estimate as
Algorithm 4 Estimation of Likelihood of Agent’s Belief via Random Partitioning

a.1 Estimating Likelihood of Agent’s Belief Distribution

We have presented two algorithms to estimate the likelihood of a test data -sample conditioned on a trained agent: Algorithm 3 is based on the standard Monte Carlo estimation, while Algorithm 4 adopts an approach combining random partitioning and Monte Carlo estimation. In this subsection, we briefly discuss the pros and cons between these two algorithms, and provide some general guidelines on how to choose between them.

Algorithm 3 produces unbiased estimates of the likelihoods, which is usually accurate when is small (e.g. for ). However, maintaining accuracy might require the number of samples to grow exponentially with . The following is an illustrative example.

Example 1 (Uniform belief over deterministic models): Consider a scenario where the number of class labels is . We say a model is deterministic if for any feature vector ,

Obviously, for any test data -sample with , under a deterministic model , we have

When restricted to the inputs , there are distinguishable deterministic models. Assume the agent’s belief distribution is uniform over these distinguishable deterministic models, then for any , the likelihood of the agent’s belief distribution is

Now let’s consider Algorithm 3. When a model is sampled from the agent’s belief distribution, with probability ,

and with probability ,

Consequently, in expectation, we need the number of Monte Carlo samples to ensure that the estimate returned by Algorithm 3 is non-zero.

To overcome this challenge, we also propose a novel approach to estimate the likelihood of agent’s belief via a combination of randomized partitioning and Monte Carlo simulation, as is presented in Algorithm 4. This approach proceeds as follows. First, models are sampled from the agent’s belief distribution. For each sampled model, each test data input , and each class label , a predictive probability and its probit are computed, where

is the CDF of the standard normal distribution. For each sampled model, we also stack its probits into a probit vector

. Then, random hyperplanes are sampled and used to partition into cells. Stacked probit vectors place models in cells. Predictive distributions of models in each cell are averaged, and the likelihood is calculated based on these averages, with each cell weighted according to the number of models it contains.

The Neural Testbed applies Algorithm 4 with . Hence, some cells are assigned many models. We conjecture that, under suitable regularity conditions, models assigned to the same cell tend to generate similar predictions. If this is the case, this algorithm produces accurate estimates even when is large. We leave a formal analysis to future work.

Finally, we briefly discuss how to choose between Algorithm 3 and Algorithm 4. As a rule of thumb, we recommend to choose Algorithm 3 for and Algorithm 4 with the number of hyperplanes between and for .

a.2 Agent Evaluation on Real Data

Algorithm 2 (and its simplified version Algorithm 1) is developed for a synthetic data generating processes. We now discuss how to extend it to agent evaluation on real data. Consider a scenario with real datasets, and each dataset is further partitioned into a training dataset and a test dataset. The main difference between this scenario and a synthetic data generating process is that we cannot compute the likelihood of environment for real data. Thus, we compute the cross-entropy loss instead (see Equation 1). The computational approach is similar to Algorithm 1: for each real dataset, we use its training dataset to train an agent. Then, we sample test data -samples from the test dataset, and estimate the likelihoods of the agent’s belief distribution. The estimate of the cross-entropy loss is taken to be the sample mean of the negative log-likelihoods.

Note that when ranking agents, the cross-entropy loss and will lead to the same order of agents, since these two losses differ by a constant independent of the agent (see Equation 1).

a.3 Choices of Experiment Parameters

To apply Algorithm 2, we need to specify an input distribution and a prior distribution on the environment . Recall from Section 4.1 that we consider binary classification problems with input dimension . We choose , and we consider three environment priors distinguished by a temperature parameter that controls the signal-to-noise ratio (SNR) regime. We sweep over temperatures in . The prior distribution is induced by a distribution over MLPs with 2 hidden layers and ReLU activation. The MLP is distributed according to standard Xavier initialization, except that biases in the first layer are drawn from . The MLP outputs two units, which are divided by the temperature parameter and passed through the softmax function to produce class probabilities. The implementation of this generative model is in our open source code under the path /generative/

We now describe the other parameters we use in the Testbed. In Algorithm 2, we pick the order of predictive distributions , training dataset size , number of sampled problems , and number of testing data -samples . We apply Algorithm 3 for evaluation of and Algorithm 4 for evaluation of . In both Algorithms 3 and 4, we sample models from the agent. In Algorithm 4, we set the number of hyperplanes . The specification of the testbed parameters is in our open soucre code under the path /leaderboard/

On real datasets, we apply the same , , and . We set the number of hyperplanes in Algorithm 4.

Appendix B Agents

In this section, we describe the benchmark agents in Section 3

and the choice of various hyperparameters used in the implementation of these agents. The list of agents include MLP, ensemble, dropout, Bayes by backprop, stochastic Langevin MCMC, ensemble+ and hypermodel. We will also include other agents such as KNN, random forest, and deep kernel, but the performance of these agents was worse than the other benchmark agents, so we chose not to include them in the comparison in Section  

4. In each case, we attempt to match the “canonical” implementation. The complete implementation of these agents including the hyperparameter sweeps used for the Testbed are available at We make use of the Epistemic Neural Networks notation from (osband2021epistemic) in our code. We set the default hyperparameters of each agent to be the ones that minimize the aggregated KL score .

b.1 Mlp

The mlp agent learns a 2-layer MLP with 50 hidden units in each layer by minimizing the cross-entropy loss with weight regularization. The weight decay scale is chosen either to be or , where is the input dimension, is the temperature of the generative process and is the size of the training dataset. We sweep over . We implement the MLP agent as a special case of a deep ensemble (B.2). The implementation and hyperparameter sweeps for the mlp agent can be found in our open source code, as a special case of the ensemble agent, under the path /agents/factories/

b.2 Ensemble

We implement the basic “deep ensembles” approach for posterior approximation (lakshminarayanan2017simple). The ensemble agent learns an ensemble of MLPs by minimizing the cross-entropy loss with weight regularization. The only difference between the ensemble members is their independently initialized network weights. We chose the weight scale to be either or , where is the ensemble size, is the input dimension, is the temperature of the generative process, and is the size of the training dataset. We sweep over ensemble size and . We find that larger ensembles work better, but this effect is within margin of error after 10 elements. The implementation and hyperparameter sweeps for the ensemble agent can be found in our open source code under the path /agents/factories/

b.3 Dropout

We follow Gal2016Dropout to build a droput agent for posterior approximation. The agent applies dropout on each layer of a fully connected MLP with ReLU activation and optimizes the network using the cross-entropy loss combined with weight decay. The weight decay scale is chosen to be either or where is the dropping probability, is the input dimension, is the temperature of the data generating process, and is the size of the training dataset. We sweep over dropout rate , length scale (used for weight decay) , number of neural network layers , and hidden layer size . The implementation and hyperparameter sweeps for the dropout agent can be found in our open source code under the path /agents/factories/

b.4 Bayes-by-backprop

We follow blundell2015weight to build a bbb

agent for posterior approximation. We consider a scale mixture of two zero-mean Gaussian densities as the prior. The Gaussian densities have standard deviations

and , and they are mixed with probabilities and , respectively. We sweep over , , , learning rate , number of training steps , number of neural network layers , hidden layer size , and the ratio of the complexity cost to the likelihood cost , where is the input dimension and is the temperature of the data generating process. The implementation and hyperparameter sweeps for the bbb agent can be found in our open source code under the path /agents/factories/

b.5 Stochastic gradient Langevin dynamics

We follow welling2011bayesian to implement a sgmcmc

agent using stochastic gradient Langevin dynamics (SGLD). We consider two versions of SGLD, one with momentum and other without the momentum. We consider independent Gaussian prior on the neural network parameters where the prior variance is set to be

where is a hyperparameter that is swept over , is the input dimension, is the temperature of the data generating process, and is the size of the training dataset. We consider a constant learning rate that is swept over . For SGLD with momentum, the momentum decay term is always set to be . The number of training batches is with burn-in time of training batches. We save a model every 1000 steps after the burn-in time and use these models as an ensemble during the evaluation. The implementation and hyperparameter sweeps for the sgmcmc agent can be found in our open source code under the path /agents/factories/

b.6 Ensemble+

We implement the ensemble+ agent using deep ensembles with randomized prior functions (osband2018rpf) and bootstrap sampling (osband2015bootstrapped). Similar to the vanilla ensemble agent in Section B.2, we consider weight scale to be either or . We sweep over ensemble size and . The randomized prior functions are sampled exactly from the data generating process, and we sweep over prior scaling . In addition, we sweep over bootstrap type (none, exponential, bernoulli). We find that the addition of randomized prior functions is crucial for improvement in performance over vanilla deep ensembles in terms of the quality of joint predictions. We also find that bootstrap sampling improves agent robustness, although the advantage is less apparent when one is allowed to tune the weight decay for each task (see Figure 4.1). The implementation and hyperparameter sweeps for the ensemble+ agent can be found in our open source code under the path /agents/factories/

b.7 Hypermodel

We follow Dwaracherla2020Hypermodels to build a hypermodel agent for posterior approximation. We consider a linear hypermodel over a 2-layer MLP base model. We sweep over index dimension . The weight decay is chosen to be either or with , where is the input dimension, is the temperature of the data generating process, and is the size of the training dataset. We chose three different bootstrapping methods of none, exponential, bernoulli. We use an additive prior which is a linear hypermodel prior over an MLP base model, which is similar to the generating process, with number of hidden layers in , hidden units in each layer, and prior scale from . The implementation and hyperparameter sweeps for the hypermodel agent can be found in our open source code under the path /agents/factories/

b.8 Non-parametric classifiers

K-nearest neighbors (k-NN) (cover1967nearest)

and random forest classifiers

(friedman2017elements) are simple and cheap off-the-shelf non-parametric baselines (murphy2012machine; scikit-learn). The ‘uncertainty’ in these classifiers arises merely from the fact that they produce distributions over the labels and as such we do not expect them to perform well relative to more principled approaches. Moreover, these methods have no capacity to model for . For the knn agent we swept over the number of neighbors and the weighting of the contribution of each neighbor as either uniform or based on distance. For the random_forest agent we swept over the number of trees in the forest , and the splitting criterion which was either the Gini impurity coefficient or the information gain. To prevent infinite values in the KL we truncate the probabilities produced by these classifiers to be in the interval . The implementation and hyperparameter sweeps for the knn and random_forest agents can be found in our open source code under the paths /agents/factories/ and /agents/factories/

b.9 Gaussian process with learned kernel

A neural network takes input and produces output , where is a matrix,

is a bias vector, and

is the output of the penultimate layer of the neural network. In the case of classification the output

corresponds to the logits over the class labels,

i.e., . The neural network should learn a function that maps the input into a space where the classes are linearly distinguishable. In other words, the mapping that the neural network is learning can be considered a form of kernel (scholkopf2002learning), where the kernel function is simply . With this in mind, we can take a trained neural network and consider the learned mapping to be the kernel in a Gaussian process (GP) (rasmussen2003gaussian), from which we can obtain approximate uncertainty estimates. Concretely, let be the matrix corresponding to the , , vectors stacked row-wise and let denote the same quantity for the test set. Fix index to be a particular class index. A GP models the joint distribution over the dataset to be a multi-variate Gaussian, i.e.,

where models the noise in the training data measurement and , are the means under the GP. The conditional distribution is given by


and rather than use the GP to compute (which would not be possible since we do not oberve the true logits) we just take it to be the output of the neural network when evaluated on the test dataset. The matrix being inverted in the expression for has dimension , which may be quite large. We use the Sherman-Morrison-Woodbury identity to rewrite it as follows (woodbury1950inverting)

which instead involves the inverse of an matrix, which may be much smaller. If we perform a Cholesky factorization of positive definite matrix then the samples for all logits simultaneously can be drawn by first sampling , with each entry drawn IID from , then forming

The implementation and hyperparameter sweeps for the deep_kernel agent can be found in our open source code under the path /agents/factories/

b.10 Other agents

In our paper we have made a concerted effort to include representative and canonical agents across different families of Bayesian deep learning and adjacent research. In addition to these implementations, we performed extensive tuning to make sure that each agent was given a fair shot. However, with the proliferation of research in this area, it was not possible for us to evaluate all competiting approaches. We hope that, by opensourcing the Neural Testbed, we can allow researchers in the field to easily assess and compare their agents to these baselines.

For example, we highlight a few recent pieces of research that might be interesting to evaluate in our setting. Of course, there are many more methods to compare and benchmark. We leave this open as an exciting area for future research.

  • [leftmargin=*]

  • Neural Tangent Kernel Prior Functions (he2020bayesian). Proposes a specific type of prior function in ensemble+ inspired by connections to the neural tangent kernel.

  • Functional Variational Bayesian Neural Networks (sun2019functional). Applies variational inference directly to the function outputs, rather than weights like bbb.

  • Variational normalizing flows (rezende2015variational). Applies variational inference over a more expressive family than bbb.

  • No U-Turn Sampler (hoffman2014no). Another approach to sgmcmc that attempts to compute the posterior directly, computational costs can grow large.

Appendix C Testbed results

In this section, we provide the complete results of the performance of benchmark agents on the Testbed, broken down by the temperature setting, which controls the SNR, and the size of the training dataset. We select the best performing agent within each agent family and plot and with the performance of an MLP agent as a reference. We also provide a plot comparing the training time of different agents.

c.1 Performance breakdown

Figures 8 and 9 show the KL estimates evaluated on and , respectively. For each agent, for each SNR regime, for each number of training points we plot the average KL estimate from the Testbed. In each plot, we include the “baseline” mlp agent as a black dashed line to allow for easy comparison across agents. A detailed description of these benchmark agents can be found in Appendix B.

Figure 8: Performance of benchmark agents on the Testbed evaluated on , compared against the MLP baseline.
Figure 9: Performance of benchmark agents on the Testbed evaluated on , compared against the MLP baseline.

c.2 Training time

Figure 10 shows a plot comparing the and training time of different agents normalized with that of an MLP. The parameters of each agent are picked to maximize the . We can see that sgmcmc agent has the best performance, but at the cost of more training time (computation). Both ensemble+ and hypermodel agents have similar performance as sgmcmc with lower training time. We trained our agents on CPU only systems.

Figure 10: Normalized KL vs training time of different agents

Appendix D Real data

This section provides supplementary details regarding the experiments in Section 5. As before, we include full implementation and source code at

d.1 Datasets

Table 2 outlines the datasets included in our experiments. Unlike to the synthetic testbed, which evaluates agents over a range of SNR regimes, these datasets are generally all high SNR regime. We can see this since the top-performing agents in the literature are able to obtain high levels of classification accuracy on held out data; something that is impossible if the underlying system has high levels of noise.

dataset name type # classes input dimension # training pairs
iris structured 3 4 120
wine quality structured 11 11 3,918
german credit numeric structured 2 24 800
mnist image 10 784 60,000
fashion-mnist image 10 784 60,000
mnist-corrupted/shot-noise image 10 784 60,000
emnist/letters image 37 784 88,800
emnist/digits image 10 784 240,000
cmaterdb image 10 3,072 5,000
cifar10 image 10 3,072 50,000
Table 2: Summary of benchmark datasets used in the paper.

Each of these datasets is provided with a canonical training/test set of specific sizes. In order to examine performance in different data regimes we augment the default settings of Table 2 by also examining the performance of agents on these datasets with reduced training data. In a way that mirrors the testbed sweep of Section 4.1, we also look at settings where the training data is restricted to data points respectively.

d.2 Correlation

Figure 6 breaks down the correlation in performance between testbeds and real data. For the purposes of Table 5(a) we say that is the ‘low data’ regime, and the maximum training dataset size is the ‘high data’ regime. Our results show that, for each agent, for each data regime, performance of hyperparameters is correlated across settings.

One concern might be that while performance on real data overall is highly correlated, that this might not necessarily be the case for any individual dataset. Or, alternatively, that this correlation is driven by extremely strong relationships in one dataset that are not present in others. Figure 11 shows that this is not the case. In fact, for each of the datasets considered we have strong and positive correlation over agent-hyperparameter pairs. This gives us confidence that the results of Figure 5(b) are robust not only to choice of agent, but also to some reasonable choice of datasets.

Figure 11: Correlation in high data regime for different datasets.

d.3 Prior functions

We consider two different forms of prior functions for ensemble+: a random MLP of the input data and a random linear function of a 2-dimensional latent trained via variational autoencoder (VAE) (kingma2013auto). For the MLP prior, we tried both linear (MLP with no hidden layer) and MLP with hidden layers, and observed that the linear prior works better. To train the 2-dimensional latent, we considered a 2-layer (128, 64) MLP for the Gaussian encoder and a 2-layer (64, 128) MLP for the Bernoulli decoder. We trained the VAE using all unsupervised training data available for each dataset. After training the VAE for 10,000 steps, we used the output mean of the Gaussian encoder as the latent.