Uncertainty Baselines: Benchmarks for Uncertainty Robustness in Deep Learning

06/07/2021 ∙ by Zachary Nado, et al. ∙ Google University of Cambridge University of Oxford The University of Texas at Austin 0

High-quality estimates of uncertainty and robustness are crucial for numerous real-world applications, especially for deep learning which underlies many deployed ML systems. The ability to compare techniques for improving these estimates is therefore very important for research and practice alike. Yet, competitive comparisons of methods are often lacking due to a range of reasons, including: compute availability for extensive tuning, incorporation of sufficiently many baselines, and concrete documentation for reproducibility. In this paper we introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. As of this writing, the collection spans 19 methods across 9 tasks, each with at least 5 metrics. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. Our goal is to provide immediate starting points for experimentation with new methods or applications. Additionally we provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results. Code available at https://github.com/google/uncertainty-baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Baselines on standardized benchmarks are crucial to machine learning research for measuring whether new ideas yield meaningful progress. However, reproducing the results from previous works can be extremely challenging, especially when only reading the paper text

(Sinha et al., 2020; D’Amour et al., 2020). Having access to the code for experiments is much more useful, assuming it is well-documented and maintained. But even this is not enough. In fact, in retrospective analyses over a collection of works, authors often find that a simpler baseline works best in practice, due to flawed experiment protocols or insufficient tuning (Melis et al., 2017; Kurach et al., 2019; Bello et al., 2021; Nado et al., 2021).

There is a wide spectrum of experiment artifacts made available in papers. A popular approach is a GitHub dump of code used to run experiments, albeit lacking documentation and tests. At best, papers might provide actively maintained repositories with examples, model checkpoints, and ample documentation to extend the work. A single paper can only go so far however: without community standards, each paper’s codebase differs in experimental protocol and code organization, making it difficult to compare across papers within a common benchmark, let alone build jointly on top of multiple papers.

To address these challenges, we created the Uncertainty Baselines library. It provides high-quality implementations of baselines across many uncertainty and out-of-distribution robustness tasks. Each baseline is designed to be self-contained (i.e., minimal dependencies) and easily extensible. We provide numerous artifacts in addition to the raw code so that others can adapt any baseline to suit their workflow.

Related work. OpenAI Baselines (Dhariwal et al., 2017)

is work in similar spirit for reinforcement learning. Prior work on uncertainty and robustness benchmarks include

Riquelme et al. (2018); Filos et al. (2019); Hendrycks and Dietterich (2019); Ovadia et al. (2019); Dusenberry et al. (2020b). These all introduce a new task and evaluate a variety of baselines on that task. In practice, they are often unmaintained or focus on experimental insights rather than the software as the contribution. Our work provides an extensive set of benchmarks (in several cases, unifying the above ones), has a larger set of baselines across these benchmarks, and focuses on designing scalable, forkable, and well-tested code.

2 Uncertainty Baselines

Uncertainty Baselines sets up each benchmark as a choice of base model, training dataset, and a suite of evaluation metrics.

  1. Base models (architectures) include Wide ResNet 28-10 (Zagoruyko and Komodakis, 2016), ResNet-50 (He et al., 2016), BERT (Devlin et al., 2018), and simple MLPs.

  2. Training datasets include standard machine learning datasets – CIFAR

    (Krizhevsky et al., a, b)

    , ImageNet

    (Russakovsky et al., 2015), and UCI (Dua and Graff, 2017) – as well as more real-world problems – Clinc Intent Detection (Larson et al., 2019), Kaggle’s Diabetic Retinopathy Detection (Filos et al., 2019), and Wikipedia Toxicity (Wulczyn et al., 2017). These span modalities such as tabular, text, and images.

  3. Evaluation includes predictive metrics such as accuracy, uncertainty metrics such as selective prediction and calibration error, compute metrics such as inference latency, and performance under out-of-distribution shifts.

As of this writing, we provide 19 baseline methods sweeping over standard and more recent strategies on 9 benchmarks. See Table 1 for the full list.


In order to optimize for researchers to easily experiment on baselines (specifically, fork them), we designed the baselines to be as modular as possible and with minimal non-standard dependencies. API-wise, Uncertainty Baselines provides little to no abstractions: datasets are light wrappers around TensorFlow Datasets

(TFDS Team, )

, models are Keras models, and training/test logic is in raw TensorFlow

(Abadi et al., 2015) This allows new users to more easily run individual examples, or incorporate our datasets and/or models into their libraries. For out-of-distribution evaluation, we plug our trained models into Robustness Metrics (Djolonga et al., 2020). Appendix A illustrates how the modules fit together.


Uncertainty Baselines is framework-agnostic. The dataset and metric modules are NumPy-compatible, and interoperate in a performant manner with modern deep learning frameworks including TensorFlow, Jax, and PyTorch. For example, our baselines on the JFT-300M dataset use raw JAX, and we include a PyTorch Monte Carlo Dropout baseline on the Diabetic Retinopathy dataset. In practice, for ease of code and performance comparison, we choose a specific backend for each benchmark and develop all baselines under that backend (most often TensorFlow). Our Jax and PyTorch baselines demonstrate that implementation with other frameworks is supported and straightforward.

Hardware. All baselines run on CPU, GPU, or Google Cloud TPUs. Baselines are optimized for a default hardware configuration and often assume a memory requirement and number of chips (e.g., 1 GPU, or TPUv2-32) in order to reproduce the results. We employ the latest coding practices to fully utilize accelerator chips (Figure 1) so researchers can leverage the most performant baselines.

Figure 1: Performance analysis of a MIMO baseline on a TPUv3-32 using the TensorFlow Profiler. The runtime is optimized, bound only by model operations, an irreducible bottleneck for a given baseline. Our implementations have 100% utilization of the TPU devices.

Hyperparameters.Hyperparameters and other experiment configuration values easily number in the dozens for a given baseline. Uncertainty Baselines uses standard Python flags to specify hyperparameters, setting default values to reproduce best performance. Flags are simple, require no additional framework, and are easy to plug into other pipelines or extend. We also document the protocol to properly tune and evaluate baselines—a common source of discrepancy in papers.

Reproducibility. All modules include testing, and all results are reported over multiple seeds. Computing metrics on trained models can be prohibitively expensive let alone training from scratch. Therefore we also provide TensorBoard dashboards which include all training, tuning, and evaluation metrics (Appendix F). An example can be found here.

Figure 2:

8 baselines evaluated on ImageNet, ImageNetV2 (matched frequency variant), ImageNet-C, and ImageNet-A.

(top) Top-1 accuracy. (bottom) Expected calibration error. Results demonstrate the many baselines available with competitive performance.
Figure 3: ImageNet baseline applied to deferred prediction. In this task, one defers predictions according to the model’s confidence (left) or a desired data retention rate (right).

3 Results

To provide an example of Uncertainty Baselines’ features, we display baselines available on 1 of 9 tasks: ImageNet. Figure 2 displays accuracy and calibration error across 8 baselines, evaluated on in- and out-of-distribution.111We omit a legend to avoid drawing comparisons among which specific baselines perform best. See our full leaderboards to draw those insights at baselines/imagenet/README.md. Figure 3 provides an example of applying such baselines to a downstream task. Overall, the results demonstrate only a sampling of the repository’s capabilities. We’re excited to see new research already building on the baselines.

We would like to acknowledge support for this project from the National Science Foundation (NSF grant IIS-9988642) and the Multidisciplinary Research Program of the Department of Defense (MURI N00014-00-1-0637). Work from the OATML Group was supported by Intel Labs and the Turing institute.


Appendix A UML Diagram

Figure 4 displays a diagram of the various components involved in a CIFAR10 TensorFlow or a Diabetic Retinopathy PyTorch experiment.

Figure 4: The program structure involved in a CIFAR10 TensorFlow and a Diabetic Retinopathy PyTorch experiment. One instantiates a Dataset (Cifar10Dataset or DiabeticRetinopathyDataset) and model (wide_resnet or resnet50_torch) within an end-to-end training script. After training, we input saved model checkpoints into Robustness Metrics for evaluation.

Appendix B Supported Baselines

Dataset Method
CIFAR (Krizhevsky, 2009) BatchEnsemble (Wen et al., 2020)
Hyper-BatchEnsemble (Wenzel et al., 2020)
MIMO (Havasi et al., 2020)
Rank-1 BNN (Gaussian)
(Dusenberry et al., 2020a)
Rank-1 BNN (Cauchy)
SNGP (Liu et al., 2020)
(Gal and Ghahramani, 2016)
(Lakshminarayanan et al., 2016)
Hyper-deep ensemble (Wenzel et al., 2020)
Variational Inference (Blundell et al., 2015)
Heteroscedastic (Collier et al., 2021)
CLINC (Larson et al., 2019) SNGP
Diabetic Retinopathy Detection (Filos et al., 2019) MC-Dropout
Radial Bayesian Neural Networks
(Farquhar et al., 2020)
Variational Inference
ImageNet (Russakovsky et al., 2015) MixUp (Carratino et al., 2020)
Rank-1 BNN (Gaussian)
Rank-1 BNN (Cauchy)
Hyper-deep ensemble
Variational Inference
MNIST (LeCun and Cortes, 2010) Variational Inference
Toxic Comments Detection (Conversation AI, 2017) SNGP
Focal Loss (Mukhoti et al., 2019)
UCI (Dua and Graff, 2017) Variational Inference
Table 1: Currently implemented methods for each dataset, in addition to a deterministic baseline. See repository for a more updated list.

Appendix C Dataset Details

For CIFAR10 and CIFAR100, we padded the images with 4 pixels of 0’s before doing a random crop to 32x32 pixels, followed by a left-right flip with 50% chance. For ImageNet, we used ResNet preprocessing as described in

He et al. (2016), but also support the common Inception preprocessing from Szegedy et al. (2015). All preprocessing is deterministic given a random seed, using tf.random.experimental.stateless_split and tf.random.experimental.stateless_fold_in. For the Diabetic Retinopathy benchmarks we used the Kaggle competition dataset as in Filos et al. (2019).

Appendix D Model Details

For CIFAR10 and CIFAR100 we provide methods based on the Wide ResNet models, typically the Wide ResNet-28 size (Zagoruyko and Komodakis, 2016). For ImageNet and the Diabetic Retinopathy benchmarks, we provide methods based on the ResNet-50 model (He et al., 2016). For ImageNet we additionally use methods based on the EfficientNet models (Tan and Le, 2019). For the Toxic Comments and CLINC Intent Detection benchmarks, our methods are based on the BERT-Base model (Devlin et al., 2018).

Appendix E Hyperparameter Tuning

All image benchmarks were trained with Nesterov momentum

(Sutskever et al., 2013)

, except for the EfficientNet models which use RMSProp with

. The text benchmarks were trained with the AdamW optimizer (Loshchilov and Hutter, 2017) with a . Unless otherwise noted, the image benchmarks used a linear warmup followed by a stepwise decay schedule, except for the EfficientNet models which used a linear warmup followed by an exponential decay. The text benchmarks used a linear warmup followed by a linear decay.

For the CIFAR10, CIFAR100, ImageNet, Toxic Comments, and CLINC Intent Detection benchmarks, the papers for each method contain their tuning details.

Diabetic Retinopathy benchmark tuning details.

For the Diabetic Retinopathy benchmark, we also provide our tuning results so that others can more easily retune their own methods. We conducted two rounds of quasirandom search on several hyperparameters (learning rate, momentum, dropout, variational posteriors, L2 regularization), where the first round was a heuristically-picked larger search space and the second round was a hand-tuned smaller range around the better performing values. Each round was for 50 trials, and the final hyperparameters were selected using the final validation AUC from the second tuning round. We finally retrained this best hyperparameter setting on the combined train and validation sets.

Appendix F Open-Source Data

The tuning and final metrics data for the Diabetic Retinopathy benchmarks can be found at the following URLs:

  • Deterministic 10 seeds

  • Dropout First Tuning

  • Dropout Final Tuning

  • Dropout 10 seeds

  • Variational Inference First Tuning

  • Variational Inference Final Tuning

  • Variational Inference 10 seeds

  • Radial BNN First Tuning

  • Radial BNN Final Tuning

  • Radial BNN 10 seeds