
Bayesian Layers: A Module for Neural Network Uncertainty
We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. It extends neural network libraries with layers capturing uncertainty over weights (Bayesian neural nets), preactivation units (dropout), activations ("stochastic output layers"), and the function itself (Gaussian processes). With reversible layers, one can also propagate uncertainty from input to output such as for flowbased distributions and constantmemory backpropagation. Bayesian Layers are a dropin replacement for other layers, maintaining core features that one typically desires for experimentation. As demonstration, we fit a 10billion parameter "Bayesian Transformer" on 512 TPUv2 cores, which replaces attention layers with their Bayesian counterpart.
12/10/2018 ∙ by Dustin Tran, et al. ∙ 14 ∙ shareread it

Measuring Calibration in Deep Learning
The reliability of a machine learning model's confidence in its predictions is critical for highrisk applications. Calibrationthe idea that a model's predicted probabilities of outcomes reflect true probabilities of those outcomesformalizes this notion. While analyzing the calibration of deep neural networks, we've identified core problems with the way calibration is currently measured. We design the Thresholded Adaptive Calibration Error (TACE) metric to resolve these pathologies and show that it outperforms other metrics, especially in settings where predictions beyond the maximum prediction that is chosen as the output class matter. There are many cases where what a practitioner cares about is the calibration of a specific prediction, and so we introduce a dynamic programming based Prediction Specific Calibration Error (PSCE) that smoothly considers the calibration of nearby predictions to give an estimate of the calibration error of a specific prediction.
04/02/2019 ∙ by Jeremy Nixon, et al. ∙ 10 ∙ shareread it

Reliable Uncertainty Estimates in Deep Neural Networks using Noise Contrastive Priors
Obtaining reliable uncertainty estimates of neural network predictions is a long standing challenge. Bayesian neural networks have been proposed as a solution, but it remains open how to specify the prior. In particular, the common practice of a standard normal prior in weight space imposes only weak regularities, causing the function posterior to possibly generalize in unforeseen ways on outofdistribution inputs. We propose noise contrastive priors (NCPs). The key idea is to train the model to output high uncertainty for data points outside of the training distribution. NCPs do so using an input prior, which adds noise to the inputs of the current mini batch, and an output prior, which is a wide distribution given these inputs. NCPs are compatible with any model that represents predictive uncertainty, are easy to scale, and yield reliable uncertainty estimates throughout training. Empirically, we show that NCPs offer clear improvements as an addition to existing baselines. We demonstrate the scalability on the flight delays data set, where we significantly improve upon previously published results.
07/24/2018 ∙ by Danijar Hafner, et al. ∙ 8 ∙ shareread it

MeshTensorFlow: Deep Learning for Supercomputers
Batchsplitting (dataparallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to SingleProgramMultipleData (SPMD) programming. However, batchsplitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (modelparallelism). Unfortunately, efficient modelparallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce MeshTensorFlow, a language for specifying a general class of distributed tensor computations. Where dataparallelism can be viewed as splitting tensors and operations along the "batch" dimension, in MeshTensorFlow, the user can specify any tensordimensions to be split across any dimensions of a multidimensional mesh of processors. A MeshTensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use MeshTensorFlow to implement an efficient dataparallel, modelparallel version of the Transformer sequencetosequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT'14 EnglishtoFrench translation task and the onebillionword language modeling benchmark. MeshTensorflow is available at https://github.com/tensorflow/mesh .
11/05/2018 ∙ by Noam Shazeer, et al. ∙ 8 ∙ shareread it

Simple, Distributed, and Accelerated Probabilistic Programming
We describe a simple, lowlevel approach for embedding probabilistic programming in a deep learning ecosystem. In particular, we distill probabilistic programming down to a single abstractionthe random variable. Our lightweight implementation in TensorFlow enables numerous applications: a modelparallel variational autoencoder (VAE) with 2ndgeneration tensor processing units (TPUv2s); a dataparallel autoregressive model (Image Transformer) with TPUv2s; and multiGPU NoUTurn Sampler (NUTS). For both a stateoftheart VAE on 64x64 ImageNet and Image Transformer on 256x256 CelebAHQ, our approach achieves an optimal linear speedup from 1 to 256 TPUv2 chips. With NUTS, we see a 100x speedup on GPUs over Stan and 37x over PyMC3.
11/05/2018 ∙ by Dustin Tran, et al. ∙ 6 ∙ shareread it

Discrete Flows: Invertible Generative Models of Discrete Data
While normalizing flows have led to significant advances in modeling highdimensional continuous distributions, their applicability to discrete distributions remains unknown. In this paper, we show that flows can in fact be extended to discrete eventsand under a simple changeofvariables formula not requiring logdeterminantJacobian computations. Discrete flows have numerous applications. We consider two flow architectures: discrete autoregressive flows that enable bidirectionality, allowing, for example, tokens in text to depend on both lefttoright and righttoleft contexts in an exact language model; and discrete bipartite flows that enable efficient nonautoregressive generation as in RealNVP. Empirically, we find that discrete autoregressive flows outperform autoregressive baselines on synthetic discrete distributions, an addition task, and Potts models; and bipartite flows can obtain competitive performance with autoregressive baselines on characterlevel language modeling for Penn Tree Bank and text8.
05/24/2019 ∙ by Dustin Tran, et al. ∙ 6 ∙ shareread it

Autoconj: Recognizing and Exploiting Conjugacy Without a DomainSpecific Language
Deriving conditional and marginal distributions using conjugacy relationships can be time consuming and error prone. In this paper, we propose a strategy for automating such derivations. Unlike previous systems which focus on relationships between pairs of random variables, our system (which we call Autoconj) operates directly on Python functions that compute logjoint distribution functions. Autoconj provides support for conjugacyexploiting algorithms in any Python embedded PPL. This paves the way for accelerating development of novel inference algorithms and structureexploiting modeling strategies.
11/29/2018 ∙ by Matthew D. Hoffman, et al. ∙ 6 ∙ shareread it

Analyzing the Role of Model Uncertainty for Electronic Health Records
In medicine, both ethical and monetary costs of incorrect predictions can be significant, and the complexity of the problems often necessitates increasingly complex models. Recent work has shown that changing just the random seed is enough for otherwise welltuned deep neural networks to vary in their individual predicted probabilities. In light of this, we investigate the role of model uncertainty methods in the medical domain. Using RNN ensembles and various Bayesian RNNs, we show that populationlevel metrics, such as AUCPR, AUCROC, loglikelihood, and calibration error, do not capture model uncertainty. Meanwhile, the presence of significant variability in patientspecific predictions and optimal decisions motivates the need for capturing model uncertainty. Understanding the uncertainty for individual patients is an area with clear clinical impact, such as determining when a model decision is likely to be brittle. We further show that RNNs with only Bayesian embeddings can be a more efficient way to capture model uncertainty compared to ensembles, and we analyze how model uncertainty is impacted across individual input features and patient subgroups.
06/10/2019 ∙ by Michael W. Dusenberry, et al. ∙ 3 ∙ shareread it

Hierarchical Implicit Models and LikelihoodFree Variational Inference
Implicit probabilistic models are a flexible class of models defined by a simulation process for data. They form the basis for theories which encompass our understanding of the physical world. Despite this fundamental nature, the use of implicit models remains limited due to challenges in specifying complex latent structure in them, and in performing inferences in such models with large data sets. In this paper, we first introduce hierarchical implicit models (HIMs). HIMs combine the idea of implicit densities with hierarchical Bayesian modeling, thereby defining models via simulators of data with rich hidden structure. Next, we develop likelihoodfree variational inference (LFVI), a scalable variational inference algorithm for HIMs. Key to LFVI is specifying a variational family that is also implicit. This matches the model's flexibility and allows for accurate approximation of the posterior. We demonstrate diverse applications: a largescale physical simulator for predatorprey populations in ecology; a Bayesian generative adversarial network for discrete data; and a deep implicit model for text generation.
02/28/2017 ∙ by Dustin Tran, et al. ∙ 0 ∙ shareread it

Deep Probabilistic Programming
We propose Edward, a Turingcomplete probabilistic programming language. Edward defines two compositional representationsrandom variables and inference. By treating inference as a first class citizen, on a par with modeling, we show that probabilistic programming can be as flexible and computationally efficient as traditional deep learning. For flexibility, Edward makes it easy to fit the same model using a variety of composable inference methods, ranging from point estimation to variational inference to MCMC. In addition, Edward can reuse the modeling representation as part of inference, facilitating the design of rich variational models and generative adversarial networks. For efficiency, Edward is integrated into TensorFlow, providing significant speedups over existing probabilistic systems. For example, we show on a benchmark logistic regression task that Edward is at least 35x faster than Stan and 6x faster than PyMC3. Further, Edward incurs no runtime overhead: it is as fast as handwritten TensorFlow.
01/13/2017 ∙ by Dustin Tran, et al. ∙ 0 ∙ shareread it

Variational Inference via χUpper Bound Minimization
Variational inference (VI) is widely used as an efficient alternative to Markov chain Monte Carlo. It posits a family of approximating distributions q and finds the closest member to the exact posterior p. Closeness is usually measured via a divergence D(q  p) from q to p. While successful, this approach also has problems. Notably, it typically leads to underestimation of the posterior variance. In this paper we propose CHIVI, a blackbox variational inference algorithm that minimizes D_χ(p  q), the χdivergence from p to q. CHIVI minimizes an upper bound of the model evidence, which we term the χ upper bound (CUBO). Minimizing the CUBO leads to improved posterior uncertainty, and it can also be used with the classical VI lower bound (ELBO) to provide a sandwich estimate of the model evidence. We study CHIVI on three models: probit regression, Gaussian process classification, and a Cox process model of basketball plays. When compared to expectation propagation and classical VI, CHIVI produces better error rates and more accurate estimates of posterior variance.
11/01/2016 ∙ by Adji B. Dieng, et al. ∙ 0 ∙ shareread it