Bayesian Layers: A Module for Neural Network Uncertainty

12/10/2018 ∙ by Dustin Tran, et al. ∙ 14

We describe Bayesian Layers, a module designed for fast experimentation with neural network uncertainty. It extends neural network libraries with layers capturing uncertainty over weights (Bayesian neural nets), pre-activation units (dropout), activations ("stochastic output layers"), and the function itself (Gaussian processes). With reversible layers, one can also propagate uncertainty from input to output such as for flow-based distributions and constant-memory backpropagation. Bayesian Layers are a drop-in replacement for other layers, maintaining core features that one typically desires for experimentation. As demonstration, we fit a 10-billion parameter "Bayesian Transformer" on 512 TPUv2 cores, which replaces attention layers with their Bayesian counterpart.



page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The rise of AI accelerators such as TPUs lets us utilize computation with FLOP/s and 4 TB of memory distributed across hundreds of processors (Jouppi et al., 2017). This lets us fit probabilistic models at many orders of magnitude larger than state of the art. In particular, we are motivated by research on priors and algorithms for Bayesian neural networks (e.g., Wen et al., 2018; Hafner et al., 2018), scaling up Gaussian processes (e.g., Salimbeni and Deisenroth, 2017; John and Hensman, 2018), and expressive distributions via invertible functions (e.g., Rezende and Mohamed, 2015). Unfortunately, while research with these methods are not limited by hardware, they are limited by software. This paper describes Bayesian Layers, an extension of neural network libraries which contributes one idea: instead of only deterministic functions as “layers”, enable distributions over functions. Our implementation extends Keras in TensorFlow (Chollet, 2016) and uses Edward2 (Tran et al., 2018)

to operate with random variables.

1.1 Related Work

There have been many software developments for distributions over functions. Our work takes classic inspiration from Radford Neal’s software in 1995 to enable flexible modeling with both Bayesian neural nets and GPs (Neal, 1995). Modern software typically focuses on only one of these directions. For Bayesian neural nets, researchers have commonly coupled variational sampling in neural net layers (e.g., code and algorithms from Gal and Ghahramani (2016); Louizos and Welling (2017)). For Gaussian processes, there have been significant developments in libraries (Rasmussen and Nickisch, 2010; GPy, 2012; Vanhatalo et al., 2013; Matthews et al., 2017; Al-Shedivat et al., 2017; Gardner et al., 2018). Perhaps most similar to our work, Aboleth (Aboleth Developers, 2017)

features variational BNNs and random feature approximations for GPs. Aside from API differences from all these works, our work tries to revive the spirit of enabling any function with uncertainty—whether that be, e.g., in the weights, activations, or the entire function—and to do so in a manner compatible with scalable deep learning ecosystems.


Historically, an older design of Bayesian Layers was implemented in TensorFlow Probability

(TensorFlow Probability Developers, 2017); the current design exists in Tensor2Tensor (Vaswani et al., 2018). In the future as the API stabilizes, we hope Bayesian Layer ideas may make it into core TensorFlow Keras and/or TensorFlow Probability.

2 Bayesian Layers

In neural network libraries, architectures decompose as a composition of “layer” objects as the core building block (Collobert et al., 2011; Al-Rfou et al., 2016; Jia et al., 2014; Chollet, 2016; Chen et al., 2015; Abadi et al., 2015; S. and N., 2016). These layers capture both the parameters and computation of a mathematical function into a programmable class. In our work, we extend layers to capture “distributions over functions”, which we describe as a layer with uncertainty about some state in its computation—be it uncertainty in the weights, pre-activation units, activations, or the entire function. Each sample from the distribution instantiates a different function, e.g., a layer with a different weight configuration.

2.1 Bayesian Neural Network Layers

The Bayesian extension of any deterministic layer is to place a prior distribution over its weights and biases. These layers require several considerations. Figure 1 implements a Bayesian RNN.

Computing the integral

We need to compute often-intractable integrals over weights and biases . We focus on two cases, the variational objective and the approximate predictive distribution

. To enable different methods to estimate these integrals, we implement each estimator as its own Layer (

Figure 3). The same Bayesian neural net can use entirely different computational graphs depending on the estimation (and therefore entirely different code). For example, sampling from with reparameterization and running the deterministic layer computation is a generic way to evaluate integrals (Kingma and Welling, 2014).

{},codes=] outputs = layers.DenseReparameterization(512, activation=tf.nn.relu)(inputs) outputs = layers.DenseFlipout(512, activation=tf.nn.relu)(inputs) outputs = layers.DenseQuadrature(512, activation=tf.nn.relu)(inputs)

Figure 3: Bayesian feedforward layer with different estimators, including reparameterization (Kingma and Welling, 2014), Flipout (Wen et al., 2018), and numerical quadrature.


We’d like to have the Bayesian extension of a deterministic layer retain its mandatory constructor arguments as well as Tensor-in Tensor-out call signature. This avoids cognitive overhead, letting one easily swap layers (

Figure 4).

{},codes=] if FLAGS.be_bayesian: Conv2D = layers.Conv2DReparameterization else: Conv2D = tf.keras.layers.Conv2D

model = tf.keras.Sequential([ Conv2D(32, kernel_size=5

, strides


, padding

=’SAME’), tf.keras.layers.BatchNormalization(), tf.keras.layers.Activation(’relu’), Conv2D(32, kernel_size=5, strides=2, padding=’SAME’), tf.keras.layers.BatchNormalization(), ... ])

Figure 4: Bayesian Layers are drop-in replacements for their deterministic counterparts.

Distributions over parameters

To specify distributions, a natural idea is to overload the existing parameter initialization arguments in a Layer’s constructor, such as kernel_initializer and bias_initializer in Keras. These arguments are extended to accept callables that take metadata such as input shape and return a distribution over the parameter. Distribution initializers may carry trainable parameters, each with their own initializers (pointwise or distributional). The default initializer represents a trainable approximate posterior in a variational inference scheme.

For the distribution abstraction, we use Edward RandomVariables (Tran et al., 2018). They are Tensors augmented with distribution methods such as sample and log_prob; by default, numerical ops operate on its sample Tensor. Layers perform forward passes using deterministic ops and the RandomVariables.

Distribution regularizers

The variational training objective requires the evaluation of a KL term, which penalizes deviations of the learned from the prior . Similar to distribution initializers, we overload the existing parameter regularization arguments in a layer’s constructor, such as kernel_regularizer and bias_regularizer in Keras. These arguments are extended to accept callables that take in the kernel or bias RandomVariables

and return a scalar Tensor. By default, we use a KL divergence toward the standard normal distribution, which represents the penalty term common in variational Bayesian neural network training.

Explicitly returning regularizers in a Layer’s call ruins composability (see Signature above). Therefore Bayesian layers, like their deterministic counterparts, side-effect the computation: one queries an attribute to access any regularizers for, e.g., the loss function.

Figure 1 implements a Bayesian RNN; Appendix A implements a Bayesian CNN (ResNet-50).

2.2 Gaussian Process Layers

As opposed to representing distributions over functions through the weights, Gaussian processes represent distributions over functions by specifying the value of the function at different inputs. Recent advances have made Gaussian process inference computationally similar to Bayesian neural networks (Hensman et al., 2013). We only require a method to sample the function value at a new input, and evaluate KL regularizers, which allows GPs to be placed in the same framework as above.333More broadly, these ideas extend to stochastic processes. For example, we plan to implement a Poisson process layer for scalable point process modeling. Figure 5 implements a deep GP.


Use the same mandatory arguments as their equivalent deterministic layer (e.g., number of units in a layer determine a GP’s output dimensionality; a GP’s kernel is an optional argument defaulting to squared exponential). Retain the Tensor-in Tensor-out call signature.

Computing the integral.

Use a separate class per estimator, such as GaussianProcess for exact integration, which is only possible in limited situations; SparseGaussianProcess for inducing variable approximations; and RandomFourierFeatures for projection approximations.

Distribution regularizers.

By default, include no regularizer for exact GPs, a KL regularizer on inducing outputs for sparse GPs, and a KL regularizer on weights for random projection approximations. These defaults reflect each inference method’s standard for training.

{},codes=] batch_size = 256 features, labels = load_spatial_data(batch_size)

model = tf.keras.Sequential([ tf.keras.layers.Flatten(), # no spatial knowledge layers.SparseGaussianProcess(units=256, num_inducing=512), layers.SparseGaussianProcess(units=256, num_inducing=512), layers.SparseGaussianProcess(units=10, num_inducing=512), ]) predictions = model(features) neg_log_likelihood = tf.losses.mean_squared_error(labels=labels, predictions=predictions) kl = sum(model.losses) loss = neg_log_likelihood + kl train_op = tf.train.AdamOptimizer().minimize(loss)

Figure 5: Three-layer deep GP with variational inference (Salimbeni and Deisenroth, 2017; Damianou and Lawrence, 2013)

. We apply it for regression given batches of spatial inputs and vector-valued outputs. We flatten inputs to use the default squared exponential kernel; this naturally extends to pass in a more sophisticated kernel function.

2.3 Stochastic Output Layers

{},codes=] def build_image_transformer(hparams): x = tf.keras.layers.Input(shape=input_shape, dtype=’float32’) x = ChannelEmbedding(hparams.hidden_size)(x) x = tf.keras.layers.Reshape([-1, hparams.hidden_size])(x) x = tf.pad(x, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]) # shift pixels right x = PositionalEmbedding(max_length=128*128*3)(x) x = tf.keras.layers.Dropout(0.3)(x) for _ in range(hparams.num_layers): y = LocalAttention1D(hparams, attention_type=’local_mask_right’, q_padding=’LEFT’, kv_padding=’LEFT’)(x) x = LayerNormalization()(tf.keras.layers.Dropout(0.3)(y) + x) y = tf.keras.layers.Dense(x, hparams.filter_size, activation=tf.nn.relu) y = tf.keras.layers.Dense(hparams.hidden_size, activation=None)(y) x = LayerNormalization()(tf.keras.layers.Dropout(0.3)(y) + x) x = layers.MixtureofLogistic(3, num_components=5)(x) x = layers.Discretize(x) model = tf.keras.Model(inputs=inputs, outputs=x, name=’ImageTransformer’) return model

image_transformer = build_image_transformer(hparams) loss = -tf.reduce_sum(image_transformer(features).distribution.log_prob(features)) train_op = tf.train.AdamOptimizer().minimize(loss)

Figure 6: Image Transformer with discretized logistic mixture (Parmar et al., 2018) over 128x128x3 features. We assume layers which don’t exist in Keras; functional versions are available in Tensor2Tensor (Vaswani et al., 2018).

In addition to uncertainty over the mapping defined by a layer, we may want to simply add stochasticity to the output. These outputs have a tractable distribution, and we often would like to access its properties: for example, maximum likelihood with an autoregressive network whose output is a discretized logistic mixture (Salimans et al., 2017) (Figure 6); or an auto-encoder with stochastic encoders and decoders (Appendix B). To implement stochastic output layers, we simply perform deterministic computations and return a RandomVariable. Because RandomVariables are Tensor-like objects, one can operate on them as if they were Tensors: composing stochastic output layers is valid. In addition, using such a layer as the last one in a network allows one to compute properties such as a network’s entropy or likelihood given data.444 In previous figures, we used loss functions such as mean_squared_error. With stochastic output layers, we can replace them with a layer returning the likelihood and calling log_prob.

2.4 Reversible Layers

{},codes=] batch_size = 256 features = load_cifar10(batch_size)

model = tf.keras.Sequential([ layers.RealNVP(MADE(hidden_dims=[512, 512])), layers.RealNVP(MADE(hidden_dims=[512, 512], order=’right-to-left’)), layers.RealNVP(MADE(hidden_dims=[512, 512])), ]) base = ed.Normal(loc=tf.zeros([batch_size, 32*32*3]), scale=1.) outputs = model(base) loss = -tf.reduce_sum(outputs.distribution.log_prob(features)) train_op = tf.train.AdamOptimizer().minimize(loss)

Figure 7: A flow-based model for image generation (Dinh et al., 2017).

With random variables in layers, one can naturally capture invertible neural networks which propagate uncertainty from input to output. In particular, a reversible layer may take a RandomVariable as input and return another RandomVariable with the same distribution up to a volume change. The layer implements not only a forward pass but also a method reverse and optionally log_det_jacobian.555We implement layers.Discretize this way in Figure 6. It takes a continuous RandomVariable as input and returns a transformed variable with probabilities integrated over bins. Figure 7 implements RealNVP (Dinh et al., 2017), which is a reversible layer parameterized by another network (here, MADE (Germain et al., 2015)). These ideas also extend to reversible networks that enable backpropagation without storing intermediate activations in memory during the forward pass (Gomez et al., 2017).

3 Experiments

We implemented a “Bayesian Transformer” for the One-Billion-Word Language Modeling Benchmark (Chelba et al., 2013). Using Mesh TensorFlow (Shazeer et al., 2018), we took a 5-billion parameter Transformer which reports a state-of-the-art perplexity of 23.1. We then augmented the model with priors over the projection matrices by replacing calls to a multihead-attention layer with its Bayesian counterpart (using the Flipout estimator). Figure 10 shows that we can fit models with over 10-billion parameters, utilizing up to 2500 TFLOPs on 512 TPUv2 cores. In attempting these scales, we were able to reach state-of-the-art perplexities at the cost of worse uncertainties than smaller Bayesian Transformers. We identified a number of challenges in scaling up Bayesian neural nets which we leave out of this work, and which we’re excited to explore for future research.

4 Limitations

While the framework we laid out tightly integrates deep Bayesian modelling into existing ecosystems, we have deliberately limited our scope—especially compared to fully-fledged probabilistic programming languages. In particular, our layers tie the model specification to the inference algorithm (typically, variational inference). The core assumption of this framework is the modularization of inference per layer. This makes inference procedures which depend on the full parameter space, such as Markov chain Monte Carlo, difficult to fit within the framework. Other inference methods such as EP variants

(Bui et al., 2016; Hernández-Lobato and Adams, 2015) could fit if explicit separate representations of prior and approximating distributions are used, in effect, making a layer a special kind of random variable already used in probabilistic programming languages (e.g. Edward).


We thank James Hensman and Alex Matthews for motivating discussions around GPs and probabilistic programming; Sergio Guadarrama for the initial idea behind stochastic output layers; Josh Dillon, Dave Moore, Chris Suter, and Matt Hoffman for API discussions in an older design; and Eugene Brevdo, Martin Wicke, and Kevin Murphy for their feedback.


Appendix A Bayesian ResNet-50

See Figure 8.

[commandchars= {},codes=] def conv_block(inputs, kernel_size, filters, strides=(2, 2)): filters1, filters2, filters3 = filters x = layers.Conv2DFlipout(filters1, (1, 1), strides=strides, kernel_initializer=’he_normal’)(inputs) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.Activation(’relu’)(x) x = layers.Conv2DFlipout(filters2, kernel_size, padding=’SAME’, kernel_initializer=’he_normal’)(x) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.Activation(’relu’)(x) x = layers.Conv2DFlipout(filters3, (1, 1), kernel_initializer=’he_normal’)(x) x = tf.keras.layers.BatchNormalization()(x) shortcut = layers.Conv2DFlipout(filters3, (1, 1), strides=strides, kernel_initializer=’he_normal’)(inputs) shortcut = tf.keras.layers.BatchNormalization()(shortcut) x = tf.keras.layers.add([x, shortcut]) x = tf.keras.layers.Activation(’relu’)(x) return x def identity_block(inputs, kernel_size, filters): filters1, filters2, filters3 = filters x = layers.Conv2DFlipout(filters1, (1, 1), kernel_initializer=’he_normal’)(inputs) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.Activation(’relu’)(x) x = layers.Conv2DFlipout(filters2, kernel_size, padding=’SAME’, kernel_initializer=’he_normal’)(x) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.Activation(’relu’)(x) x = layers.Conv2DFlipout(filters3, (1,1), kernel_initializer=’he_normal’)(x) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.add([x, inputs]) x = tf.keras.layers.Activation(’relu’)(x) return x [commandchars= {},codes=] def build_bayesian_resnet50(input_shape=None, num_classes=1000): inputs = tf.keras.layers.Input(shape=input_shape, dtype=’float32’) x = tf.keras.layers.ZeroPadding2D((3, 3))(inputs) x = layers.Conv2DFlipout(64, (7, 7), strides=(2, 2), padding=’VALID’, kernel_initializer=’he_normal’)(x) x = tf.keras.layers.BatchNormalization()(x) x = tf.keras.layers.Activation(’relu’)(x) x = tf.keras.layers.ZeroPadding2D((1, 1))(x) x = tf.keras.layers.MaxPooling2D((3,3), strides=(2,2))(x) x = conv_block(x, 3, [64, 64, 256], strides=(1, 1)) x = identity_block(x, 3, [64, 64, 256]) x = identity_block(x, 3, [64, 64, 256]) x = conv_block(x, 3, [128, 128, 512]) x = identity_block(x, 3, [128, 128, 512]) x = identity_block(x, 3, [128, 128, 512]) x = identity_block(x, 3, [128, 128, 512]) x = conv_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = identity_block(x, 3, [256, 256, 1024]) x = conv_block(x, 3, [512, 512, 2048]) x = identity_block(x, 3, [512, 512, 2048]) x = identity_block(x, 3, [512, 512, 2048]) x = tf.keras.layers.GlobalAveragePooling2D()(x) x = layers.DenseFlipout(num_classes)(x) model = models.Model(inputs, x, name=’resnet50’) return model [commandchars= {},codes=] bayesian_resnet50 = build_bayesian_resnet50() logits = bayesian_resnet50(features) neg_log_likelihood = tf.losses.sparse_softmax_cross_entropy( labels=labels, logits=logits, reduction=tf.losses.reduction.MEAN) kl = sum(bayesian_resnet50.losses) # include KL penalty which are Layer side-effects loss = neg_log_likelihood + kl train_op = tf.train.AdamOptimizer().minimize(loss) # Alternatively, run the following instead of a manual train_op. model.compile(optimizer=tf.train.AdamOptimizer(), loss=’categorical_crossentropy’, metrics=[’accuracy’]), labels, batch_size=32

, epochs

Figure 8: Bayesian ResNet-50.

Appendix B Vector-Quantized Variational Auto-Encoder

See Figure 9.

{},codes=] base_depth = 128

encoder = tf.keras.Sequential([ tf.keras.layers.Conv2D(base_depth, 5, 1, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2D(base_depth, 5, 2, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2D(2 * base_depth, 5, 1, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2D(2 * base_depth, 5, 2, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2D(4 * latent_size, 7, padding=’VALID’, activation=tf.nn.relu), tf.keras.layers.Flatten(), layers.VectorQuantizer(512, name=’latent_code’), ]) decoder = tf.keras.Sequential([ tf.keras.layers.Conv2DTranspose(2 * base_depth, 7, padding=’VALID’, activation=tf.nn.relu), tf.keras.layers.Conv2DTranspose(2 * base_depth, 5, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2DTranspose(2 * base_depth, 5, 2, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2DTranspose(base_depth, 5, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2DTranspose(base_depth, 5, 2, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2DTranspose(base_depth, 5, padding=’SAME’, activation=tf.nn.relu), tf.keras.layers.Conv2D(3, 5, padding=’SAME’, activation=None), layers.Bernoulli(256*256*3, name=’image’), # likelihood ]) encoded_features = encoder(features) reconstruction = decoder(encoded_features).distribution.log_prob(features) entropy = encoded_features.distribution.entropy() loss = reconstruction + entropy train_op = tf.train.AdamOptimizer().minimize(loss)

Figure 9: Vector-quantized variational auto-encoder (van den Oord et al., 2017)

for 256x256 ImageNet. VQVAEs assume a uniform prior during training; we use a stochastic encoder.

Appendix C Bayesian Transformer

See Figure 10.

Figure 10: Bayesian Transformer implemented with model parallelism ranging from 8 TPUv2 shards (core) to 512.