Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes

05/17/2020 ∙ by Sebastian W. Ober, et al. ∙ University of Cambridge University of Bristol 0

Variational inference is a popular approach to reason about uncertainty in Bayesian neural networks (BNNs) and deep Gaussian processes (deep GPs). However, typical variational approximate posteriors for deep BNNs and GPs use an approximate posterior that factorises across layers. This is a problematic assumption, because what matters in a deep BNN or GP is the input-output transformation defined by the full network, not the input-output transformation defined by an individual layer. We therefore propose an approximate posterior with dependencies across layers that seeks to jointly model the input-output transformation over the full network. Our approximate posterior is based on a "global" set of inducing points that are defined only at the input layer, and propagated through the network. In showing that BNNs are a special case of deep GPs, we demonstrate that this approximate posterior can be used to infer both the weights of a BNN and the functions in a deep GP. Further, we consider a new correlated prior over the weights of a BNN, which in combination with global inducing points gives state-of-the-art performance for a variational Bayesian method, without data augmentation or posterior tempering, on CIFAR-10 of 86.7%.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep models, formed by stacking together many simpler layers, give rise to extremely powerful machine learning algorithms, from deep neural networks (DNNs) to deep Gaussian processes (deep GPs)

[8, 34]. To reason about uncertainty in these models, one approach is to use variational inference (VI) [19]

. In Bayesian neural networks (BNNs) VI requires the user to define a family of approximate posteriors over the weights, with the classical approach using the family of Gaussian distributions that factorise independently over each individual weight

[18, 14, 6]. Later work has considered more complex approximate posteriors [27]

, for instance using a Matrix-Normal distribution as the approximate posterior for a full weight-matrix

[26, 33]. In contrast, deep GPs use an approximate posterior defined over functions — the standard approach is to specify the inputs and outputs at a finite number of “inducing” points [37, 40, 16, 28, 34]. Critically, for both BNNs and deep GPs, these approaches define approximate posteriors over functions separately at each layer, though some work has used expensive generic approaches to introduce global correlations [22, 32].

However, an approximate posterior that factorises across layers is problematic, because what matters for a deep model is the overall input-output transformation for the full model, not the input-output transformation for individual layers. This is particularly evident if we consider pervasive permutation symmetries in DNNs and deep GPs: if we permute the identities of hidden units in a layer, permuting all input and output connections at the same time, then the output of the network remains exactly the same [38, 4, 2]. This dramatically changes the input-output transformations for individual layers, while leaving the input-output transformation for the network as a whole unchanged. To model these symmetries in our approximate posteriors, we must introduce correlations between layers.

To develop an approximate posterior that reasons jointly over the input-output transformation defined by the full network, we extended classical inducing point methods for Gaussian processes. We describe classical methods as “local”, because they specify a posterior over a single layer, by giving a series of example inputs and outputs (inducing points) for only that layer. In contrast, our approach uses a single “global” set of inducing inputs, defined only at the input layer, and we propagate these inducing points through the network as we sample BNN-weights/GP-functions. As such, the posterior over the weights/functions for a single layer will change dramatically as the inputs to that layer change due to variability in the lower-layer weights/functions. We show excellent performance on a wide range of tasks, including superior performance to previous GP-based approaches in datasets such as CIFAR-10, where we get a classification accuracy of without data augmentation.

2 Background

A deep GP defines a prior over function values, , where is the layer, is the number of input points, and is the “width” of this layer, by stacking layers of standard Gaussian processes,

(1)

Here, the input is , and the output is (which could be continuous values for regression, or class-labels for classification), and the distribution over factorises into independent multivariate Gaussian distributions over each function,

(2)

where is the th column of , giving the activation of all datapoints for the th feature, and is a function that computes the kernel-matrix from the features in the previous layer.

To define a variational approximate posterior, we augment with inducing points consisting of the function values corresponding to inducing locations at each layer,

(3)

and because and are the function outputs corresponding to different inputs ( and ), they form a joint multivariate Gaussian distribution, analogous to Eq. (2),

(4)

The standard approach [34] is to form an approximate posterior by conditioning the function values, on the inducing outputs, ,

(5)

where is chosen to be a sensible parametric distribution (e.g. multivariate Gaussian). The variational parameters are thus the inducing inputs, , and the parameters of ), and they can be optimized by maximizing the evidence lower bound (ELBO):

(6)

3 Methods

We make four key contributions. First, a new technique for defining approximate posteriors for GPs that is easily compatible with BNNs. Second an efficient method for computing approximate posteriors by Bayesian linear regression in the convolutional setting, Third, a new approach to inducing points: “global” rather than “local”. Finally, a set of priors over neural network weights.

3.1 Compatible approximate posteriors for BNNs and GPs

We define our BNNs recursively by applying a nonlinearity, , to the function values at the previous layer (i.e. the pre-activations/activities), then multiplying by weight matrix (including biases), ,

(7)

where we have a prior over the weights in layer with hidden units:

(8)

We note that this implicitly defines a prior over function values which is equivalent to a deep Gaussian process [26, 1],

(9)

where we let . In principle, we should therefore be able to use inducing point methods described in the background section. Indeed, this was attempted by Louizos and Welling [26], but they encountered an issue: the kernel/covariance matrix in Eq. (9) is rank-deficient if the number of data/inducing points is greater than . Intuitively, each inducing input/output pair constrains the weights,

or (10)

and only so many constraints can possibly be jointly satisfied. This causes many practical issues, not least of which is that we must always use fewer inducing points than [26].

As this solution is problematic, we instead consider a different family of approximate posteriors, formed by considering noisy inducing outputs. In particular, we consider an approximate posterior over induced by conditioning on “pseudo-data”, , which is presumed to be drawn from a diagonal Gaussian with learned precision :

(11)

Concretely, the approximate posterior over becomes,

(12)

where is the prior implied by Eq. (9). The approximate posterior over becomes

(13)

This form for the approximate posterior over can be used as a direct replacement for the standard approximate posterior, , and inserted into the ELBO (Eq. 6). Importantly, however, defining the posterior by conditioning on gives us a trivial method for finding the equivalent approximate posterior over the weights of a BNN using Bayesian linear regression, and without encountering the rank-deficiency issues in Louizos and Welling [26]. In particular,

(14)

Here, is given by Eq. (11), and the resulting approximate posterior is

(15)
(16)

which is equivalent to standard Bayesian linear regression. We can now use this expression in the standard BNN ELBO, defined in terms of weights. Note that, while we have focused on BNNs, our method of using noisy inducing points is applicable to deep GPs by applying Eqs. (13) and (6).

3.2 Efficient convolutional Bayesian linear regression

The previous section was valid for a fully connected network. The extension to convolutional networks is straightforward: we create feature-vectors for each patch by flattening the spatial and channel dimensions together into a single vector. Thus, the feature-vectors have length

in_channels kernel_width kernel_height, and the matrix contains patches_per_image minibatch patches. Likewise, we now have inducing outputs, , at each location in all the inducing images, so this again has length patches_per_image minibatch. After explicitly extracting the patches, we can straightforwardly apply standard Bayesian linear regression.

However, explicitly extracting image patches is very memory intensive in a DNN. If we consider a standard convolution with a convolutional kernel, then there is a patch centred at each pixel in the input image, meaning a factor of increase in memory consumption. Instead, we noted that computing the matrices required for linear regression, and does not require explicit extraction of image-patches. Instead, these matrices can be computed by taking the autocorrelation of the image/feature map, i.e. a convolution operation where we treat the image/feature map, as both the inputs and the weights (Appendix A).

3.3 Global and local inducing points

The standard approach to inducing points, which we describe as “local”, is to optimize the inducing inputs, and noisy inducing outputs, , for each layer. Instead, we consider “global” inducing points, where we optimize the noisy inducing outputs, , as before, but only the initial inducing inputs, . The other inducing inputs are obtained by propagating the initial inducing input, , through the network,

(17)

Thus, the inducing inputs at layer

are random variables which depend on the weights sampled from the approximate posterior at previous layers. Critically, we must therefore interleave computing and sampling from the approximate posterior over the weights at this layer (Eq. 

14) and computing the inducing inputs for the next layer (see. Algo. 1). As the inducing inputs to the layer depend on the weights sampled in previous layers, it is clear that global inducing points introduce correlations between layers, which do not exist when using a local inducing point scheme.

Parameters: inducing inputs, , inducing outputs and precisions, , at all layers. Neural network inputs: (e.g. MNIST digits) Neural network outputs:

(e.g. classification logits)

for  do
       Compute the mean and covariance over the weights at this layer Sample the weights and compute the ELBO Propagate the inputs and inducing points using the sampled weights,
Algorithm 1 Global inducing points for neural networks

3.4 Priors

We consider four priors in this work, which we refer to using the class names in the BNN library published with this paper. We are careful to ensure that all parameters in the model have a prior and approximate posterior which is necessary to ensure that ELBOs are comparable across models.

First, we considered a Gaussian prior with fixed scale, NealPrior, so named because it is necessary to obtain meaningful results when considering infinite networks [30],

(18)

though it bears strong similarities to the “He” initalization [15]. NealPrior is defined so as to ensure that the activations retain a sensible scale as they propagate through the network. We compare this to the standard (StandardPrior), which causes the activations to increase exponentially as they propagate through network layers (see Eq. 8):

(19)

Next, we considered ScalePrior, which defines a prior and approximate posterior over the scale,

(20)

where here we parameterise the Gamma distribution with the shape and rate parameters, and

and are non-negative learned parameters of the approximate posterior over

. Finally, we considered SpatialIWPrior, which allows for spatial correlations in the weights. In particular, we take the covariance to be the Kronecker product of an identity matrix over channel dimensions, and a Wishart-distributed matrix,

, over the spatial dimensions,

(21)

where the non-negative real number, , and the positive definite matrix, , are learned parameters of the approximate posterior (see Appendix B).

4 Results

We describe our experiments and results to assess the performance of global inducing points (‘gi’) against local inducing points (‘li’) and the fully factorised (‘fac’) approximation family. We additionally consider models where we use one method up to the last layer and another for the last layer, which may have computational advantages; we denote such models ‘method1 method2’. While our experiments here focus on BNNs, we include results from experiments with deep GPs in the Supplementary Material, where the reader can also find further experimental details and analysis.

Figure 1:

Predictive distributions on the toy dataset. Shaded regions represent one standard deviation.

4.1 Toy

We demonstrate the use of local and global inducing point methods in a toy 1-D regression problem, comparing it with fully factorised VI and Hamiltonian Monte Carlo (HMC). Following Hernández-Lobato and Adams [17], we generate 40 input-output pairs with the inputs sampled i.i.d. from and the outputs generated by , where . We then normalized the inputs and outputs. Note that we have introduced a ‘gap’ in the inputs, following recent work [12, 44, 11]

that identifies the ability to express ‘in-between’ uncertainty as an important quality of approximate inference algorithms. We evaluated the inference algorithms using fully-connected BNNs with 2 hidden layers of 50 ReLU hidden units, using the NealPrior.

The predictive distributions for the toy experiment can be seen in Fig. 1. We observe that of the variational methods, the global inducing method produces predictive distributions closest to HMC, with good uncertainty in the gap. The local inducing point method seems to perform the worst, struggling to provide a reasonable fit to the data. Meanwhile, factorised does fit the training data, but does not produce reasonable error bars.

4.2 Deep linear networks

Figure 2: ELBO for different approximate posteriors as we change network depth/width on a dataset generated using a linear Gaussian model. The blue line lies behind the black/red lines in width and width .

Part of the motivation for our work is that by coupling the approximate posterior across functions at each layer, our method becomes robust to having lower-layers drawn at random from the prior, which is known to give good performance in infinitely wide networks [23, 29, 1, 24]

. We therefore considered data generated from a toy linear model: 5 input features are mapped to 1 output feature, where the 1000 training and 100 test inputs are drawn IID from a standard Gaussian, and the true outputs are drawn using a weight-vector drawn IID from a Gaussian with variance

, and with noise-variance of . We then trained deep linear models with approximate posteriors and compared their effectiveness for different depths and widths. Critically, the definition of our toy model allowed us to evaluate the model evidence under the true data generating process which forms an upper bound (in expectation) on the model evidence and ELBO from mismatched deep models.

We found that the ELBO for methods that factorise across layers — factorised and local inducing — drops rapidly as networks get deeper and wider (Fig. 2). This is undesirable behaviour, as we know that wide, deep networks are necessary for good performance on difficult machine learning tasks. In contrast, we found that methods with global inducing points at the last layer decay much more slowly with depth, and — as we would hope — perform better as networks get wider [due to the patterns observed in 1]. This was true for methods that used random weights in the lower layers (rand gi; blue) and global inducing points (global inducing; red), but methods that used the fully factorised approximation in the lower layers (fac gi; green) performed poorly at width , which we believe to be due to optimization issues. This difference in behaviour with depth exists because methods with global inducing points can cope with drawing the lower-layer weights from the prior, which results in a small KL term for the lower layers, giving good values for the ELBO. In contrast, methods that factorise across layers cannot cope with drawing lower layers from the prior, and therefore, must pay an extra cost in KL divergence at each layer.

4.3 Uci

Figure 3:

Average test log likelihoods on the UCI datasets (in nats). Error bars represent one standard error. Shading represents different priors. We connect the factorised models with the fac

gi models with a thin grey line as an aid for easier comparison. Further to the right is better.

To assess our methods in the regression setting, we benchmark them on the UCI datasets in Hernández-Lobato and Adams [17], popular benchmark regression datasets for BNNs. Each dataset uses 20 train-test ‘splits’ (except for protein, which only has 5 splits); we use the same splits as in Gal and Ghahramani [13]. We normalize the inputs and outputs to have zero mean and unit standard deviation. We consider four approximating families on two-layer fully-connected ReLU networks: fully factorised (with the local reparameterization trick [21]), local inducing, global inducing, and fully factorised with a global inducing output layer (which we refer to as ‘fac gi’). We also consider three priors: the standard prior, NealPrior, and ScalePrior. Finally, we perform a grid search over minibatch size and learning rate.

We display the average test log likelihoods for the un-normalized data in Fig. 3, where the dots and error bars represent the means and standard errors over the test splits, respectively (see Appendix D for further details and the ELBO and RMSE, and see Appendix E for deep GPs). For most datasets, fac gi outperforms the other methods for all priors. Importantly, however, it is reasonable to expect effective variational posteriors to be able to cope with learning the prior variance of the weights (ScalePrior). While this was always the case in larger datasets (naval, power, protein), it was not the case in smaller datasets (boston, concrete, energy, wine, yacht), where combining factorised or local inducing with ScalePrior gave pathologically poor results (see Appendix F for other pathological results for BNNs when using the first 500 examples from MNIST as a training set). In contrast, global inducing methods were robust to learned prior variances in all tested cases — indeed, they outperformed baseline methods in all cases except protein, where the difference was very small due to the very large dataset. We found that this was the case for our global inducing methods, but in several cases, Note that fac gi generally outperforms the full global inducing approximation; we hypothesize that this is due to optimization difficulties.

4.4 Cifar-10

For CIFAR-10, we considered a ResNet-inspired model consisting of conv2d-relu-block-avgpool2-block-avgpool2-block-avgpool-linear, where the ResNet blocks consisted of a shortcut connection in parallel with conv2d-relu-conv2d-relu, using 32 channels in all layers. In all our experiments, we used no data augmentation, 500 inducing points, and a learning rate of (see Appendix C

for details about why the learning rates are high relative to those in the literature). We trained over 1000 epochs, tempering the KL term (with respect to the prior) in the ELBO with a multiplicative factor that increases from 0 to 1 over the first 100 epochs; importantly, we train for 900 epochs with an unmodified ELBO so that our final results do not reflect a ‘cold posterior’

[43]. Our results are shown in Table 1. We achieved remarkable performance of predictive accuracy, with global inducing points used for all layers, and with a spatial Inverse Wishart prior on the weights. While this is low in comparison with state-of-the-art finite networks, it compares very favourably with comparable Bayesian approaches i.e. those without data augmentation or posterior sharpening. In particular, past work with deep GPs obtained [35], and work using using infinite neural-networks to define a GP obtained accuracy [25]. Remarkably, we are approaching the accuracy of sampling-based methods [43], which are in principle able to more closely approximate the true posterior. Furthermore, we see that global inducing performs the best in terms of ELBO (per datapoint) by a wide margin, demonstrating that it gets far closer to the true posterior than the other methods.

test log like. accuracy ELBO time (MNIST)
factorised -0.58 (-0.66) 80.27% (77.65%) -1.06 (-1.12) 19 s
local inducing -0.62 (-0.60) 78.96% (79.46%) -0.84 (-0.88) 33 s
fac gi -0.49 (-0.56) 83.33% (81.72%) -0.91 (-0.96) 25 s
global inducing -0.40 (-0.43) 86.70% (85.73%) -0.68 (-0.75) 65 s
Shi et al. [35]
Li et al. [25]
Shridhar et al. [36]
Wenzel et al. [43]
Table 1: CIFAR-10 classification accuracy. The first block shows our results using SpatialIWPrior, with ScalePrior in brackets. The next block show comparable past results, from GPs and BNNs. The final block show non-comparable (sampling-based) methods. Dashes indicate that the figures were either not reported, are not applicable. The time is reported per epoch with ScalePrior and for MNIST, rather than CIFAR-10 because of a known performance bug in the convolutions required in Sec. 3.2 with (and above) images https://github.com/pytorch/pytorch/issues/35603.

4.5 Related work

We were recently made aware of Ustyuzhaninov et al. [41]

which independently developed the notion of global inducing points, calling it “inducing points as inducing locations”. Our contributions are distinct, as they use toy examples to demonstrate how global inducing points differ from an alternative approach to introducing correlations across layers by parameterising the joint distribution over all inducing points at all layers as a multivariate Gaussian. In contrast, we give a unified view of how global inducing points apply to both deep GPs and BNNs, give efficient convolutional extensions, and show that global inducing points can be effective even when the lower layers are drawn at random from the prior and give results on standard benchmarks including UCI and CIFAR-10.

Panos et al. [31] use a similar form for the covariance (Eq. 13) in a standard GP setting (not deep, not convolutional). However, they use a different form for the mean and do not provide an interpretation as conditioning on pseudo-data.

Finally, the typical approach to convolutional GPs is to consider a GP mapping from an image-patch to a output, and as such, the inducing points then become a set of image-patches at each layer [42, 5, 9, 35]. We used this strategy for local inducing points. However, for global inducing points, the inducing patches are extracted from the feature-map that formed the output of the previous layer. Thus, we no longer have a set of distinct, uncoupled patches as the inducing points; instead, the inducing patches are coupled, because they are patches cut out of the same underlying feature-map.

5 Conclusions

We introduce the use of global inducing points for variational BNNs and deep GPs. The resulting approximate posterior is extremely flexible, even being robust to having the weights in one or more layers drawn entirely from the prior. We show using global inducing points leads to improved performance with better ELBOs, and state-of-the-art performance for variational BNNs on CIFAR-10.

Broader Impact

As deep learning is increasingly being deployed in safety-critical settings such a self-driving cars, it is essential to ensure that they understand when they are certain, and can safely take an action, and when they are uncertain, and should ask for further human input. The mathematical formalisation of uncertainty is Bayesian inference, so it is natural to try to ask whether we can use Bayes to reason about uncertainty in deep-networks employed in safety-critical settings. However, this has proven difficult, with many approaches using very strong assumptions and approximations, making it difficult to trust the resulting uncertainty estimates. Here, we develop a better approximate posterior for Bayesian neural networks, hopefully leading to future practical improvements in uncertainty estimation in safety-critical settings.

We thank David R. Burt, Andrew Y. K. Foong, Siddharth Swaroop, and Carl E. Rasmussen for helpful discussions. SWO acknowledges the support of the Gates Cambridge Trust in funding his doctoral studies.

References

  • [1] L. Aitchison (2019) Why bigger is not always better: on finite and infinite neural networks. arXiv preprint arXiv:1910.08013. Cited by: §3.1, §4.2, §4.2.
  • [2] S. Amari, H. Park, and T. Ozeki (2006) Singularities affect dynamics of learning in neuromanifolds. Neural computation 18 (5), pp. 1007–1065. Cited by: §1.
  • [3] M. S. Bartlett (1933) On the theory of statistical regression. Proceedings of the Royal Society of Edinburgh 53, pp. 260–283. Cited by: Appendix B.
  • [4] C. M. Bishop (1995)

    Neural networks for pattern recognition

    .
    Oxford university press. Cited by: §1.
  • [5] K. Blomqvist, S. Kaski, and M. Heinonen (2018) Deep convolutional gaussian processes. arXiv preprint arXiv:1810.03052. Cited by: §4.5.
  • [6] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: Appendix F, §1.
  • [7] D. Chafaï (2015-10) Bartlett decomposition and other factorizations. External Links: Link Cited by: Appendix B.
  • [8] A. Damianou and N. Lawrence (2013) Deep gaussian processes. In Artificial Intelligence and Statistics, pp. 207–215. Cited by: §1.
  • [9] V. Dutordoir, M. van der Wilk, A. Artemev, M. Tomczak, and J. Hensman (2019) Translation insensitivity for deep convolutional gaussian processes. arXiv preprint arXiv:1902.05888. Cited by: §4.5.
  • [10] S. Farquhar, L. Smith, and Y. Gal (2020) Try depth instead of weight correlations: mean-field is a less restrictive assumption for deeper networks. arXiv preprint arXiv:2002.03704. Cited by: Appendix F.
  • [11] A. Y. Foong, D. R. Burt, Y. Li, and R. E. Turner (2019) Pathologies of factorised gaussian and mc dropout posteriors in bayesian neural networks. arXiv preprint arXiv:1909.00719. Cited by: §4.1.
  • [12] A. Y. Foong, Y. Li, J. M. Hernández-Lobato, and R. E. Turner (2019) ’In-between’ uncertainty in Bayesian neural networks. arXiv preprint arXiv:1906.11537. Cited by: §4.1.
  • [13] Y. Gal and Z. Ghahramani (2015) Dropout as a bayesian approximation: representing model uncertainty in deep learning. arXiv preprint arXiv:1506.02142. Cited by: §4.3.
  • [14] A. Graves (2011) Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §1.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 1026–1034. Cited by: §3.4.
  • [16] J. Hensman, A. Matthews, and Z. Ghahramani (2015) Scalable variational gaussian process classification. JMLR. Cited by: §1.
  • [17] J. M. Hernández-Lobato and R. Adams (2015)

    Probabilistic backpropagation for scalable learning of Bayesian neural networks

    .
    In International Conference on Machine Learning, pp. 1861–1869. Cited by: §4.1, §4.3.
  • [18] G. E. Hinton and D. Van Camp (1993) Keeping the neural networks simple by minimizing the description length of the weights. In

    Proceedings of the sixth annual conference on Computational learning theory

    ,
    pp. 5–13. Cited by: §1.
  • [19] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §1.
  • [20] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix C.
  • [21] D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in neural information processing systems, pp. 2575–2583. Cited by: Appendix D, §4.3.
  • [22] D. Krueger, C. Huang, R. Islam, R. Turner, A. Lacoste, and A. Courville (2017) Bayesian hypernetworks. arXiv preprint arXiv:1710.04759. Cited by: §1.
  • [23] J. Lee, Y. Bahri, R. Novak, S. S. Schoenholz, J. Pennington, and J. Sohl-Dickstein (2017) Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165. Cited by: §4.2.
  • [24] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570–8581. Cited by: §4.2.
  • [25] Z. Li, R. Wang, D. Yu, S. S. Du, W. Hu, R. Salakhutdinov, and S. Arora (2019) Enhanced convolutional neural tangent kernels. arXiv preprint arXiv:1911.00809. Cited by: §4.4, Table 1.
  • [26] C. Louizos and M. Welling (2016) Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp. 1708–1716. Cited by: §1, §3.1, §3.1.
  • [27] C. Louizos and M. Welling (2017) Multiplicative normalizing flows for variational bayesian neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2218–2227. Cited by: §1.
  • [28] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani (2016)

    On sparse variational methods and the kullback-leibler divergence between stochastic processes

    .
    In Artificial Intelligence and Statistics, pp. 231–239. Cited by: §1.
  • [29] A. G. d. G. Matthews, M. Rowland, J. Hron, R. E. Turner, and Z. Ghahramani (2018) Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271. Cited by: §4.2.
  • [30] R. M. Neal (1996) Priors for infinite networks. In Bayesian Learning for Neural Networks, pp. 29–53. Cited by: §3.4.
  • [31] A. Panos, P. Dellaportas, and M. K. Titsias (2018) Fully scalable gaussian processes using subspace inducing inputs. arXiv preprint arXiv:1807.02537. Cited by: §4.5.
  • [32] N. Pawlowski, A. Brock, M. C. Lee, M. Rajchl, and B. Glocker (2017) Implicit weight uncertainty in neural networks. arXiv preprint arXiv:1711.01297. Cited by: §1.
  • [33] H. Ritter, A. Botev, and D. Barber (2018) A scalable laplace approximation for neural networks. In 6th International Conference on Learning Representations, ICLR 2018-Conference Track Proceedings, Vol. 6. Cited by: §1.
  • [34] H. Salimbeni and M. Deisenroth (2017) Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pp. 4588–4599. Cited by: Appendix E, Appendix E, Appendix E, Table 5, Table 6, Table 7, §1, §2.
  • [35] J. Shi, M. K. Titsias, and A. Mnih (2019) Sparse orthogonal variational inference for gaussian processes. arXiv preprint arXiv:1910.10596. Cited by: §4.4, §4.5, Table 1.
  • [36] K. Shridhar, F. Laumann, and M. Liwicki (2019)

    A comprehensive guide to bayesian convolutional neural network with variational inference

    .
    arXiv preprint arXiv:1901.02731. Cited by: Table 1.
  • [37] E. Snelson and Z. Ghahramani (2006) Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pp. 1257–1264. Cited by: §1.
  • [38] H. J. Sussmann (1992) Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural networks 5 (4), pp. 589–593. Cited by: §1.
  • [39] T. Tieleman and G. Hinton (2012)

    Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude

    .
    COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: Appendix C.
  • [40] M. Titsias (2009) Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pp. 567–574. Cited by: §1.
  • [41] I. Ustyuzhaninov, I. Kazlauskaite, M. Kaiser, E. Bodin, N. D. Campbell, and C. H. Ek (2019) Compositional uncertainty in deep gaussian processes. arXiv preprint arXiv:1909.07698. Cited by: §4.5.
  • [42] M. Van der Wilk, C. E. Rasmussen, and J. Hensman (2017) Convolutional gaussian processes. In Advances in Neural Information Processing Systems, pp. 2849–2858. Cited by: §4.5.
  • [43] F. Wenzel, K. Roth, B. S. Veeling, J. Świątkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin (2020) How good is the bayes posterior in deep neural networks really?. arXiv preprint arXiv:2002.02405. Cited by: §4.4, Table 1.
  • [44] J. Yao, W. Pan, S. Ghosh, and F. Doshi-Velez (2019) Quality of uncertainty quantification for bayesian neural network inference. arXiv preprint arXiv:1906.09686. Cited by: §4.1.

Appendix A Efficient convolutional linear regression

Working in one dimension for simplicity, the standard form for a convolution in deep-learning is

(22)

where is the input image/feature-map, is the output feature-map, is the convolutional weights, indexes images, and index channels, indexes the location within the image, and indexes the location within the convolutional patch. Later, we will swap the identity of the “patch location” and the “image location” and to facilitate this, we define them both to be centred on zero,

(23)

where is the size of an image and is the size of a patch, such that, for example for a size 3 kernel, .

To use the expressions for the fully-connected case, we can form a new input, by cutting out each image patch,

(24)

thus,

(25)

note that we have used commas to group pairs of indices ( and ) that should be viewed together. Indeed, combining and into a single index and combining and into a single index, this expression can be viewed as standard matrix multiplication,

(26)

Finally, in our case note that we take,

(27)

While we can explicitly compute by extracting image patches, this imposes a very large memory cost (a factor of for a

kernel, with stride

, because there are roughly as many patches as pixels in the image, and a patch requires 9 times the storage of a pixel. To implement convolutional linear regression with a more manageable memory cost, you can compute the matrices required for linear regression directly as convolutions of the input feature-maps, with themselves, and as convolutions of the input and output, , feature maps, which we describe here.

For linear regression (Eq. 15), we first need to compute,

(28)
rewriting this in terms of (i.e. without explicitly cutting out image patches), we obtain,
(29)

This can directly be viewed as the convolution of and , where we treat as the “convolutional weights”, as the location within the now very large (size ) “convolutional patch”, and

as the location in the resulting output. Once we realise that the computation is a spatial convolution, it is possible to fit it into standard convolution functions provided by deep-learning frameworks, albeit with some rearrangement of the tensors.

Next, we need to compute,

(30)
again, rewriting this in terms of (i.e. without explicitly cutting out image patches), we obtain,
(31)

To treat this as a convolution, we first need exact translational invariance, which can be achieved by using circular boundary conditions. Note that circular boundary conditions are not typically used in neural networks for images, and we therefore only use circular boundary conditions to define the approximate posterior over weights. The variational framework does not restrict us to also using circular boundary conditions within our feedforward network, and as such, we instead use standard zero-padding. With exact translational invariance, we can write this expression directly as a convolution,

(32)
where
(33)

i.e. for a size kernel, . where we treat as the “convolutional weights”, as the location within the “convolutional patch”, and as the location in the resulting output “feature-map”.

Finally, note that this form offers considerable benefits in terms of memory consumption. In particular, the output matrices are usually quite small — the number of channels is typically or , and the number of locations within an patch is typically , giving a very manageable total size that is typically smaller than .

Appendix B Wishart distributions with real-valued degrees of freedom

The classical description of the Wishart distribution,

(34)

where is a matrix, states that is an integer and we can generate by taking the product of matrices, , generated IID from a standard Gaussian,

(35)

However, for the purposes of defining learnable approximate posteriors, we need to be able sample and evaluate the probability density when

is positive real.

To do this, consider the alternative, much more efficient means of sampling from a Wishart distribution, using the Bartlett decomposition [3]. The Bartlett decomposition gives the probability density for the Cholesky of a Wishart sample. In particular,

(36)
(37)
(38)

Here, is usually considered to be sampled from a , but we have generalised this slightly using the equivalent Gamma distribution to allow for real-valued . Following Chafaï [7], We need to change variables to rather than ,

(39)
(40)
(41)
(42)

Thus, the probability density for under the Bartlett sampling operation is,

(43)

To convert this to a distribution on , we need the volume element for the transformation from to ,

(45)

which can be obtained directly by computing the log-determinant of the Jacobian for the transformation from to , or by taking the ratio of Eq. (43) and the usual Wishart probability density (with integral ). Thus,

(46)
(47)
breaking this down into separate components and performing straightforward algebraic manipulations,
(48)
(49)
(50)
as runs from to , we can replace it with , which does the same,
(51)
(52)

Finally, using the definition of the multivariate Gamma function,

(53)

We get back the probability density for the standard Wishart,

(54)

Appendix C Parameter scaling for ADAM

The standard optimizer for variational BNNs is ADAM [20], which we also use. Considering similar RMSprop updates for simplicity [39],

(55)

where the expectation over is approximated using a moving-average of past gradients. Thus, absolute parameter changes are going to be of order . This is fine if all the parameters have roughly the same order of magnitude, but becomes a serious problem if some of the parameters are very large and others are very small. For instance, if a parameter is around and , then a single ADAM step can easily double the parameter estimate, or change it from positive to negative. In contrast, if a parameter is around , then ADAM, with can make proportionally much smaller changes to this parameter, (around ). Thus, we need to ensure that all of our parameters have the same scale, especially as we mix methods, such as combining factorised and global inducing points. We thus design all our new approximate posteriors (i.e. the inducing inputs and outputs) such that the parameters have a scale of around . The key issue is that the mean weights in factorised methods tend to be quite small — they have scale around . To resolve this issue, we store scaled weights, and we divide these stored, scaled mean parameters by the fan-in as part of the forward pass,

weights (56)

This scaling forces us to use larger learning rates than are typically used.

Appendix D UCI results with Bayesian neural networks

We performed a grid search to select the learning rate and minibatch size. For the fully factorised approximation, we selected the learning rate from and the minibatch size from , optimizing for 25000 gradient steps; for the other methods we selected the learning rate and minibatch size from and

, respectively, optimizing for 5000 gradient steps. For all methods we selected the hyperparameters that gave the best ELBO. Finally, note that for the fully factorised method we used the local reparameterization trick

[21]; however, for fac gi we cannot do so because the inducing point methods require that covariances be propagated through the network correctly.

The ELBOs and RMSEs for BNNs applied to UCI datasets are given in Fig. 4. Note that the ELBOs for global inducing methods are almost always better than those for baseline methods, often by a very large margin.

Figure 4: Test RMSEs and ELBOs per datapoint for UCI datasets.

Appendix E UCI results with deep Gaussian processes

Here, we attempted to match the setup in Salimbeni and Deisenroth [34] reasonably closely. In particular, we used 100 inducing points, and full-covariance observation noise. However, our parameterisation was still somewhat different from theirs, if nothing else because our approximate posterior was defined in terms of noisy function-values, while their approximate posterior was defined in terms of the function-values themselves.

As the original results in Salimbeni and Deisenroth [34] used different UCI-splits, and did not provide the ELBO, we reran their code https://github.com/ICL-SML/Doubly-Stochastic-DGP (changing the number of epochs and noise variance to reflect the values in the paper), which gave very similar log-likelihoods to those in the paper.

Our results show significant improvements over Salimbeni and Deisenroth [34] and over local inducing points if we consider the ELBO (Fig. 5; except for kin8nm, where local-inducing appears to do slightly better). However, the picture if far less clear if we look at the test-log-likelihood, which is expected because the test-log-likelihood bears an unclear relationship to the quantity we are actually optimizing, the ELBO. For instance, if we look at smaller datasets such as energy, for which these patterns are most pronounced, (Fig. 6) we find that the test-log-likelihood rises initially, then falls, while the ELBO continues to increase. This suggests that our method’s greater capacity for modelling uncertainty can sometimes be detrimental to improving the log-likelihood.

Figure 5: Average test log likelihoods and ELBOs per datapoint for UCI datasets. The numbers indicate the depths of the models.
Figure 6: ELBO and test-log-likelihood for energy UCI dataset across training. Red is global-inducing, black is local-inducing.

Appendix F Mnist 500

For MNIST, we considered a LeNet-inspired model consisting of two conv2d-relu-maxpool blocks, followed by conv2d-relu-linear, where the convolutions all have kernels with 64 channels. We trained all models using a learning rate of .

When training on very small datasets, such as the first 500 training examples in MNIST, we can see a variety of pathologies emerge with standard methods. The key thing to remember for these datasets is that there is a meaningful sanity check for the ELBO. In particular, we could imagine a model that sets the distribution over all lower-layer parameters equal to the prior, and sets the top-layer parameters so as to ensure that the predictions are uniform. With classes, this results in a test-log-likelihood of , and an ELBO of around (but not exactly) . We found that many combinations of the approximate posterior/prior converged to something like this baseline. Indeed, the only approximate posterior to escape this baseline for ScalePrior and SpatialIWPrior was global inducing points. This is because ScalePrior and SpatialIWPrior both offer the flexibility to shrink the prior variance, and hence shrink the weights towards zero, giving uniform predictions, and potentially zero KL-divergence. In contrast, NealPrior and StandardPrior do not offer this flexibility: you always have to pay something in KL-divergence in order to give uniform predictions. We believe that this is the reason that factorised performs better than expected with NealPrior, despite having an ELBO that is close to the baseline. Furthermore, it is unclear why local inducing gives very test log likelihood and performance, despite having an ELBO that is similar to factorised. For StandardPrior, all the ELBOs are far lower than the baseline, and far lower than any other methods.. Despite this, factorised and fac gi in combination with StandardPrior appear to transiently perform better in terms of predictive accuracy than any other method. These results should sound a note of caution whenever we try to use factorised approximate posteriors with fixed prior covariances [e.g. 6, 10].

Figure 7: The ELBO, test log-likelihoods and classification accuracy with different priors and approximate posteriors on a reduced MNIST dataset consisting of only the first 500 training examples.

We optimized the model parameters using 1000 epochs with a 500 datapoints per minibatch.

Appendix G Tables of UCI Results

factorised local inducing global inducing factorised global
boston - -2.74 0.03 -4.25 0.00 -2.66 0.03
NealPrior -2.76 0.04 -3.04 0.03 -2.72 0.05
ScalePrior -3.63 0.03 -3.46 0.04 -2.67 0.04
concrete - -3.17 0.02 -4.43 0.00 -3.26 0.01
NealPrior -3.21 0.01 -3.54 0.01 -3.29 0.01
ScalePrior -3.89 0.09 -3.55 0.01 -3.27 0.01
energy - -3.94 0.01 -0.82 0.02 -0.78 0.02
NealPrior -0.79 0.02 -2.58 0.01 -2.33 0.02
ScalePrior -2.55 0.01 -2.67 0.01 -0.81 0.02
kin8nm - 1.24 0.01 -0.17 0.00 1.22 0.01
NealPrior 1.26 0.01 1.19 0.01 1.24 0.01
ScalePrior 1.23 0.01 1.18 0.01 1.23 0.01
naval - 2.52 0.00 6.80 0.03 6.90 0.04
NealPrior 5.98 0.13 6.71 0.04 6.96 0.02
ScalePrior 6.99 0.03 2.80 0.00 6.89 0.05
power - -4.13 0.01 -2.84 0.01
NealPrior -2.85 0.01 -2.85 0.01 -2.82 0.01
ScalePrior -2.86 0.01 -2.85 0.01
protein - -3.35 0.00 -2.90 0.01 -2.87 0.00
NealPrior -2.89 0.00 -2.88 0.00 -2.87 0.00
ScalePrior -2.90 0.00 -2.88 0.00
wine - -0.98 0.01 -1.41 0.00
NealPrior -0.99 0.01 -1.02 0.01 -0.97 0.01
ScalePrior -1.22 0.01 -1.09 0.03
yacht - -1.41 0.05 -4.54 0.00 -1.53 0.05
NealPrior -1.58 0.04 -2.94 0.01 -1.15 0.01
ScalePrior -4.12 0.03 -4.12 0.03 -1.25 0.02
Table 2: Average test log likelihoods in nats for BNNs on UCI datasets (errors are 1 standard error)
factorised local inducing global inducing factorised global
boston - 3.60 0.21 8.92 0.30 3.25 0.22
NealPrior 3.64 0.24 4.79 0.25 3.58 0.24
ScalePrior 9.03 0.26 7.42 0.37
concrete - 5.73 0.11 16.55 0.16 6.08 0.10
NealPrior 5.96 0.11 7.70 0.10 6.43 0.09
ScalePrior 12.66 1.04 7.85 0.10 6.25 0.10
energy - 0.51 0.01 10.32 0.09 0.51 0.02
NealPrior 0.51 0.01 2.89 0.06 2.48 0.05
ScalePrior 3.02 0.05 3.14 0.06 0.51 0.01
kin8nm - 0.26 0.00
NealPrior
ScalePrior
naval - 0.01 0.00
NealPrior
ScalePrior
power - 4.00 0.03 11.76 0.33 4.14 0.03
NealPrior 4.17 0.04 4.18 0.03
ScalePrior 4.06 0.04 4.22 0.03 4.16 0.03
protein - 6.14 0.01 4.39 0.02 4.26 0.02
NealPrior 4.38 0.02 4.33 0.00 4.25 0.02
ScalePrior 4.40 0.01 4.30 0.01 4.23 0.01
wine - 0.65 0.01 0.81 0.01
NealPrior 0.66 0.01 0.67 0.01
ScalePrior 0.82 0.01 0.72 0.02
yacht - 0.98 0.07 14.38 0.60 0.88 0.05
NealPrior 1.15 0.07 3.71 0.21 0.69 0.05
ScalePrior 14.55 0.59 14.50