dlwithbayes
Contains code for the NeurIPS 2019 paper "Practical Deep Learning with Bayesian Principles"
view repo
Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with naturalgradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are wellcalibrated and uncertainties on outofdistribution data are improved. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation will be available as a plugandplay optimiser.
READ FULL TEXT VIEW PDFContains code for the NeurIPS 2019 paper "Practical Deep Learning with Bayesian Principles"
PyTorchSSO: Scalable SecondOrder methods in PyTorch
None
Deep learning has been extremely successful in many fields such as computer vision
(Krizhevsky et al., 2012), speech processing (Hinton et al., 2012), and naturallanguage processing
(Mikolov et al., 2013), but it is also plagued with several issues that make its application difficult in many other fields. For example, it requires a large amount of highquality data and it can overfit when dataset size is small. Similarly, sequential learning can cause forgetting of past knowledge (Kirkpatrick et al., 2017), and lack of reliable confidence estimates and other robustness issues can make it vulnerable to adversarial attacks
(Bradshaw et al., 2017). Ultimately, due to such issues, application of deep learning remains challenging, especially for applications where human lives are at risk.Bayesian principles have the potential to address such issues. For example, we can represent uncertainty using the posterior distribution, enable sequential learning using Bayes’ rule, and reduce overfitting with Bayesian model averaging Hoeting et al. (1999)
. The use of such Bayesian principles for neural networks has been advocated from very early on. Bayesian inference on neural networks were all proposed in the 90s, e.g., by using MCMC methods
(Neal, 1995), Laplace’s method (Mackay, 1991), and variational inference (VI) (Hinton and Van Camp, 1993; Barber and Bishop, 1998; Saul et al., 1996; Anderson and Peterson, 1987). Benefits of Bayesian principles are even discussed in machinelearning textbooks
(MacKay, 2003; Bishop, 2006). Despite this, they are rarely employed in practice. This is mainly due to computational concerns which unfortunately overshadow their theoretical advantages.The difficulty lies in the computation of the posterior distribution, which is especially challenging for deep learning. Even approximation methods, such as VI and MCMC, have historically been difficult to scale to large datasets such as ImageNet (Russakovsky et al., 2015). Due to this, it is common to use less principled approximations, such as MCdropout (Gal and Ghahramani, 2016b), even though they are not ideal when it comes to fixing the issues of deep learning. For example, MCdropout is unsuitable for continual learning Kirkpatrick et al. (2017) since its posterior approximation does not have mass over the whole weight space. It is also found to perform poorly for sequential decision making (Riquelme et al., 2018). The form of the approximation used by such methods is usually rigid and cannot be easily improved, e.g., to other forms such as a mixture of Gaussians. The goal of this paper is to make more principled Bayesian methods, such as VI, practical for deep learning, thereby helping researchers tackle key limitations of deep learning.
We demonstrate practical training of deep networks by using recently proposed naturalgradient VI methods. These methods resemble the Adam optimiser, enabling us to leverage existing techniques for initialisation, momentum, batch normalisation, data augmentation, and distributed training. As a result, we obtain similar performance in about the same number of epochs as Adam when training many popular deep networks (e.g., LeNet, AlexNet, ResNet) on datasets such as CIFAR10 and ImageNet. See Fig. 1 for Imagenet. The results show that, despite using an approximate posterior, the training methods preserve the benefits of Bayesian principles. Compared to standard deeplearning methods, the predictive probabilities are wellcalibrated and uncertainties on outofdistribution inputs are improved. Our work shows that practical deep learning is possible with Bayesian methods and aims to support further research in this area.
Related work. Previous VI methods, notably by Graves (2011) and Blundell et al. (2015)
, require significant implementation and tuning effort to perform well, e.g., on convolution neural networks (CNN). Slow convergence is found to be problematic for sequential problems
(Riquelme et al., 2018). There appears to be no reported results with complex networks on large problems, such as ImageNet. Our work solves these issues by borrowing deeplearning techniques and applying them to naturalgradient VI (Khan et al., 2018; Zhang et al., 2018).In their paper, Zhang et al. (2018) also employed data augmentation and batch normalisation for a naturalgradient method called Noisy KFAC (see Appendix A) and showed results on VGG on CIFAR10. However, a meanfield method called Noisy Adam was found to be unstable with batch normalisation. In contrast, we show that a similar method, called Variational Online GaussNewton (VOGN), proposed by Khan et al. (2018), works well with such techniques. We show results for distributed training with Noisy KFAC on Imagenet, but do not provide extensive comparisons since we find it difficult to tune. Many of our techniques can be used to speedup Noisy KFAC too, which is promising.
Many other approaches have recently been proposed to compute posterior approximations by training deterministic networks (Ritter et al., 2018; Maddox et al., 2019; Mandt et al., 2017). Similarly to MCdropout, their posterior approximations are not flexible, making it difficult to improve the accuracy of their approximations. On the other hand, VI offers a much more flexible alternative to apply Bayesian principles to deep learning.
The success of deep learning is partly due to the availability of scalable and practical methods for training deep neural networks (DNNs). Network training is formulated as an optimisation problem where a loss between the data and the DNN’s predictions is minimised. For example, in a supervised learning task with a dataset
of inputs and corresponding outputs of length , we minimise a loss of the following form: , where , denotes the DNN outputs with weights ,denotes a differentiable loss function between an output
and its prediction , and is the regulariser.^{†}^{†}This regulariser is sometimes set to 0 or a very small value.Deep learning relies on stochasticgradient (SG) methods to minimise such loss functions. The most commonly used optimisers, such as stochasticgradient descent (SGD), RMSprop
(Tieleman and Hinton, 2012), and Adam (Kingma and Ba, 2015), take the following form^{†}^{†}Alternate versions with weightdecay and momentum differ from this update (Loshchilov and Hutter, 2019). We present a form useful to establish the connection between SG methods and naturalgradient VI. (all operations are elementwise):(1) 
where is the iteration, and are learning rates, is a small scalar, and is the stochastic gradients at defined as follows: using a minibatch of data examples. This simple update scales extremely well and can be applied to very large problems. With techniques such as initialisation tricks, momentum, weightdecay, batch normalisation, and data augmentation, it also achieves good performance for many problems.
In contrast, deep learning with Bayesian principles is computationally expensive. The posterior distribution can be obtained using Bayes’ rule: where .^{†}^{†}This is a tempered posterior (Vovk, 1990) setup where is set when we expect model misspecification and/or adversarial examples (Ghosal and Van der Vaart, 2017). Setting recovers standard Bayesian inference. This is costly due to the computation of the marginal likelihood , a highdimensional integral that is difficult to compute for large networks. Variational inference (VI) is a principled approach to scalably estimate an approximation to . The main idea is to employ a parametric approximation, e.g., a Gaussian with mean and covariance . The parameters and can then be estimated by maximising the evidence lower bound (ELBO):
(2) 
where
denotes the KullbackLeibler divergence. By using more complex approximations, we can further reduce the approximation error, but at a computational cost. By formulating Bayesian inference as an optimisation problem, VI enable a practical application of Bayesian principles.
Despite this, VI remains impractical for training large deep networks on large datasets. Existing methods, such as Graves (2011) and Blundell et al. (2015), apply popular SG methods to optimise the ELBO, yet they fail to get a reasonable performance on large problems. This is not surprising since the optimisation objectives for VI and deep learning are fundamentally different, and it is reasonable that techniques used in one field do not directly lead to improvements in the other. However, it will be useful if we can exploit the tricks and techniques of deep learning to boost performance of VI. The goal of this work is to do just that. We now describe our methods in detail.
In this paper, we propose naturalgradient VI methods for practical deep learning with Bayesian principles. The naturalgradient update takes a simple form when estimating exponentialfamily approximations (Khan and Nielsen, 2018; Khan and Lin, 2017). When , the update of the naturalparameter is performed by using the stochastic gradient of the expected regularisedloss:
(3) 
where is the learning rate, and we note that the stochastic gradients are computed with respect to , the expectation parameters of . The moving average above helps to deal with the stochasticity of the gradient estimates, and is very similar to the moving average used in deep learning (see (1)). When is set to 0, the update essentially minimises the regularised loss (see Section 5 in Khan et al. (2018)). These properties of naturalgradient VI makes it an ideal candidate for deep learning.
Recent work by Khan et al. (2018) and Zhang et al. (2018) further show that, when is Gaussian, the update (3) assumes a form that is strikingly similar to the update (1). For example, the Variational Online GaussNewton (VOGN) method of Khan et al. (2018) estimates a Gaussian with mean and a diagonal covariance matrix using the following update:
(4) 
where , with , , and are learning rates. Similarly to (1
), the vector
adapts the learning rate and is updated using a moving average.A major difference in VOGN is that the update of is now based on a GaussNewton approximation Graves (2011) which uses . This is fundamentally different from the SG update in (1) which instead uses the gradientmagnitude (Bottou et al., 2016). The first approach uses the sum outside the square while the second approach uses it inside. VOGN is therefore a secondorder method and, similarly to a Newton method, does not need a squareroot over unlike in (1). Implementation of this step requires an additional calculation (see Appendix B
) which makes VOGN a bit slower than Adam, but this is expected to give better variance estimates (see Theorem 1 in
Khan et al. (2018)).The main contribution of this paper is to demonstrate practical training of deep networks using VOGN. Since VOGN takes a similar form to SG methods, we can easily borrow existing deeplearning techniques to improve performance. We will now describe these techniques in detail. Pseudocode for VOGN is shown in Algorithm 1.
Batch normalisation: Batch normalisation (Ioffe and Szegedy, 2015)
has been found to significantly speed up and stabilise training of neural networks, and is widely used in deep learning. BatchNorm layers are inserted between neural network layers. They help stabilise each layer’s input distribution by normalising the running average of the inputs’ mean and variance. In our VOGN implementation, we simply use the existing implementation with default hyperparameter settings. We do not apply L2 regularisation and weight decay to BatchNorm parameters, like in
Goyal et al. (2017), or maintain uncertainty over the BatchNorm parameters. This straightforward application of batch normalisation works for VOGN.Data Augmentation: When training on image datasets, data augmentation (DA) techniques can improve performance drastically Goyal et al. (2017). We consider two common realtime data augmentation techniques: random cropping and horizontal flipping. After randomly selecting a minibatch at each iteration, we use a randomly selected cropped version of all images. Each image in the minibatch has a chance of being horizontally flipped.
We find that directly applying DA gives slightly worse performance than expected, and also affects the calibration of the resulting uncertainty. However, DA increases the effective sample size. We therefore modify it to be where , improving performance (see step 2 in Algorithm 1) The reason for this performance boost might be due to the complex relationship between the regularisation and . For the regularised loss , the two are unidentifiable, i.e., we can multiply by a constant and reduce by the same constant without changing the minimum. However, in a Bayesian setting (like in (2)), the two quantities are separate, and therefore changing the data might also change the optimal prior variance hyperparameter in a complicated way. This needs further theoretical investigations, but our simple fix of scaling seems to work well in the experiments.
We set
by considering the specific DA techniques used. When training on CIFAR10, the random cropping DA step involves first padding the 32x32 images to become of size 40x40, and then taking randomly selected 28x28 cropped images. We consider this as effectively increasing the dataset size by a factor of 5 (4 images for each corner, and one central image). The horizontal flipping DA step doubles the dataset size (one dataset of unflipped images, one for flipped images). Combined, this gives
. Similar arguments for ImageNet DA techniques give . Even though is another hyperparameter to set, we find that its precise value does not matter much. Typically, after setting an estimate for , tuning a little seems to work well (see Appendix E).Momentum and initialisation: It is well known that both momentum and good initialisation can improve the speed of convergence for SG methods in deep learning Sutskever et al. (2013). Since VOGN is similar to Adam, we can implement momentum in a similar way. This is shown in step 17 of Algorithm 1, where is the momentum rate. We initialise the mean in the same way the weights are initialised in Adam (we use init.xavier_normal in PyTorch (Glorot and Bengio, 2010)). For the momentum term , we use the same initialisation as Adam (initialised to 0). VOGN requires an additional initialisation for the variance . For this, we first run a forward pass through the first minibatch, calculate the average of the squared gradients and initialise the scale with it (see step 1 in Algorithm 1). This implies that the variance is initialised to . For the tempering parameter , we use a schedule where it is increased from a small value (e.g., 0.1) to 1. With these initialisation tricks, VOGN is able to mimic the convergence behaviour of Adam in the beginning.
Learning rate scheduling: A common approach to obtain high validation accuracies quickly is to use a specific learning rate schedule (Goyal et al., 2017). The learning rate (denoted by in Algorithm 1) is regularly decayed by a factor (typically a factor of 10). The frequency and timings of this decay are usually prespecified. In VOGN, we use the same schedule used for Adam, finding that it works well.
Distributed training: We also employ distributed training for VOGN to perform large experiments quickly. We can parallelise computation both over data and MonteCarlo (MC) samples. Data parallelism is useful to split up large minibatch sizes. This is followed by averaging over multiple MC samples and their losses on a single GPU. MC samples parallelism is useful when minibatch size is small, and we can copy the entire minibatch and process it on a single GPU. Algorithm 1 and Figure 2 illustrate our distributed scheme. We use a combination of these two parallelism techniques with different MC samples for different inputs. This theoretically lowers variance during training (see Equation 5 in Kingma et al. (2015)), but sometimes requires averaging over multiple MC samples at the start of training to lower the variance sufficiently. Distributed training is crucial for fast training on large problems such as ImageNet.
Implementation of the GaussNewton update in VOGN: As discussed earlier, VOGN uses the GaussNewton approximation, different to the Adam update. In this approximation, the gradients on individual data examples are first squared and then averaged afterwards (see step 12 in Algorithm 1 which implements the update for shown in (4)). We need extra computation to get access to individual gradients (see Appendix B for details). Due to this computation, VOGN is twice as slow as Adam or SGD (e.g., in Fig. 1). However, this is not a theoretical limitation and this can be improved if a framework enables an easy computation of the individual gradients.
In this section, we present experiments on fitting several deep networks on CIFAR10 and ImageNet. Our experiments demonstrate practical training using VOGN on these benchmarks and show performance that is competitive with Adam and SGD. We also assess the quality of the posterior approximation, finding that the benefits of Bayesian principles are preserved.
CIFAR10 (Krizhevsky and Hinton, 2009)
contains 10 classes with 50,000 images for training and 10,000 images for validation. For ImageNet, we train with 1.28 million training examples and validate on 50,000 examples, classifying between 1,000 classes. We used a large minibatch size
and parallelise them across 128 GPUs (NVIDIA Tesla P100). We compare the following methods on CIFAR10: Adam, MCdropout Gal and Ghahramani (2016a). For ImageNet, we also compare to SGD, KFAC, and Noisy KFAC. We do not consider Noisy KFAC for other comparisons since tuning is difficult. We compare 3 architectures: LeNet5, AlexNet, ResNet18. We only compare to Bayes by Backprop (BBB) Blundell et al. (2015) for CIFAR10 with LeNet5 since it is difficult to tune for other experiments. We carefully set the hyperparameters of all methods, following the best practice of large distributed training (Goyal et al., 2017) as the initial point of our hyperparameter tuning. The full set of hyperparameters is in Appendix C.We start by showing the effectiveness of momentum and batch normalisation to boost performance of VOGN. Figure 3 shows that these methods significantly speed up convergence as well as performance for both accuracy and log likelihoods.
Figures 1 and 4 compare the convergence of VOGN to Adam (for all experiments), SGD (on ImageNet), and MCdropout (on the rest). VOGN shows similar convergence and its performance is competitive with these methods. We also try BBB on LeNet5, where it converges prohibitively slowly, performing very poorly. We are not able to successfully train other architectures using this approach. We found VOGN far simpler to tune as we can borrow all the techniques used with Adam to boost performance. Figure 4 also shows the importance of DA in improving performance.
Table 1 gives a final comparison of train/validation accuracies, negative log likelihoods, epochs required for convergence, and runtime per epoch. We can see that the accuracy, log likelihoods, and the number of epochs are comparable. Regarding runtime, VOGN is twice as slow per epoch when compared to Adam and SGD, since it requires computation of individual gradients (see the discussion in Section 3). We clearly see that by using deeplearning techniques on VOGN, we can perform practical deep learning. This is not possible with methods such as BBB.
Due to the Bayesian nature of VOGN, there are some tradeoffs to consider. Reducing the prior precision ( in Algorithm 1) results in higher validation accuracy, but also larger traintest gap (more overfitting). This is shown in Appendix E for VOGN on ResNet18 on ImageNet. As expected, when the prior precision is small, performance is similar to nonBayesian methods. We also show the effect of changing the effective dataset size ( from Section 3) in Appendix E: note that, given we are going to tune the prior variance anyway, therefore it is sufficient to set to its correct order of magnitude. Another tradeoff concerns the number of MonteCarlo (MC) samples, shown in Appendix F. Increasing the number of training MC samples (up to a limit) improves VOGN’s convergence rate and stability at the cost of computation cost. Increasing the number of MC samples during testing improves generalisation as we have a better MC approximation of the posterior.
Finally, a few comments on the performance of the other methods. Adam regularly overfits the training set in most settings, with large traintest differences in both validation accuracy and log likelihood. The exception is LeNet5, likely because the small architecture results in underfitting (this is consistent with the low validation accuracies obtained). In contrast to Adam, MCdropout has small traintest gap, usually smaller than VOGN’s. However, we will see in Section 4.2 that this is because of underfitting. Moreover, the performance of MCdropout is highly sensitive to the dropout rate (see Appendix D for a comparison of different dropout rates). On ImageNet, Noisy KFAC performs well too. It is slower than VOGN, but it takes fewer epochs. Overall, wall clock time is about the same as VOGN.

Optimiser 


Epochs 

ECE  AUROC  
CIFAR10/ LeNet5 (no DA)  Adam  71.98 / 67.67  0.937  210  6.96  0.021  0.794  
BBB  66.84 / 64.61  1.018  800  11.43  0.045  0.784  
MCdropout  68.41 / 67.65  0.99  210  6.95  0.087  0.797  
VOGN  70.79 / 67.32  0.938  210  18.33  0.046  0.8  
CIFAR10/ AlexNet (no DA)  Adam  100.0 / 67.94  2.83  161  3.12  0.262  0.793  
MCdropout  97.56 / 72.20  1.077  160  3.25  0.140  0.818  
VOGN  98.68 / 66.49  1.12  160  9.98  0.024  0.796  
CIFAR10/ AlexNet  Adam  97.92 / 73.59  1.480  161  3.08  0.262  0.793  
MCdropout  80.65 / 77.04  0.667  160  3.20  0.114  0.828  
VOGN  81.15 / 75.48  0.703  160  10.02  0.016  0.832  
CIFAR10/ ResNet18  Adam  97.74 / 86.00  0.55  160  11.97  0.082  0.877  
MCdropout  88.23 / 82.85  0.51  161  12.51  0.166  0.768  
VOGN  91.62 / 84.27  0.477  161  53.14  0.040  0.876  
ImageNet/ ResNet18  SGD  82.63 / 67.79  1.38  90  44.13  0.067  0.856  
Adam  80.96 / 66.39  1.44  90  44.40  0.064  0.855  
MCdropout  72.96 / 65.64  1.43  90  45.86  0.012  0.856  
VOGN  73.87 / 67.38  1.37  90  76.04  0.029  0.854  
KFAC  83.73 / 66.58  1.493  60  133.69  0.097  0.856  
Noisy KFAC  72.28 / 66.44  1.44  60  179.27  0.080  0.852 
for standard deviations.
BBB is not parallelised (other methods have 4 processes), with 1 MC sample used for the convolutional layers (VOGN uses 6 samples per process).In this section, we compare the quality of the predictive probabilities for various methods. For Bayesian methods, we compute these probabilities by averaging over the samples from the posterior approximations (see Appendix G for details). For nonBayesian methods, these are obtained using the point estimate of the weights. We compare the probabilities using the following metrics: validation negative loglikelihood (NLL), area under ROC (AUROC) and expected calibration curves (ECE) (Naeini et al., 2015; Guo et al., 2017). For the first and third metric, a lower number is better, while for the second, a higher number is better. See Appendix G for an explanation of these metrics. Results are summarised in Table 1. Out of the 15 metrics (NLL, ECE and AUROC on 5 dataset/architecture combinations), VOGN performs the best or tied best on 10. On the other 5, VOGN is second best, with MCdropout best on 4. The final metric shows Adam performing well on LeNet5 (as argued earlier, the small architecture may result in underfitting). We also show calibration curves (DeGroot and Fienberg, 1983) in Figure 1 and Appendix H. Adam is consistently overconfident, with its calibration curve below the diagonal. Conversely, MCdropout is usually underconfident. On ImageNet, MCdropout performs well on ECE (all methods are very similar on AUROC), but this required an excessively tuned dropout rate (see Appendix D).
Our final result is to compare performance on outofdistribution datasets. When testing on datasets that are different from the training datasets, predictions should be more uncertain. We use experimental protocol from the literature (Hendrycks and Gimpel, 2017; Lee et al., 2018; DeVries and Taylor, 2018; Liang et al., 2018) to compare VOGN, Adam and MCdropout on CIFAR10. We also borrow metrics from other works (Hendrycks and Gimpel, 2017; Lakshminarayanan et al., 2017) and show predictive entropy histograms and also report AUROC and FPR at 95% TPR. See Appendix I for further details on the datasets and metrics. Ideally, we want the entropy to be high on outofdistribution data and low on indistribution data. Our results are summarised in Figure 5 and Appendix I. On ResNet18 and AlexNet, VOGN’s predictive entropy histograms show the desired behaviour: a spread of entropies for the indistribution data, and high entropies for outofdistribution data. Adam has many predictive entropies at zero, indicating Adam tends to classify outofdistribution data too confidently. Conversely, MCdropout’s predictive entropies are generally high (particularly indistribution), indicating MCdropout has too much noise. On LeNet5, we observe the same result as before: Adam and MCdropout both perform well. The metrics (AUROC and FPR at 95% TPR) do not provide a clear story across architectures.
We successfully train deep networks with a naturalgradient variational inference method, VOGN, on a variety of architectures and datasets, even scaling up to ImageNet. This is made possible due to the similarity of VOGN to Adam, enabling us to boost performance by borrowing deeplearning techniques. Our accuracies and convergence rates are comparable to SGD and Adam. Unlike them, however, VOGN retains the benefits of Bayesian principles, with wellcalibrated uncertainty and good performance on outofdistribution data. Better uncertainty estimates open up a whole range of potential future experiments, for example, small data experiments, active learning, adversarial experiments, and sequential decision making or continual learning. Another potential avenue for research is to consider structured covariance approximations.
Acknowledgements
We would like to thank Hikaru Nakata (Tokyo Institute of Technology) and Ikuro Sato (Denso IT Laboratory, Inc.) for their help on the PyTorch implementation. We are also thankful for the RAIDEN computing system and its support team at the RIKEN Center for AI Project which we used extensively for our experiments. This research used computational resources of the HPCI system provided by Tokyo Institute of Technology (TSUBAME3.0) through the HPCI System Research Project (Project ID:hp190122).
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html.Annual Conference on Computational Learning Theory
, pages 5–13, 1993.Noisy KFAC [Zhang et al., 2018] attempts to approximate the structure of the full covariance matrix, and therefore the updates are a bit more involved than VOGN (see Equation 4). Assuming a fullyconnected layer, we denote the weight matrix by
. The Noisy KFAC method estimates the parameters of a matrixvariate Gaussian distribution
by using the following updates:(5)  
(6) 
where , is a vector of gradients with for all layers , is a vector of activations for all layers , , and with some external damping factor . The covariance parameters are set to and . Similarly to the VOGN update in Equation 4, the gradients are scaled by matrices and , which are related to the precision matrix of the approximation.
Current codebases are only optimised to directly return the sum of gradients over the minibatch. In order to efficiently compute the GaussNewton (GN) approximation, we modify the backward pass to efficiently calculate the sum of squared gradients over the minibatch, and extend the solution in Goodfellow [2015] to both convolutional and batch normalisation layers.
Consider a convolutional filter W with dimensions [k,k] and inputs X
with dimensions [M, C, H, W]. Here M is the the size of the minibatch, C is the number of channels, H,W are the spatial dimensions, and k is the filter size. Assuming the stride to be 1 and 0 padding for our convolutions, the filter
W will act on [k,k] patches of X shifting by 1 pixel sequentially.Let be an expansion operator such that,
is the input for the filter W, S are preactivations and A are the activations. We can compute the gradients and the square of gradients for a loss L as,
(7)  
For a minibatch of size , let be the preactivation output for any layer in a deep neural network. Batch normalisation aims to normalise the preactivation outputs of the layer to zero mean and unit variance. We define as the batch normalised preactivations with learnable parameters given by,
(8)  
We can find the squared gradients of a loss function with respect to parameters and by,
(9)  
(10) 
The preactivation gradients can be obtained from the compute graph in PyTorch as shown in Figure 6.
Layerwise blockdiagonal GaussNewton approximation. Despite using the method above, it is still intractable to compute the GaussNewton matrix (and its inverse) with respect to the weights of largescale deep neural networks. We therefore apply two further approximations (Figure 7). First, we view the GaussNewton matrix as a layerwise blockdiagonal matrix. This corresponds to ignoring the correlation between the weights of different layers. Hence for a network with layers, there are diagonal blocks, and is the diagonal block corresponding to the th layer (). Second, we approximate each diagonal block with , which is either a Kroneckerfactored or diagonal matrix. Using a Kroneckerfactored matrix as corresponds to KFAC; a diagonal matrix corresponds to a meanfield approximation in that layer. By applying these two approximations, the update rule of the GaussNewton method can be written in a layerwise fashion:
(11) 
where is the weights in th layer, and
(12) 
Since the cost of computing is much cheaper compared to that of computing , our approximations make GaussNewton much more practical in deep learning.
In the distributed setting (see Figure 2), each parallel process (corresponding to 1 GPU) calculates the GN matrix for its local minibatch. Then, one GPU adds them together and calculates the inverse. This inversion step can also be parallelised after making the blockdiagonal approximation to the GN matrix. After inverting the GN matrix, the standard deviation is updated (line 9 in Algorithm 1), and sent to each parallel process, allowing each process to draw independently from the posterior.
In the Noisy KFAC case, a similar distributed scheme is used, except each parallel process now has both matrices and (see Appendix A). When using KFAC approximations to the GaussNewton blocks for other layers, Osawa et al. [2018] empirically showed that the BatchNorm layer can be approximated with a diagonal matrix without loss of accuracy, and we find the same. We therefore use diagonal with KFAC and Noisy KFAC in BatchNorm layers (see Table 2). For further details on how to efficiently parallelise KFAC in the distributed setting, please see Osawa et al. [2018].
optimiser  convolution  fullyconnected  Batch Normalisation 

OGN  diagonal  diagonal  diagonal 
VOGN  diagonal  diagonal  diagonal 
KFAC  Kroneckerfactored  Kroneckerfactored  diagonal 
Noisy KFAC  Kroneckerfactored  Kroneckerfactored  diagonal 

Optimiser 




ECE  AUROC  Epochs 


CIFAR10/ LeNet5 (no DA)  Adam  71.98 0.117  0.733 0.021  67.67 0.513  0.937 0.012  0.021 0.002  0.794 0.001  210  6.96  
BBB  66.84 0.003  0.957 0.006  64.61 0.331  1.018 0.006  0.045 0.005  0.784 0.003  800  11.43  
MCdropout  68.41 0.581  0.870 0.101  67.65 1.317  0.99 0.026  0.087 0.009  0.797 0.006  210  6.95  
VOGN  70.79 0.763  0.880 0.02  67.32 1.310  0.938 0.024  0.046 0.002  0.8 0.002  210  18.33  
CIFAR10/ AlexNet (no DA)  Adam  100.0 0  0.001 0  67.94 0.537  2.83 0.02  0.262 0.005  0.793 0.001  161  3.12  
MCdropout  97.56 0.278  0.058 0.014  72.20 0.177  1.077 0.012  0.140 0.004  0.818 0.002  160  3.25  
VOGN  98.68 0.093  0.017 0.005  66.49 0.786  1.12 0.01  0.024 0.010  0.796 0  160  9.98  
CIFAR10/ AlexNet  Adam  97.92 0.140  0.057 0.006  73.59 0.296  1.480 0.015  0.262 0.005  0.793 0.001  161  3.08  
MCdropout  80.65 0.615  0.47 0.052  77.04 0.343  0.667 0.012  0.114 0.002  0.828 0.002  160  3.20  
VOGN  81.15 0.259  0.511 0.039  75.48 0.478  0.703 0.006  0.016 0.001  0.832 0.002  160  10.02  
CIFAR10/ ResNet18  Adam  97.74 0.140  0.059 0.012  86.00 0.257  0.55 0.01  0.082 0.002  0.877 0.001  160  11.97  
MCdropout  88.23 0.243  0.317 0.045  82.85 0.208  0.51 0  0.166 0.025  0.768 0.004  161  12.51  
VOGN  91.62 0.07  0.263 0.051  84.27 0.195  0.477 0.006  0.040 0.002  0.876 0.002  161  53.14  
ImageNet/ ResNet18  SGD  82.63 0.058  0.675 0.017  67.79 0.017  1.38 0  0.067  0.856  90  44.13  
Adam  80.96 0.098  0.723 0.015  66.39 0.168  1.44 0.01  0.064  0.855  90  44.40  
MCdropout  72.96  1.12  65.64  1.43  0.012  0.856  90  45.86  
VOGN  73.87 0.061  1.02 0.01  67.38 0.263  1.37 0.01  0.0293 0.001  0.8543 0  90  76.04  
KFAC  83.73 0.058  0.571 0.016  66.58 0.176  1.493 0.006  0.097  0.856  60  133.69  
Noisy KFAC  72.28  1.075  66.44  1.44  0.080  0.852  60  179.27 
Hyperparameters for training various architectures on CIFAR10 are given in Tables 6, 7, 8, 9, 10 and 11. Hyperparameters for training ResNet18 on ImageNet are given in Table 4, with distributed training specific settings in Table 5. Please see Goyal et al. [2017] and Osawa et al. [2018] for best practice on these hyperparameter values.
optimiser  milestones  weight decay  L2 reg  

SGD  1.25e2  1.6  [30, 60, 80]  0.9      1e4 
Adam  1.25e5  1.6e3  [30, 60, 80]  0.1  0.001  1e4   
MCdropout  1.25e2  1.6  [30, 60, 80]  0.9      1e4 
VOGN  1.25e5  1.6e3  [30, 60, 80]  0.9  0.999     
KFAC  1.25e5  1.6e3  [15, 30, 45]  0.9  0.9    1e4 
Noisy KFAC  1.25e5  1.6e3  [15, 30, 45]  0.9  0.9     
optimiser  # GPUs  

VOGN / Noisy KFAC  4,096  128  1  1  5  1,281,167  133.3  2e5 
optimiser  weight decay  L2 reg  

Adam  1e3  0.1  0.001  1e2   
MCdropout  1e3  0.9      1e4 
VOGN  1e2  0.9  0.999     
optimiser  # GPUs  

VOGN  128  4  6  0.1 1  1  50,000  100  2e4 2e3 
optimiser  milestones  weight decay  L2 reg  

Adam  1e3  [80, 120]  0.1  0.001  1e4   
MCdropout  1e1  [80, 120]  0.9      1e4 
VOGN  1e4  [80, 120]  0.9  0.999     
optimiser  # GPUs  

VOGN  128  8  3  0.5 1  10  50,000  0.5  5e6 1e5 
optimiser  milestones  weight decay  L2 reg  

Adam  1e3  [80, 120]  0.1  0.001  5e4   
MCdropout  1e1  [80, 120]  0.9      1e4 
VOGN  1e4  [80, 120]  0.9  0.999     
optimiser  # GPUs  

VOGN  256  8  5  1  10  50,000  50  1e3 
We show MCdropout’s sensitivity to dropout rate, , in this Appendix. We tune MCdropout as best as we can, finding that works best for all architectures trained on CIFAR10 (see Figure 8 for the dropout rate’s sensitivity on LeNet5 as an example). On ResNet18 trained on ImageNet, we find that MCdropout is extremely sensitive to dropout rate, with even performing badly. We therefore use for MCdropout experiments on ImageNet. This high sensitivity to dropout rate is an issue with MCdropout as a method.
We show the effect of changing the prior variance ( in Algorithm 1) in Figures 10 and 11. We can see that increasing the prior variance improves validation performance (accuracy and log likelihood). However, increasing prior variance also always increases the traintest gap, without exceptions, when the other hyperparameters are held constant. As an example, training VOGN on ResNet18 on ImageNet with a prior variance of has traintest accuracy and log likelihood gaps of 2.29 and 0.12 respectively. When the prior variance is increased to , the respective traintest gaps increase to 6.38 and 0.34 (validation accuracy and validation log likelihood also increase, see Figure 10).
With increased prior variance, VOGN (and Noisy KFAC) reach converged solutions more like their nonBayesian counterparts, where overfitting is an issue. This is as expected from Bayesian principles.
Figure 12 shows the combined effect of the dataset reweighting factor and prior variance. When is set to a value in the correct order of magnitude, it does not affect performance so much: instead, we should tune . This is our methodology when dealing with . Note that we set for ImageNet to be smaller than that for CIFAR10 because the data augmentation cropping step uses a higher portion of the initial image than in CIFAR10: we crop images of size 224x224 from images of size 256x256.
In the paper, we report results for training ResNet18 on ImageNet using 128 GPUs, with 1 independent MonteCarlo (MC) sample per process during training (mc=128x1), and 10 MC samples per validation image (val_mc). We now show that increasing either of training or testing MC samples improves performance (validation accuracy and log likelihood) at the cost of increased computation time. See Figure 13.
Increasing the number of training MC samples per process reduces noise during training. This effect is observed when training on CIFAR10, where multiple MC samples per process are required to stabilise training. On ImageNet, we have much larger minibatch size (4,096 instead of 256) and more parallel processes (128 not 8), and so training with 1 MC sample per process is still stable. However, as shown in Figure 13, increasing the number of training MC samples per process to from 1 to 2 speeds up convergence per epoch, and reaches a better converged solution. The time per epoch (and hence total runtime) also increases by approximately a factor of 1.5. Increasing the number of train MC samples per process to 3 does not increase final test performance significantly.
Increasing the number of testing MC samples from 10 to 100 (on the same trained model) also results in better generalisation: the train accuracy and log likelihood are unchanged, but the validation accuracy and log likelihood increase. However, as we run an entire validation on each epoch, increasing validation MC samples also increases runtime.
These results show that, if more compute is available to the user, they can improve VOGN’s performance by improving the MC approximation at either (or both) traintime or testtime (up to a limit).
We use several approaches to compare uncertainty estimates obtained by each optimiser. We follow the same methodology for all optimisers: first, tune hyperparameters to obtain good accuracy on the validation set. Then, test on uncertainty metrics. For multiclass classification problems, all of these are based on the predictive probabilities. For nonBayesian approaches, we compute the probabilities for a validation input as , where is the weight vector of the DNN whose uncertainty we are estimating. For Bayesian methods, we can compute the predictive probabilities for each validation example as follows:
where are samples from the Gaussian approximation returned by a variational method. We use 10 MC samples at validationtime for VOGN and MCdropout (the effect of changing number of validation MC samples is shown in Appendix F). This increases the computational cost during testing for these methods when compared to Adam or SGD.
Using the estimates , we use three methods to compare uncertainties: validation log loss, AUROC and calibration curves. We also compare uncertainty performance by looking at model outputs when exposed to outofdistribution data.
Validation log likelihood. Log likelihood (or log loss) is a common uncertainty metric. We consider a validation set of examples. For an input , denote the true label by , a 1of encoded vector with 1 at the true label and 0 elsewhere. Denote the full vector of all validation outputs by . Similarly, denote the vector of all probabilities by , where . The validation log likelihood is defined as,
(13) 
Tables 1 and 3 show final validation (negative) log likelihood. VOGN performs very well on this metric (aside from on CIFAR10/AlexNet, with or without DA, where MCdropout performs the best). All final validation log likelihoods are very similar, with VOGN usually performing similarly to the other bestperforming optimisers (usually MCdropout).
Area Under ROC curves (AUROC). We consider Receiver Operating Characteristic (ROC) curves for our multiway classification tasks. A potential way that we may care about uncertainty measurements would be to discard uncertain examples by thresholding each validation input’s predicted class’ softmax output, marking them as too ambiguous to belong to a class. We can then consider the remaining validation inputs to either be correctly or incorrectly classified, and calculate the True Positive Rate (TPR) and False Positive Rate (FPR) accordingly. The ROC curve is summarised by its Area Under Curve (AUROC), reported in Table 1. This metric is useful to compare uncertainty performance in conjunction with the other metrics we use. The AUROC results are very similar between optimisers, particularly on ImageNet, although MCdropout performs marginally better than the others, including VOGN. On all but one CIFAR10 experiment (AlexNet, without DA), VOGN performs the best, or tied best. Adam performs the worst, but is surprisingly good in CIFAR10/ResNet18.
Calibration Curves. Calibration curves [DeGroot and Fienberg, 1983] test how wellcalibrated a model is by plotting true accuracy as a function of the model’s predicted accuracy (we only consider the predicted class’ ). Perfectly calibrated models would follow the diagonal line on a calibration curve. We approximate this curve by binning the model’s predictions into bins, as is often done. We show calibration curves in Appendix H, as well as Figure 1. We can also consider the Expected Calibration Error (ECE) metric [Naeini et al., 2015, Guo et al., 2017], reported in Table 1. ECE calculates the expected error between the true accuracy and the model’s predicted accuracy, averaged over all validation examples, again approximated by using bins. Across all datasets and architectures, with the exception of LeNet5 (which we have argued causes underfitting), VOGN usually has better calibration curves and better ECE than competing optimisers. Adam is consistently overconfident, with the calibration curve below the diagonal. Conversely, MCdropout is usually underconfident, with too much noise, as mentioned earlier. The exception to this is on ImageNet, where MCdropout performs well: we excessively tuned the MCdropout rate to achieve this (see Appendix D).
We show calibration curves comparing VOGN, Adam and MCdropout for final trained models from Table 1. The calibration curve for ResNet18 trained on ImageNet is in Figure 1. VOGN is extremely wellcalibrated compared to the other two optimisers (except for LeNet5, where all optimisers peform well).
We use experiments from the outofdistribution tests literature [Hendrycks and Gimpel, 2017, Lee et al., 2018, DeVries and Taylor, 2018, Liang et al., 2018], comparing VOGN to Adam and MCdropout. Using trained architectures (LeNet5, AlexNet and ResNet18) on CIFAR10, we test on SVHN, LSUN (crop) and LSUN (resize) as outofdistribution datasets, with the indistribution data given by the validation set of CIFAR10 (10,000 images). The entire training set of SVHN (73,257 examples, 10 classes) [Netzer et al., 2011]
is used. The test set of LSUN (Largescale Scene UNderstanding dataset
[Yu et al., 2015], 10,000 images from 10 different scenes) is randomly cropped to obtain LSUN (crop), and is downsampled to obtain LSUN (resize). These outofdistribution datasets have no similar classes to CIFAR10.Similar to the literature [Hendrycks and Gimpel, 2017, Lakshminarayanan et al., 2017], we use 3 metrics to test performance on outofdistribution data. Firstly, we plot histograms of predictive entropy for the indistribution and outofdistribution datasets, seen in Figure 5, 15, 16 and 17. Predictive entropy is given by . Ideally, on outofdistribution data, a model would have high predictive entropy, indicating it is unsure of which class the input image belongs to. In contrast, for indistribution data, good models should have many examples with low entropy, as they should be confident of many input examples’ (correct) class. We also compare AUROC and FPR at 95% TPR, also reported in the figures. By thresholding the most likely class’ softmax output, we assign high uncertainty images to belong to an unknown class. This allows us to calculate the FPR and TPR, allowing the ROC curve to be plotted, and the AUROC to be calculated.
We show results on AlexNet in Figure 15 and 16 (trained on CIFAR10 with DA and without DA respectively) and on LeNet5 in Figure 17. Results on ResNet18 is in Figure 5. These results are discussed in Section 4.2.
M.E.K., A.J., and R.E. conceived the original idea. This was also discussed with R.Y. and K.O. and then with S.S. and R.T. Eventually, all authors discussed and agreed with the main focus and ideas of this paper.
The first proofofconcept was done by A.J. using LeNet5 on CIFAR10. This was then extended by K.O. who wrote the main PyTorch implementation, including the distributed version. R.E. fixed multiple issues in the implementation, and also pointed out an important issue regarding data augmentation. S.S., A.J., K.O., and R.E. together fixed this issue. K.O. conducted most of the large experiments (shown in Fig. 1 and 4). The results shown in Fig. 3 was done by both K.O. and A.J. The BBB implementation was written by S.S.
The experiments in Section 4.2 were performed by A.J. and S.S. The main ideas behind the experiments were conceived by S.S., A.J., and M.E.K. with many helpful suggestions from R.T.
The main text of the paper was written by M.E.K. and S.S. The section on experiments was first written by S.S. and subsequently improved by A.J., K.O., and M.E.K. R.T. helped edit the manuscript. R.E. also helped in writing parts of the paper.
M.E.K. led the project with a significant help from S.S.. Computing resources and access to the HPCI systems were provided by R.Y.