Contains code for the NeurIPS 2019 paper "Practical Deep Learning with Bayesian Principles"
Bayesian methods promise to fix many shortcomings of deep learning, but they are impractical and rarely match the performance of standard methods, let alone improve them. In this paper, we demonstrate practical training of deep networks with natural-gradient variational inference. By applying techniques such as batch normalisation, data augmentation, and distributed training, we achieve similar performance in about the same number of epochs as the Adam optimiser, even on large datasets such as ImageNet. Importantly, the benefits of Bayesian principles are preserved: predictive probabilities are well-calibrated and uncertainties on out-of-distribution data are improved. This work enables practical deep learning while preserving benefits of Bayesian principles. A PyTorch implementation will be available as a plug-and-play optimiser.READ FULL TEXT VIEW PDF
Contains code for the NeurIPS 2019 paper "Practical Deep Learning with Bayesian Principles"
PyTorch-SSO: Scalable Second-Order methods in PyTorch
Deep learning has been extremely successful in many fields such as computer vision(Krizhevsky et al., 2012), speech processing (Hinton et al., 2012)2013), but it is also plagued with several issues that make its application difficult in many other fields. For example, it requires a large amount of high-quality data and it can overfit when dataset size is small. Similarly, sequential learning can cause forgetting of past knowledge (Kirkpatrick et al., 2017)
, and lack of reliable confidence estimates and other robustness issues can make it vulnerable to adversarial attacks(Bradshaw et al., 2017). Ultimately, due to such issues, application of deep learning remains challenging, especially for applications where human lives are at risk.
Bayesian principles have the potential to address such issues. For example, we can represent uncertainty using the posterior distribution, enable sequential learning using Bayes’ rule, and reduce overfitting with Bayesian model averaging Hoeting et al. (1999)1995), Laplace’s method (Mackay, 1991), and variational inference (VI) (Hinton and Van Camp, 1993; Barber and Bishop, 1998; Saul et al., 1996; Anderson and Peterson, 1987)
. Benefits of Bayesian principles are even discussed in machine-learning textbooks(MacKay, 2003; Bishop, 2006). Despite this, they are rarely employed in practice. This is mainly due to computational concerns which unfortunately overshadow their theoretical advantages.
The difficulty lies in the computation of the posterior distribution, which is especially challenging for deep learning. Even approximation methods, such as VI and MCMC, have historically been difficult to scale to large datasets such as ImageNet (Russakovsky et al., 2015). Due to this, it is common to use less principled approximations, such as MC-dropout (Gal and Ghahramani, 2016b), even though they are not ideal when it comes to fixing the issues of deep learning. For example, MC-dropout is unsuitable for continual learning Kirkpatrick et al. (2017) since its posterior approximation does not have mass over the whole weight space. It is also found to perform poorly for sequential decision making (Riquelme et al., 2018). The form of the approximation used by such methods is usually rigid and cannot be easily improved, e.g., to other forms such as a mixture of Gaussians. The goal of this paper is to make more principled Bayesian methods, such as VI, practical for deep learning, thereby helping researchers tackle key limitations of deep learning.
We demonstrate practical training of deep networks by using recently proposed natural-gradient VI methods. These methods resemble the Adam optimiser, enabling us to leverage existing techniques for initialisation, momentum, batch normalisation, data augmentation, and distributed training. As a result, we obtain similar performance in about the same number of epochs as Adam when training many popular deep networks (e.g., LeNet, AlexNet, ResNet) on datasets such as CIFAR-10 and ImageNet. See Fig. 1 for Imagenet. The results show that, despite using an approximate posterior, the training methods preserve the benefits of Bayesian principles. Compared to standard deep-learning methods, the predictive probabilities are well-calibrated and uncertainties on out-of-distribution inputs are improved. Our work shows that practical deep learning is possible with Bayesian methods and aims to support further research in this area.
, require significant implementation and tuning effort to perform well, e.g., on convolution neural networks (CNN). Slow convergence is found to be problematic for sequential problems(Riquelme et al., 2018). There appears to be no reported results with complex networks on large problems, such as ImageNet. Our work solves these issues by borrowing deep-learning techniques and applying them to natural-gradient VI (Khan et al., 2018; Zhang et al., 2018).
In their paper, Zhang et al. (2018) also employed data augmentation and batch normalisation for a natural-gradient method called Noisy K-FAC (see Appendix A) and showed results on VGG on CIFAR-10. However, a mean-field method called Noisy Adam was found to be unstable with batch normalisation. In contrast, we show that a similar method, called Variational Online Gauss-Newton (VOGN), proposed by Khan et al. (2018), works well with such techniques. We show results for distributed training with Noisy K-FAC on Imagenet, but do not provide extensive comparisons since we find it difficult to tune. Many of our techniques can be used to speed-up Noisy K-FAC too, which is promising.
Many other approaches have recently been proposed to compute posterior approximations by training deterministic networks (Ritter et al., 2018; Maddox et al., 2019; Mandt et al., 2017). Similarly to MC-dropout, their posterior approximations are not flexible, making it difficult to improve the accuracy of their approximations. On the other hand, VI offers a much more flexible alternative to apply Bayesian principles to deep learning.
The success of deep learning is partly due to the availability of scalable and practical methods for training deep neural networks (DNNs). Network training is formulated as an optimisation problem where a loss between the data and the DNN’s predictions is minimised. For example, in a supervised learning task with a datasetof inputs and corresponding outputs of length , we minimise a loss of the following form: , where , denotes the DNN outputs with weights ,
denotes a differentiable loss function between an outputand its prediction , and is the regulariser.††This regulariser is sometimes set to 0 or a very small value.2012), and Adam (Kingma and Ba, 2015), take the following form††Alternate versions with weight-decay and momentum differ from this update (Loshchilov and Hutter, 2019). We present a form useful to establish the connection between SG methods and natural-gradient VI. (all operations are element-wise):
where is the iteration, and are learning rates, is a small scalar, and is the stochastic gradients at defined as follows: using a minibatch of data examples. This simple update scales extremely well and can be applied to very large problems. With techniques such as initialisation tricks, momentum, weight-decay, batch normalisation, and data augmentation, it also achieves good performance for many problems.
In contrast, deep learning with Bayesian principles is computationally expensive. The posterior distribution can be obtained using Bayes’ rule: where .††This is a tempered posterior (Vovk, 1990) setup where is set when we expect model misspecification and/or adversarial examples (Ghosal and Van der Vaart, 2017). Setting recovers standard Bayesian inference. This is costly due to the computation of the marginal likelihood , a high-dimensional integral that is difficult to compute for large networks. Variational inference (VI) is a principled approach to scalably estimate an approximation to . The main idea is to employ a parametric approximation, e.g., a Gaussian with mean and covariance . The parameters and can then be estimated by maximising the evidence lower bound (ELBO):
denotes the Kullback-Leibler divergence. By using more complex approximations, we can further reduce the approximation error, but at a computational cost. By formulating Bayesian inference as an optimisation problem, VI enable a practical application of Bayesian principles.
Despite this, VI remains impractical for training large deep networks on large datasets. Existing methods, such as Graves (2011) and Blundell et al. (2015), apply popular SG methods to optimise the ELBO, yet they fail to get a reasonable performance on large problems. This is not surprising since the optimisation objectives for VI and deep learning are fundamentally different, and it is reasonable that techniques used in one field do not directly lead to improvements in the other. However, it will be useful if we can exploit the tricks and techniques of deep learning to boost performance of VI. The goal of this work is to do just that. We now describe our methods in detail.
In this paper, we propose natural-gradient VI methods for practical deep learning with Bayesian principles. The natural-gradient update takes a simple form when estimating exponential-family approximations (Khan and Nielsen, 2018; Khan and Lin, 2017). When , the update of the natural-parameter is performed by using the stochastic gradient of the expected regularised-loss:
where is the learning rate, and we note that the stochastic gradients are computed with respect to , the expectation parameters of . The moving average above helps to deal with the stochasticity of the gradient estimates, and is very similar to the moving average used in deep learning (see (1)). When is set to 0, the update essentially minimises the regularised loss (see Section 5 in Khan et al. (2018)). These properties of natural-gradient VI makes it an ideal candidate for deep learning.
Recent work by Khan et al. (2018) and Zhang et al. (2018) further show that, when is Gaussian, the update (3) assumes a form that is strikingly similar to the update (1). For example, the Variational Online Gauss-Newton (VOGN) method of Khan et al. (2018) estimates a Gaussian with mean and a diagonal covariance matrix using the following update:
where , with , , and are learning rates. Similarly to (1
), the vectoradapts the learning rate and is updated using a moving average.
A major difference in VOGN is that the update of is now based on a Gauss-Newton approximation Graves (2011) which uses . This is fundamentally different from the SG update in (1) which instead uses the gradient-magnitude (Bottou et al., 2016). The first approach uses the sum outside the square while the second approach uses it inside. VOGN is therefore a second-order method and, similarly to a Newton method, does not need a square-root over unlike in (1). Implementation of this step requires an additional calculation (see Appendix B
) which makes VOGN a bit slower than Adam, but this is expected to give better variance estimates (see Theorem 1 inKhan et al. (2018)).
The main contribution of this paper is to demonstrate practical training of deep networks using VOGN. Since VOGN takes a similar form to SG methods, we can easily borrow existing deep-learning techniques to improve performance. We will now describe these techniques in detail. Pseudo-code for VOGN is shown in Algorithm 1.
Batch normalisation: Batch normalisation (Ioffe and Szegedy, 2015)
has been found to significantly speed up and stabilise training of neural networks, and is widely used in deep learning. BatchNorm layers are inserted between neural network layers. They help stabilise each layer’s input distribution by normalising the running average of the inputs’ mean and variance. In our VOGN implementation, we simply use the existing implementation with default hyperparameter settings. We do not apply L2 regularisation and weight decay to BatchNorm parameters, like inGoyal et al. (2017), or maintain uncertainty over the BatchNorm parameters. This straightforward application of batch normalisation works for VOGN.
Data Augmentation: When training on image datasets, data augmentation (DA) techniques can improve performance drastically Goyal et al. (2017). We consider two common real-time data augmentation techniques: random cropping and horizontal flipping. After randomly selecting a minibatch at each iteration, we use a randomly selected cropped version of all images. Each image in the minibatch has a chance of being horizontally flipped.
We find that directly applying DA gives slightly worse performance than expected, and also affects the calibration of the resulting uncertainty. However, DA increases the effective sample size. We therefore modify it to be where , improving performance (see step 2 in Algorithm 1) The reason for this performance boost might be due to the complex relationship between the regularisation and . For the regularised loss , the two are unidentifiable, i.e., we can multiply by a constant and reduce by the same constant without changing the minimum. However, in a Bayesian setting (like in (2)), the two quantities are separate, and therefore changing the data might also change the optimal prior variance hyperparameter in a complicated way. This needs further theoretical investigations, but our simple fix of scaling seems to work well in the experiments.
by considering the specific DA techniques used. When training on CIFAR-10, the random cropping DA step involves first padding the 32x32 images to become of size 40x40, and then taking randomly selected 28x28 cropped images. We consider this as effectively increasing the dataset size by a factor of 5 (4 images for each corner, and one central image). The horizontal flipping DA step doubles the dataset size (one dataset of unflipped images, one for flipped images). Combined, this gives. Similar arguments for ImageNet DA techniques give . Even though is another hyperparameter to set, we find that its precise value does not matter much. Typically, after setting an estimate for , tuning a little seems to work well (see Appendix E).
Momentum and initialisation: It is well known that both momentum and good initialisation can improve the speed of convergence for SG methods in deep learning Sutskever et al. (2013). Since VOGN is similar to Adam, we can implement momentum in a similar way. This is shown in step 17 of Algorithm 1, where is the momentum rate. We initialise the mean in the same way the weights are initialised in Adam (we use init.xavier_normal in PyTorch (Glorot and Bengio, 2010)). For the momentum term , we use the same initialisation as Adam (initialised to 0). VOGN requires an additional initialisation for the variance . For this, we first run a forward pass through the first minibatch, calculate the average of the squared gradients and initialise the scale with it (see step 1 in Algorithm 1). This implies that the variance is initialised to . For the tempering parameter , we use a schedule where it is increased from a small value (e.g., 0.1) to 1. With these initialisation tricks, VOGN is able to mimic the convergence behaviour of Adam in the beginning.
Learning rate scheduling: A common approach to obtain high validation accuracies quickly is to use a specific learning rate schedule (Goyal et al., 2017). The learning rate (denoted by in Algorithm 1) is regularly decayed by a factor (typically a factor of 10). The frequency and timings of this decay are usually pre-specified. In VOGN, we use the same schedule used for Adam, finding that it works well.
Distributed training: We also employ distributed training for VOGN to perform large experiments quickly. We can parallelise computation both over data and Monte-Carlo (MC) samples. Data parallelism is useful to split up large minibatch sizes. This is followed by averaging over multiple MC samples and their losses on a single GPU. MC samples parallelism is useful when minibatch size is small, and we can copy the entire minibatch and process it on a single GPU. Algorithm 1 and Figure 2 illustrate our distributed scheme. We use a combination of these two parallelism techniques with different MC samples for different inputs. This theoretically lowers variance during training (see Equation 5 in Kingma et al. (2015)), but sometimes requires averaging over multiple MC samples at the start of training to lower the variance sufficiently. Distributed training is crucial for fast training on large problems such as ImageNet.
Implementation of the Gauss-Newton update in VOGN: As discussed earlier, VOGN uses the Gauss-Newton approximation, different to the Adam update. In this approximation, the gradients on individual data examples are first squared and then averaged afterwards (see step 12 in Algorithm 1 which implements the update for shown in (4)). We need extra computation to get access to individual gradients (see Appendix B for details). Due to this computation, VOGN is twice as slow as Adam or SGD (e.g., in Fig. 1). However, this is not a theoretical limitation and this can be improved if a framework enables an easy computation of the individual gradients.
In this section, we present experiments on fitting several deep networks on CIFAR-10 and ImageNet. Our experiments demonstrate practical training using VOGN on these benchmarks and show performance that is competitive with Adam and SGD. We also assess the quality of the posterior approximation, finding that the benefits of Bayesian principles are preserved.
CIFAR-10 (Krizhevsky and Hinton, 2009)
contains 10 classes with 50,000 images for training and 10,000 images for validation. For ImageNet, we train with 1.28 million training examples and validate on 50,000 examples, classifying between 1,000 classes. We used a large minibatch sizeand parallelise them across 128 GPUs (NVIDIA Tesla P100). We compare the following methods on CIFAR-10: Adam, MC-dropout Gal and Ghahramani (2016a). For ImageNet, we also compare to SGD, K-FAC, and Noisy K-FAC. We do not consider Noisy K-FAC for other comparisons since tuning is difficult. We compare 3 architectures: LeNet-5, AlexNet, ResNet-18. We only compare to Bayes by Backprop (BBB) Blundell et al. (2015) for CIFAR-10 with LeNet-5 since it is difficult to tune for other experiments. We carefully set the hyperparameters of all methods, following the best practice of large distributed training (Goyal et al., 2017) as the initial point of our hyperparameter tuning. The full set of hyperparameters is in Appendix C.
We start by showing the effectiveness of momentum and batch normalisation to boost performance of VOGN. Figure 3 shows that these methods significantly speed up convergence as well as performance for both accuracy and log likelihoods.
Figures 1 and 4 compare the convergence of VOGN to Adam (for all experiments), SGD (on ImageNet), and MC-dropout (on the rest). VOGN shows similar convergence and its performance is competitive with these methods. We also try BBB on LeNet-5, where it converges prohibitively slowly, performing very poorly. We are not able to successfully train other architectures using this approach. We found VOGN far simpler to tune as we can borrow all the techniques used with Adam to boost performance. Figure 4 also shows the importance of DA in improving performance.
Table 1 gives a final comparison of train/validation accuracies, negative log likelihoods, epochs required for convergence, and run-time per epoch. We can see that the accuracy, log likelihoods, and the number of epochs are comparable. Regarding run-time, VOGN is twice as slow per epoch when compared to Adam and SGD, since it requires computation of individual gradients (see the discussion in Section 3). We clearly see that by using deep-learning techniques on VOGN, we can perform practical deep learning. This is not possible with methods such as BBB.
Due to the Bayesian nature of VOGN, there are some trade-offs to consider. Reducing the prior precision ( in Algorithm 1) results in higher validation accuracy, but also larger train-test gap (more overfitting). This is shown in Appendix E for VOGN on ResNet-18 on ImageNet. As expected, when the prior precision is small, performance is similar to non-Bayesian methods. We also show the effect of changing the effective dataset size ( from Section 3) in Appendix E: note that, given we are going to tune the prior variance anyway, therefore it is sufficient to set to its correct order of magnitude. Another trade-off concerns the number of Monte-Carlo (MC) samples, shown in Appendix F. Increasing the number of training MC samples (up to a limit) improves VOGN’s convergence rate and stability at the cost of computation cost. Increasing the number of MC samples during testing improves generalisation as we have a better MC approximation of the posterior.
Finally, a few comments on the performance of the other methods. Adam regularly overfits the training set in most settings, with large train-test differences in both validation accuracy and log likelihood. The exception is LeNet-5, likely because the small architecture results in underfitting (this is consistent with the low validation accuracies obtained). In contrast to Adam, MC-dropout has small train-test gap, usually smaller than VOGN’s. However, we will see in Section 4.2 that this is because of underfitting. Moreover, the performance of MC-dropout is highly sensitive to the dropout rate (see Appendix D for a comparison of different dropout rates). On ImageNet, Noisy K-FAC performs well too. It is slower than VOGN, but it takes fewer epochs. Overall, wall clock time is about the same as VOGN.
|CIFAR-10/ LeNet-5 (no DA)||Adam||71.98 / 67.67||0.937||210||6.96||0.021||0.794|
|BBB||66.84 / 64.61||1.018||800||11.43||0.045||0.784|
|MC-dropout||68.41 / 67.65||0.99||210||6.95||0.087||0.797|
|VOGN||70.79 / 67.32||0.938||210||18.33||0.046||0.8|
|CIFAR-10/ AlexNet (no DA)||Adam||100.0 / 67.94||2.83||161||3.12||0.262||0.793|
|MC-dropout||97.56 / 72.20||1.077||160||3.25||0.140||0.818|
|VOGN||98.68 / 66.49||1.12||160||9.98||0.024||0.796|
|CIFAR-10/ AlexNet||Adam||97.92 / 73.59||1.480||161||3.08||0.262||0.793|
|MC-dropout||80.65 / 77.04||0.667||160||3.20||0.114||0.828|
|VOGN||81.15 / 75.48||0.703||160||10.02||0.016||0.832|
|CIFAR-10/ ResNet-18||Adam||97.74 / 86.00||0.55||160||11.97||0.082||0.877|
|MC-dropout||88.23 / 82.85||0.51||161||12.51||0.166||0.768|
|VOGN||91.62 / 84.27||0.477||161||53.14||0.040||0.876|
|ImageNet/ ResNet-18||SGD||82.63 / 67.79||1.38||90||44.13||0.067||0.856|
|Adam||80.96 / 66.39||1.44||90||44.40||0.064||0.855|
|MC-dropout||72.96 / 65.64||1.43||90||45.86||0.012||0.856|
|VOGN||73.87 / 67.38||1.37||90||76.04||0.029||0.854|
|K-FAC||83.73 / 66.58||1.493||60||133.69||0.097||0.856|
|Noisy K-FAC||72.28 / 66.44||1.44||60||179.27||0.080||0.852|
for standard deviations.BBB is not parallelised (other methods have 4 processes), with 1 MC sample used for the convolutional layers (VOGN uses 6 samples per process).
In this section, we compare the quality of the predictive probabilities for various methods. For Bayesian methods, we compute these probabilities by averaging over the samples from the posterior approximations (see Appendix G for details). For non-Bayesian methods, these are obtained using the point estimate of the weights. We compare the probabilities using the following metrics: validation negative log-likelihood (NLL), area under ROC (AUROC) and expected calibration curves (ECE) (Naeini et al., 2015; Guo et al., 2017). For the first and third metric, a lower number is better, while for the second, a higher number is better. See Appendix G for an explanation of these metrics. Results are summarised in Table 1. Out of the 15 metrics (NLL, ECE and AUROC on 5 dataset/architecture combinations), VOGN performs the best or tied best on 10. On the other 5, VOGN is second best, with MC-dropout best on 4. The final metric shows Adam performing well on LeNet-5 (as argued earlier, the small architecture may result in underfitting). We also show calibration curves (DeGroot and Fienberg, 1983) in Figure 1 and Appendix H. Adam is consistently over-confident, with its calibration curve below the diagonal. Conversely, MC-dropout is usually under-confident. On ImageNet, MC-dropout performs well on ECE (all methods are very similar on AUROC), but this required an excessively tuned dropout rate (see Appendix D).
Our final result is to compare performance on out-of-distribution datasets. When testing on datasets that are different from the training datasets, predictions should be more uncertain. We use experimental protocol from the literature (Hendrycks and Gimpel, 2017; Lee et al., 2018; DeVries and Taylor, 2018; Liang et al., 2018) to compare VOGN, Adam and MC-dropout on CIFAR-10. We also borrow metrics from other works (Hendrycks and Gimpel, 2017; Lakshminarayanan et al., 2017) and show predictive entropy histograms and also report AUROC and FPR at 95% TPR. See Appendix I for further details on the datasets and metrics. Ideally, we want the entropy to be high on out-of-distribution data and low on in-distribution data. Our results are summarised in Figure 5 and Appendix I. On ResNet-18 and AlexNet, VOGN’s predictive entropy histograms show the desired behaviour: a spread of entropies for the in-distribution data, and high entropies for out-of-distribution data. Adam has many predictive entropies at zero, indicating Adam tends to classify out-of-distribution data too confidently. Conversely, MC-dropout’s predictive entropies are generally high (particularly in-distribution), indicating MC-dropout has too much noise. On LeNet-5, we observe the same result as before: Adam and MC-dropout both perform well. The metrics (AUROC and FPR at 95% TPR) do not provide a clear story across architectures.
We successfully train deep networks with a natural-gradient variational inference method, VOGN, on a variety of architectures and datasets, even scaling up to ImageNet. This is made possible due to the similarity of VOGN to Adam, enabling us to boost performance by borrowing deep-learning techniques. Our accuracies and convergence rates are comparable to SGD and Adam. Unlike them, however, VOGN retains the benefits of Bayesian principles, with well-calibrated uncertainty and good performance on out-of-distribution data. Better uncertainty estimates open up a whole range of potential future experiments, for example, small data experiments, active learning, adversarial experiments, and sequential decision making or continual learning. Another potential avenue for research is to consider structured covariance approximations.
We would like to thank Hikaru Nakata (Tokyo Institute of Technology) and Ikuro Sato (Denso IT Laboratory, Inc.) for their help on the PyTorch implementation. We are also thankful for the RAIDEN computing system and its support team at the RIKEN Center for AI Project which we used extensively for our experiments. This research used computational resources of the HPCI system provided by Tokyo Institute of Technology (TSUBAME3.0) through the HPCI System Research Project (Project ID:hp190122).
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL http://proceedings.mlr.press/v9/glorot10a.html.
Annual Conference on Computational Learning Theory, pages 5–13, 1993.
Noisy K-FAC [Zhang et al., 2018] attempts to approximate the structure of the full covariance matrix, and therefore the updates are a bit more involved than VOGN (see Equation 4). Assuming a fully-connected layer, we denote the weight matrix by
. The Noisy K-FAC method estimates the parameters of a matrix-variate Gaussian distributionby using the following updates:
where , is a vector of gradients with for all layers , is a vector of activations for all layers , , and with some external damping factor . The covariance parameters are set to and . Similarly to the VOGN update in Equation 4, the gradients are scaled by matrices and , which are related to the precision matrix of the approximation.
Current codebases are only optimised to directly return the sum of gradients over the minibatch. In order to efficiently compute the Gauss-Newton (GN) approximation, we modify the backward pass to efficiently calculate the sum of squared gradients over the minibatch, and extend the solution in Goodfellow  to both convolutional and batch normalisation layers.
Consider a convolutional filter W with dimensions [k,k] and inputs X
with dimensions [M, C, H, W]. Here M is the the size of the minibatch, C is the number of channels, H,W are the spatial dimensions, and k is the filter size. Assuming the stride to be 1 and 0 padding for our convolutions, the filterW will act on [k,k] patches of X shifting by 1 pixel sequentially.
Let be an expansion operator such that,
is the input for the filter W, S are pre-activations and A are the activations. We can compute the gradients and the square of gradients for a loss L as,
For a minibatch of size , let be the pre-activation output for any layer in a deep neural network. Batch normalisation aims to normalise the pre-activation outputs of the layer to zero mean and unit variance. We define as the batch normalised pre-activations with learnable parameters given by,
We can find the squared gradients of a loss function with respect to parameters and by,
The pre-activation gradients can be obtained from the compute graph in PyTorch as shown in Figure 6.
Layer-wise block-diagonal Gauss-Newton approximation. Despite using the method above, it is still intractable to compute the Gauss-Newton matrix (and its inverse) with respect to the weights of large-scale deep neural networks. We therefore apply two further approximations (Figure 7). First, we view the Gauss-Newton matrix as a layer-wise block-diagonal matrix. This corresponds to ignoring the correlation between the weights of different layers. Hence for a network with layers, there are diagonal blocks, and is the diagonal block corresponding to the -th layer (). Second, we approximate each diagonal block with , which is either a Kronecker-factored or diagonal matrix. Using a Kronecker-factored matrix as corresponds to K-FAC; a diagonal matrix corresponds to a mean-field approximation in that layer. By applying these two approximations, the update rule of the Gauss-Newton method can be written in a layer-wise fashion:
where is the weights in -th layer, and
Since the cost of computing is much cheaper compared to that of computing , our approximations make Gauss-Newton much more practical in deep learning.
In the distributed setting (see Figure 2), each parallel process (corresponding to 1 GPU) calculates the GN matrix for its local minibatch. Then, one GPU adds them together and calculates the inverse. This inversion step can also be parallelised after making the block-diagonal approximation to the GN matrix. After inverting the GN matrix, the standard deviation is updated (line 9 in Algorithm 1), and sent to each parallel process, allowing each process to draw independently from the posterior.
In the Noisy K-FAC case, a similar distributed scheme is used, except each parallel process now has both matrices and (see Appendix A). When using K-FAC approximations to the Gauss-Newton blocks for other layers, Osawa et al.  empirically showed that the BatchNorm layer can be approximated with a diagonal matrix without loss of accuracy, and we find the same. We therefore use diagonal with K-FAC and Noisy K-FAC in BatchNorm layers (see Table 2). For further details on how to efficiently parallelise K-FAC in the distributed setting, please see Osawa et al. .
|CIFAR-10/ LeNet-5 (no DA)||Adam||71.98 0.117||0.733 0.021||67.67 0.513||0.937 0.012||0.021 0.002||0.794 0.001||210||6.96|
|BBB||66.84 0.003||0.957 0.006||64.61 0.331||1.018 0.006||0.045 0.005||0.784 0.003||800||11.43|
|MC-dropout||68.41 0.581||0.870 0.101||67.65 1.317||0.99 0.026||0.087 0.009||0.797 0.006||210||6.95|
|VOGN||70.79 0.763||0.880 0.02||67.32 1.310||0.938 0.024||0.046 0.002||0.8 0.002||210||18.33|
|CIFAR-10/ AlexNet (no DA)||Adam||100.0 0||0.001 0||67.94 0.537||2.83 0.02||0.262 0.005||0.793 0.001||161||3.12|
|MC-dropout||97.56 0.278||0.058 0.014||72.20 0.177||1.077 0.012||0.140 0.004||0.818 0.002||160||3.25|
|VOGN||98.68 0.093||0.017 0.005||66.49 0.786||1.12 0.01||0.024 0.010||0.796 0||160||9.98|
|CIFAR-10/ AlexNet||Adam||97.92 0.140||0.057 0.006||73.59 0.296||1.480 0.015||0.262 0.005||0.793 0.001||161||3.08|
|MC-dropout||80.65 0.615||0.47 0.052||77.04 0.343||0.667 0.012||0.114 0.002||0.828 0.002||160||3.20|
|VOGN||81.15 0.259||0.511 0.039||75.48 0.478||0.703 0.006||0.016 0.001||0.832 0.002||160||10.02|
|CIFAR-10/ ResNet-18||Adam||97.74 0.140||0.059 0.012||86.00 0.257||0.55 0.01||0.082 0.002||0.877 0.001||160||11.97|
|MC-dropout||88.23 0.243||0.317 0.045||82.85 0.208||0.51 0||0.166 0.025||0.768 0.004||161||12.51|
|VOGN||91.62 0.07||0.263 0.051||84.27 0.195||0.477 0.006||0.040 0.002||0.876 0.002||161||53.14|
|ImageNet/ ResNet-18||SGD||82.63 0.058||0.675 0.017||67.79 0.017||1.38 0||0.067||0.856||90||44.13|
|Adam||80.96 0.098||0.723 0.015||66.39 0.168||1.44 0.01||0.064||0.855||90||44.40|
|VOGN||73.87 0.061||1.02 0.01||67.38 0.263||1.37 0.01||0.0293 0.001||0.8543 0||90||76.04|
|K-FAC||83.73 0.058||0.571 0.016||66.58 0.176||1.493 0.006||0.097||0.856||60||133.69|
Hyperparameters for training various architectures on CIFAR-10 are given in Tables 6, 7, 8, 9, 10 and 11. Hyperparameters for training ResNet-18 on ImageNet are given in Table 4, with distributed training specific settings in Table 5. Please see Goyal et al.  and Osawa et al.  for best practice on these hyperparameter values.
|optimiser||milestones||weight decay||L2 reg|
|SGD||1.25e-2||1.6||[30, 60, 80]||0.9||-||-||1e-4|
|Adam||1.25e-5||1.6e-3||[30, 60, 80]||0.1||0.001||1e-4||-|
|MC-dropout||1.25e-2||1.6||[30, 60, 80]||0.9||-||-||1e-4|
|VOGN||1.25e-5||1.6e-3||[30, 60, 80]||0.9||0.999||-||-|
|K-FAC||1.25e-5||1.6e-3||[15, 30, 45]||0.9||0.9||-||1e-4|
|Noisy K-FAC||1.25e-5||1.6e-3||[15, 30, 45]||0.9||0.9||-||-|
|VOGN / Noisy K-FAC||4,096||128||1||1||5||1,281,167||133.3||2e-5|
|optimiser||weight decay||L2 reg|
|VOGN||128||4||6||0.1 1||1||50,000||100||2e-4 2e-3|
|optimiser||milestones||weight decay||L2 reg|
|VOGN||128||8||3||0.5 1||10||50,000||0.5||5e-6 1e-5|
|optimiser||milestones||weight decay||L2 reg|
We show MC-dropout’s sensitivity to dropout rate, , in this Appendix. We tune MC-dropout as best as we can, finding that works best for all architectures trained on CIFAR-10 (see Figure 8 for the dropout rate’s sensitivity on LeNet-5 as an example). On ResNet-18 trained on ImageNet, we find that MC-dropout is extremely sensitive to dropout rate, with even performing badly. We therefore use for MC-dropout experiments on ImageNet. This high sensitivity to dropout rate is an issue with MC-dropout as a method.
We show the effect of changing the prior variance ( in Algorithm 1) in Figures 10 and 11. We can see that increasing the prior variance improves validation performance (accuracy and log likelihood). However, increasing prior variance also always increases the train-test gap, without exceptions, when the other hyperparameters are held constant. As an example, training VOGN on ResNet-18 on ImageNet with a prior variance of has train-test accuracy and log likelihood gaps of 2.29 and 0.12 respectively. When the prior variance is increased to , the respective train-test gaps increase to 6.38 and 0.34 (validation accuracy and validation log likelihood also increase, see Figure 10).
With increased prior variance, VOGN (and Noisy K-FAC) reach converged solutions more like their non-Bayesian counterparts, where overfitting is an issue. This is as expected from Bayesian principles.
Figure 12 shows the combined effect of the dataset reweighting factor and prior variance. When is set to a value in the correct order of magnitude, it does not affect performance so much: instead, we should tune . This is our methodology when dealing with . Note that we set for ImageNet to be smaller than that for CIFAR-10 because the data augmentation cropping step uses a higher portion of the initial image than in CIFAR-10: we crop images of size 224x224 from images of size 256x256.
In the paper, we report results for training ResNet-18 on ImageNet using 128 GPUs, with 1 independent Monte-Carlo (MC) sample per process during training (mc=128x1), and 10 MC samples per validation image (val_mc). We now show that increasing either of training or testing MC samples improves performance (validation accuracy and log likelihood) at the cost of increased computation time. See Figure 13.
Increasing the number of training MC samples per process reduces noise during training. This effect is observed when training on CIFAR-10, where multiple MC samples per process are required to stabilise training. On ImageNet, we have much larger minibatch size (4,096 instead of 256) and more parallel processes (128 not 8), and so training with 1 MC sample per process is still stable. However, as shown in Figure 13, increasing the number of training MC samples per process to from 1 to 2 speeds up convergence per epoch, and reaches a better converged solution. The time per epoch (and hence total runtime) also increases by approximately a factor of 1.5. Increasing the number of train MC samples per process to 3 does not increase final test performance significantly.
Increasing the number of testing MC samples from 10 to 100 (on the same trained model) also results in better generalisation: the train accuracy and log likelihood are unchanged, but the validation accuracy and log likelihood increase. However, as we run an entire validation on each epoch, increasing validation MC samples also increases run-time.
These results show that, if more compute is available to the user, they can improve VOGN’s performance by improving the MC approximation at either (or both) train-time or test-time (up to a limit).
We use several approaches to compare uncertainty estimates obtained by each optimiser. We follow the same methodology for all optimisers: first, tune hyperparameters to obtain good accuracy on the validation set. Then, test on uncertainty metrics. For multi-class classification problems, all of these are based on the predictive probabilities. For non-Bayesian approaches, we compute the probabilities for a validation input as , where is the weight vector of the DNN whose uncertainty we are estimating. For Bayesian methods, we can compute the predictive probabilities for each validation example as follows:
where are samples from the Gaussian approximation returned by a variational method. We use 10 MC samples at validation-time for VOGN and MC-dropout (the effect of changing number of validation MC samples is shown in Appendix F). This increases the computational cost during testing for these methods when compared to Adam or SGD.
Using the estimates , we use three methods to compare uncertainties: validation log loss, AUROC and calibration curves. We also compare uncertainty performance by looking at model outputs when exposed to out-of-distribution data.
Validation log likelihood. Log likelihood (or log loss) is a common uncertainty metric. We consider a validation set of examples. For an input , denote the true label by , a 1-of- encoded vector with 1 at the true label and 0 elsewhere. Denote the full vector of all validation outputs by . Similarly, denote the vector of all probabilities by , where . The validation log likelihood is defined as,
Tables 1 and 3 show final validation (negative) log likelihood. VOGN performs very well on this metric (aside from on CIFAR-10/AlexNet, with or without DA, where MC-dropout performs the best). All final validation log likelihoods are very similar, with VOGN usually performing similarly to the other best-performing optimisers (usually MC-dropout).
Area Under ROC curves (AUROC). We consider Receiver Operating Characteristic (ROC) curves for our multi-way classification tasks. A potential way that we may care about uncertainty measurements would be to discard uncertain examples by thresholding each validation input’s predicted class’ softmax output, marking them as too ambiguous to belong to a class. We can then consider the remaining validation inputs to either be correctly or incorrectly classified, and calculate the True Positive Rate (TPR) and False Positive Rate (FPR) accordingly. The ROC curve is summarised by its Area Under Curve (AUROC), reported in Table 1. This metric is useful to compare uncertainty performance in conjunction with the other metrics we use. The AUROC results are very similar between optimisers, particularly on ImageNet, although MC-dropout performs marginally better than the others, including VOGN. On all but one CIFAR-10 experiment (AlexNet, without DA), VOGN performs the best, or tied best. Adam performs the worst, but is surprisingly good in CIFAR-10/ResNet-18.
Calibration Curves. Calibration curves [DeGroot and Fienberg, 1983] test how well-calibrated a model is by plotting true accuracy as a function of the model’s predicted accuracy (we only consider the predicted class’ ). Perfectly calibrated models would follow the diagonal line on a calibration curve. We approximate this curve by binning the model’s predictions into bins, as is often done. We show calibration curves in Appendix H, as well as Figure 1. We can also consider the Expected Calibration Error (ECE) metric [Naeini et al., 2015, Guo et al., 2017], reported in Table 1. ECE calculates the expected error between the true accuracy and the model’s predicted accuracy, averaged over all validation examples, again approximated by using bins. Across all datasets and architectures, with the exception of LeNet-5 (which we have argued causes underfitting), VOGN usually has better calibration curves and better ECE than competing optimisers. Adam is consistently over-confident, with the calibration curve below the diagonal. Conversely, MC-dropout is usually under-confident, with too much noise, as mentioned earlier. The exception to this is on ImageNet, where MC-dropout performs well: we excessively tuned the MC-dropout rate to achieve this (see Appendix D).
We show calibration curves comparing VOGN, Adam and MC-dropout for final trained models from Table 1. The calibration curve for ResNet-18 trained on ImageNet is in Figure 1. VOGN is extremely well-calibrated compared to the other two optimisers (except for LeNet-5, where all optimisers peform well).
We use experiments from the out-of-distribution tests literature [Hendrycks and Gimpel, 2017, Lee et al., 2018, DeVries and Taylor, 2018, Liang et al., 2018], comparing VOGN to Adam and MC-dropout. Using trained architectures (LeNet-5, AlexNet and ResNet-18) on CIFAR-10, we test on SVHN, LSUN (crop) and LSUN (re-size) as out-of-distribution datasets, with the in-distribution data given by the validation set of CIFAR-10 (10,000 images). The entire training set of SVHN (73,257 examples, 10 classes) [Netzer et al., 2011]
is used. The test set of LSUN (Large-scale Scene UNderstanding dataset[Yu et al., 2015], 10,000 images from 10 different scenes) is randomly cropped to obtain LSUN (crop), and is down-sampled to obtain LSUN (re-size). These out-of-distribution datasets have no similar classes to CIFAR-10.
Similar to the literature [Hendrycks and Gimpel, 2017, Lakshminarayanan et al., 2017], we use 3 metrics to test performance on out-of-distribution data. Firstly, we plot histograms of predictive entropy for the in-distribution and out-of-distribution datasets, seen in Figure 5, 15, 16 and 17. Predictive entropy is given by . Ideally, on out-of-distribution data, a model would have high predictive entropy, indicating it is unsure of which class the input image belongs to. In contrast, for in-distribution data, good models should have many examples with low entropy, as they should be confident of many input examples’ (correct) class. We also compare AUROC and FPR at 95% TPR, also reported in the figures. By thresholding the most likely class’ softmax output, we assign high uncertainty images to belong to an unknown class. This allows us to calculate the FPR and TPR, allowing the ROC curve to be plotted, and the AUROC to be calculated.
M.E.K., A.J., and R.E. conceived the original idea. This was also discussed with R.Y. and K.O. and then with S.S. and R.T. Eventually, all authors discussed and agreed with the main focus and ideas of this paper.
The first proof-of-concept was done by A.J. using LeNet-5 on CIFAR-10. This was then extended by K.O. who wrote the main PyTorch implementation, including the distributed version. R.E. fixed multiple issues in the implementation, and also pointed out an important issue regarding data augmentation. S.S., A.J., K.O., and R.E. together fixed this issue. K.O. conducted most of the large experiments (shown in Fig. 1 and 4). The results shown in Fig. 3 was done by both K.O. and A.J. The BBB implementation was written by S.S.
The experiments in Section 4.2 were performed by A.J. and S.S. The main ideas behind the experiments were conceived by S.S., A.J., and M.E.K. with many helpful suggestions from R.T.
The main text of the paper was written by M.E.K. and S.S. The section on experiments was first written by S.S. and subsequently improved by A.J., K.O., and M.E.K. R.T. helped edit the manuscript. R.E. also helped in writing parts of the paper.
M.E.K. led the project with a significant help from S.S.. Computing resources and access to the HPCI systems were provided by R.Y.