Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks
We propose Radial Bayesian Neural Networks: a variational distribution for mean field variational inference (MFVI) in Bayesian neural networks that is simple to implement, scalable to large models, and robust to hyperparameter selection. We hypothesize that standard MFVI fails in large models because of a property of the high-dimensional Gaussians used as posteriors. As variances grow, samples come almost entirely from a `soap-bubble' far from the mean. We show that the ad-hoc tweaks used previously in the literature to get MFVI to work served to stop such variances growing. Designing a new posterior distribution, we avoid this pathology in a theoretically principled way. Our distribution improves accuracy and uncertainty over standard MFVI, while scaling to large data where most other VI and MCMC methods struggle. We benchmark Radial BNNs in a real-world task of diabetic retinopathy diagnosis from fundus images, a task with 100x larger input dimensionality and model size compared to previous demonstrations of MFVI.READ FULL TEXT VIEW PDF
Bayesian-Torch is a library of neural network layers and utilities extending the core of PyTorch to enable the user to perform stochastic variational inference in Bayesian deep neural networks
Code to accompany the paper Radial Bayesian Neural Networks: Beyond Discrete Support In Large-Scale Bayesian Deep Learning
PyTorch implementation of 'Radial Bayesian Neural Networks: Beyond Discrete Support In Large-Scale Bayesian Deep Learning'
. But while the most exciting applications of deep learning use huge models, most of the work on Bayesian neural networks (BNNs)—which learn distributions over weights—focuses on expensive techniques and small networks with hundreds or thousands of parameters and low-dimensional problems (e.g., MNIST).
In principle, variational inference (VI) in BNNs should let us approximate rich posterior distributions over huge BNNs. With recent improvements (Graves, 2011; Kingma et al., 2014; Rezende et al., 2014; Blundell et al., 2015) and the ‘mean-field’ approximation that weight distributions are independent of each other, VI in BNNs is fast and has time complexity linear in the number of parameters. The trouble is: mean-field variational inference (MFVI) is hard to use, especially in big networks. Results are very sensitive to hyperparameters and initialization (Wu et al., 2019). Worse, researchers often rely on ad-hoc tweaks
to the loss function to make MFVI work whichside-step the variational inference arguments that motivated the approach in the first place. Without these ad-hoc tweaks, for MFVI, variances grow and training accuracy falls even as the loss improves (see figure 1). Setting aside concerns about whether a loss with ad-hoc tweaks is really VI at all, an ad-hoc fix that works for one setting might not work in another. This makes it hard to apply MFVI in real-world applications where extensive search over configurations is expensive and often comes with a heavy environmental cost.
Why is MFVI is so brittle? MFVI approximates the posterior distribution over weights using a multivariate Gaussian with a diagonal covariance matrix. But Gaussian distributions in high dimensions have well-known pathologies. Even though the highest probability density is at the mean, most of the probability mass isclustered in a ‘soap-bubble’ hypersphere. The location of the ‘soap-bubble’ depends on the variances and number of dimensions. Figure 3
shows the probability density function over the radius of a multivariate mean-field (factorized) Gaussian, for example as found in a 3x3 convolutional layer with 128 channels. As we see, samples from the approximate posterior are overwhelmingly likely to lie far from the mode, and this effect is stronger withlarge posterior variance and big networks.
In response, we propose the Radial BNN.
We adopt a simple approximate posterior that does not have ‘soap-bubble’ pathologies in high dimensions (figure 3, orange).
Our variational posterior is fully factorized in the radial or hyperspherical coordinate system.
We sample from a Gaussian distribution over the radial dimension, and a uniform distribution over angles
Gaussian distribution over the radial dimension, and a uniform distribution over angles. In contrast we call the standard MFVI multivariate Gaussian posterior Cartesian because it is fully factorized in a Cartesian basis. We derive a simple estimator for the ELBO loss for Radial BNNs for two useful priors. Unlike methods that relax the ‘mean-field’ assumption (Louizos and Welling, 2016; Sun et al., 2017, 2019; Oh et al., 2019; Wu et al., 2019), the computational complexity of training and inference in Radial BNNs remains linear in the number of parameters.
We show empirically in §3 that our Radial BNN, compared with ordinary MFVI:
has higher test accuracy and better calibrated uncertainty estimates;
does not exhibit the pathologies shown in figure 1 and needs no ad-hoc tweaks of the loss;
is robust to hyperparameter choice.
In §3.1 we use the ‘continual learning’ setting from Nguyen et al. (2018). A model must learn a series of tasks without forgetting, and with no access to old data. VI must learn a good approximation to the true posterior to serve as a prior when learning future tasks. This is a challenging test for BNNs, and requires a posterior with full support over the parameter space. §3.2 presents a large medical imaging dataset of retinal scans (see figure 3) used to refer patients to an expert in diabetic retinopathy pre-screening (Leibig et al., 2017). Uncertainty is important for identifying patients for whose images are inconclusive, perhaps because of camera artefacts. This dataset has ~200,000 input dimensions and a model with 1.3M parameters, orders of magnitude larger than experiments used to benchmark uncertainty in prior work.
A well-known property of multivariate Gaussians in high dimensions is that the probability mass concentrates in a ‘soap-bubble’—a narrow shell at a radius determined by the variance and the number of dimensions (e.g., (Bishop, 2006)). We see this by examining the probability density function of the multivariate Gaussian along its radius. Consider a -dimensional isotropic Gaussian. We examine a thin shell with thickness , which tends to zero, at distance between a sampled point, , and the mean of the multivariate Gaussian, . The probability density function over the radius is given by:
where is the surface area of a hypersphere in a -dimensional space. The term reflects the growing volume in shells away from the origin. For small , and for the large found in BNNs with many parameters, this term (red in figure 4) dominates and drives the probability density to zero. The exponential term reflects the Gaussian density (inverse shown in green in figure 4). For larger the exponential term drives the probability density to zero. Almost all the probability mass is in the ‘soap-bubble’ in the region where neither term dominates. We consider the isotropic case here for simplicity, but the non-isotropic Gaussian has a similar soap-bubble.111Oh et al. (2018) consider soap-bubbles in Bayesian optimization. But it has not been considered for MFVI.
In VI one proposes a parameterized variational distribution
over the weights of a neural network and minimizes the Kullback-Leibler divergencebetween the variational distribution and the true posterior over the weights given the dataset, (Jordan et al., 1999). Applying Jensen’s inequality and Bayes’ rule gives the negative evidence lower bound (ELBO) objective:
Minimizing this objective w.r.t. gives an estimate of the posterior distribution of the weights, , given the training data and a prior.
As a result of the soap-bubble effect, for high-dimensional weights with larger variance, the probability that any sample from is near the mode becomes overwhelmingly small. This is important because when variances are very small the BNN is just approximating an ordinary deterministic neural network. The KL-divergence term of equation (2) rewards larger variances because it penalizes low entropy posteriors, all else equal. The pressure towards a high variance is largest when the model is large because the KL term is proportional to the number of weights in the model.
At the same time, we observe that the design choices which previous authors have used to get good results from MFVI often serve to artificially stop the posterior variance growing. Principally, putting an ad-hoc term downweighting the KL divergence in the loss reduces the pressure from the loss that would drive variances away from their initialization. (Blundell et al., 2015; Fortunato et al., 2018). This combines with the fact that most authors initialize the variance of the parameters very small and then use a combination of learning rate decay and/or early stopping to stop the variances growing too large (Blundell et al., 2015; Fortunato et al., 2017; Nguyen et al., 2018). In practice, most implementations of MFVI make ad-hoc decisions that artificially keep the variance of the weights small, which avoids the ‘soap-bubble’ effect kicking in.222Some works may or may not use such tweaks. For example, Lipton et al. (2018) initialize with the prior’s , but do not report their prior. Houthooft et al. (2016) initialize and do not weight the KL term. But their biggest model has only 2 layers of 64 hidden units.
Motivated by the observation that the Cartesian MFVI posterior exhibits strange ‘soap-bubble’ behaviour as variances grow, and that ad-hoc fixes for MFVI act to constrain the variance, we introduce a simple posterior that has no ‘soap-bubble’ pathologies. Our posterior must, instead, continue to sample near the mode (as well as elsewhere) as the posterior variance grows.
Since the ‘soap-bubble’ depends on the distributional properties over the Euclidian distance between a mean and each sample, we focus on the hyperspherical or radial coordinate system for which this is the first dimension. A distribution without a soap bubble would have a p.d.f. over the radius which is maximal at the mean and monotonically decreasing away from it. A Gaussian distribution over radius is a simple distribution that satisfies those properties and (heuristically) has motivations from maximum entropy and limit theorems. A uniform distribution over angles allows simple analytical results. We frame this using the reparameterization trick(Rezende et al., 2014; Kingma et al., 2014; Blundell et al., 2015).333
We indicate vectors in a hyperspherical basis using superscriptwhile indicates a Cartesian basis.
where is the scalar radius and is the Cartesian multivariate unit Gaussian which would typically be used in MFVI. This uses the fact that sampling from a multivariate Gaussian divided by its norm samples uniformly from the unit hypersphere.
Relative to the Cartesian posterior, computing the forward pass only requires an extra normalization operation and multiplying by a scalar unit Gaussian. This leaves the time complexity unchanged from standard MFVI at , where is the number of weights. In practice, we found that because our method was much more robust to hyperparameters than the Cartesian baseline, the most accurate hyperparameter configuration we used in §3.2 was about four times faster for our radial model than the Cartesian baseline. In contrast, recent non-mean-field extensions to VI like Louizos and Welling (2016) and Sun et al. (2017) have higher time complexities. For example, Louizos and Welling (2016) uses a pseudo-data approximation which reduces their complexity to where is a pseudo-data count. But even for MNIST, they use up to 150 which becomes quite significant (this is their largest experiment). Sun et al. (2017) have the same complexity as Louizos and Welling (2016), depending on similar approximations and consider a maximum input dimension of only 16— over 50,000x smaller than our §3.2.
In order to compute the ELBO loss function we must compute a log-likelihood and KL-divergence. Using our radial approximate posterior does not change how we estimate the log-likelihood. However, the KL divergence between our posterior and prior requires a new estimator.
The KL-divergence in equation (2) splits into two terms:
Each of these is an expectation over the posterior. is the expectation of the log-prior, for which we continue to use a Cartesian multivariate Gaussian as in earlier work (to make comparisons as fair as possible). As a result, estimating may be done by Monte Carlo estimation in exactly the same way as in Blundell et al. (2015).
The entropy, however, is an expectation of the log probability density function of the new variational posterior and requires a new derivation. We show in Appendix A that the entropy term is equal to:
where sum over the weights. This is, up to a constant, the same as when using an ordinary Cartesian multivariate Gaussian. In Appendix B, we extend this derivation to the use of our radial distribution as a prior, which we use in the experiments of §3.1, where we use the posterior from training one model as a prior when training another.
Unlike previous work with alternative variational distributions, this makes our method easy to implement. Despite its simplicity, we show in the next section that, compared with Cartesian MFVI, our method is more robust to hyperparameter choice, more accurate, has better uncertainty estimates, and requires no ad-hoc weighting of the KL-divergence.
Our work is focused on large datasets and big models. That is where complicated full-covariance posteriors become intractable, and where the ‘soap-bubble’ pathology emerges. In §3.1, we consider a challenging problem for variational inference: the variational continual learning problem (Nguyen et al., 2018) on FashionMNIST (Xiao et al., 2017). This is a challenging test of a variational distribution’s ability to learn the true posterior.
Our most important evaluation is in §3.2. We show that on a large-scale diabetic retinopathy diagnosis image classification task: our radial posterior is more accurate, has better calibrated uncertainty, and is more robust to hyperparameters than the Cartesian baseline and therefore requires significantly fewer iterations and less experimenter time. In this setting, we have ~200,000 input dimensions and use a model with ~1.3M parameters. This is orders of magnitude larger than most other VI work, which largely considers only the UCI datasets used by Hernández-Lobato and Adams (2015) with between 4 and 16 input dimensions and using fewer than 2000 parameters, or MNIST which has only 784 input dimensions.
Continual learning is a problem setting where a sequence of tasks must be learned separately while a single model is carried from one task to the next but all data are discarded (Kirkpatrick et al., 2017). This is hard because neural networks tend to exhibit ‘catastrophic forgetting’ and lose performance on previous tasks. A number of authors have proposed prior-focused Bayesian approaches to continual learning in which the posterior at the end of learning a task becomes a prior when learning the next task (Kirkpatrick et al., 2017; Zenke et al., 2017; Chaudhry et al., 2018; Nguyen et al., 2018; Farquhar and Gal, 2018b; Ritter et al., 2018). In the case of exact Bayesian updating, this ought to balance the information learned from the datasets of each tasks. But for approximate methods we have no such guarantee. The better the posterior approximation, the better we might expect such prior-focused Bayesian approaches to work.
Variational Continual Learning (VCL), by Nguyen et al. (2018), applies MFVI to learning the posterior. Here, we use VCL as a problem setting to evaluate the quality of the posterior produced by MFVI. Note that we do not aim to solve the continual learning problem, but rather to demonstrate the effects of badly calibrated posterior uncertainty. A good posterior estimate should work as an effective prior and prevent forgetting. This setting is particularly relevant to variational inference, as other methods for estimating uncertainty in neural networks (such as Monte-Carlo dropout (Gal and Ghahramani, 2015) or ensembles (Lakshminarayanan et al., 2016)) cannot be straightforwardly used during training as a prior.
We consider a sequence of five tasks known as Split FashionMNIST (Nguyen et al., 2018; Farquhar and Gal, 2018a). FashionMNIST is a dataset of images of items of clothing or attire (shoes, t-shirts, handbags etc.) (Xiao et al., 2017)
. The first task is to classify the first 2 classes of FashionMNIST, then the next 2 etc. Each task has 12,000 images of 28x28 pixels as a training set and 2,000 as a test set. We examine a multi-headed model(Chaudhry et al., 2018; Farquhar and Gal, 2018a) in order to evaluate the quality of the posterior. The models are BNNs with four hidden layers with 200 weights in each (~250k parameters). We perform an extensive grid search over hyperparameters. Full hyperparameters and a more thorough description of the experimental settings, as well as results for the single-headed continual learning setting, are in Appendix C.1.
The radial posterior acts as a better prior, showing that it learns the true posterior better (figure 5). Our radial posterior maintains nearly perfect accuracy even after all five tasks. In contrast, the Cartesian posterior starts to have worse accuracy on the old tasks as training progresses. The Cartesian posterior approximation is not close enough to the true posterior to carry the right information to the next task.
We perform classification using the Kaggle diabetic retinopathy dataset (Kaggle, 2015). This is a dataset of ‘fundus’ images taken of the back of retinas. Diabetic retinopathy is graded in five stages, where 0 is healthy and 4 is the worst. Following Leibig et al. (2017), we distinguish the healthy (classes 0 and 1) from those that require medical observation and attention (2, 3, and 4). Images include left and right eyes separately, which are not considered as a pair by the models, and come from two different types of camera in many different physical locations. We also use model uncertainty to identify badly taken or confusing images and refer these patients to experts for more detailed examination.
The training set has 10992 RGB images of 256x256 pixels (downsampled from roughly 2k x 2k). There are 2748 validation and 10001 test images. We use a BNN similar to a scaled-down VGG-16 with ~1.3M parameters. Unlike some prior work using MFVI, we have not downweighted the KL-divergence during training. The models shown were selected based on best validation set accuracy with a random search over 86 different runs choosing randomly from plausible optimizer, learning rate, learning rate decay, batch size, number of variational samples per forward pass, and initial variance. Full hyperparameters and search strategy, preprocessing, and architecture are provided in Appendix C.3.
Area under the receiver operating characteristic curve (AUC-ROC) is the best measure of model performance on this task because the classes are unbalanced(Leibig et al., 2017). It lets us assess both the true positive and false positive rate. In figure 7, we show that the radial posterior has higher AUC-ROC than the Cartesian baseline no matter how much data referred to experts because the prediction is uncertain. Moreover, the gap widens as we refer more data to experts. This shows that the uncertainty information in the radial posterior is better calibrated than the Cartesian posterior. Test accuracy is shown in appendix C.4 and is similar. For this figure, we picked the best hyperparameters from our random search and trained an additional 5 models with those hyperparameters in order to ensure robustness to random seed.
We also found that the radial posterior was much more robust to hyperparameter variation (figure 7). 82% of hyperparameters tried for the Cartesian baseline resulted in barely any improvement over randomly guessing, compared with 39% for the radial posterior. 44% of configurations for our radial posterior reached good AUCs, compared with only 11% for the Cartesian posterior.
Why does the radial posterior outperform the Cartesian? As we showed in figure 1, for the Cartesian posterior the training accuracy falls even though the loss improves. As we discuss in §2, the model is accepting a low accuracy even on the training data because increasing the variance of the posterior benefits the loss more than increasing the accuracy. This creates particular problems for Cartesian MFVI because the ‘soap-bubble’ property means that samples from the posterior are far from the mode with very high probability. We support this here empirically. Figure 8, from the same run as figure 1, shows that the average parameter in the model grows similarly for both the Radial and Cartesian posterior during training (shading between the highest and lowest in the whole model). However, for the Cartesian posterior, the average distance between the sampled weights and the mode is about 100x higher. This aligns with our hypothesis that large weight variances are causing the ‘soap-bubble’ sampling dynamics kick in when the growing is not artificially limited.
. Markov Chain Monte Carlo (MCMC) methods are one way to approximate the intractable posterior distribution for BNNs(Neal, 1996; Welling and Teh, 2011). However, MCMC methods struggle with large datasets and generally fail to find more than a single mode (Neal, 1996). This motivates the use of variational inference (VI) in BNNs, which has the potential to scale BNNs to very large datasets. Moreover, we are particularly interested in variational distributions that have full support. Some scalable alternatives to MFVI such as Gal and Ghahramani (2015); Lakshminarayanan et al. (2016) have zero probability density almost everywhere, which makes it impossible to use for tasks like continual learning.
Graves (2011) propose the use of Monte Carlo estimates of the KL-divergence in the ELBO objective. Blundell et al. (2015) build on this by applying the reparameterization trick (Kingma et al., 2014; Rezende et al., 2014) by sampling from a fixed noise distribution, , and learning additive and multiplicative scaling parameters and such that the weights, w are given by
The variational posterior is then a multivariate Gaussian distribution with a diagonal covariance matrix (the ‘mean-field’ approximation) e.g., (Graves, 2011; Blundell et al., 2015; Nguyen et al., 2018; Lipton et al., 2018). We call this the ‘Cartesian’ posterior. The use of a Cartesian posterior has been motivated by the limit theorem style results (Walker, 1969; MacKay, 1992; Barber and Bishop, 1998). In the common case of a unit multivariate Gaussian prior the ELBO objective is a sum over the weights:
which, as we note in §2 has the same posterior entropy term as in our Radial BNN.
In practice, training BNNs with MFVI is difficult. For example, Wu et al. (2019) argue that it is sensitive to initialization and priors. Alternative VI approaches such as Monte Carlo Dropout (Gal and Ghahramani, 2015) have been proposed, but the finite support of the dropout distribution makes it difficult to use in applications such as continual learning (see §3.1). Instead, in situations which require a posterior with full support, adjustments to MFVI are often made which are not justified by the theory of variational inference are often made. Blundell et al. (2015)
weight the KL-divergence term of the loss so that, as each epoch goes on, the loss is dominated by the likelihood. Alternatively,Fortunato et al. (2018) use a method that is identical to MFVI with the reparameterization trick but with the KL-term of the loss removed. Other authors initialize the variance to be very small (e.g., in Nguyen et al. (2018)) as a way of counteracting the tendency of the variance to become very large as training goes on. Learning rate decay can also stop variances growing too much, e.g., Fortunato et al. (2017) have decayed learning rates to 0.3% of their initial values by the end of training while the variance gradient may still be non-negligible.
Researchers worried that the mean-field approximation is too constraining have explored alternative variational distributions. Louizos and Welling (2016); Sun et al. (2017, 2019) and Oh et al. (2019) have all introduced richer variational distributions which permit correlations between weights to be learned by the BNN. Oh et al. (2019) in particular have introduced a correlated posterior which is also decomposed in a hyperspherical basis. In a slightly different direction, Wu et al. (2019)
introduce a deterministic approximation of VI based on the central limit theorem. Our work differs from all of these because we do not try to introduce correlations between weights or rely on complicated distributions. Rather, we focus on a problem with sampling in high dimensions using large real-world models and introduce the simplest possible distribution that overcomes those problems. Note that although our radial distribution’s weights are correlated in the Cartesian basis, they fully factorize in the hyperspherical basis—requiring only a simple coordinate transformation and the same number of parameters. Radial BNNs can therefore be considered to be mean-field variational inference in an alternative coordinate system. The more complex methods have only been demonstrated on low dimensional problems. The largest experiment carried out in the papers presenting these non-mean-field VI approaches is MNIST and almost all stick to synthetic data and the UCI regression problems(Hernández-Lobato and Adams, 2015) with input dimension between 4 and 16. In contrast, our experimental settings involve thousands of times more parameters and input dimensions.
For variational auto-encoders, Davidson et al. (2019) introduced parameterizations of latent spaces with explicitly hyperspherical structure intended for settings where the ‘true’ latent structure was hyperspherical. In contrast to this, our variational posterior is not constrained to a hyperspherical manifold and does not reflect a hyperspherical structure of the latent space.
MFVI in neural networks has huge potential. But, until now, it has been difficult to use. We have introduced a radial MFVI posterior distribution with advantages over the standard Cartesian posterior. Our radial posterior produces models that are more accurate and have better uncertainty estimates, are more robust to hyperparameters, and do not require ad-hoc weights in the ELBO loss. At the same time, we do not rely on non-diagonal covariances or increased complexity, which means that our approach scales easily to large models. And unlike MC dropout or ensemble methods, our method does not have the drawback of finite support, which means that it can be applied to settings like continual learning. Radial BNNs offer simple, scalable, and robust mean-field variational inference.
Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks.Proceedings of the 32nd International Conference on Machine Learning, 2015.
In this section, we show that the component of KL-divergence term of the loss which is the entropy of the posterior distribution over the weights can be estimated as:
where is an index over the weights of the model.
Throughout this section, when we write we are referring to the radial posterior’s noise distribution and omitting the subscript. The superscript indicates the basis.
We begin by applying the reparameterization trick [Kingma et al., 2014, Rezende et al., 2014]. Following the auxiliary variable formulation of Gal , we express the probability density function of with an auxiliary variable.
In equation (12), we have used a reparameterization trick transformation:
where and are parameters of the model and where is the standard transformation from hyperspherical into Cartesian coordinates.
Then, we perform a coordinate transformation from to using the Jacobian of the transformation and simplify using the independence of the probability density function in the hyperspherical coordinate system.
In the last line we have used the fact that allowing us to pull the determinant of this diagonal matrix out.
is the determinant of the Jacobian for the transformation from Cartesian to hyperspherical coordinates for which we use the result by Muleshkov and Nguyen :
We know that because the radial dimension in hyperspherical coordinates can be assumed positive without loss of generality. We also know for for the hyperspherical coordinate system. So we can simplify the signs:
The probability density function of the noise variable is separable into independent distributions. The distribution of is a unit Gaussian, one angular dimension is uniform between and , and the rest are uniform between and . The resulting probability density function is:
This first term is constant and therefore not needed for our optimization.
Inserting the probability density function from equation (26) into the second term of the loss we get
This second term is identical to the entropy of the Cartesian variational posterior typically used.
And for the third term we begin by expanding the logarithm and simplifying:
and then inserting the p.d.f. from equation (26) and solving analytically tractable integrals:
where is the Euler-Mascheroni constant. This is, again, constant and may be neglected for optimization.
As a result, we can minimize the entropy term of the loss simply by finding
In most of our experiments, we use a typical Cartesian prior in order to ensure comparability with prior work. However, in some settings, such as the Variational Continual Learning setting, it is useful to use the radial posterior as a prior. We begin similarly to the previous derivation, with all unchanged expect that we are estimating
The derivation proceeds similarly until equation (24), and the second and third terms are identical except the second term taking a product over elements of of the prior, not the posterior.
Evaluating the gradient of the log probability density function of the prior depends only on the radial term, since the distribution is uniform in all angular dimensions. We therefore find
Rather than solve the integral, we can estimate this as a Monte Carlo approximation:
By adding the three terms we estimate the cross-entropy term of the ELBO loss function.
We build on the code provided by Nguyen et al.  at https://github.com/nvcuong/variational-continual-learning
adapted for FashionMNIST. The FashionMNIST dataset was downloaded using pytorch’s built in vision datasets. The data were normalized by subtracting the training set mean and dividing by the training set standard deviation.
The classes are ordered in the conventional order. The model is initialized randomly—without pretraining the means (unlike Nguyen et al. ). The model is then trained on the first two classes. The weights are carried over to the next task and set as a prior, while the model is trained on the next two classes, and so on. Note that we perform the tasks in a multi-headed way—each task has its own output head. This may not be an ideal exemplar of the continual learning problem [Chaudhry et al., 2018, Farquhar and Gal, 2018a] but it forms an effective test of the posterior. We do not use coresets, unlike Nguyen et al. , as this would not form an effective test of the quality of the posterior.
Models are Bayesian MLPs with four hidden layers with 200 units in each. The prior for training was a unit multivariate Gaussian. Instead of optimizing directly we in fact optimize such that which guarantees that is always positive. Models are optimized using Amsgrad [Reddi et al., 2018] with learning rate 0.001 with shuffling and discarding final incomplete batches each epoch. We perform a grid search over the number of epochs each task is trained over (3, 5, 10, 15, 20, 60, 120) and batch sizes (1024, 2048, 10000). We used 90% of the standard training dataset (54000 points) as a training dataset, with 10% (6000 points) withheld as a validation dataset. We initialize to and use the initialization by He et al.  for the means. The radial posterior would work with a much larger , but we wanted to use comparable initializations for each. We optimized for average validation accuracy over all models on the final task. We used the standard 10000 points as a test dataset. The best configuration for the Cartesian posterior was found to be 60 epochs of batch size 1024 (note that this differs from the 120 epochs of batch size 12000 reported in Nguyen et al.  perhaps because they pretrain the means). The best configuration for the radial posterior was found to be 20 epochs of batch size 1024. We report the individual accuracies for each head on the test dataset.
Previous authors have noted that for continual learning the single-headed environment—where the model has a single output head shared over all tasks and must therefore identify the task as well as the output based on the input—is a much harder task, possibly more reflective of continual learning [Chaudhry et al., 2018, Farquhar and Gal, 2018a]. While the multi-headed setting suffices to demonstrate improvement to the posterior, we offer some results for the single-headed setting here in the appendix for the interest of continual learning specialists, though we do not find that our posterior solves the problem.
We perform a similar grid search as before, selecting the hyperparameters that offer the highest average validation set accuracy for the final model over all five tasks. Note that in our grid search each task gets the same hyperparameters, reflecting the idea that the task distribution is not known in advance.
Our radial posterior does not solve the continual learning single-headed problem, but it does show interestingly improved performance relative to the Cartesian baseline. As we show in figure 9, the radial posterior shows some remembering on old tasks (which includes identifying the task that the image comes from). Moreover it is able to maintain good accuracy on the newest task. Meanwhile, the hyperparameters that allow Cartesian baseline to optimize last-task average accuracy mean it learns a very uncertain model which has bad accuracy on the newest tasks. This is because hyperparameters that would let it learn a high-accuracy model for the newest task would cause it to forget everything it saw earlier.
The diabetic retinopathy data are publicly available at https://www.kaggle.com/c/diabetic-retinopathy-detection/data. There are 10992 training images, 2748 validation images and 10001 test images. We augment and preprocess them similarly to Leibig et al. 
, though we do not perform the Gaussian blur preprocessing that they do. We first downsample the images to 256x256. We randomly flip horizontally and vertically. Then randomly rotate 180 degrees in either direction. Then we pad by between 0 and 5% of the width and height and randomly crop back down to 256x256. We then randomly crop to between 90% and 110% of the image size, padding with zeros if needed. We finally resize again to 256x256 and normalize the means and standard deviations of each channel separately based on the training set means and standard deviations.
The model is based on the architecture used in Leibig et al. . It is loosely inspired by VGG-16, with a quarter the number of channels, except that it is a Bayesian neural network with mean and standard deviations for each weight, and that instead of fully connected networks at the end it uses a concatenated global mean and average pool. As a result, we do not use dropout—the Bayesian weights already act as stochastic regularizers. The prior for training was a unit multivariate Gaussian. (We also tried using the scale mixture prior used in Blundell et al.  and found it made no difference.) Instead of optimizing directly we in fact optimize such that which guarantees that is always positive. The first epoch only trained the means and uses a NLL loss function. This helps the optimization, but still allows the variances to train fully (unlike reweighting the KL-divergence). Thereafter we trained using the full ELBO loss over all parameters.
We performed an extensive random hyperparameter search. We tested each configuration for both the Cartesian and radial posterior. We tested each configuration for both an SGD optimizer and Amsgrad. When training with SGD we used Nesterov momentum 0.9 and uniformly sampled from 0.01, 0.001 and 0.0001 as learning rates, with a learning rate decay each epoch of either 1.0 (no decay), 0.98 or 0.96. When training with Amsgrad we uniformly sampled from learning rates of 0.001, 0.0001, and 0.00001 and did not use decay. We uniformly selected batch sizes from 16, 32, 64, 128, and 256. We uniformly selected the number of variational distribution samples used to estimate the loss from 1, 2, and 4. However, because we discarded all runs where there was insufficient graphics memory, we were only able to test up to 64x4 or 256x1 and batch sizes above 64 were proportionately less likely to appear in the final results. We selected the initial variance fromvalues of -6, -4, -2, or 0. We also tried reducing the number of convolutional channels by a factor of or and found that this did not seem to improve performance. We ran our hyperparameter search runs for 150 epochs. We selected the best hyperparameter configurations based on the best validation accuracy at any point during the training. We trained the models for 500 epochs but selected the models saved from 300 epochs as all models had started to overfit by the end of training. For the Cartesian posterior, this was using the SGD optimizer with learning rate 0.001, decay rate 0.98 every epoch, batch size 16, 4 variational samples for estimating the loss during training and of -6. This outperformed the others by a significant margin. Using our code on a V100 GPU with 8 vCPUs and an SSD this took slightly over 13 hours to train each model. For the radial posterior, this was the Adam optimizer with learning rate 0.0001, batch size 64, 1 variational sample for estimating the loss during training and a of -6. Using our code on the same GPU, this took slightly over 3h to run. However, for the radial posterior there were very many other configurations with similar validation accuracies (one of the advantages of the posterior).
Using these hyperparameter configurations we trained an additional 5 models using different random seeds. We then computed the test scores using a Monte Carlo estimate from averaging 64 samples from the variational distribution. We estimate the model’s uncertainty about a datapoint using the mutual information between the posterior’s parameters and the prediction on a datapoint. This estimate is used rank the datapoints in order of confidence and compute the model’s accuracy under the assumption of referring increasingly many points to medical experts. We report the area under the ROC curve as the most important measure of performance for this dataset.
For the experiment shown in figure 1, we have selected slightly different hyperparameters in order to train more quickly. For both models, we use Adam with learning rate 0.0001 and train for 300 epochs. The models have the number of channels of VGG-16. The models are trained with batch size 64 and 4 variational samples to estimate the loss. This was necessary because, for the Cartesian distribution, smaller numbers of samples led to numerical instability in the loss for large variances.
Test AUC-ROC is the most important measure of predictive effectiveness for the diabetic retinopathy dataset. For reasons of space, we do not report the test accuracy in the main body, but present it here in figure 10. The results are similar to those for AUC. The radial posterior has higher test accuracy. Moreover, the improvement widens as some fraction are referred to experts, which shows that it is better at estimating the uncertainty of the model for each datapoint.