Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

03/02/2020 ∙ by Jary Pomponi, et al. ∙ 26

Bayesian Neural Networks (BNNs) are trained to optimize an entire distribution over their weights instead of a single set, having significant advantages in terms of, e.g., interpretability, multi-task learning, and calibration. Because of the intractability of the resulting optimization problem, most BNNs are either sampled through Monte Carlo methods, or trained by minimizing a suitable Evidence Lower BOund (ELBO) on a variational approximation. In this paper, we propose a variant of the latter, wherein we replace the Kullback-Leibler divergence in the ELBO term with a Maximum Mean Discrepancy (MMD) estimator, inspired by recent work in variational inference. After motivating our proposal based on the properties of the MMD term, we proceed to show a number of empirical advantages of the proposed formulation over the state-of-the-art. In particular, our BNNs achieve higher accuracy on multiple benchmarks, including several image classification tasks. In addition, they are more robust to the selection of a prior over the weights, and they are better calibrated. As a second contribution, we provide a new formulation for estimating the uncertainty on a given prediction, showing it performs in a more robust fashion against adversarial attacks and the injection of noise over their inputs, compared to more classical criteria such as the differential entropy.



There are no comments yet.


page 1

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) are currently the most widely used and studied models in the machine learning field, due to the large number of problems that can be solved very well with these architectures, such as image classification

[he2015delving], speech processing [rethage2018wavenet], image generation [li2017mmd], and several others. Despite their empirical success, however, these models have a number of open research problems. Among them, how to quantify the uncertainty of an individual prediction remains challenging [kendall2017uncertainties]. A concrete measure of uncertainty is critical for several real-world applications, e.g., driverless vehicles, out-of-distribution detection, and medical applications [kwon2020uncertainty].

The principal approach to model the uncertainty in these models is based on Bayesian statistics. Bayesian Neural Networks (BNNs) model the parameters of a DNN as a probability distribution computed via the application of the Bayes’ rule, instead of a single fixed point in the space of parameters

[mackay1992practical]. Despite a wealth of theoretical and applied research, the challenge with these BNNs is that codifying a distribution over the weights remain difficult, mainly because: (1) the minimization problem is intractable in the general case, and (2) we need to specify prior knowledge (in the form of a prior distribution) over the parameters of the network [wenzel2020good]

. On top of this, applying these principles to a Convolutional Neural Network (CNN) is even harder, because of the nature and the depth of these networks in practice.

In the last years, different approaches have been proposed to build and/or train a BNN, outlined later in Section I-B. In general, these approaches can either avoid the minimization problem altogether and sample from the posterior distribution [chen2015convergence], or they can solve the optimization problem in a restricted class of variational approximations [blundell2015weight]

. The latter approach (referred to as Variational Inference, VI) has become extremely popular over the last years thanks to the possibility of straightforwardly leveraging automatic differentiation routines common in deep learning frameworks

[blei2017variational], and avoiding a large quantity of sampling operations during the inference phase. However, the empirical results of BNNs remain sub-optimal in practice [wenzel2020good], and ample margins exist to further increase their accuracy, robustness to the choice of the prior distribution, and calibration of the classification models.

I-a Contributions of the work

In this paper, we partially address the aforementioned problems with two innovations related to BNNs. Firstly, we propose a modification of the commonly used optimization procedure in VI for BNNs. In particular, we leverage across recent works in variational auto-encoding [2017arXiv170602262Z] to propose a modification of the standard Evidence Lower BOund (ELBO) minimized during the BNN training. In the proposed approach, we replace the Kullback-Leibler term on the variational approximation with a more flexible Maximum Mean Discrepancy (MMD) estimator [Gretton:2012:KTT:2188385.2188410]. After motivating our proposal, we perform an extensive empirical evaluation showing that the proposed BNN can significantly improve over the state-of-the-art in terms of classification accuracy, calibration, and robustness in the selection of the prior distribution. Secondly, we provide a new definition to measure the uncertainty on the prediction over a single point. Different from previous state-of-the-art approaches [kwon2020uncertainty], our formulation provides a single scalar measure also in the multi-class case. In the experimental evaluation, we show that it performs better when defending from an adversarial attacks against the BNN using a simple thresholding mechanism.

I-B Related work

I-B1 VI training for BNNs

The idea to apply Bayesian methods on neural networks has been studied widely during the years. In [buntine1991bayesian] the authors were the first to propose several Bayesian methods applied to the networks, but only in [hinton1993keeping] the first VI method wa s proposed as a regularization approach. In [mackay2003information] and [neal2012bayesian]

the posterior probability of the weights were investigated, in the first case using a Laplacian approximation and in the second one by a Monte Carlo approach to train the networks. Only recently, the first practical VI training techniques were advanced in

[graves2011practical]. In [blundell2015weight], this approach was extended and an unbiased way of updating the posterior was found. Dropout has also been proposed as an approximation of VI [gal2015dropout, kingma2015variational].

While these methods can be applied in general for most DNNs, few works were carried out in the context of image classification, due to the complexity and the depth of the networks involved in these tasks, combined with the inner difficulties of VI methods. In [gal2015bayesian] and [2019arXiv190102731S] the authors used Bayesian methods to train CNNs, while in [kendall2017uncertainties] and [kwon2020uncertainty] the authors proposed two alternatives that work also for CNNs, to measure the uncertainty of a classification, using the posterior distribution.

Almost all the works devoted to VI training of BNNs have considered the standard ELBO formulation [blei2017variational], where we minimize the sum of a likelihood term and the Kullback-Leibler (KL) divergence with respect to the variational approximation. However, recently several works have put forward alternative formulations of the ELBO term replacing the KL term with separate divergences [2017arXiv170602262Z]. The target of this paper is to leverage on these proposals to improve the training procedure of a Bayesian CNN and the estimation of the classification’s uncertainty.

I-B2 Uncertainty quantification in BNNs

Quantifying the uncertainty of a prediction is a fundamental task in modern deep learning. In the context of BNNs, entropy allows to obtain a simple measure of uncertainty [leibig2017leveraging]. The work in [kendall2017uncertainties], however, analyzed the difference between aleatoric uncertainty (due to the noise in the data), and epistemic uncertainty (due to volatility in the model specification) [der2009aleatory, leibig2017leveraging, hullermeier2019aleatoric]. In order to properly model the former (which is not captured by standard entropy), they propose a modification of the BNN to also output an additional term necessary to quantify the aleatoric component. A further extension that does not require additional outputs is proposed in [kwon2020uncertainty]. Their formulation, however, does not allow for a simple scalar definition in the multi-class case.

Ii Bayesian Neural Networks

The core idea of Bayesian approaches is to estimate uncertainty using an entire distribution over the parameters, as opposed to the frequentist approach in which we estimate the solution of a problem as a fixed point. This is accomplished by using the Bayes’ theorem:


where is a dataset and is a set of parameters that we want to estimate. A BNN is a neural network with a distribution over the parameters specified according to (1) [neal2012bayesian]. In particular, we can see a DNN as a function that, given a sample and parameters , computes the associated output . Bayesian methods gives us the possibility to have a distribution of functions (the posterior) for a particular dataset , starting from a prior belief on the shape of the functions (the prior) and its likelihood on a single point, defined as .

Once we have the posterior, the inference step consists in integrating over all the possible configurations of parameters:


This new equation represents a Bayesian Model Average (BMA): instead of choosing only one hypothesis (a single setting of the parameters ) we, ideally, want to use any possible set of the parameters, weighted by the posterior probabilities. This process is called marginalization over the parameters .

Ii-a Bayes by back-propagation

In general the posterior in (1) is intractable. As outlined in Section I-B

, several techniques can be used to handle this intractability, and in this paper we focus on VI approximations, as described next. VI are an alternative to Markov Chain Monte Carlo (MCMC) methods, that can be used to faster approximate the posterior of Bayesian models if compared to MCMC but with less guarantees. For a complete review we refer to

[blei2017variational]. Generally speaking, the nature of BNNs (e.g., highly non-convex minimization problem, millions of parameters, etc.) makes these models very challenging for standard Bayesian methods.

Bayes By Back-propagation (BBB, [graves2011practical, blundell2015weight]) is a VI method to fit a variational distribution with variational parameters over the true posterior, from which the weights can be sampled. The set of variational parameters can easily be found by exploiting the back-propagation algorithm, as shown afterward. This posterior distribution answer queries about unseen data points - given a sample and variational parameters , taking the expectation with respect to the variational distribution. To make the process computationally viable, the expectation is generally approximated by sampling the weights from the posterior times; each set of weights gives us a DNN from which we predict the output, then the expectation is calculated as the average of all the predictions. Thus, Eq. (2) can be approximated as:


where are the sets of sampled weights. In the most common case, the variational family is chosen as a diagonal Guassian distribution over the weights of the network. In this case, the variational parameters are composed, for each weight of the DNN, of a mean and a value

, which is used to calculate the variance of the parameter

(to ensure that the variance is always positive). Sampling a weight from the posterior is achieved by: , where ; this technique is called re-parametrization trick [blei2017variational]. Note that there exists alternative ways to codify the posterior over the parameters, and we explore some simplifications in the experimental section.

With this formulation, the parameters of the approximated posterior can be found using the Kullback-Leibler (KL, [kullback1951information]) divergence:

The optimal parameters are the ones that satisfy both the complexity of the dataset and the prior distribution . The final objective function to minimize is:


where is an additional scale factor to weight the two terms. This equation is called ELBO because maximizing it is equivalent to minimizing the Kullback-Leibler divergence between the approximated posterior and the real one. We can also look at this equation like as loss over the dataset plus a regularization term (the KL divergence).

The ELBO function has limitations, one of them is that it might fail to learn an amortized posterior which correctly approximates the true posterior. This can happen in two cases: when the ELBO is minimized despite the fact that the posterior is inaccurate and when the model capacity is not sufficient to achieve both a good posterior as well as good data fitting. For further information we refer to [alemi2017fixing] and [2017arXiv170602262Z].

Ii-B Measuring the uncertainty of a prediction

As introduced in Section I-B2

, the uncertainty of a prediction vector

can be calculated in many ways. The most straightforward way is the entropy:


where is the vector of probabilities. Combining this formulation with Eq. (3), the classification entropy can be calculated as:


where and is the number of weights sampled from the posterior. This entropy formulation allows the calculation of the uncertainty also for a BNN, However, a more suitable measure of uncertainty, exploiting the possibility of sampling the weights to calculate the cross uncertainty between the classes (a covariance matrix), can be formulated [kwon2020uncertainty]. To this end, we define the variance of the predictive distribution (3) as:


For further information about the derivation, we refer to [kwon2020uncertainty]. The first term in the variance formula is called aleatoric uncertainty, while the second one is the epistemic uncertainty [der2009aleatory]. The first quantity measures the inherent uncertainty of the dataset , it is not dependent on the model, and more data might not reduce it, instead the second term incorporates the uncertainty of the model itself, and can be decreased by augmenting the dataset or by redefining the model. In [kendall2017uncertainties] and [kwon2020uncertainty] the authors have proposed different ways to approximate these quantities. In [kendall2017uncertainties] the authors constructed a BNN and used the mean

and the standard deviation

of the logits, the output of the last layer before the softmax activation function, to calculate the variance:


where . In [kwon2020uncertainty] the authors highlighted the problems of this approach: it models the variability of the logits (and not the predictive probabilities), ignoring that the covariance matrix is a function of the mean vector; moreover, the aleatoric uncertainty does not reflect the correlation due to the diagonal matrix modeling. To overcome these limitations, they proposed an improvement:


where . This formulation converges in probability to Eq. (7) as the number of samples increases. In the case of binary classification, the formula simplifies to:

(aleatoric) (9)

This definition is more viable because it calculates a scalar instead of a matrix, but cannot be used trivially if the problem involves more than two classes; if not, by collapsing all the probabilities that are less than the maximum one into one single probability and treat the problem as a binary one. In this paper, we also present a modified version of the definition (8), which can be used to evaluate the uncertainty as a scalar also in multiclass scenarios.

Iii Proposed approaches

In this section, we introduce the proposed variations for the training of BNNs. Firstly, we outline a new way to approximate the weights’ posteriors, leading to a better posterior approximation, higher accuracy, and an easier minimization problem. Secondly, we provide an improvement of the measure of uncertainty (8), which is more suited for problems that are not binary classification tasks.

Iii-a Posterior approximation via Maximum Mean Discrepancy regularization

The MMD estimator was originally introduced as a non-parametric test for distinguishing samples from two separate distributions [Gretton:2012:KTT:2188385.2188410]. Formally, denote by and

two samples from an independent random variable with distribution

, by and two samples from an independent random variable with distribution , and by a characteristic positive-definite kernel. The square of the MMD distance between the two distributions is defined as:

We have that . Following [2017arXiv170602262Z] and [li2017mmd], we propose to replace the KL term in (4) with an MMD estimator, i.e., we propose to search for a variational set of parameters that minimizes the MMD distance with respect to the prior :


In practice, the quantity can be estimated using finite samples from the two distributions. Given a sample and a sample

, an unbiased estimator of

is given by:


where is the th element of (and similarly for ), and both vectors have size . Using the unbiased version, the results can be negative if the two distributions are very close to each other. For this reason we use a different formulation in which, to speed up the convergence, we decide to eliminate the negative part:

The idea of using the MMD distance connected to neural network models was originally explored in [2015arXiv150503906K] and [DBLP:journals/corr/LiSZ15]

, who were mostly focused on generative models. The power of minimizing the MMD distance relies on the fact that it is equivalent to minimizing a distance between all the moments of the two distributions, under an affine kernel. In (

10) we use this metric as a regularization approach, which minimizes the distance between the posterior over the parameters and the chosen prior. Summarizing, we propose to estimate the posterior by minimizing:


with and , the number of times that we sample from the two distributions, while the value is an additional scale factor to balance the classification loss and the posterior’s one; as in [blundell2015weight], we set to , where is the current batch in the training phase and is the total number of batches; in this way, the first optimization steps are influenced by the prior more than the future ones, which are influenced only by the data samples.

Iii-B Bayesian Cross Uncertainty (BCU)

In this section, we propose a modified version of the uncertainty measure formulated in Eq. (8), that we call Bayesian Cross Uncertainty (BCU).

The variance formulated in (8) gives us a matrix, with the number of classes of our classification problem. Sometimes, it is useful to have a scalar value, which indicates the uncertainty of our prediction and that can be easily used or visualized.

The most straightforward approach to reduce a matrix to a scalar is to calculate its determinant, the sparser the matrix is the closer the resulting value will converge to zero. This approach comes with an inconvenience: in a binary classification problem, if we have, for a sample , two vectors of predictions , which codify the absolute certainty of the prediction, and , indicating that the network is maximally uncertain, we have that . To avoid these cases, we propose to modify the formulation of Eq. (8) as follows:


where is the number of classes and

is the identity matrix. In this case, we have that the determinant of Eq. (

14) is lower bounded when we have utmost confidence, and this bound is equal to the determinant of the matrix : . To calculate the upper bound we need to study when such a scenario could emerge. The possible scenarios in which we have the utmost uncertainty are the following: in the first one the network produces the same probability,

, for each class (utmost aleatoric uncertainty), while in the second one we have a sample that is classified

times, with , and at each prediction the network assign a probability equals to to a different class, and zeros to the others (utmost epistemic uncertainty). In these cases, the upper bound is: . These two values can be used to normalize the result of Eq. (14) between zero, maximum certainty, and one, utmost uncertainty, given that this formulation ensures a bounded measure of uncertainty. In this way, the uncertainty is well defined for a BNN model, since it reaches its maximum only when one of two terms, epistemic or aleatoric, reaches it. The final measure of uncertainty that we propose is the normalized version of (13):


where and are the minimum and maximum values as defined above. Furthermore, we define a way of discarding a sample based on its classification’s uncertainty. When the training of the DNN is over, we collect all the measures of uncertainty associated to the samples that have been classified correctly in a set that we call . From this set of uncertainties , we define a threshold as:


where and

are, respectively, two functions that return the first and the third quartile of the set

, and is an hyper-parameter. Once a threshold is calculated, a new sample can be discarded if its associated uncertainty exceeds it.

We underscore that this way of discarding images is not related to the formulation of variance in Eq. (7) or the BCU, nor to the BNNs, but can be used with every combination of DNN and measures of uncertainty.

Iv Neural networks calibration

BNNs are more suitable for a real world decision making application, due to the possibility to give an interval of confidence for the prediction, as explored in Section II-B. However, another important aspect in these scenarios, apart from the correctness of the predictions, is the ability of a model to provide a good calibration: the more the network is confident about a prediction, the more the probability associated with the predicted class label should reflect the likelihood of a correct classification.

In [niculescu2005predicting] the authors proved that shallow neural networks are typically well calibrated for a binary classification task. On the other hand, when considering deeper models, while the networks’ predictions become more accurate, due to the growing complexity, they also become less calibrated, as pointed out in [guo2017calibration]. In this work, we also analyze how calibrated BNNs are. In particular, we show in the experimental section that the proposed MMD estimator leads to better calibrated models.

Given a sample , the associated ground truth label , the predicted class with its associated probability of correctness , we want that:


This quantity cannot be computed with a finite set of samples, since

is a continuous random variable, but it can be approximated and visually represented (as proposed in

[degroot1983comparison] and [niculescu2005predicting]) using the following formula:


where and are the predicted and the true label for the sample , and, chosen the number of splits of the range (each one has size equals to ), we group the predictions into interval bins . Each is the set of indices of samples with a prediction confidence that falls into the range . The Eq. (17) can be combined with a measure of confidence calculated as:

to understand if a model is calibrated, which is true when , for each bin with . Not only these formulas provide a good visualization tool, namely reliability diagram, but also it is useful to have a scalar value which summarizes the calibration statistics. The metric that we use is called Expected Calibration Error (ECE, [naeini2015obtaining]):


where is the total number of samples. The resulting scalar gives us the calibration gap between a perfectly calibrated network and the evaluated one.

V Experiments

To evaluate the proposed VI method, we start with a toy regression task for visualization purposes, before moving to different datasets for image classification. We compare our proposed model with others state-of-the-art approaches. We put particular emphasis on evaluating different priors and seeing how this choice affects the final results (robustness), to study the calibration of BNNs, and how well our measure of uncertainty behaves when we want to discard images on which the network is uncertain (e.g., adversarial attacks). The code to replicate the experiments can be found in a public repository.111

V-a Case study 1: Regression

(a) DNN
(b) MC Dropout
(c) BBB
(d) MMD
Fig. 1:

The images show the results obtained on the heteroscedastic regression problem with additive noise equals to

. The line represents the prediction, the smaller points are the train dataset, while the bigger ones are test points outside the training range, to check how the function evolves; we also show the variances of the prediction.
(a) DNN
(b) MC Dropout
(c) BBB
(d) MMD
Fig. 2: The images show the results obtained on the hemoroscedastic regression problem with additive noise equals to . The line represents the prediction, the smaller points are the train dataset, while the bigger ones are test points outside the training range, to check how the function evolves; we also show the variances of the prediction.

In this section, we evaluate the models on a toy regression problem, in which the networks should learn the underlying distribution of the points, and then being able to provide reasonable predictions even in regions outside the training one.

Each regression dataset is generated randomly using a Gaussian Process, with the RBF kernel, given a range in which the points lie, the number of points to generate and the variance of the additive noise. We generated two different kinds of regression problems: homoscedastic and heteroscedastic. In the first one, the variance is shared across all the random variables, while in the second one each random variable has its own. Each experiment consists in 100 training points in the range and 100 testing points outside this range. In the MMD experiments, we used the RBF kernel with to regularize the posterior.

We compare our approach with a standard DNN, the BBB method and a network which uses the dropout layer to approximate the variational inference (proposed in [gal2015dropout]

) by keeping the dropout turned on even during the test phase. This technique is called Monte Carlo Dropout (MC Dropout). For all the experiments we trained the network for 100 epochs using RMSprop with a learning rate equal to


As prior for BBB and MMD we used a Gaussian distribution

. We initialize the as proposed in [he2015delving] and , to keep the resulting weights around a value that guarantees the convergence of the optimization procedure. In these experiments we do not vary the prior distribution.

In Fig. 1 we show the results obtained from the heterocedastic experiment with additive noise equals to . We can see that the BNN trained with MMD is the only one capable of reasonably estimating the interval of confidence in regions outside the training one. While the DNN and MC Dropout are too confident about their predictions, BBB gives the less confident predictions, but we can see that it fails to understand that the uncertainty should increase outside the training range. In Fig. 2 the results obtained on a homoscedastic regression problem are shown, with similar trends.

V-B Case study 2: Image classification

Neuron-wise Weight-wise Neuron-wise Weight-wise
MNIST 98.59 0.08 98.30 0.09 98.16 0.02 28.17 2.69 98.64 0.061 98.84 0.02
CIFAR10 74.73 0.36 75.56 0.01 65.73 0.50 - 75.24 0.29 75.64 0.12
CIFAR100 39.89 0.33 38.85 0.20 35.31 038 - 42.2 0.51 42.36 0.36
TABLE I: The table shows, for each method, the results and the associated standard deviation, both expressed in percentage, obtained on the classification benchmarks. Some results are missing because no combination of parameters lead to convergence of the classification task.

In this section, we present the results obtained on image classification experiments. To the best of our knowledge, no competitive results on this field have been proposed using BNNs; the best results are present in [2019arXiv190102731S], in which the authors used the local re-parametrization trick, described in [kingma2015variational]: a technique in which the output of a layer is sampled instead of the weights. The main problem of this technique applied to the CNNs is that it doubles the number of operations inside a layer (e.g., in the CNN case we have two convolutions, one for the mean and the other for the variance of the layer’s output). For this reason, we believe that it is not computationally reasonable, especially with deeper architectures, and MMD could be a step towards better Bayesian CNNs.

Our main concern is to show that the MMD approach works even with a “bad” prior, which implies having small knowledge about the problem. For this purpose, we studied different priors: the Gaussian distribution , the Laplace distribution

, the uniform distribution

and the Scaled Gaussian Mixture from [blundell2015weight]. In addition, we will evaluate the introduced measure of uncertainty under the Fast Gradient Sign Method (FGSM, [goodfellow2014explaining]). Finally, we evaluate the calibration of each network.

To this end, we evaluate the methods on three datasets: the first is MNIST [lecun2010mnist], the second one is CIFAR10, and the last one is a harder version of CIFAR10 called CIFAR100, which contains the same number of images, but 100 classes instead of 10. For all the experiments, we used the Adam optimizer [kingma2014adam] with the learning rate set to

, and the weights initialized as in the regression experiments, to ensure a good gradient flow. For MNIST, we used a simple network composed by one CNN layer, with 64 kernels, followed by max pooling and two linear layers. For CIFAR10, we used a network composed by three blocks of convolutions and max pooling, respectively with

, and

kernels, followed by three linear layers; for CIFAR100, we used the same architecture but doubling the number of kernels. In all the architectures, the activation function is the ReLU.

We trained all the networks for 20 epochs; we also implemented an early stopping criteria, in which training is stopped if the validation score does not improve for 5 consecutive epochs. For BBB, MMD, and MC Droput we sampled one set of weights during the train phase and 10 sets during the test phase. To have better statistics of the results, we repeated each experiment times.

Since the posterior over the weights doubles the number of parameters, we decided also to test a simplification of it, called neuron-wise posterior. This posterior is defined as , in which each weight has its own mean and the variance is given by a common variance scaled by a parameter which is defined neuron-wise. In this way, we have less parameters and the minimization problem could benefit from it.

V-B1 Prior choice

Neuron-wise Neuron-wise Weight-wise
66.26 75.43 75.43
33.28 74.59 75.04
- 75.47 75.64
52.84 75.47 75.32
12.11 74.90 75.30
- 75.58 75.47
- 74.46 74.89
66.23 74.67 75.70
66.23 75.6 75.93
66.23 74.95 74.89
66.23 74.67 75.70
66.23 75.60 75.93
66.23 74.95 74.89
66.23 74.95 74.89
66.23 74.95 74.89
TABLE II: The Table shows the preliminary results, on CIFAR10, about the robustness w.r.t. the prior choice. Some results are missing because no combination of parameters lead to convergence of the classification task.

We evaluated all the priors previously exposed before to understand how much the prior choice impacts the optimization problem and the final results. Only one result is shown, due to the large number of priors; the best results, for each method, are then used to train the models for all the experiments; the overall classification results will be presented later.

The Table II shows the results obtained, on CIFAR10, with all the tested priors. It is clear that BBB fails to converge with spiky priors because the KL divergence forces the distributions to collapse on zero. A clear case of this behaviour can be observed with the Laplacian prior, as shown in Fig. 3.

In the end, we can say that MMD works better than BBB, even with an uninformative prior, such as a uniform distribution which gives only a range for the parameters, because its sampling nature allows more operating space than BBB. Moreover, Fig. 3 also shows that BNNs trained with MMD are capable of approximating a more complex posterior.

(a) MMD
(b) BBB
Fig. 3: The images show the posterior distribution of the weights obtained on CIFAR10 with the prior . BBB method fails when combined with the peaked prior, because it forces the convergence of the distributions on zero, neglecting the minimization problem associated to the classification.
(a) Discarded images while varying the threshold .
(b) Classification score while varying the threshold .
(c) Difference between the scores obtained.
Fig. 4: The images show, respectively, how many images are discarded, the obtained score calculated over the samples that have not been discarded, and, in the last plot, the difference between the classification score obtained using BCU and the entropy based thresholds. We tested different thresholds. The results are associated to the best model trained on CIFAR100 with the BNN trained using the proposed MMD method, under the FGSM attack with .
(a) Discarded images while varying the threshold .
(b) Classification score while varying the threshold .
(c) Difference between the scores obtained.
Fig. 5: The images show, respectively, how many images are discarded, the obtained score calculated over the samples that have not been discarded, and, in the last plot, the difference between the classification score obtained using the BCU and entropy based thresholds. We tested different thresholds. The results are associated to the best model trained on CIFAR10 with the MC Dropout approach, under the FGSM attack with .
DNN MC Dropout
MC Dropout
(no weight decay)
(Weight wise)
MNIST 0.73 0.09 0.41 0.11 0.49 0.23 0.50 0.08
CIFAR10 14.56 0.32 3.43 0.57 6.00 0.17 5.93 0.91
CIFAR100 13.11 6.75 2.22 0.63 5.92 0.58 3.89 1.76
TABLE III: The Table shows, for each method, the results with the associated standard deviation, in term of calibration, measured as ECE score (%, Eq. (18))); lower is better.
(a) Reliability diagram of DNN.
(b) Reliability diagram of MC Dropout.
(c) Reliability diagram of MMD.
Fig. 6: The images show the reliability diagram for each method compared in Table III. In these images the correlation between the ECE score and the gap bars is shown visually. The methods are trained on CIFAR10.

V-B2 Classification results

Table I shows the results obtained on the classification experiments. It shows that BBB method fails drastically if we use a weight-wise posterior, but also to reach good performances when the posterior is neuron-wise and the dataset becomes harder (CIFAR100). In the end we can say that the networks trained with the original ELBO loss fail when the models become bigger and the dataset harder; we will also show that they are also more sensible to the choice of the prior.

V-B3 FGSM test

In this test we compare the proposed BCU measure (14) with the normalized entropy formulation in (6), under the FGSM attack [goodfellow2014explaining], in which, given an image and its label , we modify the image as: , where is the input-output Jacobian of a randomly sampled network.

The purpose is to discard images in which the network is less confident, therefore we study how the threshold, defined as in Eq. (15), behaves when we change the uncertainty measure.

In Fig. 5 and 4, we show the results obtained, respectively, on CIFAR10 with MMD and CIFAR100 with MC Dropout. We can see how the number of discarded images decrease exponentially when the threshold is applied to the uncertainty based entropy measure; the score also drops, since more noisy images are evaluated instead of being discarded. This is due to the fact that the entropy measure does not take into consideration the correlation between the classes, and this happens because only the distribution obtained using a set of weights is evaluated at each time, thus the entropy does not codify the overall uncertainty across all the possible models and how a class can influence the others. Both of these informations are taken into account when using the measure of uncertainty proposed.

V-B4 Network calibration

In this section, we evaluate the calibration of each network. To visually show the calibration of these models we used the reliability diagrams [degroot1983comparison, niculescu2005predicting]. Fig. 6 shows these diagrams, while Table I contains the results achieved in terms of ECE score. Only the networks that achieve a classification result near their best one, presented in Table I, are considered in this experiments; for this reason, results obtained with BNNs trained with BBB method are not evaluated due to the inability of reaching competitive scores. We decided also to compare two different versions of MC Dropout to make the comparisons fairer, because the original one uses a weight decay, which leads to a better ECE score (as pointed out in [guo2017calibration]); consequently we trained also a MC Dropout network without weight regularization. We can observe that DNN never achieves a good calibration, and while MC Dropout networks are well calibrated due to the weight decay, our method achieves a good calibration result even if no regularization is used. By comparing our method with the MC Dropout without weight regularization, we find that our method achieves a better ECE score. In the end we can say that the BNNs trained using MMD, in general, are well calibrated and do not require external normalization techniques to achieve it.

Vi Conclusion

In this paper, we proposed a new VI method to approximate the posterior over the weights of a BNN, which uses the MMD distance as a regularization metric between the posterior and the prior. This method has advantageous characteristics, if compared to other VI methods such as MC Dropout and BBB. First, the BNNs trained with this technique achieve better results, and they are able of approximating a more complex posterior. Second, it is more robust to the prior choice, if compared to BBB, an important aspect in these models. Third, this method, if combined with the right prior, can lead to a very well calibrated network, that also achieves good performance.

We also proposed and tested a new method to calculate the classification’s uncertainty of a BNN. We showed that this measure, combined with a threshold-based rejection technique, behaves better when discarding samples on which the BNN is less certain, by leading to a better score, if compared to the entropy measure, on noisy samples.

Our MMD method suggests interesting lines of further research, in which a BNN network can be trained using VI methods that involve a regularization method different from the KL divergence, and leading to better and more interesting posteriors.