Log In Sign Up

Robustness of Bayesian Neural Networks to Gradient-Based Attacks

Vulnerability to adversarial attacks is one of the principal hurdles to the adoption of deep learning in safety-critical applications. Despite significant efforts, both practical and theoretical, the problem remains open. In this paper, we analyse the geometry of adversarial attacks in the large-data, overparametrized limit for Bayesian Neural Networks (BNNs). We show that, in the limit, vulnerability to gradient-based attacks arises as a result of degeneracy in the data distribution, i.e., when the data lies on a lower-dimensional submanifold of the ambient space. As a direct consequence, we demonstrate that in the limit BNN posteriors are robust to gradient-based adversarial attacks. Experimental results on the MNIST and Fashion MNIST datasets with BNNs trained with Hamiltonian Monte Carlo and Variational Inference support this line of argument, showing that BNNs can display both high accuracy and robustness to gradient based adversarial attacks.


page 5

page 7

page 8


On the Robustness of Bayesian Neural Networks to Adversarial Attacks

Vulnerability to adversarial attacks is one of the principal hurdles to ...

The Effect of Prior Lipschitz Continuity on the Adversarial Robustness of Bayesian Neural Networks

It is desirable, and often a necessity, for machine learning models to b...

Combinatorial Attacks on Binarized Neural Networks

Binarized Neural Networks (BNNs) have recently attracted significant int...

Gradient Similarity: An Explainable Approach to Detect Adversarial Attacks against Deep Learning

Deep neural networks are susceptible to small-but-specific adversarial p...

Deep-RBF Networks Revisited: Robust Classification with Rejection

One of the main drawbacks of deep neural networks, like many other class...

On the relationship between class selectivity, dimensionality, and robustness

While the relative trade-offs between sparse and distributed representat...

Comment on "Biologically inspired protection of deep networks from adversarial attacks"

A recent paper suggests that Deep Neural Networks can be protected from ...

1 Introduction

Adversarial attacks are small, potentially imperceptible pertubations of test inputs that can lead to catastrophic misclassifications in high-dimensional classifiers such as deep neural networks (NN). Since the seminal work of

Szegedy et al. (2013), adversarial attacks have been intensively studied, and even state-of-the-art deep learning models, trained on very large data sets, have been shown to be susceptible to such attacks (Goodfellow et al., 2014). In the absence of effective defenses, the widespread existence of adversarial examples has raised serious concerns about the security and robustness of models learned from data (Biggio and Roli, 2018). As a consequence, the development of machine leanrnig models that are robust to adversarial perturbations is an essential pre-condition for their application in safety-critical scenarios, where model failures have already led to fatal accidents (Yadron and Tynan, 2016).

Many attack strategies are based on identifying directions of high variability in the loss function by evaluating gradients w.r.t. input points (see, e.g.,

Goodfellow et al. (2014); Madry et al. (2017)). Since such variability can be intuitively linked to uncertainty in the prediction, Bayesian Neural Networks (BNNs) (Neal, 2012) have been recently suggested as a more robust deep learning paradigm, a claim that has found some empirical support (Feinman et al., 2017; Gal and Smith, 2018; Bekasov and Murray, 2018; Liu et al., 2018). However, neither the source of this robustness, nor its general applicability are well understood mathematically.

In this paper we show a remarkable property of BNNs: in a suitably defined large data limit, we prove that the gradients of the expected loss function of a BNN w.r.t. the input points vanish. As a consequence, in the limit BNNs are provably immune to gradient-based adversarial attacks.

We verify our theoretical findings on various BNNs architectures trained with both Hamiltonian Monte Carlo (HMC) and Variational Inference (VI) on both MNIST and Fashion MNIST data sets, empirically showing that the magnitude of the gradients indeed decreases when more posterior samples are taken. We also show that two popular, highly effective gradient-based attack strategies are unsuccessful on BNNs. Finally, we conduct a large-scale experiment on thousands of different networks, showing that for BNNs high accuracy correlates with high robustness to gradient-based adversarial attacks, contrary to what observed for networks trained via standard Stochastic Gradient Descent (SGD)

(Zhang et al., 2019).

In summary, this paper makes the following contributions:

  • We provide a theoretical framework to analyse adversarial robustness of BNNs in the large data limit.

  • We show that, in this limit, the posterior average of the gradients of the loss function vanish, providing robustness against gradient-based attacks.

  • We substantiate empirically our arguments on a large-scale experiment, showing empirically that BNNs are immune from the well known accuracy-robustness trade-off.

1.1 Related Work

The robustess of BNNs to adversarial examples has been already observed by Gal and Smith (2018); Bekasov and Murray (2018). In particular, in (Bekasov and Murray, 2018) the authors define Bayesian adversarial spheres and empirically show that, for BNNs trained with HMC, adversarial examples tend to have high uncertanity, while in (Gal and Smith, 2018) sufficient conditions for idealised BNNs to avoid adversarial examples are derived. However, it is unclear how such conditions could be checked in practice, as it would require one to check that the BNN architecture is invariant under all the symmetries of the data.

Empirical methods to detect adversarial examples for BNNs that utilise pointwise uncertainty have been introduced in (Li and Gal, 2017; Feinman et al., 2017; Rawat et al., 2017). However, these approaches have largely relied on Monte Carlo dropout as a posterior inference approximation which can be fooled by attacks that generate adversarial examples with small uncertainty (Carlini and Wagner, 2017). We discuss reasons these methods are fooled under our framework in Sections 5 and 6. Statistical techniques for the quantification of adversarial robustness of BNNs have been introduced by Cardelli et al. (2019a) and employed in (Michelmore et al., 2019) to detect erroneous behaviours in the context of autonomous driving. Furthermore, in (Ye and Zhu, 2018) a Bayesian approach has been considered in the context of adversarial training, where the authors showed improved performances with respect to other, non-Bayesian, adversarial training approaches.

2 Gradient Based Adversarial Attacks

Gradient-based attacks are among the most employed techniques for fast testing of NNs in adversarial settings. The basic principle is to perform gradient ascent on the loss to identify candidate adversarial perturbations. Briefly, in the adversarial setting, instead of minimizing the loss function w.r.t. the NN weights on a given set (as in the training phase), the objective of the gradient method becomes loss maximization w.r.t. the input coordinates, while predictions of the network are performed with fixed weights. In other words, the gradient descent on the weights of the network becomes a local gradient ascent on the network input.

Specifically, let be a NN with input and network parameters (weights) , and denote by the associated loss function. Given an input point , and a strength (i.e. maximum perturbation magnitude) of the attack, the worst-case adversarial perturbation can be defined as the point around that maximises the loss:

If the the network prediction on differs from the original prediction on , then is called an adversarial example. As is non-linear, this poses a non-linear optimisation problem for which several approximate solution methods have been proposed (Biggio and Roli, 2018). While the results discussed in Section 3 hold for any gradient-based methods, in the experiments reported in Section 6 we directly look at the Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014) and the Projected Gradient Descent method (PGD) (Madry et al., 2017). The former, FGSM, works by approximating by taking an -step in the direction of the sign of the gradient in . In the case of perturbations, that is:

PGD is based on a iterative generalisation of FGSM (Madry et al., 2017). It starts from a random perturbation in an -ball of radius around the input sample , then performs a gradient step in the direction of the greatest loss, projecting the so obtained point in the --ball centered in , and iterates the gradient step to improve approximations :

We stress that also attacks that do not rely directly on the gradient of the network under consideration have been developed (Papernot et al., 2017; Ilyas et al., 2018; Wicker et al., 2018).

3 Bayesian Neural Networks and Adversarial Attacks

Bayesian modelling aims to capture the intrinsic epistemic uncertainty of data models by defining ensembles of predictors (see e.g. (Barber, 2012)

); it does so by turning algorithm parameters (and consequently also predictions) into random variables. In a NNs scenario

(Neal, 2012), one starts with a prior measure over the network weights . The fit of the network with weights to the data is assessed through the likelihood (Bishop, 2006)

. Bayesian inference then combines likelihood and prior via Bayes theorem to obtain a

posterior measure on the space of weights


Maximising the likelihood function w.r.t. the weights is in general equivalent to minimising the loss function in standard NNs; indeed, standard training of NNs can be viewed as an approximation to Bayesian inference which replaces the posterior distribution with a delta function at its mode.

It should be noted that obtaining the posterior distribution exactly is impossible for non-linear/ non-conjugate models such as NNs. Asymptotically exact samples from the posterior distribution can be obtained via procedures such as Hamiltonian Monte Carlo (HMC); approximate samples can be obtained more cheaply via Variatonal Inference (VI).

Irrespective of the posterior inference method of choice, Bayesian predictions at a new input are obtained from an ensemble of NNs, each with its individual weights drawn from the posterior distribution


where denotes expectation w.r.t. the distribution . The ensemble of NNs is the predictive distribution of the BNN.

Attacks against a BNN are attacks against the predictive distribution (2). The FGSM attack then becomes


where the final expression is a Monte Carlo approximation with samples are drawn from the posterior . Expression for the PGD or other gradient-based attacks are analogous.

4 Adversarial robustness of Bayesian predictive distributions

Equation (3) suggests a possible explanation for the observed robustness of BNNs to adversarial attacks: the averaging under the posterior might lead to cancellations in the final expectation. It turns out that this averaging property is intimately related to the geometry of the so called data manifold , i.e. the support of the data generating distribution . The key result that we leverage is a recent breakthrough (Du et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Mei et al., 2018) which proved that global convergence of (stochastic) gradient descent (at the distributional level) in the overparametrised, large data limit. We refer to the original publications for precise definitions; we define as a fully trained, overparametrized BNN an ensemble of NNs satisfying the conditions in (Rotskoff and Vanden-Eijnden, 2018) and at full convergence of the training algorithm.

We now prove our main result:

Theorem 1.

Let be a fully trained overparametrized BNN on a prediction problem with data manifold and posterior weight distribution . Assuming almost everywhere, in the large data limit we have a.e. on


By the definition of the FGSM attack in Equation (3) and other gradient-based attacks, Theorem 1 directly implies that any gradient-based attack will be ineffective against a BNN. The proof of Theorem 1 is instructive; we divide it in two parts in the following.

Dimensionality of the data manifold

An important result proved in (Du et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Mei et al., 2018) is that, at convergence, overparametrised NNs provably achieve zero loss on the whole data manifold in the infinite data limit. An immediate consequence of this result is the following

Lemma 1.

Let be a fully trained overparametrized NN on a prediction problem with data manifold . Let s.t. , with the -dimensional ball centred at of radius for some . Then is robust to gradient-based attacks at in the large data limit.

This is a trivial consequence of the network achieving zero loss on the data manifold (Du et al., 2018; Rotskoff and Vanden-Eijnden, 2018; Mei et al., 2018), since the function would be locally constant at .

A corollary of Lemma 1 is

Corollary 1.

Let be a fully trained overparametrized NN on a prediction problem with data manifold smooth a.e. If is vulnerable to gradient-based attacks almost everywhere in in the infinite data limit, then .

This corollary confirms the widely held conjecture that adversarial attacks originate from degeneracies of the data manifold (Goodfellow et al., 2014; Fawzi et al., 2018). In fact, it had been already empirically noticed (Khoury and Hadfield-Menell, 2018) that adversarial perturbations often arise in directions which are normal to the data manifold. The higher the codimension of the data manifold into the embedding space, the more it is likely to select random directions which are normal to it. The suggestion that lower-dimensional data structures might be ubiquitous in NN problems is also corroborated by recent results (Goldt et al., 2019) showing that the characteristic training dynamics of NNs are intimately linked to data lying on a lower-dimensional manifold. Notice that the implication is only one way; it is perfectly possible for the data manifold to be low dimensional and still not vulnerable at many points. In fact, vulnerability at one input implies low dimensionality in a neighbourhood, and the strong assumption of vulnerability a.e. is purely needed to prove low dimensionality of the full data manifold.

Notice that the assumption of smoothness a.e. for the data manifold is needed to avoid pathologies in the data distribution (e.g. its support being a closed but dense subset of ). Additionally, this assumption guarantees that the dimensionality of is locally constant; we will henceforth focus on the special case whereby is constant everywhere; generalisation to the piecewise constant case is trivial.

A consequence of Corollary 1 is that the gradient of the loss function is orthogonal to the data manifold

Bayesian averaging of normal gradients

In order to complete the proof of Theorem 1, we therefore need to show that the normal gradient has expectation zero under the posterior distribution

The key to this result is the fact that, assuming a uniform prior on the weights , all NNs that agree on the data manifold will by definition receive the same posterior weight in the ensemble, since they achieve exactly the same likelihood. It remains therefore to be proved the following symmetry of the normal gradient:

Lemma 2.

Let be a fully trained overparametrized NN on a prediction problem with data manifold and normal gradient field . Then, in the infinite data limit, there exists a set of weights such that on and

The proof of this lemma rests on the observation that finding a suitable function is equivalent to solving the Cauchy boundary value problem specified by zero loss on the data manifold and normal gradient field . Since we are in the overparametrized, large data limit, any such function will be realisable as a NN with suitable weights choice .

Figure 1:

The expected loss gradients of BNNs exhibit a vanishing behaviour when increasing the number of samples from the posterior predictive distribution. Above, we show example images from MNIST (top row) and Fashion MNIST (bottom row) and their expected loss gradients wrt networks trained with HMC (left) and VI (right). To the right of the images we plot a heat map of gradient values. The set of figures to the left demonstrate the vanishing behavior of gradients wrt a posterior distribution approximated with HMC; wheras the figures on right demonstrate the vanishing behavior of gradients wrt a posterior distribution approximated with VI.

5 Consequences and limitations

Theorem 1 has the natural consequence of protecting BNNs against all gradient-based attacks, due to the vanishing average of the expectation of the gradients in the limit. Its proof also sheds light on a number of observations made in recent years. Before moving on to empirically validating the theorem, it is worth reflecting on some of its implications and limitations:

  • Theorem 1 holds in a specific thermodynamic limit, however we expect the averaging effect of BNN gradients to still provide considerable protection in conditions where the network architecture and the data amount lead to high accuracy and strong expressivity. In practice, high accuracy might be a good indicator of robustness.

  • Theorem 1 holds when the ensemble is drawn from the true posterior distribution; nevertheless it is not obvious (and likely not true) that the posterior distribution is the sole ensemble with the zero averaging property of the gradients. Cheaper approximate Bayesian inference methods which retain ensemble predictions such as VI may in practice provide good protection.

  • Theorem 1 is proven under the assumption of uniform priors; in practice, (vague) Gaussian priors are more frequently used for computational reasons. Once again, unless the priors are too informative, we do not expect a major deviation from the idealised case.

  • Gaussian Processes (Williams and Rasmussen, 2006) are equivalent to infinitely wide BNNs with a single hidden layer, and therefore constitute overparametrized BNNs by definition (although scaling their training to the large data limit might be problematic). Theorem 1 provides theoretical backing to recent empirical observations of their adversarial robustness (Blaas et al., 2019; Cardelli et al., 2019b).

  • While the Bayesian posterior ensemble may not be the only randomization to provide protection, it is clear that some simpler randomizations such as bootstrap will be ineffective, as noted empirically in (Bekasov and Murray, 2018). This is because bootstrap resampling introduces variability along the data manifold, rather than randomising in orthogonal directions. In this sense, the often repeated mantra that bootstrap is an approximation to Bayesian inference is strikingly inaccurate when the data distribution has zero measure support. Similarly, we don’t necessarily expect gradient smoothing approaches to be successful (Athalye et al., 2018), since the type of smoothing performed by Bayesian inference is specifically informed by the geometry of the data manifold.

6 Empirical Results

In this section we investigate the applicability and relevance of our theoretical results on different BNN architectures. We train a variety of BNNs on the MNIST and Fashion MNIST (Xiao et al., 2017) datasets, and evaluate posterior distributions using HMC (asymptotically exact but more expensive), as well as cheaper approximations through VI. We explicitly verify the zero-averaging property of gradients implied by Theorem 1 (Section 6.1), and show (Section 6.2) empirically that FGSM and PGD attacks on BNNs fail to perform better than random attack. Finally, in Section 6.3 we analyse how robustness and accuracy are correlated on thousands of different neural networks architectures trained with HMC, VI and standard Stochastic Gradient Descent (SGD).

6.1 Evaluation of the Gradient of the Loss for BNNs

To investigate the practical relevance of Theorem 1

in the finite data setting, we consider two large fully-connected BNNs with Gaussian priors placed over all of the networks weights and biases, evaluated over the popular MNIST and Fashion MNIST benchmarks. We train a two hidden layers (with 1024 neurons per layer, and approximately a total of 1.8 million parameters) network with HMC and a three hidden layers networks (512 neurons per layer, i.e., almost 1 million parameters) with VI. The BNNs achieved approximately

test set accuracy on MNIST and on Fashion MNIST when trained with HMC; as well as and when trained with VI (using KL minimization) on MNIST and Fashion MNIST, respectively. The numbers of parameters and high accuracies suggest that this practical scenario is close to the thermodynamic limit of Theorem 1.

First, we examine the behaviour of the component-wise expectation of the loss gradient as more samples from the posterior distribution, , are used in the expectation computation. In line with Theorem 1, we expect gradient expectations to shrink the more posterior samples we use. Figure 1 shows anecdotal evidence of this trend on four example images from both MNIST and Fashion MNIST, for BNNs trained with HMC (left half of the figure) and VI (right half of the figure). The heatmaps show the components of the expected gradient loss, when 1, 10 and 100 posterior samples are used in the computation, demonstrating a clear decrease component-wise of the gradients as we increase the number of samples and, hence, better approximate the posterior expectation.

We then systematically examine convergence of all the components of the expected loss gradient in Figure 2 for both HMC (top row) and VI (bottom row). Each dot represents one of the components of the expected loss gradients computed on test images, for a total of gradients components used to shown their empirical distribution for each number of samples used to approximate the expectation (x-axis of the plots).

For both HMC and VI the magnitude of the gradient components drop quickly and tend to stabilize around the zero mean at about 100 samples. The residual variance is to be expected, since we are conducting our experiments on a finite approximation of the limiting regime.

Figure 2: The components values of the expected loss gradients approaches zero as the number of samples from the posterior distribution increases. For each fixed number of samples, the figure shows 784 gradient components for 1000 different test images, from both the MNIST and Fashion MNIST datasets. The gradients are computed on HMC (a) and VI (b) trained BNNs. We inset a plot with a more suitable y-axis range in the case of HMC/MNIST in order to better visualize the trend of convergence.

6.2 Gradient-Based Attacks for BNNs

Showing that gradient cancellation occurs does not directly imply that the network predictions are robust to gradient-based attacks in the finite case. For example, FGSM attacks are crafted such that the direction of the manipulation is given only by the sign of expectation of the loss gradient and not the magnitude. Thus, even if the entries of the expectation drop to an infinitesimal magnitude, but maintain the correct sign, then FGSM will produce the same attack direction.

In order to test the implications of vanishing gradients on the robustness of the posterior predictive distribution against gradient-based attacks, we compare FGSM and PGD to a random attack. The random attack simply draws a random value from the set for each input dimension and treats this as the resulting perturbation on a test image.

In Figure 3 we compare the effectiveness of FGSM, PGD and of the random attack as the number of samples drawn from the posterior distribution increases. For the evaluation we consider a notion of robustness, which we call softmax difference. Given posterior samples from a BNN, let

be the estimator of the expected output of the BNN computed over

samples drawn from the BNN posterior , then the softmax difference is defined as:


That is, the softmax difference computes the average distance over

test points of the variation of the softmax layer caused by the adversarial attack of a network whose posterior distribution is estimated with

weights. In Figure 3 we consider the cases where FGSM and PGD attacks are evaluated over the full posterior or only with respect to the weights used to craft the attack (‘fixed FGSM’ and ‘fixed PGD’). In other words, in ‘fixed’ attacks the attacker has access to all posterior samples used to evaluate expectations (in particular, the case of a single sample is equivalent to a deterministic NN). The fixed sample case converges to the full inference case as the number of samples goes to .

Figure 3 shows the results of this analysis for all networks considered. For convenience, we plot (softmax difference) so that higher values indicate high robustness and low values indicate low robustness. Notice that the random attack is totally independent of the number of samples, so it results in a horizontal line. As expected, fixed FGSM and fixed PGD with a single posterior sample are highly effective attacks, as this is identical to the deterministic case. When more samples are considered, the effectiveness rapidly decreases towards a random attack.

When considering the full posterior to evaluate the attack, both of these attacks perform worse than (or on par with) random attacks. This suggests that, while the attack direction w.r.t. one sampled weight can easily attack a deterministic network, the gradient cancellation of Theorem 1 leads to attacks on different networks cancelling. In other words, only a few predictors in the ensemble are vulnerable in the direction of any one attack. In Figure 3, we also observe that as we move away from the idealised case (HMC/MNIST) to a more approximate posterior on a less well behaved dataset (VI/Fashion MNIST) that the attacks against the posterior become more effective, which is to be expected given that the conditions of Theorem 1 are further from being satisfied. Moreover, approximate inference (VI) being slightly more susceptible to attack is consistent with our previous empirical observations in Figure 1 and Figure 2 which suggests that VI converges to zero slightly slower than HMC.

In the ideal case, we expect, via Theorem 1, that FGSM and PGD would perform identically to a random attack. As the number of samples increases, if expectation of the loss gradient vanishes so does any potentially detected attack direction, thus any small residual noise in the loss gradient might be tantamount to random noise.

Figure 3: FGSM and PGD attacks perform worse than a random attack on VI BNNs in terms of the softmax robustness. The attacks are produced on 500 different images from MNIST (first row) and Fashion MNIST (second row), both attacked with an increasing number of posterior samples (). Fixed sample robustness refers to 500 images as well and shows begins to show convergence to the full inference case as .
Figure 4: Robustness-Accuracy trade-off on the MNIST dataset for network trained with HMC (left), VI (right) and SGD. While a marked trade-off between accuracy and robustness occur for deterministic network trained with SGD, for HMC and VI the experiments show a positive correlation between accuracy and robustness. The boxplots show the correlation between model capacity and robustness.
Figure 5: Robustness-Accuracy trade-off on the Fashion MNIST dataset for network trained with HMC (left), VI (right) and SGD. The boxplots show the correlation between model capacity and robustness. Notice that different attack strength () are used for the three methods accordingly to their average robustness.

6.3 Robustness Accuracy Analysis in Deterministic and Bayesian Neural Networks

In the discussion of Section 5, we suggested that as a consequence of Theorem 1, high accuracy might be related to high robustness to gradient based attacks in Bayesian settings, this would run counter to what has been commonly observed for deterministic neural networks trained with SGD (Zhang et al., 2019). In this Section, we look at an array of 1500 different BNN architectures trained with HMC and VI on both the MNIST and Fashion-MNIST datasets and experimentally evaluate their accuracy/robustness trade-off on FGSM attacks as compared to that obtained with comparable deterministic NNs trained via SGD based methods.

The results of this analyses are plotted in Figures 4 and 5 for MNIST and Fashion MNIST respectively. Each dot in the scatter plots represent the results obtained for each specific network architecture trained with SGD (blue dots), HMC (pink dots in plots (a)) and VI (pink dots in plots (b)). As already reported in the literature (Zhang et al., 2019) we observe a marked trade-off between accuracy and robustness (i.e., 1 - softmax difference) for deterministic networks, with network having high-accuracy being also fragile to FGSM, and vice-versa. Interestingly, this trend is fully reversed for BNNs trained with HMC (plots (a) in Figures 4-5), where we find that as networks become more accurate, they additionally become more robust to FGSM attacks.

To further examine this trend we inspect how the number of parameters affects the robustness of the BNNs. In the case of HMC, we plot (boxplots in plots (a) of Figures 4-5) the effect of the number of neurons in the network versus the robustness of the resulting posterior and find that there exists an increasing trend in robustness as we increase the number of neurons in the network. Similarly, for VI we observe that there is some trend dealing with the size of the model, but we only observe this in the case of VI trained on MNIST where it can be seen that model robustness may increase as the width of the layers increases, but this can also lead to poor robustness as well (which may be indicative of mode collapse). This is in line with what we observed in the previous two sections: as the network approaches the over-parametrised limits the conditions for Theorem 1 are approximately met and the network is protected against gradients attack.

As expected, the trade-off behaviours are less obvious for the BNNs trained with VI and on Fashion-MNIST, that is, with a more approximate inference method on a more complex dataset. In particular, in plot (b) of Figure 5 we find that, similarly to the deterministic case, also for BNNs, robustness seems to have a negative correlation with accuracy. However, interestingly, we should note that for this particular case VI was the most robust in terms of its ability to withstand attacks with high magnitude ( = 0.15). So, while we report that robustness is not positively correlated with with accuracy in this case, we do find that the tested networks trained with VI obtain greater robustness to gradient-based attacks.

7 Conclusions

The quest for robust, data-driven models is an essential component towards the construction of AI-based technologies. In this respect, we believe that the fact that Bayesian ensembles of NNs can evade a broad class of adversarial attacks will be of great relevance.

While promising, this result comes with some significant limitations. First and foremost, performing Bayesian inference in large non-linear models is extremely challenging. While in our hands cheaper approximations such as VI also enjoyed a degree of adversarial robustness, albeit reduced, there are no guarantees that this will hold in general. To this end, we hope that this result will spark renewed interest in the pursuit of efficient Bayesian inference algorithms.

Secondly, our theoretical results hold in a thermodynamic limit which is never realised in practice. More worryingly, we currently have no rigorous diagnostics to understand how near we are to the limit case, and can only reason about this empirically. We notice here that several other studies (Bekasov and Murray, 2018; Li and Gal, 2017; Feinman et al., 2017; Rawat et al., 2017) have focused on pointwise uncertainty to detect adversarial behaviour; while this does not appear relevant in the limit scenario, it might be a promising indicator of robustness in finite data conditions.

Thirdly, we have focused on two attack strategies which directly utilise gradients in our empirical evaluation. More complex gradient-based attacks, such as (Carlini and Wagner, 2016; Papernot et al., 2017; Moosavi-Dezfooli et al., 2016), as well as non-gradient based/ query-based attacks, also exist (Ilyas et al., 2018; Wicker et al., 2018). Evaluating the robustness of BNNs against these attacks would also be interesting.

Finally, the proof of our main result highlighted a profound connection between adversarial vulnerability and the geometry of data manifolds; it was this connection that enabled us to show that randomisation might be an effective way to provide robustness in the high dimensional context. We hope that this connection will inspire novel algorithmic strategies which can offer adversarial protection at a cheaper computational cost.


  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: 5th item.
  • D. Barber (2012)

    Bayesian reasoning and machine learning

    Cambridge University Press. Cited by: §3.
  • A. Bekasov and I. Murray (2018) Bayesian adversarial spheres: bayesian inference and adversarial examples in a noiseless setting. arXiv preprint arXiv:1811.12335. Cited by: §1.1, §1, 5th item, §7.
  • B. Biggio and F. Roli (2018)

    Wild patterns: ten years after the rise of adversarial machine learning

    Pattern Recognition 84, pp. 317–331. Cited by: §1, §2.
  • C. M. Bishop (2006) Pattern recognition and machine learning (information science and statistics). Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 0387310738 Cited by: §3.
  • A. Blaas, L. Laurenti, A. Patane, L. Cardelli, M. Kwiatkowska, and S. Roberts (2019) Robustness quantification for classification with gaussian processes. arXiv preprint arXiv:1905.11876. Cited by: 4th item.
  • L. Cardelli, M. Kwiatkowska, L. Laurenti, N. Paoletti, A. Patane, and M. Wicker (2019a) Statistical guarantees for the robustness of bayesian neural networks. arXiv preprint arXiv:1903.01980. Cited by: §1.1.
  • L. Cardelli, M. Kwiatkowska, L. Laurenti, and A. Patane (2019b) Robustness guarantees for bayesian inference with gaussian processes. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 7759–7768. Cited by: 4th item.
  • N. Carlini and D. Wagner (2016) Towards evaluating the robustness of neural networks. External Links: 1608.04644 Cited by: §7.
  • N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §1.1.
  • S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai (2018) Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804. Cited by: §4, §4, §4.
  • A. Fawzi, H. Fawzi, and O. Fawzi (2018) Adversarial vulnerability for any classifier. In Advances in Neural Information Processing Systems, pp. 1178–1187. Cited by: §4.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §1.1, §1, §7.
  • Y. Gal and L. Smith (2018) Sufficient conditions for idealised models to have no adversarial examples: a theoretical and empirical study with bayesian neural networks. arXiv preprint arXiv:1806.00667. Cited by: §1.1, §1.
  • S. Goldt, M. Mézard, F. Krzakala, and L. Zdeborová (2019) Modelling the influence of data structure on learning in neural networks. arXiv preprint arXiv:1909.11500. Cited by: §4.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1, §2, §4.
  • A. Ilyas, L. Engstrom, A. Athalye, and J. Lin (2018) Black-box adversarial attacks with limited queries and information. arXiv preprint arXiv:1804.08598. Cited by: §2, §7.
  • M. Khoury and D. Hadfield-Menell (2018) On the geometry of adversarial examples. CoRR abs/1811.00525. External Links: Link, 1811.00525 Cited by: §4.
  • Y. Li and Y. Gal (2017) Dropout inference in bayesian neural networks with alpha-divergences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2052–2061. Cited by: §1.1, §7.
  • X. Liu, Y. Li, C. Wu, and C. Hsieh (2018) Adv-bnn: improved adversarial defense through robust bayesian neural network. arXiv preprint arXiv:1810.01279. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. External Links: 1706.06083 Cited by: §1, §2.
  • S. Mei, A. Montanari, and P. Nguyen (2018) A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 (33), pp. E7665–E7671. Cited by: §4, §4, §4.
  • R. Michelmore, M. Wicker, L. Laurenti, L. Cardelli, Y. Gal, and M. Kwiatkowska (2019) Uncertainty quantification with statistical guarantees in end-to-end autonomous driving control. arXiv preprint arXiv:1909.09884. Cited by: §1.1.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2574–2582. Cited by: §7.
  • R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1, §3.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §2, §7.
  • A. Rawat, M. Wistuba, and M. Nicolae (2017) Adversarial phenomenon in the eyes of bayesian deep learning. arXiv preprint arXiv:1711.08244. Cited by: §1.1, §7.
  • G. M. Rotskoff and E. Vanden-Eijnden (2018) Neural networks as interacting particle systems: asymptotic convexity of the loss landscape and universal scaling of the approximation error. arXiv preprint arXiv:1805.00915. Cited by: §4, §4, §4.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • M. Wicker, X. Huang, and M. Kwiatkowska (2018) Feature-guided black-box safety testing of deep neural networks. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 408–426. Cited by: §2, §7.
  • C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: 4th item.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §6.
  • D. Yadron and D. Tynan (2016) Tesla driver dies in first fatal crash while using autopilot mode. the Guardian 1. Cited by: §1.
  • N. Ye and Z. Zhu (2018) Bayesian adversarial learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6892–6901. Cited by: §1.1.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: §1, §6.3, §6.3.