The Limitations of Model Uncertainty in Adversarial Settings

12/06/2018 ∙ by Kathrin Grosse, et al. ∙ CISPA 0

Machine learning models are vulnerable to adversarial examples: minor perturbations to input samples intended to deliberately cause misclassification. Many defenses have led to an arms race-we thus study a promising, recent trend in this setting, Bayesian uncertainty measures. These measures allow a classifier to provide principled confidence and uncertainty for an input, where the latter refers to how usual the input is. We focus on Gaussian processes (GP), a classifier providing such principled uncertainty and confidence measures. Using correctly classified benign data as comparison, GP's intrinsic uncertainty and confidence deviate for misclassified benign samples and misclassified adversarial examples. We therefore introduce high-confidence-low-uncertainty adversarial examples: adversarial examples crafted maximizing GP confidence and minimizing GP uncertainty. Visual inspection shows HCLU adversarial examples are malicious, and resemble the original rather than the target class. HCLU adversarial examples also transfer to other classifiers. We focus on transferability to other algorithms providing uncertainty measures, and find that a Bayesian neural network confidently misclassifies HCLU adversarial examples. We conclude that uncertainty and confidence, even in the Bayesian sense, can be circumvented by both white-box and black-box attackers.



There are no comments yet.


page 1

page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine learning classifiers are used for various purposes in a variety of research areas ranging from health to security. Given a data set, for example a collection of labeled programs (as malicious and benign), a classifier is trained to predict a label for an unseen program. More concretely, for new programs, this classifier is able to predict if they are likely to be malicious or not.

(a) Spam data.
(b) MNIST vs .
Fig. 1: Uncertainty (gray/black) in Gaussian processes visualized using the first two principal components. As opposed to Bayesian neural networks (visualized by [33]), when moving away from the training data (here depicted in red), uncertainty increases smoothly. Hence, we upper bound the uncertainty in the plot at times the size of the largest uncertainty measured for a training point.

However, classifiers have been shown to be vulnerable to a number of different attacks [20, 22, 11, 7]. One attack, called evasion attack or adversarial examples, present a direct threat to classification. An attacker slightly modifies a program (which keeps its malicious functionality), yet the classifier outputs that it is likely to be benign [35, 39, 14]

. Also in the area of computer vision, adversarial examples can be generated, often leading to visually indistinguishable images which are are misclassified by state-of-the-art models 


While a range of defenses against these attacks has been developed, they mostly provide an empirical mitigation against adversarial examples and are often broken later on [6, 2]. The latter approach exploits that robustness guarantees are often only limited to small changes in the input data, and are therefore still vulnerable to larger changes. Despite the greater degree of change, these examples remain invisibly undetectable to humans. Further, such adversarial examples often exhibit a high confidence score. This score refers to a high softmax output, and such examples are therefore known as high-confidence adversarial examples. As pointed out by [9], however, a model can still be uncertain (in a Bayesian sense) in its predictions, even if for example the output of the softmax remains high. Hence, this interpretation of the model’s output does not necessarily reflect the actual model confidence.

In contrast, models based on Bayesian inference offer a mathematically grounded framework to reason about predictive confidence and uncertainty. Recent works studied Bayesian neural networks uncertainty and robustness 

[18, 33, 3]

. Yet, the uncertainty measures of Bayesian networks (as the decision function of their deep counterparts) are not very smooth

[33]. We visualize Gaussian Process (GP) intrinsic uncertainty measures in Figure 1, and find them to increase smoothly when moving away from the training data.

Additionally, GP treat classification as sampling from a distribution of infinitely many potential decision functions. Hence, they are an implicit ensemble of classifiers. Recent mitigations leverage such ensembles or a combination of several decision boundaries [38, 30].

We conclude that GP represent an effective mechanism for two recent trends in the research of adversarial examples: Bayesian uncertainty measures and ensembles.

Our contributions are the following:

  • We investigate GP uncertainty measures in the presence of adversarial data. To this end, we apply several classifiers (BNN, DNN, SVM) on a range of tasks (Malware, spam, handwritten digit recognition) using different attacks (including the attacks from Carlini & Wagner and attacks specifically targeting GP). We observe that either uncertainty or confidence scores of adversarial examples deviate from the values of correctly classified benign data.

  • We are the first to introduce high-confidence-low-uncertainty (HCLU) adversarial examples, which simultaneously maximize confidence and minimize uncertainty. HCLU adversarial examples are still malicious: when visualized, these examples still resemble their original class.

  • We investigate the transferability of HCLU examples, and observe that both uncertainty and confidence transfer from GP to between BNN. We conclude that Bayesian uncertainty measures, albeit promising, can be circumvented in both white- and black-box settings.

Ii Background

In this section, we first describe the attacker we consider. We then give a formal definition of the classification task that our attacker is rooted in, before we formalize the algorithms studied in this work, Gaussian processes and Bayesian neural networks. Finally, we review evasion attacks or adversarial example generating algorithms on classification.

Ii-a Threat Model

A classifier is trained on labeled data, and after training able to predict for a given, unknown data point which class it belongs to. An attacker in this setting can, given that she can slightly change the data point, always obtain a classification according to their wishes [35, 39, 14].

This urging threat has been tackled by many researchers, and yet found to lead to an arms race [6]. We want to investigate a promising solution to this problem, which has been brought up by [34]

: Bayesian uncertainty. In contrast to common machine learning classifiers, there are classifiers which are able to express their confidence in the classification output they generate. Such a confidence differs to for example deep neural networks that obtain a probability which is a mere normalization of their output. Instead, algorithms such as Gaussian processes or Bayesian neural networks are able to express in probabilistic terms how confident they are and how unusual (compared to the training data) the input is that they were presented with.

The question we want to answer in this work is whether this algorithms are able to alleviate the threat of manipulated data. To conduct the corresponding study, however, we first need to define the attacker we consider.

In the white-box setting, the attacker has unrestricted access to the model, its parameters and its training data. A black-box adversary has no access to neither the model nor the parameters of the model. A gray-box adversary has no detailed knowledge about the model, but he might know some parameters or which classifier was applied. Analogously, the attacker might have full knowledge about the data that was used to train the model, partial knowledge (some data points, or meaning of feature) or no knowledge at all. Further, recent work has shown that black-box or gray-box access allows to replicate the model, yielding a white-box attacker [37].

Adversarial Capabilities

In our motivating study, a range of attackers is studied: black-box attackers, as well as gray-box attackers up to white box attackers. In a second study, we study an attacker that has white-box knowledge about both model and data. We propose this powerful attacker in an effort to study a worst case scenario. Finally, to conclude our study on uncertainty, we consider a gray-box attacker that is only aware that uncertainty is applied, however not which model provides the uncertainty.

Ii-B Mathematical Notation for Classification

We consider the general setting of classification, where we are given a labeled set of

training instances, in the form of feature vectors

, where each individual feature vector is composed of features (with ). Each is associated with a label , where is the number of classes. The goal of classification is to adapt the parameters or weights of a classifier such that, given further test samples , predicts such that : the predicted label corresponds to the correct label.

To conclude, we briefly review two classifiers which we use, however not focus on in this paper. Deep neural networks (DNN) are layered classifiers where input is propagated through a parametrized layer , . The output of this layer is then used as input for the following layer,

. The output of the last layer is commonly referred to as logits, and fed into a softmax function to obtain a normalized (non-Bayesian) probability.

The second classifier is the support vector machine (SVM). The geometric interpretation of the SVM is a classifier which, given a similarity metric or kernel, computes a decision boundary that maximizes the distance to both classes. The training points used to determine this maximal distance are called support vectors. In our study, we use an SVM with a linear and an RBF kernel.

Ii-C Gaussian Processes

We focus on Gaussian process classification (GPC), a classifier which provides intrinsic uncertainty estimates. In a nutshell, GPC is based on a parametrized similarity function which is configured during training to separate the training data. Classification at test time is then carried out by weighting the test point’s distance to all training points and their corresponding labels. To give an example, a test point

can then be classified due to one, very close point of one class, or due to several, less close points of agreeing class.

(a) Untrained (or prior of) Gaussian process.
(b) Untrained (or prior of) Gaussian process classifier.
(c) Training data and resulting Gaussian process.
(d) Training data and resulting Gaussian process classifier.
Fig. 2: Illustration of Gaussian process classification in one dimension: -axis corresponds to feature, -axis to confidence. Gaussian processes can be interpreted as an infinite number of classification functions which are then restricted to the ones matching the data. Plots (a) and (c) show the untrained or prior distribution over decision functions. Plots (c) and (d) show the resulting Gaussian Process (plot c) classifier (plot d) with observations. The thick black line is the mean, the blue background denotes the standard deviation. The thin lines are functions sampled from the GP. Note that the possible deviation is smallest close to the decision boundary next to the observations.

In more detail, the unadapted, initial parameters of the similarity function are called prior. In a probabilistic view, this initial GP entails all possible decision functions, as depicted in Figure 1(a). More concretely, a GP is the distribution over these decision functions, where we assume the distribution over decision functions to be a Gaussian. Writing to denote the distance from the training data to itself measured by and parametrized by

, we formalize the Gaussian distribution of our GP as


Where we observe that as in Figure 1(a)

, the mean is zero and the variance of the GP is defined using the similarity function

. This distribution entails all decision functions, and thus provides the classification output. We will skip the details on the optimization for here, and assume to have found fitting parameters for our training data already. The resulting GP is depicted in Figure 1(c), and the distribution over decision functions is now constrained to functions aligning with the training data. As we can see in Figure 1(c), the joint mean (black line) of this distribution can be used for classification.

To derive the prediction of a GP, we extend Equation 1 to the test data . We set in matrix notation (analogously , ) and rewrite Equation 1 as


Where we denote the predicted labels of the training data as and the unknown test labels the GP has to predict. We now formalize the prediction of the GP, which is obtained by reordering 2 to solve for as


where we indeed obtain the weighted output we initially referred to. We wrote, in the second part, the explicit sums to illustrate this weighting during classification.

Since the distribution we consider is a Gaussian, we can further derive the predictive variance (depicted as blue areas in Figure 1(c)). Analogous to the mean, the prediction for the variance is obtained by reordering 2 to solve for the variance of , obtaining


where we skip rewriting the equation. We observe that the (predicted) labels do not matter for the variance, as the variance is a class independent measure. This measure described how close the test data is in general to the training data. We will use variance and uncertainty analogously throughout this paper. Further, we write either mean or confidence.

We ignored so far that the outputs of the GP might be arbitrarily large. Also, strictly speaking, the labels of a classification task are not drawn from a Gaussian distribution. We thus add a link function111Somewhat analogous to the softmax used in deep neural networks., to normalize the output of the GP (as depicted in Figure 1(b)). Skipping the details, this is called Laplace approximation, where we use a Laplace distribution to transform our GP to enable classification taking into account the previously named issues. We can, however, still access the unnormalized mean and variance, which are then called latent mean and latent variance. For further details, we want to refer the interested reader to Rassmussen and Williams   [28].

Before we turn to Bayesian networks, we want to address the choice of the similarity metric . The most common similarity metric which is used throughout this work is the RBF kernel. The similarity between two data points and is defined as


where the distance between two points is re-scaled by lengthscale and variance . These two parameters, and , jointly form the parameters which are adapted during training. Because of the exponential function, the similarity approaches as the (by

re-scaled) distance gets larger. This property is called abation, and useful for outlier detection or open set tasks 

[32]. In our setting, it ensures for example that the uncertainty smoothly increases when moving away from the training data.

Ii-D Bayesian Neural Networks

Bayesian neural networks (BNN) are, analogous to deep neural networks (DNN), functions that consist of parametrized layers. In contrast to (non-Bayesian) DNN, however, the parameters of a BNN are not treated as fixed values to be optimized, but seen as random variables. Each variable then has an initial, or

prior distribution. In contrast to the GP described in the previous subsection, however, it is not possible to fully integrate out uncertainty. the uncertainty measures are thus approximated, for example using Variational Inference. For more details on BNN, we want to refer the interested reader to Smith and Gal [33].

Ii-E Evading Classification

To evade a trained classifier at test-time, we compute a small change for a sample such that


where the adversarial example is classified as a different class than the original input, and is as small as possible. The minimum constraint can be loosened to obtain a maximum confidence example, for which is larger but the classifier is also more confident on the classification. An attacker can also specify which classification output results in, yielding a targeted attack. Since we only consider binary settings in this paper, the latter distinction is superfluous.

Before we detail on the attacks used in this work, we want to address measuring . There are three common distance metrics in the literature:

  • The metric counts the number of changed features, formally . An application of this measure is binary data, since any valid change is always .

  • The metric is equivalent to the euclidean or squared-root distance . Optimizing this norm favors individual, small changes, as might be desirable in image data: few overall changes are hard to notice. This distance is as well used in the covariance metric of the GP (compare Equation 5).

  • The metric measures the largest change introduced, and can be formalized as . It is particularly well-suited for image data, as small changes are not perceived, and can thus be applied to many features (or pixels).

Many algorithms exist for creating adversarial examples. We briefly recap the the algorithms that we rely on in our evaluation. All presented algorithms, if not stated otherwise, target deep neural network models. The first method we review is the fast gradient sign method (FGSM[11]. This method is formalized as

where parametrizes the strength of the perturbation, and denotes gradient of the model’s loss warranted the input. FGSM implicitly minimizes the norm, as the same change is applied to all features. Further, FGSM has been extended to SVM in [26], and will be extended to GPC in this work.

Further, we apply the Jacobian-based saliency map approach (JSMA[27]. JSMA is based on the derivative of the model’s output222In the case of deep neural networks, the gradient of the output warranted the input be computed using the normalized, sigmoid output or the unnormalized logits. In this paper, the second variant of the attack is applied. with respect to its inputs. We review the definition given in [7], and define the attack for two pixels and . Here, we use to denote the gradient of feature warranted class , where denotes the specified target class.

In a nutshell, denotes how much changing and affects the output of the target class, whereas measures the effect on all other outputs. JSMA then picks

where the strongest class is chosen (first brackets) which maximizes the output for the target class (second brackets) and minimizes the output for all other classes (third brackets). The search is executed iteratively until misclassification is achieved or a set threshold is exceeded. Depending on the implementation, JSMA either optimizes or metric. In this paper, we optimize for DNN and in our JSMA variant for GP.

We finally review the Carlini and Wager or attacks[7]. They treat the task of producing an adversarial example as an iterative optimization problem. The authors introduce three attacks, minimizing the and norm respectively. The attack is formalized as the following optimization problem

where the usage of tanh ensures that the box-constraint is fulfilled. Further trades-off the two terms. We define using for the output of class . refers to the target class we optimize for

In a nutshell, we maximize the output of the target class. Further trades off how confident the DNN is on the resulting adversarial example . According to the authors, when the adversarial example is required to fool a second DNN, should be set to . is our main motivation to use this attack in our study: it allows us to specify confidence of the target model.

The authors introduce another attack minimizing the norm. Since this norm is non-differentiable, an iterative attack is proposed where the attacker is used to determine which features are changed. Analogously, the is poorly differentiable and hard to optimize. The authors propose here to use an iterative attacker with a penalty taking into account the norm.

Besides these attacks, which are used in our evaluation, there exist further variants of adversarial examples targeting other types of classifiers or employing a different manners of optimization [4, 21].

Iii Experimental Setup

Before we begin our study, we briefly describe the setting of our evaluation. We first present the data used, then comment on the classifiers and on the adversarial examples crafted. To conclude, we briefly review our public implementation333Please contact the authors to get access..

Data. We focus on security settings, which are generally binary classification tasks. Two tasks are of direct security relevance, and concern the classification of spam emails and of malicious programs. We thus use a Malware data set (Hidost) [36] and a spam data set [19]. Both contain a mixture of binary and real valued features, with the Malware data being mostly binary. Further, both data sets are imbalanced, with the dominant class consisting in % of the spam data, and for Hidost. To validate our results on fully real-valued (and balanced) data, we study two sub-tasks of the MNIST data set [16]: 9 vs. 1 and 3 vs. 8. MNIST contains black and white images of digits which have to be classified. All data sets where selected to facilitate a study on a wide range of different types of data. We summarize these data sets in Table I.

Name number of number of kind of attack
features() instances() features norms ()
Hidost mostly binary
Spam mixed ,,
MNIST91 real ,,
MNIST38 real ,,
TABLE I: Overview of data sets used.

Classifiers. We train Gaussian process classification (GPC) and Bayesian neural networks (BNN), of which we study the uncertainty measures, but also a range of substitutes. These substitutes include a linear support vector machine (SVM) and a deep neural network (DNN), and two neural networks that are trained to mimic GPC (called GPDNN) and the linear SVM (linDNN, see the Appendix for details). We finally also train an RBF SVM to evaluate accuracy of some attacks. The accuracy on benign test data of all classifiers or substitutes is given in Table II.

MNIST91 MNIST38 Malw Spam
linear DNN
linear SVM
random guess
TABLE II: Accuracy of classifiers on benign data.

Attacks. We propose two attacks on GPC which are based on the Jacobian of the classifier, and are the equivalents of FGSM and JSMA (see Section II-E) for GPC (detailed derivation can be found in the Appendix). In our second study, we introduce an additional attack on GPC, which we skip here.

We further use all attacks which are described in detail in Section II-E on the substitutes described above. These substitutes are the linear SVM, which we target using FGSM. Further we use FGSM, JSMA and Carlini and Wagners attack on DNN and the DNN that mimic GPC and linear SVM, hence GPDNN and linDNN.


We implement our experiments in Python. For DNN and BNN, we use Tensorflow 

[1], Scipy [15] for SVM and optimization problems, and GPy [13] for GPC. We rely on the implementation of the JSMA and FGSM attack from the Cleverhans library version 1.0.0 [12]. Further, we obtain the code provided by Carlini and Wagner for their attacks444Retrieved from, July 2017.. We implement all other attacks ourselves.

Iv Motivating Study

We study how adversarial examples affect the uncertainty and confidence that GPC provides. Previous work identifies uncertainty measures as potential mitigation to adversarial examples: we therefore expect that any adversarial example shows deviations in their uncertainty or confidence scores.

(a) Confidence or absolute latent Mean: higher is better.
(b) Uncertainty or latent Variance: lower is better.
Fig. 3: Violin plots of confidence and Uncertainty of wrongly (red) and correctly (gray) classified (manipulated) data. We summarize similar attacks independent from the model they were crafted on to ease understanding. Benign data (darker) is the test data, and depicted on the very left. X-axis are shared, data sets are the same for both settings.

Iv-a Results

We monitor GP’s uncertainty and confidence measures for benign and malicious data. The latter, manipulated data, is depicted by attack type: All attacks based on the Jacobian with iterative, local changes are summarized in JSMA. The data for JSMA hence includes examples crafted on DNN, GPC, linDNN, and GPDNN. The FGSM data includes examples crafted on DNN, linDNN, GPDNN, GPC and linear SVM.

We further report FGSM dependent on to observe the relationship between strength of perturbation and confidence or uncertainty. The , and attacks were crafted on DNN, linDNN and GPDNN.

We distinguish correctly classified data and wrongly classified data: Adversarial data counts as correctly classified when the original class is recovered: the classifier outputs the class of the benign counterpart of the adversarial example. To visualize how uncertainty and confidence values are distributed, we use two violin plots per attack: one for misclassified data (red) and one for correctly classified (gray).

Confidence on Manipulated Data. We plot the distribution of the confidence or latent mean are in Figure 2(a). In general, GPC is more confident on correctly classified data than on misclassified data. This holds in all cases on MNIST91. On Malware and MNIST38, we observe similar results except for FGSM with . On MNIST38, these differences are overall less pronounced. We observe almost no difference in confidence on the spam data set, except in adversarial examples produced by and attacks.

In general, GPC is as confident on misclassified malicious data as on misclassified benign data. An exception on the MNIST91 and the Malware data is FGSM with . On MNIST38, we observer similar confidence to benign, misclassified data only for JSMA, ,, and attacks. Finally on the spam data set, GPC is in general more confident on adversarial than on benign misclassified data, with the exceptions of FGSM with , , , and attacks.

Uncertainty on Manipulated Data. We plot the uncertainty measures or latent variance in Figure 2(b). In contrast to the confidence values, the uncertainty values do not differ strongly between correct and misclassified data. The only notable exception from this is FGSM with on MNIST91.

Instead, uncertainty is generally higher when the perturbation introduced during crafting is larger: the higher the used in FGSM, the higher the observed uncertainty estimates. This result is, on all data sets, most pronounced for ; for smaller the uncertainty values are similar to the values of benign data. On MNIST91, MNIST38 and Malware, the uncertainty values are similar to benign data for JSMA, , and . On the spam data set, the results are slightly different. Here, and and FGSM with are similar to the uncertainty of benign data. JSMA on the spam data set leads to an increase in uncertainty by a factor of roughly . Finally, the and

attack on MNIST91 is distributed uniformly, as opposed to the distribution of benign, misclassified data which is skewed towards zero.

The uncertainty of manipulated data is in general higher than the uncertainty of benign, correctly classified data when ignoring outliers. There are few exceptions, for example FGSM with on all data sets. Further and on spam, or JSMA on Malware do not affect uncertainty strongly.

The attack leads in all cases to decreased uncertainty. We suspect that this is a consequence of being optimized for the distance which is also used in GPC. Another explication might be that the metric is differentiable and can thus directly be optimized. This might lead to more stealthy examples.

(a) Accuracy of undefended GPC.
(b) Accuracy of defended GPC.
Fig. 4: Accuracies on benign (dotted lines) and manipulated data (markers). Colors indicate the classifier the manipulated data was crafted on (legend is in right figure). The left hand side classifier is out-of-the-box GPC. On the right hand side, GPC is defended using a confidence and uncertainty on hold-out data. Concerning accuracy, a manipulated (adversarial) data point counts as correctly classified if it is either rejected or classified as the original, benign class. Rejected benign data, on the other hand, counts as wrongly classified. Figure is best seen in color.

A Mitigation.

Confidence estimates in GPC seem to distinguish benign and perturbed data to a large extent, when correctly classified, benign data is used as a baseline. In the case of uncertainty, we do not observe a difference between misclassified and correctly classified data, but rather a relationship between the size of the perturbation and the increase in uncertainty. We seek to confirm whether uncertainty and confidence can be used to implement a mitigation. Hence, we compute a confidence interval on benign, correctly classified validation data and investigate it’s vulnerability.

More concretely, we compute a threshold for rejection by computing both confidence and uncertainty on a hold-out data set. We then order all confidence (uncertainty) values by their size. We exclude the lowest and highest values, and denote the fourth highest and fourth lowest value . These values are used as thresholds, and any output of the GPC or is considered as a rejection.

The accuracy of this mitigation is depicted in Figure 4. The mitigation is indeed more robust than the undefended GPC. More specifically, the mitigation’s accuracy is in general above %. Exceptions are the attack and FGSM on GPDNN with on MNIST91. Further on spam, FGSM on DNN with and GPC with are exceptions with accuracies around %. On MNIST38, all accuracies are above %, with a consistent decrease for FGSM with on all crafted models. Finally, on the Malware data, accuracies are in general %, with the exception of JSMA on GPC.

Conclusion. We observed empirically that most techniques tested lead to an increase in either uncertainty or confidence in GPC. A mitigation based on both measures achieved good accuracy (generally %). To check if we obtained a false sense of security, we develop an attack that maximizes confidence and minimizes uncertainty, thereby circumventing our mitigation.

V Evading Uncertainty and Confidence

In the preceding study, we observed that conventional attacks often lead to noticeable deviations in confidence and uncertainty. Hence, we adapt the optimization of adversarial examples to account for confidence and uncertainty, thereby introducing high-confidence-low-uncertainty (HCLU) adversarial examples. We formalize the computation of HCLU examples as


where we minimize the perturbation using the norm. An extension to other norms, as in [7], is however straight forward. We study the as it is differentiable and thus allows to formulate a worst case attacker. Concerning confidence, we demand explicitly that the resulting adversarial example is confidently classified (first constraint). We also require that the uncertainty GPC outputs for the example is as least as low as for the benign counterpart (second constraint).

In the following, we first describe the resulting HCLU examples. Afterwards, we test their transferability. Finally, we evaluate high-confidence examples transferred from DNN on GPC and conclude the section by summarizing our results.

GPC as White-Box. We show some of our HCLU adversarial examples in the second row of Figure 5. These examples are still adversarial: we see in the figure that they are visually similar to their benign origin. The average perturbation introduced (measured using -norm) varies between on average for spam and on average for MNIST91. On Malware, is on average and on MNIST38 . The variance of on all data sets. The success rate is %, except for MNIST38 where we only succeed in % of the cases.

We might be tempted to craft examples by only maximizing confidence, thereby removing the second constraint. Surprisingly, the perturbation of such examples differs only in strength for MNIST38 ( as compared to examples that take into account uncertainty). For all other data sets, the number of changed features remains roughly the same, and and do not differ by more than (measured again using ). The variance of this is, as in the previous case considering uncertainty, well below . Examples are depicted in the first row of Figure 5. Yet, these examples often lead to an increase in uncertainty, and are thus not discussed here.

Fig. 5: First row: adversarial examples maximizing confidence. Second row: HCLU examples. The classes of the initial images are for both rows , , . and from left to right. In the second row, each example has the smallest off all examples crafted from that class.

Transferability. We test the HCLU adversarial examples and find that they transfer well. We depict the accuracy of a Bayesian neural network, a conventional DNN, a linear SVM and an RBF-SVM in Figure 6.

The amount of correctly classified adversarial examples on MNIST91 and Malware is about %, with the BNN achieving %. On MNIST38, the accuracies are around % for most algorithms, with only the linear SVM performing notably worse. The results for spam vary, where the accuracy of BNN and DNN is around % and the linear SVM around %. The RBF-SVM achieves on both spam and Malware data random guess accuracy.

Fig. 6: Transferability of optimized GPC adversarial examples. Dark color denotes accuracy on benign samples, lighter color on HCLU examples computed on GPC.
Fig. 7: Transferability of optimized GPC adversarial examples (bottom) to Bayesian Neural Networks. We consider Carlini & Wagner’s (, , , for Malware data) attack as a comparison (middle). Benign data is also depicted as a baseline. Correctly classified data is plotted in gray shades, misclassified data in red shades. Figure is best seen in color.

Transferability and Confidence. We test the effect of HCLU adversarial examples on Bayesian neural networks (BNN). These networks provide uncertainty measure via Monte Carlo Sampling of the posterior. In this experiment, we are interested whether the computed uncertainty differs between benign and adversarial data. We chose the attack as a baseline: , as in our case, allows the best optimization. Only for Malware, we craft the attack: We do not want the network to have an advantage as it observes real-valued features compared to mostly binary features in the training data. We further choose to obtain good transferability.

This attack is supposed to maximize the softmax output, and therefore the confidence on a DNN. However, as noted by [9], this interpretation of the softmax output suffers from certain limitations when compared to uncertainty measures as provided by GP and BNN.

We depict the results in Figure 7, where we distinguish correctly classified (gray shades) and wrongly classified (red shades) benign and adversarial data. We measure the mean and variance of the sampled posteriors and bin them using bins between and . To outline overall trends, we plot the normalized bins stacked on top of each other.

The BNN, similar to the GP, is more confident on data that is correctly classified. This observation holds true across all data sets and independently from adversarial manipulations in the data. We do observe, however, that BNN are confident on many misclassified HCLU examples. Intriguingly, BNN output low confidence on some HCLU examples which are not misclassified. This is not the case for the attack, where the network behaves analogous to benign data: low confidence for misclassified data, and confident on recovered or correctly classified data.

Fig. 8: Accuracy (left graphs) and latent mean (right graphs) when increasing transferability (parametrized by ) using the Carlini & Wagner attacks. The first value denotes the attacks used in the previous section with . Figure is best seen in color.

Transferring High Confidence Examples from DNN. To conclude this section, we investigate whether high-confidence adversarial examples from DNN transfer to GPC. We obtain this examples using , and attacks. Here, parameterizes the trade-off between the amount of perturbation introduced and the confidence in classification. We evaluate a range of , namely and .

The results are depicted in Figure 8. We observe that with increasing , neither GPC’s accuracy nor confidence increase or decrease consistently. The latent mean increases slightly in some cases, for example MNIST38, MNIST91, and spam for . further, all observed average confidence scores are around the latent mean for misclassified benign data, depicted as black line.

Summary. We observed that HCLU examples are still malicious, and very similar to their benign counterparts. Nonetheless, they are misclassified by other models. Even worse, models providing model uncertainty misclassified them with high confidence. High confidence adversarial examples crafted using a confidence that is not Bayesian, however, are not classified with high uncertainty by GPC or BNN. We conclude that in an adversarial setting, we have to distinguish Bayesian and non-Bayesian confidence, although both can be fooled.

Vi Related Work

We are not the first to study model uncertainty in the presence of adversarial examples. Bekasov and Murray [3] for example show the importance of priors in robustness, a somewhat orthogonal approach to our work.

Bradshaw et al. [5] investigate Gaussian Hybrid networks, a DNN where the last layer is replaced by a Gaussian Process. Further Melis et al. [23] add a 1-class SVM as a last layer of a DNN to build a defense based on uncertainty. They show that this defense can be circumvented, analogous to our work. Yet, we go a step further and test whether principled model uncertainty as a defense can be circumvented in a black-box setting.

Another line of work focuses on models allowing intrinsic principled uncertainty measures. For example, Gal and Smith [10] proposes an attack to sample garbage examples in the pockets of the uncertainty of Bayesian neural networks. Bayesian neural networks are further investigated by Rawat et al. [29]. They test FGSM adversarial examples on Bayesian networks and find notable differences in model uncertainty for such examples. Li and Gal [18] observe differences for high confidence adversarial examples. Further Smith and Gal [33] conclude that Mutual Information of an ensemble of Bayesian networks detects adversarial data most securely. In this work, we propose targeted adversarial examples and show their transferability. This contradicts claims that principled model uncertainty is more robust or more difficult to attack.

In our work, we show that transferability also holds for model uncertainty. General transferability was first shown by Papernot et al. [26]. Rozsa et al. [31] study transferability for different deep neural network architectures. To the best of our knowledge, there exist no works so far studying the transferability of HCLU adversarial examples adversarial examples or the transferability of attacks across different models enabling uncertainties.

Another field of research is the general

relationship between deep learning and Gaussian Processes

, as investigated by Li and Gal [25]. To gain more understanding, recent approaches by Matthews et al. [8] and Lee et al. [17] represent deep neural networks with infinite layers as kernel for Gaussian Processes. Lee et al. [17] further show a relation between uncertainty in Gaussian Processes and predictive error in DNN, a result that links our work with other approaches targeting DNN.

Vii Conclusion

In this paper, we studied GP to gain understanding about the vulnerability of machine learning models providing Bayesian model uncertainty. We used a range of existing attacks, including optimized, local and global perturbations. Additionally, we evaluated a range of classifiers: GPC, DNN, SVM and BNN. We further study several tasks like Malware classification, spam detection and handwritten digit classification.

We found that GP’s uncertainty and confidence deviated for misclassified benign and manipulated data, when correctly classified benign data is used as a baseline. This baseline even allows us to build a mitigation on hold-out data. We then investigate how an adversary can utilize uncertainty measures and introduced a technique to craft HCLU adversarial examples, which achieve both high confidence and low uncertainty on GP. These examples reliably cause misclassification, and visual inspection of the examples crafted for the MNIST tasks shows that they resemble the original rather than the target class. Our findings imply that even an ensemble with infinitely many decision functions, like a GPC, can successfully be targeted by adversarial examples.

We further conducted a study about the transferability of HCLU adversarial examples. We found they transfer both to conventional and other Bayesian machine learning models. In the case of Bayesian neural networks, we found that HCLU adversarial examples are misclassified with high confidence. Further, we find that the opposite is not necessarily true: conventional attacks maximizing the softmax output on DNN, Carlini and Wagner’s high confidence adversarial examples, do not produce high-confidence mispredictions in both GP and BNN.

Mitigations solely relying on model uncertainty and confidence measures in the Bayesian sense are effective against conventional high-confidence examples or when the attacker is not aware of the defense applied, hence in a security-by-obscurity setting. We conclude that Bayesian model uncertainty, albeit promising, is circumventable in both a black-box and a white-box setting.


This work was supported by the German Federal Ministry of Education and Research (BMBF) through funding for the Center for IT-Security, Privacy and Accountability (CISPA) (FKZ: 16KIS0753). This work has further been supported by the Engineering and Physical Research Council (EPSRC) Research Project EP/N014162/1.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
  • [2] A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, pages 274–283, 2018.
  • [3] A. Bekasov and I. Murray. Bayesian adversarial spheres: Bayesian inference and adversarial examples in a noiseless setting. Bayesian Deep Learning at NeurIPS 2018, since 2018.
  • [4] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
  • [5] J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networkstrame. ArXiv e-prints, July 2017.
  • [6] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    , pages 3–14. ACM, 2017.
  • [7] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
  • [8] A. G. de G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. International Conference on Learning Representations, 2018.
  • [9] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 1050–1059, 2016.
  • [10] Y. Gal and L. Smith. Idealised bayesian neural networks cannot have adversarial examples: Theoretical and empirical study. CoRR, abs/1806.00667, 2018.
  • [11] I. J. Goodfellow et al. Explaining and harnessing adversarial examples. In Proceedings of the 2015 International Conference on Learning Representations, 2015.
  • [12] I. J. Goodfellow, N. Papernot, and P. D. McDaniel. cleverhans v0.1: an adversarial machine learning library. CoRR, abs/1610.00768, 2016.
  • [13] GPy. GPy: A gaussian process framework in python., since 2012.
  • [14] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. D. McDaniel. Adversarial examples for malware detection. In Computer Security - ESORICS 2017 - 22nd European Symposium on Research in Computer Security, Oslo, Norway, September 11-15, 2017, Proceedings, Part II, pages 62–79, 2017.
  • [15] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–.
  • [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
  • [17] J. Lee, Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. International Conference on Learning Representations, 2018.
  • [18] Y. Li and Y. Gal. Dropout inference in bayesian neural networks with alpha-divergences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2052–2061, 2017.
  • [19] M. Lichman. UCI machine learning repository, 2013.
  • [20] D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CEAS 2005 - Second Conference on Email and Anti-Spam, July 21-22, 2005, Stanford University, California, USA, 2005.
  • [21] D. Maiorca, I. Corona, and G. Giacinto. Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection. In 8th ACM Symposium on Information, Computer and Communications Security, ASIA CCS ’13, Hangzhou, China - May 08 - 10, 2013, pages 119–130, 2013.
  • [22] S. Mei and X. Zhu. Using machine teaching to identify optimal training-set attacks on machine learners. In AAAI, pages 2871–2877, 2015.
  • [23] M. Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017, pages 751–759, 2017.
  • [24] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: A simple and accurate method to fool deep neural networks. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2016.
  • [25] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer, 1996.
  • [26] N. Papernot, P. McDaniel, and I. J. Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277, 2016.
  • [27] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The Limitations of Deep Learning in Adversarial Settings. In Proceedings of the 1st IEEE European Symposium in Security and Privacy (EuroS&P), 2016.
  • [28] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
  • [29] A. Rawat, M. Wistuba, and M.-I. Nicolae. Adversarial Phenomenon in the Eyes of Bayesian Deep Learning. ArXiv e-prints, Nov. 2017.
  • [30] B. D. Rouhani, M. Samragh, T. Javidi, F. Koushanfar, et al. Safe machine learning and defeat-ing adversarial attacks. IEEE Security and Privacy (S&P) Magazine, 2018.
  • [31] A. Rozsa, M. Günther, and T. E. Boult. Are accuracy and robustness correlated. In Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on, pages 227–232. IEEE, 2016.
  • [32] W. J. Scheirer, L. P. Jain, and T. E. Boult. Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2317–2324, 2014.
  • [33] L. Smith and Y. Gal. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
  • [34] L. Smith and Y. Gal. Understanding measures of uncertainty for adversarial example detection. CoRR, abs/1803.08533, 2018.
  • [35] N. Srndic and P. Laskov. Practical evasion of a learning-based classifier: A case study. In 2014 IEEE Symposium on Security and Privacy, SP 2014, Berkeley, CA, USA, May 18-21, 2014, pages 197–211, 2014.
  • [36] N. Šrndić and P. Laskov. Hidost: a static machine-learning-based detector of malicious files. EURASIP Journal on Information Security, 2016(1):22, Sep 2016.
  • [37] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August 10-12, 2016., pages 601–618, 2016.
  • [38] W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018, 2018.
  • [39] W. Xu, Y. Qi, and D. Evans. Automatically evading classifiers. In Proceedings of the 2016 Network and Distributed Systems Symposium, 2016.

Appendix A Detailed Attack Derivation

In this Appendix we first write the detailed derivation of the JSMA and FGSM attack for GPC, and then describe the training of the surrogates.

A-a Derivation of FGSM and JSMA for GP

In this part of the Appendix, we present the detailed derivation to compute adversarial examples on GPC, including the reasoning why it is sufficient to use the latent mean.

We compute the gradient in the output with respect to the input dimensions. We consider the chain of gradients for the normalized output , an test input :


where is the latent mean and the covariance function, respectively.

Note that for this attack, we are only interested in the relative order of the gradients, not their actual values. Unfortunately, does not vary monotonically with as the variance also affects the prediction. However, we are in a setting of binary classification, so we are only interested in moving the prediction, , across the boundary. No change in variance can cause this, instead a change in the mean of is required (effectively the mean is monotonic with respect to in the region of 0.5). The fastest we can get from one probability threshold to its opposite is when there is no variance (any variance will move the mean towards 0.5). So finding the gradient of is sufficient. This is analogous to the usage of the logits (instead of the softmax) in evasion attacks on deep neural networks.

However, we found that we can still use the gradient of (instead of a numerical approximation to ):


Let us first rewrite the expected value of given a single test point :


Where we write to denote GPC’s output for the training points . Hence, From here, we move on to the first part of the gradient,


note the remaining terms are both constant with respect to the test input . The gradient of the covariance with respect to the inputs depends on the particular kernel that is applied. In our case, for the RBF kernel, between training point and test point , the gradient can be expressed as


where and each denote feature or dimension of the corresponding vector or data point and denotes the length-scale parameter of the kernel. Using Equation 8 the gradient of the output with respect to the inputs is approximately proportional to the product of Equation 10 and Equation 11, in the region of .

Based on the computation of these gradients, we perturb the initial sample. In GPFGS (similar to FGSM), we introduce a global change using the sign of the gradient and a specified .

Alternatively in GPJM, we compute local, greedy changes (see Algorithm 2, analogous to JSMA). Instead of a saliency map, however, we iteratively compute the (still unperturbed) feature with the strongest gradient and perturb it. We alternate between perturbation for misclassification and (optionally) decreasing uncertainty. We finish altering the example when it is either misclassified at a predefined threshold, or in case we have changed more than a previously specified number of features, corresponding to a fail.

1:Input: sample , latent function , parameter
Algorithm 1 GPFGS
1:Input: sample , latent function , classifier , threshold , threshold varT, desired confidence , changed=,
3:     if lenchanged then return fail
4:     end if
5:     grads mean # classification
6:     changed, perturb(gradschanged
7:     repeat
8:         if lenchanged then return fail
9:         end if
10:         grads var # uncertainty
11:         changed, perturb(gradschanged
12:     until var varT
13:until classified
Algorithm 2 GPJM

A-B Surrogates Models for GP

We train several surrogate models to approximate GP’s decision surface and to be able to apply DNN specific attack to GPC as well. To this end, we first briefly introduce Gaussian process latent variable model (GPLVM) these surrogates are trained with, before we introduce the attacks themselves.

Fig. 9: The intuition approximating the latent using DNN. The first network is trained on a latent space, the second to classify input from this latent space. After training, the two networks are combined and yield one DNN classifier.

A-B1 GP Latent Variable Model

GPC learns based on labeled data. This introduces an implicit bias: we assume for example that the number of labels is finite and fixed, and that the labels are related to the structure of the data. Furthermore, the curse of dimensionality affects classification. This curse affects distance measure on data points. In higher dimensions, the ratio between nearest and furthest point approximates one. All data points are thus uniformly distant, impeding classification: it becomes harder to compute a separating boundary. Consequently, we are interested in finding a lower dimensional representation of our data.

These two issues are taken into account when using the Gaussian process latent variable model (GPLVM). Analogously to GPC, GPLVM models uncertainty estimates. Further, its latent space allows for nonlinear connections in the feature space to be represented. Yet, this latent space or lower-dimensional representation ignores labels. Consequently, we need to apply a classifier on top of GPLVM to enable classification.

A-B2 Training Surrogates.

We propose a complementary approach to the attacks on GPC by attacking GPLVM+SVM using (the already established methodology of) DNN surrogates specifically tailored for the GP. To train such a surrogate model, we train a DNN to fit the latent space representation of GPLVM in one of the hidden layers. We achieve this by taking a common DNN and splitting it into two parts, where a hidden layer becomes the output layer for the first part and the input layer for the second part (see lower half of Figure 9).

The training data is fed as input to the first part. We train it minimizing the loss between the output of the network and the latent space we want to approximate (for example the output of GPLVM). The second part receives this latent space as input, and is trained minimizing the loss to the normal labels. When stacking these two networks (i.e., when feeding the output of the first part immediately into the second), we obtain a combined DNN that mimics both the latent space it was trained on and the classifier on this latent space.