I Introduction
Machine learning classifiers are used for various purposes in a variety of research areas ranging from health to security. Given a data set, for example a collection of labeled programs (as malicious and benign), a classifier is trained to predict a label for an unseen program. More concretely, for new programs, this classifier is able to predict if they are likely to be malicious or not.
However, classifiers have been shown to be vulnerable to a number of different attacks [20, 22, 11, 7]. One attack, called evasion attack or adversarial examples, present a direct threat to classification. An attacker slightly modifies a program (which keeps its malicious functionality), yet the classifier outputs that it is likely to be benign [35, 39, 14]
. Also in the area of computer vision, adversarial examples can be generated, often leading to visually indistinguishable images which are are misclassified by stateoftheart models
[24].While a range of defenses against these attacks has been developed, they mostly provide an empirical mitigation against adversarial examples and are often broken later on [6, 2]. The latter approach exploits that robustness guarantees are often only limited to small changes in the input data, and are therefore still vulnerable to larger changes. Despite the greater degree of change, these examples remain invisibly undetectable to humans. Further, such adversarial examples often exhibit a high confidence score. This score refers to a high softmax output, and such examples are therefore known as highconfidence adversarial examples. As pointed out by [9], however, a model can still be uncertain (in a Bayesian sense) in its predictions, even if for example the output of the softmax remains high. Hence, this interpretation of the model’s output does not necessarily reflect the actual model confidence.
In contrast, models based on Bayesian inference offer a mathematically grounded framework to reason about predictive confidence and uncertainty. Recent works studied Bayesian neural networks uncertainty and robustness
[18, 33, 3]. Yet, the uncertainty measures of Bayesian networks (as the decision function of their deep counterparts) are not very smooth
[33]. We visualize Gaussian Process (GP) intrinsic uncertainty measures in Figure 1, and find them to increase smoothly when moving away from the training data.Additionally, GP treat classification as sampling from a distribution of infinitely many potential decision functions. Hence, they are an implicit ensemble of classifiers. Recent mitigations leverage such ensembles or a combination of several decision boundaries [38, 30].
We conclude that GP represent an effective mechanism for two recent trends in the research of adversarial examples: Bayesian uncertainty measures and ensembles.
Our contributions are the following:

We investigate GP uncertainty measures in the presence of adversarial data. To this end, we apply several classifiers (BNN, DNN, SVM) on a range of tasks (Malware, spam, handwritten digit recognition) using different attacks (including the attacks from Carlini & Wagner and attacks specifically targeting GP). We observe that either uncertainty or confidence scores of adversarial examples deviate from the values of correctly classified benign data.

We are the first to introduce highconfidencelowuncertainty (HCLU) adversarial examples, which simultaneously maximize confidence and minimize uncertainty. HCLU adversarial examples are still malicious: when visualized, these examples still resemble their original class.

We investigate the transferability of HCLU examples, and observe that both uncertainty and confidence transfer from GP to between BNN. We conclude that Bayesian uncertainty measures, albeit promising, can be circumvented in both white and blackbox settings.
Ii Background
In this section, we first describe the attacker we consider. We then give a formal definition of the classification task that our attacker is rooted in, before we formalize the algorithms studied in this work, Gaussian processes and Bayesian neural networks. Finally, we review evasion attacks or adversarial example generating algorithms on classification.
Iia Threat Model
A classifier is trained on labeled data, and after training able to predict for a given, unknown data point which class it belongs to. An attacker in this setting can, given that she can slightly change the data point, always obtain a classification according to their wishes [35, 39, 14].
This urging threat has been tackled by many researchers, and yet found to lead to an arms race [6]. We want to investigate a promising solution to this problem, which has been brought up by [34]
: Bayesian uncertainty. In contrast to common machine learning classifiers, there are classifiers which are able to express their confidence in the classification output they generate. Such a confidence differs to for example deep neural networks that obtain a probability which is a mere normalization of their output. Instead, algorithms such as Gaussian processes or Bayesian neural networks are able to express in probabilistic terms how confident they are and how unusual (compared to the training data) the input is that they were presented with.
The question we want to answer in this work is whether this algorithms are able to alleviate the threat of manipulated data. To conduct the corresponding study, however, we first need to define the attacker we consider.
In the whitebox setting, the attacker has unrestricted access to the model, its parameters and its training data. A blackbox adversary has no access to neither the model nor the parameters of the model. A graybox adversary has no detailed knowledge about the model, but he might know some parameters or which classifier was applied. Analogously, the attacker might have full knowledge about the data that was used to train the model, partial knowledge (some data points, or meaning of feature) or no knowledge at all. Further, recent work has shown that blackbox or graybox access allows to replicate the model, yielding a whitebox attacker [37].
Adversarial Capabilities
In our motivating study, a range of attackers is studied: blackbox attackers, as well as graybox attackers up to white box attackers. In a second study, we study an attacker that has whitebox knowledge about both model and data. We propose this powerful attacker in an effort to study a worst case scenario. Finally, to conclude our study on uncertainty, we consider a graybox attacker that is only aware that uncertainty is applied, however not which model provides the uncertainty.
IiB Mathematical Notation for Classification
We consider the general setting of classification, where we are given a labeled set of
training instances, in the form of feature vectors
, where each individual feature vector is composed of features (with ). Each is associated with a label , where is the number of classes. The goal of classification is to adapt the parameters or weights of a classifier such that, given further test samples , predicts such that : the predicted label corresponds to the correct label.To conclude, we briefly review two classifiers which we use, however not focus on in this paper. Deep neural networks (DNN) are layered classifiers where input is propagated through a parametrized layer , . The output of this layer is then used as input for the following layer,
. The output of the last layer is commonly referred to as logits, and fed into a softmax function to obtain a normalized (nonBayesian) probability.
The second classifier is the support vector machine (SVM). The geometric interpretation of the SVM is a classifier which, given a similarity metric or kernel, computes a decision boundary that maximizes the distance to both classes. The training points used to determine this maximal distance are called support vectors. In our study, we use an SVM with a linear and an RBF kernel.
IiC Gaussian Processes
We focus on Gaussian process classification (GPC), a classifier which provides intrinsic uncertainty estimates. In a nutshell, GPC is based on a parametrized similarity function which is configured during training to separate the training data. Classification at test time is then carried out by weighting the test point’s distance to all training points and their corresponding labels. To give an example, a test point
can then be classified due to one, very close point of one class, or due to several, less close points of agreeing class.In more detail, the unadapted, initial parameters of the similarity function are called prior. In a probabilistic view, this initial GP entails all possible decision functions, as depicted in Figure 1(a). More concretely, a GP is the distribution over these decision functions, where we assume the distribution over decision functions to be a Gaussian. Writing to denote the distance from the training data to itself measured by and parametrized by
, we formalize the Gaussian distribution of our GP as
(1) 
Where we observe that as in Figure 1(a)
, the mean is zero and the variance of the GP is defined using the similarity function
. This distribution entails all decision functions, and thus provides the classification output. We will skip the details on the optimization for here, and assume to have found fitting parameters for our training data already. The resulting GP is depicted in Figure 1(c), and the distribution over decision functions is now constrained to functions aligning with the training data. As we can see in Figure 1(c), the joint mean (black line) of this distribution can be used for classification.To derive the prediction of a GP, we extend Equation 1 to the test data . We set in matrix notation (analogously , ) and rewrite Equation 1 as
(2) 
Where we denote the predicted labels of the training data as and the unknown test labels the GP has to predict. We now formalize the prediction of the GP, which is obtained by reordering 2 to solve for as
(3) 
where we indeed obtain the weighted output we initially referred to. We wrote, in the second part, the explicit sums to illustrate this weighting during classification.
Since the distribution we consider is a Gaussian, we can further derive the predictive variance (depicted as blue areas in Figure 1(c)). Analogous to the mean, the prediction for the variance is obtained by reordering 2 to solve for the variance of , obtaining
(4) 
where we skip rewriting the equation. We observe that the (predicted) labels do not matter for the variance, as the variance is a class independent measure. This measure described how close the test data is in general to the training data. We will use variance and uncertainty analogously throughout this paper. Further, we write either mean or confidence.
We ignored so far that the outputs of the GP might be arbitrarily large. Also, strictly speaking, the labels of a classification task are not drawn from a Gaussian distribution. We thus add a link function^{1}^{1}1Somewhat analogous to the softmax used in deep neural networks., to normalize the output of the GP (as depicted in Figure 1(b)). Skipping the details, this is called Laplace approximation, where we use a Laplace distribution to transform our GP to enable classification taking into account the previously named issues. We can, however, still access the unnormalized mean and variance, which are then called latent mean and latent variance. For further details, we want to refer the interested reader to Rassmussen and Williams [28].
Before we turn to Bayesian networks, we want to address the choice of the similarity metric . The most common similarity metric which is used throughout this work is the RBF kernel. The similarity between two data points and is defined as
(5) 
where the distance between two points is rescaled by lengthscale and variance . These two parameters, and , jointly form the parameters which are adapted during training. Because of the exponential function, the similarity approaches as the (by
rescaled) distance gets larger. This property is called abation, and useful for outlier detection or open set tasks
[32]. In our setting, it ensures for example that the uncertainty smoothly increases when moving away from the training data.IiD Bayesian Neural Networks
Bayesian neural networks (BNN) are, analogous to deep neural networks (DNN), functions that consist of parametrized layers. In contrast to (nonBayesian) DNN, however, the parameters of a BNN are not treated as fixed values to be optimized, but seen as random variables. Each variable then has an initial, or
prior distribution. In contrast to the GP described in the previous subsection, however, it is not possible to fully integrate out uncertainty. the uncertainty measures are thus approximated, for example using Variational Inference. For more details on BNN, we want to refer the interested reader to Smith and Gal [33].IiE Evading Classification
To evade a trained classifier at testtime, we compute a small change for a sample such that
(6) 
where the adversarial example is classified as a different class than the original input, and is as small as possible. The minimum constraint can be loosened to obtain a maximum confidence example, for which is larger but the classifier is also more confident on the classification. An attacker can also specify which classification output results in, yielding a targeted attack. Since we only consider binary settings in this paper, the latter distinction is superfluous.
Before we detail on the attacks used in this work, we want to address measuring . There are three common distance metrics in the literature:

The metric counts the number of changed features, formally . An application of this measure is binary data, since any valid change is always .

The metric is equivalent to the euclidean or squaredroot distance . Optimizing this norm favors individual, small changes, as might be desirable in image data: few overall changes are hard to notice. This distance is as well used in the covariance metric of the GP (compare Equation 5).

The metric measures the largest change introduced, and can be formalized as . It is particularly wellsuited for image data, as small changes are not perceived, and can thus be applied to many features (or pixels).
Many algorithms exist for creating adversarial examples. We briefly recap the the algorithms that we rely on in our evaluation. All presented algorithms, if not stated otherwise, target deep neural network models. The first method we review is the fast gradient sign method (FGSM) [11]. This method is formalized as
where parametrizes the strength of the perturbation, and denotes gradient of the model’s loss warranted the input. FGSM implicitly minimizes the norm, as the same change is applied to all features. Further, FGSM has been extended to SVM in [26], and will be extended to GPC in this work.
Further, we apply the Jacobianbased saliency map approach (JSMA) [27]. JSMA is based on the derivative of the model’s output^{2}^{2}2In the case of deep neural networks, the gradient of the output warranted the input be computed using the normalized, sigmoid output or the unnormalized logits. In this paper, the second variant of the attack is applied. with respect to its inputs. We review the definition given in [7], and define the attack for two pixels and . Here, we use to denote the gradient of feature warranted class , where denotes the specified target class.
In a nutshell, denotes how much changing and affects the output of the target class, whereas measures the effect on all other outputs. JSMA then picks
where the strongest class is chosen (first brackets) which maximizes the output for the target class (second brackets) and minimizes the output for all other classes (third brackets). The search is executed iteratively until misclassification is achieved or a set threshold is exceeded. Depending on the implementation, JSMA either optimizes or metric. In this paper, we optimize for DNN and in our JSMA variant for GP.
We finally review the Carlini and Wager or attacks[7]. They treat the task of producing an adversarial example as an iterative optimization problem. The authors introduce three attacks, minimizing the and norm respectively. The attack is formalized as the following optimization problem
where the usage of tanh ensures that the boxconstraint is fulfilled. Further tradesoff the two terms. We define using for the output of class . refers to the target class we optimize for
In a nutshell, we maximize the output of the target class. Further trades off how confident the DNN is on the resulting adversarial example . According to the authors, when the adversarial example is required to fool a second DNN, should be set to . is our main motivation to use this attack in our study: it allows us to specify confidence of the target model.
The authors introduce another attack minimizing the norm. Since this norm is nondifferentiable, an iterative attack is proposed where the attacker is used to determine which features are changed. Analogously, the is poorly differentiable and hard to optimize. The authors propose here to use an iterative attacker with a penalty taking into account the norm.
Iii Experimental Setup
Before we begin our study, we briefly describe the setting of our evaluation. We first present the data used, then comment on the classifiers and on the adversarial examples crafted. To conclude, we briefly review our public implementation^{3}^{3}3Please contact the authors to get access..
Data. We focus on security settings, which are generally binary classification tasks. Two tasks are of direct security relevance, and concern the classification of spam emails and of malicious programs. We thus use a Malware data set (Hidost) [36] and a spam data set [19]. Both contain a mixture of binary and real valued features, with the Malware data being mostly binary. Further, both data sets are imbalanced, with the dominant class consisting in % of the spam data, and for Hidost. To validate our results on fully realvalued (and balanced) data, we study two subtasks of the MNIST data set [16]: 9 vs. 1 and 3 vs. 8. MNIST contains black and white images of digits which have to be classified. All data sets where selected to facilitate a study on a wide range of different types of data. We summarize these data sets in Table I.
Name  number of  number of  kind of  attack 

features()  instances()  features  norms ()  
Hidost  mostly binary  
Spam  mixed  ,,  
MNIST91  real  ,,  
MNIST38  real  ,, 
Classifiers. We train Gaussian process classification (GPC) and Bayesian neural networks (BNN), of which we study the uncertainty measures, but also a range of substitutes. These substitutes include a linear support vector machine (SVM) and a deep neural network (DNN), and two neural networks that are trained to mimic GPC (called GPDNN) and the linear SVM (linDNN, see the Appendix for details). We finally also train an RBF SVM to evaluate accuracy of some attacks. The accuracy on benign test data of all classifiers or substitutes is given in Table II.
MNIST91  MNIST38  Malw  Spam  

GPC  
GPDNN  
linear DNN  
DNN  
BNN  
linear SVM  
RBF SVM  
random guess 
Attacks. We propose two attacks on GPC which are based on the Jacobian of the classifier, and are the equivalents of FGSM and JSMA (see Section IIE) for GPC (detailed derivation can be found in the Appendix). In our second study, we introduce an additional attack on GPC, which we skip here.
We further use all attacks which are described in detail in Section IIE on the substitutes described above. These substitutes are the linear SVM, which we target using FGSM. Further we use FGSM, JSMA and Carlini and Wagners attack on DNN and the DNN that mimic GPC and linear SVM, hence GPDNN and linDNN.
Implementation.
We implement our experiments in Python. For DNN and BNN, we use Tensorflow
[1], Scipy [15] for SVM and optimization problems, and GPy [13] for GPC. We rely on the implementation of the JSMA and FGSM attack from the Cleverhans library version 1.0.0 [12]. Further, we obtain the code provided by Carlini and Wagner for their attacks^{4}^{4}4Retrieved from https://github.com/carlini/nn_robust_attacks, July 2017.. We implement all other attacks ourselves.Iv Motivating Study
We study how adversarial examples affect the uncertainty and confidence that GPC provides. Previous work identifies uncertainty measures as potential mitigation to adversarial examples: we therefore expect that any adversarial example shows deviations in their uncertainty or confidence scores.
Iva Results
We monitor GP’s uncertainty and confidence measures for benign and malicious data. The latter, manipulated data, is depicted by attack type: All attacks based on the Jacobian with iterative, local changes are summarized in JSMA. The data for JSMA hence includes examples crafted on DNN, GPC, linDNN, and GPDNN. The FGSM data includes examples crafted on DNN, linDNN, GPDNN, GPC and linear SVM.
We further report FGSM dependent on to observe the relationship between strength of perturbation and confidence or uncertainty. The , and attacks were crafted on DNN, linDNN and GPDNN.
We distinguish correctly classified data and wrongly classified data: Adversarial data counts as correctly classified when the original class is recovered: the classifier outputs the class of the benign counterpart of the adversarial example. To visualize how uncertainty and confidence values are distributed, we use two violin plots per attack: one for misclassified data (red) and one for correctly classified (gray).
Confidence on Manipulated Data. We plot the distribution of the confidence or latent mean are in Figure 2(a). In general, GPC is more confident on correctly classified data than on misclassified data. This holds in all cases on MNIST91. On Malware and MNIST38, we observe similar results except for FGSM with . On MNIST38, these differences are overall less pronounced. We observe almost no difference in confidence on the spam data set, except in adversarial examples produced by and attacks.
In general, GPC is as confident on misclassified malicious data as on misclassified benign data. An exception on the MNIST91 and the Malware data is FGSM with . On MNIST38, we observer similar confidence to benign, misclassified data only for JSMA, ,, and attacks. Finally on the spam data set, GPC is in general more confident on adversarial than on benign misclassified data, with the exceptions of FGSM with , , , and attacks.
Uncertainty on Manipulated Data. We plot the uncertainty measures or latent variance in Figure 2(b). In contrast to the confidence values, the uncertainty values do not differ strongly between correct and misclassified data. The only notable exception from this is FGSM with on MNIST91.
Instead, uncertainty is generally higher when the perturbation introduced during crafting is larger: the higher the used in FGSM, the higher the observed uncertainty estimates. This result is, on all data sets, most pronounced for ; for smaller the uncertainty values are similar to the values of benign data. On MNIST91, MNIST38 and Malware, the uncertainty values are similar to benign data for JSMA, , and . On the spam data set, the results are slightly different. Here, and and FGSM with are similar to the uncertainty of benign data. JSMA on the spam data set leads to an increase in uncertainty by a factor of roughly . Finally, the and
attack on MNIST91 is distributed uniformly, as opposed to the distribution of benign, misclassified data which is skewed towards zero.
The uncertainty of manipulated data is in general higher than the uncertainty of benign, correctly classified data when ignoring outliers. There are few exceptions, for example FGSM with on all data sets. Further and on spam, or JSMA on Malware do not affect uncertainty strongly.
The attack leads in all cases to decreased uncertainty. We suspect that this is a consequence of being optimized for the distance which is also used in GPC. Another explication might be that the metric is differentiable and can thus directly be optimized. This might lead to more stealthy examples.
A Mitigation.
Confidence estimates in GPC seem to distinguish benign and perturbed data to a large extent, when correctly classified, benign data is used as a baseline. In the case of uncertainty, we do not observe a difference between misclassified and correctly classified data, but rather a relationship between the size of the perturbation and the increase in uncertainty. We seek to confirm whether uncertainty and confidence can be used to implement a mitigation. Hence, we compute a confidence interval on benign, correctly classified validation data and investigate it’s vulnerability.
More concretely, we compute a threshold for rejection by computing both confidence and uncertainty on a holdout data set. We then order all confidence (uncertainty) values by their size. We exclude the lowest and highest values, and denote the fourth highest and fourth lowest value . These values are used as thresholds, and any output of the GPC or is considered as a rejection.
The accuracy of this mitigation is depicted in Figure 4. The mitigation is indeed more robust than the undefended GPC. More specifically, the mitigation’s accuracy is in general above %. Exceptions are the attack and FGSM on GPDNN with on MNIST91. Further on spam, FGSM on DNN with and GPC with are exceptions with accuracies around %. On MNIST38, all accuracies are above %, with a consistent decrease for FGSM with on all crafted models. Finally, on the Malware data, accuracies are in general %, with the exception of JSMA on GPC.
Conclusion. We observed empirically that most techniques tested lead to an increase in either uncertainty or confidence in GPC. A mitigation based on both measures achieved good accuracy (generally %). To check if we obtained a false sense of security, we develop an attack that maximizes confidence and minimizes uncertainty, thereby circumventing our mitigation.
V Evading Uncertainty and Confidence
In the preceding study, we observed that conventional attacks often lead to noticeable deviations in confidence and uncertainty. Hence, we adapt the optimization of adversarial examples to account for confidence and uncertainty, thereby introducing highconfidencelowuncertainty (HCLU) adversarial examples. We formalize the computation of HCLU examples as
s.t.  
where we minimize the perturbation using the norm. An extension to other norms, as in [7], is however straight forward. We study the as it is differentiable and thus allows to formulate a worst case attacker. Concerning confidence, we demand explicitly that the resulting adversarial example is confidently classified (first constraint). We also require that the uncertainty GPC outputs for the example is as least as low as for the benign counterpart (second constraint).
In the following, we first describe the resulting HCLU examples. Afterwards, we test their transferability. Finally, we evaluate highconfidence examples transferred from DNN on GPC and conclude the section by summarizing our results.
GPC as WhiteBox. We show some of our HCLU adversarial examples in the second row of Figure 5. These examples are still adversarial: we see in the figure that they are visually similar to their benign origin. The average perturbation introduced (measured using norm) varies between on average for spam and on average for MNIST91. On Malware, is on average and on MNIST38 . The variance of on all data sets. The success rate is %, except for MNIST38 where we only succeed in % of the cases.
We might be tempted to craft examples by only maximizing confidence, thereby removing the second constraint. Surprisingly, the perturbation of such examples differs only in strength for MNIST38 ( as compared to examples that take into account uncertainty). For all other data sets, the number of changed features remains roughly the same, and and do not differ by more than (measured again using ). The variance of this is, as in the previous case considering uncertainty, well below . Examples are depicted in the first row of Figure 5. Yet, these examples often lead to an increase in uncertainty, and are thus not discussed here.
Transferability. We test the HCLU adversarial examples and find that they transfer well. We depict the accuracy of a Bayesian neural network, a conventional DNN, a linear SVM and an RBFSVM in Figure 6.
The amount of correctly classified adversarial examples on MNIST91 and Malware is about %, with the BNN achieving %. On MNIST38, the accuracies are around % for most algorithms, with only the linear SVM performing notably worse. The results for spam vary, where the accuracy of BNN and DNN is around % and the linear SVM around %. The RBFSVM achieves on both spam and Malware data random guess accuracy.
Transferability and Confidence. We test the effect of HCLU adversarial examples on Bayesian neural networks (BNN). These networks provide uncertainty measure via Monte Carlo Sampling of the posterior. In this experiment, we are interested whether the computed uncertainty differs between benign and adversarial data. We chose the attack as a baseline: , as in our case, allows the best optimization. Only for Malware, we craft the attack: We do not want the network to have an advantage as it observes realvalued features compared to mostly binary features in the training data. We further choose to obtain good transferability.
This attack is supposed to maximize the softmax output, and therefore the confidence on a DNN. However, as noted by [9], this interpretation of the softmax output suffers from certain limitations when compared to uncertainty measures as provided by GP and BNN.
We depict the results in Figure 7, where we distinguish correctly classified (gray shades) and wrongly classified (red shades) benign and adversarial data. We measure the mean and variance of the sampled posteriors and bin them using bins between and . To outline overall trends, we plot the normalized bins stacked on top of each other.
The BNN, similar to the GP, is more confident on data that is correctly classified. This observation holds true across all data sets and independently from adversarial manipulations in the data. We do observe, however, that BNN are confident on many misclassified HCLU examples. Intriguingly, BNN output low confidence on some HCLU examples which are not misclassified. This is not the case for the attack, where the network behaves analogous to benign data: low confidence for misclassified data, and confident on recovered or correctly classified data.
Transferring High Confidence Examples from DNN. To conclude this section, we investigate whether highconfidence adversarial examples from DNN transfer to GPC. We obtain this examples using , and attacks. Here, parameterizes the tradeoff between the amount of perturbation introduced and the confidence in classification. We evaluate a range of , namely and .
The results are depicted in Figure 8. We observe that with increasing , neither GPC’s accuracy nor confidence increase or decrease consistently. The latent mean increases slightly in some cases, for example MNIST38, MNIST91, and spam for . further, all observed average confidence scores are around the latent mean for misclassified benign data, depicted as black line.
Summary. We observed that HCLU examples are still malicious, and very similar to their benign counterparts. Nonetheless, they are misclassified by other models. Even worse, models providing model uncertainty misclassified them with high confidence. High confidence adversarial examples crafted using a confidence that is not Bayesian, however, are not classified with high uncertainty by GPC or BNN. We conclude that in an adversarial setting, we have to distinguish Bayesian and nonBayesian confidence, although both can be fooled.
Vi Related Work
We are not the first to study model uncertainty in the presence of adversarial examples. Bekasov and Murray [3] for example show the importance of priors in robustness, a somewhat orthogonal approach to our work.
Bradshaw et al. [5] investigate Gaussian Hybrid networks, a DNN where the last layer is replaced by a Gaussian Process. Further Melis et al. [23] add a 1class SVM as a last layer of a DNN to build a defense based on uncertainty. They show that this defense can be circumvented, analogous to our work. Yet, we go a step further and test whether principled model uncertainty as a defense can be circumvented in a blackbox setting.
Another line of work focuses on models allowing intrinsic principled uncertainty measures. For example, Gal and Smith [10] proposes an attack to sample garbage examples in the pockets of the uncertainty of Bayesian neural networks. Bayesian neural networks are further investigated by Rawat et al. [29]. They test FGSM adversarial examples on Bayesian networks and find notable differences in model uncertainty for such examples. Li and Gal [18] observe differences for high confidence adversarial examples. Further Smith and Gal [33] conclude that Mutual Information of an ensemble of Bayesian networks detects adversarial data most securely. In this work, we propose targeted adversarial examples and show their transferability. This contradicts claims that principled model uncertainty is more robust or more difficult to attack.
In our work, we show that transferability also holds for model uncertainty. General transferability was first shown by Papernot et al. [26]. Rozsa et al. [31] study transferability for different deep neural network architectures. To the best of our knowledge, there exist no works so far studying the transferability of HCLU adversarial examples adversarial examples or the transferability of attacks across different models enabling uncertainties.
Another field of research is the general
relationship between deep learning and Gaussian Processes
, as investigated by Li and Gal [25]. To gain more understanding, recent approaches by Matthews et al. [8] and Lee et al. [17] represent deep neural networks with infinite layers as kernel for Gaussian Processes. Lee et al. [17] further show a relation between uncertainty in Gaussian Processes and predictive error in DNN, a result that links our work with other approaches targeting DNN.Vii Conclusion
In this paper, we studied GP to gain understanding about the vulnerability of machine learning models providing Bayesian model uncertainty. We used a range of existing attacks, including optimized, local and global perturbations. Additionally, we evaluated a range of classifiers: GPC, DNN, SVM and BNN. We further study several tasks like Malware classification, spam detection and handwritten digit classification.
We found that GP’s uncertainty and confidence deviated for misclassified benign and manipulated data, when correctly classified benign data is used as a baseline. This baseline even allows us to build a mitigation on holdout data. We then investigate how an adversary can utilize uncertainty measures and introduced a technique to craft HCLU adversarial examples, which achieve both high confidence and low uncertainty on GP. These examples reliably cause misclassification, and visual inspection of the examples crafted for the MNIST tasks shows that they resemble the original rather than the target class. Our findings imply that even an ensemble with infinitely many decision functions, like a GPC, can successfully be targeted by adversarial examples.
We further conducted a study about the transferability of HCLU adversarial examples. We found they transfer both to conventional and other Bayesian machine learning models. In the case of Bayesian neural networks, we found that HCLU adversarial examples are misclassified with high confidence. Further, we find that the opposite is not necessarily true: conventional attacks maximizing the softmax output on DNN, Carlini and Wagner’s high confidence adversarial examples, do not produce highconfidence mispredictions in both GP and BNN.
Mitigations solely relying on model uncertainty and confidence measures in the Bayesian sense are effective against conventional highconfidence examples or when the attacker is not aware of the defense applied, hence in a securitybyobscurity setting. We conclude that Bayesian model uncertainty, albeit promising, is circumventable in both a blackbox and a whitebox setting.
Acknowledgment
This work was supported by the German Federal Ministry of Education and Research (BMBF) through funding for the Center for ITSecurity, Privacy and Accountability (CISPA) (FKZ: 16KIS0753). This work has further been supported by the Engineering and Physical Research Council (EPSRC) Research Project EP/N014162/1.
References
 [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.
 [2] A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pages 274–283, 2018.
 [3] A. Bekasov and I. Murray. Bayesian adversarial spheres: Bayesian inference and adversarial examples in a noiseless setting. Bayesian Deep Learning at NeurIPS 2018, since 2018.
 [4] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
 [5] J. Bradshaw, A. G. d. G. Matthews, and Z. Ghahramani. Adversarial Examples, Uncertainty, and Transfer Testing Robustness in Gaussian Process Hybrid Deep Networkstrame. ArXiv eprints, July 2017.

[6]
N. Carlini and D. Wagner.
Adversarial examples are not easily detected: Bypassing ten detection
methods.
In
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security
, pages 3–14. ACM, 2017.  [7] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
 [8] A. G. de G. Matthews, J. Hron, M. Rowland, R. E. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. International Conference on Learning Representations, 2018.
 [9] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 1050–1059, 2016.
 [10] Y. Gal and L. Smith. Idealised bayesian neural networks cannot have adversarial examples: Theoretical and empirical study. CoRR, abs/1806.00667, 2018.
 [11] I. J. Goodfellow et al. Explaining and harnessing adversarial examples. In Proceedings of the 2015 International Conference on Learning Representations, 2015.
 [12] I. J. Goodfellow, N. Papernot, and P. D. McDaniel. cleverhans v0.1: an adversarial machine learning library. CoRR, abs/1610.00768, 2016.
 [13] GPy. GPy: A gaussian process framework in python. http://github.com/SheffieldML/GPy, since 2012.
 [14] K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. D. McDaniel. Adversarial examples for malware detection. In Computer Security  ESORICS 2017  22nd European Symposium on Research in Computer Security, Oslo, Norway, September 1115, 2017, Proceedings, Part II, pages 62–79, 2017.
 [15] E. Jones, T. Oliphant, P. Peterson, et al. SciPy: Open source scientific tools for Python, 2001–.
 [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
 [17] J. Lee, Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. SohlDickstein. Deep neural networks as gaussian processes. International Conference on Learning Representations, 2018.
 [18] Y. Li and Y. Gal. Dropout inference in bayesian neural networks with alphadivergences. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 2052–2061, 2017.
 [19] M. Lichman. UCI machine learning repository, 2013.
 [20] D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CEAS 2005  Second Conference on Email and AntiSpam, July 2122, 2005, Stanford University, California, USA, 2005.
 [21] D. Maiorca, I. Corona, and G. Giacinto. Looking at the bag is not enough to find the bomb: an evasion of structural methods for malicious PDF files detection. In 8th ACM Symposium on Information, Computer and Communications Security, ASIA CCS ’13, Hangzhou, China  May 08  10, 2013, pages 119–130, 2013.
 [22] S. Mei and X. Zhu. Using machine teaching to identify optimal trainingset attacks on machine learners. In AAAI, pages 2871–2877, 2015.
 [23] M. Melis, A. Demontis, B. Biggio, G. Brown, G. Fumera, and F. Roli. Is deep learning safe for robot vision? adversarial examples against the icub humanoid. In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 2229, 2017, pages 751–759, 2017.

[24]
S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard.
Deepfool: A simple and accurate method to fool deep neural networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2016.  [25] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer, 1996.
 [26] N. Papernot, P. McDaniel, and I. J. Goodfellow. Transferability in machine learning: from phenomena to blackbox attacks using adversarial samples. CoRR, abs/1605.07277, 2016.
 [27] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami. The Limitations of Deep Learning in Adversarial Settings. In Proceedings of the 1st IEEE European Symposium in Security and Privacy (EuroS&P), 2016.
 [28] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
 [29] A. Rawat, M. Wistuba, and M.I. Nicolae. Adversarial Phenomenon in the Eyes of Bayesian Deep Learning. ArXiv eprints, Nov. 2017.
 [30] B. D. Rouhani, M. Samragh, T. Javidi, F. Koushanfar, et al. Safe machine learning and defeating adversarial attacks. IEEE Security and Privacy (S&P) Magazine, 2018.
 [31] A. Rozsa, M. Günther, and T. E. Boult. Are accuracy and robustness correlated. In Machine Learning and Applications (ICMLA), 2016 15th IEEE International Conference on, pages 227–232. IEEE, 2016.
 [32] W. J. Scheirer, L. P. Jain, and T. E. Boult. Probability models for open set recognition. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2317–2324, 2014.
 [33] L. Smith and Y. Gal. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
 [34] L. Smith and Y. Gal. Understanding measures of uncertainty for adversarial example detection. CoRR, abs/1803.08533, 2018.
 [35] N. Srndic and P. Laskov. Practical evasion of a learningbased classifier: A case study. In 2014 IEEE Symposium on Security and Privacy, SP 2014, Berkeley, CA, USA, May 1821, 2014, pages 197–211, 2014.
 [36] N. Šrndić and P. Laskov. Hidost: a static machinelearningbased detector of malicious files. EURASIP Journal on Information Security, 2016(1):22, Sep 2016.
 [37] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart. Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August 1012, 2016., pages 601–618, 2016.
 [38] W. Xu, D. Evans, and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 1821, 2018, 2018.
 [39] W. Xu, Y. Qi, and D. Evans. Automatically evading classifiers. In Proceedings of the 2016 Network and Distributed Systems Symposium, 2016.
Appendix A Detailed Attack Derivation
In this Appendix we first write the detailed derivation of the JSMA and FGSM attack for GPC, and then describe the training of the surrogates.
Aa Derivation of FGSM and JSMA for GP
In this part of the Appendix, we present the detailed derivation to compute adversarial examples on GPC, including the reasoning why it is sufficient to use the latent mean.
We compute the gradient in the output with respect to the input dimensions. We consider the chain of gradients for the normalized output , an test input :
(7) 
where is the latent mean and the covariance function, respectively.
Note that for this attack, we are only interested in the relative order of the gradients, not their actual values. Unfortunately, does not vary monotonically with as the variance also affects the prediction. However, we are in a setting of binary classification, so we are only interested in moving the prediction, , across the boundary. No change in variance can cause this, instead a change in the mean of is required (effectively the mean is monotonic with respect to in the region of 0.5). The fastest we can get from one probability threshold to its opposite is when there is no variance (any variance will move the mean towards 0.5). So finding the gradient of is sufficient. This is analogous to the usage of the logits (instead of the softmax) in evasion attacks on deep neural networks.
However, we found that we can still use the gradient of (instead of a numerical approximation to ):
(8) 
Let us first rewrite the expected value of given a single test point :
(9) 
Where we write to denote GPC’s output for the training points . Hence, From here, we move on to the first part of the gradient,
(10) 
note the remaining terms are both constant with respect to the test input . The gradient of the covariance with respect to the inputs depends on the particular kernel that is applied. In our case, for the RBF kernel, between training point and test point , the gradient can be expressed as
(11) 
where and each denote feature or dimension of the corresponding vector or data point and denotes the lengthscale parameter of the kernel. Using Equation 8 the gradient of the output with respect to the inputs is approximately proportional to the product of Equation 10 and Equation 11, in the region of .
Based on the computation of these gradients, we perturb the initial sample. In GPFGS (similar to FGSM), we introduce a global change using the sign of the gradient and a specified .
Alternatively in GPJM, we compute local, greedy changes (see Algorithm 2, analogous to JSMA). Instead of a saliency map, however, we iteratively compute the (still unperturbed) feature with the strongest gradient and perturb it. We alternate between perturbation for misclassification and (optionally) decreasing uncertainty. We finish altering the example when it is either misclassified at a predefined threshold, or in case we have changed more than a previously specified number of features, corresponding to a fail.
AB Surrogates Models for GP
We train several surrogate models to approximate GP’s decision surface and to be able to apply DNN specific attack to GPC as well. To this end, we first briefly introduce Gaussian process latent variable model (GPLVM) these surrogates are trained with, before we introduce the attacks themselves.
AB1 GP Latent Variable Model
GPC learns based on labeled data. This introduces an implicit bias: we assume for example that the number of labels is finite and fixed, and that the labels are related to the structure of the data. Furthermore, the curse of dimensionality affects classification. This curse affects distance measure on data points. In higher dimensions, the ratio between nearest and furthest point approximates one. All data points are thus uniformly distant, impeding classification: it becomes harder to compute a separating boundary. Consequently, we are interested in finding a lower dimensional representation of our data.
These two issues are taken into account when using the Gaussian process latent variable model (GPLVM). Analogously to GPC, GPLVM models uncertainty estimates. Further, its latent space allows for nonlinear connections in the feature space to be represented. Yet, this latent space or lowerdimensional representation ignores labels. Consequently, we need to apply a classifier on top of GPLVM to enable classification.
AB2 Training Surrogates.
We propose a complementary approach to the attacks on GPC by attacking GPLVM+SVM using (the already established methodology of) DNN surrogates specifically tailored for the GP. To train such a surrogate model, we train a DNN to fit the latent space representation of GPLVM in one of the hidden layers. We achieve this by taking a common DNN and splitting it into two parts, where a hidden layer becomes the output layer for the first part and the input layer for the second part (see lower half of Figure 9).
The training data is fed as input to the first part. We train it minimizing the loss between the output of the network and the latent space we want to approximate (for example the output of GPLVM). The second part receives this latent space as input, and is trained minimizing the loss to the normal labels. When stacking these two networks (i.e., when feeding the output of the first part immediately into the second), we obtain a combined DNN that mimics both the latent space it was trained on and the classifier on this latent space.
Comments
There are no comments yet.