How Wrong Am I? - Studying Adversarial Examples and their Impact on Uncertainty in Gaussian Process Machine Learning Models

11/17/2017 ∙ by Kathrin Grosse, et al. ∙ The University of Sheffield Universität Saarland CISPA 0

Machine learning models are vulnerable to adversarial examples: minor, in many cases imperceptible, perturbations to classification inputs. Among other suspected causes, adversarial examples exploit ML models that offer no well-defined indication as to how well a particular prediction is supported by training data, yet are forced to confidently extrapolate predictions in areas of high entropy. In contrast, Bayesian ML models, such as Gaussian Processes (GP), inherently model the uncertainty accompanying a prediction in the well-studied framework of Bayesian Inference. This paper is first to explore adversarial examples and their impact on uncertainty estimates for Gaussian Processes. To this end, we first present three novel attacks on Gaussian Processes: GPJM and GPFGS exploit forward derivatives in GP latent functions, and Latent Space Approximation Networks mimic the latent space representation in unsupervised GP models to facilitate attacks. Further, we show that these new attacks compute adversarial examples that transfer to non-GP classification models, and vice versa. Finally, we show that GP uncertainty estimates not only differ between adversarial examples and benign data, but also between adversarial examples computed by different algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Machine Learning classifiers are used for various purposes in a variety of research areas ranging from robotics to health. However, they have been shown to be vulnerable to a number of different attacks. 

[24, 4, 39, 9]. Adversarial Examples

present the most direct threat to Machine Learning classification at test-time: by introducing an almost imperceptible perturbation to a correctly classified sample, an attacker is able to change its predicted class. Adversarial examples have been used to craft visually indistinguishable images that are missclassified by state-of-the-art computer vision models 

[28] and they enable malware to bypass classifier-based detection mechanisms without loss of functionality [37, 43, 12].

While a range of defenses against these attacks has been developed, they mostly provide an empirical mitigation against adversarial examples [44]. This is not surprising: the development of new methods in deep learning is primarily motivated by the need for tractable models, favoring flexibility and efficiency over a rigorous mathematical framework.

The study of interpretability, expressivity and learning dynamics of DNN is an active area of research [32, 36].

Nevertheless, the lack of a rigorous theoretical underpinning in DNN has been detrimental to many defensive mechanisms: their robustness guarantees were primarily supported by empirical observations, often omitting implicit assumptions that were subsequently successfully subverted by increasingly elaborate attacks [7]. Recent work has started addressing this developing arms race of attacks and defenses by amending the lack of provable guarantees in the formal framework of DNN by auxiliary methods, e.g. in the form of verification techniques [17, 15]. Other approaches use the theoretical framework of more formally rigorous Machine Learning models, e.g. kernel methods [14] or k-Nearest Neighbor [42], to provide meaningful security and robustness guarantees.

Using statistical methods, [3] and [11]

show that the distributions of benign data and adversarial data differ. Harnessing the comprehensive framework of Bayesian probability, relating high uncertainty for predictions to the sample being differently distributed than benign data, presents an immediate next step. Efforts to leverage Bayesian uncertainty estimates in conjunction with DNN to discern adversarial perturbations have been made

[5, 21]. More generally, when projecting DNN into the framework of Bayesian methods, the seminal work of [29] notes a direct correspondence between infinite DNN and Gaussian Processes. [20] extends this work by describing a direct correspondence between deep and wide neural networks and Gaussian Processes, and by showing that Gaussian Process uncertainty is strongly correlated with DNN predictive error.

Contributions

In this paper, we investigate adversarial examples in a Bayesian framework using Gaussian Processes. In particular, we focus on uncertainty estimates in Gaussian Process Classification (GPC) and the Gaussian Process Latent Variable Model (GPLVM). Motivated by the fact that some attacks exploit the unstable attack surface of DNN specifically [5] , we also formally derive white-box attacks on GPC and GPLVM.

Our evaluation across four tasks shows that uncertainty estimates usually reflect adversarial perturbations caused by state-of-the-art techniques. However, the connection between the change in uncertainty and the amount of perturbation introduced is not straight-forward and warrants further investigation beyond the scope of this paper. A first mitigation based only on thresholding uncertainty estimates and rejecting predictions below this threshold already shows promising initial results. Intriguingly, we observed a possible link between the norm used in the kernel and the vulnerability towards an attack based on the dual of this norm: based attacks appeared more successful in thwarting detection on our models with RBF kernels. Attacks crafted on the same algorithm, but using other metrics, were less successful and significantly affect uncertainty estimates. They therefore were rejected across all tested variants.

Ii Background

In this section, we briefly review Machine Learning classification, and Adversarial Examples before providing a introduction to Gaussian Process Classification (GPC) and Gaussian Process Latent Variable Model (GPLVM) based classification.

Ii-a Classification

In classification, we consider a dataset , where are the data points and are the labels. The goal is to train a classifier by adapting the parameters based on the training data such that , i.e. correctly predicts the label of before unseen test data

. For example an SVM computes the optimal hyperplane given some data, where a nonlinear decision boundary is achieved by using a kernel. In contrast, a DNN learns several mappings, one in each layer, and is thus optimized to separate the data.

Ii-B Adversarial Examples

Given a trained classifier , test-time attacks compute a small perturbation for a test sample such that

(1)

i.e., the sample is classified as a different class than the original input. The sample is then called an adversarial example. A more advanced attacker can also make targeted attacks, i.e. select the specific target class the sample should be misclassified as. Since we only consider binary settings in this paper, this distinction is superfluous.

Many algorithms exist for creating adversarial examples. We focus on the Fast Gradient Sign Method (FGSM) by  [9] and the Jacobian-based Saliency Map Approach (JSMA) by [31], both of which are based on the derivative of the DNN’s output with respect to its inputs. We also consider the attacks introduced by [6], which treat the task of producing an adversarial exmaple as an iterative optimization problem.

Besides these attacks, there exist further variants of adversarial examples targeting other types of classifiers [30, 26, 13] or employing a different manner of computation [25, 2, 41]

Ii-C Gaussian Processes

This paper focuses on Gaussian Processes(GP), as they provide principled uncertainty estimates. We first introduce the Gaussian Process Latent Variables Model (GPLVM), a probabilistic model yielding a latent space representation for data irrespective of the labels. Afterwards, we consider a GP variant that incorporates labels during training and introduce GP Classification (GPC) using the Laplace approximation.

Ii-D Gaussian Process Latent Variable Model

A Gaussian Process Latent Variable Model (GPLVM) [18] yields a nonlinear latent space representation, , for some input data . In particular, GPLVM learns this mapping by maximizing the likelihood for the latent positions.

To understand GPLVM, it is useful to first consider Principal Component Analysis (PCA). In PCA, we aim to reduce dimensionality by assuming that the data lies on a manifold described by the eigenvectors associated with the greatest variance. The dimensions of this lower-dimensional, non-linear mapping are expressed by latent variables.

By giving the values of the latent variables, , a Gaussian Prior and integrating over them, we obtain probabilistic PCA,

(2)

where and are the parameters and denotes the latent dimension. and can be obtained by using maximum likelihood estimates. Alternatively, putting a prior on and integrating over it yields dual probabilistic PCA:

(3)

The inner product can be kernelized. For example, using a non-linear kernel (such as the RBF) yields GPLVM. Using a non-linear kernel, however, also results in a non-closed solution. Note that GPLVM is not itself a classifier. In order to use it for classification tasks, we therefore apply an SVM to the latent variables.

Ii-E Gaussian Process Classification

We introduce GPC [33] for two classes using the Laplace approximation. The goal is to predict the labels for the test data points accurately. We first consider regression, and assume that the data is produced by a GP and can be represented using a covariance function :

(4)

where is the covariance of the training data, of the test data, and between test and training data. Having represented the data, we now review how to use this representation for predictions. The optimum estimate for the posterior mean at given test points, assuming a Gaussian likelihood function is

(5)

which is also the mean of our latent function . We will not detail the procedure for optimizing the parameters of the covariance function . The above derivation is for a regression model, we can alter this to perform classification. Since our labels are not real valued, but class labels, we ‘squash’ this output using a link function such that the output varies only between the two classes; hence the optimization can be simplified using the previously stated Laplace approximation. At this point, we want to refer the interested readers to  [33].

In addition to the mean prediction, GPs also provide the variance. This allows us to obtain the uncertainty for GPC, and will be used later in this work.

Iii Methodology

To investigate the effect of adversarial examples on uncertainty, we extend both JSMA and FGSM to a broader setting. This includes a direct computation of such examples on GPC. Further, we adapt common DNN to approximate latent space representations.

Iii-a Attacks on GPC

To produce an adversarial example for GPC we compute the gradient in the output with respect to the input dimensions. We consider the chain of gradients for the output , and input where and are the associated latent function and covariance function, respectively. To start, we rewrite the expected value of in Equation 5 given a single test point :

(6)

From here, we move on to the first part of the gradient,

(7)

as the remaining terms are both constant with respect to the test input . The gradient of the covariance with respect to the inputs depends on the kernel, in our case, for the RBF kernel, between training point and test point . The gradient can be expressed as

(8)

where and each denote feature or dimension

of the corresponding vector or data point and

denotes the length-scale parameter of the kernel. The gradient of the output with respect to the inputs is approximately proportional to the product of Equation 7 and Equation 8. A more nuanced reasoning and restrictions of this approach can be found in the Appendix.

Based on the computation of these gradients, we can perturb the initial sample. In GPFGS (Algorithm 1), we introduce a global change using the sign of the gradient and a specified . Alternatively, we compute local changes (see Algorithm 2). In this algorithm we iteratively compute the (still unperturbed) feature with the strongest gradient and perturb it. We finish altering the example when it is either misclassified or we have changed more than a previously specified number of features, corresponding to a fail.

Finally, the computation of the inverse matrix in Equation 7 might be impossible due to sparseness of the features. In cases of such sparse data, we approximate the inverse by using a Pseudo-inverse.

1:Input: sample , latent function , parameter
2:sign
3:return
Algorithm 1 GPFGS
1:Input: sample , latent function , classifier ,threshold , changed=
2:repeat
3:     if lenchanged then return fail
4:     end if
5:     grads
6:     gradschanged
7:     index max absgrads
8:     indexsigngradsindex
9:     changed.appendindex
10:until 
11:return
Algorithm 2 GPJM

Iii-B Attacks on GPLVM

Fig. 1: The intuition of LSAN. The first network is trained on a latent space, the second to classify input from this latent space. After training, the two networks are combined and yield one DNN classifier.

We propose a complementary approach to the attacks on GPC by attacking GPLVM+SVM using (the already established methodology of) DNN surrogates combined with JSMA and FGSM and extend it to DNN surrogates for the GPLVM model. To train the surrogate model, we train a DNN to fit the latent space representations in one of the hidden layers. We achieve this by taking a common DNN and splitting it into two parts, where a hidden layer becomes the output layer for the first part and the input layer for the second part (see lower half of Figure 1).

The first part is trained using the normal training data as input. We train it minimizing the loss between the output of the network and the latent space we want to approximate (for example the output of GPLVM). The second part receives this latent space as input, and is trained minimizing the loss of the normal labels. When stacking these two networks (i.e., when feeding the output of the first part immediately into the second), we obtain a combined DNN that mimics both the latent space it was trained on and the classifier on this latent space.

Iv Experimental Setup

In this section, we briefly describe the datasets and models we use.111The source code is available on request. We provide more details in the Appendix. Afterwards, to conclude this section, we give a brief outline of the experiments we conducted.

Iv-a Data

Adversarial examples are most important in security and safety contexts. Previous work [12] indicates that settings such as malware detection do not necessarily respond to the adversarial attacks in the same way as computer vision problems. Note that we focus on binary classification problems, as many security-relevant learning tasks heavily emphasize binary decisions, most notably between benign and malicious samples. We therefore select two learning tasks in which adversarial could be used to great effect without an elaborate setup on the attackers’ side: malware detection [38] and spam detection [22], both of which feature the classical security dichotomy of benign and malicious samples.

The malware dataset (MAL) contains 439,563 samples (92.5% benign) represented by 1223 binary features. The spam dataset (SPAM) contains 4,601 samples, of which 60% are benign. Each sample consists of 54 real-valued and three binary features. In addition to these security-focused datasets we also pick two binary subtasks from the MNIST dataset 

[19], namely 3 vs. 8 (MNIST38) and 1 vs. 9 (MNIST91). We select these settings in an effort to evaluate our results on a broad range of real-, mixed- and binary-valued features, as well as balanced and imbalanced datasets.

Iv-B Models

We evaluate a range of Machine Learning models in this paper. We trained DNNs and SVMs with both linear and RBF kernels. Further, GPC and GPLVM classifiers (using a SVM classifier on the latent space), both featuring an RBF kernel. Finally, we trained DNNs to mimic both the latent space of a linear SVM (dubbed linDNN) and a GPLVM classifer (GPDNN). The following models were found not to reach our performance threshold and were excluded from the study: RBF SVM on the SPAM data, and linDNN on MNIST38, MAL and SPAM. However, we used linDNN to craft adversarial examples on SPAM for experimental purposes. The classifier accuracies on test data are depicted in Table II. For ease of reference we give a list of all abbreviations used for models and attacks in Table I for the evaluation.

Name Description
GPC Gaussian Process Classification
GPLVM Gaussian Process Latent Variable Model
SVM Support Vector Machine
DNN Deep Neural Network
linDNN DNN trained to mimic a linear kernel in a hidden layer
GPDNN DNN trained to mimic GPLVM in a hidden layer
JBM Jacobian based attacks: JSMA, GPJM
FGSM, GPFGS or lin SVM attack with perturbation
Carlini and Wagners Attacks with norm
TABLE I: Summary of models and attacks.

Iv-C Outline of Experiments

Uncertainty

Our main interest here is the effect adversarial examples have on uncertainty. We will test all crafted examples on the previously named models on GP without investigating whether they are actually cause misclassification. We expect adversarial examples to have a different distribution (as shown empirically in [11]) to benign data and hence to lie further from the training data than benign test points. When using the stationary RBF kernel (as in GP here), the variance of a prediction is lower in areas where training data was observed. Thus we put forward the hypothesis that malicious data points induce a higher latent variance in GP than benign samples. For GPC, we also investigate the average of the absolute mean of the latent function. This analysis is not applicable to GPLVM, since the GPLVM latent mean is interpreted as a position in latent space (as explained in Section II-C). Hence, for GPLVM, we only measure and evaluate the latent variance.

We further make use of an uncertainty threshold to reject adversarial examples on GPC and present these results. Note that we only investigate this as a first step, and do not optimize this approach beyond a straight forward % interval. More research will be needed to solve the difficulties posed by adaptive attackers in real-world scenarios. We expect this defense to detect some adversarial examples, and are in particular interested in those cases that successfully thwart detection.

Transferability

Observing changes in uncertainty without additionally surveying the perturbation introduced by attacks, or without considering the amount of crafted examples that fail to cause misclassification, might be misleading. We thus focus on the question whether a stronger perturbation leads to stronger changes in uncertainty. Further, we investigate to which degree GP based methods are susceptible to adversarial examples. A low change in uncertainty might be a consequence of the adversarial examples being correctly classified despite the perturbation introduced by the attack.

MNIST38 MNIST91 MAL SPAM
DNN
linDNN X
GPDNN
GPLVM
GPC
lin SVM
RBF SVM
TABLE II: Accuracy of classifiers, X denotes non-convergence.

V Evaluation

The principal question we are interested in is how adversarial examples affect the uncertainty measure in GP methods. We investigate these changes for all computed examples, and ignore for now whether they actually cause misclassification. The question whether adversarial examples are actually effective (i.e. are misclassified) will be addressed afterwards.

Uncertainty in GPC

Fig. 2: Effect of (adversarial) examples on GPC uncertainty estimates. The horizontal black line is the uncertainty estimate for benign data. Colors indicate the crafting algorithm.

Figure 2 shows the effect of attempted adversarial examples compared to benign data on GPC uncertainty estimates. The different type of attacks have different degrees of impact on the latent absolute mean. Sometimes larger degree of perturbation, indicated by in attacks such as FGSM, GPFGS or the linear SVM attack, induce higher change in the mean. The uncertainty also changes for many Jacobian based methods. We observe further that GPDNN on MAL and DNN on SPAM lead to almost no changes. We include in the appendix additional results investigating changes in the average variance of the latent function.

Fig. 3: Mitigation rejecting (adversarial) examples outside of % interval of latent mean or variance. Dotted line is percentage of incorrectly rejected benign data. Colors indicate which algorithm was used for crafting.

Basic Mitigation for GPC

As a next step from these results, we investigate a straightforward mitigation: We consider the distribution of estimated variances for all the benign test data provided to the GPC. We compute the interval over this distribution and then reject test points that are outside this interval, as we hypothesise that the variance of adversarial examples will differ from benign data. We also apply the same procedure to the latent mean, for similar reasons. We present our results in Figure 3. We observe this simple step to be quite successful on the Spam data (except on the attacks). On MNIST91 we observe mixed results. On MNIST38 and the Malware data the approach does not work well.

Uncertainty in GPLVM

Fig. 4: Effect of (adversarial) examples on GPLVM uncertainty estimates. Solid horizontal line is the value for benign data. Colors indicate which algorithm was used for crafting.

Similar to the previous experiment, we measure the variances of GPLVM for all kinds of attempted adversarial examples in Figure 4. For both MNIST tasks, we observe changes of or in the variance, if there are changes at all. For the Malware and Spam data, we do observe some changes: On the Malware data, the mean variance shifts from to or . On the Spam data, the mean variance is an order of magnitude less.

V-a White Box Setting

In the previous section, we observed that Carlini and Wagner’s attacks, as well as Jacobian based methods (on MNIST38 and MAL), only lead to a small response in the uncertainty estimates. We plot the introduced perturbations in Table III and find that indeed, these settings yield low perturbations and change only around one feature. However, the adversarial examples crafted with JSMA on linDNN for MNIST91 have the highest perturbation at features on average. Strangely though, the change in uncertainty estimates is less than for the other Jacobian based attacks, which needed fewer perturbations. We thus conclude that the relationship between size of perturbation and effect on uncertainty is non-trivial.

M38 M91 MAL Spam
JBM JBM JBM JBM
GPC - - -
GPDNN
linDNN X X X X
DNN
TABLE III: Average Features changes by JBM (JSMA,GPJM) and Carlini Wagner for adversersial (misclassified) examples on crafted model. X denotes models excluded from evaluation.

V-B Transferability

We observed that the uncertainty estimates did not change noticeably for MNIST38, MAL, Carlini and Wagner’s attack and small values of . A natural reason for the uncertainty to remain low is because classification is still correct, e.g. the examples are actaully not adversarial. We report the percentage of correctly classified examples for GPCs in Figure 4(a) and for GPLVM with an SVM on top in Figure 4(b). To enable a comparison, we further plot the same percentages for a normal DNN Figure 5(a) and the individual SVM used on top of GPLVM without latent space in Figure 5(b). Full results can be found in the Appendix.

The first observation is that for GPC, GPLVM+SVM and DNN, the accuracy on all (adversarial) examples on the MAL dataset is still very high. For MNIST38, however, where we did not observe changes in uncertainty estimates, and many examples are misclassified or adversarial. Therefore there exist adversarial examples which remain undetected. A interesting finding is that, for all the GP-based classifiers used in this study, the most effective attack was Carlini and Wagners’ (with the norm). In particular, the most effective attacks were produced when the examples were crafted against or on GPDNN. For low values of (,), we observe that many (%, on SPAM %) examples are not adversarial. Finally, we found that classification using GPLVM + SVM is more robust than SVM classification on its own.

(a) Evaluated on GPC
(b) Evaluated on GPLVM
Fig. 5: Percentage of correctly classified (not adversarial) examples crafted on specified algorithm and dataset. Dotted line indicates accuracy on benign samples.
(a) Evaluated on DNN
(b) Evaluated on SVM
Fig. 6: Percentage of correctly classified (not adversarial) examples crafted on specified algorithm and dataset. Dotted line indicates accuracy on benign samples.

V-C Conclusion of experiments

We observed many adversarial examples to have an influence on uncertainty in GP based methods. The detection of changes in the estimated uncertainty, and low transferability to GP based methods yield mostly robust methods in three of four cases studied. Future work will investigate more parameters, and whether alternative covariance functions or length-scales can be used to increase robustness. One observation in particular needs to be investigated: We observe all Carlini and Wagner attacks based on the norm to remain effective and hard to detect even in the presence of uncertainty.

Since the RBF kernel of the Gaussian Process is based on the norm, future work needs to determine whether selecting a kernel with a different norm will also alter the classifier’s vulnerability to this attack. A similar connection between the classifiers metric in regularization and its vulnerability to an attack with a dual metric has already been established for linear models [35]. We therefore consider this as a promising direction for future research.

Vi Related Work

To the best of our knowledge, only [5] and [27] investigate uncertainty in the presence of adversarial examples. The latter approach adds a 1-class SVM as a last layer of a DNN to build a defense based on uncertainty. They show that this defense can be circumvented, however. The first paper is more closely related to our work: the authors investigate so-called Gaussian Hybrid networks, a DNN where the last layer is replaced by a Gaussian Process. They evaluate the robustness of their approach only on FGSM and the attack by Carlini and Wagner. In contrast, our work targets GPLVM and GPC directly and investigates the sensitivity of Bayesian uncertainty estimates regarding the perturbation caused by adversarial examples in general.

Another field of research is the general relationship between Deep Learning and Gaussian Processes [29]. To gain more understanding, recent approaches represent DNN with infinite layers as kernel for Gaussian Processes [8, 20]. Lee et al. further show a relation between uncertainty in Gaussian Processes and predictive error in DNN, a result that links our work with other approaches targeting DNN.

At the same time, other Machine Learning models also admit Bayesian Inference to model predictive uncertainty. [21]

show that uncertainty estimates in Bayesian Neural Networks, i.e. Neural Networks with a prior probability placed over their weights, can be used to tell apart adversarial and benign images.

Transferability has been investigated in the context of adversarial examples has been brought up by [30].[34] study transferability for different deep neural network architectures, whereas [23] specifically investigate targeted transferability. Finally, [40] explore transferability in general by examining the decision boundaries of different classifiers. In contrast to these works, we specifically investigate transferability in the context of Gaussian Process models, namely GPC and GPLVM. Further, we focus on the effects of adversarial examples on uncertainty measures that are inherent to these models.

Vii Conclusion

We have investigated adversarial examples and their impact on uncertainty estimates in a Bayesian framework using Gaussian Processes. Our study was based on two types of attacks: First, state-of-the-art attacks that were computed on the same dataset but using non-Gaussian Process surrogate models, relying on the transferability property of adversarial examples. Second, as set of white-box attacks we formally derived to specifically target Gaussian Process based classifiers.

In general, we found that the perturbation introduced as part of the crafting process is reflected in Gaussian Process uncertainty estimates. Interestingly, we also found that some models remain vulnerable when targeted by attacks using the dual of the target’s kernel norm as an optimization metric. This observation is in line with similar observations already made for regularization in linear methods.

Acknowledgment

This work was supported by the German Federal Ministry of Education and Research (BMBF) through funding for the Center for IT-Security, Privacy and Accountability (CISPA) (FKZ: 16KIS0753). This work has further been supported by the Engineering and Physical Research Council (EPSRC) Research Project EP/N014162/1.

References

Appendix A Details on Attack Derivation for GPC

In this part of the Appendix, we present the detailed derivation to compute adversarial examples on GPC, including the reasoning why it is sufficient to use the latent mean.

We compute the gradient in the output with respect to the input dimensions. We consider the chain of gradients for the output , and input :

(9)

where and are the associated latent function and covariance function, respectively.

Note that for this attack, we are only interested in the relative order of the gradients, not their actual values. Unfortunately, does not vary monotonically with as the variance also affects the prediction. However, we are in a setting of binary classification, so we are only interested in moving the prediction, , across the boundary. No change in variance can cause this, instead a change in the mean of is required (effectively the mean is monotonic with respect to in the region of 0.5). The fastest we can get

from one probability threshold

to its opposite is when there is no variance (any variance will move the mean towards 0.5). So finding the gradient of is sufficient.

However, we found that we can still use the gradient of (instead of a numerical approximation to ):

(10)

Let us first rewrite the expected value of given a single test point :

(11)

From here, we move on to the first part of the gradient,

(12)

note the remaining terms are both constant with respect to the test input . The gradient of the covariance with respect to the inputs depends on the particular kernel that is applied. In our case, for the RBF kernel, between training point and test point , the gradient can be expressed as

(13)

where and each denote feature or dimension of the corresponding vector or data point and denotes the length-scale parameter of the kernel. Using Equation 10 the gradient of the output with respect to the inputs is approximately proportional to the product of Equation 12 and Equation 13, in the region of 0.5.

Appendix B Details of Experimental Setup

In this Appendix we provide more detailed information about the used datasets and the parameters of the models.

B-a Datasets

In the following, we describe the datasets in detail that were used for the evaluation.

Mal

Our Malware dataset consists of the PDF Malware data of the Hidost Toolset project [38] The dataset is composed of PDF Malware samples, of which are labeled as benign and as malicious. Datapoints consist of binary features and individual feature vectors are likely to be sparse. We split it in % training and %. This still leaves us with more than test data points to craft adversarial examples, where many attacks are very time consuming to compute.

Spam

The second security-relevant dataset is an email Spam dataset [22]. It contains samples. Each sample captures features, of which are continuous and represent word frequencies or character frequencies. The three remaining integer features contain capital run length information. This dataset is slightly imbalanced: roughly % of the samples are classified as Spam, the remainder as benign emails. We split this dataset randomly and use % as test data.

Mnist

Finally, we use the MNIST benchmark dataset [19] to select two additional, binary task sub-datasets. It consist of roughly , pixels, black and white images of handwritten single digits. There are training and test samples, for each of the ten classes roughly the same number. We select two binary tasks: versus and versus (denoted as MNIST91 and MNIST38 respectively). We do this in an effort to study two different tasks on the same underlying data representation, i.e. the same number and range of features, yet with different distributions to learn.

B-B Models

We investigate transferability across multiple ML models derived by different algorithms. In some cases, dataset-specific requirements have to be met for classification to succeed.

Gplvm

We train GPLVM generally using latent dimensions with the exception of the Spam dataset, where more dimensions () are needed for good performance. We further use SVM on top of GPLVM to produce the classification results, a linear SVM for the MNIST91 tasks and an RBF-kernel SVM for all other tasks.

DNN on latent space

We distinguish between DNN approximating GPLVM (GPDNN), linear SVM (linDNN) and RBF SVM (rbfDNN). All of them contain two hidden layers with half as many neurons as the datasets’ respective features. The layer trained on latent space encompasses

neurons for the SVM networks and neurons for GPDNN, except for the Spam dataset, where we model latent variables. We train the latent space part of the network with squared loss; the classifying part is trained as other networks using cross entropy loss. From this latent space, we train a single layer for classification, with the exception in GPDNN in the cases where an RBFSVM is trained on top: here we add a hidden layer of neurons.

Dnn

Our simple DNN accomodates two hidden layers, each containing half as many neurons as the dataset has features, and ReLU activation functions.

Svm

We study a linear SVM and a SVM with an RBF kernel. They are optimized using squared hinge loss. We further set the penalty term to . For the RBF kernel, the parameter is set to 1 divided by the number of features.

B-C Implementation and third party libraries

We implement our experiments in Python using the following specialized libraries: Tensorflow [1] for DNNs, Scipy [16] for SVM and GPy [18] for GPLVM and GPC. We rely on the implementation of the JSMA and FGSM attack from the library Cleverhans version 1.0.0 [10]. We use the code provided by Carlini and Wagner for their attacks222Retrieved from https://github.com/carlini/nn_robust_attacks, July 2017.. We implement the linear SVM attack (introduced in [30]) and the GPattacks (based on GPy) ourselves.

Appendix C Full results of experiments

In the main paper, we present selected results to back up our reasoning. We present the full results in this Appendix, so that individual results can be confirmed. further, this enables looking up results that are not presented in the main paper.

The full results for uncertainty for GPC are in Table IV (latent mean), Table V (latent variance), and Table VI (mitigation). We present the full results on GPLVM uncertainty in Table VII.

Concerning the White-Box experiments, Table VIII shows the perturbations for methods based on the Jacobian and Carlini and Wagner Attack. Further Table IX shows the accuracy for FGSM, linear SVM attack and GPFGS for different models.

Finally, we present the full results on our transferability experiments, ordered by datasets. Table X was done on MNIST38, Table XI on MNIST91, Table XII on MAL and Table XIII on SPAM.

FGSM / linSVM / GPFGS CW
ORIGIN JBM
MNIST38
GPC
GPDNN
lin SVM
DNN
MALW
GPC
GPDNN
lin SVM
DNN
MNIST91
GPC
GPDNN
lin SVM
linDNN
DNN
SPAM
GPC
GPDNN
lin SVM
linDNN
DNN
TABLE IV: Average absolute latent variance in GPC for benign data (bold) and adversarial exampled crafted by algorithm and on model ORIGIN
FGSM / linSVM / GPFGS CW
origin JBM
MNIST38
GPC
GPDNN
lin SVM
DNN
MALW
GPC
GPDNN
lin SVM
DNN
MNIST91
GPC
GPDNN
lin SVM
linDNN
DNN
SPAM
GPC
GPDNN
lin SVM
linDNN
DNN
TABLE V: Standart deviation of absolute latent function for benign data (bold) and adversarial examples in GPC. Adversarial examples crafted by algorithm and on model ORIGIN.
FGSM / linSVM / GPFGS CW
ORIGIN JBM
MNIST38
GPC
GPDNN
lin SVM
DNN
MALW
GPC
GPDNN
lin SVM
DNN
MNIST91
GPC
GPDNN
lin SVM
linDNN
DNN
SPAM
GPC
GPDNN
lin SVM
linDNN
DNN
TABLE VI: Rejected data outside a % conficende interval in percent: Benign data (bold) and (adversarial) examples crafted on model by algorithm ORIGIN
FGSM / linSVM / GPFGS CW
ORIGIN JBM
SPAM
GPC
GPDNN
lin SVM
linDNN
DNN
MNIST91
GPC
GPDNN
lin SVM
linDNN
DNN
MALW
GPC
GPDNN
lin SVM
DNN
MNIST38
GPC
GPDNN
lin SVM
DNN
TABLE VII: Average variance of GPLVM predictions for benign data (bold) and adversarial examples. Adversarial examples crafted by algorithm and on model ORIGIN
MNIST38 MNIST91 MAL SPAM
GPC
GPDNN
linDNN X X
DNN
TABLE VIII: Percentage of samples we cannot craft an adversarial example for using JBM (JSMA,GPJM) and Carlini Wagner attacks. X denotes models excluded from evaluation.
Dataset
M.38 GPC
GPDNN
lin SVM
DNN
M.91 GPC
GPDNN
lin SVM
linDNN
DNN
MAL GPC
GPDNN
lin SVM
DNN
SPAM GPC
GPDNN
lin SVM
linDNN
DNN
TABLE IX: Percentage of correctly classified (non-adversarial) examples crafted on linear SVM/DNN FGSM, or GPFGS when tested on the same model used for crafting.
MNIST38 FGSM / linSVM / GPFGS CW
origin target JBM
GPC
GPDNN
GPLVM
GPC lin SVM
RBF SVM
DNN
GPC
GPDNN
GPLVM
GPDNN lin SVM
RBF SVM
DNN
GPC
GPDNN
GPLVM
linSVM lin SVM
RBF SVM
DNN
GPC
GPDNN
GPLVM
DNN lin SVM
RBF SVM
DNN
TABLE X: Transferability on MNIST38. Percentage indicates the examples that are not adversarial, e.g. correctly classified by model target, when crafted on origin using the corresponding attack. JBM denotes jacobian based methods, such as JSMA on DNN or GPJM for GPC.
MNIST91 FGSM / linSVM / GPFGS CW
origin target JBM
GPC
GPDNN
GPLVM
GPC lin SVM
RBF SVM
linDNN
DNN
GPC
GPDNN
GPLVM
GPDNN lin SVM
RBF SVM
linDNN
DNN
GPC
GPDNN
GPLVM
lin SVM lin SVM
RBF SVM
linDNN
DNN
GPC
GPDNN
GPLVM
linDNN lin SVM
RBF SVM
linDNN
DNN
GPC
GPDNN
GPLVM