Generalizable Adversarial Examples Detection Based on Bi-model Decision Mismatch

02/21/2018 ∙ by João Monteiro, et al. ∙ Université INRS 0

Deep neural networks (DNNs) have shown phenomenal success in a wide range of applications. However, recent studies have discovered that they are vulnerable to Adversarial Examples, i.e., original samples with added subtle perturbations. Such perturbations are often too small and imperceptible to humans, yet they can easily fool the neural networks. Few defense techniques against adversarial examples have been proposed, but they require modifying the target model or prior knowledge of adversarial examples generation methods. Likewise, their performance remarkably drops upon encountering adversarial example types not used during the training stage. In this paper, we propose a new framework that can be used to enhance DNNs' robustness by detecting adversarial examples. In particular, we employ the decision layer of independently trained models as features for posterior detection. The proposed framework doesn't require any prior knowledge of adversarial examples generation techniques, and can be directly augmented with unmodified off-the-shelf models. Experiments on the standard MNIST and CIFAR10 datasets show that it generalizes well across not only different adversarial examples generation methods but also various additive perturbations. Specifically, distinct binary classifiers trained on top of our proposed features can achieve a high detection rate (>90 performance when tested against unseen attacks.



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the advent of large scale digital datasets and novel computing architectures, deep neural networks (DNNs) have recently attained impressive performances on diverse and challenging problems such as object detection [1], speech recognition [2]

, face recognition in-the-wild

[3], malware detection [4], self-driving vehicles [5], mutations analysis in DNA [6], and prediction of structure-activity of potential drug molecules [7], just to name a few.

In spite of their major breakthroughs in solving complex tasks, researchers have lately discovered that DNNs are highly vulnerable to deliberate perturbations, which, when added to the input sample, can mislead the system to yield incorrect output with very high confidence [8, 9]. The resultant samples with deliberate perturbations are called adversarial examples

(AEs), i.e., carefully crafted versions of the clean samples that are intentionally perturbed (e.g. by adding noise) to confuse or fool the DNNs. Often, AEs are characterized by requiring minimal perturbations, thereby for humans not only these perturbations are imperceptible but also the corresponding samples are indistinguishable from the original ones, yet neural networks make errors with high probability

[10]. For instance, an adversarial example as depicted in Fig. 1 can be generated by adding some indiscernible perturbations into a given image. The resultant adversarial image is misclassified by the well-known convolutional classifier VGG16 [11], while a human being can still classify it correctly without spotting the deliberate added perturbations.

The vulnerability to AEs is severe as deep learning methods are now being used in daily applications and also planned to be deployed widely in the real-world. Moreover, AEs have unexpected transferability property [12], i.e., AEs generated for one model can be utilized to compromise others designed for the same task, regardless of the models’ architectures or training techniques being significantly different. The issue of AEs is particularly horrible in safety- and security-critical systems, which can cause devastating consequences. For example, an attacker can create physical AEs to baffle autonomous vehicles by manipulating either a “stop” sign as to “go fast” sign or “pedestrians on crosswalk” as “empty road”. Likewise, a suspect can construct AEs to fool a face identification system either for going unidentified or for gaining unauthorized access.

Fig. 1: An adversarial sample generated with Fast Gradient Sign Method (FGSM) in [8]. VGG16 [11] recognizes the clean original image sample correctly with high confidence. When small perturbation is added to the image, the model predicts incorrect label with similar high confidence.

Since AEs introduction in seminal works by Szegedy et al. [13] and Goodfellow et al. [8], several studies in different domains have investigated the vulnerability of deep learning models against AEs. For example, image classification [14], face recognition [15], and malware detection [16]. Similarly, many works have focused on AEs generation methods, i.e., devising algorithms to produce AEs. Some AEs generation techniques are based on the gradient of networks, such as FGSM (Fast Gradient Sign Method) [8], FGV (Fast Gradient Value) [17], JSMA (Jacobian Saliency Map Attack) [18], while others depend on solving optimization problems with different procedure such as limited memory BFGS [13], Deepfool [10], and C&W attack by Carlini & Wagner [19].

Consequently, to safeguard neural networks against AEs, several countermeasure techniques have been proposed, which roughly fall within two categories: adversarial defense and adversarial detection. The first kind of methods aim at improving the DNNs’ robustness to classify AEs correctly. Representative examples are adversarial training (i.e., training the network with clean and AEs samples) [20]

, defensive distillation (i.e., training a new robust network by using the distillation method)


and bounded ReLU activations (i.e., enhancing network’s stability against AEs by modifying standard unbounded ReLU)

[22]. While the second kind of methods attempt to detect an adversarial example before the network’s decision is taken into account. Namely, adversarial detection techniques put an extra effort in devising a separate AEs detector to distinguish clean samples from AEs with the assumption that intrinsically original data is different from AEs [9, 23].

Despite the current progress on increasing robustness of DNNs against AEs, both types of countermeasures (i.e., adversarial defense and adversarial detection) still not only do not scale well but also fail to adapt to changes in the AEs generation approach. Especially, adversarial detection methods have low generalization capability under varying or previously unseen attacks, i.e., their performance degrades when they encounter unfamiliar attack types not used during the training [24]. In addition, comparatively limited researches have been conducted on adversarial detection methods, though their development requires less efforts and computation resources as compared to adversarial defense systems. Formulating a generalizable adversarial detection strategy may help understanding intrinsic vulnerabilities as well as gaining insights on inner levels of DNNs, which in turn would help improving their robustness and performances [24].

Towards this aim, we propose a novel method capable of effectively detecting AEs without any prior knowledge about potential AEs. Particularly, we devised an adversarial detection method based on the divergence/matching of predictions of two independently trained models that relevantly differ in terms of architecture. Moreover, we propose a train scheme for pairs of models that improves the detection accuracy in some of the tested attacks. We refer to this training approach as Online Transfer Learning (OLTL).

The main difference between our method and previous AEs detection and defense approaches is that prior works usually employ both clean and perturbed datasets for training the methods in the main classification task. It is also important to highlight that, despite the fact we employ two models, no ensemble strategy is employed. Experimental results on CIFAR10 and MNIST data sets show that the proposed approach can attain accuracies higher than 90.0% even when AEs are crafted using attack strategies never seen by the detector. Moreover, our method can detect AEs generated with Gaussian Blur, Salt and Pepper, and Weighted Gaussian blur noises.

The remainder of this paper is organized as follows. Section II presents prior works on adversarial defense and detection. Section III describes different methods to generate AEs used in this study. The proposed approach is detailed in Section IV. Experimental protocol, dataset, models, figures of merit, and experimental results and discussions are given in Section V. Lastly, conclusions are drawn in Section VI.

Ii Related Work

Adversarial attacks (samples with adversarially-crafted small perturbations) have recently emerged as a significant threat to the deep learning techniques, thereby this finding may hinder the large scale adoption of DNNs-based systems in practical applications. There exists a number of methods to generate adversarial samples or adversarial examples (AEs) mainly based on the gradient of networks [8] or solving optimization problems [13], and so on. Very few works have attempted to give scientific reasoning for the vulnerability phenomena of DNNs to AEs. For instance, authors in [13] gave preliminary explanation that since the input space is densely populated by low-probability adversarial pockets, thus each point in the space is closer to many AEs. These AEs points can be easily doctored to attain the desired model outcome. In turn, Goodfellow et al. [8] argued that linear nature of DNNs-based classifiers is the main source of vulnerability. While, the work in [25] discussed “boundary tilting” view and argued that usually AEs lie in regions where the decision boundary is close to the manifold of training data.

The proposed defenses for mitigating AEs can be grouped into two categories. The first category techniques try either improving the robustness of DNNs or suppressing the success rates of attacks. For instance, adversarial training [8]

, which is training the system with AEs to augment the regularization and loss functions and making the system more resilient. The other technique is

defensive distillation [21] in which additional DNNs with softmax are trained to obstruct the deep learning system from fitting too tightly to the data. However, it has been demonstrated that defensive distillation method can be easily circumvented with a minimal modified attack [19]. Other approaches are pre-processing or denoising [26], i.e., removing the adversarial noise from the input samples before feeding them to neural networks, and architecture alteration [27], i.e., modifying the traditional neural network architectures, e.g., adding extra specific robust layers and functions.

The second category methods focus on detecting AEs as a binary classification or anomaly detection problem. An AEs detector distinguishes whether the input sample is an adversarial attack or not. Xu

et al. [28] proposed a Feature Squeezing based AEs detection algorithm. The features are squeezed either by decreasing each pixel’s color bit depth or smoothing the sample using a spatial filter. The method employs a binary classifier that uses as features the predictions of a target model before and after squeezing of the input sample. As the method relies on the effect of squeezing samples of a particular attack, it works well only when the attack technique is known but less effective under unknown ones. The method proposed in [23]

uses statistical tests to detect the AEs. This technique requires large number of benign and adversarial examples to estimate data distributions, thus it is less useful and practical. Metzen

et al. [29] proposed augmenting the target model with a small “detector” sub-network trained on both AEs and original clean samples. However, this method as well requires sufficiently large number of samples to function better. In [30], authors perform unsupervised anomaly detection which is an intrinsically harder task when compared to binary classification and, thus, has limited performance. Feinman et al. [9] exploited the idea that AEs diverge from the benign data manifold and estimated the difference between densities of set of true and attack samples an indicator of AEs. Nonetheless, this method is computationally expensive and gets confused when benign samples are very close to AEs. As pointed out above in the introduction section, most prior AEs detection methods all in all lack generalization capability, namely their accuracy deteriorates when faced unknown attacks (not used in training process). Therefore, there is ample room for devising novel AEs detection techniques.

Iii Adversarial Examples Generation Methods

In this section, we briefly describe the standard adversarial attack generation methods utilized in this study. Adversarial attacks to deep learning based systems can be either black-box or white-box, where the adversary, respectively, does not have and has knowledge of the model architecture, parameters and its training data. In addition, the attacks could be targeted and non-targeted, that aim to misguide DNNs to a specific class and arbitrary class except the correct one, respectively. In this work, we have used white-box non-targeted AEs.

Let and be a trained neural network and an original sample. Generating an adversarial example can be seen as a box-constrained optimization problem:


where and are output labels of and , respectively. Many variants of the above optimization problem have been presented in the literature to generate different kind of AEs, which are discussed below:

Iii-a Fast Gradient Sign Method (Fgsm)

FGSM technique was introduced in [8]

to generate AEs by computing model’s loss function derivative with respect to the input feature vector. The method performs one step update along the direction of the gradient sign at each pixel. The computation of AEs using this method can be expressed as:


where is model’s loss function, is the loss function’s gradient with respect to the input space, and is a hyper-parameter that governs the maximum distance allowed between adversarial and original samples.

Iii-B Iterative Gradient Sign Method (Igsm)

An iterative version of FGSM technique called IGSM was introduced by Kurakin et al. [14]. The intuitive notion of IGSM method is taking multiple small steps iteratively while adjusting the direction after each step, which increases the loss of the model (i.e. one-step gradient ascent). Namely, instead of merely applying certain amount of adversarial noise once with a given , applying it several times iteratively with smaller , which can be presented by a recursive formula:


here and clip denote the perturbed sample at the th iteration and a clipping of the adversarial sample’s values such that they are within an -neighbourhood of the original sample , respectively.

Iii-C Jacobian Saliency Map Attack (Jsma)

Papernot et al. [18] proposed a greedy iterative technique for targeted attacks. Use of outputs derivative of a target model leads to an adversarial perturbation that forces the model to misclassify the manipulated sample into a specific target class. Specifically, the sensitivity map per input features is computed as:


where indicates the desired target class, subscript is the th element (pixel) in the respective sample, and is the probability of the th class in the model’s output layer.

Iii-D DeepFool

Moosavi-Dezfooli et al. [10] devised DeepFool method to produce AEs by determining nearest distance from original input to the decision boundary. An iterative attack by linear approximation is executed in order to overcome the non-linearity in high dimension. DeepFool uses concepts from geometry to direct the search minimal perturbation to yield successful attacks. At each iteration, the method perturbs the image by a small vector that is computed to take the resulting image to the boundary of the polyhedron, which is achieved by linearizing the boundaries of the region within where the image resides. The final attacker that is able to fool the model is nothing but the accumulation of perturbations added in each iteration. The minimal perturbation is computed as:


where is the minimum perturbation added on .

Iii-E Additive Perturbation

In addition to above mentioned AEs generation methods, we also produced adversarial attacks using additive noise black-box perturbations, i.e., additive Gaussian noise attack, Salt and Pepper noise attack, and Gaussian blur attack. In order to generate AEs with these image quality degradations, a line-search was performed internally to estimate minimal adversarial perturbations.

Iv Proposed Method for Adversarial Examples Detection

Iv-a Decision Mismatch for Attacks Detection

In the work of Feinman et al. [9], it is shown that different models make different mistakes when presented to same AEs. In order to exploit that, authors use Dropout in “train mode” and sample distinct models to predict the same input. A divergence threshold is used to decide whether the input is an attack or not. However, distinct models sampled from dropping out parameters of a given neural network share most of their structure. Hence, gradient-based attacks will have similar impacts on their outputs. Moreover, in [28] subsequent outputs of the same model obtained with an input sample before and after decreasing each pixel’s color bit depth are used as features for binary classification of attacks vs. clean data. Once more, the same model is used for predictions of gradient-based attacks, which exploit model’s structure for building attacks.

We argue that mismatching of predictions will be maximized when the model’s structure is relevantly distinct. With this in mind, we propose to perform classification with two independently trained models and use their softmax output as features for a separate binary classifier. The consistency on the outputs provided by the two systems is an indicator of clean samples while uncertainty will indicate a potential attack. Employing different models in terms of their architectures, will make gradient-based attacks have different effects on each model, which is exactly what we intend to exploit in order to spot attacks.

A diagram representing the proposed approach is shown in Fig. 2. Two independently trained models receive the same input and their concatenated outputs are used as features by a binary classifier that performs classification of authentic vs. attacker inputs. The requirements for model 1 and 2 are matching performance on the task at hand and significant structural differences. Detector can be any binary classifier of choice.

Fig. 2: Detection method using decision mismatch - concatenated outputs of pretrained Models 1 and 2 are used as features for the detector.

The proposed approach also presents benefits in terms of cost since only two predictions are required at test time as opposed to [9] which need a relatively large number of predictions for divergence evaluation. Besides that, no extra processing is required on the tested samples, as is the case in [28]. However, at train time our method requires the training of two models for performing the same task.

Iv-B Online Transfer Learning

Besides the detection approach fo AEs, we propose a specific training scheme for classifiers referred to as Online Transfer Learning (OLTL) aiming to improve the performance on the detection task to be performed on top of the models outputs.

Consider two distinct classifiers trained jointly on the same task. One can formalize this particular learning problem as solving the following:


where and are conditional likelihoods of a given class label and the data parametrized by and .

In the OLTL case, we augment the optimizee by including the negative Kullback-Leibler divergences between

and . We thus get:


By doing so, we constrain the models to have matching outputs. This is particularly convenient for detection of AEs. Since we train models with OLTL exclusively presenting clean samples, at test time models will have very similar outputs in the case of unmodified original data and mismatching will be an even stronger indication of potential attacks.

V Experiments

V-a Datasets

We perform experiments on the well-known MNIST and CIFAR10 datasets. Both datasets consist of 50,000/10,000 training/testing images. MNIST contains 28x28 ten-class grayscale images representing the digits from to . CIFAR10 is a ten-class dataset with 32x32 color images consisting of animals and vehicles.

Subsets of the train partition of both MNIST and CIFAR10 were sampled for evaluation of the proposed scheme for detections of AEs. As mentioned previously, seven different attack strategies were tested in this study, namely FGSM, IGSM, JSMA, DeepFool, Gaussian Blur, Gaussian Noise, and Salt and Pepper. For each of them, samples were independently selected at random without repetition. Each image picked was selected randomly to become an attack and the model utilized to generate the attack was also selected randomly. All attacks were implemented using Foolbox [31].

In summary, by the end of the described process we had 7 independent datasets for each of MNIST and CIFAR10, each of those containing 10,000 samples being approximately 50% attacks that were generated using either model 1 or 2. Samples of above mentioned attacks are shown in Fig. 3

Fig. 3: Samples after different attacks.

V-B Models Architectures and Training Details

V-B1 Mnist

For MNIST, model 1 and 2 were selected as a convolutional (CNN) and a 3-layers fully connected (MLP) neural networks, respectively. Details of their architectures are shown below:

  • CNN: conv(5x5, 10) maxpool(2x2) conv(5x5, 20) maxpool(2x2) linear(350, 50) dropout(0.5) linear(50, 10) softmax.

  • MLP: linear(784, 320) dropout(0.5) linear(320, 50) dropout(0.5) linear(50, 10) softmax

All activation functions were set to ReLU. Training was performed with Stochastic Gradient Descent (SGD) using a fixed learning rate of

and mini-batches of size

. Training was executed for 10 and 20 epochs for independent and OLTL schemes, respectively.

V-B2 Cifar10

In the case of CIFAR10, we selected the widely used convolutional neural networks VGG16 and ResNet50

[32, 33] as model 1 and 2 respectively. The models were trained from scratch using SGD. The learning rate was scheduled to start at and decay by a factor of after epochs with no improvement on the validation set accuracy. Momentum and L2 regularization were employed with respecive coefficients set to and .

Performance of the models trained independently and with OLTL for both datasets are presented in Table I. One can notice that in some of the cases OLTL improves performance. We intend to further investigate this aspect in future work.

Dataset Model Test Accuracy (%)
CNN – OLTL 97.91
MLP 96.62
MLP – OLTL 93.63
CIFAR10 VGG 92.50
VGG – OLTL 92.70
ResNet 92.04
ResNet – OLTL 94.10
TABLE I: Performance on test data of the selected models when trained independently and with the use of Online Transfer Learning.

V-C Results and Discussion

We performed two main sets of experiments aiming to evaluate: (a) detection performance of the proposed approach when training and testing are performed using the same attack strategy. Results here also include a comparison of detection performance gain when model 1 and 2 are trained with OLTL as opposed to independent training; (b) generalization capacity of the proposed approach by using data generated with a particular attack strategy for training and a different one for testing the detector.

Prior to proceeding to main experiments results, we provide some illustrative analysis aiming to demonstrate the performance of the selected attack strategies and the effectiveness of the proposed approach. In Table II, the mean squared error (MSE) averaged over attacks generated with random samples from both datasets is shown together with the success rate of each attack. By success rate we mean the fraction of the random samples that successfully yielded attacks, i.e. made the target model to predict the wrong class.

For our models, DeepFool is the strongest attack strategy evaluated on MNIST since its MSE is at least one order of magnitude lower than the other methods. In CIFAR10, IGSM is the best performer. FGSM is the weakest of the gradient-based attacks. The additive perpetuation strategies, the ones that do not explicitly exploit the models structure to generate attacks, i.e. Gaussian Blur, Gaussian Noise and Salt and Pepper are, as expected, less effective when compared to other methods that directly exploit models information to yield refined attacks.

(a) tsne_raw
(c) tsne_oltl
(b) tsne_ind
(b) tsne_ind
Fig. 7: Two dimensional t-SNE embeddings of raw images (subplot a) and features for both independently trained models (subplot b) and training with online transfer learning (subplot c). Blue dots are authentic samples and red dots are attacks generated with DeepFool on 10,000 samples from CIFAR10. Attacks and clean data are indistinguishable for the case of raw images. Representations in the feature space make attacks and clean samples discriminable, whereas online transfer learning further enhances discrimination from clean samples since attacks concentrate in the center of the plot. Each blue cluster in the feature space corresponds to a particular class.
Attack MSE Success Rate (%)
FGSM 2.13E-02 1.91E-03 100.0 100.0
IGSM 1.07E-02 3.34E-05 100.0 100.0
JSMA 1.24E-02 2.07E-04 95.6 100.0
DeepFool 6.17E-03 1.00E-04 100.0 99.8
Gaussian Blur 4.35E-02 3.04E-03 90.2 93.8
Gaussian Noise 5.67E-02 4.90E-03 67.4 100.0
Salt and Pepper 1.23E-01 2.13E-03 100.0 100.0
TABLE II: Performance of different attacks evaluated in terms of the average MSE for 500 samples as well as their success rate.
Attack Train Scheme Classifier
Random Forest KNN SVM
Accuracy F1 AUC Accuracy F1 AUC Accuracy F1 AUC
FGSM Independent 0.94 0.94 0.94 0.93 0.93 0.93 0.93 0.92 0.93
OLTL 0.93 0.93 0.93 0.92 0.92 0.92 0.91 0.91 0.91
IGSM Independent 0.94 0.94 0.94 0.92 0.92 0.92 0.91 0.91 0.91
OLTL 0.93 0.92 0.93 0.92 0.92 0.92 0.91 0.9 0.91
JSMA Independent 0.95 0.95 0.95 0.93 0.93 0.93 0.92 0.92 0.92
OLTL 0.92 0.92 0.92 0.91 0.91 0.91 0.86 0.85 0.86
DeepFool Independent 0.94 0.93 0.94 0.93 0.93 0.93 0.92 0.91 0.92
OLTL 0.92 0.91 0.92 0.91 0.91 0.91 0.88 0.87 0.88
Gaussian Blur Independent 0.98 0.98 0.98 0.97 0.98 0.97 0.98 0.98 0.98
OLTL 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96 0.96
Gaussian Noise Independent 0.96 0.97 0.96 0.96 0.97 0.96 0.96 0.97 0.96
OLTL 0.94 0.95 0.94 0.94 0.94 0.94 0.94 0.95 0.94
Salt and Pepper Independent 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93 0.93
OLTL 0.92 0.92 0.92 0.92 0.92 0.92 0.91 0.91 0.91
TABLE III: Performance of three distinct classifiers for MNIST attacks detection. Scores are evaluated under 10-fold cross-validation. Each experiment was performed with an independent random subset of the train data containing 10,000 samples.

In order to visualize how discriminable attacks are when compared to clean samples, we show the t-SNE embeddings of the raw images and their representations with our features in Fig. 7. Plots are generated under DeepFool attacks on CIFAR10. One can clearly notice that the representation in the proposed space of features makes attacks detectable. Moreover, OLTL applied on the training phase of target models 1 and 2 makes the attacks even more separable from clean samples when compared to independent training.

Attack Train Scheme Classifier
Random Forest KNN SVM
Accuracy F1 AUC Accuracy F1 AUC Accuracy F1 AUC
FGSM Independent 0.94 0.94 0.94 0.92 0.92 0.92 0.93 0.93 0.93
OLTL 0.93 0.93 0.93 0.94 0.94 0.94 0.94 0.94 0.94
IGSM Independent 0.92 0.92 0.92 0.93 0.93 0.93 0.93 0.93 0.93
OLTL 0.93 0.93 0.93 0.95 0.95 0.95 0.94 0.93 0.94
JSMA Independent 0.96 0.96 0.96 0.91 0.91 0.91 0.93 0.92 0.93
OLTL 0.95 0.95 0.95 0.94 0.92 0.94 0.94 0.94 0.94
DeepFool Independent 0.91 0.91 0.91 0.90 0.90 0.90 0.91 0.90 0.91
OLTL 0.91 0.91 0.91 0.93 0.93 0.93 0.93 0.92 0.93
Gaussian Blur Independent 0.96 0.96 0.96 0.96 0.96 0.96 0.97 0.97 0.97
OLTL 0.96 0.97 0.96 0.96 0.96 0.96 0.97 0.97 0.97
Gaussian Noise Independent 0.95 0.94 0.94 0.94 0.94 0.94 0.95 0.95 0.95
OLTL 0.95 0.95 0.95 0.96 0.96 0.96 0.96 0.96 0.96
Salt and Pepper Independent 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94
OLTL 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94 0.94
TABLE IV: Performance of three distinct classifiers for CIFAR10 attacks detection. Scores are evaluated under 10-fold cross-validation. Each experiment was performed with an independent random subset of the train data containing 10,000 samples.

V-C1 Binary Classifiers Performance

For evaluation of the proposed detection approach, we trained three distinct detectors on top the representations obtained with pretrained models. Different detectors were used to verify the robustness of the representation under study. The binary classifiers used for detection were:  

  1. [label=()]

  2. Random Forest - estimators were employed;

  3. K-Nearest Neighbors (KNN) - K was set to ;

  4. Support Vector Machine (SVM) with RBF kernel.

And the following performance metrics were analyzed:  

  1. [label=()]

  2. Accuracy;

  3. F1 score, accounting for precision and sensitivity simultaneously;

  4. Area under ROC curve (AUC).

(a) mnist_rfo
(b) mnist_knn
(c) mnist_knn
Fig. 11: Generalization capacity for MNIST of each detector when trained in the attacks on the row and evaluated on the attacks on the columns. Values in the table indicate test accuracy.
(a) cifar10_rfo
(b) mnist_knn
(c) mnist_knn
Fig. 15: Generalization capacity for CIFAR10 of each detector when trained in the attacks on the row and evaluated on the attacks on the columns. Values in the table indicate test accuracy.

Evaluation of metrics was performed under 10-fold cross-validation. Results are shown in Tables III and IV for MNIST and CIFAR10, respectively. Scores are consistently higher than in both datasets analyzed as well as across the different detectors evaluated.

OLTL had different impacts on the two cases studied. For CIFAR10, accuracy increased in almost all attacks with the application of OLTL for KNN, which enforces the behavior depicted in Fig. 7: since attacks concentrate closer with the use of OLTL, neighboring data points are more likely to belong to the same class. However, this behavior was not the same for MNIST where results are in general better with independently trained models.

In Tables V and VI we provide an evaluation of our approach compared with results claimed in [28] and [34]. Even though other methods present a higher sensitivity in attacks such as FGSM and IGSM on MNIST, one can notice that our proposed approach performs consistently well across attacks and datasets and in harder cases, such as DeepFool or other attacks on CIFAR10, the detection strategy evaluated in this work still performs well.

V-C2 Generalization to Unseen Attacks

A second set of experiments was performed with the aim of verifying whether a detector trained on particular attack strategy is able to detect other kinds of attacks.

In order to do so, we used data from a particular attacker for training the binary classifiers and evaluated their accuracy on different attacks data. All possible pairs of attacks were evaluated under this scheme and the accuracies obtained are presented in Figures 11 and 15 for MNIST and CIFAR10.

Results indicate that our approach presents high detection accuracy even when the detector is tested against unseen attacks. Moreover, detectors trained on JSMA on MNIST are the ones the perform best on different attacks. The same behavior appears on CIFAR10 for DeepFool.

Detectors trained on additive noise attacks have much lower generalization capacity. However, detectors trained on simple attacks such as salt and pepper noise can still achieve surprisingly high accuracies on refined attacks such as DeepFool and JSMA. Another point to be mentioned is that the behavior described repeats across a set of different binary classifiers which by nature employ very different strategies to perform classification. This consistently shows that the representation proposed, i.e. the decision space of two distinct models, embeds enough information to enable for robust detection of current attacks as well as for new strategies applied to fool the models since generalization to unseen attacks is observed.

Feature Squeezing [28] 1.000 0.979 1.000
Ensemble [34] 0.998 0.997 0.450
Proposed 0.930 0.920 0.930 0.910
TABLE V: Detection rate of attackers based on MNIST.
Feature Squeezing [28] 0.208 0.550 0.885 0.774
Ensemble [34] 0.998 0.488 0.426
Proposed 0.930 0.940 0.970 0.910
TABLE VI: Detection rate of attackers based on CIFAR10.

Vi Conclusion

In this work we proposed a new approach to perform detection of adversarial attacks. The softmax layer outputs of two previously trained models are used as features for binary classification of attacks vs. clean samples. Moreover, a particular training scheme for pairs of classifiers referred to as Online Transfer Learning is proposed with the aim of making the models to yield closely matching outputs when presented to unmodified original samples.

Experiments performed on MNIST and CIFAR10 show that different detectors can perform consistently well on a set of untargeted white-box attacks. Moreover, generalization to unseen attacking methods is observed leading to detectors that can denounce new attack strategies. Classifiers trained on simple attacks such as salt and pepper additive noise are able to detect subtle state-of-the-art AEs with high accuracy. The detection approach comes with the cost of training a second model for the same task and performing two independent predictions at test time.

For future work we intend to scale this approach to classification tasks involving a higher number of classes. Moreover, even though the unconstrained testbed employed here - i.e. untargeted attacks along with white-box model dependent distortions - yields the most subtle and hence dificult to detect AEs, evaluating the proposed method on black-box and targeted attacks is a complementary analysis to be done.


The authors would like to thank Isabela Albuquerque for helpful discussions and insights and the Natural Sciences and Engineering Research Council of Canada (NSERC) for funding.