ML-LOO: Detecting Adversarial Examples with Feature Attribution

06/08/2019 ∙ by Puyudi Yang, et al. ∙ 3

Deep neural networks obtain state-of-the-art performance on a series of tasks. However, they are easily fooled by adding a small adversarial perturbation to input. The perturbation is often human imperceptible on image data. We observe a significant difference in feature attributions of adversarially crafted examples from those of original ones. Based on this observation, we introduce a new framework to detect adversarial examples through thresholding a scale estimate of feature attribution scores. Furthermore, we extend our method to include multi-layer feature attributions in order to tackle the attacks with mixed confidence levels. Through vast experiments, our method achieves superior performances in distinguishing adversarial examples from popular attack methods on a variety of real data sets among state-of-the-art detection methods. In particular, our method is able to detect adversarial examples of mixed confidence levels, and transfer between different attacking methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have achieved state-of-the-art performance on a variety of tasks, including image classification, object detection, speech recognition and machine translation. However, they have been shown to be vulnerable to adversarial examples. This incurs a security risk when DNNs are applied to sensitive areas such as finance, medicine, criminal justice and transportation. Adversarial examples are inputs to machine learning models that an attacker constructs intentionally to fool the model 

[1]. Szegedy et al. [2] observed that a visually indistinguishable perturbation in pixel space to the original image can alter the prediction of a neural network. Later, a series of papers [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16] designed more sophisticated methods for the worst-case perturbation within a restricted set, often a small ball with .

While a line of work tries to explain why adversarial examples exist [3, 17, 18, 19], a comprehensive analysis of underlying reasons has so far been an open problem, mainly because deep neural networks have complex function forms that a complete mathematical analysis is difficult to achieve. On the other hand, there has been a growing interest in developing tools for tackling the black-box nature of neural networks, among which feature attribution is a widely studied approach [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32]

. Given a predictive model, such a method outputs, for each instance to which the model is applied, a vector of importance scores associated with the underlying features. Feature attribution has been used to improve transparency and fairness of machine learning models 

[23, 30].

In this paper, we investigate the application of feature attribution to detecting adversarial examples. In particular, we observe that the feature attribution map of an adversarial example near the boundary always differs from that of the corresponding original example. A motivating example is shown in Figure 1, which demonstrates images in CIFAR-10 to be fed into a residual neural network and the corresponding feature attribution from Leave-One-Out (LOO)  [27]. The latter interprets decisions from a neural model by observing the effects on the model of erasing each pixel of input before and after the worst-case perturbation by C&W attack. While the perturbation on the original image is visually imperceptible, the feature attribution is altered drastically. We further observe that the difference can be summarized by simple statistics that characterize feature disagreement, which are capable of distinguishing adversarial examples from natural images. We conjecture that this is because adversarial attacks tend to perturb samples into an unstable region on the decision surface.

The above observation led to an effective method for detecting adversarial examples near the decision boundary. On the other hand, there also exists adversarial examples in which the model has high confidence [7]. Previous work has observed several state-of-the-art detection methods are vulnerable to such attacks [33, 34]. However, we observe an interesting phenomenon: middle layers of neural networks still contain information of uncertainty even for high-confidence adversarial examples. Based on this observation, we generalize our method to incorporate multi-layer feature attribution, where attribution scores for intermediate layers are computed without incurring extra model queries.

In numerical experiments, our method achieves superior performance in detecting adversarial examples generated from popular attack methods on MNIST, CIFAR-10 and CIFAR-100 among state-of-the-art detection methods. We also show the proposed method is capable of detecting mixed-confidence adversarial examples, transferring between adversarial examples of different confidence levels, and adversarial examples generated by various types of attacks.

2 Related Work

In this section, we review related work in feature attribution, adversarial attack, adversarial defense and detection.

Feature attribution

A variety of methods have been proposed to assign feature attribution scores. For each specific instance where the model is applied, an attribution method assigns an importance score for each feature, by approximating the target model via a linear model locally around the instance. One popular class of methods assumes the differentiability of the model, and propagates the prediction to features through gradients. Examples include direct use of gradient (Saliency Map) [22], Layer-wise Relevance Propagation (LWRP) [21] and its improved version DeepLIFT [20], and Integrated Gradients [31].

Another class is perturbation-based and thus model-agnostic. Given an instance, multiple perturbed samples are generated by masking different groups of features with a pre-specified reference value. The feature attribution of the instance is computed according to the prediction scores of a model on these samples. Popular perturbation based methods include Leave-One-Out [35, 27], LIME [23] and KernelSHAP [28].

It has been observed in Ghorbani et al. [36] that gradient-based feature attribution maps are sensitive to small perturbations. Adversarial attack to feature attribution is designed to characterize the fragility. On the contrary, robustness of an attribution method has been observed on a robust model. In fact, Yeh et al. [37] observed that gradient based explanations of an adversarially trained network are less sensitive, and Chalasani et al. [38]

established theoretical results for the robustness of attribution map on an adversarially trained logistic regression. These observations indicate that the sensitivity of a feature attribution might be rooted in the sensitivity of the model, instead of the attribution method. This motivates the detection of adversarial examples via attribution methods.

Adversarial attack

Adversarial attacks try to alter, with minimal perturbation, the prediction of an original instance from a given model, which leads to adversarial examples. Adversarial examples can be categorized as targeted or untargeted, depending on whether the goal is to classify the perturbed instance into a given target class or an arbitrary class different from the correct one. Attacks also differ by the type of distance they use to characterize minimal perturbation.

and distances are the most commonly used distances. Fast Gradient Sign Method (FGSM) by Goodfellow et al. [3] is an efficient method to minimize the distance. Kurakin et al. [4] and Madry et al. [8] proposed -PGD (BIM), an iterative version of FGSM, which achieves a higher success rate with a smaller size of perturbation. DeepFool presented by Moosavi-Dezfooli et al. [5] minimizes distance through an iterative linearization procedure. Carlini and Wagner [7] proposed effective algorithms to generate adversarial examples for each of the three distances. In particular, Carlini and Wagner [7]

proposed a loss function that is capable of controlling the confidence level of adversarial examples. The Jacobian-based Saliency Map Attack (JSMA) by

[6] is a greedy method for perturbation with

metric. Recently, several black-box adversarial attacks that solely depend on probability scores or decisions have been introduced.

Chen et al. [9] and Ilyas et al. [10, 11] introduced score-based methods using zeroth-order gradient estimation to craft adversarial examples. Brendel et al. [15] introduced Boundary Attack, as a black-box method to minimize the distance, that does not need access to gradient information and relies solely on the model decision. We demonstrate in our experiments that our method is capable of detecting adversarial examples generated by these attacks, regardless of the distance, confidence level, or whether the gradient information is used.

Adversarial defense and detection

To improve the robustness of neural networks, various approaches have been proposed to defend against adversarial attacks, including adversarial training [3, 4, 8, 39, 40], distributional smoothing [41]

, defensive distillation

[42], generative models [43], feature squeezing [44], randomized models [45, 46, 47], and verifiable defense [48, 49]. These defenses often involve modifications in the training process of a model, which often require higher computational or sample complexity [50], and lead to loss of accuracy [51].

Complimentary to the previous defending techniques, an alternative line of work focuses on screening out adversarial examples in the test stage without touching the training of the original model. Data transformations such as PCA have been used to extract features from the input and layers of neural networks for adversarial detection [52, 53, 54]. Alternative neural networks are used to classify adversarial and original images [55, 56, 57]. Feinman et al. [58]

proposed to use kernel density estimate (KD) and Bayesian uncertainty (BU) in hidden layers of the neural network for detection.

Ma et al. [59] observed Local Intrinsic Dimension (LID) of hidden-layer outputs differ between the original and adversarial examples. Lee et al. [60]

obtained the class conditional Gaussian distributions with respect to lower-level and upper-level features of the deep neural network under Gaussian discriminant analysis, which result in a confidence score based on the Mahalanobis distance (MAHA), followed by a logistic regression model on the confidence scores to detect adversarial examples. Through vast experiments, we show that our method achieves comparable or superior performance than these detection methods across various attacks. Furthermore, we show that our method achieves competitive performance for attacks with a varied confidence level, a setting where the other detection methods fail to work 

[33, 34].

Most related to our work, Tao et al. [61]

proposed to identify neurons critical for individual attributes to detect adversarial examples, but their method is restricted to models in face recognition. Instead, our method is applicable across different types of image data.

Zhang et al. [62] proposed to identify adversarial perturbations by training a neural network on the saliency map of inputs. However, their method depends on additional neural networks, which are vulnerable to white-box attacks when attackers perturb the image to fool the original model and the new neural network simultaneously.

Figure 1: The first row shows the original CIFAR-10 examples and their corresponding feature attributions. The second row shows the adversarial examples and their corresponding feature attributions. The third row plots the histograms of the original and adversarial feature attributions.
Figure 2: Histogram of dispersion measures

3 Adversarial detection with feature attribution

3.1 Feature attribution before and after perturbation

Assume that the model is a function which maps an image of dimension to a probability vector of dimension , where is the number of classes. A feature attribution method maps an input image to an attribution vector of the same shape as the image: , such that the th dimension of is the contribution of feature in the prediction of the model on the specific image . We suppress the dependence of on the model for notational convenience. We focus on the leave-One-Out (LOO) method [35, 27] throughout the paper, which assigns to each feature the reduction in the probability of the selected class when the feature in consideration is masked by some reference value, e.g. . Denoting the example with the th feature masked by as , LOO defines as

(1)

Adversarial attacks aim to change the prediction of a model with minimal perturbation of a sample, so that human is not able to detect the difference between an original image and its perturbed version . Yet we observed that is sensitive to the small difference between and . Figure 1 shows the attribution maps with the original image and its adversarially perturbed counterpart by C&W attack. Even with human eyes, we can observe an explicit difference in the attribution maps of the original and adversarial images. In particular, adversarial images have a larger dispersion in its importance scores, as demonstrated in Figure 1. We comment here that our proposed framework of adversarial detection via feature attribution is generic to popular feature attribution methods. As an example, we show the performance of Integrated Gradients [31] for adversarial detection in Appendix 6.1. LOO achieves the best performance among all attribution methods across different data sets.

3.2 Quantify the dispersion in feature attribution maps

Motivated by the apparent differences in the distributions of importance scores between the original and adversarial images, as demonstrated in Figure 1

, we propose to use measures of statistical dispersion in feature attribution to detect adversarial examples. In particular, we tried standard deviation (STD), median absolute deviation (MAD), which is the median of absolute differences between entries and their median, and interquartile range (IQR), which is the difference between the 75th percentile and the 25th percentile among all entries of

:

(2)

We observe there is a larger dispersion, which we call feature disagreement, between feature contribution to a model for an adversarially perturbed image. The difference is universal across different images. Figure 2 compares the histograms of these three dispersion measures of feature attributions for ResNet on natural test images from CIFAR-10 with those on adversarially perturbed images, where the adversarial perturbation is carried out by C&W Attack. We can see there is a significant difference in the distributions of STD, MAD and IQR between natural and adversarial images. A majority of adversarially perturbed images have a larger dispersion in feature attribution than an arbitrary natural image, besides the corresponding original images. We propose to distinguish adversarial images from natural images by thresholding the IQR of feature attribution maps. In Appendix 6.2, we show the ROC curves of adversarial detection using the three dispersion measures on CIFAR-10 data set with ResNet across three different attacks. All the three measures yield competitive performance. We stick to IQR for the rest of the paper, which is robust and has a slightly superior performance among the three.

Figure 3: ROC curves of detection methods on CIFAR-10 dataset with ResNet

3.3 Extension to multi-layer LOO: detection of attacks with mixed confidence levels

Carlini and Wagner [7] proposed the following objective to generate adversarial images with small perturbation.

(3)

where ,

maps an image to logits,

is the original label, and

is a hyperparameter for tuning confidence. Adversarial images with high confidence can be obtained by assigning a large value to

. The loss can be modified to generate constrained perturbation at different confidence levels as well [8]. Recently, Lu et al. [33] and Athalye et al. [34] observed that LID has a poor performance when faced with adversarial examples at various confidence scales. In our experiments, a similar phenomenon is observed for several other state-of-the-art detection methods, including KD+BU and MAHA, as is shown in Figure 4. This suggests that characterization of adversarial examples in related work may only hold true for adversarial examples near the decision boundary. IQR of feature attribution map, unfortunately, suffers from the same problem.

To detect adversarial images with mixed confidence levels, we generalize our method to capture dispersion of feature attributions beyond the output layer of the model. For an adversarial example within a small neighborhood of its original example in the pixel space but achieving a high confidence at the output layer in a different class from the original one, the feature representation deviates away from that of its original example gradually along the layers. Thus, we expect neurons of middle layers contain uncertainty that can be captured by a feature attribution map. We denote the map from input to an arbitrary neuron of an intermediate layer of the model by . The feature attribution of neuron is defined as , such that the th entry quantifies the contribution of feature to neuron . For Leave-One-Out (LOO), we have

To coordinate the scale difference between different neurons, we fit a logistic regression for the dispersion of feature attribution from different neurons on a hold-out training set to distinguish adversarial images from original images. The multi-layer extension of our method is called ’ML-LOO’.

Data Model Metric Attacks
C&W -PGD FGSM
KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO
MNIST CNN AUC 0.893 1.000 0.957 1.000 0.766 0.902 0.736 1.000 0.744 0.780 0.967 1.000
TPR (FPR@0.01) 0.23 0.99 0.94 0.98 0.09 0.32 0.01 0.99 0.01 0.09 0.54 0.99
TPR (FPR@0.05) 0.46 0.99 0.94 0.98 0.28 0.58 0.12 0.99 0.15 0.23 0.92 0.99
TPR (FPR@0.10) 0.55 0.99 0.94 0.98 0.34 0.72 0.29 0.99 0.24 0.40 0.94 0.99
CIFAR10 ResNet AUC 0.623 0.990 0.962 0.995 0.834 0.970 0.958 0.999 0.673 0.972 0.770 0.997
TPR (FPR@0.01) 0.01 0.55 0.57 0.86 0.54 0.52 0.41 0.96 0.04 0.29 0.04 0.82
TPR (FPR@0.05) 0.09 0.98 0.95 0.98 0.61 0.85 0.86 0.98 0.20 0.82 0.16 0.99
TPR (FPR@0.10) 0.22 0.99 0.95 0.99 0.62 0.91 0.91 0.98 0.29 0.93 0.38 0.99
DenseNet AUC 0.679 0.958 0.966 0.977 0.955 0.952 0.768 0.997 0.790 0.706 0.829 1.000
TPR (FPR@0.01) 0.06 0.30 0.48 0.33 0.69 0.51 0.03 0.99 0.17 0.04 0.00 0.99
TPR (FPR@0.05) 0.13 0.79 0.91 0.84 0.74 0.84 0.23 0.99 0.28 0.12 0.29 0.99
TPR (FPR@0.10) 0.22 0.91 0.94 0.98 0.80 0.88 0.31 0.99 0.41 0.23 0.51 0.99
CIFAR100 ResNet AUC 0.637 0.717 0.945 0.967 0.855 0.984 0.966 0.999 0.773 0.985 0.875 1.000
TPR (FPR@0.01) 0.07 0.00 0.00 0.33 0.59 0.69 0.48 0.94 0.39 0.48 0.12 0.99
TPR (FPR@0.05) 0.16 0.01 0.52 0.70 0.61 0.94 0.82 0.99 0.49 0.89 0.43 0.99
TPR (FPR@0.10) 0.29 0.01 0.80 0.92 0.64 0.96 0.92 0.99 0.56 0.99 0.57 0.99
DenseNet AUC 0.567 0.727 0.916 0.958 0.549 0.732 0.947 0.971 0.577 0.751 0.951 0.974
TPR (FPR@0.01) 0.02 0.07 0.00 0.07 0.01 0.00 0.00 0.21 0.01 0.01 0.00 0.31
TPR (FPR@0.05) 0.17 0.15 0.61 0.66 0.14 0.01 0.70 0.75 0.17 0.06 0.77 0.81
TPR (FPR@0.10) 0.22 0.26 0.84 0.88 0.20 0.04 0.91 0.96 0.23 0.18 0.93 0.94
Data Model Metric Attacks
JSMA DeepFool Boundary
KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO
MNIST CNN AUC 0.886 1.000 0.976 1.000 0.901 1.000 0.869 1.000 0.905 1.000 0.991 1.000
TPR (FPR@0.01) 0.30 1.00 0.87 0.99 0.32 1.00 0.04 1.00 0.32 1.00 0.79 1.00
TPR (FPR@0.05) 0.46 1.00 0.94 1.00 0.43 1.00 0.36 1.00 0.45 1.00 0.98 1.00
TPR (FPR@0.10) 0.51 1.00 0.95 1.00 0.57 1.00 0.59 1.00 0.55 1.00 0.98 1.00
CIFAR10 ResNet AUC 0.614 0.986 0.941 0.981 0.618 0.990 0.981 0.994 0.676 0.990 0.967 0.997
TPR (FPR@0.01) 0.01 0.49 0.45 0.46 0.01 0.57 0.60 0.89 0.03 0.64 0.60 0.92
TPR (FPR@0.05) 0.10 0.98 0.87 0.82 0.10 0.99 0.96 0.96 0.20 0.99 0.94 0.99
TPR (FPR@0.10) 0.21 0.99 0.90 0.99 0.24 0.99 0.96 0.99 0.38 0.99 0.94 0.99
DenseNet AUC 0.645 0.937 0.947 0.964 0.646 0.976 0.977 0.976 0.700 0.983 0.981 0.980
TPR (FPR@0.01) 0.04 0.14 0.41 0.12 0.03 0.34 0.51 0.24 0.05 0.58 0.62 0.31
TPR (FPR@0.05) 0.10 0.67 0.68 0.72 0.09 0.90 0.95 0.82 0.12 0.93 0.91 0.89
TPR (FPR@0.10) 0.18 0.86 0.88 0.96 0.17 0.98 0.97 0.98 0.23 0.98 0.96 0.98

CIFAR100
ResNet AUC 0.600 0.740 0.907 0.964 0.610 0.714 0.953 0.970 0.635 0.732 0.956 0.972
TPR (FPR@0.01) 0.00 0.01 0.00 0.42 0.06 0.00 0.00 0.41 0.07 0.01 0.00 0.49
TPR (FPR@0.05) 0.12 0.14 0.49 0.70 0.14 0.01 0.56 0.74 0.16 0.07 0.61 0.78
TPR (FPR@0.10) 0.27 0.24 0.77 0.91 0.29 0.01 0.87 0.94 0.30 0.15 0.94 0.93
DenseNet AUC 0.567 0.727 0.916 0.958 0.549 0.732 0.947 0.971 0.577 0.751 0.951 0.974
TPR (FPR@0.01) 0.02 0.07 0.00 0.07 0.01 0.00 0.00 0.21 0.01 0.01 0.00 0.31
TPR (FPR@0.05) 0.17 0.15 0.61 0.66 0.14 0.01 0.70 0.75 0.17 0.06 0.77 0.81
TPR (FPR@0.10) 0.22 0.26 0.84 0.88 0.20 0.04 0.91 0.96 0.23 0.18 0.93 0.94
Table 1: Performance of detection methods on different data sets, models and attack methods.

4 Experiments

We present an experimental evaluation of ML-LOO, and compare our method with several state-of-the-art detection methods. Then we consider the setting where attacks have different confidence levels. We further evaluate the transferability of various detection methods on an unknown attack.

4.1 Known Attacks

We compare our method with state-of-the-art detection algorithms including LID [59], Mahalanobis (MAHA) [60], and KD+BU [58], on three data sets: MNIST, CIFAR-10 and CIFAR-100, with the standard train/test split [63]

. We used a convolutional network composed of 32-filter convolutional layers followed by a hidden dense layer with 1024 units for MNIST. Each convolutional layer was followed by a max-pooling layer. For both CIFAR-10 and CIFAR-100, we trained a 20-layer ResNet 

[64] and 121-layer DenseNet [65] respectively. For each data set, we generated 2,000 adversarial examples from correctly classified test images by each attacking method. Among them, 1,000 adversarial images with the corresponding 1,000 natural images were used for the training process of LID, Mahalanobis and our method. Results are reported for the other 1,000 adversarial images with the corresponding natural images. We consider the following attacking methods, grouped by the norms they are optimized for:

  • [leftmargin=*]

  • : FGSM [3], -PGD [4, 8].

  • : C&W [7], DeepFool [5], Boundary Attack [15].

  • : JSMA [6].

Let true positive rate (TPR) be the proportion of adversarial images classified as adversarial, and false positive rate (FPR) be the proportion of natural images classified as adversarial. We report area under the curve (AUC) of the ROC curve as the performance evaluation as well as the true positive rates by thresholding FPR at 0.01,0.05 and 0.1, as it is practical to keep misclassified natural images at a low proportion.

The results are reported in Table 1, and the ROC curves on CIFAR-10 with ResNet are shown in Figure 3. The rest of the plots can be found in Appendix 6.3 and Appendix 6.4. ML-LOO shows superior performance over the other three detection methods across different data sets, models for all attacks optimized for and distances. By controlling FPR at 0.1, our method is able to find over 95% adversarial examples generated by most existing attacks.

Data Model Metric Attacks
C&W MIX C&W LC C&W HC
KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO
CIFAR10 ResNet AUC 0.620 0.649 0.640 0.840 0.623 0.445 0.641 0.711 0.829 0.816 0.966 0.988
TPR (FPR@0.01) 0.04 0.01 0.03 0.25 0.01 0.00 0.01 0.12 0.52 0.23 0.51 0.87
TPR (FPR@0.05) 0.17 0.06 0.14 0.42 0.09 0.06 0.10 0.21 0.59 0.43 0.90 0.94
TPR (FPR@0.10) 0.28 0.19 0.21 0.59 0.22 0.11 0.16 0.34 0.60 0.62 0.93 0.97
Data Model Metric Attacks
-PGD-MIX -PGD-LC -PGD-HC
KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO
CIFAR10 ResNet AUC 0.753 0.812 0.813 0.953 0.606 0.578 0.578 0.767 0.834 0.935 0.962 0.996
TPR (FPR@0.01) 0.20 0.10 0.11 0.60 0.01 0.01 0.01 0.09 0.54 0.26 0.46 0.89
TPR (FPR@0.05) 0.37 0.36 0.45 0.77 0.12 0.07 0.04 0.23 0.61 0.67 0.89 0.98
TPR (FPR@0.10) 0.46 0.41 0.56 0.84 0.25 0.17 0.12 0.33 0.62 0.85 0.91 0.99
Table 2: Top: Performance of detection methods trained with C&W-MIX and tested on C&W-LC, C&W-HC and C&W-MIX. Bottom: Performance of detection methods trained with -PGD-MIX and tested on -PGD-LC, -PGD-HC and -PGD-MIX.
Figure 4: The left two figures plot the histogram of confidence levels of C&W-LC, C&W-HC, and C&W-MIX, and the ROC curves of detection methods under C&W-MIX attack. The right two figures plot the histogram of confidence levels of -PGD-LC, -PGD-HC, and -PGD-MIX, and the ROC curves of detection methods under -PGD-MIX attack.

4.2 Attacks with varied confidence levels

Lu et al. [33] and Athalye et al. [34] observed that LID fails when the confidence level of adversarial examples generated from C&W attack varies. We consider adversarial images with varied confidence levels for both and attacks. We use C&W attack for optimizing distance, and adjust the confidence hyperparameter in Equation (3) to achieve mixed confidence levels. To achieve adversarial examples optimized for distance, we use -PGD for optimizing distance, and vary the constraint for different confidence levels.

C&W Attack for optimizing distance

We consider three settings for C&W attack, low-confidence (C&W-LC), mixed-confidence (C&W-MIX) and high-confidence (C&W-HC). We set the confidence parameter for C&W-LC and for C&W-HC. For mixed-confidence C&W attack, we generate adversarial images from C&W attack with the confidence parameter in Equation (3) randomly selected from when generating an adversarial image, so that the distribution of confidence levels for adversarial images is comparable with that of original images. The confidence levels of images under the three settings, along with confidence levels of original images are shown in Figure 4. The confidence level in Figure 4 is defined as , where is the probability score of the predicted class.

We carried out the experiments on ResNet trained on CIFAR-10 using adversarial images generated from the mixed-confidence C&W attack, together with the corresponding original images, as the training data for LID, Mahalanobis, KD+BU, and our method. We test the detection methods on a different set of original and adversarial images generated from three versions: low-confidence C&W attack (), high-confidence C&W attack (), and the mixed-confidence C&W attack. Table 2 (Top) and Figure 4 (Left) show TPRs at different FPR thresholds, AUC, and the ROC curve. Mahalanobis, LID and KD+BU fail to detect adversarial examples of mixed-confidence effectively, while our method performs consistently better for adversarial images across the three settings.

Data Model Metric Attacks
-PGD DeepFool FGSM JSAM Boundary
KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO KD+BU LID MAHA ML-LOO
CIFAR10 ResNet AUC 0.753 0.763 0.818 0.879 0.618 0.990 0.962 0.992 0.673 0.610 0.730 0.796 0.614 0.984 0.957 0.984 0.676 0.991 0.964 0.994
TPR (FPR@0.01) 0.20 0.08 0.14 0.21 0.01 0.56 0.61 0.72 0.04 0.07 0.06 0.04 0.01 0.43 0.44 0.45 0.03 0.56 0.60 0.82
TPR (FPR@0.05) 0.37 0.35 0.45 0.48 0.10 0.96 0.94 0.96 0.20 0.17 0.22 0.14 0.10 0.93 0.91 0.91 0.20 0.99 0.95 0.97
TPR (FPR@0.10) 0.46 0.45 0.60 0.65 0.24 0.98 0.94 0.99 0.29 0.23 0.34 0.37 0.21 0.98 0.94 0.99 0.38 0.99 0.95 0.99
Table 3: Performance of detection methods trained with C&W and transferred to -PGD, FGSM, JSMA, Boundary and DeepFool.
Figure 5: Transferability of detection methods trained with C&W attack and tested on -PGD, FGSM, JSMA, Boundary and DeepFool.

-PGD for optimizing distance

-PGD [8], also named as BIM [4], searches for adversarial examples by iteratively updating the original image with the following:

(4)

where is the original class, is the cross-entropy loss, and Clip operator clips an image elementwise to an -neighborhood. For mixed-confidence -PGD attack, we generated adversarial images from -PGD with different confidence levels by randomly selecting the constraint in Equation (4) from . The confidence levels of images from mixed-confidence -PGD attack are shown in Figure 4.

We used adversarial images generated from the mixed-confidence -PGD, together with their corresponding original images, as the training data for all detection methods. We report the results on adversarial images generated from three versions: high-confidence -PGD (), low-confidence -PGD (), and the mixed-confidence -PGD that is used to generate the training data. The corresponding original images are different from the training images. Table 2 (Bottom) and Figure 4 (Right) show TPRs at different FPR thresholds, AUC, and the ROC curve. Mahalanobis, LID and KD+BU fail to detect adversarial examples of mixed-confidence effectively, while our method performs significantly better across the three settings.

4.3 Transferability

In this experiment, we evaluate the transferability of different methods by training detection methods on adversarial examples generated from one attacking method and carry out the evaluation on adversarial examples generated from different attacking methods. We trained all methods on adversarial examples generated by C&W attack and carried out the evaluation on adversarial examples generated by the rest of the attacking methods.

Experiments are carried out on MNIST, CIFAR-10, and CIFAR-100 datasets. AUC and TPRs at different FPR thresholds are reported in Table 3. All methods trained on C&W attack are capable of detecting adversarial examples generated from an unknown attack, even when the optimized distance is , or the attack is not gradient-based. The same phenomenon has been observed in Lee et al. [60] as well. This indicates attacks might share some common features. Our method yields a slightly higher AUC consistently, and has a significantly higher TPR when FPRs are controlled to be small.

5 Discussion

In this paper, we have introduced a new framework to detect adversarial examples with multi-layer feature attribution, by capturing the scaling difference of feature attribution scores between original and adversarial examples. We show that our detection method outperforms other state-of-the-art detection methods in detecting various kinds of attacks. In particular, we show our method is able to detect adversarial examples of various confidence levels, and transfers between different attacks.

References

  • Goodfellow et al. [2017] Ian Goodfellow, Nicolas Papernot, Sandy Huang, Yan Duan, and Peter Abbeel. Attacking machine learning with adversarial examples. https://blog.openai.com/adversarial-example-research/, 2017.
  • Szegedy et al. [2014] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • Goodfellow et al. [2015] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.
  • Kurakin et al. [2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In International Conference on Learning Representations, 2017.
  • Moosavi-Dezfooli et al. [2016] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2574–2582, 2016.
  • Papernot et al. [2016a] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami.

    The limitations of deep learning in adversarial settings.

    In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pages 372–387. IEEE, 2016a.
  • Carlini and Wagner [2017] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
  • Madry et al. [2018] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • Chen et al. [2017] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    , pages 15–26. ACM, 2017.
  • Ilyas et al. [2018] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. In International Conference on Machine Learning, pages 2142–2151, 2018.
  • Ilyas et al. [2019] Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Prior convictions: Black-box adversarial attacks with bandits and priors. In International Conference on Learning Representations, 2019.
  • Liu et al. [2017] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • Papernot et al. [2016b] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016b.
  • Papernot et al. [2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 506–519. ACM, 2017.
  • Brendel et al. [2018] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In International Conference on Learning Representations, 2018.
  • Brunner et al. [2018] Thomas Brunner, Frederik Diehl, Michael Truong Le, and Alois Knoll. Guessing smart: Biased sampling for efficient black-box adversarial attacks. arXiv preprint arXiv:1812.09803, 2018.
  • Tanay and Griffin [2016] Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016.
  • Fawzi et al. [2018] Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning, 107(3):481–508, 2018.
  • Fawzi et al. [2016] Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems, pages 1632–1640, 2016.
  • Shrikumar et al. [2017] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3145–3153. PMLR, 06–11 Aug 2017.
  • Bach et al. [2015] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek. On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS One, 10(7):e0130140, 2015.
  • Simonyan et al. [2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
  • Ribeiro et al. [2016] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should I trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  • Li et al. [2016a] Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. Visualizing and understanding neural models in nlp. In Proceedings of NAACL-HLT, pages 681–691, 2016a.
  • Baehrens et al. [2010] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to explain individual classification decisions. Journal of Machine Learning Research, 11:1803–1831, 2010.
  • Lipton [2016] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
  • Li et al. [2016b] Jiwei Li, Will Monroe, and Dan Jurafsky. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016b.
  • Lundberg and Lee [2017] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pages 4765–4774, 2017.
  • Štrumbelj and Kononenko [2010] Erik Štrumbelj and Igor Kononenko.

    An efficient explanation of individual classifications using game theory.

    Journal of Machine Learning Research, 11:1–18, 2010.
  • Datta et al. [2016] Anupam Datta, Shayak Sen, and Yair Zick. Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems. In Security and Privacy (SP), 2016 IEEE Symposium on, pages 598–617. IEEE, 2016.
  • Sundararajan et al. [2017] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3319–3328. JMLR. org, 2017.
  • Chen et al. [2019] Jianbo Chen, Le Song, Martin J. Wainwright, and Michael I. Jordan. L-shapley and C-shapley: Efficient model interpretation for structured data. In International Conference on Learning Representations, 2019.
  • Lu et al. [2018] Pei-Hsuan Lu, Pin-Yu Chen, and Chia-Mu Yu. On the limitation of local intrinsic dimensionality for characterizing the subspaces of adversarial examples. arXiv preprint arXiv:1803.09638, 2018.
  • Athalye et al. [2018] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, pages 274–283, 2018.
  • Zeiler and Fergus [2014] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, pages 818–833. Springer, 2014.
  • Ghorbani et al. [2017] Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. arXiv preprint arXiv:1710.10547, 2017.
  • Yeh et al. [2019] Chih-Kuan Yeh, Cheng-Yu Hsieh, Arun Sai Suggala, David Inouye, and Pradeep Ravikumar. How sensitive are sensitivity-based explanations? arXiv preprint arXiv:1901.09392, 2019.
  • Chalasani et al. [2018] Prasad Chalasani, Somesh Jha, Aravind Sadagopan, and Xi Wu. Adversarial learning and explainability in structured datasets. arXiv preprint arXiv:1810.06583, 2018.
  • Tramèr et al. [2018] Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
  • Liu and Hsieh [2019] Xuanqing Liu and Cho-Jui Hsieh. Rob-gan: Generator, discriminator, and adversarial attacker. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019.
  • Miyato et al. [2016] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. In International Conference on Learning Representations, 2016.
  • Papernot et al. [2016c] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016c.
  • Song et al. [2018] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018.
  • Xu et al. [2018] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018, 2018.
  • Liu et al. [2018] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369–385, 2018.
  • Lecuyer et al. [2019] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In IEEE Symposium on Security and Privacy, 2019.
  • Liu et al. [2019] Xuanqing Liu, Yao Li, Chongruo Wu, and Cho-Jui Hsieh. Adv-bnn: Improved adversarial defense through robust bayesian neural network. In International Conference on Learning Representations, 2019.
  • Wong and Kolter [2018] Eric Wong and J Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, 2018.
  • Dvijotham et al. [2018] Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O’Donoghue, Jonathan Uesato, and Pushmeet Kohli. Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265, 2018.
  • Schmidt et al. [2018] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5019–5031, 2018.
  • Tsipras et al. [2018] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. There is no free lunch in adversarial robustness (but there are unexpected benefits). arXiv preprint arXiv:1805.12152, 2018.
  • Li and Li [2017] Xin Li and Fuxin Li. Adversarial examples detection in deep networks with convolutional filter statistics. In Proceedings of the IEEE International Conference on Computer Vision, pages 5764–5772, 2017.
  • Bhagoji et al. [2018] Arjun Nitin Bhagoji, Daniel Cullina, Chawin Sitawarin, and Prateek Mittal. Enhancing robustness of machine learning systems via data transformations. In 2018 52nd Annual Conference on Information Sciences and Systems (CISS), pages 1–5. IEEE, 2018.
  • Hendrycks and Gimpel [2017] Dan Hendrycks and Kevin Gimpel. Early methods for detecting adversarial images. In International Conference on Learning Representations, 2017.
  • Grosse et al. [2017] Kathrin Grosse, Praveen Manoharan, Nicolas Papernot, Michael Backes, and Patrick McDaniel. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
  • Gong et al. [2017] Zhitao Gong, Wenlu Wang, and Wei-Shinn Ku. Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960, 2017.
  • Metzen et al. [2017] Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. In International Conference on Learning Representations, 2017.
  • Feinman et al. [2017] Reuben Feinman, Ryan R Curtin, Saurabh Shintre, and Andrew B Gardner. Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410, 2017.
  • Ma et al. [2018] Xingjun Ma, Bo Li, Yisen Wang, Sarah M. Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Michael E. Houle, Dawn Song, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018.
  • Lee et al. [2018] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
  • Tao et al. [2018] Guanhong Tao, Shiqing Ma, Yingqi Liu, and Xiangyu Zhang. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In Advances in Neural Information Processing Systems, pages 7728–7739, 2018.
  • Zhang et al. [2018] Chiliang Zhang, Zuochang Ye, Yan Wang, and Zhimou Yang. Detecting adversarial perturbations with saliency. In 2018 IEEE 3rd International Conference on Signal and Image Processing (ICSIP), pages 271–275. IEEE, 2018.
  • Chollet et al. [2015] François Chollet et al. Keras. https://github.com/fchollet/keras, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
  • Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.

6 appendix

6.1 Performance of Integrated Gradients

In this section, we evaluate the detection of adversarial examples by thresholding the IQR of another popular feature attribution method Integrated Gradients (IG), and compare it with KD+BU, LID, MAHA, and ML-LOO. We consider three attacks FGSM, C&W and JSMA, which are optimized for and distances respectively, on CIFAR-10 dataset with ResNet. We can see that IQR of IG achieves competitive performance in detecting adversarial examples, but not as powerful as the detection methods which incorporated multi-layer information like LID, MAHA and our proposed method ML-LOO. The IG feature is also not as effective as the LOO feature (whose performance is shown in Figure 8).

Figure 6: ROC curves of detection methods on CIFAR-10 dataset with ResNet. We restrict FPR between 0 and 0.2, which is meaningful in practice. See Appendix 6.4 for full plots.

6.2 Comparison Based on Dispersion Measures

In this section, we compare performance of detection using three different dispersion measures of feature attributions: IQR, STD and MAD.

Figure 7 shows the histograms of these three dispersion measures of feature attributions for ResNet on natural test images from CIFAR-10 with those on adversarially perturbed images, where the adversarial perturbation is carried out by C&W Attack. We can see there is a significant difference in the distributions of the dispersion measures between natural and adversarial images.

Figure 8 shows the ROC curves of the three dispersion statistics on CIFAR-10 with ResNet. We can see that all three dispersion measures achieve competitive performance in detecting adversarial examples generated by three attacks C&W, JSMA and -PGD, but IQR achieves the largest AUC values across all attacking methods.

Figure 7: Histogram of Statistics
Figure 8: ROC curves of different statistics on CIFAR-10 dataset with ResNet

6.3 ROC curves of detection methods on CIFAR-10, MNIST and CIFAR-100 data sets with FPR from 0.0 to 0.2

Figure 9, Figure 10, Figure 11, Figure 12, and Figure 13 show the ROC curves of four detection method (LID, MAHA, KD+BU, ML-LOO) on three data sets (CIFAR-10, MNIST, CIFAR-100) with three models (CNN, ResNet, DenseNet) under six attacks (FGSM, JSMA, C&W, DeepFool, Boundary, -PGD) where FPR is from 0.0 to 0.2, which is the setting of practical interest. The ROC curves where FPR is from 0.0 to 1.0 are shown in Appendix 6.4.

Figure 9: ROC curves of detection methods on MNIST dataset with CNN
Figure 10: ROC curves of detection methods on CIFAR-10 dataset with ResNet
Figure 11: ROC curves of detection methods on CIFAR-10 dataset with DenseNet
Figure 12: ROC curves of detection methods on CIFAR-100 dataset with ResNet
Figure 13: ROC curves of detection methods on CIFAR-100 dataset with DenseNet

6.4 ROC curves of detection methods on CIFAR-10, MNIST and CIFAR-100 data sets with FPR from 0.0 to 1.0

In this section, we show the ROC curves of four detection method (LID, MAHA, KD+BU, ML-LOO) on three data sets (CIFAR-10, MNIST, CIFAR-100) with three models (CNN, ResNet, DenseNet) under six attacks (FGSM, JSMA, C&W, DeepFool, Boundary, -PGD) where FPR is from 0.0 to 1.0.

Figure 14: ROC curves of detection methods on CIFAR-10 dataset with ResNet
Figure 15: ROC curves of detection methods on CIFAR-10 dataset with DenseNet
Figure 16: ROC curves of detection methods on MNIST dataset with CNN
Figure 17: ROC curves of detection methods on CIFAR-100 dataset with ResNet
Figure 18: ROC curves of detection methods on CIFAR-100 dataset with DenseNet