Deep neural networks (DNNs) are known to be highly vulnerable to adversarial examples (AEs) [szegedy2013intriguing]. AEs are maliciously generated by adding human-imperceptible perturbations to clean examples, compromising a network to produce the attacker-desired incorrect predictions [dong2018boosting]
. This characteristic challenges DNNs to security-critical applications, face recognition[parkhi2015deep] and autonomous driving [autodrive]. Especially, the adversarial attack in medical image analysis is disastrous as it can manipulate patients’ disease diagnosis and cause serious subsequent problems. More disturbingly, recent studies have shown that DNNs for medical image analysis [zhou2015medical, zhou2017deep, zhou2019handbook], including disease diagnosis [paschali2018generalizability, finlayson2018adversarial, ma2020understanding], organ segmentation [ozbulak2019impact], and landmark detection [yao2020miss], are more vulnerable to AEs than natural images.
On the other hand, recent works [ma2020understanding] have shown that, unlike adversarial examples of natural images, medical adversarial examples can be easily detected in hierarchical feature space. To illustrate the distribution differences between clean and adversarial examples in the feature space, we plot the 2D t-SNE [t-SNE]
of their features from the penultimate layer of a well-trained pneumonia classifier in Fig.1
. It reveals that adversarial attacks move the deep representations from the original distribution to extreme outlier positions in order to compromise the classifier. As a result, a defender can easily take advantage of this intrinsic characteristic of adversarial examples by learning a decision boundary between clean and adversarial examples or distinguishing them directly by anomaly detection based methods. Given this phenomenon, two key questions are investigated in this paper. The first one is:What causes medical adversarial examples to be more easily detected, compared to natural adversarial attacks? To better understand the problem, we conduct both empirical and theoretical analysis of medical adversarial examples and compare them with natural images. Firstly, we demonstrate that the medical features are more vulnerable than natural features in a stress test, which aims to distort the features by the adversarial attack. Then, we theoretically prove that the attack’s representations are optimized in a nearly-consistent direction. The consequence of such consistent guidance is that the vulnerable representations are pushed to outlier regions where the clean example features rarely reside. The second question is: If possible, how to hide a medical adversarial example from being spotted in the feature space?
Intuitively, if the attacker could imitate the feature distributions of clean examples when manipulating the final logits, it not only deceives the anomaly detection based detectors, but also bypasses the decision boundary trained to spot the extreme outlier AE features. A straightforward idea is to select a guide example and force the representations of the adversarial examples to be close to the guide image in the hierarchical feature space[feature_iclr]. However, different medical images have different backgrounds and lesions, so it is difficult to manipulate the adversarial representation to be the same as the guide one in all layers within the limitation of small perturbation. In order to find where to hide the adversarial representation in the normal feature distribution, we propose a novel hierarchical feature constraint (HFC) as an add-on term that can be plugged into all existing attacks
. HFC first models the normal feature distributions for each activation layer with a Gaussian mixture model and then promotes the hiding of adversarial samples in where the corresponding log-likelihood is maximized. We perform extensive experiments on two public medical diagnosis datasets to validate the effectiveness of HFC. HFC helps an attacker bypass several state-of-the-art adversarial detectors, while keeping the perturbations under a strict constraint. Also, HFC greatly outperforms other methods on manipulating adversarial representations. Furthermore, HFC bypasses the detectors in the grey-box setting, where only the backbone network is known. Finally, HFC is extended to natural adversarial attacks. Our experiments support thatthe great vulnerability of medical features allows an attacker more room to manipulate the adversarial representations. Overall, we highlight the following contributions:
We investigate the intrinsic characteristics of medical images and shed light on why medical adversarial examples can be more easily detected, when compared with adversarial examples of natural images.
We propose a hierarchical feature constraint (HFC), a novel plug-in that can be applied to all existing attacks to lower their chance of being detected.
Extensive experiments validate that our HFC bypasses several state-of-the-art adversarial detectors with small perturbations in both white- and gray-box settings.
2 Related Work
Given a clean image with its ground truth label and a DNN classifier with pretrained parameters , the classifier predicts the class of the input example via:
where the logits output (with respect to class ) is given as , in which is the activation of the penultimate layer that has dimensions; and are the weights and the bias from the final dense layer, respectively; the
is the probability ofbelonging to class . A common way of crafting an adversarial attack is to manipulate the classifier’s prediction by minimizing111In this work, we focus on targeted adversarial attack. the classification error between the prediction and target class , while keeping the adversarial example within a small -ball of the -norm [PGD] centered at the original sample , i.e., , where is perturbation budget.
2.1 Adversarial attacks
A wide range of gradient-based or optimization-based attacks have been proposed to generate AEs under different types of norms. The Jacobian-based saliency map attack (JSMA) [JSMA] crafted AEs on the -norm, modifying few pixels to change the loss as much as possible. DeepFool [deepfool] is an -norm attack method which performs less perturbations by moving the attacked input sample to its closest decision boundary. Another effective -norm attack proposed by Carlini and Wagner (CW attack) [cwattack] took a Lagrangian form and adopted Adam [adam] for optimization. Elastic-net Attack (EAD) [EAD] extended CW attack to -norm by including an regularization. In this paper, we focus on state-of-the-art adversarial attacks, which are most commonly used due to its consistency with respect to human perception [PGD]. The existing approaches can be categorized into three categories. The first category is one-step gradient-based approaches, such as the fast gradient sign method (FGSM) [goodfellow2014explaining], which generates adversarial example by minimizing the loss . is often chosen as the cross-entropy loss and is the norm bound:
The second category is iterative methods. The basic iterative method (BIM) [bim] is an iterative version of FGSM, which iteratively updates perturbations with a smaller step size and keep the perturbation in norm bound by the projection function :
Different from BIM, another iterative method named projected gradient descent (PGD) [PGD] used a random start , where is the uniform noise between and , and perturbs the input by Eq. (3) iteratively. Furthermore, the momentum iterative method (MIM) [dong2018boosting] was proposed to improve the transferability by integrating the momentum term into the iterative process. The last category is optimization-based methods, among which one of the representative approach is the Carlini and Wanger (CW) attack [cwattack]. According to [PGD], the version of CW attack can be solved by the PGD algorithm by using the following objective function:
where is the logits with respect to the target class, is the maximum logits of the remaining classes, and is a parameter managing the confidence.222We set to the average difference between the largest logits and the penultimate logits for each dataset.
2.2 Adversarial defenses
Plenty of proactive defense approaches have been proposed to defend against adversarial attacks, such as feature squeezing [xu2017feature], distillation network [papernot2016distillation], input transformation (, JPEG compression [jpeg]
, autoencoder-based denoising[liao2018defense] and regularization [ross2017improving]), Parseval network [cisse2017parseval], gradient masking [masking], randomization [liu2018towards, dhillon2018stochastic], radial basis mapping kernel [taghanaki2019kernelized], non-local context encoder [he2019non]. Per [dong2019benchmarking], the PGD-based adversarial (re)training [goodfellow2014explaining, tramer2017ensemble, PGD] is the most robust defense strategy, which augments the training set with adversarial examples but consumes too much training time. However, these defenses can be bypassed either completely or partially by adaptive attacks [CW_bpda, CW_ten, tramer2020adaptive]. Different from the challenging proactive defense, recent work have focused on reactive defense, which aims at detecting AEs from clean examples with high accuracy [meng2017magnet, miller2020adversarial, zheng2018robust]. In particular, several emerging works shed light on the intrinsic characteristic in the high dimensional feature subspace [zheng2018robust, li2017adversarial]. Some of them use learning-based methods (, RBF-SVM [SaftyNet], DNN [metzen2017detecting]
) to train a decision boundary between clean and adversarial distributions in the feature space. Another line of research is k-nearest neighbors (kNN)[dubey2019defense, papernot2018deep, cohen2020detecting] based methods, which make prediction according to logits (or classes) of the kNNs in the feature space. Furthermore, anomaly-detection based methods are suitable for detecting AEs too: Feinman et al. [kde] and Li et al. [li2020robust]ma2018characterizing] characterized the dimensional properties of the adversarial subspaces by local intrinsic dimensionality (LID). Lee et al. [MAHA] measured the degree of the outlier by a Mahalanobis distance (MAHA) based confidence score. Especially, Ma et al. [ma2020understanding] evaluated that medical AEs are much easier to detect than natural images (with 100% accuracy). Similar conclusion comes from [li2020robust], which motivates us to explore the reason behind this phenomenon and evaluate the robustness of those detectors.
3 Why are Medical AEs Easy to Detect?
3.1 Vulnerability of representations
To fully understand the intrinsic characteristics of medical adversarial examples, we first perform a test to evaluate the robustness of the deep representations. Specifically, we aim to manipulate the activation values as much as possible by adversarial attacks. In implementation, we try to decrease () and increase (
) the activation values by replacing the loss functionin BIM by and respectively, where is the feature from the activation layer. We execute the stress attack on the medical dataset (Fundoscopy [aptos]
) and natural dataset (CIFAR-10[cifar]). The comparison results shown in Table 1 demonstrate that the differences caused by the medical image attacks are larger than natural ones, indicating that the representations of medical images are easier to attack; in other words, the medical image representations are much more vulnerable.
3.2 Consistency of gradient direction
We then investigate the loss function and the corresponding gradient on the final logits output . In each iteration of the approaches introduced above, and
increase the logits of the target class and decrease the other logits at the same time. Therefore, the gradients towards the similar direction occur in different iterations of various attacks, which will be back-propagated according to the chain rules.Theorem 1. Consider a binary333We provide the theoretical analysis and empirical results about the multi-class classification in the supplementary material. disease diagnosis network and its representations from the penultimate layer, the directions of the corresponding gradients are fixed during during each iteration under adversarial attack.444We provide the proof in the supplementary material. As a representative, we use adversarial attack to convert the prediction of diagnosis network from 0 to 1. Implication. The partial derivative of cross-entropy loss with respect to the activation value of -th node in the penultimate layer is computed as:
where denotes the prediction confidence of class 1 and denotes the weight between -th node in penultimate layer and -th node in the last layer. Accordingly, the component with bigger difference between and will increase more (guided by the gradient) under adversarial attack. We plot the similarity between the value changes and in Fig. 2(a). Similar conclusions can be derived when the attacker chooses different approaches to increase the targeted logits and compromise the left ones, , CW attack. Hence, we calculate the similarity of the value changes among different adversarial approaches and different iterations, and the results are shown in Fig. 2(b). These similar changes in the feature space follow that the adversarial detector such as RBF-SVM [SaftyNet] (which trained on single attack) has high transferability [ma2020understanding] to different attacks.
3.3 Extremely OOD activation values
Since the activation values are vulnerable and iteratively updated towards a consistent direction, it is likely that a few of them increase to extremely large values, which the clean activation values are unlikely to be. Suppose that we have clean samples with their AEs and the penultimate layer of the network has activation values. To figure out the abnormality levels of the outlier, we first collect the activation values from the penultimate layer generated by clean examples and AEs, and store them as and , respectively. Next, we perform a column-wise normalization: for each column of or , we divide each of its entry by its corresponding column-wise maximum that is calculated from only. This results in normalized matrices andand 555Code can be found in supplementary material..
We illustrate the distribution of max values and standard deviations in Fig. 3, the maximum values and standard deviations of adversarial examples are much greater than those of clean images, which means that several activation values are extremely greater than the normal ones. As the perturbation budget raises, the degree of outlier increases accordingly. It is worth noting that when the perturbation budget is larger than , the adversarial activation values of BIM keep increasing while those of CW stop, because the gradient is zero when . This intrinsic characteristic of the outliers in the feature space facilitates that many out-of-distribution (OOD) detection methods can detect AEs in the feature space with high accuracy, especially for medical images [ma2020understanding]. However, we show that the attacker is able to take advantage of the fragility of medical AE features and hide them to avoid being spotted in Sec. 4.
4 Adversarial attack with a hierarchical feature constraint
Here, we demonstrate how to hide the adversarial representation w.r.t. the normal feature distribution. Our intuition is to derive a term that measures the distance from the adversarial representation to the normal feature distribution, so that the adversarial representation can be pushed toward the normal distribution on the shortest path by directly minimizing this term, during the process of stochastic gradient descent in each iteration of adversarial attack.Modeling the normal feature distribution: We model the normal feature distribution using a Gaussian mixture model (GMM) as following:
where is the probability density of sample in the target class ; denotes the mapping function, , the deep representation of the activation layer with parameters ; is the mixture coefficient subject to ; and are the mean and covariance matrix of the
-th Gaussian component in the mixture model. These parameters are trained by the expectation-maximization (EM) algorithm[EM] on the data belonging to the target class . For a given input , we separately compute the log-likelihood of an adversarial feature relative to each component and find the most probable Gaussian component:
Then we focus on maximizing the log-likelihood of this chosen component to hide the adversarial representation. Hierarchical feature constraint: To avoid being detected by outlier detectors, we add the constraint of (7), ignoring the constant terms, to all DNN layers. The hierarchical feature constraint induces a loss () that is formulated as:
where is a weighting factor that controls the contribution of constraint in layer . Algorithm 1 shows the pseudo-code for the adversarial attack with hierarchical feature constraint. Given an input image , the goal is to find an adversarial example that can be misclassified to the target class , and keep the deep representation close to the normal feature distribution. Here, we focus on the AEs with the constraint. We first model the normal hierarchical features of the training data with GMM. Then, we extend the attacking process of BIM by replacing the original loss function in Eq. (3) by:
where is classification loss as same as Eq. (4) in the CW attack and is the HFC loss term.
|Fundoscopy||MIM (Adv. Acc=99.5)||BIM (Adv. Acc=99.5)||PGD (Adv. Acc=99.5)||CW (Adv. Acc=99.5)|
|KD||98.8 / 72.0||96.3 / 10.0||99.0 / 74.2||96.8 / 20.5||99.4 / 73.4||98.6 / 13.2||99.5 / 74.7||99.1 / 19.6|
|MAHA||100 / 7.8||100 / 0.0||99.6 / 6.4||99.5 / 0.0||100 / 4.2||100 / 0.0||99.8 / 33.0||99.5 / 0.0|
|LID||98.8 / 67.1||99.1 / 31.5||99.8 / 78.3||100 / 40.6||99.6 / 73.2||98.6 / 35.5||98.8 / 73.4||97.7 / 33.3|
|SVM||96.9 / 27.3||99.5 / 27.3||99.5 / 28.6||99.1 / 0.0||99.8 / 23.1||99.5 / 0.0||99.8 / 27.0||99.5 / 0.0|
|DNN||100 / 31.5||100 / 0.5||100 / 60.0||100 / 12.8||100 / 58.6||100 / 8.2||100 / 62.6||100 / 15.1|
|BU||89.9 / 33.5||60.7 / 0.0||58.9 / 37.4||9.1 / 0.0||61.9 / 35.9||9.1 / 0.0||93.0 / 32.8||73.1 / 5.0|
|Chest X-Ray||MIM (Adv. Acc=98.1)||BIM (Adv. Acc=90.9)||PGD (Adv. Acc=90.9)||CW (Adv. Acc=98.9)|
|KD||100 / 67.9||100 / 7.9||100 / 73.1||100 / 6.8||100 / 82.3||100 / 50.5||99.2 / 71.5||98.4 / 15.7|
|MAHA||100 / 0.0||100 / 0.0||100 / 0.0||100 / 0.0||100 / 0.0||100 / 0.0||100 / 22.4||100 / 0.0|
|LID||100 / 47.5||100 / 2.3||100 / 48.6||100 / 1.8||100 / 49.1||100 / 1.5||99.2 / 64.5||98.4 / 14.4|
|SVM||100 / 8.9||100 / 46.7||100 / 16.7||100 / 6.9||100 / 5.8||100 / 0.0||100 / 21.2||100 / 0.0|
|DNN||100 / 35.5||100 / 1.0||100 / 31.8||100 / 0.7||100 / 33.7||100 / 0.0||100 / 61.6||100 / 5.2|
|BU||100 / 15.2||100 / 0.0||49.9 / 26.1||19.2 / 0.0||49.2 / 26.2||22.7 / 0.0||98.3 / 26.2||94.8 / 0.0|
Datasets. We use two public datasets on typical medical classification tasks. The first one is Kaggle Fundoscopy dataset [aptos] on the diabetic retinopathy (DR) classiﬁcation task, which consists of 3,663 high-resolution fundus images. Each image is labeled to one of the five levels from ‘No DR’ to ‘mid/moderate/severe/proliferate DR’. Following [ma2020understanding, finlayson2018adversarial], we conduct a binary classification experiment, which considers all fundoscopies with DR as the same class. The other one is Kaggle Chest X-Ray [CXR] dataset on the pneumonia classification task, which consists of 5,863 X-Ray images labeled with ’Pneumonia’ or ’Normal’. Following the literature [ma2020understanding, ma2018characterizing], we split both datasets into three subsets: Train, AdvTrain and AdvTest. For each dataset, we randomly select 80% of the samples as Train set to train the DNN classifier, and treat the left samples as the Test set. The misclassified (by the diagnosis network) test samples are discarded. Then we use 70% of the samples (AdvTrain) in the Test set to train the adversarial detectors and evaluate their effectiveness with the left ones (AdvTest). DNN models. We choose the ResNet-50 [resnet] and VGG-16 [VGG]
models pretrained using ImageNet. All images are resized to 2992993 and normalized to [-1,1]. The models are trained with augmented data using random crop and horizontal flip. Both models achieve high area under curve (AUC) scores on both Fundoscopy and Chest X-Ray datasets: ResNet-50 reaches 99.5% and 97.0%, while VGG-16 reaches 99.3% and 96.5%, respectively. Adversarial attacks and detectors. Following [ma2020understanding], we choose MIM, BIM, PGD and CW to attack our models. For the adversarial detectors, we use kernel density (KD) [kde], bayesian uncertainty (BU) [kde], local intrinsic dimensionality (LID) [ma2018characterizing], Mahalanobis distance (MAHA) [MAHA], RBF-SVM [SaftyNet], and deep neural network (DNN) [metzen2017detecting]
. The parameters for KD, LID, BU and MAHA are set per original papers. We compute the scores of LID and MAHA for all of the activation layers and train a logistic regression classifier[MAHA, ma2018characterizing]. For KD, BU, and RBF-SVM, we extract features from the penultimate layers. For DNN, we train a classifier for each activation layer and embed these networks by summing up their logits. Metrics. We choose three metrics to evaluate the effectiveness of the adversarial detector and the proposed method: 1) True positive rate at 90% true negative rate (TPR@90): The detector will drop 10% of the normal samples to reject more adversarial attacks; 2) Area under curve (AUC) score; 3) Adversarial accuracy (Adv. Acc): The success rate of the targeted adversarial attack. Hyperparameters. We set and for Fundoscopy and Chest X-Ray datasets, respectively. For the activation layer, we take the mean value of each channel separately, and set to , where is the number of channels. As a tiny perturbation in medical images causes a drastic increase in loss [ma2020understanding], we set a small step size . We set the number of iterations to .
5.2 Bypassing adversarial detectors
We first train adversarial detectors on different DNN classifiers, datasets, and perturbation constraints, and then evaluate their performances under the proposed attack. As shown in Fig. 5, most of the detectors achieve high AUC scores in the deep layers (solid lines). When we use HFS to strengthen the attack, the visualization of t-SNE in Fig. 6 shows the HFS moves the adversarial representations (orange) from the outlier to the location (cyan) surrounded by normal features (purple), thereby bypassing the detectors in all layers. Furthermore, as reported in Table 2, the proposed HFS term strengthens all the adversarial attacks so that the corresponding detectors are bypassed. As in Sec. 3.3, when the perturbation constraint weakens, the BIM features move further away from the normal feature distribution. Consequently, Fig. 4 shows that the detectors have better performance on detecting BIM examples (as the solid lines increase). However, a bigger perturbation budget gives our method more room to manipulate the representations. The attacker can move the feature closer to the normal feature distribution, which compromises the detectors more drastically (as the dotted lines decrease).
5.3 Comparison of other sneak attacks
We compare different sneak attack methods to manipulate the deep representations and bypass the detectors: 1) Generate AEs with internal representation similar to a random guide image [feature_iclr]; 2) Instead of random sampling, choose a guide image with its representation closest to the input [feature_iclr]; 3) Minimize the loss terms of KDE and cross-entropy at the same time [CW_ten]; 4) Minimize the loss term of LID666The term of KDE and LID can be found in the supplementary material. and cross-entropy at the same time [CW_bpda]. As shown in Table 3, all attacks that mimic the normal feature distribution can bypass KD, MAHA, and SVM. Under the strict constraint (), our method can break the five detectors at the same time and greatly outperform other methods.
|Chest X-Ray||KD||MAHA||LID||SVM||DNN||Adv. Acc|
5.4 Hyperparameter analysis
We model the normal feature distributions of ResNet-50 on Fundoscopy by GMM with different number of components, and evaluate the corresponding attacking performances. As in Fig. 4 (ResNet-50, Fundoscopy), since777More experiments can be found in supplementary material. only KD and LID can keep the AUC scores around 80%, we report their performances in Table 4. All of the attacks can stably compromise the detectors, while setting can slightly improve the performance.
5.5 Gray-box attack
We also consider a more difficult scenario of gray-box attack: the attacker who knows only the backbone tries to confuse the victim model and bypass its adversarial detectors at the same time. As illustrated in [inkawhich2019feature], different models trained on the same dataset have similar decision boundaries and class orientations in the feature space. We explore the potential of adversarial examples generated from a substitute model to bypass the victim model’s detectors. As in Table 5, our adversarial examples can bypass most of the detectors from the victim model with high adversarial accuracy. It is worth noting that, for VGG-16, the OOD-based detectors have limited ability to detect BIM examples transferred from a substitute model with the same architecture.
|Chest X-Ray||KD||MAHA||LID||SVM||DNN||Adv ACC|
5.6 Comparison with natural image attack
We also extend HFC to the natural images (CIFAR-10), and evaluate the performance of the adversarial detectors and HFC under constraints .888The Adv Acc. will be less than 95% for medical and natural images when the perturbation is smaller than , respectively. As shown in Table 6, the detection rate for BIM examples increases with a larger perturbation budget, , the adversarial features are moved further away from the normal ones. Meanwhile, HFC has better ability to manipulate deep representations, which weakens the detection capability. On the other hand, as in Sec. 3.1, the medical features are much more vulnerable than natural ones, which make medical attacks easily detectable. Meanwhile, HFC enjoys more success in manipulating the adversarial feature into the normal feature distribution even with a small perturbation.
In this paper, we first attempted to understand the intrinsic characteristics of medical adversarial examples. The key difference between a medical image and a natural image lies in the vulnerability of deep representations. Existing adversarial attacks distort the prediction by optimizing the feature representation towards a consistent direction, which raises the vulnerability of medical image feature to an out-of-distribution level. Despite that the existing adversarial attacks on medical images are easy to detect, this fact is not reliable if the attacker alters the attacking strategy and hides the representations. On the contrary, higher vulnerability renders the attacker more power to manipulate the representations. We proposed a novel hierarchical feature constraint to find the closest direction to hide the adversarial feature in normal feature distribution represented by a Gaussian mixture model. Extensive experiments validated the effectiveness of the proposed attack to bypass adversarial detectors, under both white-box and gray-box settings.
7 The Consistency of Gradient Direction
7.1 Binary classification
Theorem 1. Consider a binary disease diagnosis network and its representations from the penultimate layer, the directions of the corresponding gradients are fixed during during each iteration under adversarial attack.999As a representative, we use adversarial attack to convert the prediction of diagnosis network from 0 to 1. Proof. As is shown in Fig. 7, Denote denotes the activation output of
-th node (neuron) in the penultimate layer of a neural network;denotes the weight from -th node to -th node of next layer; is the output of neural network in -th channel; is the probability of the input being
-th class. For simplicity, we ignore the bias parameters and activation functions in the middle layer, which does not affect the conclusion. Formally, the cross-entropy loss can be defined as:
and the Softmax function is:
According to the chain rule, we can compute the partial derivative of with respect to :
in Sec. (3.3), the goal of adversarial attack is to invert the prediction from 0 to 1, i.e., , so we can derive the partial derivative on :
where and is constant, so the partial derivative on will keep in the same direction, as is claimed in Sec. (3.3).
7.2 Multi-class classification.
Similar to binary classification, some of the activation values are increased to outlier, which are guided by similar gradients in each iteration. Differently, the derivative result shows that only the channels () with the biggest weight (greater than other weights ) have the negative gradients all the time. Our experiments prove that these values are increased to the out-of-distribution positions in a similar direction iteratively. Proof. The goal is to make the prediction classified to a particular erroneous class . And the partial derivative of with respect to the -th activation out in the penultimate layer is computed as:
where . According to Eq. 12, we will have:
For targeted attack, we know that:
Hence, Eq. 16 can be rewrited as:
where and is constant. Focus on the channel whose is greater than the other weights , its gradient keep negative all the time.
Implication. Under the direction of the negative gradient, the activation value increases to a out-of-distribution position iteratively. Thus, we conduct 10-class classification experiment with ResNet-50 [resnet] network on CIFAR-10 dataset to verify this conclusion101010Here, we select 1000 images from the test set, whose predictions and set the target class .
and explore the degree of outlier. Same as Sec. 3.3, we plot the cosine similarities, distributions of normalized standard deviations and maximum values for all of the channels (‘ALL’, 2048 channels in total), and the channels with the largest weight(respect to class , 159 channels in total, marked as ‘Biggest_0’). As shown in Fig. 8 and Fig. 9, the activation values are updated toward similar direction in each iteration. Especially, the cosine similarities of the channels in ‘Biggest_0’ are as great as the ones in binary classification tasks, which move the features to out-of-distribution positions.
8 Visualization Code for Figure 10
The python code for showing the distributions of standard deviations and maximum values.
9 Adversarial Detection Methods
Here we give a brief introduction to the state-of-the-art adversarial attack detection methods, including KD [kde], BU [kde], LID [ma2018characterizing], MAHA [MAHA], SVM [SaftyNet] and DNN [metzen2017detecting]. We try out best to use the official codes if released111111
KD and BU, https://github.com/rfeinman/detecting-adversarial-samples
LID and MAHA, https://github.com/pokaxpoka/deep_Mahalanobis_detector. In the following, we detail each method. Kernel Density (KD) KD calculated with the training set in the feature space of the last hidden layer. These are meant to detect points that lie far from the data manifold. Specifically, given a sample of class , and a set of training samples from the same class
, the KD of x can be estimated by:
where is the last hidden layer activation vector, is the kernel function, often chosen as a Gaussian kernel. Bayesian Uncertainty estimates (BU). BU are available in dropout121212We add a dropout layer in the before the last layer of ResNet-50 [resnet]. neural networks. These are meant to detect when points lie in low-confidence regions of the input space, and can detect adversarial samples in situations where density estimates cannot. Specially, the authors sample times from distribution of network configurations, thus, for a sample and stochastic predictions , the BU can be computed as:
Local Intrinsic Dimensionality (LID). LID describes the rate of expansion in the number of data objects as the distance from the reference sample increases. Specifically, given a sample , LID makes use of its distances to the first nearest neighbors:
where is the activation values from intermediate layer ; is the Euclidean distance between and its nearest neighbor. LID is computed on each layer of DNN. Mahalanobis distance-based confidence score (MAHA). MAHA utilize the Mahalanobis distance-based metric instead of Euclidean distance, and also process the DNN features for detecting adversarial samples. Specifically, the authors first compute the empirical mean and covariance of the activations for each layer of the training samples. Then, they compute the Mahalanobis distance score as:
where and are the mean and covariance of the training samples, is the activations in the intermediate layer of DNN. MAHA is computed on each layer of DNN as LID. SVM and DNN simply train a classifier (i.e., SVM or DNN) on the adversarial examples. We follow the literature [SaftyNet, metzen2017detecting] to implement them.
10 More Hyperparameter Analysis
We report the performances of HFC under different number of component on both Fundoscopy and Chest X-Ray. The experiment result in Table 7 and Table 8 shows that HFC can bypass the detectors at the same time under different components stably.