MixDefense: A Defense-in-Depth Framework for Adversarial Example Detection Based on Statistical and Semantic Analysis

04/20/2021 ∙ by Yang Yijun, et al. ∙ The Chinese University of Hong Kong 0

Machine learning with deep neural networks (DNNs) has become one of the foundation techniques in many safety-critical systems, such as autonomous vehicles and medical diagnosis systems. DNN-based systems, however, are known to be vulnerable to adversarial examples (AEs) that are maliciously perturbed variants of legitimate inputs. While there has been a vast body of research to defend against AE attacks in the literature, the performances of existing defense techniques are still far from satisfactory, especially for adaptive attacks, wherein attackers are knowledgeable about the defense mechanisms and craft AEs accordingly. In this work, we propose a multilayer defense-in-depth framework for AE detection, namely MixDefense. For the first layer, we focus on those AEs with large perturbations. We propose to leverage the `noise' features extracted from the inputs to discover the statistical difference between natural images and tampered ones for AE detection. For AEs with small perturbations, the inference result of such inputs would largely deviate from their semantic information. Consequently, we propose a novel learning-based solution to model such contradictions for AE detection. Both layers are resilient to adaptive attacks because there do not exist gradient propagation paths for AE generation. Experimental results with various AE attack methods on image classification datasets show that the proposed MixDefense solution outperforms the existing AE detection techniques by a considerable margin.



There are no comments yet.


page 4

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have achieved unprecedented success in numerous long-standing machine learning tasks, and they are widely deployed in various safety-critical systems such as autonomous vehicles, healthcare systems, and cyber-security infrastructures. In these applications, incorrect decisions or predictions could result in life risk or significant financial loss. Consequently, the safety and security of DNNs are of grave concern.

Figure 1: Overview of the proposed Defense-in-Depth framework MixDefense. The first layer of defense is designed for AEs with large perturbation, and the second layer is for AEs with small perturbation. More layers can be added in between, if needed.

One of the primary threats to DNN-based systems is adversarial examples (AEs), which are maliciously perturbed inputs to fool the DNN model [szegedy2013intriguing]. Defending against AEs has drawn lots of research interests from both academia and industry [silva2020opportunities]. Existing AE defense techniques can be broadly categorized into three types: (i) AE-aware training, including adversarial training techniques that incorporate perturbed inputs into the training set (e.g., [madry2017towards, pmlr-v97-pang19a]) and gradient masking/obfuscation techniques that try to construct models with gradients that are difficult to use by attackers (e.g., [Shan2020GottaCA, Lcuyer2019CertifiedRT]). While effectively mitigating AE threats, such solutions not only are computationally expensive but also cause an undesirable decrease in the prediction accuracy of legitimate inputs [pmlr-v119-raghunathan20a, tramer2020fundamental]. (ii). Input transformations, which "purifies" the inputs before feeding them into the DNN (e.g., [meng2017magnet, samangouei2018defense, Dziugaite2016ASO]). By doing so, carefully crafted adversarial perturbations are changed, thereby mitigating their abilities to attack. Due to the nature of this technique, there is an inherent tradeoff between tolerable perturbations and prediction accuracy for clean inputs. Consequently, such solutions are usually only effective for AEs with small perturbations, and most of them are not resistant to adaptive attacks. (iii). AE characterization, which detects AEs by leveraging the instability of AEs (e.g., [Grosse2017OnT, Song2018PixelDefendLG, Aigrain2019DetectingAE, Carrara2018AdversarialED]

) or finding AEs as statistical outliers of all input samples (e.g., 

[ma2018characterizing, kantaros2020visionguard, xu2018feature]).

An ideal defense strategy for adversarial inputs should have the following features:

  • A broad scope of defense, i.e., it should be able to defend against all sorts of AE attacks under a wide spectrum of perturbation strengths;

  • Negligible impact on model accuracy with acceptable computational effort, i.e., it should have a minimum impact on the prediction accuracy of legitimate inputs without incurring much training cost and inference overhead;

  • Robustness to adaptive attacks, i.e., it should remain effective even if attackers are knowledgeable about the defense mechanism and craft AEs accordingly;

Existing solutions are still far from meeting the above criteria, despite there exist several ensemble solutions that combine multiple defense techniques (e.g., [meng2017magnet, tramer2018ensemble, Cheval2018DEEPSECDE]).

In this work, we propose a novel defense-in-depth framework for adversarial examples, namely MixDefense, which belongs to the "AE characterization" category. To achieve a broad scope of defense, we propose to use multiple attack-agnostic layers to detect various kinds of AEs with different perturbation strengths, as shown in Fig. 1. To be specific, we first use a LP-defense layer for AEs with large perturbations, which leverages the ’noise’ features extracted from the inputs to discover the statistical difference between natural images and tampered ones. For AEs with small perturbations, the inference result of such inputs would largely deviate from their semantic information. We use a learning-based SP-defense layer to model the semantic distance between the input and its reconstruction conditioned on the predicted label for AE detection. As can be seen from Fig. 1, the LP-Defense layer in MixDefense rejects those inputs that are deemed as AEs before feeding them into the DNN model and we strive to minimize overkills (i.e., legitimate inputs regarded as AEs) in this layer. The SP-defense layer is responsible for identifying the remaining AEs in a post-mortern manner. Such a defense strategy incurs little impact on model accuracy with small computational overhead. Moreover, as the final defense in the proposed MixDefense framework, the proposed SP-defense layer is resistant to adaptive attacks.

In summary, the contributions of this work are as follows:

  • We propose MixDefense, a novel attack-agnostic defense-in-depth framework that can detect various kinds of AEs (including adaptive ones) while having little impact on the prediction accuracy of legitimate inputs with small inference overhead.

  • For AEs with large perturbations, we propose a simple adversarial example catcher (SAEC) technique as LP-defense layer (Section 3), which captures the intrinsic statistical difference between AEs and clean samples by measuring the ‘noise’ level of the inputs.

  • For AEs with small perturbation, we propose a learning-based soluton namely ContraNet as SP-Defense layer (Section 4).

    Based on the contradiction between the semantic meaning of AEs and their predicted labels, this layer consists of a conditional generative model trained on clean images and their corresponding label pairs, and a deep metric module which learns a powerful similarity metric for measuring the distance between the input and the reconstructed image.

We verify the effectiveness of MixDefense on three popular image classification datasets: MNIST, CIFAR-10, GTSRB. Experimental results (Section 

5) show that our method can defend against various AE attacks under a wide spectrum of perturbation strengths with little impact on the prediction accuracy for legitimate inputs, which outperforms the existing AE detection techniques by a considerable margin. Moreover, we perform case studies on adaptive attacks to MixDefense and show that our method is robust against them (Section 6).

2 Background

A large amount of research has been dedicated to adversarial example attacks and defenses in the literature [silva2020opportunities]. In this section, we briefly introduce AE attacks and give more detailed discussions on existing AE defense solutions.

2.1 AE Attacks

Adversarial example attacks can be broadly categorized into the following types:

  • Gradient-based methods

    generate adversarial perturbations according to the gradient direction of the loss function (e.g., 

    [fgsm, kurakin2016bim, madry2017towards, papernot2016limitations, Pang2020ATO, MoosaviDezfooli2016DeepFoolAS]). Fast Gradient Sign Method (FGSM) [fgsm] is the first AE attack that perturbs the inputs in a single step to achieve misclassification. Later attacks such as Basic Iterative Method (BIM) [kurakin2016bim], Projected Gradient Descent (PGD) [madry2017towards], and Jacobian-based Saliency Map Attack (JSMA) [papernot2016limitations] are more effective in terms of attack successful rate under a given perturbation strength constraint.

  • Optimization-based methods formulate AE designs as a constrained optimization problem and solve it accordingly. For example, Carlini-Wagner (CW) [cw2017] constructs the AEs by solving an optimization problem consisting of an L2-norm perturbation magnitude quantification term and a specific loss term. While more expensive than gradient-based AE attacks, to the best of our knowledge, it is one of the strongest AE attacks by far.

  • Transformation-based methods conduct spatial transformations with slight geometric perturbations (e.g., Spatially Transformed Attack [xiao2018spatially]) or train a generative model for AE transformation (e.g.,

    Adversarial Transformation Networks (ATN)


2.2 AE Defenses

Numerous defense techniques are proposed in the literature to make DNN models more robust against adversarial perturbations [bulusu2020anomalous] and we categorize them as follows.

AE-aware training: Adversarial training [fgsm, kannan2018adversarial, ziang2018deepdefense, pmlrv97zhang19p, pmlr-v97-pang19a, tramer2018ensemble, madry2017towards]

boosts model robustness by adding AEs into the training dataset so that their predictions can be corrected. Another type of AE-aware training strategy is to construct DNN models with gradients that are difficult to use by attackers, e.g., defensive distillation 

[papernot2016distillation], randomized training [Liu2018TowardsRN], adding gradient obfuscation layer [Liu2018TowardsRN, Lcuyer2019CertifiedRT], and embedding trapdoors for AEs [Shan2020GottaCA].

AE-aware training is relatively easy to perform when the target model is available and it is quite effective in terms of defense capability, especially for known AE attacks. However, due to the change of the target model, these techniques usually cause prediction accuracy loss [pmlr-v119-raghunathan20a, tramer2020fundamental]. Additionally, most of them are vulnerable to adaptive attacks.

Input transformations: Adversarial perturbations can be alleviated by input pre-processing, such as JPEG compression [Dziugaite2016ASO, Das2018SHIELDFP], bit-depth reduction [Guo2018CounteringAI], and randomized discretization [Zhang2019DefendingAW, Xie2018MitigatingAE]. MagNet [meng2017magnet] resorts to auto-encoders to learn the distribution of normal inputs, and then uses the trained auto-encoder to purify the input samples.

Defense-GAN [defense-gan-2018] is a novel defense mechanism that purifies the input into its closest GAN reconstruction. Similar idea is adopted by PixelDefend [Song2018PixelDefendLG], wherein a learned PixelCNN model is used to purify the inputs.

The common limitation of input transformation-based defense techniques is that their defense capabilities are limited, especially when the adversarial perturbation is relatively large. Moreover, the prediction accuracy for legitimate inputs is affected because they are changed.

AE characterization: Without changing the target DNN model and the inputs, AE characterization techniques find AEs directly.

One line of work regards AEs as statistical outliers. Some of them propose to train a binary classifier to learn the statistical differences between the distribution of clean data and that of the AEs

[Grosse2017OnT, Song2018PixelDefendLG, Metzen2017OnDA, Aigrain2019DetectingAE, Feinman2017DetectingAS, Carrara2018AdversarialED, ma2018characterizing, SAMP]. In particular, Aigrain et. al. [Aigrain2019DetectingAE], Feinman et. al. [Feinman2017DetectingAS], and Metzen et. al. [Metzen2017OnDA] use the intermediate representations from a pre-trained classifier to train the detector. Local Intrinsic Dimensionality (LID) [ma2018characterizing]estimates a so-called LID value, i.e., the distance of the input and its neighbors, and detects AEs with large LID values. The above works train the detector with both legitimate and adversarial data in a supervised manner, and hence they are usually less effective for adaptive attacks.

In contrast, some recent work proposes to use clean samples only to compute the probability of whether an input instance falls in the legitimate input distribution

[Zheng2018RobustDO, Ma2019NICDA, Grosse2017OnT, Song2018PixelDefendLG, Miller2019WhenNT]. Among them, PixelDefend [Song2018PixelDefendLG] computes the probability with a generative model. Zheng et.al

model the output distribution of the hidden neurons in a DNN classifier with a Gaussian Mixture Model 

[Zheng2018RobustDO]. Ma et. al.

propose to use the one-class SVM to extract the model in-variants (e.g. distribution of the neuron activation values) for anomaly detection 

[Ma2019NICDA]. Generally speaking, the above techniques are more effective for AEs with large perturbations as they are more easily to be identified as outliers.


Figure 2: Given (a) input images with various perturbations, our proposed LP-Defense layer, SAEC, first obtains the (b) pseudo saturation maps, and then perform linear filtering to yield the (c) noise contrast features. We then calculate the noise level score based on the (d) histogram of the noise contrast features. We detect AEs by thresholding the noise level score.

Another type of AE characterization work leverages the instability of AEs for detection [xu2018feature, kantaros2020visionguard, wang2019adversarial, liang2021]. Mutation Testing [wang2019adversarial] performs model mutation (e.g., changing neuron activation values) and evaluates the prediction inconsistency for AE detection. While effective, it requires a large number of executions to run for each input sample, causing significant inference overheads. Liang et. al. [liang2021] regard the adversarial perturbations as a kind of noise and implement two classical denoising techniques to mitigate their impact. Both the input image and its de-noised version are fed into the target classifier, and the input sample is regarded as an AE if their classification results are inconsistent. Similarly, Feature Squeezer (FS) [xu2018feature] and VisionGuard [kantaros2020visionguard] propose to feed the input and its squeezed/compressed version to the DNN model and check the result difference for AE detection. While simple to deploy, these defense techniques provide moderate robustness enhancement for the target model, and they are also vulnerable to adaptive attacks.

Our proposed MixDefense framework belongs to the AE characterization category. In the following two sections, we detail the proposed detection technique against AEs with large perturbations and AEs with small perturbations, respectively.

3 LP-Defense Layer

We model large adversarial perturbations as noise [liang2021] and propose a simple adversarial example catcher (SAEC) to serve as the LP-defense layer, which detects AEs by measuring the noise level of the input sample. However, it is challenging to determine whether the local pixel variations are due to the image texture, lighting variation, or the noise.

To solve this problem, we first extract the noise contrast feature from the image, then calculate the noise level from the histogram of the feature map, and detect AEs by thresholding a noise level score.

Noise contrast feature extraction. To make the adversarial perturbations more detectable, we propose to extract the noise contrast feature from the image. The feature extraction procedure consists of two steps: i) obtain a pseudo-saturation map and ii) perform linear filtering on the pseudo-saturation map.

Inspired by the definition of the saturation in HVS space, we define the pseudo-saturation map as the -norm of the image subtracted by pixel-wise mean intensity values:


where is the input RGB image, is the mean intensity of each pixel, and is the resulted pseudo-saturation map. As increases, the image content is suppressed, while the noises are exposed. Therefore, it would be much easier to utilize statistical strategy to differentiate AEs. In our experiments, we empirically set . Figure 2 (a) shows a batch of clean images together with their AEs in the RGB space. Figure 2 (b) displays their corresponding pseudo-saturation projections. As can be observed, the image content such as shape, color, and brightness are dramatically declined, while the noise component (salt and pepper-like points) is largely preserved, which are exactly the adversarial perturbations.

Figure 3: The workflow of the proposed SP-Defense layer, ContraNet. During inference, in the st stage, the target classifier assigns the input image a predicted label . In the nd stage, we pass both the input image and its predicted label to the conditional generative model, which reconstructs based on . After that, the distance between and its reconstruction is measured by the trained deep metric model. If the distance is larger than the threshold , the input instance is detected as an AE.

To further utilize the pseudo-saturation map

, we leverage a linear filter to augment the variances between the smoothed image content and the exposed perturbed regions. Such linear filters have been widely used in steganalysis of digital images 

[steg1, steg0, steg3, steg4, steg5], and it has also been used for AE detection in [liu2019detection]. Different from [liu2019detection], we apply the filters on the pseudo-saturation maps instead of on the natural images. As shown in Figure 2 (c), the distinction between the noise and the image content is further amplified. Therefore, such linear filters can further help us differentiate AEs from clean samples.

AE Detection. After extracting the noise contrast feature, we first obtain the histogram of the feature, and then calculate the corresponding noise level scores. Exemplar histograms are shown in Figure 2 (d). Here, the -axis represents the value of the noise contrast feature, and -axis represents the number of pixels falling in that bin. The red and green bars represent the histograms of AEs and clean images, respectively. As can be observed, the values of noise contrast feature for clean images tend to concentrate around zero, while the values of AEs distribute among a wider range.

Furthermore, to quantify such observations for utilization, we define the noise level score (NL_score) as the variance of the histogram :


where is the number of bins in , denotes the value of the -th bin, and is the mean value.

Finally, we can detect AEs by directly thresholding the NL_scores. The threshold value is determined in such a manner that clean images would not be mistakenly regarded as AEs.

SAEC has several advantages: (i). it is attack-agnostic and hence can be applied to all kinds of AE attacks; (ii). it has a negligible computational cost.

As small perturbations do not yield statistically significant variations on the pixel value distribution of the natural images, many AEs (mostly with small perturbations) can bypass our LP-defense layer and they are detected by the SP-Defense layer, as detailed in the following section.

4 SP-Defense Layer

For successful AEs with small perturbations, their inference results (with wrong labels) would naturally contradict with their semantic information observed by a human. Moreover, the smaller the perturbation, the more evident such contradiction is. Based on this observation, in this section, we present ContraNet, a learning-based AE detection technique to serve as the SP-Defense layer in our MixDefense framework.

ContraNet consists of a class-conditional generative model trained on the clean image and label pairs, and a deep metric module which learns a strong similarity metric for measuring the distance between the input and the reconstructed image. Figure 3 presents the workflow of the proposed solution. Given a DNN-based target classifier , we utilize a class-conditional generative model (Section 4.1

) to learn the joint distribution between input images and their labels. To be specific, we train

with clean training samples to reconstruct the inputs conditioned on the corresponding class labels . During inference, for a given input , we reconstruct it (in Stage 2) with conditioned on its predicted label . Generally speaking, for a legitimate input with correct prediction, its reconstruction would be similar to itself ; otherwise, if the input is an adversarial example that is obviously not associated with the predicted label, the reconstructed sample would not be faithful, because the class-conditional reconstruction model has not seen such kind of input and label pair during training. Next, in Stage 3, by measuring the distance between the input image and its reconstructed counterpart , we can distinguish the adversarial example from the benign ones. It is worth noting that, we need to measure the distance from a semantic perspective. Therefore, instead of using rule-based distances (e.g., or Cosine distance), which is notoriously poor in measuring distances in high dimensional space, we propose to use a learned deep metric model (Section 4.2) for distance measurement.

4.1 Class-Conditional Generative Model

Framework overview.

One straightforward way to realize the class conditional generation is to use an existing conditional generative adversarial network (cGAN


), which can generate images that are highly associated with the given condition. However, as existing cGAN techniques generate images from a randomly sampled latent vector

from Gaussian distribution, the generated image is not directly related to the input

. Consequently, they can be very different from even under the correct condition label (e.g., the input can be "red bird flying in the blue sky" while the generated image can be "yellow bird standing on the green tree").

To obtain a reconstructed image instead of a ‘randomly’ generated one, we revise the cGAN architecture by introducing an extra encoder to yield the latent vector . By doing so, given the correct condition, the generated image would be a faithful reconstruction of the input image instead of a random image of this category.

Figure 4: The revised cGAN in ContraNet. It mainly contains three parts, the encoder, the generator, and the discriminator. Especially, the discriminator 1) helps the generator to improve its reconstruction quality; 2) makes the generator not ignore the class condition. Please note that the discriminator is only used during training. See Section 4.1 for details.
1:  for number of training iterations do
2:     for  steps do
3:         Sample a mini-batch of examples from .
4:         Update the generator by gradient descent:
5:         Update the encoder by gradient descent:
6:         Update the discriminator by gradient descent:
7:     end for
18:  end for
The Dist function can be any distance measurement functions. In our experiment, we use the and .
Algorithm 1 Training of the revised cGAN.

Figure 4 shows our cGAN architecture. The model contains an encoder , a generator , and a discriminator

. All these modules are parameterized by feed-forward neural networks. Given an input

, the encoder encodes the input image into a latent vector . Later, the latent vector and the condition are feed to the generator to generate image of class . At last, the conditional discriminator is responsible for discriminating the real image and the generated fake image to improve generation quality.

To make sure that the condition is not ignored by the generator, we insert the condition information into both the generator and the discriminator following the practice in [miyato2018cgans]

. First, we insert one Class Batch Normalization (CBN) layer into each layer of the generator. Second, we use a projection layer to project the class embedding into the discriminator output. The right part of Figure

4 shows the discriminator details. First, the label is processed by an embedding layer to obtain its class embedding . Then, we apply an inner product between and the extracted feature . Later, the two branches are added together to produce the output result.

Training process. The training of the three components in the revised cGAN is detailed as follows. Given the dataset with input images and the associate ground-truth labels , the encoder is optimized in a way that the reconstructed is similar to the input conditioned on the label . Besides, we still maintain the distribution of the latent vector to be Gaussian distribution. Therefore, the objective of the encoder model is formally defined as:


where the first term is the KL divergence (between and Gaussian distribution) loss, and the second term denotes the reconstruction loss.

For the generator, except the reconstruction loss, we also expect the generated image is able to cheat the discriminator. Thus, if we use to denote the discriminator, the loss function of the generator is defined by:


At last, as the discriminator is trained to distinguish the real input image and the generated one, the adversarial loss for the discriminator is given by:


Algorithm 1 shows the mini-batch realization of the above training objectives (, , and ). First, in each training iteration, we split the training into mini-batches and each mini-batch contains of samples. In each mini-batch, we calculate the respective gradient w.r.t. the model parameters with the loss function , , and . Then, we update the models towards the gradient descending direction. The training iteration is repeated for a pre-defined number of training iterations.

After the encoder, the generator, and the discriminator are trained, only the encoder and the generator are used for inference. The discriminator is only an auxiliary model to help the revised cGAN consider the label information.

4.2 Deep Metric Learning

Directly applying traditional similarity metrics (e.g.,

norm, or Cosine similarity) to measure the distance between the input and the reconstructed image often results in suboptimal detection results. To tackle this problem, we propose to use deep metric learning to learn a more powerful similarity metric for AE detection.

Component Params(M) FLOPs(G)
Encoder 3.81 0.11
Generator 4.3 1.69
Deep Metric Model 2.24 0.03
Total 10.35 1.83
Table 1: Computation and storage cost of ContraNet

To be specific, we employ a Triplet network[triplet] as our deep metric module for learning the similarity metric. The deep metric module consists of three instances of the same feedforward network with shared parameters. This module has three types of inputs, namely, the anchor , the positive sample , and the negative sample . Here, we use clean images as anchors and obtain the positive and negative samples by feeding the clean images into the revised cGAN. When accompanied with correct labels, the output reconstructions are marked as positive samples; otherwise, they are marked as negative samples. These three kinds of samples are fed into the deep metric module to get their respective embeddings.

For training, we utilize Triplet Margin Loss[tripletloss] as the loss function, as depicted in Equation(6).


where the is a positive number which further enlarging the distance among dissimilar samples and demonstrates regular distance metric, e.g. Euclidean distance, consine distance which measures the distance between embedding vectors obtained from .

To improve the performance, all possible triplet combinations are used in loss calculation according to the labels. We also resort to hard sample miner[miner, metric0] during training.

Dataset No. of Classes Classifier Type Accuracy
MNIST[mnist] 10 LeNet 98.6%
CIFAR10[krizhevsky2014cifar] 10 DenseNet169 94.3%
GTSRB[GTSRB] 43 ResNet34 97.8%

Accuracy: refers to the classifier’s accuracy on clean images.

Table 2: Datasets and Target Classifier

One distinguished advantage of ContraNet is its robustness to adaptive attacks. First, the generative model itself is more robust than discriminative model under adversarial attack [kos2018adversarial]. Also, the target classifier and our detection mechanism have separate gradient propagation paths. The only connection between them is the predicted label produced by a non-differentiable argmax operation. Therefore, the attacker cannot implement an end-to-end gradient-based adaptive attack. Moreover, similar to the SAEC technique used in the LP-defense layer, ContraNet is also an attack-agnostic plug-and-play defense solution that requires the training dataset only. In Table 1, we show the number of parameters and the number of GFLOPS of ContraNet for CIFA10 dataset. As can be observed, the size and inference cost of ContraNet is moderate, considering the fact that classifiers used in safety-critical applications are usually quite large to achieve extremely high accuracy.

5 Experimental Results

In this section, we first introduce the experimental setting in Section 5.1. Then, we report the performance of MixDefense from three aspects: the overall performance of MixDefense compared with three state-of-the-art detection-based defense methods (Section 5.2), the performance of the proposed LP-Defense layer SAEC (Section 5.3), and the performance of the proposed SP-Defense layer ContraNet (Section 5.4).

5.1 Experimental Settings

We evaluate our MixDefense on three popular datasets: MNIST[mnist], CIFAR10[krizhevsky2014cifar], and GTSRB[GTSRB]. The datasets along with the target classifiers are summarized in Table 2.

Evaluation metrics. To evaluate the performance of AE defenses [carlini2019evaluating], we use the two metrics described in [deepsec] and [meng2017magnet].

The first metric is the detection accuracy of the detector given a half-and-half mixture of clean images and AEs as inputs. We treat clean images as positive samples, and AEs as negative samples. The accuracy of the detector is calculated as:


where denotes true positive, true negative, false positive and false negative, respectively.

The second metric is the overall accuracy of a robust classifier (RC), i.e., a classifier equipped with an AE detector, which is calculated as the percentage of AEs that are either detected by the detector, or correctly classified by the classifier:


Here, is the total number of AEs (including the ones that do not attack successfully), is the number of AEs detected by the detector, and is the number of AEs that are correctly classified by the target classifier.

Perturbation budget. The perturbation budget measures the allowed differences between the clean sample and the produced AE :


where denotes -norm.The larger the perturbation budget is, the stronger the produced AEs are.

Compared defense methods. We compare MixDefense with two competing detection-based defenses FS [xu2018feature] and MagNet [meng2017magnet] on the accuracy of detector (); and with FS [xu2018feature], MagNet [meng2017magnet] and one of the most strongest training-based defense known today, Madry [madry2017towards], on the accuracy of robust classifier (). The selected methods are representative AE defense methods from the three categories surveyed in Section 2.2

, where Madry boosts model robustness through adversarial training, MagNet alleviates the impact of AEs through input transformation, and FS detects AEs based on the prediction inconsistency. All three methods are implemented based on their open-source official code. Note that, since the adversarial training is time-consuming, we only test Mardry on MNIST and CIFAR10, and use the pre-trained models provided by the authors.

We also use a large perturbation budget for the comparison experiments to show the worst-case performance of the defense methods. For defense models realized using Pytorch, we use the implementations of FGSM, FGM, BIM, C&W, and DeepFool from the foolbox library 


. For those realized with TensorFlow,

e.g., MagNet, we resort to cleverhans [cleverhans] to implemented the attacks. Table 3 summarizes the attacks we employed.

5.2 Results of MixDefense

This subsection is concerned with MixDefense’s overall performance against various attacks under a wide range of adversarial perturbations. Here, we only report results of attacks using norm as distance measurement, and put results of in Appendix due to limited space.

Attack Method Knowledge Goals Distance
FGSM[fgsm] whitebox untarget
FGM[fgsm] whitebox untarget
BIM[kurakin2016bim] whitebox untarget
C&W[cw2017] whitebox untarget
DeepFool[moosavi2016deepfool] whitebox untarget
Table 3: Attack Methods
Figure 5: vs. perturbation budget curves under norm across 3 datasets. The column presents attack methods, and the row shows the type of dataset. The red solid line refers to our MixDefense, while the green dash line and blue dash line display the performance of FS and MagNet respectively.

Accuracy of Detector. Figure 5 summarizes the of MagNet, FS, and our MixDefense under different attacks with various perturbation budgets. Each accuracy curve is fitted over 30 ascending adversarial perturbation budgets. For FGSM, FGM and BIM attack, each point is obtained leveraging around 1000 randomly sampled images from the test set, containing half clean images and half AEs. Note that, when the perturbation budget is too small, it is hard for some attack method to fool the classifier. If that happens, we collect all possible success AEs to perform the experiments. Once the collected AE number is less than 100, we will simply crop the corresponding perturbation budgets, that is the reason why each panel’s perturbation scales are not strictly equal. As for DeepFool and C&W, there are no explicit attack parameters to control the adversary perturbation budgets. We call the attack once, then sort the generated adversarial samples by their perturbation budgets and plot the accuracy curve based on it. Additionally, the threshold is fixed for each dataset, that is to say, we use the same threshold against all kinds of attacks. Unless specified otherwise, the above setting is used for the following experiments about accuracy curves.

As can be observed, MixDefense can defend both AEs with large perturbation and AEs with small perturbation. This is because MixDefense consists of multiple defense layers for handling AEs with various perturbation strengths. The LP-Defense layer SAEC and the SP-Defense layer ContraNet in MixDefense are complementary to each other, which can filter out AEs with large and small perturbations in tandem.

In general, our MixDefense yields superior results among a wide range of perturbation budgets, though other competitors might be more good at detecting AEs of a certain perturbation strength. For example,

MagNet is on par with MixDefense on MNIST once the adversarial perturbation budget of the norm attack becomes larger than . However, this leading position will vanish when looking at the results on CIFAR10/GTSRB dataset, or when the perturbation budget is small. Besides, the of MixDefense under various perturbation budgets is more stable than the other two AE detectors. For instance, there exists a significant positive correlation between the of FS and the perturbation budget under BIM attack. Accuracy of Robust Classifier. We regard the classifier supported by the detector as a robust classifier, and report its accuracy in Figure 6. We also provide the accuracy of the classifier against attacks without any defense using the dot line colored by gray. As shown in Figure 6, the accuracy of the vanilla classifier decreases significantly as the perturbation gradually grows. Madry also shows a decline trend on CIFAR10 and MNIST but the performance drop is less severe compared with the vanilla classifier. There may be some cases where the other two competitors achieve better performance. However, the of MixDefense is more stable compared with FS and MagNet, and can keep a high accuracy value across various attack types or datasets generally. For example, Magnet outperforms Mixdefense against DeepFool attack on CIFAR10 when the perturbation larger than 0.1, whereas achieving only 75% accuracy under the same perturbation budget when it comes to BIM attack. Similarly, FS gains excellent performance on MNIST against FGM attak, while a notable decline could be observed when switching to CIFAR10 and GTSRB.

As pointed out in Section 1, an ideal defense strategy should not cause accuracy loss for normal inputs. In view of this, we further test our MixDefense on clean image data, and achieve respective 98.2%, 90.3%, and 96.8% classification accuracy on MNIST, CIFAR10 and GTSRB, which are on par with the original accuracy of vanilla classifier as list in Table 2. However, other AE detectors would induce accuracy reduction on clean images. For example, Madry leads to 7.2% and 0.63% accuracy decrease on CIFAR10 and MNIST, respectively.

Figure 6: vs. perturbation budget curves under norm. The column presents attack method, and the row shows their dataset). The gray dash line refers to the classifier’s prediction accuracy(without defense) while the other four colorful lines display the robust accuracy(classifier+defense) curves, where the red solid line indicates the MixDefense, green dash line refers to FS, blue dash one demonstrates MagNet and the yellow dash line displays Madry.

5.3 Results of LP-Defense

Figure 7: Large perturbation detector(SAEC)’s accuracy v.s. perturbation budget.
Figure 8: The correlation of generated image with the conditional label . In each row, the first element is the input image , the second element depicts its reconstruction generated using correct label, following reconstructions under other 6 conditional label respectively.

The performance of SAEC relies on selecting a reasonable threshold for the NL_scores to distinguish AEs from the normal samples correctly. In practice, we use the training data perturbed by random uniform noise instead of the real AEs, and select a proper threshold such that clean images would not be mistakenly regarded as perturbed ones. In this way, we avoid the need of extra information about AEs and complicated optimization, making SAEC easier to be generalized to a wide range of AEs.

Figure 7 shows the effectiveness of SAEC by depicting the detector’s accuracy v.s. perturbation budget against 4 attacks across three datasets. The first two panels demonstrate the detection accuracy against FGSM and BIM, under , and the last two panels are under norm. It can be clearly seen in the last two graphs that the accuracy grow dramatically as the perturbation becoming larger. More specifically, once the adversarial perturbation is larger than 0.07, the accuracy will reach a peak plateau in the number above 97% and will remain steady for larger perturbation. This trend is the same as what we have analysis in previous section. So, if we regard the perturbation budget larger than 0.07 as large perturbation, our SAEC can achieve at least 97% accuracy against all three attacks in a large perturbation range. When it comes to the norm attack, the first two graphs of Figure 7 reveal the similar trends, i.e. the accuracy sharply rises to a high point, at least 97%, when the perturbation above 7, then remains steady. Taken together, these results indicate that our SAEC is an effective sample statistics method to detect AEs with large perturbation.

5.4 Results of SP-Defense

Qualitative results. Figure 8 shows the reconstructed images of the revised cGAN in the proposed SP-Defense layer, ContraNet,

where the first column plots four random sampled original images from the test set directly, the second column represents the reconstruction images of these four images given the correct labels as the conditional information. In contrast to the reconstruction with correct labels, we depict the reconstruction image with another six fixed labels (may or may not equal to their correct label), as shown in the following columns respectively. These figures are a good illustration to show the effectiveness of our small perturbation detection’s core idea – contradiction between the AE’s prediction label and its semantic information. More reconstruction examples could be found in Appendix.

Figure 9: Small perturbation detector’s accuracy v.s. perturbation budget. We test ContraNet’s performance against FGSM, FGM and BIM under or norm, across different dataset indicated by color.

Quantitative results. Figure 9 demonstrates our small perturbation detector’s accuracy concerning perturbation under the same setting as the large perturbation. As shown in the first two graphs of Figure 9, the detector’s accuracy remains steady at a reasonable high level, i.e., at least 85%, in contrast to the SAEC’s poor behavior when the adversarial perturbation is a relatively small range, i.e. 0.01 to 0.07 or even wider range such as 0.01 to 0.1. Consistent with our previous analysis, the small perturbation detector plays a complement role for SAEC, and these two defense layers achieve high accuracy against wide range of adversarial perturbations. This finding also accords with the results shown in the last two graphs of Figure 9, where norm acts as the distance measure. In contrast to our SAEC’s deteriorative accuracy against small perturbations, i.e., 0.5 to 7, the small perturbation detector can achieve a stable accuracy, averaging above 90%, against different attack methods across all three datasets. The result for MNIST is counterintuitive, since MNIST is simpler than the other two datasets, while the performance seems poor especially when the perturbation is extremely small. A possible explanation for this might be that the perturbation is too small to obtain enough successful AE sample leading to an unstable statistics result.

Failure case analysis. Two batch of failure cases are exemplified in Figure 10 and Figure 11 with respect to FN and FP, respectively. It is apparent from both Figures that most input images are corrupted in some degree that it is even a tough work for humans to recognize the image within a limited time. From Figure 11 we can see that if the adversarial perturbation disturbs the key feature of an image, it will change the semantic meaning of the image and leading the mistake of our ContraNet such as the 6th image pair, whose AE is very similar to its target class that explains why the detection method believe it is a clean image.

6 Case Study Adaptive Attack

In the previous section, we show experimental results under white-box attacks (Section 2.1), wherein attackers have full knowledge of the target classifier but do not consider the proposed MixDefense when generating AEs. To further evaluate the robustness of MixDefense on adaptive attacks, in this section, we present a case study on the GTSRB dataset, which is widely used in the training of autonomous vehicle systems.

Figure 10: Failure cases of Mixdefense. There lists 10 clean images in the first line which are wrongly judged as AE by Mixdefense. Their corresponding reconstruction images are following in the second line.
Figure 11: Failure cases of Mixdefense. There lists 10 AE generated by C&W attack in the first line which are wrongly judged as clean images by Mixdefense. Their corresponding reconstruction images are following in the second line.

Possible attack strategies.

Perhaps the most straightforward attack strategy is to treat MixDefense and the target classifier as a whole and generate AEs according to the gradient direction of its loss function. However, such a strategy is not practical, because: i) the LP-Defense layer SAEC is not trainable, and ii) the SP-Defense layer ContraNet is not trained in an end-to-end manner. In particular, the target classifier and ContraNet (including the revised cGAN and the deep metric module) have separate gradient propagation processes, and the only connection in between (i.e., the predicted label ) is produced by a non-differentiable argmax operator.

Then, another possible strategy is to attack MixDefense and the target classifier separately and generate a composite AE that combines the two independently generated perturbations. Even though such composition would affect each other’s attack strength and hence difficult to perform, we show that it is extremely difficult, if not impossible, to attack MixDefense alone in the following.

Robustness of MixDefense. Here, we focus on the robustness of ContraNet. This is because, as the last defense layer of MixDefense, as long as ContraNet is robust against adaptive attacks, we can safely conclude MixDefense is also robust.

The revised cGAN in ContraNet can generate reconstruction results that are strongly coupled with the conditions, i.e., the predicted labels provided by the target classifier. We empirically verify this claim by performing targeted FGSM attack the inputs of cGAN following [advonvae], where the perturbation strength ranges from to . For each target attack, we generate AEs under a given perturbation budget in such a manner that the reconstructed image is optimized to be similar to the corresponding AE whenever possible. As can be observed in Figure 12, however, due to the strong correlation between the reconstruction and the predicted label, the semantics of the reconstructed images still follow the predicted label under such adaptive attacks, resulting in contradictions that can be exploited for AE detection.

Figure 12: Adaptive attack on the input images of the improved cGAN in the SP-Defense layer of MixDefense with gradually increased perturbation strength. As can be observed, the reconstructed images produced by the improved cGAN are still strongly correlated with the target labels.
Figure 13: Directly performing adaptive attack on the latent vectors of the improved cGAN. Although we give more advantage to the attackers, i.e. empowering them with the ability to modify the latent vector without constrain, the generated images are still strong coupled with the condition “Speed limit (30km/h)”. See Section 6 for details.

Next, let us assume that attackers can bypass the encoder part in ContraNet and attack the revised cGAN by modifying the latent variable directly111In practice, this is not possible because AE attacks can only modify the inputs.

. Under such relaxed attack constraints, the resulted latent variable might not be normally distributed. We fix the class condition

whose semantics is different from the AE, and seek for a that can yield a reconstructed image that is similar to the AE. As shown in Figure 13, although the strength () of FGSM attack exerted on is increased from to , the obtained could only be reconstructed to ‘Speed limited 30 km/h’ given the condition . Note that, directly attacking can be viewed as the strongest possible adaptive attack for ContraNet, because it offers the maximum freedom to affect the revised cGAN by exerting no restrictions on the latent variable .

In summary, it is rather difficult, if not impossible, to successfully perform an adaptive attack on MixDefense wherein the attacker has full knowledge of the target model and the defense framework.

7 Discussion

Good side effects. Deep classifiers would inevitably misclassify some inputs when tested on natural corruption data[ford2019adversarial]. Although no intentional perturbations are exerted on these samples, our MixDefense would still identify these misclassified samples as AEs. This is because there exist contradictions between the semantic meaning of the samples and the predicted labels, while the SP-defense layer ConraNet is designed to detect AEs with small perturbations based on such contradictions. See Section 4 for more details. Though wrongly recognizing these misclassified samples as AEs, rejecting such samples would not decrease the accuracy of the target classifier and may even improve the classification accuracy [smith2011improving].

Intermediate layers. As mentioned above, our proposed Mixdefense framework consists of multiple defense layers targeted on AEs with various strengths of perturbations. Besides the LP-Defense layer and the SP-defense layer introduced in this paper, one might also deploy one or several intermediate defense layers for those AEs with medium-sized perturbations. This can not only reduce the detection burden offloaded to the last SP-Defense layer but also simplify its design. Existing AE detection techniques reviewed in Section 2.2 can be adjusted and incorporated as the intermediate defense layers in MixDefense to further bootstrap the defense performance.


We found that the Contranet can achieve good performance on data sets with obvious category information, such as GTSRB. When the class information of the dataset is relatively vague, such as imageNet, on which the state-of-the-art classifier,

NFNet-F6 w/SAM can achieve 86.5% accuracy without extra training data[sota], and some images in the data set can be classified to multiple labels or even with wrong labels [northcutt2021pervasive]

, the performance of ContraNet will tend to decrease. To improve the performance of ContraNet on complex datasets such as imagenet, an advanced cGAN or more training samples and training time are necessary.

8 Conclusion

In this work, we propose MixDefense, a multi-layer attack-agnostic defense-in-depth framework for AE detection. For AEs with large perturbations, the proposed SAEC technique in the LP-defense layer is used to discover the statistical ‘noise’ difference between natural images and tampered ones. For AEs with small perturbations, the proposed ContraNet in the SP-defense layer effectively captures the contradiction between the inference results of AEs and their semantic information. Experimental results with various AE attack methods on image classification datasets show that MixDefense outperforms existing AE detection techniques by a considerable margin and it is also resistant to adaptive attacks.