Notable properties of adversarial examples have been discovered recently that make the problem worthwhile. Most surprisingly, adversarial examples have been shown to transfer from one network to another without knowledge of the target model . Real world examples of adversarial attacks have been explored in  where the authors show that adversarial images retain their properties after being printed physically, or recaptured using a camera.
Current methods in adversarial defense research follow two approaches: detection and classification. Most methods to classify adversarial examples employ deep learning techniques that can be trained end-to-end [7, 8]. This approach has two key challenges. First, the adversary may consider the defense as a part of the model, which can be attacked in the same way as the original model. Second, defenses that modify the main classification network compromise its accuracy on clean samples; this trade-off is well documented and hinders the usage of a majority of proposed defenses against adversarial attacks . Detection based methods have shown promising results, however, it has been shown that it is not easy to detect adversarial samples simply using a single neural network . Thus, in our work, we aim to accurately detect attacks by employing multiple observer networks that classify the outputs from the hidden layers of the main network as clean or adversarial. The use of more than one observer network makes it non-trivial to attack the classification model, while keeping the original network parameters unchanged. This ensures accurate adversarial detection without compromising accuracy on clean samples. Detecting adversarial samples can be critical in systems such as self-driving vehicles, where human intervention can be sought.
Adversarial perturbations are imperceptible at the input level. However, Xie et al.  show that these perturbations grow when propagated through a deep network and appear as significant noise at the hidden layers’ feature maps. Motivated by this fact, we augment the main network with multiple binary classifiers to detect this amplified noise in the feature maps. Each detector takes inputs from a different hidden layer from the original classification network and classifies the input as clean or adversarial. The parameters of the original model remain unaffected, thus retaining accuracy on clean samples. The ensemble of all detectors is used to determine whether the input is clean or not. Most of the binary classifiers rely on convolutional layers as they are ideal to model the large number and dimensionality of feature maps. We note that the observer neural networks are trained independently and do not modify the parameters of the target classifier. This enables detectors to be trained and plugged-in to off the shelf neural networks. Thus, neural networks that are not mutable or have parameters that are not publicly available can be protected.
model the confidence of classifying adversarial samples by introducing Bayesian uncertainty estimates, which are available in neural networks that use dropout. These estimates can be captured through the variance of the output vectors obtained from different paths of the neural network when using dropout. They observe that this variance is much higher for adversarial inputs; this serves as a method of detection. Liang et al. treat adversarial perturbations as a form of noise. They use scalar quantization and a smoothing spatial filter to denoise the inputs. They compare classification results of the original and denoised versions of the input to detect potential adversaries. In another study, Wang et al.  observe that classification results of adversarial inputs are much more sensitive to change in the parameters of the deep neural network. They introduce a metric to measure sensitivity and use a threshold value to filter input samples. We note that the methods discussed so far rely on capturing a pre-determined property, rather than relying on a machine learning model to detect anomalies. This often leads to sub-par detection performance due to generalization errors resulting from strong assumptions on data generation. Other signal processing techniques such as random sampling and uncertainty estimation have also led to promising results in non-intrusive adversarial input detection.  and  introduce a promising randomized approach for selective sampling from the hidden layers that aims at minimizing uncertainty in output classification. It is interesting to investigate in future work the combination of the key idea behind the proposed architectures and ensemble strategy with the sampling strategy proposed in .
The work of Metzen et al.  is most similar to our proposed approach. They train one auxiliary detector network that takes input from an internal layer of the deep neural network and classifies the input as clean or adversarial. The authors make several attempts to optimize the size and placement of the single detector in order to achieve optimal detection accuracy. Although circumventing such a detection algorithm is non-trivial, Carlini and Wagner  show that it is possible in a defense blind case. Contrary to this work, we train multiple detectors to take input from and classify feature maps of all intermediate layers. We show that each hidden layer output is meaningful in discriminating the input samples. This is because of the diverse nature of adversarial noise present in the hidden layer outputs. Thus, using multiple networks to detect adversaries allows each network to classify results based on different types of perturbations found in the feature maps. These results are combined into an ensemble to achieve state-of-the-art detection accuracy.
We achieve an average detection accuracy of 92.74% and 90.53% for two popular datasets across four popular attacks in the adversarial machine learning field, outperforming previously proposed detection algorithms on the same datasets.
Ii Problem Setup
Ii-a Datasets and Classifiers
Our evaluations are conducted using two popular datasets: The MNIST dataset of handwritten digits  and the CIFAR-10 dataset . MNIST consists of 60000 training images and 10000 test images with a dimension of , belonging to 10 classes (corresponding to the digits). The CIFAR-10 dataset consists of 50000 training images and 10000 test images with dimension , belonging to 10 classes (dog, cat, frog, horse, deer, aeroplane, truck, ship, automobile, bird).
Ii-B Threat model
There are three popular threat models in adversarial machine learning as described by Carlini and Wagner :
A zero knowledge or black box attacker who is not aware of the model architecture, model parameters, or the defense in place.
A perfect knowledge or white box attacker who is aware of the model architecture and parameters and also aware of the parameters and type of defense in place.
A limited knowledge or semi-white box attacker who is aware of either the model or defense.
We consider a realistic semi-white box threat model where the attacker has perfect knowledge of the weights of the model and the architecture, but not of the defense.
We evaluate our defense on four popular attacks: Fast Gradient Sign Method (FGSM) , Projected Gradient Descent (PGD) , DeepFool , and the Carlini & Wagner attack (CW2) . We use to denote the clean input image, for the ground truth label, for the network parameters, and for the constructed adversarial sample.
FGSM is a fast one step gradient descent with respect to the input image. The formulation of adversarial samples is:
where is the maximum allowed perturbation under an norm and
is the loss function of the classifier. We conduct experiments using.
PGD can be a much stronger attack as it tolerates a higher computational cost. FGSM is performed iteratively, and the optimization step for the iteration is given by:
where is the output of the previous iteration, and . We conduct experiments using , and 100 iterations. The ‘clip’ function is used to clip the pixel value if the change exceeds the maximum allowed perturbation, .
is an iterative attack, that finds the closest hyperplane approximation of a decision boundary to, and then pushes - in a perpendicular fashion - towards misclassification at the other side with minimal perturbation.
CW2 solves the following optimization problem to find :
where is given by:
are the true label and softmax logit vector.
Iii Proposed Method
We augment the main network with multiple detector observer networks. These networks take inputs from hidden layers of the main network (feature maps) and perform binary classification on these feature maps to determine whether the original input is adversarial or clean. In the considered setup, the main network is the ResNet-18 architecture that has been pre-trained111Training details: https://github.com/kuangliu/pytorch-cifar for image classification . The parameters of the main network are frozen in order to start training the detectors. We place four detector networks that are trained independently. Similar to the main network, the detectors are also deep neural networks that end with fully connected layers down to two output classes (adversarial or clean). The architecture of each detector is derived from the base classification network. During training, adversarial images for the entire training set are simulated for each attack being tested. During inference, the four detectors are treated as an ensemble model and the input is deemed adversarial if two or more detectors classify it as so. The ensemble is meaningful because each detector learns from a different layer’s feature map. This is verified by observing that the ensemble prediction of all four detectors yields a superior detection accuracy than any strict subset.
The ResNet-18 main network has the following architecture: Input . This is illustrated in Figure 1. Here, is the residual block  and is the fully connected layer that gives a output vector. is a stride 2 convolution with 64 kernels. Each residual block is made up of 2
is a fully connected layer that goes from 10 to 2 neurons. The detectors are trained using the Adam Optimizer with a learning rate of 0.01, batch size of 256 and a weight decay of . To the best of our knowledge, this architecture is unique and novel. The detector networks are derived from splitting the main classification network in various layers. Reusing the architecture in this manner leads to promising results, as shown in the next section. The bounds of the adversarial perturbations used throughout the experiments () are chosen based on studies performed by similar works and results we compare to. These bounds are also low so as to make the generated adversarial examples imperceptible. Further, the bounds are kept constant throughout the training and testing time. As the adversarial examples are generated during training, the detectors are trained on the specific attack that the classifier is trying to defend against. Transferability of the proposed detection mechanism to different attacks unseen during training is left for future work.
We demonstrate the effectiveness of our defense by measuring the accuracy of detection among adversarial samples (i.e., the percentage of adversarial samples that were successfully detected). We also measure the number of false positives (i.e., the number of clean images that were classified as adversarial). Across the four attacks and two datasets used, the classification accuracy post-attack (undefended network) ranged from 14% to 22%. Tables I and II summarize the results of adversarial detection obtained across all testing scenarios. The columns D1, D2, D3, and D4 give the accuracies of the individual binary detectors. The ensemble column gives the total detection accuracy of the defense222Code accepted for publication at:
Iv-a Results on MNIST
We achieve a detection accuracy of 99.5% using FGSM and 89.1% on the CW2 attack. The complete results for the attacks are shown in Table 1. False positives rates were 0.12%, 0.08%, 0.09% and 0.07% for FGSM, PGD, CW2 and DeepFool, respectively.  achieves an accuracy of 92.2% and  achieves an accuracy of 93.86% for detecting the FGSM attack.  achieves an accuracy of 97.67% accuracy for FGSM and 94.00% accuracy for CW2. The accuracies obtained in our experiments are superior in all but one case to these values. The reason for the lower detection accuracy compared to  on CW2 is left for future work.
Iv-B Results on CIFAR-10
As shown in Table II, we achieve a detection accuracy of 97.5% using FGSM and 85.0% using CW2. False positive rates were 0.06%, 0.02%, 0.03% and 0.02% for FGSM, PGD, CW2 and DeepFool, respectively.  achieves an accuracy of 74.7% and  achieves an accuracy of 84.00% for detecting the FGSM attack.  achieves an accuracy of 91.00% accuracy for FGSM and 83.00% accuracy for CW2. We obtain better accuracies than these present methods of detection.
Iv-C False Positives
Many present methods of adversarial defenses are rendered unusable due to their drop in accuracy on clean samples . We address this trade-off by keeping the original classification network unchanged. For all testing scenarios, we observe the number of clean testing samples in MNIST and CIFAR-10 that were classified as adversarial. In the worst case scenario, there were a total of 12 clean images classified as adversarial (false positive rate of 0.12%). These results validate our hypothesis of minimal loss in accuracy on clean samples.
Iv-D Ablation Study
To assess the impact of each detector on the final detection accuracy, several experiments were performed. Tables I and II give the accuracy of detection for each individual detector as well as the ensemble when using all four detectors for MNIST and CIFAR-10. The ensemble accuracy is consistently higher than that of any individual detector and this shows the effectiveness of the ensemble result. Experiments were also conducted to verify the need for multiple detectors. Ensemble accuracies were calculated using two subsets of the four detectors: D1 + D4 (peripheral: closer to input and output) and D2 + D3 (middle: farther from the input/output). In these cases of two detectors, the input is considered adversarial when both detectors classify it as so. Table III shows the accuracy of detection under these scenarios. The inferior accuracy compared to the ensemble of all four detectors demonstrates the need for each detector. This strongly suggests that each detector learns new discriminative features about the adversarial image from the output of various hidden layers. Results using other combinations of D1 through D4 are provided in the Appendix.
|Attack||Dataset||D1 + D4||D2 + D3||Ensemble|
This table shows the accuracy of detection for 2 different combinations of ensembling the detectors. Combination D1 + D4 is of the peripheral detectors and the combination D2 + D3 is of the middle detectors. The number of false positives (FP) is given over testing on 10000 images. Here, the false positives are calculated over ensembling all 4 detectors.
V Discussion and Future Work
Our defense does not incorporate any baseline defense such as adversarial training . Adversarial training is a widely used defense, which trains the classifier using adversarial samples and their corresponding corrected labels. Detection methods commonly employ this technique and build a defense on top of it. However, this is not useful for our detection based method, because we seek to retain discriminative properties of adversarial and clean samples. This means that adversarial training modifies the parameters of the network to improve classification of adversarial samples, which in turn reduces the differences between outputs of hidden layers for adversarial and clean inputs and thus, making them harder to detect.
The observer networks that we employ in our method of defense have two main properties that make it effective: non-intrusiveness and diversity. The non-intrusive aspect of the detection is that the observer networks employed are trained independently from the original classifier. Thus, the weights and gradients of the observer networks differ from those of the main network. The non-intrusive nature of our proposed detection mechanism allows for simple deployment on existing models. The diversity aspect of detection is that we use multiple observer networks for detection of an adversary. This allows us to take inputs from various layers of the neural network and classify them. These inputs (feature maps) are different from each other (diverse) and enable different bases of classifying the input as adversarial or clean.
We achieve state-of-the-art detection results on the MNIST and CIFAR-10 datasets against powerful iterative attacks such as PGD and CW2. However, application of this defense on datasets of higher resolution using larger network architectures is open for further study. In particular, we anticipate exploring the scalability of our defense to deeper models.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” http://dx.doi.org/10.1109/CVPR.2016.90
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2016.91
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
K. Arulkumaran, A. Cully, and J. Togelius, “Alphastar,”
Proceedings of the Genetic and Evolutionary Computation Conference Companion on - GECCO ’19, 2019. [Online]. Available: http://dx.doi.org/10.1145/3319619.3321894
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” CoRR, vol. abs/1312.6199, 2013.
-  A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the physical world,” 2016.
-  S. Wang, X. Wang, S. Ye, P. Zhao, and X. Lin, “Defending dnn adversarial attacks with pruning and logits augmentation,” in 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Nov 2018, pp. 1144–1148.
-  G. Jin, S. Shen, D. Zhang, F. Dai, and Y. Zhang, “Ape-gan: Adversarial perturbation elimination with gan,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 3842–3846.
-  A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” ArXiv, vol. abs/1611.01236, 2016.
N. Carlini and D. Wagner, “Adversarial examples are not easily detected,”
Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security - AISec ’17, 2017. [Online]. Available: http://dx.doi.org/10.1145/3128572.3140444
-  C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He, “Feature denoising for improving adversarial robustness,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. [Online]. Available: http://openaccess.thecvf.com/content_CVPR_2019/papers/Xie_Feature_Denoising_for_Improving_Adversarial_Robustness_CVPR_2019_paper.pdf
-  R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detecting adversarial samples from artifacts,” 2017.
-  B. Liang, H. Li, M. Su, X. Li, W. Shi, and X. Wang, “Detecting adversarial image examples in deep neural networks with adaptive noise reduction,” IEEE Transactions on Dependable and Secure Computing, p. 1–1, 2018. [Online]. Available: http://dx.doi.org/10.1109/TDSC.2018.2874243
-  J. Wang, G. Dong, J. Sun, X. Wang, and P. Zhang, “Adversarial sample detection for deep neural network through model mutation testing,” 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), May 2019. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2019.00126
-  F. Sheikholeslami, S. Jain, and G. B. Giannakis, “Efficient randomized defense against adversarial attacks in deep convolutional neural networks,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 3277–3281.
-  F. Sheikholeslami, S. Jain, and G. B. Giannakis, “Minimum uncertainty based detection of adversaries in deep neural networks,” 2019. [Online]. Available: https://arxiv.org/pdf/1603.08155.pdf
-  J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff, “On detecting adversarial perturbations,” 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, 1998, pp. 2278–2324. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.42.7665
-  A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research).” [Online]. Available: http://www.cs.toronto.edu/~kriz/cifar.html
-  I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” ArXiv, vol. abs/1706.06083, 2017.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: A simple and accurate method to fool deep neural networks,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2016.282
-  N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” 2017 IEEE Symposium on Security and Privacy (SP), May 2017. [Online]. Available: http://dx.doi.org/10.1109/SP.2017.49
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, 12 2014.
Other Detector Combinations
Table I shows the accuracies of using an ensemble of two other combinations of detectors D1 through D4 than the ones mentioned in the main text. In the case of D2 to D4, three detectors D2 + D3 + D4 are used for the ensemble and the majority classification result is selected (clean or adversarial).
|Attack||Dataset||D3 + D4||D2 to D4||Ensemble|
This table shows the accuracy of detection for 2 different combinations of ensembling the detectors.
Adversarial Noise Propagation
Figure 1 demonstrates the propagation and amplification of adversarial perturbations through the considered ResNet-18 network. The images on the top from left to right are that of a clean and adversarial 3, randomly selected from the MNIST dataset. The adversarial image is generated using PGD with = 0.2. The images on the bottom are that of their respective feature maps obtained from the output of the Res1 layer. It can be clearly seen that the feature map of the adversarial 3 is significantly noisier than the clean 3. This motivates our approach of detection by filtering out input images with noisy feature maps in the main classification network.