Convolutional neural networks (CNNs) achieve remarkable successes on a variety of tasks [6, 17, 18, 11, 27]. Interestingly, Christian Szegedy [31, 14, 3, 20, 9] et al. found that an original image would be misclassified with high confidence via adding a very small perturbation to it, which is called adversarial example. The existence of adversarial examples indicates that CNNs are more fragile than we have imagined.
The defense methods against to adversarial examples can be mainly divided into two classes . One is called the complete defense which enhances the robustness of a network by modifying its structure. For instance, methods using gradient masking [15, 7, 28, 12], which make the network be in three kinds of unreachable states on the gradient to prevent the adversarial examples generating from optimization. However, it does not analyze the difference between the adversarial example and the original image. As a result, only the generalization ability of the network is elevated while the adversarial examples remain unrecognizable. When under a stronger attack , the defending method of obfuscated gradients will be broken through. Another way is detection based defense methods, which only [33, 22, 32] identify whether the input is an adversarial example. Besides, it simply uses a formalized method to represent the difference between the adversarial example and the original image  so that the category information is lost. Therefore, methods of this kind are limited to the detection of the adversarial example and incapable of correctly classifying the adversarial example.
In summary, complete defense methods and detection based methods are lack of a proper way to quantitatively analyze changes in the features of the input data propagating within the network. This absence makes it hard to detect and classify the adversarial examples simultaneously. Our experimental finding gives insights to address this problem. We found that as the layer deepens, the representations of the feature maps of the adversarial example and the original images becomes gradually separable on the VGG16  and Inception networks  et al. We define it as Adversarial Feature Separability (AFS). AFS clarifies that adversarial examples can be distinguished in feature space with the number of layers increased. Moreover, the group visualization method proposed by Olah  provides a feasible way to interpret the internal representations of CNNs. These groups visualize features contain the representations corresponding to the category of the input image. Thus, we combine the above two ideas and propose an Adversarial Feature Genome (AFG) based adversarial examples defense framework.
This framework can not only distinguish between the original image and adversarial example, but also retain the representations of each image about the category. The proposed framework first implements decomposition on the matrix consists of multi-channel feature maps in each layer aiming to extract the low-dimensional representation of the high-dimensional feature data. Then, the visualize features in the decomposed group are stitched together to represent the features that the network has learned in the corresponding layer. Finally, we stack the stitched images of different layers to achieve AFG database. We demonstrate that AFG can effectively distinguish the adversarial examples from the original images and also correctly classify them. Therefore, AFG can be used as a data-driven adversarial examples recognition method that both detects and classifies the adversarial examples as shown in Fig. 1. In details, we transform all images into AFG to construct the big database, and give each AFG a label indicates its original category and a label given by the classification of a CNN. These two labels are consistent for original images, but are different for adversarial examples. By training a CNN model of multiple label classification on the AFG database, we can successfully recognize the adversarial examples. Our contributions are listed as follows:
We find that adversarial examples and original ones have different representation in the feature space, the divergence grows as layers go deeper. Inspired by this observation, we propose the adversarial feature genome which can be applied to analyze the feature of the data in the CNNs. It also can be potentially used on the interpretability of CNN models.
We verify that the difference between the feature of the original images and the adversarial examples are relatively small in shallow layers but bigger in deeper layers.
We further train a multi-label CNN on adversarial feature genome dataset. Experiments show that the proposed framework can not only effectively identify the adversarial examples in the defense process, but also correctly classify adversarial examples with mean accuracy up to 63%.
The structure of this paper is as follows: In the section 2, we will discuss our key idea about adversarial feature separability. In the section 3, we present the way how to build Adversarial Feature Genome. In the section 4, we model adversarial examples recognition as multiple label classification or prediction problem. In the Experiments and Result section, we show how our AFG database improve adversarial example defencing. Conclusions and further works will be put forward in the last part, followed by the section 5.
2 Adversarial Feature separability
The adversarial example is a new version of an original image that added a small perturbation, it can confuse the network so that the network misclassifies it . The generation of the adversarial example is to compute the additive perturbation. We define as the original image, as the adversarial example, as the small adversarial perturbation, and as the target class of the adversarial example. Adversarial example can be obtained by Eq.1.
where represents the classifier function. Although and are visually similar because is negligible, of the CNN classifies them into completely different categories, which can be formulated as follows:
representing activate function. The CNN is a composite function, the decision of it is based on the features which are extracted by the learned network weights. For the input imageand
, after they are fed into the network, due to they prefer different network weights on the process of forward propagation, the features extracted by these weights at each layer are different in the feature space.
In the CNNs, feature maps strongly correlate with the sample, they contain features corresponding to the category and can reflect the decision process of how the network classifies the input image[34, 25]. This means the feature maps can be utilized to represent feature and measure the difference between adversarial examples and original images concurrently. Though the feature maps of adversarial examples and original images are different, whether they are separable remains unknown. So we conduct several experiments to further explore it. Once the separability is verified, we can recognize and classify the adversarial examples by separating the feature maps of adversarial examples from those of original ones.
We define as the feature maps of original image in the -th layer, as the feature maps of adversarial examples, and denotes a distance function of and .
First, We compute the distance between the feature maps of adversarial examples and their original images in each layer, as shown in Eq. 3.
Then, We adopt the Kullback Leibler divergence to measure the difference between the distribution of and in Eq. 4.
In this experiment, we first generate some adversarial examples with Basic Iteration attack  on inception network. In this process, the setting of the iteration time is 20. We randomly choose 6 categories images in the ImageNet dataset, each category includes 200 images, generate 1200 adversarial examples in total.
Next, we measure the difference by L1 distance and KL divergence on each pair of the adversarial example and its original image. Then we compute their mean value, maximum value, minimum value and the variance for each layer. The result of the experiment is shown in Figure2.
As shown in Figure 2, regardless of using L1 distance or Kullback Leiler divergence to evaluate the distance between the feature of adversarial examples and original images, the distance of feature is gradually increasing with the CNN layers deepening, which indicates that the adversarial examples and corresponding original images are distinguishable in the feature maps. Meanwhile, the variance in the two figure are both stable and close to zero, which means the adversarial feature separability is stable.
3 Adversarial Feature Genome
Through the above experiment, the feature map is able to represent the difference between the adversarial examples and the original images, however, it is not suitable for the detection of adversarial examples because it is enormous and redundant. In order to reduce the data size and redundancy of feature maps, we apply visualization techniques mentioned by olah  on each layer to get some groups that present the main features in this layer and use a image called maximum activation images to represent the feature learned by CNN.
First, we use non-negative matrix factorization (NMF)  on feature maps at each layer to reduce the dimension of feature maps and get some feature groups that the number of it can be decided by us. After using NMF, the feature retained in groups becomes more independent and meaningful than those without using NMF . Denoted by a
dimension random vector, we conduct N times observations. We can obtainby define these observations as . We can compute the non-negative base matrix and coefficient matrix which makes . expressed as a matrix:
Then we apply activation maximization (AM) algorithm  on each group to obtain the maximum activation image as follows:
where is defined to the value of the -th group in the -th layer. The obtained image denotes the maximum activation image of the -th group at the -th layer, which can be considered as a main feature learned by convolution kernel in the -th layer. Besides, for every image, the number of maximum activation images on all layers of the entire network is large ,in order to use these maximum activation images effectively, we use all of them to construct a adversarial feature genome. We create the AFG for every sample by stitching the maximum activation images obtained in the same layer and then stacking the stitched images of different layers. We visualize 3 pairs of the AFG of adversarial example and original image in Figure 3.
According to the Figure 3 we can observe in the shallow layer, the AFGs of the adversarial example and the original image are highly similar. In the last few layers, AFGs become abstract and irregular, but they are obviously inconsistent, which is in line with the AFS assumption, that is, the separation between the adversarial example and the original image is gradually increasing on each layer.
The AFS phenomenon in AFG makes AFG an available tool that can be used to detect adversarial examples. Nonetheless, recognizing adversarial examples still needs features corresponding to the category so as to correctly classify adversarial examples. However, feature maps are highly related to the category of the input image because they are abstract features extracted by convolution kernels in each layer, which is useful in classification. AFGs based on the feature maps should inherited this property such that they are separable on category and can be used to classify images. Hence, we verify the separability of AFG on the category by following experiments.
We randomly select AFGs corresponding to original images in jellyfish class and leafhopper and then use t-Distributed Stochastic Neighbor Embedding (t-SNE)  to reduce their dimension so that they can be visualized in a 2-Dimension plane. The experiment result is shown in the Figure 4(a). It can be seen that the features belong to different categories are separated in the 2D plane, which indicates AFGs retain sufficient feature information to differentiate categories. Besides, we select AFGs of keeshond and komondor classes which have features in common. The result is shown in Figure 4(b). Figure 4(b) demonstrates that the two categories that are similar in feature cannot be accurately distinguished.
Compared Figure 4(a) with Figure 4(b), we find it difficult to distinguish the similar categories on the AFG, but it is easy to discriminate the categories with large differences. These validate that the AFG preserves the category information indeed, only the category information is macroscopic thus makes it incompetent to deal with akin categories.
In conclusion, the AFGs generated from feature maps can represent the difference between the original images and the adversarial examples without the loss of their category feature. So method based on AFGs is the combination of the detection driven method and the complete-defense driven method. To fully utilize AFGs and implement adversarial examples recognition, we create a new dataset named AFG dataset by applying our method both on the original images and the adversarial examples, and then we give each obtained AFG a real label and a label it is classified into. Some classification models can be trained on our AFG dataset to recognize and analysis the adversarial examples.
4 Adversarial Examples Recognition
AFGs which are essentially based on the representation of the feature maps can classify images and are separable in adversarial examples and original images. Therefore we can apply AFGs to recognize adversarial examples. Besides, the layer that the adversarial example and its original image begin to differ on the feature maps is different and the differences is also inconsistent for all samples, which indicates that AFGs are free from the human interference. So we construct a multi-label dataset of the AFG and train a new CNN which is good at dealing with complex feature extraction problems, to learn the difference between the adversarial examples and the original images and to induce the pattern. In this section, we introduce our adversarial examples recognition framework based on AFG which is mainly inspired by AFS, the framework is described in Figure 5:
For each image in the dataset, we reduce the dimension of its feature maps in each layer by NMF and obtain their activation blocks. Later we apply AM to generate feature image on each activation block and stitch the feature images within the same layer. Thus feature images in each layer can be transformed into a bigger stitched image. We stack stitched images of all layers to construct the AFG which can be used to recognize adversarial examples.
For the construction of the multi-label dataset of the AFG, we convert all samples from the original dataset into AFGs and give the real label and the classification label for each sample.
The specific organization of the label is as follows:
We count the number of categories of the original images and of the adversarial examples to get the length of the original images label and the adversarial examples label.
We use one-hot encoding for the original categories and the classification results of the samples by classification CNN.
We stack the original labels of adversarial examples with adversarial labels of those samples as the final label coding of the AFGs of adversarial examples and stack original labels of original images with zero vector as the final label coding of the AFGs of original examples. The length of zero vector depends on how to align the label coding of the adversarial examples and original examples.
The encoding examples are shown in Table. 1.
Finally, we can judge whether the input is a adversarial example by comparing whether the second part of label encoding obtained after inputting the AFG of a sample is all zero, meanwhile if a image is an adversarial example, we also can know what the original category it belongs to based on the description of the label.
5 Experiments and Results
AFGs can be applied to detect adversarial examples because AFGs of adversarial examples and original images can be separated. AFGs can also be used to classify adversarial examples correctly because they preserve features that imply the original category. Thus in this section we will use the AFG dataset that created in section3 to train a CNN of executing multi-label task to recognize adversarial examples and correctly classify them.
Besides, because the AFG database is very large in number, the cost of using all of the data to train is very high. so we randomly select 33 categories AFGs of the original images and their corresponding adversarial examples. Then we use the method of stratified sampling on the AFG dataset to construct the training set and test set. Meanwhile, because the AFGs of the adversarial examples is class-imbalanced in the training set we use Synthetic Minority Oversampling Technique (SMOTE) to balance them. Finally, we train VGG and InceptionV1 network on the training set.
5.1 Multi-label Classification
In order to illustrate that AFGs can effectively recognize adversarial examples, we use the training set to train the multi-label recognition CNN. In the experiment, our multi-label CNN uses VGG16, VGG19 and InceptionV1 network structure. The method to measure accuracy is IOU. The results of the experiment are shown in Fig.6
As it is shown in the Fig.6, though the accuracy fluctuates greatly during the training process, when the network model converges, the test accuracy on InceptionV1, VGG16 and VGG19 reaches 0.52, 0.655, 0.7 respectively. Though the lowest test accuracy is 0.52 in three models, the mean test accuracy of them is nearly 0.63 that is really a better result, which can further indicates that the framework that we proposed based on the AFG can really recognize the adversarial examples effectively and classify them correctly.
5.2 Sampling some channels to classify
The AFG database is not only large in number, but also large in the number of channels, every data in the AFG database has thirty-six channels. Besides, the previous conclusions makes us know that the AFG of adversarial examples and their original imagesa really similar in the shallow layers and very different in the upper layers. The similar features retain the features of their original category, the different parts makes them distinguishable in category, and we think that the data of sampling the AFG on shallow layer and high layer is also can be used to recognize adversarial examples and correctly classify them, so we sample the AFG of each sample on shallow layer and high layer and use them to train the multi-label CNN like in section4.3,but the network used in this experiment is only VGG16. In the experiment, the channels that we sample is the first 9 channels and the last 12 channels. The result is shown in Fig. 9.
Though the feature of the middle channels has been lost after sampling, the final test accuracy still reaches 0.62 in Figure 9, which is close to the accuracy of 0.65 in the Figure 6.Compared Figure 9 with Figure 6, we can find that the difference of test accuracy between them is very small, and the tendency of training and test accuracy is very similar, which may means that the feature of the first few layers and the last few layers could have retained the main information that can be used to recognize adversarial examples and classify them correctly.
6 Conclusion and future works
In this paper, we propose a new paradigm, i.e. data driven paradigm, for adversarial example defending methods. This paradigm is build on a fact we observed: adversarial examples and original ones are inseparable in FC layer by a classifier, but divisible in the feature space via transforming weighs of a CNN model into representation space by NMF. Thus, we build an Adversarial Feature Genome dataset via the group visualization on feature maps at each layer from ImageNet database, and then cast adversarial examples defending problem into a multiple label classification problem by CNN modle which called Recognition CNN. Experiments show that the proposed framework can not only effectively identify the adversarial examples in the defense process, but also correctly classify the adversarial example with mean accuracy up to 63%. Our framework potentially gives a new perspective to adversarial examples defensing. We believe that adversarial examples defense research may benefit from a large scale AFG database which is similar to ImageNet.
In future work, we will take more variables into AFG database, for instance structure information of CNN, hyper-parameters of CNN and different basic database. Richer dimension for AFG database potentially support to train a universal Recognition CNN model which is robust to adversarial examples from diverse generative methods for adversarial examples or diverse basic databases from which adversarial examples come . Also, it seems that adversarial examples are cursed by high dimension which may indicates feature space is crucial for analyzing the reason why adversarial examples exist and recognizing adversarial examples [13, 8]. The fundamental theory about adversarial examples is still missing.
-  N. Akhtar and A. Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. arXiv preprint arXiv:1801.00553, 2018.
-  A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
-  N. Carlini and D. Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. arXiv preprint arXiv:1801.00634, 2018.
N. V. Chawla.
Smote: synthetic minority over-sampling technique.
Journal of artificial intelligence research, pages 321–357, 2002.
-  C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
-  J. Deng, W. Dong, R. Socher, and L. J. Li. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255, 2009.
-  G. S. Dhillon, A. Kamyar, Z. C. Lipton, J. Bernstein, K. Jean, K. Aran, and A. Anima. Stochastic activation pruning for robust adversarial defense. In International Conference on Learning Representations, 2018, ICLR 2018., 2018.
-  S. Dube. High dimensional spaces, deep learning and adversarial examples. arXiv preprint arXiv:1801.00634, 2018.
-  G. F. Elsayed, D. Krishnan, H. Mobahi, K. Regan, and S. Bengio. Large margin deep networks for classification. arXiv preprint arXiv:1803.05598, 2018.
-  D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep network. 2009.
-  L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
-  T. Florian, K. Alexey, P. Nicolas, I. Goodfellow, B. Dan, and M. Patrick. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
-  T. Florian, P. Nicolas, G. Ian, B. Dan, and P. McDaniel. The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  B. Jacob, R. Aurko, R. Colin, and I. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In International Conference on Learning Representations, 2018, ICLR 2018., 2018.
-  J. James M. Kullback-leibler divergence. Springer, Berlin, Heidelberg, 2011.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Y. Lecun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
-  D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. Advances in neural information processing systems, 2001.
-  S. Li, A. Neupane, S. Paul, C. Song, S. V. Krishnamurthy, A. K. R. Chowdhury, and A. Swami. Adversarial perturbations against real-time video classification systems. arXiv preprint arXiv:1807.00458, 2018.
-  L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, pages 2579–2605, 2008.
-  D. Meng and H. Chen. Magnet: a two-pronged defense against adversarial examples. In Conference on Computer and Communications Security, 2017, ACM SIGSAC 2017., 2017.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. Computer Vision and Pattern Recognition, 2017. CVPR 2017. IEEE Conference on, pages 1765–1772, 2017.
-  C. Olah, S. Arvind, J. Ian, C. Shan, S. Ludwig, Y. Katherine, and M. Alexander. The building blocks of interpretability. Distill, 2018. https://distill.pub/2018/building-blocks/.
-  C. Olah, A. Mordvintsev, and L. Schubert. Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
-  M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. Sohl-Dickstein. On the expressive power of deep neural networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2847–2854, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
-  S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert. icarl: Incremental classifier and representation learning. In Proc. CVPR.
-  M. K. Samangouei, Pouya and R. Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018, ICLR 2018., 2018.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. pages 1–9, 2015.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. International Conference on Learning Representations, 2014.
-  e. a. Wang, Yulong. Interpret neural networks by identifying critical data routing paths. In Conference on Computer Vision and Pattern Recognition, 2018, CVPR 2018., 2018.
-  D. E. Xu, Weilin and Y. Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155, 2017.
-  M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European conference on computer vision, 2014., 2014.
7 Supplementary Materials
In this document, we first verify that the Adversarial Feature Genome (AFGs) contain the representations corresponding to the class information of the input image. Therefore, AFGs can be used for defensing and detecting of adversarial examples simultaneously. Figure 10,11,12,13 show that all AFGs of images. Furthermore, we model defensing and detecting of adversarial examples as a classification mission on AFGs by diverse classifiers. Due to the complexity of the genetic data itself, we find the classification results are poor in shallow model. So we use CNNs in the paper to classify AFGs. Finally, we sample different channels from AFGs to demonstrate the effect of channel changes on adversarial examples detection.
7.1 The Class information of AFGs
We introduce the concept of AFGs in Sec.3: Adversarial Feature Genome and show that the adversarial examples could be distinguished in feature sapce. Visualization by T-SNE indicates that the AFGs also encode classes information in representations. We only analysis 2 classes in main body of text because the data size of AFGs is huge for t-SNE method. Moreover, to further prove our conclusion, we use classification CNN trained on 3 classes AFGs achieved from original images. The training process is shown in Figure 8.
The final accuracy of multi-label classification is 71.12% and the CNN effectively classifies AFGs from different classes. This illuminates that AFGs not only demonstrates the changes in image features in CNNs, but also retains the representations of the classes.
7.2 The Classifiers on AFGs
We use a recognition CNN to do a multi-label classification task on AFGs in Sec.5.1 (Multi-label Classification). Because of the complex and high dimensionality of AFGs, traditional models, such as SVM , don’t distinguish classes well, and it is more difficult for multi-label tasks. We applied SVM to the multi-label classification task of AFGs and compared the results of other classifiers, as shown in Table 2.
The results of the SVM are not good on both the training set and the test set. InceptionV1 also is not good on the test set. Therefore, we used the VGG series model as the basic architecture of recognition CNN. This experiment revealed that the complexity of AFGs may a challenge problem in future research. We further study the efficient AFGs data structure.
7.3 Different Channels of AFGs
We combine different channels of AFGs in Sec.5.2 (Sampling some channels to classify). In this section, we try more combinatorial methods, including only the first few layers or only the last few layers, to verify the impact of the genetic data of different channels on the classes. We separately sampled the first 3 layers (VGG16-F3L) and the last 3 layers (VGG16-L3L) AFGs to train VGG16 networks. The training process of the two sampling methods is shown in Figure 9.
|Model||Original Only||Adversarial Only|
The multi-label classification results of AFGs from deep layers are better than the shallow ones. The AFS proposed in the paper indicates that the separation of the adversarial examples and the original images increases as the number of layers deepens. Therefore, we used all trained classifiers to evaluate the accuracy of the adversarial examples and the original images. The results are shown in Table 3. First we compare VGG16-F3L and VGG16-L3L. It shows that in shallow layers the test accuracy on adversarial examples is higher than the accuracy trained on deep layers. But for the original images, the classification results in the deeper layer are better and the difference is about 10%. This shows that the representations from shallow layers are important for distinguishing adversarial examples from original images. The representations from deep layers are close to the classifier, and the impact on the final classification result is greater, so that the AFGs cannot show the classes of the original images. This change reflects the difference in the representations of the convolution kernel in different layers . All classifiers, except InceptionV1, have a higher accuracy on the adversarial examples than the original images’. This suggests that the expression of differences on the adversarial examples can be better identified, which provides an insight for improving AFGs.