Adversarial Feature Genome: a Data Driven Adversarial Examples Recognition Method

12/25/2018 ∙ by Li Chen, et al. ∙ Central South University 20

Convolutional neural networks (CNNs) are easily spoofed by adversarial examples which lead to wrong classification result. Most of the one-way defense methods focus only on how to improve the robustness of a CNN or to identify adversarial examples. They are incapable of identifying and correctly classifying adversarial examples simultaneously due to the lack of an effective way to quantitatively represent changes in the characteristics of the sample within the network. We find that adversarial examples and original ones have diverse representation in the feature space. Moreover, this difference grows as layers go deeper, which we call Adversarial Feature Separability (AFS). Inspired by AFS, we propose an Adversarial Feature Genome (AFG) based adversarial examples defense framework which can detect adversarial examples and correctly classify them into original category simultaneously. First, we extract the representations of adversarial examples and original ones with labels by the group visualization method. Then, we encode the representations into the feature database AFG. Finally, we model adversarial examples recognition as a multi-label classification or prediction problem by training a CNN for recognizing adversarial examples and original examples on the AFG. Experiments show that the proposed framework can not only effectively identify the adversarial examples in the defense process, but also correctly classify adversarial examples with mean accuracy up to 63%. Our framework potentially gives a new perspective, i.e. data-driven way, to adversarial examples defense. We believe that adversarial examples defense research may benefit from a large scale AFG database which is similar to ImageNet. The database and source code can be visited at



There are no comments yet.


page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) achieve remarkable successes on a variety of tasks [6, 17, 18, 11, 27]. Interestingly, Christian Szegedy [31, 14, 3, 20, 9] et al. found that an original image would be misclassified with high confidence via adding a very small perturbation to it, which is called adversarial example. The existence of adversarial examples indicates that CNNs are more fragile than we have imagined.

The defense methods against to adversarial examples can be mainly divided into two classes [1]. One is called the complete defense which enhances the robustness of a network by modifying its structure. For instance, methods using gradient masking [15, 7, 28, 12], which make the network be in three kinds of unreachable states on the gradient to prevent the adversarial examples generating from optimization. However, it does not analyze the difference between the adversarial example and the original image. As a result, only the generalization ability of the network is elevated while the adversarial examples remain unrecognizable. When under a stronger attack [2], the defending method of obfuscated gradients will be broken through. Another way is detection based defense methods, which only [33, 22, 32] identify whether the input is an adversarial example. Besides, it simply uses a formalized method to represent the difference between the adversarial example and the original image [32] so that the category information is lost. Therefore, methods of this kind are limited to the detection of the adversarial example and incapable of correctly classifying the adversarial example.

In summary, complete defense methods and detection based methods are lack of a proper way to quantitatively analyze changes in the features of the input data propagating within the network. This absence makes it hard to detect and classify the adversarial examples simultaneously. Our experimental finding gives insights to address this problem. We found that as the layer deepens, the representations of the feature maps of the adversarial example and the original images becomes gradually separable on the VGG16 [29] and Inception networks [30] et al. We define it as Adversarial Feature Separability (AFS). AFS clarifies that adversarial examples can be distinguished in feature space with the number of layers increased. Moreover, the group visualization method proposed by Olah [24] provides a feasible way to interpret the internal representations of CNNs. These groups visualize features contain the representations corresponding to the category of the input image. Thus, we combine the above two ideas and propose an Adversarial Feature Genome (AFG) based adversarial examples defense framework.

This framework can not only distinguish between the original image and adversarial example, but also retain the representations of each image about the category. The proposed framework first implements decomposition on the matrix consists of multi-channel feature maps in each layer aiming to extract the low-dimensional representation of the high-dimensional feature data. Then, the visualize features in the decomposed group are stitched together to represent the features that the network has learned in the corresponding layer. Finally, we stack the stitched images of different layers to achieve AFG database. We demonstrate that AFG can effectively distinguish the adversarial examples from the original images and also correctly classify them. Therefore, AFG can be used as a data-driven adversarial examples recognition method that both detects and classifies the adversarial examples as shown in Fig. 1. In details, we transform all images into AFG to construct the big database, and give each AFG a label indicates its original category and a label given by the classification of a CNN. These two labels are consistent for original images, but are different for adversarial examples. By training a CNN model of multiple label classification on the AFG database, we can successfully recognize the adversarial examples. Our contributions are listed as follows:

  1. We find that adversarial examples and original ones have different representation in the feature space, the divergence grows as layers go deeper. Inspired by this observation, we propose the adversarial feature genome which can be applied to analyze the feature of the data in the CNNs. It also can be potentially used on the interpretability of CNN models.

  2. We verify that the difference between the feature of the original images and the adversarial examples are relatively small in shallow layers but bigger in deeper layers.

  3. We further train a multi-label CNN on adversarial feature genome dataset. Experiments show that the proposed framework can not only effectively identify the adversarial examples in the defense process, but also correctly classify adversarial examples with mean accuracy up to 63%.

The structure of this paper is as follows: In the section 2, we will discuss our key idea about adversarial feature separability. In the section 3, we present the way how to build Adversarial Feature Genome. In the section 4, we model adversarial examples recognition as multiple label classification or prediction problem. In the Experiments and Result section, we show how our AFG database improve adversarial example defencing. Conclusions and further works will be put forward in the last part, followed by the section 5.

2 Adversarial Feature separability

The adversarial example is a new version of an original image that added a small perturbation, it can confuse the network so that the network misclassifies it [31]. The generation of the adversarial example is to compute the additive perturbation. We define as the original image, as the adversarial example, as the small adversarial perturbation, and as the target class of the adversarial example. Adversarial example can be obtained by Eq.1.


where represents the classifier function. Although and are visually similar because is negligible, of the CNN classifies them into completely different categories, which can be formulated as follows:



representing activate function. The CNN is a composite function, the decision of it is based on the features which are extracted by the learned network weights. For the input image


, after they are fed into the network, due to they prefer different network weights on the process of forward propagation, the features extracted by these weights at each layer are different in the feature space.

In the CNNs, feature maps strongly correlate with the sample, they contain features corresponding to the category and can reflect the decision process of how the network classifies the input image[34, 25]. This means the feature maps can be utilized to represent feature and measure the difference between adversarial examples and original images concurrently. Though the feature maps of adversarial examples and original images are different, whether they are separable remains unknown. So we conduct several experiments to further explore it. Once the separability is verified, we can recognize and classify the adversarial examples by separating the feature maps of adversarial examples from those of original ones.

We define as the feature maps of original image in the -th layer, as the feature maps of adversarial examples, and denotes a distance function of and .

First, We compute the distance between the feature maps of adversarial examples and their original images in each layer, as shown in Eq. 3.


Then, We adopt the Kullback Leibler divergence

[16] to measure the difference between the distribution of and in Eq. 4.


In this experiment, we first generate some adversarial examples with Basic Iteration attack [14] on inception network. In this process, the setting of the iteration time is 20. We randomly choose 6 categories images in the ImageNet dataset, each category includes 200 images, generate 1200 adversarial examples in total.

Next, we measure the difference by L1 distance and KL divergence on each pair of the adversarial example and its original image. Then we compute their mean value, maximum value, minimum value and the variance for each layer. The result of the experiment is shown in Figure


Figure 2: (a) The L1 distance and (b) The KL divergence to measure the differences between adversarial examples and original images in each layer and counting the maximum, minimum, mean value and variance in each layer. (a) and (b) show that as the layer deepens the divergence becomes bigger, which means adversarial examples and original images can be separated. Especially in (a) the distance growing fast around the last layer which indicates feature maps therein is most separable.

As shown in Figure 2, regardless of using L1 distance or Kullback Leiler divergence to evaluate the distance between the feature of adversarial examples and original images, the distance of feature is gradually increasing with the CNN layers deepening, which indicates that the adversarial examples and corresponding original images are distinguishable in the feature maps. Meanwhile, the variance in the two figure are both stable and close to zero, which means the adversarial feature separability is stable.

3 Adversarial Feature Genome

Through the above experiment, the feature map is able to represent the difference between the adversarial examples and the original images, however, it is not suitable for the detection of adversarial examples because it is enormous and redundant[25]. In order to reduce the data size and redundancy of feature maps, we apply visualization techniques mentioned by olah [24] on each layer to get some groups that present the main features in this layer and use a image called maximum activation images to represent the feature learned by CNN.

First, we use non-negative matrix factorization (NMF) [19] on feature maps at each layer to reduce the dimension of feature maps and get some feature groups that the number of it can be decided by us. After using NMF, the feature retained in groups becomes more independent and meaningful than those without using NMF [24]. Denoted by a

dimension random vector, we conduct N times observations. We can obtain

by define these observations as . We can compute the non-negative base matrix and coefficient matrix which makes . expressed as a matrix:


Then we apply activation maximization (AM) algorithm [10] on each group to obtain the maximum activation image as follows:


where is defined to the value of the -th group in the -th layer. The obtained image denotes the maximum activation image of the -th group at the -th layer, which can be considered as a main feature learned by convolution kernel in the -th layer. Besides, for every image, the number of maximum activation images on all layers of the entire network is large ,in order to use these maximum activation images effectively, we use all of them to construct a adversarial feature genome. We create the AFG for every sample by stitching the maximum activation images obtained in the same layer and then stacking the stitched images of different layers. We visualize 3 pairs of the AFG of adversarial example and original image in Figure 3.

Figure 3: Three pairs of adversarial example and original image are transformed into AFGs. In the shallow layers, AFGs of adversarial examples and original images are similar but they differ in the deeper layers. It indicates the existence of AFS in AFGs, so the pairs can be separated.

According to the Figure 3 we can observe in the shallow layer, the AFGs of the adversarial example and the original image are highly similar. In the last few layers, AFGs become abstract and irregular, but they are obviously inconsistent, which is in line with the AFS assumption, that is, the separation between the adversarial example and the original image is gradually increasing on each layer.

The AFS phenomenon in AFG makes AFG an available tool that can be used to detect adversarial examples. Nonetheless, recognizing adversarial examples still needs features corresponding to the category so as to correctly classify adversarial examples. However, feature maps are highly related to the category of the input image because they are abstract features extracted by convolution kernels in each layer, which is useful in classification. AFGs based on the feature maps should inherited this property such that they are separable on category and can be used to classify images. Hence, we verify the separability of AFG on the category by following experiments.

We randomly select AFGs corresponding to original images in jellyfish class and leafhopper and then use t-Distributed Stochastic Neighbor Embedding (t-SNE) [21] to reduce their dimension so that they can be visualized in a 2-Dimension plane. The experiment result is shown in the Figure 4(a). It can be seen that the features belong to different categories are separated in the 2D plane, which indicates AFGs retain sufficient feature information to differentiate categories. Besides, we select AFGs of keeshond and komondor classes which have features in common. The result is shown in Figure 4(b). Figure 4(b) demonstrates that the two categories that are similar in feature cannot be accurately distinguished.

Compared Figure 4(a) with Figure 4(b), we find it difficult to distinguish the similar categories on the AFG, but it is easy to discriminate the categories with large differences. These validate that the AFG preserves the category information indeed, only the category information is macroscopic thus makes it incompetent to deal with akin categories.

(a) Jellyfish & Leafhopper
(b) Keeshond & Komondor
Figure 4: Clustering AFGs in different classes using t-SNE. In (a) the jellyfish and leafhopper which have no common features can be accurately separated. However, in (b) keeshond and komondor which have many features in common are tangled together. It proves that the AFGs can efficiently classify images which have fewer features in common but are weak in classifying reassembling classes.

In conclusion, the AFGs generated from feature maps can represent the difference between the original images and the adversarial examples without the loss of their category feature. So method based on AFGs is the combination of the detection driven method and the complete-defense driven method. To fully utilize AFGs and implement adversarial examples recognition, we create a new dataset named AFG dataset by applying our method both on the original images and the adversarial examples, and then we give each obtained AFG a real label and a label it is classified into. Some classification models can be trained on our AFG dataset to recognize and analysis the adversarial examples.

4 Adversarial Examples Recognition

AFGs which are essentially based on the representation of the feature maps can classify images and are separable in adversarial examples and original images. Therefore we can apply AFGs to recognize adversarial examples. Besides, the layer that the adversarial example and its original image begin to differ on the feature maps is different and the differences is also inconsistent for all samples, which indicates that AFGs are free from the human interference. So we construct a multi-label dataset of the AFG and train a new CNN which is good at dealing with complex feature extraction problems, to learn the difference between the adversarial examples and the original images and to induce the pattern. In this section, we introduce our adversarial examples recognition framework based on AFG which is mainly inspired by AFS, the framework is described in Figure 5:

Figure 5: The adversarial examples recognition framework. The framework can be divided into three steps. (1) we use the classification CNN to convert the input image into the AFG based on group visualization method. Then, we get several different sets of features in each layer which represent the representations of the entire layer. These features called AFG. (2) we stack all stitched group features. The number of the new AFGs’ channel is the same as convolutional layers’. In figure, although the adversarial example is similar to the original image, their AFGs are very different. (3) the detection and defense adversarial example is transformed into a multi-label classification problem. We feed the AFGs into the recognition net to recognize the input image whether is an adversarial example and the original category of adversarial example.

For each image in the dataset, we reduce the dimension of its feature maps in each layer by NMF and obtain their activation blocks. Later we apply AM to generate feature image on each activation block and stitch the feature images within the same layer. Thus feature images in each layer can be transformed into a bigger stitched image. We stack stitched images of all layers to construct the AFG which can be used to recognize adversarial examples.

For the construction of the multi-label dataset of the AFG, we convert all samples from the original dataset into AFGs and give the real label and the classification label for each sample.

The specific organization of the label is as follows:

  1. We count the number of categories of the original images and of the adversarial examples to get the length of the original images label and the adversarial examples label.

  2. We use one-hot encoding for the original categories and the classification results of the samples by classification CNN.

  3. We stack the original labels of adversarial examples with adversarial labels of those samples as the final label coding of the AFGs of adversarial examples and stack original labels of original images with zero vector as the final label coding of the AFGs of original examples. The length of zero vector depends on how to align the label coding of the adversarial examples and original examples.

The encoding examples are shown in Table. 1.

Original Onehot Classifized Onehot Final Label
cat 10 cat 000 10000
cat 10 jellyfish 100 10100
cat 10 conch 010 10010
dog 01 dog 000 01000
dog 01 jellyfish 100 01100
dog 01 sturgeon 001 01001
Table 1: Cat and dog are original images real classes. jellyfish, conch and sturgeon are the classes adversarial examples classified into. In original images, label 10 means cat, 01 means dog. In adversarial examples, label 100 means jellyfish, 010 means conch and 001 means sturgeon.

Finally, we can judge whether the input is a adversarial example by comparing whether the second part of label encoding obtained after inputting the AFG of a sample is all zero, meanwhile if a image is an adversarial example, we also can know what the original category it belongs to based on the description of the label.

  Data: Dataset     
  for  do
     for  do
         if  then
         end if
     end for
  end for
  while  do
  end while
Algorithm 1 . Adversarial Example Recognition Framework

5 Experiments and Results

AFGs can be applied to detect adversarial examples because AFGs of adversarial examples and original images can be separated. AFGs can also be used to classify adversarial examples correctly because they preserve features that imply the original category. Thus in this section we will use the AFG dataset that created in section3 to train a CNN of executing multi-label task to recognize adversarial examples and correctly classify them.

Besides, because the AFG database is very large in number, the cost of using all of the data to train is very high. so we randomly select 33 categories AFGs of the original images and their corresponding adversarial examples. Then we use the method of stratified sampling on the AFG dataset to construct the training set and test set. Meanwhile, because the AFGs of the adversarial examples is class-imbalanced in the training set we use Synthetic Minority Oversampling Technique (SMOTE)[4] to balance them. Finally, we train VGG and InceptionV1 network on the training set.

5.1 Multi-label Classification

In order to illustrate that AFGs can effectively recognize adversarial examples, we use the training set to train the multi-label recognition CNN. In the experiment, our multi-label CNN uses VGG16, VGG19 and InceptionV1 network structure. The method to measure accuracy is IOU. The results of the experiment are shown in Fig.6

As it is shown in the Fig.6, though the accuracy fluctuates greatly during the training process, when the network model converges, the test accuracy on InceptionV1, VGG16 and VGG19 reaches 0.52, 0.655, 0.7 respectively. Though the lowest test accuracy is 0.52 in three models, the mean test accuracy of them is nearly 0.63 that is really a better result, which can further indicates that the framework that we proposed based on the AFG can really recognize the adversarial examples effectively and classify them correctly.

Figure 6: The training and test accuracy trained on VGG16, VGG19 and InceptionV1 structure network with the AFG database as shown in (a),(b) and (c). (a) and (b) VGG-style networks have better performance but steeper curves. And as the layer deeper, the performance better. However, in the inception structure network in (c), training curve is stable but performance is worse.

5.2 Sampling some channels to classify

The AFG database is not only large in number, but also large in the number of channels, every data in the AFG database has thirty-six channels. Besides, the previous conclusions makes us know that the AFG of adversarial examples and their original imagesa really similar in the shallow layers and very different in the upper layers. The similar features retain the features of their original category, the different parts makes them distinguishable in category, and we think that the data of sampling the AFG on shallow layer and high layer is also can be used to recognize adversarial examples and correctly classify them, so we sample the AFG of each sample on shallow layer and high layer and use them to train the multi-label CNN like in section4.3,but the network used in this experiment is only VGG16. In the experiment, the channels that we sample is the first 9 channels and the last 12 channels. The result is shown in Fig. 9.

Figure 7: The training and test accuracy trained on VGG16 structure network on sampled AFG which samples 1-3 layers and the last 4 layers. The result shows the sampled data has relatively similiar performance as all layers’ data,which suggests that we can use fewer shallow layers and deep layers to replace all layers on training.

Though the feature of the middle channels has been lost after sampling, the final test accuracy still reaches 0.62 in Figure 9, which is close to the accuracy of 0.65 in the Figure 6.Compared Figure 9 with Figure 6, we can find that the difference of test accuracy between them is very small, and the tendency of training and test accuracy is very similar, which may means that the feature of the first few layers and the last few layers could have retained the main information that can be used to recognize adversarial examples and classify them correctly.

6 Conclusion and future works

In this paper, we propose a new paradigm, i.e. data driven paradigm, for adversarial example defending methods. This paradigm is build on a fact we observed: adversarial examples and original ones are inseparable in FC layer by a classifier, but divisible in the feature space via transforming weighs of a CNN model into representation space by NMF. Thus, we build an Adversarial Feature Genome dataset via the group visualization on feature maps at each layer from ImageNet database, and then cast adversarial examples defending problem into a multiple label classification problem by CNN modle which called Recognition CNN. Experiments show that the proposed framework can not only effectively identify the adversarial examples in the defense process, but also correctly classify the adversarial example with mean accuracy up to 63%. Our framework potentially gives a new perspective to adversarial examples defensing. We believe that adversarial examples defense research may benefit from a large scale AFG database which is similar to ImageNet.

In future work, we will take more variables into AFG database, for instance structure information of CNN, hyper-parameters of CNN and different basic database. Richer dimension for AFG database potentially support to train a universal Recognition CNN model which is robust to adversarial examples from diverse generative methods for adversarial examples or diverse basic databases from which adversarial examples come [23]. Also, it seems that adversarial examples are cursed by high dimension which may indicates feature space is crucial for analyzing the reason why adversarial examples exist and recognizing adversarial examples [13, 8]. The fundamental theory about adversarial examples is still missing.


7 Supplementary Materials

In this document, we first verify that the Adversarial Feature Genome (AFGs) contain the representations corresponding to the class information of the input image. Therefore, AFGs can be used for defensing and detecting of adversarial examples simultaneously. Figure 10,11,12,13 show that all AFGs of images. Furthermore, we model defensing and detecting of adversarial examples as a classification mission on AFGs by diverse classifiers. Due to the complexity of the genetic data itself, we find the classification results are poor in shallow model. So we use CNNs in the paper to classify AFGs. Finally, we sample different channels from AFGs to demonstrate the effect of channel changes on adversarial examples detection.

7.1 The Class information of AFGs

We introduce the concept of AFGs in Sec.3: Adversarial Feature Genome and show that the adversarial examples could be distinguished in feature sapce. Visualization by T-SNE indicates that the AFGs also encode classes information in representations. We only analysis 2 classes in main body of text because the data size of AFGs is huge for t-SNE method. Moreover, to further prove our conclusion, we use classification CNN trained on 3 classes AFGs achieved from original images. The training process is shown in Figure 8.

Figure 8: The training process of multi-label classification on AFGs

The final accuracy of multi-label classification is 71.12% and the CNN effectively classifies AFGs from different classes. This illuminates that AFGs not only demonstrates the changes in image features in CNNs, but also retains the representations of the classes.

7.2 The Classifiers on AFGs

We use a recognition CNN to do a multi-label classification task on AFGs in Sec.5.1 (Multi-label Classification). Because of the complex and high dimensionality of AFGs, traditional models, such as SVM [5], don’t distinguish classes well, and it is more difficult for multi-label tasks. We applied SVM to the multi-label classification task of AFGs and compared the results of other classifiers, as shown in Table 2.

Model Train Test
VGG16 0.99 0.655
VGG19 0.99 0.7
InceptionV1 0.88 0.52
SVM 0.40 0.365
Table 2: Performance of different classifiers on training sets and test sets

The results of the SVM are not good on both the training set and the test set. InceptionV1 also is not good on the test set. Therefore, we used the VGG series model as the basic architecture of recognition CNN. This experiment revealed that the complexity of AFGs may a challenge problem in future research. We further study the efficient AFGs data structure.

7.3 Different Channels of AFGs

(a) The training process of the first 3 layers of AFGs
(b) The training process of the last 3 layers of AFGs
Figure 9: Comparison of the training process of AFGs on different channels

We combine different channels of AFGs in Sec.5.2 (Sampling some channels to classify). In this section, we try more combinatorial methods, including only the first few layers or only the last few layers, to verify the impact of the genetic data of different channels on the classes. We separately sampled the first 3 layers (VGG16-F3L) and the last 3 layers (VGG16-L3L) AFGs to train VGG16 networks. The training process of the two sampling methods is shown in Figure 9.

Model Original Only Adversarial Only
VGG16-F3L 0.516 0.824
VGG16-L3L 0.615 0.810
VGG16 0.571 0.846
VGG19 0.603 0.850
InceptionV1 0.482 0.436
Table 3: Accuracy of classifiers on the original images and the adversarial examples

The multi-label classification results of AFGs from deep layers are better than the shallow ones. The AFS proposed in the paper indicates that the separation of the adversarial examples and the original images increases as the number of layers deepens. Therefore, we used all trained classifiers to evaluate the accuracy of the adversarial examples and the original images. The results are shown in Table 3. First we compare VGG16-F3L and VGG16-L3L. It shows that in shallow layers the test accuracy on adversarial examples is higher than the accuracy trained on deep layers. But for the original images, the classification results in the deeper layer are better and the difference is about 10%. This shows that the representations from shallow layers are important for distinguishing adversarial examples from original images. The representations from deep layers are close to the classifier, and the impact on the final classification result is greater, so that the AFGs cannot show the classes of the original images. This change reflects the difference in the representations of the convolution kernel in different layers [26]. All classifiers, except InceptionV1, have a higher accuracy on the adversarial examples than the original images’. This suggests that the expression of differences on the adversarial examples can be better identified, which provides an insight for improving AFGs.

Figure 10: Each layer of AFGs of the original image of beaver
Figure 11: Each layer of AFGs of the adversarial example of beaver which is misclassified as a cowboy hat
Figure 12: Each layer of AFGs of the original image of otter
Figure 13: Each layer of AFGs of the adversarial example of otter which is misclassified as a keeshond