Identifying Classes Susceptible to Adversarial Attacks

05/30/2019 ∙ by Rangeet Pan, et al. ∙ Iowa State University of Science and Technology 0

Despite numerous attempts to defend deep learning based image classifiers, they remain susceptible to the adversarial attacks. This paper proposes a technique to identify susceptible classes, those classes that are more easily subverted. To identify the susceptible classes we use distance-based measures and apply them on a trained model. Based on the distance among original classes, we create mapping among original classes and adversarial classes that helps to reduce the randomness of a model to a significant amount in an adversarial setting. We analyze the high dimensional geometry among the feature classes and identify the k most susceptible target classes in an adversarial attack. We conduct experiments using MNIST, Fashion MNIST, CIFAR-10 (ImageNet and ResNet-32) datasets. Finally, we evaluate our techniques in order to determine which distance-based measure works best and how the randomness of a model changes with perturbation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Protecting against adversarial attacks has become an important concern for machine learning (ML) models since an adversary can cause a model to misclassify an input with high confidence by adding small perturbation 

(goodfellow2014explaining, ). A number of prior works  (30papernot2016limitations, ; franceschi2018robustness, ; 2yan2018deep, ; 6pang2018towards, ; 9zheng2018robust, ; 8tao2018attacks, ; 20tian2018detecting, ; 21goswami2018unravelling, ; 3fawzi2018adversarial, ; 12peck2017lower, ; 28papernot2016distillation, )

have tried to understand the characteristics of adversarial attacks. This work focuses on adversarial attacks on deep neural networks (DNN) based image classifiers.

Our Contributions. Our work is driven by the two fundamental questions. Can an adversary fool all classes equally well? If not, which classes are susceptible to adversarial attacks more so than others? Identifying such classes can be important for developing better defense mechanisms. We introduce a technique for identifying the top susceptible classes. Our technique analyzes the DNN model to understand the high dimensional geometry in the feature space. We have used four different distance-based measures (t-SNE, N-D Euclidean, N-D Euclidean Cosine, and Nearest Neighbor Hopping distance) for understanding the feature space. To determine the top

susceptible classes, we create an adversarial map, which requires the distance in feature space among classes as input and outputs a mapping of probable adversarial classes for each actual class. To create adversarial map, we introduce the concept of the forbidden distance i.e., the distance measured in high dimension which describes the capability of a model to defend an adversarial attack.

We conduct experiments on FGSM attack using MNIST (lecun1998gradient, ), Fashion MNIST (xiao2017fashion, ), and CIFAR-10 (krizhevsky2009learning, ) (ImageNet (krizhevsky2012imagenet, ) and ResNet-32 (he2016deep, )) datasets to evaluate our technique. Finally, we compare our results with cross-entropy (CE)(goodfellow2016deep, ) and reverse cross-entropy (RCE)(3fawzi2018adversarial, ) based training techniques that defend against adversarial attack. Our evaluation suggests that in comparison to the previous state-of-the-art training based techniques, our proposed approach performs better and does not require additional computational resources.

Next, we describe related works. describes our methodology, describes detailed results with the experimental setup, and concludes.

2 Related Work

The work on adversarial frameworks (15goodfellow2014generative, ) can be categorized into attack and defense related studies.

Attack-related studies.  Several studies have crafted attacks on ML models, e.g. FGSM (goodfellow2014explaining, ), CW (carlini2017towards, ), JSMA (30papernot2016limitations, ), Graph-based attack (17zugner2018adversarial, ), attack on stochastic bandit algorithms (jun2018adversarial, ), black-box attacks (liu2016delving, ; papernot2017practical, ), etc. (5elsayed2018adversarial, ) studies transferability of attacks, and (23athalye2017synthesizing, ) surveys the kinds of attack using synthesizing robust adversarial examples for any classifier.

Defense techniques (our work fits here).  These works are primarily focused on improving robustness (2yan2018deep, ; 6pang2018towards, ; 9zheng2018robust, ), detecting adversarial samples  (8tao2018attacks, ; 9zheng2018robust, ), image manipulation  (20tian2018detecting, ; 21goswami2018unravelling, ), attack bounds  (3fawzi2018adversarial, ; 12peck2017lower, ), distillation (28papernot2016distillation, ), geometric understanding (30papernot2016limitations, ), etc. There are other studies on the geometrical understanding of adversarial attack  (30papernot2016limitations, ; franceschi2018robustness, ; schmidt2018adversarially, ). Papernot et al.’s (30papernot2016limitations, ) work is closely related to ours, where the authors built a capability-based adversarial saliency map between benign class and adversarial class to craft perturbation in the input. In contrast, we utilize distance-based measures to understand a DNN model and detect -susceptible target classes. (franceschi2018robustness, ) utilizes the decision boundary to understand the model and the authors have observed a relationship between the decision boundary and the Gaussian noise added to the input. (18Gilmer, )

has conducted a similar study to understand how decision boundary learned by a model helps to understand high dimensional data and proposed the bound over the error of a model. Our approach finds the relation of high dimensional geometry with adversarial attacks and identifies

susceptible classes.

3 Our Approach: Identifying Susceptible Classes

We use distances () to create adversarial map () and use it to pick susceptible classes ().

3.1 Terminology

This study concentrates on the feed-forward DNN classifiers. A DNN can be represented as a function , where is the set of tunable parameters, is the input, is the number of labeled classes and is the number of features and . In this study, the feature space for a model has been represented using . The focus of the paper is to understand the high dimensional geometry to identify susceptible target classes for a model. We calculate distance between two classes and , where . We utilize four different distance-based measures and compare them. In adversarial setting, , where is the perturbation added to the input which causes, . Our assumption in this study is that depends on the of a model. To compare different distances, we calculate the randomness in a model using the entropy. We define the entropy of a model as where,


denotes the probability of input , which has been misclassified to class given the actual label . In this study, we use terms e.g., actual class and adversarial class, which represents the label of a data point predicted by a model and the label after a model has gone through an attack respectively. We introduce a term forbidden distance as , a measured distance which provides the upper bound of displacement of data points in . In this context, displacement represents the distance between the adversarial class and the actual class. In this study, we have conducted our experiment using the Fast Gradient Sign Method (FGSM) (goodfellow2014explaining, ). Here, we have chosen single attack based on the Adversarial transferable property (30papernot2016limitations, ; papernot2016towards, ), which defines that adversarial examples created for one model are highly probable to be misclassified by a different model.

3.2 Hypothesis ()

According to linearity hypothesis proposed in (goodfellow2014explaining, )

, there is still a significant amount of linearity present in a model even though a DNN model utilizes non-linear transformation. The primary reason behind this is the usage of LSTM 

(hochreiter1997long, )

, ReLu 

(jarrett2009best, ; glorot2011deep, ), etc, which possess a significant amount of linear components to optimize the complexity. Here, we assume that the input examples can be misclassified to neighboring classes in the during adversarial attack.

3.3 Distance Calculation

Calculation of t-SNE distance

To understand representation, we utilize t-SNE (maaten2008visualizing, ) dimension reduction technique. t-SNE uses the Euclidean distance of data points in dimension as input and converts it into the probability of similarity, where, represents the probability of similarity between two input data points and in . We calculate the distance based on the . We convert -dimensional problem to a

-dimensional problem. In this process, we do not consider the error due to the curse of dimensionality

(indyk1998approximate, ). In 2-D feature space, we have the co-ordinate for a data point and we calculate the center of mass , where is the class. Here, mass of each point , is assumed to be unit, then center of mass represents, , where, represents the data point of class .

Calculation of N-D Euclidean Distance

Furthermore, we calculate the

-dimensional Euclidean distance between two data points. Each data point can be represented by a feature vector

, where is the vector component of data point of class . Here, has been represented as a coordinate in . We calculate the center of mass similar to the t-SNE based approach. The main difference is the calculated center of mass is a vector of coordinates. Thus, , distance can be calculated as,


Calculation of N-D Euclidean Cosine Distance

We use the dimensional angular distance as our next measure. In this process, we calculate the -dimensional Euclidean distance similar to the prior technique. In -dimension, the angular similarity among the center of mass of classes , can be calculated as, , where is the magnitude of the vector. We leverage the angular similarity and calculate the -dimensional Euclidean angular distance between center of mass of two classes using the following equation,

1:procedure hopDistance ()
2:      Neighbor of
3:     ;
4:     Mark visited;
5:     if (then
6:         while  do
7:              distance=distance+1
9:              for each  do
10:                  if ( not visited) then
11:                       Neighbor of
13:                       Mark visited
14:                  end if
15:              end for
18:         end while
19:     end if
20:     return distance
21:end procedure
Algorithm 1 Hopping Distance

Calculation of Nearest Neighbor Hopping Distance

Here, we use the nearest neighbor algorithm to understand the behavior of an adversarial attack. For distance calculation, we develop an algorithm which computes the hopping distance 1 between two classes. Initially, we calculate nearest neighbors for each data point. Due to Reflexive property, a data point is inevitably neighbor to itself, which approves different neighbors for a data point. The boundary learned by the nearest neighbor algorithm distinguishes classes by dividing into clusters. Then, nearest neighbors will belong to the same class for most of the data points except the data points located near the boundary. We leverage that information and compute the classes nearest to a particular class.

Example 3.1.

Let us assume the points in class , , . For example, we find the closest point of outside is , of is , of is , is . As depicted in Figure 1(b), we observe that of the data points in have their closest neighbors in . So, we can say that shares more boundary with and is the closest neighboring class to .

Hopping Distance computes how many hops a data point needs to travel to reach the closest data point in the target class. From the nearest neighbor algorithm, we get the unique neighbors to each data point. Algorithm 1 takes the output from the nearest neighbor, the actual predicted data point and the misclassified label as input. This problem has been converted to a problem of tree generation from lines 2 - 4 . From lines 5 - 21, we expand the tree when a new neighbor has been found and traverse using BFS. Finally, we calculate the depth of the expanded tree to calculate the minimum distance that a data point has to travel to reach the misclassified class in . This algorithm utilizes the same time and space complexity as BFS does, which is for time and for space, as in the worst case, we need to traverse all the neighbors () for an actual class. We also calculate the forbidden distance based on the average hopping distance () for a model. In the Eq.4, and denote the total number of classes and data points respectively. We use Eq.4 for both calculating the forbidden distance () and also the average displacement of data points in under an attack. For calculating the later, is the actual class and is the adversarial class. In order to create the adversarial map, we compute a matrix storing the distance among all classes () using Eq. 5, where is the total number of data points in class .

Lemma 3.1.

if average hopping distance , then is closer to than to i.e., the distance of center of mass .


. Without the loss of generality, we can say that, . As center of mass will always be within the polygon surrounding a class, without loss of generality we assume all the are at the same location and all the are at the same location .

Assuming the balanced dataset, , So, the center of mass of is closer to than the center of mass of to . ∎

Lemma 3.2.

In , if a class has been misclassified to a closer class , the entropy will decrease.


The entropy . With the increase of , also increases. So, we can say that . From 3.2), we assume that if a class is close to class , we allocate a higher probability to . So, if classes are mostly misclassified to the closer one, the entropy of the entire model will decrease. ∎

3.4 Adversarial Map

Figure 1: (a) Creation of Adversarial Map. (b) Class shares more boundary with than as of the data points in have their closest neighbors in .
1:procedure createMap ()
2:      empty graph
3:     for each  do
4:         if (then
6:         end if
7:     end for return G
8:end procedure
Algorithm 2 Create Adversarial Map
Definition 3.1.

Forbidden distance (): When a model encounters an adversarial attack, each input class requires to travel a certain distance in to accomplish the attack. Based on the attack type, maximum distance changes. We call this forbidden distance () as beyond this distance adversarial attack will not be successful. For example, to accomplish the adversarial attack given a forbidden distance of a model and to misclassify as , distance constraint between and is .

In this section, we describe how we create the adversarial map annotated with distance to neighbors. Here, we utilize the forbidden distance while creating the adversarial map. Hypothetically, any class as shown in Figure 1(a) can be misclassified to any other class by traveling the same distance . But our hypothesis is that every attack has a limitation. A data point in might need to travel different distance for misclassifying to different classes , where . If we represent the distance between and as then the attack can be accomplished more easily where is minimum. In the above equation if is minimum then the attack can missclassify as represented by the function . We create this adversarial map by using the distance between classes and as described in §3.3. Then we introduce the notion of forbidden distance . We claim that the attack on a certain class can misclassify as class if and only if as mentioned in the following equation:


Now, we create the adversarial map from the distance between different classes as depicted in Algorithm 2, which takes the distance between different classes () as input. Then, different edges are added to the graph mentioned in lines 3 - 7. Finally, the adversarial map is returned in line 7. This algorithm runs in time and space complexity.

Lemma 3.3.

The attack can misclassify a class only to one of its neighbors in adversarial map.


Let us assume an attack has a prior knowledge of a model, training example of class and can misclassify to which is not a neighbor. We know an attack can only misclassify as if and only if . According to Algorithm 2 if then, is the neighbor of . This leads to a contradiction. So the attack can only misclassify as one of its neighboring classes. ∎

3.5 Susceptible Class Identification

Here, we use the best among four distance-based measures and identify susceptible target classes for a model. In an adversarial setting, we find the target classes which are most likely being misclassified from the actual class. Our primary hypothesis () claims that any class will be misclassified to the nearest class under an attack. In order to identify susceptible classes, we use our mapping between the actual class and adversarial class mentioned in §3.4. For a particular class , we assign weighted probability to all misclassified classes , where based on the distance computed using the best distance-based measure. Higher the distance between two classes, lower the probability of one class being misclassified as another. We perform a cumulative operation on individual probability of being misclassified given the actual input label for . The top classes with highest probability will be identified as the susceptible classes under an adversarial attack.

Lemma 3.4.

Cumulative of the individual probability of adversarial classes given the actual classes determines the most susceptible classes of a model.


For a DNN model , the data sets are categorized into classes. For each class , there is a list of at most classes which can be close to . For each class, we determine them based on the hypothesis . The probability of a class being misclassified as can be determined based on the adversarial map. Lesser the distance between and , higher the probability of being misclassifed as . So, . Here, denotes the probability of being misclassifed as . As, are all independent events, the total probability of an adversarial class is and as if class has been misclassified as , we do not consider that as an adversarial effect. Hence, . Without the loss of generality, . So, the probability of an adversarial class is the cumulative of individual probability of that class given all the actual classes. ∎

4 Evaluation

4.1 Experimental Setup

In this study, we have used MNIST (lecun1998gradient, ), Fashion MNIST (F-MNIST) (xiao2017fashion, ) and CIFAR-10 (krizhevsky2009learning, ) datasets. The number of labeled classes is 10 for each dataset. MNIST and F-MNIST contains 60,000 training images and 10,000 test images. Both train and test dataset are equally partitioned into 10 classes. Each class has 6,000 training and 1,000 test images. CIFAR-10 contains 50,000 training images and 10,000 test images. We have worked on one model each for MNIST and F-MNIST with accuracy 98% and 89% respectively whereas, for CIFAR-10, we have performed our experiment on two models, Simplified ImageNet (krizhevsky2012imagenet, ) and ResNet-32 (he2016deep, ) with accuracy 72% and 82% respectively. For crafting FGSM attack, we have utilized the Cleverhans(papernot2016cleverhans, ) library. We have experimented using four distance-based measures on each dataset with the label of each data point predicted by the model. We have run our susceptible class detection on the entire dataset with FGSM attack for each model and determine -susceptible target classes. For all experiment with variable perturbation, with change the for each simulation and run the similation from to . Hence, 20 simulations have been executed for each experiment.

Figure 2: The forbidden distance for a model and average hopping distance varying with perturbation . (a) MNIST, (b) F-MNIST, (c) CIFAR10-ResNet32, and (d) CIFAR10-ImageNet
Figure 3: Entropy varying with perturbation for a model prior applying our techniques and after applying each technique. (a) MNIST, (b) F-MNIST, (c) CIFAR10-ResNet32, and (d) CIFAR10-ImageNet
Figure 4: Accuracy varying with perturbation using (a) MNIST, (b) F-MNIST, (c) CIFAR10-ResNet32, and (d) CIFAR10-ImageNet .

4.2 Usability of adversarial map for susceptible class detection

We have claimed that using the adversarial map we can identify the susceptible classes. We will discuss the accuracy of the best distance-based measure in §4.4. We have evaluated our approach on four separate models. In §4.1, we have briefly described each model. For MNIST and F-MNIST, we have utilized a simple model with one input, one dense and one output layer. We have used state-of-the-art models for CIFAR-10 to evaluate our techniques. To calculate the distance, we have implemented four different measures and compared among them by computing the entropy of the model. Initially, we have calculated the entropy of a model, by applying an adversarial attack with a fixed perturbation. In the equation, , we assume that without any prior information, the probability of an input misclassified as an adversarial class given the actual class is . For a fixed model, the value of is constant for all data points based on the previous assumption. However, we have leveraged adversarial map to provide weighted probability to each adversarial class based on the calculated distance between them. We have calculated the entropy based on the weighted probability using calculated distance from the actual class e.g., for actual class has neighbors and . Here, is closer to . In this scenario, . Our goal is to reduce the entropy in a model with the distance-based measures. In Figure 3, we evaluate the change of randomness by computing the entropy for all four models with four distance-based measures and compare them. In all the cases, Nearest Neighbor Hopping distance based measure performs best in decreasing the entropy of a model under an adversarial setting. We have found that with increasing perturbation, the randomness typically increases. In contrast, the entropy in all the cases becomes more or less constant after a certain amount of perturbation. This indicates that, mostly all images are misclassified after certain perturbation and thus the entropy will not change in relation to the perturbation. Surprisingly, we have found that for CIFAR-10 ImageNet and ResNet-32 model, the entropy decreases with increasing perturbation. In Figure 3(c) and (d), initially the entropy increases with increasing perturbation but decreases with increasing perturbation after a certain simulation. We found that data points that have been misclassified with lower perturbation, were classified correctly with higher perturbation. In Figure 5, initially the image has been classified correctly as frog by the ImageNet model. With perturbation , the image has been misclassified as deer. Whereas, with perturbation , the image has been classified correctly.

4.3 Effect of forbidden distance on adversarial attack

Figure 5: (a) Actual image data from CIFAR-10, (b) Image with , (c) Image with .

In this section, we have shown that the forbidden distance for different attacks and models. Moreover, we have evaluated the effectiveness of in misclassification. We have claimed that the attack can not travel more than under a particular adversarial setting. We have proved our claim by computing on actual training data and demonstrate that the average hopping distance () traveled under an attack which remains less than . After calculating using the Eq.4, we have simulated with increasing perturbation for each model. We have assumed that with increasing perturbation, the force of the attack increases so as the average distance (displacement) traveled by data points in . This is similar to the simple harmonic motion law of physics, which states that the displacement is proportional to the force. We have also evaluated the forbidden distance for each model and found whether the assumption regarding and applies. In Figure 2, we have simulated four cases and found that increases with perturbation and it remains under the bound given by . Hence, provides an upper-bound distance for a model. But, in our approach, we have defined it as the capacity of withstanding an attack for a particular model.

CE 79.7 - 71.5 -
RCE 98.8 - 92.6 -
NN 55.3 64.8 99.9 94.4
Table 1: Comparison of accuracy(%) with learning based detection algorithms RCE(3fawzi2018adversarial, ) and CE(goodfellow2016deep, ) with NN(our approach).

Using our approach, if the hopping distance between the classes and is more than , then can not be misclassified to . We have compared the model’s with

after a model has undergone through an attack. We have found that prior knowledge of a model provides a good estimation for describing the behavior of the adversarial examples.

4.4 Effect of adversarial map and susceptible class identifier

In this section, we take advantage of our adversarial map and susceptible class identifier to analyze the threats to a DNN model. We have utilized the adversarial map and computed the top susceptible target classes as described in the . In Figure 4, we have simulated our approach varying perturbation and . Though, it is apparent that with a larger value of , the accuracy of predicting susceptible target classes will increase. We want to increase the accuracy of our approach with least value of . This is a trade-off situation between and accuracy. From Figure 4, we have found that our approach performs best with for all models used in the evaluation. We have compared our work in Table 1 with reverse cross-entropy (RCE) (3fawzi2018adversarial, ) and common cross-entropy (CE)(goodfellow2016deep, ). Our approach can identify the susceptible target class with higher accuracy using CIFAR-10 with ResNet-32 and ImageNet respectively. Whereas, the accuracy for DNN model using MNIST is lower than the previous work. To understand the reason, we have examined the adversarial classes for models using MNIST. We have found that model using MNIST has different adversarial map for each actual class and all most all adversarial classes are susceptible to be attacked. To check further, we have visualized MNIST based model using t-SNE in 2-D space and have observed that the visualization shows a distinct separation among classes. Whereas, the 2-D visualization of the CIFAR-10 dataset based DNN model depicts some overlaps among the features, and our distance-based approach has discovered a certain pattern in the adversarial map. Thus, we can conclude that our approach works better for models with high complexity e.g., CIFAR-10 based DNN models.

5 Conclusion

In this paper, we have presented a technique to detect susceptible classes using the prior information of a model. First, we analyze a DNN model to compute the distance among classes in feature space. Then, we utilize that information to identify classes that are susceptible to be attacked. We found that with , our approach performs best. To compare the four distance-based measures, we have presented a technique to create adversarial map to identify susceptible classes. We have evaluated the utility of four different measures in creating adversarial map. We have also introduced the idea of forbidden distance in the construction of adversarial map. We have experimentally evaluated that the adversary can not misclassify to a target beyond distance . We have found that Nearest Neighbor hopping is able to describe the adversarial behavior by decreasing the entropy of a model and computing the upper bound distance () accurately. Our approach is also able to detect susceptible target classes that can detect adversarial examples with high accuracy for CIFAR-10 dataset (ImageNet and ResNet-32). In addition, for MNIST and F-MNIST, our approach possesses an accuracy of and respectively. Currently, our susceptible class detection identifies the source class of an adversarial example with probability. In the future, we want to find and study more properties of adversarial attack to detect adversarial examples with a lower bound guarantee. Analyzing model analysis techniques and algorithms to achieve the goal remain future work.