Detecting Adversarial Examples in Convolutional Neural Networks

12/08/2018 ∙ by Stefanos Pertigkiozoglou, et al. ∙ National Technical University of Athens 0

The great success of convolutional neural networks has caused a massive spread of the use of such models in a large variety of Computer Vision applications. However, these models are vulnerable to certain inputs, the adversarial examples, which although are not easily perceived by humans, they can lead a neural network to produce faulty results. This paper focuses on the detection of adversarial examples, which are created for convolutional neural networks that perform image classification. We propose three methods for detecting possible adversarial examples and after we analyze and compare their performance, we combine their best aspects to develop an even more robust approach. The first proposed method is based on the regularization of the feature vector that the neural network produces as output. The second method detects adversarial examples by using histograms, which are created from the outputs of the hidden layers of the neural network. These histograms create a feature vector which is used as the input of an SVM classifier, which classifies the original input either as an adversarial or as a real input. Finally, for the third method we introduce the concept of the residual image, which contains information about the parts of the input pattern that are ignored by the neural network. This method aims at the detection of possible adversarial examples, by using the residual image and reinforcing the parts of the input pattern that are ignored by the neural network. Each one of these methods has some novelties and by combining them we can further improve the detection results. For the proposed methods and their combination, we present the results of detecting adversarial examples on the MNIST dataset. The combination of the proposed methods offers some improvements over similar state of the art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 9

page 13

page 15

page 19

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of Deep Learning and deep neural networks has spread in a large variety of Computer Vision applications due to their increasingly effectiveness in solving many difficult visual tasks. Specifically a Convolutional Neural Network, presented in

[9]

, is a deep learning model which is used extensively in image recognition. A Convolutional Neural Network (CNN) consists of successive layers, where the network processes the input patterns in different scales. These multiple levels of representation remove the need for complex feature extraction, which transforms the raw data into a feature vector, because a CNN can accept the raw data as input and learn how to extract the important features internally in its first few layers. In addition, these types of neural networks take advantage of the locality of the patterns in an image by using convolutional layers.

However, neural networks and subsequently CNNs are vulnerable to certain inputs as shown in [21], the adversarial examples. These inputs, although are not easily perceived by humans can lead a CNN to produce faulty results. In the case of CNNs that are used in image classification, where the input is an image, we are also referring to adversarial examples as adversarial images. In this case, an original image, which is classified correctly by a CNN, is slightly perturbed to produce the adversarial image. Despite the fact that the adversarial image seems similar to the original image, according to the perception of a human, it is classified into a different category. This means that a user can alter the output of the network by perturbing the input in a way that it is not detected by humans.

The existence of adversarial examples indicates that existing CNNs, although in some specific applications can achieve near human accuracy, they do not perceive the input in the same way as humans do. As a result, by studying adversarial examples we can improve the models used, in order to create models that are closer to the human perception, which we consider as the ideal solution. In fact, as shown in [7], by improving the behavior of a neural network against adversarial examples we can also improve the accuracy of that network for real inputs. Also adversarial examples are a vulnerability that can be abused from a malicious user to influence the behavior of a system that uses a vulnerable neural network. For example a physical world adversarial example [3], can alter the perception of a self-driving car that uses cameras to navigate through the urban environment.

There are two main approaches for combating adversarial examples. The first approach aims at making the neural network more robust against adversarial examples [16], [22], by changing the network’s architecture and the learning procedure. The second approach assumes that the neural network is already trained and tries to detect whether a new input is an adversarial example or it is a real input [4], [20], [12], [11], [6].

In this paper, we focus on the second approach and we propose methods that aim to detect adversarial inputs. Specifically, we propose three different methods for detecting adversarial examples generated for a CNN that performs image classification. After we analyze and compare their performance, we propose different ways of combining their best aspects to develop a more robust approach. The first method is based on the regularization of the feature vectors which are produced by the network. Using the regularized feature vectors we retrain the last layer of the CNN similarly to the adversarial training proposed in [7]. We then can detect if a new input is an adversarial example by comparing the output of the original network with the output of the retrained network. The second method creates histograms using the absolute values of the outputs of the network’s hidden layers. Then by combining these histograms, this method creates a vector which is used by an SVM classifier to classify the input either as real or as adversarial. For the third method we assume that in a neighborhood of the input space the CNN acts as an affine classifier. Using that assumption we introduce the concept of the residual image, which contains information about the parts of the input pattern that are ignored by the network. This information is then used to perturb the input image in order to detect whether this image is a real or an adversarial input .

We use these methods and their combination to detect adversarial examples generated for a LeNet [10] network trained on the MNIST [10] dataset. The combination of the three proposed methods offers some improvements over similar state of the art approaches, on the detection of adversarial examples on the MNIST dataset.

2 Notation amd Related Work

2.1 Notation

The adversarial examples examined in this paper are generated for CNNs, which accept as input either a 2D signal of a grayscale image or a 3D signal of an RGB image. To simplify the notation, we use an image vector to denote an input image, where is the total number of points of the discrete 2D or 3D signal. Each component of the image vector corresponds to a certain point of the discrete 2D or 3D signal. Also, we refer to the component of the image vector with . This notation is also extended for the feature maps, which are the outputs of the hidden layers of the CNN.

When the image vector corresponds to a grayscale image, the operation corresponding to a discrete 2D convolution between the image and a kernel is done using a convolutional matrix that is multiplied with the image vector.

2.2 Generating Adversarial Examples

In order to generate adversarial examples we assume that we have an already trained CNN that accepts as input an image and classifies it into one of N different categories. In particular the output of the CNN is a vector with N components, where the component is the confidence of the network that the input belongs to the category. In addition, we have a set of test images (not used in training) which are used as starting points for the adversarial generation. We also refer to these test images as real images.

For the trained network, let be the input of the network, be the output of the network, be the category the input is classified into and be the classification error of the network with input and target category label , where is the total number of categories into which the network can classify the input. According to [21] if we have a real image , which is classified into category with label , we can produce an adversarial image , by solving the optimization problem

(1)

where is a parameter that controls the distance between the real image and the adversarial image .

Instead of solving the optimization problem of Equation (1), there are many proposed methods [17],[13],[8],[1],[15] that can produce robust adversarial examples much faster. For the experiments on this paper we are using the Basic Iterative Method [8] and the DeepFool method [13].

2.2.1 Basic Iterative Method (BIM)

One method proposed in [7], as a faster alternative of solving the optimization problem of Equation (1), is the Fast Gradient Sign Method. In this method the adversarial image is produced by adding to the original image , which is classified into the correct category , the vector , where :

(2)

As an extension of this method, [8] proposed the Basic Iterative Method (BIM), where the adversarial image is created by applying the fast gradient sign method several times with a smaller step , and also by clipping the result in each iteration in order to stay in a -neighbourhood of the original image . This means that at iteration , the method generates an image where:

(3)

and the pixel of is computed as follows:

(4)

where are the pixels of respectively, and are the minimum and maximum values allowed for the input.

The BIM method terminates at iteration , when it finds an adversarial image that it is classified into a category that it is different from the original category .

2.2.2 DeepFool

This method proposed in [13] is an iterative method that is based on the linearization of the classifier at each step.

Let be the output of the trained network for the category when the input is the image vector . If we have a real image which is correctly classified into the category , then using the DeepFool method after iterations we can compute the image as follows

(5)
(6)
(7)

This method terminates when it finds an image that it is classified by the neural network into a category .

(a)
(b)
(c)
(d)
(e)
(f)
Figure 1: Columns: (a) Real input images, (b) Adversarial images produced with the BIM method, which are misclassified by the LeNet network, (c) Adversarial images produced with the DeepFool method, which are misclassified by the LeNet network

2.3 Detecting Adversarial Examples

There is a large variety of different aprroaches that aim to detect possible adversarial inputs in a deep neural network. Many of these approaches are based on the extraction of a certain metric, using the outputs of the neural network, which is then used to distinct between a real input and an adversarial input.

An example of two metrics that can be used for that distinction is proposed in [4]. The first metric is based on the assumption that many adversarial generating methods produce adversarial examples that are near the low dimensional submanifold where the real inputs lie (but not on it). As a result [4]

proposes a method to model the submanifolds of the real data using kernel density estimation in the feature space produced by the last hidden layer. Therefore an adversarial input will produce a feature vector that is located in a region where the density estimate is lower than the density estimate for real inputs. The second metric is another approach to identify low confidence regions of the input space. In particular, given an input

, this metric computes an estimation of the uncertainty of a deep Gaussian process [5], using the outputs of multiple DNNs with the same architecture that are trained in the same training set but using the dropout method. Using this metric we expect a higher uncertainty estimation when the input is an adversarial input.

Also a training procedure that can enhance the detection results when we use the kernel density estimation as a metric is proposed in [14]. This training procedure adds a regularization term called reverse cross-entropy. This term encourages the network to produce outputs which have high confidence for the correct category and confidence that is as uniform as possible for the other categories.

Another method that aims at the detection of possible adversarial examples is proposed in [20]. This method uses the PixelCNN model [19], which is a trainable generative model. With this model the likelihood a new image is produced by the model can be easily computed. Based on the observation that adversarial inputs tend to be more unlikely to be produced by the model than real inputs, for a new image the value of can be used to detect whether this image is an adversarial example.

Apart from the methods that use a combination of metrics in order to detect possible adversarial examples, there are also proposed detection methods [12],[11],[6] which train secondary neural networks that act as classifiers, which classify an input image either as an adversarial example or as a real input.

3 Proposed Methods for Detecting Adversarial Examples

3.1 Regularization for detecting Adversarial Examples

This method is based on the regularization of the feature vector that the neural network produces. For the regularization, we use the method of nonlocal discrete regularization on weighted graphs proposed in [2]. Due to the fact that the outputs of the layers closest to the final output of a CNN act as feature vectors of the input, we can use the outputs of one of these layers as the feature vector of the input image.

3.1.1 Feature Vector Regularization

Let be the output of the second to last layer of the CNN, when the input is the image . For this method we are using as the feature vector of the image , which is extracted by the neural network. To perform the regularization of the feature vector we create a weighted graph from a set of input images, where each vertex represents one of the input images. In addition, we define a function on the vertices of the graph, where if the vertex represents the input image with feature vector then . Let be the weight of the edge that connects the vertices . The norm of the gradient of at a vertex is defined by:

(8)

with

(9)

where is the component of , which have components. Using the above definitions we can regularize the function using the following algorithm:

(10)
(11)

with

(12)

where the parameter controls the degree of regularity which has to be preserved and the parameter controls the fidelity to the original function .

3.1.2 Detecting Adversarial Examples using the Regularization of the feature vector

For this method we use a set of input images that consists of real images and adversarial images, for which we know the correct target categories . Also, we have a set of new images for which we do not know whether there are adversarial or not. For the images of both sets and we get the feature vectors that are extracted from the trained CNN and we create a weighted graph. Using this graph we perform the regularization that was described in section 3.1.1 using . Then we retrain the last layer of the network, which takes as input the feature vectors, using as inputs the regularized feature vectors only from the images of the set and as desired outputs the correct categories . A new image is detected as an adversarial image when the category into which is classified from the retrained last layer, using its regularized feature vector, is different from the category into which is classified by the original CNN. This procedure is shown in Algorithm 1.

An important detail of this method is the weights of the edges of the graph that are created from the input images. The weight of an edge which connects two images can depend on either the distance between them, or the distance between their feature vectors.

Let be the feature vector of image and be the feature vector of image . The weight of the edge between the vertices using the distance of the feature vectors can be computed by:

(13)

When we use the distance between the two images, the weight of the edge between the vertices can be computed by:

(14)

The use of the Euclidean distance between the input images can be explained by the fact that we want the weights to illustrate the similarity between the images. However, the Euclidean distance expresses this similarity only when the input patterns are aligned and they are on the same scale. In contrast, the Euclidean distance between the feature vectors can express this similarity even when the input patterns are not aligned and have different scales, but with the drawback that this distance can be more easily manipulated by adversarial inputs.

In addition to the Euclidean distance we can try to use different distances to compute the weights of the graph. So we may also use one of the following distances :

  • the cosine distance: .
    (where is the inner product of , )

  • the distance of the two vectors: .

In order to use the distance function , Equation (13) can be generalized as follows:

(15)

Also Equation (14) can be generalized as follows:

(16)

Using Equations (15), (16) with the different distance functions we get different results of detecting adversarial examples.

Input: Set of images with known correct categories , Set V of images from which we want to detect the adversarial examples
1 Compute the feature vectors for the images in the sets . Using and one of the Equations (15),(16) create graph G. Regularize graph G using p=1, and get the regularized feature vectors . Retrain the last layer of the CNN using as input the feature vectors

and as target output the categories

. foreach   do
2       Using the original last layer and the original feature vector , get the classification category , Using the retrained last layer and the regularized feature vector , get the classification category , if   then
3             Detect as an adversarial image.
4       else
5             Detect as a real image.
6       end if
7      
8 end foreach
Algorithm 1 Generalized Algorithm for detecting adversarial examples using the regularization of the feature vectors

3.1.3 Experiments using the Regularization Method

We generate adversarial examples using the BIM method and the DeepFool method on a LeNet [10] neural network that was trained on the MNIST dataset. From the test set of the MNIST dataset we use 2000 images to create 2000 adversarial examples using the BIM method and 2000 adversarial examples using the DeepFool method.

Adversarial Detection Using Equation (15) Using Equation (16) BIM DeepFool BIM DeepFool Distance Precision Recall Precision Recall Precision Recall Precision Recal 89.5% 66.2% 92.4% 83.7% 90.2% 67.1% 93.2% 84.4% Cosine 90.2% 66.8% 91.9% 84.2% 90.7% 68.9% 92.7% 84.9% 90% 65.7% 91.2% 84.1% 90.8% 66.1% 92.5% 84.6% Adversarial Detection without Regularization BIM DeepFool Precision=86% Recall=51% Precision=88.3% Recall=69.2%

Table 1: Results of adversarial detection on the LeNet network using the Regularization Method, when different distances are used to compute the weights of the edges of the graph

We then use the Regularization Method that was presented in section 3.1.2 in order to detect the adversarial images in a set of 2000 real and 2000 adversarial images, using the different distances and the different adversarial generation methods. Also, we try to detect the adversarial examples without using the regularization, which means that the method retrains the last layer using the original feature vectors. When we remove the regularization of the feature vectors, retraining the last layer is similar to adversarial training [7]. The results are shown in Table 1. Both the highest Precision and the highest Recall in the detection is achieved when the weights are computed using Equation (16).

3.2 Histogram Method for Adversarial Detection

By comparing the outputs of the hidden layers of a CNN when the inputs are real images and when the inputs are adversarial images, we can observe that the outputs in these two cases have different distribution of values. In particular, we compare the outputs of an original image and an adversarial image generated from the original. An example of two such outputs is presented in Figure 2. We observe that in the case of the adversarial image there is an increase in the values of some peaks of the original output while there is a decrease in the values on the rest of the points of the output.

(a)
(b)
(c)
(d)
Figure 2: First column: Input images where LABEL:sub@fig:7 is a real image and LABEL:sub@fig:9 is an adversarial image. Second column: Output of the first convolutional layer of the neural network for the inputs of the first column. (Best viewed in color)

The difference in the distribution of the outputs can be detected using histograms of the output values. In Figure 3, we can see histograms of the absolute values of the outputs of the first convolutional layer from the LeNet network, when the input is a real image and when it is an adversarial image. In these histograms we can observe that in the case of the adversarial image, the points of the output, which have values with high absolute value, are less than the respective points in the case of the real image.

(a)
(b)
Figure 3: Histograms of the absolute values of the output when the input is: LABEL:sub@fig:11 the real image of Figure (a)a, LABEL:sub@fig:12 the adversarial image of Figure (c)c.

We can use this difference in the histograms for adversarial detection. To do so, we train an SVM classifier, which takes the histogram of the absolute values of the output of the first convolutional layer as input and predicts whether the input image is an adversarial image or a real image.

This method is problematic when we add Gaussian noise in the input images. The additive noise changes the distribution of the values of the outputs and as a result the output distribution when the inputs are adversarial images becomes more similar to the output distribution when the inputs are real images. To improve this method and make it more robust against the addition of Gaussian noise we propose the reinforcement step. Let be the function that is implemented by the CNN to produce the network’s final output and be an input image that is classified into the category . Using the reinforcement step we get the new image which is defined by:

(17)

where is the classification error of the network for input when the target category is .

When the original input is an adversarial image the reinforcement step will increase the confidence for the adversarial category and as a result will have a histogram which is more distinct from a histogram of a real image. Similarly, when the original input is a real image, will increase the confidence for the real category and it will have a histogram which is more distinct from a histogram of an adversarial image. Hence, when we use both and to produce the histograms that are used for detection, it is easier for the SVM classifier to distinguish between real and adversarial images.

Another detail that improves the detection is, when we create the histogram from the output of a layer which has channels, to create one histogram for each channel instead of creating one histogram for all the channels. As a result, when we have a layer with channels we get different histograms and by combining them, we create the final vector that will be used as the input of the SVM.

Input: Set of images which are a mix of real and adversarial images, set where if is an adversarial image and otherwise
1 for  to  do
2       Generate using Equation (17) and , Compute the feature maps of the hidden layers for input , foreach  feature map of the first convolutional layer  do
3             Compute the histogram of the absolute values of ,
4       end foreach
5      Concatenate the values of all to create Compute the feature maps of the hidden layers for input , foreach  feature map of the first convolutional layer  do
6             Compute the histogram of the absolute values of ,
7       end foreach
8      Concatenate the values of all to create , Concatenate the values of to create
9 end for
Train the SVM classifier using as inputs the and as outputs the for .
Algorithm 2 Training of the SVM that is used in the Histogram method that utilizes the reinforcement step

3.2.1 Experiments using the Histogram Method

Similarly with section 3.1.3

, using a LeNet network trained on the MNIST dataset and both the BIM method and the DeepFool method we create two sets of 2000 adversarial and 2000 real images. We try to detect the adversarial images using both the original histogram method and the histogram method that utilizes the reinforcement step. The SVM that is used for the detection is trained from a different set of 1000 real and 1000 adversarial images generated using the BIM method. Also, we test the two methods when we add Gaussian noise with different values of standard deviation.

Without Noise
BIM DeepFool
Precision Recall Precision Recall
Original 97.9% 96.5% 98% 96.6%
Reinforcement Step 95.5% 94.6% 95.4% 93.6%
With Noise with standard deviation 15
Original 79.7% 74.9% 80% 81.7%
Reinforcement Step 85.5% 87.2% 86.8% 90.7%
Table 2: Results of adversarial detection on the LeNet network using the original histogram method and the method that incorporates the reinforcement step, with and without additive Gaussian noise on the input.

The results of the experiments when there is no additive noise and when there is additive Gaussian noise with standard deviation 15 are presented in Table 2. Also Figures (a)a, (b)b show the Precision and the Recall of the two histogram methods for different values of the standard deviation of the additive Gaussian noise. When there is no additive noise the original method achieves better results. Nevertheless, the difference between the methods when there is no noise is small, and because the method utilizing the reinforcement step is much more robust to the addition of noise, we can conclude that it is the preferable method.

(a)
(b)
Figure 4: Adversarial detection results using the histogram methods for different values of standard deviation of the additive Gaussian noise: LABEL:sub@fig:13 Precision of the detection LABEL:sub@fig:14 Recall of the detection

3.3 Adversarial Detection using the Residual Image

For the third proposed method we first introduce the concept of the residual image. Then, using the information of the residual image, we propose a way of detecting whether the input image is an adversarial input by perturbing the input image.

First, we propose a simple way of utilizing the residual image for adversarial detection (Method A) and then we present two alternative methods (Methods B,C), that achieve better results in the detection of adversarial images on the MNIST dataset.

3.3.1 Residual Image

(a)
(b)
(c)
(d)
Figure 5: First row: Input images where image LABEL:sub@fig:15 is a real input and image LABEL:sub@fig:16 is an adversarial input. Second row: LABEL:sub@fig:17 Vector from Equation (18) when the input image is the real image. LABEL:sub@fig:18 Vector from Equation (18) when the input image is the adversarial image.

In a CNN the lower layers, closest to the input, act as feature extractors for the input image. The feature vector that is produced is then used as an input for the layers closest to the output, which are usually fully connected layers, in order to classify the input into one of the possible categories. Given an input image, we want to find an image related vector that if it is added to the input, it will increase the norm of its feature vector without changing its direction. Let be the function that it is implemented by the lower layers and associates the input with a feature vector, be the original input and be the feature vector. We want to find a vector for which

(18)

We can easily find the vector

by using the backpropagation algorithm. As an example, by using a LeNet network trained on the MNIST dataset and the input images shown in Figures

(a)a,(b)b, we can compute the respective vectors which are shown in Figures (c)c,(d)d. We can see that for the input image of Figure (a)a, which is a correctly classified real image, the vector resembles a /9/ digit, which is the correct category of the input image. In contrast, for the adversarial input image of Figure (b)b, which is classified falsely as a /7/ digit, the vector more closely resembles a pattern of the adversarial category than a pattern of the correct category. Therefore, in a way the vector shows us the pattern that is perceived by the network.

This observation can be interpreted as follows: Due to the fact that the nonlinearities used in a CNN (e.g. Max Pooling, ReLU) are piecewise linear, in a neighborhood close to an input image the network acts as an affine classifier. This means that if we get the output

of the network, which is the feature vector of image , we can find a matrix and a vector so that

(19)

Given the input image , the final layers of the neural network perceive that input as the feature vector . Hence the neural network perceives the similarity between image and a new image , which has feature vector , as the inner product . When these two inputs activate the nonlinearities of the neural network with the same way, we can use Equation (19) to compute the inner product as follows:

(20)

According to Equation (20) the similarity between the two images depends on the inner product of the new image with the image . Hence the image , which is produced by the backpropagation, can be interpreted as the pattern that is perceived by the network and it is used to find the similarity between the image and the new image .

Let be the final output of the neural network, which classifies the input into different categories, where is the confidence of the network that the correct category of the input is the category . Similarly with Equation (19), in a neighborhood of the input space where the nonlinearities are activated with the same way, we can find a matrix and a vector so that for the input image the output can be computed as follows:

(21)

We can find which parts of the input pattern the neural network ignores by using the residual image , which is the projection of the input image onto the null space of matrix . Also, we can get the perceived image which shows us the parts of the pattern perceived by the network.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 6: Column: (a) Adversarial input images, (b) Perceived images , (c) Residual images , (d) Gradient of the classification error of the network when the input is the adversarial image and the target category is the category the input is original classified into

3.3.2 Detecting Adversarial Examples using the Residual Image

Ideally a classifier, in order to classify a pattern of an image into one category must perceive the entire pattern. This means that in an ideal classifier we would expect that the norm of the residual image will be close to zero. In reality this does not happen and the norm of stays high even for images of the training set. Nevertheless, what we can observe is that between a real image and a adversarial image generated by the real one, the norm of the ignored image increases as we go from the real image to the adversarial image.

In the case of adversarial images, as we have shown in the section 3.3.1, the perceived image resembles more the adversarial category. This means that the residual image contains valuable information for the distinction between the adversarial category and the real category, which is lost due to the fact that the classifier perceives only the image. The method of detecting adversarial examples is based on this observation and tries to add the lost information of the residual image , so that it is not ignored by the classifier. In particular adding the information of the residual image into the original image, regardless if the original image is real or adversarial, will increase the confidence of the classifier for the correct category. As a result, in the case of an adversarial image, which is at first classified into the adversarial category, we will observe a decrease of the confidence for the adversarial category and an increase of the confidence for the correct category.

Let be the original image and be the residual image. Because in a neighborhood close to the image the output of the network can be computed using Equation (21) and belongs to the null space of matrix , when we add to the output of the network does not change.

If the input is classified by the network into the category , then if we add or subtract the gradient of the classification error , we can achieve the greatest change in the error of the classifier. Because the greatest change can be achieved either by adding or subtracting , we can use the information of the residual image to determine the direction that increases the confidence for the correct category. In the next sections we propose different methods, which can be used to combine the information of and the information of , in order to find the image that we will add to the image .

3.3.3 Method A

In the first method, at each pixel, we take the value of and the sign of . The image we want to add is defined by:

(22)
(23)

where is a regularization function and is used to refer to the entrywise product of two vectors.

Although the regularization in Equation (22) is optional, the results from the experiments showed that by regularizing the residual image we can achieve an improvement in the adversarial detection. In the experiments presented in section 3.3.6 we use Total Variation Regularization [18] in order to regularize the residual image .

Method A starts with an image , which is classified into category , and perturbs it by iteratively adding which is computed using Equations (22),(23). After a certain number of iterations () the image , that is produced, is used as an input to compute the output

, which is the output of the CNN after it goes through a softmax layer. Then the method detects the original image

as an adversarial image if the softmax output for category is below a certain threshold . This procedure is shown in Algorithm 3. A weakness of this method is that with the way it computes , , there is not a clear distinction between the parts of the pattern that are perceived and the parts that are ignored.

Input: Image vector that we want to detect if it is an adversarial input
1 Compute category into which is classified from the CNN. . for  to  do
2       Compute for the input , Using and Equations (22),(23), compute , ,
3 end for
 (the softmax output for the category ) If () detect as an adversarial example
Algorithm 3 Method A

3.3.4 Method B

In order to make a clearer distinction between the parts of the pattern ignored and the parts perceived, this alternative method alters the way we compute the residual image. We denote this alternative residual image as .

Using Equation (21) we can find the projection of the input onto the null space of matrix as follows:

(24)

where is the pseudoinverse of matrix

With the alternative way of computing the residual image, we want the perceived image to be a clear depiction of a pattern that belongs to the category into which the input is classified. To achieve that, we want to change the output of the network and subsequently using Equation (24) to change the residual image. We find that in the case of images that belong to the training set there is a clearer distinction between perceived and ignored images. Therefore we want to make the output of the network to resemble the output of the images of the training set.

To achieve that, we take the outputs of the training set and using the kmeans algorithm we find the centers of the clusters that these outputs create. Then given the original output we find the output which is the center closest to . The alternative residual images is defined by :

(25)

Hence, by replacing with in the steps described in Method A, we get the alternative Method B, which is presented in Algorithm 4. This method improves the results of the detection, but produces an additive image that is still noisy, which means that after a certain number of iterations the results of the detection using this method start to get worse.

Input: Image vector , Set with the centers of the clusters of the outputs produced by the images of the training set
1 Compute category into which is classified from the CNN. . for  to  do
2       Compute output for the input , Find center that is the closest to , , , ,
3 end for
 (the softmax output for the category ) If () detect as an adversarial example
Algorithm 4 Method B

3.3.5 Method C

The iterative methods A,B, presented in the previous sections, do not converge to a final image and after a certain number of iterations they start to diverge. This is illustrated in Figure 7, where although at the first few iterations the image produced by method B is classified into the correct category for both the real and the adversarial image, after a certain number of iterations the confidence of the network about the correct category starts to decrease.

To solve this problem, we change the way we compute both the residual image and the image that we add during the method. Let be an original input image, be an image that is produced by the method from the original image , and be the perceived image when the input is the image. The alternative images , are defined by:

(26)
(27)

where is used to refer to the entrywise product of two vectors.

These equations emphasize the differences between and , at the points where the absolute values of the original image are high.

Another detail that improves the results of the detection is to confine the values of the input image to a certain range, by setting a minimum value and a maximum value . Let be the pixel of the image , then:

(28)

By replacing , with , in the steps described in Method A and changing the way we compute according to Equation (28), we get the alternative Method C, which is presented in Algorithm 5. We can see how this method improves the convergence in Figure 8, where in contrast to the Method B, by increasing the iterations we do not observe a decrease in the confidence for the correct category.

Input: Image vector , minimum value and maximum value
1 Compute category into which is classified from the CNN. . for  to  do
2       Compute for the input , Using , and Equations (26), (27), compute , Update , using Equation (28) .
3 end for
 (the softmax output for the category ) If () detect as an adversarial example
Algorithm 5 Method C
(a)
(b)
Figure 7: Confidence of the classifier about the correct category when Method B is used: LABEL:sub@fig:19 for a real image, LABEL:sub@fig:20 for an adversarial image
(a)
(b)
Figure 8: Confidence of the classifier about the correct category when Method C is used: LABEL:sub@fig:21 for a real image, LABEL:sub@fig:22 for an adversarial image
(a)
(b)
(c)
(d)
(e) (a)
(f) (b)
(g) (c)
(h) (d)
Figure 9: Comparison between the visual results of Methods A,B,C: (a) Original input images, (b) Images produced after the termination of Method A, (c) Images produced after the termination of Method B, (d) Images produced after the termination of Method C

3.3.6 Experiments with the proposed methods that use the residual image

We use the three different methods that were presented in sections 3.3.3, 3.3.4, 3.3.5 in order to detect the adversarial images in two sets of images, where each set contains 2000 real images and 2000 adversarial images. The experiments are using the LeNet network that is trained on the MNIST dataset and similarly with sections 3.1.3, 3.2.1 the adversarial images for each set are generated using either the BIM method or the DeepFool method.

The results of the three methods using the residual image are presented in Table 3. It is clearly shown how the two alternative methods B,C improve the overall results of the detection when we compare them with the original method A. Also, it is worth noticing the fact that all the methods achieve much higher Recall, when they try to detect adversarial examples generated using the DeepFool method, compared to the Recall they achieve when they detect adversarial examples generated using the BIM method. This difference indicates that the adversarial examples generated from the DeepFool method are more sensitive to the perturbations applied by Methods A,B,C, and as a result, it is easier to enhance the correct category when the input image is an adversarial example generated from this method.

One parameter that can greatly affect the results of these three methods is the threshold value presented in Algorithms 3, 4, 5. The detection results presented in Table 3 use the following threshold values:

  • Method A:

  • Method B:

  • Method C:

Adversarial Detection
BIM DeepFool
Precision Recall Precision Recall
Method A 68% 70.1% 73.5% 91.5%
Method B 84.6% 81.8% 86.7% 95.4%
Method C 87.6% 87.9% 88.4% 94.7%
Table 3: Adversarial detection results on the LeNet network using methods A,B,C.
(a)
(b)
Figure 10: Precision-Recall curves of adversarial detection using Methods A,B,C when the adversarial examples are generated using: LABEL:sub@fig:39 the BIM method, LABEL:sub@fig:40 the DeepFool method. (Best viewed in color)

By changing the threshold value, we can change the Precision and the Recall of the methods according to the needs of each application. Generally, when we increase the threshold value, the detection becomes more sensitive and as a result the Precision increases, but at the same time the Recall decreases. In contrast, when we decrease the threshold value we observe a decrease in Precision and an increase in Recall. Therefore the best threshold value for the adversarial detection depends on how much each application values precision over recall. Figures (a)a, (b)b

present the precision-recall curves for methods A,B,C , where we can observe the relationship between Precision and Recall for different threshold values.

In section 4, where we combine all the proposed methods, we use the same threshold values that were used to generate the results shown in Table 3.

4 Combining the results from the proposed methods

(a)
(b)
Figure 11: Common errors between the proposed methods: histogram method(hist), regularization method(reg), residual image method(ign). (Best viewed in color)
(c)

In order to combine the results of the methods described in sections 3.1, 3.2, 3.3, it is useful to examine the common mistakes made from these methods. Hence we use the same set of 2000 real and 2000 adversarial images and we try to detect the adversarial examples using each one of these methods. For the regularization method the weights are computed using the cosine distance between the input images, according to Equation (16), for the histogram method the reinforcement step is used and for the method utilizing the residual image Method C is used. The results are illustrated in Figure 11.

These results show that each one of the three methods has a percentage of mistakes which are unique to the method. Another useful observation is that although the histogram method has the lowest number of false positive mistakes, where real images are detected as adversarial images, the majority of these falsely detected images are not detected as adversarial images by the other two methods. In contrast the other two methods have a large number of common false positive mistakes.

Let be the result of the regularization method, which takes the boolean value of 1 if the input image is detected as an adversarial image by the method. Similarly, let be the result of the histogram method, the result of the method that uses the residual image and the final result of detection after the combination of the three methods. Also we are using the symbol for the logical OR and the symbol for the logical AND. Firstly, we identify an image as adversarial when it is detected by all the methods, which means that . In this case, we expect to achieve the highest Precision but also the lowest Recall. In contrast if we identify an image as adversarial when it is detected by at least one method, which means that , then we will have the highest Recall with the lowest Precision. In addition if we identify an image as adversarial when it is detected by at least 2 methods , , then we have intermediate results for both the Precision and the Recall.

In addition, in order to achieve high Precision without substantially decreasing the Recall, we can use the observation that the majority of the False Positive errors of the histogram method are unique to the method. So we identify an image as adversarial when it is detected as adversarial by the histogram method and by at least one of the other two methods, which means that . The results of these combinations are shown in Table 4.

Result from the combination of the methods
BIM DeepFool
Precision Recall Precision Recall
99.5% 60.1% 99.6% 77.2%
83.6% 99.4% 83.7% 99.7%
95.5% 91.8% 96.2% 96.25%
98.8% 89.6% 98.9% 92.7%
Table 4: Results of adversarial image detection on the LeNet network using different combinations of the proposed methods.

Finally, we can compute the detection results of the combination of the proposed methods when we allow the threshold value , of the method that utilizes the residual image, to change. If we use different values for and we compute the detection results using the combinations , , , we can create a ROC curve, which illustrates how we can change the sensitivity of the detection by changing both the threshold value and the way we combine the three proposed methods. The ROC curves, which show the detection results when we try to detect adversarial images produced by the BIM and the DeepFool method, are presented in Figure (a)a. In addition, the respective Precision-Recall curves are presented in Figure (b)b

For the ROC curves the Area Under Curve (AUC) is when we detect adversarial images that are generated using the BIM method, and when we detect adversarial images that are generated using the DeepFool method. These results of adversarial image detection on the MNIST dataset compare favorably to the results of similar state of the art detection methods [4], [20], which were briefly presented in section 2.3 and use a combination of metrics in order to detect adversarial images.

(a)
(b)
Figure 12: LABEL:sub@fig:201 ROC curves of adversarial image detection, LABEL:sub@fig:202 Precision-Recall curves, which are generated by using different combinations of the proposed methods and different threshold values for the method that utilizes the residual image. (Best viewed in color)

5 Conclusion

In this paper, we introduced three methods for detecting adversarial inputs in a CNN. Each one of these methods has some novelties and their combination yields an even more robust approach. The first method is based on adversarial retraining of the last layer of the network, and uses regularization of the input of the last layer to increase the effectiveness of the retraining. The second method uses the histograms of the values of the outputs of the hidden layers of the network in order to detect the adversarial inputs. Finally, in the third method we introduced the residual image, which gives us information about the parts of the input pattern that are ignored by the classifier. Using this information we perturb the input image in order to reinforce the correct category, something that allows us to detect the adversarial images which are not originally classified into the correct category.

After comparing the results of each individual method we showed how the combination of these methods improves the overall detection. This combination produces promising results and offers some improvements over similar approaches, when it is used for the detection of adversarial images on the MNIST dataset.

References