Analyzing Adversarial Robustness of Deep Neural Networks in Pixel Space: a Semantic Perspective

by   Lina Wang, et al.
Sichuan University
NetEase, Inc

The vulnerability of deep neural networks to adversarial examples, which are crafted maliciously by modifying the inputs with imperceptible perturbations to misled the network produce incorrect outputs, reveals the lack of robustness and poses security concerns. Previous works study the adversarial robustness of image classifiers on image level and use all the pixel information in an image indiscriminately, lacking of exploration of regions with different semantic meanings in the pixel space of an image. In this work, we fill this gap and explore the pixel space of the adversarial image by proposing an algorithm to looking for possible perturbations pixel by pixel in different regions of the segmented image. The extensive experimental results on CIFAR-10 and ImageNet verify that searching for the modified pixel in only some pixels of an image can successfully launch the one-pixel adversarial attacks without requiring all the pixels of the entire image, and there exist multiple vulnerable points scattered in different regions of an image. We also demonstrate that the adversarial robustness of different regions on the image varies with the amount of semantic information contained.


page 1

page 2

page 3

page 13


Towards Understanding Pixel Vulnerability under Adversarial Attacks for Images

Deep neural network image classifiers are reported to be susceptible to ...

Exploring the Space of Adversarial Images

Adversarial examples have raised questions regarding the robustness and ...

Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers

Deep neural networks have been shown to exhibit an intriguing vulnerabil...

Delving into the pixels of adversarial samples

Despite extensive research into adversarial attacks, we do not know how ...

Deflecting Adversarial Attacks with Pixel Deflection

CNNs are poised to become integral parts of many critical systems. Despi...

Semantic Adversarial Perturbations using Learnt Representations

Adversarial examples for image classifiers are typically created by sear...

Robust Watermarking Using Inverse Gradient Attention

Watermarking is the procedure of encoding desired information into an im...

I Introduction

Deep Neural Networks (DNNs) has made great advancements in recent years and has achieved impressive performance on many machine learning tasks


including computer vision

[krizhevsky2009learning], speech recognition [vaswani2017attention]

, natural language processing

[kumar2016ask], to name a few. This makes DNNs a very popular technique and extensively applied in many fields. However, on the other hand, with the increasing popular of this technology, the security concerns about DNNs have become a critical topic in both academia and industry. A series of recent studies have shown that DNNs are vulnerable to a number of attacks, and the most severe and notable of which is adversarial example attack [szegedy2013intriguing]. Adversarial example is generated by adding a small perturbation maliciously crafted by the attacker to the clean input and it is usually imperceptible to human. The purpose of adversarial example is to make the DNNs produce a wrong output, this threat hinders the application of deep neural networks in security-critical settings, for example, adversary may use adversarial examples to circumvent malware detection [grosse2016adversarial]

, fool face recognition system

[sharif2016accessorize] or mislead autonomous vehicle [papernot2017practical].

Apart from the security risks, the existence of adversarial examples also presents challenges to our understanding of deep learning. It reveals the inconsistency between deep neural networks and human perception, as the adversarial examples that significantly alter the behavior of the networks are barely distinguishable from their corresponding clean inputs for human observers. Moreover, it questions the generalization ability of deep neural networks, as the capability of humans in recognizing unfamiliar objects is almost unaffected by tiny perturbations in inputs. Therefore, since the discovery of adversarial examples in DNNs, a body of work has been devoted to analyzing the phenomenon of adversarial examples and the adversarial robustness of deep neural networks, expecting to acquire a deeper understanding of the existing deep learning paradigm and further optimize the design of DNNs.

Previous work studies the adversarial robustness of deep neural networks at the sample level. For image classifier based on CNNs which is currently the main focus and key use case for studying the robustness of DNNs due to application popularity and security implications, the granularity of observation is an image. Some studies [gu2014towards], [goodfellow2014explaining] suggest that the adversarial examples occupy large and contiguous regions in the sample space, but as far as we know, there is no research has studied the adversarial robustness in the pixel space, that is, observed the adversarial robustness at the pixel level, not at the sample level. The advantage of humans in terms of robustness against adversarial examples compared with DNNs is believed to be due to the gap in semantic generalization ability [hosseini2017limitation], on the other hand, different parts of an image contain semantic information of varying degrees. Accordingly, it is of great significance to analyze the adversarial robustness in the pixel space of an image on the granularity of pixel.

In this paper, we fully explore the pixel space of adversarial examples by modifying only one pixel at a time on the segmented image to generate adversarial examples (see Fig. 1 and Fig. 2). In order to strictly control the adversarial perturbation occurring in a certain region of the pixel space, our research is under the framework of few-pixel attack [papernot2016limitations], [narodytska2017simple], [su2019one], and only one pixel is modified at a time. What differs most from our work to the previous in terms of generating the few-pixel adversarial perturbation is that our algorithm only search for the perturbation in the region composed of some pixels of an image, which corresponds to an attack scenario that is harder but more realistic for the adversary since he or she is only able to have access to a part of an input image instead of the entire one. Besides, we compared the differences of adversarial robustness of deep neural networks in different regions carrying different amounts of semantic information of the image. To the best of our knowledge, this paper is the first work to study the adversarial robustness of DNNs in pixel space with semantic information.

Fig. 1:

CIFAR-10 original images (left colomn) and the adversarial examples generated for its foreground (middle column) and background (right column) against deep neural networks of three different architectures. The original class labels are in black color while the predicted class after attacked on the foreground and the background are in blue if they are the same, red if they are different. The corresponding probability is in parentheses.

The contributions of this paper can be summarized as follows:

  • We show that for successfully launching a black-box one pixel adversarial attack, it is sufficient to search for the modified pixel in some pixels of an image, rather than the pixels of the entire image. Modifying only one pixel in either the foreground region of an image or the background region can change the output label of the deep neural networks. Meanwhile, for an image, whether in its foreground or the background, the vulnerabilities are not unique, and may cause the network to misclassify the perturbed input into different target classes.

  • We propose an algorithm to attack the deep neural networks while limiting the region where the adversarial perturbation appears on an image. The algorithm can achieve targeted or untargeted adversarial attacks in black-box scenario, it optimizes the modified individual pixel in the discontinuous intervals corresponding to the foreground or background after the image is segmented.

  • We perform extensive experimental evaluations on a small data set CIFAR-10 [krizhevsky2009learning] and a large scale data set ImageNet [deng2009imagenet] to validate our algorithm and compare the differences in adversarial robustness of different regions in the pixel space. We find that the adversarial robustness of different regions on the image is related to the amount of semantic information contained in the region. The foreground region that carries important semantic information on the image is far more vulnerable to the adversarial perturbations than the background region which is almost not semantically meaningful.

The rest of this paper is organized as follows. In Section II we review the main recent works on generating adversarial examples and analyzing of adversarial examples. In Section III we present the method proposed in this paper. Experimental details, experimental results and the analysis of results are given in Section IV. Section V briefly concludes the work of this paper.

Fig. 2: Illustration of the adversarial images generated by one-pixel attacks on the foreground and background of ImageNet dataset where the modified pixels are highlighted with red circles. The attacked class labels and their corresponding confidence are in blue color while the original class labels and confidence are in black below. Each row shows eight images, labeled (ah-1/2/3), that were generated by attacks on one of the three networks: AlexNet, Resnet-50 and VGG-16(I).

Ii Related Work

Since the work in this paper proposes a method to generate adversarial examples on the part of the image rather than the entire, and analyzes the difference of the adversarial robustness between the foreground and the background of images which contains different degree of semantic information, we summarize the work related to generating adversarial examples and analyses of adversarial vulnerability as follows.

Generating adversarial examples. Since Szegedy et al. [szegedy2013intriguing] first revealed the vulnerability of deep neural networks, a lot of research on generating adversarial examples for the task of image classification has emerged. According to the amount of the knowledge the adversary has about the attacked model, adversarial attacks can be divided into two types: white-box attack and black-box attack.

In the white box scenario, the adversary knows some information about the model such as model type, model architecture, values of parameters, trainable weights, or has access to summary, partial, full training data. As the earliest method of generating adversarial examples, [szegedy2013intriguing] used the box-constrained L-BFGS optimization method to find the smallest possible adversarial perturbation. Goodfellow et al. [goodfellow2014explaining]

proposed a more efficient method named FGSM that approximates the loss function of the model linearly. Kurakin et al.

[kurakin2016adversarial] extended the FGSM by proposing a ‘one-step target class’ variation that uses the least likely class predicted by the network instead of the true label and a ‘Fast Gradient L’ variation that uses the norm instead of the norm for normalization. Different from the above one-step method which only takes a single step in the direction of increasing the loss of the network to perturb images, the Basic Iterative Method (BIM) [kurakin2016physical] iteratively computes the perturbation by running FGSM for several steps with a smaller step size to trade off the success rate and the computational cost. Papernot et al. [papernot2016limitations] computed a saliency map to choose which pixels to modify, they used norm to restrict the perturbations and only altered a few pixels of the image instead of the whole image. Carlini and Wagner Attacks (CW) [carlini2017towards] introduced adversarial attacks restricted on , and

norms, and they claimed that their attacks can make defensive distillation fail. Moosavi-Dezfooli et al.


proposed the DeepFool algorithm which approximates the region of the space where the network output the true label by a polyhedron and iteratively computes the perturbation vector that reaches the boundary of the polyhedron, they also provided experimental evidence to show that their method computes adversarial examples more reliably and efficiently than the existing FGSM method. Moosavi-Dezfooli et al.

[moosavi2017universal] also created another method to compute the image-agnostic ‘universal adversarial perturbations’ based on DeepFool. Lina Wang et al. [wang2021improving] proposed TUP method to compute targeted universal adversarial perturbations that can extract semantic information that is missing by the classifiers, and they also found that adversarial training based on TUP greatly improves robustness of deep neural networks. Baluja and Fischer [baluja2018learning]

used a feed-forward neural network named Adversarial Transformation Networks (ATNs) to generate adversarial examples.

In the black box scenario, the adversary is assumed to be absolutely unable to know about the model parameters, but can or cannot query the model with limited or no knowledge about the model. Many black-box attacks utilize the ‘transferability’ property of adversarial examples, which means that adversarial examples generated for a specific model can be generalized to other models with different architectures. Papernot et al. [papernot2017practical] proposed a practical black-box attack that uses a local network trained on inputs synthesized by the adversary and labels generated by the target model. Based on this practical method, Papernot et al. [papernot2016transferability] also trained the substitute model both with deep learning and techniques other than deep learning, and explore the transferability across the machine learning space. Yanpei Liu et al. [DBLP:conf/iclr/LiuCLS17] studied the adversarial transferability over large models and large scale dataset, the method they proposed uses an ensemble of several models for generating adversarial examples in the complete black box scenario where the adversary cannot query the model. When the adversary can observe the outputs given by the attacked network, it becomes possible to compute adversarial perturbations using optimization algorithms that does not need to calculate the gradients. Wieland Brendel et al. [brendel2018decisionbased] introduced a decision-based black-box attack named Boundary Attack that does not rely on substitute models but utilizes the final decision of the model, the Boundary Attack performs a random walk along the boundary between the adversarial and non-adversarial region from a known adversarial point to stay in the adversarial region and reduce the distance towards the target point. Chen et al. proposed Zeroth Order Optimization (ZOO) [chen2017zoo]

to estimate the gradients and then calculate the adversarial perturbations. Jiawei Su et al.

[su2019one] used differential evolution to perform one-pixel adversarial attacks. The method proposed in this paper also belongs to black-box attack, and we also launch the attacks in the extremely limited scenario where only one pixel in the image is allowed to modified. However, our approach generates adversarial examples based on image segmentation, and only searches for adversarial perturbations in a part of the image, rather than in the corresponding region of the entire image as in all the above studies (no matter in white box scenario or black box).

Analyses of adversarial vulnerability. Since the phenomenon of adversarial examples unveils the inherent weakness of the modern deep learning paradigm, this makes analyses of adversarial examples become a highly active research direction. However, up to now, the analyses of adversarial examples have not reached a broad consensus on many issues, especially on the existence of adversarial examples. Some viewpoints strongly deviate from each other, and some do not align with each other perfectly. Initially, the existence of adversarial examples was attributed to extreme nonlinearity of the decision boundaries induced by deep neural networks [szegedy2013intriguing]. Fawzi et al. [fawzi2015fundamental] introduced distinguishability measure to describe the difficulty of the classification task and they owed the phenomenon of adversarial examples to low flexibility in comparison, their results are not consistent with the initial belief about the high non-linearity of deep neural networks. Then a popular linearity hypothesis [goodfellow2014explaining] that contradicts the previous idea interprets the existence of adversarial examples as a result of the deep neural models being too linear for easy optimization. Kortov and Hopfiled [krotov2018dense] studied adversarial examples within the framework of Dense Associative Memory (DAM) models, and their results in line with the linearity hypothesis. Tanay and Griffin [tanay2016boundary] argued that adversarial examples exists when a submanifold of sampled data lies close to the classification boundary under the perspective of boundary tilting, and they thought that the linear explanation of adversarial examples is not convincing. Fawzi et al. [fawzi2016robustness] made a quantitative study on the curvature of the decision boundary and associated the robustness of classifiers to the curvature of decision boundaries. Rozsa et al. [rozsa2018towards] presented the evolutionary stalling theory and they argued that the correctly classified samples can end up being stuck close the decision boundaries since their contributions to the loss are decreased, hence, these samples are susceptible to adversarial perturbations. While proposing a method for generating universal adversarial perturbations, Moosavi-Dezfooli et al. [moosavi2017universal] also provided a conjecture about the reasons for the existence of the universal adversarial perturbations, which suggests that it is partly due to the exploitation of geometric correlations between the decision boundaries induced by the deep neural networks. They also further refined their theory in [moosavi2018robustness], and showed that deep networks are vulnerable to universal adversarial attacks when there exists a shared subspace along which the decision boundary is positively curved. Rozsa et al. [rozsa2016accuracy] evaluated the robustness of different deep neural networks against various adversarial example generation approaches and found that networks with higher classification accuracies are more difficult to be attacked. Cubuk et al. [cubuk2017intriguing] also argued that the adversarial accuracy is strongly correlated with clean accuracy. As far as we know, most of the research on analyzing adversarial robustness focus on the sample units except for [tabacof2016exploring], which claims that they probed the pixel space of adversarial images. However, their approach adds varying levels of noise to the image and observes the changes in the assigned labels, although it partially and roughly probes the pixel space, is still image-level analysis rather than pixel-level precise analysis. In contrast, our work really explores the pixel space by comparing the adversarial robustness against the adversarial perturbations generated at different pixels of the image.

Iii Methodology

Iii-a Problem description

The problem of generating adversarial examples can be formalized as an optimization problem of finding a minimum adversarial perturbation vector with constraints. Let be a classifier with the -dimensional input and the output . Each component of represents the probability that the input belongs to each of the classes and therefore is used to denote the probability of belongs to the class . The classifier assigns the label to the input , which means that the classifier believes that is most likely to belong to class . The adversaries are able to launch two types of attacks, the untargeted attack and targeted attack. In the case of untargeted attack, the adversarial goal is to cause the classifier to misclassify the adversarial example into any incorrect label with a minimal perturbation . This is equivalent to finding the optimized solution for the following question:


where is some distance metric to measure the similarity between and . For targeted attack, the purpose of which is to make the classifier classify the adversarial input as some specific target class incorrectly, the corresponding optimization problem can be formalized as follows:


The most widely-used distance metrics to quantify similarity for generating adversarial examples are , and norms. In this paper, we use norm as all the literature in few-pixel perturbations did. To be more specific, we only modify one pixel and perform three-pixel and five-pixel attack as a comparison. The reason for doing this is because in order to more accurately compare the attack effect of adversarial attacks in different areas of the image, we have to strictly limit the location of the perturbation. Besides, there are many possible ways to express equation 1 and 2 in a different form that is more suitable for optimization. Therefore, we apply another alternative formulation that is very commonly used and more intuitive for untargeted attack:


where represents the true class predicted by the classifier before being attacked, constrains the strength of the modification and for norm it denotes the number of pixels that are allowed to be modified. In the case of targeted attack, use denotes the target class that the adversary wants the classifier to misclassify as, the equation is as follows:


Previous works commonly search for the direction and the size of the modification over the entire input space while our approach partition the input space according to semantic information before generating adversarial perturbations. It means that the perturbation vector only modify directions of () dimensions of the input with the other dimensions of left to zeros.

Iii-B Image segmentation

GrabCut [rother2004grabcut] is a classic method that based on Graph Cut [boykov2001interactive] for interactive image segmentation. GrabCut combines both texture (colour) information and edge (contrast) information, it can reduce the user interaction required to complete the segmentation while extracting foreground of good quality. The method firstly obtains a hard segmentation using iterative graph-cut optimisation and then uses border matting to deal with the problem of transparency around the object boundaries. In specific, GrabCut models the interactive segmentation task as a graph cut optimization problem just like other approached based on graphical models did. Let be an array of opacity values at each pixel to express the segmentation of the image ( ordinarily, or for hard segmentation), be parameters that describe image foreground and background distributions. The segmentation problem can be regarded as equivalent to inferring the unknown opacity variables from the given image data and . The opacity value can be estimated as a global minimum of the following energy function that is defined so that its minimum corresponding to a good segmentation.


In order to apply to colour image, GrabCut uses color Gaussian Mixture Model (GMM) models in the data term

, as


where is a vector of the GMM component variables used to assign a unique GMM component to each pixel either from the foreground or the background model. Let

be a Gaussian probability distribution and

be the mixture weighting coefficients, so the equation 6 can be rewritten as


is the smoothness term and its contrast term is calculated based on Euclidean distance in colour space, as


Then the energy function is minimized iteratively to obtain a good hard segmentation. After that, by fitting a polyline to the hard segmentation boundary, a closed contour is obtained. Now ()is computed based on a new trimap , where is the set of pixels that located in a ribbon of width ( is an empirically chosen parameter) pixels either side of . This process is named border matting and GrabCut does this with regularisation and a dynamic programming (DP) algorithm. Foreground pixel stealing is also included to better estimate foreground pixel colours.

U-Net [qin2020u2] is a deep learning based method to segment the most visually attractive objects in an image, it belongs to Salient Object Detection (SOD) which is widely used in image segmentation. U-Net uses a two-level nested U-structure architecture designed for SOD, which enables high resolution feature maps to be maintained at a low memory and computing cost. It nests two repeated U-Net modules instead of cascaded stacking them and comes up with a novel Residual U-block named RSU for capturing intra-stage multi-scale features. Thus the method is very flexible and easy to be adapted to different tasks with little performance loss. The three main components of U-Net are a six stages encoder, a five stages decoder and a saliency map fusion module which is attached with the decoder stages and the last encoder stage. Besides, U-Net use deep supervision in the training process and the training loss as follows:


here and are two loss terms, where is the loss of the side output saliency map and is the loss of the final fusion output saliency map. and are the wights of and respectively. Each loss term is calculated using the standard binary cross-entropy.

In this paper, GrabCut is used to segment the foreground and the background of CIFAR-10 images, and U-Net is used for ImageNet. U-Net works in a non-interactive way and it can be used to segment the ImageNet images very efficiently. On the other hand, the images of CIFAR-10 are low-resolution and the foreground and background colour distributions of which are not well separated. The distribution of these images is very different from the distribution of images used in the training of the deep learning based segmentation models, which makes the deep learning based method (such as U-Net) perform very poorly on CIFAR-10. GrabCut can obtain segmentations of good quality on CIFAR-10 with less user effort among the existed classical methods.

Iii-C Optimization

When we discuss the problem of generating adversarial examples from the perspective of optimization, there are some algorithms to choose from, and the solution to optimization problems is different for different methods. Since it becomes more difficult to use gradient information when we calculate the adversarial perturbation on a part of the image (foreground or background) rather than the whole, the method that does not use the gradient information is a reasonable option. Differential Evolution (DE) is one such approach, a very powerful derivative-free optimization algorithm for black-box optimization, which is about finding the minimum of a function without knowing its analytical form. DE does not require the objective function of the optimization problem to be differentiable, and its population selection mechanism preserves diversity so that it may be able to find better solutions than gradient-based methods or even other kinds of evolutionary algorithms more efficiently. These properties make DE can be used to solve complex multi-modal optimization problems, and very suitable for generating adversarial images where the networks are sometimes highly complex and non-differentiable or calculating gradient can be hardly realistic. In addition, DE was also used to generating one pixel adversarial perturbations for the whole image in

[su2019one], this gives a fairer comparison of the results of our attacks that modifies one pixel in a part of the image.

When minimizing a function , DE firstly represents the solutions as population of individuals and initializes them, where each individual is a parameter vector of . For each generation , a population can be represented as


Then an operation called mutation is used to generate new parameter vectors. A mutant vector is generated by adding the weighted difference between two different population vectors and to a third vector as


where is called mutation factor, and are randomly chosen. The next step is called recombination, which aims to increase the diversity of the population. In this step, DE create a trial vector by mixing the information of the mutant with the information of the current population vector as


where represents the dimension of the population vector and each component in can be decided by a method called crossover as follows:


In (13), is called crossover constant, , and is the th evaluation of a random number generator (different distribution followed corresponding to different crossover variations). Finally, DE compare the trial vector with the current population vector . If is better measured by the fitness function , is replaced by ; otherwise, is preserved and discarded. The whole population will eventually converge towards the solution after many iterations.

Iii-D Method

Let be the set of pixels of an entire image , we first divide into two sets and that satisfied


where consists of pixels belong to the foreground of the image and is the set of pixels in the background. When the adversarial perturbation is optimized by DE, each candidate solution is encoded as a tuple containing five elements: . denotes the coordinates of the modified pixel and is the RGB value of the perturbation. In the standard DE algorithm, the individuals in the population are coded as continuous real numbers, which were originally used to solve optimization problems for continuous function. Each of these five elements can be regarded as taking values in a continuous interval when searching for the perturbation on the entire image. However, when the perturbation is optimized in the space corresponding to the the foreground or the background pixels, the interval of the coordinates is no longer continuous. Some methods encode individuals as discrete sequences of length


and generate mutant individuals by randomly inserting and moving the sequences. The method proposed in this paper is somewhat different for that:


This means that the coordinates in the foreground are neither discrete nor take values on a continuous interval, but take values on multiple intervals (the same is true for the background). For simplicity, we solve this problem by making the optimization algorithm automatically constrain the position of the coordinates, instead of modifying the DE algorithm itself. Specifically, when we attack an image on its foreground, we first execute the DE algorithm for the first generation on the entire image  to get the trial vector and the current population vector in Section III-C. Then we use and respectively to modify the image , and get the modified images and , just like the attack on the entire image. The difference is that when we compare the vector and , the fitness function is computed on a new image and the set of pixels that make up the image as follows:


where represents the modified images or . Similarly, when attacking on the background,


Thus, at the replacement stage of each generation, we evaluate the vector with the fitness function. This can ensure that the adversarial perturbation is in the segmented foreground or background, and force the DE algorithm to find the optimal solution in the corresponding area.

It is worth noting that when we looking for perturbations in the foreground (or background) we also used pixels of the background (or foreground) at the beginning, but from the second generation the DE algorithm actually only looks for the solution in the foreground (or background) instead of the entire image. This can be confirmed by observing the change of the value of the fitness function over the generations during evolution, because when attacking on the foreground (or background), if the perturbed pixel is located in the background (or foreground), the value of the fitness function will not change at all in our method, but the actual experimental results show that the value of the fitness function drops rapidly from the beginning. Here we only show the feasibility of using only part of the pixels (foreground or background pixels) of an image for a successful one-pixel attack and leave the specific implementation for future work.

Iv Evaluation

Iv-a Experimental setup

In this section, we first introduce the datasets and the deep learning models used for the experimental investigation. The details of the experiment are described afterwards, and the experimental results are analyzed in depth at the end of this section.

Dataset description. The experiments were performed on two datasets of different scale, CIFAR-10 and ImageNet, both of which are widely used. The two datasets contain a different number of color images and each pixel of these images takes the value of a real number between and for three color channels. Specifically, the CIFAR-10 dataset is a collection of tiny images in classes which contains training images and test images, each with a resolution of . The ImageNet project is a large visual database and ILSVRC 2012 [ILSVRC15], which uses a subset of ImageNet with million training images, validation images and test images in categories, is one of the most commonly used version of ImageNet. For each of the attacks on the three types of deep neural networks trained on CIFAR-10, we randomly selected images from CIFAR-10 test dataset. Analogously, for ImageNet, images from ILSVRC 2012 test set were randomly selected for each of the attacks. Besides, we also resized the ImageNet images to resolution since the original ILSVRC 2012 images are not uniform in size.

Architecture characteristics. In order to conduct the adversarial attacks, we trained three of common networks on CIFAR-10: all convolutional network (AllConv) [DB15a]; network in network (NiN) [lin2014network] and VGG-16 network [DBLP:journals/corr/SimonyanZ14a]. And we also trained three networks on ILSVRC 2012: BVLC AlexNet [krizhevsky2012imagenet]; Residual Network with 50 layers (ResNet-50) [he2016deep] and VGG-16. We trained the network VGG-16 on both the two datasets for comparing the performance of the same architecture on different data sets. To distinguish between the two, we use VGG-16 (C) represent the VGG-16 network trained on CIFAR-10 and use VGG-16 (I) represent the network trained on ImageNet in this paper. We will also omit the letters in parentheses when there is no ambiguity. The architectures of the networks follow those described in the original works. The baseline accuracy of all networks on clean test sets is shown by Table I which is comparable to state-of-the-art performance.

Architecture (CIFAR-10) Acc. (%)
AllConv 85.71
NiN 86.55
VGG-16 (C) 85.65
Architecture (ImageNet) Top1 (%) Top5 (%)
AlexNet 56.52 79.07
ResNet-50 75.90 92.90
VGG-16 (I) 71.50 90.10
TABLE I: Baseline accuracy (acc.) of three CIFAR-10 networks and three ImageNet networks.

Iv-B Implementation details

CIFAR-10. We launched untargeted adversarial attacks and targeted attacks on randomly selected CIFAR-10 test images for the three CIFAR-10 neural networks. For untargeted attacks, if an image can be perturbed to be misclassified as any other class, the untargeted attack on this image succeeds. For targeted attacks, each image is perturbed to all other nine CIFAR-10 classes separately, and we count it as a successful targeted attack only if the image is perturbed to the specified target class. That is, when an image classified as class without being attacked is to be perturbed to a target class , but it is finally misclassified as another target class , this cannot be counted as a successful targeted attack.

Networks Metric Original Foreground Background
1-pixel 3-pixels 5-pixels 1-pixel 3-pixels 5-pixels 1-pixel 3-pixels 5-pixels
AllConv Untargeted1 12.80% 47.00% 51.20% 11.80% 40.20% 41.20% 6.20% 19.40% 20.60%
Confidence 74.01% 70.12% 75.35% 71.38% 72.75% 77.13% 74.71% 72.32% 80.09%
Untargeted2 12.60% 48.20% 55.00% 11.80% 43.60% 45.80% 6.60% 22.20% 24.00%
Targeted 2.93% 13.11% 16.84% 2.91% 11.42% 13.27% 1.78% 5.51% 6.84%
Confidence 67.51% 61.75% 65.99% 66.72% 64.05% 68.58% 68.09% 66.51% 72.07%
NiN Untargeted1 21.00% 67.80% 75.40% 19.20% 59.80% 60.00% 9.80% 35.60% 35.20%
Confidence 66.40% 63.77% 66.23% 63.67% 64.25% 69.29% 64.14% 64.65% 70.14%
Untargeted2 21.00% 69.60% 76.80% 19.40% 64.00% 69.20% 11.40% 40.80% 45.00%
Targeted 5.91% 30.62% 38.56% 5.56% 25.07% 28.49% 2.71% 13.84% 16.07%
Confidence 59.50% 53.41% 54.40% 59.57% 54.71% 57.72% 59.79% 56.41% 60.70%
VGG-16 (C) Untargeted1 43.40% 88.40% 92.00% 41.60% 79.60% 79.00% 18.20% 51.40% 48.20%
Confidence 70.06% 69.98% 73.80% 70.00% 71.62% 74.82% 67.52% 68.36% 72.40%
Untargeted2 44.60% 90.80% 93.20% 42.80% 87.00% 90.60% 20.20% 61.60% 66.80%
Targeted 22.78% 68.13% 70.80% 22.33% 61.24% 60.64% 8.82% 35.04% 36.80%
Confidence 59.38% 51.55% 53.66% 59.94% 53.39% 57.85% 58.40% 54.54% 59.26%
TABLE II: Results of conducting one-, three-, five-pixel attacks in CIFAR-10 original test images, the foreground and background of the images on three types of networks: AllConv, NiN, VGG-16. The Untargeted1/Targeted indicate the success rate of conducting untargeted/targeted attacks. Untargeted2 is the success rate calculated from targeted attack results. Confidence/ Confidence is the average probability of successful untargeted/targeted attacks.

Then we segmented the foreground and the background of these test images use GrabCut. The interaction with GrabCut (the amount of which is usually small) stops until a satisfactory result for human eyes is obtained. The untargeted and targeted adversarial attacks using the same technique were conducted in the foreground and in the background. In order to compare the effects of the adversarial attack in the foreground and the background, it is necessary to strictly control the position where the adversarial perturbation is added. We thus performed the attacks in an extremely limited scenario where the perturbation only can be added to one pixel. In addition, we also conducted the experiments by perturbing the images with three and five pixel-modification in the original natural images as well as their foreground and background on all the three CIFAR-10 networks. The purpose of this is to explore the situation of different perturbation scale.

The perturbation used to modify the pixel (or pixels) is computed by the population-based optimization algorithm DE. The initial number of population is set to and a stratified sampling method called latin hypercube sampling (LHS) is used to initialize the initial population to maximize coverage of the available parameter space. The maximum number of iteration is set to and early-stop criterion will be triggered when the predicted label is no longer the true class in the case of untargeted attacks, and when the predicted label is the target class in the case of targeted attacks. We also employ dithering when set the value of mutation factor to help speed convergence significantly, it means takes a uniform random number between in each iteration. The binomial crossover is included. The fitness function is the confidence value of true class for untargeted attack and the confidence value of target class for targeted attack.

ImageNet. We only performed one-pixel untargeted attacks on three ImageNet neural networks with the same DE parameter settings considering the time constraints. Although the search space of ImageNet is much larger than that of CIFAR-10 (approximately times larger), it is enough to see a clear trend from the experimental results even without proportionally increasing the number of evaluations. This low computational cost setting corresponds to a computationally tractable attack method on ImageNet, which is more realistic. Of cause, we also believe that more evaluations may result in a higher attack success rate.

In addition, the most different between the experiment on ImageNet and CIGAR-10 is that we used a different method called U-Net to segment the foreground and background from ImageNet natural images. The setting of U-Net was kept as similar as possible to the original with a few modifications to adapt to changes in input size. To ensure that the foreground and background of each image used for attack are properly segmented, we finally performed a manual inspection as an additional guarantee.

Iv-C Experimental results

Some examples of perturbed images are visualized in Fig. 1 and Fig. 2. It can be clearly seen that vulnerable points in the pixel space corresponding to an image are diverse and not unique. For the same architecture trained on the same data set, the perturbed images obtained by modifying one pixel in the foreground and modifying one pixel in the background may be misclassified by the model into the same class (see the images with labels in blue in Fig. 1, or (a-1) and (b-1), (c-1) and (d-1), (a-2) and (b-2), (c-3) and (d-3), etc in Fig. 2) or different classes (images with labels in red in Fig. 1, (e-1) and (f-1) in Fig. 2). For models with different architectures trained on the same data set, there is more than one candidate pixel both in the foreground and background when performing the one-pixel attack, which makes different models incorrectly classify the perturbed images into the same class (see (a-1) and (a-2), (b-1) and (b-2), (c-1) and (c-3), (d-1) and (d-3) in Fig. 2) or different classes ((g-1) and (g-3) in Fig. 2)).

Success rate. In order to evaluate the attack performance on the foreground and background of the image, we introduce several metrics to measure the effectiveness of the attacks, with only a few differences for CIFAR-10 and ImageNet datasets. Specifically, we report the success rate of untargeted attacks and targeted attacks as well as their corresponding confidence for CIFAR-10 in Table II. The success rate is defined as the percentage of adversarial examples that are successfully misclassified as any incorrect class label in the case of untargeted attacks. And in the case of targeted attacks, it is defined as the probability of adversarial examples being misclassified into a specific target class. In addition to the success rate calculated based on the actual untargeted and targeted attack results, we also report the untargeted success rate evaluated based on targeted attack results. That is to say, if the targeted attack can be launched successfully on an image at least for one target class, the untargeted attack on this image is considered successful. When calculating the confidence value, we accumulate the values of probability of the predicted class after each successful attack, then divided by the number of successful attacks. For ImageNet, we report the success rate and the corresponding confidence of top1 predicted class after being attacked in the case of untargeted attack. Besides, since we focus on decreasing the probability of the original true class, in order to further investigate the effect of the attacks, the average decrease in confidence of the original top1, top3, top5 classes before and after being attacked as well as the probability that the original top1, top3, top5 classes is still in top3, top5 prediction after being attacked are also shown in Table III.

Networks Metric Original Foreground Background
Alexnet Untargeted 11.19% 10.71% 6.90%
Confidence 30.68% 31.14% 29.06%
Conf.-Decrease-top1(succ) 11.41% 11.94% 7.09%
Conf.-Decrease-top3(unsucc) 0.76% 0.71% 0.52%
Conf.-Decrease-top5(unsucc) 0.32% 0.29% 0.21%
Ori-top1-in-top3(succ) 100.00% 97.78% 100.00%
Ori-top1-in-top5(succ) 100.00% 100.00% 100.00%
Ori-top3-in-top3(unsucc) 92.58% 92.53% 95.74%
Ori-top5-in-top5(unsucc) 92.60% 92.85% 95.19%
VGG-16 (I) Untargeted 3.81% 3.81% 1.43%
Confidence 38.12% 38.40% 42.32%
Conf.-Decrease-top1(succ) 13.83% 14.45% 8.65%
Conf.-Decrease-top3(unsucc) 0.39% 0.37% 0.23%
Conf.-Decrease-top5(unsucc) 0.16% 0.15% 0.10%
Ori-top1-in-top3(succ) 100.00% 100.00% 100.00%
Ori-top1-in-top5(succ) 100.00% 100.00% 100.00%
Ori-top3-in-top3(unsucc) 95.06% 95.09% 96.96%
Ori-top5-in-top5(unsucc) 95.51% 95.48% 96.92%
ResNet-50 Untargeted 3.57% 3.10% 0.95%
Confidence 44.92% 43.78% 45.44%
Conf.-Decrease-top1(succ) 21.29% 19.23% 11.02%
Conf.-Decrease-top3(unsucc) 0.38% 0.37% 0.21%
Conf.-Decrease-top5(unsucc) 0.13% 0.13% 0.08%
Ori-top1-in-top3(succ) 100.00% 100.00% 100.00%
Ori-top1-in-top5(succ) 100.00% 100.00% 100.00%
Ori-top3-in-top3(unsucc) 93.40% 93.32% 95.17%
Ori-top5-in-top5(unsucc) 93.81% 94.01% 96.71%
TABLE III: Results of conducting one-pixel untargeted attack in Imagenet original test images on three types of networks: Alexnet, VGG-16, ResNet-50. The Untargeted and Confidence represent the top-1 attack success rate and the average confidence of the images successfully attacked respectively. The Conf.-Decrease-top1/3/5(succ/unsucc) indicate the average decrease in confidence of the original top1/3/5 classes predicted by the unattacked classifiers before and after successful/unsuccessful attacks. The Ori-top1/3/5-in-top3/5(succ/unsucc) indicate the probability that the original top1/3/5 predicted class is still in the top3/5 after being attacked successful/unsuccessfully.

On CIFAR-10, the results of one-pixel untargeted attacks and targeted attacks both show that the attack performance on the foreground of the image is comparable to the attack effect on the original entire image, while the attack success rate on the background is significantly lower. For example, there is chance that the one-pixel untargeted attack on an arbitrary CIFAR-10 original test image can cause the image to be misclassified for model VGG-16 (C), chance that on the foreground of the image the attack can success, while the success rate drops to for the attack on the background of the image. For all the three CIFAR-10 models, the difference between the success rate on the original image and the success rate on its background is much larger than the difference between the success rate on the original image and the success rate on its foreground. The success rate of the three models on the foreground of the image reduces by an average of in the case of untargeted attacks and by an average of in the case of targeted attacks compared with the success rate on the original image. In contrast, the success rate on the background of the image decreases by an average of for untargeted attacks and for targeted attacks respectively compared with the success rate on the original image. Moreover, for stronger attacks obtained by increasing the number of pixels that can be modified to three and five, the value of success rate increases significantly, while the observation about the success rate of the attacks on the original image compared with its foreground and background has not changed. Besides, for a successful untargeted attack or targeted attack, attacking on the original image or its foreground or background does not significantly change the confidence value corresponding to the predicted class after being attacked, nor does the confidence value change with the number of pixels that can be modified.

On ImageNet, we observed that the success rate of the three models on the foreground of the image decreases by an average of only compared with the success rate on the original image, while the success rate on the background reduces by an average of compared with on the original image, which is almost 10-fold greater. Regardless of whether the modified pixels are located in the entire image or its foreground or the background, the confidence value of the predicted class after being attacked successfully does not change much. These results are in line with our previous finding of CIFAR-10 and show that the observation generalizes well to large size images. For successful attacks, the average of the reduction in confidence corresponding to the original true top1 class of the three ImageNet models are, for attacks on the original image, for attacks on the foreground of the image and for attacks on the background respectively. Even for the unsuccessful attacks, the drop in confidence values corresponding to the true top3 and top5 classes after being attacked on the background of the image is smaller than that after being attacked on the entire image and its foreground for all the three models. All these results tell us that attacking on the background of the image is much less effective than on the entire image or its foreground. But it should also be noted that, although the average probability that the original top3 or top5 classes are still in the top3 or top5 predictions is highest after being unsuccessfully attacked on the background than on the foreground and the entire image, the original top3 or top5 classes are still in the top3 or top5 predictions in more than of the attacks for all the three models. If the attacks are launched successfully, the probability that the original top1 class is still in top3 or top5 predictions is almost . This small change in the rank order of the predicted classes is caused by the fact that we only utilized a simple implementation to modify the pixel of the images for such a large scale dataset. More calculations, other setting of parameters and fitness functions should give different results.

Fig. 3: Percentage of CIFAR-10 test images that were successfully perturbed to a certain number (from to ) of target classes by one-, three- five-pixel attacks on the original entire images, its foreground and background


Number of target classes (successful targeted attack results). The number of target classes that the percentage of the images can be perturbed to are shown in Fig. 3 where one-, three-, five-pixels are modified on the entire image, its foreground and background respectively. This metric is only included in the CIFAR-10 results since there are too many classes of ImageNet classification up to classes. It can be seen that with only one-pixel modification on the entire image, a certain amount of images can be misclassified up to three target classes by the AllConv network and NiN network, and a small number of images can even be perturbed to six target classes by VGG-16 (C) network. The results of the number of target classes when attacking on the entire image are almost as good as the results when attacking on the foreground of the image, whereas the results of the attacks that modify the pixel on the background of the image are much worse.

Perturbing the images to more target classes can be achieved when increasing the number of pixels that can be modified. At the same time, there is still a large gap between the attack effect on the background of the image and that in the entire image or its foreground. These results suggest that the parts of an image that contain different amounts of semantic information are vulnerable to adversarial attacks to different degrees.

Original-target class pairs. We display the number of the successful untargeted and targeted one-pixel attacks corresponding to different original-target class pairs in Fig. 4 and Fig. 5. The heat-maps in Fig. 4 and Fig. 5 are based on the results of the three CIFAR-10 networks when attacking on the original entire image as well as on its foreground and background. It can be observed that the results of untargeted and targeted attacks show no significant difference in patterns. When attacking on the entire image, the degree of vulnerability varies for different original-target class pairs and there are some specific original-target class pairs which are much more vulnerable than others. For example, images of class can be much more easily misclassified to class but can hardly be perturbed to class for the targeted attacks on the VGG-16 (C) network. Most of these specific class pairs are still more vulnerable than others to attacks on the foreground of the image, but are no longer significantly different from other class pairs when attacking on the background. It suggests that for a single image, the vulnerable directions are more likely found in the space corresponding to the part of the image that contains more semantic information, such as the foreground.

In addition, it is more difficult to be attacked successfully for the images of some classes than of others (such as class for AllConv and NiN, class for VGG-16 (C)), and some classes are harder to reach (such as class for AllConv, class for NiN, class for VGG-16 (C)). For the classes that are hard to be attacked and to be perturbed to, the difficulty is further increased when attacking on the background of the image. This indicates that the vulnerable directions shared by different data points that belong to the same class are also closely related to the amount of semantic information. It is more difficult to search for the vulnerable directions on the background, which contains less semantic information than the foreground of the image. Therefore, when the data points going across the input space through the direction parallel to any of the directions corresponding to the pixels in the background, the original classes are more likely to keep robust along these directions. Meanwhile, for the classes that are already hard to reach, finding a reachable direction in the space corresponding to the background will be even more difficult. This phenomenon might be due to the shape of the decision boundary and the relative angle between each direction of the data point and the boundary. It means that, for a wide boundary that the data points may be far away from the boundary, the distance from the boundary may be further along the directions in the space corresponding to the background. On the other hand, if the boundary shape is thin, it is more difficult to reach along the direction corresponding to the pixel with less semantic information.

Fig. 4: Heat maps of the number of times the untargeted one-pixel attack was successful with the corresponding source–target class pair, for the original, foreground and background images of CIFAR-10 on the three type of networks: AllConv, NiN and VGG-16(C). The vertical indices indicate the original classes and the horizontal indices indicate the target classes. The number from to in the vertical and horizontal indices represent the ten classes of CIFAR-10: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.


Fig. 5: Heat maps of the number of times the targeted one-pixel attack was successful with the corresponding source–target class pair, for the original, foreground and background images of CIFAR-10 on the three type of networks: AllConv, NiN and VGG-16(C). The vertical indices indicate the original classes and the horizontal indices indicate the target classes. The number from to in the vertical and horizontal indices represent the ten classes of CIFAR-10: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.


Change in fitness values and number of generations. In order to observe more clearly what happened during the attack, how the fitness values change during the evolution of different networks and different datasets in the case of untargeted attack is illustrated in Fig 6. More specifically, we randomly selected images that can be successfully attacked not only on the original entire image but also on its foreground and background for all the three CIFAR-10 networks: AllConv, NiN, VGG-16(C), and selected ImageNet images for the model AlexNet. Remember that the fitness value is set to be the confidence of the true class of the image, and the goal of the untargeted attack is to minimize this fitness value, as previously mentioned. We also report the average number of generations required for all images successfully attacked on the original image, its foreground and background for all the CIFAR-10 and ImageNet networks in Table IV.

As shown in Fig 6, the fitness values can drop off a cliff between certain two generations and decrease steadily at other generations. For the sample that can be successfully attacked both on the original entire image and its foreground and background, the number of generations required to drop the fitness value sufficiently to make the image be misclassified is much less than . The average of fitness decreases monotonically with the number of generations, moreover, the average fitness value drops almost as much for attacking on the foreground as it does for the entire image. However, when attacking on the background of the image, the average fitness value drop is much smaller. The results in Table IV also show that for all the networks on CIFAR-10 and ImageNet, the average number of generations that required for the successful untargeted attack by modifying one pixel on the background of the image is significantly higher than on the entire image and its foreground. Besides, it can be find that the fitness value decrease less for the AlexNet network compared with the networks on CIFAR-10. This means that under the current settings, the AlexNet network is more difficult to fool.

Original Foreground Background
AllConv 1.63 1.77 3.03
NiN 1.36 1.48 2.45
VGG-16(C) 1.45 1.69 4.70
AlexNet 3.12 4.68 7.52
ResNet-50 1.00 1.00 1.20
VGG-16(I) 3.00 3.50 9.50
TABLE IV: The average number of generations required for the successful untargeted attack when respectively modifying one pixel on the original entire image, its foreground and background for all the three networks on CIFAR-10: AllConv, NiN, VGG-16(C) and the three networks on ImageNet: AlexNet, ResNet-50, VGG-16(I).
Fig. 6: Changes of fitness values during generations of evolution of untargeted attacks by modifying one pixel on the entire image and its foreground and background among the three CIFAR-10 networks: AllConv, NiN, VGG-16(C) and the AlexNet network on ImageNet. The curves of fitness value of randomly selected images are drawn with solid gray lines and the average values are highlighted by red dotted lines. To make the comparision clearer, we also put the averages of the three cases (attacking on the original entire image, its foreground and background) together.


V Conclusion

In the present study, we have investigated the differences in adversarial robustness between regions of different semantic importance in an image. We shed light on this issue by perturbing one pixel at a time on the different regions of segmented image to fool the deep neural networks. We used the GrabCut method for CIFAR-10 and U-Net for ImageNet to segment the image into the foreground region which is semantically meaningful and the background region that does not carry much semantic information. We proposed an algorithm improved on the DE algorithm to find the solutions in multiple discontinuous intervals when one-pixel attacks are carried out in part of the image (specifically corresponding to the foreground or the background region of an image in this paper). Moreover, the experimental results on both CIFAR-10 and ImageNet for networks with different architectures demonstrate that there are multiple pixels on an image that can be perturbed to achieve an one-pixel adversarial attack, and attacking on the foreground region is easier to succeed, thus the adversarial robustness is worse than on the background.

In summary, the work of this paper explores the vulnerabilities in pixel space and reveals the inherent connection between adversarial robustness and semantic information. The algorithm we proposed in this paper implies a kind of new adversarial attack launched in a more extremely limited scenario that has not been considered by previous attacks where the adversary does not know the values of all the pixels in an image when perturbing it. Furthermore, the diversity of the vulnerable points to adversarial attacks on the pixel level suggested in this paper provides insights on the decision boundaries of deep neural networks. Based on the observation of the connection between the adversarial robustness and semantic information of the image, new ideas to improve the adversarial robustness may be inspired. A theoretical study on the above issues is reserved for future work.