Dropping Pixels for Adversarial Robustness

05/01/2019
by   Hossein Hosseini, et al.
University of Washington
0

Deep neural networks are vulnerable against adversarial examples. In this paper, we propose to train and test the networks with randomly subsampled images with high drop rates. We show that this approach significantly improves robustness against adversarial examples in all cases of bounded L0, L2 and L_inf perturbations, while reducing the standard accuracy by a small value. We argue that subsampling pixels can be thought to provide a set of robust features for the input image and, thus, improves robustness without performing adversarial training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

09/23/2020

Semantics-Preserving Adversarial Training

Adversarial training is a defense technique that improves adversarial ro...
09/04/2019

Are Adversarial Robustness and Common Perturbation Robustness Independant Attributes ?

Neural Networks have been shown to be sensitive to common perturbations ...
01/09/2018

Less is More: Culling the Training Set to Improve Robustness of Deep Neural Networks

Deep neural networks are vulnerable to adversarial examples. Prior defen...
07/03/2020

Towards Robust Deep Learning with Ensemble Networks and Noisy Layers

In this paper we provide an approach for deep learning that protects aga...
02/17/2021

Improving Hierarchical Adversarial Robustness of Deep Neural Networks

Do all adversarial examples have the same consequences? An autonomous dr...
10/14/2021

Adversarial examples by perturbing high-level features in intermediate decoder layers

We propose a novel method for creating adversarial examples. Instead of ...
12/01/2016

Towards Robust Deep Neural Networks with BANG

Machine learning models, including state-of-the-art deep neural networks...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks are known to be vulnerable against adversarial examples, inputs that are intentionally designed to cause the model to make a mistake [1]

. One particular type of adversarial examples for image classifiers is slightly perturbed images that are misclassified by the model, but are recognizable to humans 

[2, 3]. Such adversarial images are typically generated by adding a small perturbation with bounded , or norm to legitimate inputs [4].

Several methods have been proposed for defending against adversarial examples, but later broken using adaptive iterative attacks [5, 6]. The state-of-the-art defense against adversarial examples (with bounded perturbation) is adversarial training, which iteratively generates adversarial examples and trains the model to classify them correctly [7, 8]. This approach, however, significantly slows down the training process and does not properly scale to large datasets [9].

Figure 1: Examples of original and subsampled images with drop rate of . (First and Third Rows) Images from GTSRB and CIFAR10 datasets, respectively, (Second and Fourth Rows) Corresponding subsampled images. The accuracy on subsampled images reduces by about and compared to original images for GTSRB and CIFAR10 datasets, respectively.

Adversarial training is shown to improve robustness at the cost of reducing accuracy [8]. In [10]

, Tsipras et al., argued that the trade-off between adversarial robustness and standard generalization is a fundamental property of machine learning classifiers. They analyzed a binary classification problem and showed that the reduction in standard accuracy is due to the tendency of adversarially trained models to assign non-zero weights only to a small number of strongly-correlated or “robust” features. That is, such networks discard the weakly-correlated (non-robust) features that could potentially lead to better standard generalization.

In this paper, we investigate how this insight could be used to train robust classifiers without performing adversarial training. In natural image classifiers, it is not possible to identify a fixed set of robust features in pixel domain due to the position invariance of objects. As a result, the set of robust features would be different for each image. To adapt the idea of selecting robust features to natural images, we use a slightly different notion of robust features as features that are strongly-correlated with output given all other robust features. In other words, instead of selecting features that are each highly correlated with output, we select the set of features that has the highest correlation.

Image data contain high redundancy due to the strong correlation between neighboring pixels, i.e., it is possible to restore images even when a large fraction of pixels is removed [11, 12]. Therefore, conditioned on that a pixel is selected, its surrounding pixels are weakly-correlated with output, because they significantly overlap in content with the center pixel and removing them will not cause much reduction in accuracy. Hence, one straightforward approach to construct robust features is by downsampling image pixels. Since farther pixels have smaller correlation, they non-trivially contribute to model’s prediction and, thus, are considered to be robust features.

We propose to perform random (nonuniform) sampling in order to improve both accuracy and robustness. Random subsampling of pixels improves standard generalization because the model will be trained with different subsets of pixels of each image. Also, at inference time, the accuracy can be improved by averaging the prediction over multiple sampling patterns. Moreover, since the randomness is not known to the adversary, it further mitigates the attack success rate. Randomly dropping pixels is suited to defend against adversarial examples with bounded perturbation, since the model learns to recognize objects from images with missing pixels. Nevertheless, we show that it provides robustness against adversarial examples with bounded and perturbation as well.

In this paper, we present our preliminary work and results on using random subsampling for adversarial robustness. Our contributions are summarized in the following.

  • We show image classifiers can be trained with inputs with reduced redundancy, through random subsampling of pixels, without significant reduction in accuracy. We show that the best results are obtained when the model is trained with subsampled images with drop rates chosen randomly in .

  • We apply the interpretability methods on models trained with subsampled images and argue that such approaches cannot explain how the model recognizes images from few pixels. We also visualize convolutional filters of the first layer of the network and show that, in this respect, the model behaves similar to a network trained using adversarial training.

  • We evaluate adversarial robustness of the models trained with random subsamled images. Experiments are performed on GTSRB and CIFAR10 datasets and with projected gradient descent (PGD) attack [13]. We show that training with subsampled images with drop rates chosen randomly in improves the robustness against adversarial examples in all cases of bounded , and perturbation.

Figure 2: Accuracy of models trained with subsampled images with different drop rates. We used ResNet-20 and ResNet-110 for GTSRB and CIFAR10, respectively. Dropping pixels at a higher rate results in lower accuracy. However, even at very high drop rates, the accuracy remains high.

2 Training with Subsampling Pixels

Natural images are high-dimensional data with high redundancy due to the strong correlation between neighboring pixels. Hence, when training image classifiers, we can potentially reduce the redundancy without significantly reducing the standard accuracy. One approach for reducing redundancy is randomly dropping pixels at a high rate. In the following, we provide the results of training and testing models with images with missing pixels.

Let be a color image. Let be a mask with the same size as , where

is a Bernoulli random variable with mean

, i.e., elements of are equal to

with probability

and equal to otherwise. We generate the subsampled image as , where denotes Hadamard (element-wise) multiplication. Figure 1 shows samples of original images of German Traffic Sign Recognition Benchmark (GTSRB) [14] and CIFAR10 dataset [15] and their corresponding subsampled images with drop rate of .

2.1 Experimental Results

We use ResNet-20 and ResNet-110 architectures [16]

for GTSRB and CIFAR10 datasets, respectively. The models are trained and tested with subsampled images. During training, the mask is chosen randomly and differently for each image and at each epoch. Figure 

2 shows accuracy of models trained with images with different drop rates. As expected, dropping pixels at a higher rate results in lower accuracy. However, even at very high drop rates, the accuracy remains high. Specifically, compared to standard training, at drop rate of the accuracy reduces by only about and for GTSRB and CIFAR10 datasets, respectively.

We also observed that deeper networks perform better. Table 1 shows the accuracy of ResNet models with different depths on CIFAR10 images with drop rate of . As can be seen, ResNet-110 provides about and higher accuracy compared to ResNet-56 and ResNet-20, respectively. Moreover, the model achieves best results when the drop rate of each image is chosen randomly between and at each epoch.

Model Accuracy
Experiment 1 (ResNet-110)
Experiment 2 (ResNet-20)
Experiment 2 (ResNet-56)
Experiment 2 (ResNet-110)
Experiment 3 (ResNet-110)
Table 1: Results on CIFAR10 dataset. In experiment 1, model is trained and tested with original images. In experiment 2, models are trained and tested with subsampled images with drop rate of . In experiment 3, model is trained with subsampled images with drop rates chosen uniformly in and tested on subsampled images with drop rate of .

2.2 Interpretability Analysis

In recent years, several “post-hoc” methods have been proposed for interpreting the predictions of deep convolutional neural networks 

[17, 18, 19, 20]. Such methods typically identify input dimensions that the output is most sensitive to. Let be the input image, be the classifier and be the explanation function that maps inputs to objects of the same shape. Most explanation methods are based on some form of the gradient of the classifier function with respect to input [21]. In our analysis, we use magnitude of gradient as the explanation map, i.e.,  [17].

(a) Model is trained with images with drop rates chosen randomly in .
(b) Model is trained to classify subsampled images with drop rate of

into their true label, while mapping original images to uniform distribution. Accuracy on subsampled images is

.
(c) Model is trained to classify subsampled images into their true label, while mapping subsampled noisy images to uniform distribution. The subsampled noisy images are obtained as , where is sampling mask, is a random variable that takes values of with equal probability and . The drop rate is . Accuracy on subsampled images is .
Figure 3: Visualizing explanation maps. Notations: , and are original and subsampled images and the sampling mask, respectively. is the explanation map on , computed as , and is the explanation on . The gradient quantifies the sensitivity of model output with respect to its input. It, however, does not quantify how much each input dimension contributes to model prediction.

We examine the interpretability for a ResNet-110 network trained with subsampled CIFAR10 images with drop rates chosen randomly in . Figure 2(a) shows explanation maps and for original and subsampled images, respectively. For original images, the explanation map is similar to the pattern of edges in image, a phenomenon that [21] also observed and posed as a shortcoming of interpretability methods. For subsampled images, however, the explanation is not informative. We visualize and , which respectively show the gradient magnitude at pixels that have been dropped and those that are not dropped. As can be seen, most of larger values of gradient are at positions of dropped pixels, i.e., pixels that do not contribute to the model prediction.

The results raise questions about the usability of such techniques in explaining model predictions. The gradient captures the sensitivity of the model output with respect to its input, i.e., it quantifies how much a change in a small neighborhood around the input would change the predictions . It, however, does not quantify how much each input dimension contributes to the model prediction. Specifically, in our case, such interpretability methods do not explain how the model recognizes the image from few pixels.

For classifying subsampled images, the network might implicitly rely on features of original images, i.e., it might have learned to produce similar representations for original and subsampled images. In order to prevent the model to do so, we train a model to classify subsampled images into their true label, while mapping original images to uniform distribution. This training approach results in a network with accuracy of on subsampled images (with drop rate), which is only about less than a model that is only trained with subsampled images. The results imply that the network is capable of classifying subsampled images without actually learning features of natural images. Figure 2(b) shows the explanation maps for few images. Similar to 2(a), the explanations do not provide insights into the model workflow.

Finally, we train a model to classify subsampled images and subsampled noisy images differently to investigate to what extent the network relies on the exact values of subsampled pixels. Specifically, we train the model to classify subsampled images into their true label, while mapping subsampled noisy images to uniform distribution. The subsampled noisy images are obtained as , where is sampling mask, is a random variable that takes values of with equal probability and . The drop rate is .

Interestingly, the trained model achieves accuracy of on subsampled images, which is almost the same as a model trained only with subsampled images. Figure 2(c) shows the explanation maps for few images. For this model, explanations on original images are not correlated with edge pattern. Also, explanations on subsampled images are sparser compared to 2(a) and 2(b). Moreover, most of larger values of gradient are at positions where pixels have not been dropped. Further exploring interpretability of networks trained with subsampled images is left for future work.

2.3 Visualizing Convolutional Filters

Convolutional networks are known to learn basic image patterns such as edges and blobs in early layers and then combine them in later layers to distinguish complex objects [22]. Dropping pixels at a high rate disrupts such basic patterns. As a result, the network will not be able to readily extract spatial features of the image data. To gain insight into how the model classifies inputs, we examine convolutional filters of the first layer. Figure 4 shows the visualization of filters of ResNet-110 networks trained with CIFAR10 dataset. We consider three cases of a normally trained model, a model trained with subsampled images with drop rate of and a model trained with subsampled images with drop rates chosen randomly in .

As can be seen, the model that is trained with subsampled images only has filters with large values at center position. This means that the network recognizes that there is no spatial correlation between adjacent pixels and, hence, just passes several scaled versions of the image to the next layer. Also, the model that is trained with subsampled images with different drop rates contains a mix of concentrated filters and other filters similar to the normally trained model.

Interestingly, having concentrated filters in first layer is also observed in adversarially trained networks on MNIST dataset, where models were trained with adversarial examples with bounded perturbation [8]. The similar behavior of the models trained with the two approaches suggests that randomly dropping pixels is indeed related to the notion of robust features observed in adversarial training. Further exploring this relationship is left for the future work.

(a) Model is trained normally.
(b) Model is trained with images with drop rate of .
(c) Model is trained with images with drop rates chosen randomly in .
Figure 4: Visualizing convolutional filters of first layer of ResNet-110 networks trained with CIFAR10 dataset. Models trained with subsampled images have more concentrated weights.
Figure 5: Accuracy on adversarial examples of (left) GTSRB and (right) CIFAR10 datasets with bounded perturbation of , where is the total number of pixels. The threshold of entropy is chosen such that accuracy on clean validation images is less than the case where each example is tested only once.

3 Robustness to Adversarial Examples

In this section, we first provide a background on adversarial examples and then evaluate the robustness of models trained with subsampled images.

3.1 Background on Adversarial Examples

We consider a class of adversarial examples for image classifiers where small (imperceptible) perturbation is added to an image so that the model misclassifies it (misclassification attack) or classifies it into the attacker’s desired label (targeted attack). The perturbation is typically quantified according to an norm. The attacker’s problem is formally stated as follows:

(1)
s.t.

where and are the clean and adversarial examples, respectively, is the true label and is the attacker’s desired target label.

We generate adversarial examples using the Projected Gradient Descent (PGD) method [13, 8], such that the added perturbation is bounded within for norm, i.e., . PGD is an iterative attack with the following update step:

(2)

where is the image at step ,

is the attack vector at step

, is the added perturbation per step, and is the projection operator where is the set of allowed perturbations. According to attack goal, the attack vector is specified as follows:

  • , for misclassification attack,

  • , for targeted attack.

Attack Setup. To attack a model with random subsampling, we generate randomly subsampled images, for each one compute the gradient at pixels which have not been dropped, and then take average of gradients. Let be the sampling mask. The average gradient is formally obtained as follows:

We consider the cases where the , or norm of perturbation is bounded. For GTSRB, we set , where is the total number of pixels, and . The attack step size is set to , and . For CIFAR10, we set , and . The attack step size is also set to , and . We perform PGD attack with cross-entropy and CW [4]loss functions and for misclassification and targeted attacks, and present the best attack results.

3.2 Case of Bounded Perturbation

Training with random subsampling is suited to defend against adversarial examples, since the model is trained to be robust to missing pixels. We also observed that subsampled adversarial examples result in more distributed output probability vector than subsampled clean images. Therefore, we enhance the defense mechanism by rejecting examples for which the entropy of probability vector is larger than a threshold. The threshold is chosen so as to have false positive rate on validation data, i.e., the accuracy on clean validation images is reduced by .

With random subsampling, the accuracy can be improved by averaging the output on multiple different subsampled versions of the input. To improve adversarial robustness, we compute the average output probability vector over different subsampled inputs and reject the example if the entropy of probability vector is larger than a threshold. The threshold is chosen such that accuracy on clean validation images is less than the case where each example is run only once.

Figure 5 shows the results on GTSRB and CIFAR10 datasets. As expected, larger drop rate improves adversarial robustness at the cost of reducing standard accuracy. The experiments are performed on a single model which has been trained with images with drop rates in . Such a model has the advantage that at test time the drop rate can be tuned to achieve different levels of trade-offs between accuracy and robustness.

In Figure 5, it can be also seen that rejecting inputs based on the value of entropy improves the trade-off of accuracy and robustness, since the adversarial accuracy is increased by larger value than the reduction in standard accuracy. Moreover, averaging over runs and rejecting inputs with high entropy results in adversarial accuracy to be on par with standard accuracy at high drop rates.

3.3 Case of Bounded and Perturbation

Figure 6 shows the attack results for and cases for GTSRB and CIFAR10 datasets. As can be seen, random subsampling improves robustness in both cases and, similar to case, larger drop rate results in higher adversarial robustness. Intuitively, for attacking the model, the adversary needs to distribute their budget to all features in such a way that the expected attack success rate is maximized. This will reduce the attack effectiveness compared to the case that the attacker knows the exact sampling pattern.

(a) GTSRB, Attack.
(b) GTSRB, Attack.
(c) CIFAR10, Attack.
(d) CIFAR10, Attack.
Figure 6: Accuracy on adversarial examples of GTSRB and CIFAR10 datasets with bounded and perturbation.

4 Related Work

Defenses against adversarial examples with bounded perturbation have been widely studied [7, 4, 5, 8, 6, 23]. Adversarial training is the state-of-the-art approach for and cases, but is shown to significantly slow down the training procedure [8, 10]. While most papers studied defenses in setting, some of real-world attacks based on adversarial examples fit in the setting [24, 25]. For example, [24] attacked traffic sign detection algorithms by adding sticker-like perturbation to images. Also, [25]

showed face recognition algorithms can be fooled by adding physically-realizable perturbation such as eyeglasses to images.

In [26], the authors proposed a method for improving robustness in attack setting by exploiting the sparsity of natural images in Fourier domain. They showed attack results on MNIST and Fashion-MNIST datasets and mentioned that the sparsity property might not hold in large images. Similar to their method, we used a property of natural images, namely high spatial correlation, to mitigate the effect of adversarial perturbation. Our approach is, however, general to natural images of any size. In fact, with larger images, it is possible to drop pixels at a higher rate and still restore the image [11, 12]. Hence, the classifier might be able to recognize subsampled images with higher drop rates and, as a result, achieve better robustness. Moreover, our method improves the robustness against and adversarial examples in addition to the

case. As a future work, we will implement our method on Imagenet dataset.

Several papers have proposed using post-processing algorithms to increase adversarial robustness [27, 28]. In [27]

, the authors proposed applying random resizing and padding at inference time.

[28] presented an algorithm for pruning a random subset of activations of a pretrained networks and scaling up the rest. Unlike our method, such algorithms do not train the model to learn the randomness.

Introducing randomness to inputs or the network itself at both training and test times is recently explored and shown to improve performance on adversarial examples [29, 30]. In [29], the authors proposed adding random noise layers to the network and ensembling the prediction over random noises. [30] adopted similar idea and used differential privacy to provide certified robustness against adversarial perturbations. In this paper, we proposed to train the model with subsampled images, with the drop rates randomly chosen in , and test it with subsampled images with high drop rates. We showed that our method improves adversarial robustness in all cases of , and perturbations.

5 Conclusion

In this paper, we showed that image classifiers can be trained to recognize images with high drop rates. We then proposed to train models with subsampled images with drop rates randomly chosen in . Our experimental results on GTSRB and CIFR10 datasets showed that such models improve the robustness against adversarial examples in all cases of , and perturbation, while reducing standard accuracy by a small value.

Acknowledgments

This work was supported by ONR grant N00014-17-S-B001.

References