, so they have been widely used in various real-world applications including face recognitionSun et al. (2015), self-driving cars Bojarski et al. (2016), biomedical image processing Bakas et al. (2018), among many others Najafabadi et al. (2015)
. Despite of these successes, DCNN classifiers can be easily attacked by adversarial examples with perturbations imperceptible to human visionSzegedy et al. (2013); Goodfellow et al. (2014); Su et al. (2019). This motivates the hot research in adversarial attacks and defenses of DCNNs. See Wiyatno et al. (2019); Ren et al. (2020) for reviews.
Existing adversarial attacks can be categorized into white-box, gray-box, and black-box attacks. Adversaries in white-box attacks have the full information of their targeted DCNN model, whereas their knowledge is limited to model structure in gray-box attacks and only to model’s input and output in black-box attacks. For instance, popular algorithms for white-box attacks include the fast gradient sign method Goodfellow et al. (2014); Kurakin et al. (2016), the projected gradient descent method Madry et al. (2017), the Carlini and Wagner attack Carlini and Wagner (2017), among many others Szegedy et al. (2013); Papernot et al. (2016); Moosavi-Dezfooli et al. (2016)
. Defensive techniques for those attacks include heuristic and certificated defenses. Adversarial training is the current most successful heuristic defense approach for improving the robustness of DCNNs, which simply incorporates adversarial samples into training but has better numerical performance than certificated defensesRen et al. (2020).
In this paper, we propose a simple yet efficient framework for white-box adversarial image generation and training for DCNN classifiers. For generating an adversarial example of a given image, our framework provides user-customized options in the number of perturbed pixels, misclassification probability, and targeted incorrect class. To the best of our knowledge, this is the first approach rendering all the three desirable options. The freedom to specify the number of perturbed pixels allows users to conduct attacks at various pixel levels such as one-pixel Su et al. (2019) and all-pixel Moosavi-Dezfooli et al. (2017) attacks. Particularly, we adopt a recent perturbation-manifold based first-order influence (FI) measure Shu and Zhu (2019) to efficiently locate the most vulnerable pixels to increase the attack success rate. In contrast with traditional Euclidean-space based measures such as Jacobian norm Novak et al. (2018) and Cook’s local influence measure Cook (1986), the FI measure captures the intrinsic change of the perturbed objective function Zhu et al. (2007, 2011) and shows better performance in detecting vulnerable images and pixels. Besides, our framework allows users to specify the misclassification probability and/or the targeted incorrect class. The prespecified misclassification probability is rarely seen in existing approaches, which produce an adversarial example either near the model’s decision boundary Moosavi-Dezfooli et al. (2016); Nazemi and Fieguth (2019) or with unguaranteed high confidence Nguyen et al. (2015)
. We tailor different loss functions accordingly to the three desirable options and their combinations, and apply the particle swarm optimization (PSO)Kennedy and Eberhart (1995), a fast gradient-free method, to obtain the optimal perturbation. Moreover, we observe that our perturbations with high misclassfication probability can have certain adversarial universality Moosavi-Dezfooli et al. (2017) to images from different classes. For adversarial training, in training data we further utilize the FI measure to identify vulnerable images and their pixels that are prone to optional targeted classes. Then using our customized generation approach yields an adversarial dataset for training. Experiments show that our adversarial training significantly improves the robustness of pretrained DCNN classifers. Figure 1 illustrates the flowchart of our framework.
We notice that two recent papers Zhang et al. (2019); Mosli et al. (2019) also applied PSO to craft adversarial images. However, we have intrinsic distinctions. First, the two papers focus on black-box attacks, but ours is white-box. Article Zhang et al. (2019) only studied all-pixel attacks; although article Mosli et al. (2019) considered few-pixel attacks, but searched in random chunks to locate the vulnerable pixels, we use the FI measure to directly discover those pixels. Moreover, targeted attacks are not considered in Mosli et al. (2019), and both papers cannot prespecify a misclassification probability for the generated adversarial example. Our framework is able to design arbitrary-pixel-level, confidence-specified, and/or targeted/nontargeted attacks.
Our contributions are summarized as follows:
We propose a novel white-box framework for adversarial image generation and training for DCNN classifiers. It provides users with multiple options in pixel levels, confidence levels, and targeted classes for adversarial attacks and defenses.
We adopt a manifold-based FI measure to efficiently identify vulnerable images and pixels for adversarial perturbations.
We design different loss functions adaptive to user-customized specifications and apply the PSO, a fast gradient-free optimization, to obtain optimal perturbations.
We demonstrate the effectiveness of our framework via experiments on benchmark datasets and notice that our high-confidence perturbations may have certain adversarial universality.
2.1 Perturbation-Manifold Based Influence Measure
Given an input image and a DCNN classifier with parameters , the prediction probability for class is denoted by . Let
be a perturbation vector in an open set, which can be imposed on any subvector of . Let the prediction probability under perturbation be with .
For sensitivity analysis of DCNNs, Shu and Zhu Shu and Zhu (2019) recently have proposed an FI measure to delineate the ‘intrinsic’ perturbed change of the objective function on the Riemannian manifold of Zhu et al. (2007, 2011). In contrast with traditional Euclidean-space based measures such as Jacobian norm Novak et al. (2018) and Cook’s local influence measure Cook (1986), this perturbation-manifold based measure enjoys the desirable invariance property under diffeomorphic (e.g., scaling) reparameterizations of perturbations and has better numerical performance in detecting vulnerable images and pixels.
Let be an objective function of interest, for example, the cross-entropy . The FI measure at is defined by
where , with , and is the pseudoinverse of . A larger value of indicates that the DCNN model is more sensitive in to local perturbation around . We shall use the FI measure to discover vulnerable images or pixels for adversarial attacks.
2.2 Particle Swarm Optimization
Since introduced by Kennedy and Eberhart Kennedy and Eberhart (1995) in 1995, the PSO algorithm has been successfully used in solving complex optimization problems in various fields of engineering and science Poli (2008); Eberhart and Shi (2001); Zhang et al. (2015). Let be an objective function, which will be specified in Section 2.3 for adversarial scenarios. The PSO algorithm performs searching via a population (called swarm) of candidate solutions (called particles) by iterations to optimize the objective function . Specifically, let
where is the position of particle in an -dimensional space at iteration , is the total number of particles, and is the current iteration. The position, , of particle at iteration is updated with a velocity by
where is the inertia weight, and are acceleration coefficients, and and. Following Xu et al. (2019), we fix and . We can see that the movement of each particle is guided by its individual best known position and the entire swarm’s best known position. We shall use the PSO algorithm to obtain desirable adversarial perturbations under various user’s requirements.
2.3 Adversarial Image Generation
Given an image , we combine FI and PSO to generate its adversarial image with user-customized options for the number of pixels for perturbation, the misclassification probability, and the targeted class to which the image is misclassified, denoted by , , and , respectively.
Denote image . For an RGB image of pixels, we view the three channel components of a pixel as three separate pixels, so here. We let the default value of .
We first locate vulnerable pixels in for perturbation, if is specified but the targeted pixels are not given by the user. We compute the FI measure in (1) for each pixel based on the objective function
where . Denote to be the pixel with the -th largest FI value. We use as the pixels for adversarial attack and let perturbation .
where we assume , constrains the range of perturbation to guarantee the visual quality of the generated adversarial image compared to the original, is a misclassification loss function, represents the magnitude of perturbation, and and are prespecified weights. To ensure the misleading nature of the generated adversarial sample, is set to prioritize over .
where is the label with the -th largest prediction probability from the trained DCNN for the input image added with perturbation . Since results in the minimum of , this loss function encourages PSO to yield a valid perturbation. If the -perturbed is prespecified with a misclassification probability , we use the misclassification loss function
Later in our experiments, we show that high is helpful to generate universal adversarial perturbations applicable to images from the other classes. If a targeted class is given, we choose the misclassification loss function
Furthermore, if both and are provided, we use
or equivalently .
2.4 Adversarial Training
We aim to create a set of adversarial images for a given trained DCNN model, and then fine-tune the model on the training data augmented with this adversarial dataset. To include as many adversarial images as possible, we do not specify a value to in Algorithm 1. Note that Algorithm 1 may not have a feasible solution when given with restrictive parameters such as small or small . To efficiently generate a batch of adversarial images, we first select a set of potentially vulnerable images by some modifications to Algorithm 1.
Specifically, given an image dataset , thresholds and targeted incorrect labels (if not given, the label with the second largest prediction probability), we first find , the set of all correctly classified images that have image-level FI (with ) and prediction probability . For each image in set , we generate its adversarial image by Algorithm 1 in which is the number of pixels with FI and is specified to . These generated adversarial images form an adversarial dataset. The whole procedure of our adversarial training is illustrated in Figure 1 and detailed in Algorithm 2.
We conduct experiments on the two benchmark datasets MNIST and CIFAR10 using the ResNet32 model He et al. (2016). Data augmentation is used, including random horizontal and vertical shifts up to 12.5% of image height and width for both datasets, and additionally random horizontal flip for CIFAR10 data. Table 1 shows the prediction accuracy of our trained ResNet32 for the two datasets.
|Model||Training (n=60k)||Testing (n=10k)||Training (n=50k)||Testing (n=10k)|
3.1 Customized Adversarial Image Generation
We consider two images with easy visual detection and large image-level FI in MNIST and CIFAR10, shown in Figures 2 and 3 with prediction-probability graphs and pixel-level FI maps. The probability bar graphs imply candidate misclassification classes that can be used as . The FI maps indicate the vulnerability of each pixel to local perturbation and are useful to locate pixels for attack.
We first evaluate the performance of Algorithm 1 (cf. Figure 1 (b)-(e)) in generating adversarial examples of the two images according to different requirements on , and . Figures 4 and 5 show the generated adversarial images with corresponding perturbation maps. Perturbations 1-3 consider the settings with , , and , respectively, and with no specifications to and . For Perturbations 4-6, we only specify , , and , respectively, assign no value to , and tune being the number of pixels with FI and to obtain feasible solutions from PSO. Perturbations 7-9 are prespecified with , , and for MNIST, and , , and for CIFAR10, respectively, being the number of pixels with FI , and no value for . The detailed parameter settings for Algorithm 1 are provided in Supplementary Material. We can see that the generated adversarial images have visually negligible differences from the originals and satisfy the prespecified requirements.
We also investigate the adversarial universality of Perturbation 6 shown in Figures 4 and 5, which have 99% prediction probability to Class 4. Table 2 shows the proportions of original correctly-classified images that are misclassified after added with the perturbations. The MNIST dataset has error rates at least 14.3% for all classes and some up to 100%, with a total rate above 87.5% in both training and testing sets. In particular, a remarkably large proportion of each class are misclassfied to Class 4 with a total rate of 62.2% and 64.5% for training and testing sets. Perturbation 6 for CIFAR10 also exhibits a certain extent of adversarial universality with non-targeted total error rates 3.92% and 6.19% and Class-4-targeted total rates 0.92% and 1.32% for training and testing sets, respectively. Figure 6 displays images from the other nine classes that are originally correctly classified with high probability but are misclassified (most with high probability) to Class 4 after added with Perturbation 6. These results indicate that our method may generate a universal adversarial perturbation, which particularly has the potential to misclassify images from different classes to the same specific class. The existence of universal adversarial perturbations may be attributed to the geometric correlations of decision boundaries between classes Moosavi-Dezfooli et al. (2017). An adversarial perturbation with very high confidence may have salient features of its resulting class and thus it may have strong power to drag other different images towards the decision boundary.
|Training||Misclass. to 4||81.1||36.9||55.8||84.3||75.5||49.6||95.3||14.2||68.8||62.2|
|Testing||Misclass. to 4||79.7||37.3||58.8||88.3||78.5||57.4||94.8||13.7||75.0||64.5|
|Training||Misclass. to 4||1.27||0.04||0.89||1.65||1.45||0.12||2.56||0.30||0.06||0.92|
|Testing||Misclass. to 4||2.17||0.10||1.48||2.52||2.34||0.31||2.81||0.32||0.21||1.32|
3.2 Adversarial Training
We consider using Algorithm 2 to generate adversarial datasets for adversarial training. Figure 7 shows the Manhattan plots of image-level FIs for correctly classified images and Figure 8 presents the heatmaps of confusion matrices. We can see that the distributions of image-level FIs and the patterns of misclassifications are very close between training and test datasets in both MNIST and CIFAR10. Hence, our adversarial training is expected to be useful for unseen adversarial examples generated from similar mechanisms in testing.
Based on the two figures, for selecting vulnerable images (cf. Figure 1(a)), we let , be the most frequent misclassified class of ’s true class, and in Algorithm 2 . The resulting image set is likely to be near the decision boundaries of the trained classifier. We then set in the algorithm. We generate adversarial datasets Adv1 () and Adv2 (), respectively, from training and testing sets of MNIST, and Adv3 () and Adv4 () from those of CIFAR10. Adv1 and Adv3 are used for adversarial training (cf. Figure 1(f)), whereas Adv2 and Adv4 test the adversarial trained models. The detailed parameter settings for Algorithm 2 to generate those datasets are given in Supplementary Material.
The adversarial trained ResNet32 models are trained from the original trained models on the training data augmented with Adv1 and Adv3, respectively, for additional 30 epochs for MNIST and 50 epochs for CIFAR10. The results of adversarial training are reported in Tables1 and 3. Since the adversarial datasets () are much smaller than original testing datasets () and the original trained models already have high accuracy, the results in Tables 1 are only slightly improved on the test datasets. However, in Table 3, the adversarial training on Adv1 and Adv3 indeed benefits the defense of the fine-tuned ResNet32 models against adversarial attacks. The accuracy is dramatically improved from 0.00% to 83.82% and 88.93% on Adv1 and Adv3, respectively, and also up to 76.92% and 63.01% on test-data derived Adv2 and Adv4, respectively. We also observe an increase of and , respectively, in accuracy on combined data of original test set and its adversarial samples for MNIST and CIFAR10. These results indicate that our approach can significantly improve the adversarial defense of DCNN classifiers.
This paper introduced an FI-and-PSO based framework for adversarial image generation and training for DCNN classifiers by accounting for the user specified number of perturbed pixels, misclassification probability, and/or targeted incorrect class. We used the perturbation-based FI measure to efficiently detect the vulnerable images and pixels to increase the attack success rate. We designed different misclassification loss functions to meet various user’s specifications and obtained the optimal perturbation by the fast PSO algorithm. Experiments showed good performance of our approach in generating customized adversarial samples and associated adversarial training for DCNNs.
DCNN models for image classification are widely used in various real-world applications such as self-driving cars and face recognition for identification, but they can be vulnerable to adversarial attacks with small perturbations to original images, resulting in safety and security concerns in the above mentioned applications. Our proposed white-box framework for adversarial image generation and training for DCNN classifiers may help developers to test and fortify their DCNN-based products to improve reliability in the real-world applications.
Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv preprint arXiv:1811.02629. Cited by: §1.
-  (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
-  (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §1.
-  (1986) Assessment of local influence. Journal of the Royal Statistical Society: Series B (Methodological) 48 (2), pp. 133–155. Cited by: §1, §2.1.
Particle swarm optimization: developments, applications and resources.
Proceedings of the 2001 congress on evolutionary computation (IEEE Cat. No. 01TH8546), Vol. 1, pp. 81–86. Cited by: §2.2.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1.
-  (2016) Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1, §3.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §1.
-  (1995) Particle swarm optimization. In Proceedings of ICNN’95-International Conference on Neural Networks, Vol. 4, pp. 1942–1948. Cited by: §1, §2.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
-  (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1.
Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1.
-  (2017) Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 135–147. Cited by: §2.3.
-  (2018) Generating deep learning adversarial examples in black-box scenario. Electronic Design Engineering 26 (24), pp. 164–173. Cited by: §2.3.
-  (2017) Universal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1765–1773. Cited by: §1, §3.1.
-  (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1, §1.
-  (2019) They might not be giants: crafting black-box adversarial examples with fewer queries using particle swarm optimization. arXiv preprint arXiv:1909.07490. Cited by: §1.
-  (2015) Deep learning applications and challenges in big data analytics. Journal of Big Data 2 (1), pp. 1. Cited by: §1.
-  (2019) Potential adversarial samples for white-box attacks. arXiv preprint arXiv:1912.06409. Cited by: §1.
-  (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
-  (2018) Sensitivity and generalization in neural networks: an empirical study. In International Conference on Learning Representations, Note: arXiv preprint arXiv:1802.08760 Cited by: §1, §2.1.
-  (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European symposium on security and privacy (EuroS&P), pp. 372–387. Cited by: §1.
-  (2008) Analysis of the publications on the applications of particle swarm optimisation. Journal of Artificial Evolution and Applications 2008. Cited by: §2.2.
-  (2020) Adversarial attacks and defenses in deep learning. Engineering. Cited by: §1, §1.
Sensitivity analysis of deep neural networks.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4943–4950. Cited by: §1, §2.1.
-  (2019) One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation 23 (5), pp. 828–841. Cited by: §1, §1.
-  (2015) Deepid3: face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873. Cited by: §1.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §1.
-  (2019) Adversarial examples in modern machine learning: a review. arXiv preprint arXiv:1911.05268. Cited by: §1.
-  (2019) Particle swarm optimization based on dimensional learning strategy. Swarm and Evolutionary Computation 45, pp. 33–51. Cited by: §2.2.
-  (2019) Attacking black-box image classifiers with particle swarm optimization. IEEE Access 7, pp. 158051–158063. Cited by: §1.
-  (2015) A comprehensive survey on particle swarm optimization algorithm and its applications. Mathematical Problems in Engineering 2015. Cited by: §2.2.
-  (2007) Perturbation selection and influence measures in local influence analysis. The Annals of Statistics 35 (6), pp. 2565–2588. Cited by: §1, §2.1.
-  (2011) Bayesian influence analysis: a geometric approach. Biometrika 98 (2), pp. 307–323. Cited by: §1, §2.1.