Simultaneous Adversarial Training - Learn from Others Mistakes

07/21/2018 ∙ by Zukang Liao, et al. ∙ 0

Adversarial examples are maliciously tweaked images that can easily fool machine learning techniques, such as neural networks, but they are normally not visually distinguishable for human beings. One of the main approaches to solve this problem is to retrain the networks using those adversarial examples, namely adversarial training. However, standard adversarial training might not actually change the decision boundaries but cause the problem of gradient masking, resulting in a weaker ability to generate adversarial examples. Therefore, it cannot alleviate the problem of black-box attacks, where adversarial examples generated from other networks can transfer to the targeted one. In order to reduce the problem of black-box attacks, we propose a novel method that allows two networks to learn from each others' adversarial examples and become resilient to black-box attacks. We also combine this method with a simple domain adaptation to further improve the performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have been widely used to do facial recognition, surveillance, automotive driving and other tasks that require a high standard of safety. However, it has been found that by adding some unnoticeable perturbations to input images, i.e. adversarial noise, deep neural networks can be easily fooled

[2]. Furthermore, these adversarial examples often transfer, which means that adversarial examples that fool one network can also easily fool another one. This is known as black-box attacks [3]. In order to increase the robustness to the transferability of adversarial examples for faces, we propose a novel method that allows two networks to learn from each others’ adversarial examples.

Standard adversarial training is proven to be effective to the same type of white-box attacks that are used to generate adversarial examples [4] [5], but ineffective to black-box attacks due to gradient masking [6]. After a network has been trained, its adversarial examples can be generated by various methods, such as Fast Gradient Sign Method or Least Likely Class method. These adversarial examples can easily fool the network as they are found using the network’s weights and gradients, which is known as white-box attacks [6]. Standard adversarial training retrains the network using original data and its adversarial examples to make it more robust to the same type of white-box attacks [2]. However, it has been found that adversarial examples generated from defended networks (using standard adversarial training) lose the ability to easily fool other undefended networks due to the problem of gradient masking [6]. Furthermore, it has also been found that the decision boundaries of the defended networks (using standard adversarial training) remain unchanged, and thus the defended networks remain vulnerable to black-box attacks [1].

We propose a method that trains two networks simultaneously to make both of them more resilient to black-box attacks from a third holdout network, and we call it Simultaneous Adversarial Training Method (). is implemented and tested on the Adience dataset where 26,580 faces are labelled with gender and age group [7]. We show some visually noticeable adversarial examples on Adience dataset and we find that, unlike databases for object recognition, those adversarial examples can be visually misleading to human beings when the adversarial noise is set to be large enough. Additionally, we find that, for networks that do not have batch normalisation layers [8] such as VGGNets [9], distribution of features of adversarial examples is different from that of original data. Therefore, we add a domain adaptation [10] to further improve generalisability.

In this paper, we introduce methods to generate adversarial examples and methods to prevent white-box and black-box attacks in Section 2. Dataset and some visually noticeable adversarial examples are shown in Section 3. In Section 4 we introduce our methodology in details and in Section 5 we show and analyse experiments results. Finally, we conclude what we have found and achieved and what can be done for future work in Section 6.

2 Related Work

Various countermeasures for adversarial examples have been proposed for different types of attacks. Two main defense approaches are: 1) detecting and rejecting adversarial examples in testing stage to prevent adversarial attacks, and 2) making networks themselves more robust to adversarial examples, e.g. adversarial training. In this section, algorithms for generating adversarial examples are listed first, and then some state-of-the-art countermeasures are introduced.

2.1 Algorithms for Generating Adversarial Examples

2.1.1 Fast Gradient Sign Method (FGSM)

FGSM computes one step gradient and adds the sign of the gradient to raw images [2]. It is defined as:

(1)

where is raw images, is the labels,

is the loss function and

controls the magnitude of the adversarial noise. Some variants of FGSM are used to enhance the generated adversarial examples. For example, Fast Gradient Value method removed the sign function [11] and [12] combined momentum and an ensembling method with FGSM and won NIPS 2017 Targeted and Non-Targeted Adversarial Attacks Competition.

2.1.2 Single-Step Least Likely Class method (Step-LL)

Step-LL replaced with , and [13]

showed it was the most effective for adversarial training on ImageNet. It is defined as:

(2)

where and

is the probability that the network would predict

given .

2.1.3 Randomised single-step attack (R+Step-LL)

[6] showed that the vicinity of data points in loss function is not smooth. Simply using FGSM or ILLC might not suffice to find the actual adversarial direction. Therefore, they proposed a new randomised single-step attack which adds a small random step to escape from the non-smooth vicinity before computing gradients. R+Step-LL is defined as:

(3)

Other algorithms such as DeepFool [14], CPPN EA Fool [15], Hot/Cold method [11] and Natural GANs [16] can also be used to generate adversarial examples. Particularly, stronger adversarial attacks that require much less perturbations such as attacks introduced by [17] can be used as the main adversarial attacks for future work. They often succeed with probability with less than 4/256 distortion and normally they will not be visibly noticeable. Additionally, Step-LL, FGSM and R+Step-LL can be iterated many times but in this paper we would focus on single-step attack and we direct readers to [18] for further insights about iterative attacks. However, even though it is effective to weak iterative adversarial attacks, it has been broken by [19]. We direct readers to a holistic survey [20] for further information about generating adversarial examples.

In this paper, we use single-step R+Step-LL to generate adversarial examples for both training and testing with and .

2.2 Countermeasures Without Adversarial Training

Both the two main defense approaches mentioned above in Section 2 can fight against adversarial examples without adversarial training. In this section, we introduce cutting-edge methods of the first type and methods of the second type that do not involve adversarial training respectively. They were proven to be effective for weak adversarial attacks such as FGSM or Step-LL but some of them have been broken by stronger attacks [19] such as attacks [17].

In testing stage, adversarial examples can be prevented by either: 1) train a separate classifier to distinguish adversarial examples from clean data

[21] [22] [23] [24], or 2) find the differences between them by analysing their features. A wide variety of tricks can be used in the first case. For example, [25] used soft labels and added a null class to counteract adversarial examples. In the second case, features that are chosen to distinguish adversarial examples from clean data vary. For example, [26] found the certainty of adversarial examples is higher than clean data from a Bayesian perspective and [27] found coefficients in low-ranked components between adversarial examples and clean data were different. However, they have both been broken by stronger attacks with a slight increase in distortion [19] [28]. Given that [29] found that adversarial examples have a different distribution from clean data, we combine a simple domain adaptation with our method to further improve the performance.

It is also possible to make networks more resilient to adversarial examples without adversarial training, [30] used high-temperature softmax to make models less sensitive to unnoticeable perturbations. [31] used double back-propagation to penalise large gradients and they found the regularisation scheme was equivalent to first order adversarial training. However, it has been shown that distillation does not make networks more robust to stronger attacks [17] [32].

2.3 Adversarial Training Methods

Using adversarial training introduced by [2] can prevent white-box attacks. However, instead of generally reducing adversarial vulnerability, the method can cause the problem of gradient masking [6]. Due to the problem, after first-step adversarial training, the adversarially trained networks can only generate adversarial examples that are easier for undefended networks to classify, but the decision boundaries of the adversarially trained networks remain unchanged [1]. Thus, the adversarially trained networks remain vulnerable to black-box attacks.

In order to reduce the risk of black-box attacks, [6] proposed a method of ensemble adversarial training which used one pre-trained network only for generating adversarial examples and then used those adversarial examples to train another network. This way, the adversarially trained network became more resilient to black-box attacks from a third holdout network.

Similarly, [33]

proposed a method of cascade adversarial machine learning regularized with a unified embedding which uses one already defended network to generate adversarial examples to re-train another one. They found iterative attacks transfer more easily between networks that are trained using the same strategy i.e. standard training/Kurakin’s adversarial training

[13]. They also introduce a regularisation with a unified embedding which aligns features of adversarial examples and their corresponding clean data. This way, visually similar images would have similar features and it thus improves robustness of networks.

3 Adversarial Examples of Faces

In this section, we first introduce the dataset we use and then we show comparisons between original data and visually noticeable adversarial examples. We find that, unlike object recognition, some adversarial examples can be misleading for human beings. Finally, we show some results on white-box attacks with different parameter values. Adversarial examples of faces are posing serious safety threads to many face recognition systems, driver monitoring systems and security surveillance systems. Therefore a more effective method to fight against this type of adversarial examples is necessary.

The Adience dataset contains 26,580 unconstrained images of faces from 2,284 subjects, each of them is labelled with gender and eight age groups (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-). These images were collected from the Flickr albums and they were authorisedly released by their authors under the Creative Commons (CC) license. All images were taken completely “in the wile”, which means they were taken under different variations in appearance, noise, pose, blurring and lighting conditions [7]. According to its protocol, five cross validation is used to make results more statistically significant [34] [35].

The comparison between some clean testing images and their adversarial examples on the Adience dataset is shown in Figure 1. These clean testing images are all classified correctly by a ResNet50-Face [36] that is pre-trained on the VGGFace Database [37] and fine-tuned on the Adience dataset. However, their adversarial examples (that are generated using R+Step-LL) are all misclassified. Similar results can be found if we use VGG16-Face that is pre-trained on the VGGFace database. As shown in Figure 1, when of the Equation 3 (R+Step-LL) is set to be large enough, we can see the difference between the original data and adversarial examples clearly. We found that these adversarial examples can be visually misleading to human beings. For example, the adversarial example of the top left pair actually looks more senior than the clean one, and the adversarial example of the top right pair looks younger than the clean one.

Figure 1: For each pair, the clean image is on the right and its adversarial example is on the left. Top left (): labelled as the group but its adversarial example is misclassified to the . Top right (): labelled as the group but its adversarial example is misclassified to the . Bottom left (): labelled as the group but its adversarial examples is misclassfied to the . Bottom right (): labelled as the group but its adversarial example is misclassified to the .
32/256 16/256 8/256 4/256 Clean data
VGG16-Face 3.95% 4.30% 6.97% 12.13% 56.73%
ResNet-Face 4.29% 8.79% 10.65% 21.8% 59.08%
Table 1: Classification rates of white-box attacks (R+Step-LL)

When is set to be between and , adversarial examples become less visually noticeable but remain the ability to easily fool networks on the Adience dataset as shown in Table 1. We thus choose R+Step-LL with and to generate adversarial examples for the following experiments.

4 Simultaneous Adversarial Training Method ()

SATM is proposed to alleviate black-box attacks. The procedure is shown in Figure 2 and the algorithm is listed in Algorithm 1. This method re-trains two networks simultaneously using clean data and adversarial examples that are generated from the other network. This way, both networks become more resilient to black-box attacks from a third holdout network. We also combine domain adaptation with , which improves generalisability especially for networks that do not include batch normalisation layers such as VGGNets.

1:Fine-tune and using clean data
2:procedure SATM()
3:     Initialise(, , , , )
4:     while not early stopping do Based on
5:          R+Step-LL(, , )
6:          R+Step-LL(, , )
7:         
8:         
9:         
10:         
11:         
12:         
13:         
14:         
15:     end while
16:end procedure
17:function R+Step-LL()
18:     
19:     
20:     return Return adversarial examples
21:end function
Algorithm 1 Simultaneous Adversarial Training Method ()
Figure 2: retrains two networks together using clean data and adversarial examples that are generated from the other network. Additionally, uses a domain adaptation to align features of clean data and adversarial examples, which improves generalisability.

4.1 Simultaneous Adversarial Training

re-trains two networks simultaneously but in this section we describe the method from the perspective of one network first, and then we introduce the scalability of and generalise it to training multiple (more than two) networks simultaneously.

As explained in Algorithm 1, for , uses clean data and adversarial examples that are generated from to re-train it. In order to avoid the problem of gradient masking to the largest extent, we do not use adversarial examples that are generated from itself (namely white-box adversarial examples) to re-train it. Therefore, after has been re-trained using , its adversarial examples still remain “strong” enough to easily fool undefended networks. Similarly, after using , ’s adversarial examples also remain “strong” enough. This means during the whole re-training process, uses clean data and “strong” adversarial examples that are generated from to re-train . If is completely frozen, then becomes Ensemble Adversarial Training [6] without using white-box adversarial examples. However, by using , ’s adversarial examples would not simply follow one distribution because also re-trains . This way, both networks can become more resilient to black-box attacks.

4.2 Domain Adaptation

We combine domain adaptation with as shown in Figure 2. For networks without batch normalisation layers such as VGGNets, the distribution of clean data and the distribution of adversarial examples can be different. We use a simple domain adaptation block (a binary classifier with two fully-connected layers) to reduce the difference and improve generalisability. For , the domain adaptation block distinguishes features of clean data from features of ’s adversarial examples. As shown in Figure 2, the gradient of the domain adaptation block will go though a gradient reversal layer and then flow back to . This way, those features generated from become more indistinguishable and the generalisability is thus improved [10]. More advanced domain adaptation methods such as Adversarial Discriminative Domain Adaption [38] can be used to replace the simplest domain adaptation block, but here we focus on and show that by combing a domain adaptation method with , networks can be more resilient to black-box attacks.

4.3 with Multiple Networks

Multiple (more than two) networks can be re-trained using . If is used to re-train , for , clean data and adversarial examples that are generated from would be used to re-train it. Adversarial examples can be generated by those networks interchangeably to ensure that we still use clean data 50 percents of the time. More advanced domain adaptation methods should be used to deal with the multi-domain adaptation problem. However, when more than two networks are included, the batch size can be smaller than two, which might affect the performance of batch normalisation layers and the domain adaptation method. Therefore we leave with multiple networks as a future work. We believe that with multiple networks would enumerate more types of adversarial perturbations. Therefore, networks might finally become more robust against black-box attacks using with multiple networks.

5 Experiments

In this section we show experimental results on the Adience dataset using SATM. We use SATM to re-train fine-tuned VGG16-Face and ResNet50-Face simultaneously. We show results of white-box attacks first, and then we show results of black-box attacks both before and after using SATM and we show that SATM converges. Finally we show black-box attacks results from a third holdout network (which is chosen to be Resnet101 or InceptionResNetV2 that are pre-trained on ImageNet and fine-tuned on the Adience dataset) before and after SATM. Results are evaluated using classification rate and one-off classification rate. One-off classification rate is defined as:

(4)

where is the number of classes and is the number of examples of class and (mis-)classified as class [34]. Five cross-validation that is defined in [34] is used as the protocol to evaluate performance, so the results are statistically significant.

5.1 White-Box Attacks

As shown in Table 3, both networks become more robust to white-box attacks, even though is not designed to prevent white-box attacks; SATM does not use any white-box adversarial examples to re-train networks but the classification rate and one-off rate of white-box attacks still increase. As shown in Table 3, adversarial examples that are generated from these adversarially re-trained networks remain “strong” enough to easily fool undefended networks such as InceptionV3 or InceptionResNetV2, which indicates that the improvement for white-box attacks does not come from the problem of gradient masking.

Classification rate One-off
VGG16-Face 4.30% 7.22% 13.90% 21.43%
ResNet50-Face 8.79% 12.56% 26.30% 29.89%
Table 3: The first column is the models we are testing on, and the second column is the models that are used to generate adversarial examples (attacks). After using , classification rate and one-off almost remain unchanged.
Adv model Classification rate One-off
InceptionV3 VGG16-Face 14.49% 14.32% 45.01% 47.60%
InceptionV3 ResNet50-Face 23.03% 19.67% 58.81% 53.99%
Table 2: Results of white-box attacks. Classification rate and One-off increase after using

5.2 Black-Box Attacks

As shown in Table 4, after VGG16-Face and ResNet50-Face have been adversarially trained using , they both become resilient to each others’ adversarial examples. Classification rates of black-box attacks increase to and , which are very close to the classification rates on clean testing data ( and ).

Adv model Classification rate One_off
VGG16-Face ResNet50-Face 35.39% 52.40% 66.99% 88.55%
ResNet50-Face VGG16-Face 16.25% 43.25% 36.16% 76.11%
Table 4: Results of black-box attacks before and after using .

5.3 Convergence

As shown in Figure 4 and Figure 4, during training, the classification rates of clean data and adversarial examples are relatively similar to each other both on training set and validation set. After using , the classification rates of clean data and adversarial examples of training set are both around , which indicates converges. In another word, after a certain number of iterations (in this case after 10,000 iterations with mini-batches of size 8 and learning rate of ), the distributions of adversarial examples change more slowly than the networks adapt to the change. We run experiments on a Tesla K80 GPU and converges in 20 hours.

Figure 3:

VGG16-Face’s classification rates of clean data and adversarial examples on training set. X asix is the number of epochs.

Figure 4: VGG16-Face’s classification rates of clean data and adversarial examples on validation set. X asix is the number of epochs.
Figure 3: VGG16-Face’s classification rates of clean data and adversarial examples on training set. X asix is the number of epochs.

5.4 Black-Box Attacks from a Third Holdout Network

As shown in Table 5, we can see a significant improvement of the performance on black-box attacks from a third holdout networs (ResNet101 and InceptionResNetV2). We also show that outperforms the state-of-the-art method (Ensemble Adversarial Training) on the Adience database. In order to directly compare with Ensemble Adversarial Training, we do not use any white-box examples to re-train the networks, and we find that the best performances all come from . For InceptionNets, When is set to , both undefended and adversarially trained VGG16 and ResNet50 are resilient to their adversarial examples. Therefore we set to when testing on InceptionResNetV2.

Adv model Undefended Ensemble SATM
VGG16-Face ResNet101-Img 31.29/62.61% 35.87/68.04% 37.40/72.16%
ResNet50-Face ResNet101-Img 22.59/47.57% 25.84/54.62% 29.19/58.27%
VGG16-Face InceptionResNetV2 37.74/62.48% 39.61/66.31% 42.59/67.23%
ResNet50-Face InceptionResNetV2 40.23/71.09% 43.12/78.62% 48.26/80.63%
Table 5: Classification rates/one-off rates of black-box attacks from a third holdout network, i.e. InceptionResNetV2 or ResNet101-Img which are pre-trained on ImageNet and fine-tuned on the Adience dataset.

5.5 Experiments on Networks with the Same Structure

We set both and to be ResNet50-Face (the same structure and the same initialisation), and found that this led to divergence of the algorithm. A potential reason is: this way, these two networks are learning from their own “white-box attacks”, while they are not able to mask each others’ gradients. Additionally, we also conduct experiments on ResNet50-Face and ResNet50-ImageNet (the same structure and different initialisation). As shown in Table 6, this combination leads to an accuracy improvement on adversarial examples generated by ResNet101-Img, however it also leads to a accuracy decrease on adversarial examples generated by InceptionResNetV2. This may be because that adversarial examples generated by ResNet101-Img resembles adversarial examples generated by ResNet50-Img, while they are less correlated with adversarial examples generated by InceptionResNetV2.

Adv model VGG16-Face ResNet50-Img
ResNet50-Face ResNet101-Img 29.19/58.27% 31.61/61.94%
ResNet50-Face InceptionResNetV2 48.26/80.63% 45.62/77.54%
Table 6: Comparison (Classification rates/one-off rates) between training ResNet50-Face with VGG16-Face and training ResNet50-Face with ResNet50-Img.

5.6 Experiments on Other Databases

A series of experiments on MNIST and ImageNet database are also conducted, however, no significant improvement on ImageNet database and no improvement on MNIST database (not worse either) are found compared with ensemble adversarial training method

[6]. For ImageNet, we use the same testing method as [6] where 10,000 testing images are randomly chosen. InceptionResV2 and VGG16-Face are trained using SATM, and ResNet101 is chosen to be the third holdout network to generate black-box attacks. As shown in Table 7, compared with ensemble adversarial training, SATM decreases top-1 error rate by and top-5 error rate by on ImageNet. For MNIST, we re-trained structure and [6], and we report averaged black-box attacks error rate for structure . As shown in Table 7, no significant improvement can be found on MNIST database using SATM compared with ensemble adversarial training.

ImageNet MNIST
Top 1 Top 5
Ensemble 27.0% 7.9% 5.2%
SATM 26.4% 7.7% 5.6%
Table 7: Classification error rate on ImageNet and MNIST using ensemble adversarial training and SATM.

Different adversarial training methods can be more effective in specific domains (hand-written numbers, faces or objects classification). A potential reason can be that the distribution of adversarial examples of faces changes more quickly during the re-training process. Therefore, by using SATM, networks would have the chance to learn from more adversarial examples with different distribution. However, this needs to be supported by a more complete hypothesis and further experiments and we leave this topic for future work.

6 Conclusion and Future Work

We propose a novel method () which trains multiple networks simultaneously to improve their robustness to black-box attacks without encountering the problem of gradient masking. In order to achieve this, uses adversarial examples that are generated from other networks to re-train the targeted network. This way, these networks learn from others’ adversarial examples dynamically and thus all become more resilient to single-step black-box attacks. Furthermore, we also include a simple domain adaptation method to align features of clean data and features of adversarial examples to improve the performance. We conduct a series of experiments and show that, by using , networks become slightly more resilient to single-step white-box attacks and significantly more resilient to single-step black-box attacks, while their adversarial examples remain “strong” enough to easily fool undefended networks. We also show that outperforms the state-of-the-art method on single-step black-box attacks from holdout networks. In order to further improve the performance, white-box examples can be used with in various ways, stronger iterative adversarial attacks can be used, and a more deliberate domain adaptation method can be combined with for future work.

References