Deep neural networks have been widely used to do facial recognition, surveillance, automotive driving and other tasks that require a high standard of safety. However, it has been found that by adding some unnoticeable perturbations to input images, i.e. adversarial noise, deep neural networks can be easily fooled. Furthermore, these adversarial examples often transfer, which means that adversarial examples that fool one network can also easily fool another one. This is known as black-box attacks . In order to increase the robustness to the transferability of adversarial examples for faces, we propose a novel method that allows two networks to learn from each others’ adversarial examples.
Standard adversarial training is proven to be effective to the same type of white-box attacks that are used to generate adversarial examples  , but ineffective to black-box attacks due to gradient masking . After a network has been trained, its adversarial examples can be generated by various methods, such as Fast Gradient Sign Method or Least Likely Class method. These adversarial examples can easily fool the network as they are found using the network’s weights and gradients, which is known as white-box attacks . Standard adversarial training retrains the network using original data and its adversarial examples to make it more robust to the same type of white-box attacks . However, it has been found that adversarial examples generated from defended networks (using standard adversarial training) lose the ability to easily fool other undefended networks due to the problem of gradient masking . Furthermore, it has also been found that the decision boundaries of the defended networks (using standard adversarial training) remain unchanged, and thus the defended networks remain vulnerable to black-box attacks .
We propose a method that trains two networks simultaneously to make both of them more resilient to black-box attacks from a third holdout network, and we call it Simultaneous Adversarial Training Method (). is implemented and tested on the Adience dataset where 26,580 faces are labelled with gender and age group . We show some visually noticeable adversarial examples on Adience dataset and we find that, unlike databases for object recognition, those adversarial examples can be visually misleading to human beings when the adversarial noise is set to be large enough. Additionally, we find that, for networks that do not have batch normalisation layers  such as VGGNets , distribution of features of adversarial examples is different from that of original data. Therefore, we add a domain adaptation  to further improve generalisability.
In this paper, we introduce methods to generate adversarial examples and methods to prevent white-box and black-box attacks in Section 2. Dataset and some visually noticeable adversarial examples are shown in Section 3. In Section 4 we introduce our methodology in details and in Section 5 we show and analyse experiments results. Finally, we conclude what we have found and achieved and what can be done for future work in Section 6.
2 Related Work
Various countermeasures for adversarial examples have been proposed for different types of attacks. Two main defense approaches are: 1) detecting and rejecting adversarial examples in testing stage to prevent adversarial attacks, and 2) making networks themselves more robust to adversarial examples, e.g. adversarial training. In this section, algorithms for generating adversarial examples are listed first, and then some state-of-the-art countermeasures are introduced.
2.1 Algorithms for Generating Adversarial Examples
2.1.1 Fast Gradient Sign Method (FGSM)
FGSM computes one step gradient and adds the sign of the gradient to raw images . It is defined as:
where is raw images, is the labels,
is the loss function andcontrols the magnitude of the adversarial noise. Some variants of FGSM are used to enhance the generated adversarial examples. For example, Fast Gradient Value method removed the sign function  and  combined momentum and an ensembling method with FGSM and won NIPS 2017 Targeted and Non-Targeted Adversarial Attacks Competition.
2.1.2 Single-Step Least Likely Class method (Step-LL)
2.1.3 Randomised single-step attack (R+Step-LL)
 showed that the vicinity of data points in loss function is not smooth. Simply using FGSM or ILLC might not suffice to find the actual adversarial direction. Therefore, they proposed a new randomised single-step attack which adds a small random step to escape from the non-smooth vicinity before computing gradients. R+Step-LL is defined as:
Other algorithms such as DeepFool , CPPN EA Fool , Hot/Cold method  and Natural GANs  can also be used to generate adversarial examples. Particularly, stronger adversarial attacks that require much less perturbations such as attacks introduced by  can be used as the main adversarial attacks for future work. They often succeed with probability with less than 4/256 distortion and normally they will not be visibly noticeable. Additionally, Step-LL, FGSM and R+Step-LL can be iterated many times but in this paper we would focus on single-step attack and we direct readers to  for further insights about iterative attacks. However, even though it is effective to weak iterative adversarial attacks, it has been broken by . We direct readers to a holistic survey  for further information about generating adversarial examples.
In this paper, we use single-step R+Step-LL to generate adversarial examples for both training and testing with and .
2.2 Countermeasures Without Adversarial Training
Both the two main defense approaches mentioned above in Section 2 can fight against adversarial examples without adversarial training. In this section, we introduce cutting-edge methods of the first type and methods of the second type that do not involve adversarial training respectively. They were proven to be effective for weak adversarial attacks such as FGSM or Step-LL but some of them have been broken by stronger attacks  such as attacks .
In testing stage, adversarial examples can be prevented by either: 1) train a separate classifier to distinguish adversarial examples from clean data   , or 2) find the differences between them by analysing their features. A wide variety of tricks can be used in the first case. For example,  used soft labels and added a null class to counteract adversarial examples. In the second case, features that are chosen to distinguish adversarial examples from clean data vary. For example,  found the certainty of adversarial examples is higher than clean data from a Bayesian perspective and  found coefficients in low-ranked components between adversarial examples and clean data were different. However, they have both been broken by stronger attacks with a slight increase in distortion  . Given that  found that adversarial examples have a different distribution from clean data, we combine a simple domain adaptation with our method to further improve the performance.
It is also possible to make networks more resilient to adversarial examples without adversarial training,  used high-temperature softmax to make models less sensitive to unnoticeable perturbations.  used double back-propagation to penalise large gradients and they found the regularisation scheme was equivalent to first order adversarial training. However, it has been shown that distillation does not make networks more robust to stronger attacks  .
2.3 Adversarial Training Methods
Using adversarial training introduced by  can prevent white-box attacks. However, instead of generally reducing adversarial vulnerability, the method can cause the problem of gradient masking . Due to the problem, after first-step adversarial training, the adversarially trained networks can only generate adversarial examples that are easier for undefended networks to classify, but the decision boundaries of the adversarially trained networks remain unchanged . Thus, the adversarially trained networks remain vulnerable to black-box attacks.
In order to reduce the risk of black-box attacks,  proposed a method of ensemble adversarial training which used one pre-trained network only for generating adversarial examples and then used those adversarial examples to train another network. This way, the adversarially trained network became more resilient to black-box attacks from a third holdout network.
proposed a method of cascade adversarial machine learning regularized with a unified embedding which uses one already defended network to generate adversarial examples to re-train another one. They found iterative attacks transfer more easily between networks that are trained using the same strategy i.e. standard training/Kurakin’s adversarial training. They also introduce a regularisation with a unified embedding which aligns features of adversarial examples and their corresponding clean data. This way, visually similar images would have similar features and it thus improves robustness of networks.
3 Adversarial Examples of Faces
In this section, we first introduce the dataset we use and then we show comparisons between original data and visually noticeable adversarial examples. We find that, unlike object recognition, some adversarial examples can be misleading for human beings. Finally, we show some results on white-box attacks with different parameter values. Adversarial examples of faces are posing serious safety threads to many face recognition systems, driver monitoring systems and security surveillance systems. Therefore a more effective method to fight against this type of adversarial examples is necessary.
The Adience dataset contains 26,580 unconstrained images of faces from 2,284 subjects, each of them is labelled with gender and eight age groups (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-). These images were collected from the Flickr albums and they were authorisedly released by their authors under the Creative Commons (CC) license. All images were taken completely “in the wile”, which means they were taken under different variations in appearance, noise, pose, blurring and lighting conditions . According to its protocol, five cross validation is used to make results more statistically significant  .
The comparison between some clean testing images and their adversarial examples on the Adience dataset is shown in Figure 1. These clean testing images are all classified correctly by a ResNet50-Face  that is pre-trained on the VGGFace Database  and fine-tuned on the Adience dataset. However, their adversarial examples (that are generated using R+Step-LL) are all misclassified. Similar results can be found if we use VGG16-Face that is pre-trained on the VGGFace database. As shown in Figure 1, when of the Equation 3 (R+Step-LL) is set to be large enough, we can see the difference between the original data and adversarial examples clearly. We found that these adversarial examples can be visually misleading to human beings. For example, the adversarial example of the top left pair actually looks more senior than the clean one, and the adversarial example of the top right pair looks younger than the clean one.
When is set to be between and , adversarial examples become less visually noticeable but remain the ability to easily fool networks on the Adience dataset as shown in Table 1. We thus choose R+Step-LL with and to generate adversarial examples for the following experiments.
4 Simultaneous Adversarial Training Method ()
SATM is proposed to alleviate black-box attacks. The procedure is shown in Figure 2 and the algorithm is listed in Algorithm 1. This method re-trains two networks simultaneously using clean data and adversarial examples that are generated from the other network. This way, both networks become more resilient to black-box attacks from a third holdout network. We also combine domain adaptation with , which improves generalisability especially for networks that do not include batch normalisation layers such as VGGNets.
4.1 Simultaneous Adversarial Training
re-trains two networks simultaneously but in this section we describe the method from the perspective of one network first, and then we introduce the scalability of and generalise it to training multiple (more than two) networks simultaneously.
As explained in Algorithm 1, for , uses clean data and adversarial examples that are generated from to re-train it. In order to avoid the problem of gradient masking to the largest extent, we do not use adversarial examples that are generated from itself (namely white-box adversarial examples) to re-train it. Therefore, after has been re-trained using , its adversarial examples still remain “strong” enough to easily fool undefended networks. Similarly, after using , ’s adversarial examples also remain “strong” enough. This means during the whole re-training process, uses clean data and “strong” adversarial examples that are generated from to re-train . If is completely frozen, then becomes Ensemble Adversarial Training  without using white-box adversarial examples. However, by using , ’s adversarial examples would not simply follow one distribution because also re-trains . This way, both networks can become more resilient to black-box attacks.
4.2 Domain Adaptation
We combine domain adaptation with as shown in Figure 2. For networks without batch normalisation layers such as VGGNets, the distribution of clean data and the distribution of adversarial examples can be different. We use a simple domain adaptation block (a binary classifier with two fully-connected layers) to reduce the difference and improve generalisability. For , the domain adaptation block distinguishes features of clean data from features of ’s adversarial examples. As shown in Figure 2, the gradient of the domain adaptation block will go though a gradient reversal layer and then flow back to . This way, those features generated from become more indistinguishable and the generalisability is thus improved . More advanced domain adaptation methods such as Adversarial Discriminative Domain Adaption  can be used to replace the simplest domain adaptation block, but here we focus on and show that by combing a domain adaptation method with , networks can be more resilient to black-box attacks.
4.3 with Multiple Networks
Multiple (more than two) networks can be re-trained using . If is used to re-train , for , clean data and adversarial examples that are generated from would be used to re-train it. Adversarial examples can be generated by those networks interchangeably to ensure that we still use clean data 50 percents of the time. More advanced domain adaptation methods should be used to deal with the multi-domain adaptation problem. However, when more than two networks are included, the batch size can be smaller than two, which might affect the performance of batch normalisation layers and the domain adaptation method. Therefore we leave with multiple networks as a future work. We believe that with multiple networks would enumerate more types of adversarial perturbations. Therefore, networks might finally become more robust against black-box attacks using with multiple networks.
In this section we show experimental results on the Adience dataset using SATM. We use SATM to re-train fine-tuned VGG16-Face and ResNet50-Face simultaneously. We show results of white-box attacks first, and then we show results of black-box attacks both before and after using SATM and we show that SATM converges. Finally we show black-box attacks results from a third holdout network (which is chosen to be Resnet101 or InceptionResNetV2 that are pre-trained on ImageNet and fine-tuned on the Adience dataset) before and after SATM. Results are evaluated using classification rate and one-off classification rate. One-off classification rate is defined as:
where is the number of classes and is the number of examples of class and (mis-)classified as class . Five cross-validation that is defined in  is used as the protocol to evaluate performance, so the results are statistically significant.
5.1 White-Box Attacks
As shown in Table 3, both networks become more robust to white-box attacks, even though is not designed to prevent white-box attacks; SATM does not use any white-box adversarial examples to re-train networks but the classification rate and one-off rate of white-box attacks still increase. As shown in Table 3, adversarial examples that are generated from these adversarially re-trained networks remain “strong” enough to easily fool undefended networks such as InceptionV3 or InceptionResNetV2, which indicates that the improvement for white-box attacks does not come from the problem of gradient masking.
|VGG16-Face||4.30% 7.22%||13.90% 21.43%|
|ResNet50-Face||8.79% 12.56%||26.30% 29.89%|
|Adv model||Classification rate||One-off|
|InceptionV3||VGG16-Face||14.49% 14.32%||45.01% 47.60%|
|InceptionV3||ResNet50-Face||23.03% 19.67%||58.81% 53.99%|
5.2 Black-Box Attacks
As shown in Table 4, after VGG16-Face and ResNet50-Face have been adversarially trained using , they both become resilient to each others’ adversarial examples. Classification rates of black-box attacks increase to and , which are very close to the classification rates on clean testing data ( and ).
|Adv model||Classification rate||One_off|
|VGG16-Face||ResNet50-Face||35.39% 52.40%||66.99% 88.55%|
|ResNet50-Face||VGG16-Face||16.25% 43.25%||36.16% 76.11%|
As shown in Figure 4 and Figure 4, during training, the classification rates of clean data and adversarial examples are relatively similar to each other both on training set and validation set. After using , the classification rates of clean data and adversarial examples of training set are both around , which indicates converges. In another word, after a certain number of iterations (in this case after 10,000 iterations with mini-batches of size 8 and learning rate of ), the distributions of adversarial examples change more slowly than the networks adapt to the change. We run experiments on a Tesla K80 GPU and converges in 20 hours.
VGG16-Face’s classification rates of clean data and adversarial examples on training set. X asix is the number of epochs.
5.4 Black-Box Attacks from a Third Holdout Network
As shown in Table 5, we can see a significant improvement of the performance on black-box attacks from a third holdout networs (ResNet101 and InceptionResNetV2). We also show that outperforms the state-of-the-art method (Ensemble Adversarial Training) on the Adience database. In order to directly compare with Ensemble Adversarial Training, we do not use any white-box examples to re-train the networks, and we find that the best performances all come from . For InceptionNets, When is set to , both undefended and adversarially trained VGG16 and ResNet50 are resilient to their adversarial examples. Therefore we set to when testing on InceptionResNetV2.
5.5 Experiments on Networks with the Same Structure
We set both and to be ResNet50-Face (the same structure and the same initialisation), and found that this led to divergence of the algorithm. A potential reason is: this way, these two networks are learning from their own “white-box attacks”, while they are not able to mask each others’ gradients. Additionally, we also conduct experiments on ResNet50-Face and ResNet50-ImageNet (the same structure and different initialisation). As shown in Table 6, this combination leads to an accuracy improvement on adversarial examples generated by ResNet101-Img, however it also leads to a accuracy decrease on adversarial examples generated by InceptionResNetV2. This may be because that adversarial examples generated by ResNet101-Img resembles adversarial examples generated by ResNet50-Img, while they are less correlated with adversarial examples generated by InceptionResNetV2.
5.6 Experiments on Other Databases
A series of experiments on MNIST and ImageNet database are also conducted, however, no significant improvement on ImageNet database and no improvement on MNIST database (not worse either) are found compared with ensemble adversarial training method. For ImageNet, we use the same testing method as  where 10,000 testing images are randomly chosen. InceptionResV2 and VGG16-Face are trained using SATM, and ResNet101 is chosen to be the third holdout network to generate black-box attacks. As shown in Table 7, compared with ensemble adversarial training, SATM decreases top-1 error rate by and top-5 error rate by on ImageNet. For MNIST, we re-trained structure and , and we report averaged black-box attacks error rate for structure . As shown in Table 7, no significant improvement can be found on MNIST database using SATM compared with ensemble adversarial training.
|Top 1||Top 5|
Different adversarial training methods can be more effective in specific domains (hand-written numbers, faces or objects classification). A potential reason can be that the distribution of adversarial examples of faces changes more quickly during the re-training process. Therefore, by using SATM, networks would have the chance to learn from more adversarial examples with different distribution. However, this needs to be supported by a more complete hypothesis and further experiments and we leave this topic for future work.
6 Conclusion and Future Work
We propose a novel method () which trains multiple networks simultaneously to improve their robustness to black-box attacks without encountering the problem of gradient masking. In order to achieve this, uses adversarial examples that are generated from other networks to re-train the targeted network. This way, these networks learn from others’ adversarial examples dynamically and thus all become more resilient to single-step black-box attacks. Furthermore, we also include a simple domain adaptation method to align features of clean data and features of adversarial examples to improve the performance. We conduct a series of experiments and show that, by using , networks become slightly more resilient to single-step white-box attacks and significantly more resilient to single-step black-box attacks, while their adversarial examples remain “strong” enough to easily fool undefended networks. We also show that outperforms the state-of-the-art method on single-step black-box attacks from holdout networks. In order to further improve the performance, white-box examples can be used with in various ways, stronger iterative adversarial attacks can be used, and a more deliberate domain adaptation method can be combined with for future work.
-  Tramèr, F., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: The space of transferable adversarial examples. arXiv preprint arXiv:1704.03453 (2017)
-  Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. ICLR, 2015 (2014)
Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.:
Practical black-box attacks against deep learning systems using adversarial examples.arXiv preprint (2016)
-  Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. ICLR, 2018 (2018)
-  Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. ICLR, 2014 (2014)
-  Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., McDaniel, P.: Ensemble adversarial training: Attacks and defenses. ICLR, 2018 (2017)
Eidinger, E., Enbar, R., Hassner, T.:
Age and gender estimation of unfiltered faces.IEEE Transactions on Information Forensics and Security 9 (2014) 2170–2179
-  Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning. (2015) 448–456
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. ICLR, 2015 (2015)
-  Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., Lempitsky, V.: Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (2016) 2096–2030
-  Rozsa, A., Rudd, E.M., Boult, T.E.: Adversarial diversity and hard positive generation.
-  Dong, Y., Liao, F., Pang, T., Su, H., Hu, X., Li, J., Zhu, J.: Boosting adversarial attacks with momentum. arxiv preprint. arXiv preprint arXiv:1710.06081 (2017)
-  Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016)
-  Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 2574–2582
-  Nguyen, A., Yosinski, J., Clune, J.: Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 427–436
-  Zhao, Z., Dua, D., Singh, S.: Generating natural adversarial examples. arXiv preprint arXiv:1710.11342 (2017)
-  Carlini, N., Wagner, D.: Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644 (2016)
-  Gong, Z., Wang, W., Ku, W.S.: Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960 (2017)
Carlini, N., Wagner, D.:
Adversarial examples are not easily detected: Bypassing ten detection
In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, ACM (2017) 3–14
-  Yuan, X., He, P., Zhu, Q., Bhat, R.R., Li, X.: Adversarial examples: Attacks and defenses for deep learning. arXiv preprint arXiv:1712.07107 (2017)
-  Lu, J., Issaranon, T., Forsyth, D.: Safetynet: Detecting and rejecting adversarial examples robustly. CoRR, abs/1704.00103 (2017)
-  Metzen, J.H., Genewein, T., Fischer, V., Bischoff, B.: On detecting adversarial perturbations. Proceedings of 5th International Conference on Learning Representations (ICLR), 2017 (2017)
-  Grosse, K., Manoharan, P., Papernot, N., Backes, M., McDaniel, P.: On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280 (2017)
-  Hendrycks, D., Gimpel, K.: Early methods for detecting adversarial images. ICLR Workshop, 2017 (2017)
-  Hosseini, H., Chen, Y., Kannan, S., Zhang, B., Poovendran, R.: Blocking transferability of adversarial examples in black-box learning systems. arXiv preprint arXiv:1703.04318 (2017)
-  Feinman, R., Curtin, R.R., Shintre, S., Gardner, A.B.: Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410 (2017)
-  Meng, D., Chen, H.: Magnet: a two-pronged defense against adversarial examples. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, ACM (2017) 135–147
-  Carlini, N., Wagner, D.: Magnet and” efficient defenses against adversarial attacks” are not robust to adversarial examples. arXiv preprint arXiv:1711.08478 (2017)
-  Salimans, T., Karpathy, A., Chen, X., Kingma, D.P.: Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. ICLR Poster, 2017 (2017)
-  Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: Security and Privacy (SP), 2016 IEEE Symposium on, IEEE (2016) 582–597
-  Simon-Gabriel, C.J., Ollivier, Y., Schölkopf, B., Bottou, L., Lopez-Paz, D.: Adversarial vulnerability of neural networks increases with input dimension. arXiv preprint arXiv:1802.01421 (2018)
-  Carlini, N., Wagner, D.: Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311 (2016)
-  Na, T., Ko, J.H., Mukhopadhyay, S.: Cascade adversarial machine learning regularized with a unified embedding. ICLR 2018 (2017)
Levi, G., Hassner, T.:
Age and gender classification using convolutional neural networks.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. (2015) 34–42
-  Liao, Z., Petridis, S., Pantic, M.: Local deep neural networks for age and gender classification. arXiv preprint arXiv:1703.08497 (2017)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
-  Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference. (2015)
-  Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Computer Vision and Pattern Recognition (CVPR 2017). Volume 1. (2017) 4