Improving the Robustness of Deep Neural Networks via Adversarial Training with Triplet Loss

05/28/2019 ∙ by Pengcheng Li, et al. ∙ Nanjing University JD.com, Inc. 16

Recent studies have highlighted that deep neural networks (DNNs) are vulnerable to adversarial examples. In this paper, we improve the robustness of DNNs by utilizing techniques of Distance Metric Learning. Specifically, we incorporate Triplet Loss, one of the most popular Distance Metric Learning methods, into the framework of adversarial training. Our proposed algorithm, Adversarial Training with Triplet Loss (AT^2L), substitutes the adversarial example against the current model for the anchor of triplet loss to effectively smooth the classification boundary. Furthermore, we propose an ensemble version of AT^2L, which aggregates different attack methods and model structures for better defense effects. Our empirical studies verify that the proposed approach can significantly improve the robustness of DNNs without sacrificing accuracy. Finally, we demonstrate that our specially designed triplet loss can also be used as a regularization term to enhance other defense methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been widely used for security-critical tasks, including but not limited to autonomous driving [Evtimov et al.2017], surveillance [Ouyang and Wang2013], biometric recognition  [Xu et al.2017], and malware detection [Yuan et al.2014]. However, recent studies have shown that DNNs are vulnerable to adversarial examples [Goodfellow et al.2014, Papernot et al.2016, Chen et al.2017, Li et al.2018]

, which are carefully crafted instances that can mislead well-trained DNNs. This raises serious concerns about the security of machine learning models in many real-world applications.

Recently, many efforts have been made to improve the robustness of DNNs, such as (i) using the properties of obfuscated gradients [Athalye et al.2018] to prevent the attackers from obtaining the true gradient of the model, e.g., mitigating through randomization [Xie et al.2018], Thermometer encoding [Buckman et al.2018], and Defense-GAN [Samangouei et al.2018]; (ii) adding adversarial examples into the training set, e.g., Adversarial Training [Szegedy et al.2013, Goodfellow et al.2014], scalable Adversarial Training [Kurakin et al.2016b], and Ensemble Adversarial Training [Tramèr et al.2018]. However, it was shown that the first type of defense methods had been broken through by various targeted countermeasures [Carlini and Wagner2017a, He et al.2017, Athalye et al.2018]. The second type of methods also suffers the distortion of the classification boundary for the reason that they only import adversarial examples against some specific types of attacks.

In this paper, we follow the framework of Adversarial Training and introduce Triplet Loss [Schroff et al.2015], one of the most popular Distance Metric Learning methods, to improve the robustness by smoothing the classification boundary. Triplet loss is designed to optimize the embedding space such that data points with the same label are closer to each other than those with different labels. The primary challenge of triplet loss is how to select representative triplets, which are made up of three examples from two different classes and jointly constitute a positive pair and a negative pair. Since adversarial examples contain more information about the decision boundary than normal examples, we modify the anchor of triplet loss with adversarial examples to enlarge the distance between adversarial examples and examples with different labels in the embedding space. Then, we add this fine-grained triplet loss to the original adversarial training process and name the new algorithm as Adversarial Training with Triplet Loss (ATL). We also propose an ensemble algorithm which aggregates different types of attacks and model structures to improve the performance. Furthermore, the proposed triplet loss can be applied to other methods as a regularization term for better robustness.

We summarize our main contributions as follows:

  • We introduce triplet loss into the adversarial training framework and modify the anchor of triplet loss with adversarial examples. We also design an ensemble version of our method.

  • We propose to take our triplet loss as a regularization term and apply it to existing defense methods for further improvement of robustness.

  • We conduct extensive experiments to evaluate our algorithms. The empirical results show that our proposed approach behaves more robust and preserves the accuracy of the model, and the triplet loss can also improve the performance of other defense methods.

2 Related work

In this section, we briefly review existing adversarial attack and defense methods.

2.1 Attack methods

Attack methods can be divided into two main categories: gradient-based attack and optimization-based attack.

The gradient-based attack asks for the structure of the attacked model and requires that the attacked model should be differentiable. Then it generates adversarial examples by adding perturbation along the direction of the gradients. FGSM [Goodfellow et al.2014], Single-Step Least-Likely (LL) [Kurakin et al.2016a, Kurakin et al.2016b] and their iterative versions, i.e., I-FGSM and I-LL, are popular methods in this type of attack.

The optimization-based attack formulates the task of attack as an optimization problem which aims to minimize the norm of perturbation and make the DNN model mis-classify adversarial examples. C&W attack 

[Carlini and Wagner2017b] is by far one of the strongest optimization-based attacks. It can reduce the classifiers’ accuracy to almost 0 and has bypassed over different methods designed for detecting adversarial examples [Carlini and Wagner2017a]. However, it is more time-consuming than gradient-based algorithms.

2.2 Defense methods

Many recent defense approaches are based on a technique called obfuscated gradients [Athalye et al.2018]. It is similar to gradient masking  [Papernot et al.2017] which is a failed defense method that tries to deny the attacker access to a useful gradient, and leads to a false sense of security in defenses against adversarial examples. Typical defense methods using obfuscated gradients are thermometer encoding [Buckman et al.2018], Stochastic activation pruning [Dhillon et al.2018], Mitigating through randomization [Xie et al.2018] and Defense-GAN [Samangouei et al.2018].

Another common method is adversarial training, which proposes to add adversarial examples to the training set and then retrain the model for better robustness. szegedy2013intriguing (szegedy2013intriguing) first propose this simple process in which the model is trained on adversarial examples until it learns to classify them correctly. However, this type of methods suffers the distortion of the classification boundary. So in this paper, we introduce Distance Metric Learning to alleviate this distortion.

3 Methodology

In this section, we first introduce the triplet loss. Then we present Adversarial Training with Triplet Loss (ATL) and an ensemble version of ATL. Finally, we propose to treat our special triplet loss as a regularization term and combine it with existing defense methods.

3.1 Triplet loss

A triplet [Schroff et al.2015] consists of three examples from two different classes, which jointly constitute a positive pair and a negative pair. We denote as a triplet, where has the same label and has the different. The term is referred to as the anchor of a triplet. The distance between the positive pair is encouraged to be smaller than that of the negative pair, and a soft nearest neighbor classification margin is maximized by optimizing a hinge loss. Specifically, triplet loss forces the network to generate an embedding where the distance between and is larger than the distance between and plus the margin parameter .

Formally, we define the triplet loss function as follows:

where is the cardinality of the set of triplets used in the training process, is the output of the last fully connected layer of our neural network, represents a metric of distance between and . Here we use norm in our experiments.

Generating all possible triplets would result in redundant triplets and lead to slow convergence. So in the next sections, we use sampling strategy to generate triplets in our algorithms.

3.2 Adversarial training with triplet loss (AL)

The original version of adversarial training is to craft adversarial examples for the entire training set and add them to the training process. Specifically, it generates which contains adversarial examples of instances in training set . Then it concatenates and as and retrains the model with . During each iteration of the original algorithm, it generates the adversarial examples against the current model. The loss function of the original adversarial training is formulated as:

(1)

where is the hyper-parameter, is the size of the mini-batch sampled from , is the label of and is the adversarial example of .

To encourage a larger margin between the positive class and the negative class, we incorporate triplet loss into the loss function. Specifically, for example , we generate adversarial example and sampled an example from the mini-batch which has a different label to construct a new triplet . The main difference between this triplet and the original triplet is that instead of taking the original example as the anchor, we choose the adversarial example , which contains more information about the decision boundary. Specifically, when dealing with a binary classification problem, we sample which has the opposite label to . For multi-class problems, we sample from the same class as the adversarial example , which is an incorrect class from the view of human beings. We apply this new triplet to the triplet loss, and combine it with the loss of adversarial training, so the loss function of our algorithm is formulated as:

(2)

where is the size of a mini-batch, and , and are the hyper-parameters. We utilize this new loss function to retrain the model and summarize the proposed algorithm in Algorithm 1.

1:  Train with training data ;
2:  repeat
3:     Construct against for each instance in ;
4:      = [,];
5:     Retrain with using Eq. (2);
6:  until Training converged
Algorithm 1 Adversarial training with triplet loss (AL)
1:  Train with training data ;
2:  repeat
3:     for  in  do
4:        for  in  do
5:           Construct , which is the set of adversarial examples of against model under the attack method .
6:        end for
7:     end for
8:      = {}, , ;
9:     repeat
10:        Sample clean examples from training set ;
11:        Sample adversarial examples from . Each is the adversarial example of ;
12:        Construct a new training batch ;
13:        For each instance of , take as in triplet loss and sample an example from with a different label from as in triplet loss;
14:        Perform one training step of network using the mini-batch according to Eq. (2);
15:     until Training converged
16:  until Training converged
Algorithm 2 Ensemble version of AL

3.3 Ensemble AL

We proceed to improve the robustness of the model against unknown type of attacks for the reason that the originally proposed algorithm can only defend against known type of attacks, where defenders have detailed information about the attacking methods and lack robustness against attacks transferred from unknown models. Our first attempt is to combine different attack methods together to increase the robustness. As shown in Algorithm 2 where denotes an aggregation of attack methods, we conduct adversarial training on a collection of adversarial examples that are generated by all the attack methods. In this paper, we consider three types of attacks as follows:

  • Gradient-based: .

  • Optimization-based: .

  • Mixed: .

On the other hand, we adopt the idea of Ensemble Adversarial Training [Tramèr et al.2018], which says that the augmentation of training data with perturbations transferred from other models can improve the robustness not only under a known type of attack, but also under an unknown type of attack. As shown in Algorithm 2, where is a set of model structures, we extend our training set with adversarial examples against different models in .

In general, the ensemble version of our algorithm not only considers various types of attacks, but also involves adversarial examples generated against different model structures. Therefore, our algorithm captures more information about the decision boundary, and with our designed triplet loss, it can smooth the classification boundary and learn a better embedding space to alleviate the distortion.

(a) Cats vs. Dogs
(b) MNIST
(c) CIFAR10
Figure 1: Results on three datasets where attackers perform white-box attacks. ‘Adv.T’ means the traditional adversarial training. ‘-Gra’ means the training process uses the gradient-based attack methods to generate adversarial examples. ‘-Opt’ means using optimization-based attack method, i.e., C&W and ‘-Mix’ means using the mixed version of attack methods. The figures of the upper line are attacked by FGSM and figures of the bottom line are attacked by C&W.
(a) Cats vs. Dogs
(b) MNIST
(c) CIFAR10
Figure 2: Results on three datasets where defenders are under the attack of adversarial examples transferred from unknown models. Notations are the same as Figure 1 and these are attacked by FGSM.

3.4 Triplet regularization

Our triplet loss can also be regarded as a regularization term:

Thus, it can be incorporated into most of the existing defense methods for better robustness. The defense methods based on obfuscated gradients mostly mask the real gradient by adding non-differentiable preprocessing or random processes, and there is no restriction on the loss function used in the training process. Therefore, we can modify their loss function by adding our triplet regularization term to further increase the robustness.

For example, buckman2018thermometer (buckman2018thermometer) propose to encode the input with Thermometer Encoding and retrain the model with the traditional adversarial training. Triplet regularization can be easily applied to this method by changing the loss function of the adversarial training process. Mitigating through randomization [Xie et al.2018] and Defense-GAN [Samangouei et al.2018] both perform transformations over original inputs without changing the loss. So we can directly incorporate our triplet regularization into their losses to improve the defense effect.

(a) Cats vs. Dogs
(b) MNIST
(c) CIFAR10
Figure 3: Comparison of traditional adversarial training and triplet regularization. The attack method used is FGSM.

4 Experiments

In this section, we present experimental results.

4.1 Settings

We conduct experiments over three datasets, i.e., Cats vs. Dogs [Elson et al.2007], MNIST  [LeCun1998] and CIFAR10 [Krizhevsky and Hinton2009]. Cats vs. Dogs is a large scale image dataset used for binary classification problems. MNIST and CIFAR10 are commonly used datasets for multi-class classification problems.

The attack methods we used in our experiments include FGSM, I-FGSM, LL, I-LL, C&W, LS-PGA and Deepfool [Moosavi-Dezfooli et al.2016] and the model structures used in the experiments are different for three datasets. The parameters of these methods, detailed model structures and full results of the experiments are described in the supplemental material111https://github.com/njulpc/IJCAI19/blob/master/Appendix.pdf.

4.2 Adversarial Training with Triplet Loss (AL)

To illustrate the advantage of the proposed method, we compare it with adversarial training without triplet loss, whose loss function is Eq. (1). As for the hyper-parameters in Eq. (1) and Eq. (2), we traverse in the appropriate interval and find that they have a stable performance in a proper range of values. Each experiment is tested by two types of attack methods, i.e., FGSM and C&W. Due to the limitation of space, we only show results where attackers perform white-box attacks in Fig. 1 and partial results where defenders perform the attack to a network which is not included in the training set of our algorithm in Fig. 2.

We have the following observations from the results in Fig. 1: (i) when attacked by gradient-based attacks or optimization-based attacks, AL trained with adversarial examples generated by corresponding attacks has the best robustness, e.g., when attacked by gradient-based attacks, the model trained with adversarial examples generated by gradient-based attacks exhibits the best robustness; (ii) when trained with gradient-based attacks, AL shows almost no robustness against optimization-based attacks, which is shown by the black curves in Fig. 1. However, the robustness of our algorithm trained with optimization-based attacks demonstrates a decent defense effect against gradient-based attacks. This briefly verifies that optimization-based attacks are stronger and contain more information about the decision boundary than gradient-based attacks; (iii) AL trained with our mixed version of algorithms shows comparable robustness to the model trained with corresponding attacks. Although the mixed version of AL does not perform the best, it provides more reliable robustness when attacked by an unknown type of attacks.

Compared with Fig. 1 (results under adversarial examples transferred from known models), Fig. 2 (results under adversarial examples transferred from unknown models) shows that the results under the attack transferred from unknown models are slightly worse than that under the known type of attacks for the reason that the defenders are lack of precise information of the attack, e.g., the type of the attack method and model structure used for attack. However, the model still shows decent robustness against unknown type of attacks, and this is an advantage of our ensemble AL, which aggregates multiple model structures.

We also find that compared with the model trained over clean data, all the models trained with our algorithm, i.e., AL, have no loss of accuracy, and more details can be found in the supplemental material.

Ori Th. En.(1) Th. En.(7) Th. En. + TR(1) Th. En. + TR(7)
Clean 5.8 7.6 10.1 6.1 7.2
FGSM 51.5 37.1 20.0 27.6 15.1
PGD/LS-PGA 49.5 39.3 20.9 29.7 13.4
Table 1: Error rate against known type of attacks on CIFAR10 over Thermometer Models. ‘Th. En.’ mean Thermometer Encoding. ‘TR’ means applying our triplet regularization.
Model Inception-v3 ResNet-v2-101 Inception-ResNet-v2 Ens-adv-Inception-ResNet-v2
Ori Rand Rand + TR Ori Rand Rand + TR Ori Rand Rand + TR Ori Rand Rand + TR
FGSM 66.8 36.2 30.5 73.7 28.2 21.7 34.7 19.0 8.3 15.6 4.3 4.6
Deepfool 100.0 1.7 1.1 100.0 2.3 1.5 100.0 1.8 0.8 99.8 0.9 0.7
C&W 100.0 3.1 2.6 100.0 2.9 1.2 99.7 2.3 1.3 99.1 1.2 0.9
Table 2:

Error rate of different models under the vanilla attack scenario on the ImageNet datasets. ‘Ori’ means the original model. ‘Rand’ means adding some randomization layers. ‘TR’ means applying our triplet regularization.

Model A B C D
Ori DG DG + TR Ori DG DG + TR Ori DG DG + TR Ori DG DG + TR
FGSM 88.3 1.2 1.1 97.8 4.4 0.7 67.9 1.1 0.8 96.2 2.0 1.6
C&W 85.9 1.1 1.4 96.8 8.4 4.7 87.4 1.1 1.1 96.8 1.7 1.4
Table 3: Error rates of different models on the MNIST datasets. ‘DG’ mean Defense-GAN. ‘TR’ means applying our triplet regularization. A,B,C,D are different model structures, whose details are described in the supplemental material.

4.3 Triplet regularization

To reveal the effect of triplet regularization, we compare it with the original adversarial training. We use FGSM to generate adversarial examples for the training process and test the robustness by the attack of FGSM.

From Fig. 3, we can see that the error rate of the model trained with our triplet regularization (the red curve) keeps decreasing as the number of iterations increases. So the triplet loss itself can increase the robustness of the model, and its effect is no worse than the original adversarial training (the blue curve). This result verifies our hypothesis that enlarging the margin between the adversarial examples and the negative examples and decreasing the margin between examples with the same class can smooth the decision boundary. This also suggests that the designed triplet regularization can work well in most machine learning problems to increase robustness. We can also see form Fig. 3 that AL, which integrates both adversarial training and triplet regularization, shows the best performance.

4.4 Current defense methods with triplet regularization

We further explore the effect of the combination of triplet regularization with existing defense methods. We experiment over some representative defense methods and demonstrate that our triplet regularization can be applied to improve their robustness further. Due to the limitation of space, we show partial results in this part, and full results are listed in the supplemental material.

4.4.1 Thermometer encoding

We follow the setting of the original paper [Buckman et al.2018] and do experiments over both known and unknown type of attacks. Partial results are shown in Table 1 which indicate employing our triplet regularization indeed improves the robustness based on the effect of the original defense. For example, when attacked by FGSM, the error rate of the model trained using thermometer encoding after 7 iterations is 20.0%. However, combining with our triplet regularization, the model can achieve 15.1% error rate.

4.4.2 Mitigating through randomization

We also apply our triplet loss to the work of xie2017mitigating (xie2017mitigating) who proposed to randomly resize or pad the images to a designed size. This defense can be added in front of normal classification process with no additional training or fine-tuning, and can be combined with our triplet regularization directly. We experiment over two settings from the original paper (vanilla attack scenario and ensemble-pattern attack scenario), and examine the performance of our triplet regularization. The result of the vanilla attack scenario is shown in Table 2. For the model of Inception-ResNet-v2, the randomized procedure only achieves 19.0% error rate under FGSM, but with our triplet regularization, the error rate can drop to 9.3%. These results show that our triplet regularization can further improve the robustness of the model based on the original defense method.

4.4.3 Defense-GAN

Defense-GAN is designed to project samples onto the manifold of the generator before classifying them. Our triplet regularization can be easily applied after the projection of Defense-GAN by simply changing the loss function during the training process of the classifier. We follow the setting of models’ structures and parameters in the paper of samangouei2018defense (samangouei2018defense). As shown in Table 3, when attacked by C&W, model B attains 8.4% error rate using Defense-GAN, while combining with triplet regularization, it achieves 4.7% error rate. Again, this result shows that equipping the original defense method with triplet regularization can make the trained model more robust.

5 Conclusion

In this paper, we propose Adversarial Training with Triplet Loss (ATL), which incorporates a modified triplet loss in the adversarial training process to alleviate the distortion of the models’ classification boundary. We further design an ensemble version of ATL and propose to use the triple loss as a regularization term. The results of our experiments validate the effectiveness of our algorithms and demonstrate that our triplet regularization can be applied to existing defense methods for further improvement of robustness.

Acknowledgments

This work was partially supported by the National Key R&D Program of China (2018YFB1004300), NSFC-NRF Joint Research Project (61861146001), YESS (2017QNRC001), and the Collaborative Innovation Center of Novel Software Technology and Industrialization.

References

Appendix

We show more details about our experiments in this appendix. Our experiments mainly contain two parts, the first one is to examine whether the new triplet loss function can improve the performance of adversarial training or not, and the performance when triplet loss is taken as a regularization. The second one is to apply our new triplet loss function to current defense methods. We do experiments over three datasets for the first part, Cats vs. Dogs, MNIST and CIFAR10. For the second part, we follow the setting of the original papers of these defense methods. Our new triplet loss is defined as follows,

Appendix A Adversarial Training with Triplet Loss (ATL)

In this section, we show results over three datasets, Cats vs. Dogs, MNIST and CIFAR10. The attack methods used in this part are FGSM, I-FGSM, LL, I-LL and C&W. We set as , which is the scale of FGSM, I-FGSM, LL and I-LL. For iteration algorithms like I-FGSM and I-LL, we set the number of iteration as 10 and for each iteration. The max-iteration of C&W is set to be 1000. The initial constant is set to be e and the largest value of to go up to before giving up is . The rate at which we increase constant is and the learning rate of C&W attack is e. The performance of our algorithm is stable in a large range of the hyper-parameters. and are designed to adjust the weight of the original loss, adversarial loss and triplet loss. is designed to tune the margin between different classes. In our experiments, with fixed , the robustness of the model is stable when varies from 0.2 to 1.5. With fixed , the robustness is stable when varies from 0.1 to 5. As for in the triplet loss, the robustness is stable when varying from 0.1 to 2.0 with step-size 0.1.

a.1 Cats vs. Dogs

We first do experiments on Cats vs. Dogs dataset. The model structures we chose are VGG16, VGG19, inceptionv3, resnet50. Here we use FGSM, I-FGSM, LL, and I-LL to generate adversarial examples and notate these methods as gradient training methods and we select C&W as the optimization attack method for the purpose of adversarial training. The margin and is set to be 0.5 and 0.3. We train an ensemble model which covers VGG16, VGG19, inceptionv3, resnet50 because each picture of this dataset is 224x224x3 and simple networks may not work well. The accuracy of the model over clean data is 91.2%.

The known type of attack is proposed to train an ensemble model over model VGG16, VGG19, inceptionv3 and resnet50 when the original model is VGG16. We then attack the model with the adversarial examples generated against model VGG16. The unknown type of attack is designed to train the ensemble model over model VGG19, inceptionv3 and resnet50 when the original model is VGG19. We then attack the model with the adversarial examples generated against model VGG16 so that the attackers are not ware of the model’s structure during the training process.

The accuracy of adversarial training with triplet loss against gradient-based adversarial examples is 4.7%, which is better than normal adversarial training without triplet loss. Even after 10 iterations of adversarial training, the algorithm of adversarial training with triplet loss is still better. When trained with optimization-based algorithm like C&W, the robustness against gradient-based adversarial examples is worse than when trained with gradient-based algorithm. However, the mixed version of both gradient-based and optimization-based algorithm works well regardless of the type of the attack.

The parameters in the triplet loss function are set to be , and the results are shown in Figure 1, Figure 2, Table 1, and Table 2.

Attack method Training method Adv. Train(1) Adv. Train(10) ATL(1) ATL(10)
FGSM Gradient-based 18.3 11.6 13.6 8.3
Optimization-based 32.6 26.5 27.4 16.3
Mixed 20.8 19.6 13.7 11.3
C&W Gradient-based 93.5 96.2 95.2 93.1
Optimization-based 27.4 24.1 16.2 13.7
Mixed 25.3 22.2 17.3 12.0
Table 4: Error rate of known type of attack on Cats vs. Dogs. We train over model VGG16, VGG19, inceptionv3, resnet50, and we test the error rate over model VGG16. Baseline is traditional adversarial training. We use different algorithm to generate adversarial examples for training.
(a) Attack by FGSM
(b) Attack by C&W
Figure 4: Results on Cats vs. Dogs dataset. ‘Adv.T’ means traditional adversarial training. ‘-Gra’ means the training process use the gradient-based algorithms to generate adversarial examples. ‘-Opt’ means optimization-based algorithm C&W and ‘-Mix’ means mixed version of algorithm.
Attack method Training method Adv. Train(1) Adv. Train(10) ATL(1) ATL(10)
FGSM Gradient-based 42.6 25.7 21.5 15.2
Optimization-based 43.7 35.9 32.9 19.2
Mixed 41.3 22.7 23.8 12.5
C&W Gradient-based 99.3 95.4 92.1 94.6
Optimization-based 32.8 30.3 18.3 15.2
Mixed 27.6 24.6 19.5 13.5
Table 5: Error rate of unknown type of attack on Cats vs. Dogs. We train over model VGG16, VGG19, inceptionv3, resnet50, and we test the error rate over model VGG16. Baseline is traditional adversarial training. We use different algorithm to generate adversarial examples for training.
(a) Attack by FGSM
(b) Attack by C&W
Figure 5: Results on Cats vs. Dogs dataset. ‘Adv.T’ means traditional adversarial training. ‘-Gra’ means the training process use the gradient-based algorithms to generate adversarial examples. ‘-Opt’ means optimization-based algorithm C&W and ‘-Mix’ means mixed version of algorithm.

a.2 Mnist

We then tested our algorithm on MNIST. The size of instance of MNIST is not as large as the instance of Cats vs. Dogs, but the number of classes is a bit larger. Each picture of MNIST is 28x28x1, so we construct 4 different models for the ensemble training process (which are shown in the Table 6). Model A, B, C are CNNs with different constructions, and model D only contains multiply dense layers and dropout layers. We perform a known type of attack and a unknown type of attack on this dataset.

The known type of attack is proposed to train an ensemble model over model A, B, C and D ( in algorithm 2 is set to be [A, B, C, D]) and then attack the model with the adversarial examples generated against model A. Here we use FGSM, I-FGSM, LL, I-LL and a combination of these 4 algorithms to generate adversarial examples for training. The model we pre-trained (Step 1 of Algorithm 2) reaches up to 99.1% accuracy. The parameter is 32, is 0.3, and is set to be 1.0. The model after normal adversarial training or adversarial training with triplet loss does not lose the accuracy, its accuracy is still over 99% on average. We also do more experiments about the performance of different gradient-based attack methods and their combination. The result shows that ATL can increase the robustness of the model. When training with the mixed version, the robustness against corresponding attack methods shows the best performance.

The unknown type of attack is proposed to train an ensemble model over model A, B and D ( in algorithm 2 is set to be [A, B, D]) and then attack the model with the adversarial examples generated against model C. Other settings are the same as the experiments of known type of attack. The result is quite similar as known type of attack, and the mixed version of ATL which is trained by adversarial examples against both gradient-based algorithms and optimization-based algorithms shows the lowest error rate.

The parameters in the triplet loss function are set to be , and the results are shown in Figure 3, Figure 4, Table 4, and Table 5. The model structures we used in the experiments are shown in Table 3.

A B C D
Input Input Input Input
Conv(64,5,5) Conv(64,8,8) Conv(128,3,3) Dense(300)
ReLu ReLu ReLu ReLu
Conv(64,5,5) Conv(128,6,6) Conv(64,3,3) Dropout
ReLu ReLu ReLu Dense(300)
Dropout Conv(128,5,5) Dropout ReLu
Dense(128) ReLu Flatten Dropout
ReLu Dropout Dense(128) Dense(300)
Dropout Flatten ReLu ReLu
Dense(10) Dense(10) Dropout Dropout
Dense(10) Dense(10)
Table 6: Structures of models on MNIST.
Attack method Training method Adv. Train(1) Adv. Train(10) ATL(1) ATL(10)
FGSM FGSM 56.9 42.0 54.0 34.8
iter_FGSM 83.3 88.3 66.6 38.3
LL 56.7 48.1 46.2 42.9
iter_LL 87.9 83.8 53.0 44.1
Gradient-based 58.3 46.3 51.8 40.3
Optimization-based 85.2 84.2 68.9 59.6
Mixed 86.7 54.9 55.9 37.2
C&W FGSM 91.6 86.8 91.1 88.6
iter_FGSM 83.0 88.9 77.9 85.2
LL 82.6 91.3 86.8 88.5
iter_LL 79.8 84.2 89.9 85.2
Gradient-based 84.1 88.7 83.5 89.3
Optimization-based 56.4 56.1 43.3 45.2
Mixed 62.2 55.7 39.5 23.5
Table 7: Error rate of known type of attack on MNIST. We train over model A,B,C,D, and we test the error rate over model A. Baseline is traditional adversarial training. We use different algorithm to generate adversarial examples for training. We then use FGSM or C&W for testing the robustness.
(a) Attack by FGSM
(b) Attack by C&W
Figure 6: Results on MNIST dataset of known type of attack. ‘Adv.T’ means traditional adversarial training. ‘-Gra’ means the training process use the gradient-based algorithms to generate adversarial examples. ‘-Opt’ means optimization-based algorithm C&W and ‘-Mix’ means mixed version of algorithm.
Attack method Training method Adv. Train(1) Adv. Train(10) ATL(1) ATL(10)
FGSM FGSM 59.3 52.6 52.1 45.2
iter_FGSM 83.5 85.8 63.6 49.1
LL 62.1 54.7 45.3 38.5
iter_LL 82.5 86.1 51.9 45.1
Gradient-based 67.4 53.5 54.1 42.2
Optimization-based 86.1 87.8 59.2 55.3
Mixed 61.4 58.2 51.6 39.1
C&W FGSM 95.1 97.5 95.2 92.9
iter_FGSM 95.8 94.2 87.2 89.1
LL 94.1 93.2 93.9 86.6
iter_LL 97.9 94.7 98.5 83.1
Gradient-based 91.3 95.1 95.9 97.2
Optimization-based 58.3 57.8 46.1 43.3
Mixed 64.9 53.1 47.3 32.7
Table 8: Error rate of unknown type of attack on MNIST. We train over model A,B,D, and we test the error rate over model C. Baseline is traditional adversarial training. We use different algorithm to generate adversarial examples for training. We then use FGSM or C&W for testing the robustness.
(a) Attack by FGSM
(b) Attack by C&W
Figure 7: Results on MNIST dataset of unknown type of attack. ‘Adv.T’ means traditional adversarial training. ‘-Gra’ means the training process use the gradient-based algorithms to generate adversarial examples. ‘-Opt’ means optimization-based algorithm C&W and ‘-Mix’ means mixed version of algorithm.

a.3 Cifar10

MNIST is a standard dataset in the research of adversarial examples, but it’s a small dataset and we further do more experiments over CIFAR10. The models we choose for CIFAR10 are the same as the Cats vs. Dogs dataset: VGG16, VGG19, inceptionv3 and resnet50.

We also do a known type of attack and a unknown type of attack on CIFAR10. The known type of attack proposes to train an ensemble model over model VGG16, VGG19, inceptionv3 and resnet50 ( in algorithm 2 is set to be [VGG16, VGG19, inceptionv3, resnet50]) and then attack the model with the adversarial examples generated against model VGG16. The unknown type of attack is proposed to train over models mentioned above without VGG16 and attack the model against VGG16. The setting of parameters over both attacks are the same: . The accuracy of the pre-trained model (Step 1 of Algorithm 2) reaches up to 92.6% and the accuracy does not decrease much after adversarial training (near 92%). The results of unknown type of attack are a little worse than the known type of attack, but the robustness of the model indeed increases. The error rate of the original model against FGSM is 90.1% and that of the original model against C&W is 99.2%. From the results we can see that our ATL trained over ensemble models and mixed attack methods shows the best performance against both gradient-based and optimization-based attack. The triplet loss can increase the robustness of the model better than the normal adversarial training without triplet loss.

The model structures used to generate adversarial examples are the same as the setting of dataset Cats vs. Dogs. We experiment both known type of and unknown type of settings and the result shows that our special triplet loss can improve the robustness of the model better than traditional adversarial training process. The results are shown in Figure 5, Figure 6, Table 6, and Table 7.

Attack method Training method Adv. Train(1) Adv. Train(10) ATL(1) ATL(10)
FGSM Gradient-based 64.2 50.3 49.6 41.8
Optimization-based 74.3 81.9 63.1 62.8
Mixed 66.3 63.7 57.2 44.3
C&W Gradient-based 97.1 98.8 96.3 97.9
Optimization-based 53.9 43.0 46.1 35.7
Mixed 62.5 61.1 53.9 44.4
Table 9: Error rate of known type of attack on CIFAR10. We train over model VGG16, VGG19, inceptionv3, resnet50, and we test the error rate over model VGG16. Baseline is traditional adversarial training. We use different algorithm to generate adversarial examples for training. We then use FGSM or C&W for testing the robustness.
(a) Attack by FGSM
(b) Attack by C&W
Figure 8: Results on CIFAR10 dataset of known type of attack. ‘Adv.T’ means traditional adversarial training. ‘-Gra’ means the training process use the gradient-based algorithms to generate adversarial examples. ‘-Opt’ means optimization-based algorithm C&W and ‘-Mix’ means mixed version of algorithm.
Attack method Training method Adv. Train(1) Adv. Train(10) ATL(1) ATL(10)
FGSM Gradient-based 67.6 53.1 45.1 41.6
Optimization-based 84.9 88.3 77.3 71.4
Mixed 73.8 69.2 61.7 50.9
C&W Gradient-based 98.5 97.3 98.5 98.8
Optimization-based 57.6 50.7 51.1 45.7
Mixed 69.4 63.6 56.8 47.3
Table 10: Error rate of unknown type of attack on CIFAR10. We train over model VGG19, inceptionv3, resnet50, and we test the error rate over model VGG16. Baseline is traditional adversarial training. We use different algorithm to generate adversarial examples for training. We then use FGSM or C&W for testing the robustness.
(a) Attack by FGSM
(b) Attack by C&W
Figure 9: Results on CIFAR10 dataset of unknown type of attack. ‘Adv.T’ means traditional adversarial training. ‘-Gra’ means the training process use the gradient-based algorithms to generate adversarial examples. ‘-Opt’ means optimization-based algorithm C&W and ‘-Mix’ means mixed version of algorithm.

Appendix B Apply to current defense

The second part of our experiment is to apply our new loss to the existence defense methods. We choose three typical defense methods described in the paper, thermometer encoding, mitigating through randomization, and Defense-GAN. We propose to improve the robustness of the model based on these defenses.

b.1 Thermometer encoding

Thermometer encoding used a phenomenon called Gradient Shattering. This defense is designed to encode the original input to a non-differentiable image and the original defense is combined with a traditional adversarial training process. So our triplet regularization can be easily inserted into this defense. We only need to change the loss function during the adversarial training process to our new loss function and we do not need to generate different types of adversarial examples or against different model structures. In this experiment, we view the image after the thermometer encoding as the adversarial examples. In the known type of setting, we use the same model structure in both training and testing process and in the unknown type of setting we apply different structures. We show the results of the first round of adversarial training and the results after 7 rounds of training.

The parameters in the triplet loss function are set to be , and the results are shown in Table 8-11. In each table, we test our model with three types of examples, i.e. clean data, examples generated by FGSM, and examples generated by PGDLS-PGA.

Ori Th. En.(1) Th. En.(7) Th. En.+ Tri. Reg(1) Th. En.+ Tri. Reg(7)
Clean 0.8 0.8 0.9 0.8 0.8
FGSM 100.0 31.6 4.2 28.7 3.5
PGD/LS-PGA 100.0 25.0 5.9 23.0 2.7
Table 11: Error rate of known type of attacks on MNIST over Thermometer Models. ‘Th. En.’ mean Thermometer Encoding. ‘Tri. Reg’ means applying our triplet regularization.
Ori Th. En.(1) Th. En.(7) Th. En.+ Tri. Reg(1) Th. En.+ Tri. Reg(7)
Clean 5.8 7.6 10.1 6.1 7.2
FGSM 51.5 37.1 20.0 27.6 15.1
PGD/LS-PGA 49.5 39.3 20.9 29.7 13.4
Table 12: Error rate of known type of attacks on CIFAR10 over Thermometer Models. ‘Th. En.’ mean Thermometer Encoding. ‘Tri. Reg’ means applying our triplet regularization.
Ori Th. En.(1) Th. En.(7) Th. En.+ Tri. Reg(1) Th. En.+ Tri. Reg(7)
Clean 3.5 5.1 5.5 5.0 9.2
FGSM 89.1 73.7 67.0 53.1 47.9
PGD/LS-PGA 88.2 77.1 71.9 62.8 55.3
Table 13: Error rate of unknown type of attacks on MNIST over Thermometer Models. ‘Th. En.’ mean Thermometer Encoding. ‘Tri. Reg’ means applying our triplet regularization.
Ori Th. En.(1) Th. En.(7) Th. En.+ Tri. Reg(1) Th. En.+ Tri. Reg(7)
Clean 11.5 13.6 20.1 16.2 17.7
FGSM 46.5 43.8 39.1 38.6 33.0
PGD/LS-PGA 55.0 52.9 49.3 51.9 43.0
Table 14: Error rate of unknown type of attacks on CIFAR10 over Thermometer Models. ‘Th. En.’ mean Thermometer Encoding. ‘Tri. Reg’ means applying our triplet regularization.

b.2 Mitigating through randomization

Mitigating through randomization uses Stochastic Gradients which are caused by randomized defenses. The input is randomly transformed before being fed to the classifier, causing the gradients to become randomized. The target models and the defense models are exactly the same except for the parameter settings of the randomization layers, i.e., the randomization parameters of the target models are predefined while randomization parameters of the defense models are randomly generated at test time.

The traditional adversarial training process is also mentioned in the original paper. So our special triplet regularization can be easily applied to this defense. We also drop the use of different types of adversarial examples and different model structures used to improve the effect of adversarial training. We select two attack scenarios, the vanilla attack scenario and the ensemble-pattern attack scenario in our experiment.

The parameters in the triplet loss function are set to be , and the results are shown in Table 12-13. We use three different attack methods, i.e., FGSM, Deepfool, and C&W, to test the effect of the defense.

Models Inception-v3 ResNet-v2-101 Inception-ResNet-v2 Ens-adv-Inception-ResNet-v2
Ori Rand
Rand+
Tri. Reg
Ori Rand
Rand+
Tri. Reg
Ori Rand
Rand+
Tri. Reg
Ori Rand
Rand+
Tri. Reg
FGSM 66.8 36.2 30.5 73.7 28.2 21.7 34.7 19.0 8.3 15.6 4.3 4.6
Deepfool 100.0 1.7 1.1 100.0 2.3 1.5 100.0 1.8 0.8 99.8 0.9 0.7
C&W 100.0 3.1 2.6 100.0 2.9 1.2 99.7 2.3 1.3 99.1 1.2 0.9
Table 15: Top-1 classification error rate under the vanilla attack scenario. ‘Ori’ means the original model. ‘Rand’ means adding some randomization layers. ‘Tri. Reg’ means applying our triplet regularization.
Models Inception-v3 ResNet-v2-101 Inception-ResNet-v2 Ens-adv-Inception-ResNet-v2
Ori Rand
Rand+
Tri. Reg
Ori Rand
Rand+
Tri. Reg
Ori Rand
Rand+
Tri. Reg
Ori Rand
Rand +
Tri. Reg
FGSM 62.7 58.8 37.3 60.8 55.1 45.1 28.5 25.7 23.5 13.8 11.1 8.9
Deepfool 99.4 18.7 17.7 99.1 19.5 22.2 99.1 30.6 41.6 98.4 6.5 9.0
C&W 99.4 37.1 23.1 99.0 25.7 20.0 98.4 31.7 21.7 94.2 13.9 7.4
Table 16: Top-1 classification error rate under the ensemble-pattern attack scenario. Similar to vanilla attack and single-pattern attack scenarios, we see that randomization layers increase the accuracy under all attacks and networks. This clearly demonstrates the effectiveness of the proposed randomization method on defending against adversarial examples, even under this very strong attack scenario. ‘Ori’ means the original model. ‘Rand’ means adding some randomization layers. ‘Tri. Reg’ means applying our triplet regularization.

b.3 Defense-GAN

Defense-GAN used Vanishing & Exploding Gradients. This defense do not affect the training and testing of classifier. The adversarial training process was even used in the original paper and got a well performance. So we apply our triplet regularization under the same setting of parameters and the result shows that the new loss can improve the robustness based on the original defense.

The parameters in the triplet loss function are set to be , and Defense-GAN has and . The results are shown in Table 14 and the model structures are listed in Table 15.

Attack Model
No
Attack
No
Defense
Defense-
GAN-Rec
Adv.
Train
ATL
Defense-
GAN-Rec
+ Tri. Reg
FGSM A 0.3 88.3 1.2 34.9 23.6 1.1
B 3.8 97.8 4.4 94.0 96.8 0.7
C 0.4 67.9 1.1 21.4 11.9 0.8
D 0.8 96.2 2.0 26.8 15.2 1.6
C&W A 0.3 85.9 1.1 92.3 96.5 1.4
B 3.8 96.8 8.4 72.0 70.7 4.7
C 0.4 87.4 1.1 96.9 93.7 1.1
D 0.8 96.8 1.7 99.0 87.7 1.4
Table 17: Classification error rates of different classifier models using various defense strategies on the MNIST datasets, under FGSM and C&W known type of attacks. ‘Adv. Train’ mean a traditional adversarial training process. ‘Tri. Reg’ means applying our triplet regularization.
A B C D Generator Discriminator
Conv(64,5x5,1) Dropout(0.2) Conv(128,3x3,1) FC(200) FC(4096) Conv(64,5x5,2)
ReLU Conv(64,8x8,2) ReLU ReLU ReLU LeakyReLU(0.2)
Conv(64,5x5,2) ReLU Conv(64,3x3,2) Dropout(0.5) ConvT(256,5x5,1) Conv(128,5x5,2)
ReLU Conv(128,6x6,2) ReLU FC(200) ReLU LeakyReLU(0.2)
Dropout(0.25) ReLU Dropout(0.25) ReLU ConvT(128,5x5,1) Conv(256,5x5,2)
FC(128) Conv(128,5x5,1) FC(128) Dropout(0.5) ReLU LeakyReLU(0.2)
ReLU ReLU ReLU FC(10)+Softmax ConvT(1,5x5,1) FC(1)
Dropout(0.5) Dropout(0.5) Dropout(0.5) Sigmoid Sigmoid
FC(10)+Softmax FC(10)+Softmax FC(10)+Softmax
Table 18: Neural network architectures used for classifiers, substitute models and GANs.