Fast Gradient Projection Method for Text Adversary Generation and Adversarial Training

08/09/2020 ∙ by Xiaosen Wang, et al. ∙ Huazhong University of Science u0026 Technology 0

Adversarial training has shown effectiveness and efficiency in improving the robustness of deep neural networks for image classification. For text classification, however, the discrete property of the text input space makes it hard to adapt the gradient-based adversarial methods from the image domain. Existing text attack methods, moreover, are effective but not efficient enough to be incorporated into practical text adversarial training. In this work, we propose a Fast Gradient Projection Method (FGPM) to generate text adversarial examples based on synonym substitution, where each substitution is scored by the product of gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. Empirical evaluations demonstrate that FGPM achieves similar attack performance and transferability when compared with competitive attack baselines, at the same time it is about 20 times faster than the current fastest text attack method. Such performance enables us to incorporate FGPM with adversarial training as an effective defense method, and scale to large neural networks and datasets. Experiments show that the adversarial training with FGPM (ATF) significantly improves the model robustness, and blocks the transferability of adversarial examples without any decay on the model generalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks (DNNs) have made great success in recent years (Krizhevsky et al., 2012; Kim, 2014; Devlin et al., 2019). However, researchers have found that DNNs are often vulnerable to adversarial examples for image data (Szegedy et al., 2014) as well as text data (Papernot et al., 2016). In image domain, numerous methods have been proposed with regard to adversarial attack (Goodfellow et al., 2015; Athalye et al., 2018) and defense (Goodfellow et al., 2015; Samangouei et al., 2018; Guo et al., 2018). Among which, adversarial training that adopts perturbed examples in the training stage to promote the model robustness is very popular and effective (Athalye et al., 2018).

In text domain, the lexical, grammatical and semantic constraints and the discrete input space make it much harder to craft adversarial examples. Current attack methods include character-level attack (Liang et al., 2018; Li et al., 2019; Ebrahimi et al., 2018), word-level attack (Papernot et al., 2016; Samanta and Mehta, 2017; Gong et al., 2018; Cheng et al., 2018; Kuleshov et al., 2018; Neekhara et al., 2019; Ren et al., 2019; Wang et al., 2019), and sentence-level attack (Iyyer et al., 2018; Ribeiro et al., 2018). For character-level attack, recent works (Pruthi et al., 2019) have shown that a spell checker can easily fix the perturbations. It’s worth noting that although Ebrahimi et al. (2018) adapts HotFlip for word-level attack, they cannot generate abundant adversarial examples due to the syntactic and semantic constraints. Word-level attacks, then, either face the challenge of semantic preservation quality for the gradient-based perturbations, or is time-consuming for the query-based synonym substitution. Similarly, as sentence-level attacks are usually based on paraphrasing, adversary generation demands much longer time. There exists a work (Miyato et al., 2016) that adopts FGSM (Goodfellow et al., 2015) to perturb the word embeddings, but such perturbations are directly used for enhancing the model performance on the original data.

Adversarial defense for text data is much less studied in the literature. Some works (Jia et al., 2019; Huang et al., 2019) are based on interval bound propagation (IBP), first proposed in image domain (Gowal et al., 2018; Dvijotham et al., 2018), to ensure certifiable defense. Zhou et al. (2019) propose to learn to discriminate perturbations (DISP) and restore the embedding of the original word for defense without altering the training process or the model structure. Wang et al. (2019) propose a Synonym Encoding Method (SEM), which inserts an encoder before the input layer to defend synonym substitution based attacks. To our knowledge, adversarial training, one of the most efficacious defense approaches for images (Athalye et al., 2018), has not been effectively implemented for text classification as a defense method.

Among different types of adversarial attacks, existing synonym substitution based attack methods guarantee grammatical correctness and avoid semantic changes, but are usually much less efficient to be incorporated into adversarial training. Gradient-based attacks, on the other hand, often have much higher efficiency. Due to the discreteness of the text input space, however, it is challenging to adopt such methods directly in the text embedding space (Miyato et al., 2016) to generate meaningful adversarial examples and preserve original semantics.

In this work, we propose a gradient-based adversarial attack method, called Fast Gradient Projection Method (FGPM), for efficient text adversary generation. Specifically, we approximate the classification confidence change caused by synonym substitution by the product of the gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. At each iteration, we substitute a word with its synonym that leads to the highest product value. Compared with existing query-based attack methods, FGPM only needs one time of back-propagation calculation to obtain the gradient. As a result, FGPM is about 20 times faster than the current fastest adversarial attack, making it possible to incorporate FGPM with adversarial training for efficient and effective defense in text classification.

Extensive experiments on three standard datasets demonstrate that FGPM can achieve similar attack performance and transferability as state-of-the-art synonym substitution based adversarial attacks with much higher efficiency. More importantly, empirical results show that our adversarial training with FGPM (ATF) can promote the model robustness against white-box as well as black-box adversarial attacks, and effectively block the transferability of adversarial examples without any decay on the model generalization.

2 Related Work

For related work, we provide a brief overview on text adversarial attacks and defenses on word-level.

2.1 Adversarial Attack

Adversarial attacks fall in two settings: (a) white-box attack allows full access to the target model, including model outputs, (hyper-)parameters, gradients and architectures, and (b) black-box attack only allows access to the model outputs.

Methods based on word embedding usually fall in the white-box setting. Papernot et al. (2016) find a word in dictionary such that the sign of the difference between the found word and original word is closest to the sign of gradient. However, such word does not necessarily preserve the semantic and/or syntactic correctness and consistency. Gong et al. (2018) further employ the Word Mover’s Distance (WMD) in an attempt to preserve semantics. Cheng et al. (2018) also propose an attack based on the embedding space with additional constraints targeting seq2seq models. Our work similarly produces efficient white-box attacks, while guaranteeing the quality of adversarial examples by restricting the perturbation to synonym substitution, which usually appears in black-box attacks.

In black-box setting, Kuleshov et al. (2018) propose a Greedy Search Attack (GSA) that perturbs the input by synonym substitution. Specifically, GSA greedily finds a synonym for replacement that minimizes the model’s classification confidence. Ren et al. (2019) propose a Probability Weighted Word Saliency (PWWS) that greedily substitutes each target word with a synonym determined by the combination of classification confidence change and word saliency. Alzantot et al. (2018) also use synonym substitution and propose a population-based algorithm called Genetic Algorithm (GA). Wang et al. (2019) further propose an Improved Genetic Algorithm (IGA) that allows to substitute words in the same position more than once and outperforms GA.

2.2 Adversarial Defense

Currently, there are a series of works (Miyato et al., 2016; Sato et al., 2018; Barham and Feizi, 2019) that perturbs the word embeddings, and utilize adversarial training as a regularization strategy. These works aim to improve the model performance on the original dataset, but do not intend to defend the adversarial attacks. Therefore, we do not take such works into consideration in this paper.

A stream of recent popular defense methods (Jia et al., 2019; Huang et al., 2019) focuses on verifiable robustness. They use IBP in the training procedure to produce provably robust models to all possible perturbations within the constraints. Such endeavor, however, is currently time consuming in the training stage as the authors noted (Jia et al., 2019) and hard to be scaled to complex models or large datasets. Zhou et al. (2019)

train a perturbation discriminator that validates how likely a token in the text is perturbed and an embedding estimator that restores the embedding of the original word to block adversarial attacks. Instead, our work focuses on fast adversarial examples generation and easy-to-apply defense method for large neural networks and datasets.

Alzantot et al. (2018) and Ren et al. (2019) adopt the adversarial examples generated by their attack methods for adversarial training and achieve some robustness improvement. Unfortunately, due to the relatively low efficiency of generating adversaries, they are unable to craft plenty of perturbations during the training to ensure significant robustness improvement. To our knowledge, adversarial training has not been practically applied for text classification as an efficient and effective defense method.

Wang et al. (2019) propose a Synonym Encoding Method (SEM) that uses a synonym encoder to map all the synonyms to the same code in the embedding space and force the classification to be smoother. Trained with the encoder, their model also obtains significant improvement on the robustness with a little decay on the model generalization.

3 Fast Gradient Projection Method

In this section, we formalize the definition of adversarial examples for text classification and describe in detail the proposed adversarial attack method, Fast Gradient Projection Method (FGPM).

3.1 Text Adversarial Examples

Let denote the input space containing all the possible input texts and the output space. Let denote an input sample consisting of words and

the dictionary containing all the possible words in the input texts. A classifier

is expected to learn a mapping so that for any sample , the predicted label equals its true label with high probability. Let

denote the logit output of classifier

on category . The adversary adds an imperceptible perturbation on to craft an adversarial example that misleads the classifier :

where is a hyper-parameter denoting the perturbation upper bound. is the -norm distance metric which usually denotes the word substitution ratio as the measure for the perturbation caused by synonym substitution:

Here is an indicator function, and .

3.2 Generating Adversarial Examples

Mrkšić et al. (2016)

have shown that counter-fitting can help remove antonyms which are also considered as “similar words” in the original GloVe vector space to improve the vectors’ capability of indicating semantic similarity. Thus, we post-process the GloVe vectors by counter-fitting and define a synonym set for each word

in the embedding space as follows:

(1)

where is a hyper-parameter that constrains the maximum Euclidean distance for synonyms in the embedding space. As in Wang et al. (2019), we set in our experiments.

Once we have the synonym set for each word , the most important issues then lie in the synonym selection and replacement order determination.

Word Substitution. As shown in Figure 1 (a), for each word , we wish to pick a word that earns the most benefit to the overall substitution process, which we call the optimal synonym. Due to the high complexity of finding the optimal synonym, previous works (Ren et al., 2019; Wang et al., 2019) greedily pick a synonym that minimizes the classification confidence:

where . However, the selection process is time consuming as picking such a needs queries on the model. To reduce the calculation complexity, based on the local linearity of deep models, we use the product of the magnitude of gradient and the projected distance between two synonyms in the gradient direction in the word embedding space to estimate the amount of change for the classification confidence. Specificallt, as illustrated in Figure 1 (b), we first calculate the gradient for each word where

is the loss function used for training. Then, we estimate the change by calculating

. To determine the optimal synonym , we choose a synonym with the maximum product value:

(2)
Figure 1: Strategies to pick the most suitable synonym to substitute word . (a) Pick a synonym that minimizes the classification confidence among all the synonyms . (b) Pick a synonym that maximizes the product of the magnitude of gradient and the projected distance between and in the gradient direction.

Substitution Order. For each word in text , we use the above word substitution strategy to choose its optimal substitution synonym, and obtain a candidate set }. Then, we need to determine which word in should be substituted. Similar to the word substitution strategy, we pick a word that has the biggest perturbation value projected in the gradient direction to substitute :

(3)

In summary, to generate an adversarial example, we adopt the above word replacement and substitution order strategies for synonym substitution and do word substitution iteratively till the classifier makes a wrong prediction. The overall FGPM algorithm is shown in Algorithm 1.

1:Benign sample
2:True label for
3:Target classifier,
4:Upper bound distance for synonyms,
5:Maximum number of iterations,
6:Upper bound for word substitution ratio,
7:Adversarial example
8:Initialize
9:Calculate by Eq. (1) for
10:for  do
11:     Construct the candidate set by Eq. (2)
12:     Calculate optimal word by Eq. (3)
13:     Substitute with to get
14:     if  and  then
15:         return Succeed
16:     end if
17:end for
18:return None Failed
Algorithm 1 The FGPM Algorithm

In order to avoid the semantic drift caused by multiple substitutions at the same position of the text, we construct a candidate synonym set for the original sentence and constrain all the substitutions with word to the set, as shown at line 2 of Algorithm 1. We also set the upper bound for word substitution ratio in our experiments. Note that at each iteration, FGPM just calculates the gradient by back-propagation once, while previous query-based adversarial attacks need times of model queries (Kuleshov et al., 2018; Ren et al., 2019; Wang et al., 2019). We will show in experiments that FGPM is much more efficient.

AG’s News DBPedia Yahoo! Answers
CNN LSTM Bi-LSTM CNN LSTM Bi-LSTM CNN LSTM Bi-LSTM
No Attack 92.3 92.6 92.5 98.7 98.8 99.0 72.3 75.1 74.9
No Attack 87.5 90.5 88.5 99.5 99.0 99.0 71.5 72.5 73.5
Papernot et al. 67.0 61.0 63.0 83.0 78.5 86.5 34.0 37.5 34.0
GSA 38.0 33.0 37.5 66.5 58.5 64.0 19.5 19.0 18.0
PWWS 29.0 29.5 29.0 51.5 49.0 49.5   3.5 14.0 10.5
IGA 26.0 23.5 26.5 34.5 32.5 33.5   4.5   6.5   7.0
FGPM 28.5 30.5 30.5 33.0 43.0 47.0   4.5 15.5 10.0
Table 1: The classification accuracy () of different classification models under various competitive adversarial attacks. The first two rows of No Attack and No Attack show the model accuracy on the entire original test set and the sampled examples respectively. The lowest classification accuracy among the attacks is highlighted in bold to indicate the best attack effectiveness. The second lowest classification accuracy is highlighted in underline.

4 Adversarial Training with FGPM

Previous works (Alzantot et al., 2018; Ren et al., 2019) have shown incorporating their attack methods into adversarial training (Goodfellow et al., 2015) can improve the model robustness. Nevertheless, the improvement is limited. We argue that adversarial training requires plenty of adversarial examples generated based on current model parameters for better robustness enhancement. Due to the inefficiency of text adversary generation, existing text attack methods based on synonym substitution could not provide sufficient adversarial examples for adversarial training. With the high efficiency of FGPM, we propose to adopt adversarial training with FGPM (ATF) to effectively improve the model robustness for text classification.

Specifically, we modify the objective function as follows:

where is the adversarial example of each generated by FGPM based on the current model parameters . In all our experiments, we set but other values may also work. As in Goodfellow et al. (2015), we treat FGPM as an effective regularizer to improve the model robustness rather than just adding a portion of adversarial examples of the already trained model into the training set and retrain the model.

5 Experimental Results

We empirically evaluate the proposed FGPM with four state-of-the-art adversarial attack methods on three popular benchmark datasets involving three different neural networks. Experimental results demonstrate that FGPM can achieve similar attack performance and transferability as the best performing algorithms among the baselines, and with much higher efficiency. Such performance enables us to implement Adversarial Training with FGPM (ATF), as a defense approach, which can be easily scaled to large neural networks and datasets. We compare ATF with recent two competitive defense methods, and show that ATF not only can improve the model robustness against white-box and black-box adversarial attacks but also block the transferability of adversarial examples more successfully without any decay on the model generalization.

5.1 Experimental Setup

We first provide an overview of experimental setup, including baselines, datasets and classification models used in experiments.

Baselines. To evaluate the effectiveness of the proposed attack, we compare FGPM with four recent adversarial attacks, Papernot et al. (Papernot et al., 2016), GSA (Kuleshov et al., 2018), PWWS (Ren et al., 2019), and IGA (Wang et al., 2019). Furthermore, to evaluate the defense performance of our ATF, we take two competitive text defense methods, SEM (Wang et al., 2019) and IBP (Jia et al., 2019), against several recent word-level attacks. Due to the low efficiency of the attack baselines, we randomly sample 200 examples on each dataset, and generate adversarial examples by these attack methods for different models with or without defense.

CNN LSTM Bi-LSTM CNN LSTM Bi-LSTM CNN LSTM Bi-LSTM
Papernot et al. 83.0* 89.0 89.5 93.5 78.5* 90.5 93.5 87.5 86.5*
GSA 66.5* 90.0 92.0 94.5 58.5* 91.5 94.0 86.5 64.0*
PWWS 51.5* 86.0 89.0 87.0 49.0* 86.5 86.0 75.0 49.5*
IGA 34.5* 83.0 88.5 90.5 32.5* 88.5 91.5 84.0 33.5*
FGPM 33.0* 82.5 91.0 90.5 43.0* 87.0 91.0 80.5 47.0*
Table 2: The classification accuracy (%) of different classification models for adversarial examples generated on other models on DBPedia for the transferability evaluation. * indicates that the adversarial examples are generated based on this model. The lowest classification accuracy among various attacks is highlighted in bold to indicate the best transferability. The second lowest classification accuracy is highlighted in underline.
AG’s News DBPedia Yahoo! Answers
CNN LSTM Bi-LSTM CNN LSTM Bi-LSTM CNN LSTM Bi-LSTM
Papernot et al. 74 1,676 4,401 145 2,119 6,011 120 9,719 19,211
GSA 276 643 713 616 1,006 1,173 1,257 2,234 2,440
PWWS 122 28,203 28,298 204 34,753 35,388 643 98,141 100,314
IGA 965 47,142 91,331 1,369 69,770 74,376 893 132,044 123,976
FGPM 8 29 29 8 34 33 26 193 199
Table 3: Comparison on the total running times (in seconds) in generating adversarial instances, evaluated on various models and datasets.

Datasets. We compare the proposed methods with baselines on three widely used benchmark datasets including AG’s News, DBPedia ontology and Yahoo! Answers (Zhang et al., 2015). AG’s News dataset consists of news articles pertaining four classes: World, Sports, Business and Sci/Tech. Each class includes 30,000 training examples and 1,900 testing examples. DBPedia ontology dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014, which is a crowd-sourced community effort to extract structured information from Wikipedia. For each of the 14 ontology classes, there are 40,000 training samples and 5,000 testing samples. Yahoo! Answers dataset is a topic classification dataset with 10 classes, and each class contains 140,000 training samples and 5,000 testing samples.

Models.

We adopt several deep learning models that can achieve state-of-the-art performance on text classification tasks, including Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The embedding dimension of all models is 300

(Mikolov et al., 2013). Specifically, we replicate a CNN model from Kim (2014), which consists of three convolutional layers with filter size of 5, 4, and 3 respectively, a dropout layer and a final fully connected layer. We also use LSTM model which replaces the three convolutional layers of the CNN with three LSTM layers, each with 128 cells (Liu et al., 2016). Lastly, we implement a Bi-LSTM model that replaces the three LSTM layers of the LSTM with a bi-directional LSTM layer having 128 forward direction cells and 128 backward direction cells.

5.2 Evaluation on Attack Effectiveness

To evaluate the attack effectiveness, we compare FGPM with the baselines in two aspects, namely classification accuracy under attacks and transferability.

Classification Accuracy under Attacks. In Table 1, we provide the classification accuracy under FGPM and the competitive baseline attacks on three standard datasets. The more effective the attack method is, the more the classification accuracy the target model drops. We observe that FGPM reduces more of the classification accuracy than other methods on CNN with DBPedia dataset, and has close results on other cases when compared with the best query-based methods. On average, the classification accuracy on the three datasets is reduced by , and respectively under FGPM attack. Note that, compared with Papernot et al., the only gradient-based adversarial attack among the baselines, FGPM can reduce the classification accuracy by more than , and respectively on the three datasets, indicating that the proposed gradient projection technique significantly improves the effectiveness of white-box word-level attacks.

Dataset Attack CNN LSTM Bi-LSTM
NT SEM IBP ATF NT SEM IBP ATF NT SEM IBP ATF
AG’s News No Attack 92.3 89.7 90.2 92.6 92.6 90.9 89.0 92.5 92.5 91.4 90.0 92.8
No Attack 87.5 87.5 84.5 89.0 90.5 90.5 90.5 89.5 88.5 91.0 87.5 89.5
Papernot et al. 67.0 75.0 87.5 84.5 61.0 79.5 61.0 88.0 63.0 79.5 85.0 88.0
GSA 38.0 75.0 85.0 79.0 33.0 83.0 2.5 84.5 37.5 81.0 82.0 86.0
PWWS 29.0 76.0 84.0 79.5 33.5 84.0 14.0 84.5 29.0 83.0 75.5 85.5
IGA 26.0 77.5 84.0 77.5 23.5 85.0 1.5 83.0 26.5 82.5 75.0 86.5
FGPM 28.5 78.0 84.5 80.0 30.5 84.5 63.0 83.5 30.5 81.5 77.0 83.5
DBPedia No Attack 98.7 98.1 98.2 98.8 98.8 98.5 89.6 99.0 99.0 98.8 93.7 99.0
No Attack 99.5 97.5 98.0 100.0 99.0 99.5 90.0 100.0 99.0 98.0 93.0 100.0
Papernot et al. 83.0 86.5 97.0 99.5 78.5 95.0 53.5 99.5 86.5 91.5 93.0 100.0
GSA 66.5 90.0 96.5 98.0 58.5 95.5 0.0 99.0 64.0 91.5 91.5 100.0
PWWS 51.5 88.5 95.0 97.0 49.0 96.0 0.5 99.0 49.5 91.5 91.0 99.5
IGA 34.5 90.0 95.0 97.5 32.5 97.5 0.0 99.0 33.5 94.5 91.0 99.5
FGPM 33.0 88.5 96.0 98.0 43.0 96.5 49.0 99.0 47.0 93.0 91.0 99.5
Yahoo! Answers No Attack 72.3 70.0 66.6 72.7 75.1 72.8 66.2 74.7 74.9 72.9 66.9 74.7
No Attack 71.5 67.0 65.0 69.0 72.5 69.5 66.5 74.5 73.5 69.5 70.5 73.5
Papernot et al. 34.0 52.5 64.0 65.5 37.5 56.5 47.5 67.5 34.0 62.5 1.5 67.5
GSA 19.5 52.5 61.5 53.0 19.0 51.5 39.5 60.5 18.0 51.5 28.0 58.0
PWWS 3.5 51.5 60.5 53.5 14.0 54.5 15.5 62.0 10.5 54.5 3.5 58.5
IGA 4.5 56.5 60.5 53.0 6.5 56.5 12.5 58.0 7.0 57.0 5.0 57.5
FGPM 4.5 56.5 60.0 57.0 15.5 57.5 37.5 63.5 10.0 59.5 1.5 61.0
Table 4: The classification accuracy () of three competitive defense methods under various adversarial attacks on the same set of 200 randomly selected samples for the three standard datasets. The classification accuracy is in red and italic when it is lower than NT. Notation: NT — Normal Training, SEM — Synonyms Encoding Method, IBP — Training with Interval Bound Propagation, ATF — Adversarial Training with FGPM.

Transferability. The transferability of adversarial attack refers to the ability to reduce the classification accuracy of different models with adversarial examples generated on a specific model (Goodfellow et al., 2015), which is another serious threat in real-world applications. To illustrate the transferability of FGPM, we generate adversarial examples on each model by different attack methods and evaluate the classification accuracy of other models on these adversarial examples. Here, we evaluate the transferability of different attacks on DBPedia. As depicted in Table 2, the adversarial examples crafted by FGPM generally yield the second best transferability.

5.3 Evaluation on Attack Efficiency

The attack efficiency is also important for evaluating attack methods, especially if we would like to incorporate the attacks into adversarial training as a defense method. Adversarial training needs highly efficient adversary generation so as to effectively promote the model robustness. Therefore, we evaluate the total time (in seconds) of generating examples on the three datasets by various attacks. As shown in Table 3, the average time of generating examples by FGPM is nearly times faster than GSA, the second fastest attack based on synonym substitution but with worse attack performance and lower transferability than FGPM. Moreover, FGPM is on average times faster than IGA, which has the best degradation of the classification accuracy among the baselines. Though Papernot et al. (2016) craft adversarial examples based on gradient and could make each iteration faster, it needs much more iterations to obtain adversarial examples due to a relatively low attack effectiveness. On average, FGPM is about 78 times faster than Papernot et al..

Attack CNN LSTM Bi-LSTM
NT SEM IBP ATF NT SEM IBP ATF NT SEM IBP ATF
Papernot et al. 83.0* 97.5 97.0 100.0 89.0 98.0 86.0 100.0 89.5 96.0 93.5 100.0
GSA 66.5* 97.0 97.0 100.0 90.0 99.5 85.5 100.0 92.0 98.0 93.0 100.0
PWWS 51.5* 97.0 97.0 100.0 86.0 99.5 87.0 100.0 89.0 98.0 92.5 100.0
IGA 34.5* 97.0 97.0 100.0 83.0 99.5 86.5 100.0 88.5 98.0 92.5 100.0
FGPM 33.0* 97.0 97.0 100.0 82.5 99.0 84.5 100.0 91.0 97.5 93.5 100.0
Papernot et al. 93.5 96.5 96.5 100.0 78.5* 98.5 83.5 100.0 90.5 97.0 94.0 100.0
GSA 94.5 97.5 96.5 100.0 58.5* 99.5 86.0 99.5 91.5 98.5 93.5 100.0
PWWS 87.0 97.5 97.5 100.0 49.0* 99.5 88.0 99.5 86.5 98.0 92.5 100.0
IGA 90.5 97.0 97.0 100.0 32.5* 99.5 86.5 99.5 88.5 98.0 93.0 100.0
FGPM 90.5 96.5 97.5 100.0 43.0* 99.5 86.0 100.0 87.0 97.5 92.5 100.0
Papernot et al. 93.5 96.5 97.0 100.0 87.5 98.5 85.5 100.0 86.5* 96.5 93.5 100.0
GSA 94.0 97.5 97.5 100.0 86.5 99.5 87.0 100.0 64.0* 98.0 93.0 100.0
PWWS 86.0 98.0 97.5 100.0 75.0 99.5 88.0 100.0 49.5* 98.0 93.0 100.0
IGA 91.5 97.0 97.5 100.0 84.0 99.5 87.0 100.0 33.5* 98.0 93.0 100.0
FGPM 91.0 97.5 97.5 100.0 80.5 99.5 86.5 100.0 47.0* 98.0 93.5 100.0
Table 5: The classification accuracy () of different classification models under competitive defenses for adversarial examples generated on other models on DBPedia for evaluating the defense performance against transferability.

5.4 Evaluation on Adversarial Training

From the above analysis, we see that compared with the competitive attack baselines, FGPM can achieve good attack performance and transferability with much higher efficiency. Such performance enables us to implement effective adversarial training and scale to large neural networks and datasets. In this subsection, we evaluate the performance of ATF and do comparison with SEM and IBP against adversarial examples generated by the above competitive attacks. Here we focus on two sides, defense against adversarial attacks and defense against transferability.

Defense against Adversarial Attacks. We use the above attacks on models trained by various defense methods to evaluate the defense performance. The results are shown in Table 4. For normal training (NT), the classification accuracy of the models on all datasets drops dramatically under different adversarial attacks. In contrast, both SEM and ATF can promote the model robustness stably and effectively among all models and datasets. Note that IBP has distinctly better defense performance on CNNs because of its theoretical certification. However, with many restrictions added on the architectures, the model hardly converges when trained on the three-layer LSTM architecture and large dataset like Yahoo! Answers, resulting in both weakened generalization and adversarial robustness instead. In some cases, IBP even achieves poorer robustness than NT. Compared with SEM, moreover, ATF can achieve higher classification accuracy under almost all adversarial attacks on the three datasets. Especially on DBPedia, ATF can block almost all the adversarial examples and achieve or nearly classification accuracy. Besides, both SEM and IBP bring some decay on benign data. As a way of data augmentation, however, ATF can improve the model performance on benign examples in most cases.

Defense against Transferability. As transferability poses a serious concern in real world applications, a good defense method should not only defend the adversarial attack but also resist the adversarial transferability. To evaluate the ability of blocking the transferability of adversarial examples, we evaluate each model’s classification accuracy on adversarial examples generated by different attacks methods under normal training on DBPedia. As shown in Table 5, ATF blocks the transferability of adversarial examples much more successfully than normal training and the defense baselines.

In summary, ATF can promote the model robustness against both white-box and black-box adversarial attacks. When applied to complex models and large datasets, ATF keeps stable and effective performance. ATF can further block the transferability of adversarial examples more successfully with no decay on the classification accuracy on benign data.

6 Conclusion

In this work, we propose an efficient gradient based synonym substitution adversarial attack method, called Fast Gradient Projection Method (FGPM), which approximates the classification change caused by synonym substitution by the product of the gradient magnitude and the projected distance in the gradient direction of the original word and the candidate synonym. Empirical evaluations on three widely used benchmark datasets demonstrate that FGPM can achieve close attack performance and transferability as state-of-the-art synonym-substitution based adversarial attack methods.

More importantly, FGPM is about 20 times faster than the current fastest synonym substitution based adversarial attack method. With such high efficiency, we introduce an effective defense approach called Adversarial Training with FGPM (ATF) for text classification. Extensive experiments show that ATF can promote the model robustness against white-box as well as black-box adversarial attacks, and block the transferability of adversarial examples effectively without any accuracy decay on benign data.

In future work, we plan to extend FGPM and ATF to other NLP tasks (e.g. Machine Translation, Question Answering) to evaluate the robustness of sequence to sequence models. Moreover, considering the success of adversarial training in image domain, we wish our work could inspire more research of adversarial training in the text domain.

References