Deep Neural Networks (DNNs) have made great success in recent years (Krizhevsky et al., 2012; Kim, 2014; Devlin et al., 2019). However, researchers have found that DNNs are often vulnerable to adversarial examples for image data (Szegedy et al., 2014) as well as text data (Papernot et al., 2016). In image domain, numerous methods have been proposed with regard to adversarial attack (Goodfellow et al., 2015; Athalye et al., 2018) and defense (Goodfellow et al., 2015; Samangouei et al., 2018; Guo et al., 2018). Among which, adversarial training that adopts perturbed examples in the training stage to promote the model robustness is very popular and effective (Athalye et al., 2018).
In text domain, the lexical, grammatical and semantic constraints and the discrete input space make it much harder to craft adversarial examples. Current attack methods include character-level attack (Liang et al., 2018; Li et al., 2019; Ebrahimi et al., 2018), word-level attack (Papernot et al., 2016; Samanta and Mehta, 2017; Gong et al., 2018; Cheng et al., 2018; Kuleshov et al., 2018; Neekhara et al., 2019; Ren et al., 2019; Wang et al., 2019), and sentence-level attack (Iyyer et al., 2018; Ribeiro et al., 2018). For character-level attack, recent works (Pruthi et al., 2019) have shown that a spell checker can easily fix the perturbations. It’s worth noting that although Ebrahimi et al. (2018) adapts HotFlip for word-level attack, they cannot generate abundant adversarial examples due to the syntactic and semantic constraints. Word-level attacks, then, either face the challenge of semantic preservation quality for the gradient-based perturbations, or is time-consuming for the query-based synonym substitution. Similarly, as sentence-level attacks are usually based on paraphrasing, adversary generation demands much longer time. There exists a work (Miyato et al., 2016) that adopts FGSM (Goodfellow et al., 2015) to perturb the word embeddings, but such perturbations are directly used for enhancing the model performance on the original data.
Adversarial defense for text data is much less studied in the literature. Some works (Jia et al., 2019; Huang et al., 2019) are based on interval bound propagation (IBP), first proposed in image domain (Gowal et al., 2018; Dvijotham et al., 2018), to ensure certifiable defense. Zhou et al. (2019) propose to learn to discriminate perturbations (DISP) and restore the embedding of the original word for defense without altering the training process or the model structure. Wang et al. (2019) propose a Synonym Encoding Method (SEM), which inserts an encoder before the input layer to defend synonym substitution based attacks. To our knowledge, adversarial training, one of the most efficacious defense approaches for images (Athalye et al., 2018), has not been effectively implemented for text classification as a defense method.
Among different types of adversarial attacks, existing synonym substitution based attack methods guarantee grammatical correctness and avoid semantic changes, but are usually much less efficient to be incorporated into adversarial training. Gradient-based attacks, on the other hand, often have much higher efficiency. Due to the discreteness of the text input space, however, it is challenging to adopt such methods directly in the text embedding space (Miyato et al., 2016) to generate meaningful adversarial examples and preserve original semantics.
In this work, we propose a gradient-based adversarial attack method, called Fast Gradient Projection Method (FGPM), for efficient text adversary generation. Specifically, we approximate the classification confidence change caused by synonym substitution by the product of the gradient magnitude and the projected distance between the original word and the candidate word in the gradient direction. At each iteration, we substitute a word with its synonym that leads to the highest product value. Compared with existing query-based attack methods, FGPM only needs one time of back-propagation calculation to obtain the gradient. As a result, FGPM is about 20 times faster than the current fastest adversarial attack, making it possible to incorporate FGPM with adversarial training for efficient and effective defense in text classification.
Extensive experiments on three standard datasets demonstrate that FGPM can achieve similar attack performance and transferability as state-of-the-art synonym substitution based adversarial attacks with much higher efficiency. More importantly, empirical results show that our adversarial training with FGPM (ATF) can promote the model robustness against white-box as well as black-box adversarial attacks, and effectively block the transferability of adversarial examples without any decay on the model generalization.
2 Related Work
For related work, we provide a brief overview on text adversarial attacks and defenses on word-level.
2.1 Adversarial Attack
Adversarial attacks fall in two settings: (a) white-box attack allows full access to the target model, including model outputs, (hyper-)parameters, gradients and architectures, and (b) black-box attack only allows access to the model outputs.
Methods based on word embedding usually fall in the white-box setting. Papernot et al. (2016) find a word in dictionary such that the sign of the difference between the found word and original word is closest to the sign of gradient. However, such word does not necessarily preserve the semantic and/or syntactic correctness and consistency. Gong et al. (2018) further employ the Word Mover’s Distance (WMD) in an attempt to preserve semantics. Cheng et al. (2018) also propose an attack based on the embedding space with additional constraints targeting seq2seq models. Our work similarly produces efficient white-box attacks, while guaranteeing the quality of adversarial examples by restricting the perturbation to synonym substitution, which usually appears in black-box attacks.
In black-box setting, Kuleshov et al. (2018) propose a Greedy Search Attack (GSA) that perturbs the input by synonym substitution. Specifically, GSA greedily finds a synonym for replacement that minimizes the model’s classification confidence. Ren et al. (2019) propose a Probability Weighted Word Saliency (PWWS) that greedily substitutes each target word with a synonym determined by the combination of classification confidence change and word saliency. Alzantot et al. (2018) also use synonym substitution and propose a population-based algorithm called Genetic Algorithm (GA). Wang et al. (2019) further propose an Improved Genetic Algorithm (IGA) that allows to substitute words in the same position more than once and outperforms GA.
2.2 Adversarial Defense
Currently, there are a series of works (Miyato et al., 2016; Sato et al., 2018; Barham and Feizi, 2019) that perturbs the word embeddings, and utilize adversarial training as a regularization strategy. These works aim to improve the model performance on the original dataset, but do not intend to defend the adversarial attacks. Therefore, we do not take such works into consideration in this paper.
A stream of recent popular defense methods (Jia et al., 2019; Huang et al., 2019) focuses on verifiable robustness. They use IBP in the training procedure to produce provably robust models to all possible perturbations within the constraints. Such endeavor, however, is currently time consuming in the training stage as the authors noted (Jia et al., 2019) and hard to be scaled to complex models or large datasets. Zhou et al. (2019)
train a perturbation discriminator that validates how likely a token in the text is perturbed and an embedding estimator that restores the embedding of the original word to block adversarial attacks. Instead, our work focuses on fast adversarial examples generation and easy-to-apply defense method for large neural networks and datasets.
Alzantot et al. (2018) and Ren et al. (2019) adopt the adversarial examples generated by their attack methods for adversarial training and achieve some robustness improvement. Unfortunately, due to the relatively low efficiency of generating adversaries, they are unable to craft plenty of perturbations during the training to ensure significant robustness improvement. To our knowledge, adversarial training has not been practically applied for text classification as an efficient and effective defense method.
Wang et al. (2019) propose a Synonym Encoding Method (SEM) that uses a synonym encoder to map all the synonyms to the same code in the embedding space and force the classification to be smoother. Trained with the encoder, their model also obtains significant improvement on the robustness with a little decay on the model generalization.
3 Fast Gradient Projection Method
In this section, we formalize the definition of adversarial examples for text classification and describe in detail the proposed adversarial attack method, Fast Gradient Projection Method (FGPM).
3.1 Text Adversarial Examples
Let denote the input space containing all the possible input texts and the output space. Let denote an input sample consisting of words and
the dictionary containing all the possible words in the input texts. A classifieris expected to learn a mapping so that for any sample , the predicted label equals its true label with high probability. Let
denote the logit output of classifieron category . The adversary adds an imperceptible perturbation on to craft an adversarial example that misleads the classifier :
where is a hyper-parameter denoting the perturbation upper bound. is the -norm distance metric which usually denotes the word substitution ratio as the measure for the perturbation caused by synonym substitution:
Here is an indicator function, and .
3.2 Generating Adversarial Examples
Mrkšić et al. (2016)
have shown that counter-fitting can help remove antonyms which are also considered as “similar words” in the original GloVe vector space to improve the vectors’ capability of indicating semantic similarity. Thus, we post-process the GloVe vectors by counter-fitting and define a synonym set for each wordin the embedding space as follows:
where is a hyper-parameter that constrains the maximum Euclidean distance for synonyms in the embedding space. As in Wang et al. (2019), we set in our experiments.
Once we have the synonym set for each word , the most important issues then lie in the synonym selection and replacement order determination.
Word Substitution. As shown in Figure 1 (a), for each word , we wish to pick a word that earns the most benefit to the overall substitution process, which we call the optimal synonym. Due to the high complexity of finding the optimal synonym, previous works (Ren et al., 2019; Wang et al., 2019) greedily pick a synonym that minimizes the classification confidence:
where . However, the selection process is time consuming as picking such a needs queries on the model. To reduce the calculation complexity, based on the local linearity of deep models, we use the product of the magnitude of gradient and the projected distance between two synonyms in the gradient direction in the word embedding space to estimate the amount of change for the classification confidence. Specificallt, as illustrated in Figure 1 (b), we first calculate the gradient for each word where
is the loss function used for training. Then, we estimate the change by calculating. To determine the optimal synonym , we choose a synonym with the maximum product value:
Substitution Order. For each word in text , we use the above word substitution strategy to choose its optimal substitution synonym, and obtain a candidate set }. Then, we need to determine which word in should be substituted. Similar to the word substitution strategy, we pick a word that has the biggest perturbation value projected in the gradient direction to substitute :
In summary, to generate an adversarial example, we adopt the above word replacement and substitution order strategies for synonym substitution and do word substitution iteratively till the classifier makes a wrong prediction. The overall FGPM algorithm is shown in Algorithm 1.
In order to avoid the semantic drift caused by multiple substitutions at the same position of the text, we construct a candidate synonym set for the original sentence and constrain all the substitutions with word to the set, as shown at line 2 of Algorithm 1. We also set the upper bound for word substitution ratio in our experiments. Note that at each iteration, FGPM just calculates the gradient by back-propagation once, while previous query-based adversarial attacks need times of model queries (Kuleshov et al., 2018; Ren et al., 2019; Wang et al., 2019). We will show in experiments that FGPM is much more efficient.
|AG’s News||DBPedia||Yahoo! Answers|
|Papernot et al.||67.0||61.0||63.0||83.0||78.5||86.5||34.0||37.5||34.0|
4 Adversarial Training with FGPM
Previous works (Alzantot et al., 2018; Ren et al., 2019) have shown incorporating their attack methods into adversarial training (Goodfellow et al., 2015) can improve the model robustness. Nevertheless, the improvement is limited. We argue that adversarial training requires plenty of adversarial examples generated based on current model parameters for better robustness enhancement. Due to the inefficiency of text adversary generation, existing text attack methods based on synonym substitution could not provide sufficient adversarial examples for adversarial training. With the high efficiency of FGPM, we propose to adopt adversarial training with FGPM (ATF) to effectively improve the model robustness for text classification.
Specifically, we modify the objective function as follows:
where is the adversarial example of each generated by FGPM based on the current model parameters . In all our experiments, we set but other values may also work. As in Goodfellow et al. (2015), we treat FGPM as an effective regularizer to improve the model robustness rather than just adding a portion of adversarial examples of the already trained model into the training set and retrain the model.
5 Experimental Results
We empirically evaluate the proposed FGPM with four state-of-the-art adversarial attack methods on three popular benchmark datasets involving three different neural networks. Experimental results demonstrate that FGPM can achieve similar attack performance and transferability as the best performing algorithms among the baselines, and with much higher efficiency. Such performance enables us to implement Adversarial Training with FGPM (ATF), as a defense approach, which can be easily scaled to large neural networks and datasets. We compare ATF with recent two competitive defense methods, and show that ATF not only can improve the model robustness against white-box and black-box adversarial attacks but also block the transferability of adversarial examples more successfully without any decay on the model generalization.
5.1 Experimental Setup
We first provide an overview of experimental setup, including baselines, datasets and classification models used in experiments.
Baselines. To evaluate the effectiveness of the proposed attack, we compare FGPM with four recent adversarial attacks, Papernot et al. (Papernot et al., 2016), GSA (Kuleshov et al., 2018), PWWS (Ren et al., 2019), and IGA (Wang et al., 2019). Furthermore, to evaluate the defense performance of our ATF, we take two competitive text defense methods, SEM (Wang et al., 2019) and IBP (Jia et al., 2019), against several recent word-level attacks. Due to the low efficiency of the attack baselines, we randomly sample 200 examples on each dataset, and generate adversarial examples by these attack methods for different models with or without defense.
|Papernot et al.||83.0*||89.0||89.5||93.5||78.5*||90.5||93.5||87.5||86.5*|
|AG’s News||DBPedia||Yahoo! Answers|
|Papernot et al.||74||1,676||4,401||145||2,119||6,011||120||9,719||19,211|
Datasets. We compare the proposed methods with baselines on three widely used benchmark datasets including AG’s News, DBPedia ontology and Yahoo! Answers (Zhang et al., 2015). AG’s News dataset consists of news articles pertaining four classes: World, Sports, Business and Sci/Tech. Each class includes 30,000 training examples and 1,900 testing examples. DBPedia ontology dataset is constructed by picking 14 non-overlapping classes from DBpedia 2014, which is a crowd-sourced community effort to extract structured information from Wikipedia. For each of the 14 ontology classes, there are 40,000 training samples and 5,000 testing samples. Yahoo! Answers dataset is a topic classification dataset with 10 classes, and each class contains 140,000 training samples and 5,000 testing samples.
We adopt several deep learning models that can achieve state-of-the-art performance on text classification tasks, including Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The embedding dimension of all models is 300(Mikolov et al., 2013). Specifically, we replicate a CNN model from Kim (2014), which consists of three convolutional layers with filter size of 5, 4, and 3 respectively, a dropout layer and a final fully connected layer. We also use LSTM model which replaces the three convolutional layers of the CNN with three LSTM layers, each with 128 cells (Liu et al., 2016). Lastly, we implement a Bi-LSTM model that replaces the three LSTM layers of the LSTM with a bi-directional LSTM layer having 128 forward direction cells and 128 backward direction cells.
5.2 Evaluation on Attack Effectiveness
To evaluate the attack effectiveness, we compare FGPM with the baselines in two aspects, namely classification accuracy under attacks and transferability.
Classification Accuracy under Attacks. In Table 1, we provide the classification accuracy under FGPM and the competitive baseline attacks on three standard datasets. The more effective the attack method is, the more the classification accuracy the target model drops. We observe that FGPM reduces more of the classification accuracy than other methods on CNN with DBPedia dataset, and has close results on other cases when compared with the best query-based methods. On average, the classification accuracy on the three datasets is reduced by , and respectively under FGPM attack. Note that, compared with Papernot et al., the only gradient-based adversarial attack among the baselines, FGPM can reduce the classification accuracy by more than , and respectively on the three datasets, indicating that the proposed gradient projection technique significantly improves the effectiveness of white-box word-level attacks.
|AG’s News||No Attack||92.3||89.7||90.2||92.6||92.6||90.9||89.0||92.5||92.5||91.4||90.0||92.8|
|Papernot et al.||67.0||75.0||87.5||84.5||61.0||79.5||61.0||88.0||63.0||79.5||85.0||88.0|
|Papernot et al.||83.0||86.5||97.0||99.5||78.5||95.0||53.5||99.5||86.5||91.5||93.0||100.0|
|Yahoo! Answers||No Attack||72.3||70.0||66.6||72.7||75.1||72.8||66.2||74.7||74.9||72.9||66.9||74.7|
|Papernot et al.||34.0||52.5||64.0||65.5||37.5||56.5||47.5||67.5||34.0||62.5||1.5||67.5|
Transferability. The transferability of adversarial attack refers to the ability to reduce the classification accuracy of different models with adversarial examples generated on a specific model (Goodfellow et al., 2015), which is another serious threat in real-world applications. To illustrate the transferability of FGPM, we generate adversarial examples on each model by different attack methods and evaluate the classification accuracy of other models on these adversarial examples. Here, we evaluate the transferability of different attacks on DBPedia. As depicted in Table 2, the adversarial examples crafted by FGPM generally yield the second best transferability.
5.3 Evaluation on Attack Efficiency
The attack efficiency is also important for evaluating attack methods, especially if we would like to incorporate the attacks into adversarial training as a defense method. Adversarial training needs highly efficient adversary generation so as to effectively promote the model robustness. Therefore, we evaluate the total time (in seconds) of generating examples on the three datasets by various attacks. As shown in Table 3, the average time of generating examples by FGPM is nearly times faster than GSA, the second fastest attack based on synonym substitution but with worse attack performance and lower transferability than FGPM. Moreover, FGPM is on average times faster than IGA, which has the best degradation of the classification accuracy among the baselines. Though Papernot et al. (2016) craft adversarial examples based on gradient and could make each iteration faster, it needs much more iterations to obtain adversarial examples due to a relatively low attack effectiveness. On average, FGPM is about 78 times faster than Papernot et al..
|Papernot et al.||83.0*||97.5||97.0||100.0||89.0||98.0||86.0||100.0||89.5||96.0||93.5||100.0|
|Papernot et al.||93.5||96.5||96.5||100.0||78.5*||98.5||83.5||100.0||90.5||97.0||94.0||100.0|
|Papernot et al.||93.5||96.5||97.0||100.0||87.5||98.5||85.5||100.0||86.5*||96.5||93.5||100.0|
5.4 Evaluation on Adversarial Training
From the above analysis, we see that compared with the competitive attack baselines, FGPM can achieve good attack performance and transferability with much higher efficiency. Such performance enables us to implement effective adversarial training and scale to large neural networks and datasets. In this subsection, we evaluate the performance of ATF and do comparison with SEM and IBP against adversarial examples generated by the above competitive attacks. Here we focus on two sides, defense against adversarial attacks and defense against transferability.
Defense against Adversarial Attacks. We use the above attacks on models trained by various defense methods to evaluate the defense performance. The results are shown in Table 4. For normal training (NT), the classification accuracy of the models on all datasets drops dramatically under different adversarial attacks. In contrast, both SEM and ATF can promote the model robustness stably and effectively among all models and datasets. Note that IBP has distinctly better defense performance on CNNs because of its theoretical certification. However, with many restrictions added on the architectures, the model hardly converges when trained on the three-layer LSTM architecture and large dataset like Yahoo! Answers, resulting in both weakened generalization and adversarial robustness instead. In some cases, IBP even achieves poorer robustness than NT. Compared with SEM, moreover, ATF can achieve higher classification accuracy under almost all adversarial attacks on the three datasets. Especially on DBPedia, ATF can block almost all the adversarial examples and achieve or nearly classification accuracy. Besides, both SEM and IBP bring some decay on benign data. As a way of data augmentation, however, ATF can improve the model performance on benign examples in most cases.
Defense against Transferability. As transferability poses a serious concern in real world applications, a good defense method should not only defend the adversarial attack but also resist the adversarial transferability. To evaluate the ability of blocking the transferability of adversarial examples, we evaluate each model’s classification accuracy on adversarial examples generated by different attacks methods under normal training on DBPedia. As shown in Table 5, ATF blocks the transferability of adversarial examples much more successfully than normal training and the defense baselines.
In summary, ATF can promote the model robustness against both white-box and black-box adversarial attacks. When applied to complex models and large datasets, ATF keeps stable and effective performance. ATF can further block the transferability of adversarial examples more successfully with no decay on the classification accuracy on benign data.
In this work, we propose an efficient gradient based synonym substitution adversarial attack method, called Fast Gradient Projection Method (FGPM), which approximates the classification change caused by synonym substitution by the product of the gradient magnitude and the projected distance in the gradient direction of the original word and the candidate synonym. Empirical evaluations on three widely used benchmark datasets demonstrate that FGPM can achieve close attack performance and transferability as state-of-the-art synonym-substitution based adversarial attack methods.
More importantly, FGPM is about 20 times faster than the current fastest synonym substitution based adversarial attack method. With such high efficiency, we introduce an effective defense approach called Adversarial Training with FGPM (ATF) for text classification. Extensive experiments show that ATF can promote the model robustness against white-box as well as black-box adversarial attacks, and block the transferability of adversarial examples effectively without any accuracy decay on benign data.
In future work, we plan to extend FGPM and ATF to other NLP tasks (e.g. Machine Translation, Question Answering) to evaluate the robustness of sequence to sequence models. Moreover, considering the success of adversarial training in image domain, we wish our work could inspire more research of adversarial training in the text domain.
Alzantot et al. (2018)
Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani B.
Srivastava, and Kai-Wei Chang. 2018.
language adversarial examples.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2890–2896.
Athalye et al. (2018)
Anish Athalye, Nicholas Carlini, and David Wagner. 2018.
gradients give a false sense of security: Circumventing defenses to
Proceedings of the 35th International Conference on Machine Learning (ICML), pages 274–283.
- Barham and Feizi (2019) Samuel Barham and Soheil Feizi. 2019. Interpretable adversarial training for text. arXiv preprint arXiv:1905.12864.
- Cheng et al. (2018) Minhao Cheng, Jinfeng Yi, Pin-Yu Chen, Huan Zhang, and Cho-Jui Hsieh. 2018. Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. arXiv preprint arXiv:1803.01128.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long and Short Papers), pages 4171–4186.
- Dvijotham et al. (2018) Krishnamurthy Dvijotham, Sven Gowal, Robert Stanforth, Relja Arandjelovic, Brendan O’Donoghue, Jonathan Uesato, and Pushmeet Kohli. 2018. Training verified learners with learned verifiers. arXiv preprint arXiv:1805.10265.
- Ebrahimi et al. (2018) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 31–36.
- Gong et al. (2018) Zhitao Gong, Wenlu Wang, Bo Li, Dawn Song, and Wei-Shinn Ku. 2018. Adversarial texts with gradient methods. arXiv preprint arXiv:1801.07175.
- Goodfellow et al. (2015) Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In Proceedings of the 3rd International Conference on Learning Representations (ICLR).
- Gowal et al. (2018) Sven Gowal, Krishnamurthy Dvijotham, Robert Stanforth, Rudy Bunel, Chongli Qin, Jonathan Uesato, Relja Arandjelovic, Timothy Mann, and Pushmeet Kohli. 2018. On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715.
- Guo et al. (2018) Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. 2018. Countering adversarial images using input transformations. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
- Huang et al. (2019) Po-Sen Huang, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krishnamurthy Dvijotham, and Pushmeet Kohli. 2019. Achieving verified robustness to symbol substitutions via interval bound propagation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4083–4093.
- Iyyer et al. (2018) Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Volume 1 (Long Papers), pages 1875–1885.
- Jia et al. (2019) Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. 2019. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4120–4133.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Proceedings of the 26th Annual Conference on Neural Information Processing System (NIPS), pages 1097–1105.
- Kuleshov et al. (2018) Volodymyr Kuleshov, Shantanu Thakoor, Tingfung Lau, and Stefano Ermon. 2018. Adversarial examples for natural language classification problems. OpenReview submission OpenReview:r1QZ3zbAZ.
- Li et al. (2019) Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. Textbugger: Generating adversarial text against real-world applications. In Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS).
Liang et al. (2018)
Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi.
classification can be fooled.
Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), pages 4208–4215.
- Liu et al. (2016) Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network for text classification with multi-task learning. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), pages 2873–2879.
- Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations (ICLR).
- Miyato et al. (2016) Takeru Miyato, Andrew M. Dai, and Ian J. Goodfellow. 2016. Adversarial training methods for semi-supervised text classification. In Proceedings of the 5th International Conference on Learning Representations (ICLR).
- Mrkšić et al. (2016) Nikola Mrkšić, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gašić, Lina Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. Counter-fitting word vectors to linguistic constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), pages 142–148.
- Neekhara et al. (2019) Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, and Farinaz Koushanfar. 2019. Adversarial reprogramming of text classification neural networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5216–5225.
- Papernot et al. (2016) Nicolas Papernot, Patrick D. McDaniel, Ananthram Swami, and Richard E. Harang. 2016. Crafting adversarial input sequences for recurrent neural networks. In Proceedings of IEEE Military Communications Conference (MILCOM), pages 49–54.
- Pruthi et al. (2019) Danish Pruthi, Bhuwan Dhingra, and Zachary C Lipton. 2019. Combating adversarial misspellings with robust word recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5582–5591.
- Ren et al. (2019) Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pages 1085–1097.
- Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), pages 856–865.
- Samangouei et al. (2018) Pouya Samangouei, Maya Kabkab, and Rama Chellappa. 2018. Defense-gan: Protecting classifiers against adversarial attacks using generative models. In Proceedings of the 6th International Conference on Learning Representations (ICLR).
- Samanta and Mehta (2017) Suranjana Samanta and Sameep Mehta. 2017. Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812.
- Sato et al. (2018) Motoki Sato, Jun Suzuki, Hiroyuki Shindo, and Yuji Matsumoto. 2018. Interpretable adversarial perturbation in input embedding space for text. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI), pages 4323–4330.
- Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In Proceedings of the 2nd International Conference on Learning Representations (ICLR).
- Wang et al. (2019) Xiaosen Wang, Hao Jin, and Kun He. 2019. Natural language adversarial attacks and defenses in word level. arXiv Preprint arXiv:1909.06723.
- Zhang et al. (2015) Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pages 649–657.
- Zhou et al. (2019) Yichao Zhou, Jyun-Yu Jiang, Kai-Wei Chang, and Wei Wang. 2019. Learning to discriminate perturbations for blocking adversarial attacks in text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4906–4915.