One way to evaluate the robustness of a machine learning model is to search for inputs that produce incorrect outputs. Inputs intentionally designed to fool deep learning models are referred to as adversarial examples(Goodfellow et al., 2017)
. Adversarial examples have been found to trick deep neural networks for image classification: two images that look exactly the same to a human receive completely different predictions from the classifier(Goodfellow et al., 2014).
While successful in the image case, the idea of an indistinguishable change lacks a clear analog in text. Unlike images, two different sequences of text are never entirely indistinguishable. This raises the question: if indistinguishable perturbations are not possible, what are adversarial examples in text?
The literature contains many often overlapping definitions of adversarial examples in natural language (Zhang et al., 2019). We consider each definition to be valid. The correct definition varies based on the threat model.
We propose a unified definition for successful adversarial examples in natural language: inputs that both fool the model and fulfill a set of linguistic constraints defined by the attacker. In Section 2, we present four constraints NLP adversarial examples may follow: semantics, grammaticality, edit distance, and non-suspicion to human readers. We categorize a selection of past attacks based on these constraints.
After defining these constraints, we provide guidelines for enforcement and discuss inherent difficulties. We examine the effectiveness of previously suggested evaluation methods through the lens of synonym substitution attacks by Alzantot et al. (2018) and Jin et al. (2019). By applying our evaluation methods to examples generated by each attack, we see that the examples often fail to fulfill the desired constraints.
To improve constraint evaluation, we introduce TextAttack, an open-source library that separates attacks from constraint evaluation methods. This allows automatic evaluation of constraints to be adjusted while holding the attack method constant.
After tuning constraint evaluation to align with human judgment, generated adversarial perturbations preserve semantics, grammaticality, and non-suspicion. Running Jin et al. (2019)’s attack with stricter thresholds decreases attack success rate from over 80% to under 20%. An automatic grammar checker detects no additional grammatical errors in the adversarial examples. Human evaluation shows that while the attack is now less successful, it generates perturbations that better preserves semantics and are substantially less noticeable to human judges.
The four main contributions of this paper are:
Formally define constraints on adversarial perturbations in natural language and suggest evaluation methods for each constraint.
Conduct a constraint evaluation case study, revealing that state-of-the-art synonym-based substitution attacks often do not preserve semantics, grammaticality, or non-suspicion.
Show that by aligning automatic evaluation methods with human judgment, it is possible for attacks to produce successful, valid adversarial examples. However, success rate drops by over .
Introduce TextAttack, an open-source library for adversarial attacks in NLP. TextAttack decouples attacks from constraint evaluation methods, allowing for more rigorous and consistent constraint evaluation, as well as dedicated ablation studies.
|Input, : ”Shall I compare thee to a summer’s day?” – William Shakespeare, Sonnet XVIII|
|Semantics||Shall I compare thee to a winter’s day?||has a different meaning than .|
|Grammaticality||Shall I compares thee to a summer’s day?||is less grammatically correct than .|
|Edit Distance||Sha1l i conpp$haaare thee to a 5umm3r’s day?||and have a large edit distance.|
|Non-suspicion||Am I gonna compare thee to a summer’s day?||A human reader may suspect this sentence to have been modified. 1|
Shakespeare never used the word “gonna”. Its first recorded usage wasn’t until 1806, and it didn’t become popular until the 20th century.
2 Constraints on Adversarial Examples in Natural Language
We define as a predictive model, for example, a deep neural network classifier. is the input space and is the output space. We focus on adversarial perturbations which perturb a correctly predicted input, , into an input which fools the model, . We define as a set of boolean functions indicating whether the perturbation satisfies a certain constraint.
The task of an adversary is then:
Adversarial attacks search for examples which both fool the model, as represented by , and restrict such that it follows constraints .
The definition of depends on the goal of the attack. Attacks on classification frequently aim to either induce any incorrect classification (untargeted) or induce a particular classification (targeted). Attacks on other types of models may have more sophisticated goals. For example, attacks on translation may attempt to change every word of a translation, or introduce targeted keywords into the translation (Cheng et al., 2018).
A perturbation that achieves the goal of the attack must also preserve the original correct output. That is, the correct output for must equal the correct output for .111”Equal” is used loosely here. In the case of classification, the true labels must be equal. But for tasks like translation and summarization, only the semantics of the output must be preserved.
In addition to defining the goal of the attack, the attacker must decide on the constraints perturbations must meet. Different use cases require different constraints. We build on the categorization of attack spaces introduced by Gilmer et al. (2018) to introduce a set of constraints for adversarial examples in natural language.
In the following subsections, we define four constraints on adversarial perturbations in natural language: semantics, grammatically, edit distance, and non-suspicion. We provide an examples of adversarial perturbations that violate each constraint in Table 1.
This constraint requires that semantics be preserved between and . A concrete threat model for this class of adversarial examples is tricking plagiarism detection software. An attacker must preserve the semantics of the original document while avoiding detection.
Many attacks include the semantics constraint as a way to ensure the ground truth output is preserved (Zhang et al., 2019). As long as the semantics of an input do not change, the ground truth output will stay the same. There are exceptions: one could imagine tasks for which preserving semantics does not necessarily preserve the ground truth output. For example, consider the task of classifying passages as written in either modern or Old English. Perturbing “why” to “wherefore” may retain the semantic meaning of the passage, but change its ground truth label.
Under this constraint, the attacker is constrained to perturbations which don’t introduce grammatical errors.222The grammaticality constraint refers to descriptive rather than prescriptive grammar. Grammatical errors don’t necessarily change semantics, as illustrated in Table 1. In the plagiarism threat model outlined above, the grammaticality constraint applies.
2.3 Edit Distance
This constraint specifies a maximum edit distance between and , either on the character or word-level. The edit distance constraint is useful when the attacker is willing to introduce misspellings. Additionally, the edit distance constraint is sometimes used when improving the robustness of models. For example, Huang et al. (2019) improves robustness when the attacker is given a perturbation budget representing the maximum allowed character-level changes.
The non-suspicion constraint specifies that must appear to be unmodified. Consider the example in Table 1. While the perturbation preserves semantics and grammar, it switches between modern and Old English and thus may seem suspicious to readers. Note that the definition of the non-suspicious constraint is context-dependent. A sentence that is non-suspicious in the context of a kindergartner’s homework assignment might be suspicious in the context of an academic paper. A threat model where the non-suspicion constraint does not apply is illegal PDF distribution, similar to a case discussed by Gilmer et al. (2018). Consumers of an illegal PDF may tacitly collude with the person uploading it. They know the document has been altered, but do not care as long as semantics are preserved.
|Selected Attacks Generating Adversarial Examples in Natural Language||Semantics||Grammaticality||Edit Distance||Non-Suspicion|
|Synonym Substitution. (Alzantot et al., 2018; Kuleshov et al., 2018; Jin et al., 2019; Ren et al., 2019)||✓||✓||✓||✓|
|Character Substitution. (Ebrahimi et al., 2017; Gao et al., 2018; Li et al., 2018)||✓||✗||✓||✓|
|Word Insertion or Removal. (Liang et al., 2017; Samanta and Mehta, 2017)||✓||✓||✓||✓|
|General Paraphrase. (Zhao et al., 2017; Ribeiro et al., 2018; Iyyer et al., 2018)||✓||✓||✗||✓|
3 Categorization of Current Attacks
After choosing a set of constraints, the attacker must devise a method to fool the model. Here, we categorize a sample of the most significant attacks, summarized in Table 2.
Attacks by Paraphrase:
Some studies have generated adversarial examples through paraphrase. Iyyer et al. (2018)
used neural machine translation systems to generate paraphrases.Ribeiro et al. (2018) proposed semantically-equivalent adversarial rules. By definition, paraphrases preserve semantics. Since the systems aim to generate perfect paraphrases, they implicitly follow constraints of grammaticality and non-suspicion.
Attacks by Synonym Substitution:
Some works focus on an easier way to generate a subset of paraphrases: replacing words from the input with synonyms (Alzantot et al., 2018; Jin et al., 2019; Kuleshov et al., 2018; Papernot et al., 2016; Ren et al., 2019). Each attack applies a search algorithm to determine which words to replace with which synonyms. Like the general paraphrase case, they aim to create examples that preserve semantics, grammaticality, and non-suspicion. While not all have an explicit edit distance constraint, some limit the number of words perturbed.
Attacks by Character Substitution:
Some studies have proposed to attack natural language classification models by deliberately misspelling words (Ebrahimi et al., 2017; Gao et al., 2018; Li et al., 2018). These attacks use character replacements to change a word into one that the model doesn’t recognize. The replacements are designed to create character sequences that a human reader would easily correct into the original words. If there aren’t many misspellings, non-suspicion may be preserved. Semantics are preserved as long as human readers can correct the misspellings.
Attacks by Word Insertion or Removal:
devised a way to determine the most important words in the input and then used heuristics to generate perturbed inputs by adding or removing important words. In some cases, these strategies are combined with synonym substitution. These attacks aim to follow all constraints.
4 Constraint Evaluation Methods and Case Study
For each constraint introduced in Section 2, we discuss best practices for both human and automatic evaluation. We leave out edit distance due to ease of automatic evaluation.
They claim to create perturbations that preserve semantics, maintain grammaticality, and adhere to the non-suspicion constraint. However, our inspection of the adversarial perturbations revealed that many introduced grammatical errors and did not preserve semantics.
They report high attack success rates.333We use “attack success rate” to mean the percentage of the time that an attack can find a successful adversarial example by perturbing a given input. “After-attack accuracy” or “accuracy after attack” is the accuracy the model achieves after all successful perturbations have been applied.
These methods attack two of the most effective models for text classification: LSTM and BERT.
In their work, Alzantot et al. (2018)
used a genetic algorithm to attack an LSTM trained on the IMDB444https://datasets.imdbws.com/ document-level sentiment classification dataset. Jin et al. (2019) used a greedy approach to attack an LSTM, CNN, and BERT trained on five classification datasets. To generate examples for evaluation, we attacked BERT using Jin et al. (2019)’s method and attacked an LSTM using Alzantot et al. (2018)’s method. We evaluate both methods on the IMDB dataset. In addition, we evaluate Jin et al. (2019)’s method on the Yelp polarity document-level sentiment classification dataset and the Movie Review (MR) sentence-level sentiment classification dataset (Pang and Lee, 2005; Zhang et al., 2015). We use examples from each dataset. Table 3 shows example violations of each constraint.
|Semantics||Jagger, Stoppard and director Michael Apted deliver a riveting and surprisingly romantic ride.||Jagger, Stoppard and director Michael Apted deliver a baffling and surprisingly sappy motorbike.|
|Grammaticality||A grating, emaciated flick.||A grates, lanky flick.|
|Non-suspicion||Great character interaction.||Gargantuan character interaction.|
4.1 Evaluation of Semantics
4.1.1 Human Evaluation
A few past studies of attacks have included human evaluation of semantic preservation (Ribeiro et al., 2018; Iyyer et al., 2018; Alzantot et al., 2018; Jin et al., 2019). However, studies often simply ask users to simply rate the similarity of and . We believe this phrasing does not generate an accurate measure of semantic preservation, as users may consider two sentences with different semantics “similar” if they only differ by a few words. Instead, users should be explicitly asked whether changes between and preserve the meaning of the original passage.
We propose to ask human judges to rate if meaning is preserved on a Likert scale of 1-5, where 1 is “Strongly Disagree” and 5 is “Strongly Agree” (Likert, 1932). A perturbation is semantics-preserving if the average score is at least . We propose as a general rule: on average, humans should either “Agree” or “Strongly Agree” that and have the same meaning.
4.1.2 Automatic Evaluation
Automatic evaluation of semantic similarity is a well-studied NLP task. The STS Benchmark is used as a common measurement (Cer et al., 2017).
Michel et al. (2019)
explored the use of the common evaluation metrics for machine translation BLEU, METEOR, and chrF as a proxy for semantic similarity in the attack setting(Papineni et al., 2002; Denkowski and Lavie, 2014; Popović, 2015)
. While these n-gram based approaches are computationally cheap and often work well in the machine translation setting, they do not correlate with human judgment as well as sentence encoders(wieting-gimpel-2018-paranmt).
A sentence encoder encodes two sentences into a pair of fixed-length vectors, then the cosine distance between the vectors is used as the similarity score.Jin et al. (2019) uses the Universal Sentence Encoder (USE) to evaluate semantic similarity, which achieved a Pearson correlation score of 0.782 on the STS benchmark (Cer et al., 2018). Another option for evaluation is BERT, which achieved a score of 0.876 (Devlin et al., 2018).
4.1.3 Case Study
We asked users whether they agreed that the changes between the two passages preserved meaning on a scale of 1 (Strongly Disagree) to 5 (Strongly Agree). We averaged scores for each attack method to determine if the method generally preserves semantics.
Examples generated by Jin et al. (2019) were rated an average of 3.28, while examples generated by (Alzantot et al., 2018) were rated on average 2.70.555We hypothesize that Jin et al. (2019) achieved higher scores due to its use of USE. The average rating given for both methods was significantly less than our proposed of . Using a clear survey question illustrates that many perturbations are not semantics-preserving.
4.2 Evaluation of Grammaticality
4.2.1 Human Evaluation
Both Jin et al. (2019) and Iyyer et al. (2018) reported a human evaluation of grammaticality, but neither study clearly asked if any errors were introduced by a perturbation. For human evaluation of the grammaticality constraint, we propose presenting and together and asking judges if grammatical errors were introduced by the changes made. However, due to the rule-based nature of grammar, automatic evaluation is preferred.
4.2.2 Automatic Evaluation
The simplest way to automatically evaluate grammatical correctness is with a rule-based grammar checker. Free grammar checkers are available online in many languages. One popular checker is LanguageTool, an open-source proofreading tool (Naber, 2003). LanguageTool ships with thousands of human-curated rules for the English language and provides a downloadable server interface for analyzing sentences. While other rule-based and some model-based grammar checkers exist, comparison between them is outside the scope of this work.
4.2.3 Case Study
We ran each of the generated pairs through LanguageTool to count grammatical errors. LanguageTool detected more grammatical errors in than for 51% of perturbations generated by Jin et al. (2019), and 29% of perturbations generated by Alzantot et al. (2018).
Additionally, perturbations often contain errors that humans rarely make. LanguageTool detected 6 categories for which errors in the perturbed samples appear with at least 10 times more frequently than in the original content. Details regarding select error categories and examples of violations are shown in Table 4.
|Grammar Rule ID||Explanation||Context|
You should probably use: ’are’. —— Replace is with one of [are]
|this films is too busiest beat all of its allotted ma…|
|DID_BASEFORM||21||326||The verb ’can’t’ requires base form of this verb: ’compare’ —— Replace compares with one of [compare]||…first two cinema in the series, i can’t compares friday after next to them, but nothing …|
|NON3PRS_VERB||13||199||The pronoun ’i’ must be used with a non-third-person form of a verb: ’surprise’ —— Replace surprises with one of [surprise]||…ved reached out hating the second one i surprises why they saw iike the same film to me|
|A_PLURAL||24||330||Don’t use indefinite articles with plural words. Did you mean ’a grate’, ’a gratis’ or simply ’grates’? —— Replace a grates with one of [a grate,a gratis,grates]||a grates, lanky flick|
|TO_NON_BASE||4||48||The verb after ”to” should be in the base form: ’excuse’. —— Replace excuses with one of [excuse]||doesn’t inbound close to excuses the hype that surrounded its debut at t…|
|EN_A_VS_AN||110||555||Use ’an’ instead of ’a’ if the following word starts with a vowel sound, e.g. ’an article’, ’an hour’ —— Replace a with one of [an]||like a eastwards of the constraint melrose pla…|
4.3 Evaluation of Non-suspicion
4.3.1 Human Evaluation
We propose to evaluate the non-suspicion constraint with a method in which judges view a shuffled mix of real and adversarial inputs and must guess whether each is real or computer-altered. This is similar to the human evaluation done by Ren et al. (2019), but binary rather than on a 1-5 scale.666We believe that either method is valid. A perturbed example meets the non-suspicion constraint if the portion of judges who identify as computer-altered is at least , where .
4.3.2 Automatic Evaluation
Automatic evaluation may be used to guess whether or not an adversarial example is suspicious. Models can be trained to classify passages as real or perturbed, just as human judges do. For example, Warstadt et al. (2018) trained sentence encoders on a real/fake task as a proxy for evaluation of linguistic acceptability. Recently, Zellers et al. (2019)
demonstrated that GROVER, a transformer-based text generation model, could classify its own generated news articles as human or machine-written with high accuracy.
4.3.3 Case Study
We presented a shuffled mix of real and perturbed examples to human judges and asked if they were real or computer-altered. As this is a time-consuming task for long documents, we only evaluated adversarial examples generated by Jin et al. (2019)’s method on the sentence-level MR dataset.
If all generated examples were non-suspicious, judges would average 50% accuracy. In this case, judges achieved 69.2% accuracy.
5 Producing Higher Quality Adversarial Examples
In the previous section, we evaluated how well generated examples met constraints. Now, we adjust the constraints applied during the course of the attack to produce higher quality adversarial examples.
The case study in Section 4 revealed that although attacks in NLP aspire to meet linguistic constraints, in practice, they frequently violate them. Inconsistent application of constraints leads to two problems:
For a single attack, constraints that are claimed to be met may not be. Lenient constraint enforcement correlates directly with attack success.
Across multiple attacks, comparing effectiveness is difficult. Comparing the success rates of two attacks is only meaningful if the attacks follow the same constraints, evaluated in the same manner.
To alleviate these issues, we wrote TextAttack, an open-source NLP attack library designed to decouple attack methods from constraint application. TextAttack makes it easy for researchers to enforce constraints properly and to compare attacks while holding constraint enforcement techniques constant. To demonstrate TextAttack, we continue to study the attacks introduced by Jin et al. (2019) and Alzantot et al. (2018). TextAttack can be used to reproduce the original attack results.
We set out to find if a different set of thresholds on evaluation metrics could produce adversarial examples that are semantics-preserving, grammatical and non-suspicious. We modified Jin et al. (2019)’s attack with different constraints. To enforce grammaticality, we added LanguageTool. To enforce semantic preservation, we tuned two thresholds which define the requirement for being able to make a substitution: (a) minimum cosine similarity between counter-fitted word embeddings and (b) minimum cosine similarity between sentence embeddings. Through human studies, we found threshold values of 0.9 for (a) and 0.98 for (b)777Details in Section A.1.1.
5.1 Results with Adjusted Constraint Application
Semantics. With the original attack, human judges on average were “Not sure” that semantics were preserved. After adjusting constraint evaluation, human judges on average “Agree”.
Grammaticality. Automatic evaluation during the attack ensured that did not have more grammatical errors than . Thus, our generated examples were observed to meet the grammaticality constraint.888Since the MR dataset is already lowercased and tokenized, it is difficult for a rule-based grammar checker like LanguageTool to parse some inputs. A more powerful language checker would filter out an even greater number of grammatical errors.
Non-suspicion. We repeated the study from Section 4.3 with our new examples. Participants were able to guess with accuracy whether inputs were computer-altered. The accuracy is over lower than the accuracy on the examples generated by the original attack.
Attack success. For each of the three datasets, the attack success rate decreased by at least percentage points.
|Semantic Preservation (before)|
|Semantic Preservation (after)|
|Grammatical Error % (before)|
|Grammatical Error % (after)|
|Non-suspicion % (before)|
|Non-suspicion % (after)|
|Attack Success % (before)|
|Attack Success % (after)|
|Difference (before - after)|
|Jin et al. (2019)||Alzantot et al. (2018)|
|Grammatical Error %|
|Attack Success %||10.9|
|Perturbed Word %||9.5|
5.2 Comparing the Two Attacks
We compared the relative success rates of Jin et al. (2019) and Alzantot et al. (2018) with constraint evaluation held constant. We applied the constraint evaluation methods from above and tested each attack against BERT fine-tuned on the MR dataset. Contrary to previous findings, Table 6 shows the two attacks had very similar success rates. The attacks achieved similar scores on human evaluation of semantics and non-suspicion. The genetic algorithm (Alzantot et al., 2018) was slightly more successful than the greedy search (Jin et al., 2019), but was far more computationally expensive, making over x as many model queries on average.
6 Ablation Study
We generated better-quality adversarial examples by constraining the search to exclude examples that fail to meet thresholds measured in three ways: word embedding distance, sentence encoder similarity, and grammaticality. Since we applied these constraint evaluation methods all at the same time, we performed an ablation study to understand which constraints had the largest impact on attack success rate.
|Word Embedding distance|
We reran three attacks (one for each constraint removed) on each of our BERT classification datasets. Table 7 shows attack success rate after individually removing each constraint. The word embedding distance constraint was the greatest inhibitor of attack success rate; without enforcing this constraint, attacks were over twice as successful.
Decoupling attacks and constraints. It is critical researchers separate new attack methods from new constraint evaluation methods. Demonstrating the performance of a new attack while simultaneously introducing new constraints makes it unclear whether empirical gains demonstrate a more effective attack or a more relaxed set of constraints. This mirrors a broader trend in machine learning where researchers report differences that come from changing multiple independent variables, making the sources of empirical gains unclear (Lipton and Steinhardt, 2018). This is especially relevant in adversarial NLP, where each experiment depends on many parameters.999While working to reproduce past work in TextAttack, we noticed how differences that may seem negligible often have an outsized impact on attack success rate. These include the list of stopwords used, the maximum length of model inputs, and tokenization strategies.
Ablation studies for NLP adversarial attacks. Adversarial attacks in NLP need proper ablation studies. TextAttack allowed us to compare attack strategies in a standardized environment. Moving forward, TextAttack will be used for ablation studies that provide the community with an idea of the relative performance of different attack strategies and constraint evaluation methods. Additionally, TextAttack may be used to help researchers gauge model robustness against a variety of attacks.
Tradeoff between attack success and example quality. We made semantic constraints more selective, which helped attacks generate examples that scored above 4 on the Likert scale for preservation of semantics. This indicates that, when only allowing adversarial examples that preserve semantics and grammaticality, NLP models are relatively robust to current synonym substitution attacks. However, our set of constraints isn’t necessarily optimal for every attack scenario. For example, researchers using an attack for producing additional training data may wish to allow perturbations that are more flexible when it comes to grammaticality and semantics.
8 Related Work
The goal of creating adversarial examples that preserve semantics and grammaticality is common in the NLP attack literature (Zhang et al., 2019). However, previous works use different definitions of adversarial examples, making it hard to compare methods. We provide a unified definition of an adversarial example based on the constraints it must fulfill.
There are some existing open-source libraries related to adversarial examples in NLP. Trickster proposes a method for attacking NLP models based on graph search but lacks the ability to ensure that generated examples satisfy constraints (Kulynych et al., 2018). TEAPOT is a library for evaluating adversarial perturbations in text, but only supports n-gram based comparisons for evaluating attacks on machine translation models (Michel et al., 2019). AllenNLP Interpret includes functionality for running adversarial attacks on NLP models, but only supports attacks via input-reduction or gradient-based word swap (Wallace et al., 2019b). TextAttack has a broader scope than each of these libraries: it is designed to be extendable to any attack on any NLP model with any set of constraint evaluation methods.
We have shown that state-of-the-art synonym substitution attacks frequently do not preserve semantics or grammaticality, and often appear suspicious to humans. When we adjusted constraint evaluation to align with human judgement, we produced higher quality perturbations at a much lower success rate. We encourage researchers to use TextAttack to enforce rigorous constraints and decouple attacks from constraint evaluation methods.
- Alzantot et al. (2018) Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. 2018. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998.
- Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, Vancouver, Canada. Association for Computational Linguistics.
- Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St. John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, and Ray Kurzweil. 2018. Universal sentence encoder. ArXiv, abs/1803.11175.
- Cheng et al. (2018) Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. 2018. Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. arXiv preprint arXiv:1803.01128.
- Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 376–380, Baltimore, Maryland, USA. Association for Computational Linguistics.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Ebrahimi et al. (2017) Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2017. Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751.
- Gao et al. (2018) Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-box generation of adversarial text sequences to evade deep learning classifiers. In IEEE Security and Privacy Workshops (SPW).
- Gilmer et al. (2018) Justin Gilmer, Ryan P. Adams, Ian J. Goodfellow, David Andersen, and George E. Dahl. 2018. Motivating the rules of the game for adversarial example research. CoRR, abs/1807.06732.
- Goodfellow et al. (2017) Ian Goodfellow, Nicolas Papernot, Sandy Huang, Rocky Duan, Pieter Abbeel, and Jack Clark. Attacking machine learning with adversarial examples [online]. 2017.
- Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- Huang et al. (2019) Po-Sen Huang, Robert Stanforth, Johannes Welbl, Chris Dyer, Dani Yogatama, Sven Gowal, Krishnamurthy Dvijotham, and Pushmeet Kohli. 2019. Achieving verified robustness to symbol substitutions via interval bound propagation. ArXiv, abs/1909.01492.
- Iyyer et al. (2018) Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adversarial example generation with syntactically controlled paraphrase networks. CoRR, abs/1804.06059.
- Jia and Liang (2017) Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328.
- Jin et al. (2019) Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2019. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. arXiv e-prints, page arXiv:1907.11932.
- Kuleshov et al. (2018) Volodymyr Kuleshov, Shantanu Thakoor, Tingfung Lau, and Stefano Ermon. 2018. Adversarial examples for natural language classification problems.
- Kulynych et al. (2018) Bogdan Kulynych, Jamie Hayes, Nikita Samarin, and Carmela Troncoso. 2018. Evading classifiers in discrete domains with provable optimality guarantees. CoRR, abs/1810.10939.
- Li et al. (2018) Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2018. Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271.
- Liang et al. (2017) Bin Liang, Hongcheng Li, Miaoqiang Su, Pan Bian, Xirong Li, and Wenchang Shi. 2017. Deep text classification can be fooled. arXiv preprint arXiv:1704.08006.
- Likert (1932) R. Likert. 1932. A Technique for the Measurement of Attitudes. Number nos. 136-165 in A Technique for the Measurement of Attitudes. publisher not identified.
- Lipton and Steinhardt (2018) Zachary Chase Lipton and Jacob Steinhardt. 2018. Troubling trends in machine learning scholarship. ArXiv, abs/1807.03341.
- Michel et al. (2019) Paul Michel, Xian Li, Graham Neubig, and Juan Miguel Pino. 2019. On evaluation of adversarial perturbations for sequence-to-sequence models. CoRR, abs/1903.06620.
- Mrksic et al. (2016) Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting word vectors to linguistic constraints. In HLT-NAACL.
- Naber (2003) Daniel Naber. 2003. A rule-based style and grammar checker.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124, Ann Arbor, Michigan. Association for Computational Linguistics.
Papernot et al. (2016)
Nicolas Papernot, Patrick McDaniel, Ananthram Swami, and Richard Harang. 2016.
Crafting adversarial input sequences for recurrent neural networks.In Military Communications Conference, MILCOM 2016-2016 IEEE, pages 49–54. IEEE.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Popović (2015) Maja Popović. 2015. chrF: character n-gram f-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Ren et al. (2019) Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che. 2019. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097, Florence, Italy. Association for Computational Linguistics.
- Ribeiro et al. (2018) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equivalent adversarial rules for debugging nlp models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 856–865.
- Samanta and Mehta (2017) Suranjana Samanta and Sameep Mehta. 2017. Towards crafting text adversarial samples. arXiv preprint arXiv:1707.02812.
- Wallace et al. (2019a) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a. Universal adversarial triggers for attacking and analyzing nlp.
- Wallace et al. (2019b) Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matthew Gardner, and Sameer Singh. 2019b. Allennlp interpret: A framework for explaining predictions of nlp models. ArXiv, abs/1909.09251.
- Warstadt et al. (2018) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2018. Neural network acceptability judgments. CoRR, abs/1805.12471.
- Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, and Yejin Choi. 2019. Defending against neural fake news. CoRR, abs/1905.12616.
- Zhang et al. (2019) Wei Emma Zhang, Quan Z. Sheng, and Ahoud Abdulrahmn F. Alhazmi. 2019. Generating textual adversarial examples for deep learning models: A survey. CoRR, abs/1901.06796.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
- Zhao et al. (2017) Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2017. Generating natural adversarial examples. arXiv preprint arXiv:1710.11342.
Appendix A Appendix
a.1 Details about Human Studies.
Our experiments relied on labor crowd-sourced from Amazon Mechanical Turk. We used five datasets: MIT and Yelp datasets from (Alzantot et al., 2018) and MIT, Yelp, and Movie Review datasets from (Jin et al., 2019). We limited our worker pool to workers in the United States, Canada, Canada, and Australia that had completed over 5,000 HITs with over a 99% success rate. We had an additional Qualification that prevented workers who had submitted too many labels in previous tasks from fulfilling too many of our HITs. In the future, will also use a small qualifier task to select workers who are good at the task.
For the automatic portions of the case study in Section 4, we use all successfully perturbed examples. For the human portions, we randomly select successful examples for each combination of attack method and dataset, then use Amazon’s Mechanical Turk to gather answers for each example.
Rating Semantic Similarity.
For each semantic similarity prompt, we gathered annotations from 10 different judges. Recall that each selection was one of 5 different options ranging from “Strongly Agree” to “Strongly Disagree.” For each pair of original and perturbed sequences, we calculated the number of judges who chose the most frequent option. For example, if 7 choose “Strongly Agree” and 3 chose “Agree,” the number of judges who chose the most frequent option is 7. We found that for the examples studied in Section 4 the average of this metric was . For the examples in Section 5 at the threshold of which we chose, the average was .
Guessing Real vs. Computer-altered.
We present results from our Mechanical Turk survey where we asked users ”Is this text real or computer-altered?“. We restricted this task to a single dataset, Movie Review. We chose Movie Review because it had an average sample length of 20 words, much shorter than Yelp or IMDB. We made this restriction because of the time-consuming nature of classifying long samples as Real or Fake. We paid per label for this task.
Rating word similarity.
We performed a third study where we asked showed users a pair of words and asked ”In general, replacing the first word with the second preserves the meaning of a sentence:“. We paid per label for this task.
Mechanical Turk comes with a set of pre-designed questionnaire interfaces. These include one titled “Semantic Similarity” which asks users to rate a pair of sentences on a scale from “Not Similar At All” to “Highly Similar.” Examples generated by synonym attacks benefit from this question formulation because humans tend to rate two sentences that share many words as “Similar” due to their small morphological distance, even if they have different meanings.
Notes for future surveys
. In the future, we would also try to filter out bad labels by mixing some number of ground-truth “easy” data points into our dataset and rejecting the work of labelers who performed poorly on this set.
a.1.1 Finding The Right Thresholds
Comparing two words. We showed study participants a pair of words and asked them whether swapping out one word for the other would change the meaning of a sentence. The results are shown in Figure 2. Using this information, we chose 0.9 as the word-level cosine similarity threshold.
Comparing two passages. With the word-level threshold set at , we generated examples at sentence encoder thresholds . We chose to encode sentences with BERT fine-tuned for semantic similarity: first on the AllNLI dataset, then on the STS benchmark training set. We repeated the study from 4.1.1 on 100 examples from each threshold, obtaining 10 human labels per example. The results are in Figure 2. On average, judges agreed that the examples produced at 0.98 threshold preserved semantics.
a.2 Further Analysis of Non-Suspicious Constraint Case Study
presents the confusion matrix of results from the survey. Interestingly, workers guessed that the examples were realof the time, but when they guessed that examples were computer-altered they were right of the time. Thus while some perturbed examples are non-suspicious, there are some which workers identify with high precision.
a.3 The Need for Standardized Metrics Supported By Human Evaluation
Jin et al. (2019) used an additional distance metric defined as the cosine similarity between embeddings encoded by the Universal Sentence Encoder (USE) in order to determine if a synonym swap preserves semantic similarity (Cer et al., 2018). Figure 4 shows accuracy under BERT under attack by Jin et al. (2019)’s method as the maximum allowed cosine similarity between two sentences’ USE embeddings increases 101010The MR dataset is excluded due to no USE similarity restrictions being enforced on inputs of less than 15 words. . As becomes more strict, the attack becomes less successful. Figure 4 plots the accuracy under attack as the number of synonyms considered for each substitution increases. An attack that is more lenient with its standard for what constitutes a synonym is more successful. Previous methods vary in how many synonyms they considered, with Alzantot et al. (2018) considering 8, Kuleshov et al. (2018) considering 15, and Jin et al. (2019) considering 50.
a.4 Expanding the Categorization of Adversarial Examples in NLP
Recent work such as Jia and Liang (2017) and Wallace et al. (2019a) explored the creation of adversarial examples through concatenation of phrases to the input. While these examples contain the semantics of the original sentence, they also add new meaning. Future work may expand our framework to evaluate examples generated by this type of semantic composition.
Another group of attacks generate adversarial examples from scratch. Our constraints refer to the case where the attacker starts from a benign input and applies a perturbation to fool the model. This is useful from the perspective of the defender, since the defender can produce adversarial examples from inputs in the training set. However, in the real world, attackers often do not create adversarial examples from a starting point. An adversary who generates a fake news article will not try to perturb an article they find in the newspaper. The adversary will try to generate the bogus article from scratch. Without an original input for comparison, it is not immediately clear how to evaluate semantic preservation in this case. We leave it to future work to define and evaluate constraints for inputs generated from scratch.
|(Jin et al., 2019)||(Alzantot et al., 2018)|
|AS %||PW %||AS %||PW %||AS %||PW %||AS %||PW %|
The results in Table 9 compare the accuracy and average percentage of perturbed words obtained with TextAttack to the results previously reported. Our reproduction is more successful at attacking BERT than Jin et al. (2019) while perturbing more words. Our reproduction perturbs much fewer words than Alzantot et al. (2018), but achieves a similar success rate 111111We believe Alzantot et al. (2018) may have misreported their words perturbed percentage, as they set the maximum words perturbed to 20% while reporting an average words perturbed percentage of 14.7% for their genetic algorithm and 19%(!) for their baseline..
One major implementation difference between TextAttack and the attacks studied was that TextAttack does not tokenize inputs before running an attack. This has some overhead, since every perturbation has to be retokenized, but it has two major advantages. First, it prevents errors that arise when words are broken into multiple tokens and individual tokens are swapped for other words. Second, it prevents loss of information during retokenization.
a.6 Word Embeddings
It is common to perform synonym substitution by replacing a word by a neighbor in the counter-fitted embedding space. The distance between word embeddings is frequently measured using Euclidean distance, but it is also common to compare word embeddings based on their cosine similarity (the cosine of the angle between them). (Some work also measures distance based on the mean-squared error between embeddings, which is just the square of Euclidean distance.)
For this reason, past work has sometimes constrained nearest neighbors based on the Euclidean distance between two word vectors, and other times based on their cosine similarity. Alzantot et al. (2018) considered both distance metrics, and reported that they ”did not see a noticeable improvement using cosine.”
We would like to point out that, when using normalized word vectors (as is typical for counter-fitted embeddings), filtering nearest neighbors based on their minimum cosine similarity is equivalent to filtering by maximum Euclidean distance (or MSE, for that matter).
Let , be normalized word embedding vectors. That is, . Then .
Therefore, the Euclidean distance between and is directly proportional to the cosine between them. For any minimum cosine distance , we can use maximum euclidean distance and achieve the same result.
a.7 Examples In The Wild
We randomly select 10 attempted attacks from the MR dataset and show the original inputs, perturbations before constraint change, and perturbations after constraint change. See Table 10.
|by presenting an impossible romance in an impossible world , pumpkin dares us to say why either is impossible – which forces us to confront what’s possible and what we might do to make it so. Pos: 99.5%||by presenting an unsuitable romantic in an impossible world , pumpkin dares we to say why either is conceivable – which vigour we to confronted what’s possible and what we might do to make it so. Neg: 54.8%||N/A|
|…a ho-hum affair , always watchable yet hardly memorable. Neg: 83.9%||…a ho-hum affair , always watchable yet just memorable. Pos: 99.8%||N/A|
|schnitzler’s film has a great hook , some clever bits and well-drawn, if standard issue, characters, but is still only partly satisfying. Neg: 60.8%||schnitzler’s film has a great hook, some clever smithereens and well-drawn, if standard issue, characters, but is still only partly satisfying. Pos: 50.4%||schnitzler’s film has a great hook, some clever traits and well-drawn, if standard issue, characters, but is still only partly satisfying. Pos: 56.9%|
|its direction, its script, and weaver’s performance as a vaguely discontented woman of substance make for a mildly entertaining 77 minutes, if that’s what you’re in the mood for. Pos: 99.5%||its direction, its script, and weaver’s performance as a vaguely discontented woman of substance pose for a marginally comical 77 minutes, if that’s what you’re in the mood for. Neg: 65.5%||N/A|
|missteps take what was otherwise a fascinating, riveting story and send it down the path of the mundane. Pos: 99.1%||missteps take what was otherwise a fascinating, scintillating story and dispatched it down the path of the mundane. Neg: 51.2%||N/A|
|hawke draws out the best from his large cast in beautifully articulated portrayals that are subtle and so expressive they can sustain the poetic flights in burdette’s dialogue. Pos: 99.9%||hawke draws out the better from his wholesale cast in terribly jointed portrayals that are inconspicuous and so expressive they can sustain the rhymed flight in burdette’s dialogue. Neg: 60.3%||N/A|
|if religious films aren’t your bailiwick, stay away. otherwise, this could be a passable date film. Neg: 99.1%||if religious films aren’t your bailiwick, stay away. otherwise, this could be a presentable date film. Pos: 86.6%||N/A|
|[broomfield] uncovers a story powerful enough to leave the screen sizzling with intrigue. Pos: 99.1%||[broomfield] uncovers a story pompous enough to leave the screen sizzling with plots. Neg: 59.2%||N/A|
|like its two predecessors, 1983’s koyaanisqatsi and 1988’s powaqqatsi, the cinematic collage naqoyqatsi could be the most navel-gazing film ever. Pos: 99.4%||N/A||N/A|
|maud and roland’s search for an unknowable past makes for a haunting literary detective story, but labute pulls off a neater trick in possession : he makes language sexy. Pos: 99.4%||maud and roland’s search for an unknowable past makes for a haunting literary detective story, but labute pulls off a neater trick in property : he assumes language sultry. Neg: 62.1%||N/A|