Reevaluating Adversarial Examples in Natural Language

04/25/2020 ∙ by John X. Morris, et al. ∙ University of Virginia 7

State-of-the-art attacks on NLP models have different definitions of what constitutes a successful attack. These differences make the attacks difficult to compare. We propose to standardize definitions of natural language adversarial examples based on a set of linguistic constraints: semantics, grammaticality, edit distance, and non-suspicion. We categorize previous attacks based on these constraints. For each constraint, we suggest options for human and automatic evaluation methods. We use these methods to evaluate two state-of-the-art synonym substitution attacks. We find that perturbations often do not preserve semantics, and 45% introduce grammatical errors. Next, we conduct human studies to find a threshold for each evaluation method that aligns with human judgment. Human surveys reveal that to truly preserve semantics, we need to significantly increase the minimum cosine similarity between the embeddings of swapped words and sentence encodings of original and perturbed inputs. After tightening these constraints to agree with the judgment of our human annotators, the attacks produce valid, successful adversarial examples. But quality comes at a cost: attack success rate drops by over 70 percentage points. Finally, we introduce TextAttack, a library for adversarial attacks in NLP.



There are no comments yet.


page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One way to evaluate the robustness of a machine learning model is to search for inputs that produce incorrect outputs. Inputs intentionally designed to fool deep learning models are referred to as adversarial examples

(Goodfellow et al., 2017)

. Adversarial examples have been found to trick deep neural networks for image classification: two images that look exactly the same to a human receive completely different predictions from the classifier

(Goodfellow et al., 2014).

While successful in the image case, the idea of an indistinguishable change lacks a clear analog in text. Unlike images, two different sequences of text are never entirely indistinguishable. This raises the question: if indistinguishable perturbations are not possible, what are adversarial examples in text?

The literature contains many often overlapping definitions of adversarial examples in natural language (Zhang et al., 2019). We consider each definition to be valid. The correct definition varies based on the threat model.

We propose a unified definition for successful adversarial examples in natural language: inputs that both fool the model and fulfill a set of linguistic constraints defined by the attacker. In Section 2, we present four constraints NLP adversarial examples may follow: semantics, grammaticality, edit distance, and non-suspicion to human readers. We categorize a selection of past attacks based on these constraints.

After defining these constraints, we provide guidelines for enforcement and discuss inherent difficulties. We examine the effectiveness of previously suggested evaluation methods through the lens of synonym substitution attacks by Alzantot et al. (2018) and Jin et al. (2019). By applying our evaluation methods to examples generated by each attack, we see that the examples often fail to fulfill the desired constraints.

To improve constraint evaluation, we introduce TextAttack, an open-source library that separates attacks from constraint evaluation methods. This allows automatic evaluation of constraints to be adjusted while holding the attack method constant.

After tuning constraint evaluation to align with human judgment, generated adversarial perturbations preserve semantics, grammaticality, and non-suspicion. Running Jin et al. (2019)’s attack with stricter thresholds decreases attack success rate from over 80% to under 20%. An automatic grammar checker detects no additional grammatical errors in the adversarial examples. Human evaluation shows that while the attack is now less successful, it generates perturbations that better preserves semantics and are substantially less noticeable to human judges.

The four main contributions of this paper are:

  • Formally define constraints on adversarial perturbations in natural language and suggest evaluation methods for each constraint.

  • Conduct a constraint evaluation case study, revealing that state-of-the-art synonym-based substitution attacks often do not preserve semantics, grammaticality, or non-suspicion.

  • Show that by aligning automatic evaluation methods with human judgment, it is possible for attacks to produce successful, valid adversarial examples. However, success rate drops by over .

  • Introduce TextAttack, an open-source library for adversarial attacks in NLP. TextAttack decouples attacks from constraint evaluation methods, allowing for more rigorous and consistent constraint evaluation, as well as dedicated ablation studies.

Input, : ”Shall I compare thee to a summer’s day?” – William Shakespeare, Sonnet XVIII
Constraint Perturbation, Explanation
Semantics Shall I compare thee to a winter’s day? has a different meaning than .
Grammaticality Shall I compares thee to a summer’s day? is less grammatically correct than .
Edit Distance Sha1l i conpp$haaare thee to a 5umm3r’s day? and have a large edit distance.
Non-suspicion Am I gonna compare thee to a summer’s day? A human reader may suspect this sentence to have been modified. 1
  • Shakespeare never used the word “gonna”. Its first recorded usage wasn’t until 1806, and it didn’t become popular until the 20th century.

Table 1: Adversarial Constraints and Violations. For each of the four proposed constraints, we show an example for which violates the specified constraint.

2 Constraints on Adversarial Examples in Natural Language

We define as a predictive model, for example, a deep neural network classifier. is the input space and is the output space. We focus on adversarial perturbations which perturb a correctly predicted input, , into an input which fools the model, . We define as a set of boolean functions indicating whether the perturbation satisfies a certain constraint.

The task of an adversary is then:


Adversarial attacks search for examples which both fool the model, as represented by , and restrict such that it follows constraints .

The definition of depends on the goal of the attack. Attacks on classification frequently aim to either induce any incorrect classification (untargeted) or induce a particular classification (targeted). Attacks on other types of models may have more sophisticated goals. For example, attacks on translation may attempt to change every word of a translation, or introduce targeted keywords into the translation (Cheng et al., 2018).

A perturbation that achieves the goal of the attack must also preserve the original correct output. That is, the correct output for must equal the correct output for .111”Equal” is used loosely here. In the case of classification, the true labels must be equal. But for tasks like translation and summarization, only the semantics of the output must be preserved.

In addition to defining the goal of the attack, the attacker must decide on the constraints perturbations must meet. Different use cases require different constraints. We build on the categorization of attack spaces introduced by Gilmer et al. (2018) to introduce a set of constraints for adversarial examples in natural language.

In the following subsections, we define four constraints on adversarial perturbations in natural language: semantics, grammatically, edit distance, and non-suspicion. We provide an examples of adversarial perturbations that violate each constraint in Table 1.

2.1 Semantics

This constraint requires that semantics be preserved between and . A concrete threat model for this class of adversarial examples is tricking plagiarism detection software. An attacker must preserve the semantics of the original document while avoiding detection.

Many attacks include the semantics constraint as a way to ensure the ground truth output is preserved (Zhang et al., 2019). As long as the semantics of an input do not change, the ground truth output will stay the same. There are exceptions: one could imagine tasks for which preserving semantics does not necessarily preserve the ground truth output. For example, consider the task of classifying passages as written in either modern or Old English. Perturbing “why” to “wherefore” may retain the semantic meaning of the passage, but change its ground truth label.

2.2 Grammaticality

Under this constraint, the attacker is constrained to perturbations which don’t introduce grammatical errors.222The grammaticality constraint refers to descriptive rather than prescriptive grammar. Grammatical errors don’t necessarily change semantics, as illustrated in Table 1. In the plagiarism threat model outlined above, the grammaticality constraint applies.

2.3 Edit Distance

This constraint specifies a maximum edit distance between and , either on the character or word-level. The edit distance constraint is useful when the attacker is willing to introduce misspellings. Additionally, the edit distance constraint is sometimes used when improving the robustness of models. For example, Huang et al. (2019) improves robustness when the attacker is given a perturbation budget representing the maximum allowed character-level changes.

2.4 Non-suspicion

The non-suspicion constraint specifies that must appear to be unmodified. Consider the example in Table 1. While the perturbation preserves semantics and grammar, it switches between modern and Old English and thus may seem suspicious to readers. Note that the definition of the non-suspicious constraint is context-dependent. A sentence that is non-suspicious in the context of a kindergartner’s homework assignment might be suspicious in the context of an academic paper. A threat model where the non-suspicion constraint does not apply is illegal PDF distribution, similar to a case discussed by Gilmer et al. (2018). Consumers of an illegal PDF may tacitly collude with the person uploading it. They know the document has been altered, but do not care as long as semantics are preserved.

Selected Attacks Generating Adversarial Examples in Natural Language Semantics Grammaticality Edit Distance Non-Suspicion
Synonym Substitution. (Alzantot et al., 2018; Kuleshov et al., 2018; Jin et al., 2019; Ren et al., 2019)
Character Substitution. (Ebrahimi et al., 2017; Gao et al., 2018; Li et al., 2018)
Word Insertion or Removal. (Liang et al., 2017; Samanta and Mehta, 2017)
General Paraphrase. (Zhao et al., 2017; Ribeiro et al., 2018; Iyyer et al., 2018)
Table 2: Summary of Constraints and Attacks. This table shows a selection of prior work (rows) categorized by constraints (columns). A “✓” indicates that the respective attack is supposed to meet the constraint, and a “✗” means the attack is not supposed to meet the constraint.

3 Categorization of Current Attacks

After choosing a set of constraints, the attacker must devise a method to fool the model. Here, we categorize a sample of the most significant attacks, summarized in Table 2.

Attacks by Paraphrase:

Some studies have generated adversarial examples through paraphrase. Iyyer et al. (2018)

used neural machine translation systems to generate paraphrases.

Ribeiro et al. (2018) proposed semantically-equivalent adversarial rules. By definition, paraphrases preserve semantics. Since the systems aim to generate perfect paraphrases, they implicitly follow constraints of grammaticality and non-suspicion.

Attacks by Synonym Substitution:

Some works focus on an easier way to generate a subset of paraphrases: replacing words from the input with synonyms (Alzantot et al., 2018; Jin et al., 2019; Kuleshov et al., 2018; Papernot et al., 2016; Ren et al., 2019). Each attack applies a search algorithm to determine which words to replace with which synonyms. Like the general paraphrase case, they aim to create examples that preserve semantics, grammaticality, and non-suspicion. While not all have an explicit edit distance constraint, some limit the number of words perturbed.

Attacks by Character Substitution:

Some studies have proposed to attack natural language classification models by deliberately misspelling words (Ebrahimi et al., 2017; Gao et al., 2018; Li et al., 2018). These attacks use character replacements to change a word into one that the model doesn’t recognize. The replacements are designed to create character sequences that a human reader would easily correct into the original words. If there aren’t many misspellings, non-suspicion may be preserved. Semantics are preserved as long as human readers can correct the misspellings.

Attacks by Word Insertion or Removal:

Liang et al. (2017) and Samanta and Mehta (2017)

devised a way to determine the most important words in the input and then used heuristics to generate perturbed inputs by adding or removing important words. In some cases, these strategies are combined with synonym substitution. These attacks aim to follow all constraints.

4 Constraint Evaluation Methods and Case Study

For each constraint introduced in Section 2, we discuss best practices for both human and automatic evaluation. We leave out edit distance due to ease of automatic evaluation.

Additionally, we perform a case study, evaluating synonym substitution attack techniques by Alzantot et al. (2018) and Jin et al. (2019) on classification tasks. We chose these works because:

  • They claim to create perturbations that preserve semantics, maintain grammaticality, and adhere to the non-suspicion constraint. However, our inspection of the adversarial perturbations revealed that many introduced grammatical errors and did not preserve semantics.

  • They report high attack success rates.333We use “attack success rate” to mean the percentage of the time that an attack can find a successful adversarial example by perturbing a given input. “After-attack accuracy” or “accuracy after attack” is the accuracy the model achieves after all successful perturbations have been applied.

  • These methods attack two of the most effective models for text classification: LSTM and BERT.

In their work, Alzantot et al. (2018)

used a genetic algorithm to attack an LSTM trained on the IMDB

444 document-level sentiment classification dataset. Jin et al. (2019) used a greedy approach to attack an LSTM, CNN, and BERT trained on five classification datasets. To generate examples for evaluation, we attacked BERT using Jin et al. (2019)’s method and attacked an LSTM using Alzantot et al. (2018)’s method. We evaluate both methods on the IMDB dataset. In addition, we evaluate Jin et al. (2019)’s method on the Yelp polarity document-level sentiment classification dataset and the Movie Review (MR) sentence-level sentiment classification dataset (Pang and Lee, 2005; Zhang et al., 2015). We use examples from each dataset. Table 3 shows example violations of each constraint.

Constraint Violated Input, Perturbation,
Semantics Jagger, Stoppard and director Michael Apted deliver a riveting and surprisingly romantic ride. Jagger, Stoppard and director Michael Apted deliver a baffling and surprisingly sappy motorbike.
Grammaticality A grating, emaciated flick. A grates, lanky flick.
Non-suspicion Great character interaction. Gargantuan character interaction.
Table 3: Real World Constraint Violation Examples. Perturbations by Jin et al. (2019)’s attack against the BERT classification model on sentences from the MR dataset. In each example, our case study finds a violation of the respective constraint. Each is classified as positive, and each is classified as negative.

4.1 Evaluation of Semantics

4.1.1 Human Evaluation

A few past studies of attacks have included human evaluation of semantic preservation (Ribeiro et al., 2018; Iyyer et al., 2018; Alzantot et al., 2018; Jin et al., 2019). However, studies often simply ask users to simply rate the similarity of and . We believe this phrasing does not generate an accurate measure of semantic preservation, as users may consider two sentences with different semantics “similar” if they only differ by a few words. Instead, users should be explicitly asked whether changes between and preserve the meaning of the original passage.

We propose to ask human judges to rate if meaning is preserved on a Likert scale of 1-5, where 1 is “Strongly Disagree” and 5 is “Strongly Agree” (Likert, 1932). A perturbation is semantics-preserving if the average score is at least . We propose as a general rule: on average, humans should either “Agree” or “Strongly Agree” that and have the same meaning.

4.1.2 Automatic Evaluation

Automatic evaluation of semantic similarity is a well-studied NLP task. The STS Benchmark is used as a common measurement (Cer et al., 2017).

Michel et al. (2019)

explored the use of the common evaluation metrics for machine translation BLEU, METEOR, and chrF as a proxy for semantic similarity in the attack setting

(Papineni et al., 2002; Denkowski and Lavie, 2014; Popović, 2015)

. While these n-gram based approaches are computationally cheap and often work well in the machine translation setting, they do not correlate with human judgment as well as sentence encoders


A sentence encoder encodes two sentences into a pair of fixed-length vectors, then the cosine distance between the vectors is used as the similarity score.

Jin et al. (2019) uses the Universal Sentence Encoder (USE) to evaluate semantic similarity, which achieved a Pearson correlation score of 0.782 on the STS benchmark (Cer et al., 2018). Another option for evaluation is BERT, which achieved a score of 0.876 (Devlin et al., 2018).

Additionally, synonym substitution methods, including Jin et al. (2019) and Alzantot et al. (2018), often require that words be substituted only with neighbors in the counter-fitted embedding space, which is designed to push synonyms together and antonyms apart (Mrksic et al., 2016).

4.1.3 Case Study

We asked users whether they agreed that the changes between the two passages preserved meaning on a scale of 1 (Strongly Disagree) to 5 (Strongly Agree). We averaged scores for each attack method to determine if the method generally preserves semantics.

Examples generated by Jin et al. (2019) were rated an average of 3.28, while examples generated by (Alzantot et al., 2018) were rated on average 2.70.555We hypothesize that Jin et al. (2019) achieved higher scores due to its use of USE. The average rating given for both methods was significantly less than our proposed of . Using a clear survey question illustrates that many perturbations are not semantics-preserving.

4.2 Evaluation of Grammaticality

4.2.1 Human Evaluation

Both Jin et al. (2019) and Iyyer et al. (2018) reported a human evaluation of grammaticality, but neither study clearly asked if any errors were introduced by a perturbation. For human evaluation of the grammaticality constraint, we propose presenting and together and asking judges if grammatical errors were introduced by the changes made. However, due to the rule-based nature of grammar, automatic evaluation is preferred.

4.2.2 Automatic Evaluation

The simplest way to automatically evaluate grammatical correctness is with a rule-based grammar checker. Free grammar checkers are available online in many languages. One popular checker is LanguageTool, an open-source proofreading tool (Naber, 2003). LanguageTool ships with thousands of human-curated rules for the English language and provides a downloadable server interface for analyzing sentences. While other rule-based and some model-based grammar checkers exist, comparison between them is outside the scope of this work.

4.2.3 Case Study

We ran each of the generated pairs through LanguageTool to count grammatical errors. LanguageTool detected more grammatical errors in than for 51% of perturbations generated by Jin et al. (2019), and 29% of perturbations generated by Alzantot et al. (2018).

Additionally, perturbations often contain errors that humans rarely make. LanguageTool detected 6 categories for which errors in the perturbed samples appear with at least 10 times more frequently than in the original content. Details regarding select error categories and examples of violations are shown in Table 4.

Grammar Rule ID Explanation Context

You should probably use: ’are’. —— Replace is with one of [are]

this films is too busiest beat all of its allotted ma…
DID_BASEFORM 21 326 The verb ’can’t’ requires base form of this verb: ’compare’ —— Replace compares with one of [compare] …first two cinema in the series, i can’t compares friday after next to them, but nothing …
NON3PRS_VERB 13 199 The pronoun ’i’ must be used with a non-third-person form of a verb: ’surprise’ —— Replace surprises with one of [surprise] …ved reached out hating the second one i surprises why they saw iike the same film to me
A_PLURAL 24 330 Don’t use indefinite articles with plural words. Did you mean ’a grate’, ’a gratis’ or simply ’grates’? —— Replace a grates with one of [a grate,a gratis,grates] a grates, lanky flick
TO_NON_BASE 4 48 The verb after ”to” should be in the base form: ’excuse’. —— Replace excuses with one of [excuse] doesn’t inbound close to excuses the hype that surrounded its debut at t…
EN_A_VS_AN 110 555 Use ’an’ instead of ’a’ if the following word starts with a vowel sound, e.g. ’an article’, ’an hour’ —— Replace a with one of [an] like a eastwards of the constraint melrose pla…
Table 4: Many Adversarial Examples Contain Grammatical Errors. This table shows grammatical errors detected by LanguageTool that appeared far more often in the perturbed samples. and denote the numbers of errors detected in and across 3,115 examples generated by Jin et al. (2019) and Alzantot et al. (2018).

4.3 Evaluation of Non-suspicion

4.3.1 Human Evaluation

We propose to evaluate the non-suspicion constraint with a method in which judges view a shuffled mix of real and adversarial inputs and must guess whether each is real or computer-altered. This is similar to the human evaluation done by Ren et al. (2019), but binary rather than on a 1-5 scale.666We believe that either method is valid. A perturbed example meets the non-suspicion constraint if the portion of judges who identify as computer-altered is at least , where .

4.3.2 Automatic Evaluation

Automatic evaluation may be used to guess whether or not an adversarial example is suspicious. Models can be trained to classify passages as real or perturbed, just as human judges do. For example, Warstadt et al. (2018) trained sentence encoders on a real/fake task as a proxy for evaluation of linguistic acceptability. Recently, Zellers et al. (2019)

demonstrated that GROVER, a transformer-based text generation model, could classify its own generated news articles as human or machine-written with high accuracy.

4.3.3 Case Study

We presented a shuffled mix of real and perturbed examples to human judges and asked if they were real or computer-altered. As this is a time-consuming task for long documents, we only evaluated adversarial examples generated by Jin et al. (2019)’s method on the sentence-level MR dataset.

If all generated examples were non-suspicious, judges would average 50% accuracy. In this case, judges achieved 69.2% accuracy.

5 Producing Higher Quality Adversarial Examples

In the previous section, we evaluated how well generated examples met constraints. Now, we adjust the constraints applied during the course of the attack to produce higher quality adversarial examples.

The case study in Section 4 revealed that although attacks in NLP aspire to meet linguistic constraints, in practice, they frequently violate them. Inconsistent application of constraints leads to two problems:

  • For a single attack, constraints that are claimed to be met may not be. Lenient constraint enforcement correlates directly with attack success.

  • Across multiple attacks, comparing effectiveness is difficult. Comparing the success rates of two attacks is only meaningful if the attacks follow the same constraints, evaluated in the same manner.

To alleviate these issues, we wrote TextAttack, an open-source NLP attack library designed to decouple attack methods from constraint application. TextAttack makes it easy for researchers to enforce constraints properly and to compare attacks while holding constraint enforcement techniques constant. To demonstrate TextAttack, we continue to study the attacks introduced by Jin et al. (2019) and Alzantot et al. (2018). TextAttack can be used to reproduce the original attack results.

We set out to find if a different set of thresholds on evaluation metrics could produce adversarial examples that are semantics-preserving, grammatical and non-suspicious. We modified Jin et al. (2019)’s attack with different constraints. To enforce grammaticality, we added LanguageTool. To enforce semantic preservation, we tuned two thresholds which define the requirement for being able to make a substitution: (a) minimum cosine similarity between counter-fitted word embeddings and (b) minimum cosine similarity between sentence embeddings. Through human studies, we found threshold values of 0.9 for (a) and 0.98 for (b)777Details in Section A.1.1.

5.1 Results with Adjusted Constraint Application

We ran Jin et al. (2019)’s attack with the adjusted thresholds and with LanguageTool used to enforce grammaticality. Results are shown in Table 5.

Semantics. With the original attack, human judges on average were “Not sure” that semantics were preserved. After adjusting constraint evaluation, human judges on average “Agree”.

Grammaticality. Automatic evaluation during the attack ensured that did not have more grammatical errors than . Thus, our generated examples were observed to meet the grammaticality constraint.888Since the MR dataset is already lowercased and tokenized, it is difficult for a rule-based grammar checker like LanguageTool to parse some inputs. A more powerful language checker would filter out an even greater number of grammatical errors.

Non-suspicion. We repeated the study from Section 4.3 with our new examples. Participants were able to guess with accuracy whether inputs were computer-altered. The accuracy is over lower than the accuracy on the examples generated by the original attack.

Attack success. For each of the three datasets, the attack success rate decreased by at least percentage points.

Semantic Preservation (before)
Semantic Preservation (after)
Grammatical Error % (before)
Grammatical Error % (after)
Non-suspicion % (before)
Non-suspicion % (after)
Attack Success % (before)
Attack Success % (after)
Difference (before - after)
Table 5: Results from running the attack from Jin et al. (2019) with and without constraint thresholds chosen with human judgement. Attacks are on BERT classification models fine-tuned for the respective datasets.
Jin et al. (2019) Alzantot et al. (2018)
Semantic Preservation
Grammatical Error %
Non-suspicion Score
Attack Success % 10.9
Perturbed Word % 9.5
Num Queries 28
Table 6: Comparison of Alzantot et al. (2018) and Jin et al. (2019), with the same constraints, attacking BERT fine-tuned on the MR dataset. Results across examples.

5.2 Comparing the Two Attacks

We compared the relative success rates of Jin et al. (2019) and Alzantot et al. (2018) with constraint evaluation held constant. We applied the constraint evaluation methods from above and tested each attack against BERT fine-tuned on the MR dataset. Contrary to previous findings, Table 6 shows the two attacks had very similar success rates. The attacks achieved similar scores on human evaluation of semantics and non-suspicion. The genetic algorithm (Alzantot et al., 2018) was slightly more successful than the greedy search (Jin et al., 2019), but was far more computationally expensive, making over x as many model queries on average.

6 Ablation Study

We generated better-quality adversarial examples by constraining the search to exclude examples that fail to meet thresholds measured in three ways: word embedding distance, sentence encoder similarity, and grammaticality. Since we applied these constraint evaluation methods all at the same time, we performed an ablation study to understand which constraints had the largest impact on attack success rate.

Constraint Removed MR Yelp IMDB
Sentence Encoding
Word Embedding distance
Grammar Checking
Table 7: Ablation study: effect of removal of a single constraint on total attack success rate.

We reran three attacks (one for each constraint removed) on each of our BERT classification datasets. Table 7 shows attack success rate after individually removing each constraint. The word embedding distance constraint was the greatest inhibitor of attack success rate; without enforcing this constraint, attacks were over twice as successful.

7 Discussion

Decoupling attacks and constraints. It is critical researchers separate new attack methods from new constraint evaluation methods. Demonstrating the performance of a new attack while simultaneously introducing new constraints makes it unclear whether empirical gains demonstrate a more effective attack or a more relaxed set of constraints. This mirrors a broader trend in machine learning where researchers report differences that come from changing multiple independent variables, making the sources of empirical gains unclear (Lipton and Steinhardt, 2018). This is especially relevant in adversarial NLP, where each experiment depends on many parameters.999While working to reproduce past work in TextAttack, we noticed how differences that may seem negligible often have an outsized impact on attack success rate. These include the list of stopwords used, the maximum length of model inputs, and tokenization strategies.

Ablation studies for NLP adversarial attacks. Adversarial attacks in NLP need proper ablation studies. TextAttack allowed us to compare attack strategies in a standardized environment. Moving forward, TextAttack will be used for ablation studies that provide the community with an idea of the relative performance of different attack strategies and constraint evaluation methods. Additionally, TextAttack may be used to help researchers gauge model robustness against a variety of attacks.

Tradeoff between attack success and example quality. We made semantic constraints more selective, which helped attacks generate examples that scored above 4 on the Likert scale for preservation of semantics. This indicates that, when only allowing adversarial examples that preserve semantics and grammaticality, NLP models are relatively robust to current synonym substitution attacks. However, our set of constraints isn’t necessarily optimal for every attack scenario. For example, researchers using an attack for producing additional training data may wish to allow perturbations that are more flexible when it comes to grammaticality and semantics.

8 Related Work

The goal of creating adversarial examples that preserve semantics and grammaticality is common in the NLP attack literature (Zhang et al., 2019). However, previous works use different definitions of adversarial examples, making it hard to compare methods. We provide a unified definition of an adversarial example based on the constraints it must fulfill.

There are some existing open-source libraries related to adversarial examples in NLP. Trickster proposes a method for attacking NLP models based on graph search but lacks the ability to ensure that generated examples satisfy constraints (Kulynych et al., 2018). TEAPOT is a library for evaluating adversarial perturbations in text, but only supports n-gram based comparisons for evaluating attacks on machine translation models (Michel et al., 2019). AllenNLP Interpret includes functionality for running adversarial attacks on NLP models, but only supports attacks via input-reduction or gradient-based word swap (Wallace et al., 2019b). TextAttack has a broader scope than each of these libraries: it is designed to be extendable to any attack on any NLP model with any set of constraint evaluation methods.

9 Conclusion

We have shown that state-of-the-art synonym substitution attacks frequently do not preserve semantics or grammaticality, and often appear suspicious to humans. When we adjusted constraint evaluation to align with human judgement, we produced higher quality perturbations at a much lower success rate. We encourage researchers to use TextAttack to enforce rigorous constraints and decouple attacks from constraint evaluation methods.


Appendix A Appendix

a.1 Details about Human Studies.

Our experiments relied on labor crowd-sourced from Amazon Mechanical Turk. We used five datasets: MIT and Yelp datasets from (Alzantot et al., 2018) and MIT, Yelp, and Movie Review datasets from (Jin et al., 2019). We limited our worker pool to workers in the United States, Canada, Canada, and Australia that had completed over 5,000 HITs with over a 99% success rate. We had an additional Qualification that prevented workers who had submitted too many labels in previous tasks from fulfilling too many of our HITs. In the future, will also use a small qualifier task to select workers who are good at the task.

For the automatic portions of the case study in Section 4, we use all successfully perturbed examples. For the human portions, we randomly select successful examples for each combination of attack method and dataset, then use Amazon’s Mechanical Turk to gather answers for each example.

Rating Semantic Similarity.

In one task, we present results from two Mechanical Turk questionnaires to judge semantic similarity or dissimilarity. For each task, we show and , side by side, in a random order. We added a custom bit of Javascript to highlight character differences between the two sequences. We provided the following description: Compare two short pieces of English text and determine if they mean different things or the same. We then prompted labelers: The changes between these two passages preserve the original meaning. We paid per label for this task.

Inter-Annotator Agreement.

For each semantic similarity prompt, we gathered annotations from 10 different judges. Recall that each selection was one of 5 different options ranging from “Strongly Agree” to “Strongly Disagree.” For each pair of original and perturbed sequences, we calculated the number of judges who chose the most frequent option. For example, if 7 choose “Strongly Agree” and 3 chose “Agree,” the number of judges who chose the most frequent option is 7. We found that for the examples studied in Section 4 the average of this metric was . For the examples in Section 5 at the threshold of which we chose, the average was .

Guessing Real vs. Computer-altered.

We present results from our Mechanical Turk survey where we asked users ”Is this text real or computer-altered?“. We restricted this task to a single dataset, Movie Review. We chose Movie Review because it had an average sample length of 20 words, much shorter than Yelp or IMDB. We made this restriction because of the time-consuming nature of classifying long samples as Real or Fake. We paid per label for this task.

Rating word similarity.

We performed a third study where we asked showed users a pair of words and asked ”In general, replacing the first word with the second preserves the meaning of a sentence:“. We paid per label for this task.

Phrasing matters.

Mechanical Turk comes with a set of pre-designed questionnaire interfaces. These include one titled “Semantic Similarity” which asks users to rate a pair of sentences on a scale from “Not Similar At All” to “Highly Similar.” Examples generated by synonym attacks benefit from this question formulation because humans tend to rate two sentences that share many words as “Similar” due to their small morphological distance, even if they have different meanings.

Notes for future surveys

. In the future, we would also try to filter out bad labels by mixing some number of ground-truth “easy” data points into our dataset and rejecting the work of labelers who performed poorly on this set.

Figure 1: Average response to “In general, replacing the first word with the second preserves the meaning of a sentence” vs. cosine similarity between word1 and word2 (words are grouped by cosine similarity into bins of size ).
Figure 2: Average response to “The changes between these two passages preserve the original meaning” at each threshold. Threshold is minimum cosine similarity between BERT sentence embeddings.

a.1.1 Finding The Right Thresholds

Comparing two words. We showed study participants a pair of words and asked them whether swapping out one word for the other would change the meaning of a sentence. The results are shown in Figure 2. Using this information, we chose 0.9 as the word-level cosine similarity threshold.

Comparing two passages. With the word-level threshold set at , we generated examples at sentence encoder thresholds . We chose to encode sentences with BERT fine-tuned for semantic similarity: first on the AllNLI dataset, then on the STS benchmark training set. We repeated the study from 4.1.1 on 100 examples from each threshold, obtaining 10 human labels per example. The results are in Figure 2. On average, judges agreed that the examples produced at 0.98 threshold preserved semantics.

a.2 Further Analysis of Non-Suspicious Constraint Case Study

Table 8

presents the confusion matrix of results from the survey. Interestingly, workers guessed that the examples were real

of the time, but when they guessed that examples were computer-altered they were right of the time. Thus while some perturbed examples are non-suspicious, there are some which workers identify with high precision.

Guessed Label
Real Computer-altered
True Original 814 186
Perturbed 430 570
Table 8: Confusion matrix for humans guessing if perturbed examples are computer-altered

a.3 The Need for Standardized Metrics Supported By Human Evaluation

Figure 3: = 1 - USE Similarity Threshold vs. Accuracy under attack
Figure 4: = Num synonyms vs. Accuracy under attack.

Jin et al. (2019) used an additional distance metric defined as the cosine similarity between embeddings encoded by the Universal Sentence Encoder (USE) in order to determine if a synonym swap preserves semantic similarity (Cer et al., 2018). Figure 4 shows accuracy under BERT under attack by Jin et al. (2019)’s method as the maximum allowed cosine similarity between two sentences’ USE embeddings increases 101010The MR dataset is excluded due to no USE similarity restrictions being enforced on inputs of less than 15 words. . As becomes more strict, the attack becomes less successful. Figure 4 plots the accuracy under attack as the number of synonyms considered for each substitution increases. An attack that is more lenient with its standard for what constitutes a synonym is more successful. Previous methods vary in how many synonyms they considered, with Alzantot et al. (2018) considering 8, Kuleshov et al. (2018) considering 15, and Jin et al. (2019) considering 50.

a.4 Expanding the Categorization of Adversarial Examples in NLP

Recent work such as Jia and Liang (2017) and Wallace et al. (2019a) explored the creation of adversarial examples through concatenation of phrases to the input. While these examples contain the semantics of the original sentence, they also add new meaning. Future work may expand our framework to evaluate examples generated by this type of semantic composition.

Another group of attacks generate adversarial examples from scratch. Our constraints refer to the case where the attacker starts from a benign input and applies a perturbation to fool the model. This is useful from the perspective of the defender, since the defender can produce adversarial examples from inputs in the training set. However, in the real world, attackers often do not create adversarial examples from a starting point. An adversary who generates a fake news article will not try to perturb an article they find in the newspaper. The adversary will try to generate the bogus article from scratch. Without an original input for comparison, it is not immediately clear how to evaluate semantic preservation in this case. We leave it to future work to define and evaluate constraints for inputs generated from scratch.

(Jin et al., 2019) (Alzantot et al., 2018)
AS % PW % AS % PW % AS % PW % AS % PW %
Table 9: Comparison of results generated with TextAttack against results reported in original literature. Evaluated on the IMDB movie review dataset. AS is Attack Success, the percentage of sucessful attacks on 1000 examples. PW is the average percentage of words perturbed across attacks.

a.5 Reproduction

The results in Table 9 compare the accuracy and average percentage of perturbed words obtained with TextAttack to the results previously reported. Our reproduction is more successful at attacking BERT than Jin et al. (2019) while perturbing more words. Our reproduction perturbs much fewer words than Alzantot et al. (2018), but achieves a similar success rate 111111We believe Alzantot et al. (2018) may have misreported their words perturbed percentage, as they set the maximum words perturbed to 20% while reporting an average words perturbed percentage of 14.7% for their genetic algorithm and 19%(!) for their baseline..

One major implementation difference between TextAttack and the attacks studied was that TextAttack does not tokenize inputs before running an attack. This has some overhead, since every perturbation has to be retokenized, but it has two major advantages. First, it prevents errors that arise when words are broken into multiple tokens and individual tokens are swapped for other words. Second, it prevents loss of information during retokenization.

a.6 Word Embeddings

It is common to perform synonym substitution by replacing a word by a neighbor in the counter-fitted embedding space. The distance between word embeddings is frequently measured using Euclidean distance, but it is also common to compare word embeddings based on their cosine similarity (the cosine of the angle between them). (Some work also measures distance based on the mean-squared error between embeddings, which is just the square of Euclidean distance.)

For this reason, past work has sometimes constrained nearest neighbors based on the Euclidean distance between two word vectors, and other times based on their cosine similarity. Alzantot et al. (2018) considered both distance metrics, and reported that they ”did not see a noticeable improvement using cosine.”

We would like to point out that, when using normalized word vectors (as is typical for counter-fitted embeddings), filtering nearest neighbors based on their minimum cosine similarity is equivalent to filtering by maximum Euclidean distance (or MSE, for that matter).


Let , be normalized word embedding vectors. That is, . Then .

Therefore, the Euclidean distance between and is directly proportional to the cosine between them. For any minimum cosine distance , we can use maximum euclidean distance and achieve the same result.

a.7 Examples In The Wild

We randomly select 10 attempted attacks from the MR dataset and show the original inputs, perturbations before constraint change, and perturbations after constraint change. See Table 10.

Original Perturbed Perturbed
by presenting an impossible romance in an impossible world , pumpkin dares us to say why either is impossible – which forces us to confront what’s possible and what we might do to make it so. Pos: 99.5% by presenting an unsuitable romantic in an impossible world , pumpkin dares we to say why either is conceivable – which vigour we to confronted what’s possible and what we might do to make it so. Neg: 54.8% N/A
…a ho-hum affair , always watchable yet hardly memorable. Neg: 83.9% …a ho-hum affair , always watchable yet just memorable. Pos: 99.8% N/A
schnitzler’s film has a great hook , some clever bits and well-drawn, if standard issue, characters, but is still only partly satisfying. Neg: 60.8% schnitzler’s film has a great hook, some clever smithereens and well-drawn, if standard issue, characters, but is still only partly satisfying. Pos: 50.4% schnitzler’s film has a great hook, some clever traits and well-drawn, if standard issue, characters, but is still only partly satisfying. Pos: 56.9%
its direction, its script, and weaver’s performance as a vaguely discontented woman of substance make for a mildly entertaining 77 minutes, if that’s what you’re in the mood for. Pos: 99.5% its direction, its script, and weaver’s performance as a vaguely discontented woman of substance pose for a marginally comical 77 minutes, if that’s what you’re in the mood for. Neg: 65.5% N/A
missteps take what was otherwise a fascinating, riveting story and send it down the path of the mundane. Pos: 99.1% missteps take what was otherwise a fascinating, scintillating story and dispatched it down the path of the mundane. Neg: 51.2% N/A
hawke draws out the best from his large cast in beautifully articulated portrayals that are subtle and so expressive they can sustain the poetic flights in burdette’s dialogue. Pos: 99.9% hawke draws out the better from his wholesale cast in terribly jointed portrayals that are inconspicuous and so expressive they can sustain the rhymed flight in burdette’s dialogue. Neg: 60.3% N/A
if religious films aren’t your bailiwick, stay away. otherwise, this could be a passable date film. Neg: 99.1% if religious films aren’t your bailiwick, stay away. otherwise, this could be a presentable date film. Pos: 86.6% N/A
[broomfield] uncovers a story powerful enough to leave the screen sizzling with intrigue. Pos: 99.1% [broomfield] uncovers a story pompous enough to leave the screen sizzling with plots. Neg: 59.2% N/A
like its two predecessors, 1983’s koyaanisqatsi and 1988’s powaqqatsi, the cinematic collage naqoyqatsi could be the most navel-gazing film ever. Pos: 99.4% N/A N/A
maud and roland’s search for an unknowable past makes for a haunting literary detective story, but labute pulls off a neater trick in possession : he makes language sexy. Pos: 99.4% maud and roland’s search for an unknowable past makes for a haunting literary detective story, but labute pulls off a neater trick in property : he assumes language sultry. Neg: 62.1% N/A
Table 10: Ten random attempted attacks, attacking BERT fine-tuned for sentiment classification on the MR dataset. Left column are original samples. Middle are perturbations with the constraint settings from Jin et al. (2019). Right column are perturbations generated with constraints adjusted to match human judgement. N/A denotes the attack failed to find a successful perturbation when with constraints matching human judgement. For 8 out of the 10 examples, the constraint adjustments caused the attack to fail.