Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90 either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 23

10/03/2020

A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples

Generating adversarial examples for natural language is hard, as natural...
05/03/2022

SemAttack: Natural Textual Attacks via Different Semantic Spaces

Recent studies show that pre-trained language models (LMs) are vulnerabl...
03/21/2022

A Prompting-based Approach for Adversarial Example Generation and Robustness Enhancement

Recent years have seen the wide application of NLP models in crucial are...
05/30/2021

Defending Pre-trained Language Models from Adversarial Word Substitutions Without Performance Sacrifice

Pre-trained contextualized language models (PrLMs) have led to strong pe...
05/02/2020

DQI: Measuring Data Quality in NLP

Neural language models have achieved human level performance across seve...
03/19/2022

Distinguishing Non-natural from Natural Adversarial Samples for More Robust Pre-trained Language Model

Recently, the problem of robustness of pre-trained language models (PrLM...
09/13/2021

Adversarial Examples for Evaluating Math Word Problem Solvers

Standard accuracy metrics have shown that Math Word Problem (MWP) solver...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Pre-trained language models (Devlin et al., 2019; Liu et al., 2019; Lan et al., 2019; Yang et al., 2019; He et al., 2020; Zhang et al., 2019b; Jiang et al., 2020; Clark et al., 2020) have achieved state-of-the-art performance over a wide range of Natural Language Understanding (NLU) tasks (Wang et al., 2019b, a; Jia and Liang, 2017; Thorne and Vlachos, 2019; Nie et al., 2020). However, recent studies (Jin et al., 2020; Zang et al., 2020; Wang et al., 2020; Li et al., 2020b; Garg and Ramakrishnan, 2020) reveal that even these large-scale language models are vulnerable to carefully crafted adversarial examples, which can fool the models to output arbitrarily wrong answers by perturbing input sentences in a human-imperceptible way. Real-world systems built upon these vulnerable models can be misled in ways that would have profound security concerns (Li et al., 2019, 2020a).

To address this challenge, various methods (Jiang et al., 2020; Zhu et al., 2020; Wang et al., 2021; Liu et al., 2020) have been proposed to improve the adversarial robustness of language models. However, the adversary setup considered in these methods lacks a unified standard. For example, Jiang et al. (2020); Liu et al. (2020) mainly evaluate their robustness against human-crafted adversarial datasets (Nie et al., 2020; Jia and Liang, 2017), while Wang et al. (2021) evaluate the model robustness against automatic adversarial attack algorithms (Jin et al., 2020). The absence of a principled adversarial benchmark makes it difficult to compare the robustness across different models and identify the adversarial attacks that most models are vulnerable to. This motivates us to build a unified and principled robustness evaluation benchmark for natural language models and hope to help answer the following questions: what types of language models are more robust when evaluated on the unified adversarial benchmark? Which adversarial attack algorithms against language models are more effective, transferable, or stealthy to human? How likely can human be fooled by different adversarial attacks?

We list out the fundamental principles to create a high-quality robustness evaluation benchmark as follows. First, as also pointed out by (Bowman and Dahl, 2021), a reliable benchmark should be accurately and unambiguously annotated by humans. This is especially crucial for the robustness evaluation, as some adversarial examples generated by automatic attack algorithms can fool humans as well. Given our analysis in §3.4, among the generated adversarial data, there are only around adversarial examples that receive at least 4-vote consensus among 5 annotators and align with the original label. Thus, additional rounds of human filtering are critical to validate the quality of the generated adversarial attack data. Second, a comprehensive robustness evaluation benchmark should cover enough language phenomena and generate a systematic diagnostic report to understand and analyze the vulnerabilities of language models. Finally, a robustness evaluation benchmark needs to be challenging and unveil the biases shared across different models.

In this paper, we introduce Adversarial GLUE (AdvGLUE), a multi-task benchmark for robustness evaluation of language models. Compared to existing adversarial datasets, there are several contributions that render AdvGLUE a unique and valuable asset to the community.

  • [leftmargin=*]

  • Comprehensive Coverage. We consider textual adversarial attacks from different perspectives and hierarchies, including word-level transformations, sentence-level manipulations, and human-written adversarial examples, so that AdvGLUE is able to cover as many adversarial linguistic phenomena as possible.

  • Systematic Annotations. To the best of our knowledge, this is the first work that performs systematic evaluation and annotation over the generated textual adversarial examples. Concretely, AdvGLUE adopts crowd-sourcing to identify high-quality adversarial data for reliable evaluation.

  • General Compatibility. To obtain comprehensive understanding of the robustness of language models across different NLU tasks, AdvGLUE covers the widely-used GLUE tasks and creates an adversarial version of the GLUE benchmark to evaluate the robustness of language models.

  • High Transferability and Effectiveness. AdvGLUE has high adversarial transferability and can effectively attack a wide range of state-of-the-art models. We observe a significant performance drop for models evaluated on AdvGLUE compared with their standard accuracy on GLUE leaderboard. For instance, the average GLUE score of ELECTRA(Large) Clark et al. (2020) drops from to .

Our contributions are summarized as follows. () We propose AdvGLUE, a principled and comprehensive benchmark that focuses on robustness evaluation of language models. () During the data construction, we provide a thorough analysis and a fair comparison of existing strong adversarial attack algorithms. () We present thorough robustness evaluation for existing state-of-the-art language models and defense methods. We hope that AdvGLUE will inspire active research and discussion in the community. More details are available at https://adversarialglue.github.io.

2 Related Work

Existing robustness evaluation work can be roughly divided into two categories: Evaluation Toolkits and Benchmark Datasets. () Evaluation toolkits, including OpenAttack (Zeng et al., 2020), TextAttack (Morris et al., 2020), TextFlint (Gui et al., 2021) and Robustness Gym (Goel et al., 2021), integrate various ad hoc input transformations for different tasks and provide programmable APIs to dynamically test model performance. However, it is challenging to guarantee the quality of these input transformations. For example, as reported in (Zang et al., 2020), the validity of adversarial transformation can be as low as , which means that more than one third of the adversarial sentences have wrong labels. Such a high percentage of annotation errors could lead to an underestimate of model robustness, making it less qualified to serve as an accurate and reliable benchmark (Bowman and Dahl, 2021). () Benchmark datasets for robustness evaluation create challenging testing cases by using human-crafted templates or rules (Thorne and Vlachos, 2019; Ribeiro et al., 2020; Naik et al., 2018), or adopting a human-and-model-in-the-loop manner to write adversarial examples (Nie et al., 2020; Kiela et al., 2021; Bartolo et al., 2020). While the quality and validity of these adversarial datasets can be well controlled, the scalability and comprehensiveness are limited by the human annotators. For example, template-based methods require linguistic experts to carefully construct reasonable rules for specific tasks, and such templates can be barely transferable to other tasks. Moreover, human annotators tend to complete the writing tasks through minimal efforts and shortcuts (Burghardt et al., 2020; Wall et al., 2021), which can limit the coverage of various linguistic phenomena.

Corpus Task |Train| |Test| Word-Level Sent.-Level Human-Crafted
(GLUE) (AdvGLUE) C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
SST-2 sentiment 67,349 1,420 204 197 91 175 64 211 320 158 0 0 0
QQP paraphrase 363,846 422 42 151 17 35 75 37 0 65 0 0 0
QNLI NLI/QA 104,743 968 73 139 71 98 72 159 219 80 0 0 57
RTE NLI 2,490 304 43 44 31 27 23 48 88 0 0 0 0
MNLI NLI 392,702 1,864 69 402 114 161 128 217 386 0 194 193 0
Sum of AdvGLUE test set 4,978 431 933 324 496 362 672 1013 303 194 193 57
Table 1: Statistics of AdvGLUE benchmark. We apply all word-level perturbations (C1=Embedding-similarity, C2=Typos, C3=Context-aware, C4=Knowledge-guided, and C5=Compositions) to the five GLUE tasks. For sentence-level perturbations, we apply Syntactic-based perturbations (C6) to the five GLUE tasks. Distraction-based perturbations (C7) are applied to four GLUE tasks without QQP, as they may affect the semantic similarity. For human-crafted examples, we apply CheckList (C8) to SST-2, QQP, and QNLI; StressTest (C9) and ANLI (C10) to MNLI; and AdvSQuAD (C11) to QNLI tasks.

3 Dataset Construction

In this section, we provide an overview of our evaluation tasks, as well as the pipeline of how we construct the benchmark data. During this data construction process, we also compare the effectiveness of different adversarial attack methods, and present several interesting findings.

3.1 Overview

Tasks. We consider the following five most representative and challenging tasks used in GLUE (Wang et al., 2019b)

: Sentiment Analysis (

SST-2), Duplicate Question Detection (QQP), and Natural Language Inference (NLI, including MNLI, RTE, QNLI). The detailed explanation for each task can be found in Appendix A.3. Some tasks in GLUE are not included in AdvGLUE, since there are either no well-defined automatic adversarial attacks (e.g., CoLA), or insufficient data (e.g., WNLI) for the attacks.

Dataset Statistics and Evaluation Metrics.

AdvGLUE follows the same training data and evaluation metrics as GLUE. In this way, models trained on the GLUE training data can be easily evaluated under IID sampled test sets (GLUE benchmark) or carefully crafted adversarial test sets (AdvGLUE benchmark). Practitioners can understand the model generalization via the GLUE diagnostic test suite and examine the model robustness against different levels of adversarial attacks from the AdvGLUE diagnostic report with only one-time training. Given the same evaluation metrics, model developers can clearly understand the performance gap between models tested in the ideally benign environments and approximately worst-case adversarial scenarios. We present the detailed dataset statistics under various attacks in Table

1. Detailed label distribution and evaluation metrics are in Appendix Table 8.

3.2 Adversarial Perturbations

In this section, we detail how we optimize different levels of adversarial perturbations to the benign source samples and collect the raw adversarial data with noisy labels, which will then be carefully filtered by human annotators described in the next section. Specifically, we consider the dev sets of GLUE benchmark as our source samples, upon which we perform different adversarial attacks. For relatively large-scale tasks (QQP, QNLI, MNLI-m/mm), we sample 1,000 cases from the dev sets for efficiency purpose. For the remaining tasks, we consider the whole dev sets as source samples.

3.2.1 Word-level Perturbation

Existing word-level adversarial attacks perturb the words through different strategies, such as perturbing words with their synonyms (Jin et al., 2020) or carefully crafted typo words (Li et al., 2019) (e.g., “foolish” to “fo01ish”), such that the perturbation does not change the semantic meaning of the sentences but dramatically change the models’ output. To examine the model robustness against different perturbation strategies, we select one representative adversarial attack method for each strategy as follows.

Typo-based Perturbation. We select TextBugger (Li et al., 2019) as the representative algorithm for generating typo-based adversarial examples. When performing the attack, TextBugger first identifies the important words and then replaces them with typos.

Embedding-similarity-based Perturbation. We choose TextFooler (Jin et al., 2020)

as the representative adversarial attack that considers embedding similarity as a constraint to generate semantically consistent adversarial examples. Essentially, TextFooler first performs word importance ranking, and then substitutes those important ones to their synonyms extracted according to the cosine similarity of word embeddings.

Figure 1: Overview of the AdvGLUE dataset construction pipeline.

Context-aware Perturbation. We use BERT-ATTACK (Li et al., 2020b) to generate context-aware perturbations. The fundamental difference between BERT-ATTACK and TextFooler lies on the word replacement procedure. Specifically, BERT-ATTACK uses the pre-trained BERT to perform masked language prediction to generate contextualized potential word replacements for those crucial words.

Knowledge-guided Perturbation. We consider SememePSO (Zang et al., 2020) as an example to generate adversarial examples guided by the HowNet (Qi et al., 2019)

knowledge base. SememePSO first finds out substitutions for each word in HowNet based on sememes, and then searches for the optimal combination based on particle swarm optimization.

Compositions of different Perturbations. We also implement a whitebox-based adversarial attack algorithm called CompAttack that integrates the aforementioned perturbations in one algorithm to evaluate model robustness to various adversarial transformations. Moreover, we efficiently search for perturbations via optimization so that CompAttack can achieve the attack goal while perturbing the minimal number of words. The implementation details can be found in Appendix A.4.

We note that the above adversarial attacks require a surrogate model to search for the optimal perturbations. In our experiments, we follow the setup of ANLI (Nie et al., 2020) and generate adversarial examples against three different types of models (BERT, RoBERTa, and RoBERTa ensemble) trained on the GLUE benchmark. We then perform one round of filtering to retain those examples with high adversarial transferability between these surrogate models. We discuss more implementation details and hyper-parameters of each attack method in Appendix A.4.

3.2.2 Sentence-level Perturbation

Different from word-level attacks that perturb specific words, sentence-level attacks mainly focus on the syntactic and logical structures of sentences. Most of them achieve the attack goal by either paraphrasing the sentence, manipulating the syntactic structures, or inserting some unrelated sentences to distract the model attention. AdvGLUE considers the following representative perturbations.

Syntactic-based Perturbation. We incorporate three adversarial attack strategies that manipulate the sentence based on the syntactic structures. () Syntax Tree Transformations. SCPN (Iyyer et al., 2018) is trained to produce a paraphrase of a given sentence with specified syntactic structures. Following the default setting, we select the most frequent templates from ParaNMT-50M corpus Wieting and Gimpel (2017) to guide the generation process. An LSTM-based encoder-decoder model (SCPN) is used to generate parses of target sentences according to the templates. These parses are further fed into another SCPN to generate full sentences. We use the pre-trained SCPNs released by the official codebase. () Context Vector Transformations. T3 (Wang et al., 2020)

is a whitebox attack algorithm that can add perturbations on different levels of the syntax tree and generate the adversarial sentence. In our setting, we add perturbations to the context vector of the root node given syntax tree, which is iteratively optimized to construct the adversarial sentence. (

) Entailment Preserving Transformations. We follow the entailment preserving rules proposed by AdvFever (Thorne and Vlachos, 2019), and transform all the sentences satisfying the templates into semantically equivalent ones. More details can be found in Appendix A.4.

Distraction-based Perturbation. We integrate two attack strategies: () StressTest (Naik et al., 2018) appends three true statements (“and true is true”, “and false is not true”, “and true is true” for five times) to the end of the hypothesis sentence for NLI tasks. () CheckList (Ribeiro et al., 2020) adds randomly generated URLs and handles to distract model attention. Since the aforementioned distraction-based perturbations may impact the linguistic acceptability and the understanding of semantic equivalence, we mainly apply these rules to part of the GLUE tasks, including SST-2 and NLI tasks (MNLI, RTE, QNLI), to evaluate whether model can be easily misled by the strong negation words or such lexical similarity.

Linguistic  Phenomenon Samples (Strikethrough = Original Text, red = Adversarial Perturbation) Label Prediction
Typo (Word-level) Question: What was the population of the Dutch Republic before this emigration? False True
Sentence: This was a huge hu ge influx as the entire population of the Dutch Republic amounted to ca.
Distraction (Sent.-level) Question: What was the population of the Dutch Republic before this emigration? https://t.co/DlI9kw False True
Sentence: This was a huge influx as the entire population of the Dutch Republic amounted to ca.
CheckList (Human-crafted) Question: What is Tony’s profession? True False
Sentence: Both Tony and Marilyn were executives, but there was a change in Marilyn, who is now an assistant.
Table 2: Examples of AdvGLUE benchmark. We show examples from QNLI task. These examples are generated with three levels of perturbations and they all can successfully change the predictions of all surrogate models (BERT, RoBERTa and RoBERTa ensemble).

3.2.3 Human-crafted Examples

To ensure our benchmark covers more linguistic phenomena in addition to those provided by automatic attack algorithms, we integrate the following high-quality human-crafted adversarial data from crowd-sourcing or expert-annotated templates and transform them to the formats of GLUE tasks.

CheckList111We note that both CheckList and StressTest propose both rule-based distraction sentences and manually crafted templates to generate test samples. The former is considered as sentence-level distraction-based perturbations, while the latter is considered as human-crafted examples. (Ribeiro et al., 2020) is a testing method designed for analysing different capabilities of NLP models using different test types. For each task, CheckList first identifies necessary natural language capabilities a model should have, then designs several test templates to generate test cases at scale. We follow the instructions and collect testing cases for three tasks: SST-2, QQP and QNLI. For each task, we adopt two capability tests: Temporal and Negation, which test if the model understands the order of events and if the model is sensitive to negations.

StressTestfootnotemark: (Naik et al., 2018) proposes carefully crafted rules to construct “stress tests” and evaluate robustness of NLI models to specific linguistic phenomena. We adopt the test cases focusing on Numerical Reasoning into our adversarial MNLI dataset. These premise-hypothesis pairs are able to test whether the model can perform reasoning involving numbers and quantifiers and predict the correct relation between premise and hypothesis.

ANLI (Nie et al., 2020) is a large-scale NLI dataset collected iteratively in a human-in-the-loop manner. In each iteration, human annotators are asked to design sentences to fool current model. Then the model is further finetuned on a larger dataset incorporating these sentences, which leads to a stronger model. Finally, annotators are asked to write harder examples to detect the weakness of this stronger model. In the end, the sentence pairs generated in each round form a comprehensive dataset that aims at examining the vulnerability of NLI models. We adopt ANLI into our adversarial MNLI dataset. We obtain the permission from the ANLI authors to include the ANLI dataset as part of our leaderboard.

AdvSQuAD (Jia and Liang, 2017) is an adversarial dataset targeting at reading comprehension systems. Adversarial examples are generated by appending a distracting sentence to the end of the input paragraph. The distracting sentences are carefully designed to have common words with questions and look like a correct answer to the question. We mainly consider the examples generated by AddSent and AddOneSent strategies, and adopt the distracting sentences and questions in the QNLI format with labels “not answered”. The use of AdvSQuAD in AdvGLUE is authorized by the authors.

We present sampled AdvGLUE examples with the word-level, sentence-level perturbations and human-crafted samples in Table 2. More examples are provided in Appendix A.5.

3.3 Data Curation

After collecting the raw adversarial dataset, additional rounds of filtering are required to guarantee its quality and validity. We consider two types of filtering: automatic filtering and human evaluation.

Automatic Filtering mainly evaluates the generated adversarial examples along two fronts: transferability and fidelity.

  1. [leftmargin=*]

  2. Transferability evaluates whether the adversarial examples generated against one source model (e.g., BERT) can successfully transfer and attack the other two (e.g., RoBERTa and RoBERTa ensemble), given the surrogate models used to generate adversarial examples (BERT, RoBERTa and RoBERTa ensemble). Only adversarial examples that can successfully transfer to the other two models will be kept for the next round of fidelity filtering, so that the selected examples can exploit the biases shared across different models and unveil their fundamental weakness.

  3. Fidelity evaluates how the generated adversarial examples maintain the original semantics. For word-level adversarial examples, we use word modification rate to measure what percentage of words are perturbed. Concretely, word-level adversarial examples with word modification rate larger than are filtered out. For sentence-level adversarial examples, we use BERTScore (Zhang et al., 2019a) to evaluate the semantic similarity between the adversarial sentences and their corresponding original ones. For each sentence-level attack, adversarial examples with the highest similarity scores are kept to guarantee their semantic closeness to the benign samples.

Tasks Metrics Word-level Attacks Sentence-level Attacks Avg
SPSO TF TB CA BA T3 SCPN AdvFever
SST-2 ASR 89.08 95.38 88.08 31.91 39.77 97.69 65.37 0.57 63.48
Curated ASR 8.29 8.97 8.85 4.02 4.04 10.45 6.88 0.23 6.47
Filter Rate 90.71 90.62 90.04 86.63 89.81 89.27 89.47 60.00 85.82
Fleiss Kappa 0.22 0.20 0.50 0.21 0.24 0.23 0.29 0.12 0.26
Curated Fleiss Kappa 0.51 0.49 0.67 0.46 0.45 0.44 0.47 0.20 0.52
Human Accuracy 0.85 0.86 0.91 0.88 0.85 0.78 0.85 0.50 0.87
MNLI ASR 78.45 61.50 69.35 68.58 65.02 91.23 87.73 2.25 65.51
Curated ASR 3.48 1.55 8.94 3.11 2.58 3.41 6.75 0.30 3.77
Filter Rate 95.59 97.55 87.12 95.45 96.10 96.27 92.31 86.63 93.38
Fleiss Kappa 0.28 0.24 0.53 0.39 0.32 0.28 0.24 0.35 0.33
Curated Fleiss Kappa 0.65 0.59 0.74 0.65 0.60 0.56 0.60 0.51 0.67
Human Accuracy 0.85 0.83 0.91 0.89 0.83 0.84 0.91 0.83 0.89
RTE ASR 76.67 75.67 85.89 73.36 72.05 92.39 88.45 6.62 71.39
Curated ASR 6.20 8.14 10.03 6.97 5.58 7.05 8.30 2.53 6.85
Filter Rate 91.93 89.21 88.29 90.72 92.16 92.31 90.61 61.34 87.07
Fleiss Kappa 0.30 0.32 0.58 0.35 0.25 0.33 0.43 0.58 0.38
Curated Fleiss Kappa 0.49 0.67 0.80 0.63 0.42 0.60 0.64 0.65 0.66
Human Accuracy 0.77 0.95 0.94 0.87 0.79 0.89 0.91 0.86 0.92
QNLI ASR 71.88 67.03 82.54 67.24 60.53 96.41 67.37 0.97 64.25
Curated ASR 3.92 2.87 5.87 4.09 2.69 7.59 3.90 0.00 3.87
Filter Rate 94.63 95.89 92.89 93.92 95.78 92.16 94.21 100.00 94.93
Fleiss Kappa 0.07 0.05 0.16 0.10 0.14 0.07 0.12 -0.16 0.11
Curated Fleiss Kappa 0.37 0.43 0.49 0.34 0.53 0.37 0.43 - 0.44
Human Accuracy 0.80 0.86 0.85 0.82 0.92 0.89 0.92 - 0.85
QQP ASR 45.86 48.59 57.92 49.33 43.66 48.20 44.37 0.30 42.28
Curated ASR 1.52 1.74 5.87 3.05 0.76 1.47 1.50 0.00 1.99
Filter Rate 96.73 96.50 89.90 93.83 98.28 97.04 96.62 100.00 96.11
Fleiss Kappa 0.26 0.27 0.38 0.27 0.24 0.25 0.29 - 0.30
Curated Fleiss Kappa 0.32 0.46 0.62 0.48 0.40 0.10 0.47 - 0.51
Human Accuracy 0.84 0.98 0.97 0.89 0.78 0.89 1.00 - 0.89
Table 3: Statistics of data curation. We report Attack Success Rate (ASR) and ASR after data curation (Curated ASR) to evaluate the effectiveness of different adversarial attacks. We present the Filter Rate of data curation and inter-annotator agreement rate (Fleiss Kappa) before and after curation to evaluate the validity of adversarial examples. Human Accuracy on our curated dataset is evaluated by taking one random annotator’s annotation as prediction and the majority voted label as ground truth. SPSO: SememePSO, TF: TextFooler, TB:TextBugger, CA: CompAttack, BA:BERT-ATTACK. /: higher/lower the better.

Human Evaluation validates whether the adversarial examples preserve the original labels and whether the labels are highly agreed among annotators. Concretely, we recruit annotators from Amazon Mechanical Turk. To make sure the annotators fully understand the GLUE tasks, each worker is required to pass a training step to be qualified to work on the main filtering tasks for the generated adversarial examples. We tune the pay rate for different tasks, as shown in Appendix Table 11. The pay rate of the main filtering phase is twice as much as that of the training phase.

  1. [leftmargin=*]

  2. Human Training Phase is designed to ensure that the annotators understand the tasks. The annotation instructions for each task follows (Nangia and Bowman, 2019), and we provide at least two examples for each class to help annotators understand the tasks.222Instructions can be found at https://adversarialglue.github.io/instructions. Each annotator is required to work on a batch of 20 examples randomly sampled from the GLUE dev set. After annotators answer each example, a ground-truth answer will be provided to help them understand whether the answer is correct. Workers who get at least of the examples correct during training are qualified to work on the main filtering task. A total of 100 crowd workers participated in each task, and the number of qualified workers are shown in Appendix Table 11. We also test the human accuracy of qualified annotators for each task on 100 randomly sampled examples from the dev set excluding the training samples. The details and results can be found in Appendix Table 11.

  3. Human Filtering Phase verifies the quality of the generated adversarial examples and only maintains high-quality ones to construct the benchmark dataset. Specifically, annotators are required to work on a batch of 10 adversarial examples generated from the same attack method. Every adversarial example will be validated by 5 different annotators. Examples are selected following two criteria: () high consensus: each example must have at least 4-vote consensus; () utility preserving: the majority-voted label must be the same as the original one to make sure the attacks are valid (i.e., cannot fool human) and preserve the semantic content.

The data curation results including inter-annotator agreement rate (Fleiss Kappa) and human accuracy on the curated dataset are shown in Table 3. We will provide more analysis in the next section. Note that even after the data curation step, some grammatical errors and typos can still remain, as some adversarial attacks intentionally inject typos (e.g., TextBugger) or manipulate syntactic trees (e.g., SCPN) which are very stealthy. We will retain these samples as their labels receive high consensus from annotators, which means the typos do not substantially impact humans’ understanding.

Model SST-2 MNLI RTE QNLI QQP Avg Avg Avg
AdvGLUE AdvGLUE AdvGLUE AdvGLUE AdvGLUE AdvGLUE GLUE
State-of-the-art Pre-trained Language Models
BERT (Large) 33.03 28.72/27.05 40.46 39.77 37.91/16.56 33.68 85.76 52.08
ELECTRA (Large) 58.59 14.62/20.22 23.03 57.54 61.37/42.40 41.69 93.16 51.47
RoBERTa (Large) 58.52 50.78/39.62 45.39 52.48 57.11/41.80 50.21 91.44 41.23
T5 (Large) 60.56 48.43/38.98 62.83 57.64 63.03/55.68 56.82 90.39 33.57
ALBERT (XXLarge) 66.83 51.83/44.17 73.03 63.84 56.40/32.35 59.22 91.87 32.65
DeBERTa (Large) 57.89 58.36/52.46 78.95 57.85 60.43/47.98 60.86 92.67 31.81
Robust Training Methods for Pre-trained Language Models
SMART (BERT) 25.21 26.89/23.32 38.16 34.61 36.49/20.24 30.29 85.70 55.41
SMART (RoBERTa) 50.92 45.56/36.07 70.39 52.17 64.22/44.28 53.71 92.62 38.91
FreeLB (RoBERTa) 61.69 31.59/27.60 62.17 62.29 42.18/31.07 50.47 92.28 41.81
InfoBERT (RoBERTa) 47.61 50.39/41.26 39.47 54.86 49.29/35.54 46.04 89.06 43.02
Table 4: Model performance on AdvGLUE test set. BERT (Large) and RoBERTa (Large) are fine-tuned using different random seeds and thus different from the surrogate models used for adversarial text generation. For MNLI, we report the test accuracy on the matched and mismatched test sets; for QQP, we report accuracy and F1; and for other tasks, we report the accuracy. All values are reported by percentage (%). We also report the macro-average (Avg) of per-task scores for different models. (Complete results are listed in our leaderboard.)

3.4 Benchmark of Adversarial Attack Algorithms

Our data curation phase also serves as a comprehensive benchmark over existing adversarial attack methods, as it provides a fair standard for all adversarial attacks and systematic human annotations to evaluate the quality of the generated samples.

Evaluation Metrics. Specifically, we evaluate these attacks along two fronts: effectiveness and validity. For effectiveness, we consider two evaluation metrics: Attack Success Rate (ASR) and Curated Attack Success Rate (Curated ASR). Formally, given a benign dataset consisting of pairs of sample and ground truth , for an adversarial attack method that generates an adversarial example given an input to attack a surrogate model , ASR is calculated as

(1)

where is the indicator function. After the data curation phase, we collect a curated adversarial dataset . Thus, Curated ASR is calculated as

(2)

For validity, we consider three evaluation metrics: Filter Rate, Fleiss Kappa, and Human Accuracy. Specifically, Filter Rate is calculated by to measure how many examples are rejected in the data curation procedures and can reflect the noisiness of the generated adversarial examples. We report the average ASR, Curated ASR, and Filter Rate over the three surrogate models we consider in Table 3. Fleiss Kappa is a widely used metric in existing datasets (e.g.

, SNLI, ANLI, and FEVER

(Bowman et al., 2015; Nie et al., 2020; Thorne et al., 2018)

) to measure the inter-annotator agreement rate on the collected dataset. Fleiss Kappa between 0.4 and 0.6 is considered as moderate agreement and between 0.6 and 0.8 as substantial agreement. The inter-annotator agreement rates of most high-quality datasets fall into these two intervals. In this paper, we follow the standard protocol and report Fleiss Kappa and Curated Fleiss Kappa to analyze the inter-annotator agreement rate on the collected adversarial dataset before and after curation to reflect the ambiguity of generated examples. We also estimate the human performance on our curated datasets. Specifically, given a sample with 5 annotations, we take one random annotator’s annotation as the prediction and the majority voted label as the ground truth and calculate the human accuracy as shown in Table

3.

Analysis. As shown in Table 3, in terms of attack effectiveness, while most attacks show high ASR, the Curated ASR is always less than , which indicates that most existing adversarial attack algorithms are not effective enough to generate high-quality adversarial examples. In terms of validity, the filter rates for most adversarial attack methods are more than , which suggests that existing strong adversarial attacks are prone to generating invalid adversarial examples that either change the original semantic meanings or generate ambiguous perturbations that hinder the annotators’ unanimity. We provide detailed filter rates for automatic filtering and human evaluation in Appendix Table 12, and the conclusion is that around of examples are filtered due to the low transferability and high word modification rate. Among the remaining samples, around examples are filtered due to the low human agreement rates (Human Consensus Filtering), and around are filtered due to the semantic changes which lead to the label changes (Utility Preserving Filtering). We also note that the data curation procedures are indispensable for the adversarial evaluation, as the Fleiss Kappa before curation is very low, suggesting that a lot of adversarial sentences have unreliable labels and thus tend to underestimate the model robustness against the textual adversarial attacks. After the data curation, our AdvGLUE shows a Curated Fleiss Kappa of near 0.6, comparable with existing high-quality dataset such as SNLI and ANLI. Among all the existing attack methods, we observe that TextBugger is the most effective and valid attack method, as it demonstrates the highest Curated ASR and Curated Fleiss Kappa across different tasks.

3.5 Finalizing the Dataset

The full pipeline of constructing AdvGLUE is summarized in Figure 1.

Merging. We note that distraction-based adversarial examples and human-crafted adversarial examples are guaranteed to be valid by definition or crowd-sourcing annotations, and thus data curation is not needed on these attacks. When merging them with our curated set, we calculate the average number of samples per attack from our curated set, and sample the same amount of adversarial examples from these attacks following the same label distribution. This way, each attack contributes to similar amount of adversarial data, so that AdvGLUE can evaluate models against different types of attacks with similar weights and provide a comprehensive and unbiased diagnostic report.

Dev-Test Split. After collecting the adversarial examples from the considered attacks, we split the final dataset into a dev set and a test set. In particular, we first randomly split the benign data into , and the adversarial examples generated based on of the benign data serve as the hidden test set, while the others are published as the dev set. For human-crafted adversarial examples, since they are not generated based on the benign GLUE data, we randomly select of the data as the test set, and the remaining as the dev set. The dev set is publicly released to help participants to understand the tasks and the data format. To protect the integrity of our test data, the test set will not be released to the public. Instead, participants are required to upload the model to CodaLab, which automates the evaluation process on the hidden test set and provides a diagnostic report.

Models Word-Level Perturbations Sent.-Level Human-Crafted Examples
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11
BERT (Large) 42.02 31.96 45.18 45.86 33.85 44.86 24.16 16.33 23.20 13.47 10.53
ELECTRA (Large) 43.07 45.12 47.95 46.33 47.33 43.47 33.30 32.20 26.29 26.94 52.63
RoBERTa (Large) 56.54 57.19 60.47 49.81 55.92 50.49 41.89 37.78 28.35 16.58 35.09
T5 (Large) 60.04 67.94 64.60 59.84 58.50 50.54 42.20 69.02 23.20 17.10 52.63
ALBERT (XXLarge) 66.71 67.61 73.49 70.36 59.52 63.76 49.14 45.55 39.69 26.94 43.86
DeBERTa (Large) 65.07 74.87 68.02 65.30 62.54 57.41 47.22 45.08 52.06 22.80 54.39
SMART (BERT) 45.17 31.04 42.89 45.23 30.76 40.74 16.62 8.20 18.56 10.36 1.75
SMART (RoBERTa) 62.93 58.03 65.09 62.65 61.37 55.31 40.13 39.27 28.35 15.54 31.58
FreeLB (RoBERTa) 51.95 53.23 52.92 51.15 52.18 50.75 37.72 66.87 23.71 29.02 64.91
InfoBERT (RoBERTa) 55.47 55.78 59.02 51.33 55.48 44.56 31.49 34.31 42.27 14.51 43.86
Table 5: Diagnostic report of state-of-the-art language models and robust training methods. For each attack method, we evaluate models against generated adversarial data for different tasks to obtain per-task accuracy scores, and report the macro-average of those scores. (C1=Embedding-similarity, C2=Typos, C3=Context-aware, C4=Knowledge-guided, C5=Compositions, C6=Syntactic-based Perturbations, C7=Distraction-based Perturbations, C8=CheckList, C9=StressTest, C10=ANLI and C11=AdvSQuAD).

4 Diagnostic Report for Language Models

Benchmark Results. We follow the official implementations and training scripts of pre-trained language models to reproduce results on GLUE and test these models on AdvGLUE. The training details can be found in Appendix A.6. Results are summarized in Table 4. We observe that although state-of-the-art language models have achieved high performance on GLUE, they are vulnerable to various adversarial attacks. For instance, the performance gap can be as large as on the SMART (BERT) model in terms of the average score. DeBERTa (Large) and ALBERT (XXLarge) achieve the highest average AdvGLUE scores among all the tested language models. This result is also aligned with the ANLI leaderboard333https://github.com/facebookresearch/anli, which shows that ALBERT (XXLarge) is the most robust to human-crafted adversarial NLI dataset (Nie et al., 2020).

We note that although our adversarial examples are generated from surrogate models based on BERT and RoBERTa, these examples have high transferability between models after our data curation. Specifically, the average score of ELECTRA (Large) on AdvGLUE is even lower than RoBERTa (Large), which demonstrates that AdvGLUE can effectively transfer across models of different architectures and unveil the vulnerabilities shared across multiple models. Moreover, we find some models even perform worse than random guess. For example, the performance of BERT on AdvGLUE for all tasks is lower than random-guess accuracy.

We also benchmark advanced robust training methods to evaluate whether these methods can indeed provide robustness improvement on AdvGLUE and to what extent. We observe that SMART and FreeLB are particularly helpful to improve robustness for RoBERTa. Specifically, SMART (RoBERTa) improves RoBERTa (Large) over on average, and it even improves the benign accuracy as well. Since InfoBERT is not evaluated on GLUE, we run InfoBERT with different hyper-parameters and report the best accuracy on benign GLUE dev set and AdvGLUE test set. However, we find that the benign accuracy of InfoBERT (RoBERTa) is still lower than RoBERTa (Large), and similarly for the robust accuracy. These results suggest that existing robust training methods only have incremental robustness improvement, and there is still a long way to go to develop robust models to achieve satisfactory performance on AdvGLUE.

Diagnostic Report of Model Vulnerabilities. To have a systematic understanding of which adversarial attacks language models are vulnerable to, we provide a detailed diagnostic report in Table 5. We observe that models are most vulnerable to human-crafted examples, where complex linguistic phenomena (e.g., numerical reasoning, negation and coreference resolution) can be found. For sentence-level perturbations, models are more vulnerable to distraction-based perturbations than directly manipulating syntactic structures. In terms of word-level perturbations, models are similarly vulnerable to different word replacement strategies, among which typo-based perturbations and knowledge-guided perturbations are the most effective attacks.

We hope the above findings can help researchers systematically examine their models against different adversarial attacks, thus also devising new methods to defend against them. Comprehensive analysis of the model robustness report is provided in our website and Appendix A.9.

5 Conclusion

We introduce AdvGLUE, a multi-task benchmark to evaluate and analyze the robustness of state-of-the-art language models and robust training methods. We systematically conduct 14 adversarial attacks on GLUE tasks and adopt crowd-sourcing to guarantee the quality and validity of generated adversarial examples. Modern language models perform poorly on AdvGLUE, suggesting that model vulnerabilities to adversarial attacks still remain unsolved. We hope AdvGLUE can serve as a comprehensive and reliable diagnostic benchmark for researchers to further develop robust models.

We thank the anonymous reviewers for their constructive feedback. We also thank Prof. Sam Bowman, Dr. Adina Williams, Nikita Nangia, Jingfeng Li, and many others for the helpful discussion. We thank Prof. Robin Jia and Yixin Nie for allowing us to incorporate their datasets as part of the evaluation. We thank the SQuAD team for allowing us to use their website template and submission tutorials. This work is partially supported by the NSF grant No.1910100, NSF CNS 20-46726 CAR, the Amazon Research Award.

References

  • M. Bartolo, A. Roberts, J. Welbl, S. Riedel, and P. Stenetorp (2020) Beat the ai: investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics 8, pp. 662–678. Cited by: §A.2, §2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In EMNLP, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton (Eds.), Cited by: §3.4.
  • S. R. Bowman and G. E. Dahl (2021) What will it take to fix benchmarking in natural language understanding?. In NAACL, Cited by: §A.8, §1, §2.
  • K. Burghardt, T. Hogg, R. D’Souza, K. Lerman, and M. Posfai (2020) Origins of algorithmic instabilities in crowdsourced ranking. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW2), pp. 1–20. Cited by: §2.
  • N. Carlini and D. A. Wagner (2018) Audio adversarial examples: targeted attacks on speech-to-text. 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §A.2, §A.4.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: 4th item, §1.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, Cited by: §A.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), Cited by: §1.
  • K. Dvijotham, S. Gowal, R. Stanforth, R. Arandjelovic, B. O’Donoghue, J. Uesato, and P. Kohli (2018) Training verified learners with learned verifiers. CoRR abs/1805.10265. Cited by: §A.2.
  • J. Ebrahimi, A. Rao, D. Lowd, and D. Dou (2018) HotFlip: white-box adversarial examples for text classification. In ACL, Cited by: §A.2.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. X. Song (2017)

    Robust physical-world attacks on deep learning models.

    .
    Cited by: §A.2.
  • Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu (2020) Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195. Cited by: §A.2.
  • S. Garg and G. Ramakrishnan (2020) BAE: bert-based adversarial examples for text classification. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 6174–6181. Cited by: §A.2, §1.
  • T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford (2018) Datasheets for datasets. arXiv preprint arXiv:1803.09010. Cited by: Appendix B.
  • K. Goel, N. Rajani, J. Vig, S. Tan, J. Wu, S. Zheng, C. Xiong, M. Bansal, and C. Ré (2021) Robustness gym: unifying the nlp evaluation landscape. arXiv preprint arXiv:2101.04840. Cited by: §2.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: §A.2.
  • T. Gui, X. Wang, Q. Zhang, Q. Liu, Y. Zou, X. Zhou, R. Zheng, C. Zhang, Q. Wu, J. Ye, et al. (2021) Textflint: unified multilingual robustness evaluation toolkit for natural language processing. arXiv preprint arXiv:2103.11441. Cited by: §2.
  • P. He, X. Liu, J. Gao, and W. Chen (2020) Deberta: decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654. Cited by: §1.
  • P. Huang, R. Stanforth, J. Welbl, C. Dyer, D. Yogatama, S. Gowal, K. Dvijotham, and P. Kohli (2019) Achieving verified robustness to symbol substitutions via interval bound propagation. In EMNLP-IJCNLP, Cited by: §A.2.
  • M. Iyyer, J. Wieting, K. Gimpel, and L. Zettlemoyer (2018) Adversarial example generation with syntactically controlled paraphrase networks. In NAACL-HLT, Cited by: §A.2, §3.2.2.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In EMNLP, M. Palmer, R. Hwa, and S. Riedel (Eds.), Cited by: §A.2, §1, §1, §3.2.3.
  • R. Jia, A. Raghunathan, K. Göksel, and P. Liang (2019) Certified robustness to adversarial word substitutions. In EMNLP-IJCNLP, Cited by: §A.2.
  • H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and T. Zhao (2020) SMART: robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Cited by: §A.2, §1, §1.
  • D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits (2020) Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In AAAI, Cited by: §A.2, §1, §1, §3.2.1, §3.2.1.
  • D. Kiela, M. Bartolo, Y. Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams (2021) Dynabench: rethinking benchmarking in nlp. In NAACL, Cited by: §2.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    ALBERT: a lite bert for self-supervised learning of language representations

    .
    ArXiv abs/1909.11942. Cited by: §1.
  • J. Li, T. Du, S. Ji, R. Zhang, Q. Lu, M. Yang, and T. Wang (2020a)

    TextShield: robust text classification based on multimodal embedding and neural machine translation

    .
    In 29th USENIX Security Symposium (USENIX Security 20), Cited by: §1.
  • J. Li, S. Ji, T. Du, B. Li, and T. Wang (2019) TextBugger: generating adversarial text against real-world applications. In NDSS, Cited by: §A.2, §1, §3.2.1, §3.2.1.
  • L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu (2020b) BERT-attack: adversarial attack against bert using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6193–6202. Cited by: §A.2, §1, §3.2.1.
  • X. Liu, H. Cheng, P. He, W. Chen, Y. Wang, H. Poon, and J. Gao (2020) Adversarial training for large neural language models. CoRR abs/2004.08994. Cited by: §A.2, §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. Cited by: §A.2.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016)

    DeepFool: a simple and accurate method to fool deep neural networks

    .
    CVPR, pp. 2574–2582. Cited by: §A.2.
  • J. X. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi (2020) Textattack: a framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909. Cited by: §2.
  • A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig (2018) Stress test evaluation for natural language inference. arXiv preprint arXiv:1806.00692. Cited by: §A.2, §2, §3.2.2, §3.2.3.
  • N. Nangia and S. Bowman (2019) Human vs. muppet: a conservative estimate of human performance on the glue benchmark. In ACL, Cited by: §A.7, item 1.
  • Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela (2020) Adversarial NLI: A new benchmark for natural language understanding. In ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Cited by: §A.2, §A.8, §1, §1, §2, §3.2.1, §3.2.3, §3.4, §4.
  • N. Papernot, P. D. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §A.2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans (Eds.), pp. 1532–1543. Cited by: §A.2.
  • F. Qi, C. Yang, Z. Liu, Q. Dong, M. Sun, and Z. Dong (2019) OpenHowNet: an open sememe-based lexical knowledge base. ArXiv abs/1901.09957. Cited by: §A.2, §3.2.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §A.3.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of NLP models with CheckList. In ACL, pp. 4902–4912. Cited by: §A.2, §2, §3.2.2, §3.2.3.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §A.3.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. In NAACL-HLT, Cited by: §3.4.
  • J. Thorne and A. Vlachos (2019) Adversarial attacks against fact extraction and verification. CoRR abs/1903.05543. External Links: 1903.05543 Cited by: §1, §2, §3.2.2.
  • E. Wall, A. Narechania, A. Coscia, J. Paden, and A. Endert (2021) Left, right, and gender: exploring interaction traces to mitigate human biases. arXiv preprint arXiv:2108.03536. Cited by: §2.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) Superglue: a stickier benchmark for general-purpose language understanding systems. In NeurIPS, Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §1, §3.1.
  • B. Wang, H. Pei, B. Pan, Q. Chen, S. Wang, and B. Li (2020)

    T3: tree-autoencoder constrained adversarial text generation for targeted attack

    .
    In EMNLP, Cited by: §A.2, §A.4, §1, §3.2.2.
  • B. Wang, S. Wang, Y. Cheng, Z. Gan, R. Jia, B. Li, and J. Liu (2021) InfoBERT: improving robustness of language models from an information theoretic perspective. In ICLR, Cited by: §1.
  • J. Wieting and K. Gimpel (2017) ParaNMT-50m: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. arXiv preprint arXiv:1711.05732. Cited by: §A.4, §3.2.2.
  • A. Williams, N. Nangia, and S. R. Bowman (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §A.3.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, Cited by: §1.
  • Z. Yang, B. Li, P. Chen, and D. X. Song (2018) Characterizing audio adversarial examples using temporal dependency. ArXiv abs/1809.10875. Cited by: §A.2.
  • M. Ye, C. Gong, and Q. Liu (2020) SAFER: A structure-free approach for certified robustness to adversarial word substitutions. In ACL, Cited by: §A.2.
  • Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, and M. Sun (2020)

    Word-level textual adversarial attacking as combinatorial optimization

    .
    In ACL, Cited by: §A.2, §1, §2, §3.2.1.
  • G. Zeng, F. Qi, Q. Zhou, T. Zhang, B. Hou, Y. Zang, Z. Liu, and M. Sun (2020)

    Openattack: an open-source textual adversarial attack toolkit

    .
    arXiv preprint arXiv:2009.09191. Cited by: §2.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019a) BERTScore: evaluating text generation with bert. In ICLR, Cited by: item 2.
  • Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019b) ERNIE: enhanced language representation with informative entities. In ACL, Cited by: §1.
  • C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, and J. Liu (2020) FreeLB: enhanced adversarial training for natural language understanding. In ICLR, Cited by: §A.2, §1.

Appendix A Appendix

a.1 Glossary of Adversarial Attacks

We present a glossary of adversarial attacks considered in AdvGLUE in Table 6 and 7.

Perturbations Explanation Examples (Strikethrough = Original Text, red = Adversarial Perturbation)
TextBugger (Word-level / Typo-based) TextBugger first identifies the important words in each sentence and then replaces them with carefully crafted typos. Task: QNLI
Question: What was the population of the Dutch Republic before this emigration?
Sentence: This was a huge hu ge influx as the entire population of the Dutch Republic amounted to ca.
Prediction: False True
TextFooler (Word-level / Embedding-similarity-based) Embedding-similarity-based adversarial attacks such as TextFooler select synonyms according to the cosine similarity of word embeddings. Words that have high similarity scores will be used as candidates to replace original words in the sentences. Task: QQP
Question 1: I am getting fat on my lower body and on the chest torso, is there any way I can get fit without looking skinny fat?
Question 2: Why I am getting skinny instead of losing body fat?
Prediction: Not Equivalent Equivalent
BERT-ATTACK (Word-level / Context-aware) BERT-ATTACK uses pre-trained BERT to perform masked language prediction to generate contextualized potential word replacements for those crucial words. Task: MNLI
Premise: Do you know what this is? With a dramatic gesture she flung back the left side of her coat sleeve and exposed a small enamelled badge.
Hypothesis: The coat that she wore was long enough to cover her knees .
Prediction: Neutral Contradiction
SememePSO (Word-level / Knowledge-guided) Knowledge-guided adversarial attacks such as SememePSO use external knowledge base such as HowNet or WordNet to search for substitutions. Task: QQP
Question 1: What people who you’ve never met have influenced infected your life the most?
Question 2: Who are people you have never met who have had the greatest influence on your life?
Prediction: Equivalent Not Equivalent
CompAttack (Word-level / Compositions) CompAttack is a whitebox-based adversarial attack that integrates all other word-level perturbation methods in one algorithm to evaluate model robustness to various adversarial transformations. Task: SST-2
Sentence: The primitive force of this film seems to bubble bybble up from the vast collective memory of the combatants.
Prediction: Positive Negative
SCPN (Sent.-level / Syntactic-based) SCPN is an attack method based on syntax tree transformations. It is trained to produce a paraphrase of a given sentence with specified syntactic structures. Task: RTE
Sentence 1: He became a boxing referee in 1964 and became most well-known for his decision against Mike Tyson, during the Holyfield fight, when Tyson bit Holyfield’s ear.
Sentence 2: Mike Tyson bit Holyfield’s ear in 1964.
Prediction: Not Entailment Entailment
T3 (Sent.-level / Syntactic-based) T3 is a whitebox attack algorithm that can add perturbations on different levels of the syntax tree and generate the adversarial sentence. Task: MNLI
Premise: What’s truly striking, though, is that Jobs has had never really let this idea go.
Hypothesis: Jobs never held onto an idea for long.
Prediction: Contradiction Entailment
AdvFever (Sent.-level / Syntactic-based) Entailment preserving rules proposed by AdvFever transform all the sentences satisfying the templates into semantically equivalent ones. Task: SST-2
Sentence: I’ll bet the video game is There exists a lot more fun than the film that goes by the name of i ’ll bet the video game.
Prediction: Negative Positive
StressTest (Sent.-level / Distraction-based) StressTest appends three true statements (“and true is true”, “and false is not true”, “and true is true” for five times) to the end of the hypothesis sentence for NLI tasks. Task: RTE
Sentence 1: Yet, we now are discovering that antibiotics are losing their effectiveness against illness. Disease-causing bacteria are mutating faster than we can come up with new antibiotics to fight the new variations.
Sentence 2: Bacteria is winning the war against antibiotics and true is true.
Prediction: Entailment Not Entailment
CheckList (Sent.-level / Distraction-based) CheckList adds randomly generated URLs and handles to distract model attention. Task: QNLI
Question: What was the population of the Dutch Republic before this emigration? https://t.co/DlI9kw
Sentence: This was a huge influx as the entire population of the Dutch Republic amounted to ca.
Prediction: False True
Table 6: Glossary of adversarial attacks (word-level and sentence-level) in AdvGLUE. For each adversarial attack, we provide a brief explanation and a corresponding example in AdvGLUE.
Perturbations Explanation Examples (Strikethrough = Original Text, red = Adversarial Perturbation)
CheckList (Human-crafted) CheckList analyses different capabilities of NLP models using different test types. We adopt two capability tests: Temporal and Negation, which test if the model understands the order of events and if the model is sensitive to negations. Task: SST-2
Sentence: I think this movie is perfect, but I used to think it was annoying.
Prediction: Positive Negative
StressTest (Human-crafted) StressTest proposes carefully crafted rules to construct “stress tests” and evaluate robustness of NLI models to specific linguistic phenomena. Here we adopt the test cases focusing on Numerical Reasoning. Task: MNLI
Premise: If Anne’ s speed were doubled, they could clean their house in 3 hours working at their respective rates.
Hypothesis: If Anne’ s speed were doubled, they could clean their house in less than 6 hours working at their respective rates.
Prediction: Entailment Contradiction
ANLI (Human-crafted) ANLI is a large-scale NLI dataset collected iteratively in a human-in-the-loop manner. The sentence pairs generated in each round form a comprehensive dataset that aims at examining the vulnerability of NLI models. Task: MNLI
Premise: Kamila Filipcikova (born 1991) is a female Slovakian fashion model. She has modeled in fashion shows for designers such as Marc Jacobs, Chanel, Givenchy, Dolce & Gabbana, and Sonia Rykiel. And appeared on the cover of Vogue Italia two times in a row.
Hypothesis: Filipcikova lives in Italy.
Prediction: Neutral Contradiction
AdvSQuAD (Human-crafted) AdvSQuAD is an adversarial dataset targeting at reading comprehension systems. Examples are generated by appending a distracting sentence to the end of the input paragraph. We adopt the distracting sentences and questions in the QNLI format with labels “not answered”. Task: QNLI
Question: What day was the Super Bowl played on?
Sentence: The Champ Bowl was played on August 18th,1991.
Prediction: False True
Table 7: Glossary of adversarial attacks (human-crafted) in AdvGLUE. For each adversarial attack, we provide a brief explanation and a corresponding example in AdvGLUE.

a.2 Additional Related Work

We discuss more related work about textual adversarial attacks and defenses in this subsection.

Textual Adversarial Attacks

Recent research has shown deep neural networks (DNNs) are vulnerable to adversarial examples that are carefully crafted to fool machine learning models without disturbing human perception

[Goodfellow et al., 2015, Papernot et al., 2016, Moosavi-Dezfooli et al., 2016]. However, compared with a large amount of adversarial attacks in continuous data domain [Yang et al., 2018, Carlini and Wagner, 2018, Eykholt et al., 2017], there are a few studies focusing on the discrete text domain. Most existing gradient-based attacks on image or audio models are no longer applicable to NLP models, as words are intrinsically discrete tokens. Another challenge for generating adversarial text is to ensure the semantic and syntactic coherence and consistency.

Existing textual adversarial attacks can be roughly divided into three categories: word-level transformations, sentence-level attacks, and human-crafted samples. () Word-level transformations adopt different word replacement strategies during attack. For example, existing work [Li et al., 2019, Ebrahimi et al., 2018] applies character-level perturbation to carefully crafted typo words (e.g., from “foolish” to “fo0lish”), thus making the model ignore or misunderstand the original statistical cues. Others adopt knowledge-based perturbation and utialize knowledge base to constrain the search space. For example, Zang et al. [2020] uses sememe-based knowledge base from HowNet [Qi et al., 2019] to construct a search space for word substitution. Some [Jin et al., 2020, Li et al., 2019] use non-contextualized word embedding from GLoVe [Pennington et al., 2014] or Word2Vec [Mikolov et al., 2013] to build synonym candidates, by querying the cosine similarity or euclidean distance between the original and candidate word and selecting the closet ones as the replacements. Recent work [Garg and Ramakrishnan, 2020, Li et al., 2020b] also leverages BERT to generate contextualized perturbations by masked language modeling. () Different from the dominant word-level adversarial attacks, sentence-level adversarial attacks perform sentence-level transformation or paraphrasing by perturbing the syntactic structures based on human crafted rules [Naik et al., 2018, Ribeiro et al., 2020] or carefully designed auto-encoders [Iyyer et al., 2018, Wang et al., 2020]. Sentence-level manipulations are generally more challenging than word-level attacks, because the perturbation space for syntactic structures are limited compared to word-level perturbation spaces that grow exponentially with the sentence length. However, sentence-level attacks tend to have higher linguistic quality than word-level, as both semantic and syntactic coherence are taken into considerations when generating adversarial sentences. () Human-crafted adversarial examples are generally crafted in the human-in-the-loop manner [Jia and Liang, 2017, Nie et al., 2020, Bartolo et al., 2020] or use manually crafted templates to generate test cases [Naik et al., 2018, Ribeiro et al., 2020]. Our AdvGLUE incorporates all of the above textual adversarial to provide a comprehensive and systematic diagnostic report over existing state-of-the-art large-scale language models.

Defenses against Textual Adversarial Attacks

To defend against textual adversarial attacks, existing work can be classified into three categories: (

) Adversarial Training is a practical method to defend against adversarial examples. Existing work either uses PGD-based attacks to generate adversarial examples in the embedding space of NLP as data augmentation [Zhu et al., 2020], or regularizes the standard objective using virtual adversarial training [Jiang et al., 2020, Liu et al., 2020, Gan et al., 2020]. However, one drawback is that the threat model is often unknown, which renders adversarial training less effective when facing unseen attacks. () Interval Bound Propagation (IBP) [Dvijotham et al., 2018] is proposed as a new technique to consider the worst-case perturbation theoretically. Recent work [Huang et al., 2019, Jia et al., 2019] has applied IBP in the NLP domain to certify the robustness of models. However, IBP-based methods rely on strong assumptions of model architecture and are difficult to adapt to recent transformer-based language models. () Randomized Smoothing [Cohen et al., 2019] provides a tight robustness guarantee in norm by smoothing the classifier with Gaussian noise. Ye et al. [2020] adapts the idea to the NLP domain, and replace the Gaussian noise with synonym words to certify the robustness as long as adversarial word substitution falls into predefined synonym sets. However, to guarantee the completeness of the synonym set is challenging.

a.3 Task Descriptions, Statistics and Evaluation Metrics

We present the detailed label distribution statistics and evaluation metrics of GLUE and AdvGLUE benchmark in 8.

Sst-2

The Stanford Sentiment Treebank Socher et al. [2013] consists of sentences from movie reviews and human annotations of their sentiment. Given a review sentence, the task is to predict the sentiment of it. Sentiments can be divided into two classes: positive and negative.

Qqp

The Quora Question Pairs (QQP) dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

Mnli

The Multi-Genre Natural Language Inference Corpus Williams et al. [2017] consists of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral)

Qnli

Question-answering NLI (QNLI) dataset consists of question-sentence pairs modified from The Stanford Question Answering Dataset

Rajpurkar et al. [2016]. The task is to determine whether the context sentence contains the answer to the question.

Rte

The Recognizing Textual Entailment (RTE) dataset is a combination of a series of data from annual textual entailment challenges. Examples are constructed based on news and Wikipedia text. The task is to predict the relationship between a pair of sentences. For consistency, the relationship can be classified into two classes: entailment and not entailment, where neutral and contradiction are seen as not entailment.

Corpus Task |Dev| (GLUE) |Test| (GLUE) |Dev| (AdvGLUE) |Test| (AdvGLUE) Evaluation Metrics
SST-2 sentiment 428:444 1821 72:76 590:830 acc.
QQP paraphrase 25,545:14,885 390,965 46:32 297:125 acc./F1
QNLI NLI/QA 2,702:2,761 5,463 74:74 394:574 acc.
RTE NLI 146:131 3,000 35:46 123:181 acc.
MNLI NLI 6,942:6,252:6,453 19,643 92:84:107 706:565:593 matched acc./mismatched acc.
Table 8: The label distribution of AdvGLUE dataset. For SST-2, we report the label distribution as “negative”:“positive”. For QQP, we report the label distribution as “not equivalent”:“equivalent”. For QNLI, we report the label distribution as “true”:“false”. For RTE, we report the label distribution as “entailment”:“not entailment”. For MNLI, we report the label distribution as “entailment”:“neutral”:“contradiction”.

We also show the detailed per-task model performance on AdvGLUE and GLUE in Table 9.

Models Avg SST-2 MNLI RTE QNLI QQP
GLUE AdvGLUE GLUE AdvGLUE GLUE AdvGLUE GLUE AdvGLUE GLUE AdvGLUE GLUE AdvGLUE
BERT(Large) 85.76 33.68 93.23 33.03 85.78/85.57 28.72/27.05 68.95 40.46 91.91 39.77 90.72/87.38 37.91/16.56
RoBERTa(Large) 91.44 50.21 95.99 58.52 89.74/89.86 50.78/39.62 86.60 45.39 94.14 52.48 91.99/89.37 57.11/41.80
T5(Large) 90.39 56.82 95.53 60.56 88.98/89.20 48.43/38.98 84.12 62.83 93.78 57.64 90.82/88.07 63.03/55.68
ALBERT(XXLarge) 91.87 59.22 95.18 66.83 89.29/89.88 51.83/44.17 88.45 73.03 95.26 63.84 92.26/89.49 56.40/32.35
ELECTRA(Large) 93.16 41.69 97.13 58.59 90.71 14.62/20.22 90.25 23.03 95.17 57.54 92.56 61.37/42.40
DeBERTa(Large) 92.67 60.86 96.33 57.89 90.95/90.85 58.36/52.46 90.25 78.94 94.86 57.85 92.29/89.69 60.43/47.98
SMART(BERT) 85.70 30.29 93.35 25.21 84.72/85.34 26.89/23.32 69.68 38.16 91.71 34.61 90.25/87.22 36.49/20.24
SMART(RoBERTa) 92.62 53.71 96.56 50.92 90.75/90.66 45.56/36.07 90.98 70.39 95.04 52.17 91.20/88.44 64.22/44.28
FreeLB(RoBERTa) 92.28 50.47 96.44 61.69 90.64 31.59/27.60 86.69 62.17 95.04 62.29 92.58 42.18/31.07
InfoBERT(RoBERTa) 89.06 46.04 96.22 47.61 89.67/89.27 50.39/41.26 74.01 39.47 94.62 54.86 92.25/89.70 49.29/35.54
Table 9: Model performance on AdvGLUE test set and GLUE dev set.

a.4 Implementation Details of Adversarial Attacks

TextBugger

To ensure the small magnitude of the perturbation, we consider the following five strategies: () randomly inserting a space into a word; () randomly deleting a character of a word; () randomly replacing a character of a word with its adjacent character in the keyboard; () randomly replacing a character of a word with its visually similar counterpart (e.g., “0” v.s. “o”, “1” v.s. “l”); and () randomly swapping two characters in a word. The first four strategies guarantee the word edit distance between the typo word and its original word to be 1, and that of the last strategy is limited to 2. Following the default setting, in Strategy (), we only insert a space into a word when the word contains less than characters. In Strategy (), we swap characters in a word only when the word has more than characters.

TextFooler

Concretely, for the sentiment analysis tasks, we set the cosine similarity threshold to be , which encourages the synonyms to be semantically close to original ones and enhances the quality of adversarial data. For the rest of the tasks, we follow the default hyper-parameter to set the cosine similarity threshold to be . Besides, the number of synonyms for each word is set to following the default setting.

Bert-Attack

We follow the hyper-parameters from the official codebase, and set the number of candidate words to 48 and cosine similarity threshold to in order to filter out antonyms using synonym dictionaries, as BERT masked language model does not distinguish synonyms and antonyms.

SememePSO

We adopt the official hyper-parameters in which maximum and minimum inertia weights are set to and

, respectively. We also set the maximum and minimum movement probabilities of the particles to

and , respectively, following the default setting. Population size is set to in every task.

CompAttack

We follow the T3 [Wang et al., 2020] and C&W attack [Carlini and Wagner, 2018] and design the same optimization objective for adversarial perturbation generation in the embedding space as:

(3)

where the first term controls the magnitude of perturbation, while is the attack objective function depending on the attack scenario. weighs the attack goal against attack cost. CompAttack constrains the perturbation to be close to pre-defined perturbation space, including typo space (e.g., TextBugger), knowledge space (e.g., WordNet) and contextualized embedding space (e.g., BERT embedding clusters) to make sure the perturbation is valid. We can also see from Table 3 that CompAttack overall has lower filter rate than other state-of-the-art attack methods.

Scpn

We use the pre-trained SCPN models released by the official codebase. Following the default setting, we select the most frequent templates from ParaNMT-50M corpus Wieting and Gimpel [2017] to guide the generation process. We first parse sentences from GLUE dev set using Stanford CoreNLP. We used CoreNLP version 3.7.0 in our experiment, along with the Shift-Reduce Parser models.

T3

We follow the hyper-parameters in the official setting where the scaling const is set to and the optimizing confidence is set to . In each iteration, we optimize the perturbation vector for at most steps with learning rate .

AdvFever

We follow the entailment preserving rules proposed by the official implementation. We adopt all templates to transform original sentences into semantically equivalent ones. Many common sentence patterns in everyday life are included in these templates.

a.5 Examples of AdvGLUE benchmark

We show more comprehensive examples in Table 10. Examples are generated with different levels of perturbations and they all can successfully change the predictions of all surrogate models (BERT, RoBERTa and RoBERTa ensemble).

Task Linguistic Phenomenon Samples (Strikethrough = Original Text, red = Adversarial Perturbation) Label Prediction
SST-2 Typo (Word-level) Sentence: The primitive force of this film seems to bubble bybble up from the vast collective memory of the combatants. Positive Negative
SST-2 Context-aware (Word-level) Sentence: In execution , this clever idea is far less smaller funny than the original , killers from space. Negative Positive
SST-2 CheckList (Human-crafted) Sentence: I think this movie is perfect, but I used to think it was annoying. Positive Negative
QQP Embedding (Word-level) Question 1: I am getting fat on my lower body and on the chest torso, is there any way I can get fit without looking skinny fat? Not Equivalent Equivalent
Question 2: Why I am getting skinny instead of losing body fat?
QQP Syntactic (Sent.-level) Question 1: Can I learn MMA at the age of 26? You can learn MMA at 24? Not Equivalent Equivalent
Question 2: Can I learn MMA at the age of 24?
QQP CheckList (Human-crafted) Question 1: Is Alfred Kennedy an analyst? Not Equivalent Equivalent
Question 2: Is Alfred Kennedy becoming an analyst?
MNLI Typo (Word-level) Premise: uh-huh how about any matching mathcing programs Entailment Contradiction
Hypothesis: What about matching programs?
MNLI Distraction (Sent.-level) Premise: You and your friends are not welcome here, said Severn. Entailment Contradiction
Hypothesis: Severn said the people were not welcome there and true is true.
MNLI ANLI (Human-crafted) Premise: Kamila Filipcikova (born 1991) is a female Slovakian fashion model. She has modeled in fashion shows for designers such as Marc Jacobs, Chanel, Givenchy, Dolce & Gabbana, and Sonia Rykiel. And appeared on the cover of Vogue Italia two times in a row. Neutral Contradiction
Hypothesis: Filipcikova lives in Italy.
QNLI Distraction (Sent.-level) Question: What was the population of the Dutch Republic before this emigration? https://t.co/DlI9kw False True
Sentence: This was a huge influx as the entire population of the Dutch Republic amounted to ca.
QNLI AdvSQuAD (Human-crafted) Question: What day was the Super Bowl played on? False True
Sentence: The Champ Bowl was played on August 18th,1991.
RTE Knowledge (Word-level) Sentence 1: In Nigeria, by far the most populous country in sub-Saharan Africa, over 2.7 million people are exist infected with HIV. Not Entailment Entailment
Sentence 2: 2.7 percent of the people infected with HIV live in Africa.
RTE Syntactic (Sent.-level) Sentence 1: He became a boxing referee in 1964 and became most well-known for his decision against Mike Tyson, during the Holyfield fight, when Tyson bit Holyfield’s ear. Not Entailment Entailment
Sentence 2: Mike Tyson bit Holyfield’s ear in 1964.
Table 10: Examples of AdvGLUE benchmark.

a.6 Fine-tuning Details of Large-Scale Language Models

For all the experiments, we are using a GPU cluster with 8 V100 GPUs and 256GB memory.

BERT (Large)

For RTE, we train our model for epochs and for other tasks we train our model for epochs. Batch size for QNLI is set to , and for other tasks it is set to . Learning rates are all set to .

ELECTRA (Large)

We follow the official hyper-parameter setting to set the learning rate to and set batch size to . We train ELECTRA on RTE for epochs and train for epochs on other tasks. We set the weight decay rate to for every task.

RoBERTa (Large)

We train our RoBERTa for epochs with learning rate on each task. The batch size for QNLI is and for other tasks.

T5 (Large)

We train our T5 for epochs with learning rate on each task. The batch size for QNLI is and for other tasks. We follow the templates in original paper to convert GLUE tasks into generation tasks.

ALBERT (XXLarge)

We use the default hyper-parameters to train our ALBERT. For example, max training steps for SST-2, MNLI, QNLI, QQP, RTE, is , , , , respectively. For MNLI and QQP, batch size is set to and for other tasks batch size is set to .

DeBERTa (Large)

We use the official hyper-parameters to train our DeBERTa. For example, learning rate is set to across all tasks. For MNLI and QQP, batch size is set to and for other tasks batch size is set to 32.

Smart

For SMART(BERT) and SMART(RoBERTa), we use grid search to search for the best parameters and report the best performance among all trained models.

FreeLB (RoBERTa)

For FreeLB, we test every parameter combination provided by the official codebase and select the best parameters for our training.

InfoBERT (RoBERTa)

We set the batch size to and learning rate to for all tasks.

a.7 Human Evaluation Details

Corpus Pay Rate #/ Qualified Human Human Fleiss
(per batch) Workers Acc. (Avg.) Acc. (vote) Kappa
SST-2 $0.4 70 89.2 95.0 0.738
MNLI $1.0 33 80.4 85.0 0.615
RTE $1.0 66 85.8 92.0 0.602
QNLI $1.0 41 85.6 91.0 0.684
QQP $0.5 58 86.4 90.0 0.691
Table 11: The statistics of AdvGLUE in the human training phase.
Human Training

We present the pay rate and the number of qualified workers in Table 11. We also test our qualified workers on another non-overlapping 100 samples of the GLUE dev sets for each task. We can see that the human accuracy is comparable to [Nangia and Bowman, 2019], which means that most our selected annotators understand the GLUE tasks well.

Human Filtering

The detailed filtering statistics of each stage is shown in Table 12. We can see that around of examples are filtered due to the low transferability and high word modification rate. Among the remaining samples, around examples are filtered due to the low human agreement rates (Human Consensus Filtering), and around are filtered due to the semantic changes which lead to the label changes (Utility Preserving Filtering).

Tasks Metrics Word-level Attacks Average
SememePSO TextFooler TextBugger CombAttack BERT-ATTACK
SST-2 Transferability 58.85 63.56 64.87 53.58 66.87 61.54
Fidelity 14.65 11.06 22.40 19.93 12.03 16.01
Human Consensus 10.53 10.56 2.27 9.92 7.09 8.07
Utility Preserving 6.68 5.43 0.51 3.20 3.82 3.93
Filter Rate 90.71 90.62 90.04 86.63 89.81 89.56
MNLI Transferability 44.16 43.15 42.58 35.08 41.80 41.36
Fidelity 36.57 45.94 37.71 38.14 38.60 39.39
Human Consensus 10.37 6.38 5.51 11.15 9.78 8.64
Utility Preserving 4.49 2.08 1.32 11.07 5.91 4.97
Filter Rate 95.59 97.55 87.12 95.45 96.10 94.36
RTE Transferability 55.32 67.38 41.96 54.20 60.94 55.96
Fidelity 19.83 7.79 42.18 23.17 14.25 21.44
Human Consensus 8.08 7.91 3.55 7.64 8.44 7.12
Utility Preserving 8.69 6.13 0.60 5.70 8.54 5.93
Filter Rate 91.93 89.21 88.29 90.72 92.16 90.46
QNLI Transferability 63.36 70.67 59.24 55.47 69.15 63.58
Fidelity 17.73 13.01 25.31 23.53 13.17 18.55
Human Consensus 10.06 9.80 6.84 9.98 9.36 9.21
Utility Preserving 3.48 2.41 1.50 4.94 4.10 3.29
Filter Rate 94.63 95.89 92.89 93.92 95.78 94.62
QQP Transferability 42.96 58.60 55.09 44.83 51.97 50.69
Fidelity 45.61 29.35 26.46 30.99 37.77 34.04
Human Consensus 4.38 4.69 5.19 10.08 3.94 5.66
Utility Preserving 3.79 3.86 3.16 7.93 4.60 4.67
Filter Rate 96.73 96.50 89.90 93.83 98.28 95.05
Table 12: Filter rates during data curation.
Human Annotation Instructions

We show examples of annotation instructions in the training phase and filtering phase on MNLI in Figure 2 and 3. More instructions can be found in https://adversarialglue.github.io/instructions. We also provide a FAQ document in each task description page https://docs.google.com/document/d/1MikHUdyvcsrPqE8x-N-gHaLUNAbA6-Uvy-iA5gkStoc/edit?usp=sharing.

Figure 2: Human annotation instructions (training phase) for MNLI.
Figure 3: Human annotation instructions (filtering phase) for MNLI.

a.8 Discussion of Limitations

Due to the constraints of computational resources, we are unable to conduct a comprehensive evaluation of all existing language models. However, with the release of our leaderboard website, we are expecting researchers to actively submit their models and evaluate against our AdvGLUE benchmark to have a systematic understanding of model robustness. We are also interested in the adversarial robustness of large-scale auto-regressive language models under the few-shot settings, and leave it as a compelling future work.

In this paper, we follow ANLI [Nie et al., 2020] and generate adversarial examples against surrogate models based on BERT and RoBERTa. However, there are concerns [Bowman and Dahl, 2021] that such adversarial filtering may not be able to fairly benchmark the model robustness, as participants may top the leaderboard by producing different errors from our surrogate models. We note that such concerns can be solved given systematic data curation. As shown in our main benchmark results, we observe we successfully select the adversarial examples with high adversarial transferability that can unveil the vulnerabilities shared across models of different architectures. Specifically, we observe a huge performance gap in ELECTRA (Large) that is pre-trained with different data and shown less robust than one of surrogate model RoBERTa (Large).

Finally, we emphasize that our AdvGLUE benchmark mainly focuses on robustness evaluation. Thus AdvGLUE can also be considered as a supplementary diagnostic test set besides the standard GLUE benchmark. We suggest that participants should evaluate their models against both GLUE benchmark and our AdvGLUE to understand both model generalization and robustness. We hope our work can help researchers to develop models with high generalization and adversarial robustness.

a.9 Website

We present the diagnostic report on our website in Figure 4.

Figure 4: An example of model diagnostic report for BERT (Large).

Appendix B Data Sheet

We follow the documentation frameworks provided by Gebru et al. [2018].

b.1 Motivation

For what purpose was the dataset created?

While recently a lot of methods (SMART, FreeLB, InfoBERT, ALUM) claim that they can improve the model robustness against adversarial attacks, the adversary setup in these methods () lacks a unified standard and is usually different across different methods; () fails to cover comprehensive linguistic transformation (typos, synonymous substitution, paraphrasing, etc) to recognize to which levels of adversarial attacks models are still vulnerable. This motivates us to build a unified and principled robustness benchmark dataset and evaluate to which extent the state-of-the-art models have progressed so far in terms of adversarial robustness.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

University of Illinois at Urbana-Champaign (UIUC) and Microsoft Corporation.

b.2 Composition/collection process/preprocessing/cleaning/labeling and uses:

The answers are described in our paper as well as website https://adversarialglue.github.io.

b.3 Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?

The dev set is released to the public. The test set is hidden and can only be evaluated by an automatic submission API hosted on CodaLab.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?

The dev set is released on our website https://adversarialglue.github.io. The test set is hidden and hosted on CodaLab.

When will the dataset be distributed?

It has been released now.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?

Our dataset will be distributed under the CC BY-SA 4.0 license.

b.4 Maintenance

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

Boxin Wang (boxinw2@illinois.edu) and Chejian Xu (xuchejian@zju.edu.cn) will be responsible for maintenance.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?

Yes. If we include more tasks or find any errors, we will correct the dataset and update the leaderboard accordingly. It will be updated on our website.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?

They can contact us via email for the contribution.