1 Introduction
Pre-trained language models (Devlin et al., 2019; Liu et al., 2019; Lan et al., 2019; Yang et al., 2019; He et al., 2020; Zhang et al., 2019b; Jiang et al., 2020; Clark et al., 2020) have achieved state-of-the-art performance over a wide range of Natural Language Understanding (NLU) tasks (Wang et al., 2019b, a; Jia and Liang, 2017; Thorne and Vlachos, 2019; Nie et al., 2020). However, recent studies (Jin et al., 2020; Zang et al., 2020; Wang et al., 2020; Li et al., 2020b; Garg and Ramakrishnan, 2020) reveal that even these large-scale language models are vulnerable to carefully crafted adversarial examples, which can fool the models to output arbitrarily wrong answers by perturbing input sentences in a human-imperceptible way. Real-world systems built upon these vulnerable models can be misled in ways that would have profound security concerns (Li et al., 2019, 2020a).
To address this challenge, various methods (Jiang et al., 2020; Zhu et al., 2020; Wang et al., 2021; Liu et al., 2020) have been proposed to improve the adversarial robustness of language models. However, the adversary setup considered in these methods lacks a unified standard. For example, Jiang et al. (2020); Liu et al. (2020) mainly evaluate their robustness against human-crafted adversarial datasets (Nie et al., 2020; Jia and Liang, 2017), while Wang et al. (2021) evaluate the model robustness against automatic adversarial attack algorithms (Jin et al., 2020). The absence of a principled adversarial benchmark makes it difficult to compare the robustness across different models and identify the adversarial attacks that most models are vulnerable to. This motivates us to build a unified and principled robustness evaluation benchmark for natural language models and hope to help answer the following questions: what types of language models are more robust when evaluated on the unified adversarial benchmark? Which adversarial attack algorithms against language models are more effective, transferable, or stealthy to human? How likely can human be fooled by different adversarial attacks?
We list out the fundamental principles to create a high-quality robustness evaluation benchmark as follows. First, as also pointed out by (Bowman and Dahl, 2021), a reliable benchmark should be accurately and unambiguously annotated by humans. This is especially crucial for the robustness evaluation, as some adversarial examples generated by automatic attack algorithms can fool humans as well. Given our analysis in §3.4, among the generated adversarial data, there are only around adversarial examples that receive at least 4-vote consensus among 5 annotators and align with the original label. Thus, additional rounds of human filtering are critical to validate the quality of the generated adversarial attack data. Second, a comprehensive robustness evaluation benchmark should cover enough language phenomena and generate a systematic diagnostic report to understand and analyze the vulnerabilities of language models. Finally, a robustness evaluation benchmark needs to be challenging and unveil the biases shared across different models.
In this paper, we introduce Adversarial GLUE (AdvGLUE), a multi-task benchmark for robustness evaluation of language models. Compared to existing adversarial datasets, there are several contributions that render AdvGLUE a unique and valuable asset to the community.
-
[leftmargin=*]
-
Comprehensive Coverage. We consider textual adversarial attacks from different perspectives and hierarchies, including word-level transformations, sentence-level manipulations, and human-written adversarial examples, so that AdvGLUE is able to cover as many adversarial linguistic phenomena as possible.
-
Systematic Annotations. To the best of our knowledge, this is the first work that performs systematic evaluation and annotation over the generated textual adversarial examples. Concretely, AdvGLUE adopts crowd-sourcing to identify high-quality adversarial data for reliable evaluation.
-
General Compatibility. To obtain comprehensive understanding of the robustness of language models across different NLU tasks, AdvGLUE covers the widely-used GLUE tasks and creates an adversarial version of the GLUE benchmark to evaluate the robustness of language models.
-
High Transferability and Effectiveness. AdvGLUE has high adversarial transferability and can effectively attack a wide range of state-of-the-art models. We observe a significant performance drop for models evaluated on AdvGLUE compared with their standard accuracy on GLUE leaderboard. For instance, the average GLUE score of ELECTRA(Large) Clark et al. (2020) drops from to .
Our contributions are summarized as follows. () We propose AdvGLUE, a principled and comprehensive benchmark that focuses on robustness evaluation of language models. () During the data construction, we provide a thorough analysis and a fair comparison of existing strong adversarial attack algorithms. () We present thorough robustness evaluation for existing state-of-the-art language models and defense methods. We hope that AdvGLUE will inspire active research and discussion in the community. More details are available at https://adversarialglue.github.io.
2 Related Work
Existing robustness evaluation work can be roughly divided into two categories: Evaluation Toolkits and Benchmark Datasets. () Evaluation toolkits, including OpenAttack (Zeng et al., 2020), TextAttack (Morris et al., 2020), TextFlint (Gui et al., 2021) and Robustness Gym (Goel et al., 2021), integrate various ad hoc input transformations for different tasks and provide programmable APIs to dynamically test model performance. However, it is challenging to guarantee the quality of these input transformations. For example, as reported in (Zang et al., 2020), the validity of adversarial transformation can be as low as , which means that more than one third of the adversarial sentences have wrong labels. Such a high percentage of annotation errors could lead to an underestimate of model robustness, making it less qualified to serve as an accurate and reliable benchmark (Bowman and Dahl, 2021). () Benchmark datasets for robustness evaluation create challenging testing cases by using human-crafted templates or rules (Thorne and Vlachos, 2019; Ribeiro et al., 2020; Naik et al., 2018), or adopting a human-and-model-in-the-loop manner to write adversarial examples (Nie et al., 2020; Kiela et al., 2021; Bartolo et al., 2020). While the quality and validity of these adversarial datasets can be well controlled, the scalability and comprehensiveness are limited by the human annotators. For example, template-based methods require linguistic experts to carefully construct reasonable rules for specific tasks, and such templates can be barely transferable to other tasks. Moreover, human annotators tend to complete the writing tasks through minimal efforts and shortcuts (Burghardt et al., 2020; Wall et al., 2021), which can limit the coverage of various linguistic phenomena.
Corpus | Task | |Train| | |Test| | Word-Level | Sent.-Level | Human-Crafted | ||||||||
(GLUE) | (AdvGLUE) | C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | ||
SST-2 | sentiment | 67,349 | 1,420 | 204 | 197 | 91 | 175 | 64 | 211 | 320 | 158 | 0 | 0 | 0 |
QQP | paraphrase | 363,846 | 422 | 42 | 151 | 17 | 35 | 75 | 37 | 0 | 65 | 0 | 0 | 0 |
QNLI | NLI/QA | 104,743 | 968 | 73 | 139 | 71 | 98 | 72 | 159 | 219 | 80 | 0 | 0 | 57 |
RTE | NLI | 2,490 | 304 | 43 | 44 | 31 | 27 | 23 | 48 | 88 | 0 | 0 | 0 | 0 |
MNLI | NLI | 392,702 | 1,864 | 69 | 402 | 114 | 161 | 128 | 217 | 386 | 0 | 194 | 193 | 0 |
Sum of AdvGLUE test set | 4,978 | 431 | 933 | 324 | 496 | 362 | 672 | 1013 | 303 | 194 | 193 | 57 |
3 Dataset Construction
In this section, we provide an overview of our evaluation tasks, as well as the pipeline of how we construct the benchmark data. During this data construction process, we also compare the effectiveness of different adversarial attack methods, and present several interesting findings.
3.1 Overview
Tasks. We consider the following five most representative and challenging tasks used in GLUE (Wang et al., 2019b)
SST-2), Duplicate Question Detection (QQP), and Natural Language Inference (NLI, including MNLI, RTE, QNLI). The detailed explanation for each task can be found in Appendix A.3. Some tasks in GLUE are not included in AdvGLUE, since there are either no well-defined automatic adversarial attacks (e.g., CoLA), or insufficient data (e.g., WNLI) for the attacks.Dataset Statistics and Evaluation Metrics.
AdvGLUE follows the same training data and evaluation metrics as GLUE. In this way, models trained on the GLUE training data can be easily evaluated under IID sampled test sets (GLUE benchmark) or carefully crafted adversarial test sets (AdvGLUE benchmark). Practitioners can understand the model generalization via the GLUE diagnostic test suite and examine the model robustness against different levels of adversarial attacks from the AdvGLUE diagnostic report with only one-time training. Given the same evaluation metrics, model developers can clearly understand the performance gap between models tested in the ideally benign environments and approximately worst-case adversarial scenarios. We present the detailed dataset statistics under various attacks in Table
1. Detailed label distribution and evaluation metrics are in Appendix Table 8.3.2 Adversarial Perturbations
In this section, we detail how we optimize different levels of adversarial perturbations to the benign source samples and collect the raw adversarial data with noisy labels, which will then be carefully filtered by human annotators described in the next section. Specifically, we consider the dev sets of GLUE benchmark as our source samples, upon which we perform different adversarial attacks. For relatively large-scale tasks (QQP, QNLI, MNLI-m/mm), we sample 1,000 cases from the dev sets for efficiency purpose. For the remaining tasks, we consider the whole dev sets as source samples.
3.2.1 Word-level Perturbation
Existing word-level adversarial attacks perturb the words through different strategies, such as perturbing words with their synonyms (Jin et al., 2020) or carefully crafted typo words (Li et al., 2019) (e.g., “foolish” to “fo01ish”), such that the perturbation does not change the semantic meaning of the sentences but dramatically change the models’ output. To examine the model robustness against different perturbation strategies, we select one representative adversarial attack method for each strategy as follows.
Typo-based Perturbation. We select TextBugger (Li et al., 2019) as the representative algorithm for generating typo-based adversarial examples. When performing the attack, TextBugger first identifies the important words and then replaces them with typos.
Embedding-similarity-based Perturbation. We choose TextFooler (Jin et al., 2020)
as the representative adversarial attack that considers embedding similarity as a constraint to generate semantically consistent adversarial examples. Essentially, TextFooler first performs word importance ranking, and then substitutes those important ones to their synonyms extracted according to the cosine similarity of word embeddings.

Context-aware Perturbation. We use BERT-ATTACK (Li et al., 2020b) to generate context-aware perturbations. The fundamental difference between BERT-ATTACK and TextFooler lies on the word replacement procedure. Specifically, BERT-ATTACK uses the pre-trained BERT to perform masked language prediction to generate contextualized potential word replacements for those crucial words.
Knowledge-guided Perturbation. We consider SememePSO (Zang et al., 2020) as an example to generate adversarial examples guided by the HowNet (Qi et al., 2019)
knowledge base. SememePSO first finds out substitutions for each word in HowNet based on sememes, and then searches for the optimal combination based on particle swarm optimization.
Compositions of different Perturbations. We also implement a whitebox-based adversarial attack algorithm called CompAttack that integrates the aforementioned perturbations in one algorithm to evaluate model robustness to various adversarial transformations. Moreover, we efficiently search for perturbations via optimization so that CompAttack can achieve the attack goal while perturbing the minimal number of words. The implementation details can be found in Appendix A.4.
We note that the above adversarial attacks require a surrogate model to search for the optimal perturbations. In our experiments, we follow the setup of ANLI (Nie et al., 2020) and generate adversarial examples against three different types of models (BERT, RoBERTa, and RoBERTa ensemble) trained on the GLUE benchmark. We then perform one round of filtering to retain those examples with high adversarial transferability between these surrogate models. We discuss more implementation details and hyper-parameters of each attack method in Appendix A.4.
3.2.2 Sentence-level Perturbation
Different from word-level attacks that perturb specific words, sentence-level attacks mainly focus on the syntactic and logical structures of sentences. Most of them achieve the attack goal by either paraphrasing the sentence, manipulating the syntactic structures, or inserting some unrelated sentences to distract the model attention. AdvGLUE considers the following representative perturbations.
Syntactic-based Perturbation. We incorporate three adversarial attack strategies that manipulate the sentence based on the syntactic structures. () Syntax Tree Transformations. SCPN (Iyyer et al., 2018) is trained to produce a paraphrase of a given sentence with specified syntactic structures. Following the default setting, we select the most frequent templates from ParaNMT-50M corpus Wieting and Gimpel (2017) to guide the generation process. An LSTM-based encoder-decoder model (SCPN) is used to generate parses of target sentences according to the templates. These parses are further fed into another SCPN to generate full sentences. We use the pre-trained SCPNs released by the official codebase. () Context Vector Transformations. T3 (Wang et al., 2020)
is a whitebox attack algorithm that can add perturbations on different levels of the syntax tree and generate the adversarial sentence. In our setting, we add perturbations to the context vector of the root node given syntax tree, which is iteratively optimized to construct the adversarial sentence. (
) Entailment Preserving Transformations. We follow the entailment preserving rules proposed by AdvFever (Thorne and Vlachos, 2019), and transform all the sentences satisfying the templates into semantically equivalent ones. More details can be found in Appendix A.4.Distraction-based Perturbation. We integrate two attack strategies: () StressTest (Naik et al., 2018) appends three true statements (“and true is true”, “and false is not true”, “and true is true” for five times) to the end of the hypothesis sentence for NLI tasks. () CheckList (Ribeiro et al., 2020) adds randomly generated URLs and handles to distract model attention. Since the aforementioned distraction-based perturbations may impact the linguistic acceptability and the understanding of semantic equivalence, we mainly apply these rules to part of the GLUE tasks, including SST-2 and NLI tasks (MNLI, RTE, QNLI), to evaluate whether model can be easily misled by the strong negation words or such lexical similarity.
Linguistic Phenomenon | Samples (Strikethrough = Original Text, red = Adversarial Perturbation) | Label Prediction |
Typo (Word-level) | Question: What was the population of the Dutch Republic before this emigration? | False True |
Sentence: This was a huge hu ge influx as the entire population of the Dutch Republic amounted to ca. | ||
Distraction (Sent.-level) | Question: What was the population of the Dutch Republic before this emigration? https://t.co/DlI9kw | False True |
Sentence: This was a huge influx as the entire population of the Dutch Republic amounted to ca. | ||
CheckList (Human-crafted) | Question: What is Tony’s profession? | True False |
Sentence: Both Tony and Marilyn were executives, but there was a change in Marilyn, who is now an assistant. |
3.2.3 Human-crafted Examples
To ensure our benchmark covers more linguistic phenomena in addition to those provided by automatic attack algorithms, we integrate the following high-quality human-crafted adversarial data from crowd-sourcing or expert-annotated templates and transform them to the formats of GLUE tasks.
CheckList111We note that both CheckList and StressTest propose both rule-based distraction sentences and manually crafted templates to generate test samples. The former is considered as sentence-level distraction-based perturbations, while the latter is considered as human-crafted examples. (Ribeiro et al., 2020) is a testing method designed for analysing different capabilities of NLP models using different test types. For each task, CheckList first identifies necessary natural language capabilities a model should have, then designs several test templates to generate test cases at scale. We follow the instructions and collect testing cases for three tasks: SST-2, QQP and QNLI. For each task, we adopt two capability tests: Temporal and Negation, which test if the model understands the order of events and if the model is sensitive to negations.
StressTest††footnotemark: (Naik et al., 2018) proposes carefully crafted rules to construct “stress tests” and evaluate robustness of NLI models to specific linguistic phenomena. We adopt the test cases focusing on Numerical Reasoning into our adversarial MNLI dataset. These premise-hypothesis pairs are able to test whether the model can perform reasoning involving numbers and quantifiers and predict the correct relation between premise and hypothesis.
ANLI (Nie et al., 2020) is a large-scale NLI dataset collected iteratively in a human-in-the-loop manner. In each iteration, human annotators are asked to design sentences to fool current model. Then the model is further finetuned on a larger dataset incorporating these sentences, which leads to a stronger model. Finally, annotators are asked to write harder examples to detect the weakness of this stronger model. In the end, the sentence pairs generated in each round form a comprehensive dataset that aims at examining the vulnerability of NLI models. We adopt ANLI into our adversarial MNLI dataset. We obtain the permission from the ANLI authors to include the ANLI dataset as part of our leaderboard.
AdvSQuAD (Jia and Liang, 2017) is an adversarial dataset targeting at reading comprehension systems. Adversarial examples are generated by appending a distracting sentence to the end of the input paragraph. The distracting sentences are carefully designed to have common words with questions and look like a correct answer to the question. We mainly consider the examples generated by AddSent and AddOneSent strategies, and adopt the distracting sentences and questions in the QNLI format with labels “not answered”. The use of AdvSQuAD in AdvGLUE is authorized by the authors.
3.3 Data Curation
After collecting the raw adversarial dataset, additional rounds of filtering are required to guarantee its quality and validity. We consider two types of filtering: automatic filtering and human evaluation.
Automatic Filtering mainly evaluates the generated adversarial examples along two fronts: transferability and fidelity.
-
[leftmargin=*]
-
Transferability evaluates whether the adversarial examples generated against one source model (e.g., BERT) can successfully transfer and attack the other two (e.g., RoBERTa and RoBERTa ensemble), given the surrogate models used to generate adversarial examples (BERT, RoBERTa and RoBERTa ensemble). Only adversarial examples that can successfully transfer to the other two models will be kept for the next round of fidelity filtering, so that the selected examples can exploit the biases shared across different models and unveil their fundamental weakness.
-
Fidelity evaluates how the generated adversarial examples maintain the original semantics. For word-level adversarial examples, we use word modification rate to measure what percentage of words are perturbed. Concretely, word-level adversarial examples with word modification rate larger than are filtered out. For sentence-level adversarial examples, we use BERTScore (Zhang et al., 2019a) to evaluate the semantic similarity between the adversarial sentences and their corresponding original ones. For each sentence-level attack, adversarial examples with the highest similarity scores are kept to guarantee their semantic closeness to the benign samples.
Tasks | Metrics | Word-level Attacks | Sentence-level Attacks | Avg | ||||||
SPSO | TF | TB | CA | BA | T3 | SCPN | AdvFever | |||
SST-2 | ASR | 89.08 | 95.38 | 88.08 | 31.91 | 39.77 | 97.69 | 65.37 | 0.57 | 63.48 |
Curated ASR | 8.29 | 8.97 | 8.85 | 4.02 | 4.04 | 10.45 | 6.88 | 0.23 | 6.47 | |
Filter Rate | 90.71 | 90.62 | 90.04 | 86.63 | 89.81 | 89.27 | 89.47 | 60.00 | 85.82 | |
Fleiss Kappa | 0.22 | 0.20 | 0.50 | 0.21 | 0.24 | 0.23 | 0.29 | 0.12 | 0.26 | |
Curated Fleiss Kappa | 0.51 | 0.49 | 0.67 | 0.46 | 0.45 | 0.44 | 0.47 | 0.20 | 0.52 | |
Human Accuracy | 0.85 | 0.86 | 0.91 | 0.88 | 0.85 | 0.78 | 0.85 | 0.50 | 0.87 | |
MNLI | ASR | 78.45 | 61.50 | 69.35 | 68.58 | 65.02 | 91.23 | 87.73 | 2.25 | 65.51 |
Curated ASR | 3.48 | 1.55 | 8.94 | 3.11 | 2.58 | 3.41 | 6.75 | 0.30 | 3.77 | |
Filter Rate | 95.59 | 97.55 | 87.12 | 95.45 | 96.10 | 96.27 | 92.31 | 86.63 | 93.38 | |
Fleiss Kappa | 0.28 | 0.24 | 0.53 | 0.39 | 0.32 | 0.28 | 0.24 | 0.35 | 0.33 | |
Curated Fleiss Kappa | 0.65 | 0.59 | 0.74 | 0.65 | 0.60 | 0.56 | 0.60 | 0.51 | 0.67 | |
Human Accuracy | 0.85 | 0.83 | 0.91 | 0.89 | 0.83 | 0.84 | 0.91 | 0.83 | 0.89 | |
RTE | ASR | 76.67 | 75.67 | 85.89 | 73.36 | 72.05 | 92.39 | 88.45 | 6.62 | 71.39 |
Curated ASR | 6.20 | 8.14 | 10.03 | 6.97 | 5.58 | 7.05 | 8.30 | 2.53 | 6.85 | |
Filter Rate | 91.93 | 89.21 | 88.29 | 90.72 | 92.16 | 92.31 | 90.61 | 61.34 | 87.07 | |
Fleiss Kappa | 0.30 | 0.32 | 0.58 | 0.35 | 0.25 | 0.33 | 0.43 | 0.58 | 0.38 | |
Curated Fleiss Kappa | 0.49 | 0.67 | 0.80 | 0.63 | 0.42 | 0.60 | 0.64 | 0.65 | 0.66 | |
Human Accuracy | 0.77 | 0.95 | 0.94 | 0.87 | 0.79 | 0.89 | 0.91 | 0.86 | 0.92 | |
QNLI | ASR | 71.88 | 67.03 | 82.54 | 67.24 | 60.53 | 96.41 | 67.37 | 0.97 | 64.25 |
Curated ASR | 3.92 | 2.87 | 5.87 | 4.09 | 2.69 | 7.59 | 3.90 | 0.00 | 3.87 | |
Filter Rate | 94.63 | 95.89 | 92.89 | 93.92 | 95.78 | 92.16 | 94.21 | 100.00 | 94.93 | |
Fleiss Kappa | 0.07 | 0.05 | 0.16 | 0.10 | 0.14 | 0.07 | 0.12 | -0.16 | 0.11 | |
Curated Fleiss Kappa | 0.37 | 0.43 | 0.49 | 0.34 | 0.53 | 0.37 | 0.43 | - | 0.44 | |
Human Accuracy | 0.80 | 0.86 | 0.85 | 0.82 | 0.92 | 0.89 | 0.92 | - | 0.85 | |
QQP | ASR | 45.86 | 48.59 | 57.92 | 49.33 | 43.66 | 48.20 | 44.37 | 0.30 | 42.28 |
Curated ASR | 1.52 | 1.74 | 5.87 | 3.05 | 0.76 | 1.47 | 1.50 | 0.00 | 1.99 | |
Filter Rate | 96.73 | 96.50 | 89.90 | 93.83 | 98.28 | 97.04 | 96.62 | 100.00 | 96.11 | |
Fleiss Kappa | 0.26 | 0.27 | 0.38 | 0.27 | 0.24 | 0.25 | 0.29 | - | 0.30 | |
Curated Fleiss Kappa | 0.32 | 0.46 | 0.62 | 0.48 | 0.40 | 0.10 | 0.47 | - | 0.51 | |
Human Accuracy | 0.84 | 0.98 | 0.97 | 0.89 | 0.78 | 0.89 | 1.00 | - | 0.89 |
Human Evaluation validates whether the adversarial examples preserve the original labels and whether the labels are highly agreed among annotators. Concretely, we recruit annotators from Amazon Mechanical Turk. To make sure the annotators fully understand the GLUE tasks, each worker is required to pass a training step to be qualified to work on the main filtering tasks for the generated adversarial examples. We tune the pay rate for different tasks, as shown in Appendix Table 11. The pay rate of the main filtering phase is twice as much as that of the training phase.
-
[leftmargin=*]
-
Human Training Phase is designed to ensure that the annotators understand the tasks. The annotation instructions for each task follows (Nangia and Bowman, 2019), and we provide at least two examples for each class to help annotators understand the tasks.222Instructions can be found at https://adversarialglue.github.io/instructions. Each annotator is required to work on a batch of 20 examples randomly sampled from the GLUE dev set. After annotators answer each example, a ground-truth answer will be provided to help them understand whether the answer is correct. Workers who get at least of the examples correct during training are qualified to work on the main filtering task. A total of 100 crowd workers participated in each task, and the number of qualified workers are shown in Appendix Table 11. We also test the human accuracy of qualified annotators for each task on 100 randomly sampled examples from the dev set excluding the training samples. The details and results can be found in Appendix Table 11.
-
Human Filtering Phase verifies the quality of the generated adversarial examples and only maintains high-quality ones to construct the benchmark dataset. Specifically, annotators are required to work on a batch of 10 adversarial examples generated from the same attack method. Every adversarial example will be validated by 5 different annotators. Examples are selected following two criteria: () high consensus: each example must have at least 4-vote consensus; () utility preserving: the majority-voted label must be the same as the original one to make sure the attacks are valid (i.e., cannot fool human) and preserve the semantic content.
The data curation results including inter-annotator agreement rate (Fleiss Kappa) and human accuracy on the curated dataset are shown in Table 3. We will provide more analysis in the next section. Note that even after the data curation step, some grammatical errors and typos can still remain, as some adversarial attacks intentionally inject typos (e.g., TextBugger) or manipulate syntactic trees (e.g., SCPN) which are very stealthy. We will retain these samples as their labels receive high consensus from annotators, which means the typos do not substantially impact humans’ understanding.
Model | SST-2 | MNLI | RTE | QNLI | QQP | Avg | Avg | Avg |
AdvGLUE | AdvGLUE | AdvGLUE | AdvGLUE | AdvGLUE | AdvGLUE | GLUE | ||
State-of-the-art Pre-trained Language Models | ||||||||
BERT (Large) | 33.03 | 28.72/27.05 | 40.46 | 39.77 | 37.91/16.56 | 33.68 | 85.76 | 52.08 |
ELECTRA (Large) | 58.59 | 14.62/20.22 | 23.03 | 57.54 | 61.37/42.40 | 41.69 | 93.16 | 51.47 |
RoBERTa (Large) | 58.52 | 50.78/39.62 | 45.39 | 52.48 | 57.11/41.80 | 50.21 | 91.44 | 41.23 |
T5 (Large) | 60.56 | 48.43/38.98 | 62.83 | 57.64 | 63.03/55.68 | 56.82 | 90.39 | 33.57 |
ALBERT (XXLarge) | 66.83 | 51.83/44.17 | 73.03 | 63.84 | 56.40/32.35 | 59.22 | 91.87 | 32.65 |
DeBERTa (Large) | 57.89 | 58.36/52.46 | 78.95 | 57.85 | 60.43/47.98 | 60.86 | 92.67 | 31.81 |
Robust Training Methods for Pre-trained Language Models | ||||||||
SMART (BERT) | 25.21 | 26.89/23.32 | 38.16 | 34.61 | 36.49/20.24 | 30.29 | 85.70 | 55.41 |
SMART (RoBERTa) | 50.92 | 45.56/36.07 | 70.39 | 52.17 | 64.22/44.28 | 53.71 | 92.62 | 38.91 |
FreeLB (RoBERTa) | 61.69 | 31.59/27.60 | 62.17 | 62.29 | 42.18/31.07 | 50.47 | 92.28 | 41.81 |
InfoBERT (RoBERTa) | 47.61 | 50.39/41.26 | 39.47 | 54.86 | 49.29/35.54 | 46.04 | 89.06 | 43.02 |
3.4 Benchmark of Adversarial Attack Algorithms
Our data curation phase also serves as a comprehensive benchmark over existing adversarial attack methods, as it provides a fair standard for all adversarial attacks and systematic human annotations to evaluate the quality of the generated samples.
Evaluation Metrics. Specifically, we evaluate these attacks along two fronts: effectiveness and validity. For effectiveness, we consider two evaluation metrics: Attack Success Rate (ASR) and Curated Attack Success Rate (Curated ASR). Formally, given a benign dataset consisting of pairs of sample and ground truth , for an adversarial attack method that generates an adversarial example given an input to attack a surrogate model , ASR is calculated as
(1) |
where is the indicator function. After the data curation phase, we collect a curated adversarial dataset . Thus, Curated ASR is calculated as
(2) |
For validity, we consider three evaluation metrics: Filter Rate, Fleiss Kappa, and Human Accuracy. Specifically, Filter Rate is calculated by to measure how many examples are rejected in the data curation procedures and can reflect the noisiness of the generated adversarial examples.
We report the average ASR, Curated ASR, and Filter Rate over the three surrogate models we consider in Table 3.
Fleiss Kappa is a widely used metric in existing datasets (e.g. , SNLI, ANLI, and FEVER ) to measure the inter-annotator agreement rate on the collected dataset. Fleiss Kappa between 0.4 and 0.6 is considered as moderate agreement and between 0.6 and 0.8 as substantial agreement. The inter-annotator agreement rates of most high-quality datasets fall into these two intervals. In this paper, we follow the standard protocol and report Fleiss Kappa and Curated Fleiss Kappa to analyze the inter-annotator agreement rate on the collected adversarial dataset before and after curation to reflect the ambiguity of generated examples. We also estimate the human performance on our curated datasets. Specifically, given a sample with 5 annotations, we take one random annotator’s annotation as the prediction and the majority voted label as the ground truth and calculate the human accuracy as shown in Table
Analysis. As shown in Table 3, in terms of attack effectiveness, while most attacks show high ASR, the Curated ASR is always less than , which indicates that most existing adversarial attack algorithms are not effective enough to generate high-quality adversarial examples. In terms of validity, the filter rates for most adversarial attack methods are more than , which suggests that existing strong adversarial attacks are prone to generating invalid adversarial examples that either change the original semantic meanings or generate ambiguous perturbations that hinder the annotators’ unanimity. We provide detailed filter rates for automatic filtering and human evaluation in Appendix Table 12, and the conclusion is that around of examples are filtered due to the low transferability and high word modification rate. Among the remaining samples, around examples are filtered due to the low human agreement rates (Human Consensus Filtering), and around are filtered due to the semantic changes which lead to the label changes (Utility Preserving Filtering). We also note that the data curation procedures are indispensable for the adversarial evaluation, as the Fleiss Kappa before curation is very low, suggesting that a lot of adversarial sentences have unreliable labels and thus tend to underestimate the model robustness against the textual adversarial attacks. After the data curation, our AdvGLUE shows a Curated Fleiss Kappa of near 0.6, comparable with existing high-quality dataset such as SNLI and ANLI. Among all the existing attack methods, we observe that TextBugger is the most effective and valid attack method, as it demonstrates the highest Curated ASR and Curated Fleiss Kappa across different tasks.
3.5 Finalizing the Dataset
The full pipeline of constructing AdvGLUE is summarized in Figure 1.
Merging. We note that distraction-based adversarial examples and human-crafted adversarial examples are guaranteed to be valid by definition or crowd-sourcing annotations, and thus data curation is not needed on these attacks. When merging them with our curated set, we calculate the average number of samples per attack from our curated set, and sample the same amount of adversarial examples from these attacks following the same label distribution. This way, each attack contributes to similar amount of adversarial data, so that AdvGLUE can evaluate models against different types of attacks with similar weights and provide a comprehensive and unbiased diagnostic report.
Dev-Test Split. After collecting the adversarial examples from the considered attacks, we split the final dataset into a dev set and a test set. In particular, we first randomly split the benign data into , and the adversarial examples generated based on of the benign data serve as the hidden test set, while the others are published as the dev set. For human-crafted adversarial examples, since they are not generated based on the benign GLUE data, we randomly select of the data as the test set, and the remaining as the dev set. The dev set is publicly released to help participants to understand the tasks and the data format. To protect the integrity of our test data, the test set will not be released to the public. Instead, participants are required to upload the model to CodaLab, which automates the evaluation process on the hidden test set and provides a diagnostic report.
Models | Word-Level Perturbations | Sent.-Level | Human-Crafted Examples | ||||||||
C1 | C2 | C3 | C4 | C5 | C6 | C7 | C8 | C9 | C10 | C11 | |
BERT (Large) | 42.02 | 31.96 | 45.18 | 45.86 | 33.85 | 44.86 | 24.16 | 16.33 | 23.20 | 13.47 | 10.53 |
ELECTRA (Large) | 43.07 | 45.12 | 47.95 | 46.33 | 47.33 | 43.47 | 33.30 | 32.20 | 26.29 | 26.94 | 52.63 |
RoBERTa (Large) | 56.54 | 57.19 | 60.47 | 49.81 | 55.92 | 50.49 | 41.89 | 37.78 | 28.35 | 16.58 | 35.09 |
T5 (Large) | 60.04 | 67.94 | 64.60 | 59.84 | 58.50 | 50.54 | 42.20 | 69.02 | 23.20 | 17.10 | 52.63 |
ALBERT (XXLarge) | 66.71 | 67.61 | 73.49 | 70.36 | 59.52 | 63.76 | 49.14 | 45.55 | 39.69 | 26.94 | 43.86 |
DeBERTa (Large) | 65.07 | 74.87 | 68.02 | 65.30 | 62.54 | 57.41 | 47.22 | 45.08 | 52.06 | 22.80 | 54.39 |
SMART (BERT) | 45.17 | 31.04 | 42.89 | 45.23 | 30.76 | 40.74 | 16.62 | 8.20 | 18.56 | 10.36 | 1.75 |
SMART (RoBERTa) | 62.93 | 58.03 | 65.09 | 62.65 | 61.37 | 55.31 | 40.13 | 39.27 | 28.35 | 15.54 | 31.58 |
FreeLB (RoBERTa) | 51.95 | 53.23 | 52.92 | 51.15 | 52.18 | 50.75 | 37.72 | 66.87 | 23.71 | 29.02 | 64.91 |
InfoBERT (RoBERTa) | 55.47 | 55.78 | 59.02 | 51.33 | 55.48 | 44.56 | 31.49 | 34.31 | 42.27 | 14.51 | 43.86 |
4 Diagnostic Report for Language Models
Benchmark Results. We follow the official implementations and training scripts of pre-trained language models to reproduce results on GLUE and test these models on AdvGLUE. The training details can be found in Appendix A.6. Results are summarized in Table 4. We observe that although state-of-the-art language models have achieved high performance on GLUE, they are vulnerable to various adversarial attacks. For instance, the performance gap can be as large as on the SMART (BERT) model in terms of the average score. DeBERTa (Large) and ALBERT (XXLarge) achieve the highest average AdvGLUE scores among all the tested language models. This result is also aligned with the ANLI leaderboard333https://github.com/facebookresearch/anli, which shows that ALBERT (XXLarge) is the most robust to human-crafted adversarial NLI dataset (Nie et al., 2020).
We note that although our adversarial examples are generated from surrogate models based on BERT and RoBERTa, these examples have high transferability between models after our data curation. Specifically, the average score of ELECTRA (Large) on AdvGLUE is even lower than RoBERTa (Large), which demonstrates that AdvGLUE can effectively transfer across models of different architectures and unveil the vulnerabilities shared across multiple models. Moreover, we find some models even perform worse than random guess. For example, the performance of BERT on AdvGLUE for all tasks is lower than random-guess accuracy.
We also benchmark advanced robust training methods to evaluate whether these methods can indeed provide robustness improvement on AdvGLUE and to what extent. We observe that SMART and FreeLB are particularly helpful to improve robustness for RoBERTa. Specifically, SMART (RoBERTa) improves RoBERTa (Large) over on average, and it even improves the benign accuracy as well. Since InfoBERT is not evaluated on GLUE, we run InfoBERT with different hyper-parameters and report the best accuracy on benign GLUE dev set and AdvGLUE test set. However, we find that the benign accuracy of InfoBERT (RoBERTa) is still lower than RoBERTa (Large), and similarly for the robust accuracy. These results suggest that existing robust training methods only have incremental robustness improvement, and there is still a long way to go to develop robust models to achieve satisfactory performance on AdvGLUE.
Diagnostic Report of Model Vulnerabilities. To have a systematic understanding of which adversarial attacks language models are vulnerable to, we provide a detailed diagnostic report in Table 5. We observe that models are most vulnerable to human-crafted examples, where complex linguistic phenomena (e.g., numerical reasoning, negation and coreference resolution) can be found. For sentence-level perturbations, models are more vulnerable to distraction-based perturbations than directly manipulating syntactic structures. In terms of word-level perturbations, models are similarly vulnerable to different word replacement strategies, among which typo-based perturbations and knowledge-guided perturbations are the most effective attacks.
We hope the above findings can help researchers systematically examine their models against different adversarial attacks, thus also devising new methods to defend against them. Comprehensive analysis of the model robustness report is provided in our website and Appendix A.9.
5 Conclusion
We introduce AdvGLUE, a multi-task benchmark to evaluate and analyze the robustness of state-of-the-art language models and robust training methods. We systematically conduct 14 adversarial attacks on GLUE tasks and adopt crowd-sourcing to guarantee the quality and validity of generated adversarial examples. Modern language models perform poorly on AdvGLUE, suggesting that model vulnerabilities to adversarial attacks still remain unsolved. We hope AdvGLUE can serve as a comprehensive and reliable diagnostic benchmark for researchers to further develop robust models.
We thank the anonymous reviewers for their constructive feedback. We also thank Prof. Sam Bowman, Dr. Adina Williams, Nikita Nangia, Jingfeng Li, and many others for the helpful discussion. We thank Prof. Robin Jia and Yixin Nie for allowing us to incorporate their datasets as part of the evaluation. We thank the SQuAD team for allowing us to use their website template and submission tutorials. This work is partially supported by the NSF grant No.1910100, NSF CNS 20-46726 CAR, the Amazon Research Award.
References
- Beat the ai: investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics 8, pp. 662–678. Cited by: §A.2, §2.
- A large annotated corpus for learning natural language inference. In EMNLP, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and Y. Marton (Eds.), Cited by: §3.4.
- What will it take to fix benchmarking in natural language understanding?. In NAACL, Cited by: §A.8, §1, §2.
- Origins of algorithmic instabilities in crowdsourced ranking. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW2), pp. 1–20. Cited by: §2.
- Audio adversarial examples: targeted attacks on speech-to-text. 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. Cited by: §A.2, §A.4.
- Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: 4th item, §1.
- Certified adversarial robustness via randomized smoothing. In ICML, Cited by: §A.2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), Cited by: §1.
- Training verified learners with learned verifiers. CoRR abs/1805.10265. Cited by: §A.2.
- HotFlip: white-box adversarial examples for text classification. In ACL, Cited by: §A.2.
-
Robust physical-world attacks on deep learning models.
. Cited by: §A.2. - Large-scale adversarial training for vision-and-language representation learning. arXiv preprint arXiv:2006.06195. Cited by: §A.2.
-
BAE: bert-based adversarial examples for text classification.
In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pp. 6174–6181. Cited by: §A.2, §1. - Datasheets for datasets. arXiv preprint arXiv:1803.09010. Cited by: Appendix B.
- Robustness gym: unifying the nlp evaluation landscape. arXiv preprint arXiv:2101.04840. Cited by: §2.
- Explaining and harnessing adversarial examples. CoRR abs/1412.6572. Cited by: §A.2.
- Textflint: unified multilingual robustness evaluation toolkit for natural language processing. arXiv preprint arXiv:2103.11441. Cited by: §2.
- Deberta: decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654. Cited by: §1.
- Achieving verified robustness to symbol substitutions via interval bound propagation. In EMNLP-IJCNLP, Cited by: §A.2.
- Adversarial example generation with syntactically controlled paraphrase networks. In NAACL-HLT, Cited by: §A.2, §3.2.2.
- Adversarial examples for evaluating reading comprehension systems. In EMNLP, M. Palmer, R. Hwa, and S. Riedel (Eds.), Cited by: §A.2, §1, §1, §3.2.3.
- Certified robustness to adversarial word substitutions. In EMNLP-IJCNLP, Cited by: §A.2.
- SMART: robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization. In ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Cited by: §A.2, §1, §1.
- Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In AAAI, Cited by: §A.2, §1, §1, §3.2.1, §3.2.1.
- Dynabench: rethinking benchmarking in nlp. In NAACL, Cited by: §2.
-
ALBERT: a lite bert for self-supervised learning of language representations
. ArXiv abs/1909.11942. Cited by: §1. -
TextShield: robust text classification based on multimodal embedding and neural machine translation
. In 29th USENIX Security Symposium (USENIX Security 20), Cited by: §1. - TextBugger: generating adversarial text against real-world applications. In NDSS, Cited by: §A.2, §1, §3.2.1, §3.2.1.
- BERT-attack: adversarial attack against bert using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6193–6202. Cited by: §A.2, §1, §3.2.1.
- Adversarial training for large neural language models. CoRR abs/2004.08994. Cited by: §A.2, §1.
- RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. Cited by: §1.
- Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 3111–3119. Cited by: §A.2.
-
DeepFool: a simple and accurate method to fool deep neural networks
. CVPR, pp. 2574–2582. Cited by: §A.2. - Textattack: a framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909. Cited by: §2.
- Stress test evaluation for natural language inference. arXiv preprint arXiv:1806.00692. Cited by: §A.2, §2, §3.2.2, §3.2.3.
- Human vs. muppet: a conservative estimate of human performance on the glue benchmark. In ACL, Cited by: §A.7, item 1.
- Adversarial NLI: A new benchmark for natural language understanding. In ACL, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), Cited by: §A.2, §A.8, §1, §1, §2, §3.2.1, §3.2.3, §3.4, §4.
- Distillation as a defense to adversarial perturbations against deep neural networks. 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §A.2.
- Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, A. Moschitti, B. Pang, and W. Daelemans (Eds.), pp. 1532–1543. Cited by: §A.2.
- OpenHowNet: an open sememe-based lexical knowledge base. ArXiv abs/1901.09957. Cited by: §A.2, §3.2.1.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §A.3.
- Beyond accuracy: behavioral testing of NLP models with CheckList. In ACL, pp. 4902–4912. Cited by: §A.2, §2, §3.2.2, §3.2.3.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: §A.3.
- FEVER: a large-scale dataset for fact extraction and verification. In NAACL-HLT, Cited by: §3.4.
- Adversarial attacks against fact extraction and verification. CoRR abs/1903.05543. External Links: 1903.05543 Cited by: §1, §2, §3.2.2.
- Left, right, and gender: exploring interaction traces to mitigate human biases. arXiv preprint arXiv:2108.03536. Cited by: §2.
- Superglue: a stickier benchmark for general-purpose language understanding systems. In NeurIPS, Cited by: §1.
- GLUE: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §1, §3.1.
-
T3: tree-autoencoder constrained adversarial text generation for targeted attack
. In EMNLP, Cited by: §A.2, §A.4, §1, §3.2.2. - InfoBERT: improving robustness of language models from an information theoretic perspective. In ICLR, Cited by: §1.
- ParaNMT-50m: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. arXiv preprint arXiv:1711.05732. Cited by: §A.4, §3.2.2.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §A.3.
- XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, Cited by: §1.
- Characterizing audio adversarial examples using temporal dependency. ArXiv abs/1809.10875. Cited by: §A.2.
- SAFER: A structure-free approach for certified robustness to adversarial word substitutions. In ACL, Cited by: §A.2.
-
Word-level textual adversarial attacking as combinatorial optimization
. In ACL, Cited by: §A.2, §1, §2, §3.2.1. -
Openattack: an open-source textual adversarial attack toolkit
. arXiv preprint arXiv:2009.09191. Cited by: §2. - BERTScore: evaluating text generation with bert. In ICLR, Cited by: item 2.
- ERNIE: enhanced language representation with informative entities. In ACL, Cited by: §1.
- FreeLB: enhanced adversarial training for natural language understanding. In ICLR, Cited by: §A.2, §1.
Appendix A Appendix
a.1 Glossary of Adversarial Attacks
Perturbations | Explanation | Examples (Strikethrough = Original Text, red = Adversarial Perturbation) |
TextBugger (Word-level / Typo-based) | TextBugger first identifies the important words in each sentence and then replaces them with carefully crafted typos. | Task: QNLI |
Question: What was the population of the Dutch Republic before this emigration? | ||
Sentence: This was a huge hu ge influx as the entire population of the Dutch Republic amounted to ca. | ||
Prediction: False True | ||
TextFooler (Word-level / Embedding-similarity-based) | Embedding-similarity-based adversarial attacks such as TextFooler select synonyms according to the cosine similarity of word embeddings. Words that have high similarity scores will be used as candidates to replace original words in the sentences. | Task: QQP |
Question 1: I am getting fat on my lower body and on the chest torso, is there any way I can get fit without looking skinny fat? | ||
Question 2: Why I am getting skinny instead of losing body fat? | ||
Prediction: Not Equivalent Equivalent | ||
BERT-ATTACK (Word-level / Context-aware) | BERT-ATTACK uses pre-trained BERT to perform masked language prediction to generate contextualized potential word replacements for those crucial words. | Task: MNLI |
Premise: Do you know what this is? With a dramatic gesture she flung back the left side of her coat sleeve and exposed a small enamelled badge. | ||
Hypothesis: The coat that she wore was long enough to cover her knees . | ||
Prediction: Neutral Contradiction | ||
SememePSO (Word-level / Knowledge-guided) | Knowledge-guided adversarial attacks such as SememePSO use external knowledge base such as HowNet or WordNet to search for substitutions. | Task: QQP |
Question 1: What people who you’ve never met have influenced infected your life the most? | ||
Question 2: Who are people you have never met who have had the greatest influence on your life? | ||
Prediction: Equivalent Not Equivalent | ||
CompAttack (Word-level / Compositions) | CompAttack is a whitebox-based adversarial attack that integrates all other word-level perturbation methods in one algorithm to evaluate model robustness to various adversarial transformations. | Task: SST-2 |
Sentence: The primitive force of this film seems to bubble bybble up from the vast collective memory of the combatants. | ||
Prediction: Positive Negative | ||
SCPN (Sent.-level / Syntactic-based) | SCPN is an attack method based on syntax tree transformations. It is trained to produce a paraphrase of a given sentence with specified syntactic structures. | Task: RTE |
Sentence 1: He became a boxing referee in 1964 and became most well-known for his decision against Mike Tyson, during the Holyfield fight, when Tyson bit Holyfield’s ear. | ||
Sentence 2: Mike Tyson bit Holyfield’s ear in 1964. | ||
Prediction: Not Entailment Entailment | ||
T3 (Sent.-level / Syntactic-based) | T3 is a whitebox attack algorithm that can add perturbations on different levels of the syntax tree and generate the adversarial sentence. | Task: MNLI |
Premise: What’s truly striking, though, is that Jobs has had never really let this idea go. | ||
Hypothesis: Jobs never held onto an idea for long. | ||
Prediction: Contradiction Entailment | ||
AdvFever (Sent.-level / Syntactic-based) | Entailment preserving rules proposed by AdvFever transform all the sentences satisfying the templates into semantically equivalent ones. | Task: SST-2 |
Sentence: I’ll bet the video game is There exists a lot more fun than the film that goes by the name of i ’ll bet the video game. | ||
Prediction: Negative Positive | ||
StressTest (Sent.-level / Distraction-based) | StressTest appends three true statements (“and true is true”, “and false is not true”, “and true is true” for five times) to the end of the hypothesis sentence for NLI tasks. | Task: RTE |
Sentence 1: Yet, we now are discovering that antibiotics are losing their effectiveness against illness. Disease-causing bacteria are mutating faster than we can come up with new antibiotics to fight the new variations. | ||
Sentence 2: Bacteria is winning the war against antibiotics and true is true. | ||
Prediction: Entailment Not Entailment | ||
CheckList (Sent.-level / Distraction-based) | CheckList adds randomly generated URLs and handles to distract model attention. | Task: QNLI |
Question: What was the population of the Dutch Republic before this emigration? https://t.co/DlI9kw | ||
Sentence: This was a huge influx as the entire population of the Dutch Republic amounted to ca. | ||
Prediction: False True |
Perturbations | Explanation | Examples (Strikethrough = Original Text, red = Adversarial Perturbation) |
CheckList (Human-crafted) | CheckList analyses different capabilities of NLP models using different test types. We adopt two capability tests: Temporal and Negation, which test if the model understands the order of events and if the model is sensitive to negations. | Task: SST-2 |
Sentence: I think this movie is perfect, but I used to think it was annoying. | ||
Prediction: Positive Negative | ||
StressTest (Human-crafted) | StressTest proposes carefully crafted rules to construct “stress tests” and evaluate robustness of NLI models to specific linguistic phenomena. Here we adopt the test cases focusing on Numerical Reasoning. | Task: MNLI |
Premise: If Anne’ s speed were doubled, they could clean their house in 3 hours working at their respective rates. | ||
Hypothesis: If Anne’ s speed were doubled, they could clean their house in less than 6 hours working at their respective rates. | ||
Prediction: Entailment Contradiction | ||
ANLI (Human-crafted) | ANLI is a large-scale NLI dataset collected iteratively in a human-in-the-loop manner. The sentence pairs generated in each round form a comprehensive dataset that aims at examining the vulnerability of NLI models. | Task: MNLI |
Premise: Kamila Filipcikova (born 1991) is a female Slovakian fashion model. She has modeled in fashion shows for designers such as Marc Jacobs, Chanel, Givenchy, Dolce & Gabbana, and Sonia Rykiel. And appeared on the cover of Vogue Italia two times in a row. | ||
Hypothesis: Filipcikova lives in Italy. | ||
Prediction: Neutral Contradiction | ||
AdvSQuAD (Human-crafted) | AdvSQuAD is an adversarial dataset targeting at reading comprehension systems. Examples are generated by appending a distracting sentence to the end of the input paragraph. We adopt the distracting sentences and questions in the QNLI format with labels “not answered”. | Task: QNLI |
Question: What day was the Super Bowl played on? | ||
Sentence: The Champ Bowl was played on August 18th,1991. | ||
Prediction: False True |
a.2 Additional Related Work
We discuss more related work about textual adversarial attacks and defenses in this subsection.
Textual Adversarial Attacks
Recent research has shown deep neural networks (DNNs) are vulnerable to adversarial examples that are carefully crafted to fool machine learning models without disturbing human perception
[Goodfellow et al., 2015, Papernot et al., 2016, Moosavi-Dezfooli et al., 2016]. However, compared with a large amount of adversarial attacks in continuous data domain [Yang et al., 2018, Carlini and Wagner, 2018, Eykholt et al., 2017], there are a few studies focusing on the discrete text domain. Most existing gradient-based attacks on image or audio models are no longer applicable to NLP models, as words are intrinsically discrete tokens. Another challenge for generating adversarial text is to ensure the semantic and syntactic coherence and consistency.Existing textual adversarial attacks can be roughly divided into three categories: word-level transformations, sentence-level attacks, and human-crafted samples. () Word-level transformations adopt different word replacement strategies during attack. For example, existing work [Li et al., 2019, Ebrahimi et al., 2018] applies character-level perturbation to carefully crafted typo words (e.g., from “foolish” to “fo0lish”), thus making the model ignore or misunderstand the original statistical cues. Others adopt knowledge-based perturbation and utialize knowledge base to constrain the search space. For example, Zang et al. [2020] uses sememe-based knowledge base from HowNet [Qi et al., 2019] to construct a search space for word substitution. Some [Jin et al., 2020, Li et al., 2019] use non-contextualized word embedding from GLoVe [Pennington et al., 2014] or Word2Vec [Mikolov et al., 2013] to build synonym candidates, by querying the cosine similarity or euclidean distance between the original and candidate word and selecting the closet ones as the replacements. Recent work [Garg and Ramakrishnan, 2020, Li et al., 2020b] also leverages BERT to generate contextualized perturbations by masked language modeling. () Different from the dominant word-level adversarial attacks, sentence-level adversarial attacks perform sentence-level transformation or paraphrasing by perturbing the syntactic structures based on human crafted rules [Naik et al., 2018, Ribeiro et al., 2020] or carefully designed auto-encoders [Iyyer et al., 2018, Wang et al., 2020]. Sentence-level manipulations are generally more challenging than word-level attacks, because the perturbation space for syntactic structures are limited compared to word-level perturbation spaces that grow exponentially with the sentence length. However, sentence-level attacks tend to have higher linguistic quality than word-level, as both semantic and syntactic coherence are taken into considerations when generating adversarial sentences. () Human-crafted adversarial examples are generally crafted in the human-in-the-loop manner [Jia and Liang, 2017, Nie et al., 2020, Bartolo et al., 2020] or use manually crafted templates to generate test cases [Naik et al., 2018, Ribeiro et al., 2020]. Our AdvGLUE incorporates all of the above textual adversarial to provide a comprehensive and systematic diagnostic report over existing state-of-the-art large-scale language models.
Defenses against Textual Adversarial Attacks
To defend against textual adversarial attacks, existing work can be classified into three categories: (
) Adversarial Training is a practical method to defend against adversarial examples. Existing work either uses PGD-based attacks to generate adversarial examples in the embedding space of NLP as data augmentation [Zhu et al., 2020], or regularizes the standard objective using virtual adversarial training [Jiang et al., 2020, Liu et al., 2020, Gan et al., 2020]. However, one drawback is that the threat model is often unknown, which renders adversarial training less effective when facing unseen attacks. () Interval Bound Propagation (IBP) [Dvijotham et al., 2018] is proposed as a new technique to consider the worst-case perturbation theoretically. Recent work [Huang et al., 2019, Jia et al., 2019] has applied IBP in the NLP domain to certify the robustness of models. However, IBP-based methods rely on strong assumptions of model architecture and are difficult to adapt to recent transformer-based language models. () Randomized Smoothing [Cohen et al., 2019] provides a tight robustness guarantee in norm by smoothing the classifier with Gaussian noise. Ye et al. [2020] adapts the idea to the NLP domain, and replace the Gaussian noise with synonym words to certify the robustness as long as adversarial word substitution falls into predefined synonym sets. However, to guarantee the completeness of the synonym set is challenging.a.3 Task Descriptions, Statistics and Evaluation Metrics
We present the detailed label distribution statistics and evaluation metrics of GLUE and AdvGLUE benchmark in 8.
Sst-2
The Stanford Sentiment Treebank Socher et al. [2013] consists of sentences from movie reviews and human annotations of their sentiment. Given a review sentence, the task is to predict the sentiment of it. Sentiments can be divided into two classes: positive and negative.
Qqp
The Quora Question Pairs (QQP) dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.
Mnli
The Multi-Genre Natural Language Inference Corpus Williams et al. [2017] consists of sentence pairs with textual entailment annotations. Given a premise sentence and a hypothesis sentence, the task is to predict whether the premise entails the hypothesis (entailment), contradicts the hypothesis (contradiction), or neither (neutral)
Qnli
Question-answering NLI (QNLI) dataset consists of question-sentence pairs modified from The Stanford Question Answering Dataset
Rajpurkar et al. [2016]. The task is to determine whether the context sentence contains the answer to the question.Rte
The Recognizing Textual Entailment (RTE) dataset is a combination of a series of data from annual textual entailment challenges. Examples are constructed based on news and Wikipedia text. The task is to predict the relationship between a pair of sentences. For consistency, the relationship can be classified into two classes: entailment and not entailment, where neutral and contradiction are seen as not entailment.
Corpus | Task | |Dev| (GLUE) | |Test| (GLUE) | |Dev| (AdvGLUE) | |Test| (AdvGLUE) | Evaluation Metrics |
SST-2 | sentiment | 428:444 | 1821 | 72:76 | 590:830 | acc. |
QQP | paraphrase | 25,545:14,885 | 390,965 | 46:32 | 297:125 | acc./F1 |
QNLI | NLI/QA | 2,702:2,761 | 5,463 | 74:74 | 394:574 | acc. |
RTE | NLI | 146:131 | 3,000 | 35:46 | 123:181 | acc. |
MNLI | NLI | 6,942:6,252:6,453 | 19,643 | 92:84:107 | 706:565:593 | matched acc./mismatched acc. |
We also show the detailed per-task model performance on AdvGLUE and GLUE in Table 9.
Models | Avg | SST-2 | MNLI | RTE | QNLI | QQP | ||||||
GLUE | AdvGLUE | GLUE | AdvGLUE | GLUE | AdvGLUE | GLUE | AdvGLUE | GLUE | AdvGLUE | GLUE | AdvGLUE | |
BERT(Large) | 85.76 | 33.68 | 93.23 | 33.03 | 85.78/85.57 | 28.72/27.05 | 68.95 | 40.46 | 91.91 | 39.77 | 90.72/87.38 | 37.91/16.56 |
RoBERTa(Large) | 91.44 | 50.21 | 95.99 | 58.52 | 89.74/89.86 | 50.78/39.62 | 86.60 | 45.39 | 94.14 | 52.48 | 91.99/89.37 | 57.11/41.80 |
T5(Large) | 90.39 | 56.82 | 95.53 | 60.56 | 88.98/89.20 | 48.43/38.98 | 84.12 | 62.83 | 93.78 | 57.64 | 90.82/88.07 | 63.03/55.68 |
ALBERT(XXLarge) | 91.87 | 59.22 | 95.18 | 66.83 | 89.29/89.88 | 51.83/44.17 | 88.45 | 73.03 | 95.26 | 63.84 | 92.26/89.49 | 56.40/32.35 |
ELECTRA(Large) | 93.16 | 41.69 | 97.13 | 58.59 | 90.71 | 14.62/20.22 | 90.25 | 23.03 | 95.17 | 57.54 | 92.56 | 61.37/42.40 |
DeBERTa(Large) | 92.67 | 60.86 | 96.33 | 57.89 | 90.95/90.85 | 58.36/52.46 | 90.25 | 78.94 | 94.86 | 57.85 | 92.29/89.69 | 60.43/47.98 |
SMART(BERT) | 85.70 | 30.29 | 93.35 | 25.21 | 84.72/85.34 | 26.89/23.32 | 69.68 | 38.16 | 91.71 | 34.61 | 90.25/87.22 | 36.49/20.24 |
SMART(RoBERTa) | 92.62 | 53.71 | 96.56 | 50.92 | 90.75/90.66 | 45.56/36.07 | 90.98 | 70.39 | 95.04 | 52.17 | 91.20/88.44 | 64.22/44.28 |
FreeLB(RoBERTa) | 92.28 | 50.47 | 96.44 | 61.69 | 90.64 | 31.59/27.60 | 86.69 | 62.17 | 95.04 | 62.29 | 92.58 | 42.18/31.07 |
InfoBERT(RoBERTa) | 89.06 | 46.04 | 96.22 | 47.61 | 89.67/89.27 | 50.39/41.26 | 74.01 | 39.47 | 94.62 | 54.86 | 92.25/89.70 | 49.29/35.54 |
a.4 Implementation Details of Adversarial Attacks
TextBugger
To ensure the small magnitude of the perturbation, we consider the following five strategies: () randomly inserting a space into a word; () randomly deleting a character of a word; () randomly replacing a character of a word with its adjacent character in the keyboard; () randomly replacing a character of a word with its visually similar counterpart (e.g., “0” v.s. “o”, “1” v.s. “l”); and () randomly swapping two characters in a word. The first four strategies guarantee the word edit distance between the typo word and its original word to be 1, and that of the last strategy is limited to 2. Following the default setting, in Strategy (), we only insert a space into a word when the word contains less than characters. In Strategy (), we swap characters in a word only when the word has more than characters.
TextFooler
Concretely, for the sentiment analysis tasks, we set the cosine similarity threshold to be , which encourages the synonyms to be semantically close to original ones and enhances the quality of adversarial data. For the rest of the tasks, we follow the default hyper-parameter to set the cosine similarity threshold to be . Besides, the number of synonyms for each word is set to following the default setting.
Bert-Attack
We follow the hyper-parameters from the official codebase, and set the number of candidate words to 48 and cosine similarity threshold to in order to filter out antonyms using synonym dictionaries, as BERT masked language model does not distinguish synonyms and antonyms.
SememePSO
We adopt the official hyper-parameters in which maximum and minimum inertia weights are set to and
, respectively. We also set the maximum and minimum movement probabilities of the particles to
and , respectively, following the default setting. Population size is set to in every task.CompAttack
We follow the T3 [Wang et al., 2020] and C&W attack [Carlini and Wagner, 2018] and design the same optimization objective for adversarial perturbation generation in the embedding space as:
(3) |
where the first term controls the magnitude of perturbation, while is the attack objective function depending on the attack scenario. weighs the attack goal against attack cost. CompAttack constrains the perturbation to be close to pre-defined perturbation space, including typo space (e.g., TextBugger), knowledge space (e.g., WordNet) and contextualized embedding space (e.g., BERT embedding clusters) to make sure the perturbation is valid. We can also see from Table 3 that CompAttack overall has lower filter rate than other state-of-the-art attack methods.
Scpn
We use the pre-trained SCPN models released by the official codebase. Following the default setting, we select the most frequent templates from ParaNMT-50M corpus Wieting and Gimpel [2017] to guide the generation process. We first parse sentences from GLUE dev set using Stanford CoreNLP. We used CoreNLP version 3.7.0 in our experiment, along with the Shift-Reduce Parser models.
T3
We follow the hyper-parameters in the official setting where the scaling const is set to and the optimizing confidence is set to . In each iteration, we optimize the perturbation vector for at most steps with learning rate .
AdvFever
We follow the entailment preserving rules proposed by the official implementation. We adopt all templates to transform original sentences into semantically equivalent ones. Many common sentence patterns in everyday life are included in these templates.
a.5 Examples of AdvGLUE benchmark
We show more comprehensive examples in Table 10. Examples are generated with different levels of perturbations and they all can successfully change the predictions of all surrogate models (BERT, RoBERTa and RoBERTa ensemble).
Task | Linguistic Phenomenon | Samples (Strikethrough = Original Text, red = Adversarial Perturbation) | Label Prediction |
SST-2 | Typo (Word-level) | Sentence: The primitive force of this film seems to bubble bybble up from the vast collective memory of the combatants. | Positive Negative |
SST-2 | Context-aware (Word-level) | Sentence: In execution , this clever idea is far less smaller funny than the original , killers from space. | Negative Positive |
SST-2 | CheckList (Human-crafted) | Sentence: I think this movie is perfect, but I used to think it was annoying. | Positive Negative |
QQP | Embedding (Word-level) | Question 1: I am getting fat on my lower body and on the chest torso, is there any way I can get fit without looking skinny fat? | Not Equivalent Equivalent |
Question 2: Why I am getting skinny instead of losing body fat? | |||
QQP | Syntactic (Sent.-level) | Question 1: Can I learn MMA at the age of 26? You can learn MMA at 24? | Not Equivalent Equivalent |
Question 2: Can I learn MMA at the age of 24? | |||
QQP | CheckList (Human-crafted) | Question 1: Is Alfred Kennedy an analyst? | Not Equivalent Equivalent |
Question 2: Is Alfred Kennedy becoming an analyst? | |||
MNLI | Typo (Word-level) | Premise: uh-huh how about any matching mathcing programs | Entailment Contradiction |
Hypothesis: What about matching programs? | |||
MNLI | Distraction (Sent.-level) | Premise: You and your friends are not welcome here, said Severn. | Entailment Contradiction |
Hypothesis: Severn said the people were not welcome there and true is true. | |||
MNLI | ANLI (Human-crafted) | Premise: Kamila Filipcikova (born 1991) is a female Slovakian fashion model. She has modeled in fashion shows for designers such as Marc Jacobs, Chanel, Givenchy, Dolce & Gabbana, and Sonia Rykiel. And appeared on the cover of Vogue Italia two times in a row. | Neutral Contradiction |
Hypothesis: Filipcikova lives in Italy. | |||
QNLI | Distraction (Sent.-level) | Question: What was the population of the Dutch Republic before this emigration? https://t.co/DlI9kw | False True |
Sentence: This was a huge influx as the entire population of the Dutch Republic amounted to ca. | |||
QNLI | AdvSQuAD (Human-crafted) | Question: What day was the Super Bowl played on? | False True |
Sentence: The Champ Bowl was played on August 18th,1991. | |||
RTE | Knowledge (Word-level) | Sentence 1: In Nigeria, by far the most populous country in sub-Saharan Africa, over 2.7 million people are exist infected with HIV. | Not Entailment Entailment |
Sentence 2: 2.7 percent of the people infected with HIV live in Africa. | |||
RTE | Syntactic (Sent.-level) | Sentence 1: He became a boxing referee in 1964 and became most well-known for his decision against Mike Tyson, during the Holyfield fight, when Tyson bit Holyfield’s ear. | Not Entailment Entailment |
Sentence 2: Mike Tyson bit Holyfield’s ear in 1964. |
a.6 Fine-tuning Details of Large-Scale Language Models
For all the experiments, we are using a GPU cluster with 8 V100 GPUs and 256GB memory.
BERT (Large)
For RTE, we train our model for epochs and for other tasks we train our model for epochs. Batch size for QNLI is set to , and for other tasks it is set to . Learning rates are all set to .
ELECTRA (Large)
We follow the official hyper-parameter setting to set the learning rate to and set batch size to . We train ELECTRA on RTE for epochs and train for epochs on other tasks. We set the weight decay rate to for every task.
RoBERTa (Large)
We train our RoBERTa for epochs with learning rate on each task. The batch size for QNLI is and for other tasks.
T5 (Large)
We train our T5 for epochs with learning rate on each task. The batch size for QNLI is and for other tasks. We follow the templates in original paper to convert GLUE tasks into generation tasks.
ALBERT (XXLarge)
We use the default hyper-parameters to train our ALBERT. For example, max training steps for SST-2, MNLI, QNLI, QQP, RTE, is , , , , respectively. For MNLI and QQP, batch size is set to and for other tasks batch size is set to .
DeBERTa (Large)
We use the official hyper-parameters to train our DeBERTa. For example, learning rate is set to across all tasks. For MNLI and QQP, batch size is set to and for other tasks batch size is set to 32.
Smart
For SMART(BERT) and SMART(RoBERTa), we use grid search to search for the best parameters and report the best performance among all trained models.
FreeLB (RoBERTa)
For FreeLB, we test every parameter combination provided by the official codebase and select the best parameters for our training.
InfoBERT (RoBERTa)
We set the batch size to and learning rate to for all tasks.
a.7 Human Evaluation Details
Corpus | Pay Rate | #/ Qualified | Human | Human | Fleiss |
(per batch) | Workers | Acc. (Avg.) | Acc. (vote) | Kappa | |
SST-2 | $0.4 | 70 | 89.2 | 95.0 | 0.738 |
MNLI | $1.0 | 33 | 80.4 | 85.0 | 0.615 |
RTE | $1.0 | 66 | 85.8 | 92.0 | 0.602 |
QNLI | $1.0 | 41 | 85.6 | 91.0 | 0.684 |
QQP | $0.5 | 58 | 86.4 | 90.0 | 0.691 |
Human Training
We present the pay rate and the number of qualified workers in Table 11. We also test our qualified workers on another non-overlapping 100 samples of the GLUE dev sets for each task. We can see that the human accuracy is comparable to [Nangia and Bowman, 2019], which means that most our selected annotators understand the GLUE tasks well.
Human Filtering
The detailed filtering statistics of each stage is shown in Table 12. We can see that around of examples are filtered due to the low transferability and high word modification rate. Among the remaining samples, around examples are filtered due to the low human agreement rates (Human Consensus Filtering), and around are filtered due to the semantic changes which lead to the label changes (Utility Preserving Filtering).
Tasks | Metrics | Word-level Attacks | Average | ||||
SememePSO | TextFooler | TextBugger | CombAttack | BERT-ATTACK | |||
SST-2 | Transferability | 58.85 | 63.56 | 64.87 | 53.58 | 66.87 | 61.54 |
Fidelity | 14.65 | 11.06 | 22.40 | 19.93 | 12.03 | 16.01 | |
Human Consensus | 10.53 | 10.56 | 2.27 | 9.92 | 7.09 | 8.07 | |
Utility Preserving | 6.68 | 5.43 | 0.51 | 3.20 | 3.82 | 3.93 | |
Filter Rate | 90.71 | 90.62 | 90.04 | 86.63 | 89.81 | 89.56 | |
MNLI | Transferability | 44.16 | 43.15 | 42.58 | 35.08 | 41.80 | 41.36 |
Fidelity | 36.57 | 45.94 | 37.71 | 38.14 | 38.60 | 39.39 | |
Human Consensus | 10.37 | 6.38 | 5.51 | 11.15 | 9.78 | 8.64 | |
Utility Preserving | 4.49 | 2.08 | 1.32 | 11.07 | 5.91 | 4.97 | |
Filter Rate | 95.59 | 97.55 | 87.12 | 95.45 | 96.10 | 94.36 | |
RTE | Transferability | 55.32 | 67.38 | 41.96 | 54.20 | 60.94 | 55.96 |
Fidelity | 19.83 | 7.79 | 42.18 | 23.17 | 14.25 | 21.44 | |
Human Consensus | 8.08 | 7.91 | 3.55 | 7.64 | 8.44 | 7.12 | |
Utility Preserving | 8.69 | 6.13 | 0.60 | 5.70 | 8.54 | 5.93 | |
Filter Rate | 91.93 | 89.21 | 88.29 | 90.72 | 92.16 | 90.46 | |
QNLI | Transferability | 63.36 | 70.67 | 59.24 | 55.47 | 69.15 | 63.58 |
Fidelity | 17.73 | 13.01 | 25.31 | 23.53 | 13.17 | 18.55 | |
Human Consensus | 10.06 | 9.80 | 6.84 | 9.98 | 9.36 | 9.21 | |
Utility Preserving | 3.48 | 2.41 | 1.50 | 4.94 | 4.10 | 3.29 | |
Filter Rate | 94.63 | 95.89 | 92.89 | 93.92 | 95.78 | 94.62 | |
QQP | Transferability | 42.96 | 58.60 | 55.09 | 44.83 | 51.97 | 50.69 |
Fidelity | 45.61 | 29.35 | 26.46 | 30.99 | 37.77 | 34.04 | |
Human Consensus | 4.38 | 4.69 | 5.19 | 10.08 | 3.94 | 5.66 | |
Utility Preserving | 3.79 | 3.86 | 3.16 | 7.93 | 4.60 | 4.67 | |
Filter Rate | 96.73 | 96.50 | 89.90 | 93.83 | 98.28 | 95.05 |
Human Annotation Instructions
We show examples of annotation instructions in the training phase and filtering phase on MNLI in Figure 2 and 3. More instructions can be found in https://adversarialglue.github.io/instructions. We also provide a FAQ document in each task description page https://docs.google.com/document/d/1MikHUdyvcsrPqE8x-N-gHaLUNAbA6-Uvy-iA5gkStoc/edit?usp=sharing.


a.8 Discussion of Limitations
Due to the constraints of computational resources, we are unable to conduct a comprehensive evaluation of all existing language models. However, with the release of our leaderboard website, we are expecting researchers to actively submit their models and evaluate against our AdvGLUE benchmark to have a systematic understanding of model robustness. We are also interested in the adversarial robustness of large-scale auto-regressive language models under the few-shot settings, and leave it as a compelling future work.
In this paper, we follow ANLI [Nie et al., 2020] and generate adversarial examples against surrogate models based on BERT and RoBERTa. However, there are concerns [Bowman and Dahl, 2021] that such adversarial filtering may not be able to fairly benchmark the model robustness, as participants may top the leaderboard by producing different errors from our surrogate models. We note that such concerns can be solved given systematic data curation. As shown in our main benchmark results, we observe we successfully select the adversarial examples with high adversarial transferability that can unveil the vulnerabilities shared across models of different architectures. Specifically, we observe a huge performance gap in ELECTRA (Large) that is pre-trained with different data and shown less robust than one of surrogate model RoBERTa (Large).
Finally, we emphasize that our AdvGLUE benchmark mainly focuses on robustness evaluation. Thus AdvGLUE can also be considered as a supplementary diagnostic test set besides the standard GLUE benchmark. We suggest that participants should evaluate their models against both GLUE benchmark and our AdvGLUE to understand both model generalization and robustness. We hope our work can help researchers to develop models with high generalization and adversarial robustness.
a.9 Website
We present the diagnostic report on our website in Figure 4.

Appendix B Data Sheet
We follow the documentation frameworks provided by Gebru et al. [2018].
b.1 Motivation
For what purpose was the dataset created?
While recently a lot of methods (SMART, FreeLB, InfoBERT, ALUM) claim that they can improve the model robustness against adversarial attacks, the adversary setup in these methods () lacks a unified standard and is usually different across different methods; () fails to cover comprehensive linguistic transformation (typos, synonymous substitution, paraphrasing, etc) to recognize to which levels of adversarial attacks models are still vulnerable. This motivates us to build a unified and principled robustness benchmark dataset and evaluate to which extent the state-of-the-art models have progressed so far in terms of adversarial robustness.
Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?
University of Illinois at Urbana-Champaign (UIUC) and Microsoft Corporation.
b.2 Composition/collection process/preprocessing/cleaning/labeling and uses:
The answers are described in our paper as well as website https://adversarialglue.github.io.
b.3 Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?
The dev set is released to the public. The test set is hidden and can only be evaluated by an automatic submission API hosted on CodaLab.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?
The dev set is released on our website https://adversarialglue.github.io. The test set is hidden and hosted on CodaLab.
When will the dataset be distributed?
It has been released now.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?
Our dataset will be distributed under the CC BY-SA 4.0 license.
b.4 Maintenance
How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
Boxin Wang (boxinw2@illinois.edu) and Chejian Xu (xuchejian@zju.edu.cn) will be responsible for maintenance.
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?
Yes. If we include more tasks or find any errors, we will correct the dataset and update the leaderboard accordingly. It will be updated on our website.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?
They can contact us via email for the contribution.
Comments
There are no comments yet.