Adversarial NLI: A New Benchmark for Natural Language Understanding

by   Yixin Nie, et al.

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.



page 1

page 2

page 3

page 4


Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models

With large-scale pre-training, the past two years have witnessed signifi...

LSOIE: A Large-Scale Dataset for Supervised Open Information Extraction

Open Information Extraction (OIE) systems seek to compress the factual p...

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Law, interpretations of law, legal arguments, agreements, etc. are typic...

RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark

In this paper, we introduce an advanced Russian general language underst...

ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recen...

HellaSwag: Can a Machine Really Finish Your Sentence?

Recent work by Zellers et al. (2018) introduced a new task of commonsens...

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets

Several datasets have recently been constructed to expose brittleness in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Progress in AI has been driven by, among other things, the development of challenging large-scale benchmarks like ImageNet 

Russakovsky et al. (2015)

in computer vision, and SNLI 

Bowman et al. (2015), SQuAD Rajpurkar et al. (2016)

, and others in natural language processing (NLP). Recently, for natural language understanding (NLU) in particular, the focus has shifted to combined benchmarks like SentEval 

Conneau and Kiela (2018) and GLUE Wang et al. (2018), which track model performance on multiple tasks and provide a unified platform for analysis.

With the rapid pace of advancement in AI, however, NLU benchmarks struggle to keep up with model improvement. Whereas it took around 15 years to achieve “near-human performance” on MNIST LeCun et al. (1998); Cireşan et al. (2012); Wan et al. (2013) and approximately 7 years to surpass humans on ImageNet Deng et al. (2009); Russakovsky et al. (2015); He et al. (2016), the GLUE benchmark did not last as long as we would have hoped after the advent of BERT Devlin et al. (2018), and rapidly had to be extended into SuperGLUE Wang et al. (2019). This raises an important question: Can we collect a large benchmark dataset that can last longer?

The speed with which benchmarks become obsolete raises another important question: are current NLU models genuinely as good as their high performance on benchmarks suggests? A growing body of evidence shows that state-of-the-art models learn to exploit spurious statistical patterns in datasets Gururangan et al. (2018); Poliak et al. (2018); Tsuchiya (2018); Glockner et al. (2018); Geva et al. (2019); McCoy et al. (2019), instead of learning meaning in the flexible and generalizable way that humans do. Given this, human annotators—be they seasoned NLP researchers or non-experts—might easily be able to construct examples that expose model brittleness.

Figure 1: Adversarial NLI data collection procedure, via human-and-model-in-the-loop entailment training (HAMLET). The four steps make up one round of data collection.

We propose an iterative, adversarial human-and-model-in-the-loop solution for NLU dataset collection that addresses both benchmark longevity and robustness issues. In the first stage, human annotators devise examples that our current best models cannot determine the correct label for. These resulting hard examples—which should expose additional model weaknesses—can be added to the training set and used to train a stronger model. We then subject the strengthened model to human interference and collect more weaknesses over several rounds. After each round, we both train a new model, and set aside a new test set. The process can be iteratively repeated in a never-ending learning Mitchell et al. (2018) setting, with the model getting stronger and the test set getting harder in each new round.This process yields a “moving post” dynamic target for NLU systems, rather than a static benchmark that will eventually saturate.

Our approach draws inspiration from recent efforts that gamify collaborative training of machine learning agents over multiple rounds 

Yang et al. (2017) and pit “builders” against “breakers” to learn better models Ettinger et al. (2017)

. Recently, Dinan2019build showed that a similar approach can be used to make dialogue safety classifiers more robust. Here, we focus on natural language inference (NLI), arguably the most canonical task in NLU. We collected three rounds of data, and call our new dataset Adversarial NLI (ANLI).

Our contributions are as follows: 1) We introduce a novel human-and-model-in-the-loop dataset, currently consisting of three rounds that progressively increase in difficulty and complexity, that includes annotator-provided explanations. 2) We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks. 3) We provide a detailed analysis of the collected data that sheds light on the shortcomings of current models, categorizes the data by inference type to examine weaknesses, and demonstrates good performance on NLI stress tests. The ANLI dataset is available at A demo of the annotation procedure can be viewed at

Premise Hypothesis Reason Round Labels Annotations
orig. pred. valid.
Roberto Javier Mora García (c. 1962 – 16 March 2004) was a Mexican journalist and editorial director of “El Mañana”, a newspaper based in Nuevo Laredo, Tamaulipas, Mexico. He worked for a number of media outlets in Mexico, including the “El Norte” and “El Diario de Monterrey”, prior to his assassination. Another individual laid waste to Roberto Javier Mora Garcia. The context states that Roberto Javier Mora Garcia was assassinated, so another person had to have “laid waste to him.” The system most likely had a hard time figuring this out due to it not recognizing the phrase “laid waste.” A1 (Wiki) E N E E Lexical Similar (assassination, laid waste), Tricky Presupposition, Basic Idiom
A melee weapon is any weapon used in direct hand-to-hand combat; by contrast with ranged weapons which act at a distance. The term “melee” originates in the 1640s from the French word “mĕlée”, which refers to hand-to-hand combat, a close quarters battle, a brawl, a confused fight, etc. Melee weapons can be broadly divided into three categories Melee weapons are good for ranged and hand-to-hand combat. Melee weapons are good for hand to hand combat, but NOT ranged. A2 (Wiki) C E C N C Basic Conjunction, Tricky Exhaustification, Reasoning Facts
If you can dream it, you can achieve it—unless you’re a goose trying to play a very human game of rugby. In the video above, one bold bird took a chance when it ran onto a rugby field mid-play. Things got dicey when it got into a tussle with another player, but it shook it off and kept right on running. After the play ended, the players escorted the feisty goose off the pitch. It was a risky move, but the crowd chanting its name was well worth it. The crowd believed they knew the name of the goose running on the field. Because the crowd was chanting its name, the crowd must have believed they knew the goose’s name. The word “believe” may have made the system think this was an ambiguous statement. A3 (News) E N E E Reasoning Facts, Reference Coreference
Table 1: Examples from development set. ‘A’ refers to round number, ‘orig.’ is the original annotator’s gold label, ‘pred.’ is the model prediction, ‘valid.’ is the validator labels, ‘reason’ was provided by the original annotator, ‘Annotations’ is the tags determined by linguist expert annotator.

2 Dataset collection

The primary aim of this work is to create a new large-scale NLI benchmark on which current state-of-the-art models fail. This constitutes a new target for the field to work towards, and can elucidate model capabilities and limitations. As noted, however, static benchmarks do not last very long these days. If continuously deployed, the data collection procedure we introduce here can pose a dynamic challenge that allows for never-ending learning.

2.1 Hamlet

To paraphrase the great bard Shakespeare (1603), there is something rotten in the state of the art. We propose Human-And-Model-in-the-Loop Entailment Training (HAMLET), a training procedure to automatically mitigate problems with current dataset collection procedures (see Figure 1).

In our setup, our starting point is a base model, trained on NLI data. Rather than employing automated adversarial methods, here the model’s “adversary” is a human annotator. Given a context (also often called a “premise” in NLI), and a desired target label, we ask the human writer to provide a hypothesis that fools the model into misclassifying the label. One can think of the writer as a “white hat” hacker, trying to identify vulnerabilities in the system. For each human-generated example that is misclassified, we also ask the writer to provide a reason why they believe it was misclassified.

For examples that the model misclassified, it is necessary to verify that they are actually correct —i.e., that the given context-hypothesis pairs genuinely have their specified target label. The best way to do this is to have them checked by another human. Hence, we provide the example to human verifiers. If two human verifiers agree with the writer, the example is considered a good example. If they disagree, we ask a third human verifier to break the tie. If there is still disagreement between the writer and the verifiers, the example is discarded. Occasionally, verifiers will overrule the original label of the writer.

Once data collection for the current round is finished, we construct a new training set from the collected data, with accompanying development and test sets. While the training set includes correctly classified examples, the development and tests sets are built solely from them. The test set was further restricted so as to: 1) include pairs from “exclusive” annotators that are never included in the training data; and 2) be balanced by label classes (and genres, where applicable). We subsequently train a new model on this and other existing data, and repeat the procedure three times.

2.2 Annotation details

We employed crowdsourced workers from Mechanical Turk with qualifications. We collected hypotheses via the ParlAI111

framework. Annotators are presented with a context and a target label—either ‘entailment’, ‘contradiction’, or ‘neutral’—and asked to write a hypothesis that corresponds to the label. We phrase the label classes as “definitely correct”, “definitely incorrect”, or “neither definitely correct nor definitely incorrect” given the context, to make the task easier to grasp. Submitted hypotheses are given to the model to make a prediction for the context-hypothesis pair. The probability of each label is returned to the worker as feedback. If the model predicts the label incorrectly, the job is complete. If not, the worker continues to write hypotheses for the given (context, target-label) pair until the model predicts the label incorrectly or the number of tries exceeds a threshold (5 tries in the first round, 10 tries thereafter).

To encourage workers, payments increased as rounds became harder. For hypotheses that the model predicted the incorrect label for, but were verified by other humans, we paid an additional bonus on top of the standard rate.

2.3 Round 1

For the first round, we used a BERT-Large model Devlin et al. (2018) trained on a concatenation of SNLI Bowman et al. (2015) and MNLI Williams et al. (2017), and selected the best-performing model we could train as the starting point for our dataset collection procedure. For Round 1 contexts, we randomly sampled short multi-sentence passages from Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set Yang et al. (2018). Contexts are either ground-truth contexts from that dataset, or they are Wikipedia passages retrieved using TF-IDF Chen et al. (2017) based on a HotpotQA question.

Dataset Genre Context Train / Dev / Test Model error rate Tries Time (sec.)
Unverified Verified mean/median per verified ex.
A1 Wiki 2,100 16,946 / 1,000 / 1,000 29.45% 18.18% 3.4 / 2.0 199.2 / 125.2
A2 Wiki 2,700 45,360 / 1,000 / 1,000 16.52% 8.04% 6.4 / 4.0 355.3 / 189.1
A3 Various 6,000 100,459 / 1,200 / 1,200 17.44% 8.59% 6.4 / 4.0 284.0 / 157.0
(Wiki subset) 1,000 19,920 / 200 / 200 14.79% 6.92% 7.4 / 5.0 337.3 / 189.6
ANLI Various 10,800 162,765 / 2,200 / 2,200 18.54% 9.52% 5.7 / 3.0 282.9 / 156.3
Table 2: Dataset statistics: ‘Model error rate’ is the percentage of examples that model got wrong; ‘unverified’ is the simple percentage, while ‘verified’ is the percentage that were additionally verified by 2 human annotators.

2.4 Round 2

For the second round, we used a more powerful RoBERTa model Liu et al. (2019b) trained on SNLI, MNLI, an NLI-version222The NLI version of FEVER pairs claims with evidence retrieved by nie2019combining as (context, hypothesis) inputs. of FEVER Thorne et al. (2018)

, and the training data from the previous round (A1). After a hyperparameter search, we selected the model with the best performance on the A1 development set. Then, using the hyperparameters selected from this search, we created a final set of models by training several models with different random seeds. During annotation, we constructed an ensemble by randomly picking a model from the model set as the adversary each turn. This helps us avoid annotators exploiting vulnerabilities in one single model. A new non-overlapping set of contexts was again constructed from Wikipedia via HotpotQA using the same method as Round 1.

2.5 Round 3

For the third round, we selected a more diverse set of contexts, in order to explore robustness under domain transfer. In addition to contexts from Wikipedia for Round 3, we also included contexts from the following domains: News (extracted from Common Crawl), fiction (extracted from Mostafazadeh et al. 2016, Story Cloze, and Hill et al. 2015, CBT), formal spoken text (excerpted from court and presidential debate transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National, and causal or procedural text, which describes sequences of events or actions, extracted from WikiHow. Finally, we also collected annotations using the longer contexts present in the GLUE RTE training data, which came from the RTE5 dataset Bentivogli et al. (2009). We trained an even stronger RoBERTa model by adding the training set from the second round (A2) to the training data.

2.6 Comparing with other datasets

The ANLI dataset improves upon previous work in several ways. First, and most obviously, the dataset is collected to be more difficult than previous datasets, by design. Second, it remedies a problem with SNLI, namely that its contexts (or premises) are very short, because they were selected from the image captioning domain. We believe longer contexts should naturally lead to harder examples, and so we constructed ANLI contexts from longer, multi-sentence source material.

Following previous observations that models might exploit spurious biases in NLI hypotheses, Gururangan et al. (2018); Poliak et al. (2018), we conduct a study of the performance of hypothesis-only models on our dataset. We show that such models perform poorly on our test sets.

With respect to data generation with naïve annotators, Geva2019taskorannotator noted that models might pick up on annotator bias, modelling the annotators themselves rather than capturing the intended reasoning phenomenon. To counter this, we selected a subset of annotators (i.e., the “exclusive” workers) whose data would only be included in the test set. This enables us to avoid overfitting to the writing style biases of particular annotators, and also to determine how much individual annotator bias is present for the main portion of the data. Examples from each round of dataset collection are provided in Table 1.

Furthermore, our dataset poses new challenges to the community that were less relevant for previous work, such as: can we improve performance online without having to train a new model from scratch every round, how can we overcome catastrophic forgetting, how do we deal with mixed model biases, etc. Because the training set includes examples that the model got right but were not verified, it might be noisy, posing filtering as an additional interesting problem.

Model Data A1 A2 A3 ANLI ANLI-E SNLI MNLI-m/-mm
BERT S,M 00.0 28.9 28.8 19.8 19.9 91.3 86.7 / 86.4
+A1 44.2 32.6 29.3 35.0 34.2 91.3 86.3 / 86.5
+A1+A2 57.3 45.2 33.4 44.6 43.2 90.9 86.3 / 86.3
+A1+A2+A3 57.2 49.0 46.1 50.5 46.3 90.9 85.6 / 85.4
S,M,F,ANLI 57.4 48.3 43.5 49.3 44.2 90.4 86.0 / 85.8
XLNet S,M,F,ANLI 67.6 50.7 48.3 55.1 52.0 91.8 89.6 / 89.4
RoBERTa S,M 47.6 25.4 22.1 31.1 31.4 92.6 90.8 / 90.6
+F 54.0 24.2 22.4 32.8 33.7 92.7 90.6 / 90.5
+F+A1 68.7 19.3 22.0 35.8 36.8 92.8 90.9 / 90.7
+F+A1+A2 71.2 44.3 20.4 43.7 41.4 92.9 91.0 / 90.7
S,M,F,ANLI 73.8 48.9 44.4 53.7 49.7 92.6 91.0 / 90.6
Table 3: Model Performance. ‘Data’ refers to training dataset (‘S’ refers to SNLI, ‘M’ to MNLI dev (-m=matched, -mm=mismatched), and ‘F’ to FEVER); ‘A1–A3’ refer to the rounds respectively. ‘-E’ refers to test set examples written by annotators exclusive to the test set. Datasets marked ‘’ were used to train the base model for round , and their performance on that round is underlined.

3 Dataset statistics

The dataset statistics can be found in Table 2. The number of examples we collected increases per round, starting with approximately 19k examples for Round 1, to around 47k examples for Round 2, to over 103k examples for Round 3. We collected more data for later rounds not only because that data is likely to be more interesting, but also simply because the base model is better and so annotation took longer to collect good, verified correct examples of model vulnerabilities.

For each round, we report the model error rate, both on verified and unverified examples. The unverified model error rate captures the percentage of examples where the model disagreed with the writer’s target label, but where we are not (yet) sure if the example is correct. The verified model error rate is the percentage of model errors from example pairs that other annotators were able to confirm the correct label for. Note that this error rate represents a straightforward way to evaluate model quality: the lower the model error rate—assuming constant annotator quality and context-difficulty—the better the model.

We observe that model error rates decrease as we progress through rounds. In Round 3, where we included a more diverse range of contexts from various domains, the overall error rate went slightly up compared to the preceding round, but for Wikipedia contexts the error rate decreased substantially. While for the first round roughly 1 in every 5 examples were verified model errors, this quickly dropped over consecutive rounds, and the overall model error rate is less than 1 in 10. On the one hand, this is impressive, and shows how far we have come with just three rounds. On the other hand, it shows that we still have a long way to go if even untrained annotators can fool ensembles of state-of-the-art models with relative ease.

Table 2 also reports the average number of “tries”, i.e., attempts made for each context until a model error was found (or the number of possible tries is exceeded), and the average time this took (in seconds). Again, these metrics represent a useful way to evaluate model quality. We observe that the average tries and average time per verified error both go up as we progress through the rounds. The numbers clearly demonstrate that the rounds are getting increasingly more difficult.

4 Results

Table 3 reports the main results. In addition to BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019b), we also include XLNet Yang et al. (2019) as an example of a strong, but different, model architecture. We show test set performance on the ANLI test sets per round, the total ANLI test set, and the exclusive test subset (examples from test-set-exclusive workers). We also show accuracy on the SNLI test set and the MNLI development (for the purpose of comparing between different model configurations across table rows) set. In what follows, we briefly discuss our observations.

Base model performance is low.

Notice that the base model for each round performs very poorly on that round’s test set. This is the expected outcome: For round 1, the base model gets the entire test set wrong, by design. For rounds 2 and 3, we used an ensemble, so performance is not necessarily zero. However, as it turns out, performance still falls well below chance, indicating that workers did not find vulnerabilities specific to a single model, but generally applicable ones for that model class.

Rounds become increasingly more difficult.

As already foreshadowed by the dataset statistics, round 3 is more difficult (yields lower performance) than round 2, and round 2 is more difficult than round 1. This is true for all model architectures.

Training on more rounds improves robustness.

Generally, our results indicate that training on more rounds improves model performance. This is true for all model architectures. Simply training on more “normal NLI” data would not help a model be robust to adversarial attacks, but our data actively helps mitigate these.

RoBERTa achieves state-of-the-art performance…

We obtain state of the art performance on both SNLI and MNLI with the RoBERTa model finetuned on our new data The RoBERTa paper Liu et al. (2019b) reports a score of for both MNLI-matched and -mismatched dev, while we obtain and . The state of the art on SNLI is currently held by MT-DNN Liu et al. (2019a), which reports compared to our .

…but is outperformed when it is base model.

However, the base (RoBERTa) models for rounds 2 and 3 are outperformed by both BERT and XLNet. This shows that annotators have managed to write examples that RoBERTa generally struggles with, and more training data alone cannot easily mitigated these shortcomings. It also implies that BERT, XLNet, and RoBERTa all have different weaknesses, possibly as a function of their training data (BERT, XLNet and RoBERTa were trained on very different data sets, which might or might not have contained information relevant to the weaknesses)—an additional round with a wider model variety would thus be interesting to investigate as a next step.

Continuously augmenting training data does not downgrade performance.

Even though ANLI training data is different from SNLI and MNLI, adding this data to the training set does not harm performance on those tasks. Furthermore, as Table 4 shows, training only on ANLI is transferable to SNLI and MNLI, but not vice versa. This suggests that methods could successfully be applied for many more consecutive rounds.

Exclusive test subset difference is small.

In order to avoid the possibility that models might pick up on annotator-specific artifacts, a concern raised by Geva2019taskorannotator, we included an exclusive test subset with examples from annotators never seen in the training data. We find that the differences between this exclusive subset and the test set are small, indicating that our models do not over-rely on individual annotator’s writing styles.

Data A1 A2 A3 S M-m/mm
ALL 72.1 48.4 42.7 92.6 90.4/90.4
ANLI-Only 71.3 43.3 43.0 83.5 86.3/86.5
ALL 49.7 46.3 42.8 71.4 60.2/59.8
S+M 33.1 29.4 32.2 71.8 62.0/62.0
ANLI-Only 51.0 42.6 41.5 47.0 51.9/54.5
ALL 67.6 50.7 48.3 91.7 88.8/89.1
ANLI-Only 47.8 48.5 43.8 71.0 58.9/58.4
Table 4: Analysis of hypothesis-only performance for the different rounds. Hypothesis-only models are marked . The rows subscripted with are XLNet models, all other rows are RoBERTa. S=SNLI, M=MNLI. ALL=S,M,F,ANLI.

4.1 Hypothesis-only results

For SNLI and MNLI, concerns have been raised about the propensity of models to pick up on spurious artifacts that are present just in the hypotheses Gururangan et al. (2018); Poliak et al. (2018). To study this in the context of our results and task difficulty, we compare models trained on (context, hypothesis) pairs to models trained only on the hypothesis (marked ). Table 4 reports results on the three rounds of ANLI, as well as SNLI and MNLI. The table shows some interesting take-aways:

Model SNLI-Hard Stress Tests
AT (m/mm) NR LN (m/mm) NG (m/mm) WO (m/mm) SE (m/mm)
Previous models 72.7 14.4 / 10.2 28.8 58.7 / 59.4 48.8 / 46.6 50.0 / 50.2 58.3 / 59.4
BERT (All) 80.2 74.1 / 71.9 61.1 83.0 / 84.1 62.5 / 63.0 62.3 / 60.8 78.5 / 78.4
XLNet (All) 83.0 85.0 / 84.1 80.9 86.5 / 86.8 60.6 / 60.7 67.2 / 65.9 82.6 / 82.9
RoBERTa (S+M+F) 84.8 81.6 / 77.0 69.2 88.0 / 88.5 59.9 / 60.3 65.2 / 64.3 86.4 / 86.7
RoBERTa (All) 84.6 87.0 / 84.4 82.4 88.0 / 88.4 64.8 / 64.7 71.2 / 70.4 84.9 / 85.5
Table 5: Model Performance on NLI stress tests (tuned on their respective dev. sets). All=S+M+F+ANLI. AT=‘Antonym’; ‘NR’=Numerical Reasoning; ‘LN’=Length; ‘NG’=Negation; ‘WO’=Word Overlap; ‘SE’=Spell Error. Previous models refers to the naik-EtAl:2018:C18-1 implementation of [InferSent]Conneau2018senteval for the Stress Tests, and to the Gururangan2018annotation implementation of [DIIN]gong2018 for SNLI-Hard.

Hypothesis-only models perform poorly on ANLI.

We corroborate that hypothesis-only models obtain good performance on SNLI and MNLI. Performance of such models on ANLI is substantially lower, and decreases with more rounds.

RoBERTa does not outperform hypothesis-only on rounds 2 and 3.

On the two rounds where RoBERTa was used as the base model, its performance is not much better than the hypothesis-only model. This could mean two things: either the test data is very difficult, or the training data is not good. To rule out the latter, we trained only on ANLI (163k training examples): doing so with RoBERTa matches the performance of BERT on MNLI when it is trained on the much larger, fully in-domain SNLI+MNLI combined dataset (943k training examples), with both getting 86, which is impressive. Hence, this shows that our new challenge test sets are so difficult that the current state-of-the-art model cannot do better than a hypothesis-only prior.

5 Analysis

We perform two types of model error analysis. First we evaluate two popular existing test sets that were created to expose model weaknesses, and show that our dataset discourages models from learning spurious statistical facts, relative to other large popular datasets (e.g., SNLI and MNLI). Secondly, we explore, by round, the types of inferences our writers successfully employed to stump models, by performing hand-annotation on 500 examples from each round’s development set.

5.1 Performance on challenge datasets

Recently, several hard test sets have been made available for revealing the biases NLI models learn from their training datasets (Nie and Bansal, 2017; McCoy et al., 2019; Gururangan et al., 2018; Naik et al., 2018). We examined model performance on two of these: the SNLI-Hard Gururangan et al. (2018) test set, which consists of examples that hypothesis-only models label incorrectly, and the NLI stress tests (Naik et al., 2018)

, in which sentences containing antonyms pairs, negations, high word overlap, i.a., are heuristically constructed. We test our models on these stress tests, after tuning on each test’s respective development set to account for potential domain mismatches. For comparison, we also report accuracies from the original papers: for SNLI-Hard we present the results from

Gururangan et al.

’s implementation of the hierarchical tensor-based Densely Interactive Inference Network

(Gong et al., 2018, DIIN) on MNLI, and for the NLI stress tests, we present the performance of Naik et al.’s implementation of InferSent (Conneau and Kiela, 2018) trained on SNLI. Our results are in Table 5.

We observe that all of our models far outperform the models presented in original papers for these common stress tests, with our two RoBERTa models performing best. Both perform well on SNLI-Hard and achieve accuracy levels in the high 80s on the ‘antonym’ (AT), ‘numerical reasoning’ (NR), ‘length’ (LN), ‘spelling error’(SE) sub-datasets, and show marked improvement on both ‘negation’ (NG), and ‘word overlap’ (WO). Training a RoBERTa model also on ANLI appears to be particularly useful for the NR, WO, NG and AT NLI stress tests.

Round Numerical & Quantitative Reference & Names Basic Lexical Tricky Reasoning & Facts Quality
R1 38% 13% 18% 13% 22% 53% 4%
R2 32% 20% 21% 21% 20% 59% 3%
R3 17% 12% 30% 33% 26% 58% 4%
Average 29% 15% 23% 22.3% 23% 56.6% 3.6%
Table 6: Analysis of 500 development set examples per round. ‘Average’ lists the average percentage of each top level category in ANLI.

5.2 Reasoning types

A dynamically evolving dataset offers the unique opportunity to track how model error rates change over time. Since each round’s development set contains only verified examples, we can investigate two interesting questions: which types of inference do writers employ to fool the models, and are base models differentially sensitive to different types of reasoning? The results are summarized in Table 6.

We employed an expert linguist annotator to devise an ontology of inference types that would be specific to NLI. While designing an appropriate ontology of types of inference is far from straightforward, we found that a unified ontology could be utilized to characterize examples from all three rounds, which suggests that it has at least some generalizeable applicability. The ontology was used to label 500 examples from each ANLI development set.

The inference ontology contains six types of inference: Numerical & Quantitative (i.e., reasoning about cardinal and ordinal numbers, inferring dates and ages from numbers, etc.), Reference & Names (coreferences between pronouns and forms of proper names, knowing facts about name gender, etc.), Basic Inferences (conjunctions, negations, cause-and-effect, comparatives and superlatives etc.), Lexical Inference (inferences made possible by lexical information about synonyms, antonyms, etc.), Tricky Inferences (wordplay, linguistic strategies such as syntactic transformations/reorderings, or inferring writer intentions from contexts), and reasoning from outside knowledge or additional facts (e.g., “You can’t reach the sea directly from Djibouti”). The quality of annotations was also tracked; if a pair was ambiguous or had a label that seemed incorrect (from the expert annotator’s perspective), it was flagged. Round 1–3 development sets contained few ‘Quality’ tags; the incidence of quality issues was stable at between 3% and 4% per round. Any one example can have multiple types, and every example contained at least one tag.

As rounds 1 and 2 were both built with contexts from the same genre (Wikipedia), we might expect writers to arrive at similar strategies. However, since the model architectures used in the first two rounds differ, writers might be sufficiently creative in finding different exploits in each. For round 3, we expect some difference in reasoning types to be present, because we used source material from several domains as our contexts. In sum, any change between rounds could be due to any of the following factors: inherent differences between data collection, model architectures and model training data, random selection of contexts, or slight differences in writer pool or writer preferences.

We observe that both round 1 and 2 writers rely heavily on numerical and quantitative reasoning in over 30% of the development set—the percentage in A2 (32%) dropped roughly 6% from A1 (38%)—while round 3 writers use numerical or quantitative reasoning for only 17%. The majority of numerical reasoning types were references to cardinal numbers that referred to dates and ages. Inferences predicated on references and names were present in about 10% of rounds 1 & 3 development sets, and reached a high of 20% in round 2, with coreference featuring prominently. Basic inference types increased in prevalence as the rounds increased, ranging from 18%–30%, as did Lexical inferences (increasing from 13%–33%). The percentage of sentences relying on reasoning and outside facts remains roughly the same, in the mid-50s, perhaps slightly increasing after round 1. For round 3, we observe that the model used to collect it appears to be more susceptible to Basic, Lexical, and Tricky inference types. This finding is compatible with the idea that the models trained on adversarial data are more impressive and perform better, encouraging writers to devise more creative examples containing harder types of inference in order to stump them.

6 Related work

Bias in datasets

Machine learning methods are well-known to pick up on spurious statistical patterns. For instance, in image captioning, a simple baseline of utilizing the captions of nearest neighbors in the training set was shown to yield impressive BLEU scores Devlin et al. (2015). In the first visual question answering dataset Antol et al. (2015), biases like “2” being the correct answer to 39% of the questions starting with “how many” allowed learning algorithms to perform well while ignoring the visual modality altogether Jabri et al. (2016); Goyal et al. (2017). The field has a tendency to overfit on static targets, even if that does not happen deliberately Recht et al. (2018).

In NLI, Gururangan2018annotation, Poliak2018hypothesis and Tsuchiya2018performance showed that hypothesis-only baselines often perform far better than chance. It has been shown that NLI systems can often be broken merely by performing simple lexical substitutions Glockner et al. (2018), and that they struggle with quantifiers Geiger et al. (2018) and certain superficial syntactic properties McCoy et al. (2019). In reading comprehension and question answering, Kaushik2018howmuch showed that question- and passage-only models can perform surprisingly well, while Jia2017adversarial added adversarially constructed sentences to passages, leading to a drastic drop in performance. Many text classification datasets do not require sophisticated linguistic reasoning, as shown by the surprisingly good performance of random encoders Wieting and Kiela (2019). Similar observations were made in machine translation Belinkov and Bisk (2017) and dialogue Sankar et al. (2019). In short, the field is rife with dataset bias and papers trying address this important problem. This work can be viewed as a natural extension: if such biases exist, they will allow humans to fool the models, adding useful examples to the training data until the bias is dynamically mitigated.

Dynamic datasets.

Concurrently with this work, Anonymous2020adversarialfilters proposed AFLite, an iterative approach for filtering adversarial data points to avoid spurious biases. Kaushik2019learningdifference offer a causal account of spurious patterns, and counterfactually augment NLI datasets by editing examples to break the model. The former is an example of a model-in-the-loop setting, where the model is iteratively probed and improved. The latter is human-in-the-loop training, where humans are used to find problems with one single model. In this work, we employ both strategies iteratively, in a form of human-and-model-in-the-loop training, to collect completely new examples, in a potentially never-ending loop Mitchell et al. (2018). Relatedly, lan2017ppdb propose a method for continuously growing a dataset of paraphrases.

Human-and-model-in-the-loop training is not a new idea. Mechanical Turker Descent proposes a gamified environment for the collaborative training of grounded language learning agents over multiple rounds Yang et al. (2017). The “Build it Break it Fix it” strategy in the security domain Ruef et al. (2016) has been adapted to NLP Ettinger et al. (2017) as well as dialogue Dinan et al. (2019). The QApedia framework Kratzwald and Feuerriegel (2019) continuously refines and updates its content repository using humans in the loop, while human feedback loops have been used to improve image captioning systems Ling and Fidler (2017). wallace2018trick leverage trivia experts to create a model-driven adversarial question writing procedure and generate a small set of challenge questions that QA-models fail on.

There has been a flurry of work in constructing datasets with an adversarial component, such as Swag Zellers et al. (2018) and HellaSwag Zellers et al. (2019), CODAH Chen et al. (2019), Adversarial SQuAD Jia and Liang (2017), Lambada Paperno et al. (2016) and others. Our dataset is not to be confused with abductive NLI Bhagavatula et al. (2019), which calls itself NLI, or ART.

7 Discussion & Conclusion

In this work, we used a human-and-model-in-the-loop entailment training method to collect a new benchmark for natural language understanding. The benchmark is designed to be challenging to current state of the art models. Annotators were employed to act as adversaries, and encouraged to find vulnerabilities that fool the model into predicting the wrong label, but that another person would correctly classify. We found that non-expert annotators, in this gamified setting and with appropriate incentives to fool the model, are remarkably creative at finding and exploiting weaknesses in models. We collected three rounds, and as the rounds progressed, the models became more robust and the test sets for each round became more difficult. Training on this new data yielded the state of the art on existing NLI benchmarks.

The ANLI benchmark presents a new challenge to the community. It was carefully constructed to mitigate issues with previous datasets, and was designed from first principles to last longer—if the test set saturates, the field can simply train up a new model, collect more data and find itself confronted yet again with a difficult challenge.

The dataset also presents many opportunities for further study. For instance, we collected annotator-provided explanations for each example that the model got wrong. We provided inference labels for the development set, opening up possibilities for interesting more fine-grained studies of NLI model performance. While we verified the development and test examples, we did not verify the correctness of each training example, which means there is probably some room for improvement there.

The benchmark is meant to be a challenge for measuring NLU progress, even for as yet undiscovered models and architectures. We plan for the benchmark itself to adapt to these new models by continuing to build new challenge rounds. As a first next step, it would be interesting to examine results when annotators are confronted with a wide variety of model architectures. We hope that the dataset will prove to be an interesting new challenge for the community. Luckily, if it does turn out to saturate quickly, we will always be able to collect a new round.


YN and MB were sponsored by DARPA MCS Grant #N66001-19-2-4031, ONR Grant #N00014-18-1-2871, and DARPA YFA17-D17AP00022.


  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §6.
  • Y. Belinkov and Y. Bisk (2017)

    Synthetic and natural noise both break neural machine translation

    arXiv preprint arXiv:1711.02173. Cited by: §6.
  • L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, and B. Magnini (2009) The Fifth PASCAL Recognizing Textual Entailment Challenge. Cited by: §2.5.
  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, S. W. Yih, and Y. Choi (2019) Abductive commonsense reasoning. arXiv preprint arXiv:1908.05739. Cited by: §6.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326. Cited by: §1, §2.3.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL), Cited by: §2.3.
  • M. Chen, M. D’Arcy, A. Liu, J. Fernandez, and D. Downey (2019) CODAH: an adversarially authored question-answer dataset for common sense. CoRR abs/1904.04365. Cited by: §6.
  • D. Cireşan, U. Meier, and J. Schmidhuber (2012)

    Multi-column deep neural networks for image classification

    arXiv preprint arXiv:1202.2745. Cited by: §1.
  • A. Conneau and D. Kiela (2018) Senteval: an evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449. Cited by: §1, §5.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2.3, §4.
  • J. Devlin, S. Gupta, R. Girshick, M. Mitchell, and C. L. Zitnick (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467. Cited by: §6.
  • E. Dinan, S. Humeau, B. Chintagunta, and J. Weston (2019) Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack. In Proceedings of EMNLP, Cited by: §6.
  • A. Ettinger, S. Rao, H. Daumé III, and E. M. Bender (2017) Towards linguistically generalizable nlp systems: a workshop and shared task. arXiv preprint arXiv:1711.01505. Cited by: §1, §6.
  • A. Geiger, I. Cases, L. Karttunen, and C. Potts (2018) Stress-testing neural models of natural language inference with multiply-quantified sentences. arXiv preprint arXiv:1810.13033. Cited by: §6.
  • M. Geva, Y. Goldberg, and J. Berant (2019) Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898. Cited by: §1.
  • M. Glockner, V. Shwartz, and Y. Goldberg (2018) Breaking nli systems with sentences that require simple lexical inferences. In Proceedings of ACL, Cited by: §1, §6.
  • Y. Gong, H. Luo, and J. Zhang (2018) Natural language inference over interaction space. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §5.1.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: §6.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of NAACL, Cited by: §1, §2.6, §4.1, §5.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • F. Hill, A. Bordes, S. Chopra, and J. Weston (2015) The goldilocks principle: reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301. Cited by: §2.5.
  • A. Jabri, A. Joulin, and L. Van Der Maaten (2016) Revisiting visual question answering baselines. In European conference on computer vision, pp. 727–739. Cited by: §6.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In Proceedings of EMNLP, Cited by: §6.
  • B. Kratzwald and S. Feuerriegel (2019) Learning from on-line user feedback in neural question answering on the web. In The World Wide Web Conference, pp. 906–916. Cited by: §6.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.
  • H. Ling and S. Fidler (2017) Teaching machines to describe images via natural language feedback. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5075–5085. Cited by: §6.
  • X. Liu, P. He, W. Chen, and J. Gao (2019a) Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §4.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.4, §4, §4.
  • T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3428–3448. Cited by: §1, §5.1, §6.
  • T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, et al. (2018) Never-ending learning. Communications of the ACM 61 (5), pp. 103–115. Cited by: §1, §6.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and evaluation framework for deeper understanding of commonsense stories. arXiv preprint arXiv:1604.01696. Cited by: §2.5.
  • A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig (2018) Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2340–2353. Cited by: §5.1.
  • Y. Nie and M. Bansal (2017) Shortcut-stacked sentence encoders for multi-domain inference. arXiv preprint arXiv:1708.02312. Cited by: §5.1.
  • D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández (2016) The lambada dataset: word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031. Cited by: §6.
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, Cited by: §1, §2.6, §4.1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. Cited by: §1.
  • B. Recht, R. Roelofs, L. Schmidt, and V. Shankar (2018) Do cifar-10 classifiers generalize to cifar-10?. arXiv preprint arXiv:1806.00451. Cited by: §6.
  • A. Ruef, M. Hicks, J. Parker, D. Levin, M. L. Mazurek, and P. Mardziel (2016) Build it, break it, fix it: contesting secure development. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 690–703. Cited by: §6.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1, §1.
  • C. Sankar, S. Subramanian, C. Pal, S. Chandar, and Y. Bengio (2019) Do neural dialog systems use the conversation history effectively? an empirical study. arXiv preprint arXiv:1906.01603. Cited by: §6.
  • W. Shakespeare (1603) The tragedy of hamlet, prince of denmark. Cited by: §2.1.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355. Cited by: §2.4.
  • M. Tsuchiya (2018) Performance impact caused by hidden bias of training data for recognizing textual entailment. Cited by: §1.
  • L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In International conference on machine learning, pp. 1058–1066. Cited by: §1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537. Cited by: §1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §1.
  • J. Wieting and D. Kiela (2019) No training required: exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444. Cited by: §6.
  • A. Williams, N. Nangia, and S. R. Bowman (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §2.3.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §4.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600. Cited by: §2.3.
  • Z. Yang, S. Zhang, J. Urbanek, W. Feng, A. H. Miller, A. Szlam, D. Kiela, and J. Weston (2017) Mastering the dungeon: grounded language learning by mechanical turker descent. arXiv preprint arXiv:1711.07950. Cited by: §1, §6.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) Swag: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of EMNLP, Cited by: §6.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: can a machine really finish your sentence?. In Proceedings of ACL, Cited by: §6.