Collecting Entailment Data for Pretraining: New Protocols and Negative Results

04/24/2020 ∙ by Samuel R. Bowman, et al. ∙ Google NYU college 0

Textual entailment (or NLI) data has proven useful as pretraining data for tasks requiring language understanding, even when building on an already-pretrained model like RoBERTa. The standard protocol for collecting NLI was not designed for the creation of pretraining data, and it is likely far from ideal for this purpose. With this application in mind, we propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a simple MNLI-based baseline, we collect and compare five new 8.5k-example training sets. Our primary results are solidly negative, with our baseline MNLI-style dataset yielding good transfer performance, but none of our four new methods (nor the recent ANLI) showing any improvements on that baseline. However, we do observe that all four of these interventions, especially the use of seed sentences for inspiration, reduce previously observed issues with annotation artifacts.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of natural language inference (NLI; also known as textual entailment) has been widely used as an evaluation task when developing new methods for language understanding tasks, but it has recently become clear that high-quality NLI data can be useful in transfer learning as well. Several recent papers have shown that training large neural network models on natural language inference data, then fine-tuning them for other language understanding tasks often yields substantially better results on those target tasks

(Conneau et al., 2017; Subramanian et al., 2018). This result holds even when starting from large models like BERT (Devlin et al., 2019) that have already been pretrained extensively on unlabeled data (Phang et al., 2018; Clark et al., 2019; Liu et al., 2019).

The largest general-purpose corpus for NLI, and the one that has proven most successful in this setting, is the Multi-Genre NLI Corpus (MNLI Williams et al., 2018). MNLI was designed for use in a benchmark task, rather than as a resource for use in transfer learning,and it was not developed on the basis of any kind of deliberate experimentation. Further, data collected under MNLI’s data collection protocol has known issues with annotation artifacts which make it possible to perform much better than chance using only one of the sentences in each pair (Tsuchiya, 2018; Gururangan et al., 2018; Poliak et al., 2018).

This work begins to ask what would be involved in collecting a similar dataset that is explicitly designed with transfer learning in mind. In particular, we consider four potential changes to the original MNLI data collection protocol that are designed to improve either the ease with which annotators can produce sound examples, or the quality and diversity of those examples, and evaluate their effects on transfer. We collect a baseline dataset of 8500 examples that follows the MNLI protocol with our annotator pool, followed by four additional datasets of the same size which isolate each of our candidate changes. We then compare all five in a set of transfer learning experiments that look at our ability to use each of these datasets to improve performance on the eight downstream language understanding tasks in the SuperGLUE (Wang et al., 2019a) benchmark.

All five of our datasets are consistent with the task definition that was used in MNLI, which is in turn based on the definition introduced by Dagan et al. (2006). In this task, each example consists of a pair of short texts, called the premise and the hypothesis. The model is asked to read both texts and make a three-way classification decision: Given the premise, would a reasonable person infer that hypothesis must be true (entailment), infer that that it must be false (contradiction), or decide that there is not enough information to make either inference (neutral)? While it is certainly not clear that this framing is optimal for pretraining, we leave a more broad-based exploration of task definitions for future work.

Figure 1: The annotation interface for Base.

Our Base data collection protocol (Figure 1) follows MNLI closely in asking annotators to read a premise sentence and then write three corresponding hypothesis sentences in empty text boxes corresponding to the three different labels (entailment, contradiction, and neutral). When an annotator follows this protocol, they produce three sentence pairs at once, all sharing a single premise.

Our Paragraph protocol tests the effect of supplying annotators with complete paragraphs, rather than sentences, as premises. Longer texts offer the potential for discourse-level inferences, the addition of which should yield a dataset which is more difficult, more diverse, and less likely to contain trivial artifacts. However, reading full paragraphs could increase the time required to annotate a single example; time which could potentially be better spent constructing more sentence-level examples.

Our EditPremise and EditOther protocols test the effect of pre-filling a single seed text in each of the three text boxes that annotators are asked to fill out. By reducing the raw amount of typing required, this could allow annotators to produce good examples more quickly. By encouraging them to keep the three sentences similar, it could also indirectly facilitate the construction of minimal-pair-like examples that minimize artifacts, in the style of Kaushik et al. (2020). We test two variants of this idea: One uses a copy of the premise sentence as a seed text and the second retrieves a new sentence from an existing corpus that is similar to the premise sentence, and uses that.

Our Contrast protocol tests the effect of adding artificial constraints on the kinds of hypothesis sentences annotators can write. Giving annotators difficult and varying constraints could encourage creativity and prevent annotators from falling into patterns in their writing that lead to easier or more repetitive data. However, as with the use of longer contexts in Paragraph, this protocol risks substantially slowing the annotation process. We experiment with a procedure inspired by that used to create the language-and-vision dataset NLVR2 (Suhr et al., 2019), in which in which annotators must write sentences that are show some specified relationship (entailment or contradiction) to a given premise, but do not show that relationship to a second, similar, distractor premise.

In evaluations on transfer learning with the SuperGLUE benchmark, our Base dataset and the datasets collected under our four new protocols offer substantial improvements in transfer ability over a plain RoBERTa or XLNet model, comparable to the gains seen with an equally-sized sample of MNLI. However, Base

reliably shows the strongest transfer results. This finding, combined with a low variance across runs, strongly suggests that none of these four interventions improves the suitability of NLI data for transfer learning. While this is a negative result for our primary focus on transfer, we also observe that all four of these methods are able to produce data of comparable subjective quality while significantly reducing the incidence of previously reported annotation artifacts, and that

Paragraph, EditPremise, and EditOther all accomplish this without significantly increasing the time cost of annotation.

2 Related Work

MNLI (Training Set)
P: Conceptually cream skimming has two basic dimensions - product and geography.
H: Product and geography are what make cream skimming work.
P: The board had also expressed concerns about the amounts of cash kept by SNC’s Libyan office, at that time approximately $10 million, according to the company’s chief financial officer.
H: According to the board, the Libyan office should be holding more cash on hand.
P: The paper, along with the ”Washington Blade”, was acquired by Window Media, LLC in 2001, and both were then sold to HX Media in 2007. Kat Long succeeded Trenton Straube as editor-in-chief in February 2009. The paper ceased publication in July 2009.
H: Kal Long succeeded Trenton Straube as editor-in-chief in March 2019.
P: This standpoint is believed to promote Deaf people’s right to collective space within society to pass on their language and culture to future generations.
H: This standpoint is believed to demote Deaf people’s right to collective space within society.
P: Shobhona Sharma (born 5 February 1953) is a professor specializing in immunology, molecular biology, and biochemistry at the Tata Institute of Fundamental Research, Mumbai.
H: Shobhona Sharma is also professor of mathematics at the Tata Institute of Fundamental Research, Mumbai.
P: Bengt Erik Johan Renvall (September 22, 1959 – August 24, 2015) was a Swedish dancer and choreographer active in the United States from 1978.
H: He was a dancer in America in the 1970s.
Table 1: Randomly selected examples from the datasets under study. Neither the MNLI training set nor any of our collected data are filtered for quality in any way, and errors or debatable judgments are common in both.

The observation that NLI data can be effective in pretraining was first reported for SNLI (Bowman et al., 2015) and MNLI by Conneau et al. (2017) on models pretrained from scratch on NLI data. This finding was replicated in the setting of multi-task pretraining by Subramanian et al. (2018). This was later extended to the context of intermediate training—where a model is pretrained on unlabeled data, then on relatively abundant labeled data (MNLI), and finally scarce task specific labeled data—by Phang et al. (2018), Clark et al. (2019), Liu et al. (2019), Yang et al. (2019), and Liu et al. (2019) across a range of large pretrained models models and target language-understanding tasks. Similar results have been observed with transfer from the SocialIQA corpus (Sap et al., 2019) to target tasks centered on common sense. A small body of work including Mou et al. (2016), Bingel and Søgaard (2017) and Wang et al. (2019) has explored the empirical landscape of which supervised NLP tasks can offer effective pretraining for other supervised NLP tasks.

Existing NLI datasets have been built using a wide range of strategies: FraCaS (Cooper et al., 1996) and several targeted evaluation sets were constructed manually by experts from scratch. The RTE challenge corpora (Dagan et al., 2006, et seq.) primarily used expert annotations on top of existing premise sentences. SICK (Marelli et al., 2014) was created using a structured pipeline centered on asking crowdworkers to edit sentences in prescribed ways. MPE (Lai et al., 2017) uses a similar strategy, but constructs unordered sets of sentences for use as premises. SNLI (Bowman et al., 2015) introduced the method, used in MNLI, of asking crowdworkers to compose labeled hypotheses for a given premise. SciTail (Khot et al., 2018) and SWAG (Zellers et al., 2018) used domain-specific resources to pair existing sentences as potential entailment pairs, with SWAG additionally using trained models to identify examples worth annotating. There has been little work directly evaluating and comparing these many methods. In that absence, we focus on the SNLI/MNLI approach, because it has been shown to be effective for the collection of pretraining data and because its reliance on only crowdworkers and unstructured source text makes it simple to scale.

Two recent papers have investigated other methods that could augment the base MNLI protocol we study here. ANLI (Nie et al., 2019) collects new examples following this protocol, but adds an incentive for crowdworkers to produce sentence pairs on which a baseline system will perform poorly. Kaushik et al. (2020) introduce a method for expanding an already-collected dataset by making small edits to existing examples that change their labels, with the intent to produce minimally-different minimal pairs with differing labels. Both of these papers offer methodological changes that are potentially complementary to the changes we investigate here, and neither evaluates the impact of their methods on transfer learning. Since ANLI is large and roughly comparable with MNLI, we include it in our transfer evaluations here.

In addition to NLVR2, WinoGrande (Sakaguchi et al., 2019) also showed promising results from the use of artificial constraints during the annotation process for a different style of language-understanding dataset.

3 Data Collection

The basic interface for our tasks is similar to that used for SNLI and MNLI: We provide a premise from a preexisting text source and ask human annotators to provide three hypothesis sentences: one that says something true about the fact or situation in the prompt (entailment), one that says something that may or may not be true about the fact or situation in the prompt (neutral), and one that definitely does not say something true about the fact or situation in the prompt (contradiction).


In this baseline, modeled closely on the protocol used for MNLI, we show annotators a premise sentence and ask them to provide compose one new sentence for each label.


Here, we use the same instructions as Base, but with full paragraphs, rather than single sentences, as the supplied premises.


Here, we pre-fill the three text boxes with editable copies of the premise sentence, and ask annotators to edit each text field to compose sentences that match the three different labels. Annotators are permitted to delete the pre-filled text.


Here, we follow the same procedure as EditPremise, but rather than pre-filling the premise as a seed sentence, we instead use a similarity search method to retrieve a different sentence from the same source corpus that is similar to the premise.


Here, we again retrieve a second sentence that is similar to the premise, but we display it as a second premise rather than using it to seed an editable text box. We then ask annotators to compose two new sentences: One sentence must be true only about the fact or situation in the first premise (that is, contradictory or neutral with respect to the second premise). The other sentence must be false only about the fact or situation in the first premise (and true or neutral with respect to the second premise). This yields an entailment pair and a contradiction pair, both of which use only the first premises, with the second premise serving only as a constraint on the annotation process. We could not find a sufficiently intuitive way to collect neutral sentence pairs under this protocol, and opted to use only two classes rather than increase the difficulty of an already unintuitive task.

3.1 Text Source

MNLI uses the small but stylistically diverse OpenANC corpus (Ide and Suderman, 2006) as its source for premise sentences, but uses nearly every available sentence from its non-technical sections. To avoid re-using premises, we instead draw on English Wikipedia.111We use the 2019-06-20 downloadable version, extract the plain text with Apertium’s WikiExtractor feature (Forcada et al., 2011), sentence-tokenize it with SpaCy (Honnibal and Montani, 2017), and randomly sample sentences (or paragraphs) for annotation.

Similarity Search

The EditOther and Contrast

 protocols require pairs of similar sentences as their inputs. To construct these, we assemble a heuristic sentence-matching system intended to generate pairs of highly similar sentences that can be minimally edited to construct entailments or contradictions: Given a premise, we retrieve its closest 10k nearest neighbors according to dot-product similarity over Universal Sentence Encoder 

(Cer et al., 2018) embeddings. Using a parser and an NER system, we then select those neighbors which share a subject noun phrase in common with the premise (dropping premises for which no such neighbors exist). From those filtered neighbors, we retrieve the single non-identical neighbor that has the highest overlap with the premise in both raw tokens and entity mentions, preferring sentences with similar length to the hypothesis.

Label Length Unique
MNLI Gov. 8.5k
premise all labels 25.1 (13.4)
hypothesis entailment 12.6 (5.1) 4.4 (5.1)
hypothesis neutral 13.0 (5.3) 7.1 (5.3)
hypothesis contradiction 12.0 (4.5) 5.7 (4.5)
premise all labels 23.3 (11.4)
hypothesis entailment 10.6 (5.6) 2.3 (5.5)
hypothesis neutral 10.5 (5.5) 4.5 (5.5)
hypothesis contradiction 10.2 (5.1) 4.0 (5.1)
premise all labels 66.7 (60.0)
hypothesis entailment 13.0 (8.1) 2.3 (8.1)
hypothesis neutral 12.9 (8.1) 4.1 (8.1)
hypothesis contradiction 12.5 (7.9) 3.3 (7.9)
hypothesis entailment 15.0 (8.9) 2.5 (8.9)
hypothesis neutral 17.0 (9.8) 4.3 (9.8)
hypothesis contradiction 15.3 (9.2) 3.3 (9.2)
hypothesis entailment 12.6 (6.3) 3.2 (6.3)
hypothesis neutral 13.0 (6.8) 6.2 (6.8)
hypothesis contradiction 12.7 (6.4) 4.7 (6.4)
hypothesis entailment 7.9 (5.1) 2.5 (5.1)
hypothesis contradiction 7.7 (4.9) 3.5 (4.9)
Table 2: Key text statistics. Premises are drawn from essentially the same distribution in all our tasks except Paragraph, so are shown only once. The Unique column shows the number of tokens that appear in a hypothesis but not in the corresponding premise.

3.2 The Annotation Process

We start data collection for each protocol with a pilot of 100 items, which are not included in the final datasets. We use these to refine task instructions and to provide feedback to our annotator pool on the intended task definition. We continue to provide regular feedback throughout the annotation process to clarify ambiguities in the protocols and to discourage the use of systematic patterns—such as consistently composing shorter hypotheses for entailments than for contradictions—that could make the resulting data artificially easy.

Annotators are allowed to skip prompts which they deem unusable for any reason. These generally involve either non-sentence strings that were mishandled by our sentence tokenizer or premises with inaccessible technical language. Skip rates ranged from about 2.5% for EditOther to about 10% for Contrast (which can only be completed when the two premises are both comprehensible and sufficiently different from one another).

A pool of 19 professional annotators located in the United States worked on our tasks, with about ten working on each. As a consequence of this relatively small annotation team, many annotators worked under more than one protocol, which we ran consecutively. This introduces a modest potential bias against Base, in that annotators start the later tasks having seen somewhat more feedback.

We do not use any kind of second-pass annotation process for quality control, and we neither designate a test set nor recommend these datasets for system evaluation. Our aim is to use our limited annotation time budget to collect the largest and best possible sample of (pre)training data, and we are motivated by work like Khetan et al. (2018) which calls into question the value of second-pass quality-control annotations for training data.

3.3 The Resulting Data

Using each protocol, we collect a training set of exactly 8,500 examples and a small validation set of at least 300 examples. Table 1 shows randomly chosen examples.222These datasets will be made public shortly, and a link will be made available in an updated version of this manuscript.

Hypotheses are mostly fluent, full sentences that adhere to prescriptive writing conventions for US English. In constructing hypotheses, annotators often reuse words or phrases from the premise, but rearrange them, alter their inflectional forms, or substitute synonyms or antonyms. Hypotheses tend to differ from premises both grammatically and stylistically.

Table 2 shows some simple statistics on the collected text. Our clearest observation here is that the two methods that use seed sentences tend to yield longer hypotheses and tend not to show a clear relationship between hypothesis–premise token overlap and label. Contrast tends to produce shorter hypotheses.

Annotator Time

Annotators completed each of the five protocols at a similar rate, taking 3–4 minutes per prompt. This goes against our expectations that the longer premises in Paragraph should substantially slow the annotation process, and that the pre-filled text in EditPremise and EditOther should speed annotation. Since the relatively complex Contrast produces only two sentence pairs per prompt rather than three, it yields fewer examples per minute.

Label–Word Association

Word Label PMI Counts
MNLI Gov. 8.5k
never contradiction 0.837 152/ 178
no contradiction 0.828 342/ 426
any contradiction 0.721 128/ 169
never contradiction 0.935 231/ 255
also neutral 0.587 64/ 93
any contradiction 0.585 46/ 64
never contradiction 0.608 49/ 67
than neutral 0.526 95/ 156
went neutral 0.489 46/ 73
years neutral 0.461 135/ 239
ago neutral 0.443 17/ 21
eight contradiction 0.437 13/ 15
refused contradiction 0.565 24/ 28
hardly contradiction 0.507 16/ 17
later neutral 0.482 48/ 77
MNLI Gov. 8.5k (two-class)
no contradiction 0.754 437/ 461
any contradiction 0.689 193/ 208
only contradiction 0.633 215/ 249
Contrast  (two-class)
only contradiction 0.677 176/ 228
never contradiction 0.635 73/ 90
no contradiction 0.616 102/ 135
Table 3: The top three words most associated with specific labels in each dataset, sorted by the PMI between the word and the label. The counts column shows how many of the instances of each word occur in hypotheses matching the specified label. We compare the two-class Contrast with a two-class version of MNLI Gov.

Table 3 shows the three words in each dataset that are most strongly associated with specific labels, using the smoothed PMI method of Gururangan et al. (2018). We also include results for a baseline: an 8.5k-example sample from the government documents single-genre section of MNLI, which is meant to to be maximally comparable to the single-genre datasets we collect.

Base shows similar associations to the original MNLI, but all four of our interventions reduce these associations at least slightly. The use of seed sentences, especially in EditPremise, largely eliminates the strong association between negation and contradiction seen in MNLI, and no new strong associations appear to take its place.

4 Modeling Experiments

Our experiments generally compare models trained on ten different datasets: Each of the five 8.5k-example training sets introduced in this paper; the full 393k-example MNLI training set; the full 1.1m-example ANLI training set (which combines the SNLI training set, the MNLI training set, and the supplemental ANLI training examples);333In these runs, we use only the original ANLI validation set for evaluation and early stopping. 8.5k-example samples from the MNLI training set and from the combined ANLI training set, meant to control for the size differences between these existing datasets and our baselines; and finally a 8.5k-example sample from the government section of the MNLI training set, meant to control (as much as possible) for the difference between our single-genre Wikipedia datasets and MNLI’s relatively diverse text.

Our models are trained starting from pretrained RoBERTa (large variant; Liu et al., 2019) or XLNet (large, cased; Yang et al., 2019). RoBERTa represented the state of the art on most of our target tasks as of the launch of our experiments. XLNet is competitive with RoBERTa on most tasks, and offers a natural replication. It can be used to better compare models trained on our data with models trained on ANLI: ANLI was collected with a model-in-the-loop procedure using RoBERTa that makes it difficult to interpret RoBERTa results.

We run our expemients using the jiant toolkit (Wang et al., 2019c), which implements the SuperGLUE tasks, MNLI, and ANLI, and in turn uses transformers (Wolf et al., 2019), AllenNLP (Gardner et al., 2017)

, and PyTorch

(Paszke et al., 2017). To make it possible to train these large models on single consumer GPUs, we use small-batch () training and a maximum total sequence length of 128 word pieces.444We cut this to a slightly lower number on a few individual runs as needed to avoid memory constraints. Note that this potentially limits the gains observable for Paragraph, which has a longer mean premise length of 66.7 words.

We train for up to 2 epochs for the very large ReCoRD, 10 epochs for the very small CB, COPA, and WSC, and 4 epochs for the remaining tasks. Except where noted, all results reflect the median final performance from three random restarts of training.

555Scripts implementing these experiments are available at

Direct NLI Evaluations

As a preliminary sanity check, Table 4 shows the results of evaluating models trained in each of the settings described above on their own validation sets, on the MNLI validation set, and on the expert-constructed GLUE diagnostic set (Wang et al., 2019b)

. As NLI classifiers trained on

Contrast cannot produce the neutral labels used in MNLI, we evaluate them separately, and compare them with two-class variants of the MNLI models.

Our Base data yields a model that performs somewhat worse than a comparable MNLI Gov. 8.5k model, both on their respective validation sets and on the full MNLI validation set. This suggests, at least tentatively, that the new annotations are less reliable than those in MNLI. This is disconcerting, but does not interfere with our key comparisons.

The main conclusion we draw from these results is that none of the first three interventions improve performance on the out-of-domain GLUE diagnostic set, suggesting that they do not help in the collection of data that is both difficult and consistent with the MNLI label definitions. We also observe that the newer ANLI data yields worse performance than MNLI on the out-of-domain evaluation data when we control for dataset size.

Training Data Self MNLI GLUE Diag.
Base 84.8 81.5 40.5
Paragraph 78.3 78.2 31.7
EditPremise 82.9 79.8 35.5
EditOther 82.5 82.6 33.9
MNLI8.5k 87.5 87.5 44.6
MNLIGov8.5k 87.7 85.4 40.7
ANLI8.5k 35.7 85.6 39.8
MNLI 90.4 90.4 49.2
ANLI 61.5 90.1 49.7
MNLI (two-class) 94.0 94.0
MNLI8.5k (two-class) 92.4 92.4
Contrast 91.6 80.6
Table 4: NLI modeling experiments with RoBERTa, reporting results on the validation sets for MNLI and for the task used for training each model (Self), and the GLUE diagnostic set. We compare the two-class Contrast with a two-class version of MNLI.
Training Data Self MNLI
Base 57.9 52.2
Paragraph 48.3 47.0
EditPremise 40.4 39.4
EditOther 45.1 50.7
MNLI8.5k 56.8 56.8
MNLIGov8.5k 63.7 53.9
ANLI8.5k 34.3 54.4
MNLI 62.0 62.0
ANLI 53.2 61.6
MNLI (two-class) 72.6 72.6
MNLI8.5k (two-class) 62.4 62.4
Contrast 56.9 55.9
Table 5: Results from RoBERTa hypothesis-only NLI classifiers on the vaidation sets for MNLI and for the datasets used in training.
Intermediate- Avg. BoolQ CB COPA MultiRC ReCoRD RTE WiC WSC
Training Data () Acc. F1/Acc. Acc. F1/EM F1/EM Acc. Acc. Acc.
RoBERTa (large)
None 67.3 (1.2) 84.3 83.1 / 89.3 90.0 70.0 / 27.3 86.5 / 85.9 85.2 71.9 64.4
Base 72.2 (0.1) 84.4 97.4 / 96.4 94.0 71.9 / 33.3 86.1 / 85.5 88.4 70.8 76.9
Paragraph 70.3 (0.1) 84.7 97.4 / 96.4 90.0 70.4 / 29.9 86.7 / 86.0 86.3 70.2 67.3
EditPremise 69.6 (0.6) 83.0 92.3 / 92.9 89.0 71.2 / 31.2 86.4 / 85.7 85.6 71.0 65.4
EditOther 70.3 (0.1) 84.2 91.8 / 94.6 91.0 70.7 / 31.3 86.2 / 85.6 87.4 71.5 68.3
Contrast 69.2 (0.0) 84.1 93.1 / 94.6 87.0 71.4 / 29.5 84.8 / 84.1 84.5 71.5 67.3
MNLI8.5k 71.0 (0.6) 84.7 96.1 / 94.6 92.0 71.7 / 32.3 86.4 / 85.7 87.4 74.0 68.3
MNLIGov8.5k 70.9 (0.5) 84.8 97.4 / 96.4 92.0 71.4 / 32.0 86.2 / 85.6 86.3 71.6 70.2
ANLI8.5k 70.5 (0.3) 84.7 96.1 / 94.6 89.0 71.6 / 31.8 85.7 / 85.0 85.9 71.9 70.2
MNLI 70.0 (0.0) 85.3 89.0 / 92.9 88.0 72.2 / 35.4 84.7 / 84.1 89.2 71.8 66.3
ANLI 70.4 (0.9) 85.4 92.4 / 92.9 90.0 72.0 / 33.5 85.5 / 84.8 91.0 71.8 66.3
XLNet (large cased)
None 62.7 (1.3) 82.0 83.1 / 89.3 76.0 69.9 / 26.8 80.9 / 80.1 69.0 65.2 63.5
Base 67.7 (0.0) 83.1 90.5 / 92.9 89.0 70.5 / 28.2 78.2 / 77.4 85.9 68.7 64.4
Paragraph 67.3 (0.0) 82.5 90.8 / 94.6 85.0 69.8 / 28.1 79.4 / 78.6 83.8 69.7 64.4
EditPremise 67.0 (0.4) 82.8 82.8 / 91.1 83.0 69.8 / 28.6 79.3 / 78.5 85.2 70.2 65.4
EditOther 67.2 (0.1) 82.9 84.4 / 91.1 87.0 70.2 / 29.1 79.4 / 78.6 85.6 69.7 63.5
Contrast 66.3 (0.6) 83.0 82.5 / 89.3 83.0 69.8 / 28.3 80.2 / 79.5 85.9 68.2 58.7
MNLI8.5k 67.6 (0.1) 83.5 89.5 / 92.9 88.0 69.4 / 28.3 79.5 / 78.6 86.3 69.3 62.5
MNLIGov8.5k 67.5 (0.3) 82.5 89.5 / 94.6 85.0 70.0 / 28.1 79.8 / 79.0 87.4 68.7 62.5
ANLI8.5k 67.2 (0.3) 83.4 86.3 / 91.1 83.0 69.3 / 28.9 81.2 / 80.4 85.9 70.1 63.5
MNLI 67.7 (0.1) 84.0 85.5 / 91.1 89.0 71.5 / 31.0 79.1 / 78.3 87.7 68.5 63.5
ANLI 68.1 (0.4) 83.7 82.8 / 91.1 86.0 71.3 / 30.0 80.1 / 79.3 89.5 69.6 66.3
Table 6:

Model performance on the SuperGLUE validation and diagnostic sets. The Avg. column shows the overall SuperGLUE score—an average across the eight tasks, weighting each task equally—as a mean and standard deviation across three restarts.

Hypothesis-Only Models

To further investigate the degree to which our hypotheses contain artifacts that reveal their labels, Table 5 shows results with single-input versions of our models trained on hypothesis-only versions of the datasets under study and evaluated on the datasets’ validation sets.

Our first three interventions, especially EditPremise, show much lower hypothesis-only performance than Base. This drop is much larger than the drop seen in in our standard NLI experimenst in the Self column of Table 4. This indicates that these these results cannot be explained away as a consequence of the lower quality of the evaluation sets for these three new datasets. This adds further evidence, alongside our PMI results, that these interventions reduce the presence of such artifacts. While we do not have a direct baseline for the two-class Contrast in this experiment, comparisons with MNLI 8.5k are consistent with the encouraging results seen above.

Transfer Evaluations

For our primary evaluation, we use the training sets from our datasets in STILTs-style intermediate training (Phang et al., 2018): We fine-tune a large pretrained model on our collected data using standard fine-tuning procedures, then fine-tune a copy of the resulting model again on each of the target evaluation datasets we use. We then measure the aggregate performance of the resulting models across those evaluation datasets.

We evaluate on the target tasks in the SuperGLUE benchmark (Wang et al., 2019a): BoolQ (Clark et al., 2019), MultiRC (Khashabi et al., 2018), ReCoRD (Zhang et al., 2018), CommitmentBank (CB; De Marneffe et al., 2019), Choice of Plausible Alternatives (COPA; Roemmele et al., 2011), Recognizing Textual Entailment (RTE; Dagan et al., 2006; Bar Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009), the Winograd Schema Challenge (WSC; Levesque et al., 2012), and WiC (Pilehvar and Camacho-Collados, 2019). These tasks were selected to be difficult for BERT but relatively easy for non-expert humans, and are meant to replace the largely-saturated GLUE benchmark (Wang et al., 2019b).

SuperGLUE does not include labeled test data, and does not allow for substantial ablation analyses on its test sets. Since we have no single final model whose performance we aim to show off, we do not evaluate on the test sets. We also neither use any auxiliary WSC-format data when training our WSC model (as in Kocijan et al., 2019) nor artificially modify the task format. As has been observed elsewhere, we do not generally reach above-chance performance on that task without these extra techniques.

Results are shown in Table 6. Our first observation is that our overall data collection pipeline worked well for our purposes: Our Base data yields models that transfer substantially better than the plain RoBERTa or XLNet baseline, and at least slightly better than 8.5k-example samples of MNLI, MNLI Government or ANLI.

However, all four of our interventions yield worse transfer performance than Base. The variances across runs are small, and this pattern is consistent across both RoBERTa and XLNet, and across most individual target tasks.

We believe that this is a genuine negative result: At least under the broad experimental setting outline here, we find that none of these four interventions is helpful for transfer learning. We chose to collect 8,500 example samples because of the prior observation that this approximate amount was sufficient to show clear results on transfer learning, and we reproduce that finding here: Both MNLI 8.5k and the Base dataset yield large improvements over plain RoBERTa or XLNet through transfer learning. If any of our interventions were to be helpful in general, we would expect them to be harmless or helpful in our regime relative to Base. This is not what we observe.

We believe that this is the first study to evaluate ANLI as a pretraining task in transfer learning, and we observe that the large combined ANLI training set yields consistently better transfer than the original MNLI dataset. However, we observe (to our surprise) that this result reverses when we control for length, with an 8.5k-example sample of MNLI yielding consistently better performance than an equivalently small sample of ANLI.

We also note that our best overall result uses only 8.5k NLI training examples, suggesting either that this size is enough to maximize the gains available through NLI pretraining, or that issues of forgetting during intermediate training make larger intermediate training sets more sensitive to optimization choices.

Finally, we replicate the finding from (Phang et al., 2018) that intermediate-task training with NLI data substantially reduces the variance across restarts seen in target task tuning.

5 Conclusion

Our chief results on transfer learning are conclusively negative: In this setting, all four interventions yield substantially worse transfer performance than our base MNLI data collection protocol. However, we also observe promising signs that all four of our interventions help to reduce the prevalence of artifacts in the generated hypotheses that reveal the label. This suggests that these methods may be valuable in the collection of high-quality evaluation data, if combined with additional validation methods to ensure high human agreement with the collected labels.

The need and opportunity that motivated this work remains compelling: Human-annotated data like MNLI has already proven itself as a valuable tool in teaching machines general-purpose skils for language understanding, and discovering ways to more effectively build and use such data could further accelerate the field’s already fast progress toward robust, general-purpose language understanding technologies.

Further work along this line of research could productively follow a number of directions: General work on incentive structures and task design for crowdsourcing could help to address questions about how to collect data that is both creative and consistently labeled. Machine learning methods work on transfer learning could help to better understand and exploit the effects that drive the successes we have seen with NLI data so far. Finally, there remains room for further empirical work investigating the kinds of task definitions and data collection protocols most likely to yield positive transfer.


We thank the annotators who spent time and effort on this project and the many members of the natural language processing community at Google who provided feedback.


  • R. Bar Haim, I. Dagan, B. Dolan, L. Ferro, D. Giampiccolo, B. Magnini, and I. Szpektor (2006) The second PASCAL recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment, External Links: Link Cited by: §4.
  • L. Bentivogli, I. Dagan, H. T. Dang, D. Giampiccolo, and B. Magnini (2009) The fifth PASCAL recognizing textual entailment challenge. In Textual Analysis Conference (TAC), External Links: Link Cited by: §4.
  • J. Bingel and A. Søgaard (2017) Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 164–169. External Links: Link Cited by: §2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. External Links: Document, Link Cited by: §2, §2.
  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil (2018) Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. External Links: Link, Document Cited by: §3.1.
  • C. Clark, K. Lee, M. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova (2019) BoolQ: exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2924–2936. External Links: Link, Document Cited by: §1, §2, §4.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), External Links: Link, Document Cited by: §1, §2.
  • R. Cooper, D. Crouch, J. Van Eijck, C. Fox, J. Van Genabith, J. Jaspars, H. Kamp, D. Milward, M. Pinkal, M. Poesio, S. Pulman, T. Briscoe, and K. Konrad (1996) Using the framework. Technical report Technical Report LRE 62-051 D-16, The FraCaS Consortium. Cited by: §2.
  • I. Dagan, O. Glickman, and B. Magnini (2006) The PASCAL recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Textual Entailment, External Links: Link Cited by: §1, §2, §4.
  • M. De Marneffe, M. Simons, and J. Tonhauser (2019) The CommitmentBank: investigating projection in naturally occurring discourse. Note: To appear in Proceedings of Sinn und Bedeutung 23. Data can be found at Cited by: §4.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), External Links: Link Cited by: §1.
  • M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz, F. Sánchez-Martínez, G. Ramírez-Sánchez, and F. M. Tyers (2011)

    Apertium: a free/open-source platform for rule-based machine translation

    Machine translation 25 (2), pp. 127–144. Cited by: footnote 1.
  • M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. S. Zettlemoyer (2017) AllenNLP: a deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software, External Links: Link Cited by: §4.
  • D. Giampiccolo, B. Magnini, I. Dagan, and B. Dolan (2007) The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, Prague, pp. 1–9. External Links: Link Cited by: §4.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. External Links: Document, Link Cited by: §1, §3.3.
  • M. Honnibal and I. Montani (2017)

    spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing

    Note: Available at External Links: Link Cited by: footnote 1.
  • N. Ide and K. Suderman (2006) Integrating linguistic resources: the national corpus model. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC), External Links: Link Cited by: §3.1.
  • D. Kaushik, E. Hovy, and Z. C. Lipton (2020) Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §2.
  • D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth (2018) Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), External Links: Link Cited by: §4.
  • A. Khetan, Z. C. Lipton, and A. Anandkumar (2018) Learning from noisy singly-labeled data. In International Conference on Learning Representations (ICLR), Cited by: §3.2.
  • T. Khot, A. Sabharwal, and P. Clark (2018) SciTail: a textual entailment dataset from science question answering. In AAAI, Cited by: §2.
  • V. Kocijan, A. Cretu, O. Camburu, Y. Yordanov, and T. Lukasiewicz (2019) A surprisingly robust trick for the Winograd schema challenge. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4837–4842. External Links: Link, Document Cited by: §4.
  • A. Lai, Y. Bisk, and J. Hockenmaier (2017) Natural language inference from multiple premises. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Taipei, Taiwan, pp. 100–109. External Links: Link Cited by: §2.
  • H. Levesque, E. Davis, and L. Morgenstern (2012) The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, External Links: Link Cited by: §4.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Link, Document Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint 1907.11692. Cited by: §1, §2, §4.
  • M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, and R. Zamparelli (2014) A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 216–223. External Links: Link Cited by: §2.
  • L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin (2016) How transferable are neural networks in NLP applications?. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 479–489. External Links: Link, Document Cited by: §2.
  • Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela (2019) Adversarial NLI: a new benchmark for natural language understanding. arXiv preprint 1910.14599. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems (NeurIPS), External Links: Link Cited by: §4.
  • J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on STILTs: supplementary training on intermediate labeled-data tasks. arXiv preprint 1811.01088. External Links: Link Cited by: §1, §2, §4, §4.
  • M. T. Pilehvar and J. Camacho-Collados (2019) WiC: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), External Links: Link Cited by: §4.
  • A. Poliak, A. Haldar, R. Rudinger, J. E. Hu, E. Pavlick, A. S. White, and B. Van Durme (2018) Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §1.
  • M. Roemmele, C. A. Bejan, and A. S. Gordon (2011) Choice of plausible alternatives: an evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, Cited by: §4.
  • K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y. Choi (2019) WinoGrande: an adversarial winograd schema challenge at scale. arXiv preprint 1907.10641. External Links: Link Cited by: §2.
  • M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y. Choi (2019) Social IQa: commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4462–4472. External Links: Link, Document Cited by: §2.
  • S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal (2018) Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  • A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, and Y. Artzi (2019) A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6418–6428. External Links: Link, Document Cited by: §1.
  • M. Tsuchiya (2018) Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §1.
  • A. Wang, J. Hula, P. Xia, R. Pappagari, R. T. McCoy, R. Patel, N. Kim, I. Tenney, Y. Huang, K. Yu, S. Jin, B. Chen, B. Van Durme, E. Grave, E. Pavlick, and S. R. Bowman (2019) Can you tell me how to get past sesame street? sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4465–4476. External Links: Link Cited by: §2.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019a) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3261–3275. External Links: Link Cited by: §1, §4.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, External Links: Link Cited by: §4, §4.
  • A. Wang, I. F. Tenney, Y. Pruksachatkun, K. Yu, J. Hula, P. Xia, R. Pappagari, S. Jin, R. T. McCoy, R. Patel, Y. Huang, J. Phang, E. Grave, H. Liu, N. Kim, P. M. Htut, T. Févry, B. Chen, N. Nangia, A. Mohananey, K. Kann, S. Bordia, N. Patry, D. Benton, E. Pavlick, and S. R. Bowman (2019c) jiant 1.2: a software toolkit for research on general-purpose text understanding models. External Links: Link Cited by: §4.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), External Links: Link Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s Transformers: state-of-the-art natural language processing. arXiv preprint 1910.03771. Cited by: §4.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 5753–5763. External Links: Link Cited by: §2, §4.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: a large-scale adversarial dataset for grounded commonsense inference. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §2.
  • S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme (2018) ReCoRD: bridging the gap between human and machine commonsense reading comprehension. arXiv preprint 1810.12885. Cited by: §4.