DeepAI AI Chat
Log In Sign Up

Collecting Entailment Data for Pretraining: New Protocols and Negative Results

04/24/2020
by   Samuel R. Bowman, et al.
Google
NYU college
0

Textual entailment (or NLI) data has proven useful as pretraining data for tasks requiring language understanding, even when building on an already-pretrained model like RoBERTa. The standard protocol for collecting NLI was not designed for the creation of pretraining data, and it is likely far from ideal for this purpose. With this application in mind, we propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a simple MNLI-based baseline, we collect and compare five new 8.5k-example training sets. Our primary results are solidly negative, with our baseline MNLI-style dataset yielding good transfer performance, but none of our four new methods (nor the recent ANLI) showing any improvements on that baseline. However, we do observe that all four of these interventions, especially the use of seed sentences for inspiration, reduce previously observed issues with annotation artifacts.

READ FULL TEXT

page 1

page 2

page 3

page 4

10/13/2020

Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

Large-scale natural language inference (NLI) datasets such as SNLI or MN...
10/09/2021

The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design

Pretraining Neural Language Models (NLMs) over a large corpus involves c...
11/02/2018

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Pretraining with language modeling and related unsupervised tasks has re...
05/12/2018

AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples

We consider the problem of learning textual entailment models with limit...
06/01/2021

What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Crowdsourcing is widely used to create data for common natural language ...
06/15/2020

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Pretraining NLP models with variants of Masked Language Model (MLM) obje...
09/08/2021

Continuous Entailment Patterns for Lexical Inference in Context

Combining a pretrained language model (PLM) with textual patterns has be...