Collecting Entailment Data for Pretraining: New Protocols and Negative Results

04/24/2020
by   Samuel R. Bowman, et al.
0

Textual entailment (or NLI) data has proven useful as pretraining data for tasks requiring language understanding, even when building on an already-pretrained model like RoBERTa. The standard protocol for collecting NLI was not designed for the creation of pretraining data, and it is likely far from ideal for this purpose. With this application in mind, we propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a simple MNLI-based baseline, we collect and compare five new 8.5k-example training sets. Our primary results are solidly negative, with our baseline MNLI-style dataset yielding good transfer performance, but none of our four new methods (nor the recent ANLI) showing any improvements on that baseline. However, we do observe that all four of these interventions, especially the use of seed sentences for inspiration, reduce previously observed issues with annotation artifacts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2020

Asking Crowdworkers to Write Entailment Examples: The Best of Bad Options

Large-scale natural language inference (NLI) datasets such as SNLI or MN...
research
10/09/2021

The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design

Pretraining Neural Language Models (NLMs) over a large corpus involves c...
research
05/13/2023

SCENE: Self-Labeled Counterfactuals for Extrapolating to Negative Examples

Detecting negatives (such as non-entailment relationships, unanswerable ...
research
11/02/2018

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Pretraining with language modeling and related unsupervised tasks has re...
research
05/12/2018

AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples

We consider the problem of learning textual entailment models with limit...
research
06/01/2023

TMI! Finetuned Models Leak Private Information from their Pretraining Data

Transfer learning has become an increasingly popular technique in machin...

Please sign up or login with your details

Forgot password? Click here to reset