DeepAI AI Chat
Log In Sign Up

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

by   Jason Phang, et al.
NYU college

Pretraining with language modeling and related unsupervised tasks has recently been shown to be a very effective enabling technology for the development of neural network models for language understanding tasks. In this work, we show that although language model-style pretraining is extremely effective at teaching models about language, it does not yield an ideal starting point for efficient transfer learning. By supplementing language model-style pretraining with further training on data-rich supervised tasks, we are able to achieve substantial additional performance improvements across the nine target tasks in the GLUE benchmark. We obtain an overall score of 76.9 on GLUE--a 2.3 point improvement over our baseline system adapted from Radford et al. (2018) and a 4.1 point improvement over Radford et al.'s reported score. We further use training data downsampling to show that the benefits of this supplementary training are even more pronounced in data-constrained regimes.


page 1

page 2

page 3

page 4


Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

Work on the problem of contextualized word representation -- the develop...

Language Modeling Teaches You More Syntax than Translation Does: Lessons Learned Through Auxiliary Task Analysis

Recent work using auxiliary prediction task classifiers to investigate t...

Linear pretraining in recurrent mixture density networks

We present a method for pretraining a recurrent mixture density network ...

Human vs. Muppet: A Conservative Estimate of HumanPerformance on the GLUE Benchmark

The GLUE benchmark (Wang et al., 2019b) is a suite of language understan...

To Pretrain or Not to Pretrain: Examining the Benefits of Pretraining on Resource Rich Tasks

Pretraining NLP models with variants of Masked Language Model (MLM) obje...

Emergent inabilities? Inverse scaling over the course of pretraining

Does inverse scaling only occur as a function of model parameter size, o...

Collecting Entailment Data for Pretraining: New Protocols and Negative Results

Textual entailment (or NLI) data has proven useful as pretraining data f...