On Losses for Modern Language Models

10/04/2020
by   Stephane Aroca-Ouellette, et al.
0

BERT set many state-of-the-art results over varied NLU benchmarks by pre-training over two tasks: masked language modelling (MLM) and next sentence prediction (NSP), the latter of which has been highly criticized. In this paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen possible auxiliary pre-training tasks, of which seven are novel to modern language models, and 3) investigate different ways to include multiple tasks into pre-training. We show that NSP is detrimental to training due to its context splitting and shallow semantic signal. We also identify six auxiliary pre-training tasks – sentence ordering, adjacent sentence prediction, TF prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant – that outperform a pure MLM baseline. Finally, we demonstrate that using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task. Using these methods, we outperform BERT Base on the GLUE benchmark using fewer than a quarter of the training tokens.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2022

Improving Contextual Representation with Gloss Regularized Pre-training

Though achieving impressive results on many NLP tasks, the BERT-like mas...
research
04/20/2020

MPNet: Masked and Permuted Pre-training for Language Understanding

BERT adopts masked language modeling (MLM) for pre-training and is one o...
research
12/14/2022

Towards Linguistically Informed Multi-Objective Pre-Training for Natural Language Inference

We introduce a linguistically enhanced combination of pre-training metho...
research
07/31/2019

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models

Pre-training by language modeling has become a popular and successful ap...
research
08/13/2021

Towards Structured Dynamic Sparse Pre-Training of BERT

Identifying algorithms for computational efficient unsupervised training...
research
12/08/2020

In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness

Consider a prediction setting where a few inputs (e.g., satellite images...
research
06/11/2021

Dynamic Language Models for Continuously Evolving Content

The content on the web is in a constant state of flux. New entities, iss...

Please sign up or login with your details

Forgot password? Click here to reset