On the Role of Bidirectionality in Language Model Pre-Training

05/24/2022
by   Mikel Artetxe, et al.
0

Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/08/2021

NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task–Next Sentence Prediction

Using prompts to utilize language models to perform various downstream t...
research
09/29/2022

Bidirectional Language Models Are Also Few-shot Learners

Large language models such as GPT-3 (Brown et al., 2020) can perform arb...
research
10/08/2020

Masked ELMo: An evolution of ELMo towards fully contextual RNN language models

This paper presents Masked ELMo, a new RNN-based model for language mode...
research
05/21/2022

Life after BERT: What do Other Muppets Understand about Language?

Existing pre-trained transformer analysis works usually focus only on on...
research
09/30/2022

What Makes Pre-trained Language Models Better Zero/Few-shot Learners?

In this paper, we propose a theoretical framework to explain the efficac...
research
12/31/2019

oLMpics – On what Language Model Pre-training Captures

Recent success of pre-trained language models (LMs) has spurred widespre...

Please sign up or login with your details

Forgot password? Click here to reset