Log In Sign Up

Cloze-driven Pretraining of Self-attention Networks

by   Alexei Baevski, et al.

We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with the concurrently introduced BERT model. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.


page 1

page 2

page 3

page 4


RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but ...

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

Self-attention network (SAN) can benefit significantly from the bi-direc...

Domain-Adaptive Pretraining Methods for Dialogue Understanding

Language models like BERT and SpanBERT pretrained on open-domain data ha...

XLNet: Generalized Autoregressive Pretraining for Language Understanding

With the capability of modeling bidirectional contexts, denoising autoen...

Semantic maps and metrics for science Semantic maps and metrics for science using deep transformer encoders

The growing deluge of scientific publications demands text analysis tool...

When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset

While self-supervised learning has made rapid advances in natural langua...

Generative Pretraining for Paraphrase Evaluation

We introduce ParaBLEU, a paraphrase representation learning model and ev...