ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

03/23/2020
by   Kevin Clark, et al.
0

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2020

Position Masking for Language Models

Masked language modeling (MLM) pre-training models such as BERT corrupt ...
research
06/10/2020

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Pre-trained contextual representations (e.g., BERT) have become the foun...
research
12/31/2020

AraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding

Advances in English language representation enabled a more sample-effici...
research
11/09/2022

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive compu...
research
10/08/2022

Short Text Pre-training with Extended Token Classification for E-commerce Query Understanding

E-commerce query understanding is the process of inferring the shopping ...
research
04/20/2021

Efficient pre-training objectives for Transformers

The Transformer architecture deeply changed the natural language process...
research
12/15/2020

Pre-Training Transformers as Energy-Based Cloze Models

We introduce Electric, an energy-based cloze model for representation le...

Please sign up or login with your details

Forgot password? Click here to reset