Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

11/09/2022
by   Baohao Liao, et al.
0

The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and gather the contextualized information from unmasked tokens to restore the corrupted information. It raises the question of whether we can append [MASK]s at a later layer, to reduce the sequence length for earlier layers and make the pre-training more efficient. We show: (1) [MASK]s can indeed be appended at a later layer, being disentangled from the word embedding; (2) The gathering of contextualized information from unmasked tokens can be conducted with a few layers. By further increasing the masking rate from 15 only 78 on the GLUE benchmark. When pre-training with the original budget, our method outperforms RoBERTa for 6 out of 8 GLUE tasks, on average by 0.4

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2020

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt...
research
08/29/2023

Characterizing Learning Curves During Language Model Pre-Training: Learning, Forgetting, and Stability

How do language models learn to make predictions during pre-training? To...
research
02/16/2022

Should You Mask 15

Masked language models conventionally use a masking rate of 15 belief th...
research
06/02/2020

Position Masking for Language Models

Masked language modeling (MLM) pre-training models such as BERT corrupt ...
research
11/21/2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-training

Video-language pre-training is crucial for learning powerful multi-modal...
research
07/14/2022

BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling

The pre-training of large language models usually requires massive amoun...
research
08/23/2023

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been pou...

Please sign up or login with your details

Forgot password? Click here to reset