Should You Mask 15

02/16/2022
by   Alexander Wettig, et al.
5

Masked language models conventionally use a masking rate of 15 belief that more masking would provide insufficient context to learn good representations, and less masking would make training too expensive. Surprisingly, we find that masking up to 40 15 measured by fine-tuning on downstream tasks. Increasing the masking rates has two distinct effects, which we investigate through careful ablations: (1) A larger proportion of input tokens are corrupted, reducing the context size and creating a harder task, and (2) models perform more predictions, which benefits training. We observe that larger models in particular favor higher masking rates, as they have more capacity to perform the harder task. We also connect our findings to sophisticated masking schemes such as span masking and PMI masking, as well as BERT's curious 80-10-10 corruption strategy, and find that simple uniform masking with [MASK] replacements can be competitive at higher masking rates. Our results contribute to a better understanding of masked language modeling and point to new avenues for efficient pre-training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2023

Improving BERT with Hybrid Pooling Network and Drop Mask

Transformer-based pre-trained language models, such as BERT, achieve gre...
research
11/09/2022

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive compu...
research
04/21/2020

Train No Evil: Selective Masking for Task-guided Pre-training

Recently, pre-trained language models mostly follow the pre-training-the...
research
12/10/2022

Uniform Masking Prevails in Vision-Language Pretraining

Masked Language Modeling (MLM) has proven to be an essential component o...
research
08/20/2023

How Good Are Large Language Models at Out-of-Distribution Detection?

Out-of-distribution (OOD) detection plays a vital role in enhancing the ...
research
08/30/2023

LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models

In recent years, there have been remarkable advancements in the performa...

Please sign up or login with your details

Forgot password? Click here to reset