Dynamic Masking Rate Schedules for MLM Pretraining

by   Zachary Ankner, et al.

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15 instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30 pretraining improves average GLUE accuracy by 0.46 standard 15 scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining.


page 1

page 2

page 3

page 4


How to Train BERT with an Academic Budget

While large language models à la BERT are used ubiquitously in NLP, pret...

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Masked language modeling (MLM), a self-supervised pretraining objective,...

Pretraining Without Attention

Transformers have been essential to pretraining success in NLP. Other ar...

Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Expert...

Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

Work on the problem of contextualized word representation -- the develop...

Uniform Masking Prevails in Vision-Language Pretraining

Masked Language Modeling (MLM) has proven to be an essential component o...

Accelerating Vision-Language Pretraining with Free Language Modeling

The state of the arts in vision-language pretraining (VLP) achieves exem...

Please sign up or login with your details

Forgot password? Click here to reset