Dynamic Masking Rate Schedules for MLM Pretraining

05/24/2023
by   Zachary Ankner, et al.
0

Most works on transformers trained with the Masked Language Modeling (MLM) objective use the original BERT model's fixed masking rate of 15 instead dynamically schedules the masking ratio throughout training. We found that linearly decreasing the masking rate from 30 pretraining improves average GLUE accuracy by 0.46 standard 15 scheduling come from being exposed to both high and low masking rate regimes. Our results demonstrate that masking rate scheduling is a simple way to improve the quality of masked language models and achieve up to a 1.89x speedup in pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/15/2021

How to Train BERT with an Academic Budget

While large language models à la BERT are used ubiquitously in NLP, pret...
research
09/04/2021

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Masked language modeling (MLM), a self-supervised pretraining objective,...
research
12/20/2022

Pretraining Without Attention

Transformers have been essential to pretraining success in NLP. Other ar...
research
05/23/2023

Towards A Unified View of Sparse Feed-Forward Network in Pretraining Large Language Model

Large and sparse feed-forward networks (S-FFN) such as Mixture-of-Expert...
research
12/28/2018

Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

Work on the problem of contextualized word representation -- the develop...
research
12/10/2022

Uniform Masking Prevails in Vision-Language Pretraining

Masked Language Modeling (MLM) has proven to be an essential component o...
research
03/24/2023

Accelerating Vision-Language Pretraining with Free Language Modeling

The state of the arts in vision-language pretraining (VLP) achieves exem...

Please sign up or login with your details

Forgot password? Click here to reset