Improving Self-supervised Pre-training via a Fully-Explored Masked Language Model

10/12/2020
by   Mingzhi Zheng, et al.
0

Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training. In this paper, we argue that randomly sampled masks in MLM would lead to undesirably large gradient variance. Thus, we theoretically quantify the gradient variance via correlating the gradient covariance with the Hamming distance between two different masks (given a certain text sequence). To reduce the variance due to the sampling of masks, we propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments. Thereafter, the tokens within one segment are masked for training. We prove, from a theoretical perspective, that the gradients derived from this new masking schema have a smaller variance and can lead to more efficient self-supervised training. We conduct extensive experiments on both continual pre-training and general pre-training from scratch. Empirical results confirm that this new masking strategy can consistently outperform standard random masking. Detailed efficiency analysis and ablation studies further validate the advantages of our fully-explored masking strategy under the MLM framework.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/30/2022

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Self-supervised pre-training has been successful in both text and speech...
research
01/02/2021

Improving Sequence-to-Sequence Pre-training via Sequence Span Rewriting

In this paper, we generalize text infilling (e.g., masked language model...
research
05/28/2022

Object-wise Masked Autoencoders for Fast Pre-training

Self-supervised pre-training for images without labels has recently achi...
research
04/17/2022

On Effectively Learning of Knowledge in Continual Pre-training

Pre-trained language models (PLMs) like BERT have made significant progr...
research
06/01/2023

On Masked Pre-training and the Marginal Likelihood

Masked pre-training removes random input dimensions and learns a model t...
research
05/08/2023

Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding

Masked signal modeling has greatly advanced self-supervised pre-training...
research
06/16/2023

Robot Learning with Sensorimotor Pre-training

We present a self-supervised sensorimotor pre-training approach for robo...

Please sign up or login with your details

Forgot password? Click here to reset