Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

11/17/2022
by   Zhewei Yao, et al.
0

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a novel random and layerwise token dropping method (random-LTD), which skips the computation of a subset of the input tokens at all middle layers. Particularly, random-LTD achieves considerable speedups and comparable accuracy as the standard training baseline. Compared to other token dropping methods, random-LTD does not require (1) any importance score-based metrics, (2) any special token treatment (e.g., [CLS]), and (3) many layers in full sequence length training except the first and the last layers. Besides, a new LayerToken learning rate schedule is proposed for pretraining problems that resolve the heavy tuning requirement for our proposed training mechanism. Finally, we demonstrate that random-LTD can be applied to broader applications, including GPT and BERT pretraining as well as ViT and GPT finetuning tasks. Our results show that random-LTD can save about 33.3 wall-clock training time while achieving similar zero-shot evaluations on GPT-31.3B as compared to baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2022

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computati...
research
05/24/2023

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Token dropping is a recently-proposed strategy to speed up the pretraini...
research
12/20/2022

Pretraining Without Attention

Transformers have been essential to pretraining success in NLP. Other ar...
research
10/05/2020

PMI-Masking: Principled masking of correlated spans

Masking tokens uniformly at random constitutes a common flaw in the pret...
research
12/17/2020

Few-shot Sequence Learning with Transformers

Few-shot algorithms aim at learning new tasks provided only a handful of...
research
03/26/2021

Turning transformer attention weights into zero-shot sequence labelers

We demonstrate how transformer-based models can be redesigned in order t...
research
05/24/2023

Self-Evolution Learning for Discriminative Language Model Pretraining

Masked language modeling, widely used in discriminative language model (...

Please sign up or login with your details

Forgot password? Click here to reset