Learning to Grow Pretrained Models for Efficient Transformer Training

03/02/2023
by   Peihao Wang, et al.
0

Scaling transformers has led to significant breakthroughs in many domains, leading to a paradigm in which larger versions of existing models are trained and released on a periodic basis. New instances of such models are typically trained completely from scratch, despite the fact that they are often just scaled-up versions of their smaller counterparts. How can we use the implicit knowledge in the parameters of smaller, extant models to enable faster training of newer, larger models? This paper describes an approach for accelerating transformer training by learning to grow pretrained transformers, where we learn to linearly map the parameters of the smaller model to initialize the larger model. For tractable learning, we factorize the linear transformation as a composition of (linear) width- and depth-growth operators, and further employ a Kronecker factorization of these growth operators to encode architectural knowledge. Extensive experiments across both language and vision transformers demonstrate that our learned Linear Growth Operator (LiGO) can save up to 50 computational cost of training from scratch, while also consistently outperforming strong baselines that also reuse smaller pretrained models to initialize larger models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2021

Don't Sweep your Learning Rate under the Rug: A Closer Look at Cross-modal Transfer of Pretrained Transformers

Self-supervised pre-training of large-scale transformer models on text c...
research
07/12/2023

What Happens During Finetuning of Vision Transformers: An Invariance Based Investigation

The pretrain-finetune paradigm usually improves downstream performance o...
research
04/18/2021

Knowledge Neurons in Pretrained Transformers

Large-scale pretrained language models are surprisingly good at recallin...
research
03/24/2021

Finetuning Pretrained Transformers into RNNs

Transformers have outperformed recurrent neural networks (RNNs) in natur...
research
03/11/2022

Staged Training for Transformer Language Models

The current standard approach to scaling transformer language models tra...
research
12/06/2022

Enabling and Accelerating Dynamic Vision Transformer Inference for Real-Time Applications

Many state-of-the-art deep learning models for computer vision tasks are...
research
08/11/2023

Composable Function-preserving Expansions for Transformer Architectures

Training state-of-the-art neural networks requires a high cost in terms ...

Please sign up or login with your details

Forgot password? Click here to reset