Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup

11/27/2020
by   Cheng Yang, et al.
2

Pre-trained language models, such as BERT, have achieved significant accuracy gain in many natural language processing tasks. Despite its effectiveness, the huge number of parameters makes training a BERT model computationally very challenging. In this paper, we propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT. We decompose the whole training process into several stages. The training is started from a small model with only a few encoder layers and we gradually increase the depth of the model by adding new encoder layers. At each stage, we only train the top (near the output layer) few encoder layers which are newly added. The parameters of the other layers which have been trained in the previous stages will not be updated in the current stage. In BERT training, the backward computation is much more time-consuming than the forward computation, especially in the distributed training setting in which the backward computation time further includes the communication time for gradient synchronization. In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation. Hence both the computation and communication efficiencies are greatly improved. Experimental results show that the proposed method can achieve more than 110 degradation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/27/2020

CoRe: An Efficient Coarse-refined Training Framework for BERT

In recent years, BERT has made significant breakthroughs on many natural...
research
05/09/2020

schuBERT: Optimizing Elements of BERT

Transformers <cit.> have gradually become a key component for many state...
research
09/24/2021

DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference

Large-scale pre-trained language models have shown remarkable results in...
research
01/17/2022

Efficient DNN Training with Knowledge-Guided Layer Freezing

Training deep neural networks (DNNs) is time-consuming. While most exist...
research
04/01/2021

Optimizer Fusion: Efficient Training with Better Locality and Parallelism

Machine learning frameworks adopt iterative optimizers to train neural n...
research
06/15/2017

FreezeOut: Accelerate Training by Progressively Freezing Layers

The early layers of a deep neural net have the fewest parameters, but ta...

Please sign up or login with your details

Forgot password? Click here to reset