Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

04/01/2019
by   Yang You, et al.
18

Large-batch training is key to speeding up deep neural network training in large distributed systems. However, large-batch training is difficult because it produces a generalization gap. Straightforward optimization often leads to accuracy loss on the test set. BERT devlin2018bert is a state-of-the-art deep learning model that builds on top of deep bidirectional transformers for language understanding. Previous large-batch training techniques do not perform well for BERT when we scale the batch size (e.g. beyond 8192). BERT pre-training also takes a long time to finish (around three days on 16 TPUv3 chips). To solve this problem, we propose the LAMB optimizer, which helps us to scale the batch size to 65536 without losing accuracy. LAMB is a general optimizer that works for both small and large batch sizes and does not need hyper-parameter tuning besides the learning rate. The baseline BERT-Large model needs 1 million iterations to finish pre-training, while LAMB with batch size 65536/32768 only needs 8599 iterations. We push the batch size to the memory limit of a TPUv3 pod and can finish BERT training in 76 minutes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2020

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

BERT has recently attracted a lot of attention in natural language under...
research
03/27/2023

Sigmoid Loss for Language Image Pre-Training

We propose a simple pairwise sigmoid loss for image-text pre-training. U...
research
11/20/2019

Auto-Precision Scaling for Distributed Deep Learning

In recent years, large-batch optimization is becoming the key of distrib...
research
07/05/2023

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Adaptive gradient methods, such as Adam and LAMB, have demonstrated exce...
research
10/19/2022

Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction

Training deep learning models can be computationally expensive. Prior wo...
research
01/29/2022

ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language Models via Efficient Large-Batch Adversarial Noise

In recent years, large pre-trained Transformer-based language models hav...
research
10/24/2021

Micro Batch Streaming: Allowing the Training of DNN models Using a large batch size on Small Memory Systems

The size of the deep learning models has greatly increased over the past...

Please sign up or login with your details

Forgot password? Click here to reset