Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction

10/19/2022
by   Muralidhar Andoorveedu, et al.
0

Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16 higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19 baseline.

READ FULL TEXT
research
05/29/2023

SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics

Transformer-based models, such as BERT and ViT, have achieved state-of-t...
research
09/06/2022

Mimose: An Input-Aware Checkpointing Planner for Efficient Training on GPU

Larger deep learning models usually lead to higher model quality with an...
research
04/01/2019

Reducing BERT Pre-Training Time from 3 Days to 76 Minutes

Large-batch training is key to speeding up deep neural network training ...
research
05/05/2020

The Cascade Transformer: an Application for Efficient Answer Sentence Selection

Large transformer-based language models have been shown to be very effec...
research
05/31/2023

Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

Running out of GPU memory has become a main bottleneck for large-scale D...
research
10/07/2019

Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization

Modern neural networks are increasingly bottlenecked by the limited capa...
research
12/07/2017

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In this work we present In-Place Activated Batch Normalization (InPlace-...

Please sign up or login with your details

Forgot password? Click here to reset