Reducing Activation Recomputation in Large Transformer Models

05/10/2022
by   Vijay Korthikanti, et al.
0

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomputed, which saves memory but adds redundant compute. In this work, we show most of this redundant compute is unnecessary because we can reduce memory consumption sufficiently without it. We present two novel yet very simple techniques: sequence parallelism and selective activation recomputation. In conjunction with tensor parallelism, these techniques almost eliminate the need to recompute activations. We evaluate our approach on language models up to one trillion parameters in scale and show that our method reduces activation memory by 5x, while reducing execution time overhead from activation recomputation by over 90 parameter GPT-3 style model on 2240 NVIDIA A100 GPUs, we achieve a Model Flops Utilization of 54.2 recomputation. Our implementation will be available in both Megatron-LM and NeMo-Megatron.

READ FULL TEXT
research
04/03/2023

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Large-scale language models (LLMs) have demonstrated outstanding perform...
research
09/06/2022

EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models

Large transformer models display promising performance on a wide range o...
research
05/31/2023

Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

Running out of GPU memory has become a main bottleneck for large-scale D...
research
02/06/2023

Colossal-Auto: Unified Automation of Parallelization and Activation Checkpoint for Large-scale Models

In recent years, large-scale models have demonstrated state-of-the-art p...
research
08/20/2023

Activation Addition: Steering Language Models Without Optimization

Reliably controlling the behavior of large language models (LLMs) is a p...
research
04/12/2021

An Efficient 2D Method for Training Super-Large Deep Learning Models

Huge neural network models have shown unprecedented performance in real-...
research
07/14/2021

Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines

Training large deep learning models at scale is very challenging. This p...

Please sign up or login with your details

Forgot password? Click here to reset