Adam Accumulation to Reduce Memory Footprints of both Activations and Gradients for Large-scale DNN Training

05/31/2023
by   Yijia Zhang, et al.
0

Running out of GPU memory has become a main bottleneck for large-scale DNN training. How to reduce the memory footprint during training has received intensive research attention. We find that previous gradient accumulation reduces activation memory but fails to be compatible with gradient memory reduction due to a contradiction between preserving gradients and releasing gradients. To address this issue, we propose a novel optimizer accumulation method for Adam, named Adam Accumulation (AdamA), which enables reducing both activation and gradient memory. Specifically, AdamA directly integrates gradients into optimizer states and accumulates optimizer states over micro-batches, so that gradients can be released immediately after use. We mathematically and experimentally demonstrate AdamA yields the same convergence properties as Adam. Evaluated on transformer-based models, AdamA achieves up to 23 degradation in training throughput. Notably, AdamA can work together with memory reduction methods for optimizer states to fit 1.26x 3.14x larger models over PyTorch and DeepSpeed baseline on GPUs with different memory capacities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2023

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Adaptive gradient methods, such as Adam and LAMB, have demonstrated exce...
research
05/10/2022

Reducing Activation Recomputation in Large Transformer Models

Training large transformer models is one of the most important computati...
research
05/29/2023

SlimFit: Memory-Efficient Fine-Tuning of Transformer-based Models Using Training Dynamics

Transformer-based models, such as BERT and ViT, have achieved state-of-t...
research
01/30/2019

Memory-Efficient Adaptive Optimization for Large-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...
research
01/07/2020

Sparse Weight Activation Training

Training convolutional neural networks (CNNs) is time-consuming. Prior w...
research
02/12/2019

Extreme Tensoring for Low-Memory Preconditioning

State-of-the-art models are now trained with billions of parameters, rea...
research
10/19/2022

Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction

Training deep learning models can be computationally expensive. Prior wo...

Please sign up or login with your details

Forgot password? Click here to reset