Mesa: A Memory-saving Training Framework for Transformers

11/22/2021
by   Zizheng Pan, et al.
15

There has been an explosion of interest in designing high-performance Transformers. While Transformers have delivered significant performance improvements, training such networks is extremely memory intensive owing to storing all intermediate activations that are needed for gradient computation during backpropagation, especially for long sequences. To this end, we present Mesa, a memory-saving resource-efficient training framework for Transformers. Specifically, Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. The low-precision activations are then dequantized during back-propagation to compute gradients. Besides, to address the heterogeneous activation distributions in the multi-head self-attention layers, we propose a head-wise activation quantization strategy, which quantizes activations based on the statistics of each head to minimize the approximation error. To further boost training efficiency, we learn quantization parameters by running estimates. More importantly, by re-investing the saved memory in employing a larger batch size or scaling up model size, we may further improve the performance under constrained computational resources. Extensive experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training while achieving comparable or even better performance. Code is available at https://github.com/zhuang-group/Mesa

READ FULL TEXT
research
01/23/2019

Backprop with Approximate Activations for Memory-efficient Network Training

Larger and deeper neural network architectures deliver improved accuracy...
research
12/08/2022

TinyKG: Memory-Efficient Training Framework for Knowledge Graph Neural Recommender Systems

There has been an explosion of interest in designing various Knowledge G...
research
07/01/2018

SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks

Inference for state-of-the-art deep neural networks is computationally e...
research
05/04/2023

On the Expressivity Role of LayerNorm in Transformers' Attention

Layer Normalization (LayerNorm) is an inherent component in all Transfor...
research
05/24/2023

BinaryViT: Towards Efficient and Accurate Binary Vision Transformers

Vision Transformers (ViTs) have emerged as the fundamental architecture ...
research
11/27/2019

Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory

This paper introduces a new activation checkpointing method which allows...
research
02/24/2022

Auto-scaling Vision Transformers without Training

This work targets automated designing and scaling of Vision Transformers...

Please sign up or login with your details

Forgot password? Click here to reset