Few-Bit Backward: Quantized Gradients of Activation Functions for Memory Footprint Reduction

02/01/2022
by   Georgii Novikov, et al.
0

Memory footprint is one of the main limiting factors for large neural network training. In backpropagation, one needs to store the input to each operation in the computational graph. Every modern neural network model has quite a few pointwise nonlinearities in its architecture, and such operation induces additional memory costs which – as we show – can be significantly reduced by quantization of the gradients. We propose a systematic approach to compute optimal quantization of the retained gradients of the pointwise nonlinear functions with only a few bits per each element. We show that such approximation can be achieved by computing optimal piecewise-constant approximation of the derivative of the activation function, which can be done by dynamic programming. The drop-in replacements are implemented for all popular nonlinearities and can be used in any existing pipeline. We confirm the memory reduction and the same convergence on several open benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2021

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

The increasing size of neural network models has been critical for impro...
research
12/19/2021

Logarithmic Unbiased Quantization: Practical 4-bit Training in Deep Learning

Quantization of the weights and activations is one of the main methods t...
research
09/22/2022

Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

An activation function is an element-wise mathematical function and play...
research
12/16/2021

Approximation of functions with one-bit neural networks

This paper examines the approximation capabilities of coarsely quantized...
research
03/04/2023

MetaGrad: Adaptive Gradient Quantization with Hypernetworks

A popular track of network compression approach is Quantization aware Tr...
research
05/22/2018

Backpropagation for long sequences: beyond memory constraints with constant overheads

Naive backpropagation through time has a memory footprint that grows lin...
research
03/04/2020

Ordering Chaos: Memory-Aware Scheduling of Irregularly Wired Neural Networks for Edge Devices

Recent advances demonstrate that irregularly wired neural networks from ...

Please sign up or login with your details

Forgot password? Click here to reset