Neural gradients are lognormally distributed: understanding sparse and quantized training

06/15/2020
by   Brian Chmiel, et al.
0

Neural gradient compression remains a main bottleneck in improving training efficiency, as most existing neural network compression methods (e.g., pruning or quantization) focus on weights, activations, and weight gradients. However, these methods are not suitable for compressing neural gradients, which have a very different distribution. Specifically, we find that the neural gradients follow a lognormal distribution. Taking this into account, we suggest two methods to reduce the computational and memory burdens of neural gradients. The first one is stochastic gradient pruning, which can accurately set the sparsity level – up to 85 ImageNet). The second method determines the floating-point format for low numerical precision gradients (e.g., FP8). Our results shed light on previous findings related to local scaling, the optimal bit-allocation for the mantissa and exponent, and challenging workloads for which low-precision floating-point arithmetic has reported to fail. Reference implementation accompanies the paper.

READ FULL TEXT
research
04/14/2018

Low-Precision Floating-Point Schemes for Neural Network Training

The use of low-precision fixed-point arithmetic along with stochastic ro...
research
05/29/2019

Mixed Precision Training With 8-bit Floating Point

Reduced precision computation for deep neural networks is one of the key...
research
06/06/2022

8-bit Numerical Formats for Deep Neural Networks

Given the current trend of increasing size and complexity of machine lea...
research
06/21/2023

Training Transformers with 4-bit Integers

Quantizing the activation, weight, and gradient to 4-bit is promising to...
research
03/19/2019

Trained Uniform Quantization for Accurate and Efficient Neural Network Inference on Fixed-Point Hardware

We propose a method of training quantization clipping thresholds for uni...
research
03/21/2022

Optimal Fine-Grained N:M sparsity for Activations and Neural Gradients

In deep learning, fine-grained N:M sparsity reduces the data footprint a...
research
12/10/2020

A MAC-less Neural Inference Processor Supporting Compressed, Variable Precision Weights

This paper introduces two architectures for the inference of convolution...

Please sign up or login with your details

Forgot password? Click here to reset