8-bit Optimizers via Block-wise Quantization

10/06/2021
by   Tim Dettmers, et al.
0

Stateful optimizers maintain gradient statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past gradient values. This state can be used to accelerate optimization compared to plain stochastic gradient descent but uses memory that might otherwise be allocated to model parameters, thereby limiting the maximum size of models trained in practice. In this paper, we develop the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states. To overcome the resulting computational, quantization, and stability challenges, we develop block-wise dynamic quantization. Block-wise quantization divides input tensors into smaller blocks that are independently quantized. Each block is processed in parallel across cores, yielding faster optimization and high precision quantization. To maintain stability and performance, we combine block-wise quantization with two additional changes: (1) dynamic quantization, a form of non-linear optimization that is precise for both large and small magnitude values, and (2) a stable embedding layer to reduce gradient variance that comes from the highly non-uniform distribution of input tokens in language models. As a result, our 8-bit optimizers maintain 32-bit performance with a small fraction of the memory footprint on a range of tasks, including 1.5B parameter language modeling, GLUE finetuning, ImageNet classification, WMT'14 machine translation, MoCo v2 contrastive ImageNet pretraining+finetuning, and RoBERTa pretraining, without changes to the original optimizer hyperparameters. We open-source our 8-bit optimizers as a drop-in replacement that only requires a two-line code change.

READ FULL TEXT
research
09/04/2023

Memory Efficient Optimizers with 4-bit States

Optimizer states are a major source of memory consumption for training n...
research
10/31/2022

Block-Wise Dynamic-Precision Neural Network Training Acceleration via Online Quantization Sensitivity Analytics

Data quantization is an effective method to accelerate neural network tr...
research
12/19/2022

The case for 4-bit precision: k-bit Inference Scaling Laws

Quantization methods reduce the number of bits required to represent eac...
research
08/15/2018

Blended Coarse Gradient Descent for Full Quantization of Deep Neural Networks

Quantized deep neural networks (QDNNs) are attractive due to their much ...
research
03/13/2017

Guetzli: Perceptually Guided JPEG Encoder

Guetzli is a new JPEG encoder that aims to produce visually indistinguis...
research
08/12/2020

Leveraging Automated Mixed-Low-Precision Quantization for tiny edge microcontrollers

The severe on-chip memory limitations are currently preventing the deplo...
research
09/11/2023

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Large Language Models (LLMs) have proven their exceptional capabilities ...

Please sign up or login with your details

Forgot password? Click here to reset