FP8 Formats for Deep Learning

09/12/2022
by   Paulius Micikevicius, et al.
0

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2017

Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks

Deep neural networks are commonly developed and trained in 32-bit floati...
research
02/28/2019

A security steganography scheme based on hdr image

It is widely recognized that the image format is crucial to steganograph...
research
08/19/2022

FP8 Quantization: The Power of the Exponent

When quantizing neural networks for efficient inference, low-bit integer...
research
03/29/2021

Representation range needs for 16-bit neural network training

Deep learning has grown rapidly thanks to its state-of-the-art performan...
research
03/09/2023

Dynamic Stashing Quantization for Efficient Transformer Training

Large Language Models (LLMs) have demonstrated impressive performance on...
research
03/31/2023

FP8 versus INT8 for efficient deep learning inference

Recently, the idea of using FP8 as a number format for neural network tr...
research
06/03/2019

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language transla...

Please sign up or login with your details

Forgot password? Click here to reset