Representation range needs for 16-bit neural network training

by   Valentina Popescu, et al.

Deep learning has grown rapidly thanks to its state-of-the-art performance across a wide range of real-world applications. While neural networks have been trained using IEEE-754 binary32 arithmetic, the rapid growth of computational demands in deep learning has boosted interest in faster, low precision training. Mixed-precision training that combines IEEE-754 binary16 with IEEE-754 binary32 has been tried, and other 16-bit formats, for example Google's bfloat16, have become popular. In floating-point arithmetic there is a tradeoff between precision and representation range as the number of exponent bits changes; denormal numbers extend the representation range. This raises questions of how much exponent range is needed, of whether there is a format between binary16 (5 exponent bits) and bfloat16 (8 exponent bits) that works better than either of them, and whether or not denormals are necessary. In the current paper we study the need for denormal numbers for mixed-precision training, and we propose a 1/6/9 format, i.e., 6-bit exponent and 9-bit explicit mantissa, that offers a better range-precision tradeoff. We show that 1/6/9 mixed-precision training is able to speed up training on hardware that incurs a performance slowdown on denormal operations or eliminates the need for denormal numbers altogether. And, for a number of fully connected and convolutional neural networks in computer vision and natural language processing, 1/6/9 achieves numerical parity to standard mixed-precision.


page 1

page 2

page 3

page 4


EFloat: Entropy-coded Floating Point Format for Deep Learning

We describe the EFloat floating-point number format with 4 to 6 addition...

FP8 Quantization: The Power of the Exponent

When quantizing neural networks for efficient inference, low-bit integer...

Fixed-Posit: A Floating-Point Representation for Error-Resilient Applications

Today, almost all computer systems use IEEE-754 floating point to repres...

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

In recent years fused-multiply-add (FMA) units with lower-precision mult...

FP8 Formats for Deep Learning

FP8 is a natural progression for accelerating deep learning training inf...

Distributed Low Precision Training Without Mixed Precision

Low precision training is one of the most popular strategies for deployi...

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Fluid dynamics simulations with the lattice Boltzmann method (LBM) are v...

Please sign up or login with your details

Forgot password? Click here to reset