Towards Fully 8-bit Integer Inference for the Transformer Model

09/17/2020
by   Ye Lin, et al.
0

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2022

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Vision Transformers (ViTs) have achieved state-of-the-art performance on...
research
03/23/2023

Scaled Quantization for the Vision Transformer

Quantization using a small number of bits shows promise for reducing lat...
research
07/12/2023

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

We investigate the effects of post-training quantization and quantizatio...
research
04/20/2020

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Quantization techniques can reduce the size of Deep Neural Networks and ...
research
06/03/2019

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language transla...
research
07/18/2022

Is Integer Arithmetic Enough for Deep Learning Training?

The ever-increasing computational complexity of deep learning models mak...
research
06/21/2020

Efficient Integer-Arithmetic-Only Convolutional Neural Networks

Integer-arithmetic-only networks have been demonstrated effective to red...

Please sign up or login with your details

Forgot password? Click here to reset