Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

06/03/2019
by   Aishwarya Bhandare, et al.
0

In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel^ Xeon^ Cascade Lake processors to improve inference performance while maintaining less than 0.5% drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance. Furthermore, it reveals the opportunities and challenges to boost performance of quantized deep learning inference and establishes best practices to run inference with high efficiency on Intel CPUs.

READ FULL TEXT

page 7

page 8

research
09/17/2020

Towards Fully 8-bit Integer Inference for the Transformer Model

8-bit integer inference, as a promising direction in reducing both the l...
research
06/18/2020

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

A growing number of applications implement predictive functions using de...
research
04/13/2018

Pieces of Eight: 8-bit Neural Machine Translation

Neural machine translation has achieved levels of fluency and adequacy t...
research
03/09/2023

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Quantization is a popular technique used in Deep Neural Networks (DNN) i...
research
09/16/2020

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

Transformer is being widely used in Neural Machine Translation (NMT). De...
research
09/12/2022

FP8 Formats for Deep Learning

FP8 is a natural progression for accelerating deep learning training inf...
research
02/01/2021

Understanding Cache Boundness of ML Operators on ARM Processors

Machine Learning compilers like TVM allow a fast and flexible deployment...

Please sign up or login with your details

Forgot password? Click here to reset