I-BERT: Integer-only BERT Quantization

01/05/2021
by   Sehoon Kim, et al.
0

Transformer based models, like BERT and RoBERTa, have achieved state-of-the-art results in many Natural Language Processing tasks. However, their memory footprint, inference latency, and power consumption are prohibitive for many edge processors, and it has been a challenge to deploy these models for edge applications and devices that have resource constraints. While quantization can be a viable solution to this, previous work on quantizing Transformer based models uses floating-point arithmetic during inference, thus limiting model deployment on many edge processors. In this work, we propose a novel integer-only quantization scheme for Transformer based models that quantizes the entire inference process. In particular, we demonstrate how to approximate nonlinear operations in Transformer architectures, e.g., GELU, Softmax, and Layer Normalization, with lightweight integer computations. We use those approximations in our method, I-BERT, with an end-to-end integer-only inference, and without any floating point calculation. We test our approach on GLUE downstream tasks using RoBERTa-Base and RoBERTa-Large. For both cases, with an 8-bit integer-only quantization scheme, I-BERT achieves similar accuracy as compared to the full-precision baseline.

READ FULL TEXT
research
07/04/2022

I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference

Vision Transformers (ViTs) have achieved state-of-the-art performance on...
research
09/20/2022

Integer Fine-tuning of Transformer-based Models

Transformer based models are used to achieve state-of-the-art performanc...
research
09/12/2019

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Transformer based architectures have become de-facto models used for a r...
research
04/13/2021

NPE: An FPGA-based Overlay Processor for Natural Language Processing

In recent years, transformer-based models have shown state-of-the-art re...
research
04/20/2020

Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation

Quantization techniques can reduce the size of Deep Neural Networks and ...
research
11/28/2020

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Transformer-based language models such as BERT provide significant accur...
research
12/03/2021

NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference

Non-linear operations such as GELU, Layer normalization, and Softmax are...

Please sign up or login with your details

Forgot password? Click here to reset