Block-wise Bit-Compression of Transformer-based Models

03/16/2023
by   Gaochen Dong, et al.
0

With the popularity of the recent Transformer-based models represented by BERT, GPT-3 and ChatGPT, there has been state-of-the-art performance in a range of natural language processing tasks. However, the massive computations, huge memory footprint, and thus high latency of Transformer-based models is an inevitable challenge for the cloud with high real-time requirement. To tackle the issue, we propose BBCT, a method of block-wise bit-compression for transformer without retraining. Our method achieves more fine-grained compression of the whole transformer, including embedding, matrix multiplication, GELU, softmax, layer normalization, and all the intermediate results. As a case, we compress an efficient BERT with the method of BBCT. Our benchmark test results on General Language Understanding Evaluation (GLUE) show that BBCT can achieve less than 1

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/04/2023

Blockwise Compression of Transformer-based Models without Retraining

Transformer-based models, represented by GPT-3, ChatGPT, and GPT-4, have...
research
03/04/2021

Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing

BERT is the most recent Transformer-based model that achieves state-of-t...
research
02/27/2020

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Transformer-based models pre-trained on large-scale corpora achieve stat...
research
06/22/2022

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Transformers have become a predominant machine learning workload, they a...
research
02/17/2022

Revisiting Over-smoothing in BERT from the Perspective of Graph

Recently over-smoothing phenomenon of Transformer-based models is observ...
research
09/12/2019

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Transformer based architectures have become de-facto models used for a r...
research
10/26/2020

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

Transformer-based models are the state-of-the-art for Natural Language U...

Please sign up or login with your details

Forgot password? Click here to reset