Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

06/01/2023
by   Zi Yang, et al.
0

Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a quantization-aware tensor-compressed training approach to reduce the model size, arithmetic operations, and ultimately runtime latency of transformer-based models. We compress the embedding and linear layers of transformers into small low-rank tensor cores, which significantly reduces model parameters. A quantization-aware training with learnable scale factors is used to further obtain low-precision representations of the tensor-compressed models. The developed approach can be used for both end-to-end training and distillation-based training. To improve the convergence, a layer-by-layer distillation is applied to distill a quantized and tensor-compressed student model from a pre-trained transformer. The performance is demonstrated in two natural language understanding tasks, showing up to 63× compression ratio, little accuracy loss and remarkable inference and training speedup.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2021

Prune Once for All: Sparse Pre-Trained Language Models

Transformer-based language models are applied to a wide range of applica...
research
02/23/2023

Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers

Pre-trained Transformer models such as BERT have shown great success in ...
research
11/30/2022

HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression

Transformers have attained superior performance in natural language proc...
research
07/19/2018

Statistical Model Compression for Small-Footprint Natural Language Understanding

In this paper we investigate statistical model compression applied to na...
research
11/30/2020

Extreme Model Compression for On-device Natural Language Understanding

In this paper, we propose and experiment with techniques for extreme com...
research
05/25/2022

BiT: Robustly Binarized Multi-distilled Transformer

Modern pre-trained transformers have rapidly advanced the state-of-the-a...
research
06/05/2023

Efficient GPT Model Pre-training using Tensor Train Matrix Representation

Large-scale transformer models have shown remarkable performance in lang...

Please sign up or login with your details

Forgot password? Click here to reset