Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders

11/20/2022
by   Minsoo Kim, et al.
0

Knowledge distillation (KD) has been a ubiquitous method for model compression to strengthen the capability of a lightweight model with the transferred knowledge from the teacher. In particular, KD has been employed in quantization-aware training (QAT) of Transformer encoders like BERT to improve the accuracy of the student model with the reduced-precision weight parameters. However, little is understood about which of the various KD approaches best fits the QAT of Transformers. In this work, we provide an in-depth analysis of the mechanism of KD on attention recovery of quantized large Transformers. In particular, we reveal that the previously adopted MSE loss on the attention score is insufficient for recovering the self-attention information. Therefore, we propose two KD methods; attention-map and attention-output losses. Furthermore, we explore the unification of both losses to address task-dependent preference between attention-map and output losses. The experimental results on various Transformer encoder models demonstrate that the proposed KD methods achieve state-of-the-art accuracy for QAT with sub-2-bit weight quantization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2023

Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers

Pre-trained Transformer models such as BERT have shown great success in ...
research
01/15/2021

KDLSQ-BERT: A Quantized Bert Combining Knowledge Distillation with Learned Step Size Quantization

Recently, transformer-based language models such as BERT have shown trem...
research
07/01/2023

Variation-aware Vision Transformer Quantization

Despite the remarkable performance of Vision Transformers (ViTs) in vari...
research
05/18/2023

Boost Vision Transformer with GPU-Friendly Sparsity and Quantization

The transformer extends its success from the language to the vision doma...
research
12/31/2020

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

We generalize deep self-attention distillation in MiniLM (Wang et al., 2...
research
04/27/2022

DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Transformers are successfully applied to computer vision due to their po...
research
05/21/2023

Bi-ViT: Pushing the Limit of Vision Transformer Quantization

Vision transformers (ViTs) quantization offers a promising prospect to f...

Please sign up or login with your details

Forgot password? Click here to reset