KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

09/13/2021
by   Marzieh S. Tahaei, et al.
0

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power, it makes them unsuitable for deployment on low-capacity devices. We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We use this decomposition for compression of the embedding layer, all linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layer. We perform intermediate-layer knowledge distillation using the uncompressed model as the teacher to improve the performance of the compressed model. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained using this framework. We evaluate the performance of KroneckerBERT on well-known NLP benchmarks and show that for a high compression factor of 19 (5 the BERT_BASE model), our KroneckerBERT outperforms state-of-the-art compression methods on the GLUE. Our experiments indicate that the proposed model has promising out-of-distribution robustness and is superior to the state-of-the-art compression methods on SQuAD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/15/2021

Kronecker Decomposition for GPT Compression

GPT is an auto-regressive Transformer-based pre-trained language model w...
research
10/16/2021

What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression

Recent works have focused on compressing pre-trained language models (PL...
research
04/15/2022

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Pre-trained language models have demonstrated superior performance in va...
research
06/04/2022

Extreme Compression for Pre-trained Transformers Made Simple and Efficient

Extreme compression, particularly ultra-low bit precision (binary/ternar...
research
06/29/2022

Knowledge Distillation of Transformer-based Language Models Revisited

In the past few years, transformer-based pre-trained language models hav...
research
06/11/2023

Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

The large scale of pre-trained language models poses a challenge for the...
research
10/16/2021

HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression

On many natural language processing tasks, large pre-trained language mo...

Please sign up or login with your details

Forgot password? Click here to reset