Kronecker Decomposition for GPT Compression

10/15/2021
by   Ali Edalati, et al.
7

GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from  100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-22 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both language modeling and General Language Understanding Evaluation benchmark tasks and show that with more efficient pre-training and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2021

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has ma...
research
06/15/2021

Direction is what you need: Improving Word Embedding Compression in Large Language Models

The adoption of Transformer-based models in natural language processing ...
research
10/30/2022

Learning to Decompose: Hypothetical Question Decomposition Based on Comparable Texts

Explicit decomposition modeling, which involves breaking down complex ta...
research
12/14/2020

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

The pre-training models such as BERT have achieved great results in vari...
research
06/05/2023

Efficient GPT Model Pre-training using Tensor Train Matrix Representation

Large-scale transformer models have shown remarkable performance in lang...
research
07/20/2022

Model Compression for Resource-Constrained Mobile Robots

The number of mobile robots with constrained computing resources that ne...
research
03/24/2022

Multi-armed bandits for online optimization of language model pre-training: the use case of dynamic masking

Transformer-based language models (TLMs) provide state-of-the-art perfor...

Please sign up or login with your details

Forgot password? Click here to reset