Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

06/11/2023
by   Shicheng Tan, et al.
0

The large scale of pre-trained language models poses a challenge for their deployment on various devices, with a growing emphasis on methods to compress these models, particularly knowledge distillation. However, current knowledge distillation methods rely on the model's intermediate layer features and the golden labels (also called hard labels), which usually require aligned model architecture and enough labeled data respectively. Moreover, the parameters of vocabulary are usually neglected in existing methods. To address these problems, we propose a general language model distillation (GLMD) method that performs two-stage word prediction distillation and vocabulary compression, which is simple and surprisingly shows extremely strong performance. Specifically, GLMD supports more general application scenarios by eliminating the constraints of dimension and structure between models and the need for labeled datasets through the absence of intermediate layers and golden labels. Meanwhile, based on the long-tailed distribution of word frequencies in the data, GLMD designs a strategy of vocabulary compression through decreasing vocabulary size instead of dimensionality. Experimental results show that our method outperforms 25 state-of-the-art methods on the SuperGLUE benchmark, achieving an average score that surpasses the best method by 3

READ FULL TEXT

page 7

page 8

page 15

research
06/11/2023

GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

Currently, the reduction in the parameter scale of large-scale pre-train...
research
09/29/2020

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss ...
research
09/13/2021

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has ma...
research
05/04/2022

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majorit...
research
09/21/2021

Knowledge Distillation with Noisy Labels for Natural Language Understanding

Knowledge Distillation (KD) is extensively used to compress and deploy l...
research
09/19/2023

Improving CLIP Robustness with Knowledge Distillation and Self-Training

This paper examines the robustness of a multi-modal computer vision mode...
research
05/03/2021

Black-Box Dissector: Towards Erasing-based Hard-Label Model Stealing Attack

Model stealing attack aims to create a substitute model that steals the ...

Please sign up or login with your details

Forgot password? Click here to reset