Dynamic Knowledge Distillation for Pre-trained Language Models

09/23/2021
by   Lei Li, et al.
0

Knowledge distillation (KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected teacher model on the pre-defined training dataset. In this paper, we explore whether a dynamic knowledge distillation that empowers the student to adjust the learning procedure according to its competency, regarding the student performance and learning efficiency. We explore the dynamical adjustments on three aspects: teacher model adoption, data selection, and KD objective adaptation. Experimental results show that (1) proper selection of teacher model can boost the performance of student model; (2) conducting KD with 10 accelerates the training; (3) the student performance can be boosted by adjusting the supervision contribution of different alignment objective. We find dynamic knowledge distillation is promising and provide discussions on potential future directions towards more efficient KD methods. Our code is available at https://github.com/lancopku/DynamicKD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowle...
research
07/12/2022

Knowledge Condensation Distillation

Knowledge Distillation (KD) transfers the knowledge from a high-capacity...
research
07/19/2022

Context Unaware Knowledge Distillation for Image Retrieval

Existing data-dependent hashing methods use large backbone networks with...
research
02/03/2023

Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

Knowledge distillation (KD) is a highly promising method for mitigating ...
research
06/04/2023

Revisiting Data-Free Knowledge Distillation with Poisoned Teachers

Data-free knowledge distillation (KD) helps transfer knowledge from a pr...
research
06/14/2023

Knowledge Distillation of Large Language Models

Knowledge Distillation (KD) is a promising technique for reducing the hi...
research
07/06/2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Since visual perception can give rich information beyond text descriptio...

Please sign up or login with your details

Forgot password? Click here to reset