Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization

12/12/2022
by   Aref Jafari, et al.
0

Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/12/2018

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...
research
05/18/2023

Student-friendly Knowledge Distillation

In knowledge distillation, the knowledge from the teacher model is often...
research
06/09/2021

Knowledge distillation: A good teacher is patient and consistent

There is a growing discrepancy in computer vision between large-scale mo...
research
10/16/2021

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) at...
research
08/12/2021

Learning from Matured Dumb Teacher for Fine Generalization

The flexibility of decision boundaries in neural networks that are ungui...
research
01/27/2023

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model...
research
05/20/2023

Lifting the Curse of Capacity Gap in Distilling Language Models

Pretrained language models (LMs) have shown compelling performance on va...

Please sign up or login with your details

Forgot password? Click here to reset