Annealing Knowledge Distillation

by   Aref Jafari, et al.

Significant memory and computational requirements of large deep neural networks restrict their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the knowledge of a trained large teacher model is transferred to a smaller student model. The success of knowledge distillation is mainly attributed to its training objective function, which exploits the soft-target information (also known as "dark knowledge") besides the given regular hard labels in a training set. However, it is shown in the literature that the larger the gap between the teacher and the student networks, the more difficult is their training using knowledge distillation. To address this shortcoming, we propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently. Our Annealing-KD technique is based on a gradual transition over annealed soft-targets generated by the teacher at different temperatures in an iterative process, and therefore, the student is trained to follow the annealed teacher output in a step-by-step manner. This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method. We did a comprehensive set of experiments on different tasks such as image classification (CIFAR-10 and 100) and NLP language inference with BERT-based models on the GLUE benchmark and consistently got superior results.



page 1

page 2

page 3

page 4


Improved Knowledge Distillation via Teacher Assistant: Bridging the Gap Between Student and Teacher

Despite the fact that deep neural networks are powerful models and achie...

Learning to Teach with Student Feedback

Knowledge distillation (KD) has gained much attention due to its effecti...

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) at...

Knowledge Distillation in Wide Neural Networks: Risk Bound, Data Efficiency and Imperfect Teacher

Knowledge distillation is a strategy of training a student network with ...

Follow Your Path: a Progressive Method for Knowledge Distillation

Deep neural networks often have a huge number of parameters, which posts...

Dynamic Rectification Knowledge Distillation

Knowledge Distillation is a technique which aims to utilize dark knowled...

Stochastic Precision Ensemble: Self-Knowledge Distillation for Quantized Deep Neural Networks

The quantization of deep neural networks (QDNNs) has been actively studi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.