Knowledge Distillation Beyond Model Compression

07/03/2020
by   Fahad Sarfraz, et al.
5

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various techniques have been proposed since the original formulation, which mimic different aspects of the teacher such as the representation space, decision boundary, or intra-data relationship. Some methods replace the one-way knowledge distillation from a static teacher with collaborative learning between a cohort of students. Despite the recent advances, a clear understanding of where knowledge resides in a deep neural network and an optimal method for capturing knowledge from teacher and transferring it to student remains an open question. In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge. We demonstrate the versatility of the KD framework on different datasets and network architectures under varying capacity gaps between the teacher and student. The study provides intuition for the effects of mimicking different aspects of the teacher and derives insights from the performance of the different distillation approaches to guide the design of more effective KD methods. Furthermore, our study shows the effectiveness of the KD framework in learning efficiently under varying severity levels of label noise and class imbalance, consistently providing generalization gains over standard training. We emphasize that the efficacy of KD goes much beyond a model compression technique and it should be considered as a general-purpose training paradigm which offers more robustness to common challenges in the real-world datasets compared to the standard training procedure.

READ FULL TEXT

page 1

page 7

research
01/21/2021

Collaborative Teacher-Student Learning via Multiple Knowledge Transfer

Knowledge distillation (KD), as an efficient and effective model compres...
research
10/22/2021

How and When Adversarial Robustness Transfers in Knowledge Distillation?

Knowledge distillation (KD) has been widely used in teacher-student trai...
research
10/03/2019

On the Efficacy of Knowledge Distillation

In this paper, we present a thorough evaluation of the efficacy of knowl...
research
10/21/2021

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique ...
research
11/15/2020

Online Ensemble Model Compression using Knowledge Distillation

This paper presents a novel knowledge distillation based model compressi...
research
12/31/2019

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Knowledge distillation (KD) is a new method for transferring knowledge o...
research
10/16/2021

Pro-KD: Progressive Distillation by Following the Footsteps of the Teacher

With ever growing scale of neural models, knowledge distillation (KD) at...

Please sign up or login with your details

Forgot password? Click here to reset