In Defense of Feature Mimicking for Knowledge Distillation

11/03/2020
by   Guo-Hua Wang, et al.
0

Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logit as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction. To meet this requirement, we propose a loss term based on locality-sensitive hashing (LSH). With the help of this new loss, our method indeed mimics feature directions more accurately, relaxes constraints on feature magnitudes, and achieves state-of-the-art distillation accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2022

Deeply-Supervised Knowledge Distillation

Knowledge distillation aims to enhance the performance of a lightweight ...
research
03/26/2022

Knowledge Distillation with the Reused Teacher Classifier

Knowledge distillation aims to compress a powerful yet cumbersome teache...
research
04/24/2023

Function-Consistent Feature Distillation

Feature distillation makes the student mimic the intermediate features o...
research
10/21/2022

Distilling the Undistillable: Learning from a Nasty Teacher

The inadvertent stealing of private/sensitive information using Knowledg...
research
02/27/2023

Leveraging Angular Distributions for Improved Knowledge Distillation

Knowledge distillation as a broad class of methods has led to the develo...
research
03/18/2023

Crowd Counting with Online Knowledge Learning

Efficient crowd counting models are urgently required for the applicatio...
research
02/22/2023

Debiased Distillation by Transplanting the Last Layer

Deep models are susceptible to learning spurious correlations, even duri...

Please sign up or login with your details

Forgot password? Click here to reset