Improving Knowledge Distillation via Regularizing Feature Norm and Direction

05/26/2023
by   Yuzhu Wang, et al.
0

Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task. Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features. While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g., classification accuracy. In this work, we propose to align student features with class-mean of teacher features, where class-mean naturally serves as a strong classifier. To this end, we explore baseline techniques such as adopting the cosine distance based loss to encourage the similarity between student features and their corresponding class-means of the teacher. Moreover, we train the student to produce large-norm features, inspired by other lines of work (e.g., model pruning and domain adaptation), which find the large-norm features to be more significant. Finally, we propose a rather simple loss term (dubbed ND loss) to simultaneously (1) encourage student to produce large-norm features, and (2) align the direction of student features and teacher class-means. Experiments on standard benchmarks demonstrate that our explored techniques help existing KD methods achieve better performance, i.e., higher classification accuracy on ImageNet and CIFAR100 datasets, and higher detection precision on COCO dataset. Importantly, our proposed ND loss helps the most, leading to the state-of-the-art performance on these benchmarks. The source code is available at <https://github.com/WangYZ1608/Knowledge-Distillation-via-ND>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2020

Multi-head Knowledge Distillation for Model Compression

Several methods of knowledge distillation have been developed for neural...
research
04/24/2023

Function-Consistent Feature Distillation

Feature distillation makes the student mimic the intermediate features o...
research
08/08/2022

SKDCGN: Source-free Knowledge Distillation of Counterfactual Generative Networks using cGANs

With the usage of appropriate inductive biases, Counterfactual Generativ...
research
11/16/2022

Stare at What You See: Masked Image Modeling without Reconstruction

Masked Autoencoders (MAE) have been prevailing paradigms for large-scale...
research
06/13/2022

Robust Distillation for Worst-class Performance

Knowledge distillation has proven to be an effective technique in improv...
research
07/28/2023

CLIP Brings Better Features to Visual Aesthetics Learners

The success of pre-training approaches on a variety of downstream tasks ...
research
06/29/2023

Understanding the Overfitting of the Episodic Meta-training

Despite the success of two-stage few-shot classification methods, in the...

Please sign up or login with your details

Forgot password? Click here to reset