Knowledge Distillation with the Reused Teacher Classifier

03/26/2022
by   Defang Chen, et al.
0

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years, generally with elaborately designed knowledge representations, which in turn increase the difficulty of model development and interpretation. In contrast, we empirically show that a simple knowledge distillation technique is enough to significantly narrow down the teacher-student performance gap. We directly reuse the discriminative classifier from the pre-trained teacher model for student inference and train a student encoder through feature alignment with a single ℓ_2 loss. In this way, the student model is able to achieve exactly the same performance as the teacher model provided that their extracted features are perfectly aligned. An additional projector is developed to help the student encoder match with the teacher classifier, which renders our technique applicable to various teacher and student architectures. Extensive experiments demonstrate that our technique achieves state-of-the-art results at the modest cost of compression ratio due to the added projector.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2021

Distilling Knowledge via Intermediate Classifier Heads

The crux of knowledge distillation – as a transfer-learning approach – i...
research
02/21/2023

MaskedKD: Efficient Distillation of Vision Transformers with Masked Images

Knowledge distillation is a popular and effective regularization techniq...
research
04/30/2021

Distilling EEG Representations via Capsules for Affective Computing

Affective computing with Electroencephalogram (EEG) is a challenging tas...
research
11/03/2020

In Defense of Feature Mimicking for Knowledge Distillation

Knowledge distillation (KD) is a popular method to train efficient netwo...
research
03/28/2023

DisWOT: Student Architecture Search for Distillation WithOut Training

Knowledge distillation (KD) is an effective training strategy to improve...
research
05/16/2021

Undistillable: Making A Nasty Teacher That CANNOT teach students

Knowledge Distillation (KD) is a widely used technique to transfer knowl...
research
02/27/2023

Leveraging Angular Distributions for Improved Knowledge Distillation

Knowledge distillation as a broad class of methods has led to the develo...

Please sign up or login with your details

Forgot password? Click here to reset