NormKD: Normalized Logits for Knowledge Distillation

08/01/2023
by   Zhihao Chi, et al.
0

Logit based knowledge distillation gets less attention in recent years since feature based methods perform better in most cases. Nevertheless, we find it still has untapped potential when we re-investigate the temperature, which is a crucial hyper-parameter to soften the logit outputs. For most of the previous works, it was set as a fixed value for the entire distillation procedure. However, as the logits from different samples are distributed quite variously, it is not feasible to soften all of them to an equal degree by just a single temperature, which may make the previous work transfer the knowledge of each sample inadequately. In this paper, we restudy the hyper-parameter temperature and figure out its incapability to distill the knowledge from each sample sufficiently when it is a single value. To address this issue, we propose Normalized Knowledge Distillation (NormKD), with the purpose of customizing the temperature for each sample according to the characteristic of the sample's logit distribution. Compared to the vanilla KD, NormKD barely has extra computation or storage cost but performs significantly better on CIRAR-100 and ImageNet for image classification. Furthermore, NormKD can be easily applied to the other logit based methods and achieve better performance which can be closer to or even better than the feature based method.

READ FULL TEXT
research
11/27/2022

Class-aware Information for Logit-based Knowledge Distillation

Knowledge distillation aims to transfer knowledge to the student model b...
research
12/17/2020

Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup

Knowledge distillation, which involves extracting the "dark knowledge" f...
research
09/14/2021

Exploring the Connection between Knowledge Distillation and Logits Matching

Knowledge distillation is a generalized logits matching technique for mo...
research
03/16/2022

Decoupled Knowledge Distillation

State-of-the-art distillation methods are mainly based on distilling dee...
research
08/03/2023

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

We present our proposed solution to the BabyLM challenge [arXiv:2301.117...
research
03/15/2023

Knowledge Distillation from Single to Multi Labels: an Empirical Study

Knowledge distillation (KD) has been extensively studied in single-label...
research
05/09/2023

BadCS: A Backdoor Attack Framework for Code search

With the development of deep learning (DL), DL-based code search models ...

Please sign up or login with your details

Forgot password? Click here to reset