Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

05/19/2021
by   Taehyeon Kim, et al.
1

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter tau. Despite its widespread use, few studies have discussed the influence of such softening on generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when tau increases and the label matching when tau goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the difference in the penultimate layer representations between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, particularly when using the KL divergence loss with small tau, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

READ FULL TEXT

page 1

page 11

research
07/01/2021

Revisiting Knowledge Distillation: An Inheritance and Exploration Framework

Knowledge Distillation (KD) is a popular technique to transfer knowledge...
research
10/23/2019

Contrastive Representation Distillation

Often we wish to transfer representational knowledge from one neural net...
research
05/23/2023

Decoupled Kullback-Leibler Divergence Loss

In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence...
research
08/23/2020

Learn to Talk via Proactive Knowledge Transfer

Knowledge Transfer has been applied in solving a wide variety of problem...
research
09/02/2021

Built Year Prediction from Buddha Face with Heterogeneous Labels

Buddha statues are a part of human culture, especially of the Asia area,...
research
07/27/2023

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Knowledge distillation (KD) is the process of transferring knowledge fro...
research
04/15/2021

Reference and Probability-Matching Priors for the Parameters of a Univariate Student t-Distribution

In this paper reference and probability-matching priors are derived for ...

Please sign up or login with your details

Forgot password? Click here to reset