Extending Label Smoothing Regularization with Self-Knowledge Distillation

09/11/2020
by   Ji-Yue Wang, et al.
0

Inspired by the strong correlation between the Label Smoothing Regularization(LSR) and Knowledge distillation(KD), we propose an algorithm LsrKD for training boost by extending the LSR method to the KD regime and applying a softer temperature. Then we improve the LsrKD by a Teacher Correction(TC) method, which manually sets a constant larger proportion for the right class in the uniform distribution teacher. To further improve the performance of LsrKD, we develop a self-distillation method named Memory-replay Knowledge Distillation (MrKD) that provides a knowledgeable teacher to replace the uniform distribution one in LsrKD. The MrKD method penalizes the KD loss between the current model's output distributions and its copies' on the training trajectory. By preventing the model learning so far from its historical output distribution space, MrKD can stabilize the learning and find a more robust minimum. Our experiments show that LsrKD can improve LSR performance consistently at no cost, especially on several deep neural networks where LSR is ineffectual. Also, MrKD can significantly improve single model training. The experiment results confirm that the TC can help LsrKD and MrKD to boost training, especially on the networks they are failed. Overall, LsrKD, MrKD, and their TC variants are comparable to or outperform the LSR method, suggesting the broad applicability of these KD methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2019

Revisit Knowledge Distillation: a Teacher-free Framework

Knowledge Distillation (KD) aims to distill the knowledge of a cumbersom...
research
03/12/2021

Self-Feature Regularization: Self-Feature Distillation Without Teacher Models

Knowledge distillation is the process of transferring the knowledge from...
research
06/15/2023

Self-Knowledge Distillation for Surgical Phase Recognition

Purpose: Advances in surgical phase recognition are generally led by tra...
research
03/31/2020

Regularizing Class-wise Predictions via Self-knowledge Distillation

Deep neural networks with millions of parameters may suffer from poor ge...
research
06/09/2020

Self-Distillation as Instance-Specific Label Smoothing

It has been recently demonstrated that multi-generational self-distillat...
research
05/03/2021

Initialization and Regularization of Factorized Neural Layers

Factorized layers–operations parameterized by products of two or more ma...
research
12/01/2021

Extrapolating from a Single Image to a Thousand Classes using Distillation

What can neural networks learn about the visual world from a single imag...

Please sign up or login with your details

Forgot password? Click here to reset