Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

05/15/2018
by   Chenglin Yang, et al.
0

This paper studies teacher-student optimization on neural networks, i.e., adopting the supervision from a trained (teacher) network to optimize another (student) network. Conventional approaches enforced the student to learn from a strict teacher which fit a hard distribution and achieved high recognition accuracy, but we argue that a more tolerant teacher often educate better students. We start with adding an extra loss term to a patriarch network so that it preserves confidence scores on a primary class (the ground-truth) and several visually-similar secondary classes. The patriarch is also known as the first teacher. In each of the following generations, a student learns from the teacher and becomes the new teacher in the next generation. Although the patriarch is less powerful due to ambiguity, the students enjoy a persistent ability growth as we gradually fine-tune them to fit one-hot distributions. We investigate standard image classification tasks (CIFAR100 and ILSVRC2012). Experiments with different network architectures verify the superiority of our approach, either using a single model or an ensemble of models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2020

Interactive Knowledge Distillation

Knowledge distillation is a standard teacher-student learning framework ...
research
10/22/2021

How and When Adversarial Robustness Transfers in Knowledge Distillation?

Knowledge distillation (KD) has been widely used in teacher-student trai...
research
02/16/2019

How Machine (Deep) Learning Helps Us Understand Human Learning: the Value of Big Ideas

I use simulation of two multilayer neural networks to gain intuition int...
research
05/12/2018

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...
research
01/30/2020

Search for Better Students to Learn Distilled Knowledge

Knowledge Distillation, as a model compression technique, has received g...
research
08/23/2019

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

Recent developments in NLP have been accompanied by large, expensive mod...
research
10/08/2022

Sparse Teachers Can Be Dense with Knowledge

Recent advances in distilling pretrained language models have discovered...

Please sign up or login with your details

Forgot password? Click here to reset