Search for Better Students to Learn Distilled Knowledge

01/30/2020
by   Jindong Gu, et al.
0

Knowledge Distillation, as a model compression technique, has received great attention. The knowledge of a well-performed teacher is distilled to a student with a small architecture. The architecture of the small student is often chosen to be similar to their teacher's, with fewer layers or fewer channels, or both. However, even with the same number of FLOPs or parameters, the students with different architecture can achieve different generalization ability. The configuration of a student architecture requires intensive network architecture engineering. In this work, instead of designing a good student architecture manually, we propose to search for the optimal student automatically. Based on L1-norm optimization, a subgraph from the teacher network topology graph is selected as a student, the goal of which is to minimize the KL-divergence between student's and teacher's outputs. We verify the proposal on CIFAR10 and CIFAR100 datasets. The empirical experiments show that the learned student architecture achieves better performance than ones specified manually. We also visualize and understand the architecture of the found student.

READ FULL TEXT

page 4

page 6

research
12/01/2020

Multi-level Knowledge Distillation

Knowledge distillation has become an important technique for model compr...
research
12/05/2020

Multi-head Knowledge Distillation for Model Compression

Several methods of knowledge distillation have been developed for neural...
research
09/14/2018

Network Recasting: A Universal Method for Network Architecture Transformation

This paper proposes network recasting as a general method for network ar...
research
11/05/2021

AUTOKD: Automatic Knowledge Distillation Into A Student Architecture Family

State-of-the-art results in deep learning have been improving steadily, ...
research
10/08/2022

Sparse Teachers Can Be Dense with Knowledge

Recent advances in distilling pretrained language models have discovered...
research
05/15/2018

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

This paper studies teacher-student optimization on neural networks, i.e....
research
11/13/2020

The Teacher-Student Chatroom Corpus

The Teacher-Student Chatroom Corpus (TSCC) is a collection of written co...

Please sign up or login with your details

Forgot password? Click here to reset