Accelerating Large Scale Knowledge Distillation via Dynamic Importance Sampling

12/03/2018
by   Minghan Li, et al.
10

Knowledge distillation is an effective technique that transfers knowledge from a large teacher model to a shallow student. However, just like massive classification, large scale knowledge distillation also imposes heavy computational costs on training models of deep neural networks, as the softmax activations at the last layer involve computing probabilities over numerous classes. In this work, we apply the idea of importance sampling which is often used in Neural Machine Translation on large scale knowledge distillation. We present a method called dynamic importance sampling, where ranked classes are sampled from a dynamic distribution derived from the interaction between the teacher and student in full distillation. We highlight the utility of our proposal prior which helps the student capture the main information in the loss function. Our approach manages to reduce the computational cost at training time while maintaining the competitive performance on CIFAR-100 and Market-1501 person re-identification datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/18/2018

Recurrent knowledge distillation

Knowledge distillation compacts deep networks by letting a small student...
research
04/15/2023

Teacher Network Calibration Improves Cross-Quality Knowledge Distillation

We investigate cross-quality knowledge distillation (CQKD), a knowledge ...
research
06/23/2023

GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

Knowledge distillation is commonly used for compressing neural networks ...
research
12/26/2022

Prototype-guided Cross-task Knowledge Distillation for Large-scale Models

Recently, large-scale pre-trained models have shown their advantages in ...
research
11/21/2022

Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation

With the growth of high-dimensional sparse data in web-scale recommender...
research
07/14/2022

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Although more layers and more parameters generally improve the accuracy ...

Please sign up or login with your details

Forgot password? Click here to reset