A closer look at the training dynamics of knowledge distillation

03/20/2023
by   Roy Miles, et al.
0

In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2019

Relational Knowledge Distillation

Knowledge distillation aims at transferring knowledge acquired in one mo...
research
08/26/2022

Disentangle and Remerge: Interventional Knowledge Distillation for Few-Shot Object Detection from A Conditional Causal Perspective

Few-shot learning models learn representations with limited human annota...
research
02/14/2021

Self Regulated Learning Mechanism for Data Efficient Knowledge Distillation

Existing methods for distillation use the conventional training approach...
research
03/16/2022

Learning to Generate Synthetic Training Data using Gradient Matching and Implicit Differentiation

Using huge training datasets can be costly and inconvenient. This articl...
research
09/30/2022

Towards a Unified View of Affinity-Based Knowledge Distillation

Knowledge transfer between artificial neural networks has become an impo...
research
10/29/2018

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

The convergence rate and final performance of common deep learning model...
research
03/19/2022

Emulating Quantum Dynamics with Neural Networks via Knowledge Distillation

High-fidelity quantum dynamics emulators can be used to predict the time...

Please sign up or login with your details

Forgot password? Click here to reset