On student-teacher deviations in distillation: does it pay to disobey?

01/30/2023
by   Vaishnavh Nagarajan, et al.
10

Knowledge distillation has been widely-used to improve the performance of a "student" network by hoping to mimic soft probabilities of a "teacher" network. Yet, for self-distillation to work, the student must somehow deviate from the teacher (Stanton et al., 2021). But what is the nature of these deviations, and how do they relate to gains in generalization? We investigate these questions through a series of experiments across image and language classification datasets. First, we observe that distillation consistently deviates in a characteristic way: on points where the teacher has low confidence, the student achieves even lower confidence than the teacher. Secondly, we find that deviations in the initial dynamics of training are not crucial – simply switching to distillation loss in the middle of training can recover much of its gains. We then provide two parallel theoretical perspectives to understand the role of student-teacher deviations in our experiments, one casting distillation as a regularizer in eigenspace, and another as a gradient denoiser. Our analysis bridges several gaps between existing theory and practice by (a) focusing on gradient-descent training, (b) by avoiding label noise assumptions, and (c) by unifying several disjoint empirical and theoretical findings.

READ FULL TEXT

page 22

page 23

page 24

page 28

page 29

research
06/17/2022

Revisiting Self-Distillation

Knowledge distillation is the procedure of transferring "knowledge" from...
research
10/08/2019

Knowledge Distillation from Internal Representations

Knowledge distillation is typically conducted by training a small model ...
research
06/29/2022

Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

This work investigates the compatibility between label smoothing (LS) an...
research
02/23/2023

Random Teachers are Good Teachers

In this work, we investigate the implicit regularization induced by teac...
research
06/14/2022

SoTeacher: A Student-oriented Teacher Network Training Framework for Knowledge Distillation

How to train an ideal teacher for knowledge distillation is still an ope...
research
06/10/2021

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT co...
research
07/17/2023

DOT: A Distillation-Oriented Trainer

Knowledge distillation transfers knowledge from a large model to a small...

Please sign up or login with your details

Forgot password? Click here to reset