Supervision Complexity and its Role in Knowledge Distillation

01/28/2023
by   Hrayr Harutyunyan, et al.
8

Despite the popularity and efficacy of knowledge distillation, there is limited understanding of why it helps. In order to study the generalization behavior of a distilled student, we propose a new theoretical framework that leverages supervision complexity: a measure of alignment between teacher-provided supervision and the student's neural tangent kernel. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. Specifically, it provides a rigorous justification for the utility of various techniques that are prevalent in the context of distillation, such as early stopping and temperature scaling. Our analysis further suggests the use of online distillation, where a student receives increasingly more complex supervision from teachers in different stages of their training. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/03/2019

On the Efficacy of Knowledge Distillation

In this paper, we present a thorough evaluation of the efficacy of knowl...
research
06/19/2021

Teacher's pet: understanding and mitigating biases in distillation

Knowledge distillation is widely used as a means of improving the perfor...
research
10/02/2019

Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

Distillation is a method to transfer knowledge from one model to another...
research
06/16/2023

Coaching a Teachable Student

We propose a novel knowledge distillation framework for effectively teac...
research
12/01/2018

Snapshot Distillation: Teacher-Student Optimization in One Generation

Optimizing a deep neural network is a fundamental task in computer visio...
research
10/20/2022

Similarity of Neural Architectures Based on Input Gradient Transferability

In this paper, we aim to design a quantitative similarity function betwe...
research
11/18/2019

Preparing Lessons: Improve Knowledge Distillation with Better Supervision

Knowledge distillation (KD) is widely used for training a compact model ...

Please sign up or login with your details

Forgot password? Click here to reset