On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in the Kernel Regime

03/30/2020
by   Arman Rahbar, et al.
0

Knowledge distillation (KD), i.e. one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. However, there has been little or no theoretical analysis of this phenomenon. We provide the first theoretical analysis of KD in the setting of extremely wide two layer non-linear networks in model and regime in (Arora et al., 2019; Du Hu, 2019; Cao Gu, 2019). We prove results on what the student network learns and on the rate of convergence for the student network. Intriguingly, we also confirm the lottery ticket hypothesis (Frankle Carbin, 2019) in this model. To prove our results, we extend the repertoire of techniques from linear systems dynamics. We give corresponding experimental analysis that validates the theoretical results and yields additional insights.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2021

Towards Understanding Knowledge Distillation

Knowledge distillation, i.e., one classifier being trained on the output...
research
03/28/2022

Knowledge Distillation: Bad Models Can Be Good Role Models

Large neural networks trained in the overparameterized regime are able t...
research
02/25/2021

Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation

Knowledge distillation is classically a procedure where a neural network...
research
04/12/2019

Unifying Heterogeneous Classifiers with Distillation

In this paper, we study the problem of unifying knowledge from a set of ...
research
11/11/2015

Unifying distillation and privileged information

Distillation (Hinton et al., 2015) and privileged information (Vapnik & ...
research
12/01/2020

Solvable Model for Inheriting the Regularization through Knowledge Distillation

In recent years the empirical success of transfer learning with neural n...
research
10/02/2019

Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

Distillation is a method to transfer knowledge from one model to another...

Please sign up or login with your details

Forgot password? Click here to reset