Knowledge Distillation Performs Partial Variance Reduction

05/27/2023
by   Mher Safaryan, et al.
0

Knowledge distillation is a popular approach for enhancing the performance of “student” models, with lower representational capacity, by taking advantage of more powerful “teacher” models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this method, by examining it from an optimization perspective. We show that, in the context of linear and deep linear models, KD can be interpreted as a novel type of stochastic variance reduction mechanism. We provide a detailed convergence analysis of the resulting dynamics, which hold under standard assumptions for both strongly-convex and non-convex losses, showing that KD acts as a form of partial variance reduction, which can reduce the stochastic gradient noise, but may not eliminate it completely, depending on the properties of the “teacher” model. Our analysis puts further emphasis on the need for careful parametrization of KD, in particular w.r.t. the weighting of the distillation loss, and is validated empirically on both linear models and deep neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/26/2021

PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation

We propose a novel knowledge distillation methodology for compressing de...
research
11/30/2020

A Selective Survey on Versatile Knowledge Distillation Paradigm for Neural Network Models

This paper aims to provide a selective survey about knowledge distillati...
research
07/17/2023

DOT: A Distillation-Oriented Trainer

Knowledge distillation transfers knowledge from a large model to a small...
research
09/26/2021

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Knowledge distillation field delicately designs various types of knowled...
research
04/04/2022

Using Explainable Boosting Machine to Compare Idiographic and Nomothetic Approaches for Ecological Momentary Assessment Data

Previous research on EMA data of mental disorders was mainly focused on ...
research
02/06/2019

Distilling Policy Distillation

The transfer of knowledge from one policy to another is an important too...
research
10/29/2018

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

The convergence rate and final performance of common deep learning model...

Please sign up or login with your details

Forgot password? Click here to reset