FitNets: Hints for Thin Deep Nets

12/19/2014
by   Adriana Romero, et al.
0

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2021

RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

Intermediate layer knowledge distillation (KD) can improve the standard ...
research
10/02/2018

LIT: Block-wise Intermediate Representation Training for Model Compression

Knowledge distillation (KD) is a popular method for reducing the computa...
research
10/04/2022

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...
research
10/26/2017

Knowledge Projection for Deep Neural Networks

While deeper and wider neural networks are actively pushing the performa...
research
04/29/2021

Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft Committee Machines

Over-parametrized deep neural networks trained by stochastic gradient de...
research
03/17/2016

Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

Yes, they do. This paper provides the first empirical demonstration that...
research
10/22/2019

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

We develop a statistical mechanical approach based on the replica method...

Please sign up or login with your details

Forgot password? Click here to reset