FitNets: Hints for Thin Deep Nets

by   Adriana Romero, et al.

While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.


page 1

page 2

page 3

page 4


RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

Intermediate layer knowledge distillation (KD) can improve the standard ...

LIT: Block-wise Intermediate Representation Training for Model Compression

Knowledge distillation (KD) is a popular method for reducing the computa...

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...

Knowledge Projection for Deep Neural Networks

While deeper and wider neural networks are actively pushing the performa...

Soft Mode in the Dynamics of Over-realizable On-line Learning for Soft Committee Machines

Over-parametrized deep neural networks trained by stochastic gradient de...

Do Deep Convolutional Nets Really Need to be Deep and Convolutional?

Yes, they do. This paper provides the first empirical demonstration that...

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

We develop a statistical mechanical approach based on the replica method...

Code Repositories


FitNets: Hints for Thin Deep Nets

view repo

Please sign up or login with your details

Forgot password? Click here to reset