LIT: Block-wise Intermediate Representation Training for Model Compression

10/02/2018
by   Animesh Koratana, et al.
4

Knowledge distillation (KD) is a popular method for reducing the computational overhead of deep network inference, in which the output of a teacher model is used to train a smaller, faster student model. Hint training (i.e., FitNets) extends KD by regressing a student model's intermediate representation to a teacher model's intermediate representation. In this work, we introduce bLock-wise Intermediate representation Training (LIT), a novel model compression technique that extends the use of intermediate representations in deep network compression, outperforming KD and hint training. LIT has two key ideas: 1) LIT trains a student of the same width (but shallower depth) as the teacher by directly comparing the intermediate representations, and 2) LIT uses the intermediate representation from the previous block in the teacher model as an input to the current student block during training, avoiding unstable intermediate representations in the student network. We show that LIT provides substantial reductions in network depth without loss in accuracy -- for example, LIT can compress a ResNeXt-110 to a ResNeXt-20 (5.5x) on CIFAR10 and a VDCNN-29 to a VDCNN-9 (3.2x) on Amazon Reviews without loss in accuracy, outperforming KD and hint training in network size for a given accuracy. We also show that applying LIT to identical student/teacher architectures increases the accuracy of the student model above the teacher model, outperforming the recently-proposed Born Again Networks procedure on ResNet, ResNeXt, and VDCNN. Finally, we show that LIT can effectively compress GAN generators, which are not supported in the KD framework because GANs output pixels as opposed to probabilities.

READ FULL TEXT

page 6

page 13

research
12/19/2014

FitNets: Hints for Thin Deep Nets

While depth tends to improve network performances, it also makes gradien...
research
09/14/2018

Network Recasting: A Universal Method for Network Architecture Transformation

This paper proposes network recasting as a general method for network ar...
research
09/29/2020

Contrastive Distillation on Intermediate Representations for Language Model Compression

Existing language model compression methods mostly use a simple L2 loss ...
research
03/09/2020

Pacemaker: Intermediate Teacher Knowledge Distillation For On-The-Fly Convolutional Neural Network

There is a need for an on-the-fly computational process with very low pe...
research
10/04/2022

Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Layer-wise distillation is a powerful tool to compress large models (i.e...
research
04/10/2019

Knowledge Squeezed Adversarial Network Compression

Deep network compression has been achieved notable progress via knowledg...
research
03/14/2023

A Contrastive Knowledge Transfer Framework for Model Compression and Transfer Learning

Knowledge Transfer (KT) achieves competitive performance and is widely u...

Please sign up or login with your details

Forgot password? Click here to reset