Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

by   Iulia Turc, et al.

Recent developments in NLP have been accompanied by large, expensive models. Knowledge distillation is the standard method to realize these gains in applications with limited resources: a compact student is trained to recover the outputs of a powerful teacher. While most prior work investigates student architectures and transfer techniques, we focus on an often-neglected aspect---student initialization. We argue that a random starting point hinders students from fully leveraging the teacher expertise, even in the presence of a large transfer set. We observe that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks. We conduct experiments on 4 NLP tasks and 24 sizes of Transformer-based students; for sentiment classification on the Amazon Book Reviews dataset, pre-training boosts size reduction and TPU speed-up from 3.1x/1.25x to 31x/16x. Extensive ablation studies dissect the interaction between pre-training and distillation, revealing a compound effect even when they are applied on the same unlabeled dataset.



There are no comments yet.


page 1

page 2

page 3

page 4


Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with ...

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and hug...

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique ...

Undistillable: Making A Nasty Teacher That CANNOT teach students

Knowledge Distillation (KD) is a widely used technique to transfer knowl...

Initialization and Regularization of Factorized Neural Layers

Factorized layers–operations parameterized by products of two or more ma...

Teacher-Students Knowledge Distillation for Siamese Trackers

With the development of Siamese network based trackers, a variety of tec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.