Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

08/23/2019
by   Iulia Turc, et al.
0

Recent developments in NLP have been accompanied by large, expensive models. Knowledge distillation is the standard method to realize these gains in applications with limited resources: a compact student is trained to recover the outputs of a powerful teacher. While most prior work investigates student architectures and transfer techniques, we focus on an often-neglected aspect---student initialization. We argue that a random starting point hinders students from fully leveraging the teacher expertise, even in the presence of a large transfer set. We observe that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks. We conduct experiments on 4 NLP tasks and 24 sizes of Transformer-based students; for sentiment classification on the Amazon Book Reviews dataset, pre-training boosts size reduction and TPU speed-up from 3.1x/1.25x to 31x/16x. Extensive ablation studies dissect the interaction between pre-training and distillation, revealing a compound effect even when they are applied on the same unlabeled dataset.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/02/2020

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with ...
12/11/2020

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and hug...
05/12/2018

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...
10/21/2021

Augmenting Knowledge Distillation With Peer-To-Peer Mutual Learning For Model Compression

Knowledge distillation (KD) is an effective model compression technique ...
05/16/2021

Undistillable: Making A Nasty Teacher That CANNOT teach students

Knowledge Distillation (KD) is a widely used technique to transfer knowl...
05/03/2021

Initialization and Regularization of Factorized Neural Layers

Factorized layers–operations parameterized by products of two or more ma...
07/24/2019

Teacher-Students Knowledge Distillation for Siamese Trackers

With the development of Siamese network based trackers, a variety of tec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.