DeepAI AI Chat
Log In Sign Up

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

by   Iulia Turc, et al.

Recent developments in NLP have been accompanied by large, expensive models. Knowledge distillation is the standard method to realize these gains in applications with limited resources: a compact student is trained to recover the outputs of a powerful teacher. While most prior work investigates student architectures and transfer techniques, we focus on an often-neglected aspect---student initialization. We argue that a random starting point hinders students from fully leveraging the teacher expertise, even in the presence of a large transfer set. We observe that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks. We conduct experiments on 4 NLP tasks and 24 sizes of Transformer-based students; for sentiment classification on the Amazon Book Reviews dataset, pre-training boosts size reduction and TPU speed-up from 3.1x/1.25x to 31x/16x. Extensive ablation studies dissect the interaction between pre-training and distillation, revealing a compound effect even when they are applied on the same unlabeled dataset.


page 1

page 2

page 3

page 4


Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with ...

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

Distillation from Weak Teacher (DWT) is a method of transferring knowled...

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and hug...

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...

Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation

Knowledge distillation (KD) has shown very promising capabilities in tra...

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

This paper studies teacher-student optimization on neural networks, i.e....

Initialization and Regularization of Factorized Neural Layers

Factorized layers–operations parameterized by products of two or more ma...