DeepAI AI Chat
Log In Sign Up

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

08/23/2019
by   Iulia Turc, et al.
0

Recent developments in NLP have been accompanied by large, expensive models. Knowledge distillation is the standard method to realize these gains in applications with limited resources: a compact student is trained to recover the outputs of a powerful teacher. While most prior work investigates student architectures and transfer techniques, we focus on an often-neglected aspect---student initialization. We argue that a random starting point hinders students from fully leveraging the teacher expertise, even in the presence of a large transfer set. We observe that applying language model pre-training to students unlocks their generalization potential, surprisingly even for very compact networks. We conduct experiments on 4 NLP tasks and 24 sizes of Transformer-based students; for sentiment classification on the Amazon Book Reviews dataset, pre-training boosts size reduction and TPU speed-up from 3.1x/1.25x to 31x/16x. Extensive ablation studies dissect the interaction between pre-training and distillation, revealing a compound effect even when they are applied on the same unlabeled dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

12/02/2020

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with ...
05/26/2023

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

Distillation from Weak Teacher (DWT) is a method of transferring knowled...
12/11/2020

Reinforced Multi-Teacher Selection for Knowledge Distillation

In natural language processing (NLP) tasks, slow inference speed and hug...
05/12/2018

Born Again Neural Networks

Knowledge distillation (KD) consists of transferring knowledge from one ...
06/13/2022

Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge Distillation

Knowledge distillation (KD) has shown very promising capabilities in tra...
05/15/2018

Knowledge Distillation in Generations: More Tolerant Teachers Educate Better Students

This paper studies teacher-student optimization on neural networks, i.e....
05/03/2021

Initialization and Regularization of Factorized Neural Layers

Factorized layers–operations parameterized by products of two or more ma...