A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

05/26/2023
by   Hayeon Lee, et al.
0

Distillation from Weak Teacher (DWT) is a method of transferring knowledge from a smaller, weaker teacher model to a larger student model to improve its performance. Previous studies have shown that DWT can be effective in the vision domain and natural language processing (NLP) pre-training stage. Specifically, DWT shows promise in practical scenarios, such as enhancing new generation or larger models using pre-trained yet older or smaller models and lacking a resource budget. However, the optimal conditions for using DWT have yet to be fully investigated in NLP pre-training. Therefore, this study examines three key factors to optimize DWT, distinct from those used in the vision domain or traditional knowledge distillation. These factors are: (i) the impact of teacher model quality on DWT effectiveness, (ii) guidelines for adjusting the weighting value for DWT loss, and (iii) the impact of parameter remapping as a student model initialization technique for DWT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowle...
research
05/25/2022

Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models

Knowledge Distillation (KD) is a prominent neural model compression tech...
research
06/05/2021

MergeDistill: Merging Pre-trained Language Models using Distillation

Pre-trained multilingual language models (LMs) have achieved state-of-th...
research
08/23/2019

Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation

Recent developments in NLP have been accompanied by large, expensive mod...
research
10/02/2019

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more pr...
research
04/14/2023

Learn What Is Possible, Then Choose What Is Best: Disentangling One-To-Many Relations in Language Through Text-based Games

Language models pre-trained on large self-supervised corpora, followed b...
research
02/09/2021

NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application

Pre-trained language models (PLMs) like BERT have made great progress in...

Please sign up or login with your details

Forgot password? Click here to reset