One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

06/02/2021
by   Chuhan Wu, et al.
0

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge model sizes hinder their applications in many practical systems. Knowledge distillation is a popular technique to compress PLMs, which learns a small student model from a large teacher PLM. However, the knowledge learned from a single teacher may be limited and even biased, resulting in low-quality student model. In this paper, we propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression, which can train high-quality student model from multiple teacher PLMs. In MT-BERT we design a multi-teacher co-finetuning method to jointly finetune multiple teacher PLMs in downstream tasks with shared pooling and prediction layers to align their output space for better collaborative teaching. In addition, we propose a multi-teacher hidden loss and a multi-teacher distillation loss to transfer the useful knowledge in both hidden states and soft labels from multiple teacher PLMs to the student model. Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2019

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...
research
05/08/2020

Distilling Knowledge from Pre-trained Language Models via Text Smoothing

This paper studies compressing pre-trained language models, like BERT (D...
research
09/25/2019

Extreme Language Model Compression with Optimal Subwords and Shared Projections

Pre-trained deep neural network language models such as ELMo, GPT, BERT ...
research
11/21/2022

Multi-Level Knowledge Distillation for Out-of-Distribution Detection in Text

Self-supervised representation learning has proved to be a valuable comp...
research
06/10/2021

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT co...
research
08/07/2023

Efficient Temporal Sentence Grounding in Videos with Multi-Teacher Knowledge Distillation

Temporal Sentence Grounding in Videos (TSGV) aims to detect the event ti...
research
05/25/2022

Towards Understanding Label Regularization for Fine-tuning Pre-trained Language Models

Knowledge Distillation (KD) is a prominent neural model compression tech...

Please sign up or login with your details

Forgot password? Click here to reset