How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

05/24/2023
by   Xinpeng Wang, et al.
0

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2019

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...
research
06/04/2021

ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

Pretrained language models (PLMs) such as BERT adopt a training paradigm...
research
06/10/2021

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT co...
research
12/05/2021

Causal Distillation for Language Models

Distillation efforts have led to language models that are more compact a...
research
10/20/2020

BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

Relevance has significant impact on user experience and business profit ...
research
05/03/2023

A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training

Modern Natural Language Generation (NLG) models come with massive comput...
research
10/10/2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Teacher-student knowledge distillation is a popular technique for compre...

Please sign up or login with your details

Forgot password? Click here to reset