ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization

01/09/2023
by   Weixin Liu, et al.
0

Task-agnostic knowledge distillation attempts to address the problem of deploying large pretrained language model in resource-constrained scenarios by compressing a large pretrained model called teacher into a smaller one called student such that the student can be directly finetuned on downstream tasks and retains comparable performance. However, we empirically find that there is a generalization gap between the student and the teacher in existing methods. In this work, we show that we can leverage multi-task learning in task-agnostic distillation to advance the generalization of the resulted student. In particular, we propose Multi-task Infused Task-agnostic Knowledge Distillation (MITKD). We first enhance the teacher by multi-task training it on multiple downstream tasks and then perform distillation to produce the student. Experimental results demonstrate that our method yields a student with much better generalization, significantly outperforms existing baselines, and establishes a new state-of-the-art result on in-domain, out-domain, and low-resource datasets in the setting of task-agnostic distillation. Moreover, our method even exceeds an 8x larger BERT_Base on SQuAD and four GLUE tasks. In addition, by combining ERNIE 3.0, our method achieves state-of-the-art results on 10 Chinese datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/09/2019

Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models

In this paper, we explore the knowledge distillation approach under the ...
research
10/04/2019

Distilling Transformers into Simple Neural Networks with Unlabeled Transfer Data

Recent advances in pre-training huge models on large amounts of text thr...
research
12/19/2022

KNIFE: Knowledge Distillation with Free-Text Rationales

Free-text rationales (FTRs) follow how humans communicate by explaining ...
research
07/16/2021

Representation Consolidation for Training Expert Students

Traditionally, distillation has been used to train a student model to em...
research
10/10/2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Teacher-student knowledge distillation is a popular technique for compre...
research
11/01/2020

MixKD: Towards Efficient Distillation of Large-scale Language Models

Large-scale language models have recently demonstrated impressive empiri...
research
05/20/2023

Lifting the Curse of Capacity Gap in Distilling Language Models

Pretrained language models (LMs) have shown compelling performance on va...

Please sign up or login with your details

Forgot password? Click here to reset