A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training

05/03/2023
by   Nitay Calderon, et al.
0

Modern Natural Language Generation (NLG) models come with massive computational and storage requirements. In this work, we study the potential of compressing them, which is crucial for real-world applications serving millions of users. We focus on Knowledge Distillation (KD) techniques, in which a small student model learns to imitate a large teacher model, allowing to transfer knowledge from the teacher to the student. In contrast to much of the previous work, our goal is to optimize the model for a specific NLG task and a specific dataset. Typically, in real-world applications, in addition to labeled data there is abundant unlabeled task-specific data, which is crucial for attaining high compression rates via KD. In this work, we conduct a systematic study of task-specific KD techniques for various NLG tasks under realistic assumptions. We discuss the special characteristics of NLG distillation and particularly the exposure bias problem. Following, we derive a family of Pseudo-Target (PT) augmentation methods, substantially extending prior work on sequence-level KD. We propose the Joint-Teaching method for NLG distillation, which applies word-level KD to multiple PTs generated by both the teacher and the student. Our study provides practical model design observations and demonstrates the effectiveness of PT training for task-specific KD in NLG.

READ FULL TEXT
research
12/31/2020

Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Knowledge Distillation (KD) is a common knowledge transfer algorithm use...
research
05/22/2023

Improving Robustness in Knowledge Distillation Using Domain-Targeted Data Augmentation

Applying knowledge distillation encourages a student model to behave mor...
research
10/10/2022

Knowledge Distillation Transfer Sets and their Impact on Downstream NLU Tasks

Teacher-student knowledge distillation is a popular technique for compre...
research
06/04/2021

ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

Pretrained language models (PLMs) such as BERT adopt a training paradigm...
research
12/26/2022

Prototype-guided Cross-task Knowledge Distillation for Large-scale Models

Recently, large-scale pre-trained models have shown their advantages in ...
research
07/03/2021

Pool of Experts: Realtime Querying Specialized Knowledge in Massive Neural Networks

In spite of the great success of deep learning technologies, training an...
research
05/24/2023

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Recently, various intermediate layer distillation (ILD) objectives have ...

Please sign up or login with your details

Forgot password? Click here to reset