Selective Knowledge Distillation for Neural Machine Translation

05/27/2021
by   Fusheng Wang, et al.
0

Neural Machine Translation (NMT) models achieve state-of-the-art performance on many translation benchmarks. As an active research field in NMT, knowledge distillation is widely applied to enhance the model's performance by transferring teacher model's knowledge on each training sample. However, previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge. In this paper, we design a novel protocol that can effectively analyze the different impacts of samples by comparing various samples' partitions. Based on above protocol, we conduct extensive experiments and find that the teacher's knowledge is not the more, the better. Knowledge over specific samples may even hurt the whole performance of knowledge distillation. Finally, to address these issues, we propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation. We evaluate our approaches on two large-scale machine translation tasks, WMT'14 English->German and WMT'19 Chinese->English. Experimental results show that our approaches yield up to +1.28 and +0.89 BLEU points improvements over the Transformer baseline, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/29/2021

Using Perturbed Length-aware Positional Encoding for Non-autoregressive Neural Machine Translation

Non-autoregressive neural machine translation (NAT) usually employs sequ...
research
05/14/2023

Towards Understanding and Improving Knowledge Distillation for Neural Machine Translation

Knowledge distillation (KD) is a promising technique for model compressi...
research
06/12/2021

Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation

Although teacher forcing has become the main training paradigm for neura...
research
12/27/2020

Learning Light-Weight Translation Models from Deep Transformer

Recently, deep models have shown tremendous improvements in neural machi...
research
06/22/2020

Self-Knowledge Distillation: A Simple Way for Better Generalization

The generalization capability of deep neural networks has been substanti...
research
12/05/2020

Reciprocal Supervised Learning Improves Neural Machine Translation

Despite the recent success on image classification, self-training has on...
research
02/16/2022

EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation

We propose EdgeFormer – a parameter-efficient Transformer of the encoder...

Please sign up or login with your details

Forgot password? Click here to reset