Sequence-Level Knowledge Distillation

06/25/2016
by   Yoon Kim, et al.
0

Neural machine translation (NMT) offers a novel alternative formulation of translation that is potentially simpler than statistical approaches. However to reach competitive performance, NMT models need to be exceedingly large. In this paper we consider applying knowledge distillation approaches (Bucila et al., 2006; Hinton et al., 2015) that have proven successful for reducing the size of neural models in other domains to the problem of NMT. We demonstrate that standard knowledge distillation applied to word-level prediction can be effective for NMT, and also introduce two novel sequence-level versions of knowledge distillation that further improve performance, and somewhat surprisingly, seem to eliminate the need for beam search (even when applied on the original teacher model). Our best student model runs 10 times faster than its state-of-the-art teacher with little loss in performance. It is also significantly better than a baseline model trained without knowledge distillation: by 4.2/1.7 BLEU with greedy decoding/beam search. Applying weight pruning on top of knowledge distillation results in a student model that has 13 times fewer parameters than the original teacher model, with a decrease of 0.4 BLEU.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/06/2017

Ensemble Distillation for Neural Machine Translation

Knowledge distillation describes a method for training a student network...
research
05/20/2023

Accurate Knowledge Distillation with n-best Reranking

We propose extending the Sequence-level Knowledge Distillation (Kim and ...
research
12/06/2019

Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation

Sequence-level knowledge distillation (SLKD) is a model compression tech...
research
03/11/2023

Knowledge Distillation for Efficient Sequences of Training Runs

In many practical scenarios – like hyperparameter search or continual re...
research
05/18/2022

[Re] Distilling Knowledge via Knowledge Review

This effort aims to reproduce the results of experiments and analyze the...
research
06/29/2022

Revisiting Label Smoothing and Knowledge Distillation Compatibility: What was Missing?

This work investigates the compatibility between label smoothing (LS) an...
research
05/31/2019

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

Recent advances in deep learning have facilitated the demand of neural m...

Please sign up or login with your details

Forgot password? Click here to reset