f-Divergence Minimization for Sequence-Level Knowledge Distillation

07/27/2023
by   Yuqiao Wen, et al.
0

Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an f-DISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our f-DISTILL methods. We further derive step-wise decomposition for our f-DISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/01/2023

Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge f...
research
11/12/2021

Learning Interpretation with Explainable Knowledge Distillation

Knowledge Distillation (KD) has been considered as a key solution in mod...
research
02/10/2023

Distillation of encoder-decoder transformers for sequence labelling

Driven by encouraging results on a wide range of tasks, the field of NLP...
research
06/14/2023

Knowledge Distillation of Large Language Models

Knowledge Distillation (KD) is a promising technique for reducing the hi...
research
05/04/2022

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majorit...
research
09/15/2020

Autoregressive Knowledge Distillation through Imitation Learning

The performance of autoregressive models on natural language generation ...
research
05/19/2021

Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Knowledge distillation (KD), transferring knowledge from a cumbersome te...

Please sign up or login with your details

Forgot password? Click here to reset