GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models

06/23/2023
by   Rishabh Agarwal, et al.
1

Knowledge distillation is commonly used for compressing neural networks to reduce their inference cost and memory footprint. However, current distillation methods for auto-regressive models, such as generative language models (LMs), suffer from two key issues: (1) distribution mismatch between output sequences during training and the sequences generated by the student during its deployment, and (2) model under-specification, where the student model may not be expressive enough to fit the teacher's distribution. To address these issues, we propose Generalized Knowledge Distillation (GKD). GKD mitigates distribution mismatch by sampling output sequences from the student during training. Furthermore, GKD handles model under-specification by optimizing alternative divergences, such as reverse KL, that focus on generating samples from the student that are likely under the teacher's distribution. We demonstrate that GKD outperforms commonly-used approaches for distilling LLMs on summarization, machine translation, and arithmetic reasoning tasks.

READ FULL TEXT
research
05/04/2022

Knowledge Distillation of Russian Language Models with Reduction of Vocabulary

Today, transformer language models serve as a core component for majorit...
research
05/16/2023

Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation

It has been commonly observed that a teacher model with superior perform...
research
10/10/2020

Structural Knowledge Distillation

Knowledge distillation is a critical technique to transfer knowledge bet...
research
12/06/2022

Life-long Learning for Multilingual Neural Machine Translation with Knowledge Distillation

A common scenario of Multilingual Neural Machine Translation (MNMT) is t...
research
12/03/2018

Accelerating Large Scale Knowledge Distillation via Dynamic Importance Sampling

Knowledge distillation is an effective technique that transfers knowledg...
research
08/11/2021

Preventing Catastrophic Forgetting and Distribution Mismatch in Knowledge Distillation via Synthetic Data

With the increasing popularity of deep learning on edge devices, compres...
research
05/08/2023

Web Content Filtering through knowledge distillation of Large Language Models

We introduce a state-of-the-art approach for URL categorization that lev...

Please sign up or login with your details

Forgot password? Click here to reset