GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

06/11/2023
by   Shicheng Tan, et al.
0

Currently, the reduction in the parameter scale of large-scale pre-trained language models (PLMs) through knowledge distillation has greatly facilitated their widespread deployment on various devices. However, the deployment of knowledge distillation systems faces great challenges in real-world industrial-strength applications, which require the use of complex distillation methods on even larger-scale PLMs (over 10B), limited by memory on GPUs and the switching of methods. To overcome these challenges, we propose GKD, a general knowledge distillation framework that supports distillation on larger-scale PLMs using various distillation methods. With GKD, developers can build larger distillation models on memory-limited GPUs and easily switch and combine different distillation methods within a single framework. Experimental results show that GKD can support the distillation of at least 100B-scale PLMs and 25 mainstream methods on 8 NVIDIA A100 (40GB) GPUs.

READ FULL TEXT

page 10

page 11

research
11/02/2022

Gradient Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) is an effective framework to transfer knowle...
research
06/11/2023

Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

The large scale of pre-trained language models poses a challenge for the...
research
09/21/2021

Knowledge Distillation with Noisy Labels for Natural Language Understanding

Knowledge Distillation (KD) is extensively used to compress and deploy l...
research
10/16/2021

HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression

On many natural language processing tasks, large pre-trained language mo...
research
06/29/2022

Knowledge Distillation of Transformer-based Language Models Revisited

In the past few years, transformer-based pre-trained language models hav...
research
10/27/2019

MOD: A Deep Mixture Model with Online Knowledge Distillation for Large Scale Video Temporal Concept Localization

In this paper, we present and discuss a deep mixture model with online k...
research
07/12/2021

Technical Report of Team GraphMIRAcles in the WikiKG90M-LSC Track of OGB-LSC @ KDD Cup 2021

Link prediction in large-scale knowledge graphs has gained increasing at...

Please sign up or login with your details

Forgot password? Click here to reset