AutoDisc: Automatic Distillation Schedule for Large Language Model Compression

05/29/2022
by   Chen Zhang, et al.
6

Driven by the teacher-student paradigm, knowledge distillation is one of the de facto ways for language model compression. Recent studies have uncovered that conventional distillation is less effective when facing a large capacity gap between the teacher and the student, and introduced teacher assistant-based distillation to bridge the gap. As a connection, the scale and the performance of the teacher assistant is crucial for transferring the knowledge from the teacher to the student. However, existing teacher assistant-based methods manually select the scale of the teacher assistant, which fails to identify the teacher assistant with the optimal scale-performance tradeoff. To this end, we propose an Automatic Distillation Schedule (AutoDisc) for large language model compression. In particular, AutoDisc first specifies a set of teacher assistant candidates at different scales with gridding and pruning, and then optimizes all candidates in an once-for-all optimization with two approximations. The best teacher assistant scale is automatically selected according to the scale-performance tradeoff. AutoDisc is evaluated with an extensive set of experiments on a language understanding benchmark GLUE. Experimental results demonstrate the improved performance and applicability of our AutoDisc. We further apply AutoDisc on a language model with over one billion parameters and show the scalability of AutoDisc.

READ FULL TEXT
research
11/26/2022

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obt...
research
11/29/2021

Improved Knowledge Distillation via Adversarial Collaboration

Knowledge distillation has become an important approach to obtain a comp...
research
09/17/2021

Distilling Linguistic Context for Language Model Compression

A computationally expensive and memory intensive neural network lies beh...
research
05/22/2023

Lion: Adversarial Distillation of Closed-Source Large Language Model

The practice of transferring knowledge from a sophisticated, closed-sour...
research
05/20/2023

Lifting the Curse of Capacity Gap in Distilling Language Models

Pretrained language models (LMs) have shown compelling performance on va...
research
04/25/2022

Faculty Distillation with Optimal Transport

Knowledge distillation (KD) has shown its effectiveness in improving a s...
research
07/06/2021

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Since visual perception can give rich information beyond text descriptio...

Please sign up or login with your details

Forgot password? Click here to reset