Large Language Model Distillation Doesn't Need a Teacher

05/24/2023
by   Ananya Harsh Jha, et al.
0

Knowledge distillation trains a smaller student model to match the output distribution of a larger teacher to maximize the end-task performance under computational constraints. However, existing literature on language model distillation primarily focuses on compressing encoder-only models that are then specialized by task-specific supervised finetuning. We need to rethink this setup for more recent large language models with tens to hundreds of billions of parameters. Task-specific finetuning is impractical at this scale, and model performance is often measured using zero/few-shot prompting. Thus, in this work, we advocate for task-agnostic zero-shot evaluated distillation for large language models without access to end-task finetuning data. We propose a teacher-free task-agnostic distillation method, which uses a truncated version of the larger model for initialization, and continues pretraining this model using a language modeling objective. Our teacher-free method shines in a distillation regime where it is infeasible to fit both the student and teacher into the GPU memory. Despite its simplicity, our method can effectively reduce the model size by 50%, matching or outperforming the vanilla distillation method on perplexity and accuracy on 13 zero-shot end-tasks while being 1.5x computationally efficient.

READ FULL TEXT

page 6

page 7

research
12/31/2020

Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Knowledge Distillation (KD) is a common knowledge transfer algorithm use...
research
07/06/2023

Distilling Large Vision-Language Model with Out-of-Distribution Generalizability

Large vision-language models have achieved outstanding performance, but ...
research
12/05/2021

Causal Distillation for Language Models

Distillation efforts have led to language models that are more compact a...
research
09/08/2019

Transformer to CNN: Label-scarce distillation for efficient text classification

Significant advances have been made in Natural Language Processing (NLP)...
research
09/10/2023

Exploiting CLIP for Zero-shot HOI Detection Requires Knowledge Distillation at Multiple Levels

In this paper, we investigate the task of zero-shot human-object interac...
research
05/03/2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Deploying large language models (LLMs) is challenging because they are m...
research
05/21/2023

Task-agnostic Distillation of Encoder-Decoder Language Models

Finetuning pretrained language models (LMs) have enabled appealing perfo...

Please sign up or login with your details

Forgot password? Click here to reset