Task-agnostic Distillation of Encoder-Decoder Language Models

05/21/2023
by   Chen Zhang, et al.
0

Finetuning pretrained language models (LMs) have enabled appealing performance on a diverse array of tasks. The intriguing task-agnostic property has driven a shifted focus from task-specific to task-agnostic distillation of LMs. While task-agnostic, compute-efficient, performance-preserved LMs can be yielded by task-agnostic distillation, previous studies mainly sit in distillation of either encoder-only LMs (e.g., BERT) or decoder-only ones (e.g., GPT) yet largely neglect that distillation of encoder-decoder LMs (e.g., T5) can posit very distinguished behaviors. Frustratingly, we discover that existing task-agnostic distillation methods can fail to handle the distillation of encoder-decoder LMs. To the demand, we explore a few paths and uncover a path named as MiniEnD that successfully tackles the distillation of encoder-decoder LMs in a task-agnostic fashion. We examine MiniEnD on language understanding and abstractive summarization. The results showcase that MiniEnD is generally effective and is competitive compared to other alternatives. We further scale MiniEnD up to distillation of 3B encoder-decoder language models with interpolated distillation. The results imply the opportunities and challenges in distilling large language models (e.g., LLaMA).

READ FULL TEXT

page 1

page 4

research
05/06/2023

NorBench – A Benchmark for Norwegian Language Models

We present NorBench: a streamlined suite of NLP tasks and probes for eva...
research
02/10/2023

Distillation of encoder-decoder transformers for sequence labelling

Driven by encouraging results on a wide range of tasks, the field of NLP...
research
04/17/2023

Learning to Compress Prompts with Gist Tokens

Prompting is now the primary way to utilize the multitask capabilities o...
research
07/21/2022

FADE: Fusing the Assets of Decoder and Encoder for Task-Agnostic Upsampling

We consider the problem of task-agnostic feature upsampling in dense pre...
research
05/24/2023

Large Language Model Distillation Doesn't Need a Teacher

Knowledge distillation trains a smaller student model to match the outpu...
research
02/16/2023

LEALLA: Learning Lightweight Language-agnostic Sentence Embeddings with Knowledge Distillation

Large-scale language-agnostic sentence embedding models such as LaBSE (F...
research
09/19/2023

A Family of Pretrained Transformer Language Models for Russian

Nowadays, Transformer language models (LMs) represent a fundamental comp...

Please sign up or login with your details

Forgot password? Click here to reset