XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

06/08/2021
by   Subhabrata Mukherjee, et al.
18

While deep and large pre-trained models are the state-of-the-art for various natural language processing tasks, their huge size poses significant challenges for practical uses in resource constrained settings. Recent works in knowledge distillation propose task-agnostic as well as task-specific methods to compress these models, with task-specific ones often yielding higher compression rate. In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage of task-specific methods for learning a small universal model that can be applied to arbitrary tasks and languages. To this end, we study the transferability of several source tasks, augmentation resources and model architecture for distillation. We evaluate our model performance on multiple tasks, including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD question answering dataset and a massive multi-lingual NER dataset with 41 languages. We release three distilled task-agnostic checkpoints with 13MM, 22MM and 33MM parameters obtaining SOTA performance in several tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/12/2020

TinyMBERT: Multi-Stage Distillation Framework for Massive Multi-lingual NER

Deep and large pre-trained language models are the state-of-the-art for ...
research
05/26/2023

TADA: Task-Agnostic Dialect Adapters for English

Large Language Models, the dominant starting point for Natural Language ...
research
06/08/2017

Dynamic Integration of Background Knowledge in Neural NLU Systems

Common-sense or background knowledge is required to understand natural l...
research
01/25/2020

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Over the past year, the emergence of transfer learning with large-scale ...
research
02/28/2023

Generic-to-Specific Distillation of Masked Autoencoders

Large vision Transformers (ViTs) driven by self-supervised pre-training ...
research
04/15/2022

MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation

Pre-trained language models have demonstrated superior performance in va...
research
09/08/2019

Transformer to CNN: Label-scarce distillation for efficient text classification

Significant advances have been made in Natural Language Processing (NLP)...

Please sign up or login with your details

Forgot password? Click here to reset