MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

02/25/2020
by   Wenhui Wang, et al.
0

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and online serving in real-life applications due to latency and capacity constraints. In this work, we present a simple and effective approach to compress large Transformer (Vaswani et al., 2017) based pre-trained models, termed as deep self-attention distillation. The small model (student) is trained by deeply mimicking the self-attention module, which plays a vital role in Transformer networks, of the large model (teacher). Specifically, we propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Furthermore, we introduce the scaled dot-product between values in the self-attention module as the new deep self-attention knowledge, in addition to the attention distributions (i.e., the scaled dot-product of queries and keys) that have been used in existing works. Moreover, we show that introducing a teacher assistant (Mirzadeh et al., 2019) also helps the distillation of large pre-trained Transformer models. Experimental results demonstrate that our model outperforms state-of-the-art baselines in different parameter size of student models. In particular, it retains more than 99 several GLUE benchmark tasks using 50 computations of the teacher model. The code and models are publicly available at https://github.com/microsoft/unilm/tree/master/minilm

READ FULL TEXT
research
12/31/2020

MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers

We generalize deep self-attention distillation in MiniLM (Wang et al., 2...
research
11/11/2019

Attending to Entities for Better Text Understanding

Recent progress in NLP witnessed the development of large-scale pre-trai...
research
11/14/2022

The Birds Need Attention Too: Analysing usage of Self Attention in identifying bird calls in soundscapes

Birds are vital parts of ecosystems across the world and are an excellen...
research
02/22/2023

KS-DETR: Knowledge Sharing in Attention Learning for Detection Transformer

Scaled dot-product attention applies a softmax function on the scaled do...
research
05/22/2023

VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is differen...
research
09/14/2019

Tree Transformer: Integrating Tree Structures into Self-Attention

Pre-training Transformer from large-scale raw texts and fine-tuning on t...
research
09/06/2023

Progressive Attention Guidance for Whole Slide Vulvovaginal Candidiasis Screening

Vulvovaginal candidiasis (VVC) is the most prevalent human candidal infe...

Please sign up or login with your details

Forgot password? Click here to reset