Attentive Student Meets Multi-Task Teacher: Improved Knowledge Distillation for Pretrained Models

11/09/2019
by   Linqing Liu, et al.
0

In this paper, we explore the knowledge distillation approach under the multi-task learning setting. We distill the BERT model refined by multi-task learning on seven datasets of the GLUE benchmark into a bidirectional LSTM with attention mechanism. Unlike other BERT distillation methods which specifically designed for Transformer-based architectures, we provide a general learning framework. Our approach is model agnostic and can be easily applied on different future teacher models. Compared to a strong, similarly BiLSTM-based approach, we achieve better quality under the same computational constraints. Compared to the present state of the art, we reach comparable results with much faster inference speed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/09/2023

ERNIE 3.0 Tiny: Frustratingly Simple Method to Improve Task-Agnostic Distillation Generalization

Task-agnostic knowledge distillation attempts to address the problem of ...
research
01/25/2022

Attentive Task Interaction Network for Multi-Task Learning

Multitask learning (MTL) has recently gained a lot of popularity as a le...
research
03/24/2022

Multitask Emotion Recognition Model with Knowledge Distillation and Task Discriminator

Due to the collection of big data and the development of deep learning, ...
research
09/12/2023

Self-Training and Multi-Task Learning for Limited Data: Evaluation Study on Object Detection

Self-training allows a network to learn from the predictions of a more c...
research
02/14/2021

Distillation based Multi-task Learning: A Candidate Generation Model for Improving Reading Duration

In feeds recommendation, the first step is candidate generation. Most of...
research
10/05/2020

Lifelong Language Knowledge Distillation

It is challenging to perform lifelong language learning (LLL) on a strea...
research
07/28/2022

SDBERT: SparseDistilBERT, a faster and smaller BERT model

In this work we introduce a new transformer architecture called SparseDi...

Please sign up or login with your details

Forgot password? Click here to reset