MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation

05/12/2021
by   Ahmad Rashid, et al.
19

The advent of large pre-trained language models has given rise to rapid progress in the field of Natural Language Processing (NLP). While the performance of these models on standard benchmarks has scaled with size, compression techniques such as knowledge distillation have been key in making them practical. We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation. MATE-KD first trains a masked language model based generator to perturb text by maximizing the divergence between teacher and student logits. Then using knowledge distillation a student is trained on both the original and the perturbed training samples. We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines. On the GLUE test set our 6 layer RoBERTa based model outperforms BERT-Large.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/25/2019

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effect...
research
08/26/2023

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...
research
04/15/2022

CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge Distillation

Knowledge distillation (KD) is an efficient framework for compressing la...
research
02/08/2023

An Empirical Study of Uniform-Architecture Knowledge Distillation in Document Ranking

Although BERT-based ranking models have been commonly used in commercial...
research
11/20/2022

AI-KD: Adversarial learning and Implicit regularization for self-Knowledge Distillation

We present a novel adversarial penalized self-knowledge distillation met...
research
01/03/2022

Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

We perform knowledge distillation (KD) benchmark from task-specific BERT...
research
09/01/2020

Automatic Assignment of Radiology Examination Protocols Using Pre-trained Language Models with Knowledge Distillation

Selecting radiology examination protocol is a repetitive, error-prone, a...

Please sign up or login with your details

Forgot password? Click here to reset