How to Select One Among All? An Extensive Empirical Study Towards the Robustness of Knowledge Distillation in Natural Language Understanding

09/13/2021
by   Tianda Li, et al.
5

Knowledge Distillation (KD) is a model compression algorithm that helps transfer the knowledge of a large neural network into a smaller one. Even though KD has shown promise on a wide range of Natural Language Processing (NLP) applications, little is understood about how one KD algorithm compares to another and whether these approaches can be complimentary to each other. In this work, we evaluate various KD algorithms on in-domain, out-of-domain and adversarial testing. We propose a framework to assess the adversarial robustness of multiple KD algorithms. Moreover, we introduce a new KD algorithm, Combined-KD, which takes advantage of two promising approaches (better training scheme and more efficient data augmentation). Our extensive experimental results show that Combined-KD achieves state-of-the-art results on the GLUE benchmark, out-of-domain generalization, and adversarial robustness compared to competitive methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2020

Towards Zero-Shot Knowledge Distillation for Natural Language Processing

Knowledge Distillation (KD) is a common knowledge transfer algorithm use...
research
11/08/2022

Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study

Mixup is a popular data augmentation technique based on creating new sam...
research
12/14/2020

LRC-BERT: Latent-representation Contrastive Knowledge Distillation for Natural Language Understanding

The pre-training models such as BERT have achieved great results in vari...
research
03/15/2022

Generalized but not Robust? Comparing the Effects of Data Modification Methods on Out-of-Domain Generalization and Adversarial Robustness

Data modification, either via additional training datasets, data augment...
research
01/10/2021

Adversarially robust and explainable model compression with on-device personalization for NLP applications

On-device Deep Neural Networks (DNNs) have recently gained more attentio...
research
04/08/2020

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

BERT is a cutting-edge language representation model pre-trained by a la...
research
06/02/2021

Not All Knowledge Is Created Equal

Mutual knowledge distillation (MKD) improves a model by distilling knowl...

Please sign up or login with your details

Forgot password? Click here to reset