Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation

01/20/2021
by   Lingyun Feng, et al.
0

Despite pre-trained language models such as BERT have achieved appealing performance in a wide range of natural language processing tasks, they are computationally expensive to be deployed in real-time applications. A typical method is to adopt knowledge distillation to compress these large pre-trained models (teacher models) to small student models. However, for a target domain with scarce training data, the teacher can hardly pass useful knowledge to the student, which yields performance degradation for the student models. To tackle this problem, we propose a method to learn to augment for data-scarce domain BERT knowledge distillation, by learning a cross-domain manipulation scheme that automatically augments the target with the help of resource-rich source domains. Specifically, the proposed method generates samples acquired from a stationary distribution near the target data and adopts a reinforced selector to automatically refine the augmentation strategy according to the performance of the student. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines on four different tasks, and for the data-scarce domains, the compressed student models even perform better than the original large teacher model, with much fewer parameters (only ∼13.3%) when only a few labeled examples available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/02/2020

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with ...
research
05/22/2023

Improving Robustness in Knowledge Distillation Using Domain-Targeted Data Augmentation

Applying knowledge distillation encourages a student model to behave mor...
research
02/19/2023

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Knowledge distillation has been shown to be a powerful model compression...
research
05/08/2020

Distilling Knowledge from Pre-trained Language Models via Text Smoothing

This paper studies compressing pre-trained language models, like BERT (D...
research
10/27/2021

Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data

Knowledge distillation (KD) aims to craft a compact student model that i...
research
01/27/2023

Improved knowledge distillation by utilizing backward pass knowledge in neural networks

Knowledge distillation (KD) is one of the prominent techniques for model...
research
07/07/2023

Distilling Universal and Joint Knowledge for Cross-Domain Model Compression on Time Series Data

For many real-world time series tasks, the computational complexity of p...

Please sign up or login with your details

Forgot password? Click here to reset