Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

01/03/2022
by   Made Nindyatama Nityasya, et al.
0

We perform knowledge distillation (KD) benchmark from task-specific BERT-base teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini, and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language. We also compare various aspects of distillations including the usage of word embeddings and unlabeled data augmentation. Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource (CPU, RAM, and storage) compared to pruned BERT models. We further propose some quick wins on performing KD to produce small NLP models via efficient KD training mechanisms involving simple choices of loss functions, word embeddings, and unlabeled data preparation.

READ FULL TEXT
research
05/12/2021

MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation

The advent of large pre-trained language models has given rise to rapid ...
research
08/26/2023

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has ...
research
10/20/2020

BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online E-Commerce Search

Relevance has significant impact on user experience and business profit ...
research
10/27/2020

To BERT or Not to BERT: Comparing Task-specific and Task-agnostic Semi-Supervised Approaches for Sequence Tagging

Leveraging large amounts of unlabeled data using Transformer-like archit...
research
01/25/2020

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Over the past year, the emergence of transfer learning with large-scale ...
research
08/29/2023

SpikeBERT: A Language Spikformer Trained with Two-Stage Knowledge Distillation from BERT

Spiking neural networks (SNNs) offer a promising avenue to implement dee...
research
06/11/2021

Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation

Semi-Supervised Learning (SSL) has seen success in many application doma...

Please sign up or login with your details

Forgot password? Click here to reset