Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

08/26/2023
by   Apoorv Dankar, et al.
0

The use of large transformer-based models such as BERT, GPT, and T5 has led to significant advancements in natural language processing. However, these models are computationally expensive, necessitating model compression techniques that reduce their size and complexity while maintaining accuracy. This project investigates and applies knowledge distillation for BERT model compression, specifically focusing on the TinyBERT student model. We explore various techniques to improve knowledge distillation, including experimentation with loss functions, transformer layer mapping methods, and tuning the weights of attention and representation loss and evaluate our proposed techniques on a selection of downstream tasks from the GLUE benchmark. The goal of this work is to improve the efficiency and effectiveness of knowledge distillation, enabling the development of more efficient and accurate models for a range of natural language processing tasks.

READ FULL TEXT
research
05/12/2021

MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation

The advent of large pre-trained language models has given rise to rapid ...
research
12/22/2022

CAMeMBERT: Cascading Assistant-Mediated Multilingual BERT

Large language models having hundreds of millions, and even billions, of...
research
05/04/2023

Smaller3d: Smaller Models for 3D Semantic Segmentation Using Minkowski Engine and Knowledge Distillation Methods

There are various optimization techniques in the realm of 3D, including ...
research
07/28/2022

SDBERT: SparseDistilBERT, a faster and smaller BERT model

In this work we introduce a new transformer architecture called SparseDi...
research
01/03/2022

Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

We perform knowledge distillation (KD) benchmark from task-specific BERT...
research
04/08/2020

LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression

BERT is a cutting-edge language representation model pre-trained by a la...
research
02/07/2020

BERT-of-Theseus: Compressing BERT by Progressive Module Replacing

In this paper, we propose a novel model compression approach to effectiv...

Please sign up or login with your details

Forgot password? Click here to reset