RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

09/21/2021
by   Md. Akmal Haidar, et al.
0

Intermediate layer knowledge distillation (KD) can improve the standard KD technique (which only targets the output of teacher and student models) especially over large pre-trained language models. However, intermediate layer distillation suffers from excessive computational burdens and engineering efforts required for setting up a proper layer mapping. To address these problems, we propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model. This randomized selection enforce that: all teacher layers are taken into account in the training process, while reducing the computational cost of intermediate layer distillation. Also, we show that it act as a regularizer for improving the generalizability of the student model. We perform extensive experiments on GLUE tasks as well as on out-of-domain test sets. We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.

READ FULL TEXT
research
02/28/2021

Distilling Knowledge via Intermediate Classifier Heads

The crux of knowledge distillation – as a transfer-learning approach – i...
research
02/16/2022

Deeply-Supervised Knowledge Distillation

Knowledge distillation aims to enhance the performance of a lightweight ...
research
02/25/2022

Learn From the Past: Experience Ensemble Knowledge Distillation

Traditional knowledge distillation transfers "dark knowledge" of a pre-t...
research
12/19/2014

FitNets: Hints for Thin Deep Nets

While depth tends to improve network performances, it also makes gradien...
research
10/26/2017

Knowledge Projection for Deep Neural Networks

While deeper and wider neural networks are actively pushing the performa...
research
10/13/2020

BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance

Pre-trained language models (e.g., BERT) have achieved significant succe...
research
02/03/2023

Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

Knowledge distillation (KD) is a highly promising method for mitigating ...

Please sign up or login with your details

Forgot password? Click here to reset