Revisiting Intermediate Layer Distillation for Compressing Language Models: An Overfitting Perspective

02/03/2023
by   Jongwoo Ko, et al.
0

Knowledge distillation (KD) is a highly promising method for mitigating the computational problems of pre-trained language models (PLMs). Among various KD approaches, Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field. In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD. Next, we present the simple observations to mitigate the overfitting of ILD: distilling only the last Transformer layer and conducting ILD on supplementary tasks. Based on our two findings, we propose a simple yet effective consistency-regularized ILD (CR-ILD), which prevents the student model from overfitting the training dataset. Substantial experiments on distilling BERT on the GLUE benchmark and several synthetic datasets demonstrate that our proposed ILD method outperforms other KD techniques. Our code is available at https://github.com/jongwooko/CR-ILD.

READ FULL TEXT
research
09/23/2021

Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation (KD) has been proved effective for compressing la...
research
09/21/2021

RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

Intermediate layer knowledge distillation (KD) can improve the standard ...
research
06/15/2021

Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation

In recent years the ubiquitous deployment of AI has posed great concerns...
research
07/25/2022

Self-Distilled Vision Transformer for Domain Generalization

In recent past, several domain generalization (DG) methods have been pro...
research
11/13/2022

TIER-A: Denoising Learning Framework for Information Extraction

With the development of deep neural language models, great progress has ...
research
05/15/2023

Memorization for Good: Encryption with Autoregressive Language Models

Over-parameterized neural language models (LMs) can memorize and recite ...
research
10/12/2022

MiniALBERT: Model Distillation via Parameter-Efficient Recursive Transformers

Pre-trained Language Models (LMs) have become an integral part of Natura...

Please sign up or login with your details

Forgot password? Click here to reset