Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

10/15/2021
by   Shaoyi Huang, et al.
6

Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. Conventional wisdom is that pruning reduces the model expressiveness and thus is more likely to underfit than overfit compared to the original model. However, under the trending pretrain-and-finetune paradigm, we argue that pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase, as it increases the amount of information a model needs to learn from the downstream task, resulting in relative data deficiency. In this paper, we aim to address the overfitting issue under the pretrain-and-finetune paradigm to improve pruning performance via progressive knowledge distillation (KD) and sparse pruning. Furthermore, to mitigate the interference between different strategies of learning rate, pruning and distillation, we propose a three-stage learning framework. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Experiments on multiple datasets of GLUE benchmark show that our method achieves highly competitive pruning performance over the state-of-the-art competitors across different pruning ratio constraints.

READ FULL TEXT
research
09/30/2021

Deep Neural Compression Via Concurrent Pruning and Self-Distillation

Pruning aims to reduce the number of parameters while maintaining perfor...
research
03/02/2023

Average of Pruning: Improving Performance and Stability of Out-of-Distribution Detection

Detecting Out-of-distribution (OOD) inputs have been a critical issue fo...
research
06/20/2020

Paying more attention to snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Network pruning is one of the most dominant methods for reducing the hea...
research
05/31/2021

Greedy Layer Pruning: Decreasing Inference Time of Transformer Models

Fine-tuning transformer models after unsupervised pre-training reaches a...
research
04/04/2022

APP: Anytime Progressive Pruning

With the latest advances in deep learning, there has been a lot of focus...
research
04/01/2022

Structured Pruning Learns Compact and Accurate Models

The growing size of neural language models has led to increased attentio...
research
01/17/2018

Faster gaze prediction with dense networks and Fisher pruning

Predicting human fixations from images has recently seen large improveme...

Please sign up or login with your details

Forgot password? Click here to reset