DeepAI AI Chat
Log In Sign Up

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

by   Shaoyi Huang, et al.
Northeastern University
Penn State University
University of Connecticut
The University of Texas at San Antonio
Stevens Institute of Technology

Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. Conventional wisdom is that pruning reduces the model expressiveness and thus is more likely to underfit than overfit compared to the original model. However, under the trending pretrain-and-finetune paradigm, we argue that pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase, as it increases the amount of information a model needs to learn from the downstream task, resulting in relative data deficiency. In this paper, we aim to address the overfitting issue under the pretrain-and-finetune paradigm to improve pruning performance via progressive knowledge distillation (KD) and sparse pruning. Furthermore, to mitigate the interference between different strategies of learning rate, pruning and distillation, we propose a three-stage learning framework. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Experiments on multiple datasets of GLUE benchmark show that our method achieves highly competitive pruning performance over the state-of-the-art competitors across different pruning ratio constraints.


Deep Neural Compression Via Concurrent Pruning and Self-Distillation

Pruning aims to reduce the number of parameters while maintaining perfor...

Average of Pruning: Improving Performance and Stability of Out-of-Distribution Detection

Detecting Out-of-distribution (OOD) inputs have been a critical issue fo...

Paying more attention to snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Network pruning is one of the most dominant methods for reducing the hea...

Greedy Layer Pruning: Decreasing Inference Time of Transformer Models

Fine-tuning transformer models after unsupervised pre-training reaches a...

APP: Anytime Progressive Pruning

With the latest advances in deep learning, there has been a lot of focus...

Faster gaze prediction with dense networks and Fisher pruning

Predicting human fixations from images has recently seen large improveme...

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Traditional (unstructured) pruning methods for a Transformer model focus...