DeepAI AI Chat
Log In Sign Up

Sparse Progressive Distillation: Resolving Overfitting under Pretrain-and-Finetune Paradigm

10/15/2021
by   Shaoyi Huang, et al.
Northeastern University
Penn State University
University of Connecticut
The University of Texas at San Antonio
Stevens Institute of Technology
6

Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models. Conventional wisdom is that pruning reduces the model expressiveness and thus is more likely to underfit than overfit compared to the original model. However, under the trending pretrain-and-finetune paradigm, we argue that pruning increases the risk of overfitting if pruning was performed at the fine-tuning phase, as it increases the amount of information a model needs to learn from the downstream task, resulting in relative data deficiency. In this paper, we aim to address the overfitting issue under the pretrain-and-finetune paradigm to improve pruning performance via progressive knowledge distillation (KD) and sparse pruning. Furthermore, to mitigate the interference between different strategies of learning rate, pruning and distillation, we propose a three-stage learning framework. We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm. Experiments on multiple datasets of GLUE benchmark show that our method achieves highly competitive pruning performance over the state-of-the-art competitors across different pruning ratio constraints.

READ FULL TEXT
09/30/2021

Deep Neural Compression Via Concurrent Pruning and Self-Distillation

Pruning aims to reduce the number of parameters while maintaining perfor...
03/02/2023

Average of Pruning: Improving Performance and Stability of Out-of-Distribution Detection

Detecting Out-of-distribution (OOD) inputs have been a critical issue fo...
06/20/2020

Paying more attention to snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Network pruning is one of the most dominant methods for reducing the hea...
05/31/2021

Greedy Layer Pruning: Decreasing Inference Time of Transformer Models

Fine-tuning transformer models after unsupervised pre-training reaches a...
04/04/2022

APP: Anytime Progressive Pruning

With the latest advances in deep learning, there has been a lot of focus...
01/17/2018

Faster gaze prediction with dense networks and Fisher pruning

Predicting human fixations from images has recently seen large improveme...
10/05/2020

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Traditional (unstructured) pruning methods for a Transformer model focus...