Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

02/19/2020
by   Mitchell A. Gordon, et al.
0

Universal feature extractors, such as BERT for natural language processing and VGG for computer vision, have become effective methods for improving deep learning models without requiring more labeled data. A common paradigm is to pre-train a feature extractor on large amounts of data then fine-tune it as part of a deep learning model on some downstream task (i.e. transfer learning). While effective, feature extractors like BERT may be prohibitively large for some deployment scenarios. We explore weight pruning for BERT and ask: how does compression during pre-training affect transfer learning? We find that pruning affects transfer learning in three broad regimes. Low levels of pruning (30-40%) do not affect pre-training loss or transfer to downstream tasks at all. Medium levels of pruning increase the pre-training loss and prevent useful pre-training information from being transferred to downstream tasks. High levels of pruning additionally prevent models from fitting downstream datasets, leading to further degradation. Finally, we observe that fine-tuning BERT on a specific task does not improve its prunability. We conclude that BERT can be pruned once during pre-training rather than separately for each task without affecting performance.

READ FULL TEXT
research
04/24/2022

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trai...
research
09/27/2019

Reweighted Proximal Pruning for Large-Scale Language Representation

Recently, pre-trained language representation flourishes as the mainstay...
research
05/20/2022

Pre-Train Your Loss: Easy Bayesian Transfer Learning with Informative Priors

Deep learning is increasingly moving towards a transfer learning paradig...
research
09/26/2021

On the Prunability of Attention Heads in Multilingual BERT

Large multilingual models, such as mBERT, have shown promise in crosslin...
research
02/23/2022

Reconstruction Task Finds Universal Winning Tickets

Pruning well-trained neural networks is effective to achieve a promising...
research
04/18/2021

Rethinking Network Pruning – under the Pre-train and Fine-tune Paradigm

Transformer-based pre-trained language models have significantly improve...
research
09/30/2020

AUBER: Automated BERT Regularization

How can we effectively regularize BERT? Although BERT proves its effecti...

Please sign up or login with your details

Forgot password? Click here to reset