The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

06/06/2023
by   Ajay Jaiswal, et al.
0

Large pre-trained transformers are show-stealer in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of the repetitive train-prune-retrain routine of iterative magnitude pruning (IMP) which worsens with increasing model size. In this paper, we comprehensively study induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of – essential sparsity defined with a sharp dropping point beyond which the performance declines much faster w.r.t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in one-shot. In the sparsity-performance curve We also present an intriguing emerging phenomenon of abrupt sparsification during the pre-training of BERT, i.e., BERT suddenly becomes heavily sparse in pre-training after certain iterations. Moreover, our observations also indicate a counter-intuitive finding that BERT trained with a larger amount of pre-training data tends to have a better ability to condense knowledge in comparatively relatively fewer parameters. Lastly, we investigate the effect of the pre-training loss on essential sparsity and discover that self-supervised learning (SSL) objectives trigger stronger emergent sparsification properties than supervised learning (SL). Our codes are available at <https://github.com/VITA-Group/essential_sparsity>.

READ FULL TEXT

page 4

page 9

research
12/12/2020

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models

The computer vision world has been re-gaining enthusiasm in various pre-...
research
07/23/2020

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

In natural language processing (NLP), enormous pre-trained models like B...
research
04/24/2022

Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trai...
research
06/18/2023

Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models

Large pre-trained transformers have been receiving explosive attention i...
research
03/03/2020

CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model

In this paper, we introduce the Chinese corpus from CLUE organization, C...
research
08/13/2021

Towards Structured Dynamic Sparse Pre-Training of BERT

Identifying algorithms for computational efficient unsupervised training...
research
01/09/2023

Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling

We identify and overcome two key obstacles in extending the success of B...

Please sign up or login with your details

Forgot password? Click here to reset