Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training

04/24/2022
by   Yuanxin Liu, et al.
0

Recent studies on the lottery ticket hypothesis (LTH) show that pre-trained language models (PLMs) like BERT contain matching subnetworks that have similar transfer learning performance as the original PLM. These subnetworks are found using magnitude-based pruning. In this paper, we find that the BERT subnetworks have even more potential than these studies have shown. Firstly, we discover that the success of magnitude pruning can be attributed to the preserved pre-training performance, which correlates with the downstream transferability. Inspired by this, we propose to directly optimize the subnetwork structure towards the pre-training objectives, which can better preserve the pre-training performance. Specifically, we train binary masks over model weights on the pre-training tasks, with the aim of preserving the universal transferability of the subnetwork, which is agnostic to any specific downstream tasks. We then fine-tune the subnetworks on the GLUE benchmark and the SQuAD dataset. The results show that, compared with magnitude pruning, mask training can effectively find BERT subnetworks with improved overall performance on downstream tasks. Moreover, our method is also more efficient in searching subnetworks and more advantageous when fine-tuning within a certain range of data scarcity. Our code is available at https://github.com/llyx97/TAMT.

READ FULL TEXT
research
02/19/2020

Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning

Universal feature extractors, such as BERT for natural language processi...
research
07/27/2023

Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-trained Vision-Language Models

Prompt tuning and adapter tuning have shown great potential in transferr...
research
05/14/2021

BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT

Multiple studies have shown that BERT is remarkably robust to pruning, y...
research
06/18/2023

Instant Soup: Cheap Pruning Ensembles in A Single Pass Can Draw Lottery Tickets from Large Models

Large pre-trained transformers have been receiving explosive attention i...
research
03/27/2023

Generalizable Local Feature Pre-training for Deformable Shape Analysis

Transfer learning is fundamental for addressing problems in settings wit...
research
03/12/2022

ELLE: Efficient Lifelong Pre-training for Emerging Data

Current pre-trained language models (PLM) are typically trained with sta...
research
06/06/2023

The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Large pre-trained transformers are show-stealer in modern-day deep learn...

Please sign up or login with your details

Forgot password? Click here to reset