Speeding up Deep Model Training by Sharing Weights and Then Unsharing

10/08/2021
by   Shuo Yang, et al.
7

We propose a simple and efficient approach for training the BERT model. Our approach exploits the special structure of BERT that contains a stack of repeated modules (i.e., transformer encoders). Our proposed approach first trains BERT with the weights shared across all the repeated modules till some point. This is for learning the commonly shared component of weights across all repeated layers. We then stop weight sharing and continue training until convergence. We present theoretic insights for training by sharing weights then unsharing with analysis for simplified models. Empirical experiments on the BERT model show that our method yields better performance of trained models, and significantly reduces the number of training iterations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/09/2022

FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers

The mainstream BERT/GPT model contains only 10 to 20 layers, and there i...
research
04/08/2021

An Exploratory Study on the Repeatedly Shared External Links on Stack Overflow

On Stack Overflow, users reuse 11,926,354 external links to share the re...
research
11/10/2021

BagBERT: BERT-based bagging-stacking for multi-topic classification

This paper describes our submission on the COVID-19 literature annotatio...
research
04/21/2020

Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

Because attention modules are core components of Transformer-based model...
research
06/08/2023

Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Weight-sharing supernet has become a vital component for performance est...
research
04/13/2022

CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing

Model ensemble is a popular approach to produce a low-variance and well-...
research
09/29/2022

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

Training vision or language models on large datasets can take days, if n...

Please sign up or login with your details

Forgot password? Click here to reset