FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers

04/09/2022
by   Dezhou Shen, et al.
0

The mainstream BERT/GPT model contains only 10 to 20 layers, and there is little literature to discuss the training of deep BERT/GPT. This paper proposes a simple yet effective method to stabilize BERT and GPT training. We successfully scale up BERT and GPT to 1,000 layers, which is an order of magnitude deeper than previous BERT and GPT. The proposed method FoundationLayerNormalization enables efficient training of deep neural networks and is validated at the 1000-layer scale.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/25/2020

Further Boosting BERT-based Models by Duplicating Existing Layers: Some Intriguing Phenomena inside BERT

Although Bidirectional Encoder Representations from Transformers (BERT) ...
research
10/08/2021

Speeding up Deep Model Training by Sharing Weights and Then Unsharing

We propose a simple and efficient approach for training the BERT model. ...
research
06/13/2017

Deep Control - a simple automatic gain control for memory efficient and high performance training of deep convolutional neural networks

Training a deep convolutional neural net typically starts with a random ...
research
05/02/2020

Generating Derivational Morphology with BERT

Can BERT generate derivationally complex words? We present the first stu...
research
11/03/2021

BERT-DRE: BERT with Deep Recursive Encoder for Natural Language Sentence Matching

This paper presents a deep neural architecture, for Natural Language Sen...
research
10/12/2020

Layer-wise Guided Training for BERT: Learning Incrementally Refined Document Representations

Although BERT is widely used by the NLP community, little is known about...
research
04/26/2016

Scale Normalization

One of the difficulties of training deep neural networks is caused by im...

Please sign up or login with your details

Forgot password? Click here to reset