Foundation Transformers

10/12/2022
by   Hongyu Wang, et al.
26

A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/22/2022

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

A big convergence of language, vision, and multimodal pretraining is eme...
research
06/02/2022

VL-BEiT: Generative Vision-Language Pretraining

We introduce a vision-language foundation model called VL-BEiT, which is...
research
06/13/2022

Multimodal Learning with Transformers: A Survey

Transformer is a promising neural network learner, and has achieved grea...
research
03/11/2023

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Training stability is of great importance to Transformers. In this work,...
research
05/23/2023

All Roads Lead to Rome? Exploring the Invariance of Transformers' Representations

Transformer models bring propelling advances in various NLP tasks, thus ...
research
10/21/2022

Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?

Recent advances in vision-and-language modeling have seen the developmen...
research
05/04/2023

BranchNorm: Robustly Scaling Extremely Deep Transformers

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 l...

Please sign up or login with your details

Forgot password? Click here to reset