RealFormer: Transformer Likes Residual Attention

12/21/2020
by   Ruining He, et al.
74

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple Residual Attention Layer Transformer architecture that significantly outperforms canonical Transformers on a spectrum of tasks including Masked Language Modeling, GLUE, and SQuAD. Qualitatively, RealFormer is easy to implement and requires minimal hyper-parameter tuning. It also stabilizes training and leads to models with sparser attentions. Code will be open-sourced upon paper acceptance.

READ FULL TEXT

page 7

page 8

page 11

page 12

page 13

research
09/17/2021

Primer: Searching for Efficient Transformers for Language Modeling

Large Transformer models have been central to recent advances in natural...
research
03/14/2022

Efficient Language Modeling with Sparse all-MLP

All-MLP architectures have attracted increasing interest as an alternati...
research
07/13/2022

N-Grammer: Augmenting Transformers with latent n-grams

Transformer models have recently emerged as one of the foundational mode...
research
02/12/2023

Transformer models: an introduction and catalog

In the past few years we have seen the meteoric appearance of dozens of ...
research
06/01/2023

Learning Transformer Programs

Recent research in mechanistic interpretability has attempted to reverse...
research
02/12/2020

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To ...
research
05/28/2023

Key-Value Transformer

Transformers have emerged as the prevailing standard solution for variou...

Please sign up or login with your details

Forgot password? Click here to reset