ReZero is All You Need: Fast Convergence at Large Depth

03/10/2020
by   Thomas Bachlechner, et al.
0

Deep networks have enabled significant performance gains across domains, but they often suffer from vanishing/exploding gradients. This is especially true for Transformer architectures where depth beyond 12 layers is difficult to train without large datasets and computational budgets. In general, we find that inefficient signal propagation impedes learning in deep networks. In Transformers, multi-head self-attention is the main cause of this poor signal propagation. To facilitate deep signal propagation, we propose ReZero, a simple change to the architecture that initializes an arbitrary layer as the identity map, using a single additional learned parameter per layer. We apply this technique to language modeling and find that we can easily train ReZero-Transformer networks over a hundred layers. When applied to 12 layer Transformers, ReZero converges 56 Transformers to other residual networks, enabling 1,500 deep fully connected networks and 32 trained on CIFAR 10.

READ FULL TEXT
research
07/13/2020

Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...
research
08/29/2019

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...
research
02/20/2023

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Skip connections and normalisation layers form two standard architectura...
research
03/01/2023

Are More Layers Beneficial to Graph Transformers?

Despite that going deep has proven successful in many neural architectur...
research
08/31/2021

Effectiveness of Deep Networks in NLP using BiDAF as an example architecture

Question Answering with NLP has progressed through the evolution of adva...
research
04/17/2020

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...
research
10/18/2021

NormFormer: Improved Transformer Pretraining with Extra Normalization

During pretraining, the Pre-LayerNorm transformer suffers from a gradien...

Please sign up or login with your details

Forgot password? Click here to reset