Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

02/20/2023
by   Bobby He, et al.
0

Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.

READ FULL TEXT

page 2

page 16

page 22

page 28

research
04/04/2023

Effective Theory of Transformers at Initialization

We perform an effective-theory analysis of forward-backward signal propa...
research
03/10/2020

ReZero is All You Need: Fast Convergence at Large Depth

Deep networks have enabled significant performance gains across domains,...
research
05/16/2023

Mimetic Initialization of Self-Attention Layers

It is notoriously difficult to train Transformers on small datasets; typ...
research
04/14/2023

Optimal inference of a generalised Potts model by single-layer transformers with factored attention

Transformers are the type of neural networks that has revolutionised nat...
research
06/07/2022

Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

Transformers have achieved remarkable success in several domains, rangin...
research
02/16/2020

Robustness Verification for Transformers

Robustness verification that aims to formally certify the prediction beh...
research
06/16/2020

On the Computational Power of Transformers and Its Implications in Sequence Modeling

Transformers are being used extensively across several sequence modeling...

Please sign up or login with your details

Forgot password? Click here to reset