Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse

06/07/2022
by   Lorenzo Noci, et al.
0

Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of Transformers - can result in rank collapse of the tokens' representations at initialization. The question of if and how rank collapse affects training is still largely unanswered, and its investigation is necessary for a more comprehensive understanding of this architecture. In this work, we shed new light on the causes and the effects of this phenomenon. First, we show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish at initialization. Furthermore, we provide a thorough description of the origin of rank collapse and discuss how to prevent it via an appropriate depth-dependent scaling of the residual branches. Finally, our analysis unveils that specific architectural hyperparameters affect the gradients of queries and values differently, leading to disproportionate gradient norms. This suggests an explanation for the widespread use of adaptive methods for Transformers' optimization.

READ FULL TEXT

page 6

page 30

page 33

page 34

research
04/04/2023

Effective Theory of Transformers at Initialization

We perform an effective-theory analysis of forward-backward signal propa...
research
06/17/2021

XCiT: Cross-Covariance Image Transformers

Following their success in natural language processing, transformers hav...
research
06/08/2021

A Survey of Transformers

Transformers have achieved great success in many artificial intelligence...
research
06/12/2023

Transformers learn through gradual rank increase

We identify incremental learning dynamics in transformers, where the dif...
research
02/20/2023

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Skip connections and normalisation layers form two standard architectura...
research
06/10/2022

Learning to Estimate Shapley Values with Vision Transformers

Transformers have become a default architecture in computer vision, but ...
research
06/15/2023

Understanding Parameter Sharing in Transformers

Parameter sharing has proven to be a parameter-efficient approach. Previ...

Please sign up or login with your details

Forgot password? Click here to reset