On the Expressivity Role of LayerNorm in Transformers' Attention

05/04/2023
by   Shaked Brody, et al.
0

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the common belief that LayerNorm's only role is to normalize the activations during the forward pass, and their gradients during the backward pass. We consider a geometric interpretation of LayerNorm and show that it consists of two components: (a) projection of the input vectors to a d-1 space that is orthogonal to the [1,1,...,1] vector, and (b) scaling of all vectors to the same norm of √(d). We show that each of these components is important for the attention layer that follows it in Transformers: (a) projection allows the attention mechanism to create an attention query that attends to all keys equally, offloading the need to learn this operation by the attention; and (b) scaling allows each key to potentially receive the highest attention, and prevents keys from being "un-select-able". We show empirically that Transformers do indeed benefit from these properties of LayeNorm in general language modeling and even in computing simple functions such as "majority". Our code is available at https://github.com/tech-srl/layer_norm_expressivity_role .

READ FULL TEXT
research
10/19/2022

The Devil in Linear Transformer

Linear transformers aim to reduce the quadratic space-time complexity of...
research
06/17/2022

SimA: Simple Softmax-free Attention for Vision Transformers

Recently, vision transformers have become very popular. However, deployi...
research
11/22/2021

Mesa: A Memory-saving Training Framework for Transformers

There has been an explosion of interest in designing high-performance Tr...
research
06/05/2023

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition

An emerging solution for explaining Transformer-based models is to use v...
research
05/02/2023

Unlimiformer: Long-Range Transformers with Unlimited Length Input

Transformer-based models typically have a predefined bound to their inpu...
research
05/22/2023

Interpreting Transformer's Attention Dynamic Memory and Visualizing the Semantic Information Flow of GPT

Recent advances in interpretability suggest we can project weights and h...
research
09/11/2023

Uncovering mesa-optimization algorithms in Transformers

Transformers have become the dominant model in deep learning, but the re...

Please sign up or login with your details

Forgot password? Click here to reset