Position Embedding Needs an Independent Layer Normalization

12/10/2022
by   Runyi Yu, et al.
0

The Position Embedding (PE) is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operation. By analyzing the input and output of each encoder layer in VTs using reparameterization and visualization, we find that the default PE joining method (simply adding the PE and patch embedding together) operates the same affine transformation to token embedding and PE, which limits the expressiveness of PE and hence constrains the performance of VTs. To overcome this limitation, we propose a simple, effective, and robust method. Specifically, we provide two independent layer normalizations for token embeddings and PE for each layer, and add them together as the input of each layer's Muti-Head Self-Attention module. Since the method allows the model to adaptively adjust the information of PE for different layers, we name it as Layer-adaptive Position Embedding, abbreviated as LaPE. Extensive experiments demonstrate that LaPE can improve various VTs with different types of PE and make VTs robust to PE types. For example, LaPE improves 0.94 1.72 extra parameters, memory and computational cost brought by LaPE. The code is publicly available at https://github.com/Ingrid725/LaPE.

READ FULL TEXT

page 4

page 13

research
02/25/2021

LazyFormer: Self Attention with Lazy Update

Improving the efficiency of Transformer-based language pre-training is a...
research
05/06/2022

GlobEnc: Quantifying Global Token Attribution by Incorporating the Whole Encoder Layer in Transformers

There has been a growing interest in interpreting the underlying dynamic...
research
10/18/2021

NormFormer: Improved Transformer Pretraining with Extra Normalization

During pretraining, the Pre-LayerNorm transformer suffers from a gradien...
research
06/28/2020

Rethinking Positional Encoding in Language Pre-training

How to explicitly encode positional information into neural networks is ...
research
07/21/2023

Strip-MLP: Efficient Token Interaction for Vision MLP

Token interaction operation is one of the core modules in MLP-based mode...
research
10/14/2022

Holistic Sentence Embeddings for Better Out-of-Distribution Detection

Detecting out-of-distribution (OOD) instances is significant for the saf...
research
06/05/2023

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition

An emerging solution for explaining Transformer-based models is to use v...

Please sign up or login with your details

Forgot password? Click here to reset