Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

05/24/2023
by   Zixuan Jiang, et al.
0

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and stabilizing the training of Transformers. While LayerNorm recenters and rescales input vectors, RMSNorm only rescales the vectors by their RMS value. Despite being more computationally efficient, RMSNorm may compromise the representation ability of Transformers. There is currently no consensus regarding the preferred normalization technique, as some models employ LayerNorm while others utilize RMSNorm, especially in recent large language models. It is challenging to convert Transformers with one normalization to the other type. While there is an ongoing disagreement between the two normalization types, we propose a solution to unify two mainstream Transformer architectures, Pre-LN and Pre-RMSNorm Transformers. By removing the inherent redundant mean information in the main branch of Pre-LN Transformers, we can reduce LayerNorm to RMSNorm, achieving higher efficiency. We further propose the Compressed RMSNorm (CRMSNorm) and Pre-CRMSNorm Transformer based on a lossless compression of the zero-mean vectors. We formally establish the equivalence of Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm Transformer variants in both training and inference. It implies that Pre-LN Transformers can be substituted with Pre-(C)RMSNorm counterparts at almost no cost, offering the same arithmetic functionality along with free efficiency improvement. Experiments demonstrate that we can reduce the training and inference time of Pre-LN Transformers by up to 10

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/12/2020

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To ...
research
06/01/2022

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...
research
08/02/2022

Unified Normalization for Accelerating and Stabilizing Transformers

Solid results from Transformers have made them prevailing architectures ...
research
10/22/2020

AdapterDrop: On the Efficiency of Adapters in Transformers

Massively pre-trained transformer models are computationally expensive t...
research
04/28/2022

A Probabilistic Interpretation of Transformers

We propose a probabilistic interpretation of exponential dot product att...
research
10/12/2022

Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers

This paper studies the curious phenomenon for machine learning models wi...
research
06/01/2023

Birth of a Transformer: A Memory Viewpoint

Large language models based on transformers have achieved great empirica...

Please sign up or login with your details

Forgot password? Click here to reset