Transformers without Tears: Improving the Normalization of Self-Attention

10/14/2019
by   Toan Q. Nguyen, et al.
0

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose ℓ_2 normalization with a single scale parameter (ScaleNorm) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FixNorm). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT'15 English-Vietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT'14 English-German), ScaleNorm and FixNorm remain competitive but PreNorm degrades performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2020

Query-Key Normalization for Transformers

Low-resource language translation is a challenging but socially valuable...
research
08/29/2019

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...
research
09/24/2021

Unsupervised Translation of German–Lower Sorbian: Exploring Training and Novel Transfer Methods on a Low-Resource Language

This paper describes the methods behind the systems submitted by the Uni...
research
04/17/2018

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

The performance of Neural Machine Translation (NMT) systems often suffer...
research
10/07/2021

Cross-Language Learning for Entity Matching

Transformer-based matching methods have significantly moved the state-of...
research
10/26/2022

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

In this paper, we elaborate upon recipes for building multilingual repre...
research
06/21/2023

EquiformerV2: Improved Equivariant Transformer for Scaling to Higher-Degree Representations

Equivariant Transformers such as Equiformer have demonstrated the effica...

Please sign up or login with your details

Forgot password? Click here to reset