Log In Sign Up

Transformers without Tears: Improving the Normalization of Self-Attention

by   Toan Q. Nguyen, et al.

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose ℓ_2 normalization with a single scale parameter (ScaleNorm) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FixNorm). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT'15 English-Vietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT'14 English-German), ScaleNorm and FixNorm remain competitive but PreNorm degrades performance.


page 1

page 2

page 3

page 4


Query-Key Normalization for Transformers

Low-resource language translation is a challenging but socially valuable...

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

The performance of Neural Machine Translation (NMT) systems often suffer...

Cross-Language Learning for Entity Matching

Transformer-based matching methods have significantly moved the state-of...

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

In this paper, we elaborate upon recipes for building multilingual repre...