Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

08/29/2019
by   Biao Zhang, et al.
0

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified averagebased self-attention sublayer and the encoderdecoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/08/2019

Why Deep Transformers are Difficult to Converge? From Computation Order to Lipschitz Restricted Parameter Initialization

The Transformer translation model employs residual connection and layer ...
research
07/13/2020

Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...
research
03/10/2020

ReZero is All You Need: Fast Convergence at Large Depth

Deep networks have enabled significant performance gains across domains,...
research
10/14/2019

Transformers without Tears: Improving the Normalization of Self-Attention

We evaluate three simple, normalization-centric changes to improve Trans...
research
05/15/2021

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Skip connection, is a widely-used technique to improve the performance a...
research
06/28/2019

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

The transformer is a state-of-the-art neural translation model that uses...
research
03/01/2022

DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize ext...

Please sign up or login with your details

Forgot password? Click here to reset