Why Deep Transformers are Difficult to Converge? From Computation Order to Lipschitz Restricted Parameter Initialization

11/08/2019
by   Hongfei Xu, et al.
0

The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure. While several previous works show that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly a Transformer model with more than 12 encoder/decoder layers fails to converge. In this paper, we first empirically demonstrate that a simple modification made in the official implementation which changes the computation order of residual connection and layer normalization can effectively ease the optimization of deep Transformers. In addition, we deeply compare the subtle difference in computation order, and propose a parameter initialization method which simply puts Lipschitz restriction on the initialization of Transformers but can effectively ensure their convergence. We empirically show that with proper parameter initialization, deep Transformers with the original computation order can converge, which is quite in contrast to all previous works, and obtain significant improvements with up to 24 layers. Our proposed approach additionally enables to benefit from deep decoders compared to previous works which focus on deep encoders.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2019

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...
research
03/01/2022

DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize ext...
research
04/17/2020

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...
research
07/13/2020

Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...
research
02/12/2020

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To ...
research
06/01/2022

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...
research
05/04/2023

BranchNorm: Robustly Scaling Extremely Deep Transformers

Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 l...

Please sign up or login with your details

Forgot password? Click here to reset