Why Deep Transformers are Difficult to Converge? From Computation Order to Lipschitz Restricted Parameter Initialization

by   Hongfei Xu, et al.

The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure. While several previous works show that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly a Transformer model with more than 12 encoder/decoder layers fails to converge. In this paper, we first empirically demonstrate that a simple modification made in the official implementation which changes the computation order of residual connection and layer normalization can effectively ease the optimization of deep Transformers. In addition, we deeply compare the subtle difference in computation order, and propose a parameter initialization method which simply puts Lipschitz restriction on the initialization of Transformers but can effectively ensure their convergence. We empirically show that with proper parameter initialization, deep Transformers with the original computation order can converge, which is quite in contrast to all previous works, and obtain significant improvements with up to 24 layers. Our proposed approach additionally enables to benefit from deep decoders compared to previous works which focus on deep encoders.


page 1

page 2

page 3

page 4


Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...

DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize ext...

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...

Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...

On Layer Normalization in the Transformer Architecture

The Transformer is widely used in natural language processing tasks. To ...

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Skip connection, is a widely-used technique to improve the performance a...

1 Introduction

Neural machine translation has achieved great success in the last few years Bahdanau et al. (2014); Gehring et al. (2017); Vaswani et al. (2017). The Transformer Vaswani et al. (2017), which has outperformed previous RNN/CNN based translation models Bahdanau et al. (2014); Gehring et al. (2017), is based on multi-layer self-attention networks and can be trained very efficiently. The multi-layer structure allows the Transformer to model complicated functions. Increasing the depth of models can increase their capacity but may also cause optimization difficulties Mhaskar et al. (2017); Telgarsky (2016); Eldan and Shamir (2016); He et al. (2016); Bapna et al. (2018)

. In order to ease optimization, the Transformer employs residual connection and layer normalization techniques which have been proven useful in reducing optimization difficulties of deep neural networks for various tasks

He et al. (2016); Ba et al. (2016).

However, even with residual connections and layer normalization, deep Transformers are still hard to train: the original Transformer Vaswani et al. (2017) only contains 6 encoder/decoder layers. bapna2018training show that Transformer models with more than 12 encoder layers fail to converge, and propose the Transparent Attention (TA) mechanism which weighted combines outputs of all encoder layers as encoded representation. However, the TA mechanism has to value outputs of shallow encoder layers to feedback sufficient gradients during back-propagation to ensure their convergence, which implies that weights of deep layers are likely to be hampered and against the motivation when go very deep, and as a result bapna2018training cannot get further improvements with more than 16 layers. wang2019learning reveal that deep Transformers with proper use of layer normalization is able to converge and propose to aggregate previous layers’ outputs for each layer instead of at the end of encoding. wu2019depth research on incremental increasing the depth of the Transformer Big by freezing pre-trained shallow layers. In concurrent work, zhang2019improving also point out the same issue as in this work, but there are differences between.

In contrast to all previous works, we empirically show that with proper parameter initialization, deep Transformers with the original computation order can converge. The contributions of our work are as follows:

We empirically demonstrate that a simple modification made in the Transformer’s official implementation Vaswani et al. (2018) which changes the computation order of residual connection and layer normalization can effectively ease its optimization;

We deeply analyze how the subtle difference of computation order affects the convergence deep Transformer models, and propose to initialize deep Transformer models under Lipschitz restriction;

Our simple approach effectively ensures the convergence of deep Transformers with up to 24 layers, and bring and BLEU improvements in the WMT 14 English to German task and the WMT 15 Czech to English task;

We study the influence of the deep decoder in addition to the deep encoder studied by previous works Bapna et al. (2018); Wang et al. (2019), and show that deep decoders can also benefit the performance of the Transformer.

2 Convergence of Different Computation Order

Models Layers en-de cs-en
Encoder Decoder v1 v2 v1 v2
Bapna et al. (2018) 16 6 28.39 None 29.36 None
Transformer 6 27.77 27.31 28.62 28.40
12 28.12 29.38
18 28.60 29.61
24 29.02 29.73
Table 1: Results of Different Computation Order

In our research we focus on training problems of deep Transformers which prevent them from convergence (as opposed to other important issues such as over-fitting on the training set). To alleviate the training problem for the standard Transformer model, Layer Normalization (Ba et al., 2016) and Residual Connection (He et al., 2016) are adopted.

Figure 1: Two Computation Sequences of Transformer Translation Models: (a) the presentation in the paper, (b) the official implementation

The official implementation Vaswani et al. (2018) of the Transformer uses a different computation sequence (Figure 1 b) compared to the published version Vaswani et al. (2017) (Figure 1 a), since it seems better for harder-to-learn models111https://github.com/tensorflow/tensor2tensor/blob/v1.6.5/tensor2tensor/layers/common_hparams.py#L110-L112.. Though several papers Chen et al. (2018); Domhan (2018) mentioned this change, how this modification impacts on the performance of the Transformer, especially for deep Transformers, has never been deeply studied before with empirical results to the best of our knowledge, except wang2019learning analyzed the difference between two computation orders during back-propagation, and zhang2019improving point out the same effects of normalization in concurrent work.

In order to compare with bapna2018training, we used the datasets from the WMT 14 English to German task and the WMT 15 Czech to English task for experiments. We applied joint Byte-Pair Encoding (BPE) Sennrich et al. (2016) with 32k merge operations. We used the same setting as the Transformer base Vaswani et al. (2017) except the number of warm-up steps was set to . We conducted our experiments based on the Neutron implementation (Xu and Liu, 2019) of the Transformer.

Parameters were initialized with Glorot Initialization222Uniformly initialize matrices between , where and are two dimensions of the matrix. Glorot and Bengio (2010) like in many other Transformer implementation Klein et al. (2017); Hieber et al. (2017); Vaswani et al. (2018). Our experiments run on 2 GTX 1080 Ti GPUs, and a batch size of at least target tokens is achieved through gradient accumulation of small batches.

We used a beam size of 4 for decoding, and evaluated tokenized case-sensitive BLEU with the averaged model of the last 5 checkpoints saved with an interval of 1,500 training steps Vaswani et al. (2017).

v1 v2
Table 2: Computation with Layer Normalization and Residual Connection

Results of two different computation order are shown in Table 1. v1 and v2 stand for the computation order of the proposed Transformer Vaswani et al. (2017) and that of the official implementation Vaswani et al. (2018) respectively. “” means fail to converge, “None” means not reported in original works, “*” indicates our implementation of their approach. and mean and while comparing between v1 and v2 of the same number of layers in significance test.

3 Analysis and Lipschitz Restricted Parameter Initialization

Since the subtle change of computation order results in huge differences in convergence, we analyze the differences between the computation orders to figure out how they affect convergence.

3.1 Comparison between Computation Orders

As a conjecture, we think that the convergence issue of deep Transformers is perhaps due to the fact that layer normalization over residual connections in Figure 1 (a) makes residual connections are likely to be hampered by layer normalization which tends to shrink consecutive residual connections to avoid potential exploding of combined layer outputs Chen et al. (2018). We studied how the layer normalization and the residual connection are computed in the two computation orders as shown in Table 2.

“mean” and “std” mean the computation of mean value and standard variance.

and stand for output of current layer and accumulated outputs from previous layers respectively. and

are weight and bias of layer normalization which are initialized with a vector full of

and another vector full of . is the computation result of the layer normalization. and are results of residual connections of v1 and v2.

Table 2 shows that the computation of residual connection in v1 is weighted by compared to v2, and the residual connection of previous layers will be shrunk in case .

We suggest bapna2018training introduced the TA mechanism to compensate normalized residual connections through combining outputs of shallow layers to the final encoder output for the published Transformer, and obtained significant improvements with deep Transformer models. wang2019learning additionally aggregating outputs of previous layers for each encoder layer instead of only at the end of encoding.

3.2 Lipschitz Restricted Parameter Initialization

Since the convergence issue of deep v1 Transformers is likely because of the shrunken residual connections, is it possible to restrict ? Given that is initialized with , we suggest to restrict the standard variance of :


in which case, will be greater than or at least equal to , and the residual connection of v1 will not be shrunk anymore. To achieve this goal, we can restrict values in between and ensuring its distribution variance is smaller than .


as any probability distribution of

between :


then the standard variance of is:


given that:


for as constrained by Equation 2, we can make Equation 3 into:


clean up the Equation 5, we can get:


after applying Equation 2 into Equation 6, we can find that:


Thus, as long as:


the requirements for corresponding described in Equation can be satisfied.

This goal can be simply achieved through initializing the sub-model before layer normalization to be a k-Lipschitz function, where .

The k-Lipschitz restriction can be satisfied effectively through weight clipping333Note that the weight of the layer normalization cannot be clipped, otherwise residual connections will be more heavily shrunk., and we empirically find that only applying a restriction to parameter initialization is sufficient enough, which is more efficient and can avoid potential risk of weight clipping on performance.

In practice, we initialize embedding matrices and weights of linear transformation with uniform distributions of

and respectively. We use as and as where , and stand for the size of embedding, vocabulary size and the input dimension of the linear transformation respectively444To preserve the magnitude of the variance of the weights in the forward pass..

Layers en-de cs-en
v1’ v2’ v1’ v2’
6 27.96 27.38 28.78 28.39
12 28.67 28.13 29.17 29.45
18 29.05 28.67 29.55 29.63
24 29.46 29.20 29.70 29.88
Table 3: Results with Lipschitz Restricted Parameter Initialization

Results for two computation orders with new parameter initialization method are shown in Table 3. v1’ indicates v1 with Lipschitz restricted parameter initialization, same for v2’. Table 3 shows that deep v1 models do not suffer from convergence problem anymore with our new parameter initialization approach.

4 Effects of Deeper Encoder and Deeper Decoder

Previous approaches Bapna et al. (2018); Wang et al. (2019) only increases the depth of encoder, while we suggest that deep decoders should also be helpful. We analyzed the influence of deep encoders and decoders separately and results are shown in Table 4.

Encoder Decoder en-de cs-en
6 27.96 28.78
24 6 28.76 29.20
6 24 28.63 29.36
24 29.46 29.70
Table 4: Effects of Encoder and Decoder Depth

Table 4 shows that the deep decoder can benefit the performance in addition to the deep encoder, especially on the Czech to English task.

5 Conclusion

In contrast to all previous works (Bapna et al., 2018; Wang et al., 2019; Wu et al., 2019) which show that deep Transformers with the computation order as in vaswani2017attention have difficulty in convergence. We empirically show that deep Transformers with the original computation order can converge as long as with proper parameter initialization.

In this paper, we first investigate convergence differences between the published Transformer (Vaswani et al., 2017) and the official implementation of the Transformer (Vaswani et al., 2018), and compare the differences of computation orders between them. Then we conjecture the training problem of deep Transformers is because layer normalization sometimes shrinks residual connections, and propose this can be tackled simply with Lipschitz restricted parameter initialization.

Our experiments demonstrate the effectiveness of our simple approach on the convergence of deep Transformers, and brings significant improvements on the WMT 14 English to German and the WMT 15 Czech to English news translation tasks. We also study the effects of deep decoders in addition to deep encoders concerned in previous works.


Hongfei Xu is supported by a doctoral grant from China Scholarship Council ([2018]3101, 201807040056). This work is also supported by the German Federal Ministry of Education and Research (BMBF) under the funding code 01IW17001 (Deeplee).