1 Introduction
Neural machine translation has achieved great success in the last few years Bahdanau et al. (2014); Gehring et al. (2017); Vaswani et al. (2017). The Transformer Vaswani et al. (2017), which has outperformed previous RNN/CNN based translation models Bahdanau et al. (2014); Gehring et al. (2017), is based on multilayer selfattention networks and can be trained very efficiently. The multilayer structure allows the Transformer to model complicated functions. Increasing the depth of models can increase their capacity but may also cause optimization difficulties Mhaskar et al. (2017); Telgarsky (2016); Eldan and Shamir (2016); He et al. (2016); Bapna et al. (2018)
. In order to ease optimization, the Transformer employs residual connection and layer normalization techniques which have been proven useful in reducing optimization difficulties of deep neural networks for various tasks
He et al. (2016); Ba et al. (2016).However, even with residual connections and layer normalization, deep Transformers are still hard to train: the original Transformer Vaswani et al. (2017) only contains 6 encoder/decoder layers. bapna2018training show that Transformer models with more than 12 encoder layers fail to converge, and propose the Transparent Attention (TA) mechanism which weighted combines outputs of all encoder layers as encoded representation. However, the TA mechanism has to value outputs of shallow encoder layers to feedback sufficient gradients during backpropagation to ensure their convergence, which implies that weights of deep layers are likely to be hampered and against the motivation when go very deep, and as a result bapna2018training cannot get further improvements with more than 16 layers. wang2019learning reveal that deep Transformers with proper use of layer normalization is able to converge and propose to aggregate previous layers’ outputs for each layer instead of at the end of encoding. wu2019depth research on incremental increasing the depth of the Transformer Big by freezing pretrained shallow layers. In concurrent work, zhang2019improving also point out the same issue as in this work, but there are differences between.
In contrast to all previous works, we empirically show that with proper parameter initialization, deep Transformers with the original computation order can converge. The contributions of our work are as follows:
We empirically demonstrate that a simple modification made in the Transformer’s official implementation Vaswani et al. (2018) which changes the computation order of residual connection and layer normalization can effectively ease its optimization;
We deeply analyze how the subtle difference of computation order affects the convergence deep Transformer models, and propose to initialize deep Transformer models under Lipschitz restriction;
Our simple approach effectively ensures the convergence of deep Transformers with up to 24 layers, and bring and BLEU improvements in the WMT 14 English to German task and the WMT 15 Czech to English task;
2 Convergence of Different Computation Order
Models  Layers  ende  csen  

Encoder  Decoder  v1  v2  v1  v2  
Bapna et al. (2018)  16  6  28.39  None  29.36  None 
Transformer  6  27.77  27.31  28.62  28.40  
12  28.12  29.38  
18  28.60  29.61  
24  29.02  29.73 
In our research we focus on training problems of deep Transformers which prevent them from convergence (as opposed to other important issues such as overfitting on the training set). To alleviate the training problem for the standard Transformer model, Layer Normalization (Ba et al., 2016) and Residual Connection (He et al., 2016) are adopted.
The official implementation Vaswani et al. (2018) of the Transformer uses a different computation sequence (Figure 1 b) compared to the published version Vaswani et al. (2017) (Figure 1 a), since it seems better for hardertolearn models^{1}^{1}1https://github.com/tensorflow/tensor2tensor/blob/v1.6.5/tensor2tensor/layers/common_hparams.py#L110L112.. Though several papers Chen et al. (2018); Domhan (2018) mentioned this change, how this modification impacts on the performance of the Transformer, especially for deep Transformers, has never been deeply studied before with empirical results to the best of our knowledge, except wang2019learning analyzed the difference between two computation orders during backpropagation, and zhang2019improving point out the same effects of normalization in concurrent work.
In order to compare with bapna2018training, we used the datasets from the WMT 14 English to German task and the WMT 15 Czech to English task for experiments. We applied joint BytePair Encoding (BPE) Sennrich et al. (2016) with 32k merge operations. We used the same setting as the Transformer base Vaswani et al. (2017) except the number of warmup steps was set to . We conducted our experiments based on the Neutron implementation (Xu and Liu, 2019) of the Transformer.
Parameters were initialized with Glorot Initialization^{2}^{2}2Uniformly initialize matrices between , where and are two dimensions of the matrix. Glorot and Bengio (2010) like in many other Transformer implementation Klein et al. (2017); Hieber et al. (2017); Vaswani et al. (2018). Our experiments run on 2 GTX 1080 Ti GPUs, and a batch size of at least target tokens is achieved through gradient accumulation of small batches.
We used a beam size of 4 for decoding, and evaluated tokenized casesensitive BLEU with the averaged model of the last 5 checkpoints saved with an interval of 1,500 training steps Vaswani et al. (2017).
v1  v2 

Results of two different computation order are shown in Table 1. v1 and v2 stand for the computation order of the proposed Transformer Vaswani et al. (2017) and that of the official implementation Vaswani et al. (2018) respectively. “” means fail to converge, “None” means not reported in original works, “*” indicates our implementation of their approach. and mean and while comparing between v1 and v2 of the same number of layers in significance test.
3 Analysis and Lipschitz Restricted Parameter Initialization
Since the subtle change of computation order results in huge differences in convergence, we analyze the differences between the computation orders to figure out how they affect convergence.
3.1 Comparison between Computation Orders
As a conjecture, we think that the convergence issue of deep Transformers is perhaps due to the fact that layer normalization over residual connections in Figure 1 (a) makes residual connections are likely to be hampered by layer normalization which tends to shrink consecutive residual connections to avoid potential exploding of combined layer outputs Chen et al. (2018). We studied how the layer normalization and the residual connection are computed in the two computation orders as shown in Table 2.
“mean” and “std” mean the computation of mean value and standard variance.
and stand for output of current layer and accumulated outputs from previous layers respectively. andare weight and bias of layer normalization which are initialized with a vector full of
and another vector full of . is the computation result of the layer normalization. and are results of residual connections of v1 and v2.Table 2 shows that the computation of residual connection in v1 is weighted by compared to v2, and the residual connection of previous layers will be shrunk in case .
We suggest bapna2018training introduced the TA mechanism to compensate normalized residual connections through combining outputs of shallow layers to the final encoder output for the published Transformer, and obtained significant improvements with deep Transformer models. wang2019learning additionally aggregating outputs of previous layers for each encoder layer instead of only at the end of encoding.
3.2 Lipschitz Restricted Parameter Initialization
Since the convergence issue of deep v1 Transformers is likely because of the shrunken residual connections, is it possible to restrict ? Given that is initialized with , we suggest to restrict the standard variance of :
(1) 
in which case, will be greater than or at least equal to , and the residual connection of v1 will not be shrunk anymore. To achieve this goal, we can restrict values in between and ensuring its distribution variance is smaller than .
(2) 
then the standard variance of is:
(3) 
given that:
(4) 
(5) 
clean up the Equation 5, we can get:
(6) 
(7) 
Thus, as long as:
(8) 
the requirements for corresponding described in Equation can be satisfied.
This goal can be simply achieved through initializing the submodel before layer normalization to be a kLipschitz function, where .
The kLipschitz restriction can be satisfied effectively through weight clipping^{3}^{3}3Note that the weight of the layer normalization cannot be clipped, otherwise residual connections will be more heavily shrunk., and we empirically find that only applying a restriction to parameter initialization is sufficient enough, which is more efficient and can avoid potential risk of weight clipping on performance.
In practice, we initialize embedding matrices and weights of linear transformation with uniform distributions of
and respectively. We use as and as where , and stand for the size of embedding, vocabulary size and the input dimension of the linear transformation respectively^{4}^{4}4To preserve the magnitude of the variance of the weights in the forward pass..Layers  ende  csen  

v1’  v2’  v1’  v2’  
6  27.96  27.38  28.78  28.39 
12  28.67  28.13  29.17  29.45 
18  29.05  28.67  29.55  29.63 
24  29.46  29.20  29.70  29.88 
Results for two computation orders with new parameter initialization method are shown in Table 3. v1’ indicates v1 with Lipschitz restricted parameter initialization, same for v2’. Table 3 shows that deep v1 models do not suffer from convergence problem anymore with our new parameter initialization approach.
4 Effects of Deeper Encoder and Deeper Decoder
Previous approaches Bapna et al. (2018); Wang et al. (2019) only increases the depth of encoder, while we suggest that deep decoders should also be helpful. We analyzed the influence of deep encoders and decoders separately and results are shown in Table 4.
Encoder  Decoder  ende  csen 

6  27.96  28.78  
24  6  28.76  29.20 
6  24  28.63  29.36 
24  29.46  29.70 
Table 4 shows that the deep decoder can benefit the performance in addition to the deep encoder, especially on the Czech to English task.
5 Conclusion
In contrast to all previous works (Bapna et al., 2018; Wang et al., 2019; Wu et al., 2019) which show that deep Transformers with the computation order as in vaswani2017attention have difficulty in convergence. We empirically show that deep Transformers with the original computation order can converge as long as with proper parameter initialization.
In this paper, we first investigate convergence differences between the published Transformer (Vaswani et al., 2017) and the official implementation of the Transformer (Vaswani et al., 2018), and compare the differences of computation orders between them. Then we conjecture the training problem of deep Transformers is because layer normalization sometimes shrinks residual connections, and propose this can be tackled simply with Lipschitz restricted parameter initialization.
Our experiments demonstrate the effectiveness of our simple approach on the convergence of deep Transformers, and brings significant improvements on the WMT 14 English to German and the WMT 15 Czech to English news translation tasks. We also study the effects of deep decoders in addition to deep encoders concerned in previous works.
Acknowledgments
Hongfei Xu is supported by a doctoral grant from China Scholarship Council ([2018]3101, 201807040056). This work is also supported by the German Federal Ministry of Education and Research (BMBF) under the funding code 01IW17001 (Deeplee).
References
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Bapna et al. (2018)
Ankur Bapna, Mia Chen, Orhan Firat, Yuan Cao, and Yonghui Wu. 2018.
Training deeper neural
machine translation models with transparent attention.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages 3028–3033. Association for Computational Linguistics.  Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 76–86. Association for Computational Linguistics.
 Domhan (2018) Tobias Domhan. 2018. How much attention do you need? a granular analysis of neural machine translation architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1799–1808. Association for Computational Linguistics.

Eldan and Shamir (2016)
Ronen Eldan and Ohad Shamir. 2016.
The power of
depth for feedforward neural networks.
In 29th Annual Conference on Learning Theory, volume 49 of
Proceedings of Machine Learning Research
, pages 907–940, Columbia University, New York, New York, USA. PMLR.  Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1243–1252, International Convention Centre, Sydney, Australia. PMLR.

Glorot and Bengio (2010)
Xavier Glorot and Yoshua Bengio. 2010.
Understanding
the difficulty of training deep feedforward neural networks.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy. PMLR. 
He et al. (2016)
K. He, X. Zhang, S. Ren, and J. Sun. 2016.
Deep residual learning
for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778.  Hieber et al. (2017) Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Post. 2017. Sockeye: A toolkit for neural machine translation. arXiv preprint arXiv:1712.05690.
 Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: OpenSource Toolkit for Neural Machine Translation. ArXiv eprints.
 Mhaskar et al. (2017) Hrushikesh Mhaskar, Qianli Liao, and Tomaso Poggio. 2017. When and why are deep networks better than shallow ones? In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pages 2343–2348.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
 Telgarsky (2016) Matus Telgarsky. 2016. benefits of depth in neural networks. In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1517–1539, Columbia University, New York, New York, USA. PMLR.
 Vaswani et al. (2018) Ashish Vaswani, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2tensor for neural machine translation. CoRR, abs/1803.07416.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
 Wang et al. (2019) Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning deep transformer models for machine translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 1810–1822, Florence, Italy. Association for Computational Linguistics.
 Wu et al. (2019) Lijun Wu, Yiren Wang, Yingce Xia, Fei Tian, Fei Gao, Tao Qin, Jianhuang Lai, and TieYan Liu. 2019. Depth growing for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5558–5563, Florence, Italy. Association for Computational Linguistics.
 Xu and Liu (2019) Hongfei Xu and Qiuhui Liu. 2019. Neutron: An Implementation of the Transformer Translation Model and its Variants. arXiv preprint arXiv:1903.07402.
 Zhang et al. (2019) Biao Zhang, Ivan Titov, and Rico Sennrich. 2019. Improving deep transformer with depthscaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 897–908, Hong Kong, China. Association for Computational Linguistics.