1 Introduction
Transformers Vaswani et al. (2017) have led to a series of breakthroughs in various deep learning tasks Devlin et al. (2019); Velickovic et al. (2018)
. They do not contain recurrent connections and can parallelize all computations in the same layer, thus improving effectiveness, efficiency, and scalability. Training Transformers, however, requires extra efforts. For example, although the standard stochastic gradient descent (SGD) is a canonical algorithm for the conventional RNNs and CNNs, it converges to bad/suspicious local optima for Transformers
Zhang et al. (2019b). Moreover, comparing to other neural architectures, removing the warmup stage in Transformer training results in more serious consequences such as model divergence Popel and Bojar (2018); Liu et al. (2020). In this paper, we conduct a comprehensive analysis in theoretical and empirical manners to answer the question: what complicates Transformer training.Our analysis starts from the observation: the original Transformer (referred to as PostLN) is less robust than its PreLN variant Baevski and Auli (2019); Nguyen and Salazar (2019). We recognize that gradient vanishing is not the direct reason causing such difference, since fixing this issue alone cannot stabilize PostLN training. It implies that there exist factors other than unbalanced gradients that influence model training greatly.
With further analysis, we recognize that for each layer, the dependency on its residual branch^{1}^{1}1For a residual layer , its shortcut output refers to , its residual branch refers to , and the dependency on its residual branch refers to . plays an important role in training stability. First, we find that a PostLN layer has a heavier dependency on its residual branch than a PreLN layer. For example, at initialization, a PreLN layer has roughly the same dependency on its residual branch and any previous layer, whereas a PostLN layer has a stronger dependency on its residual branch (more discussions are presented in Section 4.1). We find that strong dependencies of PostLN amplify fluctuations brought by parameter changes and destabilize the model training (as in Theorem 4.2 and Figure 2). On the other hand, the loose reliance on residual branches in PreLN generally limits the potential of the training algorithm, and often produce an inferior model.
In light of our analysis, we propose Admin, an adaptive initialization method for training PostLN Transformer models, which retains the merits of PreLN stability without hurting the performance. It restricts the layer dependency on its residual branches in the early stage and unleashes the model potential in the late stage. Admin is more stable, converges faster, and performs better in extensive experiments on IWSLT’14 DeEn and WMT’14 EnDe^{2}^{2}2Code is released at: https://github.com/LiyuanLucasLiu/TransforemrClinic.
2 Preliminaries
Transformer Architectures and Notations. The Transformer architecture contains two types of sublayers, i.e., Attention sublayers and Feedforward (FFN) sublayers. They are composed of mainly three basic modules Vaswani et al. (2017), i.e., Layer Norm (), Multihead Attention () and Feedforward Network (). As illustrated in Figure 4, the PreLN Transformer and the PostLN Transformer organize these modules differently. For example, a PreLN encoder organizes the SelfAttention sublayer as: and a PostLN encoder as where is the input of the th Transformer layer and is the output of the th SelfAttention sublayer. Here, we refer and as the residual branches and their outputs as the residual outputs, in contrast to layer/sublayer outputs, which integrates residual outputs and shortcut outputs. Notation elaborations are shown in Figure 4. In particular, we use superscript to indicate network architectures (i.e., the PreLN Encoder), use subscript to indicate layer indexes (top layers have larger indexes), all inputs and outputs are formulated as .
Layer Norm. Layer norm Ba et al. (2016) plays a key role in the Transformer architecture. It is defined as , where and
are the mean and standard deviation of
.Feedforward Network.
Transformers use twolayer perceptrons as feedforward networks,
i.e., , where is the nonlinear function^{3}^{3}3In our analysis, we use ReLU as the activation function while Admin can be applied to other nonlinear functions.
and are parameters.Multihead Attention. Multihead Attention allows the network to have multiple focuses and has shown to be effective for many tasks Chen et al. (2018). It is defined as (with heads): , where is the rowwise softmax function and are parameters. and are matrices, and are matrices, where is the hidden state dimension. Parameters without subscript refer the concatenation of all head parameters, e.g., . Here, this module is used in two different manners: EncoderAttention (i.e., , where is encoder outputs) and SelfAttention (i.e., ).
3 Unbalanced Gradients
Here, we strive to answer the question: what complicates Transformer training. Our analysis starts from the observation: PreLN training is more robust than that of PostLN while PostLN is more likely to reach a better performance than PreLN. For example, in a parameter grid search (as in Figure 10), PreLN converges in all 15 settings and PostLN diverges in 7 out of 15 settings; when PostLN converges, it outperforms PreLN in 7 out of 8 settings. We seek to reveal the underlying factor that destabilizes PostLN and restricts the performance of PreLN.
In this section, we focus on the unbalanced gradients (e.g., gradient vanishing). We find that, although PostLN suffers from gradient vanishing and PreLN does not, gradient vanishing is not the direct reason causing the instability of PostLN. Specifically, we first theoretically and empirically establish that only PostLN decoders suffer from gradient vanishing and PostLN encoders do not. We then observe that fixing the gradient vanishing issue alone cannot stabilize training.
3.1 Gradients at Initialization
As gradient vanishing can hamper convergence from the beginning, it has been regarded as the major issue leading towards unstable training. Also, recent studies show that this issue exists in the PostLN Transformer, even after using residual connections
Xiong et al. (2019). Below, we establish that only PostLN decoders suffer from the gradient vanishing, and neither PostLN encoders, PreLN encoders nor PreLN decoders.We use to denote gradients, i.e., where is the training objective. Following previous studies Bengio et al. (1994); Glorot and Bengio (2010); He et al. (2015); Saxe et al. (2013a), we analyze the gradient distribution at the very beginning of training. As established in Theorem A.2 and Remark A.1 (detailed derivations are included in Appendix A), neither PostLN Encoders, PreLN Encoders nor PreLN Decoders suffer from gradient vanishing. In other words, only the EncoderAttention layer in PostLN suffers from gradient vanishing. Empirical studies are further conducted for verification. At initialization, we calculate for 18layer Transformers^{4}^{4}4Note if , . and visualize in Figure 5
. It verifies that only PostLN decoders suffer from the gradient vanishing. It also shows that the gradient vanishing happens in backpropagations from EncoderAttention sublayer outputs to its inputs (
i.e., SelfAttention sublayer outputs).3.2 Impact of the Gradient Vanishing
Now, we analyze whether gradient vanishing is the direct cause of training instability.
Encoder  Decoder  Gradient  Training 
PostLN  PostLN  Varnishing  Diverged 
PostLN  PreLN  Varnishing  Diverged 
PreLN  PreLN  Varnishing  Converged 
As in Section 3.1, only PostLN decoders suffer from gradient vanishing, but not PostLN encoders. Thus, we combine a PostLN encoder and a PreLN decoder to construct a hybrid Transformer, which does not suffer from gradient vanishing. As shown in Table 1, fixing gradient vanishing alone (i.e., changing PostLN decoders to PreLN decoders) fails to stabilize model training. It implies that the gradient vanishing issue is not the direct cause of the unstable PostLN training.
Moreover, we observe that gradients of all attention modules are unbalanced and are hard to be neutralized. Also, we find this issue is largely addressed by adaptive optimizers. For example, as in Figure 3, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes even with unbalanced gradients. It explains why the standard SGD fails in training Transformers (i.e., lacking the ability to handle unbalanced gradients) and necessitates the use of adaptive optimizers. More discussions are included in Appendix A.4.
4 Instability from Amplification Effect
We find that unbalanced gradients are not the root cause of the instability of PostLN, which implies the existence of other factors influencing model training. Now, we go beyond gradient vanishing and introduce the amplification effect. Specifically, we first examine the difference between PreLN and PostLN, including their behaviors in the earlystage and latestage of training. Then, we show that the training instability of PostLN is attributed to the amplification effect of layer dependency, which intensifies gradient updates and destabilizes training.
4.1 Impact of Layer Norms Positions
As described in Section 4, both PreLN and PostLN employ layer norm to regularize inputs and outputs. In residual networks, different residual outputs are aggregated and normalized before serving as inputs of other layers (i.e.
, residual outputs will be scaled to ensure the integrated input to have a consistent variance). To some extend, layer norm treats the variance of residual outputs as weights to average them. For example, for PostLN SelfAttention, we have
at initialization. Larger not only increases the proportion of in the output but decreases the proportion of other residual outputs (integrated in ), which is similar to weights in the weighted average.The position of layer norms is the major difference between PreLN and PostLN and makes them aggregate residual outputs differently (i.e., using different weights). As in Figure 6, all residual outputs in PreLN are only normalized once before feeding into other layers (thus only treating residual output variances as weights); in PostLN, most residual outputs are normalized more than once and different residual outputs are normalized for different times. For example, if all layers are initialized in the same way, output variances of different PreLN residual branches would be similar, and the aggregation would be similar to the simple average. Similarly, for PostLN, nearby residual outputs are normalized by less times comparing to others, thus having relatively larger weights. We proceed to calculate and analyze these weights to understand the impact of layer norm positions.
First, we use to refer (i.e., normalized outputs of th residual branch) and to refer (i.e., normalized outputs of th layer or normalized inputs of (+1)th residual branch). Then, we describe their relationships as , where integrates scaling operations of all layer norms (including ). For example, PreLN sets . Intuitively, describes the proportion of th residual branch outputs in th layer outputs, thus reflects the dependency among layers.
We calculate and visualize in Figure 7. For each PostLN layer, its outputs rely more on its own residual branch from the initialization to the end. At initialization, PreLN layer outputs have roughly the same reliance on all previous residual branches. As the training advances, each layer starts to rely more on its own residual outputs. But comparing to PostLN, PreLN layer outputs in the final model are still less focused on their residual branches.
Intuitively, it is harder for PreLN layers to depend too much on their own residual branches. In PreLN, layer outputs (i.e., ) are not normalized, and their variances are likely to be larger for higher layers (i.e., if and are independent, ; also, in our experiments increases as becomes larger). Since , is likely to be smaller for higher layers, which restricts th layer outputs from depending too much on its own residual branch and inhibits the network from reaching its full potential. In other words, PreLN restricts the network from being too deep (i.e., if it is hard to distinguish and , appending one layer would be similar to double the width of the last layer), while PostLN allows the network to have the choice of being wider or deeper.
4.2 Amplification Effect at Initialization
Although depending more on residual branches allows the model to have a larger potential, it amplifies the fluctuation brought by parameter changes. For a network where is the model input and is the parameter, the output change caused by parameter perturbations is , where . Its relationship with is described in Theorem 4.2 and the derivation is elaborated in Appendix B. Consider a layer Transformer , where is the input and is the parameter. If the layer dependency stays the same after a parameter change (i.e., has the same value after changing to , where is randomly initialized and is independent to ), the output change (i.e.,
) can be estimated as
where is a constant. If is the same for all layers, PreLN sets as and PostLN sets as a constant. Thus, we have Corollary 4.2 and 4.2 as below. For layer PreLN , we have . For layer PostLN , we have . They show that, since PostLN relies more on residual branches comparing to PreLN (i.e., has a larger ), the perturbation is amplified to a larger magnitude. To empirically verify these relationships, we calculate for PreLN and PostLN and visualize the results in Figure 2. In Corollary 4.2, is linearly associated with for PostLN; and in Corollary 4.2, is linearly associated with for PreLN. These relationships match the observation in our experiments (as in Figure 2). For further verification, we measure their correlation magnitudes by and find in both cases.Moreover, we replace the random noise with optimization updates (i.e., setting , where is the Adam optimizer) and visualize the output shifts in Figure 2. The output shift for PostLN is larger than PreLN by multiple magnitudes.
Intuitively, large output shifts would destabilize the training Li et al. (2018). Also, as elaborated in Appendix B, the constant in Theorem 4.2 is related to network derivatives, thus would be smaller as training advances, which explains why warmup is also helpful for the standard SGD. Therefore, we conjecture it is the large output shift of PostLN results in unstable training. Now we proceed to stabilize PostLN by controlling the dependency on residual branches in the early stage of training.
4.3 Admin – Adaptive Model Initialization
In light of our analysis, we add additional parameters (i.e., ) to control residual dependencies of PostLN and stabilize training by adaptively initializing to ensure a output change.
Due to different training configurations and model specificities (e.g., different model may use different activation functions and dropout ratios), it is hard to derive a universal initialization method. Instead, we decompose model initialization into two phrases: Profiling and Initialization. Specifically, Admin adds new parameters and constructs its ith sublayer as , where , is a
dimension vector and
is elementwise product. Then the Profiling phrase and Initialization phrase are:Profiling. After initializing the network with a standard method (initializing as ), conduct forward propagation without parameter updating and record the output variance of residual branches (i.e., calculate ).
Initialization. Set and initialize all other parameters with the same method used in the Profiling phrase.
In the early stage, Admin sets to approximately and ensures a output change thus stabilizing training. Model training would become more stable in the late stage (the constant in Theorem 4.2 is related to parameter gradients) and each layer has the flexibility to adjust and depends more on its own residual branch to calculate the layer outputs. After training finishes, Admin can be reparameterized as the conventional PostLN structure (i.e., removing ). More implementation details are elaborated in Appendix C.
To verify our intuition, we calculate the layer dependency of 18Layer models and visualize the result in Figure 8. Figure 7 and 8 show that Admin avoids overlarge dependencies at initialization and unleashes the potential to make the layer outputs depend more on their residual outputs in the final model. Moreover, we visualize the output change of Admin in Figure 2. Benefiting from the adaptive initialization, the output change of Admin gets roughly the same increase speed as PreLN, even constructed in the PostLN manner. Also, although Admin is formulated in a PostLN manner and suffers from gradient vanishing, 18layer Admin successfully converges and outperforms 18layer PreLN (as in Table 2). These evidences support our intuition that the large dependency on residual branches amplifies the output fluctuation and destabilizes training.
5 Experiments
We conduct experiments on two machine translation datasets, i.e., IWSLT’14 DeEn and WMT’14 EnDe. The detailed experimental configurations are elaborated in Appendix D.
Method  BLEU  
6Layer  12Layer  18Layer  
PostLN  27.80  Diverged  Diverged 
PreLN  27.27  28.26  28.26 
Admin  27.90  28.58  28.80 
5.1 Performance Comparison
We use BLEU as the evaluation matric and summarize the model performance in Table 2 and Table 3. For the WMT’14 dataset, experiments are conducted using the Transformerbase model with 6, 12 or 18 layers. Admin achieves a better performance than PostLN and PreLN in all three settings. Specifically, 12Layer and 18Layer PostLN diverges without the adaptive initialization. Admin obtains comparable performance with PostLN in the 6layer setting and converges well in both the 12layer and the 18layer settings. PreLN also converges in all settings, but it results in a suboptimal performance, which verifies our intuition that the PreLN structure limits the model potential. As depicted in Figure 1 and Figure 9, although the 6layer PreLN converges faster than PostLN, its final performance is worse than PostLN. In contrast, Admin not only achieves the same convergence speed with PreLN in the early stage, but reaches a good performance in the late stage.
For the IWSLT’14 dataset, we use the Transformersmall model for training. We observe that all methods perform similarly and Admin outperforms the other two by a small margin. Comparing to the WMT’14 results, it verifies that the training stability is related to the number of layers. For shallow networks, the stability difference between PostLN and PreLN is not significant, and all architectures lead to a similar performance. Besides, we find that the attention dropout and the activation dropout have a large impact on the model performance. Specifically, via setting the attention dropout ratio and relu dropout ratio to 0.1, we are able to improve the PostLN performance from 34.60 (reported) to 35.64 (average of five runs).
5.2 Comparing to other Initializations
We further compare our methods with two initialization methods, i.e., FixUp Zhang et al. (2019a) and LookLinear Balduzzi et al. (2017a). Specifically, we conduct experiments with 18layer Transformers on the WMT’14 DeEn dataset. In our experiments, we observe that both FixUp (without using layer normalization) and LookLinear (with PostLN) leads to divergent training. With further analysis, we find that, the halfprecision training and dropout could be the reason destabilizing FixUp, due to the lack of layer normalization.
5.3 Connection to Warmup
Previous work Liu et al. (2020) establishes that the need of warmup comes from the unstable adaptive learning rates in the early stage. Still, it is observed that removing the warmup phrase results in more serious consequences for Transformers than other architectures. Also, warmup is found to be useful for the vanilla SGD Xiong et al. (2019).
In Theorem A.2, we establish that
where . In the early stage of training, the network has larger parameter gradients and thus larger . Therefore, same parameter shifts would result in larger output changes in the early stage than in the late stage. Thus, warmup relieves the output change and helps to stabilize training.
To further verify our intuitions, we remove the warmup phrase and conduct a grid search on RAdam hyperparameters Liu et al. (2020). Results are visualized in Figure 10. It shows that PostLN is less robust to the choice of learning rates. Specifically, PostLN diverges with larger learning rates or smaller (smaller use less samples to estimate adaptive learning rates), while Admin and PreLN are more robust. At the same time, we extend the warmup phrase from 8 thousand updates to 16, 24, and 32 thousand updates and find the training of 18layer PostLN still converges to bad/suspicious local optima. It shows that, the large output shift of PostLN is not always neutralized by the learning rate warmup. Intuitively, the large output shift not only requires a small learning rate but also unsmoothes the loss surface Li et al. (2018)
and complicates training. Since warmup stabilizes the training without smoothing the loss surface, it fails to train deeper Transformer networks. On the other hand, Admin not only stabilizes training but simplifies the training by initializing from the area with a smooth loss surface, thus leading to better training.
6 Related Work
Transformer. Transformer Vaswani et al. (2017) has led to a series of breakthroughs in various domains Devlin et al. (2019); Velickovic et al. (2018); Huang et al. (2019); Parmar et al. (2018); Ramachandran et al. (2019). Liu et al. (2020) shows that comparing to other architectures, removing the warmup phrase is more damaging for Transformers, especially PostLN. Similarly, it has been found that the original Transformer (referred as PostLN) is less robust than its PreLN variant Baevski and Auli (2019); Nguyen and Salazar (2019); Wang et al. (2019). Our studies go beyond existing literature about gradient vanishing Xiong et al. (2019) and identify an important factor influencing Transformer training greatly. Our analysis guides us to propose a novel adaptive initialization method and allows us to better understand other empirical observations, e.g., initializing parameters to smaller values helps to stabilize training Nguyen and Salazar (2019).
Deep Network Initialization. To handle the gradient vanishing in deep feedforward networks, specific initialization is derived and found to be useful Glorot and Bengio (2010). The derivation is further improved for ReLU networks He et al. (2015). He et al. (2016) find the deep network training is still hard after addressing the gradient vanishing issue and propose ResNet. Balduzzi et al. (2017b) identifies the shattered gradient issue and proposes LookLinear initialization. Recently, the study of dynamical isometryXiao et al. (2018); Yang and Schoenholz (2017); Pennington et al. (2017); Saxe et al. (2013b) provides a new perspective to analyze the network behavior at initialization and focus on simple networks like Deep Linear Network and gradient updates. On the other hand, it has been observed that scaling the residual outputs to smaller values help to stabilize training Hanin and Rolnick (2018); Mishkin and Matas (2015); Zhang et al. (2019a); Bachlechner et al. (2020); Goyal et al. (2017). Here, we focus our study on the Transformer architecture, identify that unbalanced gradients is not the direct cause of the PostLN instability, recognize the amplification effect of residual dependencies and propose a novel adaptive initialization method.
7 Conclusion
In this paper, we study the difficulties of training Transformer in theoretical and empirical manners. Our study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training. Also, the unbalanced gradient distribution is largely addressed by adaptive optimizers. In Section 4, we reveal the root cause of the instability to be the strong dependency on residual branches, which amplifies the fluctuation caused by parameter changes and destabilizes model training. In light of our analysis, we propose Admin, an adaptive initialization method to stabilize Transformers training. It controls the dependency in the beginning of training and maintains the flexibility to capture those dependencies once training stabilizes. Extensive experiments on real world datasets verify our intuitions and show that Admin achieves more stable training, faster convergence, and better performance.
Our work opens up new possibilities to not only further push the stateoftheart but also better understand deep network training. It leads to many interesting future works, including generalizing Theorem 4.2 to other models, designing new algorithms to automatically adapt deep networks to different training configurations, upgrading the Transformer architecture, and applying our proposed Admin to conduct training in a larger scale.
Acknowledge
We thank Chengyu Dong, Haoming Jiang, Jingbo Shang, Xiaotao Gu, and Zihan Wang for valuable discussions and comments; Jingbo Shang for sharing GPU machines; and Microsoft for setting up GPU machines.
References
 Ba et al. (2016) Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. ArXiv, abs/1607.06450.
 Bachlechner et al. (2020) Thomas C. Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. Rezero is all you need: Fast convergence at large depth. ArXiv, abs/2003.04887.
 Baevski and Auli (2019) Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In ICLR.
 Balduzzi et al. (2017a) David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt WanDuo Ma, and Brian McWilliams. 2017a. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML.
 Balduzzi et al. (2017b) David Balduzzi, Marcus Frean, Lennox Leary, J P Lewis, Kurt WanDuo Ma, and Brian McWilliams. 2017b. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML.

Bengio et al. (1994)
Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi. 1994.
Learning longterm dependencies with gradient descent is difficult.
IEEE transactions on neural networks
. 
Chen et al. (2018)
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey,
George Foster, Llion Jones, Niki Parmar, Michael Schuster, ZhiFeng Chen,
Yonghui Wu, and Macduff Hughes. 2018.
The best of both worlds: Combining recent advances in neural machine translation.
In ACL.  Devlin et al. (2019) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pretraining of deep bidirectional transformers for language understanding. In NAACLHLT.
 Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677.
 Hanin and Rolnick (2018) Boris Hanin and David Rolnick. 2018. How to start training: The effect of initialization and architecture. In NeurIPS.

He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification.
In ICCV.  He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
 Huang et al. (2019) ChengZhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music transformer: Generating music with longterm structure. In ICLR.
 Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. 2018. Visualizing the loss landscape of neural nets. In NeurIPS.
 Liu et al. (2020) Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the variance of the adaptive learning rate and beyond. In ICLR.
 Lu et al. (2020) Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and TieYan Liu. 2020. Understanding and improving transformer from a multiparticle dynamic system point of view. In ICLR Workshop DeepDiffEq.
 Mishkin and Matas (2015) Dmytro Mishkin and Juan E. Sala Matas. 2015. All you need is a good init. In ICLR.
 Nguyen and Salazar (2019) Toan Q. Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of selfattention. In IWSLT.
 Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACLHLT 2019: Demonstrations.
 Parmar et al. (2018) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML.
 Pennington et al. (2017) Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. 2017. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In NIPS.
 Popel and Bojar (2018) Martin Popel and Ondrej Bojar. 2018. Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics, 110:43 – 70.
 Ramachandran et al. (2019) Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Standalone selfattention in vision models. In NeurIPS.
 Saxe et al. (2013a) Andrew M Saxe, James L McClelland, and Surya Ganguli. 2013a. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
 Saxe et al. (2013b) Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2013b. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna. 2016.
Rethinking the inception architecture for computer vision.
In CVPR.  Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
 Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
 Wang et al. (2019) Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning deep transformer models for machine translation. In ACL.
 Wu et al. (2019) Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In ICLR.

Xiao et al. (2018)
Lechao Xiao, Yasaman Bahri, Jascha SohlDickstein, Samuel S. Schoenholz, and
Jeffrey Pennington. 2018.
Dynamical isometry and a mean field theory of cnns: How to train 10, 000layer vanilla convolutional neural networks.
In ICML.  Xiong et al. (2019) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shu xin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, LiWei Wang, and TieYan Liu. 2019. On layer normalization in the transformer architecture. ArXiv, abs/2002.04745.
 Yang and Schoenholz (2017) Greg Yang and Samuel S. Schoenholz. 2017. Mean field residual networks: On the edge of chaos. In NIPS.
 Zhang et al. (2019a) Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019a. Fixup initialization: Residual learning without normalization. In ICLR.
 Zhang et al. (2019b) Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Surinder Kumar, and Suvrit Sra. 2019b. Why adam beats sgd for attention models. ArXiv, abs/1912.03194.
Appendix A Gradients at Initialization
Here, we first reveal that PreLN does not suffer from the gradient vanishing. Then we establish that only the PostLN decoder suffers from the gradient vanishing, but not the PostLN encoder. For simplicity, we use to denote gradients, i.e., where is the training objective. Following the previous study Bengio et al. (1994); Glorot and Bengio (2010); He et al. (2015); Saxe et al. (2013a), we analyze the gradient distribution at the very beginning of training, assume that the randomly initialized parameters and the partial derivative with regard to module inputs are independent.
a.1 PreLN Analysis
For PreLN encoders, we have and . At initialization, the two terms on the right part are approximately independent and . Therefore we have . Similarly, we can get thus . Applying the same analysis to PreLN decoders, we can get . Thus, lower layers have larger gradients than higher layers and gradients do not vanish in the backpropagation. For PreLN, if and the derivatives of modules in the th sublayer are independent, then .
a.2 PostLN Encoder Analysis
Different from PreLN, and are associated with not only the residual connection, but the layer normalization, which makes it harder to establish the connection on their gradients. After making assumptions on the model initialization, we find that lower layers in PostLN encoder also have larger gradients than higher layers and gradients do not vanish in the backpropagation through the encoder.
For PostLN Encoders, if and in the Layer Norm are initialized as and respectively; all other parameters are initialized by symmetric distributions with zero mean; and are subject to symmetric distributions with zero mean; the variance of is (i.e., normalized by Layer Norm); and the derivatives of modules in th sublayer are independent, we have .
Proof.
We first prove , i.e., the backpropagation through FFN sublayers does not suffer from gradient vanishing. In PostLN encoders, the output of FFN sublayers are calculated as where . Since at initialization, and are independently randomized by symmetric distributions, we have and
where . Referring the dimension of as , He et al. (2015) establishes that
Since in PostLN, is the output of layer norm, we have . Thus,
(1) 
Assuming different terms are also independent in the backpropagation, we have
At initialization, He et al. (2015) establishes that
Therefore, we have
(2) 
Combining Equation 1 with Equation 2, we have
(3) 
which shows the backpropagation through FFN sublayers does not suffer from gradient vanishing.
Now we proceed to proof that, , i.e., the backpropagation through SelfAttention sublayers does not suffer from gradient vanishing. In PostLN encoders, the output of SelfAttention sublayers are calculated as where and . At initialization, since , , and are independently randomized by symmetric distributions, we have , thus , where .
Referring as , we have
Similar to He et al. (2015), we have
Since is the output of layer norm, we have . Thus,
(4) 
In the backpropagation, we have
At initialization, we assume and model parameters are independent He et al. (2015), thus
Therefore, we have
(5) 
Integrating Equation 4 with Equation 5, we have
(6) 
a.3 PostLN Decoder Analysis
In PostLN, the EncoderAttention sublayer suffers from gradient vanishing. The EncoderAttention sublayer calculates outputs as where and . Here is encoder outputs and is the rowwise softmax function. In the backpropagation, All of backpropagations from to went through the softmax function, we have . Thus, those backpropagations suffer from gradient vanishing.
a.4 Gradients of Unbalanced Gradients
As in Figure 3 and Figure 11, even for PreLN, the gradient distributions of Attention modules are unbalanced. Specifically, parameters within the softmax function (i.e., and ) suffer from gradient vanishing (i.e., ) and have smaller gradients than other parameters.
With further analysis, we find it is hard to neutralize the gradient vanishing of softmax. Different from conventional nonlinear functions like ReLU or sigmoid, softmax has a dynamic input length (i.e.
, for sentence with different lengths, inputs of softmax have different dimensions). Although this setting allows Attention modules to handle sequential inputs, it restricts them from having stable and consistent backpropagation. Specifically, let us consider the comparison between softmax and sigmoid. For the sigmoid function, although its derivation is smaller than 1, this damping effect is consistent for all inputs. Thus, sigmoid can be neutralized by a larger initialization
Glorot and Bengio (2010). For softmax, its damping effect is different for different inputs, thus cannot be neutralized by a static initialization.Also, we observe that this issue is largely addressed by adaptive optimizers. Specifically, we calculate the norm of parameter change in consequent epochs (e.g., where is the checkpoint saved after epochs) and also visualize the relative norm (scaled by the largest value in the same network) in Figure 11. Comparing the relative norm of parameter gradients and parameter updates, we notice that: although the gradient distribution is unbalanced, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes. This results explains why the vanilla SGD fails for training Transformer (i.e., lacking the ability to handle unbalanced gradient distributions). Also, it implies that the unbalanced gradient distribution (e.g., gradient vanishing) has been largely addressed by adaptive optimizers and may not have a big impact on the training instability.
Appendix B Proof of Theorem 4.2
Here, we elaborate the derivation for Theorem 4.2, which establishes the relationship between layer number and output fluctuation brought by parameter change.
Consider a layer Transformer , where is the input and is the parameter. If the layer dependency stays the same after a parameter change (i.e., has the same value after changing to , where is randomly initialized and is independent to ), the output change (i.e., ) can be estimated as where is a constant.
Proof.
We refer the module in sublayer as , where is the normalized residual output and is the normalized module output. The final output is marked as . To simplify the notation, we use the superscript to indicate variables related to , e.g., and .
At initialization, all parameters are initialized independently. Thus , and are independent and . Also, since layer and layer share the residual connection to previous layers, we have . Thus and
(7) 
Now, we proceed to analyze . Specifically, we have
(8) 
Since is randomly initialized,