. They do not contain recurrent connections and can parallelize all computations in the same layer, thus improving effectiveness, efficiency, and scalability. Training Transformers, however, requires extra efforts. For example, although the standard stochastic gradient descent (SGD) is a canonical algorithm for the conventional RNNs and CNNs, it converges to bad/suspicious local optima for TransformersZhang et al. (2019b). Moreover, comparing to other neural architectures, removing the warmup stage in Transformer training results in more serious consequences such as model divergence Popel and Bojar (2018); Liu et al. (2020). In this paper, we conduct a comprehensive analysis in theoretical and empirical manners to answer the question: what complicates Transformer training.
Our analysis starts from the observation: the original Transformer (referred to as Post-LN) is less robust than its Pre-LN variant Baevski and Auli (2019); Nguyen and Salazar (2019). We recognize that gradient vanishing is not the direct reason causing such difference, since fixing this issue alone cannot stabilize Post-LN training. It implies that there exist factors other than unbalanced gradients that influence model training greatly.
With further analysis, we recognize that for each layer, the dependency on its residual branch111For a residual layer , its shortcut output refers to , its residual branch refers to , and the dependency on its residual branch refers to . plays an important role in training stability. First, we find that a Post-LN layer has a heavier dependency on its residual branch than a Pre-LN layer. For example, at initialization, a Pre-LN layer has roughly the same dependency on its residual branch and any previous layer, whereas a Post-LN layer has a stronger dependency on its residual branch (more discussions are presented in Section 4.1). We find that strong dependencies of Post-LN amplify fluctuations brought by parameter changes and destabilize the model training (as in Theorem 4.2 and Figure 2). On the other hand, the loose reliance on residual branches in Pre-LN generally limits the potential of the training algorithm, and often produce an inferior model.
In light of our analysis, we propose Admin, an adaptive initialization method for training Post-LN Transformer models, which retains the merits of Pre-LN stability without hurting the performance. It restricts the layer dependency on its residual branches in the early stage and unleashes the model potential in the late stage. Admin is more stable, converges faster, and performs better in extensive experiments on IWSLT’14 De-En and WMT’14 En-De222Code is released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.
Transformer Architectures and Notations. The Transformer architecture contains two types of sub-layers, i.e., Attention sub-layers and Feedforward (FFN) sub-layers. They are composed of mainly three basic modules Vaswani et al. (2017), i.e., Layer Norm (), Multi-head Attention () and Feedforward Network (). As illustrated in Figure 4, the Pre-LN Transformer and the Post-LN Transformer organize these modules differently. For example, a Pre-LN encoder organizes the Self-Attention sub-layer as: and a Post-LN encoder as where is the input of the -th Transformer layer and is the output of the -th Self-Attention sub-layer. Here, we refer and as the residual branches and their outputs as the residual outputs, in contrast to layer/sub-layer outputs, which integrates residual outputs and shortcut outputs. Notation elaborations are shown in Figure 4. In particular, we use superscript to indicate network architectures (i.e., the Pre-LN Encoder), use subscript to indicate layer indexes (top layers have larger indexes), all inputs and outputs are formulated as .
Layer Norm. Layer norm Ba et al. (2016) plays a key role in the Transformer architecture. It is defined as , where and
are the mean and standard deviation of.
Transformers use two-layer perceptrons as feedforward networks,i.e., , where is the non-linear function333 and are parameters.
Multi-head Attention. Multi-head Attention allows the network to have multiple focuses and has shown to be effective for many tasks Chen et al. (2018). It is defined as (with heads): , where is the row-wise softmax function and are parameters. and are matrices, and are matrices, where is the hidden state dimension. Parameters without subscript refer the concatenation of all -head parameters, e.g., . Here, this module is used in two different manners: Encoder-Attention (i.e., , where is encoder outputs) and Self-Attention (i.e., ).
3 Unbalanced Gradients
Here, we strive to answer the question: what complicates Transformer training. Our analysis starts from the observation: Pre-LN training is more robust than that of Post-LN while Post-LN is more likely to reach a better performance than Pre-LN. For example, in a parameter grid search (as in Figure 10), Pre-LN converges in all 15 settings and Post-LN diverges in 7 out of 15 settings; when Post-LN converges, it outperforms Pre-LN in 7 out of 8 settings. We seek to reveal the underlying factor that destabilizes Post-LN and restricts the performance of Pre-LN.
In this section, we focus on the unbalanced gradients (e.g., gradient vanishing). We find that, although Post-LN suffers from gradient vanishing and Pre-LN does not, gradient vanishing is not the direct reason causing the instability of Post-LN. Specifically, we first theoretically and empirically establish that only Post-LN decoders suffer from gradient vanishing and Post-LN encoders do not. We then observe that fixing the gradient vanishing issue alone cannot stabilize training.
3.1 Gradients at Initialization
As gradient vanishing can hamper convergence from the beginning, it has been regarded as the major issue leading towards unstable training. Also, recent studies show that this issue exists in the Post-LN Transformer, even after using residual connectionsXiong et al. (2019). Below, we establish that only Post-LN decoders suffer from the gradient vanishing, and neither Post-LN encoders, Pre-LN encoders nor Pre-LN decoders.
We use to denote gradients, i.e., where is the training objective. Following previous studies Bengio et al. (1994); Glorot and Bengio (2010); He et al. (2015); Saxe et al. (2013a), we analyze the gradient distribution at the very beginning of training. As established in Theorem A.2 and Remark A.1 (detailed derivations are included in Appendix A), neither Post-LN Encoders, Pre-LN Encoders nor Pre-LN Decoders suffer from gradient vanishing. In other words, only the Encoder-Attention layer in Post-LN suffers from gradient vanishing. Empirical studies are further conducted for verification. At initialization, we calculate for 18-layer Transformers444Note if , . and visualize in Figure 5
. It verifies that only Post-LN decoders suffer from the gradient vanishing. It also shows that the gradient vanishing happens in backpropagations from Encoder-Attention sub-layer outputs to its inputs (i.e., Self-Attention sub-layer outputs).
3.2 Impact of the Gradient Vanishing
Now, we analyze whether gradient vanishing is the direct cause of training instability.
As in Section 3.1, only Post-LN decoders suffer from gradient vanishing, but not Post-LN encoders. Thus, we combine a Post-LN encoder and a Pre-LN decoder to construct a hybrid Transformer, which does not suffer from gradient vanishing. As shown in Table 1, fixing gradient vanishing alone (i.e., changing Post-LN decoders to Pre-LN decoders) fails to stabilize model training. It implies that the gradient vanishing issue is not the direct cause of the unstable Post-LN training.
Moreover, we observe that gradients of all attention modules are unbalanced and are hard to be neutralized. Also, we find this issue is largely addressed by adaptive optimizers. For example, as in Figure 3, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes even with unbalanced gradients. It explains why the standard SGD fails in training Transformers (i.e., lacking the ability to handle unbalanced gradients) and necessitates the use of adaptive optimizers. More discussions are included in Appendix A.4.
4 Instability from Amplification Effect
We find that unbalanced gradients are not the root cause of the instability of Post-LN, which implies the existence of other factors influencing model training. Now, we go beyond gradient vanishing and introduce the amplification effect. Specifically, we first examine the difference between Pre-LN and Post-LN, including their behaviors in the early-stage and late-stage of training. Then, we show that the training instability of Post-LN is attributed to the amplification effect of layer dependency, which intensifies gradient updates and destabilizes training.
4.1 Impact of Layer Norms Positions
As described in Section 4, both Pre-LN and Post-LN employ layer norm to regularize inputs and outputs. In residual networks, different residual outputs are aggregated and normalized before serving as inputs of other layers (i.e.
, residual outputs will be scaled to ensure the integrated input to have a consistent variance). To some extend, layer norm treats the variance of residual outputs as weights to average them. For example, for Post-LN Self-Attention, we haveat initialization. Larger not only increases the proportion of in the output but decreases the proportion of other residual outputs (integrated in ), which is similar to weights in the weighted average.
The position of layer norms is the major difference between Pre-LN and Post-LN and makes them aggregate residual outputs differently (i.e., using different weights). As in Figure 6, all residual outputs in Pre-LN are only normalized once before feeding into other layers (thus only treating residual output variances as weights); in Post-LN, most residual outputs are normalized more than once and different residual outputs are normalized for different times. For example, if all layers are initialized in the same way, output variances of different Pre-LN residual branches would be similar, and the aggregation would be similar to the simple average. Similarly, for Post-LN, nearby residual outputs are normalized by less times comparing to others, thus having relatively larger weights. We proceed to calculate and analyze these weights to understand the impact of layer norm positions.
First, we use to refer (i.e., normalized outputs of -th residual branch) and to refer (i.e., normalized outputs of -th layer or normalized inputs of (+1)-th residual branch). Then, we describe their relationships as , where integrates scaling operations of all layer norms (including ). For example, Pre-LN sets . Intuitively, describes the proportion of -th residual branch outputs in -th layer outputs, thus reflects the dependency among layers.
We calculate and visualize in Figure 7. For each Post-LN layer, its outputs rely more on its own residual branch from the initialization to the end. At initialization, Pre-LN layer outputs have roughly the same reliance on all previous residual branches. As the training advances, each layer starts to rely more on its own residual outputs. But comparing to Post-LN, Pre-LN layer outputs in the final model are still less focused on their residual branches.
Intuitively, it is harder for Pre-LN layers to depend too much on their own residual branches. In Pre-LN, layer outputs (i.e., ) are not normalized, and their variances are likely to be larger for higher layers (i.e., if and are independent, ; also, in our experiments increases as becomes larger). Since , is likely to be smaller for higher layers, which restricts -th layer outputs from depending too much on its own residual branch and inhibits the network from reaching its full potential. In other words, Pre-LN restricts the network from being too deep (i.e., if it is hard to distinguish and , appending one layer would be similar to double the width of the last layer), while Post-LN allows the network to have the choice of being wider or deeper.
4.2 Amplification Effect at Initialization
Although depending more on residual branches allows the model to have a larger potential, it amplifies the fluctuation brought by parameter changes. For a network where is the model input and is the parameter, the output change caused by parameter perturbations is , where . Its relationship with is described in Theorem 4.2 and the derivation is elaborated in Appendix B. Consider a -layer Transformer , where is the input and is the parameter. If the layer dependency stays the same after a parameter change (i.e., has the same value after changing to , where is randomly initialized and is independent to ), the output change (i.e.,
) can be estimated aswhere is a constant. If is the same for all layers, Pre-LN sets as and Post-LN sets as a constant. Thus, we have Corollary 4.2 and 4.2 as below. For -layer Pre-LN , we have . For -layer Post-LN , we have . They show that, since Post-LN relies more on residual branches comparing to Pre-LN (i.e., has a larger ), the perturbation is amplified to a larger magnitude. To empirically verify these relationships, we calculate for Pre-LN and Post-LN and visualize the results in Figure 2. In Corollary 4.2, is linearly associated with for Post-LN; and in Corollary 4.2, is linearly associated with for Pre-LN. These relationships match the observation in our experiments (as in Figure 2). For further verification, we measure their correlation magnitudes by and find in both cases.
Moreover, we replace the random noise with optimization updates (i.e., setting , where is the Adam optimizer) and visualize the output shifts in Figure 2. The output shift for Post-LN is larger than Pre-LN by multiple magnitudes.
Intuitively, large output shifts would destabilize the training Li et al. (2018). Also, as elaborated in Appendix B, the constant in Theorem 4.2 is related to network derivatives, thus would be smaller as training advances, which explains why warmup is also helpful for the standard SGD. Therefore, we conjecture it is the large output shift of Post-LN results in unstable training. Now we proceed to stabilize Post-LN by controlling the dependency on residual branches in the early stage of training.
4.3 Admin – Adaptive Model Initialization
In light of our analysis, we add additional parameters (i.e., ) to control residual dependencies of Post-LN and stabilize training by adaptively initializing to ensure a output change.
Due to different training configurations and model specificities (e.g., different model may use different activation functions and dropout ratios), it is hard to derive a universal initialization method. Instead, we decompose model initialization into two phrases: Profiling and Initialization. Specifically, Admin adds new parameters and constructs its i-th sub-layer as , where , is a
-dimension vector andis element-wise product. Then the Profiling phrase and Initialization phrase are:
Profiling. After initializing the network with a standard method (initializing as ), conduct forward propagation without parameter updating and record the output variance of residual branches (i.e., calculate ).
Initialization. Set and initialize all other parameters with the same method used in the Profiling phrase.
In the early stage, Admin sets to approximately and ensures a output change thus stabilizing training. Model training would become more stable in the late stage (the constant in Theorem 4.2 is related to parameter gradients) and each layer has the flexibility to adjust and depends more on its own residual branch to calculate the layer outputs. After training finishes, Admin can be reparameterized as the conventional Post-LN structure (i.e., removing ). More implementation details are elaborated in Appendix C.
To verify our intuition, we calculate the layer dependency of 18-Layer models and visualize the result in Figure 8. Figure 7 and 8 show that Admin avoids over-large dependencies at initialization and unleashes the potential to make the layer outputs depend more on their residual outputs in the final model. Moreover, we visualize the output change of Admin in Figure 2. Benefiting from the adaptive initialization, the output change of Admin gets roughly the same increase speed as Pre-LN, even constructed in the Post-LN manner. Also, although Admin is formulated in a Post-LN manner and suffers from gradient vanishing, 18-layer Admin successfully converges and outperforms 18-layer Pre-LN (as in Table 2). These evidences support our intuition that the large dependency on residual branches amplifies the output fluctuation and destabilizes training.
We conduct experiments on two machine translation datasets, i.e., IWSLT’14 De-En and WMT’14 En-De. The detailed experimental configurations are elaborated in Appendix D.
5.1 Performance Comparison
We use BLEU as the evaluation matric and summarize the model performance in Table 2 and Table 3. For the WMT’14 dataset, experiments are conducted using the Transformer-base model with 6, 12 or 18 layers. Admin achieves a better performance than Post-LN and Pre-LN in all three settings. Specifically, 12-Layer and 18-Layer Post-LN diverges without the adaptive initialization. Admin obtains comparable performance with Post-LN in the 6-layer setting and converges well in both the 12-layer and the 18-layer settings. Pre-LN also converges in all settings, but it results in a sub-optimal performance, which verifies our intuition that the Pre-LN structure limits the model potential. As depicted in Figure 1 and Figure 9, although the 6-layer Pre-LN converges faster than Post-LN, its final performance is worse than Post-LN. In contrast, Admin not only achieves the same convergence speed with Pre-LN in the early stage, but reaches a good performance in the late stage.
For the IWSLT’14 dataset, we use the Transformer-small model for training. We observe that all methods perform similarly and Admin outperforms the other two by a small margin. Comparing to the WMT’14 results, it verifies that the training stability is related to the number of layers. For shallow networks, the stability difference between Post-LN and Pre-LN is not significant, and all architectures lead to a similar performance. Besides, we find that the attention dropout and the activation dropout have a large impact on the model performance. Specifically, via setting the attention dropout ratio and relu dropout ratio to 0.1, we are able to improve the Post-LN performance from 34.60 (reported) to 35.64 (average of five runs).
5.2 Comparing to other Initializations
We further compare our methods with two initialization methods, i.e., FixUp Zhang et al. (2019a) and LookLinear Balduzzi et al. (2017a). Specifically, we conduct experiments with 18-layer Transformers on the WMT’14 De-En dataset. In our experiments, we observe that both FixUp (without using layer normalization) and LookLinear (with Post-LN) leads to divergent training. With further analysis, we find that, the half-precision training and dropout could be the reason destabilizing FixUp, due to the lack of layer normalization.
5.3 Connection to Warmup
Previous work Liu et al. (2020) establishes that the need of warmup comes from the unstable adaptive learning rates in the early stage. Still, it is observed that removing the warmup phrase results in more serious consequences for Transformers than other architectures. Also, warmup is found to be useful for the vanilla SGD Xiong et al. (2019).
In Theorem A.2, we establish that
where . In the early stage of training, the network has larger parameter gradients and thus larger . Therefore, same parameter shifts would result in larger output changes in the early stage than in the late stage. Thus, warmup relieves the output change and helps to stabilize training.
To further verify our intuitions, we remove the warmup phrase and conduct a grid search on RAdam hyper-parameters Liu et al. (2020). Results are visualized in Figure 10. It shows that Post-LN is less robust to the choice of learning rates. Specifically, Post-LN diverges with larger learning rates or smaller (smaller use less samples to estimate adaptive learning rates), while Admin and Pre-LN are more robust. At the same time, we extend the warmup phrase from 8 thousand updates to 16, 24, and 32 thousand updates and find the training of 18-layer Post-LN still converges to bad/suspicious local optima. It shows that, the large output shift of Post-LN is not always neutralized by the learning rate warmup. Intuitively, the large output shift not only requires a small learning rate but also unsmoothes the loss surface Li et al. (2018)
and complicates training. Since warmup stabilizes the training without smoothing the loss surface, it fails to train deeper Transformer networks. On the other hand, Admin not only stabilizes training but simplifies the training by initializing from the area with a smooth loss surface, thus leading to better training.
6 Related Work
Transformer. Transformer Vaswani et al. (2017) has led to a series of breakthroughs in various domains Devlin et al. (2019); Velickovic et al. (2018); Huang et al. (2019); Parmar et al. (2018); Ramachandran et al. (2019). Liu et al. (2020) shows that comparing to other architectures, removing the warmup phrase is more damaging for Transformers, especially Post-LN. Similarly, it has been found that the original Transformer (referred as Post-LN) is less robust than its Pre-LN variant Baevski and Auli (2019); Nguyen and Salazar (2019); Wang et al. (2019). Our studies go beyond existing literature about gradient vanishing Xiong et al. (2019) and identify an important factor influencing Transformer training greatly. Our analysis guides us to propose a novel adaptive initialization method and allows us to better understand other empirical observations, e.g., initializing parameters to smaller values helps to stabilize training Nguyen and Salazar (2019).
Deep Network Initialization. To handle the gradient vanishing in deep feedforward networks, specific initialization is derived and found to be useful Glorot and Bengio (2010). The derivation is further improved for ReLU networks He et al. (2015). He et al. (2016) find the deep network training is still hard after addressing the gradient vanishing issue and propose ResNet. Balduzzi et al. (2017b) identifies the shattered gradient issue and proposes LookLinear initialization. Recently, the study of dynamical isometryXiao et al. (2018); Yang and Schoenholz (2017); Pennington et al. (2017); Saxe et al. (2013b) provides a new perspective to analyze the network behavior at initialization and focus on simple networks like Deep Linear Network and gradient updates. On the other hand, it has been observed that scaling the residual outputs to smaller values help to stabilize training Hanin and Rolnick (2018); Mishkin and Matas (2015); Zhang et al. (2019a); Bachlechner et al. (2020); Goyal et al. (2017). Here, we focus our study on the Transformer architecture, identify that unbalanced gradients is not the direct cause of the Post-LN instability, recognize the amplification effect of residual dependencies and propose a novel adaptive initialization method.
In this paper, we study the difficulties of training Transformer in theoretical and empirical manners. Our study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training. Also, the unbalanced gradient distribution is largely addressed by adaptive optimizers. In Section 4, we reveal the root cause of the instability to be the strong dependency on residual branches, which amplifies the fluctuation caused by parameter changes and destabilizes model training. In light of our analysis, we propose Admin, an adaptive initialization method to stabilize Transformers training. It controls the dependency in the beginning of training and maintains the flexibility to capture those dependencies once training stabilizes. Extensive experiments on real world datasets verify our intuitions and show that Admin achieves more stable training, faster convergence, and better performance.
Our work opens up new possibilities to not only further push the state-of-the-art but also better understand deep network training. It leads to many interesting future works, including generalizing Theorem 4.2 to other models, designing new algorithms to automatically adapt deep networks to different training configurations, upgrading the Transformer architecture, and applying our proposed Admin to conduct training in a larger scale.
We thank Chengyu Dong, Haoming Jiang, Jingbo Shang, Xiaotao Gu, and Zihan Wang for valuable discussions and comments; Jingbo Shang for sharing GPU machines; and Microsoft for setting up GPU machines.
- Ba et al. (2016) Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. ArXiv, abs/1607.06450.
- Bachlechner et al. (2020) Thomas C. Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W. Cottrell, and Julian J. McAuley. 2020. Rezero is all you need: Fast convergence at large depth. ArXiv, abs/2003.04887.
- Baevski and Auli (2019) Alexei Baevski and Michael Auli. 2019. Adaptive input representations for neural language modeling. In ICLR.
- Balduzzi et al. (2017a) David Balduzzi, Marcus Frean, Lennox Leary, J. P. Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017a. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML.
- Balduzzi et al. (2017b) David Balduzzi, Marcus Frean, Lennox Leary, J P Lewis, Kurt Wan-Duo Ma, and Brian McWilliams. 2017b. The shattered gradients problem: If resnets are the answer, then what is the question? In ICML.
Bengio et al. (1994)
Yoshua Bengio, Patrice Y. Simard, and Paolo Frasconi. 1994.
Learning long-term dependencies with gradient descent is difficult.
IEEE transactions on neural networks.
Chen et al. (2018)
Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey,
George Foster, Llion Jones, Niki Parmar, Michael Schuster, Zhi-Feng Chen,
Yonghui Wu, and Macduff Hughes. 2018.
The best of both worlds: Combining recent advances in neural machine translation.In ACL.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT.
- Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS.
- Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv, abs/1706.02677.
- Hanin and Rolnick (2018) Boris Hanin and David Rolnick. 2018. How to start training: The effect of initialization and architecture. In NeurIPS.
He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In ICCV.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR.
- Huang et al. (2019) Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. 2019. Music transformer: Generating music with long-term structure. In ICLR.
- Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. 2018. Visualizing the loss landscape of neural nets. In NeurIPS.
- Liu et al. (2020) Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the variance of the adaptive learning rate and beyond. In ICLR.
- Lu et al. (2020) Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2020. Understanding and improving transformer from a multi-particle dynamic system point of view. In ICLR Workshop DeepDiffEq.
- Mishkin and Matas (2015) Dmytro Mishkin and Juan E. Sala Matas. 2015. All you need is a good init. In ICLR.
- Nguyen and Salazar (2019) Toan Q. Nguyen and Julian Salazar. 2019. Transformers without tears: Improving the normalization of self-attention. In IWSLT.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Parmar et al. (2018) Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In ICML.
- Pennington et al. (2017) Jeffrey Pennington, Samuel S. Schoenholz, and Surya Ganguli. 2017. Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. In NIPS.
- Popel and Bojar (2018) Martin Popel and Ondrej Bojar. 2018. Training tips for the transformer model. The Prague Bulletin of Mathematical Linguistics, 110:43 – 70.
- Ramachandran et al. (2019) Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jonathon Shlens. 2019. Stand-alone self-attention in vision models. In NeurIPS.
- Saxe et al. (2013a) Andrew M Saxe, James L McClelland, and Surya Ganguli. 2013a. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
- Saxe et al. (2013b) Andrew M. Saxe, James L. McClelland, and Surya Ganguli. 2013b. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. CoRR, abs/1312.6120.
Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Rethinking the inception architecture for computer vision.In CVPR.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Velickovic et al. (2018) Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. In ICLR.
- Wang et al. (2019) Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, and Lidia S. Chao. 2019. Learning deep transformer models for machine translation. In ACL.
- Wu et al. (2019) Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin, and Michael Auli. 2019. Pay less attention with lightweight and dynamic convolutions. In ICLR.
Xiao et al. (2018)
Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, and
Jeffrey Pennington. 2018.
Dynamical isometry and a mean field theory of cnns: How to train 10, 000-layer vanilla convolutional neural networks.In ICML.
- Xiong et al. (2019) Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shu xin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Li-Wei Wang, and Tie-Yan Liu. 2019. On layer normalization in the transformer architecture. ArXiv, abs/2002.04745.
- Yang and Schoenholz (2017) Greg Yang and Samuel S. Schoenholz. 2017. Mean field residual networks: On the edge of chaos. In NIPS.
- Zhang et al. (2019a) Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019a. Fixup initialization: Residual learning without normalization. In ICLR.
- Zhang et al. (2019b) Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank J. Reddi, Surinder Kumar, and Suvrit Sra. 2019b. Why adam beats sgd for attention models. ArXiv, abs/1912.03194.
Appendix A Gradients at Initialization
Here, we first reveal that Pre-LN does not suffer from the gradient vanishing. Then we establish that only the Post-LN decoder suffers from the gradient vanishing, but not the Post-LN encoder. For simplicity, we use to denote gradients, i.e., where is the training objective. Following the previous study Bengio et al. (1994); Glorot and Bengio (2010); He et al. (2015); Saxe et al. (2013a), we analyze the gradient distribution at the very beginning of training, assume that the randomly initialized parameters and the partial derivative with regard to module inputs are independent.
a.1 Pre-LN Analysis
For Pre-LN encoders, we have and . At initialization, the two terms on the right part are approximately independent and . Therefore we have . Similarly, we can get thus . Applying the same analysis to Pre-LN decoders, we can get . Thus, lower layers have larger gradients than higher layers and gradients do not vanish in the back-propagation. For Pre-LN, if and the derivatives of modules in the -th sub-layer are independent, then .
a.2 Post-LN Encoder Analysis
Different from Pre-LN, and are associated with not only the residual connection, but the layer normalization, which makes it harder to establish the connection on their gradients. After making assumptions on the model initialization, we find that lower layers in Post-LN encoder also have larger gradients than higher layers and gradients do not vanish in the back-propagation through the encoder.
For Post-LN Encoders, if and in the Layer Norm are initialized as and respectively; all other parameters are initialized by symmetric distributions with zero mean; and are subject to symmetric distributions with zero mean; the variance of is (i.e., normalized by Layer Norm); and the derivatives of modules in -th sub-layer are independent, we have .
We first prove , i.e., the backpropagation through FFN sublayers does not suffer from gradient vanishing. In Post-LN encoders, the output of FFN sublayers are calculated as where . Since at initialization, and are independently randomized by symmetric distributions, we have and
where . Referring the dimension of as , He et al. (2015) establishes that
Since in Post-LN, is the output of layer norm, we have . Thus,
Assuming different terms are also independent in the backpropagation, we have
At initialization, He et al. (2015) establishes that
Therefore, we have
which shows the backpropagation through FFN sublayers does not suffer from gradient vanishing.
Now we proceed to proof that, , i.e., the backpropagation through Self-Attention sublayers does not suffer from gradient vanishing. In Post-LN encoders, the output of Self-Attention sublayers are calculated as where and . At initialization, since , , and are independently randomized by symmetric distributions, we have , thus , where .
Referring as , we have
Similar to He et al. (2015), we have
Since is the output of layer norm, we have . Thus,
In the backpropagation, we have
At initialization, we assume and model parameters are independent He et al. (2015), thus
Therefore, we have
a.3 Post-LN Decoder Analysis
In Post-LN, the Encoder-Attention sub-layer suffers from gradient vanishing. The Encoder-Attention sub-layer calculates outputs as where and . Here is encoder outputs and is the row-wise softmax function. In the backpropagation, All of backpropagations from to went through the softmax function, we have . Thus, those backpropagations suffer from gradient vanishing.
a.4 Gradients of Unbalanced Gradients
As in Figure 3 and Figure 11, even for Pre-LN, the gradient distributions of Attention modules are unbalanced. Specifically, parameters within the softmax function (i.e., and ) suffer from gradient vanishing (i.e., ) and have smaller gradients than other parameters.
With further analysis, we find it is hard to neutralize the gradient vanishing of softmax. Different from conventional non-linear functions like ReLU or sigmoid, softmax has a dynamic input length (i.e.
, for sentence with different lengths, inputs of softmax have different dimensions). Although this setting allows Attention modules to handle sequential inputs, it restricts them from having stable and consistent backpropagation. Specifically, let us consider the comparison between softmax and sigmoid. For the sigmoid function, although its derivation is smaller than 1, this damping effect is consistent for all inputs. Thus, sigmoid can be neutralized by a larger initializationGlorot and Bengio (2010). For softmax, its damping effect is different for different inputs, thus cannot be neutralized by a static initialization.
Also, we observe that this issue is largely addressed by adaptive optimizers. Specifically, we calculate the norm of parameter change in consequent epochs (e.g., where is the checkpoint saved after epochs) and also visualize the relative norm (scaled by the largest value in the same network) in Figure 11. Comparing the relative norm of parameter gradients and parameter updates, we notice that: although the gradient distribution is unbalanced, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes. This results explains why the vanilla SGD fails for training Transformer (i.e., lacking the ability to handle unbalanced gradient distributions). Also, it implies that the unbalanced gradient distribution (e.g., gradient vanishing) has been largely addressed by adaptive optimizers and may not have a big impact on the training instability.
Appendix B Proof of Theorem 4.2
Here, we elaborate the derivation for Theorem 4.2, which establishes the relationship between layer number and output fluctuation brought by parameter change.
Consider a -layer Transformer , where is the input and is the parameter. If the layer dependency stays the same after a parameter change (i.e., has the same value after changing to , where is randomly initialized and is independent to ), the output change (i.e., ) can be estimated as where is a constant.
We refer the module in sub-layer as , where is the normalized residual output and is the normalized module output. The final output is marked as . To simplify the notation, we use the superscript to indicate variables related to , e.g., and .
At initialization, all parameters are initialized independently. Thus , and are independent and . Also, since -layer and -layer share the residual connection to previous layers, we have . Thus and
Now, we proceed to analyze . Specifically, we have
Since is randomly initialized,