Understanding the Difficulty of Training Transformers

04/17/2020 ∙ by Liyuan Liu, et al. ∙ Microsoft University of Illinois at Urbana-Champaign 0

Transformers have been proved effective for many deep learning tasks. Training transformers, however, requires non-trivial efforts regarding carefully designing learning rate schedulers and cutting-edge optimizers (the standard SGD fails to train Transformers effectively). In this paper, we study Transformer training from both theoretical and empirical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that substantially influences training. Specifically, we observe that for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable since it amplifies small parameter perturbations (e.g., parameter updates) and result in significant disturbances in the model output, yet a light dependency limits the potential of model training and can lead to an inferior trained model. Inspired by our analysis, we propose Admin (Adaptive model initialization) to stabilize the training in the early stage and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformers Vaswani et al. (2017) have led to a series of breakthroughs in various deep learning tasks  Devlin et al. (2019); Velickovic et al. (2018)

. They do not contain recurrent connections and can parallelize all computations in the same layer, thus improving effectiveness, efficiency, and scalability. Training Transformers, however, requires extra efforts. For example, although the standard stochastic gradient descent (SGD) is a canonical algorithm for the conventional RNNs and CNNs, it converges to bad/suspicious local optima for Transformers 

Zhang et al. (2019b). Moreover, comparing to other neural architectures, removing the warmup stage in Transformer training results in more serious consequences such as model divergence Popel and Bojar (2018); Liu et al. (2020). In this paper, we conduct a comprehensive analysis in theoretical and empirical manners to answer the question: what complicates Transformer training.

Figure 1: Lacking enough robustness and stability, the 18-Layer Post-LN Transformer training (i.e.the original architecture) diverges and is omitted in the left graph. Admin not only stabilizes model training but unleashes model potential for a better performance.
Figure 2: Output changes of Pre-LN, Post-LN and Admin encoders (1 to 100 layers, each has 2 sub-layers), i.e., and is calculated with random perturbations (left) or gradient updates (right). Figure 3: Histogram of relative norm of gradient and where is the checkpoint saved after training for epochs.

Our analysis starts from the observation: the original Transformer (referred to as Post-LN) is less robust than its Pre-LN variant Baevski and Auli (2019); Nguyen and Salazar (2019). We recognize that gradient vanishing is not the direct reason causing such difference, since fixing this issue alone cannot stabilize Post-LN training. It implies that there exist factors other than unbalanced gradients that influence model training greatly.

With further analysis, we recognize that for each layer, the dependency on its residual branch111For a residual layer , its shortcut output refers to , its residual branch refers to , and the dependency on its residual branch refers to . plays an important role in training stability. First, we find that a Post-LN layer has a heavier dependency on its residual branch than a Pre-LN layer. For example, at initialization, a Pre-LN layer has roughly the same dependency on its residual branch and any previous layer, whereas a Post-LN layer has a stronger dependency on its residual branch (more discussions are presented in Section 4.1). We find that strong dependencies of Post-LN amplify fluctuations brought by parameter changes and destabilize the model training (as in Theorem 4.2 and Figure 2). On the other hand, the loose reliance on residual branches in Pre-LN generally limits the potential of the training algorithm, and often produce an inferior model.

In light of our analysis, we propose Admin, an adaptive initialization method for training Post-LN Transformer models, which retains the merits of Pre-LN stability without hurting the performance. It restricts the layer dependency on its residual branches in the early stage and unleashes the model potential in the late stage. Admin is more stable, converges faster, and performs better in extensive experiments on IWSLT’14 De-En and WMT’14 En-De222Code is released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

2 Preliminaries

Figure 4: The Architecture and notations of Pre-LN Transformers (Left) and Post-LN Transformers (Right).

Transformer Architectures and Notations. The Transformer architecture contains two types of sub-layers, i.e., Attention sub-layers and Feedforward (FFN) sub-layers. They are composed of mainly three basic modules Vaswani et al. (2017), i.e., Layer Norm (), Multi-head Attention () and Feedforward Network (). As illustrated in Figure 4, the Pre-LN Transformer and the Post-LN Transformer organize these modules differently. For example, a Pre-LN encoder organizes the Self-Attention sub-layer as: and a Post-LN encoder as where is the input of the -th Transformer layer and is the output of the -th Self-Attention sub-layer. Here, we refer and as the residual branches and their outputs as the residual outputs, in contrast to layer/sub-layer outputs, which integrates residual outputs and shortcut outputs. Notation elaborations are shown in Figure 4. In particular, we use superscript to indicate network architectures (i.e., the Pre-LN Encoder), use subscript to indicate layer indexes (top layers have larger indexes), all inputs and outputs are formulated as .

Layer Norm. Layer norm Ba et al. (2016) plays a key role in the Transformer architecture. It is defined as , where and

are the mean and standard deviation of

.

Feedforward Network.

Transformers use two-layer perceptrons as feedforward networks,

i.e., , where is the non-linear function333

In our analysis, we use ReLU as the activation function while Admin can be applied to other non-linear functions.

and are parameters.

Multi-head Attention. Multi-head Attention allows the network to have multiple focuses and has shown to be effective for many tasks Chen et al. (2018). It is defined as (with heads): , where is the row-wise softmax function and are parameters. and are matrices, and are matrices, where is the hidden state dimension. Parameters without subscript refer the concatenation of all -head parameters, e.g., . Here, this module is used in two different manners: Encoder-Attention (i.e., , where is encoder outputs) and Self-Attention (i.e., ).

3 Unbalanced Gradients

Here, we strive to answer the question: what complicates Transformer training. Our analysis starts from the observation: Pre-LN training is more robust than that of Post-LN while Post-LN is more likely to reach a better performance than Pre-LN. For example, in a parameter grid search (as in Figure 10), Pre-LN converges in all 15 settings and Post-LN diverges in 7 out of 15 settings; when Post-LN converges, it outperforms Pre-LN in 7 out of 8 settings. We seek to reveal the underlying factor that destabilizes Post-LN and restricts the performance of Pre-LN.

In this section, we focus on the unbalanced gradients (e.g., gradient vanishing). We find that, although Post-LN suffers from gradient vanishing and Pre-LN does not, gradient vanishing is not the direct reason causing the instability of Post-LN. Specifically, we first theoretically and empirically establish that only Post-LN decoders suffer from gradient vanishing and Post-LN encoders do not. We then observe that fixing the gradient vanishing issue alone cannot stabilize training.

3.1 Gradients at Initialization

Figure 5: Relative gradient norm histogram (on a log scale) of 18-layer Transformers on the WMT’14 En-De dataset, i.e., the gradient norm of sub-layer outputs, scaled by the largest gradient norm in the same network.

As gradient vanishing can hamper convergence from the beginning, it has been regarded as the major issue leading towards unstable training. Also, recent studies show that this issue exists in the Post-LN Transformer, even after using residual connections 

Xiong et al. (2019). Below, we establish that only Post-LN decoders suffer from the gradient vanishing, and neither Post-LN encoders, Pre-LN encoders nor Pre-LN decoders.

We use to denote gradients, i.e., where is the training objective. Following previous studies Bengio et al. (1994); Glorot and Bengio (2010); He et al. (2015); Saxe et al. (2013a), we analyze the gradient distribution at the very beginning of training. As established in Theorem A.2 and Remark A.1 (detailed derivations are included in Appendix A), neither Post-LN Encoders, Pre-LN Encoders nor Pre-LN Decoders suffer from gradient vanishing. In other words, only the Encoder-Attention layer in Post-LN suffers from gradient vanishing. Empirical studies are further conducted for verification. At initialization, we calculate for 18-layer Transformers444Note if , . and visualize in Figure 5

. It verifies that only Post-LN decoders suffer from the gradient vanishing. It also shows that the gradient vanishing happens in backpropagations from Encoder-Attention sub-layer outputs to its inputs (

i.e., Self-Attention sub-layer outputs).

3.2 Impact of the Gradient Vanishing

Now, we analyze whether gradient vanishing is the direct cause of training instability.

Encoder Decoder Gradient Training
Post-LN Post-LN Varnishing Diverged
Post-LN Pre-LN Varnishing Diverged
Pre-LN Pre-LN Varnishing Converged
Table 1: Changing decoders from Post-LN to Pre-LN fixes gradient vanishing, but does not stabilize model training successfully. Encoder/Decoder have 18 layers.

As in Section 3.1, only Post-LN decoders suffer from gradient vanishing, but not Post-LN encoders. Thus, we combine a Post-LN encoder and a Pre-LN decoder to construct a hybrid Transformer, which does not suffer from gradient vanishing. As shown in Table 1, fixing gradient vanishing alone (i.e., changing Post-LN decoders to Pre-LN decoders) fails to stabilize model training. It implies that the gradient vanishing issue is not the direct cause of the unstable Post-LN training.

Moreover, we observe that gradients of all attention modules are unbalanced and are hard to be neutralized. Also, we find this issue is largely addressed by adaptive optimizers. For example, as in Figure 3, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes even with unbalanced gradients. It explains why the standard SGD fails in training Transformers (i.e., lacking the ability to handle unbalanced gradients) and necessitates the use of adaptive optimizers. More discussions are included in Appendix A.4.

4 Instability from Amplification Effect

We find that unbalanced gradients are not the root cause of the instability of Post-LN, which implies the existence of other factors influencing model training. Now, we go beyond gradient vanishing and introduce the amplification effect. Specifically, we first examine the difference between Pre-LN and Post-LN, including their behaviors in the early-stage and late-stage of training. Then, we show that the training instability of Post-LN is attributed to the amplification effect of layer dependency, which intensifies gradient updates and destabilizes training.

4.1 Impact of Layer Norms Positions

As described in Section 4, both Pre-LN and Post-LN employ layer norm to regularize inputs and outputs. In residual networks, different residual outputs are aggregated and normalized before serving as inputs of other layers (i.e.

, residual outputs will be scaled to ensure the integrated input to have a consistent variance). To some extend, layer norm treats the variance of residual outputs as weights to average them. For example, for Post-LN Self-Attention, we have

at initialization. Larger not only increases the proportion of in the output but decreases the proportion of other residual outputs (integrated in ), which is similar to weights in the weighted average.

Figure 6: The major difference between Pre-LN and Post-LN is the position of layer norms.

The position of layer norms is the major difference between Pre-LN and Post-LN and makes them aggregate residual outputs differently (i.e., using different weights). As in Figure 6, all residual outputs in Pre-LN are only normalized once before feeding into other layers (thus only treating residual output variances as weights); in Post-LN, most residual outputs are normalized more than once and different residual outputs are normalized for different times. For example, if all layers are initialized in the same way, output variances of different Pre-LN residual branches would be similar, and the aggregation would be similar to the simple average. Similarly, for Post-LN, nearby residual outputs are normalized by less times comparing to others, thus having relatively larger weights. We proceed to calculate and analyze these weights to understand the impact of layer norm positions.

First, we use to refer (i.e., normalized outputs of -th residual branch) and to refer (i.e., normalized outputs of -th layer or normalized inputs of (+1)-th residual branch). Then, we describe their relationships as , where integrates scaling operations of all layer norms (including ). For example, Pre-LN sets . Intuitively, describes the proportion of -th residual branch outputs in -th layer outputs, thus reflects the dependency among layers.

We calculate and visualize in Figure 7. For each Post-LN layer, its outputs rely more on its own residual branch from the initialization to the end. At initialization, Pre-LN layer outputs have roughly the same reliance on all previous residual branches. As the training advances, each layer starts to rely more on its own residual outputs. But comparing to Post-LN, Pre-LN layer outputs in the final model are still less focused on their residual branches.

Figure 7: in 6-Layer Post-LN and Pre-LN on the WMT-14 En-De dataset (contains 12 sub-layers).

Intuitively, it is harder for Pre-LN layers to depend too much on their own residual branches. In Pre-LN, layer outputs (i.e., ) are not normalized, and their variances are likely to be larger for higher layers (i.e., if and are independent, ; also, in our experiments increases as becomes larger). Since , is likely to be smaller for higher layers, which restricts -th layer outputs from depending too much on its own residual branch and inhibits the network from reaching its full potential. In other words, Pre-LN restricts the network from being too deep (i.e., if it is hard to distinguish and , appending one layer would be similar to double the width of the last layer), while Post-LN allows the network to have the choice of being wider or deeper.

4.2 Amplification Effect at Initialization

Although depending more on residual branches allows the model to have a larger potential, it amplifies the fluctuation brought by parameter changes. For a network where is the model input and is the parameter, the output change caused by parameter perturbations is , where . Its relationship with is described in Theorem 4.2 and the derivation is elaborated in Appendix B. Consider a -layer Transformer , where is the input and is the parameter. If the layer dependency stays the same after a parameter change (i.e., has the same value after changing to , where is randomly initialized and is independent to ), the output change (i.e.,

) can be estimated as

where is a constant. If is the same for all layers, Pre-LN sets as and Post-LN sets as a constant. Thus, we have Corollary 4.2 and 4.2 as below. For -layer Pre-LN , we have . For -layer Post-LN , we have . They show that, since Post-LN relies more on residual branches comparing to Pre-LN (i.e., has a larger ), the perturbation is amplified to a larger magnitude. To empirically verify these relationships, we calculate for Pre-LN and Post-LN and visualize the results in Figure 2. In Corollary 4.2, is linearly associated with for Post-LN; and in Corollary 4.2, is linearly associated with for Pre-LN. These relationships match the observation in our experiments (as in Figure 2). For further verification, we measure their correlation magnitudes by and find in both cases.

Moreover, we replace the random noise with optimization updates (i.e., setting , where is the Adam optimizer) and visualize the output shifts in Figure 2. The output shift for Post-LN is larger than Pre-LN by multiple magnitudes.

Intuitively, large output shifts would destabilize the training Li et al. (2018). Also, as elaborated in Appendix B, the constant in Theorem 4.2 is related to network derivatives, thus would be smaller as training advances, which explains why warmup is also helpful for the standard SGD. Therefore, we conjecture it is the large output shift of Post-LN results in unstable training. Now we proceed to stabilize Post-LN by controlling the dependency on residual branches in the early stage of training.

Figure 8: of 18-Layer Admin (Post-LN) and Pre-LN on the WMT-14 En-De dataset.

4.3 Admin – Adaptive Model Initialization

In light of our analysis, we add additional parameters (i.e., ) to control residual dependencies of Post-LN and stabilize training by adaptively initializing to ensure a output change.

Due to different training configurations and model specificities (e.g., different model may use different activation functions and dropout ratios), it is hard to derive a universal initialization method. Instead, we decompose model initialization into two phrases: Profiling and Initialization. Specifically, Admin adds new parameters and constructs its i-th sub-layer as , where , is a

-dimension vector and

is element-wise product. Then the Profiling phrase and Initialization phrase are:

Profiling. After initializing the network with a standard method (initializing as ), conduct forward propagation without parameter updating and record the output variance of residual branches (i.e., calculate ).

Initialization. Set and initialize all other parameters with the same method used in the Profiling phrase.

In the early stage, Admin sets to approximately and ensures a output change thus stabilizing training. Model training would become more stable in the late stage (the constant in Theorem 4.2 is related to parameter gradients) and each layer has the flexibility to adjust and depends more on its own residual branch to calculate the layer outputs. After training finishes, Admin can be reparameterized as the conventional Post-LN structure (i.e., removing ). More implementation details are elaborated in Appendix C.

To verify our intuition, we calculate the layer dependency of 18-Layer models and visualize the result in Figure 8. Figure 7 and 8 show that Admin avoids over-large dependencies at initialization and unleashes the potential to make the layer outputs depend more on their residual outputs in the final model. Moreover, we visualize the output change of Admin in Figure 2. Benefiting from the adaptive initialization, the output change of Admin gets roughly the same increase speed as Pre-LN, even constructed in the Post-LN manner. Also, although Admin is formulated in a Post-LN manner and suffers from gradient vanishing, 18-layer Admin successfully converges and outperforms 18-layer Pre-LN (as in Table 2). These evidences support our intuition that the large dependency on residual branches amplifies the output fluctuation and destabilizes training.

5 Experiments

We conduct experiments on two machine translation datasets, i.e., IWSLT’14 De-En and WMT’14 En-De. The detailed experimental configurations are elaborated in Appendix D.

Method BLEU
6-Layer 12-Layer 18-Layer
Post-LN 27.80 Diverged Diverged
Pre-LN 27.27 28.26 28.26
Admin 27.90 28.58 28.80
Table 2: Evaluation Results on WMT14 De-En.

5.1 Performance Comparison

We use BLEU as the evaluation matric and summarize the model performance in Table 2 and Table 3. For the WMT’14 dataset, experiments are conducted using the Transformer-base model with 6, 12 or 18 layers. Admin achieves a better performance than Post-LN and Pre-LN in all three settings. Specifically, 12-Layer and 18-Layer Post-LN diverges without the adaptive initialization. Admin obtains comparable performance with Post-LN in the 6-layer setting and converges well in both the 12-layer and the 18-layer settings. Pre-LN also converges in all settings, but it results in a sub-optimal performance, which verifies our intuition that the Pre-LN structure limits the model potential. As depicted in Figure 1 and Figure 9, although the 6-layer Pre-LN converges faster than Post-LN, its final performance is worse than Post-LN. In contrast, Admin not only achieves the same convergence speed with Pre-LN in the early stage, but reaches a good performance in the late stage.

For the IWSLT’14 dataset, we use the Transformer-small model for training. We observe that all methods perform similarly and Admin outperforms the other two by a small margin. Comparing to the WMT’14 results, it verifies that the training stability is related to the number of layers. For shallow networks, the stability difference between Post-LN and Pre-LN is not significant, and all architectures lead to a similar performance. Besides, we find that the attention dropout and the activation dropout have a large impact on the model performance. Specifically, via setting the attention dropout ratio and relu dropout ratio to 0.1, we are able to improve the Post-LN performance from 34.60 (reported) to 35.64 (average of five runs).

Method BLEU
Post-LN Vaswani et al. (2017) 34.60
DynamicConv Wu et al. (2019) 35.2
Post-LN 35.640.23
Pre-LN 35.500.04
Admin 35.670.15
Table 3: Performance on IWSLT14 De-En (Transformer models are 6-layer Transformer-small models).
Figure 9: Development PPL on the WMT’14 En-De dataset and the IWLST’14 De-En dataset.

5.2 Comparing to other Initializations

We further compare our methods with two initialization methods, i.e., FixUp Zhang et al. (2019a) and LookLinear Balduzzi et al. (2017a). Specifically, we conduct experiments with 18-layer Transformers on the WMT’14 De-En dataset. In our experiments, we observe that both FixUp (without using layer normalization) and LookLinear (with Post-LN) leads to divergent training. With further analysis, we find that, the half-precision training and dropout could be the reason destabilizing FixUp, due to the lack of layer normalization.

5.3 Connection to Warmup

Previous work Liu et al. (2020) establishes that the need of warmup comes from the unstable adaptive learning rates in the early stage. Still, it is observed that removing the warmup phrase results in more serious consequences for Transformers than other architectures. Also, warmup is found to be useful for the vanilla SGD Xiong et al. (2019).

In Theorem A.2, we establish that

where . In the early stage of training, the network has larger parameter gradients and thus larger . Therefore, same parameter shifts would result in larger output changes in the early stage than in the late stage. Thus, warmup relieves the output change and helps to stabilize training.

To further verify our intuitions, we remove the warmup phrase and conduct a grid search on RAdam hyper-parameters Liu et al. (2020). Results are visualized in Figure 10. It shows that Post-LN is less robust to the choice of learning rates. Specifically, Post-LN diverges with larger learning rates or smaller (smaller use less samples to estimate adaptive learning rates), while Admin and Pre-LN are more robust. At the same time, we extend the warmup phrase from 8 thousand updates to 16, 24, and 32 thousand updates and find the training of 18-layer Post-LN still converges to bad/suspicious local optima. It shows that, the large output shift of Post-LN is not always neutralized by the learning rate warmup. Intuitively, the large output shift not only requires a small learning rate but also unsmoothes the loss surface Li et al. (2018)

and complicates training. Since warmup stabilizes the training without smoothing the loss surface, it fails to train deeper Transformer networks. On the other hand, Admin not only stabilizes training but simplifies the training by initializing from the area with a smooth loss surface, thus leading to better training.

6 Related Work

Transformer. Transformer Vaswani et al. (2017) has led to a series of breakthroughs in various domains Devlin et al. (2019); Velickovic et al. (2018); Huang et al. (2019); Parmar et al. (2018); Ramachandran et al. (2019). Liu et al. (2020) shows that comparing to other architectures, removing the warmup phrase is more damaging for Transformers, especially Post-LN. Similarly, it has been found that the original Transformer (referred as Post-LN) is less robust than its Pre-LN variant Baevski and Auli (2019); Nguyen and Salazar (2019); Wang et al. (2019). Our studies go beyond existing literature about gradient vanishing Xiong et al. (2019) and identify an important factor influencing Transformer training greatly. Our analysis guides us to propose a novel adaptive initialization method and allows us to better understand other empirical observations, e.g., initializing parameters to smaller values helps to stabilize training Nguyen and Salazar (2019).

Figure 10: BLEU score of Post-LN, Pre-LN and Admin on the IWSLT’14 De-En dataset (x-axis is the for adaptive optimizers and y-axis is the learning rate). Pre-LN converges in all settings while Post-LN diverges in 7 out of 15 settings. When Post-LN converges, it outperforms Pre-LN in 7 out of 8 settings. Admin stabilizes Post-LN training and outperforms Pre-LN (its best performance is comparable with Post-LN).

Deep Network Initialization. To handle the gradient vanishing in deep feedforward networks, specific initialization is derived and found to be useful Glorot and Bengio (2010). The derivation is further improved for ReLU networks He et al. (2015). He et al. (2016) find the deep network training is still hard after addressing the gradient vanishing issue and propose ResNet. Balduzzi et al. (2017b) identifies the shattered gradient issue and proposes LookLinear initialization. Recently, the study of dynamical isometryXiao et al. (2018); Yang and Schoenholz (2017); Pennington et al. (2017); Saxe et al. (2013b) provides a new perspective to analyze the network behavior at initialization and focus on simple networks like Deep Linear Network and gradient updates. On the other hand, it has been observed that scaling the residual outputs to smaller values help to stabilize training Hanin and Rolnick (2018); Mishkin and Matas (2015); Zhang et al. (2019a); Bachlechner et al. (2020); Goyal et al. (2017). Here, we focus our study on the Transformer architecture, identify that unbalanced gradients is not the direct cause of the Post-LN instability, recognize the amplification effect of residual dependencies and propose a novel adaptive initialization method.

7 Conclusion

In this paper, we study the difficulties of training Transformer in theoretical and empirical manners. Our study in Section 3 suggests that the gradient vanishing problem is not the root cause of the unstable Transformer training. Also, the unbalanced gradient distribution is largely addressed by adaptive optimizers. In Section 4, we reveal the root cause of the instability to be the strong dependency on residual branches, which amplifies the fluctuation caused by parameter changes and destabilizes model training. In light of our analysis, we propose Admin, an adaptive initialization method to stabilize Transformers training. It controls the dependency in the beginning of training and maintains the flexibility to capture those dependencies once training stabilizes. Extensive experiments on real world datasets verify our intuitions and show that Admin achieves more stable training, faster convergence, and better performance.

Our work opens up new possibilities to not only further push the state-of-the-art but also better understand deep network training. It leads to many interesting future works, including generalizing Theorem 4.2 to other models, designing new algorithms to automatically adapt deep networks to different training configurations, upgrading the Transformer architecture, and applying our proposed Admin to conduct training in a larger scale.

Acknowledge

We thank Chengyu Dong, Haoming Jiang, Jingbo Shang, Xiaotao Gu, and Zihan Wang for valuable discussions and comments; Jingbo Shang for sharing GPU machines; and Microsoft for setting up GPU machines.

References

Appendix A Gradients at Initialization

Here, we first reveal that Pre-LN does not suffer from the gradient vanishing. Then we establish that only the Post-LN decoder suffers from the gradient vanishing, but not the Post-LN encoder. For simplicity, we use to denote gradients, i.e., where is the training objective. Following the previous study Bengio et al. (1994); Glorot and Bengio (2010); He et al. (2015); Saxe et al. (2013a), we analyze the gradient distribution at the very beginning of training, assume that the randomly initialized parameters and the partial derivative with regard to module inputs are independent.

a.1 Pre-LN Analysis

For Pre-LN encoders, we have and . At initialization, the two terms on the right part are approximately independent and . Therefore we have . Similarly, we can get thus . Applying the same analysis to Pre-LN decoders, we can get . Thus, lower layers have larger gradients than higher layers and gradients do not vanish in the back-propagation. For Pre-LN, if and the derivatives of modules in the -th sub-layer are independent, then .

a.2 Post-LN Encoder Analysis

Different from Pre-LN, and are associated with not only the residual connection, but the layer normalization, which makes it harder to establish the connection on their gradients. After making assumptions on the model initialization, we find that lower layers in Post-LN encoder also have larger gradients than higher layers and gradients do not vanish in the back-propagation through the encoder.

For Post-LN Encoders, if and in the Layer Norm are initialized as and respectively; all other parameters are initialized by symmetric distributions with zero mean; and are subject to symmetric distributions with zero mean; the variance of is (i.e., normalized by Layer Norm); and the derivatives of modules in -th sub-layer are independent, we have .

Proof.

We first prove , i.e., the backpropagation through FFN sublayers does not suffer from gradient vanishing. In Post-LN encoders, the output of FFN sublayers are calculated as where . Since at initialization, and are independently randomized by symmetric distributions, we have and

where . Referring the dimension of as , He et al. (2015) establishes that

Since in Post-LN, is the output of layer norm, we have . Thus,

(1)

Assuming different terms are also independent in the backpropagation, we have

At initialization, He et al. (2015) establishes that

Therefore, we have

(2)

Combining Equation 1 with Equation 2, we have

(3)

which shows the backpropagation through FFN sublayers does not suffer from gradient vanishing.

Now we proceed to proof that, , i.e., the backpropagation through Self-Attention sublayers does not suffer from gradient vanishing. In Post-LN encoders, the output of Self-Attention sublayers are calculated as where and . At initialization, since , , and are independently randomized by symmetric distributions, we have , thus , where .

Referring as , we have

Similar to He et al. (2015), we have

Since is the output of layer norm, we have . Thus,

(4)

In the backpropagation, we have

At initialization, we assume and model parameters are independent He et al. (2015), thus

Therefore, we have

(5)

Integrating Equation 4 with Equation 5, we have

(6)

Combining Equation 3 and Equation 6, we have . ∎

a.3 Post-LN Decoder Analysis

In Post-LN, the Encoder-Attention sub-layer suffers from gradient vanishing. The Encoder-Attention sub-layer calculates outputs as where and . Here is encoder outputs and is the row-wise softmax function. In the backpropagation, All of backpropagations from to went through the softmax function, we have . Thus, those backpropagations suffer from gradient vanishing.

a.4 Gradients of Unbalanced Gradients

Figure 11: Relative Norm of Gradient (, where is the checkpoint of -th epoch) and Update () of Self-Attention Parameters in 12-Layer Pre-LN.

As in Figure 3 and Figure 11, even for Pre-LN, the gradient distributions of Attention modules are unbalanced. Specifically, parameters within the softmax function (i.e., and ) suffer from gradient vanishing (i.e., ) and have smaller gradients than other parameters.

With further analysis, we find it is hard to neutralize the gradient vanishing of softmax. Different from conventional non-linear functions like ReLU or sigmoid, softmax has a dynamic input length (i.e.

, for sentence with different lengths, inputs of softmax have different dimensions). Although this setting allows Attention modules to handle sequential inputs, it restricts them from having stable and consistent backpropagation. Specifically, let us consider the comparison between softmax and sigmoid. For the sigmoid function, although its derivation is smaller than 1, this damping effect is consistent for all inputs. Thus, sigmoid can be neutralized by a larger initialization 

Glorot and Bengio (2010). For softmax, its damping effect is different for different inputs, thus cannot be neutralized by a static initialization.

Also, we observe that this issue is largely addressed by adaptive optimizers. Specifically, we calculate the norm of parameter change in consequent epochs (e.g., where is the checkpoint saved after epochs) and also visualize the relative norm (scaled by the largest value in the same network) in Figure 11. Comparing the relative norm of parameter gradients and parameter updates, we notice that: although the gradient distribution is unbalanced, adaptive optimizers successfully assign different learning rates to different parameters and lead to consistent update magnitudes. This results explains why the vanilla SGD fails for training Transformer (i.e., lacking the ability to handle unbalanced gradient distributions). Also, it implies that the unbalanced gradient distribution (e.g., gradient vanishing) has been largely addressed by adaptive optimizers and may not have a big impact on the training instability.

Appendix B Proof of Theorem 4.2

Here, we elaborate the derivation for Theorem 4.2, which establishes the relationship between layer number and output fluctuation brought by parameter change.

Consider a -layer Transformer , where is the input and is the parameter. If the layer dependency stays the same after a parameter change (i.e., has the same value after changing to , where is randomly initialized and is independent to ), the output change (i.e., ) can be estimated as where is a constant.

Proof.

We refer the module in sub-layer as , where is the normalized residual output and is the normalized module output. The final output is marked as . To simplify the notation, we use the superscript to indicate variables related to , e.g., and .

At initialization, all parameters are initialized independently. Thus , and are independent and . Also, since -layer and -layer share the residual connection to previous layers, we have . Thus and

(7)

Now, we proceed to analyze . Specifically, we have