DeepNet: Scaling Transformers to 1,000 Layers

03/01/2022
by   Hongyu Wang, et al.
Microsoft
0

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

READ FULL TEXT VIEW PDF

Authors

page 21

page 22

11/08/2019

Why Deep Transformers are Difficult to Converge? From Computation Order to Lipschitz Restricted Parameter Initialization

The Transformer translation model employs residual connection and layer ...
06/01/2022

On Layer Normalizations and Residual Connections in Transformers

In the perspective of a layer normalization (LN) position, the architect...
09/28/2020

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many ...
04/17/2020

Understanding the Difficulty of Training Transformers

Transformers have been proved effective for many deep learning tasks. Tr...
08/29/2019

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...
02/15/2022

XAI for Transformers: Better Explanations through Conservative Propagation

Transformers have become an important workhorse of machine learning, wit...
07/13/2020

Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent years have witnessed a trend towards large-scale Transformer (transformer) models. The capacity has substantially increased from millions of parameters (JacobDevlin2018BERTPO; xlmr) to billions (gpt-2; gpt3; gpipe; t5; gshard; gopher; xglm; mt-nlg), and even trillions (glam). Large-scale models yield state-of-the-art performance on a wide range of tasks, and show impressive abilities in few-shot and zero-shot learning. Despite an enormous number of parameters, their depths (as shown in Figure 1) are limited by the training instability of Transformers.

ToanQNguyen2019TransformersWT find that pre-norm residual connections (Pre-LN) improve the stability of Transformers based on post-norm connections (Post-LN). However, the gradients of Pre-LN at bottom layers tend to be larger than at top layers (Normformer2021), leading to a degradation in performance compared with Post-LN. In order to alleviate the above issue, there have been efforts on improving the optimization of deep Transformer by means of better initialization (BiaoZhang2019ImprovingDT; HongyiZhang2019FixupIR; XiaoShiHuang2020ImprovingTO), or better architecture (Wang2019DLCL; LiyuanLiu2020UnderstandingTD; rezero2020; Normformer2021). These approaches can stabilize a Transformer model with up to hundreds of layers. Yet, none of previous methods has been successfully scaled to 1,000 layers.

Our aim is to improve the training stability of Transformers and scale the model depth by orders of magnitude. To this end, we study the cause of unstable optimization, finding the exploding model update is responsible for the instability. Motivated by the above observation, we introduce a new normalization function (DeepNorm) at residual connections (resnet), which has theoretical justification of bounding the model update by a constant. The proposed method is simple yet effective, with just lines of code change. The approach improves the stability of Transformers so that we are able to scale model depth to more than 1,000 layers. Moreover, experimental results show that DeepNorm combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN. The proposed method can be a preferred alternative of Transformers, not only for extremely deep (such as >1000 layers) models, but also for existing large models. Notably, our 200-layer model with 3.2B parameters achieves 5 BLEU improvement on a massively multilingual machine translation benchmark compared to state-of-the-art model (m2m100) with 48 layers and 12B model size.

2 TL;DR for Practitioners

  def deepnorm(x):       return LayerNorm(x *  + f(x))   def deepnorm_init(w):       if w is ['ffn', 'v_proj', 'out_proj']:           nn.init.xavier_normal_(w, gain=)       elif w is ['q_proj', 'k_proj']:           nn.init.xavier_normal_(w, gain=1) Architectures Encoder Decoder Encoder-only - - (e.g., BERT) Decoder-only - - (e.g., GPT) Encoder-decoder (e.g., NMT, T5)
Figure 2: (a) Pseudocode for DeepNorm. We take Xavier initialization (xavier) as an example, and it can be replaced with other standard initialization. Notice that is a constant. (b) Parameters of DeepNorm for different architectures (-layer encoder, -layer decoder).

As shown in Figure 2, it is simple to implement our method based on Transformers with Post-LN. Compared to Post-LN, DeepNorm up-scales the residual connection before performing layer normalization. Besides, we down-scale the parameters during initialization. Notably, we only scale the weights of feed-forward networks, as well as the value projection and the output projection of attention layers. Moreover, the scales of residual connection and initialization are dependent on the architecture (Figure 2). We provide more details in Section 4.3.

3 Instability of Deep Transformer

We study the causes of the instability for deep Transformers. Our analysis begins with the observation: better initialization methods stabilize the training of Transformer. This has also been verified by previous work (BiaoZhang2019ImprovingDT; XiaoShiHuang2020ImprovingTO; PengXu2021OptimizingDT). Therefore, we study the training process of Post-LN with or without proper initialization. With better initialization, we down-scale the weights of -th layer by after performing Xavier initialization. For example, the output projection of FFN in -th layer is initialized as:

where is an average of input and output dimensions. We name this model Post-LN-init. Notice that different from the prior work (BiaoZhang2019ImprovingDT), we narrow the scale of lower layers instead of the higher layers. We believe that it helps to separate the effect of the gradient scale from the model update. Besides, Post-LN-init has the same architecture as Post-LN, which eliminates the impact from the architecture.

Figure 3: (a) Gradient norm in the top layers of 18L-18L models. (b) Gradient norm in the last layer of the models with depths varying from 6L-6L to 24L-24L. (c) Validation loss curves of 18L-18L models.
(a) Accumulated model update
(b) Input from FFN to LN
(c) Input from attention to LN
(d) Gradient norm in all decoder layers
Figure 4: Visualization of the model update, the average input of LNs, and the gradients for the 18L-18L models at the early stage of training.

We train 18L-18L Post-LN and 18L-18L Post-LN-init on the IWSLT-14 De-En machine translation dataset. Figure 3 visualizes their gradients and validation loss curves. As shown in Figure 3(c), Post-LN-init converged while Post-LN did not. Post-LN-init has an even larger gradient norm in the last several layers, although its weights have been scaled down. Furthermore, we visualize the gradient norm of the last decoder layer with varying model depth from 6L-6L to 24L-24L. Figure 3 shows that the gradient norm of Post-LN-init in the last layer is still much larger than that of Post-LN, regardless of model depth. It concludes that the exploding gradients in deep layers should not be the root cause of instability of Post-LN, while the scale of model update tends to account for it.

Then we demonstrate that the instability of Post-LN comes from a chain of several issues, including gradient vanishing as well as too large model updates. As shown in Figure 4(a), we first visualize the norm of model update at the early stage of training:

where and denotes input, and model parameters after -th updates. Post-LN has an exploding update at the very beginning of training, and then nearly no update shortly. It indicates that the model has been stuck in a spurious local optima. Both warm-up and better initialization help alleviate this issue, enabling the model to update smoothly. When the update explodes, the inputs to LN become large (see Figure 4(b) and Figure 4(c)). According to the theoretical analysis from RuibinXiong2020OnLN, the magnitude of gradient through LN is inversely proportional to the magnitude of its input:

Figure 4(b) and Figure 4(c) show that is significantly larger than without warm-up or proper initialization. This explains the gradient vanishing problem occurred in the training of Post-LN (see Figure 4(d)).

Above all, the instability starts from the large model update at the beginning of training. It renders the model trapped in a bad local optima, which in turn increases the magnitude of inputs to each LN. As training continues, the gradient through LN becomes increasingly small, thus resulting in severe gradient vanishing. The vanishing gradients make it difficult to escape from the local optima, and further destabilize the optimization. On the contrary, Post-LN-init has relatively small updates, and the inputs to LN are stable. This relieves suffering from gradient vanishing, making optimization more stable.

4 DeepNet: Extremely Deep Transformers

In this section, we introduce our extremely deep Transformers named DeepNet

. It can stabilize the optimization by mitigating the exploding model update problem. We first provide the estimation of the expected magnitude of

DeepNet’s model update. Then we provide the theoretical analysis to show that its updates can be bounded by a constant with our proposed DeepNorm.

4.1 Architecture

DeepNet is based on the Transformer architecture. Compared to the vanilla Transformer, it uses our new DeepNorm, instead of Post-LN, for each sub-layer. The formulation of DeepNorm can be written as:

where is a constant, and is the function of the -th Transformer sub-layer (i.e., attention or feed-forward network) with parameters . Besides, DeepNet scales the weights inside residual branches by . Notably, both and are constants that only depend on the architecture, and we provide the derivation in Section 4.3.

4.2 Expected Magnitude of Model Update

Attention is an important part of Transformer. Without loss of generality, we study the 1-head case. Let denote the query, key, value, respectively. are the input projection matrices, and is the output projection matrix. Then, the attention module can be formulated as:

We study the magnitude of the attention module. Lemma 4.1 proves that and do not change the bound of attention output’s magnitude.

Lemma 4.1.

Given , where , and for all , it satisfies that

where stands for equal bound of magnitude.

In other words, the magnitude of attention output only depends on the value and output projection: . In this work, we only consider the magnitude of model update, so it is sufficiently instructive to study the case where the hidden dimension equals to . For simplicity, we reduce the matrices to the scalars , which means . Similarly, we have , where denotes the parameters of the feed-forward network.

We define the model update as . Based on the analysis above, we have the following theorem to characterize ’s magnitude of an -layer DeepNet with attentions and FFNs.

Theorem 4.2.

Given an -layer DeepNet (), where and denote the parameters of self-attention and FFN in -th layer, and each sub-layer is normalized with DeepNorm: , satisfies:

Vanilla Post-LN can be regarded as a special case of DeepNet, where and at Xavier initialization (xavier). Based on Theorem 4.2, we have for vanilla Post-LN. It shows that the model tends to accumulate the update of each sub-layer, which leads to exploding magnitude of model’s update and destabilizes the optimization at the early stage. This explains our findings in Section 3.

Besides, Theorem 4.2 also explains why warm-ups and smaller initialization can stabilize the training of Post-LN. Warm-ups can reduce the magnitude of the model update by decreasing , while smaller initialization lowers .

Furthermore, we study the magnitude of DeepNet with an -layer encoder and an -layer decoder. Let denotes the model, where is the input of encoder and decoder. follows the same definition as in Theorem 4.2. stands for the parameters of self-attentions, cross-attentions, and FFNs. We use {, } and {, } to distinguish the notations between the encoder and the decoder. The following theorem shows the expected magnitude of the encoder-decoder’s model update .

Theorem 4.3.

Given an encoder-decoder DeepNet with N encoder layers and M decoder layers, where each encoder sub-layer is normalized as , and the decoder sub-layer is normalized as , satisfies:

(1)

The vanilla encoder-decoder model satisfies that all of {, , , , , } equal to , so we have . It indicates the similar accumulative effect which leads to fast growth of the magnitude regarding the model depth (see Figure 5). Furthermore, the cross-attention propagates the magnitude from the encoder to the decoder, which explains why the decoder is more unstable than the encoder (LiyuanLiu2020UnderstandingTD).

Figure 5: Model updates of vanilla Post-LN and DeepNet at the early stage of training. The visualization is conducted on 64-128-2 tiny Transformers with depth varying from 6L-6L to 100L-100L. It shows that DeepNet has much smaller and more stable updates than Post-LN.

4.3 Derivation for DeepNorm and the Initialization

We show that the expected model updates for DeepNet can be bounded by a constant with proper parameters and . Our analysis is based on SGD update, and we empirically verify it works well for Adam optimizer (adam). We provide the analysis on the encoder-decoder architecture, which can be naturally extended to encoder-only and decoder-only models in the same way. Analogous to HongyiZhang2019FixupIR, we set our goal for the model update as follows:

GOAL: is updated by per SGD step after initialization as . That is where .

For SGD optimizer, the update of each decoder layer equals to . RuibinXiong2020OnLN

proved that Post-LN decreases the magnitude of backpropagating error signal, so we have

. With and the assumption , the second term of Theorem 4.3 can be bounded as:

(2)

There are multiple schemes to bound Equation 2 by . In order to balance the effect of residual connections and the initialization, we set , and due to symmetry, that is , . Similarly, we use to bound the first term in Theorem 4.3. Detailed derivation is shown in Appendix B.

In comparison with Post-LN, we visualize the model updates for DeepNet on IWSLT-14 De-En translation dataset at the early training stage. Figure 5 shows that the model update of DeepNet is nearly constant, while the model update of Post-LN is exploding.

In summary, we apply our approach as follows:

[enhanced,attach boxed title to top center=yshift=-3mm,yshifttext=-1mm, title=Encoder-decoder architecture, colback=white, colframe=white!75!blue, coltitle=black, colbacktitle=white]

  1. [leftmargin=*]

  2. Apply standard initialization (e.g., Xavier initialization) for each encoder and decoder layer.

  3. For encoder layers, scale the weights of feed-forward networks as well as the value projection and the output projection of attention layers by , and set the weight of residual connections as .

  4. For decoder layers, scale the weights of feed-forward networks as well as the value projection and the output projection of attention layers by , and set the weight of residual connections as .

The derivation of encoder-only (such as BERT) and decoder-only (such as GPT) architectures can be conducted in the same way (see Appendix C). We summarize the steps as follows:

[enhanced,attach boxed title to top center=yshift=-3mm,yshifttext=-1mm, title=Encoder-only (or decoder-only) architecture, colback=white, colframe=white!75!blue, coltitle=black, colbacktitle=white]

  1. [leftmargin=*]

  2. Apply standard initialization (e.g., Xavier initialization) for each layer.

  3. For each layer, scale the weights of feed-forward networks as well as the value projection and the output projection of attention layers by (or ), and set the weight of residual connections as (or ).

5 Neural Machine Translation

Models LN 6L-6L 18L-18L 50L-50L 100L-100L
Vanilla Post-LN (transformer) Post 28.1 diverged
DS-Init (BiaoZhang2019ImprovingDT) Post 27.9 diverged
Admin (LiyuanLiu2020UnderstandingTD) Post 27.9 28.8 diverged
ReZero (rezero2020) No 26.9 diverged
R-Fixup (HongyiZhang2019FixupIR) No 27.5 28.4 27.7 diverged
T-Fixup (XiaoShiHuang2020ImprovingTO) No 27.5 28.4 27.9 diverged
Vanilla Pre-LN (transformer) Pre 27.0 28.1 28.0 27.4
DLCL (Wang2019DLCL) Pre 27.4 28.2 diverged 27.5
NormFormer (Normformer2021) Pre 27.0 28.3 27.8 diverged

DeepNet (ours)
Deep 27.8 28.8 29.0 28.9
Table 1: BLEU scores on the WMT-17 En-De test set for different models with varying depth. L-L refers to -layer encoder and -layer decoder.

We verify the effectiveness of DeepNet on the popular machine translation benchmarks, including IWSLT-14 German-English (De-En) dataset and WMT-17 English-German (En-De) dataset. We compare our method with multiple state-of-the-art deep Transformer models, including DLCL (Wang2019DLCL), NormFormer (Normformer2021), ReZero (rezero2020), R-Fixup (HongyiZhang2019FixupIR), T-Fixup (XiaoShiHuang2020ImprovingTO), DS-init (BiaoZhang2019ImprovingDT), and Admin (LiyuanLiu2020UnderstandingTD)

. We reproduce the baselines with their open-source code, and set the hyper-parameters the same for a fair comparison.

We use BLEU as the evaluation metric for all experiments.

Table 1 reports the results of the baselines and DeepNet on WMT-17 En-De translation dataset. According to their LNs, the baselines are grouped into three categories: Pre-LN, Post-LN, and No-LN. All the compared models are base-size with different depths.

Compared with the models with Post-LN, DeepNet is more stable, and can successfully scale to 100L-100L, reaching the 28.9 BLEU on the test set. In contrast, the baselines with Post-LN lead to unstable optimization when the depth goes to 50L-50L. Besides, DeepNet achieves comparable performance with these baselines when the models are shallow.

In addition, we compare DeepNet with the methods without LN. Both R-Fixup and T-Fixup introduce better initialization methods, which stabilize the training of No-LN Transformer with up to 50-50 layers. Yet, their performance is not as good as those with Post-LN. Besides, half-precision could destabilize the training of ReZero, leading to its divergence with 18-18 layers. This observation is also reported by LiyuanLiu2020UnderstandingTD. Moreover, deeper models (50L-50L) do not outperform the shallow models (18L-18L). In comparison, DeepNet achieves better translation accuracy than these methods, and scaling to deeper models brings no harm to the performance.

Compared with the Post-LN baselines, the models with Pre-LN are more stable. Both vanilla Pre-LN and DLCL can be scaled to 100L-100L, and 50L-50L NormFormer is also trained successfully. Nevertheless, Pre-LN leads to a 0.5-1.0 BLEU drop compared with the converged Post-LN models. We presume this should be caused by the problem that gradients of Pre-LN at earlier layers tend to be larger than gradients at later layers (Normformer2021). We leave it as the future work. In contrast, DeepNet alleviates the problem by using Post-LN, and outperforms all the Pre-LN baselines.

Convergence with varying depth.

We vary the depths of the models from 10L-10L to 100L-100L with an interval of 10 layers.All experiments are conducted with mixed precision training, except ReZero111According to our experiments, ReZero is unstable with half precision, even when the model is shallow.. Figure 6 shows the results on the IWSLT-14 dataset. We train the models for 8,000 steps because we find most divergence occurs at the beginning of optimization. Overall, DeepNet is stable from shallow to deep. It converges fast, achieving over 30 BLEU in only 8,000 steps while most of the baselines do not. Moreover, the performance keeps improving as the model goes deeper.

Large learning rate, batch size, and hidden dimension.

We further scale DeepNet

to larger learning rate, batch size, and hidden dimension, respectively. For each experiment, we only change one hyperparameter with the others fixed.

Figure 7 reports the loss curves on the WMT-17 validation set. It shows that DeepNet can be trained without difficulty in all the largest settings. The loss of DeepNet with 1024 hidden size increases after 10K steps because of overfitting. Besides, it indicates that DeepNet can benefit from the larger settings, resulting in faster convergence and lower validation loss.

Figure 6: BLEU scores on the IWSLT-14 De-En test set for different deep models with varing depth from 10L-10L to 100L-100L.
Figure 7: WMT-17 En-De validation loss curves for 18L-18L DeepNet with varing learning rate, batch size and hidden dimension.

6 Massively Multilingual Neural Machine Translation

Figure 8: Average BLEU scores for DeepNet with varying depth on the OPUS-100 En-X and X-En test sets.
Models # Layers # Params XEn EnX Avg
Baseline (opus100) 12 133M 27.5 21.4 24.5
24 173M 29.5 22.9 26.2
48 254M 31.4 24.0 27.7
DeepNet (ours) 200 863M 33.2 29.0 31.1
1000 3.8B 33.9 30.2 32.1
Table 2: Average BLEU for DeepNet and the baseline on the OPUS-100 test sets.
Models # Layers # Params WMT OPUS TED Flores
M2M-100 (m2m100) 48 12B 31.9 18.4 18.7 13.6
DeepNet (ours) 200 3.2B 33.9 23.0 20.1 18.6
Table 3: BLEU scores for DeepNet and M2M-100 on various evaluation sets.

We conduct experiments on the large-scale multilingual machine translation, which is a good testbed for large models. We first use OPUS-100 corpus (opus100) to evaluate our model. OPUS-100 is an English-centric multilingual corpus covering 100 languages, which is randomly sampled from the OPUS collection. We scale DeepNet up to 1,000 layers. The model has a 500-layer encoder, a 500-layer decoder, 512 hidden size, 8 attention head, and 2,048 dimensions of feed-forward layers. More details can be found in the Appendix.

Table 2 summarizes the results of DeepNet and the baselines. It shows that increasing the depth can significantly improve the translation quality of NMT: the baseline of 48 layers achieves a gain of 3.2 points on average over the 12-layer model. DeepNet can successfully scale up the depth to 1,000 layers, outperforming the baseline by an improvement of 4.4 BLEU. It is noted that DeepNet

is only trained for 4 epochs, and the performance can be further improved given more computation budgets.

Scaling law in terms of depth

We train DeepNet of {12, 20, 100, 200, 1000} layers on the OPUS-100 dataset. Figure 8 illustrates the scaling curve. Compared with bilingual NMT, multilingual NMT benefits more from scaling the depth of the model because of its hunger in model capacity. We observe logarithmic growth of the BLEU score for multilingual NMT, and the scaling law can be written as:

where is the depth, and are the constants regarding the other hyper-parameters.

More data and language directions.

To explore the limits of DeepNet on multilingual NMT, we then scale up the training data by using CCMatrix (ccmatrix). We also expand the data from CCAligned (ccaligned), OPUS (opus100), and Tatoeba222https://tatoeba.org/en/ to cover all languages of Flores101 evaluation sets. The final data consists of 102 languages, 1932 directions, and 12B sentence pairs. With the data, we train DeepNet with a 100-layer encoder, 100-layer decoder, 1,024 hidden dimension, 16 heads, and 4,096 intermediate dimension of feed-forward layers. More details can be found in the Appendix.

We compare DeepNet with the state-of-the-art multilingual NMT model M2M-100 (m2m100). M2M-100 has a 24-layer encoder, a 24-layer decoder, and 4,096 hidden size, resulting in up to 12B parameters. Compared with M2M-100, DeepNet is deep and narrow with only 3.2B parameters. For a fair comparison, we generate the model with beam size 5 and length penalty 1.

Following M2M-100 (m2m100), we evaluate the models on several multilingual translation evaluation datasets, including WMT (wmt14; wmt17; wmt18; wmt19), OPUS (opus100), TED (ted-data), and Flores (flores101). The language pairs from the WMT dataset are English-centric. There are 10 languages including English, and most of them are high-resource. For the OPUS dataset, we select the non-English directions from the test set, which has 30 evaluation pairs. The TED evaluation set has 28 languages and 756 directions, and the data is from the spoken language domain. The Flores dataset has all translation pairs between 102 languages. We use a subset covering the languages supported by both M2M-100 and DeepNet, resulting in 87 languages and 7,482 translation directions.

We report the results in Table 3. For a fair comparison, we use the same evaluation methods as the baseline. The details can be found in the Appendix. It shows that DeepNet has significantly better performance than M2M-100 on all evaluation datasets, indicating that deepening the model is a very promising direction to improve the quality of NMT models.

7 Conclusion and Future Work

We improve the stability of Transformer and successfully scale it to 1,000 layers. This is achieved by our DeepNet with a novel normalization function called DeepNorm. It has theoretical justification to stabilize the optimization with a constant upper bound for model updates. Experimental results verify the effectiveness of our methods across various benchmarks. We focus on machine translation as a test bed in the current experiments. In the future, we will extend DeepNet to support more diverse tasks, e.g., language model pre-training (unilm; unilmv2; chi-etal-2021-infoxlm; deltalm; chi2021xlme), protein structure prediction (AlphaFold2021), and BEiT vision pre-training (beit; vlmo).

Acknowledgement

We would like to acknowledge Saksham Singhal for the CCMatrix corpus.

References

Appendix A Main Theorem Proof

a.1 Proof of Lemma 4.1

Lemma A.1.

Given , where , and for all , it satisfies that

where stands for equal bound of magnitude.

Proof.

The weight of to output is , .

(3)

With , , for all , we have . Therefore, , which is equivalent to softmax. ∎

a.2 Proof of Theorem 4.2

Theorem A.2.

Given an -layer DeepNet (), where and denote the parameters of self-attention and FFN in -th layer, and each sub-layer is normalized with DeepNorm: , satisfies:

Proof.

Our aim is to study the magnitude of model updates. Following HongyiZhang2019FixupIR, we make the following assumptions to simplify the derivations:

  1. Hidden dimension equals to .

  2. All relevant weights are positive with magnitude less than 1 and for DeepNorm are positive with magnitude greater than 1.

Given Assumption 1, if is feed-forward network with , then . According to Lemma 4.1, the query and key projections do not change the bound of the attention output’s magnitude. Therefore, if is self-attention with , then

. Especially, if Xavier initialization is used for the projection, then the output can preserve the input variance, which is equivalent to

. With Assumption 2, we have:

(4)

With Equation 4, the magnitude of and is bounded by:

(5)

Besides, the model update satisfies:

(6)

Using Taylor expansion for Equation 6, we get:

(7)
(8)

Then, we have:

(9)

For vanilla Post-LN with standard initialization, , so .

Proof of Theorem 4.3

Theorem A.3.

Given an encoder-decoder DeepNet with N encoder layers and M decoder layers, where each encoder sub-layer is normalized as , and the decoder sub-layer is normalized as , satisfies:

(10)
Proof.

The derivation of self-attention and FFN layers is given in Section A.2. For the cross-attention layers, we have:

(11)

With Equation 11, we have the bound of the derivative of :

By means of Taylor expansion, we estimate the update of -th cross-attention layer as:

(12)

According to Theorem 4.2, we have . Therefore, the magnitude of satisfies:

(13)

As a special case, the corresponding parameters in Equation 13 for vanilla Post-LN with standard initialization are , so its model update .

Appendix B Derivation for Encoder-Decoder Architecture

Here, we give the derivation of DeepNet for the encoder-decoder architecture with an -layer encoder and an -layer decoder. As in Section 4.3, we have , to bound the second term of Equation 13 to . For the first term, we set , so that it goes to:

(14)
(15)

In this work, we use , and that is , to satisfy the condition.

Appendix C Derivation for Encoder-only (Decoder-only) Architecture

For an -layer DeepNet, starting from Theorem 4.2 we have,

(16)

By assumption , and , we achieve:

(17)

Due to symmetry, we set , so it goes to . In this work, we use and to satisfy the condition.

Appendix D Experimental Details

d.1 Hyperparameters for IWSLT-14 De-En

Hyperparameters Value
Learning rate 5e-4
Learning rate scheduler inverse sqrt
Warm-up updates 4000
Warm-up init learning rate 1e-7
Max tokens 4000
Adam 1e-8
Adam (0.9, 0.98)
Label smoothing 0.1
Training updates 8K
Gradient clipping 0.0
Dropout 0.4
Weight decay 0.0001
Hidden size 512
FFN inner hidden size 2048
Attention heads 8
Table 4: Hyperparameters for the machine translation experiments on the IWSLT-14 De-En dataset.

d.2 Hyperparameters for WMT-17 En-De

Hyperparameters No-LN Pre-LN Post-LN DeepNorm
Learning rate 5e-4 1.5e-3 1.5e-3 1.5e-3
Learning rate scheduler inverse sqrt
Warm-up updates 4000
Warm-up init learning rate 1e-7
Max tokens 128 4096
Adam 1e-8
Adam (0.9, 0.98)
Label smoothing 0.1
Training updates 100K
Gradient clipping 0.0
Dropout 0.4
Weight decay 0.0001
Hidden size 512
FFN inner hidden size 2048
Attention heads 8
Table 5: Hyperparameters for the base-setting experiments on the WMT-17 En-De dataset.
Hyperparameters Base size Medium size Large size
Hidden size 512 768 1,024
FFN inner hidden size 2048 3072 4096
Attention heads 8 12 16
Layers 18-18
Learning rate 5e-4
Learning rate scheduler inverse sqrt
Warm-up updates 4000
Warm-up init learning rate 1e-7
Max tokens 128 4096
Adam 1e-6
Adam (0.9, 0.98)
Label smoothing 0.1
Training updates 30K
Gradient clipping 1.0
Dropout 0.4
Weight decay 0.0
Table 6: Hyperparameters for the large-setting experiments on the WMT-17 En-De dataset.

d.3 Hyperparameters for OPUS-100

Hyperparameters Value
Learning rate 5e-4
Learning rate scheduler inverse sqrt
Warm-up updates 4000
Warm-up init learning rate 1e-7
Max tokens 128 4096
Adam 1e-8
Adam (0.9, 0.98)
Label smoothing 0.1
Training epochs 4
Gradient clipping 0.0
Dropout 0.1
Weight decay 0.0
Hidden size 512
FFN inner hidden size 2048
Attention heads 8
Table 7: Hyperparameters for the machine translation experiments on the OPUS-100 dataset.

d.4 Hyperparameters for 102-Language Machine Translation

Hyperparameters Value
Learning rate 5e-4
Learning rate scheduler inverse sqrt