dairly_learning
dairly learning
view repo
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.
READ FULL TEXT VIEW PDFdairly learning
Recent years have witnessed a trend towards large-scale Transformer (transformer) models. The capacity has substantially increased from millions of parameters (JacobDevlin2018BERTPO; xlmr) to billions (gpt-2; gpt3; gpipe; t5; gshard; gopher; xglm; mt-nlg), and even trillions (glam). Large-scale models yield state-of-the-art performance on a wide range of tasks, and show impressive abilities in few-shot and zero-shot learning. Despite an enormous number of parameters, their depths (as shown in Figure 1) are limited by the training instability of Transformers.
ToanQNguyen2019TransformersWT find that pre-norm residual connections (Pre-LN) improve the stability of Transformers based on post-norm connections (Post-LN). However, the gradients of Pre-LN at bottom layers tend to be larger than at top layers (Normformer2021), leading to a degradation in performance compared with Post-LN. In order to alleviate the above issue, there have been efforts on improving the optimization of deep Transformer by means of better initialization (BiaoZhang2019ImprovingDT; HongyiZhang2019FixupIR; XiaoShiHuang2020ImprovingTO), or better architecture (Wang2019DLCL; LiyuanLiu2020UnderstandingTD; rezero2020; Normformer2021). These approaches can stabilize a Transformer model with up to hundreds of layers. Yet, none of previous methods has been successfully scaled to 1,000 layers.
Our aim is to improve the training stability of Transformers and scale the model depth by orders of magnitude. To this end, we study the cause of unstable optimization, finding the exploding model update is responsible for the instability. Motivated by the above observation, we introduce a new normalization function (DeepNorm) at residual connections (resnet), which has theoretical justification of bounding the model update by a constant. The proposed method is simple yet effective, with just lines of code change. The approach improves the stability of Transformers so that we are able to scale model depth to more than 1,000 layers. Moreover, experimental results show that DeepNorm combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN. The proposed method can be a preferred alternative of Transformers, not only for extremely deep (such as >1000 layers) models, but also for existing large models. Notably, our 200-layer model with 3.2B parameters achieves 5 BLEU improvement on a massively multilingual machine translation benchmark compared to state-of-the-art model (m2m100) with 48 layers and 12B model size.
As shown in Figure 2, it is simple to implement our method based on Transformers with Post-LN. Compared to Post-LN, DeepNorm up-scales the residual connection before performing layer normalization. Besides, we down-scale the parameters during initialization. Notably, we only scale the weights of feed-forward networks, as well as the value projection and the output projection of attention layers. Moreover, the scales of residual connection and initialization are dependent on the architecture (Figure 2). We provide more details in Section 4.3.
We study the causes of the instability for deep Transformers. Our analysis begins with the observation: better initialization methods stabilize the training of Transformer. This has also been verified by previous work (BiaoZhang2019ImprovingDT; XiaoShiHuang2020ImprovingTO; PengXu2021OptimizingDT). Therefore, we study the training process of Post-LN with or without proper initialization. With better initialization, we down-scale the weights of -th layer by after performing Xavier initialization. For example, the output projection of FFN in -th layer is initialized as:
where is an average of input and output dimensions. We name this model Post-LN-init. Notice that different from the prior work (BiaoZhang2019ImprovingDT), we narrow the scale of lower layers instead of the higher layers. We believe that it helps to separate the effect of the gradient scale from the model update. Besides, Post-LN-init has the same architecture as Post-LN, which eliminates the impact from the architecture.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
We train 18L-18L Post-LN and 18L-18L Post-LN-init on the IWSLT-14 De-En machine translation dataset. Figure 3 visualizes their gradients and validation loss curves. As shown in Figure 3(c), Post-LN-init converged while Post-LN did not. Post-LN-init has an even larger gradient norm in the last several layers, although its weights have been scaled down. Furthermore, we visualize the gradient norm of the last decoder layer with varying model depth from 6L-6L to 24L-24L. Figure 3 shows that the gradient norm of Post-LN-init in the last layer is still much larger than that of Post-LN, regardless of model depth. It concludes that the exploding gradients in deep layers should not be the root cause of instability of Post-LN, while the scale of model update tends to account for it.
Then we demonstrate that the instability of Post-LN comes from a chain of several issues, including gradient vanishing as well as too large model updates. As shown in Figure 4(a), we first visualize the norm of model update at the early stage of training:
where and denotes input, and model parameters after -th updates. Post-LN has an exploding update at the very beginning of training, and then nearly no update shortly. It indicates that the model has been stuck in a spurious local optima. Both warm-up and better initialization help alleviate this issue, enabling the model to update smoothly. When the update explodes, the inputs to LN become large (see Figure 4(b) and Figure 4(c)). According to the theoretical analysis from RuibinXiong2020OnLN, the magnitude of gradient through LN is inversely proportional to the magnitude of its input:
Figure 4(b) and Figure 4(c) show that is significantly larger than without warm-up or proper initialization. This explains the gradient vanishing problem occurred in the training of Post-LN (see Figure 4(d)).
Above all, the instability starts from the large model update at the beginning of training. It renders the model trapped in a bad local optima, which in turn increases the magnitude of inputs to each LN. As training continues, the gradient through LN becomes increasingly small, thus resulting in severe gradient vanishing. The vanishing gradients make it difficult to escape from the local optima, and further destabilize the optimization. On the contrary, Post-LN-init has relatively small updates, and the inputs to LN are stable. This relieves suffering from gradient vanishing, making optimization more stable.
In this section, we introduce our extremely deep Transformers named DeepNet
. It can stabilize the optimization by mitigating the exploding model update problem. We first provide the estimation of the expected magnitude of
DeepNet’s model update. Then we provide the theoretical analysis to show that its updates can be bounded by a constant with our proposed DeepNorm.DeepNet is based on the Transformer architecture. Compared to the vanilla Transformer, it uses our new DeepNorm, instead of Post-LN, for each sub-layer. The formulation of DeepNorm can be written as:
where is a constant, and is the function of the -th Transformer sub-layer (i.e., attention or feed-forward network) with parameters . Besides, DeepNet scales the weights inside residual branches by . Notably, both and are constants that only depend on the architecture, and we provide the derivation in Section 4.3.
Attention is an important part of Transformer. Without loss of generality, we study the 1-head case. Let denote the query, key, value, respectively. are the input projection matrices, and is the output projection matrix. Then, the attention module can be formulated as:
We study the magnitude of the attention module. Lemma 4.1 proves that and do not change the bound of attention output’s magnitude.
Given , where , and for all , it satisfies that
where stands for equal bound of magnitude.
In other words, the magnitude of attention output only depends on the value and output projection: . In this work, we only consider the magnitude of model update, so it is sufficiently instructive to study the case where the hidden dimension equals to . For simplicity, we reduce the matrices to the scalars , which means . Similarly, we have , where denotes the parameters of the feed-forward network.
We define the model update as . Based on the analysis above, we have the following theorem to characterize ’s magnitude of an -layer DeepNet with attentions and FFNs.
Given an -layer DeepNet (), where and denote the parameters of self-attention and FFN in -th layer, and each sub-layer is normalized with DeepNorm: , satisfies:
Vanilla Post-LN can be regarded as a special case of DeepNet, where and at Xavier initialization (xavier). Based on Theorem 4.2, we have for vanilla Post-LN. It shows that the model tends to accumulate the update of each sub-layer, which leads to exploding magnitude of model’s update and destabilizes the optimization at the early stage. This explains our findings in Section 3.
Besides, Theorem 4.2 also explains why warm-ups and smaller initialization can stabilize the training of Post-LN. Warm-ups can reduce the magnitude of the model update by decreasing , while smaller initialization lowers .
Furthermore, we study the magnitude of DeepNet with an -layer encoder and an -layer decoder. Let denotes the model, where is the input of encoder and decoder. follows the same definition as in Theorem 4.2. stands for the parameters of self-attentions, cross-attentions, and FFNs. We use {, } and {, } to distinguish the notations between the encoder and the decoder. The following theorem shows the expected magnitude of the encoder-decoder’s model update .
Given an encoder-decoder DeepNet with N encoder layers and M decoder layers, where each encoder sub-layer is normalized as , and the decoder sub-layer is normalized as , satisfies:
(1) |
The vanilla encoder-decoder model satisfies that all of {, , , , , } equal to , so we have . It indicates the similar accumulative effect which leads to fast growth of the magnitude regarding the model depth (see Figure 5). Furthermore, the cross-attention propagates the magnitude from the encoder to the decoder, which explains why the decoder is more unstable than the encoder (LiyuanLiu2020UnderstandingTD).
We show that the expected model updates for DeepNet can be bounded by a constant with proper parameters and . Our analysis is based on SGD update, and we empirically verify it works well for Adam optimizer (adam). We provide the analysis on the encoder-decoder architecture, which can be naturally extended to encoder-only and decoder-only models in the same way. Analogous to HongyiZhang2019FixupIR, we set our goal for the model update as follows:
GOAL: is updated by per SGD step after initialization as . That is where .
For SGD optimizer, the update of each decoder layer equals to . RuibinXiong2020OnLN
proved that Post-LN decreases the magnitude of backpropagating error signal, so we have
. With and the assumption , the second term of Theorem 4.3 can be bounded as:(2) |
There are multiple schemes to bound Equation 2 by . In order to balance the effect of residual connections and the initialization, we set , and due to symmetry, that is , . Similarly, we use to bound the first term in Theorem 4.3. Detailed derivation is shown in Appendix B.
In comparison with Post-LN, we visualize the model updates for DeepNet on IWSLT-14 De-En translation dataset at the early training stage. Figure 5 shows that the model update of DeepNet is nearly constant, while the model update of Post-LN is exploding.
In summary, we apply our approach as follows:
[enhanced,attach boxed title to top center=yshift=-3mm,yshifttext=-1mm, title=Encoder-decoder architecture, colback=white, colframe=white!75!blue, coltitle=black, colbacktitle=white]
[leftmargin=*]
Apply standard initialization (e.g., Xavier initialization) for each encoder and decoder layer.
For encoder layers, scale the weights of feed-forward networks as well as the value projection and the output projection of attention layers by , and set the weight of residual connections as .
For decoder layers, scale the weights of feed-forward networks as well as the value projection and the output projection of attention layers by , and set the weight of residual connections as .
The derivation of encoder-only (such as BERT) and decoder-only (such as GPT) architectures can be conducted in the same way (see Appendix C). We summarize the steps as follows:
[enhanced,attach boxed title to top center=yshift=-3mm,yshifttext=-1mm, title=Encoder-only (or decoder-only) architecture, colback=white, colframe=white!75!blue, coltitle=black, colbacktitle=white]
[leftmargin=*]
Apply standard initialization (e.g., Xavier initialization) for each layer.
For each layer, scale the weights of feed-forward networks as well as the value projection and the output projection of attention layers by (or ), and set the weight of residual connections as (or ).
Models | LN | 6L-6L | 18L-18L | 50L-50L | 100L-100L |
---|---|---|---|---|---|
Vanilla Post-LN (transformer) | Post | 28.1 | diverged | ||
DS-Init (BiaoZhang2019ImprovingDT) | Post | 27.9 | diverged | ||
Admin (LiyuanLiu2020UnderstandingTD) | Post | 27.9 | 28.8 | diverged | |
ReZero (rezero2020) | No | 26.9 | diverged | ||
R-Fixup (HongyiZhang2019FixupIR) | No | 27.5 | 28.4 | 27.7 | diverged |
T-Fixup (XiaoShiHuang2020ImprovingTO) | No | 27.5 | 28.4 | 27.9 | diverged |
Vanilla Pre-LN (transformer) | Pre | 27.0 | 28.1 | 28.0 | 27.4 |
DLCL (Wang2019DLCL) | Pre | 27.4 | 28.2 | diverged | 27.5 |
NormFormer (Normformer2021) | Pre | 27.0 | 28.3 | 27.8 | diverged |
DeepNet (ours) |
Deep | 27.8 | 28.8 | 29.0 | 28.9 |
We verify the effectiveness of DeepNet on the popular machine translation benchmarks, including IWSLT-14 German-English (De-En) dataset and WMT-17 English-German (En-De) dataset. We compare our method with multiple state-of-the-art deep Transformer models, including DLCL (Wang2019DLCL), NormFormer (Normformer2021), ReZero (rezero2020), R-Fixup (HongyiZhang2019FixupIR), T-Fixup (XiaoShiHuang2020ImprovingTO), DS-init (BiaoZhang2019ImprovingDT), and Admin (LiyuanLiu2020UnderstandingTD)
. We reproduce the baselines with their open-source code, and set the hyper-parameters the same for a fair comparison.
We use BLEU as the evaluation metric for all experiments.
Table 1 reports the results of the baselines and DeepNet on WMT-17 En-De translation dataset. According to their LNs, the baselines are grouped into three categories: Pre-LN, Post-LN, and No-LN. All the compared models are base-size with different depths.Compared with the models with Post-LN, DeepNet is more stable, and can successfully scale to 100L-100L, reaching the 28.9 BLEU on the test set. In contrast, the baselines with Post-LN lead to unstable optimization when the depth goes to 50L-50L. Besides, DeepNet achieves comparable performance with these baselines when the models are shallow.
In addition, we compare DeepNet with the methods without LN. Both R-Fixup and T-Fixup introduce better initialization methods, which stabilize the training of No-LN Transformer with up to 50-50 layers. Yet, their performance is not as good as those with Post-LN. Besides, half-precision could destabilize the training of ReZero, leading to its divergence with 18-18 layers. This observation is also reported by LiyuanLiu2020UnderstandingTD. Moreover, deeper models (50L-50L) do not outperform the shallow models (18L-18L). In comparison, DeepNet achieves better translation accuracy than these methods, and scaling to deeper models brings no harm to the performance.
Compared with the Post-LN baselines, the models with Pre-LN are more stable. Both vanilla Pre-LN and DLCL can be scaled to 100L-100L, and 50L-50L NormFormer is also trained successfully. Nevertheless, Pre-LN leads to a 0.5-1.0 BLEU drop compared with the converged Post-LN models. We presume this should be caused by the problem that gradients of Pre-LN at earlier layers tend to be larger than gradients at later layers (Normformer2021). We leave it as the future work. In contrast, DeepNet alleviates the problem by using Post-LN, and outperforms all the Pre-LN baselines.
We vary the depths of the models from 10L-10L to 100L-100L with an interval of 10 layers.All experiments are conducted with mixed precision training, except ReZero111According to our experiments, ReZero is unstable with half precision, even when the model is shallow.. Figure 6 shows the results on the IWSLT-14 dataset. We train the models for 8,000 steps because we find most divergence occurs at the beginning of optimization. Overall, DeepNet is stable from shallow to deep. It converges fast, achieving over 30 BLEU in only 8,000 steps while most of the baselines do not. Moreover, the performance keeps improving as the model goes deeper.
We further scale DeepNet
to larger learning rate, batch size, and hidden dimension, respectively. For each experiment, we only change one hyperparameter with the others fixed.
Figure 7 reports the loss curves on the WMT-17 validation set. It shows that DeepNet can be trained without difficulty in all the largest settings. The loss of DeepNet with 1024 hidden size increases after 10K steps because of overfitting. Besides, it indicates that DeepNet can benefit from the larger settings, resulting in faster convergence and lower validation loss.![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Models | # Layers | # Params | XEn | EnX | Avg |
Baseline (opus100) | 12 | 133M | 27.5 | 21.4 | 24.5 |
24 | 173M | 29.5 | 22.9 | 26.2 | |
48 | 254M | 31.4 | 24.0 | 27.7 | |
DeepNet (ours) | 200 | 863M | 33.2 | 29.0 | 31.1 |
1000 | 3.8B | 33.9 | 30.2 | 32.1 |
Models | # Layers | # Params | WMT | OPUS | TED | Flores |
---|---|---|---|---|---|---|
M2M-100 (m2m100) | 48 | 12B | 31.9 | 18.4 | 18.7 | 13.6 |
DeepNet (ours) | 200 | 3.2B | 33.9 | 23.0 | 20.1 | 18.6 |
We conduct experiments on the large-scale multilingual machine translation, which is a good testbed for large models. We first use OPUS-100 corpus (opus100) to evaluate our model. OPUS-100 is an English-centric multilingual corpus covering 100 languages, which is randomly sampled from the OPUS collection. We scale DeepNet up to 1,000 layers. The model has a 500-layer encoder, a 500-layer decoder, 512 hidden size, 8 attention head, and 2,048 dimensions of feed-forward layers. More details can be found in the Appendix.
Table 2 summarizes the results of DeepNet and the baselines. It shows that increasing the depth can significantly improve the translation quality of NMT: the baseline of 48 layers achieves a gain of 3.2 points on average over the 12-layer model. DeepNet can successfully scale up the depth to 1,000 layers, outperforming the baseline by an improvement of 4.4 BLEU. It is noted that DeepNet
is only trained for 4 epochs, and the performance can be further improved given more computation budgets.
We train DeepNet of {12, 20, 100, 200, 1000} layers on the OPUS-100 dataset. Figure 8 illustrates the scaling curve. Compared with bilingual NMT, multilingual NMT benefits more from scaling the depth of the model because of its hunger in model capacity. We observe logarithmic growth of the BLEU score for multilingual NMT, and the scaling law can be written as:
where is the depth, and are the constants regarding the other hyper-parameters.
To explore the limits of DeepNet on multilingual NMT, we then scale up the training data by using CCMatrix (ccmatrix). We also expand the data from CCAligned (ccaligned), OPUS (opus100), and Tatoeba222https://tatoeba.org/en/ to cover all languages of Flores101 evaluation sets. The final data consists of 102 languages, 1932 directions, and 12B sentence pairs. With the data, we train DeepNet with a 100-layer encoder, 100-layer decoder, 1,024 hidden dimension, 16 heads, and 4,096 intermediate dimension of feed-forward layers. More details can be found in the Appendix.
We compare DeepNet with the state-of-the-art multilingual NMT model M2M-100 (m2m100). M2M-100 has a 24-layer encoder, a 24-layer decoder, and 4,096 hidden size, resulting in up to 12B parameters. Compared with M2M-100, DeepNet is deep and narrow with only 3.2B parameters. For a fair comparison, we generate the model with beam size 5 and length penalty 1.
Following M2M-100 (m2m100), we evaluate the models on several multilingual translation evaluation datasets, including WMT (wmt14; wmt17; wmt18; wmt19), OPUS (opus100), TED (ted-data), and Flores (flores101). The language pairs from the WMT dataset are English-centric. There are 10 languages including English, and most of them are high-resource. For the OPUS dataset, we select the non-English directions from the test set, which has 30 evaluation pairs. The TED evaluation set has 28 languages and 756 directions, and the data is from the spoken language domain. The Flores dataset has all translation pairs between 102 languages. We use a subset covering the languages supported by both M2M-100 and DeepNet, resulting in 87 languages and 7,482 translation directions.
We report the results in Table 3. For a fair comparison, we use the same evaluation methods as the baseline. The details can be found in the Appendix. It shows that DeepNet has significantly better performance than M2M-100 on all evaluation datasets, indicating that deepening the model is a very promising direction to improve the quality of NMT models.
We improve the stability of Transformer and successfully scale it to 1,000 layers. This is achieved by our DeepNet with a novel normalization function called DeepNorm. It has theoretical justification to stabilize the optimization with a constant upper bound for model updates. Experimental results verify the effectiveness of our methods across various benchmarks. We focus on machine translation as a test bed in the current experiments. In the future, we will extend DeepNet to support more diverse tasks, e.g., language model pre-training (unilm; unilmv2; chi-etal-2021-infoxlm; deltalm; chi2021xlme), protein structure prediction (AlphaFold2021), and BEiT vision pre-training (beit; vlmo).
We would like to acknowledge Saksham Singhal for the CCMatrix corpus.
Given , where , and for all , it satisfies that
where stands for equal bound of magnitude.
The weight of to output is , .
(3) |
With , , for all , we have . Therefore, , which is equivalent to softmax. ∎
Given an -layer DeepNet (), where and denote the parameters of self-attention and FFN in -th layer, and each sub-layer is normalized with DeepNorm: , satisfies:
Our aim is to study the magnitude of model updates. Following HongyiZhang2019FixupIR, we make the following assumptions to simplify the derivations:
Hidden dimension equals to .
All relevant weights are positive with magnitude less than 1 and for DeepNorm are positive with magnitude greater than 1.
Given Assumption 1, if is feed-forward network with , then . According to Lemma 4.1, the query and key projections do not change the bound of the attention output’s magnitude. Therefore, if is self-attention with , then
. Especially, if Xavier initialization is used for the projection, then the output can preserve the input variance, which is equivalent to
. With Assumption 2, we have:(4) |
With Equation 4, the magnitude of and is bounded by:
(5) |
Besides, the model update satisfies:
(6) |
Using Taylor expansion for Equation 6, we get:
(7) | ||||
(8) |
Then, we have:
(9) |
∎
For vanilla Post-LN with standard initialization, , so .
Proof of Theorem 4.3
Given an encoder-decoder DeepNet with N encoder layers and M decoder layers, where each encoder sub-layer is normalized as , and the decoder sub-layer is normalized as , satisfies:
(10) |
The derivation of self-attention and FFN layers is given in Section A.2. For the cross-attention layers, we have:
(11) |
With Equation 11, we have the bound of the derivative of :
By means of Taylor expansion, we estimate the update of -th cross-attention layer as:
(12) |
According to Theorem 4.2, we have . Therefore, the magnitude of satisfies:
(13) |
∎
As a special case, the corresponding parameters in Equation 13 for vanilla Post-LN with standard initialization are , so its model update .
Here, we give the derivation of DeepNet for the encoder-decoder architecture with an -layer encoder and an -layer decoder. As in Section 4.3, we have , to bound the second term of Equation 13 to . For the first term, we set , so that it goes to:
(14) | ||||
(15) |
In this work, we use , and that is , to satisfy the condition.
For an -layer DeepNet, starting from Theorem 4.2 we have,
(16) |
By assumption , and , we achieve:
(17) |
Due to symmetry, we set , so it goes to . In this work, we use and to satisfy the condition.
Hyperparameters | Value |
---|---|
Learning rate | 5e-4 |
Learning rate scheduler | inverse sqrt |
Warm-up updates | 4000 |
Warm-up init learning rate | 1e-7 |
Max tokens | 4000 |
Adam | 1e-8 |
Adam | (0.9, 0.98) |
Label smoothing | 0.1 |
Training updates | 8K |
Gradient clipping | 0.0 |
Dropout | 0.4 |
Weight decay | 0.0001 |
Hidden size | 512 |
FFN inner hidden size | 2048 |
Attention heads | 8 |
Hyperparameters | No-LN | Pre-LN | Post-LN | DeepNorm |
---|---|---|---|---|
Learning rate | 5e-4 | 1.5e-3 | 1.5e-3 | 1.5e-3 |
Learning rate scheduler | inverse sqrt | |||
Warm-up updates | 4000 | |||
Warm-up init learning rate | 1e-7 | |||
Max tokens | 128 4096 | |||
Adam | 1e-8 | |||
Adam | (0.9, 0.98) | |||
Label smoothing | 0.1 | |||
Training updates | 100K | |||
Gradient clipping | 0.0 | |||
Dropout | 0.4 | |||
Weight decay | 0.0001 | |||
Hidden size | 512 | |||
FFN inner hidden size | 2048 | |||
Attention heads | 8 |
Hyperparameters | Base size | Medium size | Large size |
---|---|---|---|
Hidden size | 512 | 768 | 1,024 |
FFN inner hidden size | 2048 | 3072 | 4096 |
Attention heads | 8 | 12 | 16 |
Layers | 18-18 | ||
Learning rate | 5e-4 | ||
Learning rate scheduler | inverse sqrt | ||
Warm-up updates | 4000 | ||
Warm-up init learning rate | 1e-7 | ||
Max tokens | 128 4096 | ||
Adam | 1e-6 | ||
Adam | (0.9, 0.98) | ||
Label smoothing | 0.1 | ||
Training updates | 30K | ||
Gradient clipping | 1.0 | ||
Dropout | 0.4 | ||
Weight decay | 0.0 |
Hyperparameters | Value |
---|---|
Learning rate | 5e-4 |
Learning rate scheduler | inverse sqrt |
Warm-up updates | 4000 |
Warm-up init learning rate | 1e-7 |
Max tokens | 128 4096 |
Adam | 1e-8 |
Adam | (0.9, 0.98) |
Label smoothing | 0.1 |
Training epochs | 4 |
Gradient clipping | 0.0 |
Dropout | 0.1 |
Weight decay | 0.0 |
Hidden size | 512 |
FFN inner hidden size | 2048 |
Attention heads | 8 |
Hyperparameters | Value |
---|---|
Learning rate | 5e-4 |
Learning rate scheduler | inverse sqrt |