Log In Sign Up

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified averagebased self-attention sublayer and the encoderdecoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.


page 1

page 2

page 3

page 4


Transformer with Depth-Wise LSTM

Increasing the depth of models allows neural models to model complicated...

ReZero is All You Need: Fast Convergence at Large Depth

Deep networks have enabled significant performance gains across domains,...

Transformers without Tears: Improving the Normalization of Self-Attention

We evaluate three simple, normalization-centric changes to improve Trans...

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Skip connection, is a widely-used technique to improve the performance a...

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

The transformer is a state-of-the-art neural translation model that uses...

Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

Attention based neural networks are state of the art in a large range of...

1 Introduction

Figure 1:

Gradient norm (y-axis) of each encoder layer (top) and decoder layer (bottom) in Transformer with respect to layer depth (x-axis). Gradients are estimated with

3k target tokens at the beginning of training. “DS-Init”: the proposed depth-scaled initialization. “6L”: 6 layers. Solid lines indicate the vanilla Transformer, and dashed lines denote our proposed method. During back-propagation, gradients in Transformer gradually vanish from high layers to low layers.

The capability of deep neural models of handling complex dependencies has benefited various artificial intelligence tasks, such as image recognition where test error was reduced by scaling VGG nets 

Simonyan and Zisserman (2015) up to hundreds of convolutional layers He et al. (2015). In NLP, deep self-attention networks have enabled large-scale pretrained language models such as BERT Devlin et al. (2019) and GPT Radford et al. (2018)

to boost state-of-the-art (SOTA) performance on downstream applications. By contrast, though neural machine translation (NMT) gained encouraging improvement when shifting from a shallow architecture 

Bahdanau et al. (2015) to deeper ones  Zhou et al. (2016); Wu et al. (2016); Zhang et al. (2018a); Chen et al. (2018), the Transformer Vaswani et al. (2017), a currently SOTA architecture, achieves best results with merely 6 encoder and decoder layers, and no gains were reported by Vaswani et al. (2017) from further increasing its depth on standard datasets.

We start by analysing why the Transformer does not scale well to larger model depth. We find that the architecture suffers from gradient vanishing as shown in Figure 1, leading to poor convergence. An in-depth analysis reveals that the Transformer is not norm-preserving due to the involvement of and the interaction between residual connection (RC) He et al. (2015) and layer normalization (LN) Ba et al. (2016).

To address this issue, we propose depth-scaled initialization (DS-Init) to improve norm preservation. We ascribe the gradient vanishing to the large output variance of RC and resort to strategies that could reduce it without model structure adjustment. Concretely, DS-Init scales down the variance of parameters in the -th layer with a discount factor of at the initialization stage alone, where denotes the layer depth starting from 1. The intuition is that parameters with small variance in upper layers would narrow the output variance of corresponding RCs, improving norm preservation as shown by the dashed lines in Figure 1. In this way, DS-Init enables the convergence of deep Transformer models to satisfactory local optima.

Another bottleneck for deep Transformers is the increase in computational cost for both training and decoding. To combat this, we propose a merged attention network (MAtt). MAtt simplifies the decoder by replacing the separate self-attention and encoder-decoder attention sublayers with a new sublayer that combines an efficient variant of average-based self-attention (AAN) Zhang et al. (2018)

and the encoder-decoder attention. We simplify the AAN by reducing the number of linear transformations, reducing both the number of model parameters and computational cost. The merged sublayer benefits from parallel calculation of (average-based) self-attention and encoder-decoder attention, and reduces the depth of each decoder block.

We conduct extensive experiments on WMT and IWSLT translation tasks, covering five translation tasks with varying data conditions and translation directions. Our results show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt.

Our contributions are summarized as follows:

  • We analyze the vanishing gradient issue in the Transformer, and identify the interaction of residual connections and layer normalization as its source.

  • To address this problem, we introduce depth-scaled initialization (DS-Init).

  • To reduce the computational cost of training deep Transformers, we introduce a merged attention model (MAtt). MAtt combines a simplified average-attention model and the encoder-decoder attention into a single sublayer, allowing for parallel computation.

  • We conduct extensive experiments and verify that deep Transformers with DS-Init and MAtt improve translation quality while preserving decoding efficiency.

2 Related Work

Our work aims at improving translation quality by increasing model depth. Compared with the single-layer NMT system Bahdanau et al. (2015), deep NMT models are typically more capable of handling complex language variations and translation relationships via stacking multiple encoder and decoder layers Zhou et al. (2016); Wu et al. (2016); Britz et al. (2017); Chen et al. (2018), and/or multiple attention layers Zhang et al. (2018a). One common problem for the training of deep neural models are vanishing or exploding gradients. Existing methods mainly focus on developing novel network architectures so as to stabilize gradient back-propagation, such as the fast-forward connection Zhou et al. (2016), the linear associative unit Wang et al. (2017), or gated recurrent network variants Hochreiter and Schmidhuber (1997); Gers and Schmidhuber (2001); Cho et al. (2014); Di Gangi and Federico (2018). In contrast to the above recurrent network based NMT models, recent work focuses on feed-forward alternatives with more smooth gradient flow, such as convolutional networks Gehring et al. (2017) and self-attention networks Vaswani et al. (2017).

The Transformer represents the current SOTA in NMT. It heavily relies on the combination of residual connections He et al. (2015) and layer normalization Ba et al. (2016) for convergence. Nevertheless, simply extending this model with more layers results in gradient vanishing due to the interaction of RC and LN (see Section 4). Recent work has proposed methods to train deeper Transformer models, including a rescheduling of RC and LN Vaswani et al. (2018), the transparent attention model Bapna et al. (2018) and the stochastic residual connection Pham et al. (2019). In contrast to these work, we identify the large output variance of RC as the source of gradient vanishing, and employ scaled initialization to mitigate it without any structure adjustment. The effect of careful initialization on boosting convergence was also investigated and verified in previous work Zhang et al. (2019); Child et al. (2019); Devlin et al. (2019); Radford et al. (2018).

The merged attention network falls into the category of simplifying the Transformer so as to shorten training and/or decoding time. Methods to improve the Transformer’s running efficiency range from algorithmic improvements Junczys-Dowmunt et al. (2018), non-autoregressive translation Gu et al. (2018); Ghazvininejad et al. (2019) to decoding dependency reduction such as average attention network Zhang et al. (2018) and blockwise parallel decoding Stern et al. (2018). Our MAtt builds upon the AAN model, further simplifying the model by reducing the number of linear transformations, and combining it with the encoder-decoder attention. In work concurrent to ours, So et al. (2019) propose the evolved Transformer which, based on automatic architecture search, also discovered a parallel structure of self-attention and encoder-decoder attention.

3 Background: Transformer

Given a source sequence , the Transformer predicts a target sequence under the encoder-decoder framework. Both the encoder and the decoder in the Transformer are composed of attention networks, functioning as follows:


where and are input sequence representations of length and respectively, denote weight parameters. The attention network can be further enhanced with multi-head attention Vaswani et al. (2017).

Formally, the encoder stacks identical layers, each including a self-attention sublayer (Eq. 2) and a point-wise feed-forward sublayer (Eq. 3):


denotes the sequence representation of the -th encoder layer. Input to the first layer is the element-wise addition of the source word embedding and the corresponding positional encoding. is a two-layer feed-forward network with a large intermediate representation and ReLUactivation function. Each encoder sublayer is wrapped with a residual connection (Eq. 4), followed by layer normalization (Eq. 5):


where and

are input vectors, and

indicates element-wise multiplication. and

denote the mean and standard deviation statistics of vector

. The normalized is then re-scaled and re-centered by trainable parameters and individually.

The decoder also consists of identical layers, each of them extends the encoder sublayers with an encoder-decoder attention sublayer (Eq. 7) to capture translation alignment from target words to relevant source words:


is the sequence representation of the -th decoder layer. Input is defined similar to . To ensure auto-regressive decoding, the attention weights in Eq. 6 are masked to prevent attention to future target tokens.

The Transformer’s parameters are typically initialized by sampling from a uniform distribution:


where and indicate input and output dimension separately. This initialization has the advantage of maintaining activation variances and back-propagated gradients variance and can help train deep neural networks Glorot and Bengio (2010).

4 Vanishing Gradient Analysis

One natural way to deepen Transformer is simply enlarging the layer number . Unfortunately, Figure 1 shows that this would give rise to gradient vanishing on both the encoder and the decoder at the lower layers, and that the case on the decoder side is worse. We identified a structural problem in the Transformer architecture that gives rise to this issue, namely the interaction of RC and LN, which we will here discuss in more detail.

Given an input vector , let us consider the general structure of RC followed by LN:


where are intermediate outputs. represents any neural network, such as recurrent, convolutional or attention network, etc. Suppose during back-propagation, the error signal at the output of LN is . Contributions of RC and LN to the error signal are as follows:


where denotes the normalized input.

is the identity matrix and

establishes a diagonal matrix from its input. The resulting and are error signals arrived at output and respectively.

We define the change of error signal as follows:


where (or model ratio), (or LN ratio) and (or RC ratio) measure the gradient norm ratio222Model gradients depend on both error signal and layer activation. Reduced/enhanced error signal does not necessarily result in gradient vanishing/explosion, but strongly contributes to it. of the whole residual block, the layer normalization and the residual connection respectively. Informally, a neural model should preserve the gradient norm between layers () so as to allow training of very deep models (see Zaeemzadeh et al., 2018).

Method Module Self Cross FFN
Base Enc 0.86 - 0.84
1.22 - 1.10
1.05 - 0.93
1.38 - 1.40
Dec 0.82 0.74 0.84
1.21 1.00 1.11
0.98 0.74 0.93
1.48 1.84 1.39
Ours Enc 0.96 - 0.95
1.04 - 1.02
1.02 - 0.98
1.10 - 1.10
Dec 0.95 0.94 0.94
1.05 1.00 1.02
1.10 0.95 0.98
1.13 1.15 1.11
Table 1: Empirical measure of output variance of RC and error signal change ratio , and (Eq. 14) averaged over 12 layers. These values are estimated with 3k target tokens at the beginning of training using 12-layer Transformer. “Base”: the baseline Transformer. “Ours”: the Transformer with DS-Init. Enc and Dec stand for encoder and decoder respectively. Self, Cross and FFN indicate the self-attention, encoder-decoder attention and the feed-forward sublayer respectively.

We resort to empirical evidence to analyze these ratios. Results in Table 1 show that LN weakens error signal () but RC strengthens it (). One explanation about LN’s decay effect is the large output variance of RC () which negatively affects as shown in Eq. 12. By contrast, the short-cut in RC ensures that the error signal at higher layer can always be safely carried on to lower layer no matter how complex would be as in Eq. 13, increasing the ratio.

(a) Self-Attention
(b) AAN
(c) Merged attention with simplified AAN
Figure 2: An overview of self-attention, AAN and the proposed merged attention with simplified AAN.

5 Depth-Scaled Initialization

Results on the model ratio show that self-attention sublayer has a (near) increasing effect () that intensifies error signal, while feed-forward sublayer manifests a decreasing effect (). In particular, though the encoder-decoder attention sublayer and the self-attention sublayer share the same attention formulation, the model ratio of the former is smaller. As shown in Eq. 7 and 1, part of the reason is that encoder-decoder attention can only back-propagate gradients to lower layers through the query representation , bypassing gradients at the key and the value to the encoder side. This negative effect explains why the decoder suffers from more severe gradient vanishing than the encoder in Figure 1.

The gradient norm is preserved better through the self-attention layer than the encoder-decoder attention, which offers insights on the successful training of the deep Transformer in BERT Devlin et al. (2019) and GPT Radford et al. (2018), where encoder-decoder attention is not involved. However, results in Table 1 also suggests that the self-attention sublayer in the encoder is not strong enough to counteract the gradient loss in the feed-forward sublayer. That is why BERT and GPT adopt a much smaller standard deviation (0.02) for initialization, in a similar spirit to our solution.

We attribute the gradient vanishing issue to the large output variance of RC (Eq. 12). Considering that activation variance is positively correlated with parameter variance Glorot and Bengio (2010), we propose DS-Init and change the original initialization method in Eq. 9 as follows:



is a hyperparameter in the range of

and denotes layer depth. Hyperparameter improves the flexibility of our method. Compared with existing approaches Vaswani et al. (2018); Bapna et al. (2018), our solution does not require modifications in the model architecture and hence is easy to implement.

According to the property of uniform distribution, the variance of model parameters decreases from to after applying DS-Init. By doing so, a higher layer would have smaller output variance of RC so that more gradients can flow back. Results in Table 1 suggest that DS-Init narrows both the variance and different ratios to be 1, ensuring the stability of gradient back-propagation. Evidence in Figure 1 also shows that DS-Init helps keep the gradient norm and slightly increases it on the encoder side. This is because DS-Init endows lower layers with parameters of larger variance and activations of larger norm. When error signals at different layers are of similar scale, the gradient norm at lower layers would be larger. Nevertheless, this increase does not hurt model training based on our empirical observation.

DS-Init is partially inspired by the Fixup initialization Zhang et al. (2019). Both of them try to reduce the output variance of RC. The difference is that Fixup focuses on overcoming gradient explosion cased by consecutive RCs and seeks to enable training without LN but at the cost of carefully handling parameter initialization of each matrix transformation, including manipulating initialization of different bias and scale terms. Instead, DS-Init aims at solving gradient vanishing in deep Transformer caused by the structure of RC followed by LN. We still employ LN to standardize layer activation and improve model convergence. The inclusion of LN ensures the stability and simplicity of DS-Init.

6 Merged Attention Model

With large model depth, deep Transformer unavoidably introduces high computational overhead. This brings about significantly longer training and decoding time. To remedy this issue, we propose a merged attention model for decoder that integrates a simplified average-based self-attention sublayer into the encoder-decoder attention sublayer. Figure 2 highlights the difference.

The AAN model (Figure 2(b)), as an alternative to the self-attention model (Figure 2(a)), accelerates Transformer decoding by allowing decoding in linear time, avoiding the complexity of the self-attention mechanism Zhang et al. (2018). Unfortunately, the gating sublayer and the feed-forward sublayer inside AAN reduce the empirical performance improvement. We propose a simplified AAN by removing all matrix computation except for two linear projections:


where denotes the average mask matrix for parallel computation Zhang et al. (2018). This new model is then combined with the encoder-decoder attention as shown in Figure 2(c):


The mapping is shared for SAan and Att. After combination, MAtt allows for the parallelization of AAN and encoder-decoder attention.

7 Experiments

Dataset #Src #Tgt #Sent #BPE
WMT14 En-De 116M 110M 4.5M 32K
WMT14 En-Fr 1045M 1189M 36M 32K
WMT18 En-Fi 73M 54M 3.3M 32K
WMT18 Zh-En 510M 576M 25M 32K
IWSLT14 De-En 3.0M 3.2M 159K 30K
Table 2: Statistics for different training datasets. #Src and #Tgt denote the number of source and target tokens respectively. #Sent: the number of bilingual sentences. #BPE: the number of merge operations in BPE. M: million, K: thousand.
ID Model #Param Test14 Dec Train
1 Base 6 layers 72.3M 27.59 (26.9) 62.26/1.00 0.105/1.00
2 1 + T2T 72.3M 27.20 (26.5) 68.04/0.92 0.105/1.00
3 1 + DS-Init 72.3M 27.50 (26.8) /1.00 /1.00
4 1 + MAtt 66.0M 27.49 (26.8) 40.51/1.54 0.094/1.12
5 1 + MAtt + DS-Init 66.0M 27.35 (26.8) 40.84/1.52 0.094/1.12
6 1 + MAtt with self-attention 72.3M 27.41 (26.7) 60.25/1.03 0.105/1.00
7 1 + MAtt with original AAN 72.2M 27.36 (26.7) 46.13/1.35 0.098/1.07
8 1 + 72.3M 27.84 (27.2) /1.00 /1.00
9 1 + layers + - - - -
10 4 + layers + - - -
11 3 + 12 layers + 116.4M 28.27 (27.6) 102.9/1.00 0.188/1.00
12 11 + T2T 116.5M 28.03 (27.4) 107.7/0.96 0.191/0.98
13 11 + MAtt 103.8M 28.55 (27.9) 67.12/1.53 0.164 /1.15
14 3 + 20 layers + 175.3M 28.42 (27.7) 157.8/1.00 0.283/1.00
15 14 + T2T 175.3M 28.27 (27.6) 161.2/0.98 0.289/0.98
16 14 + MAtt 154.3M 28.67 (28.0) 108.6/1.45 0.251/1.13
Table 3: Tokenized case-sensitive BLEU (in parentheses: sacreBLEU) on WMT14 En-De translation task. #Param: number of model parameters. Dec: decoding time (seconds)/speedup on newstest2014 dataset with a batch size of 32. Train

: training time (seconds)/speedup per training step evaluated on 0.5K steps with a batch size of 1K target tokens. Time is averaged over 3 runs using Tensorflow on a single TITAN X (Pascal). “-”: optimization failed and no result. “

”: the same as model 1⃝. and : comparison against 1⃝1 and 1⃝4 respectively rather than 1⃝. Base: the baseline Transformer with base setting. Bold indicates best BLEU score. and : dropout rate on attention weights and residual connection. : batch size in tokens.

7.1 Datasets and Evaluation

We take WMT14 English-German translation (En-De) Bojar et al. (2014) as our benchmark for model analysis, and examine the generalization of our approach on four other tasks: WMT14 English-French (En-Fr), IWSLT14 German-English (De-En) Cettolo et al. (2014), WMT18 English-Finnish (En-Fi) and WMT18 Chinese-English (Zh-En) Bojar et al. (2018). Byte pair encoding algorithm (BPE) Sennrich et al. (2016) is used in preprocessing to handle low frequency words. Statistics of different datasets are listed in Table 2.

Except for IWSLT14 De-En task, we collect subword units independently on the source and target side of training data. We directly use the preprocessed training data from the WMT18 website333 for En-Fi and Zh-En tasks, and use newstest2017 as our development set, newstest2018 as our test set. Our training data for WMT14 En-De and WMT14 En-Fr is identical to previous setups Vaswani et al. (2017); Wu et al. (2019). We use newstest2013 as development set for WMT14 En-De and newstest2012+2013 for WMT14 En-Fr. Apart from newstest2014 test set444We use the filtered test set consisting of 2737 sentence pairs. The difference of translation quality on filtered and full test sets is marginal., we also evaluate our model on all WMT14-18 test sets for WMT14 En-De translation. The settings for IWSLT14 De-En are as in Ranzato et al. (2016), with 7584 sentence pairs for development, and the concatenated dev sets for IWSLT 2014 as test set (tst2010, tst2011, tst2012, dev2010, dev2012).

We report tokenized case-sensitive BLEU Papineni et al. (2002) for WMT14 En-De and WMT14 En-Fr, and provide detokenized case-sensitive BLEU for WMT14 En-De, WMT18 En-Fi and Zh-En with sacreBLEU Post (2018)555Signature BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.2.20. We also report chrF score for En-Fi translation which was found correlated better with human evaluation Bojar et al. (2018). Following previous work Wu et al. (2019), we evaluate IWSLT14 De-En with tokenized case-insensitive BLEU.

7.2 Model Settings

We experiment with both base (layer size 512/2048, 8 heads) and big (layer size 1024/4096, 16 heads) settings as in Vaswani et al. (2017). Except for the vanilla Transformer, we also compare with the structure that is currently default in tensor2tensor (T2T), which puts layer normalization before residual blocks Vaswani et al. (2018). We use an in-house toolkit for all experiments.

Dropout is applied to the residual connection () and attention weights (). We share the target embedding matrix with the softmax projection matrix but not with the source embedding matrix. We train all models using Adam optimizer (0.9/0.98 for base, 0.9/0.998 for big) with adaptive learning rate schedule (warm-up step 4K for base, 16K for big) as in Vaswani et al. (2017) and label smoothing of 0.1. We set in DS-Init to 1.0. Sentence pairs containing around 25K50K () target tokens are grouped into one batch. We use relatively larger batch size and dropout rate for deeper and bigger models for better convergence. We perform evaluation by averaging last 5 checkpoints. Besides, we apply mixed-precision training to all big models. Unless otherwise stated, we train base and big model with 300K maximum steps, and decode sentences using beam search with a beam size of 4 and length penalty of 0.6. Decoding is implemented with cache to save redundant computations. Other settings for specific translation tasks are explained in the individual subsections.

7.3 WMT14 En-De Translation Task

Table 3 summarizes translation results under different settings. Applying DS-Init and/or MAtt to Transformer with 6 layers slightly decreases translation quality by 0.2 BLEU (27.5927.35). However, they allow scaling up to deeper architectures, achieving a BLEU score of 28.55 (12 layers) and 28.67 (20 layers), outperforming all baselines. These improvements can not be obtained via enlarging the training batch size (8⃝), confirming the strength of deep models.

We also compare our simplified AAN in MAtt (4⃝) with two variants: a self-attention network (6⃝), and the original AAN (7⃝). Results show minor differences in translation quality, but improvements in training and decoding speed, and a reduction in the number of model parameters. Compared to the baseline, MAtt improves decoding speed by 50%, and training speed by 10%, while having 9% fewer parameters.

Result 9⃝ indicates that the gradient vanishing issue prevents training of deep vanilla Transformers, which cannot be solved by only simplifying the decoder via MAtt (1⃝0). By contrast, both T2T and DS-Init can help. Our DS-Init improves norm preservation through specific parameter initialization, while T2T reschedules the LN position. Results in Table 3 show that T2T underperforms DS-Init by 0.2 BLEU on average, and slightly increases training and decoding time (by 2%) compared to the original Transformer due to additional LN layers. This suggests that our solution is more effective and efficient.

Train Dev Train Dev
1 28.64 26.16 5.23 4.76
11 29.63 26.44 4.48 4.38
12 29.75 26.16 4.60 4.49
13 29.43 26.51 5.09 4.71
14 30.71 26.52 3.96 4.32
15 30.89 26.53 4.09 4.41
16 30.25 26.56 4.62 4.58
Table 4: Tokenized case-sensitive BLEU (BLEU) and perplexity (PPL) on training (Train) and development (newstest2013, Dev) set. We randomly select 3K sentence pairs as our training data for evaluation. Lower PPL is better.

Surprisingly, training deep Transformers with both DS-Init and MAtt improves not only running efficiency but also translation quality (by 0.2 BLEU), compared with DS-Init alone. To get an improved understanding, we analyze model performance on both training and development set. Results in Table 4 show that models with DS-Init yield the best perplexity on both training and development set, and those with T2T achieve the best BLEU on the training set. However, DS-InitMAtt performs best in terms of BLEU on the development set. This indicates that the success of DS-InitMAtt comes from its better generalization rather than better fitting training data.

Task Model #Param BLEU Dec Train
WMT14 En-Fr Base + 6 layers 76M 39.09 167.56/1.00 0.171/1.00
Ours + Base + 12 layers 108M 40.58 173.62/0.97 0.265/0.65
IWSLT14 De-En Base + 6 layers 61M 34.41 315.59/1.00 0.153/1.00
Ours + Base + 12 layers 92M 35.63 329.95/0.96 0.247/0.62
WMT18 En-Fi Base + 6 layers 65M 15.5 (50.82) 156.32/1.00 0.165/1.00
Ours + Base + 12 layers 96M 15.8 (51.47) 161.74/0.97 0.259/0.64
WMT18 Zh-En Base + 6 layers 77M 21.1 217.40/1.00 0.173/1.00
Ours + Base + 12 layers 108M 22.3 228.57/0.95 0.267/0.65
Table 5: Translation results on different tasks. Settings for BLEU score is given in Section 7.1. Numbers in bracket denote chrF score. Our model outperforms the vanilla base Transformer on all tasks. “Ours”: DS-InitMAtt.
Figure 3: Test BLEU score on newstest2014 with respect to model depth for TransformerDS-InitMAtt.
Model #Param Test14 Test14-18
Vaswani et al. (2017) 213M 28.4 -
Chen et al. (2018) 379M 28.9 -
Ott et al. (2018) 210M 29.3 -
Bapna et al. (2018) 137M 28.04 -
Wu et al. (2019) 213M 29.76 (29.0) 33.13 (32.86)
Big + 6 layers 233M 29.07 (28.3) 33.16 (32.88)
Ours + Big + 12 layers 359M 29.47 (28.7) 33.21 (32.90)
Ours + Big + 20 layers 560M 29.62 (29.0) 33.26 (32.96)
Table 6: Tokenized case-sensitive BLEU (sacreBLEU) on WMT14 En-De translation task. “Test14-18”: BLEU score averaged over newstest2014newstest2018. : results obtained by running code and model released by Wu et al. (2019). “Ours”: DS-InitMAtt.

We also attempt to apply DS-Init on the encoder alone or the decoder alone for 12-layer models. Unfortunately, both variants lead to unstable optimization where gradients tend to explode at the middle of training. We attempt to solve this issue with gradient clipping of rate 1.0. Results show that this fails for decoder and achieves only 27.89 BLEU for encoder, losing 0.66 BLEU compared with the full variant (28.55). We leave further analysis to future work and recommend using DS-Init on both the encoder and the decoder.

Effect of Model Depth We empirically compare a wider range of model depths for TransformerDS-InitMAtt with up to 30 layers. Hyperparameters are the same as for 1⃝4 except that we use 42K and 48K batch size for 18 and 30 layers respectively. Figure 3 shows that deeper Transformers yield better performance. However, improvements are steepest going from 6 to 12 layers, and further improvements are small.

7.3.1 Comparison with Existing Work

Table 6 lists the results in big setting and compares with current SOTA. Big models are trained with and . The 6-layer baseline and the deeper ones are trained with batch size of 48K and 54K respectively. Deep Transformer with our method outperforms its 6-layer counterpart by over 0.4 points on newstest2014 and around 0.1 point on newstest2014newstest2018. Our model outperforms the transparent model Bapna et al. (2018) (+1.58 BLEU), an approach for the deep encoder. Our model performs on par with current SOTA, the dynamic convolution model (DCNN) Wu et al. (2019). In particular, though DCNN achieves encouraging performance on newstest2014, it falls behind the baseline on other test sets. By contrast, our model obtains more consistent performance improvements.

In work concurrent to ours, wang-etal-2019-learning discuss how the placement of layer normalization affects deep Transformers, and compare the original post-norm (which we consider our baseline) and a pre-norm layout (which we call T2T). Their results also show that pre-norm allows training of deeper Transformers. Our results show that deep post-norm Transformers are also trainable with appropriate initialization, and tend to give slightly better results.

7.4 Results on Other Translation Tasks

We use 12 layers for our model in these tasks. We enlarge the dropout rate to for IWSLT14 De-En task and train models on WMT14 En-Fr and WMT18 Zh-En with 500K steps. Other models are trained with the same settings as in WMT14 En-De.

We report translation results on other tasks in Table 5. Results show that our model beats the baseline on all tasks with gains of over 1 BLEU, except the WMT18 En-Fi where our model yields marginal BLEU improvements (+0.3 BLEU). We argue that this is due to the rich morphology of Finnish, and BLEU’s inability to measure improvements below the word level. We also provide the chrF score in which our model gains 0.6 points. In addition, speed measures show that though our model consumes 50+% more training time, there is only a small difference with respect to decoding time thanks to MAtt.

7.5 Analysis of Training Dynamics

Figure 4: Gradient norm (y-axis) of the first and the last encoder layers (top) and decoder layers (bottom) in 18-layer deep Transformer over the fist 5k training steps. We use around 25k source/target tokens in each training batch. Each point in this plot is averaged over 50 training steps. “L1/L18” denotes the first/last layer. DS-Init helps stabilize the gradient norm during training.

Our analysis in Figure 1 and Table 1

is based on gradients estimated exactly after parameter initialization without considering training dynamics. Optimizers with adaptive step rules, such as Adam, could have an adverse effect that enables gradient scale correction through the accumulated first and second moments. However, results in Figure

4 show that without DS-Init, the encoder gradients are less stable and the decoder gradients still suffer from the vanishing issue, particularly at the first layer. DS-Init makes the training more stable and robust.666We observe this both in the raw gradients and after taking the Adam step rules into account.

8 Conclusion and Future Work

This paper discusses training of very deep Transformers. We show that the training of deep Transformers suffers from gradient vanishing, which we mitigate with depth-scaled initialization. To improve training and decoding efficiency, we propose a merged attention sublayer that integrates a simplified average-based self-attention sublayer into the encoder-decoder attention sublayer. Experimental results show that deep models trained with these techniques clearly outperform a vanilla Transformer with 6 layers in terms of BLEU, and outperforms other solutions to train deep Transformers Bapna et al. (2018); Vaswani et al. (2018). Thanks to the more efficient merged attention sublayer, we achieve these quality improvements while matching the decoding speed of the baseline model.

In the future, we would like to extend our model to other sequence-to-sequence tasks, such as summarization and dialogue generation, as well as adapt the idea to other generative architectures Zhang et al. (2016, 2018b). We have trained models with up to 30 layers each for the encoder and decoder, and while training was successful and improved over shallower counterparts, gains are relatively small beyond 12 layers. An open question is whether there are other structural issues that limit the benefits of increasing the depth of the Transformer architecture, or whether the benefit of very deep models is greater for other tasks and dataset.


We thank the reviewers for their insightful comments. This project has received funding from the grant H2020-ICT-2018-2-825460 (ELITR) by the European Union. Biao Zhang also acknowledges the support of the Baidu Scholarship. This work has been performed using resources provided by the Cambridge Tier-2 system operated by the University of Cambridge Research Computing Service ( funded by EPSRC Tier-2 capital grant EP/P020259/1.


  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1, §2.
  • A. Bapna, M. Chen, O. Firat, Y. Cao, and Y. Wu (2018) Training deeper neural machine translation models with transparent attention. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 3028–3033. External Links: Link Cited by: §2, §5, §7.3.1, Table 6, §8.
  • O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. Tamchyna (2014) Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Link Cited by: §7.1.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, P. Koehn, and C. Monz (2018) Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 272–303. External Links: Link Cited by: §7.1, §7.1.
  • D. Britz, A. Goldie, M. Luong, and Q. Le (2017) Massive exploration of neural machine translation architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1442–1451. External Links: Link, Document Cited by: §2.
  • M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico (2014) Report on the 11th IWSLT Evaluation Campaign, IWSLT 2014. In Proceedings of the 11th Workshop on Spoken Language Translation, Lake Tahoe, CA, USA, pp. 2–16. Cited by: §7.1.
  • M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, Z. Chen, Y. Wu, and M. Hughes (2018) The best of both worlds: combining recent advances in neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 76–86. External Links: Link Cited by: §1, §2, Table 6.
  • R. Child, S. Gray, A. Radford, and I. Sutskever (2019) Generating long sequences with sparse transformers. CoRR abs/1904.10509. Cited by: §2.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2, §5.
  • M. A. Di Gangi and M. Federico (2018) Deep neural machine translation with weakly-recurrent units. In Proceedings of EAMT, Alicante, Spain. Cited by: §2.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1243–1252. External Links: Link Cited by: §2.
  • F. A. Gers and J. Schmidhuber (2001) Long Short-Term Memory Learns Context Free and Context Sensitive Languages. In Proceedings of the ICANNGA 2001 Conference, Vol. 1, pp. 134–137. Cited by: §2.
  • M. Ghazvininejad, O. Levy, Y. Liu, and L. S. Zettlemoyer (2019) Constant-time machine translation with conditional masked language models. CoRR abs/1904.09324. Cited by: §2.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research - Proceedings Track 9, pp. 249–256. Cited by: §3, §5.
  • J. Gu, J. Bradbury, C. Xiong, V. O.K. Li, and R. Socher (2018) Non-autoregressive neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §1, §1, §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Document, ISSN 0899-7667, Link Cited by: §2.
  • M. Junczys-Dowmunt, K. Heafield, H. Hoang, R. Grundkiewicz, and A. Aue (2018) Marian: cost-effective high-quality neural machine translation in C++. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, Melbourne, Australia. External Links: Link Cited by: §2.
  • M. Ott, S. Edunov, D. Grangier, and M. Auli (2018) Scaling neural machine translation. arXiv preprint arXiv:1806.00187. Cited by: Table 6.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: §7.1.
  • N. Pham, T. Nguyen, J. Niehues, M. Müller, and A. H. Waibel (2019) Very deep self-attention networks for end-to-end speech recognition. CoRR abs/1904.13377. Cited by: §2.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. External Links: Link Cited by: §7.1.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §1, §2, §5.
  • M. Ranzato, S. Chopra, M. Auli, and W. Zaremba (2016)

    Sequence Level Training with Recurrent Neural Networks

    In The International Conference on Learning Representations, Cited by: §7.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §7.1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §1.
  • D. R. So, C. Liang, and Q. V. Le (2019) The evolved transformer. arXiv preprint arXiv:1901.11117. Cited by: §2.
  • M. Stern, N. Shazeer, and J. Uszkoreit (2018)

    Blockwise parallel decoding for deep autoregressive models

    In Advances in Neural Information Processing Systems, pp. 10086–10095. Cited by: §2.
  • A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer, and J. Uszkoreit (2018) Tensor2Tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), Boston, MA, pp. 193–199. External Links: Link Cited by: §2, §5, §7.2, §8.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §1, §2, §3, §7.1, §7.2, §7.2, Table 6.
  • M. Wang, Z. Lu, J. Zhou, and Q. Liu (2017) Deep neural machine translation with linear associative unit. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 136–145. External Links: Link, Document Cited by: §2.
  • F. Wu, A. Fan, A. Baevski, Y. Dauphin, and M. Auli (2019) Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, External Links: Link Cited by: §7.1, §7.1, §7.3.1, Table 6.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link Cited by: §1, §2.
  • A. Zaeemzadeh, N. Rahnavard, and M. Shah (2018) Norm-preservation: why residual networks can become extremely deep?. CoRR abs/1805.07477. External Links: Link, 1805.07477 Cited by: §4.
  • B. Zhang, D. Xiong, J. Su, H. Duan, and M. Zhang (2016) Variational neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 521–530. External Links: Link, Document Cited by: §8.
  • B. Zhang, D. Xiong, and J. Su (2018) Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1789–1798. External Links: Link Cited by: §1, §2, §6.
  • B. Zhang, D. Xiong, and J. Su (2018a) Neural machine translation with deep attention. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 0162-8828 Cited by: §1, §2.
  • H. Zhang, Y. N. Dauphin, and T. Ma (2019) Fixup initialization: residual learning without normalization via better initialization. In International Conference on Learning Representations, External Links: Link Cited by: §2, §5.
  • X. Zhang, J. Su, Y. Qin, Y. Liu, R. Ji, and H. Wang (2018b) Asynchronous bidirectional decoding for neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §8.
  • J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu (2016) Deep recurrent models with fast-forward connections for neural machine translation. Transactions of the Association for Computational Linguistics 4, pp. 371–383. External Links: Link, Document Cited by: §1, §2.