1 Introduction
The multilayer structure allows neural models to model complicated functions. Increasing the depth of models can increase their capacity but may also cause optimization difficulties Mhaskar et al. (2017); Telgarsky (2016); Eldan and Shamir (2016); He et al. (2016); Bapna et al. (2018).
Specifically with the Transformer translation model, in order to ease its optimization, vaswani2017attention employ residual connection He et al. (2016) and layer normalization Ba et al. (2016)
techniques which have been proven useful in reducing optimization difficulties of deep neural networks for various tasks.
When it comes to deep Transformers, previous works Bapna et al. (2018); Wang et al. (2019); Zhang et al. (2019); Xu et al. (2020) are under the motivation to ensure that outputs of initial layers can be conveyed with significance to the final prediction stage, so those layers can receive sufficient gradients of good quality (mostly aiming to train their outputs for the prediction of groundtruth), i.e. they attempt to prevent residual connections from shrinking Zhang et al. (2019); Xu et al. (2020)
or to compensate probably faded residual connections
Bapna et al. (2018); Wang et al. (2019); Wei et al. (2020).In this paper, we first shed light on the problems of residual connection which can simply and effectively ensure the convergence of deep neural networks. Additionally, we propose to train Transformers with the depthwise LSTM which regards outputs of layers as steps in time series instead of residual connections, under the motivation that deep models have difficulty in convergence because shallow layers cannot receive clear gradients from the loss function which is far away from them (their outputs cannot clearly convey to the classifier in the forward propagation), while LSTM
Hochreiter and Schmidhuber (1997) has been proven of good capability in capturing longdistance relationship even though it performs better with short sentences Linzen et al. (2016), and it may alleviate some drawbacks of residual connections (we will discuss later) while ensuring the convergence.Though to generalize the advantages of LSTM to deep computation is already proposed by kalchbrenner2016grid, suggesting that the vanishing gradient problem suffered by deep networks is the same as recurrent networks applied to long sequences. We suggest that in our work, we explicitly propose to alternate residual connections with the depthwise LSTM of the advanced, strong and popular Transformer, which is nontrival. Besides, our approach to integrate the computation of multihead attention networks and feedforward networks with the depthwise LSTM for the Transformer is also more complex than their work which solely connects LSTM cells across the stacking of LSTM layers, we show how to utilize the depthwise LSTM like the residual connection.
Our contributions in this paper are as follows:

We suggest that the popular residual connection has its drawbacks, and propose to use depthwise LSTM for the training of Transformers instead of using residual connections, which is nontrival.

We integrate the depthwise LSTM with the other parts (multihead attention networks and feedforward networks) of Transformer layers, which demonstrates how to use depthwise LSTM to replace residual connections.

In our experiments, we show that the 6layer Transformer using depthwise LSTM can bring significant improvements over that with residual connections. In deep Transformer experiments, we show that depthwise LSTM also has the ability to ensure deep Transformers with up to layers, and the 12layer Transformer using depthwise LSTM already performs comparably to the 24layer Transformer with residual connections, which suggests more efficient using of perlayer parameters with depthwise LSTM than residual connections.

To measure the effects of the nonlinearity of the layer on performance, we propose to distill the analyzing layer of the trained model into a linear transformation which cannot sustain any nonlinearity and observe the performance degradation brought by the replacement.
2 Preliminaries: Residual Connection and its Issue
he2016deep present the residual learning framework to ease the training of deep neural networks, by explicitly reformulating the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.
Specifically, they suggest that if the added layers of the deep model on top of these shallow layers are identity mapping, the deep model shall produce no higher training error than its shallower counterpart, and attribute the convergence issue of deep networks stacking nonlinear layers to that it is hard for nonlinear layers to learn the identity function which means that their training will encounter more difficulties than the layer which can easily model the identity function. Thus, they propose to explicitly enable these layers fit a residual mapping:
(1) 
where x is the input to the nonlinear layer, H(x) and F(x) are the function of that nonlinear layer and that of the corresponding residual layer.
As it is easier for almost all nonlinear layers to learn a zero function which consistently outputs zeros than to learn the identity function, he2016deep suggest that residual connections can reduce the training difficulty of deep neural networks, and with the help of the residual connection, they successfully train the deep convolutional network up to layers with high performances on various tasks.
The Transformer Vaswani et al. (2017) also employs the residual connection to ensure the convergence of the 6layer model, and further empirical results show that as long as the residual connection is not normalized by the layer normalization, Transformers with more than layers can also converge Wang et al. (2019); Xu et al. (2020) with further improvements.
However, we suggest that the motivation under the residual connection, which tries to ensure each layer can learn the identity function, seems in contrast to the motivation of using deep models, to model complicated functions with the nonlinearity. As a result, the residual connection, which adds the input to the output of the layer aiming to allow the model skipping one or more layers, may waste the nonlinearity provided by those skipped layers, i.e. the complexity of the model function. Correspondingly in practice, it is a common observation that the improvements in performances are also smaller and smaller with the increasing of depth, and deep models seem to have difficulty in using parameters as efficient as their shallow counterparts.
In this paper, we suggest that the residual connection which simply accumulates representations of various layers with the elementwise addition operation may have the following drawbacks:

It accumulates outputs of layers equally and lacks an evaluation mechanism to combine representations based on their importance and reliability. As a result, it may lead to two problems: 1) residual models may require many layers to overcome outputs of poor quality from few layers. 2) For deep models which aggegates outputs of many layers into a fixed dimension vector, it is likely to incur information loss, which means that a part of the layer may work on generating representations which are never used.

After the addition of two representations, there is no way for subsequent layers to distinguish involved representations, which may bring challenge to the layer when it requires to utilize differentlevel of information (e.g. linguistic properties of different levels for NLP) differently.
3 Transformer with DepthWise LSTM
3.1 DepthWise LSTM and its Advantages
Intuitively, deep models have difficulty in convergence because shallow layers cannot receive clear gradients from the loss which is far away from them (their outputs cannot clearly convey to the classifier in the forward propagation). The LSTM which is able to capture longdistance relationship (use the representation of a far token for the computing of current token), shall be able to utilize the output of the first layer in the computing of the last layer while using it in a depthwise way (regarding layer depth as a token sequence). Thus, we suggest to alternate the residual connection with the depthwise LSTM which forward propagates steps with outputs of layers in a layerbylayer manner instead of the tokenbytoken manner, as illustrated in Figure 1.
We employ the LSTM equipped with layer normalization in this work following chen2018best, which provides better performance as the NMT decoder than the vanilla LSTM. The computation graph of the LSTM is shown in Figure 2.
Specifically, it first concatenates the input to the LSTM with the output of the LSTM in the last step :
(2) 
where “” indicates concatenation and “” is the concatenated vector.
Then, the LSTM computes three gates (specifically, input gate , forget gate and output gate
) together with the hidden representation
with the concatenated representation:(3) 
(4) 
(5) 
(6) 
where and are weight and bias parameters,
is the sigmoid activation function, “LN” is the layer normalization, “
” stands for the activation function for the computation of the hidden state.The layer normalization Ba et al. (2016) is computed as follows:
(7) 
where and are the input and corresponding computation result of the layer normalization, and
stand for the mean and standard deviation of
, and are two vector parameters initialized by ones and zeros respectively.In this work, we use the advanced GeLU activation function Hendrycks and Gimpel (2016) which are employed by BERT Devlin et al. (2019) rather than the tanh function used in Hochreiter1997LSTM,chen2018best.
We suppose that the role of the computation of the hidden state in Equation 6 is similar to the positionwise feedforward sublayer in each encoder layer and decoder layer, so we remove the feedforward sublayer from encoder and decoder layers while additionally study the effects of computing the hidden state with the 2layer feedforward network like adopted in Transformer layers, in which case, the feedforward sublayer is integrated as part of the computation of the depthwise LSTM, as shown in Equation 8.
(8) 
After the computation of the hidden state, the cell and the output of the LSTM unit are computed as:
(9) 
(10) 
where indicates the elementwise multiplication.
Compared to the residual connection, we suggest that: 1) The gate mechanism (in Equation 3, 4, 5) of the depthwise LSTM can serve as the evaluation mechanism to treat representations from different sources differently. 2) the computation of its hidden state (in Equation 6) is performed on the concatenated representation instead of the elementwise added representation, which allows to utilize differentlevel of information differently.
We use depthwise LSTM rather than depthwise multihead attention network with which can build the NMT model solely based on the attention mechanism for two reasons:

Even using the multihead attention network, it has to compute in the layerbylayer manner like in the decoding, which will not help GPU parallelization and bring significant acceleration.

The attention mechanism linearly combines representations with attention weights. Thus, it lacks the ability to provide the nonlinearity compared to the LSTM, which we suggest shall be important.
3.2 Encoder Layer with DepthWise LSTM
Directly replacing residual connections with LSTM units will introduce huge amount of additional parameters and computation. Given that the task to compute of the LSTM hidden state is similar to the feedforward sublayer in the original Transformer layers, we propose to replace the feedforward sublayer with the newly introduced LSTM unit, which only introduces one LSTM unit per layer.
The original Transformer encoder layer only contains two sublayers: the selfattention sublayer based on the multihead attention network to collect information from contexts, and the 2layer feedforward network sublayer to evolve representations with its nonlinearity.
For the new encoder layer with the depthwise LSTM unit (forward propagating across the depth dimension rather than the token dimension) instead of the residual connection, the layer first performs the selfattention computation, then the depthwise LSTM unit takes the selfattention results and the output and the cell of previous layer to compute the output and the cell of current layer. The architecture of the encoder layer with depthwise LSTM unit is shown in Figure 3.
3.3 Decoder Layer with DepthWise LSTM
Different from the encoder layer, the decoder layer involves two multihead attention sublayers, the selfattention sublayer to attend decoding history and the crossattention sublayer to bring information from the source side. Given that the depthwise LSTM unit only takes one input, we introduce a merge layer to collect the outputs of these two sublayers and merge them into one as the input to the LSTM unit.
Specifically, the decoder layer with depthwise LSTM first computes the selfattention sublayer and the crossattention sublayer like in the original decoder layer, then it merges the outputs of these two sublayers and feeds the merged representation into the depthwise LSTM unit which also takes the cell and the output of previous layer to compute the output of current decoder layer and the cell of the LSTM. We examine both elementwise addition and concatenation as the merging operation. The architecture is shown in Figure 4.
For the input of the crossattention sublayer, we also utilize the sum of the selfattention outputs and the input to this decoder layer like in the standard decoder layer, to utilize both selfattention results and the outputs of previous layer. Since the computation of the LSTM hidden (Equation 6 or 8) does not add its input to its output, which breaks residual connections across layers, we suggest that it is not a residual connection.
4 Analysis of Layer’s NonLinearity on Performance
As suggested above, the residual connection eases the optimization of deep models by explicitly enabling it modeling the identity function, which may hamper the nonlinearity of layers and lead to less complex model functions in contrast to the motivation of modeling a complicated function by stacking layers. How does the nonlinearity provided by the layer affect the performance?
We propose to measure the contribution of a layer’s nonlinearity to performance through replacing the analyzing layer of the fully trained model with a linear transformation which cannot sustain any nonlinearity and observing the performance degradation brought by the replacement.
Specifically, in the standard forward propagation of the converged model, the function of the layer computes the output of that layer given its input :
(11) 
To analyze the impacts of the nonlinearity of that layer on the performance, we change its computation to:
(12) 
where and are the weight matrix the bias trained on the same training set with the other parts of the well trained model frozen.
The training of aims to distill the linear transformation in the function to while removing all nonlinear transformation in , since Equation 12 does not have any capability in providing nonlinearity.
5 Experiment
We implemented our approach based on the Neutron implementation of the Transformer Xu and Liu (2019). To show the effects of our approach on the 6layer Transformer, we first conducted our experiments on the WMT 14 English to German and English to French news translation tasks to compare with vaswani2017attention. Additionally, we also examined the impacts of our approach on deep Transformers, experiments were conducted on the WMT 14 English to German task and the WMT 15 Czech to English task following bapna2018training,xu2020lipschitz.
The concatenation of newstest 2012 and newstest 2013 was used for validation and newstest 2014 as test sets for the WMT 14 English to German and English to French news translation tasks, and newstest 2013 as validation set for the WMT 15 Czech to English task. Newstest 2014 was test sets for both the WMT 14 English to German and the English to French task, and newstest 2015 was the test set for the Czech to English task.
5.1 Settings
We applied joint BytePair Encoding (BPE) Sennrich et al. (2016) with merging operations on both data sets to address the unknown word issue. We only kept sentences with a maximum of
subword tokens for training. Training sets were randomly shuffled in every training epoch.
Though zhang2019improving,xu2020dynamically suggest using a large batch size which may lead to improved performance, we used a batch size of target tokens which was achieved through gradient accumulation of small batches to fairly compare with previous work Vaswani et al. (2017); Xu et al. (2020). The training steps for Transformer Base and Transformer Big were and respectively following vaswani2017attention.
The number of warmup steps was set to ,^{1}^{1}1https://github.com/tensorflow/tensor2tensor/blob/v1.15.4/tensor2tensor/models/transformer.py#L1818. and each training batch contained at least target tokens. We used a dropout of for all experiments except for the Transformer Big on the EnDe task which was
. For the Transformer Base setting, the embedding dimension and the hidden dimension of the positionwise feedforward neural network were
and respectively, corresponding values for the Transformer Big setting were and respectively. We employed a label smoothing Szegedy et al. (2016) value of . We used the Adam optimizer Kingma and Ba (2015) with , and as , and . We followed vaswani2017attention for the other settings.For deep Transformers, we used the computation order of: layer normalization processing dropout residual connection, which is able to converge without introducing additional approaches which may affect the performance according to wang2019learning,xu2020lipschitz.
We used a beam size of for decoding, and evaluated tokenized casesensitive BLEU ^{2}^{2}2https://github.com/mosessmt/mosesdecoder/blob/master/scripts/generic/multibleu.perl with the averaged model of the last checkpoints for the Transformer Base setting and checkpoints for the Transformer Big setting saved with an interval of training steps. We also conducted significance tests Koehn (2004).
5.2 Main Results
We first examine the effects of our approach on the 6layer Transformer on the WMT 14 EnglishGerman and EnglishFrench task to compare with vaswani2017attention, and results are shown in Table 1.
Models  EnDe  EnFr 

Transformer Base  27.55  39.54 
with depthwise LSTM  28.41  40.02 
Transformer Big  28.63  41.52 
with depthwise LSTM  29.42  43.04 
In our approach (“with depthwise LSTM”), we used the 2layer neural network for the computation of the LSTM hidden state (as in Equation 8) and shared parameters across stacked encoder / decoder layers for computing the LSTM gates (in Equation 3, 4, 5). Further details can be found in our ablation study.
Table 1 shows that our approach to use the depthwise LSTM for the convergence of the Transformer can bring significant improvements on both tasks over the Transformer with residual connections with both the Transformer Base setting and the Transformer Big Setting.
We conjecture that our approach with the base setting brings about more improvements on the EnglishGerman task than that on the EnglishFrench task may because that the performance on the EnglishFrench task using a large dataset () may rely more on the capability of the model (i.e. the number of parameters) than on the complexity of the modeling function (i.e. depth of the model, nonlinearity strength perlayer, etc.). With the Transformer Big model which contains more parameters than the Transformer Base, the improvement on EnFr () is larger than that on EnDe ().
5.3 Ablation Study
We first study the effects of two types of computations for the LSTM hidden in Equation 6 and 8 on performance on the WMT 14 EnDe task. Results are shown in Table 2.
FFN  BLEU 

LSTM  27.84 
2Layer  28.41 
Table 2 shows that the 2layer feedforward neural network used in Transformer layers outperforms the original computation of the LSTM hidden which uses only one layer, which is consistent with intuition.
We also study two merging operations, the concatenation and elementwise addition, to combine the selfattention sublayer output and the crossattention sublayer output for the depthwise LSTM unit in decoder layers. Results are shown in Table 3.
Merging  BLEU 

Concat  28.28 
Add  28.41 
Table 3 shows that though counter intuitively, the elementwise addition merging operation empirically results in slightly higher BLEU than the concatenation operation with fewer parameters introduced. Thus, we use the elementwise addition operation in our experiments by default.
Since the number of layers is prespecified, the depthwise LSTM unit in all layers can either be shared or be independent (i.e. whether to bind parameters of the depthwise LSTM across stacked layers). Since Table 2 supports the importance of the capability of the module for the hidden state computation, and sharing the module is likely to hurt its capability, we additionally study to share only parameters for gate computation (in Equation 3, 4, 5) and to share all parameters (i.e. parameters for both the computation of gates and that of the hidden state). Results are shown in Table 4.
Sharing  BLEU 

All  26.89 
Gate  28.41 
None  28.21 
Table 4 shows that: 1) Sharing parameters for the computation of the LSTM hidden significantly hampers its performance, which is consistent with our conjecture. 2) Sharing parameters for the computation of gates (in Equation 3, 4, 5) leads to slightly higher BLEU with fewer parameters introduced than without sharing them (“None” in Table 4). Thus, in the other experiments, we bind parameters for the computation of LSTM gates across stacked layers by default.
5.4 Deep Transformers
To examine whether the depthwise LSTM has the ability to ensure the convergence of deep Transformers and how it performs with deep Transformers. We conduct experiments on the WMT 14 English to German task and the WMT 15 Czech to English task following bapna2018training,xu2020lipschitz, and compare our approach with the Transformer in which residual connections are not normalized by layer normalization. Results are shown in Table 5.
Layers  EnDe  CsEn  

Std  Ours  Std  Ours  
6  27.55  28.41  28.40  29.05 
12  28.12  29.20  29.38  29.60 
18  28.60  29.23  29.61  30.08 
24  29.02  29.09  29.73  29.95 
Table 5 shows that though the BLEU improvements seem saturated with deep Transformers more than layers, depthwise LSTM is able to ensure the convergence of the up to layer Transformer.
On the EnDe task, the 12layer Transformer with depthwise LSTM already outperforms the 24layer Transformer with residual connections, suggesting the efficient using of layer parameters.
On the CsEn task, the 12layer model with our approach performs comparably to the 24layer model with residual connections. Unlike the EnDe task, increasing depth over the 12layer Transformer can still bring some BLEU improvements, and the 18layer model results in the best performance. We conjecture that probably because the data set of the CsEn task () is larger than that of the EnDe task (), and increasing the depth of the model for the CsEn task also increasing its number of parameters and capability. While for the EnDe task, the 12layer Transformer with depthwise LSTM may already provide both sufficient complexity and capability for the data set.
5.5 Layer NonLinearity Analysis
Layer  Encoder  Decoder  

BLEU  BLEU  
None  27.55  0.00  27.55  0.00 
1  27.17  1.38  27.62  0.25 
2  27.11  1.60  27.64  0.33 
3  27.09  1.67  27.47  0.29 
4  27.07  1.74  27.53  0.07 
5  27.15  1.45  26.96  2.14 
6  27.24  1.13  26.42  4.10 
Layer  Encoder  Decoder  

BLEU  BLEU  
None  28.41  0.00  28.41  0.00 
1  27.50  3.20  28.30  0.39 
2  26.60  6.37  28.17  0.84 
3  27.09  4.65  28.22  0.67 
4  27.41  3.52  28.05  1.27 
5  27.87  1.90  26.82  5.60 
6  27.84  2.01  19.87  30.06 
To study the contribution of each layer to the over all performance (i.e. how the output of each layer is utilized by the other layers and the classifier in the translation), we perform the layer efficiency analysis on the WMT 14 EnDe task. We use the performance reduction in BLEU to show the contribution of individual layers to the overall performance. Nonlinearity removing results of the 6layer standard Transformer with residual connections and the corresponding model with the depthwise LSTM are shown in Table 6 and 7 respectively. BLEU and indicate the BLEU score after distilling the layer into the linear transformation and its relative reduction compared to the full model performance (in percentages).
Compared to Table 6, Table 7 shows that the normalized performance loss of each layer with the depthwise LSTM is larger than that with residual connections, with which we suggest that individual layers trained with the depthwise LSTM get a more important role in the overall performance than those trained with residual connections.
Another interesting observation is that, though the performance degradation of removing the nonlinearity of decoder layer 1 to 3 of the Transformer with depthwise LSTM are relatively small, suggesting the possible redundancy of the 6layer decoder, surprisingly, removing the nonlinearity of the first and second decoder layer of the 6layer standard Transformer with residual connections even leads to slight BLEU improvements, which supports our first suggestion of the drawbacks brought by residual connections (described in Section 2) and the first advantage of using depthwise LSTM (described in Section 3.1). We conjecture that residual connections may try to train the first and second decoder layer to linear transformations. However, its goal is not fully achieved until the end of the training, while the evaluation mechanism (gates) of the depthwise LSTM helps ensure the nonlinearity of the layer at least does not degrade the performance.
6 Related Work
he2016deep suggest that the nonlinear activation function makes the layer without the residual connection has difficulty in learning the identity function, thus the model without residual connections suffer from severer convergence problem than the model with residual connections, and present the residual learning framework to ease the training of deep neural networks, by explicitly reformulating the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. srivastava2015highway propose the highway network which contains a transform gate and a carry gate to control how much of the output is produced by transforming the input and carrying it, respectively. More recently, chai2020highway propose the highway Transformer, which integrates a selfgating mechanism into the Transformer. However, we suggest our work is quite different from it, e.g. residual connections are still kept in their model.
Deep NMT.
zhou2016deep introduce the fastforward connections and an interleaved bidirectional architecture for stacking the LSTM layers which play an essential role in propagating the gradients and building a deep topology of depth . wang2017deep propose a novel Linear Associative Unit (LAU) which uses linear associative connections between input and output of the recurrent unit to reduce the gradient propagation path inside.
Deep Transformers.
bapna2018training propose the Transparent Attention (TA) mechanism which improves gradient flow during back propagation by allowing each decoder layer to attend weighted combinations of all encoder layer outputs, instead of just the top encoder layer. wang2019learning propose the Dynamic Linear Combination of Layers (DLCL) approach which additionally aggregate previous layers’ outputs for each encoder layer. wu2019depth propose an effective twostage approach which incrementally increases the depth of the encoder and the decoder of the Transformer Big model by freezing both parameters and the encoderdecoder attention computation of pretrained shallow layers. More recently, wei2020multiscale let each decoder layer attend the corresponding encoder layer of the same depth and introduce a depthwise GRU to additionally aggregate outputs of all encoder layers for the top decoder layer, but residual connections are still kept in their approach. zhang2019improving propose the layerwise DepthScaled Initialization (DSInit) approach, which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient backpropagation through normalization layers. xu2020lipschitz propose the Lipschitz constrained parameter initialization approach to reduce the standard deviation of layer normalization inputs and to ensure the convergence of deep Transformers.
7 Conclusion
In this paper, we suggest that the popular residual connection has its drawbacks. Inspired by that the vanishing gradient problem suffered by deep networks is the same as recurrent networks applied to long sequences Kalchbrenner et al. (2016), we alternate residual connections of the Transformer with the depthwise LSTM ,which propogates through the depth dimension rather than the sequence dimension, given that LSTM Hochreiter and Schmidhuber (1997) has been proven of good capability in capturing longdistance relationship, and its design may alleviate some drawbacks of residual connections while ensuring the convergence. Specifically, we show how to integrate the computation of multihead attention networks and feedforward networks with the depthwise LSTM for the Transformer, and how to utilize the depthwise LSTM like the residual connection.
Our experiment with the 6layer Transformer shows that our approach using depthwise LSTM can bring about significant BLEU improvements in both WMT 14 EnglishGerman and EnglishFrench tasks over the standard Transformer with residual connections. Our deep Transformer experiment demonstrates that: 1) Our depthwise LSTM approach also has the ability to ensure deep Transformers with up to layers, 2) The 12layer Transformer using depthwise LSTM already performs comparably to the 24layer Transformer with residual connections, suggesting more efficient usage of perlayer parameters with our depthwise LSTM approach than with residual connections.
We propose to measure how the nonlinearity of the layer affects performance by replacing the analyzing layer of the trained model with a linear transformation which cannot sustain any nonlinearity and observing the performance degradation brought by the replacement. Our analysis results support the more efficient use of perlayer nonlinearity of the Transformer with depthwise LSTM than that with residual connections.
Acknowledgments
Hongfei Xu acknowledges the support of China Scholarship Council ([2018]3101, 201807040056). Deyi Xiong is supported by the National Natural Science Foundation of China (Grant No. 61861130364), the Natural Science Foundation of Tianjin (Grant No. 19JCZDJC31400) and the Royal Society (London) (NAFR1180122). Hongfei Xu and Josef van Genabith are supported by the German Federal Ministry of Education and Research (BMBF) under the funding code 01IW17001 (Deeplee).
References
 Layer normalization. arXiv preprint arXiv:1607.06450. External Links: Link Cited by: §1, §3.1.

Training deeper neural machine translation models with transparent attention
. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 3028–3033. External Links: Link Cited by: §1, §1.  BERT: pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §3.1.

The power of depth for feedforward neural networks.
In 29th Annual Conference on Learning Theory, V. Feldman, A. Rakhlin, and O. Shamir (Eds.),
Proceedings of Machine Learning Research
, Vol. 49, Columbia University, New York, New York, USA, pp. 907–940. External Links: Link Cited by: §1. 
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, ISSN 10636919 Cited by: §1, §1.  Gaussian error linear units (gelus). CoRR abs/1606.08415. External Links: Link, 1606.08415 Cited by: §3.1.
 Long shortterm memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 08997667, Link, Document Cited by: Transformer with DepthWise LSTM, §1, §7.
 Grid long shortterm memory. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 24, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §7.
 Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §5.1.
 Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §5.1.
 Assessing the ability of LSTMs to learn syntaxsensitive dependencies. Transactions of the Association for Computational Linguistics 4, pp. 521–535. External Links: Link, Document Cited by: §1.
 When and why are deep networks better than shallow ones?. In Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, pp. 2343–2348. External Links: Link Cited by: §1.
 Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. External Links: Document, Link Cited by: §5.1.
 Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2818–2826. External Links: Document, ISSN 10636919 Cited by: §5.1.
 Benefits of depth in neural networks. In 29th Annual Conference on Learning Theory, V. Feldman, A. Rakhlin, and O. Shamir (Eds.), Proceedings of Machine Learning Research, Vol. 49, Columbia University, New York, New York, USA, pp. 1517–1539. External Links: Link Cited by: §1.
 Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2, §5.1.
 Learning deep transformer models for machine translation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy, pp. 1810–1822. External Links: Link Cited by: §1, §2.
 Multiscale collaborative deep models for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 414–426. External Links: Link Cited by: §1.
 Lipschitz constrained parameter initialization for deep transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 397–402. External Links: Link Cited by: §1, §2, §5.1.
 Neutron: An Implementation of the Transformer Translation Model and its Variants. arXiv preprint arXiv:1903.07402. External Links: 1903.07402, Link Cited by: §5.
 Improving deep transformer with depthscaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), Hong Kong, China, pp. 898–909. External Links: Link, Document Cited by: §1.