1 Introduction
Deep neural networks have advanced the state of the art in various communities, from computer vision to natural language processing. Researchers have directed their efforts into designing patterns of modules that can be assembled systematically, which makes neural networks deeper and wider. However, one key challenge of training such huge networks lies in how to transform and combine information across layers. To encourage gradient flow and feature propagation, researchers in the field of computer vision have proposed various approaches , such as residual connections
[He et al.2016], densely connected network [Huang et al.2017] and deep layer aggregation [Yu et al.2018].Stateoftheart neural machine translation (NMT) models generally implement encoder and decoder as multiple layers [Wu et al.2016, Gehring et al.2017, Vaswani et al.2017, Chen et al.2018], in which only the top layer is exploited in the subsequent processes. Fusing information across layers for deep NMT models, however, has received substantially less attention. A few recent studies reveal that simultaneously exposing all layer representations outperforms methods that utilize just the top layer for natural language processing tasks [Peters et al.2018, Shen et al.2018, Wang et al.2018, Dou et al.2018]. However, their methods mainly focus on static aggregation in that the aggregation mechanisms are the same across different positions in the sequence. Consequently, useful context of sequences embedded in the layer representations are ignored, which could be used to further improve layer aggregation.
In this work, we propose dynamic layer aggregation approaches, which allow the model to aggregate hidden states across layers for each position dynamically. We assign a distinct aggregation strategy for each symbol in the sequence, based on the corresponding hidden states that represent both syntax and semantic information of this symbol. To this end, we propose several strategies to model the dynamic principles. First, we propose a simple dynamic combination mechanism, which assigns a distinct set of aggregation weights, learned by a feedforward network, to each position. Second, inspired by the recent success of iterative routing on assigning parts to wholes for computer vision tasks [Sabour, Frosst, and Hinton2017, Hinton, Sabour, and Frosst2018], here we apply the idea of routingbyagreement to layer aggregation. Benefiting from the highdimensional coincidence filtering, i.e.
the agreement between every two internal neurons, the routing algorithm has the ability to extract the most active features shared by multiple layer representations.
We evaluated our approaches upon the standard Transformer model [Vaswani et al.2017] on two widelyused WMT14 EnglishGerman and WMT17 ChineseEnglish translation tasks. We show that although static layer aggregation strategy indeed improves translation performance, which indicates the necessity and effectiveness of fusing information across layers for deep NMT models, our proposed dynamic approaches outperform their static counterpart. Also, our models consistently improve translation performance over the vanilla Transformer model across language pairs. It is worth mentioning that TransformerBase with dynamic layer aggregation outperforms the vanilla TransformerBig model with only less than half of the parameters.
Contributions.
Our key contributions are:

Our study demonstrates the necessity and effectiveness of dynamic layer aggregation for NMT models, which benefits from exploiting useful context embedded in the layer representations.

Our work is among the few studies (cf. [Gong et al.2018, Zhao et al.2018]) which prove that the idea of capsule networks can have promising applications on natural language processing tasks.
2 Background
2.1 Deep Neural Machine Translation
Deep representations have a noticeable effect on neural machine translation [Meng et al.2016, Zhou et al.2016, Wu et al.2016]. Generally, multiplelayer encoder and decoder are employed to perform the translation task through a series of nonlinear transformations from the representation of input sequences to final output sequences.
Specifically, the encoder is composed of a stack of identical layers with the bottom layer being the word embedding layer. Each encoder layer is calculated as
(1) 
where a residual connection [He et al.2016] is employed around each of the two layers. is the layer function, which can be implemented as RNN [Cho et al.2014], CNN [Gehring et al.2017], or selfattention network (SAN) [Vaswani et al.2017]. In this work, we evaluate the proposed approach on the standard Transformer model, while it is generally applicable to any other type of NMT architectures.
The decoder is also composed of a stack of layers:
(2) 
which is calculated based on both the lower decoder layer and the top encoder layer . The top layer of the decoder is used to generate the final output sequence.
As seen, both the encoder and decoder stack layers in sequence and only utilize the information in the top layer. While studies have shown deeper layers extract more semantic and more global features [Zeiler and Fergus2014, Peters et al.2018], these do not prove that the last layer is the ultimate representation for any task. Although residual connections have been incorporated to combine layers, these connections have been “shallow” themselves, and only fuse by simple, onestep operations [Yu et al.2018].
2.2 Exploiting Deep Representations
Recently, aggregating layers to better fuse semantic and spatial information has proven to be of profound value in computer vision tasks [Huang et al.2017, Yu et al.2018]. For machine translation, shen2018dense shen2018dense and Dou:2018:EMNLPDou:2018:EMNLP have proven that simultaneously exposing all layer representations outperforms methods that utilize just the top layer on several generation tasks. Specifically, one of the methods proposed by Dou:2018:EMNLPDou:2018:EMNLP is to linearly combine the outputs of all layers:
(3) 
where are trainable parameter matrices, where is the dimensionality of hidden layers. The linear combination strategy is applied to both the encoder and decoder. The combined layer that embeds all layer representations instead of only the top layer , is used in the subsequent processes.
As seen, the linear combination is encoded in a static set of weights , which ignores the useful context of sentences that could further improve layer aggregation. In this work, we introduce the dynamic principles into layer aggregation mechanisms.
3 Approach
3.1 Dynamic Combination
An intuitive extension of static linear combination is to generate different weights for each layer combination rather than apply the same weights all the time. To this end, we calculate the weights of the linear combination as
(4) 
where is the length of the hidden layer , and is a distinct feedforward network associated with the th layer . Specifically, we use all the layer representations as the context, based on which we output a weight matrix that shares the same dimensionality with . Accordingly, the weights are adapted during inference depending on the input layer combination.
Our approach has two strengths. First, it is a more flexible strategy to dynamically combine layers by capturing contextual information among them, which is ignored by the conventional version. Second, the transformation matrix offers the ability to assign a distinct weight to each state in the layers, while its static counterpart fails to exploit such strength since the length of input layers varies across sentences thus cannot be predefined.
3.2 Layer Aggregation as Capsule Routing
The goal of layer aggregation is to find a whole representation of the input from partial representations captured by different layers. This is identical to the aims of capsule network, which becomes an appealing alternative to solving the problem of assigning parts to wholes [Hinton, Krizhevsky, and Wang2011]. Capsule network employs a fast iterative process called routingbyagreement. Concretely, the basic idea is to iteratively update the proportion of how much a part should be assigned to a whole, based on the agreement between parts and wholes. An important difference between iterative routing and layer aggregation is that the former provides a new way to aggregate information according to the representation of the final output.
A capsule is a group of neurons whose outputs represent different properties of the same entity from the input [Hinton, Sabour, and Frosst2018]. Similarly, a layer consists of a group of hidden states that represent different linguistic properties of the same input [Peters et al.2018, Anastasopoulos and Chiang2018], thus each hidden layer can be viewed as a capsule. Given the layers as input capsules, we introduce an additional layer of output capsules and then perform iterative routing between these two layers of capsules. Specifically, in this work we explore two representative routing mechanisms, namely dynamic routing and EM routing, which differ at how the iterative routing procedure is implemented. We expect layer aggregation can benefit greatly from advanced routing algorithms, which allow the model to allow the model to directly learn the partwhole relationships.
3.2.1 Dynamic Routing
Dynamic routing is a straightforward implementation of routingbyagreement. To illustrate, the information of input capsules is dynamically routed to output capsules, which are concatenated to form the final output , as shown in Figure 1
. Each vector output of capsule
is calculated with a nonlinear “squashing” function [Sabour, Frosst, and Hinton2017]:(5)  
(6) 
where is the total input of capsule , which is a weighted sum over all “vote vectors” transformed from the input capsules :
(7) 
where is a trainable transformation matrix, and is an input capsule associated with input layer :
(8) 
where is a distinct transformation function.^{1}^{1}1Note that we calculate each input capsule with instead of , since the former achieves better performance on translation task by exploiting more context as shown in our experiment section. is the assignment probability (i.e. agreement) that is determined by the iterative dynamic routing.
Algorithm 1 lists the algorithm of iterative dynamic routing. The assignment probabilities associated with each input capsule sum to 1: , and are determined by a “routing softmax” (Line 4):
(9) 
where measures the degree that should be coupled to capsule
(similar to energy function in the attention model
[Bahdanau, Cho, and Bengio2015]), which is initialized as all 0 (Line 2). The initial assignment probabilities are then iteratively refined by measuring the agreement between the vote vector and capsule (Lines 46), which is implemented as a simple scalar product in this work (Line 5).With the iterative routingbyagreement mechanism, an input capsule prefers to send its representation to output capsules, whose activity vectors have a big scalar product with the vote coming from the input capsule. Benefiting from the highdimensional coincidence filtering, capsule neurons are able to ignore all but the most active feature from the input capsules. Ideally, each capsule output represents a distinct property of the input. To make the dimensionality of the final output be consistent with that of hidden layer (i.e. ), the dimensionality of each capsule output is set to .
3.2.2 EM Routing
Dynamic routing uses the cosine of the angle between two vectors to measure their agreement:
. The cosine saturates at 1, which makes it insensitive to the difference between a quite good agreement and a very good agreement. In response to this problem, Hinton:2018:ICLR Hinton:2018:ICLR propose a novel ExpectationMaximization routing algorithm.
Specifically, the routing process fits a mixture of Gaussians using ExpectationMaximization (EM) algorithm, where the output capsules play the role of Gaussians and the means of the activated input capsules play the role of the datapoints. It iteratively adjusts the means, variances, and activation probabilities of the output capsules, as well as the assignment probabilities
of the input capsules, as listed in Algorithm 2. Comparing with the dynamic routing described above, the EM routing assigns means, variances, and activation probabilities for each capsule, which are used to better estimate the agreement for routing.
The activation probability of the input capsule is calculated by
(10) 
where is a trainable transformation matrix, and is calculated by Equation 8. The activation probabilities and votes of the input capsules are fixed during the EM routing process.
MStep
for each Gaussian associated with consists of finding the mean of the votes from input capsules and the variance about that mean for each dimension :
(11)  
(12) 
The incremental cost of using an active capsule is
(13) 
The activation probability of capsule is calculated by
(14) 
where is a fixed cost for coding the mean and variance of when activating it, is another fixed cost per input capsule when not activating it, and is an inverse temperature parameter set with a fixed schedule. We refer the readers to [Hinton, Sabour, and Frosst2018] for more details.
EStep
adjusts the assignment probabilities for each input . First, we compute the negative log probability density of the vote from
under the Gaussian distribution fitted by the output capsule
it gets assigned to:(15) 
Accordingly, the assignment probability is renormalized by
(16) 
As has been stated above, EM routing is a more powerful routing algorithm, which can better estimate the agreement by allowing active capsules to receive a cluster of similar votes. In addition, it assigns an additional activation probability to represent the probability of whether each capsule is present, rather than the length of vector.
4 Experiment
#  Model  # Para.  Train  Decode  BLEU  
1  TransformerBase  88.0M  1.79  1.43  27.31  – 
2  + Linear Combination [Dou et al.2018]  +14.7M  1.57  1.36  27.73  +0.42 
3  + Dynamic Combination  +25.2M  1.50  1.30  28.33  +1.02 
4  + Dynamic Routing  +37.8M  1.37  1.24  28.22  +0.91 
5  + EM Routing  +56.8M  1.10  1.15  28.81  +1.50 
System  Architecture  EnDe  ZhEn  

# Para.  BLEU  # Para.  BLEU  
Existing NMT systems  
[Wu et al.2016]  Rnn with 8 layers  N/A  26.30  N/A  N/A 
[Gehring et al.2017]  Cnn with 15 layers  N/A  26.36  N/A  N/A 
[Vaswani et al.2017]  TransformerBase  65M  27.3  N/A  N/A 
TransformerBig  213M  28.4  N/A  N/A  
[Hassan et al.2018]  TransformerBig  N/A  N/A  N/A  24.2 
Our NMT systems  
this work  TransformerBase  88M  27.31  108M  24.13 
+ EM Routing  123M  28.81  143M  24.81  
TransformerBig  264M  28.58  304M  24.56  
+ EM Routing  490M  28.97  530M  25.00 
4.1 Setting
We conducted experiments on two widelyused WMT14 English German (EnDe) and WMT17 Chinese English (ZhEn) translation tasks and compared our model with results reported by previous work [Gehring et al.2017, Vaswani et al.2017, Hassan et al.2018]. For the EnDe task, the training corpus consists of about million sentence pairs. We used newstest2013 as the development set and newstest2014 as the test set. For the ZhEn task, we used all of the available parallel data, consisting of about million sentence pairs. We used newsdev2017 as the development set and newstest2017 as the test set. All the data had been tokenized and segmented into subword symbols using bytepair encoding with 32K merge operations [Sennrich, Haddow, and Birch2016]. We used 4gram NIST BLEU score [Papineni et al.2002]
as the evaluation metric, and
signtest [Collins, Koehn, and Kucerova2005] for statistical significance test.We evaluated the proposed approaches on the Transformer model [Vaswani et al.2017]. We followed the configurations in [Vaswani et al.2017], and reproduced their reported results on the EnDe task. The parameters of the proposed models were initialized by the pretrained model. All the models were trained on eight NVIDIA P40 GPUs where each was allocated with a batch size of 4096 tokens. In consideration of computation cost, we studied model variations with TransformerBase model on EnDe task, and evaluated overall performance with TransformerBase and TransformerBig model on both ZhEn and EnDe tasks.
4.2 Results
4.2.1 Model Variations
Table 1 shows the results on WMT14 EnDe translation task. As one would expect, the linear combination (Row 2) improves translation performance by +0.42 BLEU points, indicating the necessity of aggregating layers for deep NMT models.
All dynamic aggregation models (Rows 35) consistently outperform its static counterpart (Row 2), demonstrating the superiority of the dynamic mechanisms. Among the model variations, the simplest strategy – dynamic combination (Row 3) surprisingly improves performance over the baseline model by up to +1.02 BLEU points. Benefiting from the advanced routingbyagreement algorithm, the dynamic routing strategy can achieve similar improvement. The EM routing further improves performance by better estimating the agreement during the routing. These findings suggest potential applicability of capsule networks to natural language processing tasks, which has not been fully investigated yet.
All the dynamic aggregation strategies introduce new parameters, ranging from 25.2M to 56.8M. Accordingly, the training speed would decrease due to more efforts to train the new parameters. Dynamic aggregation mechanisms only marginally decrease decoding speed, with EM routing being the slowest one, which decreases decoding speed by 19.6%.
4.2.2 Main Results
Table 2 lists the results on both WMT17 ZhEn and WMT14 EnDe translation tasks. As seen, dynamically aggregating layers consistently improves translation performance across NMT models and language pairs, which demonstrating the effectiveness and universality of the proposed approach. It is worth mentioning that TransformerBase with EM routing outperforms the vanilla TransformerBig model, with only less than half of the parameters, demonstrating our model could utilize the parameters more efficiently and effectively.
4.3 Analysis of Iterative Routing
We conducted extensive analysis from different perspectives to better understand the iterative routing process. All results are reported on the development set of EnDe task with “TransformerBase + EM routing” model.
4.3.1 Impact of the Number of Output Capsules
The number of output capsules is a key parameter for our model, as shown in Figure 1. We plot in Figure 2 the BLEU score with different number of output capsules. Generally, the BLEU score goes up with the increase of the capsule numbers. As aforementioned, to make the dimensionality of the final output be consistent with hidden layer (i.e. ), the dimensionality of each capsule output is . When increases, the dimensionality of capsule output decreases (the minimum value is 1), which may lead to more subtle representations of different properties of the input.
4.3.2 Impact of Routing Iterations
Another key parameter is the iteration of the iterative routing , which affects the estimation of the agreement. As shown in Figure 3, the BLEU score typically goes up with the increase of the iterations , while the trend does not hold when . This indicates that more iterations may overestimate the agreement between two capsules, thus harms the performance. The optimal iteration is also consistent with the findings in previous work [Sabour, Frosst, and Hinton2017, Hinton, Sabour, and Frosst2018].
Model  Construct with  BLEU 

Base  N/A  25.84 
Ours  26.18  
26.62 
4.3.3 Impact of Functions to Construct Input Capsules
For the iterative routing models, we use instead of to construct each input capsule . Table 3 lists the comparison results, which shows that the former indeed outperforms the latter. We attribute this to that is more representative by extracting features from the concatenation of the original layer representations.
4.3.4 Visualization of Agreement Distribution
The assignment probability before M step with denotes the agreement between the input capsule and the output capsule , which is determined by the iterative routing. A higher agreement denotes that the input capsule prefers to send its representation to the output capsule . We plot in Figure 4
the alignment distribution in different routing iterations. In the first iteration (top panel), the initialized uniform distribution is employed as the agreement distribution, and each output capsule equally attends to all the input capsules. As the iterative routing goes, the input capsules learns to send their representations to proper output capsules, and accordingly output capsules are more likely to capture distinct features. We empirically validate our claim from the following two perspectives.
We use the entropy to measure the skewness of the agreement distributions:
(17) 
A lower entropy denotes a more skewed distribution, which indicates that the input capsules are more certain about which output capsules should be routed more information. The entropies of the three iterations are respectively 6.24, 5.93, 5.86, which indeed decreases as expected.
To validate the claim that different output capsules focus on different subsets of input capsules, we measure the diversity between each two output capsules. Let be the agreement probabilities assigned to the output capsule , we calculate the diversity among all the output capsules as
(18) 
A higher diversity score denotes that output capsules attend to different subsets of input capsules. The diversity scores of the three iterations are respectively 0.0, 0.09, and 0.18, which reconfirm our observations.
4.4 Effect on Encoder and Decoder
Model  Applied to  BLEU  

Encoder  Decoder  
Base  N/A  N/A  25.84 
Ours  ✓  ×  26.33 
×  ✓  26.34  
✓  ✓  26.62 
Both encoder and decoder are composed of a stack of layers, which may benefit from the proposed approach. In this experiment, we investigate how our model affects the two components, as shown in Table 4. Aggregating layers of encoder or decoder individually consistently outperforms the vanilla baseline model, and exploiting both components further improves performance. These results provide support for the claim that aggregating layers is useful for both understanding input sequence and generating output sequence.
4.5 Length Analysis
Following Bahdanau:2015:ICLR Bahdanau:2015:ICLR and tu2016modeling tu2016modeling, we grouped sentences of similar lengths together and computed the BLEU score for each group, as shown in Figure 5. Generally, the performance of TransformerBase goes up with the increase of input sentence lengths. We attribute this to the strength of selfattention mechanism to model global dependencies without regard to their distance. Clearly, the proposed approaches outperform the baseline in all length segments.
5 Related Work
Our work is inspired by research in the field of exploiting deep representation and capsule networks.
Exploiting Deep Representation
Exploiting deep representations have been studied by various communities, from computer vision to natural language processing. he2016deep he2016deep propose a residual learning framework, combining layers and encouraging gradient flow by simple shortcut connections. Huang:2017:CVPR Huang:2017:CVPR extend the idea by introducing densely connected layers which could better strengthen feature propagation and encourage feature reuse. Deep layer aggregation [Yu et al.2018] designs architecture to fuse information iteratively and hierarchically.
Concerning natural language processing, Peters:2018:NAACL Peters:2018:NAACL have found that combining different layers is helpful and their model significantly improves stateoftheart models on various tasks. Researchers have also explored fusing information for NMT models and demonstrate aggregating layers is also useful for NMT [Shen et al.2018, Wang et al.2018, Dou et al.2018]. However, all of these works mainly focus on static aggregation in that their aggregation strategy is independent of specific hidden states. In response to this problem, we introduce dynamic principles into layer aggregation. In addition, their approaches are a fixed policy without considering the representation of the final output, while the routingbyagreement mechanisms are able to aggregate information according to the final representation.
Capsule Networks
The idea of dynamic routing is first proposed by Sabour:2017:NIPS Sabour:2017:NIPS, which aims at addressing the representational limitations of convolutional and recurrent neural networks for image classification. The iterative routing procedure is further improved by using ExpectationMaximization algorithm to better estimate the agreement between capsules
[Hinton, Sabour, and Frosst2018]. In computer vision community, xi2017capsule xi2017capsule explore its application on CIFAR data with higher dimensionality. lalonde2018capsules lalonde2018capsules apply capsule networks on object segmentation task.The applications of capsule networks in natural language processing tasks, however, have not been widely investigated to date. zhao2018investigating zhao2018investigating testify capsule networks on text classification tasks and Gong:2018:arXiv Gong:2018:arXiv propose to aggregate a sequence of vectors via dynamic routing for sequence encoding. To the best of our knowledge, this work is the first to apply the idea of dynamic routing to NMT.
6 Conclusion
In this work, we propose several methods to dynamically aggregate layers for deep NMT models. Our best model, which utilizes EMbased iterative routing to estimate the agreement between inputs and outputs, has achieved significant improvements over the baseline model across language pairs. By visualizing the routing process, we find that capsule networks are able to extract most active features shared by different inputs. Our study suggests potential applicability of capsule networks across computer vision and natural language processing tasks for aggregating information of multiple inputs.
Future directions include validating our approach on other NMT architectures such as RNN [Chen et al.2018] and CNN [Gehring et al.2017], as well as on other NLP tasks such as dialogue and reading comprehension. It is also interesting to combine with other techniques [Shaw, Uszkoreit, and Vaswani2018, Li et al.2018, Dou et al.2018, Yang et al.2018, Yang et al.2019, Kong et al.2019] to further boost the performance of Transformer.
References
 [Anastasopoulos and Chiang2018] Anastasopoulos, A., and Chiang, D. 2018. Tied multitask learning for neural speech translation. In NAACL.
 [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
 [Chen et al.2018] Chen, M. X.; Firat, O.; Bapna, A.; Johnson, M.; Macherey, W.; Foster, G.; Jones, L.; Niki, P.; Schuster, M.; Chen, Z.; Wu, Y.; and Hughes, M. 2018. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation. In ACL.
 [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoderdecoder for statistical machine translation. In EMNLP.
 [Collins, Koehn, and Kucerova2005] Collins, M.; Koehn, P.; and Kucerova, I. 2005. Clause restructuring for statistical machine translation. In ACL.
 [Dou et al.2018] Dou, Z.Y.; Tu, Z.; Wang, X.; Shi, S.; and Zhang, T. 2018. Exploiting deep representations for neural machine translation. In EMNLP.
 [Gehring et al.2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. In ICML.
 [Gong et al.2018] Gong, J.; Qiu, X.; Wang, S.; and Huang, X. 2018. Information aggregation via dynamic routing for sequence encoding. In COLING.
 [Hassan et al.2018] Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C.; Huang, X.; JunczysDowmunt, M.; Lewis, W.; Li, M.; et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv.
 [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
 [Hinton, Krizhevsky, and Wang2011] Hinton, G. E.; Krizhevsky, A.; and Wang, S. D. 2011. Transforming autoencoders. In ICANN.
 [Hinton, Sabour, and Frosst2018] Hinton, G. E.; Sabour, S.; and Frosst, N. 2018. Matrix capsules with em routing. In ICLR.
 [Huang et al.2017] Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In CVPR.
 [Kong et al.2019] Kong, X.; Tu, Z.; Shi, S.; Hovy, E.; and Zhang, T. 2019. Neural machine translation with adequacyoriented learning. In AAAI.
 [LaLonde and Bagci2018] LaLonde, R., and Bagci, U. 2018. Capsules for object segmentation. arXiv.
 [Li et al.2018] Li, J.; Tu, Z.; Yang, B.; Lyu, M. R.; and Zhang, T. 2018. Multihead attention with disagreement regularization. In EMNLP.
 [Meng et al.2016] Meng, F.; Lu, Z.; Tu, Z.; Li, H.; and Liu, Q. 2016. A deep memorybased architecture for sequencetosequence learning. In ICLR Workshop.
 [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
 [Peters et al.2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL.
 [Sabour, Frosst, and Hinton2017] Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. In NIPS.
 [Sennrich, Haddow, and Birch2016] Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural machine translation of rare words with subword units. In ACL.
 [Shaw, Uszkoreit, and Vaswani2018] Shaw, P.; Uszkoreit, J.; and Vaswani, A. 2018. SelfAttention with Relative Position Representations. In NAACL.
 [Shen et al.2018] Shen, Y.; Tan, X.; He, D.; Qin, T.; and Liu, T.Y. 2018. Dense information flow for neural machine translation. In NAACL.
 [Tu et al.2016] Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modeling coverage for neural machine translation. In ACL.
 [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS.
 [Wang et al.2018] Wang, Q.; Li, F.; Xiao, T.; Li, Y.; Li, Y.; and Zhu, J. 2018. Multilayer representation fusion for neural machine translation. In COLING.
 [Wu et al.2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv.
 [Xi, Bing, and Jin2017] Xi, E.; Bing, S.; and Jin, Y. 2017. Capsule network performance on complex data. arXiv.
 [Yang et al.2018] Yang, B.; Tu, Z.; Wong, D. F.; Meng, F.; Chao, L. S.; and Zhang, T. 2018. Modeling localness for selfattention networks. In EMNLP.
 [Yang et al.2019] Yang, B.; Li, J.; Wong, D. F.; Chao, L. S.; Wang, X.; and Tu, Z. 2019. Contextaware selfattention networks. In AAAI.
 [Yu et al.2018] Yu, F.; Wang, D.; Shelhamer, E.; and Darrell, T. 2018. Deep layer aggregation. In CVPR.
 [Zeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In ECCV.
 [Zhao et al.2018] Zhao, W.; Ye, J.; Yang, M.; Lei, Z.; Zhang, S.; and Zhao, Z. 2018. Investigating capsule networks with dynamic routing for text classification. In EMNLP.
 [Zhou et al.2016] Zhou, J.; Cao, Y.; Wang, X.; Li, P.; and Xu, W. 2016. Deep recurrent models with fastforward connections for neural machine translation. TACL.