Dynamic Layer Aggregation for Neural Machine Translation with Routing-by-Agreement

by   Zi-Yi Dou, et al.
Carnegie Mellon University

With the promising progress of deep neural networks, layer aggregation has been used to fuse information across layers in various fields, such as computer vision and machine translation. However, most of the previous methods combine layers in a static fashion in that their aggregation strategy is independent of specific hidden states. Inspired by recent progress on capsule networks, in this paper we propose to use routing-by-agreement strategies to aggregate layers dynamically. Specifically, the algorithm learns the probability of a part (individual layer representations) assigned to a whole (aggregated representations) in an iterative way and combines parts accordingly. We implement our algorithm on top of the state-of-the-art neural machine translation model TRANSFORMER and conduct experiments on the widely-used WMT14 English-German and WMT17 Chinese-English translation datasets. Experimental results across language pairs show that the proposed approach consistently outperforms the strong baseline model and a representative static aggregation model.


page 1

page 2

page 3

page 4


Residual Tree Aggregation of Layers for Neural Machine Translation

Although attention-based Neural Machine Translation has achieved remarka...

Capsule-Transformer for Neural Machine Translation

Transformer hugely benefits from its key design of the multi-head self-a...

Exploiting Deep Representations for Neural Machine Translation

Advanced neural machine translation (NMT) models generally implement enc...

Improving Multi-Head Attention with Capsule Networks

Multi-head attention advances neural machine translation by working out ...

Towards Linear Time Neural Machine Translation with Capsule Networks

In this study, we first investigate a novel capsule network with dynamic...

Information Aggregation for Multi-Head Attention with Routing-by-Agreement

Multi-head attention is appealing for its ability to jointly extract dif...

Neuron Interaction Based Representation Composition for Neural Machine Translation

Recent NLP studies reveal that substantial linguistic information can be...

1 Introduction

Deep neural networks have advanced the state of the art in various communities, from computer vision to natural language processing. Researchers have directed their efforts into designing patterns of modules that can be assembled systematically, which makes neural networks deeper and wider. However, one key challenge of training such huge networks lies in how to transform and combine information across layers. To encourage gradient flow and feature propagation, researchers in the field of computer vision have proposed various approaches , such as residual connections 

[He et al.2016], densely connected network [Huang et al.2017] and deep layer aggregation [Yu et al.2018].

State-of-the-art neural machine translation (NMT) models generally implement encoder and decoder as multiple layers [Wu et al.2016, Gehring et al.2017, Vaswani et al.2017, Chen et al.2018], in which only the top layer is exploited in the subsequent processes. Fusing information across layers for deep NMT models, however, has received substantially less attention. A few recent studies reveal that simultaneously exposing all layer representations outperforms methods that utilize just the top layer for natural language processing tasks [Peters et al.2018, Shen et al.2018, Wang et al.2018, Dou et al.2018]. However, their methods mainly focus on static aggregation in that the aggregation mechanisms are the same across different positions in the sequence. Consequently, useful context of sequences embedded in the layer representations are ignored, which could be used to further improve layer aggregation.

In this work, we propose dynamic layer aggregation approaches, which allow the model to aggregate hidden states across layers for each position dynamically. We assign a distinct aggregation strategy for each symbol in the sequence, based on the corresponding hidden states that represent both syntax and semantic information of this symbol. To this end, we propose several strategies to model the dynamic principles. First, we propose a simple dynamic combination mechanism, which assigns a distinct set of aggregation weights, learned by a feed-forward network, to each position. Second, inspired by the recent success of iterative routing on assigning parts to wholes for computer vision tasks [Sabour, Frosst, and Hinton2017, Hinton, Sabour, and Frosst2018], here we apply the idea of routing-by-agreement to layer aggregation. Benefiting from the high-dimensional coincidence filtering, i.e.

the agreement between every two internal neurons, the routing algorithm has the ability to extract the most active features shared by multiple layer representations.

We evaluated our approaches upon the standard Transformer model [Vaswani et al.2017] on two widely-used WMT14 EnglishGerman and WMT17 ChineseEnglish translation tasks. We show that although static layer aggregation strategy indeed improves translation performance, which indicates the necessity and effectiveness of fusing information across layers for deep NMT models, our proposed dynamic approaches outperform their static counterpart. Also, our models consistently improve translation performance over the vanilla Transformer model across language pairs. It is worth mentioning that Transformer-Base with dynamic layer aggregation outperforms the vanilla Transformer-Big model with only less than half of the parameters.


Our key contributions are:

  • Our study demonstrates the necessity and effectiveness of dynamic layer aggregation for NMT models, which benefits from exploiting useful context embedded in the layer representations.

  • Our work is among the few studies (cf.  [Gong et al.2018, Zhao et al.2018]) which prove that the idea of capsule networks can have promising applications on natural language processing tasks.

2 Background

2.1 Deep Neural Machine Translation

Deep representations have a noticeable effect on neural machine translation [Meng et al.2016, Zhou et al.2016, Wu et al.2016]. Generally, multiple-layer encoder and decoder are employed to perform the translation task through a series of nonlinear transformations from the representation of input sequences to final output sequences.

Specifically, the encoder is composed of a stack of identical layers with the bottom layer being the word embedding layer. Each encoder layer is calculated as


where a residual connection [He et al.2016] is employed around each of the two layers. is the layer function, which can be implemented as RNN [Cho et al.2014], CNN [Gehring et al.2017], or self-attention network (SAN) [Vaswani et al.2017]. In this work, we evaluate the proposed approach on the standard Transformer model, while it is generally applicable to any other type of NMT architectures.

The decoder is also composed of a stack of layers:


which is calculated based on both the lower decoder layer and the top encoder layer . The top layer of the decoder is used to generate the final output sequence.

As seen, both the encoder and decoder stack layers in sequence and only utilize the information in the top layer. While studies have shown deeper layers extract more semantic and more global features [Zeiler and Fergus2014, Peters et al.2018], these do not prove that the last layer is the ultimate representation for any task. Although residual connections have been incorporated to combine layers, these connections have been “shallow” themselves, and only fuse by simple, one-step operations [Yu et al.2018].

2.2 Exploiting Deep Representations

Recently, aggregating layers to better fuse semantic and spatial information has proven to be of profound value in computer vision tasks [Huang et al.2017, Yu et al.2018]. For machine translation, shen2018dense shen2018dense and Dou:2018:EMNLPDou:2018:EMNLP have proven that simultaneously exposing all layer representations outperforms methods that utilize just the top layer on several generation tasks. Specifically, one of the methods proposed by Dou:2018:EMNLPDou:2018:EMNLP is to linearly combine the outputs of all layers:


where are trainable parameter matrices, where is the dimensionality of hidden layers. The linear combination strategy is applied to both the encoder and decoder. The combined layer that embeds all layer representations instead of only the top layer , is used in the subsequent processes.

As seen, the linear combination is encoded in a static set of weights , which ignores the useful context of sentences that could further improve layer aggregation. In this work, we introduce the dynamic principles into layer aggregation mechanisms.

3 Approach

3.1 Dynamic Combination

An intuitive extension of static linear combination is to generate different weights for each layer combination rather than apply the same weights all the time. To this end, we calculate the weights of the linear combination as


where is the length of the hidden layer , and is a distinct feed-forward network associated with the -th layer . Specifically, we use all the layer representations as the context, based on which we output a weight matrix that shares the same dimensionality with . Accordingly, the weights are adapted during inference depending on the input layer combination.

Our approach has two strengths. First, it is a more flexible strategy to dynamically combine layers by capturing contextual information among them, which is ignored by the conventional version. Second, the transformation matrix offers the ability to assign a distinct weight to each state in the layers, while its static counterpart fails to exploit such strength since the length of input layers varies across sentences thus cannot be pre-defined.

3.2 Layer Aggregation as Capsule Routing

The goal of layer aggregation is to find a whole representation of the input from partial representations captured by different layers. This is identical to the aims of capsule network, which becomes an appealing alternative to solving the problem of assigning parts to wholes [Hinton, Krizhevsky, and Wang2011]. Capsule network employs a fast iterative process called routing-by-agreement. Concretely, the basic idea is to iteratively update the proportion of how much a part should be assigned to a whole, based on the agreement between parts and wholes. An important difference between iterative routing and layer aggregation is that the former provides a new way to aggregate information according to the representation of the final output.

A capsule is a group of neurons whose outputs represent different properties of the same entity from the input [Hinton, Sabour, and Frosst2018]. Similarly, a layer consists of a group of hidden states that represent different linguistic properties of the same input  [Peters et al.2018, Anastasopoulos and Chiang2018], thus each hidden layer can be viewed as a capsule. Given the layers as input capsules, we introduce an additional layer of output capsules and then perform iterative routing between these two layers of capsules. Specifically, in this work we explore two representative routing mechanisms, namely dynamic routing and EM routing, which differ at how the iterative routing procedure is implemented. We expect layer aggregation can benefit greatly from advanced routing algorithms, which allow the model to allow the model to directly learn the part-whole relationships.

3.2.1 Dynamic Routing

Figure 1: Illustration of the dynamic routing algorithm.

Dynamic routing is a straightforward implementation of routing-by-agreement. To illustrate, the information of input capsules is dynamically routed to output capsules, which are concatenated to form the final output , as shown in Figure 1

. Each vector output of capsule

is calculated with a non-linear “squashing” function [Sabour, Frosst, and Hinton2017]:


where is the total input of capsule , which is a weighted sum over all “vote vectors” transformed from the input capsules :


where is a trainable transformation matrix, and is an input capsule associated with input layer :


where is a distinct transformation function.111Note that we calculate each input capsule with instead of , since the former achieves better performance on translation task by exploiting more context as shown in our experiment section. is the assignment probability (i.e. agreement) that is determined by the iterative dynamic routing.


Algorithm 1 Iterative Dynamic Routing. Input: input capsules , iterations ; Output: capsules .
1:procedure Routing(, ):
2:     :
3:     for  iterations do
4:         :  
5:         :  compute by Eq. 5
6:         :      return

Algorithm 1 lists the algorithm of iterative dynamic routing. The assignment probabilities associated with each input capsule sum to 1: , and are determined by a “routing softmax” (Line 4):


where measures the degree that should be coupled to capsule

(similar to energy function in the attention model 

[Bahdanau, Cho, and Bengio2015]), which is initialized as all 0 (Line 2). The initial assignment probabilities are then iteratively refined by measuring the agreement between the vote vector and capsule (Lines 4-6), which is implemented as a simple scalar product in this work (Line 5).

With the iterative routing-by-agreement mechanism, an input capsule prefers to send its representation to output capsules, whose activity vectors have a big scalar product with the vote coming from the input capsule. Benefiting from the high-dimensional coincidence filtering, capsule neurons are able to ignore all but the most active feature from the input capsules. Ideally, each capsule output represents a distinct property of the input. To make the dimensionality of the final output be consistent with that of hidden layer (i.e. ), the dimensionality of each capsule output is set to .

3.2.2 EM Routing


Algorithm 2 Iterative EM Routing returns activation of the output capsules, given the activation and vote of the input capsule.
1:procedure EM Routing():
2:     :
3:     for  iterations do
4:         :  M-Step()
5:         : E-Step()      
6:     : return
1:procedure M-Step()
2: hold constant, adjust () for
3:     :
4:     Compute by Eq. 11 and 12
5:     Compute by Eq. 14
1:procedure E-Step()
2: hold () constant, adjust for
3:     : compute by Eq. 16

Dynamic routing uses the cosine of the angle between two vectors to measure their agreement:

. The cosine saturates at 1, which makes it insensitive to the difference between a quite good agreement and a very good agreement. In response to this problem, Hinton:2018:ICLR Hinton:2018:ICLR propose a novel Expectation-Maximization routing algorithm.

Specifically, the routing process fits a mixture of Gaussians using Expectation-Maximization (EM) algorithm, where the output capsules play the role of Gaussians and the means of the activated input capsules play the role of the datapoints. It iteratively adjusts the means, variances, and activation probabilities of the output capsules, as well as the assignment probabilities

of the input capsules, as listed in Algorithm 2

. Comparing with the dynamic routing described above, the EM routing assigns means, variances, and activation probabilities for each capsule, which are used to better estimate the agreement for routing.

The activation probability of the input capsule is calculated by


where is a trainable transformation matrix, and is calculated by Equation 8. The activation probabilities and votes of the input capsules are fixed during the EM routing process.


for each Gaussian associated with consists of finding the mean of the votes from input capsules and the variance about that mean for each dimension :


The incremental cost of using an active capsule is


The activation probability of capsule is calculated by


where is a fixed cost for coding the mean and variance of when activating it, is another fixed cost per input capsule when not activating it, and is an inverse temperature parameter set with a fixed schedule. We refer the readers to [Hinton, Sabour, and Frosst2018] for more details.


adjusts the assignment probabilities for each input . First, we compute the negative log probability density of the vote from

under the Gaussian distribution fitted by the output capsule

it gets assigned to:


Accordingly, the assignment probability is re-normalized by


As has been stated above, EM routing is a more powerful routing algorithm, which can better estimate the agreement by allowing active capsules to receive a cluster of similar votes. In addition, it assigns an additional activation probability to represent the probability of whether each capsule is present, rather than the length of vector.

4 Experiment

# Model # Para. Train Decode BLEU
1 Transformer-Base 88.0M 1.79 1.43 27.31
2    + Linear Combination [Dou et al.2018] +14.7M 1.57 1.36 27.73 +0.42
3    + Dynamic Combination +25.2M 1.50 1.30 28.33 +1.02
4    + Dynamic Routing +37.8M 1.37 1.24 28.22 +0.91
5    + EM Routing +56.8M 1.10 1.15 28.81 +1.50
Table 1: Translation performance on WMT14 EnglishGerman translation task. “# Para.” denotes the number of parameters, and “Train” and “Decode” respectively denote the training (steps/second) and decoding (sentences/second) speeds.
System Architecture EnDe ZhEn
# Para. BLEU # Para. BLEU
Existing NMT systems
[Wu et al.2016] Rnn with 8 layers N/A 26.30 N/A N/A
[Gehring et al.2017] Cnn with 15 layers N/A 26.36 N/A N/A
[Vaswani et al.2017] Transformer-Base 65M 27.3 N/A N/A
Transformer-Big 213M 28.4 N/A N/A
[Hassan et al.2018] Transformer-Big N/A N/A N/A 24.2
Our NMT systems
this work Transformer-Base 88M 27.31 108M 24.13
   + EM Routing 123M 28.81 143M 24.81
Transformer-Big 264M 28.58 304M 24.56
   + EM Routing 490M 28.97 530M 25.00
Table 2: Comparing with existing NMT systems on WMT14 EnglishGerman (“EnDe”) and WMT17 ChineseEnglish (“ZhEn”) tasks. “” indicates statistically significant difference () from the Transformer baseline.

4.1 Setting

We conducted experiments on two widely-used WMT14 English German (EnDe) and WMT17 Chinese English (ZhEn) translation tasks and compared our model with results reported by previous work [Gehring et al.2017, Vaswani et al.2017, Hassan et al.2018]. For the EnDe task, the training corpus consists of about million sentence pairs. We used newstest2013 as the development set and newstest2014 as the test set. For the ZhEn task, we used all of the available parallel data, consisting of about million sentence pairs. We used newsdev2017 as the development set and newstest2017 as the test set. All the data had been tokenized and segmented into subword symbols using byte-pair encoding with 32K merge operations [Sennrich, Haddow, and Birch2016]. We used 4-gram NIST BLEU score [Papineni et al.2002]

as the evaluation metric, and

sign-test [Collins, Koehn, and Kucerova2005] for statistical significance test.

We evaluated the proposed approaches on the Transformer model [Vaswani et al.2017]. We followed the configurations in [Vaswani et al.2017], and reproduced their reported results on the EnDe task. The parameters of the proposed models were initialized by the pre-trained model. All the models were trained on eight NVIDIA P40 GPUs where each was allocated with a batch size of 4096 tokens. In consideration of computation cost, we studied model variations with Transformer-Base model on EnDe task, and evaluated overall performance with Transformer-Base and Transformer-Big model on both ZhEn and EnDe tasks.

4.2 Results

4.2.1 Model Variations

Table 1 shows the results on WMT14 EnDe translation task. As one would expect, the linear combination (Row 2) improves translation performance by +0.42 BLEU points, indicating the necessity of aggregating layers for deep NMT models.

All dynamic aggregation models (Rows 3-5) consistently outperform its static counterpart (Row 2), demonstrating the superiority of the dynamic mechanisms. Among the model variations, the simplest strategy – dynamic combination (Row 3) surprisingly improves performance over the baseline model by up to +1.02 BLEU points. Benefiting from the advanced routing-by-agreement algorithm, the dynamic routing strategy can achieve similar improvement. The EM routing further improves performance by better estimating the agreement during the routing. These findings suggest potential applicability of capsule networks to natural language processing tasks, which has not been fully investigated yet.

All the dynamic aggregation strategies introduce new parameters, ranging from 25.2M to 56.8M. Accordingly, the training speed would decrease due to more efforts to train the new parameters. Dynamic aggregation mechanisms only marginally decrease decoding speed, with EM routing being the slowest one, which decreases decoding speed by 19.6%.

4.2.2 Main Results

Table 2 lists the results on both WMT17 ZhEn and WMT14 EnDe translation tasks. As seen, dynamically aggregating layers consistently improves translation performance across NMT models and language pairs, which demonstrating the effectiveness and universality of the proposed approach. It is worth mentioning that Transformer-Base with EM routing outperforms the vanilla Transformer-Big model, with only less than half of the parameters, demonstrating our model could utilize the parameters more efficiently and effectively.

4.3 Analysis of Iterative Routing

We conducted extensive analysis from different perspectives to better understand the iterative routing process. All results are reported on the development set of EnDe task with “Transformer-Base + EM routing” model.

Figure 2: Impact of number of output capsules.

4.3.1 Impact of the Number of Output Capsules

The number of output capsules is a key parameter for our model, as shown in Figure 1. We plot in Figure 2 the BLEU score with different number of output capsules. Generally, the BLEU score goes up with the increase of the capsule numbers. As aforementioned, to make the dimensionality of the final output be consistent with hidden layer (i.e. ), the dimensionality of each capsule output is . When increases, the dimensionality of capsule output decreases (the minimum value is 1), which may lead to more subtle representations of different properties of the input.

Figure 3: Impact of routing iterations.

4.3.2 Impact of Routing Iterations

Another key parameter is the iteration of the iterative routing , which affects the estimation of the agreement. As shown in Figure 3, the BLEU score typically goes up with the increase of the iterations , while the trend does not hold when . This indicates that more iterations may over-estimate the agreement between two capsules, thus harms the performance. The optimal iteration is also consistent with the findings in previous work [Sabour, Frosst, and Hinton2017, Hinton, Sabour, and Frosst2018].

Model Construct with BLEU
Base N/A 25.84
Ours 26.18
Table 3: Impact of functions to construct input capsules.

4.3.3 Impact of Functions to Construct Input Capsules

For the iterative routing models, we use instead of to construct each input capsule . Table 3 lists the comparison results, which shows that the former indeed outperforms the latter. We attribute this to that is more representative by extracting features from the concatenation of the original layer representations.

Figure 4: Agreement distribution with 6 input capsules (y-axis) and 512 output capsules (x-axis). Darker color denotes higher agreement. The three heatmaps from top to bottom are respectively the to iterations.

4.3.4 Visualization of Agreement Distribution

The assignment probability before M step with denotes the agreement between the input capsule and the output capsule , which is determined by the iterative routing. A higher agreement denotes that the input capsule prefers to send its representation to the output capsule . We plot in Figure 4

the alignment distribution in different routing iterations. In the first iteration (top panel), the initialized uniform distribution is employed as the agreement distribution, and each output capsule equally attends to all the input capsules. As the iterative routing goes, the input capsules learns to send their representations to proper output capsules, and accordingly output capsules are more likely to capture distinct features. We empirically validate our claim from the following two perspectives.

We use the entropy to measure the skewness of the agreement distributions:


A lower entropy denotes a more skewed distribution, which indicates that the input capsules are more certain about which output capsules should be routed more information. The entropies of the three iterations are respectively 6.24, 5.93, 5.86, which indeed decreases as expected.

To validate the claim that different output capsules focus on different subsets of input capsules, we measure the diversity between each two output capsules. Let be the agreement probabilities assigned to the output capsule , we calculate the diversity among all the output capsules as


A higher diversity score denotes that output capsules attend to different subsets of input capsules. The diversity scores of the three iterations are respectively 0.0, 0.09, and 0.18, which reconfirm our observations.

4.4 Effect on Encoder and Decoder

Model Applied to BLEU
Encoder Decoder
Base N/A N/A 25.84
Ours × 26.33
× 26.34
Table 4: Effect of EM routing on encoder and decoder.

Both encoder and decoder are composed of a stack of layers, which may benefit from the proposed approach. In this experiment, we investigate how our model affects the two components, as shown in Table 4. Aggregating layers of encoder or decoder individually consistently outperforms the vanilla baseline model, and exploiting both components further improves performance. These results provide support for the claim that aggregating layers is useful for both understanding input sequence and generating output sequence.

4.5 Length Analysis

Figure 5: BLEU scores on the EnDe test set with respect to various input sentence lengths.

Following Bahdanau:2015:ICLR Bahdanau:2015:ICLR and  tu2016modeling tu2016modeling, we grouped sentences of similar lengths together and computed the BLEU score for each group, as shown in Figure 5. Generally, the performance of Transformer-Base goes up with the increase of input sentence lengths. We attribute this to the strength of self-attention mechanism to model global dependencies without regard to their distance. Clearly, the proposed approaches outperform the baseline in all length segments.

5 Related Work

Our work is inspired by research in the field of exploiting deep representation and capsule networks.

Exploiting Deep Representation

Exploiting deep representations have been studied by various communities, from computer vision to natural language processing. he2016deep he2016deep propose a residual learning framework, combining layers and encouraging gradient flow by simple short-cut connections. Huang:2017:CVPR Huang:2017:CVPR extend the idea by introducing densely connected layers which could better strengthen feature propagation and encourage feature reuse. Deep layer aggregation [Yu et al.2018] designs architecture to fuse information iteratively and hierarchically.

Concerning natural language processing, Peters:2018:NAACL Peters:2018:NAACL have found that combining different layers is helpful and their model significantly improves state-of-the-art models on various tasks. Researchers have also explored fusing information for NMT models and demonstrate aggregating layers is also useful for NMT  [Shen et al.2018, Wang et al.2018, Dou et al.2018]. However, all of these works mainly focus on static aggregation in that their aggregation strategy is independent of specific hidden states. In response to this problem, we introduce dynamic principles into layer aggregation. In addition, their approaches are a fixed policy without considering the representation of the final output, while the routing-by-agreement mechanisms are able to aggregate information according to the final representation.

Capsule Networks

The idea of dynamic routing is first proposed by Sabour:2017:NIPS Sabour:2017:NIPS, which aims at addressing the representational limitations of convolutional and recurrent neural networks for image classification. The iterative routing procedure is further improved by using Expectation-Maximization algorithm to better estimate the agreement between capsules 

[Hinton, Sabour, and Frosst2018]. In computer vision community, xi2017capsule xi2017capsule explore its application on CIFAR data with higher dimensionality. lalonde2018capsules lalonde2018capsules apply capsule networks on object segmentation task.

The applications of capsule networks in natural language processing tasks, however, have not been widely investigated to date. zhao2018investigating zhao2018investigating testify capsule networks on text classification tasks and Gong:2018:arXiv Gong:2018:arXiv propose to aggregate a sequence of vectors via dynamic routing for sequence encoding. To the best of our knowledge, this work is the first to apply the idea of dynamic routing to NMT.

6 Conclusion

In this work, we propose several methods to dynamically aggregate layers for deep NMT models. Our best model, which utilizes EM-based iterative routing to estimate the agreement between inputs and outputs, has achieved significant improvements over the baseline model across language pairs. By visualizing the routing process, we find that capsule networks are able to extract most active features shared by different inputs. Our study suggests potential applicability of capsule networks across computer vision and natural language processing tasks for aggregating information of multiple inputs.

Future directions include validating our approach on other NMT architectures such as RNN [Chen et al.2018] and CNN [Gehring et al.2017], as well as on other NLP tasks such as dialogue and reading comprehension. It is also interesting to combine with other techniques [Shaw, Uszkoreit, and Vaswani2018, Li et al.2018, Dou et al.2018, Yang et al.2018, Yang et al.2019, Kong et al.2019] to further boost the performance of Transformer.


  • [Anastasopoulos and Chiang2018] Anastasopoulos, A., and Chiang, D. 2018. Tied multitask learning for neural speech translation. In NAACL.
  • [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • [Chen et al.2018] Chen, M. X.; Firat, O.; Bapna, A.; Johnson, M.; Macherey, W.; Foster, G.; Jones, L.; Niki, P.; Schuster, M.; Chen, Z.; Wu, Y.; and Hughes, M. 2018. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation. In ACL.
  • [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP.
  • [Collins, Koehn, and Kucerova2005] Collins, M.; Koehn, P.; and Kucerova, I. 2005. Clause restructuring for statistical machine translation. In ACL.
  • [Dou et al.2018] Dou, Z.-Y.; Tu, Z.; Wang, X.; Shi, S.; and Zhang, T. 2018. Exploiting deep representations for neural machine translation. In EMNLP.
  • [Gehring et al.2017] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; and Dauphin, Y. N. 2017. Convolutional sequence to sequence learning. In ICML.
  • [Gong et al.2018] Gong, J.; Qiu, X.; Wang, S.; and Huang, X. 2018. Information aggregation via dynamic routing for sequence encoding. In COLING.
  • [Hassan et al.2018] Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C.; Huang, X.; Junczys-Dowmunt, M.; Lewis, W.; Li, M.; et al. 2018. Achieving human parity on automatic chinese to english news translation. arXiv.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [Hinton, Krizhevsky, and Wang2011] Hinton, G. E.; Krizhevsky, A.; and Wang, S. D. 2011. Transforming auto-encoders. In ICANN.
  • [Hinton, Sabour, and Frosst2018] Hinton, G. E.; Sabour, S.; and Frosst, N. 2018. Matrix capsules with em routing. In ICLR.
  • [Huang et al.2017] Huang, G.; Liu, Z.; van der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In CVPR.
  • [Kong et al.2019] Kong, X.; Tu, Z.; Shi, S.; Hovy, E.; and Zhang, T. 2019. Neural machine translation with adequacy-oriented learning. In AAAI.
  • [LaLonde and Bagci2018] LaLonde, R., and Bagci, U. 2018. Capsules for object segmentation. arXiv.
  • [Li et al.2018] Li, J.; Tu, Z.; Yang, B.; Lyu, M. R.; and Zhang, T. 2018. Multi-head attention with disagreement regularization. In EMNLP.
  • [Meng et al.2016] Meng, F.; Lu, Z.; Tu, Z.; Li, H.; and Liu, Q. 2016. A deep memory-based architecture for sequence-to-sequence learning. In ICLR Workshop.
  • [Papineni et al.2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
  • [Peters et al.2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL.
  • [Sabour, Frosst, and Hinton2017] Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. In NIPS.
  • [Sennrich, Haddow, and Birch2016] Sennrich, R.; Haddow, B.; and Birch, A. 2016. Neural machine translation of rare words with subword units. In ACL.
  • [Shaw, Uszkoreit, and Vaswani2018] Shaw, P.; Uszkoreit, J.; and Vaswani, A. 2018. Self-Attention with Relative Position Representations. In NAACL.
  • [Shen et al.2018] Shen, Y.; Tan, X.; He, D.; Qin, T.; and Liu, T.-Y. 2018. Dense information flow for neural machine translation. In NAACL.
  • [Tu et al.2016] Tu, Z.; Lu, Z.; Liu, Y.; Liu, X.; and Li, H. 2016. Modeling coverage for neural machine translation. In ACL.
  • [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In NIPS.
  • [Wang et al.2018] Wang, Q.; Li, F.; Xiao, T.; Li, Y.; Li, Y.; and Zhu, J. 2018. Multi-layer representation fusion for neural machine translation. In COLING.
  • [Wu et al.2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv.
  • [Xi, Bing, and Jin2017] Xi, E.; Bing, S.; and Jin, Y. 2017. Capsule network performance on complex data. arXiv.
  • [Yang et al.2018] Yang, B.; Tu, Z.; Wong, D. F.; Meng, F.; Chao, L. S.; and Zhang, T. 2018. Modeling localness for self-attention networks. In EMNLP.
  • [Yang et al.2019] Yang, B.; Li, J.; Wong, D. F.; Chao, L. S.; Wang, X.; and Tu, Z. 2019. Context-aware self-attention networks. In AAAI.
  • [Yu et al.2018] Yu, F.; Wang, D.; Shelhamer, E.; and Darrell, T. 2018. Deep layer aggregation. In CVPR.
  • [Zeiler and Fergus2014] Zeiler, M. D., and Fergus, R. 2014. Visualizing and understanding convolutional networks. In ECCV.
  • [Zhao et al.2018] Zhao, W.; Ye, J.; Yang, M.; Lei, Z.; Zhang, S.; and Zhao, Z. 2018. Investigating capsule networks with dynamic routing for text classification. In EMNLP.
  • [Zhou et al.2016] Zhou, J.; Cao, Y.; Wang, X.; Li, P.; and Xu, W. 2016. Deep recurrent models with fast-forward connections for neural machine translation. TACL.