Log In Sign Up

Dynamic Past and Future for Neural Machine Translation

Previous studies have shown that neural machine translation (NMT) models can benefit from modeling translated (Past) and un-translated (Future) source contents as recurrent states (Zheng et al., 2018). However, the recurrent process is less interpretable. In this paper, we propose to model Past and Future by Capsule Network (Hinton et al.,2011), which provides an explicit separation of source words into groups of Past and Future by the process of parts-to-wholes assignment. The assignment is learned with a novel variant of routing-by-agreement mechanism (Sabour et al., 2017), namely Guided Dynamic Routing, in which what to translate at current decoding step guides the routing process to assign each source word to its associated group represented by a capsule, and to refine the representation of the capsule dynamically and iteratively. Experiments on translation tasks of three language pairs show that our model achieves substantial improvements over both RNMT and Transformer. Extensive analysis further verifies that our method does recognize translated and untranslated content as expected, and produces better and more adequate translations.


page 7

page 12


Modeling Past and Future for Neural Machine Translation

Existing neural machine translation systems do not explicitly model what...

Future-Prediction-Based Model for Neural Machine Translation

We propose a novel model for Neural Machine Translation (NMT). Different...

Neural Machine Translation Advised by Statistical Machine Translation

Neural Machine Translation (NMT) is a new approach to machine translatio...

Neural Machine Translation with Recurrent Attention Modeling

Knowing which words have been attended to in previous time steps while g...

Avoiding Implementation Pitfalls of "Matrix Capsules with EM Routing" by Hinton et al

The recent progress on capsule networks by Hinton et al. has generated c...

Towards Linear Time Neural Machine Translation with Capsule Networks

In this study, we first investigate a novel capsule network with dynamic...

1 Introduction

Neural machine translation (NMT) generally adopts an attentive encoder-decoder framework Sutskever et al. (2014); Gehring et al. (2017); Vaswani et al. (2017), where the encoder maps the source sentence into a sequence of contextual representations (source contents

), and the decoder generates the target sentence word-by-word based on part of the source content assigned by an attention model 

Bahdanau et al. (2015). Similar to human translators, NMT systems should have the ability to know the relevant source-side context for the current word (Present), as well as recognizing what parts in the source contents have been translated (Past) and what parts have not (Future), at each decoding step. Accordingly, the Past, Present and Future are three dynamically changing states during the whole translation process.

Figure 1: An example of Past and Futurein machine translation. When it comes to generating current translation (i.e., “his”), the source tokens “BOS”, “布什(Bush)” and phrase “为…辩护(defend)” are the translated contents (Past), while the remaining tokens are untranslated contents (Future). Intuitively, the prediction of current translation (Present) is expected to benefit from an explicit separation of Past and Future source contents.

Previous studies have shown that NMT models are likely to face the well-known illness of inadequate translation Kong et al. (2019), which is usually embodied in over- and under-translation problems Tu et al. (2016, 2017). This issue may be attributed to the poor capacity of NMT of recognizing the translated and untranslated contents during translation. To remedy this, zheng2018modeling first demonstrate that explicitly tracking Past and Future contents helps NMT models alleviate this issue and generate better translation. In their work, Past and Future contents are modeled as recurrent states during the decoding procedure. While recurrently tracking the Past and Future states has proven to be useful, it is still non-trivial to figure out which parts of the source words are the Past and which are the Future

, and to what extent the recurrent states represent them at each decoding step respectively, probably leading to a less interpretable issue.

We argue that an explicit separation of the source words into two groups, representing Past and Future respectively (Figure 1), could be more beneficial not only for easy and direct recognition of the translated and untranslated source contents but also for better interpretation of model’s behavior of the recognition. We formulate the explicit separation as a procedure of parts-to-wholes assignment: the sequence of contextual representations of the source words (parts) should be assigned to a distinct group of either Past or Future (wholes).

In this paper, we propose to use Capsule Network with routing-by-agreement Hinton et al. (2011)

mechanism, which has demonstrated its appealing strength of solving the problem of parts-to-wholes assignment in computer vision 

Hinton et al. (2011); Sabour et al. (2017); Hinton et al. (2018)

and natural language processing 

Dou et al. (2019); Zhao et al. (2018); Gong et al. (2018), to model the separation of the Past and Future: 1) We first cast the Past and Future source contents as two groups of capsules. 2) We then design a novel variant of the routing-by-agreement mechanism, called Guided Dynamic Routing (Gdr), which is guided by what to translate at each decoding step to assign each source word to its associated capsules by assignment probabilities for several routing iterations. 3) Finally, the Past and Future capsules accumulate their expected contents from representations, and are fed into the decoder to provide a time-dependent holistic view of context to decide the prediction. In addition, two auxiliary learning signals facilitate Gdr’s acquiring of our expected functionality, other than implicit learning within the conventional training process of NMT models.

We conducted extensive experiments and analysis to verify the effectiveness of our proposed model. Translation experiments on Chinese-to-English, English-to-German, and English-to-Romanian show consistent and substantial improvements over the Transformer Vaswani et al. (2017) or RNMT Bahdanau et al. (2015). Visualized evidence proves that the proposed approach does acquire the expected ability that separates the source words into Past and Future  achieving better performance and interpretability. We also observe that our model does alleviate the inadequate translation problem: Human subjective evaluation reveals that our model achieves more high-quality and adequate translation than Transformer. Another investigation concerning the length of source sentences shows that our model generates not only longer but also better translation.

2 Neural Machine Translation

Neural models for sequence-to-sequence tasks such as machine translation often adopt an encoder-decoder framework. Given a source sentence , the encoder-decoder model learns to predict a target sentence by maximizing the conditional probabilities . Specifically, an encoder first maps the source sentence into a sequence of encoded representations:


where is the transformation function of encoder. Given the encoded representations of source sentence, a decoder generates the target sequence of symbols autoregressively:


The current word is predicted based on the decoder state . is the transformation function of decoder, which determines based on the target translation trajectory , and the lexical-level source content that is most relevant to Present translation, assigned by the attention model Bahdanau et al. (2015). Ideally, with all the source contextual representations in the encoder, NMT models should be able to update translated and untranslated source contents and keep them in mind. However, most of existing NMT models lack an explicit functionality to maintain the translated and untranslated contents, failing to separate and assign source words to property of either Past or Future Zheng et al. (2018), which is likely to suffer from severe inadequate translation problem Kong et al. (2019); Tu et al. (2016).

3 Approach

3.1 Motivation

Our intuition arises straightforward: if we could tell the translated and untranslated source contents apart by directly separating the source words into Past and Future categories at each decoding step, the Present translation could benefit from the dynamically holistic context (i.e., Past+ Present+ Future). To this purpose, we should design a mechanism by which each word could be recognized and assigned to a distinct category, i.e., Past or Future contents, subject to the translation information at present. This procedure can be seen as a parts-to-wholes assignment, in which the encoder hidden states of the source words (parts) are supposed to be assigned to either Past or Future  (wholes).

Capsule network Hinton et al. (2011) has shown its capability of solving the problem of assigning parts to wholes Sabour et al. (2017)

. A capsule is a vector of neurons which represents different properties of the same entity from the input 

Sabour et al. (2017). The functionality relies on a fast iterative process called routing-by-agreement, whose basic idea is to iteratively refine the proportion of how much a part should be assigned to a whole, based on the agreement between parts and wholes Dou et al. (2019). Therefore, it is appealing to investigate if this mechanism could be employed for our intuition.

3.2 Guided Dynamic Routing (Gdr)

Dynamic routing Sabour et al. (2017) is an implementation of routing-by-agreement, where it runs intrinsically without any external guidance. However, what we expect should be a mechanism driven by the decoding information at present. Here we propose a variant of dynamic routing mechanism called Guided Dynamic Routing (Gdr), where the routing process is guided by the translating information at each decoding step.

Formally, we let the encoder hidden states of source words be the input capsules, while we denote as output capsules, which consist of entries. Initially, we assume that of them () represent the Past contents, and the rest capsules () represent the Future contents:

Each capsule is represented by a -dimension vector. We combine the Past capsules and Future capsules together, which are expected to compete for corresponding source information. So now, we have . We will describe how to teach these capsules to retrieve their relevant parts from source contents in the Section 3.4.

Note that we employ Gdr at every decoding step to obtain time-dependent translated and untranslated contents, and omit subscript for simplicity.

In the dynamic routing process, each vector output of capsule is calculated with a non-linear squashing function Sabour et al. (2017):


where is the accumulated input of capsule , which is a weighted sum over all vote vectors . is transformed from the input capsule :


where is a trainable transformation matrix for -th output capsule111Note that unlike sabour2017dynamic, where each pair of input capsule and output capsule has a distinct transformation matrix as their numbers are predefined ( transformation matrices in total), here we share the transformation matrix of output capsule among all the input capsules due to the varied amount of the source words. So there are transformation matrices in our model.. is the assignment probability (i.e. the agreement) that is determined by the iterative dynamic routing.

The assignment probabilities associated with each input capsule sum to 1: , and are determined by:


where routing logit

is initialized as all 0s, which measures the degree that should be sent to . The initial assignment probabilities are then iteratively updated by measuring the agreement between the vote vector and capsule by a MLP, considering the current decoding state :


where and are learnable parameters. Instead of simple scalar product, i.e.,  Sabour et al. (2017), which could not consider the current decoding state as a condition signal, we resort to the MLP to take into account inspired by MLP-based attention mechanism Bahdanau et al. (2015); Luong et al. (2015). That is why we call it “guided” dynamic routing.

Now with the current decoding information, the hidden state (input capsule) of a source word prefers to send its representation to the output capsules, which have large routing agreements associated with the input capsule. After a few round of iterations, the output capsules are able to ignore all but the most relevant information from the source hidden states, representing a distinct aspect of either Past or Future.

Redundant Capsules

The aim of the proposed Gdr is to decouple the representations of source words into Past and Future properties. However in some cases, there may be some parts of the source sentence should belong to neither past contents nor future contents. For example, function words in English (e.g., “the”) could not find its counterpart translation in Chinese. Therefore, we add additional Redundant Capsules to mark Non-Past/Future contents, which are expected to receive higher routing assignment probabilities when a source word with its encoder hidden state should not belong to either Past or Future.

1:procedure: Gdr(h, , )
2: Initializing routing logits
3:for  iterations do
4:      : Compute assignment probabilities by Equation 6
5:       Compute capsules by Equation 4
6:       Compute by Equation 7
7:end for
8: Return past, future, and redundant capsules
Algorithm 1 Guided Dynamic Routing (Gdr)

We show the algorithm of our guided dynamic routing in Algorithm 1.

3.3 Integrating into NMT

The proposed approach can be applied to the top of any sequence-to-sequence architecture (Figure 2). Given a sentence , the encoder leverages stacked identical layers to map the sentence into contextual representations:

where the superscript indicates layer depth. Based on the encoded source representations , a decoder generates translation word by word. The decoder also has stacked identical layers:

where is the lexical-level source context assigned by an attention mechanism between current decoder layer and the last encoder layer. Given the hidden states of the last decoder layer , we perform our proposed guided dynamic routing (Gdr) mechanism to compute the Past and Future contents from the source side and obtain the holistic context of each decoding step:

where is the sequence of holistic context of each decoding step. Based on the holistic context, the output probabilities are computed as:

The NMT model is now able to employ dynamic holistic context for better generation.

Figure 2: Illustration of our architecture.

3.4 Learning

Auxiliary Losses

To ensure that the dynamic routing process runs as what we expected, we introduce the following auxiliary losses to assist the learning process.

Bag-of-Word Constraint

Weng:2017:EMNLP propose a multi-tasking scheme to boost NMT by predicting the bag-of-words of target sentence using the the Word Predictions approach. Inspired by this work, we introduce a BoW constraint to encourage the Past and Future capsules to be predictive of the preceding and subsequent bag-of-words at current decoding step, respectively:

where and are the predicted probabilities of the preceding bag-of-words and subsequent words at decoding step , respectively. For instance, the probabilities of the preceding bag-of-words are computed by:

The computation for the subsequent bag-of-words is similar. By applying the BoW constraint, the Past capsules and Future capsules are able to learn to reflect the target-side past and future bag-of-words information.

Bilingual Content Alignment

Intuitively, the translated source contents should be semantically consistent with the translated target contents, and so do untranslated contents. Based on this intuition, a natural idea is to encourage the source Past contents, modeled by the Past capsules to be close to the target past representation. And we can apply the same encouragement for the Future

. Hence, we propose a Bilingual Content Alignment (BCA) constraint to let the bilingual associated contents be predictive to each other by Minimum Square Estimation (MSE) loss:

where the target-side past information is represented by the averaged results of the decoder hidden states of all preceding words, while the average of subsequent decoder hidden states represents the target-side future information.


Given the dataset of parallel training examples , the model parameters are trained by minimizing the loss , where is the set of all the parameter of the proposed model:

where and are hyper-parameters.

Model Dev MT04 MT05 MT06 Tests All
Transformer 66.1m 1.00 1.00 45.83 46.66 43.36 42.17 44.26 (N/A)
Gdr 68.9m 0.77 0.94 46.50 47.03 45.50 42.21 45.05 (+0.79)
     + 69.2m 0.70 0.94 47.12 48.09 45.98 42.68 45.42 (+1.16)
     + 69.4m 0.75 0.94 46.86 48.00 45.67 42.62 45.22 (+0.96)
     + + [Ours] 69.7m 0.67 0.94 47.52 48.13 45.98 42.85 45.83 (+1.57)
     - redundant capsules 68.7m 0.69 0.94 47.20 47.98 45.79 42.50 45.40 (+1.14)
RNMT 50.2m 1.00 1.00 35.98 37.85 36.12 35.86 36.57 (N/A)
PFRnn Zheng et al. (2018) N/A 0.54 0.74 37.90 40.37 36.75 35.44 N/A
AOL Kong et al. (2019) N/A 0.57 N/A 37.61 40.05 37.58 36.87 N/A
Ours 53.9m 0.62 0.90 38.10 40.87 37.50 37.00 38.71 (+2.14)
Table 1: Experiment ressuts on NIST Zh-En task, including number of parameters (, excluding word embeddings), training/testing speeds (/), and translation results in case-insensitive BLEU. Note that “Tests All” means that the scores are obtained on the concatenation of all test sets.

4 Experiment

We mainly evaluated our approaches on the widely used NIST Chinese-to-English (Zh-En) translation task. We also conducted experiments on WMT14 English-to-German (En-De) and WMT16 Romanian-to-English (Ro-En) translation tasks.

1. NIST Chinese-to-English (NIST Zh-En). The training data consists of 1.09 million sentence pairs extracted from LDC222The corpora includes LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. We used NIST MT03 as the development set (Dev); MT04, MT05, MT06 as the test sets.

2. WMT14 English-to-German (WMT En-De). The training data consists of 4.5 million sentence pairs from WMT14 news translation task. We used newstest2013 as the development set and newstest2014 as the test set.

3. WMT16 English-to-Romanian (WMT En-Ro). The training data consists of 0.6 million sentence pairs from WMT16 news translation task. We used newstest2015 as the development set and newstest2016 as the test set.

We used transformer_base configuration Vaswani et al. (2017) for all the models. We run the dynamic routing for iterations. The dimension of a single capsule is 256. Either Past or Future content was represented by capsules. Our proposed models were trained on the top of pre-trained baseline models. and in training objective were set to 1. In Appendix, we provide details for the training settings.

4.1 NIST Zh-En Translation

We list the results of our experiments on NIST Zh-En task in Table 1 concerning two different architectures, i.e., Transformer and RNMT. As we can see, all of our models substantially outperform the baselines in terms of BLEU score of the concatenation of all test sets. Among them, our best model achieves 45.83 BLEU based on Transformer architecture. We also find that redundant capsules are helpful, while discarding them leads to -0.53 BLEU degradation (45.83 vs 45.40).


Our approach shows consistent effects on both Transformer and RNMT architectures. In comparison to the Transformer baseline, our model achieves at most +1.57 BLEU improvement (45.83 v.s 44.26), while +2.14 BLEU improvement over RNMT baselines (38.71 v.s 36.57). These results indicate the compatibility of our approach to different architectures.

Auxiliary losses

We also evaluate the effects of the introduced auxiliary losses, i.e., the bag-of-words constraints (BoW) and the bilingual context alignment (BCA). Both the auxiliary constraints could help our model for better learning. BoW constraint leads to a +0.56 BLEU improvement compared to the vanilla Gdr, while the benefit of BLEU score is +0.64 for BCA. Finally, combining the both gains the most improvements on Transformer (45.83 v.s 44.05), indicating that they can supplement to each other.


To examine the efficiency of the proposed approach, we also list the relative speed of both training and testing. Our approach is 0.67 slower than the Transformer baseline in training phase, however, it does not hurt the speed of testing too much (0.94). It is because the most extra computation in training phrase is related to the softmax operations of BoW loss, so the degradation of the testing efficiency is moderate.

Comparison to other work

On the experiments on RNMT architecture, we list two related works. zheng2018modeling use extra Past RNN and Future RNN to capture translated and untranslated contents recurrently (PRRnn), while kong2018neural directly leverage translation adequacy as learning reward by their propsed Adequacy-oriented Learning (AOL). Compared to them, our model also enjoys consistent improvements due to explicit separation of source contents.

Model En-De En-Ro GNMT+RL Wu et al. (2016) 24.6 N/A ConvS2S Gehring et al. (2017) 25.2 29.88 Transformer Vaswani et al. (2017) 27.3 N/A Transformer + AOL Kong et al. (2019) 28.01 N/A Transformer Gu et al. (2017) N/A 31.91 Transformer 27.14 32.10 Ours 28.10 32.96
Table 2: Case-sensitive BLEU on WMT14 En-De and WMT16 En-Ro tasks.

4.2 WMT En-De and En-Ro Translation

We evaluated our approach on WMT14 En-De and WMT16 En-Ro tasks. As shown in Table 2, our reproduced Transformer baseline systems are close to the state-of-the-art results in previous work, which guarantee the comparability of our experiments. The results show a consistent trend of improvements as NIST Zh-En task on WMT14 En-De (+0.96 BLEU) and WMT16 En-Ro (+0.86 BLEU) benchmarks. We also list the results of other published research for comparison, where our model outperforms the previous results in both language pairs. Note that our approach also surpasses kong2018neural on WMT14 En-De task. These experiments demonstrate the effectiveness of our approach across different language pairs.

4.3 Analysis and Discussion

Our model learns Past and Future.
Figure 3: Visualization of the assignment probabilities of iterative routing. For the sub-heatmap of each translation word, the left column indicates the probabilities of each source words routing to the Past capsules, and the right column indicates the probabilities to the Future capsules. Examples in red frames indicate the changes before and after the generation of the central word. We omit the assignment probabilities associated to the redundant capsules for simplicity.

We visualize the assignment probabilities in the last routing iteration (Figure 3). Interestingly, there is a clear trend that the assignment probabilities to the Past capsules gradually raise up, while those to the Future capsules reduce to around zeros. This phenomenon is consistent with the intuition that the translated contents should aggregate and the untranslated should decline Zheng et al. (2018). Specifically, after the target word “defended” was generated, the assignment probabilities of its corresponding source word “辩护” left from the Future to the Past. Similar results also associate to the words “Bush“, “his”, “revive” and “economy”, except a adverse case (“plan”). These pieces of evidence give a strong verification that our Gdr mechanism indeed has learned to distinguish the Past contents and Future contents in the source-side.

Moreover, we measure how well our capsules accumulate the expected contents by comparison between the BoW predictions and ground-truth target words. Accordingly, we define an overlap rate () for predicting preceding and subsequent words are defined as follow, respectively:

The Past capsules achieves an of 0.72, while of 0.70 for the Future capsules. The results indicate that our introduced capsules could predict the corresponding words to a certain extent, which implies the capsules contain the expected information, i.e., the Past and Future contents.

Model Cdr Human
Quality Over(%) Under(%)
Transformer 0.73 4.390.11 0.030.01 3.830.97
Ours 0.79 4.660.10 0.010.01 2.410.80
Table 3: Evaluation on translation quality and adequacy. For Human evaluation, we asked three evaluators to score translations from 100 source sentences, which are randomly sampled from the testsets from anonymous systems, the Quality from 1 to 5 (higher is better), and the proportions of source words concerning Over- and Under-translation, respectively.
Translations become better and more adequate.

To validate the translation adequacy of our model, we use Coverage Difference Ratio (Cdr), which is proposed by kong2018neural, i.e., , where and are the set of source words covered by the reference and translation, respectively. The Cdr reflects the translation adequacy by comparing the source coverages between reference and translation. As shown in the second column of Table 3, our approach achieves a better Cdr than the Transformer baseline, which means superiority in translation adequacy.

Following tu-EtAl:2016:P16-1, zheng2018modeling and kong2018neural, we also conduct subjective evaluations to validate the benefit of modeling Past and Future (the last three columns of Table 3). Surprisingly, we could see that modern NMT models rarely produce over-translation but still suffer from under-translation. Our model obtains the highest human rating on translation quality while substantially alleviates the under-translation problem compared to the Transformer.

Longer sentences benefit more.
(a) Trans. len. v.s src. len.
(b) BLEU v.s src. len.
Figure 4: Comparison regarding source length.

We report the comparison with sentence lengths (Figure 4). In all the intervals of length, our model does generate better (Figure (b)b) and longer (Figure (a)a) translations in machine translation. Interestingly, our approaches get a larger improvement when the input sentences become longer, which are commonly thought hard to translate. We attribute this to less problem of under-translation in our model, meaning that our model learns better translation quality and adequacy, especially for long sentences.

5 Related Work

Inadequate translation problem is a widely known weakness of NMT models Kong et al. (2019); Tu et al. (2016). To alleviate this problem, one direction is to recognize the translated and untranslated contents, and pay more attention to untranslated parts. tu-EtAl:2016:P16-1, Mi2016 and li2018simple employ coverage vector or coverage ratio to indicate the lexical-level coverage of source words. meng2018neural influence the attentive vectors by translated/untranslated information. Our work mainly follows the path of zheng2018modeling, which introduce two extra recurrent layers in the decoder to maintain the representations of the past and future translation contents. However, it may be not easy to show the direct correspondence between the source contents and learned representations in the past/future RNN layers, nor compatible with the state-of-the-art Transformer for the additional recurrences prevent Transformer decoder from being parallelized.

Another direction is to introduce global representations. lin2018deconvolution model a global source representation by deconvolution networks. xia2017deliberation,zhang2018asynchronous,geng2018adaptive propose to provide a holistic view of target sentence by multi-pass decoding. Different from these work aiming at providing static global information in the whole translation process, our approach models a dynamic holistic context by using capsules network to separate source contents and obtain time-dependent global information at every decoding step.

Other efforts explore exploiting future hints. serdyuk2018twin design a Twin Regularization to encourage the hidden states in forward decoder RNN to estimate the representations of a backward RNN. Weng:2017:EMNLP require the decoder states to not only generate the current word but also predict the remain untranslated words. Actor-critic algorithms are employed to predict future properties Li et al. (2017a); Bahdanau et al. (2017); He et al. (2017) by estimating the future rewards for decision making. kong2018neural propose a policy gradient based adequacy-oriented approach to improve translation adequacy. These methods use future information only at the training stage, while our model could also exploit past and future information at inference, which provides accessible clues of translated and untranslated contents.

sabour2017dynamic first introduce the concept of dynamic routing, which is further improved by EM-routing Hinton et al. (2018), aiming at addressing the limited expressive ability of the parts-to-wholes assignment in computer vision. In natural language processing community, however, the capsule network has not yet been widely investigated. zhao2018investigating testify capsule network on text classification and gong2018information propose to aggregate a sequence of vectors via dynamic routing for sequence encoding. dou2019dynamic first propose to employ capsule network in NMT using routing-by-agreement mechanism for layer aggregation to exploit advanced representations. These studies mainly use capsule network for information aggregation, where the capsules could have a less interpretable meaning. In contrast, our model learns what we expect by the aid of auxiliary learning signals, which endows our model with better interpretability.

6 Conclusion

In this paper, we propose to recognize the translated Past and untranslated Future contents as a process of parts-to-wholes assignment in neural machine translation. We use Capsule Network and propose a novel mechanism called Guided Dynamic Routing to explicitly separate source words into Past and Future groups, guided by Present target decoding information at each decoding step. We demonstrate that such explicit separation of source contents benefit neural machine translation. Translation experiments on three language pairs and two sequence-to-sequence architectures show considerable and consistent improvements. Extensive analysis shows that our approach indeed learns to model the Past and Future

, and alleviates the inadequate translation. It is interesting to apply our approach to other sequence-to-sequence tasks, which require a dynamic holistic view of source contents, e.g., text summarization.


  • Ayana et al. (2016) Ayana, Shiqi Shen, Zhiyuan Liu, and Maosong Sun. 2016. Neural headline generation with minimum risk training. CoRR, abs/1604.01904.
  • Bahdanau et al. (2017) Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. An actor-critic algorithm for sequence prediction. In ICLR 2017.
  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR 2015.
  • Chang et al. (2018) Chieh-Teng Chang, Chi-Chia Huang, and Jane Yung jen Hsu. 2018. A hybrid word-character model for abstractive summarization. CoRR, abs/1802.09968.
  • Chen et al. (2016) Qian Chen, Xiao-Dan Zhu, Zhen-Hua Ling, Si Wei, and Hui Jiang. 2016.

    Distraction-based neural networks for modeling document.

    In IJCAI.
  • Dou et al. (2019) Zi-Yi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, and Tong Zhang. 2019. Dynamic layer aggregation for neural machine translation with routing-by-agreement. arXiv preprint arXiv:1902.05770.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. 2017. Convolutional sequence to sequence learning. In ICML 2017.
  • Geng et al. (2018) Xinwei Geng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2018. Adaptive multi-pass decoder for neural machine translation. In EMNLP, pages 523–532.
  • Gong et al. (2018) Jingjing Gong, Xipeng Qiu, Shaojing Wang, and Xuanjing Huang. 2018. Information aggregation via dynamic routing for sequence encoding. In COLING, pages 2742–2752.
  • Gu et al. (2017) Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. 2017. Non-Autoregressive Neural Machine Translation.
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O.K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In ACL 2016.
  • He et al. (2017) Di He, Hanqing Lu, Yingce Xia, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2017. Decoding with value networks for neural machine translation. In NIPS, pages 178–187.
  • Hinton et al. (2011) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. 2011. Transforming auto-encoders. In ICANN, pages 44–51. Springer.
  • Hinton et al. (2018) Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. 2018. Matrix capsules with EM routing. In ICLR.
  • Hu et al. (2015) Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. Lcsts: A large scale chinese short text summarization dataset. In EMNLP.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. ICLR 2014.
  • Kong et al. (2019) Xiang Kong, Zhaopeng Tu, Shuming Shi, Eduard Hovy, and Tong Zhang. 2019. Neural machine translation with adequacy-oriented learning. In AAAI.
  • Li et al. (2017a) Jiwei Li, Will Monroe, and Dan Jurafsky. 2017a. Learning to decode for future success. arXiv preprint arXiv:1701.06549.
  • Li et al. (2018a) Piji Li, Lidong Bing, and Wai Lam. 2018a. Actor-critic based training framework for abstractive summarization. CoRR, abs/1803.11070.
  • Li et al. (2017b) Piji Li, Wai Lam, Lidong Bing, and Zihao Wang. 2017b. Deep recurrent generative decoder for abstractive text summarization. In EMNLP.
  • Li et al. (2018b) Yanyang Li, Tong Xiao, Yinqiao Li, Qiang Wang, Changming Xu, and Jingbo Zhu. 2018b. A simple and effective approach to coverage-aware neural machine translation. In ACL, volume 2, pages 292–297.
  • Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In ACL 2004.
  • Lin et al. (2018) Junyang Lin, Xu Sun, Xuancheng Ren, Shuming Ma, Jinsong Su, and Qi Su. 2018. Deconvolution-based global decoding for neural machine translation. In COLING, pages 3260–3271.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and D. Christopher Manning. 2015. Effective approaches to attention-based neural machine translation. In EMNLP 2015.
  • Ma et al. (2018) Shuming Ma, Xu Sun, Wei Li, Sujian Li, Wenjie Li, and Xuancheng Ren. 2018. Query and output: Generating words by querying distributed word representations for paraphrase generation. In NAACL-HLT.
  • Meng et al. (2018) Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, and Di Wang. 2018. Neural machine translation with key-value memory-augmented attention. In IJCAI, pages 2574–2580. AAAI Press.
  • Mi et al. (2016) Haitao Mi, Baskaran Sankaran, Zhiguo Wang, and Abe Ittycheriah. 2016. Coverage Embedding Models for Neural Machine Translation. EMNLP 2016.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL 2002.
  • Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In NIPS, pages 3856–3866.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. Computer Science.
  • Serdyuk et al. (2018) Dmitriy Serdyuk, Nan Rosemary Ke, Alessandro Sordoni, Adam Trischler, Chris Pal, and Yoshua Bengio. 2018. Twin networks: Matching the future for sequence generation.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In NIPS 2014.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826.
  • Tu et al. (2017) Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017. Neural machine translation with reconstruction. In AAAI 2017.
  • Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In ACL 2016.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In NIPS.
  • Weng et al. (2017) Rongxiang Weng, Shujian Huang, Zaixiang Zheng, Xin-Yu Dai, and Jiajun Chen. 2017. Neural machine translation with word predictions. In EMNLP 2017.
  • Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • Xia et al. (2017) Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In NIPS, pages 1784–1794.
  • Zhang et al. (2018) Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. 2018. Asynchronous bidirectional decoding for neural machine translation. arXiv preprint arXiv:1801.05122.
  • Zhao et al. (2018) Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, and Zhou Zhao. 2018. Investigating capsule networks with dynamic routing for text classification. arXiv preprint arXiv:1804.00538.
  • Zheng et al. (2018) Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Jiajun Chen, and Zhaopeng Tu. 2018. Modeling past and future for neural machine translation. TACL, 6:145–157.

Appendix A Machine Translation

Neural models for sequence-to-sequence tasks such as machine translation often adopt an encoder-decoder framework. Given a source sentence and a target sentence , a sequence-to-sequence model models the conditional probabilities from the source sentence to the target sentence in a word-by-word manner: , where , and is the encoded sentence representations from the encoder. Conditioned on , a decoder generates the target sequence of symbols autoregressively.

a.1 Detailed Experimental Settings

For Zh-En, we used a vocabulary size of 30K to both source and target. The Chinese part of the data is word segmented using ICTCLAS333 We limited the maximum sentence length to 50 tokens. For En-De and Ro-En, we applied byte pair encoding (Sennrich et al., 2016, BPE) to segment all sentences and limited the vocabulary size to 32K. We did not filter out the sentence length for En-De and Ro-En. All out-of-vocabulary words were mapped to a distinct token <UNK>.

We used the Adam optimizer Kingma and Ba (2014) with , , and . We used the same learning rate schedule strategy as Vaswani et al. (2017) with 4,000 warmup steps. The training batch consisted of approximately 25,000 source tokens and 25,000 source and target tokens. Label smoothing of value of 0.1 Szegedy et al. (2016) was used for training. We trained our models for 100k steps on single GTX 1080ti GPU.

For evaluation, we used beam search with a width of 4 with length penalty of 0.6 Wu et al. (2016). We did not apply checkpoint averaging Vaswani et al. (2017)

on the parameters for evaluation. The translation evaluation metric is case-insensitive BLEU

Papineni et al. (2002) for Zh-En444, and case-sensitive BLEU for En-De and Ro-En555, which are consistent with previous work.

(a) BLEU scores regarding dimensionality of each capsule.
(b) BLEU scores regarding numbers of capsules of each category.
(c) BLEU scores regarding routing iterations of Gdr
Figure 5:

BLEU scores in terms of different hyperparameters.

a.2 Effect of Hyperparameters

To examine the effects of different hyperparameters of the proposed model, we list the results of different settings of the dimension and the number of the capsules, and the number of routing iteration in Table 5.

We observe that increasing the dimension or the number of capsules does not bring better performance. We attribute these results to the sufficient expressive capacity of medium scales of the capsules. Likewise, the BLEU score goes up with the increase of the number of iterations, while it turns to decrease after the performance climb to the peak at the best setting of 3 iterations. The number of routing iteration affects affects the estimation of the agreement between two capsules. Hence, redundant iterations may lead to over-estimation of the agreement, which has also been revealed in dou2019dynamic.

(a) An example of guided dynamic routing. The orderings of the source sentence in Chinese and the English translation are non-monotonic. For example, after the target word “supply” has been generated in the intermediate of the translations, the assignment probabilities of its corresponding source word “供给”, near the end of the source sentence, changes from the the Past to the Future.
(b) Another example. An interesting case is related to “unware”, which has a source counterpart of three Chinese words (“毫无”, “所”, and “知”). When it has been generated, the assignment probabilities of the words in its counterpart phrase change from Past to the Futuresimultaneously.
Figure 6: Visualization of the assignment probabilities of the iterative guided dynamic routing. For each translation word, the three columns represent the probability assigning to Past, Future or redundant capsule, respectively. Green frames indicate that the assignment probabilities of source words change from Past at the beginning to Future in the end. Red frames highlight the changes before and after a specific word’s generation.

a.3 Visualization

We show two more examples of visualization of the assignment probabilities of the guided dynamic routing mechanism in Figure 6.

Appendix B Abstract Summarization

Abstract summarization is another prevalent sequence-to-sequence tasks, which aims to find a short, fluent and effective summary from a long text article Chang et al. (2018). Intuitively, discarding redundant contents in source article is important for abstract summarization, which could be achieved by our proposed redundant capsules. Moreover, figuring our which parts of the source article have been summarized completely and which have not yet should be also explicitly modeled for abstract summarization.

b.1 Experimental Settings


We conduct experiments on the LCSTS dataset Hu et al. (2015) to evaluate our proposed model for abstract summarization. This dataset contains a large number of short Chinese news articles with their headlines as the short summaries collected from Sina Weibo, a twitter-like micro-blogging website in China. As shown in Table 4, this dataset is composed of three parts. Part I contains a large number of 2,400,591 pairs of (article, summary). Part II and III contain not only text data but also manually rated scores from 1 to 5 for the quality of summaries in terms of their relevance to the source articles. We follow Hu2015LCSTSAL to use Part I as training set, and the subset scored from 3 to 5 of Part II and Part III as development set and testing set.

Part I Part II Part III
pairs 2,400,591 10,666 1,106
pairs (score ) - 8,685 725
Table 4: Statistics of the LCSTS dataset.
Training and evaluation

Following Hu2015LCSTSAL, we conducted experiments based on character. We set the vocabulary size to 4,000 for both source and target sides. We used Transformer as our baseline system. Model configurations and other training hyperparameters are the same as machine translation tasks. For evaluation, we used beam search with a width of 10 without length penalty. We report three variants of recall-based ROUGE Lin (2004), namely, ROUGE-1 (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest-common substring).

b.2 Results

Model R-1 R-2 R-L
RNN+context Hu et al. (2015) 29.9 17.4 27.2
CopyNet Gu et al. (2016) 34.4 21.6 31.3
Distraction Chen et al. (2016) 35.2 22.6 32.5
DGRD Li et al. (2017b) 36.99 24.15 34.21
MRT Ayana et al. (2016) 37.87 25.43 35.33
WEAN Ma et al. (2018) 37.80 25.60 35.20
AC-ABS Li et al. (2018a) 37.51 24.68 35.02
Transformer Chang et al. (2018) 40.49 26.83 37.32
    +HWC Chang et al. (2018) 44.38 32.26 41.35
Transformer 40.18 25.76 35.69
Ours 43.85 29.57 39.10
    - redundant capsules 42.43 28.17 37.66
Table 5: ROUGE scores on LCSTS abstract summarization task.

Table 5 shows the results of existing systems and our proposed model. We observe that our proposed model outperforms the Transformer baseline system by a significant margin. As we expected, the redundant capsules are vital for abstract summarization task, in which summary is required to be short and concise by discarding tons of less important contents from the original article. Our model also beats most of the previous approaches except the model of chang2018ahw, which benefit from an hybrid word-character vocabularies of max size (more than 900m entries) and data cleaning. We expect our model to benefit from these improvements as well. In addition, our model does not rely on any extraction-based methods, which are designed for abstract summarization task to extract relevant part from the source article directly (e.g., CopyNet Gu et al. (2016)). Our method may achieve further gains by incorporating these task-specific mechanisms.