Neural Machine Translation (NMT) has achieved unprecedented successes and drawn much attention from both academia and industry. Following the sequence-to-sequence learning paradigm, NMT approaches Sutskever et al. (2014); Bahdanau et al. (2014); Vaswani et al. (2017); Ott et al. (2018)
usually consist of two parts – the encoder and the decoder, where the encoder maps the source side sentence into a sequence of hidden representations, and the decoder generates the target side tokens step by step based on the encoder outputs.
Despite its success, the commonly used encoder-decoder framework in NMT always suffers from over- and under- translation problems Tu et al. (2016, 2017). The decoder may tend to repeatedly focus on same parts of the source sentence while ignoring the other parts. Many efforts Tu et al. (2016); Meng et al. (2018); Zheng et al. (2018, 2019) have been made to mitigate this issue by either explicitly or implicitly modeling the step-by-step translated and un-translated information during the decoding process. One promising direction is to track the translated (Past) and un-translated (Future) components of the source sentence Zheng et al. (2018, 2019) at each decoding step. The components are modeled by RNN or Capsule Network with heuristic objectives (e.g., Bag-of-Words Loss).
In this paper, we argue that the heuristic objectives in previous approaches may be indirect and insufficient in certain circumstances, which limits their effectiveness. The past and future modules have two major functionalities, which are the identification of past and future contents and extracting useful features for further predictions. However, prior studies mix these two functionalities up and try to model them jointly by only fitting the outputs of Past / Future module. Here, we propose a novel dual learning method to enhance both two functionalities with two transformer models (source-to-target and target-to-source) trained simultaneously (See Figure 1). On the one hand, we propose to use backward NMT encoder with the partially inputs to provide contextually-rich supervision for the past / future identification instead of a coarse-grained bag-of-word loss. On the other hand, we exploit a Guided Capsule Network Zheng et al. (2019)
on two encoders to align the capability of feature extraction with manually masking, instead of mixing up both functionalities. With the training proceeds, bidirectional models perform as teachers for each other and strengthen the performance iteratively.
We evaluate our approach on two commonly used translation datasets, i.e., the NIST Chinese-to-English task and the WMT 2014 English-to-German task. The experimental results demonstrate that our method significantly outperforms the previous strong baselines in terms of the translation quality of generated NMT translations. Also, among the subjective evaluation, our method surpasses previous adequacy-oriented methods in mitigating both over- and under-translation problem.
Neural Machine Translation (NMT) often adopts an encoder-decoder framework. Given a source sentence with words, and a target sentence with words, the encoder first maps source sentence into a sequence of word embeddings . Then, it encodes the word embeddings into corresponding hidden representations using its transformation layers. Similarly, the decoder follows the same procedure to encode the decoder inputs (shifted target sequence with a special token as head) into hidden representations in additional to taking into account.
Then, the NMT model predicts the target sequence by maximizing the conditional probability based onand :
where is the set of learnable parameters and is a partial input. And, the commonly used training loss is the Cross Entropy Loss.
3 Proposed Method
In this section, we introduce our dual learning framework for Past and Future, which not only explicitly models the dynamics of the translated (past) and un-translated (future) part of the source sentence, but also leverages the duality of source-to-target and target-to-source models to provide a more acurate and contextually-rich supervision.
More specifically, after the computation of encoder and decoder hidden states, we feed back the partially decoded sub-sequence for one model into the reverse model, and exploit a Guided Capsule Network to align the outputs for both directions’ encoders outputs. Then, in a dual learning manner, two models are trained simultaneously and improve the translation performance.
In the following section, we refer to one model as forward model and the other as backward model to illustrate our method in details.
3.1 Guided Capsule Networks
Firstly, we present the details of the Guided Capsule Network Zheng et al. (2019).
Capsule Network Sabour et al. (2017) has shown its superiority in solving the problem of assigning parts to wholes Sabour et al. (2017). In our settings, the capsule’s routing by agreement mechanism is suitable in finding Past and Future in whole sentence Zheng et al. (2019).
Formally, we regard the outputs from the forward encoding module as the low-level capsules with a linear projection,
where is a trainable transformation matrix for capsule
. Then, in the dynamic routing process, each vector representation for high-level capsuleis calculated by a squash function,
where is the weighted sum over all low-level capsules and is the assignment probabilities (i.e., the agreement between low-level and high-level capsules). Note that, the high-level capsules are evenly split into two groups, representing the Past and Future .
Then, during each iteration of Capsule Network, is updated by a guided agreement between different level capsules with decoder output ,
3.2 Dual Past and Future
After capturing the Past and Future, next we introduce how to use duality for supervising the Past and Future. In the following section, we refer to the dynamic guided capsule as DGC for short.
Suppose in decoding time step , the Past capsule outputs for forward and backward models are,
Then, we have a partially decoded target sequence from source-to-target and a partially decoded source sequence from the other. We put these partial outputs back into encoders of their reverse direction models, respectively:
where and denote the hidden states for partially decoded sub-sequence from both directions. Notably, and represent the contextual rich representation for translated words for either side.
Then, we use another DGC to extract the feature outputs for and :
where is the corresponding past mask for -th step. After masking out the irrelevant assignment probability , the low-level capsules from translated words are only routed to the Past capsules.
Then, we can minimize the semantic distance between Past capsules’ outputs of both directions,
Similarly, we perform the same computation for the Future capsules except for feeding and . By this way, the bi-directional models are improved in an iterative manner.111 Note that, for the consideration of computational efficiency, we actually put the whole sequence back into the encoder and set the attention bias to lower-triangle bias (past) with very large negative numbers (i.e., -1e9), the same as the bias used in the decoding self-attention process.
3.3 Incorporating with NMT
The above approach can be applied on top of the general sequence-to-sequence model. In our experiments, we use Transformer Vaswani et al. (2017) as our base model since it achieves many state-of-the-art results in the NMT task.
Given the last layer encoder and decoder outputs, and , we use DGC to extract the Past and Future memory features from the source encoding side and obtain the holistic context for each decoding step. Following the setting in (Zheng et al., 2019), an extra redundant capsule is introduced,
where , and are the Past, Future and Redundant
capsule outputs, and [;] represents the concatenation operation. Finally, the output probability for each decoding step is computed via a softmax layer,
|Existing NMT systems|
|Wang et al. (2018)||-||45.47||46.31||45.30||46.45||45.62||45.83||-|
|Cheng et al. (2018)||-||45.78||45.96||45.51||46.49||45.73||45.89||-|
|Our NMT systems|
|Vaswani et al. (2017)||1.00||44.70||45.26||43.75||45.68||44.14||44.71||reference|
|Zheng et al. (2019)||0.87||45.70||46.13||44.90||46.84||45.20||45.75||+1.04|
|Ours + Inde.Train.||0.87||46.15||46.54||45.15||46.97||45.41||46.04||+1.33|
The main experiments are conducted in the widely used NIST Chinese to English (ZH-EN) dataset, containing 1.25M parallel sentences. We also show the results on WMT14 English to German (EN-DE) dataset, containing 4.50M parallel sentences, to compare our model performance with other state-of-the-art models.
Among all of our experiments, we follow the Transformer-base configuration from Vaswani et al. (2017). The residual dropout rates are 0.4 for NIST ZH-EN and 0.1 for WMT14 EN-DE. The dimension of Past and Future capsules is set to 256 and each component consists of 2 capsules. For NIST ZH-EN task and WMT14 EN-DE task, we use case-insensitive and case-sensitive 4-gram BLEU score
4.3 Main Results
We mainly conduct our experiments on the NIST Chinese-to-English (ZH-EN) translation task. Besides our implemented NMT systems, we also list the performance of several existing systems Wang et al. (2018); Cheng et al. (2018) to support the effectiveness of our model. The experimental results can be found in Table 1. Compared with previous adequacy-oriented method Zheng et al. (2019) and existing systems, our model shows its superiority with +1.33 BLEU score improvement over the baseline Transformer. Although our model introduces approximately once more numbers of the parameters in Transformer-base, it does not hurt the decoding speed too much (0.87) because we use dual models for training but only one side model is used for testing. Also, we report the performance of EN-ZH translation produced by our model. Compared with Transformer baseline, our method improves the translation by +0.66 BLEU scores.
4.4 Wmt14 En-De
To evaluate our performance with other previous work, we also conduct experiments on the well-studied WMT 2014 English-to-German (EN-DE) translation task. The experimental results are shown in Table 2. Here we also list several results of previous state-of-the-art systems for comparison. We find that our model outperforms the previous results in the well-studied WMT-2014 EN-DE translation task.
|System||Under (%)||Over (%)|
|GDR Zheng et al. (2019)||71%||92%|
4.5 Subjective Evaluation
Following prior work (Tu et al., 2016; Zheng et al., 2018, 2019), we also use the human evaluation to evaluate the adequacy of our proposed model. We randomly select 100 sentences from ZH-EN task and ask the annotator to judge whether the translations produced by GDR and our model suffer from under- or over- translation. The results demonstrate that our model outperforms in resolving both under- (+11%) and over- (+3%) translation cases, while there is still much room for improvement.
Sequence-to-sequence based neural machine translation models always suffer from the under- and over-translation problem. In this paper, we present a novel dual learning framework, aiming at modeling the translation adequacy. By leveraging the power of both source-to-target and target-to-source model, our proposed method provides a more direct and contextual-rich supervision signal for the translated and un-translated words. The experimental results demonstrate that our method outperforms the previous adequacy-based methods and achieves significant improvement in mitigating over- and under- translation problem.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Cheng et al. (2018) Yong Cheng, Zhaopeng Tu, Fandong Meng, Junjie Zhai, and Yang Liu. 2018. Towards robust neural machine translation. arXiv preprint arXiv:1805.06130.
Gehring et al. (2017)
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin.
Convolutional sequence to sequence learning.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1243–1252. JMLR. org.
- Meng et al. (2018) Fandong Meng, Zhaopeng Tu, Yong Cheng, Haiyang Wu, Junjie Zhai, Yuekui Yang, and Di Wang. 2018. Neural machine translation with key-value memory-augmented attention. arXiv preprint arXiv:1806.11249.
- Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018. Scaling neural machine translation. arXiv preprint arXiv:1806.00187.
- Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In Advances in neural information processing systems, pages 3856–3866.
Sutskever et al. (2014)
I Sutskever, O Vinyals, and QV Le. 2014.
Sequence to sequence learning with neural networks.Advances in NIPS.
Tu et al. (2017)
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017.
Neural machine translation with reconstruction.
Thirty-First AAAI Conference on Artificial Intelligence.
- Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. arXiv preprint arXiv:1601.04811.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
- Wang et al. (2018) Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. Switchout: an efficient data augmentation algorithm for neural machine translation. arXiv preprint arXiv:1808.07512.
- Zheng et al. (2019) Zaixiang Zheng, Shujian Huang, Zhaopeng Tu, Xin-Yu Dai, and Jiajun Chen. 2019. Dynamic past and future for neural machine translation. arXiv preprint arXiv:1904.09646.
- Zheng et al. (2018) Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Jiajun Chen, and Zhaopeng Tu. 2018. Modeling past and future for neural machine translation. Transactions of the Association for Computational Linguistics, 6:145–157.