Dual Past and Future for Neural Machine Translation

by   Jianhao Yan, et al.

Though remarkable successes have been achieved by Neural Machine Translation (NMT) in recent years, it still suffers from the inadequate-translation problem. Previous studies show that explicitly modeling the Past and Future contents of the source sentence is beneficial for translation performance. However, it is not clear whether the commonly used heuristic objective is good enough to guide the Past and Future. In this paper, we present a novel dual framework that leverages both source-to-target and target-to-source NMT models to provide a more direct and accurate supervision signal for the Past and Future modules. Experimental results demonstrate that our proposed method significantly improves the adequacy of NMT predictions and surpasses previous methods in two well-studied translation tasks.


page 1

page 2

page 3

page 4


Neural Machine Translation with Reconstruction

Although end-to-end Neural Machine Translation (NMT) has achieved remark...

Modeling Past and Future for Neural Machine Translation

Existing neural machine translation systems do not explicitly model what...

Neural Machine Translation: Challenges, Progress and Future

Machine translation (MT) is a technique that leverages computers to tran...

A Teacher-Student Framework for Zero-Resource Neural Machine Translation

While end-to-end neural machine translation (NMT) has made remarkable pr...

Regularized Context Gates on Transformer for Machine Translation

Context gates are effective to control the contributions from the source...

A Survey of Deep Learning Techniques for Neural Machine Translation

In recent years, natural language processing (NLP) has got great develop...

Generating Authentic Adversarial Examples beyond Meaning-preserving with Doubly Round-trip Translation

Generating adversarial examples for Neural Machine Translation (NMT) wit...

1 Introduction

Neural Machine Translation (NMT) has achieved unprecedented successes and drawn much attention from both academia and industry. Following the sequence-to-sequence learning paradigm, NMT approaches Sutskever et al. (2014); Bahdanau et al. (2014); Vaswani et al. (2017); Ott et al. (2018)

usually consist of two parts – the encoder and the decoder, where the encoder maps the source side sentence into a sequence of hidden representations, and the decoder generates the target side tokens step by step based on the encoder outputs.

Despite its success, the commonly used encoder-decoder framework in NMT always suffers from over- and under- translation problems Tu et al. (2016, 2017). The decoder may tend to repeatedly focus on same parts of the source sentence while ignoring the other parts. Many efforts Tu et al. (2016); Meng et al. (2018); Zheng et al. (2018, 2019) have been made to mitigate this issue by either explicitly or implicitly modeling the step-by-step translated and un-translated information during the decoding process. One promising direction is to track the translated (Past) and un-translated (Future) components of the source sentence Zheng et al. (2018, 2019) at each decoding step. The components are modeled by RNN or Capsule Network with heuristic objectives (e.g., Bag-of-Words Loss).


Figure 1: The model architecture for Dual Past-Future-Transformer. Here we only depict the supervision of Past module for simplicity. At decoding step , the partial sub-sequence and are fed into the reverse direction models and get the encoder outputs. Then, we use reverse directions’ outputs from Past to regularize the forward Past module.

In this paper, we argue that the heuristic objectives in previous approaches may be indirect and insufficient in certain circumstances, which limits their effectiveness. The past and future modules have two major functionalities, which are the identification of past and future contents and extracting useful features for further predictions. However, prior studies mix these two functionalities up and try to model them jointly by only fitting the outputs of Past / Future module. Here, we propose a novel dual learning method to enhance both two functionalities with two transformer models (source-to-target and target-to-source) trained simultaneously (See Figure 1). On the one hand, we propose to use backward NMT encoder with the partially inputs to provide contextually-rich supervision for the past / future identification instead of a coarse-grained bag-of-word loss. On the other hand, we exploit a Guided Capsule Network Zheng et al. (2019)

on two encoders to align the capability of feature extraction with manually masking, instead of mixing up both functionalities. With the training proceeds, bidirectional models perform as teachers for each other and strengthen the performance iteratively.

We evaluate our approach on two commonly used translation datasets, i.e., the NIST Chinese-to-English task and the WMT 2014 English-to-German task. The experimental results demonstrate that our method significantly outperforms the previous strong baselines in terms of the translation quality of generated NMT translations. Also, among the subjective evaluation, our method surpasses previous adequacy-oriented methods in mitigating both over- and under-translation problem.

2 Background

Neural Machine Translation (NMT) often adopts an encoder-decoder framework. Given a source sentence with words, and a target sentence with words, the encoder first maps source sentence into a sequence of word embeddings . Then, it encodes the word embeddings into corresponding hidden representations using its transformation layers. Similarly, the decoder follows the same procedure to encode the decoder inputs (shifted target sequence with a special token as head) into hidden representations in additional to taking into account.

Then, the NMT model predicts the target sequence by maximizing the conditional probability based on

and :


where is the set of learnable parameters and is a partial input. And, the commonly used training loss is the Cross Entropy Loss.


3 Proposed Method

In this section, we introduce our dual learning framework for Past and Future, which not only explicitly models the dynamics of the translated (past) and un-translated (future) part of the source sentence, but also leverages the duality of source-to-target and target-to-source models to provide a more acurate and contextually-rich supervision.

More specifically, after the computation of encoder and decoder hidden states, we feed back the partially decoded sub-sequence for one model into the reverse model, and exploit a Guided Capsule Network to align the outputs for both directions’ encoders outputs. Then, in a dual learning manner, two models are trained simultaneously and improve the translation performance.

In the following section, we refer to one model as forward model and the other as backward model to illustrate our method in details.

3.1 Guided Capsule Networks

Firstly, we present the details of the Guided Capsule Network Zheng et al. (2019).

Capsule Network Sabour et al. (2017) has shown its superiority in solving the problem of assigning parts to wholes Sabour et al. (2017). In our settings, the capsule’s routing by agreement mechanism is suitable in finding Past and Future in whole sentence Zheng et al. (2019).

Formally, we regard the outputs from the forward encoding module as the low-level capsules with a linear projection,


where is a trainable transformation matrix for capsule

. Then, in the dynamic routing process, each vector representation for high-level capsule

is calculated by a squash function,


where is the weighted sum over all low-level capsules and is the assignment probabilities (i.e., the agreement between low-level and high-level capsules). Note that, the high-level capsules are evenly split into two groups, representing the Past and Future .

Then, during each iteration of Capsule Network, is updated by a guided agreement between different level capsules with decoder output ,


3.2 Dual Past and Future

After capturing the Past and Future, next we introduce how to use duality for supervising the Past and Future. In the following section, we refer to the dynamic guided capsule as DGC for short.

Suppose in decoding time step , the Past capsule outputs for forward and backward models are,


Then, we have a partially decoded target sequence from source-to-target and a partially decoded source sequence from the other. We put these partial outputs back into encoders of their reverse direction models, respectively:


where and denote the hidden states for partially decoded sub-sequence from both directions. Notably, and represent the contextual rich representation for translated words for either side.

Then, we use another DGC to extract the feature outputs for and :


where is the corresponding past mask for -th step. After masking out the irrelevant assignment probability , the low-level capsules from translated words are only routed to the Past capsules.

Then, we can minimize the semantic distance between Past capsules’ outputs of both directions,


Similarly, we perform the same computation for the Future capsules except for feeding and . By this way, the bi-directional models are improved in an iterative manner.111 Note that, for the consideration of computational efficiency, we actually put the whole sequence back into the encoder and set the attention bias to lower-triangle bias (past) with very large negative numbers (i.e., -1e9), the same as the bias used in the decoding self-attention process.

3.3 Incorporating with NMT

The above approach can be applied on top of the general sequence-to-sequence model. In our experiments, we use Transformer Vaswani et al. (2017) as our base model since it achieves many state-of-the-art results in the NMT task.

Given the last layer encoder and decoder outputs, and , we use DGC to extract the Past and Future memory features from the source encoding side and obtain the holistic context for each decoding step. Following the setting in (Zheng et al., 2019), an extra redundant capsule is introduced,


where , and are the Past, Future and Redundant

capsule outputs, and [;] represents the concatenation operation. Finally, the output probability for each decoding step is computed via a softmax layer,

System Speed MT06 MT02 MT03 MT04 MT05 Average
Existing NMT systems
Wang et al. (2018) - 45.47 46.31 45.30 46.45 45.62 45.83 -
Cheng et al. (2018) - 45.78 45.96 45.51 46.49 45.73 45.89 -
Our NMT systems
Vaswani et al. (2017) 1.00 44.70 45.26 43.75 45.68 44.14 44.71 reference
Zheng et al. (2019) 0.87 45.70 46.13 44.90 46.84 45.20 45.75 +1.04
Ours 0.87 45.96 46.29 44.83 46.92 45.26 45.85 +1.14
Ours + Inde.Train. 0.87 46.15 46.54 45.15 46.97 45.41 46.04 +1.33
Table 1: Case-insensitive BLEU scores (%) on the NIST Chinese-to-English (ZH-EN) task. The improvements over the Transformer baseline Vaswani et al. (2017) are in the superscript.

4 Experiments

4.1 Dataset

The main experiments are conducted in the widely used NIST Chinese to English (ZH-EN) dataset, containing 1.25M parallel sentences. We also show the results on WMT14 English to German (EN-DE) dataset, containing 4.50M parallel sentences, to compare our model performance with other state-of-the-art models.

4.2 Settings

Among all of our experiments, we follow the Transformer-base configuration from Vaswani et al. (2017). The residual dropout rates are 0.4 for NIST ZH-EN and 0.1 for WMT14 EN-DE. The dimension of Past and Future capsules is set to 256 and each component consists of 2 capsules. For NIST ZH-EN task and WMT14 EN-DE task, we use case-insensitive and case-sensitive 4-gram BLEU score

System EN-DE
ConvS2S Gehring et al. (2017) 25.20
Transformer Vaswani et al. (2017) 27.30
Transformer (Our Impl.) 27.54
Ours 27.86
Table 2: Experimental results on WMT 2014 English-to-German (EN-DE) task.

4.3 Main Results

We mainly conduct our experiments on the NIST Chinese-to-English (ZH-EN) translation task. Besides our implemented NMT systems, we also list the performance of several existing systems Wang et al. (2018); Cheng et al. (2018) to support the effectiveness of our model. The experimental results can be found in Table 1. Compared with previous adequacy-oriented method Zheng et al. (2019) and existing systems, our model shows its superiority with +1.33 BLEU score improvement over the baseline Transformer. Although our model introduces approximately once more numbers of the parameters in Transformer-base, it does not hurt the decoding speed too much (0.87) because we use dual models for training but only one side model is used for testing. Also, we report the performance of EN-ZH translation produced by our model. Compared with Transformer baseline, our method improves the translation by +0.66 BLEU scores.

4.4 Wmt14 En-De

To evaluate our performance with other previous work, we also conduct experiments on the well-studied WMT 2014 English-to-German (EN-DE) translation task. The experimental results are shown in Table 2. Here we also list several results of previous state-of-the-art systems for comparison. We find that our model outperforms the previous results in the well-studied WMT-2014 EN-DE translation task.

System Under (%) Over (%)
GDR Zheng et al. (2019) 71% 92%
Ours 82% 95%
Table 3: Manually annotated results for under- and over-translation problem for generated predictions.

4.5 Subjective Evaluation

Following prior work (Tu et al., 2016; Zheng et al., 2018, 2019), we also use the human evaluation to evaluate the adequacy of our proposed model. We randomly select 100 sentences from ZH-EN task and ask the annotator to judge whether the translations produced by GDR and our model suffer from under- or over- translation. The results demonstrate that our model outperforms in resolving both under- (+11%) and over- (+3%) translation cases, while there is still much room for improvement.

5 Conclusion

Sequence-to-sequence based neural machine translation models always suffer from the under- and over-translation problem. In this paper, we present a novel dual learning framework, aiming at modeling the translation adequacy. By leveraging the power of both source-to-target and target-to-source model, our proposed method provides a more direct and contextual-rich supervision signal for the translated and un-translated words. The experimental results demonstrate that our method outperforms the previous adequacy-based methods and achieves significant improvement in mitigating over- and under- translation problem.