Pronouns are important in natural languages as they imply rich discourse information. However, in pro-drop languages such as Chinese and Japanese, pronouns are frequently omitted when their referents can be pragmatically inferred from the context. When translating sentences from a pro-drop language into a non-pro-drop language (e.g. Chinese-to-English), translation models generally fail to translate invisible dropped pronouns (DPs). This phenomenon leads to various translation problems in terms of completeness, syntax and even semantics of translations. A number of approaches have been investigated for DP translation Le Nagard and Koehn (2010); Xiang et al. (2013); Wang et al. (2016, 2018).
Wang:2018:AAAI is a pioneering work to model DP translation for neural machine translation (NMT) models. They employ two separate reconstructors Tu et al. (2017) to respectively reconstruct encoder and decoder representations back to the DP-annotated source sentence. The annotation of DP is provided by an external prediction model, which is trained on the parallel corpus using automatically learned alignment information Wang et al. (2016). Although this model achieved significant improvements, there nonetheless exist two drawbacks: 1) there is no interaction between the two separate reconstructors, which misses the opportunity to exploit useful relations between encoder and decoder representations; and 2) the external DP prediction model only has an accuracy of 66% in F1-score, which propagates numerous errors to the translation model.
In this work, we propose to improve the original model from two perspectives. First, we use a shared reconstructor to read hidden states from both encoder and decoder. Second, we integrate a DP predictor into NMT to jointly learn to translate and predict DPs. Incorporating these as two auxiliary loss terms can guide both the encoder and decoder states to learn critical information relevant to DPs. Experimental results on a large-scale Chinese–English subtitle corpus show that the two modifications can accumulatively improve translation performance, and the best result is +1.5 BLEU points better than that reported by Wang:2018:AAAI. In addition, the jointly learned DP prediction model significantly outperforms its external counterpart by 9% in F1-score.
As shown in Figure 1
, Wang:2018:AAAI introduced two independent reconstructors with their own parameters, which reconstruct the DP-annotated source sentence from the encoder and decoder hidden states, respectively. The central idea underpinning their approach is to guide the corresponding hidden states to embed the recalled source-side DP information and subsequently to help the NMT model generate the missing pronouns with these enhanced hidden representations.
|DP Position||88%||你 烤 的 #DP# 吗 ?|
|DP Words||66%||你 烤 的 它 吗 ?|
The DPs can be automatically annotated for training and test data using two different strategies Wang et al. (2016). In the training phase, where the target sentence is available, we annotate DPs for the source sentence using alignment information. These annotated source sentences can be used to build a neural-based DP predictor, which can be used to annotate test sentences since the target sentence is not available during the testing phase. As shown in Table 1, wang2016naacl,Wang:2018:AAAI explored to predict the exact DP words111Unless otherwise indicated, in the paper, the terms “DP” and “DP word” are identical., the accuracy of which is only 66% in F1-score. By analyzing the translation outputs, we found that 16.2% of errors are newly introduced and caused by errors from the DP predictor. Fortunately, the accuracy of predicting DP positions (DPPs) is much higher, which provides the chance to alleviate the error propagation problem. Intuitively, we can learn to generate DPs at the predicted positions using a jointly trained DP predictor, which is fed with informative representations in the reconstructor.
3.1 Shared Reconstructor
Recent work shows that NMT models can benefit from sharing a component across different tasks and languages. Taking multi-language translation as an example, Firat:2016:NAACL share an attention model across languages while Dong:2015:ACL share an encoder. Our work is most similar to the work of Zoph:2016:NAACL and Anastasopoulos:2018:NAACL, which share a decoder and two separate attention models to read from two different sources. In contrast, we share information at the level of reconstructed frames.
The architectures of our proposed shared reconstruction model are shown in Figure 2(a). Formally, the reconstructor reads from both the encoder and decoder hidden states, as well as the DP-annotated source sentence, and outputs a reconstruction score. It uses two separate attention models to reconstruct the annotated source sentence word by word, and the reconstruction score is computed by
where is the hidden state in the reconstructor, and computed by Equation (1):
Here andand are the weighted sum of and , respectively, as in Equation (2) and (3):
Note that the weights and are calculated by two separate attention models. We propose two attention strategies which differ as to whether the two attention models have interactions or not.
feeds the context vector produced by one attention model to another attention model. The intuition behind this is that the interaction between two attention models can lead to a better exploitation of the encoder and decoder representations. As the interactive attention is directional, we have two options (Equation (6) and (7)) which modify either or while leaving the other one unchanged:
3.2 Joint Prediction of Dropped Pronouns
Inspired by recent successes of multi-task learning Dong et al. (2015); Luong et al. (2016), we propose to jointly learn to translate and predict DPs (as shown in Figure 2(b)). To ease the learning difficulty, we leverage the information of DPPs predicted by an external model, which can achieve an accuracy of 88% in F1-score. Accordingly, we transform the original DP prediction problem to DP word generation given the pre-predicted DP positions. Since the DPP-annotated source sentence serves as the reconstructed input, we introduce an additional DP-generation loss, which measures how well the DP is generated from the corresponding hidden state in the reconstructor.
Let be the list of DPs in the annotated source sentence, and
be the corresponding hidden states in the reconstructor. The generation probability is computed by
where is softmax for the DP predictor.
3.3 Training and Testing
We train both the encoder-decoder and the shared reconstructors together in a single end-to-end process, and the training objective is
where are respectively the parameters associated with the encoder-decoder, shared reconstructor and the DP prediction model. The auxiliary reconstruction objective guides the related part of the parameter matrix to learn better latent representations, which are used to reconstruct the DPP-annotated source sentence. The auxiliary prediction loss guides the related part of both the encoder-decoder and the reconstructor to learn better latent representations, which are used to predict the DPs in the source sentence.
|Existing system Wang et al. (2018)|
|6||Shared-Rec(+DPPs) + joint prediction||+87.9M||0.51K||11.88||35.88|
|7||Shared-Rec(+DPPs) + joint prediction||+91.9M||0.48K||11.84||36.53|
|8||Shared-Rec(+DPPs) + joint prediction||+89.9M||0.49K||11.85||35.99|
Following Tu et al. Tu et al. (2017) and Wang et al. Wang et al. (2018), we use the reconstruction score as a reranking technique to select the best translation candidate from the generated -best list at testing time. Different from Wang et al. Wang et al. (2018), we reconstruct DPP-annotated source sentence, which is predicted by an external model.
To compare our work with the results reported by previous work Wang et al. (2018), we conducted experiments on their released ChineseEnglish TV Subtitle corpus.222https://github.com/longyuewangdcu/tvsub The training, validation, and test sets contain 2.15M, 1.09K, and 1.15K sentence pairs, respectively. We used case-insensitive 4-gram NIST BLEU metrics Papineni et al. (2002) for evaluation, and sign-test Collins et al. (2005) to test for statistical significance.
We implemented our models on the code repository released by Wang:2018:AAAI.333https://github.com/tuzhaopeng/nmt We used the same configurations (e.g. vocabulary size = 30K, hidden size = 1000) and reproduced their reported results. It should be emphasized that we did not use the pre-train strategy as done in Wang:2018:AAAI, since we found training from scratch achieved a better performance in the shared reconstructor setting.
Table 2 shows the translation results. It is clear that the proposed models significantly outperform the baselines in all cases, although there are considerable differences among different variations.
(Rows 1-4): The three baselines (Rows 1, 2, and 4) differ regarding the training data used. “Separate-Recs(+DPs)” (Row 3) is the best model reported in Wang:2018:AAAI, which we employed as another strong baseline. The baseline trained on the DPP-annotated data (“Baseline (+DPPs)”, Row 4) outperforms the other two counterparts, indicating that the error propagation problem does affect the performance of translating DPs. It suggests the necessity of jointly learning to translate and predict DPs.
(Rows 5-8): Using our shared reconstructor (Row 5) not only outperforms the corresponding baseline (Row 4), but also surpasses its separate reconstructor counterpart (Row 3). Introducing a joint prediction objective (Row 6) can achieve a further improvement of +0.61 BLEU points. These results verify that shared reconstructor and jointly predicting DPs can accumulatively improve translation performance.
Among the variations of shared reconstructors (Rows 6-8), we found that an interaction attention from encoder to decoder (Row 7) achieves the best performance, which is +3.45 BLEU points better than our baseline (Row 4) and +1.45 BLEU points better than the best result reported by Wang:2018:AAAI (Row 3). We attribute the superior performance of “Shared-Rec” to the fact that the attention context over encoder representations embeds useful DP information, which can help to better attend to the representations of the corresponding pronouns in the decoder side. Similar to Wang et al. Wang et al. (2018), the proposed approach improves BLEU scores at the cost of decreased training and decoding speed, which is due to the large number of newly introduced parameters resulting from the incorporation of reconstructors into the NMT model.
DP Prediction Accuracy
As shown in Table 3, the jointly learned model significantly outperforms the external one by 9% in F1-score. We attribute this to the useful contextual information embedded in the reconstructor representations, which are used to generate the exact DP words.
Table 4 lists translation results when the reconstruction model is used in training only. We can see that the proposed model outperforms both the strong baseline and the best model reported in Wang:2018:AAAI. This is encouraging since no extra resources and computation are introduced to online decoding, which makes the approach highly practical, for example for translation in industry applications.
Effect of DPP Labelling Accuracy
For each sentence in testing, the DPs and DPPs are labelled automatically by two separate external prediction models, the accuracy of which are respectively 66% and 88% measured in F1 score. We investigate the best performance the models can achieve with manual labelling, which can be regarded as an “Oracle”, as shown in Table 5. As seen, there still exists a significant gap in performance, and this could be improved by improving the accuracy of our DPP generator. In addition, our models show a relatively smaller distance in performance from the oracle performance (“Man”), indicating that the error propagation problem is alleviated to some extent.
In this paper, we proposed effective approaches of translating DPs with NMT models: shared reconstructor and jointly learning to translate and predict DPs. Through experiments we verified that 1) shared reconstruction is helpful to share knowledge between the encoder and decoder; and 2) joint learning of the DP prediction model indeed alleviates the error propagation problem by improving prediction accuracy. The two approaches accumulatively improve translation performance. The method is not restricted to the DP translation task and could potentially be applied to other sequence generation problems where additional source-side information could be incorporated.
In future work we plan to: 1) build a fully end-to-end NMT model for DP translation, which does not depend on any external component (i.e. DPP predictor); 2) exploit cross-sentence context Wang et al. (2017) to further improve DP translation; 3) investigate a new research strand that adapts our model in an inverse translation direction by learning to drop pronouns instead of recovering DPs.
The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund. We thank the anonymous reviewers for their insightful comments.
- Anastasopoulos and Chiang (2018) Antonios Anastasopoulos and David Chiang. 2018. Tied multitask learning for neural speech translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 82–91, New Orleans, Louisiana, USA.
- Collins et al. (2005) Michael Collins, Philipp Koehn, and Ivona Kucerova. 2005. Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pages 531–540, Ann Arbor, Michigan, USA.
- Dong et al. (2015) Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-task learning for multiple language translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, pages 1723–1732, Beijing, China.
- Firat et al. (2016) Orhan Firat, Kyunghyun Cho, and Yoshua Bengio. 2016. Multi-way, multilingual neural machine translation with a shared attention mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 866–875, San Diego, California, USA.
- Le Nagard and Koehn (2010) Ronan Le Nagard and Philipp Koehn. 2010. Aiding pronoun translation with co-reference resolution. In Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden.
- Luong et al. (2016) Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. 2016. Multi-task sequence to sequence learning. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2965–2977, Santa Fe, New Mexico, USA.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA.
Tu et al. (2017)
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017.
Neural machine translation with reconstruction.
Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 3097–3103, San Francisco, California, USA.
- Wang et al. (2018) Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, and Qun Liu. 2018. Translating pro-drop languages with reconstruction models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 4937–4945, New Orleans, Louisiana, USA.
Wang et al. (2017)
Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2017.
Exploiting cross-sentence context for neural machine translation.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2816–2821, Copenhagen, Denmark.
- Wang et al. (2016) Longyue Wang, Zhaopeng Tu, Xiaojun Zhang, Hang Li, Andy Way, and Qun Liu. 2016. A novel approach for dropped pronoun translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 983–993, San Diego, California, USA.
- Xiang et al. (2013) Bing Xiang, Xiaoqiang Luo, and Bowen Zhou. 2013. Enlisting the ghost: Modeling empty categories for machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 822–831, Sofia, Bulgaria.
- Zoph and Knight (2016) Barret Zoph and Knight Knight. 2016. Multi-source neural translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 30–34, San Diego, California.