Parallel corpora, which are large collections of parallel texts, serve as an important resource for inducing translation correspondences, either at the level of words [Brown et al.1993, Smadja and McKeown1994, Wu and Xia1994] or phrases [Kupiec1993, Melamed1997, Marcu and Wong2002, Koehn et al.2003]. However, the availability of large-scale, wide-coverage corpora still remains a challenge even in the era of big data: parallel corpora are usually only existent for resource-rich languages and restricted to limited domains such as government documents and news articles.
Therefore, intensive attention has been drawn to exploiting non-parallel corpora for acquiring translation correspondences. Most previous efforts have concentrated on learning parallel lexicons from non-parallel corpora, including parallel sentence and lexicon extraction via bootstrapping [Fung and Cheung2004], inducing parallel lexicons via canonical correlation analysis [Haghighi et al.2008], training IBM models on monolingual corpora as decipherment [Ravi and Knight2011, Nuhn et al.2012, Dou et al.2014], and deriving parallel lexicons from bilingual word embeddings [Vulić and Moens2013, Mikolov et al.2013, Vulić and Moens2015].
Recently, a number of authors have turned to a more challenging task: learning parallel phrases from non-parallel corpora [Zhang and Zong2013, Dong et al.2015]. Zhang and Zong Zhang:13 present a method for retrieving parallel phrases from non-parallel corpora using a seed parallel lexicon. Dong et al. Dong:15 continue this line of research to further introduce an iterative approach to joint learning of parallel lexicons and phrases. They introduce a corpus-level latent-variable translation model in a non-parallel scenario and develop a training algorithm that alternates between (1) using a parallel lexicon to extract parallel phrases from non-parallel corpora and (2) using the extracted parallel phrases to enlarge the parallel lexicon. They show that starting from a small seed lexicon, their approach is capable of learning both new words and phrases gradually over time.
However, due to the structural divergence between natural languages as well as the presence of noisy data, only using asymmetric translation models might be insufficient to accurately identify parallel lexicons and phrases from non-parallel corpora. Dong et al. Dong:15 report that the accuracy on Chinese-English dataset is only around 40% after running for 70 iterations. In addition, their approach seems prone to be affected by noisy data in non-parallel corpora as the accuracy drops significantly with the increase of noise.
Since asymmetric word alignment and phrase alignment models are usually complementary, it is natural to combine them to make more accurate predictions. In this work, we propose to introduce agreement-based learning [Liang et al.2006, Liang et al.2008]
into extracting parallel lexicons and phrases from non-parallel corpora. Based on the latent-variable model proposed by Dong et al. Dong:15, we propose two kinds of loss functions to take into account the agreement between both phrase alignment and word alignment in two directions. As the inference is intractable, we resort to a Viterbi EM algorithm to train the two models efficiently. Experiments on the Chinese-English dataset show that agreement-based learning is more robust to noisy data and leads to substantial improvements in phrase alignment and machine translation evaluations.
Given a monolingual corpus of source language phrases and a monolingual corpus of target language phrases , we assume there exists a parallel corpus , where denotes that and are translations of each other.
As a long sentence in is usually unlikely to have an translation in and vise versa, most previous efforts build on the assumption that phrases are more likely to have translational equivalents on the other side [Munteanu and Marcu2006, Cettolo et al.2010, Zhang and Zong2013, Dong et al.2015]. Such a set of phrases can be constructed by collecting either constituents of parsed sentences or strings with hyperlinks on webpages (e.g., Wikipedia). Therefore, we assume the two monolingual corpora are readily available and focus on how to extract from and .
To address this problem, Dong et al. Dong:15 introduce a corpus-level latent-variable translation model in a non-parallel scenario:
where is phrase alignment and is a set of model parameters. Each target phrase is restricted to connect to exactly one source phrase: , where . For example, denotes that is aligned to . Note that represents an empty source phrase.
They follow IBM Model 1 [Brown et al.1993] to further decompose the model as
where is a phrase translation model that can be further defined as
Dong et al. Dong:15 distinguish between empty and non-empty phrase translations. If a target phrase is aligned to the empty source phrase (i.e.,
), they set the phrase translation probability to a fixed number. Otherwise, conventional word alignment models such as IBM Model 1 can be used for non-empty phrase translation:
where is a length model and is a translation model. We use to denote the length of .
Therefore, the latent-variable model involves two kinds of latent structures: (1) phrase alignment between source and target phrases, (2) word alignment between source and target words within phrases.
Given the two monolingual corpora and , the training objective is to maximize the likelihood of the training data:
Note that is a small seed parallel lexicon for initializing training 111Due to the difficulty of learning translation correspondences from non-parallel corpora, many authors have assumed that a small seed lexicon is readily available [Gaussier et al.2004, Zhang and Zong2013, Vulić and Moens2013, Mikolov et al.2013, Dong et al.2015]. and checks whether an entry exists in .
Given the monolingual corpora and the optimized model parameters, the Viterbi phrase alignment is calculated as
Finally, parallel lexicons can be derived from the translation probability table of IBM model 1 and parallel phrases can be collected from the Viterbi phrase alignment . This process iterates and enlarges parallel lexicons and phrases gradually over time.
As it is very challenging to extract parallel phrases from non-parallel corpora, unidirectional models might only capture partial aspects of translation modeling on non-parallel corpora. Indeed, Dong et al. Dong:15 find that the accuracy of phrase alignment is only around 50% on the Chinese-English dataset. More importantly, their approach seems to be vulnerable to noise as the accuracy drops significantly with the increase of noise. As source-to-target and target-to-source translation models are usually complementary [Och and Ney2003, Koehn et al.2003, Liang et al.2006], it is appealing to combine them to improve alignment accuracy.
3.1 Agreement-based Learning
The basic idea of our work is to encourage the source-to-target and target-to-source translation models to agree on both phrase and word alignments.
For example, Figure 1 shows two example Chinese-to-English and English-to-Chinese phrase alignments on the same non-parallel data. As each model only captures partial aspects of translation modeling, our intuition is that the links on which two models agree (highlighted in red) are more likely to be correct.
More formally, let be a source-to-target translation model and be a target-to-source model, where and are corresponding model parameters. We use to denote source-to-target phrase alignment. Likewise, the target-to-source phrase alignment is denoted by .
To ease the comparison between and , we represent them as sets of non-empty links equivalently:
For example, suppose the source-to-target and target-to-source phrase alignments are and . The equivalent link sets are and . Therefore, is said to be equal to (i.e., ).
Following Liang et al. Liang:06, we introduce a new training objective that favors the agreement between two unidirectional models:
where the posterior probabilities in two directions are defined as
The loss function measures the disagreement between the two models.
3.2 Outer Agreement
A straightforward loss function is to force the two models to generate identical phrase alignments:
We refer to Eq. (14) as outer agreement since it only considers phrase alignment and ignores the word alignment within aligned phrases.
3.2.2 Training Objective
Since the outer agreement forces two models to generate identical phrase alignments, the training objective can be written as
where is a phrase alignment on which two models agree.
The partial derivatives of the training objective with respect to source-to-target model parameters are given by
The partial derivatives with respect to are defined likewise.
3.2.3 Training Algorithm
As the expectation in Eq. (16) is usually intractable to calculate due to the exponential search space of phrase alignment, we follow Dong et al. Dong:15 to use a Viterbi EM algorithm instead.
As shown in Figure 2, the algorithm takes a set of source phrases , a set of target phrases , and a seed parallel lexicon as input (line 1). After initializing model parameters (line 2), the algorithm calls the procedure Align to compute the Viterbi phrase alignment between and on which two models agree. Then, the algorithm updates the two models by normalizing counts collected from the Viterbi phrase alignment. The process iterates for iterations and returns the final Viterbi phrase alignment and model parameters.
3.2.4 Computing Viterbi Phrase Alignments
The procedure Align computes the Viterbi phrase alignment between and on which two models agree as follows:
Unfortunately, due to the exponential search space of phrase alignment, computing is also intractable. As a result, we approximate it as the intersection of two unidirectional Viterbi phrase alignments:
where the unidirectional Viterbi phrase alignments are calculated as
The source-to-target Viterbi phrase alignment is calculated as
Dong et al. Dong:15 indicate that computing the Viterbi alignment for individual target phrases is independent and only need to focus on finding the most probable source phrase for each target phrase:
3.2.5 Updating Model Parameters
Following Liang et al. Liang:06, we collect counts of model parameters only from the agreement term.222We experimented with collecting counts from both the unidirectional and agreement terms but obtained much worse results than counting only from the agreement term.
Given the agreed Viterbi phrase alignment , the count of the source-to-target length model is given by
The new length probabilities can be obtained by
The count of the source-to-target translation model is given by
The new translation probabilities can be obtained by
Counts of target-to-source length and translation models can be calculated in a similar way.
3.3 Inner Agreement
As the outer agreement only considers the phrase alignment, the inner agreement takes both phrase alignment and word alignment into consideration:
For example, Figure 3 shows two examples of Chinese-to-English and English-to-Chinese word alignments. The shared links are highlighted in red. Our intuition is that a source phrase and a target phrase are more likely to be translations of each other if the two translation models also agree on word alignment within aligned phrases.
3.3.2 Training Objective and Algorithm
The training objective for inner agreement is given by
We still use the Viterbi EM algorithm as shown in Figure 2 for training the two models.
3.3.3 Computing Viterbi Phrase Alignments
The agreed Viterbi phrase alignment is defined as
As computing is intractable, we still approximate it using the intersection of two unidirectional Viterbi phrase alignments (see Eq. (18)). The source-to-target Viterbi phrase alignment is calculated as
where is source-to-target link posterior probability of the link being present (or absent) in the word alignment according to the source-to-target model, is target-to-source link posterior probability. We follow Liang et al. Liang:06 to use the product of link posteriors to encourage the agreement at the level of word alignment.
We use a coarse-to-fine approach [Dong et al.2015] to compute the Viterbi alignment: first retrieving a coarse set of candidate source phrases using translation probabilities and then selecting the candidate with the highest score according to Eq. (31). The target-to-source Viterbi phrase alignment can be calculated similarly.
3.3.4 Updating Model Parameters
Given the agreed Viterbi phrase alignment , the count of the source-to-target length model is still given by Eq. (24). The count of the translation model is calculated as
Counts of target-to-source length and translation models can be calculated in a similar way.
In this section, we evaluate our approach in two tasks: phrase alignment (Section 4.1) and machine translation (Section 4.2).
4.1 Alignment Evaluation
4.1.1 Evaluation Metrics
Given two monolingual corpora and , we suppose there exists a ground truth parallel corpus and denote an extracted parallel corpus as . The quality of an extracted parallel corpus can be measured by .
4.1.2 Data Preparation
Although it is appealing to apply our approach to dealing with real-world non-parallel corpora, it is time-consuming and labor-intensive to manually construct a ground truth parallel corpus. Therefore, we follow Dong et al. Dong:15 to build synthetic , , and to facilitate the evaluation.
We first extract a set of parallel phrases from a sentence-level parallel corpus using the state-of-the-art phrase-based translation system Moses [Koehn et al.2007] and discard low-probability parallel phrases. Then, and can be constructed by corrupting the parallel phrase set by adding irrelevant source and target phrases randomly. Note that the parallel phrase set can serve as the ground truth parallel corpus . We refer to the non-parallel phrases in and as noise.
From LDC Chinese-English parallel corpora, we constructed a development set and a test set. The development set contains 20K parallel phrases, 20K noisy Chinese phrases, and 20K noisy English phrases. The test test contains 20K parallel phrases, 180K noisy Chinese phrases, and 180K noisy English phrases. The seed parallel lexicon contains 1K entries.
|seed||C E||E C||Outer||Inner|
4.1.3 Comparison of Agreement Ratios
We introduce agreement ratio to measure to what extent two unidirectional models agree on phrase alignment:
Figure 4 shows the agreement ratios of independent training (“no agreement”), joint training with the outer agreement (“outer”), and joint training with the inner agreement (“inner”). We find that independently trained unidirectional models hardly agree on phrase alignment, suggesting that each model can only capture partial aspects of translation modeling on non-parallel corpora. In contrast, imposing the agreement term significantly increases the agreement ratios: after 10 iterations, about 40% of phrase alignment links are shared by two models.
4.1.4 Effect of Seed Lexicon Size
Table 1 shows the F1 scores of the Chinese-to-English model (“C E”), the English-to-Chinese model (“E C”), joint learning based on the outer agreement (“outer”), and jointing learning based on the inner agreement (“inner”) over various sizes of seed lexicons on the development set.
We find that agreement-based learning obtains substantial improvements over independent learning across all sizes. More importantly, even with a seed lexicon containing only 50 entries, agreement-based learning is able to achieve F1 scores above 60%. The inner agreement performs better than the outer agreement by taking the consensus at the word level into account.
|noise||C E||E C||Outer||Inner|
4.1.5 Effect of Noise
Table 2 demonstrates the effect of noise on the development set. In row 1, “0+0” denotes there is no noise, which can be seen as an upper bound. Adding noise, either on the Chinese side or on the English side, deteriorates the F1 scores for all methods. Adding noise on the English side makes predicting phrase alignment in the C E direction more challenging due to the enlarged search space. The situation is similar in the reverse direction. It is clear that agreement-based learning is more robust to noise: while independent training suffers from a reduction of 40% in terms of F1 for the “20K + 20K” setting, agreement-based learning still achieves F1 scores over 70%.
Figure 5 gives the final results on the test set. We find that agreement-based training achieves significant improvements over independent training. By considering the consensus on both phrase and word alignments, the inner agreement significantly outperforms the outer agreement. Notice that Dong:15 only add noise on one side while we add noisy phrases on both sides, which makes phrase alignment more challenging.
|Chinese||jiaoyisuo shichang jiage zhishu|
|English||exchange market price index|
|Chinese||qianding bianjing maoyi xieding|
|English||signed border trade agreements|
Table 3 shows example learned parallel words and phrases. The lexicon is built from the translation table by retaining high-probability word pairs. Therefore, our approach is capable of learning both new words and new phrases unseen in the seed lexicon.
4.2 Translation Evaluation
|E C||C E||Outer||Inner||E C||C E||Outer||Inner|
Following Zhang:13 and Dong:15, we evaluate our approach on domain adaptation for machine translation.
The data set consists of two in-domain non-parallel corpora and an out-domain parallel corpus. The in-domain non-parallel corpora consists of 2.65M Chinese phrases and 3.67M English phrases extracted from LDC news articles. We use a small out-domain parallel corpus extracted from financial news of FTChina which contains 10K phrase pairs. The task is to extract a parallel corpus from in-domain non-parallel corpora starting from a small out-domain parallel corpus.
We use the state-of-the-art translation system Moses [Koehn et al.2007]
and evaluate the performance on Chinese-English NIST datasets. The development set is NIST 2006 and the test set is NIST 2005. The evaluation metric is case-insensitive BLEU4[Papineni et al.2002]. We use the SRILM toolkit [Stolcke2002] to train a 4-gram English language model on a monolingual corpus with 399M English words.
Table 4 shows the results. At iteration 0, only the out-domain corpus is used and the BLEU score is 5.61. All methods iteratively extract parallel phrases from non-parallel corpora and enlarge the extracted parallel corpus. We find that agreement-based learning achieves much higher BLEU scores while obtains a smaller parallel corpus as compared with independent learning. One possible reason is that the agreement-based learning rules out most unlikely phrase pairs by encouraging consensus between two models.
We have presented agreement-based training for learning parallel lexicons and phrases from non-parallel corpora. By modeling the agreement on both phrase alignment and word alignment, our approach achieves significant improvements in both alignment and translation evaluations.
We sincerely thank the reviewers for their valuable suggestions. We also thank Meng Zhang, Yankai Lin, Shiqi Shen and Meiping Dong for their insightful discussions. Yang Liu is supported by the National Natural Science Foundation of China (No. 61522204), the 863 Program (2015AA011808), and Samsung R&D Institute of China. Huanbo Luan is supported by the National Natural Science Foundation of China (No. 61303075). Maosong Sun is supported by the Major Project of the National Social Science Foundation of China (13&ZD190).
[Brown et al.1993]
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L.
The mathematics of statistical machine translation: Parameter estimation.Computational Linguisitics.
- [Cettolo et al.2010] Mauro Cettolo, Marcello Federico, and Nicola Bertoldi. 2010. Mining parallel fragments from comparable texts. In Proceedings of IWSLT.
- [Dong et al.2014] Meiping Dong, Yong Cheng, Yang Liu, Jia Xu, Maosong Sun, Tatsuya Izuha, and Jie Hao. 2014. Query lattice for translation retrieval. In Proceedings of COLING.
- [Dong et al.2015] Meiping Dong, Yang Liu, Huanbo Luan, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. 2015. Iterative learning of parallel lexicons and phrases from non-parallel corpora. In Proceedings of IJCAI.
- [Dou et al.2014] Qing Dou, Ashish Vaswani, and Kevin Knight. 2014. Beyond parallel data: Joint word alignment and decipherment improves machine translation. In Proceedings of EMNLP.
- [Fung and Cheung2004] Pascale Fung and Percy Cheung. 2004. Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. In Proceedings of EMNLP.
- [Gaussier et al.2004] Eric Gaussier, J.M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of ACL.
- [Haghighi et al.2008] Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL.
- [Koehn et al.2003] Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL.
- [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.
- [Kupiec1993] Julian Kupiec. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of ACL.
- [Liang et al.2006] Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of NAACL.
- [Liang et al.2008] Percy Liang, Dan Klein, and I. Jordan, Michael. 2008. Alignment-based learning. In Proceedings of NIPS.
- [Marcu and Wong2002] Daniel Marcu and Daniel Wong. 2002. A phrase-based joint probability model for statistical machine translation. In Proceedings of EMNLP.
- [Melamed1997] I. Dan Melamed. 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of EMNLP.
- [Mikolov et al.2013] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv:1309.4168.
- [Munteanu and Marcu2006] Dragos Stefan Munteanu and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of ACL.
[Nuhn et al.2012]
Malte Nuhn, Arne Mauser, and Hermann Ney.
Deciphering foreign language by combining language models and context vectors.In Proceedings of ACL.
- [Och and Ney2003] Franz J. Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguisitics.
- [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a methof for automatic evaluation of machine translation. In Proceedings of ACL.
- [Ravi and Knight2011] Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of ACL.
- [Smadja and McKeown1994] Frank Smadja and Kathleen McKeown. 1994. Translating collocations for use in bilingual lexicons. In Proceedings of the ARPA Human Language Technology Workshop.
- [Stolcke2002] Andreas Stolcke. 2002. Srilm - an extensible language modeling toolkit. In Proceedings of ICSLP.
- [Vogel and Ney1996] Stephan Vogel and Hermann Ney. 1996. Hhm-based word alignment in statistical translation. In Proceedings of COLING.
- [Vulić and Moens2013] Ivan Vulić and Marie-Francine Moens. 2013. A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In Proceedings of EMNLP.
- [Vulić and Moens2015] Ivan Vulić and Marie-Francine Moens. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of ACL.
- [Wu and Xia1994] Dekai Wu and Xuanyin Xia. 1994. Learning an english-chinese lexicon from a parallel corpus. In Proceedings of the ARPA Human Language Technology Workshop.
- [Zhang and Zong2013] Jiajun Zhang and Chengqing Zong. 2013. Learning a phrase-based translation model from monolingual data with application to domain adaptation. In Proceedings of ACL.