Agreement-based Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora

06/15/2016 ∙ by Chunyang Liu, et al. ∙ Tsinghua University SAMSUNG 0

We introduce an agreement-based approach to learning parallel lexicons and phrases from non-parallel corpora. The basic idea is to encourage two asymmetric latent-variable translation models (i.e., source-to-target and target-to-source) to agree on identifying latent phrase and word alignments. The agreement is defined at both word and phrase levels. We develop a Viterbi EM algorithm for jointly training the two unidirectional models efficiently. Experiments on the Chinese-English dataset show that agreement-based learning significantly improves both alignment and translation performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Parallel corpora, which are large collections of parallel texts, serve as an important resource for inducing translation correspondences, either at the level of words [Brown et al.1993, Smadja and McKeown1994, Wu and Xia1994] or phrases [Kupiec1993, Melamed1997, Marcu and Wong2002, Koehn et al.2003]. However, the availability of large-scale, wide-coverage corpora still remains a challenge even in the era of big data: parallel corpora are usually only existent for resource-rich languages and restricted to limited domains such as government documents and news articles.

Therefore, intensive attention has been drawn to exploiting non-parallel corpora for acquiring translation correspondences. Most previous efforts have concentrated on learning parallel lexicons from non-parallel corpora, including parallel sentence and lexicon extraction via bootstrapping [Fung and Cheung2004], inducing parallel lexicons via canonical correlation analysis [Haghighi et al.2008], training IBM models on monolingual corpora as decipherment [Ravi and Knight2011, Nuhn et al.2012, Dou et al.2014], and deriving parallel lexicons from bilingual word embeddings [Vulić and Moens2013, Mikolov et al.2013, Vulić and Moens2015].

Recently, a number of authors have turned to a more challenging task: learning parallel phrases from non-parallel corpora [Zhang and Zong2013, Dong et al.2015]. Zhang and Zong Zhang:13 present a method for retrieving parallel phrases from non-parallel corpora using a seed parallel lexicon. Dong et al. Dong:15 continue this line of research to further introduce an iterative approach to joint learning of parallel lexicons and phrases. They introduce a corpus-level latent-variable translation model in a non-parallel scenario and develop a training algorithm that alternates between (1) using a parallel lexicon to extract parallel phrases from non-parallel corpora and (2) using the extracted parallel phrases to enlarge the parallel lexicon. They show that starting from a small seed lexicon, their approach is capable of learning both new words and phrases gradually over time.

However, due to the structural divergence between natural languages as well as the presence of noisy data, only using asymmetric translation models might be insufficient to accurately identify parallel lexicons and phrases from non-parallel corpora. Dong et al. Dong:15 report that the accuracy on Chinese-English dataset is only around 40% after running for 70 iterations. In addition, their approach seems prone to be affected by noisy data in non-parallel corpora as the accuracy drops significantly with the increase of noise.

Since asymmetric word alignment and phrase alignment models are usually complementary, it is natural to combine them to make more accurate predictions. In this work, we propose to introduce agreement-based learning [Liang et al.2006, Liang et al.2008]

into extracting parallel lexicons and phrases from non-parallel corpora. Based on the latent-variable model proposed by Dong et al. Dong:15, we propose two kinds of loss functions to take into account the agreement between both phrase alignment and word alignment in two directions. As the inference is intractable, we resort to a Viterbi EM algorithm to train the two models efficiently. Experiments on the Chinese-English dataset show that agreement-based learning is more robust to noisy data and leads to substantial improvements in phrase alignment and machine translation evaluations.

2 Background

Given a monolingual corpus of source language phrases and a monolingual corpus of target language phrases , we assume there exists a parallel corpus , where denotes that and are translations of each other.

As a long sentence in is usually unlikely to have an translation in and vise versa, most previous efforts build on the assumption that phrases are more likely to have translational equivalents on the other side [Munteanu and Marcu2006, Cettolo et al.2010, Zhang and Zong2013, Dong et al.2015]. Such a set of phrases can be constructed by collecting either constituents of parsed sentences or strings with hyperlinks on webpages (e.g., Wikipedia). Therefore, we assume the two monolingual corpora are readily available and focus on how to extract from and .

To address this problem, Dong et al. Dong:15 introduce a corpus-level latent-variable translation model in a non-parallel scenario:

(1)

where is phrase alignment and is a set of model parameters. Each target phrase is restricted to connect to exactly one source phrase: , where . For example, denotes that is aligned to . Note that represents an empty source phrase.

They follow IBM Model 1 [Brown et al.1993] to further decompose the model as

(2)

where is a phrase translation model that can be further defined as

(3)
Figure 1: Agreement between (a) Chinese-to-English and (b) English-to-Chinese phrase alignments. The arrows indicate translation directions. The links on which two models agree are highlighted in bold red. The outer agreement loss function (see Eq. (14)) aims to encourage the agreement at the phrase level.

Dong et al. Dong:15 distinguish between empty and non-empty phrase translations. If a target phrase is aligned to the empty source phrase (i.e.,

), they set the phrase translation probability to a fixed number

. Otherwise, conventional word alignment models such as IBM Model 1 can be used for non-empty phrase translation:

(4)

where is a length model and is a translation model. We use to denote the length of .

Therefore, the latent-variable model involves two kinds of latent structures: (1) phrase alignment between source and target phrases, (2) word alignment between source and target words within phrases.

Given the two monolingual corpora and , the training objective is to maximize the likelihood of the training data:

(5)

where

(6)

Note that is a small seed parallel lexicon for initializing training 111Due to the difficulty of learning translation correspondences from non-parallel corpora, many authors have assumed that a small seed lexicon is readily available [Gaussier et al.2004, Zhang and Zong2013, Vulić and Moens2013, Mikolov et al.2013, Dong et al.2015]. and checks whether an entry exists in .

Given the monolingual corpora and the optimized model parameters, the Viterbi phrase alignment is calculated as

(7)
(8)

Finally, parallel lexicons can be derived from the translation probability table of IBM model 1 and parallel phrases can be collected from the Viterbi phrase alignment . This process iterates and enlarges parallel lexicons and phrases gradually over time.

As it is very challenging to extract parallel phrases from non-parallel corpora, unidirectional models might only capture partial aspects of translation modeling on non-parallel corpora. Indeed, Dong et al. Dong:15 find that the accuracy of phrase alignment is only around 50% on the Chinese-English dataset. More importantly, their approach seems to be vulnerable to noise as the accuracy drops significantly with the increase of noise. As source-to-target and target-to-source translation models are usually complementary [Och and Ney2003, Koehn et al.2003, Liang et al.2006], it is appealing to combine them to improve alignment accuracy.

3 Approach

3.1 Agreement-based Learning

The basic idea of our work is to encourage the source-to-target and target-to-source translation models to agree on both phrase and word alignments.

For example, Figure 1 shows two example Chinese-to-English and English-to-Chinese phrase alignments on the same non-parallel data. As each model only captures partial aspects of translation modeling, our intuition is that the links on which two models agree (highlighted in red) are more likely to be correct.

More formally, let be a source-to-target translation model and be a target-to-source model, where and are corresponding model parameters. We use to denote source-to-target phrase alignment. Likewise, the target-to-source phrase alignment is denoted by .

To ease the comparison between and , we represent them as sets of non-empty links equivalently:

(9)
(10)

For example, suppose the source-to-target and target-to-source phrase alignments are and . The equivalent link sets are and . Therefore, is said to be equal to (i.e., ).

Following Liang et al. Liang:06, we introduce a new training objective that favors the agreement between two unidirectional models:

(11)

where the posterior probabilities in two directions are defined as

(12)
(13)

The loss function measures the disagreement between the two models.

3.2 Outer Agreement

1:procedure ViterbiEM(, , )
2:     Initialize
3:     for all  do
4:          Search()
5:          Update()
6:     end for
7:     return
8:end procedure
Figure 2: A Viterbi EM algorithm for agreement-based learning of parallel lexicons and phrases from non-parallel corpora. and are non-parallel corpora, is a seed parallel lexicon, is the set of model parameters at the -th iteration, is the Viterbi phrase alignment on which two models agree at the -th iteration.

3.2.1 Definition

A straightforward loss function is to force the two models to generate identical phrase alignments:

(14)

We refer to Eq. (14) as outer agreement since it only considers phrase alignment and ignores the word alignment within aligned phrases.

3.2.2 Training Objective

Since the outer agreement forces two models to generate identical phrase alignments, the training objective can be written as

(15)

where is a phrase alignment on which two models agree.

The partial derivatives of the training objective with respect to source-to-target model parameters are given by

(16)

The partial derivatives with respect to are defined likewise.

3.2.3 Training Algorithm

As the expectation in Eq. (16) is usually intractable to calculate due to the exponential search space of phrase alignment, we follow Dong et al. Dong:15 to use a Viterbi EM algorithm instead.

As shown in Figure 2, the algorithm takes a set of source phrases , a set of target phrases , and a seed parallel lexicon as input (line 1). After initializing model parameters (line 2), the algorithm calls the procedure Align to compute the Viterbi phrase alignment between and on which two models agree. Then, the algorithm updates the two models by normalizing counts collected from the Viterbi phrase alignment. The process iterates for iterations and returns the final Viterbi phrase alignment and model parameters.

3.2.4 Computing Viterbi Phrase Alignments

The procedure Align computes the Viterbi phrase alignment between and on which two models agree as follows:

(17)

Unfortunately, due to the exponential search space of phrase alignment, computing is also intractable. As a result, we approximate it as the intersection of two unidirectional Viterbi phrase alignments:

(18)

where the unidirectional Viterbi phrase alignments are calculated as

(19)
(20)

The source-to-target Viterbi phrase alignment is calculated as

(21)
(22)

Dong et al. Dong:15 indicate that computing the Viterbi alignment for individual target phrases is independent and only need to focus on finding the most probable source phrase for each target phrase:

(23)

This can be cast as a translation retrieval problem [Zhang and Zong2013, Dong et al.2014]. Please refer to [Dong et al.2015] for more details. The target-to-source Viterbi phrase alignment can be calculated similarly.

3.2.5 Updating Model Parameters

Following Liang et al. Liang:06, we collect counts of model parameters only from the agreement term.222We experimented with collecting counts from both the unidirectional and agreement terms but obtained much worse results than counting only from the agreement term.

Given the agreed Viterbi phrase alignment , the count of the source-to-target length model is given by

(24)

The new length probabilities can be obtained by

(25)

The count of the source-to-target translation model is given by

(26)

The new translation probabilities can be obtained by

(27)

Counts of target-to-source length and translation models can be calculated in a similar way.

3.3 Inner Agreement

Figure 3: Agreement between (a) Chinese-to-English and (b) English-to-Chinese word alignments. The links on which two models agree are highlighted in red. The inner agreement loss function (see Eq. (28)) aims to encourage the agreement at both the phrase and word levels.

3.3.1 Definition

As the outer agreement only considers the phrase alignment, the inner agreement takes both phrase alignment and word alignment into consideration:

(28)

For example, Figure 3 shows two examples of Chinese-to-English and English-to-Chinese word alignments. The shared links are highlighted in red. Our intuition is that a source phrase and a target phrase are more likely to be translations of each other if the two translation models also agree on word alignment within aligned phrases.

3.3.2 Training Objective and Algorithm

The training objective for inner agreement is given by

(29)

We still use the Viterbi EM algorithm as shown in Figure 2 for training the two models.

3.3.3 Computing Viterbi Phrase Alignments

The agreed Viterbi phrase alignment is defined as

(30)

As computing is intractable, we still approximate it using the intersection of two unidirectional Viterbi phrase alignments (see Eq. (18)). The source-to-target Viterbi phrase alignment is calculated as

(31)

where is source-to-target link posterior probability of the link being present (or absent) in the word alignment according to the source-to-target model, is target-to-source link posterior probability. We follow Liang et al. Liang:06 to use the product of link posteriors to encourage the agreement at the level of word alignment.

We use a coarse-to-fine approach [Dong et al.2015] to compute the Viterbi alignment: first retrieving a coarse set of candidate source phrases using translation probabilities and then selecting the candidate with the highest score according to Eq. (31). The target-to-source Viterbi phrase alignment can be calculated similarly.

3.3.4 Updating Model Parameters

Given the agreed Viterbi phrase alignment , the count of the source-to-target length model is still given by Eq. (24). The count of the translation model is calculated as

(32)

Counts of target-to-source length and translation models can be calculated in a similar way.

4 Experiments

In this section, we evaluate our approach in two tasks: phrase alignment (Section 4.1) and machine translation (Section 4.2).

4.1 Alignment Evaluation

4.1.1 Evaluation Metrics

Given two monolingual corpora and , we suppose there exists a ground truth parallel corpus and denote an extracted parallel corpus as . The quality of an extracted parallel corpus can be measured by .

4.1.2 Data Preparation

Although it is appealing to apply our approach to dealing with real-world non-parallel corpora, it is time-consuming and labor-intensive to manually construct a ground truth parallel corpus. Therefore, we follow Dong et al. Dong:15 to build synthetic , , and to facilitate the evaluation.

We first extract a set of parallel phrases from a sentence-level parallel corpus using the state-of-the-art phrase-based translation system Moses [Koehn et al.2007] and discard low-probability parallel phrases. Then, and can be constructed by corrupting the parallel phrase set by adding irrelevant source and target phrases randomly. Note that the parallel phrase set can serve as the ground truth parallel corpus . We refer to the non-parallel phrases in and as noise.

From LDC Chinese-English parallel corpora, we constructed a development set and a test set. The development set contains 20K parallel phrases, 20K noisy Chinese phrases, and 20K noisy English phrases. The test test contains 20K parallel phrases, 180K noisy Chinese phrases, and 180K noisy English phrases. The seed parallel lexicon contains 1K entries.

Figure 4: Comparison of agreement ratios on the development set.
seed C E E C Outer Inner
50 4.1 4.8 60.8 66.2
100 5.1 5.5 65.6 69.8
500 7.5 8.4 70.4 72.5
1,000 22.4 23.1 73.6 74.3
Table 1: Effect of seed lexicon size in terms of F1 on the development set.

4.1.3 Comparison of Agreement Ratios

We introduce agreement ratio to measure to what extent two unidirectional models agree on phrase alignment:

(33)

Figure 4 shows the agreement ratios of independent training (“no agreement”), joint training with the outer agreement (“outer”), and joint training with the inner agreement (“inner”). We find that independently trained unidirectional models hardly agree on phrase alignment, suggesting that each model can only capture partial aspects of translation modeling on non-parallel corpora. In contrast, imposing the agreement term significantly increases the agreement ratios: after 10 iterations, about 40% of phrase alignment links are shared by two models.

4.1.4 Effect of Seed Lexicon Size

Table 1 shows the F1 scores of the Chinese-to-English model (“C E”), the English-to-Chinese model (“E C”), joint learning based on the outer agreement (“outer”), and jointing learning based on the inner agreement (“inner”) over various sizes of seed lexicons on the development set.

We find that agreement-based learning obtains substantial improvements over independent learning across all sizes. More importantly, even with a seed lexicon containing only 50 entries, agreement-based learning is able to achieve F1 scores above 60%. The inner agreement performs better than the outer agreement by taking the consensus at the word level into account.

noise C E E C Outer Inner
C E
0 0 58.5 61.2 86.5 86.1
0 10K 41.0 54.4 83.6 83.8
0 20K 28.3 48.3 80.1 81.2
10K 0 54.7 43.1 84.9 84.3
20K 0 50.4 31.4 83.8 83.6
10K 10K 34.9 34.4 80.0 79.7
20K 20K 22.4 23.1 73.6 74.3
Table 2: Effect of noise in terms of F1 on the development set.

4.1.5 Effect of Noise

Table 2 demonstrates the effect of noise on the development set. In row 1, “0+0” denotes there is no noise, which can be seen as an upper bound. Adding noise, either on the Chinese side or on the English side, deteriorates the F1 scores for all methods. Adding noise on the English side makes predicting phrase alignment in the C E direction more challenging due to the enlarged search space. The situation is similar in the reverse direction. It is clear that agreement-based learning is more robust to noise: while independent training suffers from a reduction of 40% in terms of F1 for the “20K + 20K” setting, agreement-based learning still achieves F1 scores over 70%.

4.1.6 Results

Figure 5: Comparison of F1 scores on the test set.

Figure 5 gives the final results on the test set. We find that agreement-based training achieves significant improvements over independent training. By considering the consensus on both phrase and word alignments, the inner agreement significantly outperforms the outer agreement. Notice that Dong:15 only add noise on one side while we add noisy phrases on both sides, which makes phrase alignment more challenging.

Chinese jingji
English economy
Chinese jialebi
English caribbean
Chinese zhengzhi huanjing
English political environment
Chinese jiaoyisuo shichang jiage zhishu
English exchange market price index
Chinese qianding bianjing maoyi xieding
English signed border trade agreements
Table 3: Example learned parallel lexicons and phrases. New words that are not included in the seed lexicon are highlighted in italic.

Table 3 shows example learned parallel words and phrases. The lexicon is built from the translation table by retaining high-probability word pairs. Therefore, our approach is capable of learning both new words and new phrases unseen in the seed lexicon.

4.2 Translation Evaluation

Iteration Corpus Size BLEU
E C C E Outer Inner E C C E Outer Inner
0 10k 5.61
1 145k 162k 59k 73k 8.65 8.90 13.53 13.74
2 195k 215k 69k 101k 8.82 9.47 15.26 15.61
3 209k 231k 88k 132k 8.42 9.29 16.88 16.94
4 214k 238k 106k 159k 8.46 9.27 17.15 17.83
5 217k 241k 123k 181k 8.87 9.40 17.94 18.89
6 219k 243k 137k 197k 8.52 9.30 18.56 19.47
7 222k 247k 140k 207k 8.81 9.22 18.72 19.46
8 224k 249k 153k 220k 8.71 9.26 18.84 19.50
9 227k 251k 159k 233k 8.92 9.35 19.05 19.63
10 229k 254k 163k 239k 8.33 9.06 19.39 19.78
Table 4: Results on domain adaptation for machine translation.

Following Zhang:13 and Dong:15, we evaluate our approach on domain adaptation for machine translation.

The data set consists of two in-domain non-parallel corpora and an out-domain parallel corpus. The in-domain non-parallel corpora consists of 2.65M Chinese phrases and 3.67M English phrases extracted from LDC news articles. We use a small out-domain parallel corpus extracted from financial news of FTChina which contains 10K phrase pairs. The task is to extract a parallel corpus from in-domain non-parallel corpora starting from a small out-domain parallel corpus.

We use the state-of-the-art translation system Moses [Koehn et al.2007]

and evaluate the performance on Chinese-English NIST datasets. The development set is NIST 2006 and the test set is NIST 2005. The evaluation metric is case-insensitive BLEU4

[Papineni et al.2002]. We use the SRILM toolkit [Stolcke2002] to train a 4-gram English language model on a monolingual corpus with 399M English words.

Table 4 shows the results. At iteration 0, only the out-domain corpus is used and the BLEU score is 5.61. All methods iteratively extract parallel phrases from non-parallel corpora and enlarge the extracted parallel corpus. We find that agreement-based learning achieves much higher BLEU scores while obtains a smaller parallel corpus as compared with independent learning. One possible reason is that the agreement-based learning rules out most unlikely phrase pairs by encouraging consensus between two models.

5 Conclusion

We have presented agreement-based training for learning parallel lexicons and phrases from non-parallel corpora. By modeling the agreement on both phrase alignment and word alignment, our approach achieves significant improvements in both alignment and translation evaluations.

In the future, we plan to apply our approach to real-world non-parallel corpora to further verify its effectiveness. It is also interesting to extend the phrase translation model to more sophisticated models such as IBM models 2-5 [Brown et al.1993] and HMM [Vogel and Ney1996].

Acknowledgments

We sincerely thank the reviewers for their valuable suggestions. We also thank Meng Zhang, Yankai Lin, Shiqi Shen and Meiping Dong for their insightful discussions. Yang Liu is supported by the National Natural Science Foundation of China (No. 61522204), the 863 Program (2015AA011808), and Samsung R&D Institute of China. Huanbo Luan is supported by the National Natural Science Foundation of China (No. 61303075). Maosong Sun is supported by the Major Project of the National Social Science Foundation of China (13&ZD190).

References

  • [Brown et al.1993] Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. 1993.

    The mathematics of statistical machine translation: Parameter estimation.

    Computational Linguisitics.
  • [Cettolo et al.2010] Mauro Cettolo, Marcello Federico, and Nicola Bertoldi. 2010. Mining parallel fragments from comparable texts. In Proceedings of IWSLT.
  • [Dong et al.2014] Meiping Dong, Yong Cheng, Yang Liu, Jia Xu, Maosong Sun, Tatsuya Izuha, and Jie Hao. 2014. Query lattice for translation retrieval. In Proceedings of COLING.
  • [Dong et al.2015] Meiping Dong, Yang Liu, Huanbo Luan, Maosong Sun, Tatsuya Izuha, and Dakun Zhang. 2015. Iterative learning of parallel lexicons and phrases from non-parallel corpora. In Proceedings of IJCAI.
  • [Dou et al.2014] Qing Dou, Ashish Vaswani, and Kevin Knight. 2014. Beyond parallel data: Joint word alignment and decipherment improves machine translation. In Proceedings of EMNLP.
  • [Fung and Cheung2004] Pascale Fung and Percy Cheung. 2004. Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and em. In Proceedings of EMNLP.
  • [Gaussier et al.2004] Eric Gaussier, J.M. Renders, I. Matveeva, C. Goutte, and H. Dejean. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of ACL.
  • [Haghighi et al.2008] Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning bilingual lexicons from monolingual corpora. In Proceedings of ACL.
  • [Koehn et al.2003] Philipp Koehn, Franz J. Och, and Daniel Marcu. 2003. Statistical phrase-based translation. In Proceedings of NAACL.
  • [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions.
  • [Kupiec1993] Julian Kupiec. 1993. An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of ACL.
  • [Liang et al.2006] Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by agreement. In Proceedings of NAACL.
  • [Liang et al.2008] Percy Liang, Dan Klein, and I. Jordan, Michael. 2008. Alignment-based learning. In Proceedings of NIPS.
  • [Marcu and Wong2002] Daniel Marcu and Daniel Wong. 2002. A phrase-based joint probability model for statistical machine translation. In Proceedings of EMNLP.
  • [Melamed1997] I. Dan Melamed. 1997. Automatic discovery of non-compositional compounds in parallel data. In Proceedings of EMNLP.
  • [Mikolov et al.2013] Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv:1309.4168.
  • [Munteanu and Marcu2006] Dragos Stefan Munteanu and Daniel Marcu. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of ACL.
  • [Nuhn et al.2012] Malte Nuhn, Arne Mauser, and Hermann Ney. 2012.

    Deciphering foreign language by combining language models and context vectors.

    In Proceedings of ACL.
  • [Och and Ney2003] Franz J. Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguisitics.
  • [Papineni et al.2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a methof for automatic evaluation of machine translation. In Proceedings of ACL.
  • [Ravi and Knight2011] Sujith Ravi and Kevin Knight. 2011. Deciphering foreign language. In Proceedings of ACL.
  • [Smadja and McKeown1994] Frank Smadja and Kathleen McKeown. 1994. Translating collocations for use in bilingual lexicons. In Proceedings of the ARPA Human Language Technology Workshop.
  • [Stolcke2002] Andreas Stolcke. 2002. Srilm - an extensible language modeling toolkit. In Proceedings of ICSLP.
  • [Vogel and Ney1996] Stephan Vogel and Hermann Ney. 1996. Hhm-based word alignment in statistical translation. In Proceedings of COLING.
  • [Vulić and Moens2013] Ivan Vulić and Marie-Francine Moens. 2013. A study on bootstrapping bilingual vector spaces from non-parallel data (and nothing else). In Proceedings of EMNLP.
  • [Vulić and Moens2015] Ivan Vulić and Marie-Francine Moens. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceedings of ACL.
  • [Wu and Xia1994] Dekai Wu and Xuanyin Xia. 1994. Learning an english-chinese lexicon from a parallel corpus. In Proceedings of the ARPA Human Language Technology Workshop.
  • [Zhang and Zong2013] Jiajun Zhang and Chengqing Zong. 2013. Learning a phrase-based translation model from monolingual data with application to domain adaptation. In Proceedings of ACL.