Bilingual word embeddings are the essential components of multilingual NLP systems. These embeddings capture cross-lingual semantic transfers of words and phrases from bilingual corpora, and are widely deployed in many NLP tasks, such as machine translation Conneau et al. (2018), cross-lingual Wikification Tsai and Roth (2016), knowledge alignment Chen et al. (2018) and semantic search Vulić and Moens (2015).
A variety of approaches have been proposed to learn bilingual word embeddings Duong et al. (2017); Luong et al. (2015); Coulmance et al. (2015). Many such approaches rely on the use of aligned corpora. Such corpora could be seed lexicons that provide word-level mappings between two languages Mikolov et al. (2013a); Xing et al. (2015), or parallel corpora that align sentences and documents Klementiev et al. (2012); Gouws et al. (2015). However, these methods critically suffer from several deficiencies. First, seed-lexicon-based approaches are often hindered by the limitedness of seeds Vulić and Korhonen (2016), which is an intractable barrier since high-quality seed lexicons require extensive human efforts to obtain Zhang et al. (2017). Second, parallel corpora provide coarse alignment that does not often accurately infer fine-grained semantic transfers of lexicons Ruder et al. (2017).
Unlike the existing methods, we propose to use publicly available dictionaries111We refer to dictionary in its regular meaning, i.e. the collections of word definitions. This is different from some papers that refer to dictionaries as seed lexicons. for bilingual word embedding learning. Dictionaries, such as Wiktionary and Merriam-Webster, contain large collections of lexical definitions, which are clean linguistic knowledge that naturally connects word semantics within and across human languages. Hence, dictionaries provide valuable information to bridge the lexicons in different languages. However, cross-lingual learning from lexical definitions is a non-trivial task. A straightforward approach based on aligning the target word embedding to the aggregated embedding of words in the definition might work, but not all words in a definition are semantically related to the defined target word (Fig. 1(a)). Therefore, a successful model has to effectively identify the most related lexicons from the multi-granular and asymmetric alignment of lexical definitions. Besides, how to leverage both bilingual and monolingual dictionaries for cross-lingual learning is another challenge.
In this paper, we propose BilLex (Bilingual Word Embeddings Based on Lexical Definitions) to learn bilingual word embeddings. BilLex constitutes a carefully designed two-stage mechanism to automatically cultivate, propagate and leverage lexicon pairs of high semantic similarity from lexical definitions in dictionaries. It first extracts bilingual strong word pairs from bilingual lexical definitions of which the words contribute to the cross-lingual definitions of each other. On top of that, our model automatically exploits induced word pairs, which utilize monolingual dictionaries and the aforementioned strong pairs to exploit semantically related word pairs. This automated word pair induction process enables BilLex to capture abundant high-quality lexical alignment information, based on which the cross-lingual semantic transfer of words is easily captured in a shared embedding space. Experimental results on word-level and sentence-level translation tasks show that BilLex drastically outperforms various baselines that are trained on parallel or seed-lexicon corpora, as well as state-of-the-art unsupervised methods.
2 Related Work
Prior approaches to learning bilingual word embeddings often rely on word or sentence alignment Ruder et al. (2017). In particular, seed lexicon methods Mikolov et al. (2013a); Faruqui and Dyer (2014); Guo et al. (2015) learn transformations across different language-specific embedding spaces based on predefined word alignment. The performance of these approaches is limited by the sufficiency of seed lexicons. Besides, parallel corpora methods Gouws et al. (2015); Coulmance et al. (2015) leverage the aligned sentences in different languages and force the representations of corresponding sentence components to be similar. However, aligned sentences merely provide weak alignment of lexicons that do not accurately capture the one-to-one mapping of words, while such a mapping is well-desired by translation tasks Upadhyay et al. (2016). In addition, a few unsupervised approaches alleviate the use of bilingual resources Chen and Cardie (2018); Conneau et al. (2018). These models require considerable effort to train and rely heavily on massive monolingual corpora.
Monolingual lexical definitions have been used for weak supervision of monolingual word similarities Tissier et al. (2017). Our work demonstrates that dictionary information can be extended to a cross-lingual scenario, for which we develop a simple yet effective induction method to populate fine-grain word alignment.
We first provide the formal definition of bilingual dictionaries. Let be the set of languages and be the set of ordered language pairs. For a language , we use to denote its vocabulary, where for each word ,
denotes its embedding vector. A dictionary denoted ascontains words in language and their definitions in . In particular, is a monolingual dictionary if and is a bilingual dictionary if . A dictionary contains dictionary entries , where is the word being defined and is a sequence of words in describing the meaning of the word . Fig. 1(a) shows an entry from an English-French dictionary, and one from a French-English dictionary.
BilLex allows us to exploit semantically related word pairs in two stages. We first use bilingual dictionaries to construct bilingual strong pairs, which are similar to those monolingual word pairs in Tissier et al. (2017). Then based on the given strong word pairs and monolingual dictionaries, we provide two types of induced word pairs to further enhance the cross-lingual learning.
3.1 Bilingual Strong Pairs
A bilingual strong pair contains two words with high semantic relevance. Such a pair of words that mutually contribute to the cross-lingual definitions of each other is defined as below.
Definition (Bilingual Strong Pairs) is the set of bilingual strong pairs in , where each word pair is defined as:
Intuitively, if appears in the cross-lingual definition of and appears in the cross-lingual definition of , then and should be semantically close to each other. Particularly, denotes monolingual strong pairs if . For instance, (car, véhicule) depicted in Fig. 1(a) form a bilingual strong pair. Note that Tissier et al. also introduce the monolingual weak pairs by pairing the target word with the other words from its definition, which do not form strong pairs with it. However, we do not extend such weak pairs to the bilingual setting, as we find them to be inaccurate to represent cross-lingual corresponding words.
3.2 Induced Word Pairs
Since bilingual lexical definitions cover only limited numbers of words in two languages, we incorporate both monolingual and bilingual strong pairs, from which we induce two types of word pairs with different confidence: directly induced pairs and indirectly induced pairs.
Definition (Bilingual Directly Induced Pairs) is the set of bilingual directly induced pairs in , where each word pair is defined as:
Intuitively, a bilingual induced pair indicates that we can find a pivot word that forms a monolingual strong pair with one word from and a bilingual strong pair with the other.
Definition (Bilingual Indirectly Induced Pairs) is the set of bilingual indirectly induced pairs in , where each word pair is defined as:
A bilingual indirectly induced pair indicates that there exists a pivot bilingual strong pair , such that forms a monolingual strong pair with and forms a monolingual strong pair with . Fig. 1(b-c) shows examples of the two types of induced word pairs.
Our model jointly learns three word-pair-based cross-lingual objectives to align the embedding spaces of two languages, and two monolingual monolingual Skip-Gram losses Mikolov et al. (2013b) to preserve monolingual word similarities. Given a language pair
, the learning objective of BilLex is to minimize the following joint loss function:
) thereof, is the hyperparameter that controls how much the corresponding type of word pairs contributes to cross-lingual learning. For alignment objectives, we use word pairs in both directions of an ordered language pairto capture the cross-lingual semantic similarity of words, such that and . Then for each , the alignment objective is defined as below, where
is the sigmoid function.
For each word pair , we use the unigram distribution raised to the power of 0.75 to select a number of words in (or ) for (or ) to form a negative sample set (or ). Without loss of generality, we define the negative sample set as , where is the distribution of words in .
We evaluate BilLex on two bilingual tasks: word translation and sentence translation retrieval. Following the convention Gouws et al. (2015); Mogadala and Rettinger (2016), we evaluate BilLex between English-French and English-Spanish. Accordingly, we extract word pairs from both directions of bilingual dictionaries in Wiktionary for these language pairs. To support the induced word pairs, we also extract monolingual lexical definitions in the three languages involved, which include 238k entries in English, 107k entries in French and 49k entries in Spanish. The word pair extraction process of BilLex excludes stop words and punctuation in the lexical definitions. The statistics of three types of extracted word pairs are reported in Table 1.
4.1 Word translation
This task aims to retrieve the translation of a source word in the target language. We use the test set provided by Conneau et al. Conneau et al. (2018), which selects the most frequent 200k words of each language as candidates for 1.5k query words. We translate a query word by retrieving its k nearest neighbours in the target language, and report to represent the fraction of correct translations that are ranked not larger than .
Evaluation protocol. The hyperparameters of BilLex are tuned based on a small validation set of 1k word pairs provided by Conneau et al. Conneau et al. (2018). We allocate 128-dimensional word embeddings with pre-trained BilBOWA Gouws et al. (2015). and use the standard configuration to Skip-Gram Mikolov et al. (2013b) on monolingual Wikipedia dumps. We set the negative sampling size of bilingual word pairs to 4, which is selected from 0 to 10 with the step of 1. is set to , which is tuned from 0 to 1 with the step of 0.1. As we assume that the strong pair relations between words are independent, we empirically set = and = . We minimize the loss function using AMSGrad Reddi et al. (2018) with a learning rate of 0.001. The training is terminated based on early stopping. We limit the vocabularies as the 200k most frequent words in each language, and exclude the bilingual strong pairs that have appeared in the test set. The baselines we compare against include BiCVM Hermann and Blunsom (2014), BilBOWA Gouws et al. (2015), Biskip Luong et al. (2015), supervised and unsupervised MUSE Conneau et al. (2018).
Results. Results are summarized in Table 2, where the performance of BilLex is reported for three variants: (i) training with bilingual strong pairs only BilLex(), (ii) with directly induced pair added BilLex(+), and (iii) with all three types of word pairs BilLex(++). BilLex(++) thereof, offers consistently better performance in all settings, which implies that the induced word pairs are effective in improving the cross-lingual learning of lexical semantics. Among the baseline models, the unsupervised MUSE outperforms the other four supervised ones. We also discover that for the word translation task, the supervised models with coarse alignment such as BiCVM and BilBOWA do not perform as well as the models with word-level supervision, such as BiSkip and supervised MUSE. Our best BilLex outperforms unsupervised MUSE by 4.45.7% of between En and Fr, and by 0.31.8% between En and Es. The reason why the settings between En and Fr achieve better performance is that there are much fewer bilingual definitions between En and Es.
4.2 Sentence translation retrieval
This task focuses on retrieving the sentence in the target language space with the tf-idf weighted sentence representation approach. We follow the experiment setup in Conneau et al. (2018) with 2k source sentence queries and 200k target sentences from the Europarl corpus for English and French. We carry forward model configurations from the previous experiment, and report .
Results. The results are reported in Table 3. Overall, our best model variant BilLex(++) performs better than the best baseline with a noticeable increment of by 1.41.7% and P@5 by 1.32.1%. This demonstrates that BilLex is suitable for transferring sentential semantics.
In this paper, we propose BilLex, a novel bilingual word embedding model based on lexical definitions. BilLex is motivated by the fact that openly available dictionaries offer high-quality linguistic knowledge to connect lexicons across languages. We design the word pair induction method to capture semantically related lexicons in dictionaries, which serve as alignment information in joint training. BilLex outperforms state-of-the-art methods on word and sentence translation tasks.
Chen et al. (2018)
Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, et al. 2018.
Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment.In
Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 3998–4004.
Chen and Cardie (2018)
Xilun Chen and Claire Cardie. 2018.
Unsupervised multilingual word embeddings.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 261–270.
- Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. ICLR.
- Coulmance et al. (2015) Jocelyn Coulmance, Jean-Marc Marty, Guillaume Wenzek, and Amine Benhalloum. 2015. Trans-gram, fast cross-lingual word-embeddings.
- Duong et al. (2017) Long Duong, Hiroshi Kanayama, Tengfei Ma, Steven Bird, and Trevor Cohn. 2017. Multilingual training of crosslingual word embeddings. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 894–904.
- Faruqui and Dyer (2014) Manaal Faruqui and Chris Dyer. 2014. Improving vector space word representations using multilingual correlation. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 462–471.
Gouws et al. (2015)
Stephan Gouws, Yoshua Bengio, and Greg Corrado. 2015.
Bilbowa: Fast bilingual distributed representations without word alignments.In
Inter National Conference on Machine Learning.
- Guo et al. (2015) Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1234–1244.
- Hermann and Blunsom (2014) Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual models for compositional distributed semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 58–68.
- Klementiev et al. (2012) Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012. Inducing crosslingual distributed representations of words. In Proceedings of COLING 2012.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Bilingual word representations with monolingual quality in mind. In Proceedings of NAACL-HLT, pages 151–159.
- Mikolov et al. (2013a) Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013a. Exploiting similarities among languages for machine translation. CoRR,abs/1309.4168.
- Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
- Mogadala and Rettinger (2016) Aditya Mogadala and Achim Rettinger. 2016. Bilingual word embeddings from parallel and non-parallel corpora for cross-language text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 692–702.
- Reddi et al. (2018) Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. 2018. On the convergence of adam and beyond. In ICLR.
- Ruder et al. (2017) Sebastian Ruder, Ivan Vulić, and Anders Søgaard. 2017. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research.
- Tissier et al. (2017) Julien Tissier, Christophe Gravier, and Amaury Habrard. 2017. Dict2vec: Learning word embeddings using lexical dictionaries. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2017), pages 254–263.
- Tsai and Roth (2016) Chen-Tse Tsai and Dan Roth. 2016. Cross-lingual wikification using multilingual embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 589–598.
- Upadhyay et al. (2016) Shyam Upadhyay, Manaal Faruqui, Chris Dyer, and Dan Roth. 2016. Cross-lingual models of word embeddings: An empirical comparison. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1661–1670.
- Vulić and Korhonen (2016) Ivan Vulić and Anna Korhonen. 2016. On the role of seed lexicons in learning bilingual word embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 247–257.
- Vulić and Moens (2015) Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 363–372. ACM.
- Xing et al. (2015) Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011.
- Zhang et al. (2017) Meng Zhang, Haoruo Peng, Yang Liu, Huan-Bo Luan, and Maosong Sun. 2017. Bilingual lexicon induction from non-parallel data with minimal supervision. In AAAI, pages 3379–3385.