Code for "Unsupervised Cross-lingual Transfer of Word Embedding Spaces" in EMNLP 2018
Cross-lingual transfer of word embeddings aims to establish the semantic mappings among words in different languages by learning the transformation functions over the corresponding word embedding spaces. Successfully solving this problem would benefit many downstream tasks such as to translate text classification models from resource-rich languages (e.g. English) to low-resource languages. Supervised methods for this problem rely on the availability of cross-lingual supervision, either using parallel corpora or bilingual lexicons as the labeled data for training, which may not be available for many low resource languages. This paper proposes an unsupervised learning approach that does not require any cross-lingual labeled data. Given two monolingual word embedding spaces for any language pair, our algorithm optimizes the transformation functions in both directions simultaneously based on distributional matching as well as minimizing the back-translation losses. We use a neural network implementation to calculate the Sinkhorn distance, a well-defined distributional similarity measure, and optimize our objective through back-propagation. Our evaluation on benchmark datasets for bilingual lexicon induction and cross-lingual word similarity prediction shows stronger or competitive performance of the proposed method compared to other state-of-the-art supervised and unsupervised baseline methods over many language pairs.READ FULL TEXT VIEW PDF
Cross-lingual representations of words enable us to reason about word me...
Unsupervised learning of cross-lingual word embedding offers elegant mat...
Recent approaches to cross-lingual word embedding have generally been ba...
We explore the use of unsupervised methods in Cross-Lingual Word Sense
We construct a multilingual common semantic space based on distributiona...
Bilingual word embeddings (BWEs) are useful for many cross-lingual
Text classification is a fundamental task for text data mining. In order...
Code for "Unsupervised Cross-lingual Transfer of Word Embedding Spaces" in EMNLP 2018
. Training word vectors using monolingual corpora is a common practice in various NLP tasks. However, how to establish cross-lingual semantic mapping among monolingual embeddings remain an open challenge as the availability of resources and benchmarks are highly imbalanced across languages.
Recently, increasing effort of research has been motivated to address this challenge. Successful cross-lingual word mapping will benefit many cross-lingual learning tasks, such as transforming text classification models trained in resource-rich languages to low-resource languages. Downstream applications include word alignment, text classification, named entity recognition, dependency parsing, POS-tagging, and moreSøgaard et al. (2015)
. Most methods for cross-lingual transfer of word embeddings are based on supervised or semi-supervised learning, i.e., they require cross-lingual supervision such as human-annotated bilingual lexicons and parallel corporaLu et al. (2015); Smith et al. (2017); Artetxe et al. (2016). Such a requirement may not be met for many language pairs in the real world.
This paper proposes an unsupervised approach to the cross-lingual transfer of monolingual word embeddings, which requires zero cross-lingual supervision. The key idea is to optimize the mapping in both directions for each language pair (say A and B), in the way that the word embedding translated from language A to language B will match the distribution of word embedding in language B. And when translated back from B to A, the word embedding after two steps of transfer will be maximally close to the original word embedding. A similar property holds for the other direction of the loop (from B to A and then from A back to B). Specifically, we use the Sinkhorn distance Cuturi (2013) to capture the distributional similarity between two set of embeddings after transformation, which we found empirically superior to the KL-divergence Zhang et al. (2017a) and distance to nearest neighbor Artetxe et al. (2017); Conneau et al. (2017) with regards to the quality of learned transformation as well as the robustness under different training conditions.
Our novel contributions in the proposed work include:
We propose an unsupervised learning framework which incorporates the Sinkhorn distance as a distributional similarity measure in the back-translation loss function.
We use a neural network to optimize our model, especially to implement the Sinkhorn distance whose calculation itself is an optimization problem.
Unlike previous models which only consider cross-lingual transformation in a single direction, our model jointly learns the word embedding transfer in both directions for each language pair.
We present an intensive comparative evaluation where our model achieved the state-of-the-art performance for many language pairs in cross-lingual tasks.
We divide the related work into supervised and unsupervised categories. Representative methods in both categories are included in our comparative evaluation (Section 3.4). We also discuss some related work in unsupervised domain transfer in addition. Supervised Methods: There is a rich body of supervised methods for learning cross-lingual transfer of word embeddings based on bilingual dictionaries Mikolov et al. (2013); Faruqui and Dyer (2014); Artetxe et al. (2016); Xing et al. (2015); Duong et al. (2016); Gouws and Søgaard (2015), sentence-aligned corpora Kočiskỳ et al. (2014); Hermann and Blunsom (2014); Gouws et al. (2015) and document-aligned corpora Vulić and Moens (2016); Søgaard et al. (2015). The most relevant line of work is that by Mikolov et al. (2013)
where they showed monolingual word embeddings are likely to share similar geometric properties across languages although they are trained separately and hence cross-lingual mapping can be captured by a linear transformation across embedding spaces. Several follow-up studies tried to improve the cross-lingual transformation in various waysFaruqui and Dyer (2014); Artetxe et al. (2016); Xing et al. (2015); Duong et al. (2016); Ammar et al. (2016); Artetxe et al. (2016); Zhang et al. (2016); Shigeto et al. (2015). Nevertheless, all these methods require bilingual lexicons for supervised learning. Vulić and Korhonen (2016) showed that 5000 high-quality bilingual lexicons are sufficient for learning a reasonable cross-lingual mapping. Unsupervised Methods have been studied to establish cross-lingual mapping without any human-annotated supervision. Earlier work simply relied on word occurrence information only Rapp (1995); Fung (1996) while later efforts have considered more sophisticated statistics in addition Haghighi et al. (2008). The main difficulty in unsupervised learning of cross-lingual mapping is the formulation of the objective function, i.e., how to measure the goodness of an induced mapping without any supervision is a non-trivial question. Cao et al. (2016)
tried to match the mean and standard deviation of the embedded word vectors in two different languages after mapping the words in the source language to the target language. However, such an approach has shown to be sub-optimal because the objective function only carries the first and second order statistics of the mapping.Artetxe et al. (2017) tried to impose an orthogonal constraint to their linear transformation model and minimize the distance between the transferred source-word embedding and its nearest neighbor in the target embedding space. Their method, however, requires a seed bilingual dictionary as the labeled training data and hence is not fully unsupervised. Zhang et al. (2017a); Barone (2016) adapted a generative adversarial network (GAN) to make the transferred embedding of each source-language word indistinguishable from its true translation in the target embedding space Goodfellow et al. (2014). The adversarial model could be optimized in a purely unsupervised manner but is often suffered from unstable training, i.e. the adversarial learning does not always improve the performance over simpler baselines. Zhang et al. (2017b), Conneau et al. (2017) and Artetxe et al. (2017) also tried adversarial approaches for the induction of seed bilingual dictionaries, as a sub-problem in the cross-lingual transfer of word embedding.
Unsupervised Domain Transfer:
Generally speaking, learning the cross-lingual transfer of word embedding can be viewed as a domain transfer problem, where the domains are word sets in different languages. Thus various work in the field of unsupervised domain adaptation or unsupervised transfer learning
unsupervised transfer learningcan shed light on our problem. For example, He et al. (2016) proposed a semi-supervised method for machine translation to utilize large monolingual corpora. Shen et al. (2017)
used unsupervised learning to transfer sentences of different sentiments. Recent work in computer vision addresses the problem of image style transfer without any annotated training dataZhu et al. (2017); Taigman et al. (2016); Yi et al. (2017). Among those, our work is mostly inspired by the work on CycleGAN Zhu et al. (2017)
, and we adopt their cycled consistent loss over images into our back-translation loss. One key difference of our method from CycleGAN is that they used the training loss of an adversarial classifier as an indicator of the distributional distance, but instead, we introduce the Sinkhorn distance in our objective function and demonstrate its superiority over the representative method using adversarial lossZhang et al. (2017a).
Our system takes two sets of monolingual word embeddings of dimension as input, which are trained separately on two languages. We denote them as , , . During the training of monolingual word embedding for and , we also have the access to the word frequencies, represented by vectors and for and , respectively. Specifically, is the frequency for word (embedding) and similarly for of . As illustrated in Figure 1, our model has two mappings: and . We further denote transferred embedding from as and correspondingly for .
In the unsupervised setting, the goal is to learn the mapping and without any paired word translation. To achieve this, our loss function consists of two parts: Sinkhorn distance Cuturi (2013) for matching the distribution of transferred embedding to its target embedding distribution; and a back-translation loss for preventing degenerated transformation.
Sinkhorn distance is a recently proposed distance between probability distributions. We use the Sinkhorn distance to measure the closeness betweenand , and also between and . During the training, our model optimizes and for lower Sinkhorn distance to make the transferred embeddings match the distribution of the target embeddings. Here we only illustrate the Sinkhorn distance between and , the derivation for and is very similar. Although the vocabulary sizes of two languages could be different, we are able to sample mini-batches of equal size from and . therefore we assume in the following derivation.
To compute Sinkhorn distance, we firstly compute a distance matrix between and where is the distance measure between and . The superscript on indicates the distance that depends on a parameterized transformation . For instance, if we choose Euclidean distance as a measure (see Section 3.1.3 for more discussions), we will have
Given the distance matrix, the Sinkhorn distance between and is defined as:
where is the Forbenius dot-product and is an entropy constrained transport polytope, defined as
Note that is non-negative and the first two constraints make its element-wise sum be . Therefore, can be seen as a set of probability distributions. The same applies for and since they are frequencies. is the entropy function defined on any probability distributions and
is a hyperparameter to choose. For any probabilistic matrix, it can be viewed as the joint probability of . The first two constraints ensure that has marginal distribution on as and on as . We can also view as the evidence for establishing a translation between word vector and word vector .
An intuitive interpretation of equation (1) is that we are trying to find the optimal transport probability under the entropy constraint such that the total distance to transport from to is minimized.
Cuturi (2013) showed that the optimal solution of formula (1) has the form , where and are some non-negative vectors and ; is the Lagrange multiplier for the entropic constraint in 3.1.1 and each in Equation (1) has one corresponding . The Sinkhorn distance can be efficiently computed by a matrix scaling algorithm. We present the pseudo code in Algorithm 1. Note that the computation of only requires matrix-vector multiplication. Therefore, we can compute and back propagate the gradient of with regards to the parameters in
using standard deep learning libraries. We show our implementation details in Section3.4 and supplementary material.
In Section 3.1.1, we used the Euclidean distance of vector pairs to define and Sinkhorn distance . However, in our preliminary experiment, we found that Euclidean distance of unnormalized vectors gave poor performance. Therefore, following the common practice, we normalize all word embedding vectors to have a unit L2 norm in the construction of .
As pointed out in Theorem 1 of Cuturi (2013), must be a valid metric in order to make a valid metric. For example, the commonly used cosine distance, which is defined as , is not a valid metric because it does not satisfy triangle inequality 111If we select We have , which violates the triangle inequality.. Thus, for constructing , we propose the square root cosine distance () below:
is a valid metric.
, let , . We have and . Then
Obviously, the last term is the Euclidean distance between normalized input vectors and . Since Euclidean distance is a valid metric, it follows that satisfies all the axioms for a valid metric. ∎
Given enough capacity, is capable to transfer to
for arbitrary word-to-word mappings. To ensure that, we learn a meaningful translation and also to regularize the search space of possible transformations, we enforce the word embedding after the forward and the backward transformation should not diverge much from its original direction. We simply choose the back-translation loss based on the cosine similarity:
where is the cosine similarity.
Putting everything together, we minimize the following objective function.
where hyper-parameter controls the relative weight of the last term against the first two terms in the objective function. By definition, computation of or involves another minimization problem as shown in Equation (1). We solve it using the matrix scaling algorithm in Section 3.1.2, and treat as a deterministic and differentiable function of parameters in . The same holds for and .
In preliminary experiments, we found that our objective 6 is sensitive to the initialization of the weight in and in the purely unsupervised setting. It requires a good initial setting of the parameters to avoid getting stuck in the poor local minimal. To address this sensitivity issue, we employed a similar approach as in Zhang et al. (2017b); Aldarmaki et al. (2018) to firstly used an adversarial training approach to learn and and use them as the initial point for training our full objective 6. More specifically, we choose to minimize the optimal transport distance below.
is the transport polytope without entropy constraint, defined as follows.
We optimize the distance above by its dual form and through adversarial training, which is also known as Wasserstein GAN (WGAN) Arjovsky et al. (2017). We applied the optimization trick proposed by Gulrajani et al. (2017).
Although the first phase of adversarial training could be unstable, and the performance is lower than using the Sinkhorn distance, the adversarial training narrows down the search space of model parameters and boosting the training of our proposed model.
We implemented transformation and by a linear transformation. The dimension of the input and output are the same with the word embedding dimension .222We tried more complex non-linear transformations for and . The performance is slightly worse than the linear case. For all the experiments in the subsequent section, the in (6) was set to be . For hyper-parameters from the computation of Sinkhorn distance, we choose and run the matrix scaling algorithm for iterations. Due to the space constraint, a detailed implementation description is presented in the supplementary material. The code of our implementation is publicly available 333Our implementation https://github.com/xrc10/unsup-cross-lingual-embedding-transfer.
We conducted an evaluation of our approach in comparison with state-of-the-art supervised/unsupervised methods on several evaluation benchmarks for bilingual lexicon induction (Task 1) and word similarity prediction (Task 2). We include our main results in this section and report the ablation study in the supplementary material.
All the methods being evaluated in both tasks take monolingual word embedding in each language as the input data. We use publicly available pre-trained word embeddings trained on Wikipedia articles: (1) a smaller set of word embeddings of dimension trained on comparable Wikipedia dump in five languages Zhang et al. (2017a)444Available at http://nlp.csai.tsinghua.edu.cn/~zm/UBiLexAT and (2) a larger set of word embeddings of dimension trained on Wikipedia dump in 294 languages Bojanowski et al. (2016)555Available at https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md. For convenience, we name the two sets WE-Z and WE-C, respectively.
We need true translation pairs of words for evaluating methods in bilingual lexicon induction (Task 1). We followed previous studies and prepared two datasets below.
LEX-Z: Zhang et al. (2017a) constructed the bilingual lexicons from various resources. Since their ground truth word pairs are not released, we followed their procedure, crawled bilingual dictionaries and randomly separated them into the training and testing set of equal size.666The bilingual dictionaries we crawled are submitted as supplementary material. Note that our proposed method did not utilize the training set. It was only used by supervised baseline methods described in Section 4.2. There are eight language pairs (order counted); the corresponding dataset statistics are summarized in Table 1. We use WE-Z embeddings in this dataset.
LEX-C: This lexicon was constructed by Conneau et al. (2017) and contains more translation pairs than LEX-Z. They divided them into training and testing set. We run our model and the baseline methods on 16 language pairs. For each language pair, the training set contains unique query words and the testing set has query words. We followed Conneau et al. (2017) and set the search space of candidate translations to be the most frequent words in each target language. We use WE-C embeddings in this dataset.
|# tokens||vocab. size||bi. lex. size|
For bilingual word similarity prediction (Task 2) we need the true labels for evaluation. Following Conneau et al. (2017), we used the SemEval 2017 competition dataset, where human annotators measured the cross-lingual similarity of nominal word pairs according to the five-point Likert scale. This dataset contains word pairs across five languages: English (en), German (de), Spanish (es), Italian (it), and Farsi (fa). Each language pair has about 1,000 word pairs annotated with a real similarity score ranging from to .
|Supervised||Mikolov et al. (2013)||19.41||10.81||68.73||41.19||45.88||45.37||59.83||41.26|
|Zhang et al. (2016)||23.39||11.07||72.36||41.19||48.01||42.66||63.19||40.37|
|Xing et al. (2015)||24.00||10.78||71.92||41.02||48.10||42.90||62.81||40.43|
|Shigeto et al. (2015)||26.56||8.52||72.23||37.80||49.95||38.15||63.14||35.63|
|Artetxe et al. (2016)||23.49||10.74||71.98||41.12||48.01||42.66||63.14||40.28|
|Artetxe et al. (2017)||22.88||10.78||72.61||41.62||47.54||42.82||61.32||39.63|
|Unsupervised||Conneau et al. (2017)||4.09||1.41||60.16||33.58||41.98||34.70||26.98||15.47|
|Zhang et al. (2017a)||15.83||7.41||63.41||37.73||42.08||41.26||54.75||37.17|
We evaluated the same set of supervised and unsupervised baselines for comparative evaluation in both Task 1 and Task 2. The supervised baselines include the methods of Shigeto et al. (2015); Zhang et al. (2016); Artetxe et al. (2016); Xing et al. (2015); Mikolov et al. (2013); Artetxe et al. (2017).777The implementations are available from https://github.com/artetxem/vecmap. We fed all the supervised methods with the bilingual dictionaries in the training portions of the LEX-Z and LEX-C datasets, respectively.
For unsupervised baselines we include the methods of Zhang et al. (2017a) and Conneau et al. (2017), whose source code is publicly available as provided by the authors.888We used implementation by Zhang et al. (2017a) from http://nlp.csai.tsinghua.edu.cn/~zm/UBiLexAT and that of Conneau et al. (2017) from https://github.com/facebookresearch/MUSE
|Supervised||Mikolov et al. (2013)||44.80||48.47||57.73||66.20||43.73||63.73||26.53||28.93|
|Zhang et al. (2016)||50.60||39.73||63.40||58.73||50.87||53.93||34.53||22.87|
|Xing et al. (2015)||50.33||40.00||63.40||58.53||51.13||53.73||34.27||21.60|
|Shigeto et al. (2015)||61.00||33.80||69.33||53.60||61.27||41.67||42.20||13.87|
|Artetxe et al. (2016)||53.27||43.40||65.27||60.87||54.07||55.93||35.80||26.47|
|Artetxe et al. (2017)||47.27||34.40||61.27||56.73||38.07||44.20||24.07||12.20|
|Unsupervised||Conneau et al. (2017)||26.47||13.87||41.00||33.07||24.27||24.47||-||-|
|Zhang et al. (2017a)||-||-||-||-||-||-||-||-|
|Supervised||Mikolov et al. (2013)||61.93||73.07||74.00||80.73||71.33||82.20||68.93||77.60|
|Zhang et al. (2016)||67.67||69.87||77.27||78.53||76.07||78.20||72.40||73.40|
|Xing et al. (2015)||67.73||69.53||77.20||78.60||76.33||78.67||72.00||73.33|
|Shigeto et al. (2015)||71.07||63.73||81.07||74.53||79.93||73.13||76.47||68.13|
|Artetxe et al. (2016)||69.13||72.13||78.27||80.07||77.73||79.20||73.60||74.47|
|Artetxe et al. (2017)||68.07||69.20||75.60||78.20||74.47||77.67||70.53||71.67|
|Unsupervised||Conneau et al. (2017)||69.87||71.53||78.53||79.40||77.67||78.33||74.60||75.80|
|Zhang et al. (2017a)||-||-||-||-||-||-||-||-|
Bilingual lexicon induction is a task to induce a translation in the target language for each query word in the source language. After the query word and the target-language words are represented in the same embedding space (or after our system maps the query word from the source embedding space to the target embedding space), the nearest target words are retrieved based on their cosine similarity scores with respect to the query vector. If the retrieved target words contain any valid translation according to the gold bilingual lexicon, the translation (retrieval) is considered successful. The fraction of the correctly translated source words in the test set is defined as , which is conventional metric in benchmark evaluations.
shows the accuracy@1 for all the methods on LEX-Z in our evaluation. We can see that our method outperformed the other unsupervised baselines by a large margin on all the eight language pairs. Compared with the supervised methods, our method is still competitive (the best or the second-best scores on four out of eight language pairs), even ours does not require cross-lingual supervision. Also, we notice the performance variance over different language pairs. Our method outperforms all the methods (supervised and unsupervised combined) on the English-Spanish (en-es) pair, perhaps for the reasons that these two languages are most similar to each other, and that the monolingual word embeddings for this pair in the comparable corpus are better aligned than the other language pairs. On the other hand, all the methods including ours have the worst performance on the English-Turkish (en-tr) pair. Another observation is the performance differences in the two directions of the language pair. For example, the performance of it-en is better than en-it for all methods in table2. A part of the reason is that there are more unique English words than non-English words in the evaluation set. This would cause direction “xx-en” to be easier than ”en-xx” because there are often multiple valid ground truth English translations for each query in “xx”. But the same may not hold for the opposite direction of “en-xx”. Nevertheless, the relative performance of our method compared to others is quite robust over different language pairs and different directions of translation.
Table 3 and Table 4 summarize the results of all the methods on the LEX-C dataset. Several points may be worth noticing. Firstly, the performance scores on LEX-C are not necessarily consistent with those on LEX-Z (Table 2) even if the methods and the language pairs are the same; this is not surprising as the two datasets differ in query words, word embedding quality, and training-set sizes. Secondly, the performance gap between the best supervised methods and the best unsupervised methods in both Table 3 and Table 4 are larger than that in Table 2. This is attributed to the large amount of good-quality supervision in LEX-C (5,000 human-annotated word pairs) and the larger candidate size in WE-C ( candidates). Thirdly, the average performance in Table 3 is lower than that in Table 4, indicating that the language pairs in the former are more difficult than that in the latter. Nevertheless, we can see that our method has much stronger performance than other unsupervised methods in Table 3, i.e., on the harder language pairs, and that it performed comparably with the model by Conneau et al. (2017) in Table 4 on the easier language pairs. Combining all these observations, we see that our method is highly robust for various language pairs and under different training conditions.
|Supervised||Mikolov et al. (2013)||0.71||0.72||0.68||0.71|
|Zhang et al. (2016)||0.71||0.71||0.69||0.71|
|Xing et al. (2015)||0.72||0.71||0.69||0.72|
|Shigeto et al. (2015)||0.72||0.72||0.69||0.71|
|Artetxe et al. (2016)||0.73||0.72||0.70||0.73|
|Artetxe et al. (2017)||0.70||0.70||0.67||0.71|
|Unsupervised||Conneau et al. (2017)||0.71||0.71||0.68||0.71|
|Zhang et al. (2017a)||-||-||-||-|
We evaluate models on cross-lingual word similarity prediction (Task 2) to measure how much the predicted cross-language word similarities match the ground truth annotated by humans. Following the convention in benchmark evaluations for this task, we compute the Pearson correlation between the model-induced similarity scores and the human-annotated similarity scores over testing word pairs for each language pair. A higher correlation score with the ground truth represents the better quality of induced embeddings. All systems use the cosine similarity between the transformed embedding of each query and the word embedding of its paired translation as the predicted similarity score.
Table 5 summarizes the performance of all the methods in cross-lingual word similarity prediction. We can see that the unsupervised methods, including ours, perform equally well as the supervised methods, which is highly encouraging.
In this paper, we presented a novel method for cross-lingual transformation of monolingual embeddings in an unsupervised manner. By simultaneously optimizing the bi-directional mappings w.r.t. Sinkhorn distances and back-translation losses on both ends, our model enjoys its prediction power as well as robustness, with the impressive performance on multiple evaluation benchmarks. For future work, we would like to extend this work in the semi-supervised setting where insufficient bilingual dictionaries are available.
We thank the reviewers for their helpful comments. This work is supported in part by Defense Advanced Research Projects Agency Information Innovation Oce (I2O), the Low Resource Languages for Emergent Incidents (LORELEI) Program, Issued by DARPA/I2O under Contract No. HR0011-15-C-0114, and in part by the National Science Foundation (NSF) under grant IIS-1546329.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294.
Bilbowa: Fast bilingual distributed representations without word alignments.In
International Conference on Machine Learning, pages 748–756.
Journal of Artificial Intelligence Research, 55:953–994.
Dualgan: Unsupervised dual learning for image-to-image translation.arXiv preprint.