DeepAI
Log In Sign Up

Learning Unsupervised Word Mapping by Maximizing Mean Discrepancy

Cross-lingual word embeddings aim to capture common linguistic regularities of different languages, which benefit various downstream tasks ranging from machine translation to transfer learning. Recently, it has been shown that these embeddings can be effectively learned by aligning two disjoint monolingual vector spaces through a linear transformation (word mapping). In this work, we focus on learning such a word mapping without any supervision signal. Most previous work of this task adopts parametric metrics to measure distribution differences, which typically requires a sophisticated alternate optimization process, either in the form of minmax game or intermediate density estimation. This alternate optimization process is relatively hard and unstable. In order to avoid such sophisticated alternate optimization, we propose to learn unsupervised word mapping by directly maximizing the mean discrepancy between the distribution of transferred embedding and target embedding. Extensive experimental results show that our proposed model outperforms competitive baselines by a large margin.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/27/2018

Improving Cross-Lingual Word Embeddings by Meeting in the Middle

Cross-lingual word embeddings are becoming increasingly important in mul...
08/21/2019

On the Robustness of Unsupervised and Semi-supervised Cross-lingual Word Embedding Learning

Cross-lingual word embeddings are vector representations of words in dif...
03/04/2018

Concatenated p-mean Word Embeddings as Universal Cross-Lingual Sentence Representations

Average word embeddings are a common baseline for more sophisticated sen...
12/31/2020

Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring

Recent research on cross-lingual word embeddings has been dominated by u...
08/31/2018

Gromov-Wasserstein Alignment of Word Embedding Spaces

Cross-lingual or cross-domain correspondences play key roles in tasks ra...
03/09/2022

Unsupervised Alignment of Distributional Word Embeddings

Cross-domain alignment play a key roles in tasks ranging from machine tr...
05/16/2018

A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings

Recent work has managed to learn cross-lingual word embeddings without p...

1 Introduction

It has been shown that word embeddings are capable of capturing meaningful representations of words Mikolov et al. (2013a); Pennington et al. (2014); Bojanowski et al. (2017). Recently, more and more efforts turn to cross-lingual word embeddings, which benefit various downstream tasks ranging from machine translation Lample et al. (2017) to transfer learning Zhou et al. (2016).

Based on the observation that monolingual word embeddings share similar geometric properties across languages Mikolov et al. (2013b), the underlying idea is to align two disjoint monolingual vector spaces through a linear transformation. Xing et al. (2015)

further empirically demonstrate that the results can be improved by constraining the desired linear transformation as an orthogonal matrix, which is also proved theoretically by 

Smith et al. (2017).

Recently, increasing effort of research has been motivated to learn word mapping without any supervision signal. One line of research focuses on designing heuristics 

Kondrak et al. (2017) or taking advantage of the structural similarity of monolingual embeddings Aldarmaki et al. (2018); Artetxe et al. (2018b); Hoshen and Wolf (2018). However, these methods often require a large number of random restarts or additional skills such as re-weighting Artetxe et al. (2018a) to achieve satisfactory results.

Another line strives to learn word mapping by matching distribution of transferred embedding and target embedding. For instance, Zhang et al. (2017a) and  Conneau et al. (2017) implement the word mapping as generator in GAN Goodfellow et al. (2014), which is essentially a minmax game. Zhang et al. (2017b) and Xu et al. (2018) adopt the Earth Mover’s distance Rubner et al. (1998) and Sinkhorn distance Cuturi (2013) as the optimized metrics respectively, both of which require intermediate density estimation. Although this line exhibits relatively excellent performance, both the minmax game and the intermediate density estimation require alternate optimization. However, such a sophisticated alternate optimization process tends to cause a hard and unstable optimization problem Grave et al. (2018).

In this paper, we follow the core idea of the second line basing on distribution-matching and combine it with the first line simultaneously. Different from the previous work requiring sophisticated alternate optimization, we propose to learn unsupervised word mapping by directly maximizing the mean discrepancy between the distribution of transferred embedding and target embedding. The Maximum Mean Discrepancy (MMD) is a non-parametric metric, which measures the difference between two distributions. Compared with other parametric metrics, it does not require any intermediate density estimation, leading to a more stable optimization problem. Besides, in order to alleviate the initialization sensitive issue of the distribution-matching, we take advantage of the structural similarity of monolingual embeddings Artetxe et al. (2018b) to learn the initial word mapping to provide a warm-up start.

The main contributions of this paper are concluded as follows:

  • We systematically analyze the drawbacks of the current models for the task of learning unsupervised word mapping.

  • We propose to learn unsupervised word mapping by means of non-parametric maximum mean discrepancy, which avoids a relatively sophisticated alternate optimization process.

  • Extensive experimental results show that our approach outperforms competitive baselines by a large margin on two benchmark tasks.

2 Proposed Method

2.1 Overview

Here we define some notations and describe the task of learning word mapping between different languages without any supervision signal. Let and be two sets of and pre-trained monolingual word embeddings with dimensionality , which come from the source and target language, respectively. Our goal is to learn a word mapping so that for any source word embedding , lies close to the embedding of its target language translation . Here is the space composed of all orthogonal matrices. From the perspective of distribution-matching, the task of learning word mapping can be modeled as finding an optimal orthogonal matrix to make the distributions of and as close as possible.

2.2 MMD-Matching

Concisely, Maximum Mean Discrepancy (MMD) measures the difference between two distributions based on the Reproducing Kernel Hilbert Space (RKHS) . Let and represent the distribution of and , respectively, i.e., and . Then, the difference between the distributions and can be characterized by:

(1)

where is the feature mapping. reaches its minimum only when the distributions and match exactly. Therefore, in order to match the distribution of transferred embedding and target embedding as exactly as possible, the underlying linear mapping can be learned by solving the following optimization problem:

(2)

By means of kernel trick Gretton et al. (2012), the MMD of the distributions and can be calculated as:

(3)

where is the kernel function, such as polynomial kernel or Gaussian kernel. At the training stage, Eq.(3) can be estimated by the sampling method, which is formulated as:

(4)

where is the size of mini-batch.

In order to maintain the orthogonality of during training Smith et al. (2017), we adopt the same update strategy proposed in Cisse et al. (2017). In detail, we replace the original update of the matrix with the following update rule:

(5)

where is a hyper-parameter. After the optimization process of matching the distributions and converges, we use the iterative refinement Conneau et al. (2017); Artetxe et al. (2018b) to further improve results.

Methods FR-EN EN-FR DE-EN EN-DE ES-EN EN-ES IT-EN EN-IT
Supervised:
Mikolov et al. (2013a) 71.33 82.20 61.93 73.07 74.00 80.73 68.93 77.60
Xing et al. (2015) 76.33 78.67 67.73 69.53 77.20 78.60 72.00 73.33
Shigeto et al. (2015) 79.93 73.13 71.07 63.73 81.07 74.53 76.47 68.13
Zhang et al. (2016) 76.07 78.20 67.67 69.87 77.27 78.53 72.40 73.40
Artetxe et al. (2016) 77.73 79.20 69.13 72.13 78.27 80.07 73.60 74.47
Artetxe et al. (2017) 74.47 77.67 68.07 69.20 75.60 78.20 70.53 71.67
Unsupervised:
Zhang et al. (2017a) * 57.60 40.13 41.27 58.80 60.93 43.60 44.53
Conneau et al. (2017) 77.87 78.13 69.73 71.33 79.07 78.80 74.47 75.33
Xu et al. (2018) 75.47 77.93 67.00 69.33 77.80 79.53 72.60 73.47
Ours 78.87 78.40 70.33 71.53 79.33 79.93 74.73 75.53
Table 1:

The accuracy of different methods in various language pairs on the bilingual lexicon induction task. The best score for each language pair is bold for the supervised and unsupervised categories, respectively. For the baseline 

Zhang et al. (2017a), we adopt the most commonly used unidirectional transformation model. “*” means that the model fails to converge and hence the result is omitted.

2.3 Compression and Initialization

At the training stage, Eq.(3) is estimated by the sampling method. The bias of estimation directly determines the accuracy of calculation of the MMD. A reliable estimation of Eq.(3) generally requires the size of the mini-batch to be proportional to the dimension. Therefore, we adopt a compressing network

implemented as a multilayer perceptron to map all embeddings into a lower feature space. Experimental results show that the use of

compressing network can not only improve the performance of the model, but also provide significant computational savings.

Besides, we find that the training of the model is sensitive to the initialization of word mapping. An inappropriate initialization tends to cause the model to stuck in poor local optimum. The same sensitivity issue is also observed by Zhang et al. (2017b); Aldarmaki et al. (2018); Xu et al. (2018). Therefore, we take advantage of the structural similarity of embeddings to provide the initial setting for our MMD-matching process. Readers can refer to Artetxe et al. (2018b) for detailed approach.

3 Experiments

3.1 Evaluation Tasks

We evaluate our proposed model on two representative benchmark tasks: bilingual lexicon induction and cross-lingual word similarity prediction.

Bilingual lexicon induction

The goal of this task is to retrieve the translation of given source word. We use the lexicon constructed by Conneau et al. (2017). Here we report accuracy with nearest neighbor retrieval

based on cosine similarity.

Word similarity prediction

This task aims to measure how well the predicted cross-language word cosine similarities correlate with the human-labeled scores. Following Conneau et al. (2017), we use the SemEval 2017 competition data. We report Pearson correlation between the predicted similarity scores and the human-labeled scores over testing word pairs for each language pair.

3.2 Experiment Settings

For both evaluation tasks, we use publicly available 300-dimensional fastText word embeddings trained on Wikipedia. The compressing network is used to map the original 300-dimensional embeddings to 50-dimensional. The batch size is set to 1280 and in Eq.(5) is set to 0.01. We use a mixture of 10 isotropic Gaussian (RBF) kernels with different bandwidths as in Li et al. (2015). The Adam optimizer is used to minimize the final objective function. The learning rate is initialized to

and it is halved after every training epoch. We adopt the unsupervised criterion proposed in 

Conneau et al. (2017) both as a stopping criterion and to select the best hyper-parameters.

3.3 Results

The experimental results of our approach and the baselines on the bilingual lexicon induction task are shown in Table 1. Results show that our proposed model outperforms all unsupervised baselines by a large margin, which shows that the use of MMD is of great help to improve the quality of word mapping. Compared with the supervised methods, it is gratifying that our approach also achieves completely comparable performance.

Table 2 summarizes the performance of all methods on the cross-lingual word similarity prediction task. Similar to results in Table 1, our proposed model still achieves the best performance compared with the unsupervised baselines, which is also highly comparable to competitive supervised methods.

Methods EN-ES EN-FA EN-DE EN-IT
Supervised:
Mikolov et al. (2013a) 0.71 0.68 0.71 0.71
Xing et al. (2015) 0.71 0.69 0.72 0.72
Shigeto et al. (2015) 0.72 0.69 0.72 0.71
Zhang et al. (2016) 0.71 0.69 0.71 0.71
Artetxe et al. (2016) 0.72 0.70 0.73 0.73
Artetxe et al. (2017) 0.70 0.67 0.70 0.71
Unsupervised:
Zhang et al. (2017a) 0.68 0.65 0.69 0.69
Conneau et al. (2017) 0.71 0.68 0.72 0.71
Xu et al. (2018) 0.71 0.67 0.71 0.71
Ours 0.72 0.67 0.72 0.72
Table 2: Comparison between our approach and all baselines on the word similarity prediction task. Pearson correlation between the predicted similarity scores and the human-labeled scores is reported.

4 Analysis and Discussions

Here we perform further analysis on the model and experiment results.

4.1 Ablation Study

Here we perform an ablation study to understand the importance of different components of our approach. Table 3 presents the best performance obtained by multiple versions of our model with some missing components: the MMD-matching, the refinement, and the initialization.

The most critical component is initialization, without which our model will fail to converge. This initialization sensitivity issue is ingrained and difficult to eliminate in the optimization of some metrics Aldarmaki et al. (2018); Xu et al. (2018). Besides, as is shown in Table 3, the final refinement can bring a significant improvement in performance. What we need to emphasize is that although the missing of MMD-matching brings the weakest decline in performance, it is still a key component to guide the model to learn a better final word mapping.

Models EN-ES EN-FR EN-DE EN-IT
Full model 79.93 78.40 71.53 75.53
w/o MMD-matching 71.60 72.53 68.20 71.40
w/o Refinement 55.80 65.27 61.00 58.67
w/o Initialization * * * *
Table 3: Ablation study on the bilingual lexicon induction task. “*” means that the model fails to converge and hence the result is omitted.

4.2 Error Analysis

In the experiment, we find that all methods exhibit relatively poor performance when translating rare words on the bilingual lexicon induction task. Figure 1 shows the performance of our approach on the common word pairs and rare word pairs, from which we can see that the performance is far worse when the model translates rare words.

Since the pre-trained monolingual embeddings provide the cornerstone for learning unsupervised word mapping, the quality of monolingual embeddings directly determines the quality of word mapping. Due to the low frequency of rare words, the quality of their embeddings is lower than that of common words. This makes the isometric assumption Artetxe et al. (2018b) more difficult to satisfy on rare words, leading to poor performance of all methods on rare word pairs. Improving the quality of cross-lingual embeddings of rare words is expected to be explored in future work.

Figure 1: The performance of our approach in common words and rare words on the bilingual lexicon induction task. Common words are the most frequent 20,000 words, and the remaining are regarded as rare words.

5 Conclusion

In this paper, we propose to learn unsupervised word mapping between different languages by directly maximizing the mean discrepancy between the distribution of transferred embedding and target embedding. The proposed model adopts non-parametric metric that does not require any intermediate density estimation, which avoids a relatively sophisticated and unstable alternate optimization process. Extensive experimental results show that the proposed model outperforms the baselines by a substantial margin.

References

  • Aldarmaki et al. (2018) Hanan Aldarmaki, Mahesh Mohan, and Mona T. Diab. 2018. Unsupervised word mapping using structural similarities in monolingual embeddings. TACL, 6:185–196.
  • Artetxe et al. (2016) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016

    , pages 2289–2294.
  • Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 451–462.
  • Artetxe et al. (2018a) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018a. Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

    .
  • Artetxe et al. (2018b) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018b. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume1: Long Papers, pages 789–798.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. TACL, 5:135–146.
  • Cisse et al. (2017) Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. 2017. Parseval networks: Improving robustness to adversarial examples. In

    Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017

    , pages 854–863.
  • Conneau et al. (2017) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087.
  • Cuturi (2013) Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 2292–2300.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
  • Grave et al. (2018) Edouard Grave, Armand Joulin, and Quentin Berthet. 2018. Unsupervised alignment of embeddings with wasserstein procrustes. arXiv preprint arXiv:1805.11222.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. 2012. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773.
  • Hoshen and Wolf (2018) Yedid Hoshen and Lior Wolf. 2018. An iterative closest point method for unsupervised word translation. CoRR, abs/1801.06126.
  • Kondrak et al. (2017) Grzegorz Kondrak, Bradley Hauer, and Garrett Nicolai. 2017. Bootstrapping unsupervised bilingual lexicon induction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 619–624.
  • Lample et al. (2017) Guillaume Lample, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
  • Li et al. (2015) Yujia Li, Kevin Swersky, and Richard S. Zemel. 2015.

    Generative moment matching networks.

    In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1718–1727.
  • Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Mikolov et al. (2013b) Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  • Rubner et al. (1998) Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 1998. A metric for distributions with applications to image databases. In ICCV, pages 59–66.
  • Shigeto et al. (2015) Yutaro Shigeto, Ikumi Suzuki, Kazuo Hara, Masashi Shimbo, and Yuji Matsumoto. 2015. Ridge regression, hubness, and zero-shot learning. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part I, pages 135–151.
  • Smith et al. (2017) Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
  • Xing et al. (2015) Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June 5, 2015, pages 1006–1011.
  • Xu et al. (2018) Ruochen Xu, Yiming Yang, Naoki Otani, and Yuexin Wu. 2018. Unsupervised cross-lingual transfer of word embedding spaces. arXiv preprint arXiv:1809.03633.
  • Zhang et al. (2017a) Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017a. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1959–1970.
  • Zhang et al. (2017b) Meng Zhang, Yang Liu, Huanbo Luan, and Maosong Sun. 2017b. Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 1934–1945.
  • Zhang et al. (2016) Yuan Zhang, David Gaddy, Regina Barzilay, and Tommi S. Jaakkola. 2016. Ten pairs to tag - multilingual POS tagging via coarse mapping between embeddings. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, pages 1307–1317.
  • Zhou et al. (2016) Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2016. Cross-lingual sentiment classification with bilingual document representation learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.