Previous work has proposed to learn a linear mapping between continuous representations of words by employing a small bilingual lexicon as supervision. The transformation generalizes well to words that are not observed during training, making it possible to extend the lexicon. Another application is to transfer predictive models between languages Klementiev et al. (2012).
The first simple method proposed by Mikolov et al. Mikolov et al. (2013b) has been subsequently improved by changing the problem parametrization. One successful suggestion is to
–normalize the word vectors and to constrain the linear mapping to be orthogonalXing et al. (2015). An alignment is then efficiently found using orthogonal Procrustes Artetxe et al. (2016); Smith et al. (2017), improving the accuracy on standard benchmarks.
Yet, the resulting models suffer from the so-called “hubness problem”: some word vectors tend to be the nearest neighbors of an abnormally high number of other words. This limitation is now addressed by applying a corrective metric at inference time, such as the inverted softmax (ISF) Smith et al. (2017) or the cross-domain similarity local scaling (CSLS) Conneau et al. (2017). This is not fully satisfactory because the loss used for inference is not consistent with that employed for training. This observation suggests that the square loss is suboptimal and could advantageously be replaced by a loss adapted to retrieval.
In this paper, we propose a training objective inspired by the CSLS retrieval criterion. We introduce convex relaxations of the corresponding objective function, which are efficiently optimized with projected subgradient descent. This loss can advantageously include unsupervised information and therefore leverage the representations of words not occurring in the training lexicon.
Our contributions are as follows. First we introduce our approach and empirically evaluate it on standard benchmarks for word translation. We obtain state-of-the-art bilingual mappings for more than language pairs. Second, we specifically show the benefit of our alternative loss function and of leveraging unsupervised information. Finally, we show that with our end-to-end formulation, a non-orthogonal mapping achieves better results. The code for our approach is a part of the fastText library111https://github.com/facebookresearch/fastText/tree/master/alignment/ and the aligned vectors are available on https://fasttext.cc/.
2 Preliminaries on bilingual mappings
This section introduces pre-requisites and prior works to learn a mapping between two languages, using a small bilingual lexicon as supervision.
We start from two sets of continuous representations in two languages, each learned on monolingual data. Let us introduce some notation. Each word in the source language (respectively target language) is associated with a vector (respectively ). For simplicity, we assume that our initial lexicon, or seeds, corresponds to the first pairs . The goal is to extend the lexicon to all source words that are not seeds. Mikolov et al. Mikolov et al. (2013b) learn a linear mapping between the word vectors of the seed lexicon that minimizes a measure of discrepancy between mapped word vectors of the source language and word vectors of the target language:
where is a loss function, typically the square loss . This leads to a least squares problem, which is solved in closed form.
The linear mapping is constrained to be orthogonal, i.e. such that , where is the
-dimensional identity matrix. This choice preserves distances between word vectors, and likewise word similarities. Previous worksXing et al. (2015); Artetxe et al. (2016); Smith et al. (2017) experimentally observed that constraining the mapping in such a way improves the quality of the inferred lexicon. With the square loss and by enforcing an orthogonal mapping , Eq. (1) admits a closed form solution Gower and Dijksterhuis (2004): where
is the singular value decomposition of the matrix.
Once a mapping is learned, one can infer word correspondences for words that are not in the initial lexicon. The translation of a source word is obtained as
When the squared loss is used, this amounts to computing and to performing a nearest neighbor search with respect to the Euclidean distance:
A common observation is that nearest neighbor search for bilingual lexicon inference suffers from the “hubness problem” Doddington et al. (1998); Dinu et al. (2014). Hubs are words that appear too frequently in the neighborhoods of other words. To mitigate this effect, a simple solution is to replace, at inference time, the square -norm in Eq. (3) by another criterion, such as ISF Smith et al. (2017) or CSLS Conneau et al. (2017).
This solution, both with ISF and CSLS criteria, is applied with a transformation learned using the square loss. However, replacing the loss in Eq. (3) creates a discrepancy between the learning of the translation model and the inference.
3 Word translation as a retrieval task
In this section, we propose to directly include the CSLS criterion in the model in order to make learning and inference consistent. We also show how to incorporate unsupervised information..
The CSLS criterion is a similarity measure between the vectors and defined as:
where is the set of nearest neighbors of the point in the set of target word vectors , and
is the cosine similarity. Note, the second term in the expression of the CSLS loss does not change the neighbors of. However, it gives a loss function that is symmetrical with respect to its two arguments, which is a desirable property for training.
Let us now write the optimization problem for learning the bilingual mapping with CSLS. At this stage, we follow previous work and constrain the linear mapping to belong to the set of orthogonal matrices . Here, we also assume that word vectors are -normalized. Under these assumptions, we have . Similarly, we have Therefore, finding the nearest neighbors of among the elements of is equivalent to finding the elements of which have the largest dot product with . We adopt this equivalent formulation because it leads to a convex formulation when relaxing the orthogonality constraint on . In summary, our optimization problem with the Relaxed CSLS loss (RCSLS) is written as:
Eq. (4) involves the minimization of a non-smooth cost function over the manifold of orthogonal matrices . As such, it can be solved using manifold optimization tools (Boumal et al., 2014). In this work, we consider as an alternative to the set , its convex hull , i.e., the unit ball of the spectral norm. We refer to this projection as the “Spectral” model. We also consider the case where these constraints on the alignment matrix are simply removed.
Having a convex domain allows us to reason about the convexity of the cost function. We observe that the second and third terms in the CSLS loss can be rewritten as follows:
where denotes the set of all subsets of of size . This term, seen as a function of , is a maximum of linear functions of , which is convex (Boyd and Vandenberghe, 2004). This shows that our objective function is convex with respect to the mapping and piecewise linear (hence non-smooth). Note, our approach could be generalized to other loss functions by replacing the term by any function convex in . We minimize this objective function over the convex set by using the projected subgradient descent algorithm.
The projection onto the set is solved by taking the singular value decomposition (SVD) of the matrix, and thresholding the singular values to one.
Usually, the number of word pairs in the seed lexicon is small with respect to the size of the dictionaries . To benefit from unlabeled data, it is common to add an iterative “refinement procedure” (Artetxe et al., 2017) when learning the translation model . Given a model , this procedure iterates over two steps. First it augments the training lexicon by keeping the best-inferred translation in Eq. (3). Second it learns a new mapping by solving the problem in Eq. (1). This strategy is similar to standard semi-supervised approaches where the training set is augmented over time. In this work, we propose to use the unpaired words in the dictionaries as “negatives” in the RCSLS loss: instead of computing the -nearest neighbors amongst the annotated words , we do it over the whole dictionary .
|Adversarial + refine||81.7||83.3||82.3||82.1||74.0||72.2||44.0||59.1||32.5||31.4||64.3|
|ICP + refine||82.2||83.8||82.5||82.5||74.8||73.1||46.3||61.6||-||-||-|
|Wass. Proc. + refine||82.8||84.1||82.6||82.9||75.4||73.3||43.7||59.1||-||-||-|
|Least Square Error||78.9||80.7||79.3||80.7||71.5||70.1||47.2||60.2||42.3||4.0||61.5|
|Procrustes + refine||82.4||83.9||82.3||83.2||75.3||73.2||50.1||63.5||40.3||35.5||66.9|
|RCSLS + spectral||83.5||85.7||82.3||84.1||78.2||75.8||56.1||66.5||44.9||45.7||70.2|
This section reports the main results obtained with our method. We provide complementary results and an ablation study in the appendix. We refer to our method without constraints as RCSLS and as RCSLS+spectral if the spectral constraints are used.
4.1 Implementation details
We choose a learning rate in
and a number of epochs inon the validation set. For the unconstrained RCSLS, a small regularization can be added to prevent the norm of to diverge. In practice, we do not use any regularization. For the English-Chinese pairs (en-zh), we center the word vectors. The number of nearest neighbors in the CSLS loss is . We use the -normalized fastText word vectors by Bojanowski et al. Bojanowski et al. (2017) trained on Wikipedia.
4.2 The MUSE benchmark
Table 9 reports the comparison of RCSLS with standard supervised and unsupervised approaches on 5 language pairs (in both directions) of the MUSE benchmark Conneau et al. (2017). Every approach uses the Wikipedia fastText vectors and supervision comes in the form of a lexicon composed of k words and their translations. Regardless of the relaxation, RCSLS outperforms the state of the art by, on average, to in accuracy. This shows the importance of using the same criterion during training and inference. Note that the refinement step (“refine”) also uses CSLS to finetune the alignments but leads to a marginal gain for supervised methods.
Interestingly, RCSLS achieves a better performance without constraints () for all pairs. Contrary to observations made in previous works, this result suggests that preserving the distance between word vectors is not essential for word translation. Indeed, previous works used a loss where, indeed, orthogonal constraints lead to an improvement of (Procrustes versus Least Square Error). This suggests that a linear mapping with no constraints works well only if it is learned with a proper criterion.
Impact of extended normalization.
Table 2 reports the gain brought by including words not in the lexicon (unannotated words) to the performance of RCSLS. Extending the dictionary significantly improves the performance on all language pairs.
|Adversarial + refine + CSLS||45.1||38.3|
|Mikolov et al. (2013b)||33.8||24.9|
|Dinu et al. (2014)||38.5||24.6|
|Artetxe et al. (2016)||39.7||33.8|
|Smith et al. (2017)||43.1||38.0|
|Procrustes + CSLS||44.9||38.5|
4.3 The WaCky dataset
Dinu et al. (2014) introduce a setting where word vectors are learned on the WaCky datasets (Baroni et al., 2009) and aligned with a noisy bilingual lexicon. We select the number of epochs within on a validation set. Table 3 shows that RCSLS is on par with the state of the art. RCSLS is thus robust to relatively poor word vectors and noisy lexicons.
BP uses a different training set of comparable size.
4.4 Comparison with existing aligned vectors
Recently, word vectors based on fastText have been aligned and released by Smith et al. (Smith et al., 2017, BabylonPartners, BP) and Conneau et al. (Conneau et al., 2017, MUSE). Both use a variation of Procrustes to align word vectors in the same space.
We compare these methods to RCSLS and report results in Table 8. RCSLS improves the performance by over MUSE vectors when trained with the same lexicon (Original). Training RSCSL on the full training lexicon (Full) brings an additional improvement of on average with a CSLS criterion, and with a NN criterion. For reference, the performance of Procrustes only improves by with CSLS and even degrades with a NN criterion. RCSLS benefits more from additional supervision than Procrustes. Finally, the gap between RCSLS and the other methods is higher with a NN criterion, suggesting that RCSLS imports some of the properties of CSLS to the dot product between aligned vectors.
4.5 Impact on word vectors
Non-orthogonal mapping of word vectors changes their dot products. We evaluate the impact of this mapping on word analogy tasks Mikolov et al. (2013a). In Table 4, we report the accuracy on analogies for raw word vectors and our vectors mapped to English with an alignement trained on the full MUSE training set. Regardless of the source language, the mapping does not negatively impact the word vectors. Similarly, our alignement has also little impact on word similarity, as shown in Table 6.
We confirm this observation by running the reverse mapping, i.e., by mapping the English word vectors of Mikolov et al. (2018) to Spanish. It leads to an improvement of both for vectors trained on Common Crawl ( to ) and Wikipedia + News ( to ).
This paper shows that minimizing a convex relaxation of the CSLS loss significantly improves the quality of bilingual word vector alignment. We use a reformulation of CSLS that generalizes to convex functions beyond dot-products and provides to a single end-to-end training that is consistent with the inference stage. Finally, we show that removing the orthogonality constraint does not degrade the quality of the aligned vectors.
We thank Guillaume Lample and Alexis Conneau for their feedback and help with MUSE.
Artetxe et al. (2016)
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2016.
Learning principled bilingual mappings of word embeddings while
preserving monolingual invariance.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2289–2294.
- Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 451–462.
- Baroni et al. (2009) Marco Baroni, Silvia Bernardini, Adriano Ferraresi, and Eros Zanchetta. 2009. The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language resources and evaluation, 43(3):209–226.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146. https://fasttext.cc/docs/en/pretrained-vectors.html.
Boumal et al. (2014)
Nicolas Boumal, Bamdev Mishra, P.-A. Absil, and Rodolphe Sepulchre. 2014.
Manopt, a Matlab toolbox for
optimization on manifolds.
Journal of Machine Learning Research, 15:1455–1459.
- Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press.
- Camacho-Collados et al. (2016) José Camacho-Collados, Mohammad Taher Pilehvar, and Roberto Navigli. 2016. Nasari: Integrating explicit knowledge and corpus statistics for a multilingual representation of concepts and entities. Artificial Intelligence, 240:36–64.
- Conneau et al. (2017) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087. http://github.com/facebookresearch/MUSE.
- Dinu et al. (2014) Georgiana Dinu, Angeliki Lazaridou, and Marco Baroni. 2014. Improving zero-shot learning by mitigating the hubness problem. arXiv preprint arXiv:1412.6568.
- Doddington et al. (1998) George Doddington, Walter Liggett, Alvin Martin, Mark Przybocki, and Douglas Reynolds. 1998. Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Technical report.
- Gower and Dijksterhuis (2004) John C Gower and Garmt B Dijksterhuis. 2004. Procrustes problems, volume 30. Oxford University Press on Demand.
- Grave et al. (2018) Edouard Grave, Armand Joulin, and Quentin Berthet. 2018. Unsupervised alignment of embeddings with wasserstein procrustes. arXiv preprint arXiv:1805.11222.
- Hoshen and Wolf (2018) Yedid Hoshen and Lior Wolf. 2018. An iterative closest point method for unsupervised word translation. arXiv preprint arXiv:1801.06126.
Klementiev et al. (2012)
Alexandre Klementiev, Ivan Titov, and Binod Bhattarai. 2012.
Inducing crosslingual distributed representations of words.Proceedings of COLING 2012, pages 1459–1474.
- Mikolov et al. (2013a) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Mikolov et al. (2018) Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. Advances in pre-training distributed word representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA).
- Mikolov et al. (2013b) Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013b. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
- Smith et al. (2017) Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859. https://github.com/Babylonpartners/fastText_multilingual.
- Xing et al. (2015) Chao Xing, Dong Wang, Chao Liu, and Yiye Lin. 2015. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011.
Appendix A Ablation study
This appendix presents an ablation study, to validate the design choices that we made.
Size of training lexicon.
Figure 1 compares the accuracy of RCSLS and Procrustes as a function of the training set size. On small training sets, the difference between RCSLS and Procrustes is marginal but increases with the training set size.
Impact of the number of nearest neighbors.
The CSLS criterion and the RCSLS loss are sensible to the number of nearest neighbors. Figure 2 shows the impact of this parameter on both Procrustes and RCSLS. Procrustes is impacted through the retrieval criterion while RCSLS is impacted by the loss and the criterion. Taking nearest neighbors is optimal and the performance decreases significantly with a large number of neighbors.
Comparison between different functions in CSLS on four language pairs. Linear is the standard criterion, while logSumExp is equivalent to a logistic regression with hard mining.
Comparison of alternative criterions.
As discussed in the main paper, the dot product in the CSLS terms can be replaced by any convex function of and still yield a convex objective. Using a logSumExp function, i.e.,
is equivalent to a “local” logistic regression classifier, or equivalently, to a logistic regression with hard mining. In this experiment, we train our model using the alternative loss and report the accuracy of the resulting lexicon in Table7. We observe that this choice does not significantly changes the performance. This suggests that the local property of the criterion is more important than the form of the loss.
|with exact string matches|
|without exact string matches|
|Adv. + ref. + NN||79.1||78.1||78.1||78.2||71.3||69.6||37.3||54.3||30.9||21.9||59.9|
|Adv. + ref. + CSLS||81.7||83.3||82.3||82.1||74.0||72.2||44.0||59.1||32.5||31.4||64.3|
|Procrustes + NN||77.4||77.3||74.9||76.1||68.4||67.7||47.0||58.2||40.6||30.2||61.8|
|Procrustes + CSLS||81.4||82.9||81.1||82.4||73.5||72.4||51.7||63.7||42.7||36.7||66.8|
|RCSLS + NN||81.1||84.9||80.5||80.5||75.0||72.3||55.3||67.1||43.6||40.1||68.0|
|RCSLS + CSLS||84.1||86.3||83.3||84.1||79.1||76.3||57.9||67.2||45.9||46.4||71.0|
Exact string matches.
The MUSE datasets contains exact string matches based on vocabularies built on Wikipedia. The matches may reflects correct translations but can come from other sources, like English word that frequently appears in movie or song titles. Table 8 compares alignments on the MUSE test set with and without exact string matches average over languages. Note that we do not remove exact matches in the training sets for fair comparison with MUSE vectors. We note that the gap between our vectors and others is more important with an NN criterion. We also observe that, the performance of all the methods drop when the exact string matches are removed.
Impact of the retrieval criterion.
Table 9 shows performance on MUSE with a nearest neighbors (NN) criterion. Replacing CSLS by NN leads to a smaller drops for RCSLS () than for competitors (around ), suggesting that RCSLS transfers some local information encoded in the CSLS criterion to the dot product.
Appendix B Alignment and word vectors
In this appendix, we look at the relation between the quality of the word vectors and the quality of an alignment. We measure both the impact of the vectors on the alignment and the impact of a non-orthogonal mapping on word vectors.
|without subword||with subword|
Quality of the embedding model.
In this experiment, we study the impact of the quality of the word vectors on the performance of word translation. For this purpose, we trained word vectors on the same Wikipedia data, using skipgram with and without subword information. In Table 10, we report the accuracy for different language pairs when using these two sets of word vectors. Overall, we observe that using subword information improves the accuracy by a few points on all pairs.
Impact on English word vectors.
We evaluate the impact of a non-orthogonal mapping on the English word analogy task Mikolov et al. (2013a). Table 11 compares on analogies the raw English word vectors to their alignments to languages. Regardless of the target language, the mapping does not have negative impact on the word vectors.
Finally, we evaluate our aligned vectors on the task of cross-lingual word similarity in Table 12
. They obtain similar results to vectors aligned with an orthogonal matrix. These experiments concur with the previous observation that a linear non-orthogonal mapping does not hurt the geometry of the word vector space.