Improving Supervised Bilingual Mapping of Word Embeddings

04/20/2018 ∙ by Armand Joulin, et al. ∙ Facebook 0

Continuous word representations, learned on different languages, can be aligned with remarkable precision. Using a small bilingual lexicon as training data, learning the linear transformation is often formulated as a regression problem using the square loss. The obtained mapping is known to suffer from the hubness problem, when used for retrieval tasks (e.g. for word translation). To address this issue, we propose to use a retrieval criterion instead of the square loss for learning the mapping. We evaluate our method on word translation, showing that our loss function leads to state-of-the-art results, with the biggest improvements observed for distant language pairs such as English-Chinese.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Previous work has proposed to learn a linear mapping between continuous representations of words by employing a small bilingual lexicon as supervision. The transformation generalizes well to words that are not observed during training, making it possible to extend the lexicon. Another application is to transfer predictive models between languages Klementiev et al. (2012).

The first simple method proposed by Mikolov et al. Mikolov et al. (2013b) has been subsequently improved by changing the problem parametrization. One successful suggestion is to

–normalize the word vectors and to constrain the linear mapping to be orthogonal 

Xing et al. (2015). An alignment is then efficiently found using orthogonal Procrustes Artetxe et al. (2016); Smith et al. (2017), improving the accuracy on standard benchmarks.

Yet, the resulting models suffer from the so-called “hubness problem”: some word vectors tend to be the nearest neighbors of an abnormally high number of other words. This limitation is now addressed by applying a corrective metric at inference time, such as the inverted softmax (ISF) Smith et al. (2017) or the cross-domain similarity local scaling (CSLS) Conneau et al. (2017). This is not fully satisfactory because the loss used for inference is not consistent with that employed for training. This observation suggests that the square loss is suboptimal and could advantageously be replaced by a loss adapted to retrieval.

In this paper, we propose a training objective inspired by the CSLS retrieval criterion. We introduce convex relaxations of the corresponding objective function, which are efficiently optimized with projected subgradient descent. This loss can advantageously include unsupervised information and therefore leverage the representations of words not occurring in the training lexicon.

Our contributions are as follows. First we introduce our approach and empirically evaluate it on standard benchmarks for word translation. We obtain state-of-the-art bilingual mappings for more than language pairs. Second, we specifically show the benefit of our alternative loss function and of leveraging unsupervised information. Finally, we show that with our end-to-end formulation, a non-orthogonal mapping achieves better results. The code for our approach is a part of the fastText library111https://github.com/facebookresearch/fastText/tree/master/alignment/ and the aligned vectors are available on https://fasttext.cc/.

2 Preliminaries on bilingual mappings

This section introduces pre-requisites and prior works to learn a mapping between two languages, using a small bilingual lexicon as supervision.

We start from two sets of continuous representations in two languages, each learned on monolingual data. Let us introduce some notation. Each word in the source language (respectively target language) is associated with a vector (respectively ). For simplicity, we assume that our initial lexicon, or seeds, corresponds to the first pairs . The goal is to extend the lexicon to all source words  that are not seeds. Mikolov et al. Mikolov et al. (2013b) learn a linear mapping  between the word vectors of the seed lexicon that minimizes a measure of discrepancy between mapped word vectors of the source language and word vectors of the target language:

(1)

where is a loss function, typically the square loss . This leads to a least squares problem, which is solved in closed form.

Orthogonality.

The linear mapping is constrained to be orthogonal, i.e. such that , where is the

-dimensional identity matrix. This choice preserves distances between word vectors, and likewise word similarities. Previous works 

Xing et al. (2015); Artetxe et al. (2016); Smith et al. (2017) experimentally observed that constraining the mapping in such a way improves the quality of the inferred lexicon. With the square loss and by enforcing an orthogonal mapping , Eq. (1) admits a closed form solution Gower and Dijksterhuis (2004): where

is the singular value decomposition of the matrix

.

Inference.

Once a mapping is learned, one can infer word correspondences for words that are not in the initial lexicon. The translation of a source word is obtained as

(2)

When the squared loss is used, this amounts to computing and to performing a nearest neighbor search with respect to the Euclidean distance:

(3)

Hubness.

A common observation is that nearest neighbor search for bilingual lexicon inference suffers from the “hubness problem” Doddington et al. (1998); Dinu et al. (2014). Hubs are words that appear too frequently in the neighborhoods of other words. To mitigate this effect, a simple solution is to replace, at inference time, the square -norm in Eq. (3) by another criterion, such as ISF Smith et al. (2017) or CSLS Conneau et al. (2017).

This solution, both with ISF and CSLS criteria, is applied with a transformation learned using the square loss. However, replacing the loss in Eq. (3) creates a discrepancy between the learning of the translation model and the inference.

3 Word translation as a retrieval task

In this section, we propose to directly include the CSLS criterion in the model in order to make learning and inference consistent. We also show how to incorporate unsupervised information..

The CSLS criterion is a similarity measure between the vectors  and defined as:

where is the set of nearest neighbors of the point in the set of target word vectors , and

is the cosine similarity. Note, the second term in the expression of the CSLS loss does not change the neighbors of

. However, it gives a loss function that is symmetrical with respect to its two arguments, which is a desirable property for training.

Objective function.

Let us now write the optimization problem for learning the bilingual mapping with CSLS. At this stage, we follow previous work and constrain the linear mapping to belong to the set of orthogonal matrices . Here, we also assume that word vectors are -normalized. Under these assumptions, we have . Similarly, we have Therefore, finding the nearest neighbors of among the elements of is equivalent to finding the elements of which have the largest dot product with . We adopt this equivalent formulation because it leads to a convex formulation when relaxing the orthogonality constraint on . In summary, our optimization problem with the Relaxed CSLS loss (RCSLS) is written as:

(4)

Convex relaxation.

Eq. (4) involves the minimization of a non-smooth cost function over the manifold of orthogonal matrices . As such, it can be solved using manifold optimization tools (Boumal et al., 2014). In this work, we consider as an alternative to the set , its convex hull , i.e., the unit ball of the spectral norm. We refer to this projection as the “Spectral” model. We also consider the case where these constraints on the alignment matrix are simply removed.

Having a convex domain allows us to reason about the convexity of the cost function. We observe that the second and third terms in the CSLS loss can be rewritten as follows:

where denotes the set of all subsets of of size . This term, seen as a function of , is a maximum of linear functions of , which is convex (Boyd and Vandenberghe, 2004). This shows that our objective function is convex with respect to the mapping and piecewise linear (hence non-smooth). Note, our approach could be generalized to other loss functions by replacing the term by any function convex in . We minimize this objective function over the convex set by using the projected subgradient descent algorithm.

The projection onto the set is solved by taking the singular value decomposition (SVD) of the matrix, and thresholding the singular values to one.

Extended Normalization.

Usually, the number of word pairs in the seed lexicon is small with respect to the size of the dictionaries . To benefit from unlabeled data, it is common to add an iterative “refinement procedure” (Artetxe et al., 2017) when learning the translation model . Given a model , this procedure iterates over two steps. First it augments the training lexicon by keeping the best-inferred translation in Eq. (3). Second it learns a new mapping by solving the problem in Eq. (1). This strategy is similar to standard semi-supervised approaches where the training set is augmented over time. In this work, we propose to use the unpaired words in the dictionaries as “negatives” in the RCSLS loss: instead of computing the -nearest neighbors amongst the annotated words , we do it over the whole dictionary .

Method en-es es-en en-fr fr-en en-de de-en en-ru ru-en en-zh zh-en avg.
Adversarial + refine 81.7 83.3 82.3 82.1 74.0 72.2 44.0 59.1 32.5 31.4 64.3
ICP + refine 82.2 83.8 82.5 82.5 74.8 73.1 46.3 61.6 - - -
Wass. Proc. + refine 82.8 84.1 82.6 82.9 75.4 73.3 43.7 59.1 - - -
Least Square Error 78.9 80.7 79.3 80.7 71.5 70.1 47.2 60.2 42.3 4.0 61.5
Procrustes 81.4 82.9 81.1 82.4 73.5 72.4 51.7 63.7 42.7 36.7 66.8
Procrustes + refine 82.4 83.9 82.3 83.2 75.3 73.2 50.1 63.5 40.3 35.5 66.9
RCSLS + spectral 83.5 85.7 82.3 84.1 78.2 75.8 56.1 66.5 44.9 45.7 70.2
RCSLS 84.1 86.3 83.3 84.1 79.1 76.3 57.9 67.2 45.9 46.4 71.0
Table 1: Comparison between RCSLS, Least Square Error, Procrustes and unsupervised approaches in the setting of Conneau et al. (2017). All the methods use the CSLS criterion for retrieval. “Refine” is the refinement step of Conneau et al. (2017). Adversarial, ICP and Wassertsein Proc. are unsupervised (Conneau et al., 2017; Hoshen and Wolf, 2018; Grave et al., 2018).

4 Experiments

This section reports the main results obtained with our method. We provide complementary results and an ablation study in the appendix. We refer to our method without constraints as RCSLS and as RCSLS+spectral if the spectral constraints are used.

4.1 Implementation details

We choose a learning rate in

and a number of epochs in

on the validation set. For the unconstrained RCSLS, a small regularization can be added to prevent the norm of to diverge. In practice, we do not use any regularization. For the English-Chinese pairs (en-zh), we center the word vectors. The number of nearest neighbors in the CSLS loss is . We use the -normalized fastText word vectors by Bojanowski et al. Bojanowski et al. (2017) trained on Wikipedia.

4.2 The MUSE benchmark

Table 9 reports the comparison of RCSLS with standard supervised and unsupervised approaches on 5 language pairs (in both directions) of the MUSE benchmark Conneau et al. (2017). Every approach uses the Wikipedia fastText vectors and supervision comes in the form of a lexicon composed of k words and their translations. Regardless of the relaxation, RCSLS outperforms the state of the art by, on average, to in accuracy. This shows the importance of using the same criterion during training and inference. Note that the refinement step (“refine”) also uses CSLS to finetune the alignments but leads to a marginal gain for supervised methods.

Interestingly, RCSLS achieves a better performance without constraints () for all pairs. Contrary to observations made in previous works, this result suggests that preserving the distance between word vectors is not essential for word translation. Indeed, previous works used a loss where, indeed, orthogonal constraints lead to an improvement of (Procrustes versus Least Square Error). This suggests that a linear mapping with no constraints works well only if it is learned with a proper criterion.

en-es en-fr en-de en-ru avg.
Train 80.7 82.3 74.8 51.9 72.4
Ext. 84.1 83.3 79.1 57.9 76.1
Table 2: Accuracy with and without an extended normalization for RCSLS. “Ext.” uses the full k vocabulary and “Train” only uses the pairs from the training lexicon.

Impact of extended normalization.

Table 2 reports the gain brought by including words not in the lexicon (unannotated words) to the performance of RCSLS. Extending the dictionary significantly improves the performance on all language pairs.

en-it it-en
Adversarial + refine + CSLS 45.1 38.3
Mikolov et al. (2013b) 33.8 24.9
Dinu et al. (2014) 38.5 24.6
Artetxe et al. (2016) 39.7 33.8
Smith et al. (2017) 43.1 38.0
Procrustes + CSLS 44.9 38.5
RCSLS 45.5 38.0
Table 3: Accuracy on English and Italian with the setting of Dinu et al. (2014). “Adversarial” is an unsupervised technique. The adversarial and Procrustes results are from Conneau et al. (2017). We use a CSLS criterion for retrieval.

4.3 The WaCky dataset

Dinu et al. (2014) introduce a setting where word vectors are learned on the WaCky datasets (Baroni et al., 2009) and aligned with a noisy bilingual lexicon. We select the number of epochs within on a validation set. Table 3 shows that RCSLS is on par with the state of the art. RCSLS is thus robust to relatively poor word vectors and noisy lexicons.

Original Aligned
Sem. Synt. Tot. Sem. Synt. Tot.
Cs 26.4 76.7 63.7 27.3 77.7 64.6
De 62.2 56.9 59.5 61.4 57.1 59.3
Es 54.5 59.4 56.8 55.1 61.1 57.9
Fr 76.0 54.7 68.5 75.2 55.1 68.1
It 51.8 62.0 56.9 52.7 63.8 58.2
Pl 49.7 62.4 53.4 50.9 63.2 54.5
Zh 42.6 - 42.6 47.2 - 47.2
Avg. 51.9 62.0 57.3 52.8 58.5 58.5
Table 4: Performance on word analogies for different languages. We compare the original embeddings to their mapping to English. The mappings are learned using the full MUSE bilingual lexicons. We use the fastText vectors of Bojanowski et al. (2017).
BP MUSE Proc. RCSLS
Orig. Full Orig. Full
Bg 55.7 57.5 58.1 63.9 65.2
Ca 66.5 70.9 70.5 73.8 75.0
Cs 63.9 64.5 66.3 68.2 71.1
Da 66.8 67.4 68.3 71.1 72.9
De 68.9 72.7 73.5 76.9 77.6
El 54.9 58.5 60.1 62.7 64.5
Es 82.1 83.5 84.5 86.4 87.1
Et 41.5 45.7 47.3 49.5 53.7
Fi 56.7 59.5 61.9 65.8 69.9
Fr 81.7 82.4 82.5 84.7 84.7
He 51.5 54.1 55.4 57.8 60.0
Hr 48.9 52.2 53.4 55.6 60.2
Hu 61.9 64.9 66.1 69.3 73.1
Id 62.8 67.9 67.9 69.7 72.9
It 75.3 77.9 78.5 81.5 82.8
Mk 53.9 54.6 55.4 59.9 60.4
Nl 72.0 75.3 76.1 79.7 80.5
No 65.3 67.4 68.3 71.2 73.3
Pl 63.3 66.9 68.1 70.5 73.5
Pt 77.7 80.3 80.4 82.9 84.6
Ro 66.3 68.1 67.6 74.0 73.9
Ru 61.3 63.7 64.3 67.1 70.3
Sk 55.1 55.3 57.9 59.0 61.7
Sl 51.1 50.4 52.5 54.2 58.2
Sv 55.9 60.0 64.0 63.7 69.5
Tr 57.4 59.2 61.4 61.9 65.8
Uk 48.7 49.3 51.3 51.5 55.5
Vi 35.0 55.8 63.0 55.8 66.9
CSLS 60.8 63.8 65.2 67.4 70.2
NN 54.6 57.4 57.5 62.4 68.5
Table 5: Comparison with publicly available aligned vectors over 28 languages. All use supervision. Alignements are learned either on the “Original” or “Full” MUSE training. We report the detailed performance with a CSLS criterion and the average for both NN and CSLS criteria.
BP uses a different training set of comparable size.

4.4 Comparison with existing aligned vectors

Recently, word vectors based on fastText have been aligned and released by Smith et al. (Smith et al., 2017, BabylonPartners, BP) and Conneau et al. (Conneau et al., 2017, MUSE). Both use a variation of Procrustes to align word vectors in the same space.

We compare these methods to RCSLS and report results in Table 8. RCSLS improves the performance by over MUSE vectors when trained with the same lexicon (Original). Training RSCSL on the full training lexicon (Full) brings an additional improvement of on average with a CSLS criterion, and with a NN criterion. For reference, the performance of Procrustes only improves by with CSLS and even degrades with a NN criterion. RCSLS benefits more from additional supervision than Procrustes. Finally, the gap between RCSLS and the other methods is higher with a NN criterion, suggesting that RCSLS imports some of the properties of CSLS to the dot product between aligned vectors.

Original Aligned
De Gur350 72 74
Ws350 68 70
Zg222 46 44
Es Ws353 59 61
It Ws350 64 64
Pt Ws353 60 58
Ro Ws353 58 55
Hj 67 65
Ws350 59 58
Zh Sim 35 44
Avg.    58.8    59.2
Table 6: Performance on word similarities for different languages. We compare the original embeddings to their mapping to English. The mappings are learned with the full MUSE bilingual lexicons over the fastText vectors of Bojanowski et al. (2017).

4.5 Impact on word vectors

Non-orthogonal mapping of word vectors changes their dot products. We evaluate the impact of this mapping on word analogy tasks Mikolov et al. (2013a). In Table 4, we report the accuracy on analogies for raw word vectors and our vectors mapped to English with an alignement trained on the full MUSE training set. Regardless of the source language, the mapping does not negatively impact the word vectors. Similarly, our alignement has also little impact on word similarity, as shown in Table 6.

We confirm this observation by running the reverse mapping, i.e., by mapping the English word vectors of Mikolov et al. (2018) to Spanish. It leads to an improvement of both for vectors trained on Common Crawl ( to ) and Wikipedia + News ( to ).

5 Conclusion

This paper shows that minimizing a convex relaxation of the CSLS loss significantly improves the quality of bilingual word vector alignment. We use a reformulation of CSLS that generalizes to convex functions beyond dot-products and provides to a single end-to-end training that is consistent with the inference stage. Finally, we show that removing the orthogonality constraint does not degrade the quality of the aligned vectors.

Acknowledgement.

We thank Guillaume Lample and Alexis Conneau for their feedback and help with MUSE.

References

Appendix A Ablation study

This appendix presents an ablation study, to validate the design choices that we made.

Figure 1: Accuracy as a function of the training set size (log scale) on the en-de pair.

Size of training lexicon.

Figure 1 compares the accuracy of RCSLS and Procrustes as a function of the training set size. On small training sets, the difference between RCSLS and Procrustes is marginal but increases with the training set size.

Figure 2: Accuracy as a function of the number of nearest neighbors, averaged over different pairs.

Impact of the number of nearest neighbors.

The CSLS criterion and the RCSLS loss are sensible to the number of nearest neighbors. Figure 2 shows the impact of this parameter on both Procrustes and RCSLS. Procrustes is impacted through the retrieval criterion while RCSLS is impacted by the loss and the criterion. Taking nearest neighbors is optimal and the performance decreases significantly with a large number of neighbors.

en-es es-en en-ru ru-en
Linear 84.1 86.3 58.0 67.2
logSumExp 84.1 86.3 58.3 67.0
Table 7:

Comparison between different functions in CSLS on four language pairs. Linear is the standard criterion, while logSumExp is equivalent to a logistic regression with hard mining.

Comparison of alternative criterions.

As discussed in the main paper, the dot product in the CSLS terms can be replaced by any convex function of and still yield a convex objective. Using a logSumExp function, i.e.,

is equivalent to a “local” logistic regression classifier, or equivalently, to a logistic regression with hard mining. In this experiment, we train our model using the alternative loss and report the accuracy of the resulting lexicon in Table 

7. We observe that this choice does not significantly changes the performance. This suggests that the local property of the criterion is more important than the form of the loss.

Original Full
BP MUSE RCSLS Proc. RCSLS
with exact string matches
NN 54.6 57.4 62.4 57.5 68.5
CSLS 60.8 63.8 67.4 65.2 70.2
without exact string matches
NN 56.6 55.5 61.4 53.7 64.3
CSLS 61.5 60.4 65.4 60.2 65.7
Table 8: Comparison with publicly available aligned vectors, averaged over language pairs. All use supervision. Alignements are learned either on the “Original” or “Full” MUSE training. We report performance with the NN and CSLS criterion on either the full MUSE test set or without the exact string matches. BP uses a different training set with k words.
Method en-es es-en en-fr fr-en en-de de-en en-ru ru-en en-zh zh-en avg.
Adv. + ref. + NN 79.1 78.1 78.1 78.2 71.3 69.6 37.3 54.3 30.9 21.9 59.9
Adv. + ref. + CSLS 81.7 83.3 82.3 82.1 74.0 72.2 44.0 59.1 32.5 31.4 64.3
Procrustes + NN 77.4 77.3 74.9 76.1 68.4 67.7 47.0 58.2 40.6 30.2 61.8
Procrustes + CSLS 81.4 82.9 81.1 82.4 73.5 72.4 51.7 63.7 42.7 36.7 66.8
RCSLS + NN 81.1 84.9 80.5 80.5 75.0 72.3 55.3 67.1 43.6 40.1 68.0
RCSLS + CSLS 84.1 86.3 83.3 84.1 79.1 76.3 57.9 67.2 45.9 46.4 71.0
Table 9: Comparison between a nearest neighbor (NN) criterion and CSLS.

Exact string matches.

The MUSE datasets contains exact string matches based on vocabularies built on Wikipedia. The matches may reflects correct translations but can come from other sources, like English word that frequently appears in movie or song titles. Table 8 compares alignments on the MUSE test set with and without exact string matches average over languages. Note that we do not remove exact matches in the training sets for fair comparison with MUSE vectors. We note that the gap between our vectors and others is more important with an NN criterion. We also observe that, the performance of all the methods drop when the exact string matches are removed.

Impact of the retrieval criterion.

Table 9 shows performance on MUSE with a nearest neighbors (NN) criterion. Replacing CSLS by NN leads to a smaller drops for RCSLS () than for competitors (around ), suggesting that RCSLS transfers some local information encoded in the CSLS criterion to the dot product.

Appendix B Alignment and word vectors

In this appendix, we look at the relation between the quality of the word vectors and the quality of an alignment. We measure both the impact of the vectors on the alignment and the impact of a non-orthogonal mapping on word vectors.

without subword with subword
en-es 82.8 84.1
es-en 84.1 86.3
en-fr 82.3 83.3
fr-en 82.5 84.1
en-de 78.5 79.1
de-en 74.1 76.3
Table 10: Impact on the alignment of the quality of the word vectors trained on the same corpora.

Quality of the embedding model.

In this experiment, we study the impact of the quality of the word vectors on the performance of word translation. For this purpose, we trained word vectors on the same Wikipedia data, using skipgram with and without subword information. In Table 10, we report the accuracy for different language pairs when using these two sets of word vectors. Overall, we observe that using subword information improves the accuracy by a few points on all pairs.

Sem. Synt. Tot.
Orig. 79.4 73.4 76.1
enes 80.5 75.8 78.0
enfr 79.8 75.9 77.6
ende 80.0 75.9 77.6
enru 79.5 74.6 76.8
Table 11: Semantic and syntactic accuracies of English vectors and their mappings to different languages.

Impact on English word vectors.

We evaluate the impact of a non-orthogonal mapping on the English word analogy task Mikolov et al. (2013a). Table 11 compares on analogies the raw English word vectors to their alignments to languages. Regardless of the target language, the mapping does not have negative impact on the word vectors.

en-es. en-de en-it
NASARI baseline 0.64 0.60 0.65
BP 0.72 0.69 0.71
MUSE 0.71 0.71 0.71
RCSLS 0.71 0.71 0.71
Table 12: Cross-lingual word similarity on the NASARI datasets of Camacho-Collados et al. (2016). We report the Pearson correlation. BP, MUSE and RCSLS uses the Wikipedia fastText vectors.

Cross-lingual similarity.

Finally, we evaluate our aligned vectors on the task of cross-lingual word similarity in Table 12

. They obtain similar results to vectors aligned with an orthogonal matrix. These experiments concur with the previous observation that a linear non-orthogonal mapping does not hurt the geometry of the word vector space.