## 1 Introduction

Pre-trained continuous representations of words are standard building blocks of many natural language processing and machine learning systems

(mikolov2013distributed). Word vectors are designed to summarize and quantify semantic nuances through a few hundred coordinates. Such representations are typically used in downstream tasks to improve generalization when the amount of data is scarce (collobert2011natural). The distributional information used to learn these word vectors derives from statistical properties of word co-occurrence found in large corpora (deerwester1990indexing). Such corpora are, by design, monolingual (mikolov2013distributed; FastText), resulting in the independent learning of word embeddings for each language.A limitation of these monolingual embeddings is that it is impossible to compare words across languages. It is thus natural to try to combine all these word representations into a common multilingual space, where every language could be mapped. Mikolov13 observed that word vectors learned on different languages share a similar structure. More precisely, two sets of pre-trained vectors in different languages can be aligned to some extent: a linear mapping between the two sets of embeddings is enough to produce decent word translations. Recently, there has been an increasing interest in mapping these pre-trained vectors in a common space (Xing15; artetxe2017learning), resulting in many publicly available embeddings in many languages mapped into a single common vector space (smith2017offline; conneau2017word; joulin2018loss). The quality of these multilingual embeddings can be tested by composing mappings between languages and looking at the resulting translations. As an example, learning a direct mapping between Italian and Portuguese leads to a word translation accuracy of with a nearest neighbor (NN) criterion, while composing the mapping between Italian and English and Portuguese and English leads to a word translation accuracy of only. Practically speaking, it is not surprising to see such a degradation since these bilingual alignments are trained separately, without enforcing transitivity.

In this paper, we propose a novel approach to align multiple languages simultaneously in a common space in a way that enforces transitive translations. Our method relies on constraining word translations to be coherent between languages when mapped to the common space. nakashole2017knowledge has recently shown that similar constraints over a well chosen triplet of languages improve supervised bilingual alignment. Our work extends their conclusions to the unsupervised case. We show that our approach achieves competitive performance while enforcing composition.

## 2 Preliminaries on bilingual alignment

In this section, we provide a brief overview of bilingual alignment methods to learn a mapping between two sets of embeddings, and discuss their limits when used in multilingual settings.

### 2.1 Supervised bilingual alignment

Mikolov13 formulate the problem of word vector alignment as a quadratic problem. Given two sets of word vectors stacked in two matrices and and a assignment matrix

built on a bilingual lexicon, the mapping matrix

is the solution of a least-square problem:which admits a closed form solution. Restraining to the set of orthogonal matrices , improves the alignments (Xing15)

. The resulting problem, known as Orthogonal Procrustes, still admits a closed form solution through a singular value decomposition

(schönemann66).#### Alternative loss function.

The loss is intrinsically associated with the nearest neighbor (NN) criterion. This criterion suffers from the existence of “hubs”, i.e., data points that are nearest neighbors to all other data points (dinu2014improving). Alternative criterions have been suggested, such as the inverted softmax (smith2017offline) and CSLS (conneau2017word). Recently, joulin2018loss

has shown that directly minimizing a loss inspired by the CSLS criterion significantly improve the quality of the retrieved word translations. Their loss function, called RCSLS, is defined as:

This loss is a tight convex relaxation of the CSLS cristerion for normalized word vectors, and can be efficiently minimized with a subgradient method.

### 2.2 Unsupervised bilingual Alignment: Wasserstein-Procrustes

In the setting of unsupervised bilingual alignment, the assignment matrix is unknown and must be learned jointly with the mapping

. An assignment matrix represents a one-to-one correspondence between the two sets of words, i.e., is a bi-stochastic matrix with binary entries. The set of assignment matrices

is thus defined as:The resulting approach, called Wasserstein-Procrustes (Zhang17b; grave18), jointly learns both matrices by solving the following problem:

(1) |

This problem is not convex since none of the sets and is convex. Minimizing over each variable separately leads, however, to well understood optimization problems: when is fixed, solving for , involves solving the orthogonal Procrustes problem. When is fixed, an optimal permutation matrix

can be obtained with the Hungarian algorithm. A simple heuristic to address Eq.(

1) is thus to use an alternate optimization. Both algorithms have a cubic complexity but on different quantities: Procrustes involves the dimension of the vectors, i.e., (with ), whereas the Hungarian algorithm involves the size of the sets, i.e., (with k–k). Directly applying the Hungarian algorithm is computationally prohibitive, but efficient alternatives exist. cuturi13 shows that regularizing this problem with a negative entropy leads to a Sinkhorn algorithm and a complexity of up to logarithmic factors (altschuler2017near).As for many non-convex problems, a good initial guess helps converge to better local minima. grave18 compute an initial with a convex relaxation of the quadratic assignment problem. We found, however, that the entropic regularization of the Gromov-Wasserstein (GW) problem (memoli2011gromov) worked well in practice and was significantly faster (solomon2016entropic; peyre2016gromov):

The case corresponds to memoli2011gromov’s initial proposal. Optimizing the regularized version () leads to a local minimum that can be used as an initialization to solve Eq. (1). Note that a similar formulation was recently used in the same context by alvarez2018gromov.

## 3 Composable Multilingual Alignments

In this section, we propose an unsupervised approach to jointly align sets of vectors to a unique common space while preserving the quality of word translations between all pairs of languages.

### 3.1 Multilingual alignment to a common space

Given sets of word vectors, we are interested in aligning these to a common target space . For simplicity, we assume that this target space coincide with one of the word vector set. The language associated with this vector set is denoted as “pivot” and is indexed by . A typical choice for the pivot, used in publicly available aligned vectors, is English (smith2017offline; conneau2017word; joulin2018loss). Aligning multiple languages to a common space consists in learning a mapping for each language , such that its vectors are aligned with the vectors of the pivot language up to a permutation matrix :

(2) |

This objective function decomposes over each language and does not guarantee good indirect word translation between pairs of languages that do not include the pivot. A solution would be to directly enforce compositionality by adding constraints on the mappings. However, this would require to introduce mappings between all pairs of languages, leading to the estimation of

mappings simultaneously. Instead we leverage the fact that all the vector sets are mapped to a common space to enforce good alignments within this space. With the convention that is the identity, this leads to the following problem:(3) |

This formulation does not introduce any unnecessary mapping or hyperparameter. It constrains all pairs of vector sets to be well aligned, instead of directly constraining the mappings. Constraining the mappings would encourage coherence over the entire space, while we are interested in well aligned data; that is coherent mapping within the span of the word vectors. Our approach takes its inspiration from the hyperalignment of multiple graphs

(goodall91). We refer to our approach as Unsupervised Multilingual Hyperalignment (UMH).#### Choice of loss function.

We adapt the RCSLS loss to the unsupervised setting by applying the assignment matrices only to the linear part of the loss function, leading to the following loss between two sets of vectors:

conneau2017word has shown that the CSLS criterion significantly improves the quality of word translation, both when used to refine trained mappings and as a retrieval criterion for inference. We follow their work and use the URCSLS loss to learn our multilingual mappings.

#### Efficient optimization.

Directly optimizing Eq. (3) is computationally prohibitive since terms are involved. We use a stochastic procedure where pairs of languages are sampled at each iteration and updated. The URCSLS loss is much slower to optimize than a loss because it requires to find the

nearest elements of each entry. We thus propose the following approximated algorithm: We first optimize for a couple of epochs the URCSLS loss with the same scheme as in

grave18: alternate minimization with a Sinkhorn algorithm for the assignment matrix on small batches. Then we switch to a cruder optimization scheme: we use a greedy assignment algorithm by taking the max per row. We also subsample the number of elements to compute the nearest neighbors from. We restrict each set of vectors to its first k elements. UMH runs on a CPU with threads in minutes for a pair of languages and in a couple of hours for languages.## 4 Related work

#### Bilingual word embedding alignment.

Since the work of Mikolov13, many have proposed different approaches to align word vectors with different degrees of supervision, from fully supervised (dinu2014improving; xing2015normalized; artetxe2016learning; joulin2018loss) to little supervision (smith2017offline; artetxe2017learning) and even fully unsupervised (Zhang17b; conneau2017word; hoshen2018iterative). Among unsupervised approaches, some have explicitly formulated this problem as a distribution matching: cao2016distribution

align the first two moments of the word vector distributions, assuming Gaussian distributions. Others

(zhang2017adversarial; conneau2017word) have used a Generative Adversarial Network framework (goodfellow2014generative). Zhang17b shows that an earth mover distance can be used to refine the alignment obtained from a generative adversarial network, drawing a connection between word embedding alignment and Optimal Transport (OT). Closer to our work, grave18 and alvarez2018gromov have proposed an unsupervised bilingual alignment method solely based on OT. We use an approach inspired by their work to initialize our multilingual approach.nakashole2017knowledge show that constraining coherent word alignments between triplets of nearby languages improves the quality of induced bilingual lexicons. As opposed to our work, their approach is restriced to triplets of languages and use supervision for both the lexicon and the choice of the pivot language. Finally, independently of this work, chen2018unsupervised has recently extended the bilingual method of conneau2017word to the multilingual setting.

#### Optimal Transport.

Optimal transport (villani2003topics; santambrogio2015optimal)

provides a natural topology on shapes and discrete probability measures

(peyre2017computational), that can be leveraged thanks to fast OT problem solvers (cuturi13; altschuler2017near). Of particular interest is the Gromov-Wasserstein distance (gromov2007metric; memoli2011gromov). It has been used for shape matching under its primitive form (bronstein2006generalized; memoli2007use) and under its entropy-regularized form (solomon2016entropic).#### Hyperalignment.

Hyperalignment, as introduced by goodall91, is the method of aligning several shapes onto each other with supervision. Recently, lorbert2012kernel extended this supervised approach to non-Euclidean distances. We recommend gower2004procrustes for a thorough survey of the different extensions of Procrustes and to edelman1998geometry for algorithms involving orthogonal constraints. For unsupervised alignment of multiple shapes, huang2007unsupervised use a pointwise entropy based method and apply it to face alignment.

## 5 Experimental Results

#### Implementation Details.

We use normalized fastText word embeddings trained on the Wikipedia Corpus (FastText)

. We optimize our loss with a stochastic gradient descent (SGD). We run a first epoch with a batch size of

and then fix the batch size to k elements. Based on the loss at convergence, we have found that a learning rate of works well for URCSLS and of for UMH. For the first two iterations, we learn the assignment with a regularized Sinkhorn. Then, for faster convergence, we use a a greedy assignment, by picking the max per row. We initialize with the Gromov-Wasserstein approach applied to the first k vectors and a regularization parameter of (peyre2016gromov). We use the python optimal transport package^{1}

^{1}1POT, https://pot.readthedocs.io/en/stable/.

#### Extended MUSE Benchmark.

We evaluate on the MUSE test datasets (conneau2017word) on the following languages: Czech, Danish, Dutch, English, French, German, Italian, Polish, Portuguese, Russian and Spanish. MUSE bilingual lexicon are mostly translations to or from English. For missing pairs of languages (e.g., Danish-German), we use the intersection of their translation to English to build a test set.

#### Baselines.

We consider as baselines several bilingual alignment methods that are either supervised, i.e., Orthogonal Procrustes and RCSLS (joulin2018loss), or unsupervised, i.e., Adversarial (conneau2017word), ICP (hoshen2018iterative) and Wasserstein Procrustes (grave18).

en-es | en-fr | en-it | en-de | en-ru | Avg. | ||||||

supervised, bilingual |
|||||||||||

Proc. | 80.9 | 82.9 | 81.0 | 82.3 | 75.3 | 77.7 | 74.3 | 72.4 | 51.2 | 64.5 | 74.3 |

RCSLS | 84.1 | 86.3 | 83.3 | 84.1 | 79.3 | 81.5 | 79.1 | 76.3 | 57.9 | 67.2 | 77.9 |

unsupervised, bilingual |
|||||||||||

GW | 81.7 | 80.4 | 81.3 | 78.9 | 78.9 | 75.2 | 71.9 | 72.8 | 45.1 | 43.7 | 71.0 |

Adv. + ref. | 81.7 | 83.3 | 82.3 | 82.1 | 77.4 | 76.1 | 74.0 | 72.2 | 44.0 | 59.1 | 73.2 |

ICP + ref. | 82.1 | 84.1 | 82.3 | 82.9 | 77.9 | 77.5 | 74.7 | 73.0 | 47.5 | 61.8 | 74.4 |

W-Proc. + ref. | 82.8 | 84.1 | 82.6 | 82.9 | - | - | 75.4 | 73.3 | 43.7 | 59.1 | |

UMH bil. | 82.5 | 84.9 | 82.9 | 83.3 | 79.4 | 79.4 | 74.8 | 73.7 | 45.3 | 62.8 | 74.9 |

unsupervised, multilingual |
|||||||||||

UMH multi. | 81.5 | 82.6 | 81.5 | 81.7 | 76.1 | 76.4 | 73.2 | 71.9 | 46.1 | 61.8 | 73.3 |

### 5.1 Impact of the loss function

Table 1 compares our bilingual UMH, with state-of-the-art unsupervised approaches on the MUSE benchmark. All the approaches use the CSLS criterion. Most unsupervised approaches learn a mapping with a loss different from the criterion used for inference and apply a refinement step (“ref.”) to fine-tune their mapping with the proper criterion (artetxe2017learning; conneau2017word). UMH directly learns a bilingual mapping with an approximation of the retrieval criterion and we do not apply a refinement step. Bilingual UMH compares favorably with previous approaches (). Of particular interest is the comparison with W-Proc.+ref since it is the closest approach to ours. This experiments validates our choice of the URCSLS loss for our approach.

### 5.2 Triplet alignment

Direct | Ind. | Direct | Ind. | Direct | Ind. | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

de-en | fr-en | de-fr | de-fr | pt-es | fr-es | pt-fr | pt-fr | it-en | pt-en | it-pt | it-pt | |

Pairs | 72.3 | 80.2 | 64.5 | 61.7 | 86.5 | 81.2 | 77.2 | 72.3 | 77.7 | 77.8 | 78.1 | 70.7 |

Triplet | 71.9 | 80.2 | - | 68.3 | 86.8 | 81.2 | - | 77.9 | 76.7 | 78.2 | - | 78.1 |

In this set of experiments, we evaluate the quality of our formulation in the simple case of triplets of languages. One of the language acts as the pivot between the two others. We evaluate both the direct translation to the pivot and the indirect translation between the two other languages (i.e. source to pivot to target). This experiment is inspired by the setting of nakashole2017knowledge. We pick triplets where MUSE bilingual lexicons exists.

Table 2 compares our approach trained on pairs and triplets of languages. For the pairs, we evaluate both direct and indirect translation. We use a NN criterion instead of CSLS because it gives a better insight on the the dot product between vectors in the common space. We test different settings by changing the pivot or changing the pair of languages. We also consider cases where the pivot is a natural in-between language or not. Overall, we observe that these changes have little impact on the performance. Our approach obtains comparable performance on direct translation to and from the pivot. More importantly, indirect translation with our approach obtains performance that compares favorably even with the direct translation. In comparison, the performance of indirect translation obtained with bilingual models dropped by . Note that this drop is reduced to a couple of percents if a CSLS criterion is used instead of a NN criterion.

### 5.3 Multilingual alignment

In this set of experiments, we evaluate the quality of joint multilingual alignment on a larger set of languages, i.e., languages. We look at the impact on direct and indirect alignments.

Latin | Germanic | Slavic | Latin-Germanic | Latin-Slavic | Germanic-Slavic | All | |
---|---|---|---|---|---|---|---|

Bil. | 76.6 | 54.6 | 45.5 | 53.7 | 46.3 | 40.8 | 51.7 |

Multi. | 80.0 | 61.8 | 51.8 | 59.2 | 50.4 | 48.3 | 57.1 |

#### Indirect word translation.

Table 3 shows the performance on indirect word translation with English as a pivot language. We consider averaged accuracies among and across language families, i.e. Latin, Germanic and Slavic. As expected, constraining the alignments significantly improves over the bilingual baseline, by almost . The biggest improvement comes from Slavic languages. This is not surprising since it is the language family that is the most distant from the pivot, i.e., English. Similarly, it is not surprising that the smallest improvement is between Latin and Germanic languages (), since English is a natural pivot between them.

#### Direct word translation.

Table 1 shows a comparison of UMH with other unsupervised bilingual approaches on the MUSE benchmark. This benchmark consists of translations to and from English and the results on the remaining languages are in the supplementary material. We observe a degradation of performance of compared to bilingual UMH, which is consistent on the remaining languages. We uniformly sample the pairs of languages, which explains this drop of performance. For example, if we change this sampling to take more direct pairs, the performance on direct translation matches the bilingual case, but at a cost the indirect word translation. In the absence of supervision, we prefer to keep a uniform sampling scheme.

## 6 Conclusion

This paper introduces an unsupervised multilingual alignment method that maps every language into a common space while minimizing the impact on indirect word translation. We show that a simple extension of a bilingual formulation significantly reduces the drop of performance of indirect word translation. Our multilingual approach also matches the performance of previously published bilingual approach on direct translation. However, our current approach is relatively hard to scale and how to jointly learn alignment over hundreds of languages remains an open question.

Comments

There are no comments yet.