Estimating distances between text passages is in the core of information retrieval applications such as document retrieval, summarization and question answering. Recently, in , Kusner et al. proposed the Word Mover’s Distance (WMD), a novel distance metric for text data. WMD is directly derived from the optimal transport (OT) theory [2, 3]
and is, in fact, an implementation of the Wasserstein distance (also known as Earth Mover’s distance) for textual data. For WMD, a source and a target text span are expressed by high-dimensional probability densities through the bag-of-words representation. Given the two densities, OT aims to find the map (or transport plan) that minimizes the total transportation cost given a ground metric for transferring the first density to the second. The ground metric for text data can be estimated using word embeddings.
One interesting feature of the Wasserstein distance is that it defines a proper metric on a space of probability measures. This distance presents several advantages when compared to other statistical distance measures, such as, for instance, the
- and the Jensen-Shannon divergences: (1) it is parametrized by the ground metric that offers the flexibility in adapting it to various data types; (2) it is known to be a very efficient metric due to its ability of taking into account the geometry of the data through the pairwise distances between the distributions’ points. For all these reasons, the Wasserstein distance is of increasing interest to the machine learning community for various applications like: computer vision, domain adaptation , and clustering .
In this paper, our goal is to show how information retrieval (IR) applications can benefit from the Wasserstein distance. We demonstrate that for text applications, the Wasserstein distance can naturally incorporate different weighing schemes that are particularly efficient in IR applications, such as the inverse document frequency. This presents an important advantage compared to uniform weighting considered in the previous works on the subject. Further, we propose to use the regularized version of OT , which relies on entropic regularization allowing to obtain smoother, and therefore more stable results, and to solve the OT problem using the efficient Sinkhorn-Knopp matrix algorithm. From the application’s perspective, we evaluate the use of Wasserstein distances in the task of Cross-Lingual Document Retrieval (CLDR) where given a query document (e.g., English Wikipedia entry for “Dog”) one needs to retrieve its corresponding document in another language (e.g., French entry for “Chien”). In this specific context we propose a novel strategy to handle out-of-vocabulary words based on morphological similarity.
The rest of this paper is organized as follows. In Section 2, we briefly present the OT problem and its entropic regularized version. Section 3 presents the proposed approach and investigates different scenarios with respect to (w.r.t.) the weighting schemes, the regularization and the word embeddings. Empirical evaluations, conducted in eight cross-lingual settings, are presented in Section 4 and demonstrate that our approach is substantially more efficient than other strong baselines in terms of Mean Reciprocal Rank. The last section concludes the paper with a discussion of future research perspectives.
2 Preliminary knowledge
In this section we introduce the OT problem  as well as its entropic regularized version that will be later used to calculate the regularized Wasserstein distance.
2.1 Optimal transport
OT theory, originally introduced in 
to study the problem of resource allocation, provides a powerful geometrical tool for comparing probability distributions.
In a more formal way, given access to two sets of points and , we construct two discrete empirical probability distributions as follows
where and are probabilities associated to and , respectively and is a Dirac measure that can be interpreted as an indicator function taking value 1 at the position of and elsewhere. For these two distributions, the Monge-Kantorovich problem consists in finding a probabilistic coupling defined as a joint probability measure over with marginals and that minimizes the cost of transport with respect to some metric :
where is the Frobenius dot product, is a set of doubly stochastic matrices and is a dissimilarity matrix, i.e., , defining the energy needed to move a probability mass from to . This problem admits a unique solution and defines a metric on the space of probability measures (called the Wasserstein distance) as follows:
The success of algorithms based on this distance is also due to  who introduced an entropic regularized version of optimal transport that can be optimized efficiently using matrix scaling algorithm. We present this regularization below.
2.2 Entropic regularization
The idea of using entropic regularization has recently found its application to the optimal transportation problem  through the following objective function:
The second term in this equation allows to obtain smoother and more numerically stable solutions compared to the original case and converges to it at the exponential rate . The intuition behind it is that entropic regularization allows to transport the mass from one distribution to another more or less uniformly depending on the regularization parameter . Furthermore, it allows to solve the optimal transportation problem efficiently using Sinkhorn-Knopp matrix scaling algorithm .
3 Word Mover’s Distance for CLDR
In this section, we explain the main underlying idea of our approach and show how the regularized optimal transport can be used in the cross-lingual information retrieval. We start with the formalization of our problem.
3.1 Problem setup
For our task, we assume access to two document collections and , where (resp. ) is the -th (resp. -th) document written in language (resp. ). Let the vocabulary size of the two languages be denoted as and . For the rest of the development, we assume to have access to dictionaries of embeddings where words from and are projected into a sharedvector space of dimension , hence and , denote the embeddings of words . As learning the bilingual embeddings is not the focus of this paper, any of the previously proposed methods can be used e.g., [10, 11]. A document consists of words and is represented using the Vector Space Model with frequencies. Hence, , ; the value (resp. ) then represents the frequency of word (resp. ) in (resp. ). Calculating the distance of words in the embedding’s space is naturally achieved using the Euclidean distance with lower values meaning that words are similar between them. For the rest, we denote by the Euclidean distance between the words and in the embedding’s space. Our goal is to estimate the distance of , that are written in two languages, while taking advantage of the expressiveness of word embeddings and the Wasserstein distance.
3.2 Proposed method
In order to use the Wasserstein distance on documents, we consider that the documents and from different languages are both modeled as empirical probability distributions, i.e.
where and are probabilities associated with words and in and , respectively. In order to increase the efficiency of optimal transport between these documents, it would be desirable to incorporate a proper weighting scheme that reflects the relative frequencies of different words appearing in a given text corpus. To this end, we use the following weighting schemes:
term frequency (tf), that represents a document using the frequency of its word occurrences. This schema was initially proposed in  and corresponds to the case where and .
term frequency-inverse document frequency (idf), where the term frequencies are multiplied by the words’ inverse document frequencies. In a collection of documents, the document frequency is the number of documents in the collection containing the word . A word’s inverse document frequency penalizes words that occur in many documents. As commonly done, we use a smoothed version of idf. Hence, we consider and .
Furthermore, we use the Euclidean distance between the word embeddings of the two documents  as a ground metric in order to construct the matrix . Now, we seek solving the following optimization problem:
Given the solution of this problem, we can calculate the Wasserstein distance between documents as
As transforming the words of to comes with the cost , the optimization problem of Eq. (1) translates to the minimization of the associated cumulative cost of transforming all the words. The value of the minimal cost is the distance between the documents. Intuitively, the more similar the words between the documents are, the lower will be the costs associated to the solution of the optimization problem, which, in turn, means smaller document distances. For example, given “the cat sits on the mat” and its French translation “le chat est assis sur le tapis”, the weights after stopwords filtering of “cat”, “sits”, “mat”, and “chat”, “assis”, “tapis” will be . Given high-quality embeddings, solving Eq. (1) will converge to the one-to-one transformations “cat-chat”, “sits-assis” and “mat-tapis”, with very low cumulative cost as the paired words are similar.
This one-to-one matching, however, can be less efficient when documents with larger vocabularies are used. In this case, every word can be potentially associated with, not a single, but several words representing its synonyms or terms often used in the same context. Furthermore, the problem of Eq. (1) is a special case of the Earth Mover’s distance 
and presents a standard Linear Programming problem that has a computation complexity of. When is large, this can present a huge computational burden. Hence, it may be more beneficial to use the entropic regularization of optimal transport that allows more associations between words by increasing the entropy of the coupling matrix and can be solved faster, in linear time. Our second proposed model thus reads
As in the previous problem, once is obtained, we estimate the entropic regularized Wasserstein distance (also known as Sinkhorn distance) as
Algorithm 1 summarizes the CLDR process with Wasserstein distance. We also illustrate the effect of regularization, controlled by in the OT problem of Eq. (2). Figure 1 presents the obtained coupling matrices and the underlying word matchings between the words of the example we considered above when varying . We project the words in 2-D space using t-SNE as our dimensionality reduction technique.111We use the Numberbatch embeddings presented in our experiments (Sec. 4). Notice that high values lead to the uniform association weights between all the words while the lowest value leads to a complete algorithm failure. For the corresponding pairs are associated with the bold lines showing that this link is more likely than the other fading lines. Finally, gives the optimal, one-to-one matching. This figure shows that entropic regularization encourages the “soft” associations of words with different degrees of strength. Also, it highlights how OT accounts for the data geometry, as the strongest links occurs between the words that are closest it the space.
3.3 Out-of-vocabulary words
An important limitation when using the dictionaries of embeddings is the out-of-vocabulary (OOV) words. High rates of OOV result in loss of information. To partially overcome this limitation one needs a protocol to handle them. We propose a simple strategy for OOV that is based on the strong assumption that morphologically similar words have similar meanings. To measure similarity between words we use the Levenshtein distance, that estimates the minimum number of character edits needed to transform one word to the other. Hence, the protocol we use is as follows: in case a word is OOV, we measure its distance from every other word in , and select the embedding of a word whose distance is less than a threshold . If there are several such words, we randomly select one. Depending on the language and on the dictionary size, this may have significant effects: for languages like English and Greek for instance one may handle OOV plural nouns, as they often formulated by adding the suffix “s” ( in Greek) in the noun (e.g., tree/trees). The same protocol can help with languages like French and German that have plenty of letters with accents such as acutes (é) and umlauts (ü).
The above strategy handles the OOV words of a language using its dictionary . To fine-tune the available embeddings for the task of cross-lingual retrieval, we extend the argument of morphological similarity for the embeddings across languages. To achieve that, we collapse the cross-lingual embeddings of alike words. Hence, if for two words and the Levenshtein distance is zero, we use the embedding of the language with the biggest dictionary size for both. As a result, the English word “transition” and the French word “transition” will use the English embedding. Of course, while several words may be that similar between English and French, there will be fewer, for instance, for English and Finnish or none for English and Greek as they use different alphabets.
We note that the assumption of morphologically similar words having similar meanings and thus embeddings is strong; we do not claim it to be anything more than a heuristic. In fact, one can come up with counter-examples where it fails. We believe, however, that for languages with less resources than English, it is an heuristic that can help overcome the high rates of OOV. Its positive or negative impact for CLDR remains to be empirically validated.
4 Experimental framework
In our experiments we are interested in CLDR whose aim is to identify corresponding documents written in different languages. Assuming, for instance, that one has access to English and French Wikipedia documents, the goal is to identify the cross-language links between the articles. Traditional retrieval approaches employing bag-of-words representations perform poorly in the task as the vocabularies vary across languages, and words from different languages rarely co-occur.
To evaluate the suitability of OT distances for cross-lingual document retrieval we extract four bilingual () Wikipedia datasets: (i) English-French, (ii) English-German, (iii) English-Greek and (iv) English-Finnish. Each dataset defines two retrieval problems: for the first () the documents of are retrieved given queries in ; for the second () the documents of are the queries. To construct the Wikipedia excerpts, we use the comparable Wikipedia corpora of linguatools.222http://linguatools.org/tools/corpora/wikipedia-comparable-corpora/ Following , the inter-language links are the golden standard and will be used to calculate the evaluation measures. Compared to “ad hoc” retrieval problems  where there are several relevant documents for each query, this is referred to as a “known-search” problem as there is exactly one “right” result for each query . Our datasets comprise 10K pairs of comparable documents; the first 500 are used for evaluating the retrieval approaches. The use of the remaining 9.5K, dubbed “BiLDA Training” is described in the next paragraph. Table 1(a) summarizes these datasets. In the pre-processing steps we lowercase the documents, we remove the stopwords, the punctuation symbols and the numbers. We keep documents with more than five words and, for efficiency, we keep for each document the first 500 words.
One may distinguish between two families of methods for cross-lingual IR: translation-based and semantic-based. The methods of the first one use a translation mechanism to translate the query from the source language to the target language. Then, any of the known retrieval techniques can be employed. The methods of the second family project both the query and the documents of the target language in a shared semantic space. The calculation of the query-document distances is performed in this shared space.
To demonstrate the advantages of our methods, we present results using systems that rely either on translations or on cross-lingual semantic spaces. Concerning the translation mechanism, we rely on a dictionary-based approach. To generate the dictionaries for each language pair we use Wiktionary and, in particular, the open implementation of [16, 17]. For a given word, one may have several translations: we pick the candidate translation according to a unigram language model learned on “BiLDA training” data. In the rest, we compare:333We release the code at: https://github.com/balikasg/WassersteinRetrieval.
-tf: Euclidean distance between the term-frequency representation of documents. To be applied for CLDR, the query needs to be translated in the language of the target documents.
-idf: Euclidean distance between the idf representation of documents. As with tf, the query needs to be translated.
-nBOW: nBOW (neural bag-of-words) represents documents by a weighted average of their words’ embeddings [18, 19]. If =tf the output is the result of averaging the embeddings of the occurring words. If =idf, then the embedding of each word is multiplied with the word’s inverse document frequency. Having the nBOW representations the distances are calculated using the Euclidean distance. nBOW methods can be used both with cross-lingual embeddings, or with mono-lingual embeddings if the query is translated.
-BiLDA: Previous work found the bilingual Latent Dirichlet Allocation (BiLDA) to yield state-of-the-art results, we cite for instance [20, 13, 21]. BiLDA is trained on comparable corpora and learns aligned per-word topic distributions between two or more languages. During inference it projects unseen documents in the shared topic space where cross-lingual distances can be calculated efficiently. We train BiLDA separately for each language pair with 300 topics. We use collapsed Gibbs sampling for inference , which we implemented with Numpy . Following previous work, we set the Dirichlet hyper-parameters ( being the number of topics) and . We let 200 Gibbs sampling iterations for burn-in and then sample the document distributions every 25 iterations until the -th Gibbs iteration. For learning the topics, we used the “BiLDA training” data. Having the per-document representations in the shared space of the topics, we use entropy as distance, following .
-Wass: is the proposed metric given by Eq. 1. If =tf, it is equivalent to that of , as for generating the high-dimensional source and target histograms the terms’ frequencies are used. If =idf, the idf weighting scheme is applied.
-Entro_Wass is the proposed metric given by Eq. 2. The subscript reads the same as for the previous approach. We implemented Wass and Entro_Wass with Scikit-learn  using the solvers of POT .444For Entro_Wass we used the sinkhorn2 function with reg=0.1, numItermax=50, method=’sinkhorn_stabilized’ arguments to prevent numerical errors. For the importance of the regularization term in Eq. (2), we performed grid-search and found to consistently perform the best.
For the systems that require embeddings (nBOW, Wass, Entro_Wass), we use the Numberbatch pre-trained embeddings of .555The v17.06 vectors: https://github.com/commonsense/conceptnet-numberbatch The Numberbatch embeddings are 300-dimensional embeddings for 78 languages, that project words and short expressions of these languages in the same shared space and were shown to achieve state-of-the-art results in cross-lingual tasks [11, 25].
Complementary to tf and idf document representations, we also evaluate the heuristic we proposed for OOV words in Section 3.3. We select the threshold for the Levenhstein distances to be 1, and we denote with tf+ and idf+ the settings where the proposed OOV strategy is employed.
|En Fr||Fr En||En Ge||Ge En||En Gr||Gr En||En Fi||Fi En|
|Systems that rely on topic models|
|Systems that rely on translations|
|Systems that rely on cross-lingual embeddings|
|Handling OOV words & cross-lingual embeddings|
As evaluation measure, we report the Mean Reciprocal Rank (MRR)  which accounts for the rank of the correct answer in the returned documents. Higher values signify that the golden documents are ranked higher. Table 2 presents the achieved scores for the CLDR problems.
There are several observations from the results of this table. First, notice that the results clearly establish the superiority of the Wasserstein distance for CLDR. Independently of the representation used (tf, idf, tf+, idf+) the performance when the Wasserstein distance is used is substantially better than the other baselines. This is due to the fact that the proposed distances account for the geometry of the data. In this sense, they essentially implement optimal word-alignment algorithms as the calculated transportation cost uses the word representations in order to minimize their transformation from the source to the target document. Although nBOW also uses exactly the same embeddings, it performs a weighted averaging operation that results in information loss.
Comparing the two proposed methods, we notice that the approach with the entropic regularization (Entro_Wass) outperforms in most of the cases its original version Wass. This suggests that using regularization in the OT problem improves the performance for our application. As a result, using Entro_Wass is not only faster and GPU-friendly, but also more accurate. Also, both approaches consistently benefit from the idf weighting scheme. The rest of the baselines, although competitive, perform worse than the proposed approaches.
Another interesting insight stems from the comparison of the translation-based and the semantic-based approaches. The results suggest that the semantic-based approaches that use the Numberbatch embeddings perform better, meaning that the machine translation method we employed introduces more error than the imperfect induction of the embedding spaces. This is also evident by the performance decrease of tf and idf when moving from language pairs with more resources like “En-Fr” to more resource deprived pairs like “En-Fi” or “En-Gr”. While one may argue that better results can be achieved with a better-performing translation mechanism, the important outcome of our comparison is that both families of approaches improve when the Wasserstein distance is used. Notice, for instance, the characteristic example of the translation based systems for “FiEn”: tf, idf and their nBOW variants perform poorly (MRR 0.09), suggesting low-quality translations; still Wass and Entro_Wass achieve remarkable improvements (MRR), using the same resources.
Our last comments concern the effect of the OOV protocol. Overall, having such a protocol in place benefits Wass and Entro_Wass as the comparison of the tf and idf with tf+ and idf+ variants suggests. The impact of the heuristic is more evident for the “En-Gr” and “En-Fi” problems that gain several (.20) points in terms of MRR. This is also due to the fact that the proposed OOV mechanism reduces the OOV rates as Greek and Finnish have the smallest embeddings dictionary as shown in Table 1b.
In this paper, we demonstrated that the Wasserstein distance and its regularized version naturally incorporate term-weighting schemes. We also proposed a novel protocol to handle OOV words based on morphological similarity. Our experiments, carried on eight CLDR datasets, established the superiority of the Wasserstein distance compared to other approaches as well as the interest of integrating entropic regularization to the optimization, and coefficients to the word embeddings. Finally, we showed the benefits of our OOV strategy, especially when the size of the embedding’s dictionary for a language is small.
Our study opens several avenues for future research. First, we plan to evaluate the generalization of the Wasserstein distances for ad hoc retrieval, using for instance the benchmarks of the CLEF ad hoc news test suites. Further, while we showed that entropic regularization greatly improves the achieved results, it remains to be studied how one can apply other types of regularization to the OT problem. For instance, one could expect that group sparsity inducing regularization applied in the CLDR context can be a promising direction as semantically close words intrinsically form clusters and thus it appears meaningful to encourage the transport within them. Lastly, CLDR with Wasserstein distances is an interesting setting for comparing methods for deriving cross-lingual embeddings as their quality directly impacts the performance on the task.
-  Matt J Kusner, Yu Sun, Nicholas I Kolkin, Kilian Q Weinberger, et al. From word embeddings to document distances. In ICML, 2015.
-  Gaspard Monge. Mémoire sur la théorie des déblais et des remblais. Histoire de l’Académie Royale des Sciences, pages 666–704, 1781.
-  Leonid Kantorovich. On the translocation of masses. In C.R. (Doklady) Acad. Sci. URSS(N.S.), volume 37(10), pages 199–201, 1942.
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas.
The earth mover’s distance as a metric for image retrieval.International Journal on Computer Vision, 2000.
-  Nicolas Courty, Rémi Flamary, and Devis Tuia. Domain adaptation with regularized optimal transport. In Proceedings ECML/PKDD 2014, pages 1–16, 2014.
-  Charlotte Laclau, Ievgen Redko, Basarab Matei, Younès Bennani, and Vincent Brault. Co-clustering through optimal transport. In ICML, 2017.
-  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pages 2292–2300, 2013.
-  Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative Bregman Projections for Regularized Transportation Problems. SIAM Journal on Scientific Computing, 2(37):A1111–A1138, 2015.
-  Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21:343–348, 1967.
Ivan Vulić and Marie-Francine Moens.
Bilingual distributed word representations from document-aligned
Journal of Artificial Intelligence Research, 55:953–994, 2016.
-  Robert Speer, Joshua Chin, and Catherine Havasi. Conceptnet 5.5: An open multilingual graph of general knowledge. In AAAI, 2017.
-  Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for distributions with applications to image databases. In ICCV, 1998.
-  Ivan Vulić, Wim De Smet, Jie Tang, and Marie-Francine Moens. Probabilistic topic modeling in multilingual settings: An overview of its methodology and applications. Information Processing & Management, 2015.
-  Ellen M Voorhees. Overview of trec 2003. In TREC, 2003.
-  Andrei Broder. A taxonomy of web search. In SIGIR. ACM, 2002.
-  Judit Acs, Katalin Pajkossy, and Andras Kornai. Building basic vocabulary across 40 languages. In Sixth Workshop on Building and Using Comparable Corpora@ACL, 2013.
-  Judit Ács. Pivot-based multilingual dictionary building using wiktionary. In LREC, 2014.
-  Jeff Mitchell and Mirella Lapata. Composition in distributional models of semantics. Cognitive science, 2010.
-  William Blacoe and Mirella Lapata. A comparison of vector-based representations for semantic composition. In EMNLP-CoNLL, 2012.
-  Kosuke Fukumasu, Koji Eguchi, and Eric P Xing. Symmetric correspondence topic models for multilingual text analysis. In NIPS, 2012.
-  Yu-Chun Wang, Chun-Kai Wu, and Richard Tzong-Han Tsai. Cross-language article linking with different knowledge bases using bilingual topic model and translation features. Knowledge-Based Systems, 111:228–236, 2016.
-  Stéfan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering, 2011.
-  F. Pedregosa, G. Varoquaux, and A. et al. Gramfort. Scikit-learn: Machine learning in python. JMLR, 2011.
-  Rémi Flamary and Nicolas Courty. Pot python optimal transport library. 2017.
-  Robert Speer and Joanna Lowry-Duda. Conceptnet at semeval-2017 task 2: Extending word embeddings with multilingual relational knowledge. arXiv:1704.03560, 2017.
-  Ellen M Voorhees et al. The trec-8 question answering track report. In TREC, 1999.