1 Introduction
Cross-lingual representations—such as embeddings of words and phrases into a single comparable feature space—have become a key technique in multilingual natural language processing. They offer strong promise towards the goal of a joint understanding of concepts across languages, as well as for enabling the transfer of knowledge and machine learning models between different languages. Therefore, cross-lingual embeddings can serve a variety of downstream tasks such as bilingual lexicon induction, cross-lingual information retrieval, machine translation and many applications of zero-shot transfer learning, which is particularly impactful from resource-rich to low-resource languages.
Existing methods can be broadly classified into two groups
(Ruder et al., 2017): mapping methods leverage existing monolingual embeddings which are treated as independent, and apply a post-process step to map the embeddings of each language into a shared space, through a linear transformation (Mikolov et al., 2013b; Conneau et al., 2017; Joulin et al., 2018). On the other hand, joint methods learn representations concurrently for multiple languages, by combining monolingual and cross-lingual training tasks (Luong et al., 2015; Coulmance et al., 2015; Gouws et al., 2015; Vulic and Moens, 2015; Chandar et al., 2014; Hermann and Blunsom, 2013).While recent work on word embeddings has focused almost exclusively on mapping methods, which require little to no cross-lingual supervision, (Søgaard et al., 2018) establish that their performance is hindered by linguistic and domain divergences in general, and for distant language pairs in particular. Principally, their analysis shows that cross-lingual hubness, where a few words (hubs) in the source language are nearest cross-lingual neighbours of many words in the target language, and structural non-isometry between embeddings do impose a fundamental barrier to the performance of linear mapping methods.
(Ormazabal et al., 2019) propose using joint learning as a means of mitigating these issues. Given parallel data, such as sentences, a joint model learns to predict either the word or context in both source and target languages. As we will demonstrate with results from our algorithm, joint methods yield compatible embeddings which are closer to isomorphic, less sensitive to hubness, and perform better on cross-lingual benchmarks.
Contributions. We propose the Bi-Sent2vec algorithm, which extends the Sent2vec algorithm (Pagliardini et al., 2018; Gupta et al., 2019) to the cross-lingual setting. We also revisit TransGram Coulmance et al. (2015), another joint learning method, to assess the effectiveness of joint learning over mapping-based methods. Our contributions are
-
On cross-lingual sentence-retrieval and monolingual word representation quality evaluations, Bi-Sent2vec significantly outperforms competing methods, both jointly trained as well as mapping-based ones while preserving state-of-the-art performance on cross-lingual word retrieval tasks. For dis-similar language pairs, Bi-Sent2vec outperform their competitors by an even larger margin on all the tasks hinting towards the robustness of our method.
-
Bi-Sent2vec performs on par with a multilingual RNN based sentence encoder, Laser (Artetxe and Schwenk, 2018), on MLDoc (Schwenk and Li, 2018), a zero-shot cross-lingual transfer task on documents in multiple languages. Compared to Laser, our method improves computational efficiency by an order of magnitude for both training and inference, making it suitable for resource or latency-constrained on-device cross-lingual NLP applications.
-
We verify that joint learning methods consistently dominate state-of-the-art mapping methods on standard benchmarks, i.e., cross-lingual word and sentence retrieval.
-
Training on parallel data additionally enriches monolingual representation quality, evident by the superior performance of Bi-Sent2vec over FastText embeddings trained on a larger corpus.
We make our models and code publicly available.
2 Related Work
The literature on cross-lingual representation learning is extensive. Most recent advances in the field pursue unsupervised (Artetxe et al., 2017; Conneau et al., 2017; Chen and Cardie, 2018; Hoshen and Wolf, 2018; Grave et al., 2018b) or supervised (Joulin et al., 2018; Conneau et al., 2017) mapping or alignment-based algorithms. All these methods use existing monolingual word embeddings, followed by a cross-lingual alignment procedure as a post-processing step— that is to learn a simple (typically linear) mapping from the source language embedding space to the target language embedding space.
Supervised learning of a linear map from a source embedding space to another target embedding space (Mikolov et al., 2013b) based on a bilingual dictionary was one of the first approaches towards cross-lingual word embeddings. Additionally enforcing orthogonality constraints on the linear map results in rotations, and can be formulated as an orthogonal Procrustes problem (Smith et al., 2017). However, the authors found the translated embeddings to suffer from hubness, which they mitigate by introducing the inverted softmax as a corrective search metric at inference time. (Artetxe et al., 2017) align embedding spaces starting from a parallel seed lexicon such as digits and iteratively build a larger bilingual dictionary during training.
In their seminal work, (Conneau et al., 2017) propose an adversarial training method to learn a linear orthogonal map, avoiding bilingual supervision altogether. They further refine the learnt mapping by applying the Procrustes procedure iteratively with a synthetic dictionary generated through adversarial training. They also introduce the ‘Cross-Domain Similarity Local Scaling’ (CSLS) retrieval criterion for translating between spaces, which further improves on the word translation accuracy over nearest-neighbour and inverted softmax metrics. They refer to their work as Multilingual Unsupervised and Supervised Embeddings (MUSE). In this paper, we will use MUSE to denote the unsupervised embeddings introduced by them, and “Procrustes + refine” to denote the supervised embeddings obtained by them. (Chen and Cardie, 2018) similarly use “multilingual adversarial training” followed by “pseudo-supervised refinement” to obtain unsupervised multilingual word embeddings (UMWE), as opposed to bilingual word embeddings by (Conneau et al., 2017). Hoshen and Wolf (2018)
describe an unsupervised approach where they align the second moment of the two word embedding distributions followed by a further refinement. Building on the success of CSLS in reducing retrieval sensitivity to hubness,
(Joulin et al., 2018) directly optimize a convex relaxation of the CSLS function (RCSLS) to align existing mono-lingual embeddings using a bilingual dictionary.While none of the methods described above require parallel corpora, all assume structural isomorphism between existing embeddings for each language (Mikolov et al., 2013b), i.e. there exists a simple (typically linear) mapping function which aligns all existing embeddings. However, this is not always a realistic assumption (Søgaard et al., 2018)—even in small toy-examples it is clear that many geometric configurations of points can not be linearly mapped to their targets.
Joint learning algorithms such as TransGram (Coulmance et al., 2015) and Cr5 (Josifoski et al., 2019) , circumvent this restriction by simultaneously learning embeddings as well as their alignment. TransGram, for example, extends the Skipgram (Mikolov et al., 2013a) method to jointly train bilingual embeddings in the same space, on a corpus composed of parallel sentences. In addition to the monolingual Skipgram loss for both languages, they introduce a similar cross-lingual loss where a word from a sentence in one language is trained to predict the word-contents of the sentence in the other. Cr5, on the other hand, uses document-aligned corpora to achieve state-of-the-art results for cross-lingual document retrieval while staying competitive at cross-lingual sentence and word retrieval. TransGram embeddings have been absent from discussion in most of the recent work. However, the growing abundance of sentence-aligned parallel data (Tiedemann, 2012) merits a reappraisal of their performance.
(Ormazabal et al., 2019) use BiVec (Luong et al., 2015), another bilingual extension of Skipgram, which uses a bilingual dictionary in addition to parallel sentences to obtain word-alignments and compare it with the unsupervised version of VecMap (Artetxe et al., 2018a), another mapping-based method. Our experiments show this extra level of supervision in the case of BiVec is redundant in obtaining state-of-the-art performance.
3 Model
Proposed Model.
Our Bi-Sent2vec model is a cross-lingual extension of Sent2vec proposed by (Pagliardini et al., 2018), which in turn is an extension of the C-BOW embedding method (Mikolov et al., 2013a). Sent2vec
is trained on sentence contexts, with the word and higher-order word n-gram embeddings specifically optimized toward obtaining robust sentence embeddings using additive composition. Formally,
Sent2vec obtains representation of a sentence by averaging the word-ngram embeddings (including unigrams) as where is the set of word n-grams in the sentence .The Sent2vec training objective aims to predict a masked word token in the sentence using the rest of the sentence representation . To formulate the training objective, we use logistic loss in conjunction with negative sampling. More precisely, for a raw text corpus , the monolingual training objective for Sent2vec is given by
(1) |
where is the target word and, and are the source n-gram and target word embedding matrices respectively. Here, the set of negative words
is sampled from a multinomial distribution where the probability of picking a word is directly proportional to the square root of its frequency in the corpus. Each target word
is sampled with probability where is the frequency of the word in the corpus.We adapt the Sent2vec model to bilingual corpora by introducing a cross-lingual loss in addition to the monolingual loss in equation (1). Given a sentence pair where and are translations of each other in languages and , the cross-lingual loss for a target word in is given by
(2) |
Thus, we use the sentence to predict the constituent words of and vice-versa in a similar fashion to the monolingual Sent2vec, shown in Figure 1. This ensures that the word and n-gram embeddings of both languages lie in the same space.

Implementation Details.
We build our C++ implementation on the top of the FastText library (Bojanowski et al., 2016; Joulin et al., 2016). Model parameters are updated by asynchronous SGD with a linearly decaying learning rate.
Our model is trained on the ParaCrawl (Esplà-Gomis, 2019) v4.0 datasets for the English-Italian, English-German, English-French, English-Spanish, English-Hungarian and English-Finnish language pairs. For the English-Russian language pair, we concatenate the OpenSubtitle corpus111http://www.opensubtitles.org/(Lison and Tiedemann, 2016) and the Tanzil project222http://tanzil.net/ (Quran translations) corpus. The number of parallel sentence pairs in the corpora except for those of English-Finnish and English-Hungarian used by us range from 17-32 Million. Number of parallel sentence pairs for the dis-similar language pairs(English-Hungarian and English-Finnish) is approximately 2 million. Evaluation results for these two language pairs can be found in Subsection 4.4. Exact statistics regarding the different corpora can be found in the Table 7 in the Appendix. All the sentences were tokenized using Spacy tokenizers333https://spacy.io/ for their respective languages.
For each dataset, we trained two different models: one with unigram embeddings only, and the other additionally augmented with bigrams. The earlier TransGram models (Coulmance et al., 2015) were trained on a small amount of data (Europarl Corpus (Koehn, 2005)). To facilitate a fair comparison, we train new TransGram embeddings on the same data used for Bi-Sent2vec. Given that TransGram and Bi-Sent2vec are a cross-lingual extension of Skipgram and Sent2vec respectively, we use the same parameters as (Bojanowski et al., 2016) and (Gupta et al., 2019)
, except increasing the number of epochs for
TransGram to 8, and decreasing the same for Bi-Sent2vecto 5. Additionally, a preliminary hyperparameter search (except changing the number of epochs) on
Bi-Sent2vec and TransGram did not improve the results. All parameters for training the TransGram and Bi-Sent2vec models can be found in the Table 6 in the Appendix.4 Evaluation
To assess the quality of the word and sentence embeddings obtained as well as their cross-lingual alignment quality, we compare our results using the following four benchmarks
-
Cross-lingual word retrieval
-
Monolingual word representation quality
-
Cross-lingual sentence retrieval
-
Zero-shot cross-lingual transfer of document classifiers
where benchmarks are presented in order of increasing linguistic granularity, i.e. word, sentence, and document level. We also analyze the effect of training data by studying the relationship between representation quality and corpus size.
We use the code available in the MUSE library444https://github.com/facebookresearch/MUSE (Conneau et al., 2017) for all evaluations except the zero-shot classifier transfer, which is tested on the MLDoc task (Schwenk and Li, 2018)555https://github.com/facebookresearch/MLDoc.
4.1 Word Translation
The task involves retrieving correct translation(s) of a word in a source language from a target language. To evaluate translation accuracy, we use the bilingual dictionaries constructed by (Conneau et al., 2017). We consider 1500 source-test queries and 200k target words for each language pair and report P@1 scores for the supervised and unsupervised baselines as well as our models in Table 1.
Method | en-es | en-fr | en-de | en-ru | en-it | avg. | |||||
MUSE (Conneau et al., 2017) | 81.7 | 83.3 | 82.3 | 82.1 | 74.0 | 72.2 | 44.0 | 59.1 | 78.6 | 77.9 | 73.5 |
UMWE (Chen and Cardie, 2018) | 82.5 | 83.1 | 82.5 | 82.1 | 74.6 | 72.5 | 49.5 | 61.7 | 78.3 | 77.0 | 74.4 |
Procrustes + refine (Conneau et al., 2017) | 82.4 | 83.9 | 82.3 | 83.2 | 75.3 | 73.2 | 50.1 | 63.5 | 77.5 | 77.6 | 74.9 |
RCSLS (Joulin et al., 2018) | 83.7 | 87.1 | 84.1 | 84.7 | 79.2 | 77.5 | 60.9 | 70.2 | 81.1 | 82.7 | 79.1 |
TransGram (Coulmance et al., 2015) | 91.6 | 88.6 | 89.1 | 90.1 | 87.5 | 87.2 | 65.6 | 73.7 | 88.6 | 89.5 | 85.2 |
VecMap (unsupervised) (Artetxe et al., 2018a) | 87.4 | 87.8 | 88.3 | 88.5 | 84.3 | 87.2 | 48.6 | 50.5 | 87.4 | 86.5 | 79.6 |
VecMap (supervised) (Artetxe et al., 2018b) | 87.2 | 90.2 | 87.6 | 90.4 | 87.3 | 86.8 | 49.7 | 65.6 | 87.2 | 89.2 | 82.1 |
BiVec NN (Luong et al., 2015) | 87.4 | 88.6 | 86.8 | 89.1 | 87.5 | 87.2 | 64.0 | 59.1 | 86.8 | 84.0 | 81.7 |
BiVec CSLS (Luong et al., 2015) | 87.6 | 89.1 | 88.8 | 90.3 | 86.4 | 87.2 | 66.1 | 70.6 | 87.6 | 87.8 | 84.3 |
Bi-Sent2vec uni. NN | 86.9 | 91.6 | 86.9 | 91.0 | 86.0 | 88.7 | 58.0 | 72.8 | 88.3 | 92.4 | 84.3 |
Bi-Sent2vec uni. + bi. NN | 89.4 | 92.9 | 89.3 | 92.8 | 86.7 | 89.3 | 59.0 | 70.2 | 89.5 | 91.8 | 85.1 |
Bi-Sent2vec uni. CSLS | 86.0 | 91.7 | 86.4 | 91.4 | 84.6 | 88.8 | 60.5 | 73.0 | 88.2 | 91.8 | 84.2 |
Bi-Sent2vec uni. + bi. CSLS | 89.0 | 92.1 | 88.9 | 92.4 | 86.5 | 89.0 | 61.0 | 73.5 | 89.6 | 91.4 | 85.3 |
4.2 Monolingual Word Representation Quality
We assess the monolingual quality improvement of our proposed cross-lingual training by evaluating performance on monolingual word similarity tasks. To disentangle the specific contribution of the cross-lingual loss, we train the monolingual counterpart of Bi-Sent2vec, Sent2vec on the same corpora as our method.
Method\Dataset | SimLex-999 | WS-353 | ||
---|---|---|---|---|
en | it | en | it | |
MUSE | 0.38 | 0.30 | 0.74 | 0.64 |
RCSLS | 0.38 | 0.30 | 0.74 | 0.64 |
FastText- Common Crawl | 0.49 | 0.32 | 0.75 | 0.57 |
BiVec | 0.40 | 0.36 | 0.70 | 0.60 |
TransGram | 0.43 | 0.37 | 0.73 | 0.63 |
Sent2vec uni. | 0.49 | 0.38 | 0.73 | 0.60 |
Bi-Sent2vec uni. | 0.57 | 0.47 | 0.79 | 0.65 |
Bi-Sent2vec uni. + bi. | 0.58 | 0.50 | 0.80 | 0.69 |
Performance on monolingual word-similarity tasks is evaluated using the English SimLex-999 (Hill et al., 2014) and its Italian and German translations, English WS-353 (Finkelstein et al., 2001) and its German, Italian and Spanish translations. For French, we use a translation of the RG-65 (Joubarne and Inkpen, 2011)
dataset. Pearson scores are used to measure the correlation between human-annotated word similarities and predicted cosine similarities. We also include
FastText monolingual vectors trained on CommonCrawl data (Grave et al., 2018a) which is comprised of 600 billion, 68 billion, 66 billion, 72 billion and 36 billion words of English, French, German, Spanish and Italian respectively and is at least larger than the corpora on which we trained Bi-Sent2vec. We report Pearson correlation scores on different word-similarity datasets for En-It pair in Table 2. Evaluation results on other language pairs are similar and can be found in the appendix in Tables 8, 9, and 10.4.3 Cross-lingual Sentence Retrieval
The primary contribution of our work is to deliver improved cross-lingual sentence representations. We test sentence embeddings for each method obtained by bag-of-words composition for sentence retrieval across different languages on the Europarl corpus. In particular, the tf-idf weighted average is used to construct sentence embeddings from word embeddings. We consider 2000 sentences in the source language dataset and retrieve their translation among 200K sentences in the target language dataset. The other 300K sentences in the Europarl corpus are used to calculate tf-idf weights. Results for P@1 of unsupervised and supervised benchmarks vs our models are included in Table 3.
Method | en-es | en-fr | en-de | en-it | avg. | ||||
---|---|---|---|---|---|---|---|---|---|
MUSE | 72.7 | 71.5 | 69.2 | 68.8 | 53.3 | 53.4 | 66.1 | 64.3 | 64.9 |
RCSLS | 26.9 | 26.7 | 19.3 | 21.2 | 8.8 | 11.3 | 15.1 | 17.6 | 18.4 |
TransGram | 83.5 | 81.4 | 80.4 | 81.6 | 64.8 | 69.9 | 77.2 | 77.9 | 77.1 |
VecMap (unsupervised) | 81.7 | 82.1 | 79.8 | 80.4 | 62.8 | 64.6 | 69.0 | 71.1 | 74.0 |
VecMap (supervised) | 81.3 | 81 | 80.4 | 80.7 | 62.6 | 64.3 | 67.8 | 71 | 73.6 |
BiVec NN | 69.8 | 77.1 | 54.7 | 75.5 | 56.1 | 44.1 | 58.2 | 45.1 | 60.1 |
BiVec CSLS | 81.6 | 83.4 | 78.1 | 81.6 | 71.6 | 68.1 | 74.2 | 72.4 | 76.4 |
Bi-Sent2vec uni. NN | 87.8 | 86.4 | 85.2 | 83.4 | 82.3 | 80.2 | 85.9 | 85.8 | 84.6 |
Bi-Sent2vec uni. + bi. NN | 87.9 | 87.8 | 86.1 | 83.9 | 79.5 | 79.7 | 85.1 | 85.3 | 84.4 |
Bi-Sent2vec uni. CSLS | 89.5 | 88.5 | 87.1 | 86.4 | 84.4 | 83.0 | 88.2 | 87.5 | 86.8 |
Bi-Sent2vec uni. + bi. CSLS | 89.7 | 89.6 | 87.8 | 87.4 | 84.2 | 84.0 | 87.9 | 87.6 | 87.3 |
Reduction in error | 37.5 | 37.3 | 37.8 | 31.5 | 44.4 | 46.8 | 46.9 | 43.9 | – |
4.4 Performance on dis-similar language pairs
We report a substantial improvement on the performance of previous models on cross-lingual word and sentence retrieval tasks for the dis-similar language pairs(English-Finnish and English-Hungarian). We use the same evaluation scheme as in Subsections 4.1 and 4.3 Results for these pairs are included in Table 4.
Method | word retrieval | sentence retrieval | ||||||
---|---|---|---|---|---|---|---|---|
en-fi | en-hu | en-fi | en-hu | |||||
MUSE | 48.1 | 59.5 | 53.9 | 64.9 | 21.7 | 29.5 | 39.1 | 46.7 |
RCSLS | 61.8 | 69.9 | 67.0 | 73.0 | 3.2 | 4.8 | 3.6 | 5.1 |
VecMap (unsupervised) | 62.5 | 66.8 | 61.6 | 68.7 | 13.2 | 14.7 | 20.5 | 19.3 |
VecMap (supervised) | 62.6 | 78.3 | 63.7 | 76.6 | 15.0 | 16.9 | 20.9 | 21.7 |
BiVec NN | 62.1 | 55.3 | 62.1 | 53.7 | 14.2 | 9.7 | 26.2 | 13.7 |
BiVec CSLS | 69.6 | 78.0 | 72.4 | 78.4 | 33.3 | 32.0 | 46.7 | 41.3 |
TransGram | 69.7 | 81.1 | 73.1 | 80.8 | 35.4 | 40.5 | 52.1 | 55 |
Bi-Sent2vec uni. NN | 71.2 | 85.4 | 75.6 | 83.9 | 63.5 | 64.2 | 75.2 | 76.2 |
Bi-Sent2vec uni. + bi. NN | 68.5 | 81.7 | 71.4 | 79.4 | 57.5 | 55.9 | 65.8 | 65.2 |
Bi-Sent2vec uni. CSLS | 72.0 | 86.5 | 76.3 | 85.1 | 70.2 | 69.0 | 81.4 | 80.8 |
Bi-Sent2vec uni. + bi. CSLS | 70.1 | 84.4 | 73.7 | 81.7 | 66 | 64.1 | 73.8 | 74.5 |
4.5 Zero-shot Cross-lingual Transfer Of Document Classifiers
The MLDoc multilingual document classification task (Schwenk and Li, 2018)
consists of news documents given in 8 different languages, which need to be classified into 4 different categories. To demonstrate the ability to transfer trained classifiers in a robust fashion between languages, we use a zero-shot setting, i.e., we train a classifier on embeddings in the source language, and report the accuracy of the same classifier applied to the target language. As the classifier, we use a simple feed-forward neural network with two hidden layers of size 10 and 8 respectively, optimized using the Adam optimizer. Each document is represented using the sum of its sentence embeddings.
Method | en-es | en-fr | en-de | en-it | avg. | ||||
---|---|---|---|---|---|---|---|---|---|
Laser | 79.3 | 69.6 | 78.0 | 80.1 | 86.3 | 80.8 | 70.2 | 74.2 | 77.3 |
Bi-Sent2vec | 74.0 | 71.5 | 81.6 | 82.2 | 86.5 | 79.2 | 75.0 | 72.6 | 77.8 |
We compare the performance of Bi-Sent2vec with the Laser sentence embeddings (Artetxe and Schwenk, 2018) in Table 5. Laser sentence embedding model is a multi-lingual sentence embedding model which is composed of a biLSTM encoder and an LSTM decoder. It uses a shared byte pair encoding based vocabulary of 50k words. The Laser model which we compare to was trained on 223M sentences for 93 languages and requires 5 days to train on 16 V100 GPUs compared to our model which takes 1-2.5 hours for each language pair on 30 CPU threads.
4.6 Effect of Corpus Size on Representation Quality
We conduct an ablation study on how Bi-Sent2vec embeddings’ performance depends on the size of the training corpus. We uniformly sample smaller subsets of the En-Fr ParaCrawl dataset and train a Bi-Sent2vec model on them. We test word/sentence translation performance with the CSLS retrieval criterion, and monolingual embedding quality for En-Fr with increasing ParaCrawl corpus size. The results are illustrated in Figures 2 and 3.



5 Discussion
In the following section, we discuss the results on monolingual and cross-lingual benchmarks, presented in Tables 1 - 5, and a data ablation study for how the model behaves with increasing parallel corpus size in Figure 2 - 3. The most impressive outcome of our experiments is improved cross-lingual sentence retrieval performance, which we elaborate on along with word translation in the next subsection.
Cross-lingual evaluations For cross-lingual tasks, we observe in Table 1 that jointly trained embeddings produce much better results on cross-lingual word and sentence retrieval tasks. Bi-Sent2vec’s performance on word-retrieval tasks is uniformly superior to mapping methods, achieving up to 11.5% more in P@1 than RCSLS for the English to German language pair, consistent with the results from (Ormazabal et al., 2019). It is also on-par with, or better than competing joint methods except on translation from Russian to English, where TransGram receives a significantly better score. For word retrieval tasks, there is no discernible difference between CSLS/NN criteria for Bi-Sent2vec, suggesting the relative absence of the hubness phenomenon which significantly hinders the performance of cross-lingual word embedding methods.
Our principal contribution is in improving cross-lingual sentence retrieval. Table 3 shows Bi-Sent2vec decisively outperforms all other methods by a wide margin, reducing the relative P@1 error anywhere from to
. Our model displays considerably less variance than others in quality across language pairs, with at most a
5% deficit between best and worst, and nearly symmetric accuracy within a language pair.TransGram also outperforms the mapping-based methods, but still falls significantly short of Bi-Sent2vec’s. These results can be attributed to the fact that Bi-Sent2vec directly optimizes for obtaining robust sentence embeddings using additive composition of its word embeddings. Since Bi-Sent2vec’s learning objective is closest to a sentence retrieval task amongst current state-of-the-art methods, it can surpass them without sacrificing performance on other tasks.
Cross-lingual evaluations on dis-similar language pairs Unlike other language pairs in the evaluation, English-Finnish and English-Hungarian pairs are composed of languages from two different language families(English being an Indo-European language and the other language being a Finno-Ugric language). In Table 4, we see that the performance boost achieved by Bi-Sent2vec on competing methods methods is more pronounced in the case of dis-similar language pairs as compared to paris of languages close to each other. This observation affirms the suitaibility of Bi-Sent2vec for learning joint representations on languages from different families.
Monolingual word quality For the monolingual word similarity tasks, we observe large gains over existing methods. Sent2vec is trained on the same corpora as us, and FastText vectors are trained on the CommonCrawl corpora which are more than 100 times larger than ParaCrawl v4.0. In Table 2, we see that Bi-Sent2vec outperforms them by a significant margin on SimLex-999 and WS-353, two important monolingual word quality benchmarks. This observation is in accordance with the fact (Faruqui and Dyer, 2014) that bilingual contexts can be surprisingly effective for learning monolingual word representations. However, amongst the joint-training methods, Bi-Sent2vec also outperforms TransGram and BiVec
trained on the same corpora by a significant margin, again hinting at the superiority of the sentence level loss function over a fixed context window loss.
Effect of n-grams (Gupta et al., 2019) report improved results on monolingual word representation evaluation tasks for Sent2vec and FastText word vectors by training them alongside word n-grams. Our method incorporates their results based on the observation that unigram vectors trained alongside with bigrams significantly outperform unigrams alone on the majority of the evaluation tasks. We can see from Tables 1 - 3 that this holds for the bilingual case as well. However, in case of dis-similar language pairs(Table 4), we observe that using n-grams degrades the cross-lingual performance of the embeddings. This observation suggests that use of higher order n-grams may not be helpful for language pairs where the grammatical structures are contrasting.
Effect of corpus size Considering the cross-lingual performance curve exhibited by Bi-Sent2vec in Figure 2, increasing corpus size for the English-French datasets up to 1-3.1M lines appears to saturate the performance of the model on cross-lingual word/sentence retrieval, after which it either plateaus or degrades slightly. This is an encouraging result, indicating that joint methods can use significantly less data to obtain promising performance. This implies that joint methods may not necessarily be constrained to high-resource language pairs as previously assumed, though further experimentation is needed to verify this claim.
It should be noted from Figure 3 that the monolingual quality does keep improving with an increase in the size of the corpus. A potential way to overcome this issue of plateauing cross-lingual performance is to give different weights to the monolingual and cross-lingual component of the loss with the weights possibly being dependent on other factors such as training progress.
Comparison with a cross-lingual sentence embedding model and performance on document level task On the MLDoc classifier transfer task (Schwenk and Li, 2018) where we evaluate a classifier learned on documents in one language on documents in another, Table 5 shows we achieve parity with the performance of the Laser model for language pairs involving English, where Bi-Sent2vec’s average accuracy of 77.8% is slightly higher than Laser’s 77.3%. While the comparison is not completely justified as Laser is multilingual in nature and is trained on a different dataset, one must emphasize that Bi-Sent2vec is a bag-of-words method as compared to Laser which uses a multi-layered biLSTM sentence encoder. Our method only requires to average a set of vectors to encode sentences reducing its computational footprint significantly. This makes Bi-Sent2vec an ideal candidate for on-device computationally efficient cross-lingual NLP, unlike Laser which has a huge computational overhead and specialized hardware requirement for encoding sentences.
6 Conclusion and Future Work
We introduce a cross-lingual extension of an existing monolingual word and sentence embedding method. The proposed model is tested at three levels of linguistic granularity: words, sentences and documents. The model outperforms all other methods by a wide margin on the cross-lingual sentence retrieval task while maintaining parity with the best-performing methods on word translation tasks. Our method achieves parity with Laser on zero-shot document classification, despite being a much simpler model. We also demonstrate that training on parallel data yields a significant improvement in the monolingual word representation quality.
The success of our model on the bilingual level calls for its extension to the multilingual level especially for pairs which have little or no parallel corpora. While the amount of bilingual/multilingual parallel data has grown in abundance, the amount of monolingual data available is practically limitless. Consequently, we would like to explore training cross-lingual embeddings with a large amount of raw text combined with a smaller amount of parallel data.
Acknowledgments
We acknowledge funding from the Innosuisse ADA grant.
References
- Learning bilingual word embeddings with (almost) no bilingual data. In ACL, Cited by: §2, §2.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 789–798. Cited by: §2, §3, Table 1.
-
Generalizing and improving bilingual word embedding mappings with a multi-step framework of linear transformations.
In
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
, pp. 5012–5019. Cited by: §3, Table 1. - Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics 7, pp. 597–610. Cited by: 2nd item, §4.5.
- Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. Cited by: §3, §3.
-
An autoencoder approach to learning bilingual word representations
. In NIPS, Cited by: §1. - Unsupervised multilingual word embeddings. In EMNLP, Cited by: §2, §2, Table 1.
- Word translation without parallel data. ArXiv abs/1710.04087. Cited by: §1, §2, §2, §4.1, Table 1, §4.
- Trans-gram, Fast Cross-lingual Word-embeddings. In EMNLP - Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1109–1113. Cited by: §1, §1, §2, §3, Table 1.
- ParaCrawl: Web-scale parallel corpora for the languages of the EU. Cited by: §3.
- Improving vector space word representations using multilingual correlation. In EACL, Cited by: §5.
- Placing search in context: the concept revisited. In WWW, Cited by: §4.2.
-
Bilbowa: fast bilingual distributed representations without word alignments
. Cited by: §1. - Learning word vectors for 157 languages. ArXiv abs/1802.06893. Cited by: §A.1, §4.2.
- Unsupervised alignment of embeddings with wasserstein procrustes. In AISTATS, Cited by: §2.
- Better word embeddings by disentangling contextual n-gram information. In NAACL-HLT, Cited by: §1, §3, §5.
- Multilingual distributed representations without word alignment. In ICLR 2014, Cited by: §1.
-
SimLex-999: evaluating semantic models with (genuine) similarity estimation
. Computational Linguistics 41, pp. 665–695. Cited by: §4.2. - Non-adversarial unsupervised word translation. In EMNLP, Cited by: §2, §2.
-
Crosslingual document embedding as reduced-rank ridge regression
. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pp. 744–752. Cited by: §2. - Comparison of semantic similarity for different languages using the google n-gram corpus and second-order co-occurrence measures. In Canadian Conference on AI, Cited by: §4.2.
- Loss in translation: learning bilingual word mapping with a retrieval criterion. In EMNLP, Cited by: §1, §2, §2, Table 1.
- Bag of tricks for efficient text classification. In EACL, Cited by: §3.
- Europarl: a parallel corpus for statistical machine translation. Cited by: §3.
- OpenSubtitles2016: extracting large parallel corpora from movie and tv subtitles. In LREC, Cited by: §3.
- Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pp. 151–159. Cited by: §1, §2, §3, Table 1.
- Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: §2, §3.
- Exploiting similarities among languages for machine translation. ArXiv abs/1309.4168. Cited by: §1, §2, §2.
- Analyzing the limitations of cross-lingual word embedding mappings. In ACL, Cited by: §1, §2, §3, §5.
- Unsupervised learning of sentence embeddings using compositional n-gram features. In NAACL-HLT, Cited by: §1, §3.
- A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902. Cited by: §1.
- A corpus for multilingual document classification in eight languages. arXiv preprint arXiv:1805.09821. Cited by: 2nd item, §4.5, Table 5, §4, §5.
- Offline bilingual word vectors, orthogonal transformations and the inverted softmax. ArXiv abs/1702.03859. Cited by: §2.
- On the limitations of unsupervised bilingual dictionary induction. In ACL, Cited by: Robust Cross-lingual Embeddings from Parallel Sentences, §1, §2.
- Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pp. 2214–2218. Cited by: Robust Cross-lingual Embeddings from Parallel Sentences, §2.
- Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In ACL, Cited by: §1.
Appendix A Appendix
a.1 Dataset Statistics
Dataset | Number of sentences |
|
||
---|---|---|---|---|
En-De ParaCrawl v4.0 | 17 Million | 308 Million | ||
En-Es ParaCrawl v4.0 | 22 Million | 477 Million | ||
En-Fi ParaCrawl v4.0 | 2.16 Million | 42 Million | ||
En-Fr ParaCrawl v4.0 | 32 Million | 665 Million | ||
En-Hu ParaCrawl v4.0 | 1.91 Million | 31 Million | ||
En-It ParaCrawl v4.0 | 13 Million | 261 Million | ||
En-Ru OpenSubtitles + Tanzil | 27 Million | 363 Million | ||
Wikipedia - En | 70 Million | 1792 Million | ||
Wikipedia - De | – | 1384 Million | ||
Wikipedia - Fr | – | 1108 Million | ||
Wikipedia - Es | – | 797 Million | ||
Wikipedia - It | – | 702 Million | ||
Wikipedia - Ru | – | 824 Million | ||
Common Crawl - En | – | 600 Billion | ||
Common Crawl - De | – | 66 Billion | ||
Common Crawl - Fr | – | 68 Billion | ||
Common Crawl - It | – | 36 Billion | ||
Common Crawl - Es | – | 72 Billion |
We used ParaCrawl v4.0 corpora for training Bi-Sent2vec, Sent2vec,BiVec,VecMap and TransGram embeddings except for En-Ru pair for which we used OpenSubtitles and Tanzil corpora combined. MUSE and RCSLS vectors were trained from FastText vectors obtained from Wikipedia dumps(Grave et al., 2018a).
a.2 Training parameters for trained models
Model |
|
|
|
TransGram | ||||||
Embedding dimension | 300 | 300 | 300 | 300 | ||||||
Max vocabulary size | 750k | 750k | 750k | 750k | ||||||
Minimum word count | 5 | 8 | 5 | 5 | ||||||
Initial Learning Rate | 0.2 | 0.2 | 0.2 | 0.025 | ||||||
Epochs | 5 | 5 | 5 | 8 | ||||||
Subsampling hyper-parameter | ||||||||||
Word-Ngrams Bucket Size | – | 2M | – | – | ||||||
Word-Ngrams dropped per context | – | 4 | – | – | ||||||
Window size | 5 | |||||||||
Number of negatives sampled | 10 | 10 | 10 | 5 |
a.3 Additional monolingual Quality Tables
Method\Dataset | SimLex-999 | WS-353 | |
---|---|---|---|
en | en | es | |
MUSE | 0.38 | 0.74 | 0.61 |
RCSLS | 0.38 | 0.74 | 0.62 |
FastText- Common Crawl | 0.49 | 0.75 | 0.54 |
BiVec | 0.40 | 0.72 | 0.57 |
TransGram | 0.42 | 0.74 | 0.59 |
Sent2vec uni. | 0.49 | 0.58 | 0.51 |
Bi-Sent2vec uni. | 0.57 | 0.78 | 0.60 |
Bi-Sent2vec uni. + bi. | 0.60 | 0.82 | 0.66 |
Method\Dataset | SimLex-999 | WS-353 | RG-65 |
---|---|---|---|
en | en | fr | |
MUSE | 0.38 | 0.74 | 0.72 |
RCSLS | 0.38 | 0.74 | 0.70 |
FastText- Common Crawl | 0.49 | 0.75 | 0.76 |
BiVec | 0.40 | 0.70 | 0.74 |
TransGram | 0.39 | 0.72 | 0.74 |
Sent2vec uni. | 0.46 | 0.75 | 0.71 |
Bi-Sent2vec uni. | 0.55 | 0.78 | 0.74 |
Bi-Sent2vec uni. + bi. | 0.59 | 0.79 | 0.78 |
Method\Dataset | SimLex-999 | WS-353 | ||
---|---|---|---|---|
en | de | en | de | |
MUSE | 0.38 | 0.41 | 0.74 | 0.68 |
RCSLS | 0.38 | 0.43 | 0.74 | 0.70 |
FastText- Common Crawl | 0.49 | 0.39 | 0.75 | 0.64 |
BiVec | 0.40 | 0.41 | 0.71 | 0.62 |
TransGram | 0.42 | 0.42 | 0.74 | 0.66 |
Sent2vec uni. | 0.48 | 0.38 | 0.70 | 0.63 |
Bi-Sent2vec uni. | 0.56 | 0.47 | 0.76 | 0.68 |
Bi-Sent2vec uni. + bi. | 0.59 | 0.53 | 0.75 | 0.70 |