Introduction
Word embeddings, continuous vectorial representations of words, have become a fundamental initial step in many natural language processing (NLP) tasks for many languages. In recent years, their crosslingual counterpart, crosslingual word embeddings (CLWE) —maps of matching words across languages— have been shown to be useful in many important crosslingual transfer and modeling tasks such as machine translation
Zou et al. (2013); Conneau et al. (2018), crosslingual document classification Klementiev, Titov, and Bhattarai (2012) and zeroshot dependency parsing Guo et al. (2015).In these representations, matching words across different languages are represented by similar vectors. Following the observation of
Mikolov et al. (2013)that the geometric positions of similar words in two embedding spaces of different languages appear to be related by a linear relation, the most common method aims to map between two pretrained monolingual embedding spaces by learning a single linear transformation matrix. Due to its simple structure design and competitive performance, this approach has become the mainstream of learning CLWE
Glavaš et al. (2019); Vulić et al. (2019); Ruder, Vulić, and Søgaard (2019).Initially, the linear mapping was learned by minimizing the distances between the source and target words in a seed dictionary. Early work from Mikolov et al. (2013) uses a seed dictionary of fivethousand word pairs. Since then, the size of the seed dictionary has been gradually reduced, from severalthousand to fifty word pairs Smith et al. (2017), reaching a minimal version of only sharing numerals Artetxe, Labaka, and Agirre (2017).
More recent works on unsupervised learning have shown that mappings across embedding spaces can also be learned without any bilingual evidence
Barone (2016); Zhang et al. (2017); Conneau et al. (2018); Hoshen and Wolf (2018); AlvarezMelis and Jaakkola (2018); Artete, Labaka, and Agirre (2018). More concretely, these fully unsupervised methods usually consist of two main steps Hartmann, Kementchedjhieva, and Søgaard (2019): an unsupervised step which aims to induce the seed dictionary by matching the source and target distributions, and then a pseudosupervised refinement step based on this seed dictionary.The system proposed by Conneau et al. (2018) can be considered the first successful unsupervised system for learning CLWE. They first use generative adversarial networks (GANs) to learn a single linear mapping to induce the seed dictionary, followed by the Procrustes Analysis Schönemann (1966) to refine the linear mapping based on the induced seed dictionary. While this GANbased model has competitive or even better performance compared to supervised methods on typologicallysimilar language pairs, it often exhibits poor performance on typologicallydistant language pairs, pairs of languages that differ drastically in word forms, morphology, word order and other properties that determine how similar the lexicon of a language is. More specifically, their initial linear mapping often fails to induce the seed dictionary for distant language pairs Vulić et al. (2019). Later work from Artete, Labaka, and Agirre (2018) has proposed an unsupervised selflearning framework to make the unsupervised CLWE learning more robust. Their system uses similarity distribution matching to induce the seed dictionary and stochastic dictionary induction to refine the mapping iteratively. The final CLWE learned by their system performs better than the GANbased system. However, their advantage appears to come from the iterative refinement with stochastic dictionary induction, according to Hartmann, Kementchedjhieva, and Søgaard (2019). If we only consider the performance of a model induced only with distribution matching, GANbased models perform much better. This brings us to our first conclusions, that a GANbased model is preferable for seed dictionary induction.
Fully unsupervised mappingbased methods to learn CLWE rely on the strong assumption that monolingual word embedding spaces are isomorphic or nearisomorphic, but this assumption is not fulfilled in practice, especially for distant language pairs Søgaard, Ruder, and Vulić (2018). Supervised methods are also affected by lack of isomorphism, as their performance on distant language pairs is worse than on similar language pairs. Moreover, experiments by Vulić, Ruder, and Søgaard (2020) also demonstrate that the lack of isomorphism does not arise only because of the typological distance among languages, but it also depends on the quality of the monolingual embedding space. Actually, if we replace the seed dictionary learned by an unsupervised distribution matching method with a pretrained dictionary, keeping constant the refinement technique, the final system becomes more robust Vulić et al. (2019).
All these previous results indicate that learning a better seed dictionary is a crucial step to improve unsupervised crosslingual word embedding induction and reduce the gap between unsupervised methods and supervised methods, and that GANbased methods hold the most promise to achieve this goal. The results also indicate that a solution that can handle the full complexity of induction of crosslingual word embeddings will show improvements in both close and distant languages.
In this paper, we focus on improving the initial step of distribution matching, using GANs Hartmann, Kementchedjhieva, and Søgaard (2019). Because the isomorphism assumption is not observed in reality, we argue that a successful GANbased model must not learn only one single linear mapping for the entire distribution, but must be able to identify mapping subspaces and learn multiple mappings. We propose a multiadversarial learning method which learns different linear maps for different subspaces of word embeddings.
Limitation of singlelinear assumption for embedding mapping
If the assumption that similar words across source and target languages are related by a single linear relation Mikolov et al. (2013) holds exactly or even approximately, the distance between source and target embedding spaces should be evenly or at least nearly evenly minimized during the training of the initial mapping. More specifically, each source subspace should be mapped equally well or nearly equally well to its corresponding target space, so that the translation ability of the single linear mapping should be similar across different source subspaces.
To verify this expectation, we use the GANbased system MUSE^{1}^{1}1https://github.com/facebookresearch/MUSE of Conneau et al. (2018) to train two linear mappings (without refinement). One mapping relates two typologically distant languages, English and Chinese, and the other maps the English space to the space of French  a typologically similar language. We use pretrained fastText embeddings. ^{2}^{2}2https://fasttext.cc/docs/en/pretrainedvectors.html
We split the English space into ten subspaces by running Kmeans clustering. We evaluate the trained linear mappings by calculating the translation accuracy with precision at one (P@1) —how often the highest ranked translation is the correct one— for each subspace, using the translations from Google Translate as the gold dataset. To reduce the influence of infrequent words, we only consider the first fiftythousand most frequent source words.
As we can see in Figure 1, the distribution of accuracies of different subspaces is not uniform or even nearly so. This is true for both language pairs, but particularly for the distant languages, where the general mapping does not work at all in some subspaces. This lack of uniformity in results corroborates the appropriateness of designing a model that learns different linear mappings for different subspaces instead of only learning a single linear mapping for the entire source space.
Distribution matching with a singlelinear mapping
To learn different mappings for different source subspaces, we propose one multidiscriminator GAN for each source subspace and encourage words from a specific source subspace to be trained against words from the corresponding source subspace in the target side. While the basic intuition is simple, the actual GAN architecture that achieves the correct results is composed of several aspects. In this section, we first introduce the basic GAN architecture with single mapping that we are going to use as a comparative baseline. In the following section, we introduce our model.
Learning unsupervised CLWE with GANs
Let two monolingual word embeddings and be given. Mapping to means seeking a linear transformation matrix , so that the projected vector of a source word is close the vector of its translation in the target language. The basic idea underlying supervised methods is using a seed dictionary of word pairs to learn the matrix by minimizing the distance in (1), where and represent the embeddings of and . The trained matrix can then be used to map the source word embeddings to the target space.
(1) 
In an unsupervised setting, the seed dictionary is not provided. Conneau et al. (2018) propose a twostep system where the seed dictionary is learned in an unsupervised fashion. In a first step, they use GANs to learn an initial linear transformation matrix and use this to induce a seed dictionary by finding the translations of the first tenthousand most frequent source words. In a second step, the seed dictionary just learned is used to refine the initial matrix . In the following subsections, we summarize the GANbased method of Conneau et al. (2018) for building CLWE. We use this model as our comparative baseline.
Distribution matching with GANs
A standard GAN model plays a minmax game between a generator and a discriminator (Goodfellow et al., 2014). The generator learns from the distribution of source data and tries to fool the discriminator by generating new samples which are similar to the target data.
When we adapt the basic GAN model to learning CLWE, the goal of the generator is to learn the linear mapping matrix . The discriminator detects whether the input is from the distribution of target embeddings . Conneau et al. (2018)
use the loss functions in (
2) and (3) to update the discriminator and the generator, respectively. , anddenotes the probability that the input vector
came from the target distribution rather than the generator applied to samples from the source distribution .(2) 
(3) 
The parameters of both generator and discriminator are updated alternatively by using stochastic gradient descent. Simply using these loss functions to train the initial matrix is, however, not reliable. So, some other features need to be applied in order to improve the robustness and the quality of the seed dictionary.
Orthogonalization
Previous work shows that enforcing the mapping matrix to be orthogonal during the training can improve the performance Smith et al. (2017). In the system of Conneau et al. (2018), they follow the work of Cisse et al. (2017) and approximate setting
to an orthogonal matrix, as in (
4). The orthogonalization usually performs well when setting to 0.001 Conneau et al. (2018); Wang, Henderson, and Merlo (2019).(4) 
CrossDomain Similarity Local Scaling
The trained mapping matrix can be used for retrieving the translation for a given source word by searching a target word whose embedding vector is close to . Conneau et al. (2018) showed that using crossdomain similarity local scaling (CSLS) to retrieve translations is more accurate than standard nearest neighbor techniques and can reduce the impact of the hubs problem Radovanović, Nanopoulos, and Ivanović (2010); Dinu, Lazaridou, and Baroni (2015). Instead of just considering the distance between and , CSLS also takes into account the neighbours of in the source language, as in (5).
(5) 
In this equation, denotes the mean similarity between a and its neighbours in the target language, while represents the mean similarity between and its neighbours in the source language. Due to the good performance of retrieving translations, CSLS has become an unavoidable component to induce good seed dictionaries.
Model selection criteria
Another important component of the model of Conneau et al. (2018) is the cosinebased model selection criterion that they proposed for selecting the best mapping matrix
during the adversarial training. More specifically, at the end of each training epoch, they use the current mapping to translate the tenthousand most frequent source words into target words and calculate the average cosine similarity between the source vectors and the target vectors. This cosinebased criterion has been shown to correlate well with the quality of
Conneau et al. (2018); Hartmann, Kementchedjhieva, and Søgaard (2019).Bidirectional seed dictionary induction
This final step guarantees the bidirectionality of the induced dictionary. After the mapping matrix is learned from the adversarial training, the translations for the top tenthousand source words are retrieved and then backtranslated into the source language . The mutual translation pairs such that constitute the seed dictionary. The seed dictionary is then refined in a second step.
Seed dictionary refinement
The refinement step is based on the Procrustes Analysis Schönemann (1966). With the seed dictionary just learned after the adversarial learning, the mapping matrix can be updated using the objective in equation (1
) and forced to be orthogonal by using singular value decomposition (SVD)
Xing et al. (2015).(6) 
Later work combines the Procrustes Analysis with stochastic dictionary induction Artete, Labaka, and Agirre (2018) and largely improves the performance of the standard refinement Hartmann, Kementchedjhieva, and Søgaard (2019). More specifically, in order to prevent local optima, after each iteration some elements of the similarity matrix are randomly dropped, so that the similarity distributions of words change randomly and the new seed dictionary for the next iteration varies.
Distribution matching with a multilinear mapping
To English  
et  fa  fi  lv  tr  vi  es  it  
avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  
MUSE  0  10  16.48  4  25.32  4  0  10  24.32  8  0  10  65.12  0  54.75  0 
VecMap  0  10  0  10  0  10  0  10  0  10  0  10  0  10  0  10 
MultiMap  0  10  20.97  3  26.66  4  16.7  9  32.53  8  0  10  68.42  0  58.23  0 
As discussed in the previous section, previous GANbased systems start by learning a single linear mapping from the source embedding space to the target embedding space. In such models, a source word is trained against a target word sampled from the whole target distribution, and the resulting single linear mapping is applied to all the source words. We instead learn different mappings for different source subspaces. We propose a multidiscriminator GAN where a source word from one subspace is trained against a target word sampled from the aligned target subspace. We first describe the multidiscriminator GAN model and then explain the process we use to identify the aligned sourcetarget subspaces.
Multidiscriminator adversarial learning
For each subspace of source embeddings, we propose a multidiscriminator adversarial model to train the specific mapping for vectors that belong to this subspace. As the architecture in figure 2 illustrates, the generator of the given source subspace takes the vector sampled from the subdistribution as input and maps it to the target language. Differently from standard GANs, the mapped vector will be fed into two discriminators.

A subspacespecific discriminator which judges whether the input vector has come from the correspondent target subspace . We use the vectors sampled from both source and target subspaces to train .

To avoid the local optima over the specific subspace, we add a normal discriminator on language which detects whether the input vector has come from the whole target distribution. We follow the work of Conneau et al. (2018) and only use the first seventyfivethousand source and target words to train , to reduce the negative impact of infrequent words.
Both discriminators are twolayer perceptron classifiers. Except for the different sampling ranges, their loss function is similar to equation (
2):(7) 
(8) 
where and are sampled from the first 75 thousand most frequent source and target words,^{3}^{3}3We use different language discriminator models for each subspace , even though their training samples all come from the same distributions. This leads to more stable training, presumably because initially these language discriminators are randomly different. and and are sampled from the specific source subspace and its corresponding target subspace . Since the outputs of both discriminators are used for training the generator, the loss function of the subspacespecific generator can be written as in (9).
(9) 
where is a coefficient that we call global confidence. We use the global confidence to balance the contributions of the two discriminators in updating the generator. In practice, we find that setting to 0.5 for each subspace works well for the final result.
Additionally, we propose a metric to set
dynamically, based on the proportion of the eigenvalue divergence between two subspaces and the eigenvalue divergence between the whole source and target distributions, as shown in (
10).(10) 
The eigenvalue divergence between two embedding distributions and can be computed as shown in (11), where and represent the eigenvalues of and .
(11) 
All subspacespecific generators are initialized with a single linear mapping from source to target space, which will be discussed below for subspace alignment.
Parameterfree hierarchical clustering
The above multidiscriminator GAN assumes that we have an alignment between source subspaces and target subspaces. We first present the clustering method we use to find coherent substances, and then the method we use to produce aligned subspaces in both source and target distributions, which are both important for the model’s improved performance.
The first issue in clustering an embedding space is how to find a clustering that adapts to the space, without fixed parameters. To identify the word subspaces, we could run a traditional Kmeans clustering algorithm over the embedding space. However, the number of subspaces would have to be defined in advance. Hierarchical clustering algorithms do not suffer from this shortcoming. Recent work proposes a parameterfree method called First Integer Neighbor Clustering Hierarchy (FINCH)
Sarfraz, Sharma, and Stiefelhagen (2019), which we use in this paper.Traditionally, clustering methods split a given space of vectors into different clusters by calculating the distances between the centroid and the other vectors. FINCH is developed based on the observation that the first neighbour of each vector is a sufficient statistic to find links in the space, so that computing the distance matrix between all the vectors is not needed Sarfraz, Sharma, and Stiefelhagen (2019). For a given vector space, one first computes an adjacency link matrix using the equation in (12).
(12) 
where denote the indices of vectors and represents the index of the first neighbour of the vector with index . The connected components can then be detected from the adjacency matrix by building a directed or undirected graph on . No parameter needs to be set. When the clustering on the original first level (original data) is completed, the centroid of each cluster can then be considered as a data vector for the next level and a new level of clustering is computed using the same procedure. In theory, all the vectors will eventually be gathered into a single cluster. In practice, we find that using the clusters of the last level or the secondtolast level works well for our system. ^{4}^{4}4In the code of Sarfraz, Sharma, and Stiefelhagen (2019), the last level means the level before grouping all the data vectors into a single cluster.
To English  

et  fa  fi  lv  tr  vi  es  it  
avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  
Procrustes  500  4.60    0    4.67    0    0    0    0    18.27    
Procrustes  1000  27.47    21.24    33.67    20.27    24.22    0.40    69.80    64.60    
Procrustes  5000  52.27    41.39    63.93    42.87    62.44    54.73    83.93    79.00    
MUSE    0  10  27.76  1  47.87  0  32.43  9  54.42  7  0  10  81.43  0  77.86  0  
VecMap    0  10  39.24  1  46.33  0  0  10  57.37  0  0  10  82.67  0  77.32  0  
MultiMap    28.06  9  39.09  0  49.59  0  34.22  7  58.97  7  16.7  9  83.45  0  78.46  0 
Samplelevel subspace alignment
If we want to encourage words from a specific source subspace to be trained against words from a matching target subspace, we need to align the two crosslanguage subspaces. The second problem we need to solve for our multiadversarial method to work is how to discover this alignment. Although metrics such as GromovHausdorff distance (GH) Patra et al. (2019) and Eigenvalue Divergence (EVD) Dubossarsky et al. (2020) can be used to measure the similarity between two distributions and find the most similar target subspace for a given source subspace, matching between two subdistributions may amplify any bias generated during the clustering.
To avoid this problem, we only run the clustering on the target side. For a given target embedding space , we denote its subspaces after clustering as where represent the number of subspaces. To align source words to their matching target subspace, we propose to first learn a single linear mapping from source to target space using the GANbased method (without refinement) described previously. Then, for each source word, we use CSLS to retrieve its translation in the target language and consider the subspace index of this translation as its own subspace index. In this way, the source embedding space is partitioned into as many subspaces as the target embedding space, denoted as . Then each is an aligned subspace pair.
Although the single linear mapping from source language to target language is not good enough to get accurate translations, it appears to be a good method to produce an initial alignment. A possible reason for this result is that the clustering on the target language has already grouped similar words. Therefore, translations that in the end turn out to be incorrect usually have the same subspace index as the good translations.
We evaluate the performance of the model presented here and the contributions of each of its components in the ablation tests described in next section.
To English  
et  fa  fi  lv  tr  vi  es  it  
avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  avg.  f.  
All components  0  10  20.97  3  26.66  4  16.7  9  32.53  8  0  10  68.42  0  58.23  0 
  0  10  20.82  3  25.33  4  16.7  9  29.33  8  0  10  66.93  0  56.73  0 
  0  10  17.62  3  25.42  4  16.48  9  29.46  8  0  10  65.27  0  57.71  0 
 Dynamic  0  10  20.33  3  26.33  4  16.7  9  31.26  8  0  10  68.04  0  58.27  0 
 Global initialization  0  10  16.33  9  0  10  0  10  35.20  8  0  10  60.60  2  58.22  1 
Experiments
In this section, we evaluate our proposal on the task of Bilingual Lexicon Induction (BLI). For each language pair, we retrieve the best translations of source words in the test dictionary using CSLS, and we report the accuracy with precision at one (P@1). Since GANbased methods of learning CLWE are often criticized for their instability at training, we report the average accuracy over 10 runs and consider the run as failed when the accuracy is lower than 2%. We evaluate our system both with and without refinement; for refinement, we follow Conneau et al. (2018) and use the Procrustes Analysis to refine the initial mapping. To understand the contribution of different components, we also conduct ablation tests.
Monolingual word embeddings:
We use the pretrained fastText embedding models Bojanowski et al. (2017) for our experiments. These embeddings of 300 dimensions are pretrained on Wikipedia dumps and publicly available^{5}^{5}5https://fasttext.cc/docs/en/pretrainedvectors.html. Following previous works, we use the first 200k most frequent words for each monolingual embedding model.^{6}^{6}6The original pretrained Latvian fastText model only consists of 171k words.
Test dataset:
We evaluate our system on the evaluation dataset provided by Conneau et al. (2018). This dataset contains high quality dictionaries for more than 150 language pairs. For each language pair, it provides a training dictionary of 5000 words and a test dictionary of 1500 words. This dataset allows us to have a better understanding of the performance of our proposal on many different language pairs.
Language pairs:
As mentioned previously, because of lack of isomorphism across languages, unsupervised methods usually fail to learn good CLWE for distant language pairs. This is why, in this paper, we mostly focus on distant language pairs. Following the work of Hartmann, Kementchedjhieva, and Søgaard (2019), we choose Estonian (et), Farsi (fa), Finnish (fi), Latvian (lv), Turkish (tr), Vietnamese (vi) and evaluate our system by mapping the monolingual embedding models of these six languages to the embedding model of English. In order to better understand the performance of our system and to compare to close languages, we also investigate Spanish (es) and Italian (it), two languages more similar to English than the previous six.
Baselines:
The objective of our proposal is to improve the mapping ability of GANs by learning multiplelinear mapping instead of learning only a singlelinear mapping. Therefore, we use the GANbased system proposed by Conneau et al. (2018)^{7}^{7}7https://github.com/facebookresearch/MUSE, denoted as MUSE, as our main unsupervised baseline. Since the unsupervised method proposed by Artete, Labaka, and Agirre (2018)^{8}^{8}8 https://github.com/artetxem/vecmap is considered a robust CLWE system, we also use it as our second unsupervised baseline and denote it as VecMap. However, according to Hartmann, Kementchedjhieva, and Søgaard (2019), the advantage of VecMap mostly comes from its iterative refinement with stochastic dictionary induction. To have a fair comparison, we apply the standard refinement through Procrustes Analysis to VecMap. Finally, GANbased systems are often compared to different supervised systems. We think that such a comparison is not appropriate to understand the GAN’s performance on inducing the seed dictionary. The supervised algorithm should be kept constant. Therefore, we build our own supervised baseline with the Procrustes Analysis that we use for refinement and feed it with dictionaries of different sizes. We shuffle the training dictionaries proposed by Conneau et al. (2018) and split them into 500 word pairs, 1000 word pairs and 5000 word pairs.
Multilinear GANs vs singlelinear GANs
Table 1 and table 2 show the results of our proposed multilinear method in comparison to our unsupervised baselines both with and without refinement, respectively. From these two tables we can easily observe that the six distant language pairs that we select here are very difficult for unsupervised systems. Even so, from the numbers of fails shown in these two tables, we can see that our multilinear GANs are a little more stable than singlelinear GANs. We fail less. If we only consider the results on the successful runs, we can find that our multilinear method performs much better than singlelinear method in all the language pairs. For some language pairs, such as FarsitoEnglish with refinement, the improvement brought by our system is more than 10%.
Unsupervised seed dictionary vs pretrained dictionary
From the results shown in table 2, we can see that when using dictionaries of 500 word pairs, our supervised baseline is not capable of building good CLWE. Even with 1000 word pairs, its performance is still not better than our unsupervised method. Despite the unstableness of training, the performance of our multilinear method is in some cases very close to the supervised baseline calculated with 5000 word pairs.
Ablation tests
We report the results of our ablation tests in table 3. The last line shows that using the mapping learned for aligning subspaces to initialize each subspacespecific generator is very important; otherwise, the training becomes very unstable, even for similar language pairs. The second and third lines show, respectively, that both the discriminator on language and the discriminator on subspaces contribute to the mapping performance. However, the discriminator on subspaces appears to be more important in some cases, if we look at the drop of performance when removing it. The fourth line shows that the dynamic calculated with the eigenvalue divergence is not very important. The drop in performance when it is replaced by a fixed value of 0.5 is very limited. In fact, inspection of the value of dynamic shows that it is usually close to 0.5 during training.
These ablation tests indicate that, with the exception of dynamic , all the components of our novel multiadversarial model contribute to bringing the overall system to very good unsupervised performances, in some cases comparable to supervised methods.
Conclusion
In this paper, we propose a multiadversarial learning method for crosslingual word embeddings. Our system learns different linear mappings for different source subspaces instead of just learning a single one for the whole source space. The results of our experiments on bilingual lexicon induction on both close languages and the difficult case of typologicallydistant languages prove that learning crosslingual word embeddings with multimapping improves the performance over single mapping.
References
 AlvarezMelis and Jaakkola (2018) AlvarezMelis, D.; and Jaakkola, T. S. 2018. GromovWasserstein Alignment of Word Embedding Spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, volume 18811890. Brussels, Belgium.
 Artete, Labaka, and Agirre (2018) Artete, M.; Labaka, G.; and Agirre, E. 2018. A robust selflearning method for fully unsupervised crosslingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume Long Papers, 789–798. Melbourne, Australia.
 Artetxe, Labaka, and Agirre (2017) Artetxe, M.; Labaka, G.; and Agirre, E. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 451–462. Vancouver, Canada.

Barone (2016)
Barone, A. V. M. 2016.
Towards CrossLingual Distributed Representations without Parallel Text Trained with Adversarial Autoencoders.
In Proceedings of the 1st Workshop on Representation Learning for NLP, 121–126. Berlin, Germany.  Bojanowski et al. (2017) Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5: 135–146. doi:10.1162/tacl_a_00051.

Cisse et al. (2017)
Cisse, M.; Bojanowski, P.; Grave, E.; Dauphin, Y.; and Usunier, N. 2017.
Parseval Networks: Improving Robustness to Adversarial Examples.
In
Proceedings of the 34th International Conference on Machine Learning
, volume 70, 854–863. Sydney, Australia.  Conneau et al. (2018) Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and Jégou, H. 2018. Word Translation Without Parallel Data. In Proceedings of the 6th International Conference on Learning Representations, 1–14. Vancouver, Canada.
 Dinu, Lazaridou, and Baroni (2015) Dinu, G.; Lazaridou, A.; and Baroni, M. 2015. Improving ZeroShot Learning by Mitigating the Hubness Problem. In Proceedings of the 3rd International Conference on Learning Representations, volume Workshop Track, 1–10. Toulon, France.
 Dubossarsky et al. (2020) Dubossarsky, H.; Vulić, I.; Reichart, R.; and Korhonen, A. 2020. Lost in Embedding Space: Explaining CrossLingual Task Performance with Eigenvalue Divergence. arXiv 1–10.
 Glavaš et al. (2019) Glavaš, G.; Litschko, R.; Ruder, S.; and Vulić, I. 2019. How to (Properly) Evaluate CrossLingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 710–721. Florence, Italy.
 Goodfellow et al. (2014) Goodfellow, I. J.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680. Montréal, Canada.
 Guo et al. (2015) Guo, J.; Che, W.; Yarowsky, D.; Wang, H.; and Liu, T. 2015. Crosslingual Dependency Parsing Based on Distributed Representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, 1234–1244. Beijing, China.
 Hartmann, Kementchedjhieva, and Søgaard (2019) Hartmann, M.; Kementchedjhieva, Y.; and Søgaard, A. 2019. Comparing Unsupervised Word Translation Methods Step by Step. In Proceedings of the 33rd Conference on Neural Information Processing Systems, 6033–6043. Vancouver, Canada.
 Hoshen and Wolf (2018) Hoshen, Y.; and Wolf, L. 2018. An Iterative Closest Point Method for Unsupervised Word Translation. ArXiv .
 Klementiev, Titov, and Bhattarai (2012) Klementiev, A.; Titov, I.; and Bhattarai, B. 2012. Inducing Crosslingual Distributed Representations of Words. In Proceedings of the 24th International Conference on Computational Linguistics, 1459–1473. Mumbai, India.

Mikolov et al. (2013)
Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013.
Efficient Estimation of Word Representations in Vector Space.
In Proceedings of the 1st International Conference on Learning Representations, 1–12. Arizona, USA.  Patra et al. (2019) Patra, B.; Moniz, J. R. A.; Garg, S.; Gormley, M. R.; and Neubig, G. 2019. Bilingual Lexicon Induction with Semisupervisionin NonIsometric Embedding Spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistic, 184–193. Florence, Italy.

Radovanović, Nanopoulos, and
Ivanović (2010)
Radovanović, M.; Nanopoulos, A.; and Ivanović, M. 2010.
Hubs in Space: Popular Nearest Neighbors in HighDimensional Data.
Journal of Machine Learning Research 11: 2487–2531. 
Ruder, Vulić, and Søgaard (2019)
Ruder, S.; Vulić, I.; and Søgaard, A. 2019.
A Survey Of Crosslingual Word Embedding Models.
Journal of Artificial Intelligence Research
65: 569–631. 
Sarfraz, Sharma, and
Stiefelhagen (2019)
Sarfraz, M. S.; Sharma, V.; and Stiefelhagen, R. 2019.
Efficient Parameterfree Clustering Using First Neighbor Relations.
In
Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 8934–8943. Long Beach, California, USA.  Schönemann (1966) Schönemann, P. H. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31: 1–10.
 Smith et al. (2017) Smith, S. L.; Turban, D. H. P.; Hamblin, S.; and Hammerla, N. Y. 2017. Offline Bilingualword Vectors, Orthogonal Transformations and the Inverted Softmax. In Proceedings of the 15th International Conference on Learning Representations, 1–10. Toulon, France.
 Søgaard, Ruder, and Vulić (2018) Søgaard, A.; Ruder, S.; and Vulić, I. 2018. On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume Long Papers, 778–788. Melbourne, Australia.
 Vulić et al. (2019) Vulić, I.; Glavaš, G.; Reichart, R.; and Korhonen, A. 2019. Do We Really Need Fully Unsupervised CrossLingual Embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 4407–4418. Hong Kong, China.
 Vulić, Ruder, and Søgaard (2020) Vulić, I.; Ruder, S.; and Søgaard, A. 2020. Are All Good Word Vector Spaces Isomorphic? arXiv 1–11.
 Wang, Henderson, and Merlo (2019) Wang, H.; Henderson, J.; and Merlo, P. 2019. WeaklySupervised Conceptbased Adversarial Learning for Crosslingual Word Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 4419–4430. Hong Kong, China.
 Xing et al. (2015) Xing, C.; Wang, D.; Liu, C.; and Lin, Y. 2015. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, 1006–1011. Denver, Colorado.
 Zhang et al. (2017) Zhang, M.; Liu, Y.; Luan, H.; and Sun, M. 2017. Adversarial Training for Unsupervised Bilingual Lexicon Induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1959–1970. Vancouver, Canada.
 Zou et al. (2013) Zou, W. Y.; Socher, R.; Cer, D.; and Manning, C. D. 2013. Bilingual Word Embeddings for PhraseBased Machine Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1393–1398. Seattle, Washington, USA.