DeepAI
Log In Sign Up

Multi-Adversarial Learning for Cross-Lingual Word Embeddings

Generative adversarial networks (GANs) have succeeded in inducing cross-lingual word embeddings – maps of matching words across languages – without supervision. Despite these successes, GANs' performance for the difficult case of distant languages is still not satisfactory. These limitations have been explained by GANs' incorrect assumption that source and target embedding spaces are related by a single linear mapping and are approximately isomorphic. We assume instead that, especially across distant languages, the mapping is only piece-wise linear, and propose a multi-adversarial learning method. This novel method induces the seed cross-lingual dictionary through multiple mappings, each induced to fit the mapping for one subspace. Our experiments on unsupervised bilingual lexicon induction show that this method improves performance over previous single-mapping methods, especially for distant languages.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/22/2020

Revisiting the Context Window for Cross-lingual Word Embeddings

Existing approaches to mapping-based cross-lingual word embeddings are b...
04/20/2019

Weakly-Supervised Concept-based Adversarial Learning for Cross-lingual Word Embeddings

Distributed representations of words which map each word to a continuous...
06/12/2019

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

Recent research in cross-lingual word embeddings has almost exclusively ...
12/31/2020

Beyond Offline Mapping: Learning Cross Lingual Word Embeddings through Context Anchoring

Recent research on cross-lingual word embeddings has been dominated by u...
11/26/2020

Unsupervised Word Translation Pairing using Refinement based Point Set Registration

Cross-lingual alignment of word embeddings play an important role in kno...
09/12/2019

Lost in Evaluation: Misleading Benchmarks for Bilingual Dictionary Induction

The task of bilingual dictionary induction (BDI) is commonly used for in...

Introduction

Word embeddings, continuous vectorial representations of words, have become a fundamental initial step in many natural language processing (NLP) tasks for many languages. In recent years, their cross-lingual counterpart, cross-lingual word embeddings (CLWE) —maps of matching words across languages— have been shown to be useful in many important cross-lingual transfer and modeling tasks such as machine translation

Zou et al. (2013); Conneau et al. (2018), cross-lingual document classification Klementiev, Titov, and Bhattarai (2012) and zero-shot dependency parsing Guo et al. (2015).

In these representations, matching words across different languages are represented by similar vectors. Following the observation of

Mikolov et al. (2013)

that the geometric positions of similar words in two embedding spaces of different languages appear to be related by a linear relation, the most common method aims to map between two pretrained monolingual embedding spaces by learning a single linear transformation matrix. Due to its simple structure design and competitive performance, this approach has become the mainstream of learning CLWE

Glavaš et al. (2019); Vulić et al. (2019); Ruder, Vulić, and Søgaard (2019).

Initially, the linear mapping was learned by minimizing the distances between the source and target words in a seed dictionary. Early work from Mikolov et al. (2013) uses a seed dictionary of five-thousand word pairs. Since then, the size of the seed dictionary has been gradually reduced, from several-thousand to fifty word pairs Smith et al. (2017), reaching a minimal version of only sharing numerals Artetxe, Labaka, and Agirre (2017).

More recent works on unsupervised learning have shown that mappings across embedding spaces can also be learned without any bilingual evidence

Barone (2016); Zhang et al. (2017); Conneau et al. (2018); Hoshen and Wolf (2018); Alvarez-Melis and Jaakkola (2018); Artete, Labaka, and Agirre (2018). More concretely, these fully unsupervised methods usually consist of two main steps Hartmann, Kementchedjhieva, and Søgaard (2019): an unsupervised step which aims to induce the seed dictionary by matching the source and target distributions, and then a pseudo-supervised refinement step based on this seed dictionary.

The system proposed by Conneau et al. (2018) can be considered the first successful unsupervised system for learning CLWE. They first use generative adversarial networks (GANs) to learn a single linear mapping to induce the seed dictionary, followed by the Procrustes Analysis Schönemann (1966) to refine the linear mapping based on the induced seed dictionary. While this GAN-based model has competitive or even better performance compared to supervised methods on typologically-similar language pairs, it often exhibits poor performance on typologically-distant language pairs, pairs of languages that differ drastically in word forms, morphology, word order and other properties that determine how similar the lexicon of a language is. More specifically, their initial linear mapping often fails to induce the seed dictionary for distant language pairs Vulić et al. (2019). Later work from Artete, Labaka, and Agirre (2018) has proposed an unsupervised self-learning framework to make the unsupervised CLWE learning more robust. Their system uses similarity distribution matching to induce the seed dictionary and stochastic dictionary induction to refine the mapping iteratively. The final CLWE learned by their system performs better than the GAN-based system. However, their advantage appears to come from the iterative refinement with stochastic dictionary induction, according to Hartmann, Kementchedjhieva, and Søgaard (2019). If we only consider the performance of a model induced only with distribution matching, GAN-based models perform much better. This brings us to our first conclusions, that a GAN-based model is preferable for seed dictionary induction.

Fully unsupervised mapping-based methods to learn CLWE rely on the strong assumption that monolingual word embedding spaces are isomorphic or near-isomorphic, but this assumption is not fulfilled in practice, especially for distant language pairs Søgaard, Ruder, and Vulić (2018). Supervised methods are also affected by lack of isomorphism, as their performance on distant language pairs is worse than on similar language pairs. Moreover, experiments by Vulić, Ruder, and Søgaard (2020) also demonstrate that the lack of isomorphism does not arise only because of the typological distance among languages, but it also depends on the quality of the monolingual embedding space. Actually, if we replace the seed dictionary learned by an unsupervised distribution matching method with a pretrained dictionary, keeping constant the refinement technique, the final system becomes more robust Vulić et al. (2019).

All these previous results indicate that learning a better seed dictionary is a crucial step to improve unsupervised cross-lingual word embedding induction and reduce the gap between unsupervised methods and supervised methods, and that GAN-based methods hold the most promise to achieve this goal. The results also indicate that a solution that can handle the full complexity of induction of cross-lingual word embeddings will show improvements in both close and distant languages.

In this paper, we focus on improving the initial step of distribution matching, using GANs Hartmann, Kementchedjhieva, and Søgaard (2019). Because the isomorphism assumption is not observed in reality, we argue that a successful GAN-based model must not learn only one single linear mapping for the entire distribution, but must be able to identify mapping subspaces and learn multiple mappings. We propose a multi-adversarial learning method which learns different linear maps for different subspaces of word embeddings.

Limitation of single-linear assumption for embedding mapping

Figure 1: Translation accuracy from English to Chinese and to French for different English subspaces. We only include the top fifty-thousand most frequent English words in the pretrained fastText embeddings. The gold translations comes from Google Translate.

If the assumption that similar words across source and target languages are related by a single linear relation Mikolov et al. (2013) holds exactly or even approximately, the distance between source and target embedding spaces should be evenly or at least nearly evenly minimized during the training of the initial mapping. More specifically, each source subspace should be mapped equally well or nearly equally well to its corresponding target space, so that the translation ability of the single linear mapping should be similar across different source subspaces.

To verify this expectation, we use the GAN-based system MUSE111https://github.com/facebookresearch/MUSE of Conneau et al. (2018) to train two linear mappings (without refinement). One mapping relates two typologically distant languages, English and Chinese, and the other maps the English space to the space of French - a typologically similar language. We use pretrained fastText embeddings. 222https://fasttext.cc/docs/en/pretrained-vectors.html

We split the English space into ten subspaces by running K-means clustering. We evaluate the trained linear mappings by calculating the translation accuracy with precision at one (P@1) —how often the highest ranked translation is the correct one— for each subspace, using the translations from Google Translate as the gold dataset. To reduce the influence of infrequent words, we only consider the first fifty-thousand most frequent source words.

As we can see in Figure 1, the distribution of accuracies of different subspaces is not uniform or even nearly so. This is true for both language pairs, but particularly for the distant languages, where the general mapping does not work at all in some subspaces. This lack of uniformity in results corroborates the appropriateness of designing a model that learns different linear mappings for different subspaces instead of only learning a single linear mapping for the entire source space.

Figure 2: Architecture of our multi-discriminator model for source subspace . The generator of each subspace is initialized with the mapping trained for subspace alignment.

Distribution matching with a single-linear mapping

To learn different mappings for different source subspaces, we propose one multi-discriminator GAN for each source subspace and encourage words from a specific source subspace to be trained against words from the corresponding source subspace in the target side. While the basic intuition is simple, the actual GAN architecture that achieves the correct results is composed of several aspects. In this section, we first introduce the basic GAN architecture with single mapping that we are going to use as a comparative baseline. In the following section, we introduce our model.

Learning unsupervised CLWE with GANs

Let two monolingual word embeddings and be given. Mapping to means seeking a linear transformation matrix , so that the projected vector of a source word is close the vector of its translation in the target language. The basic idea underlying supervised methods is using a seed dictionary of word pairs to learn the matrix by minimizing the distance in (1), where and represent the embeddings of and . The trained matrix can then be used to map the source word embeddings to the target space.

(1)

In an unsupervised setting, the seed dictionary is not provided. Conneau et al. (2018) propose a two-step system where the seed dictionary is learned in an unsupervised fashion. In a first step, they use GANs to learn an initial linear transformation matrix and use this to induce a seed dictionary by finding the translations of the first ten-thousand most frequent source words. In a second step, the seed dictionary just learned is used to refine the initial matrix . In the following subsections, we summarize the GAN-based method of Conneau et al. (2018) for building CLWE. We use this model as our comparative baseline.

Distribution matching with GANs

A standard GAN model plays a min-max game between a generator and a discriminator (Goodfellow et al., 2014). The generator learns from the distribution of source data and tries to fool the discriminator by generating new samples which are similar to the target data.

When we adapt the basic GAN model to learning CLWE, the goal of the generator is to learn the linear mapping matrix . The discriminator detects whether the input is from the distribution of target embeddings . Conneau et al. (2018)

use the loss functions in (

2) and (3) to update the discriminator and the generator, respectively. , and

denotes the probability that the input vector

came from the target distribution rather than the generator applied to samples from the source distribution .

(2)
(3)

The parameters of both generator and discriminator are updated alternatively by using stochastic gradient descent. Simply using these loss functions to train the initial matrix is, however, not reliable. So, some other features need to be applied in order to improve the robustness and the quality of the seed dictionary.

Orthogonalization

Previous work shows that enforcing the mapping matrix to be orthogonal during the training can improve the performance Smith et al. (2017). In the system of Conneau et al. (2018), they follow the work of Cisse et al. (2017) and approximate setting

to an orthogonal matrix, as in (

4). The orthogonalization usually performs well when setting to 0.001 Conneau et al. (2018); Wang, Henderson, and Merlo (2019).

(4)
Cross-Domain Similarity Local Scaling

The trained mapping matrix can be used for retrieving the translation for a given source word by searching a target word whose embedding vector is close to . Conneau et al. (2018) showed that using cross-domain similarity local scaling (CSLS) to retrieve translations is more accurate than standard nearest neighbor techniques and can reduce the impact of the hubs problem Radovanović, Nanopoulos, and Ivanović (2010); Dinu, Lazaridou, and Baroni (2015). Instead of just considering the distance between and , CSLS also takes into account the neighbours of in the source language, as in (5).

(5)

In this equation, denotes the mean similarity between a and its neighbours in the target language, while represents the mean similarity between and its neighbours in the source language. Due to the good performance of retrieving translations, CSLS has become an unavoidable component to induce good seed dictionaries.

Model selection criteria

Another important component of the model of Conneau et al. (2018) is the cosine-based model selection criterion that they proposed for selecting the best mapping matrix

during the adversarial training. More specifically, at the end of each training epoch, they use the current mapping to translate the ten-thousand most frequent source words into target words and calculate the average cosine similarity between the source vectors and the target vectors. This cosine-based criterion has been shown to correlate well with the quality of

Conneau et al. (2018); Hartmann, Kementchedjhieva, and Søgaard (2019).

Bidirectional seed dictionary induction

This final step guarantees the bidirectionality of the induced dictionary. After the mapping matrix is learned from the adversarial training, the translations for the top ten-thousand source words are retrieved and then back-translated into the source language . The mutual translation pairs such that constitute the seed dictionary. The seed dictionary is then refined in a second step.

Seed dictionary refinement

The refinement step is based on the Procrustes Analysis Schönemann (1966). With the seed dictionary just learned after the adversarial learning, the mapping matrix can be updated using the objective in equation (1

) and forced to be orthogonal by using singular value decomposition (SVD)

Xing et al. (2015).

(6)

Later work combines the Procrustes Analysis with stochastic dictionary induction Artete, Labaka, and Agirre (2018) and largely improves the performance of the standard refinement Hartmann, Kementchedjhieva, and Søgaard (2019). More specifically, in order to prevent local optima, after each iteration some elements of the similarity matrix are randomly dropped, so that the similarity distributions of words change randomly and the new seed dictionary for the next iteration varies.

Distribution matching with a multi-linear mapping

To English
et fa fi lv tr vi es it
avg. f. avg. f. avg. f. avg. f. avg. f. avg. f. avg. f. avg. f.
MUSE 0 10 16.48 4 25.32 4 0 10 24.32 8 0 10 65.12 0 54.75 0
VecMap 0 10 0 10 0 10 0 10 0 10 0 10 0 10 0 10
MultiMap 0 10 20.97 3 26.66 4 16.7 9 32.53 8 0 10 68.42 0 58.23 0
Table 1: Bilingual lexicon induction results (without refinement) on the dataset of Conneau et al. (2018). We consider the accuracy below 2% as a failure (f.) and report the average accuracy with P@1 (avg.) over the successful runs. The translations are retrieved using CSLS. Languages: Estonian (et), Farsi (fa), Finnish (fi), Latvian (lv), Turkish (tr), Vietnamese (vi), Spanish (es) and Italian (it).

As discussed in the previous section, previous GAN-based systems start by learning a single linear mapping from the source embedding space to the target embedding space. In such models, a source word is trained against a target word sampled from the whole target distribution, and the resulting single linear mapping is applied to all the source words. We instead learn different mappings for different source subspaces. We propose a multi-discriminator GAN where a source word from one subspace is trained against a target word sampled from the aligned target subspace. We first describe the multi-discriminator GAN model and then explain the process we use to identify the aligned source-target subspaces.

Multi-discriminator adversarial learning

For each subspace of source embeddings, we propose a multi-discriminator adversarial model to train the specific mapping for vectors that belong to this subspace. As the architecture in figure 2 illustrates, the generator of the given source subspace takes the vector sampled from the sub-distribution as input and maps it to the target language. Differently from standard GANs, the mapped vector will be fed into two discriminators.

  • A subspace-specific discriminator which judges whether the input vector has come from the correspondent target subspace . We use the vectors sampled from both source and target subspaces to train .

  • To avoid the local optima over the specific subspace, we add a normal discriminator on language which detects whether the input vector has come from the whole target distribution. We follow the work of Conneau et al. (2018) and only use the first seventy-five-thousand source and target words to train , to reduce the negative impact of infrequent words.

Both discriminators are two-layer perceptron classifiers. Except for the different sampling ranges, their loss function is similar to equation (

2):

(7)
(8)

where and are sampled from the first 75 thousand most frequent source and target words,333We use different language discriminator models for each subspace , even though their training samples all come from the same distributions. This leads to more stable training, presumably because initially these language discriminators are randomly different. and and are sampled from the specific source subspace and its corresponding target subspace . Since the outputs of both discriminators are used for training the generator, the loss function of the subspace-specific generator can be written as in (9).

(9)

where is a coefficient that we call global confidence. We use the global confidence to balance the contributions of the two discriminators in updating the generator. In practice, we find that setting to 0.5 for each subspace works well for the final result.

Additionally, we propose a metric to set

dynamically, based on the proportion of the eigenvalue divergence between two subspaces and the eigenvalue divergence between the whole source and target distributions, as shown in (

10).

(10)

The eigenvalue divergence between two embedding distributions and can be computed as shown in (11), where and represent the eigenvalues of and .

(11)

All subspace-specific generators are initialized with a single linear mapping from source to target space, which will be discussed below for subspace alignment.

Parameter-free hierarchical clustering

The above multi-discriminator GAN assumes that we have an alignment between source subspaces and target subspaces. We first present the clustering method we use to find coherent substances, and then the method we use to produce aligned subspaces in both source and target distributions, which are both important for the model’s improved performance.

The first issue in clustering an embedding space is how to find a clustering that adapts to the space, without fixed parameters. To identify the word subspaces, we could run a traditional K-means clustering algorithm over the embedding space. However, the number of subspaces would have to be defined in advance. Hierarchical clustering algorithms do not suffer from this short-coming. Recent work proposes a parameter-free method called First Integer Neighbor Clustering Hierarchy (FINCH)

Sarfraz, Sharma, and Stiefelhagen (2019), which we use in this paper.

Traditionally, clustering methods split a given space of vectors into different clusters by calculating the distances between the centroid and the other vectors. FINCH is developed based on the observation that the first neighbour of each vector is a sufficient statistic to find links in the space, so that computing the distance matrix between all the vectors is not needed Sarfraz, Sharma, and Stiefelhagen (2019). For a given vector space, one first computes an adjacency link matrix using the equation in (12).

(12)

where denote the indices of vectors and represents the index of the first neighbour of the vector with index . The connected components can then be detected from the adjacency matrix by building a directed or undirected graph on . No parameter needs to be set. When the clustering on the original first level (original data) is completed, the centroid of each cluster can then be considered as a data vector for the next level and a new level of clustering is computed using the same procedure. In theory, all the vectors will eventually be gathered into a single cluster. In practice, we find that using the clusters of the last level or the second-to-last level works well for our system. 444In the code of Sarfraz, Sharma, and Stiefelhagen (2019), the last level means the level before grouping all the data vectors into a single cluster.

To English
Dico
Size
et fa fi lv tr vi es it
avg. f. avg. f. avg. f. avg. f. avg. f. avg. f. avg. f. avg. f.
Procrustes 500 4.60 - 0 - 4.67 - 0 - 0 - 0 - 0 - 18.27 -
Procrustes 1000 27.47 - 21.24 - 33.67 - 20.27 - 24.22 - 0.40 - 69.80 - 64.60 -
Procrustes 5000 52.27 - 41.39 - 63.93 - 42.87 - 62.44 - 54.73 - 83.93 - 79.00 -
MUSE - 0 10 27.76 1 47.87 0 32.43 9 54.42 7 0 10 81.43 0 77.86 0
VecMap - 0 10 39.24 1 46.33 0 0 10 57.37 0 0 10 82.67 0 77.32 0
MultiMap - 28.06 9 39.09 0 49.59 0 34.22 7 58.97 7 16.7 9 83.45 0 78.46 0
Table 2: Bilingual lexicon induction results (with refinement) on the dataset of Conneau et al. (2018). For all the unsupervised systems, we use the Procrustes refinement Conneau et al. (2018). We consider the accuracy below 2% as a failure (f.) and report the average(accuracy with P@1 over the successful runs (avg.). For the supervised baseline, the dictionaries are shuffled from the training dictionary proposed by Conneau et al. (2018). The translations are retrieved using CSLS. Languages: Estonian (et), Farsi (fa), Finnish (fi), Latvian (lv), Turkish (tr), Vietnamese (vi), Spanish (es) and Italian (it).

Sample-level subspace alignment

If we want to encourage words from a specific source subspace to be trained against words from a matching target subspace, we need to align the two cross-language subspaces. The second problem we need to solve for our multi-adversarial method to work is how to discover this alignment. Although metrics such as Gromov-Hausdorff distance (GH) Patra et al. (2019) and Eigenvalue Divergence (EVD) Dubossarsky et al. (2020) can be used to measure the similarity between two distributions and find the most similar target subspace for a given source subspace, matching between two sub-distributions may amplify any bias generated during the clustering.

To avoid this problem, we only run the clustering on the target side. For a given target embedding space , we denote its subspaces after clustering as where represent the number of subspaces. To align source words to their matching target subspace, we propose to first learn a single linear mapping from source to target space using the GAN-based method (without refinement) described previously. Then, for each source word, we use CSLS to retrieve its translation in the target language and consider the subspace index of this translation as its own subspace index. In this way, the source embedding space is partitioned into as many subspaces as the target embedding space, denoted as . Then each is an aligned subspace pair.

Although the single linear mapping from source language to target language is not good enough to get accurate translations, it appears to be a good method to produce an initial alignment. A possible reason for this result is that the clustering on the target language has already grouped similar words. Therefore, translations that in the end turn out to be incorrect usually have the same subspace index as the good translations.

We evaluate the performance of the model presented here and the contributions of each of its components in the ablation tests described in next section.

To English
et fa fi lv tr vi es it
avg. f. avg. f. avg. f. avg. f. avg. f. avg. f. avg. f. avg. f.
All components 0 10 20.97 3 26.66 4 16.7 9 32.53 8 0 10 68.42 0 58.23 0
- 0 10 20.82 3 25.33 4 16.7 9 29.33 8 0 10 66.93 0 56.73 0
- 0 10 17.62 3 25.42 4 16.48 9 29.46 8 0 10 65.27 0 57.71 0
- Dynamic 0 10 20.33 3 26.33 4 16.7 9 31.26 8 0 10 68.04 0 58.27 0
- Global initialization 0 10 16.33 9 0 10 0 10 35.20 8 0 10 60.60 2 58.22 1
Table 3: Results of the ablation test (Without refinement) on the dataset of Conneau et al. (2018). We consider the accuracy below 2% as a failure (f.) and report the the average accuracy with P@1 (avg.) over the successful runs. and denote the discriminator on language and the discriminator on subspaces. For the test without dynamic , we use the value of 0.5 instead. The global initialization refers to using the mapping learned for subspace alignment to initialize each subspace-specific generator. Languages: Estonian (et), Farsi (fa), Finnish (fi), Latvian (lv), Turkish (tr), Vietnamese (vi), Spanish (es) and Italian (it).

Experiments

In this section, we evaluate our proposal on the task of Bilingual Lexicon Induction (BLI). For each language pair, we retrieve the best translations of source words in the test dictionary using CSLS, and we report the accuracy with precision at one (P@1). Since GAN-based methods of learning CLWE are often criticized for their instability at training, we report the average accuracy over 10 runs and consider the run as failed when the accuracy is lower than 2%. We evaluate our system both with and without refinement; for refinement, we follow Conneau et al. (2018) and use the Procrustes Analysis to refine the initial mapping. To understand the contribution of different components, we also conduct ablation tests.

Monolingual word embeddings:

We use the pretrained fastText embedding models Bojanowski et al. (2017) for our experiments. These embeddings of 300 dimensions are pretrained on Wikipedia dumps and publicly available555https://fasttext.cc/docs/en/pretrained-vectors.html. Following previous works, we use the first 200k most frequent words for each monolingual embedding model.666The original pretrained Latvian fastText model only consists of 171k words.

Test dataset:

We evaluate our system on the evaluation dataset provided by Conneau et al. (2018). This dataset contains high quality dictionaries for more than 150 language pairs. For each language pair, it provides a training dictionary of 5000 words and a test dictionary of 1500 words. This dataset allows us to have a better understanding of the performance of our proposal on many different language pairs.

Language pairs:

As mentioned previously, because of lack of isomorphism across languages, unsupervised methods usually fail to learn good CLWE for distant language pairs. This is why, in this paper, we mostly focus on distant language pairs. Following the work of Hartmann, Kementchedjhieva, and Søgaard (2019), we choose Estonian (et), Farsi (fa), Finnish (fi), Latvian (lv), Turkish (tr), Vietnamese (vi) and evaluate our system by mapping the monolingual embedding models of these six languages to the embedding model of English. In order to better understand the performance of our system and to compare to close languages, we also investigate Spanish (es) and Italian (it), two languages more similar to English than the previous six.

Baselines:

The objective of our proposal is to improve the mapping ability of GANs by learning multiple-linear mapping instead of learning only a single-linear mapping. Therefore, we use the GAN-based system proposed by Conneau et al. (2018)777https://github.com/facebookresearch/MUSE, denoted as MUSE, as our main unsupervised baseline. Since the unsupervised method proposed by Artete, Labaka, and Agirre (2018)888 https://github.com/artetxem/vecmap is considered a robust CLWE system, we also use it as our second unsupervised baseline and denote it as VecMap. However, according to Hartmann, Kementchedjhieva, and Søgaard (2019), the advantage of VecMap mostly comes from its iterative refinement with stochastic dictionary induction. To have a fair comparison, we apply the standard refinement through Procrustes Analysis to VecMap. Finally, GAN-based systems are often compared to different supervised systems. We think that such a comparison is not appropriate to understand the GAN’s performance on inducing the seed dictionary. The supervised algorithm should be kept constant. Therefore, we build our own supervised baseline with the Procrustes Analysis that we use for refinement and feed it with dictionaries of different sizes. We shuffle the training dictionaries proposed by Conneau et al. (2018) and split them into 500 word pairs, 1000 word pairs and 5000 word pairs.

Multi-linear GANs vs single-linear GANs

Table 1 and table 2 show the results of our proposed multi-linear method in comparison to our unsupervised baselines both with and without refinement, respectively. From these two tables we can easily observe that the six distant language pairs that we select here are very difficult for unsupervised systems. Even so, from the numbers of fails shown in these two tables, we can see that our multi-linear GANs are a little more stable than single-linear GANs. We fail less. If we only consider the results on the successful runs, we can find that our multi-linear method performs much better than single-linear method in all the language pairs. For some language pairs, such as Farsi-to-English with refinement, the improvement brought by our system is more than 10%.

Unsupervised seed dictionary vs pretrained dictionary

From the results shown in table 2, we can see that when using dictionaries of 500 word pairs, our supervised baseline is not capable of building good CLWE. Even with 1000 word pairs, its performance is still not better than our unsupervised method. Despite the unstableness of training, the performance of our multi-linear method is in some cases very close to the supervised baseline calculated with 5000 word pairs.

Ablation tests

We report the results of our ablation tests in table 3. The last line shows that using the mapping learned for aligning subspaces to initialize each subspace-specific generator is very important; otherwise, the training becomes very unstable, even for similar language pairs. The second and third lines show, respectively, that both the discriminator on language and the discriminator on subspaces contribute to the mapping performance. However, the discriminator on subspaces appears to be more important in some cases, if we look at the drop of performance when removing it. The fourth line shows that the dynamic calculated with the eigenvalue divergence is not very important. The drop in performance when it is replaced by a fixed value of 0.5 is very limited. In fact, inspection of the value of dynamic shows that it is usually close to 0.5 during training.

These ablation tests indicate that, with the exception of dynamic , all the components of our novel multi-adversarial model contribute to bringing the overall system to very good unsupervised performances, in some cases comparable to supervised methods.

Conclusion

In this paper, we propose a multi-adversarial learning method for cross-lingual word embeddings. Our system learns different linear mappings for different source subspaces instead of just learning a single one for the whole source space. The results of our experiments on bilingual lexicon induction on both close languages and the difficult case of typologically-distant languages prove that learning cross-lingual word embeddings with multi-mapping improves the performance over single mapping.

References

  • Alvarez-Melis and Jaakkola (2018) Alvarez-Melis, D.; and Jaakkola, T. S. 2018. Gromov-Wasserstein Alignment of Word Embedding Spaces. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, volume 1881-1890. Brussels, Belgium.
  • Artete, Labaka, and Agirre (2018) Artete, M.; Labaka, G.; and Agirre, E. 2018. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume Long Papers, 789–798. Melbourne, Australia.
  • Artetxe, Labaka, and Agirre (2017) Artetxe, M.; Labaka, G.; and Agirre, E. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 451–462. Vancouver, Canada.
  • Barone (2016) Barone, A. V. M. 2016.

    Towards Cross-Lingual Distributed Representations without Parallel Text Trained with Adversarial Autoencoders.

    In Proceedings of the 1st Workshop on Representation Learning for NLP, 121–126. Berlin, Germany.
  • Bojanowski et al. (2017) Bojanowski, P.; Grave, E.; Joulin, A.; and Mikolov, T. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics 5: 135–146. doi:10.1162/tacl_a_00051.
  • Cisse et al. (2017) Cisse, M.; Bojanowski, P.; Grave, E.; Dauphin, Y.; and Usunier, N. 2017. Parseval Networks: Improving Robustness to Adversarial Examples. In

    Proceedings of the 34th International Conference on Machine Learning

    , volume 70, 854–863. Sydney, Australia.
  • Conneau et al. (2018) Conneau, A.; Lample, G.; Ranzato, M.; Denoyer, L.; and Jégou, H. 2018. Word Translation Without Parallel Data. In Proceedings of the 6th International Conference on Learning Representations, 1–14. Vancouver, Canada.
  • Dinu, Lazaridou, and Baroni (2015) Dinu, G.; Lazaridou, A.; and Baroni, M. 2015. Improving Zero-Shot Learning by Mitigating the Hubness Problem. In Proceedings of the 3rd International Conference on Learning Representations, volume Workshop Track, 1–10. Toulon, France.
  • Dubossarsky et al. (2020) Dubossarsky, H.; Vulić, I.; Reichart, R.; and Korhonen, A. 2020. Lost in Embedding Space: Explaining Cross-Lingual Task Performance with Eigenvalue Divergence. arXiv 1–10.
  • Glavaš et al. (2019) Glavaš, G.; Litschko, R.; Ruder, S.; and Vulić, I. 2019. How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 710–721. Florence, Italy.
  • Goodfellow et al. (2014) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680. Montréal, Canada.
  • Guo et al. (2015) Guo, J.; Che, W.; Yarowsky, D.; Wang, H.; and Liu, T. 2015. Cross-lingual Dependency Parsing Based on Distributed Representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguisticsand the 7th International Joint Conference on Natural Language Processing, 1234–1244. Beijing, China.
  • Hartmann, Kementchedjhieva, and Søgaard (2019) Hartmann, M.; Kementchedjhieva, Y.; and Søgaard, A. 2019. Comparing Unsupervised Word Translation Methods Step by Step. In Proceedings of the 33rd Conference on Neural Information Processing Systems, 6033–6043. Vancouver, Canada.
  • Hoshen and Wolf (2018) Hoshen, Y.; and Wolf, L. 2018. An Iterative Closest Point Method for Unsupervised Word Translation. ArXiv .
  • Klementiev, Titov, and Bhattarai (2012) Klementiev, A.; Titov, I.; and Bhattarai, B. 2012. Inducing Crosslingual Distributed Representations of Words. In Proceedings of the 24th International Conference on Computational Linguistics, 1459–1473. Mumbai, India.
  • Mikolov et al. (2013) Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013.

    Efficient Estimation of Word Representations in Vector Space.

    In Proceedings of the 1st International Conference on Learning Representations, 1–12. Arizona, USA.
  • Patra et al. (2019) Patra, B.; Moniz, J. R. A.; Garg, S.; Gormley, M. R.; and Neubig, G. 2019. Bilingual Lexicon Induction with Semi-supervisionin Non-Isometric Embedding Spaces. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistic, 184–193. Florence, Italy.
  • Radovanović, Nanopoulos, and Ivanović (2010) Radovanović, M.; Nanopoulos, A.; and Ivanović, M. 2010.

    Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data.

    Journal of Machine Learning Research 11: 2487–2531.
  • Ruder, Vulić, and Søgaard (2019) Ruder, S.; Vulić, I.; and Søgaard, A. 2019. A Survey Of Cross-lingual Word Embedding Models.

    Journal of Artificial Intelligence Research

    65: 569–631.
  • Sarfraz, Sharma, and Stiefelhagen (2019) Sarfraz, M. S.; Sharma, V.; and Stiefelhagen, R. 2019. Efficient Parameter-free Clustering Using First Neighbor Relations. In

    Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 8934–8943. Long Beach, California, USA.
  • Schönemann (1966) Schönemann, P. H. 1966. A generalized solution of the orthogonal procrustes problem. Psychometrika 31: 1–10.
  • Smith et al. (2017) Smith, S. L.; Turban, D. H. P.; Hamblin, S.; and Hammerla, N. Y. 2017. Offline Bilingualword Vectors, Orthogonal Transformations and the Inverted Softmax. In Proceedings of the 15th International Conference on Learning Representations, 1–10. Toulon, France.
  • Søgaard, Ruder, and Vulić (2018) Søgaard, A.; Ruder, S.; and Vulić, I. 2018. On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, volume Long Papers, 778–788. Melbourne, Australia.
  • Vulić et al. (2019) Vulić, I.; Glavaš, G.; Reichart, R.; and Korhonen, A. 2019. Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 4407–4418. Hong Kong, China.
  • Vulić, Ruder, and Søgaard (2020) Vulić, I.; Ruder, S.; and Søgaard, A. 2020. Are All Good Word Vector Spaces Isomorphic? arXiv 1–11.
  • Wang, Henderson, and Merlo (2019) Wang, H.; Henderson, J.; and Merlo, P. 2019. Weakly-Supervised Concept-based Adversarial Learning for Cross-lingual Word Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 4419–4430. Hong Kong, China.
  • Xing et al. (2015) Xing, C.; Wang, D.; Liu, C.; and Lin, Y. 2015. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics, 1006–1011. Denver, Colorado.
  • Zhang et al. (2017) Zhang, M.; Liu, Y.; Luan, H.; and Sun, M. 2017. Adversarial Training for Unsupervised Bilingual Lexicon Induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 1959–1970. Vancouver, Canada.
  • Zou et al. (2013) Zou, W. Y.; Socher, R.; Cer, D.; and Manning, C. D. 2013. Bilingual Word Embeddings for Phrase-Based Machine Translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1393–1398. Seattle, Washington, USA.