Revisiting Adversarial Autoencoder for Unsupervised Word Translation with Cycle Consistency and Improved Training

04/04/2019 ∙ by Tasnim Mohiuddin, et al. ∙ Nanyang Technological University 0

Adversarial training has shown impressive success in learning bilingual dictionary without any parallel data by mapping monolingual embeddings to a shared space. However, recent work has shown superior performance for non-adversarial methods in more challenging language pairs. In this work, we revisit adversarial autoencoder for unsupervised word translation and propose two novel extensions to it that yield more stable training and improved results. Our method includes regularization terms to enforce cycle consistency and input reconstruction, and puts the target encoders as an adversary against the corresponding discriminator. Extensive experimentations with European, non-European and low-resource languages show that our method is more robust and achieves better performance than recently proposed adversarial and non-adversarial approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning cross-lingual word embeddings has been shown to be an effective way to transfer knowledge from one language to another for many key linguistic tasks including machine translation, named entity recognition, part-of-speech tagging, and parsing

Ruder et al. (2017). While earlier efforts solved the associated word alignment problem using large parallel corpora Luong et al. (2015), broader applicability demands methods to relax this requirement since acquiring a large corpus of parallel data is not feasible in most scenarios. Recent methods instead use embeddings learned from monolingual data, and learn a linear mapping from one language to another with the underlying assumption that two embedding spaces exhibit similar geometric structures (i.e., approximately isomorphic). This allows the model to learn effective cross-lingual representations without expensive supervision Artetxe et al. (2017).

Given monolingual word embeddings of two languages, Mikolov13 show that a linear mapping can be learned from a seed dictionary of 5000 word pairs by minimizing the sum of squared Euclidean distances between the mapped vectors and the target vectors. Subsequent works

Xing et al. (2015); Artetxe et al. (2016, 2017); Smith et al. (2017) propose to improve the model by normalizing the embeddings, imposing an orthogonality constraint on the mapper, and modifying the objective function. While these methods assume some supervision in the form of a seed dictionary, recently fully unsupervised methods have shown competitive results. Zhang17,Zhang-17-emnlp first reported encouraging results with adversarial training

. conneau2018word improved this approach with post-mapping refinements, showing impressive results for several language pairs. Their learned mapping was then successfully used to train a fully unsupervised neural machine translation system

Lample et al. (2018a, b).

Although successful, adversarial training has been criticized for not being stable and failing to converge, inspiring researchers to propose non-adversarial methods more recently Xu et al. (2018a); Hoshen and Wolf (2018); Alvarez-Melis and Jaakkola (2018); Artetxe et al. (2018b). In particular, Artetxe-2018-acl show that the adversarial methods of conneau2018word and Zhang17,Zhang-17-emnlp fail for many language pairs.

In this paper, we revisit adversarial training and propose a number of key improvements that yield more robust training and improved mappings. Our main idea is to learn the cross-lingual mapping in a projected latent space and add more constraints to guide the unsupervised mapping in this space. We accomplish this by proposing a novel adversarial autoencoder framework Makhzani et al. (2015), where adversarial mapping is done at the (latent) code space as opposed to the original embedding space (Figure 1). This gives the model the flexibility to automatically induce the required geometric structures in its latent code space that could potentially yield better mappings. Anders-18 recently find that the isomorphic assumption made by most existing methods does not hold in general even for two closely related languages like English and German. In their words “approaches based on this assumption have important limitations”. By mapping the latent vectors through adversarial training, our approach therefore departs from the isomorphic assumption.

In our adversarial training, not only the mapper but also the target encoder is trained to fool the discriminator. This forces the discriminator to improve its discrimination skills, which in turn pushes the mapper to generate indistinguishable translation. To guide the mapping, we include two additional constraints. Our first constraint enforces cycle consistency so that code vectors after being translated from one language to another, and then translated back to their source space remain close to the original vectors. The second constraint ensures reconstruction of the original input word embeddings from the back-translated codes. This grounding step forces the model to retain word semantics during the mapping process.

We conduct a series of experiments with six different language pairs (in both directions) comprising European, non-European, and low-resource languages from two different datasets. Our results show that our model is more robust and yields significant gains over conneau2018word for all translation tasks in all evaluation measures. Our method also gives better initial mapping compared to other existing methods Artetxe et al. (2018b). We also perform an extensive ablation study to understand the contribution of different components of our model. The study reveals that cycle consistency contributes the most, while adversarial training of the target encoder and post-cycle reconstruction also have significant effect. We have released our source code at https://ntunlpsg.github.io/project/unsup-word-translation/

The remainder of this paper is organized as follows. After discussing related work in Section 2, we present our unsupervised word translation approach with adversarial autoencoder in Section 3. We describe our experimental setup in Section 4, and present our results with in-depth analysis in Section 5. Finally, we summarize our findings with possible future directions in Section 6.

2 Related Work

In recent years a number of methods have been proposed to learn bilingual dictionary from monolingual word embeddings.111see Ruder et al. (2017) for a nice survey

Many of these methods use an initial seed dictionary. Mikolov13 show that a linear transformation can be learned from a seed dictionary of

pairs by minimizing the squared Euclidean distance. In their view, the key reason behind the good performance of their model is the similarity of geometric arrangements in vector spaces of the embeddings of different languages. For translating a new source word, they map the corresponding word embedding to the target space using the learned mapping and find the nearest target word. In their approach, they found that simple linear mapping works better than non-linear mappings with multi-layer neural networks.

XingWLL15 enforce the word vectors to be of unit length during the learning of the embeddings and modify the objective function for learning the mapping to maximize the cosine similarity instead of using Euclidean distance. To preserve length normalization after mapping, they enforce the

orthogonality constraint on the mapper. Instead of learning a mapping from the source to the target embedding space, Faruqui14 use a technique based on Canonical Correlation Analysis (CCA) to project both source and target embeddings to a common low-dimensional space, where the correlation of the word pairs in the seed dictionary is maximized. artetxe2016emnlp show that the above methods are variants of the same core optimization objective and propose a closed form solution for the mapper under orthogonality constraint. SmithICLR17 find that this solution is closely related to the orthogonal Procrustes solution. In their follow-up work, artetxe2017acl obtain competitive results using a seed dictionary of only 25 word pairs. They propose a self-learning framework that performs two steps iteratively until convergence. In the first step, they use the dictionary (starting with the seed) to learn a linear mapping, which is then used in the second step to induce a new dictionary.

A more recent line of research attempts to eliminate the seed dictionary totally and learn the mapping in a purely unsupervised way. This was first proposed by Valerio16, who initially used an adversarial network similar to conneau2018word, and found that the mapper (which is also the encoder) translates everything to a single embedding, known commonly as the mode collapse issue Goodfellow (2017). To preserve diversity in mapping, he used a decoder to reconstruct the source embedding from the mapped embedding, extending the framework to an adversarial autoencoder. His preliminary qualitative analysis shows encouraging results but not competitive with methods using bilingual seeds. He suspected issues with training and with the isomorphic assumption. In our work, we successfully address these issues with an improved model that also relaxes the isomorphic assumption. Our model uses two separate autoencoders, one for each language, which allows us to put more constraints to guide the mapping. We also distinguish the role of an encoder from the role of a mapper. The encoder projects embeddings to latent code vectors, which are then translated by the mapper.

Zhang17 improved adversarial training with orthogonal parameterization and cycle consistency. To aid training, they incorporate additional techniques like noise injection which works as a regularizer. For selecting the best model, they rely on sharp drops of the discriminator accuracy. In their follow-up work Zhang et al. (2017b), they minimize Earth-Mover’s distance between the distribution of the transformed source embeddings and the distribution of the target embeddings. conneau2018word show impressive results with adversarial training and refinement with the Procrustes solution. Instead of using the adversarial loss, Xu2018 use Sinkhorn distance and adopt cycle consistency inspired by the CycleGAN Zhu et al. (2017). We also incorporate cycle consistency along with the adversarial loss. However, while all these methods learn the mapping in the original embedding space, our approach learns it in the latent code space considering both the mapper and the target encoder as adversary. In addition, we use a post-cycle reconstruction to guide the mapping.

A number of non-adversarial methods have also been proposed recently. Artetxe-2018-acl learn an initial dictionary by exploiting the structural similarity of the embeddings and use a robust self-learning algorithm to improve it iteratively. Hoshen-18 align the second moment of word distributions of the two languages using principal component analysis (PCA) and then refine the alignment iteratively using a variation of the Iterative Closest Point (ICP) method used in computer vision. david2018gromov cast the problem as an optimal transport problem and exploit the Gromov-Wasserstein distance which measures how similarities between pairs of words relate across languages.

3 Approach

Let and be two sets consisting of and word embeddings of -dimensions for a source and a target language, respectively. We assume that and are trained independently from monolingual corpora. Our aim is to learn a mapping in an unsupervised way (i.e., no bilingual dictionary given) such that for every , corresponds to its translation in . Our overall approach follows the same sequence of steps as conneau2018word:

  1. Induction of seed dictionary through adversarial training.

  2. Iterative refinement of the initial mapping through the Procrustes solution.

  3. Apply CSLS for nearest neighbor search.

We propose a novel adversarial autoencoder model to learn the initial mapping for inducing a seed dictionary in step (i), and we adopt existing refinement methods for steps (ii) and (iii).

3.1 Adversarial Autoencoder for Initial Dictionary Induction

Our proposed model (Figure 1) has two autoencoders, one for each language. Each autoencoder comprises an encoder (res. ) and a decoder (res. ). The encoders transform an input (res. ) into a latent code (res. ) from which the decoders try to reconstruct the original input. We use a linear encoder and reconstruction loss

(1)
(2)

where and are the parameters of the encoder and the decoder for -dimensional word embedding and -dimensional code vector.222We also experimented with a non-linear encoder, but it did not work well. The encoder, decoder and the reconstruction loss for the other autoencoder () is similarly defined.

Let and be the encoding distributions of the two autoencoders. We use adversarial training to find a mapping between and . This is in contrast with most existing methods (e.g., conneau2018word,artetxe2017acl) that directly map the distribution of the source word embeddings to the distribution of the target . As Anders-18 pointed out, the isomorphism does not hold in general between the word embedding spaces of two languages. Mapping the latent codes gives our model more flexibility to induce the required semantic structures in its code space that could potentially yield more accurate mappings.

Figure 1: Our proposed adversarial autoencoder framework for unsupervised word translation.

As shown in Figure 1, we include two linear mappings and to project the code vectors (samples from ) from one language to the other. In addition, we have two language discriminators, and . The discriminators are trained to discriminate between the mapped codes and the encoded codes, while the mappers and encoders are jointly trained to fool their respective discriminator. This results in a three-player game, where the discriminator tries to identify the origin of a code, and the mapper and the encoder act together to prevent the discriminator to succeed by making the mapped vector and the encoded vector as similar as possible.

Discriminator Loss

Let and denote the parameters of the two discriminators, and and are the mapping weight matrices. The loss for the source discriminator can be written as

(3)

where

is the probability according to

to distinguish whether is coming from the source encoder () or from the target-to-source mapper (). The discrimination loss is similarly defined for the target discriminator using and .

Our discriminators have the same architecture as conneau2018word. It is a feed-forward network with two hidden layers of size 2048 and Leaky-ReLU activations. We apply dropout with a rate of 0.1 on the input to the discriminators. Instead of using 1 and 0, we also apply a smoothing coefficient (

) in the discriminator loss.

Adversarial Loss

The mappers and encoders are trained jointly with the following adversarial loss to fool their respective discriminators.

(4)

The adversarial loss for mapper and encoder is similarly defined. Note that we consider both the mapper and the target encoder as generators. This is in contrast to existing adversarial methods, which do not use any autoencoder in the target side. The mapper and the target encoder team up to fool the discriminator. This forces the discriminator to improve its skill and vice versa for the generators, forcing them to produce indistinguishable codes through better mapping.

Cycle Consistency and Reconstruction

The adversarial method introduced above maps a “bag” of source embeddings to a “bag” of target embeddings, and in theory, the mapper can match the target language distribution. However, mapping at the bag level is often insufficient to learn the individual word level mappings. In fact, there exist infinite number of possible mappings that can match the same target distribution. Thus to learn better mappings, we need to enforce more constraints to our objective.

The first form of constraints we consider is cycle consistency to ensure that a source code translated to the target language code space, and translated back to the original space remains unchanged, i.e., . Formally, the cycle consistency loss in one direction:

(5)

The loss in the other direction () is similarly defined. In addition to cycle consistency, we include another constraint to guide the mapping further. In particular, we ask the decoder of the respective autoencoder to reconstruct the original input from the back-translated code. We compute this post-cycle reconstruction loss for the source autoencoder as follows:

(6)

The reconstruction loss at the target autoencoder is defined similarly. Apart from improved mapping, both cycle consistency and reconstruction lead to more stable training in our experiments. Specifically, they help our training to converge and get around the mode collapse issue Goodfellow (2017). Since the model now has to translate the mapped code back to the source code and reconstruct the original word embedding, the generators cannot get away by mapping all source codes to a single target code.

Total Loss

The total loss for mapping a batch from source to target is

(7)

where and control the relative importance of the three loss components. Similarly we define the total loss for mapping in the opposite direction . The complete objective of our model is:

(8)

3.2 Training and Dictionary Construction

We present the training procedure of our model and the overall word translation process in Algorithm LABEL:alg:training. We first pre-train the autoencoders separately on monolingual embeddings (Step 1). This pre-training is required to induce word semantics (and relations) in the latent code space.

We start adversarial training (Step 2) by updating the discriminators for (5) times, each time with a random batch. Then we update the generators (the mapper and target encoder) on the adversarial loss. The mappers then go through two more updates, one for cycle consistency and another for post-cycle reconstruction. The autoencoders (encoder-decoder) in this stage get updated only on the post-cycle reconstruction loss. We also apply the orthogonalization update to the mappers following conneau2018word with .

algocf[h!]    

Our training setting is similar to conneau2018word, and we apply the same pre- and post-processing steps. We use stochastic gradient descent (SGD) with a batch size of 32, a learning rate of 0.1, and a decay of 0.98.

For selecting the best model, we use the unsupervised validation criterion proposed by conneau2018word, which correlates highly with the mapping quality. In this criterion, most frequent source words along with their nearest neighbors in the target space are considered. The average cosine similarity between these pseudo translations is considered as the validation metric.

The initial bilingual dictionary induced by adversarial training (or any other unsupervised method) is generally of lower quality than what could be achieved by a supervised method. conneau2018word and Artetxe-2018-acl propose fine-tuning methods to refine the initial mappings. Similar to conneau2018word), we fine-tune our initial mappings ( and ) by iteratively solving the Procrustes

problem and applying a dictionary induction step. This method uses singular value decomposition or SVD of

to find the optimal mappings (similarly SVD( for ) given the approximate alignment of words from the previous step. For generating synthetic dictionary in each iteration, we only consider the translation pairs that are mutual nearest neighbors. In our fine-tuning, we run five iterations of this process. For finding the nearest neighbors, we use the Cross-domain Similarity Local Scaling (CSLS) which works better in mitigating the hubness problem Conneau et al. (2018).

4 Experimental Settings

Following the tradition, we evaluate our model on word translation (a.k.a. 

bilingual lexicon induction

) task, which measures the accuracy of the predicted dictionary to a gold standard dictionary.

4.1 Datasets

We evaluate our model on two different datasets. The first one is from conneau2018word, which consists of FastText monolingual embeddings of ( =) 300 dimensions Bojanowski et al. (2017) trained on Wikipedia monolingual corpus and gold dictionaries for 110 language pairs.333https://github.com/facebookresearch/MUSE To show the generality of different methods, we consider European, non-European and low-resource languages. In particular, we evaluate on English (En) from/to Spanish (Es), German (De), Italian (It), Arabic (Ar), Malay (Ms), and Hebrew (He).

We also evaluate on the more challenging dataset of Dinu-iclr-workshop15 and its subsequent extension by artetxe2018aaai. We will refer to this dataset as Dinu-Artexe dataset. From this dataset, we choose to experiment on English from/to Italian and Spanish. English and Italian embeddings were trained on WacKy corpora using CBOW Mikolov et al. (2013b), while the Spanish embeddings were trained on WMT News Crawl. The CBOW vectors are also of 300 dimensions.

4.2 Baselines

We compare our method with the unsupervised models of conneau2018word, Artetxe-2018-acl, david2018gromov, Xu2018, and Hoshen-18.

To evaluate how our unsupervised method compares with methods that rely on a bilingual seed dictionary, we follow conneau2018word, and compute a supervised baseline that uses the Procrustes solution directly on the seed dictionary (5000 pairs) to learn the mapping function, and then uses CSLS to do the nearest neighbor search. We also compare with the supervised approaches of artetxe2017acl,artetxe2018aaai, which to our knowledge are the state-of-the-art supervised systems. For some of the baselines, results are reported from their papers, while for the rest we report results by running the publicly available codes on our machine.

For training our model on European languages, the weight for cycle consistency () in Eq. 7 was always set to 5, and the weight for post-cycle reconstruction () was set to 1. For non-European languages, we use different values of and for different language pairs. 444We did not tune the values much, rather used our initial observation. Tuning values might yield even better results. The dimension of the code vectors in our model was set to 350.

5 Results

We present our results on European languages on the datasets of conneau2018word and Dinu-iclr-workshop15 in Tables 1 and 3, while the results on non-European languages are shown in Table 2. Through experiments, our goal is to assess:

  1. Does the unsupervised mapping method based on our proposed adversarial autoencoder model improve over the best existing adversarial method of conneau2018word in terms of mapping accuracy and convergence (Section 5.1)?

  2. How does our unsupervised mapping method compare with other unsupervised and supervised approaches (Section 5.2)?

  3. Which components of our adversarial autoencoder model attribute to improvements (Section 5.3)?

5.1 Comparison with conneau2018word

Since our approach follows the same steps as conneau2018word, we first compare our proposed model with their model on European (Table 1), non-European and low-resource languages (Table 2) on their dataset. In the tables, we present the numbers that they reported in their paper (conneau2018word (paper)) as well as the results that we get by running their code on our machine (conneau2018word (code)). For a fair comparison with respect to the quality of the learned mappings (or induced seed dictionary), here we only consider the results of our approach that use the refinement procedure of conneau2018word.

En-Es En-De En-It
Supervised (Procrustes-CSLS) 82.4 83.9 75.3 72.7 78.1 78.1
Unsupervised Baselines
Artetxe-2018-acl 82.2 84.4 74.9 74.1 78.8 79.5
alvarezmelis2018gromov 81.7 80.4 71.9 72.8 78.9 75.2
Ruochen-emnlp18 79.5 77.8 69.3 67.0 73.5 72.6
Hoshen-18 82.1 84.1 74.7 73.0 77.9 77.5
conneau2018word (paper) 81.7 83.3 74.0 72.2 - -
conneau2018word (code) 82.3 83.7 74.2 72.6 78.3 78.1
Our Unsupervised Approach
Adversarial autoencoder +
conneau2018word Refinement 82.6 84.4 75.5 73.9 78.8 78.5
Artetxe-2018-acl Refinement 82.7 84.7 75.4 74.3 79.0 79.6
Table 1: Word translation accuracy (P@1) on European languages on the dataset of conneau2018word using fastText embeddings (trained on Wikipedia). ‘-’ indicates the authors did not report the number.

In Table 1, we see that our Adversarial autoencoder + conneau2018word Refinement outperforms conneau2018word in all the six translation tasks involving European language pairs, yielding gains in the range 0.3 - 1.3%. Our method is also superior to theirs for the non-European and low-resource language pairs in Table 2. Here our method gives more gains ranging from 1.8 to 4.3%. Note specifically that Malay (Ms) is a low-resource language, and the FastText contains word vectors for only 155K Malay words. We found their model to be very fragile for En from/to Ms, and does not converge at all for MsEn. We ran their code 10 times for MsEn but failed every time. Compared to that, our method is more robust and converged most of the time we ran.

En-Ar En-Ms En-He
Supervised Baselines
artetxe2017acl 24.8 43.3 38.8 41.6 32.7 51.1
artetxe2018aaai 36.2 52.9 51.2 47.7 43.6 56.8
Supervised (Procrus-CSLS) 34.5 49.7 47.3 46.6 39.2 54.1
Unsupervised Baselines
Hoshen-18 34.4 49.3 ** ** 36.5 52.3
Artetxe-2018-acl 36.1 48.7 54.0 55.4 43.8 57.5
conneau2018word (code) 29.3 47.6 46.2 ** 36.8 53.1
Our Unsupervised Approach
Adversarial autoencoder +
conneau2018word Refinement 33.6 49.7 49.5 44.3 40.0 54.9
Artetxe-2018-acl Refinement 36.3 52.6 54.1 51.7 44.0 57.1
Table 2: Word translation accuracy (P@1) on non-European and low-resource languages on the dataset of conneau2018word using fastText embeddings. ** indicates the model failed to converge.

If we compare our method with the method of conneau2018word on the more challenging Dinu-Artexe dataset in Table 3, we see here also our method performs better than their method in all the four translation tasks involving European language pairs. In this dataset, our method shows more robustness compared to their method. For example, their method had difficulties in converging for En from/to Es translations; for EnEs, it converges only 2 times out of 10 attempts, while for EsEn it did not converge a single time in 10 attempts. Compared to that, our method was more robust, converging 4 times out of 10 attempts.

In Section 5.3, we compare our model with conneau2018word more rigorously by evaluating them with and without fine-tuning and measuring their performance on P@1, P@5, and P@10.

5.2 Comparison with Other Methods

In this section, we compare our model with other state-of-the-art methods that do not follow the same procedure as us and conneau2018word. For example, Artetxe-2018-acl do the initial mapping in the similarity space, then they apply a different self-learning method to fine-tune the embeddings, and perform a final refinement with symmetric re-weighting. Instead of mapping from source to target, they map both source and target embeddings to a common space.

En-It En-Es
Supervised Baselines
artetxe2017acl 39.7 33.8 32.4 27.2
artetxe2018aaai 45.3 38.5 37.2 29.6
Procrustes-CSLS 44.9 38.5 33.8 29.3
Unsupervised Baselines
Artetxe-2018-acl 47.9 42.3 37.5 31.2
conneau2018word (paper) 45.1 38.3 - -
conneau2018word (code) 44.9 38.7 34.7 **
Our Unsupervised Approach
Adversarial autoencoder +
conneau2018word Refinement 45.3 39.4 35.2 29.9
Artetxe-2018-acl Refinement 47.6 42.5 37.4 31.9
Table 3: Word translation accuracy (P@1) on the English-Italian and English-Spanish language pairs of Dinu-Artetxe dataset Dinu et al. (2015); Artetxe et al. (2017). All methods use CBOW embeddings. ** indicates the model failed to converge; ‘-’ indicates the authors did not report the number.

Let us first consider the results for European language pairs on the dataset of conneau2018word in Table 1. Our Adversarial autoencoder + conneau2018word Refinement performs better than most of the other methods on this dataset, achieving the highest accuracy for 4 out of 6 translation tasks. For DeEn, our result is very close to the best system of Artetxe-2018-acl with only 0.2% difference.

On the dataset of Dinu-iclr-workshop15,artetxe2017acl in Table 3, our Adversarial autoencoder + conneau2018word Refinement performs better than other methods except Artetxe-2018-acl. On average our method lags behind by about 2%. However, as mentioned, they follow a different refinement and mapping methods. For non-European and low-resource language pairs in Table 2, our Adversarial autoencoder + conneau2018word Refinement exhibits better performance than others in one translation task, where the model of Artetxe-2018-acl performs better in the rest. One important thing to notice here is that other unsupervised models (apart from ours and Artetxe-2018-acl) fail to converge in one or more language pairs.

We notice that the method of Artetxe-2018-acl gives better results than other baselines, even in some translation tasks they achieve the highest accuracy. To understand whether the improvements of their method are due to a better initial mapping or better post-processing, we conducted two additional experiments. In our first experiment, we use their method to induce the initial seed dictionary and then apply iterative Procrustes solution (same refinement procedure of conneau2018word) for refinement. Table 4 shows the results. Surprisingly, on both datasets their initial mappings fail to produce any reasonable results. So we suspect that the main gain in Artetxe et al. (2018b) comes from their fine-tuning method, which they call robust self learning. In our second experiment, we use the initial dictionary induced by our adversarial training and then apply their refinement procedure. Here for most of the translation tasks, we achieve better results; see the model Adversarial autoencoder + Artetxe-2018-acl Refinement in Tables 1 - 3. These two experiments demonstrate that the quality of the initial dictionary induced by our model is far better than that of Artetxe-2018-acl.

En-It En-Es
Dinu-Artetxe Dataset ** ** ** **
Conneau Dataset 01.2 01.6 04.7 05.1
Table 4: conneau2018word refinement applied to the initial mappings of Artetxe-2018-acl. ** indicates the model failed to converge.
EnEs EsEn EnDe DeEn EnIt ItEn
P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10 P@1 P@5 P@10
Without Fine-Tuning
Conneau-18 65.3 73.8 80.6 66.7 78.3 80.8 61.5 70.1 78.2 60.3 70.2 77.0 64.8 75.3 79.4 63.8 77.1 81.8
Our (full) 71.8 81.1 85.7 72.7 81.5 83.8 64.9 74.4 81.8 63.1 71.3 79.8 68.2 78.9 83.7 67.5 77.6 82.1
  - Enc. adv 70.5 79.7 83.5 71.3 80.4 83.3 63.7 73.5 79.3 62.6 70.5 79.0 67.6 77.3 82.7 66.2 78.3 82.5
  - - Recon 70.1 78.9 83.4 70.8 81.1 83.4 63.1 73.8 80.5 62.2 71.7 78.7 66.9 79.7 82.1 64.8 78.6 82.1
  - - - Cycle 66.8 76.5 82.1 67.2 79.9 82.7 61.4 69.7 77.8 60.1 69.8 76.5 65.3 75.1 78.9 64.4 77.6 81.7
With Fine-Tuning
Conneau-18 82.3 90.8 93.2 83.7 91.9 93.5 74.2 89.0 91.5 72.6 85.7 88.8 78.3 88.4 91.1 78.1 88.2 90.6
Our (full) 82.6 91.8 93.5 84.4 92.3 94.3 75.5 90.1 92.9 73.9 86.5 89.3 78.8 89.2 91.9 78.5 88.9 91.1
  - Enc. adv 82.5 91.6 93.5 84.3 92.1 94.3 75.4 89.7 92.7 73.5 86.3 89.2 78.4 89.0 91.8 78.1 88.7 91.0
  - - Recon 82.5 91.6 93.4 84.1 92.2 94.3 75.3 89.4 92.6 73.2 85.9 89.0 78.2 89.1 91.9 78.2 88.8 91.2
  - - - Cycle 82.4 91.0 93.1 83.6 92.2 94.0 74.3 89.7 92.6 72.7 86.1 89.1 77.8 89.2 91.8 77.4 88.3 90.8
Table 5: Ablation study of our adversarial autoencoder model on the dataset of conneau2018word.

5.3 Model Dissection

We further analyze our model by dissecting it and measuring the contribution of each novel component that is proposed in this work. We achieve this by incrementally removing a new component from the model and evaluating it on different translation tasks. In order to better understand the contribution of each component, we evaluate each model by measuring its P@1, P@5, and P@10 with fine-tuning and without fine-tuning. In case of without fine-tuning, the models apply the CSLS neighbor search directly on the mappings learned from the adversarial training, i.e., no Procrustes solution based refinement is done after the adversarial training. This setup allows us to compare our model directly with the adversarial model of conneau2018word, putting the effect of fine-tuning aside.

Table 5 presents the ablation results for En-Es, En-De, and En-It in both directions. The first row (Conneau-18) presents the results of conneau2018word that uses adversarial training to map the word embeddings. The next row shows the results of our full model. The subsequent rows incrementally detach one component from our model. For example, - Enc. adv denotes the variant of our model where the target encoder is not trained on the adversarial loss ( in Eq. 4); - - Recon excludes the post-cycle reconstruction loss from - Enc. adv, and - - - Cycle excludes the cycle consistency from - - Recon. Thus, - - - Cycle is a variant of our model that uses only adversarial loss to learn the mapping. However, it is important to note that in contrast to conneau2018word, our mapping is performed at the code space.

As we compare our full model with the model of conneau2018word in the without fine-tuning setting, we notice large improvements in all measures across all datasets: 5.1 - 7.3% in EnEs, 3 - 6% in EsEn, 3.4 - 4.3% in EnDe, 1 - 3% in DeEn, 3.4 - 4.3% in EnIt, and 0.3 - 3.7% in ItEn. These improvements demonstrate that our model finds a better mapping compared to conneau2018word. Among the three components, the cycle consistency is the most influential one across all languages. Training the target encoder adversarially also gives a significant boost. The reconstruction has less impact. If we compare the results of - - - Cycle with Conneau-18, we see sizeable gains for En-Es in both directions. This shows the benefits of mapping at the code level.

Now let us turn our attention to the results with fine-tuning. Here also we see gains across all datasets for our model, although the gains are not as verbose as before (about 1% on average). However, this is not surprising as it has been shown that iterative fine-tuning with Procrustes solution is a robust method that can recover many errors made in the initial mapping Conneau et al. (2018). Given a good enough initial mapping, the measures converge nearly to the same point even though the differences were comparatively more substantial initially; for example, notice that the scores are very similar for P@5 and P@10 measures after fine-tuning.

6 Conclusions

We have proposed an adversarial autoencoder framework to learn the cross-lingual mapping of monolingual word embeddings of two languages in a completely unsupervised way. In contrast to the existing methods that directly map word embeddings, our method first learns to transform the embeddings into latent code vectors by pretraining an autoencoder. We apply adversarial training to map the distributions of the source and target code vectors. In our adversarial training, both the mapper and the target encoder are treated as generators that act jointly to fool the discriminator. To guide the mapping further, we include constraints for cycle consistency and post-cycle reconstruction.

Through extensive experimentations on six different language pairs comprising European, non-European and low-resource languages from two different data sources, we demonstrate that our method outperforms the method of conneau2018word for all translation tasks in all measures (P@{1,5,10}) across all settings (with and without fine-tuning). Comparison with other existing methods also shows that our method learns better mapping (not considering the fine-tuning). With an ablation study, we further demonstrated that the cycle consistency is the most important component followed by the adversarial training of target encoder and the post-cycle reconstruction. In future work, we plan to incorporate knowledge from the similarity space in our adversarial framework.

Acknowledgments

The authors would like to thank the funding support from MOE Tier-1 (Grant M4011897.020).

References