Word translation can be challenging because often there exists no single word translation capable of encapsulating the meaning of the original word ahluwalia2019pristine. For example, the Brazilian Portuguese word cafuné means "to run your fingers through the hair of someone you love". Such a rich physical expression of human affection and love cannot be translated by a single English word, but can be approximated by the combined use of words that represent the act. In machine translation, the vocabulary of a language is often represented by a word embedding which is generally specific to the language ruder2019survey. A word embedding provides a metric space that represents semantics where words with similar meanings are closer to each other. Consequently, word translation can be achieved by finding a mapping function between the word embedding of two languages jansen2017word. This can be done by either learning a transformation mikolov2013exploiting, alvarez2018gromov, gaddy2016ten or learning two embeddings that are aligned garg2019jointly, i.e. are part of the same space. This has been done in supervised mikolov2013exploiting, semi-supervised garg2019jointly, gaddy2016ten, and unsupervised alvarez2018gromov manners. Although several methods for word translation have been proposed, most require large numbers of provided word pairs (supervised) to achieve a good performance mikolov2013exploiting, gaddy2016ten, or (for unsupervised methods) may lack sufficient references to achieve a good quality alignment alvarez2018gromov. In this work, we present a new method that learns two aligned embeddings, thus providing accurate and semantically meaningful word translation, without the need for known (supervised) word pairs. Our method, termed WAM, is based on a localized Maximum Mean Discrepancy (MMD) loss that is added to a sentence translation task (e.g. Transformer vaswani2017attention) and aims to align the embeddings by minimizing their distributional distance. We show that our method outperforms other methods that are both supervised (with provided word translation pairs) and unsupervised.
2 Related work
The problem of word alignment has been tackled with a variety of methods. Approaches for word alignment can generally be classified into methods based on statistical machine translation or neural machine translation. Here we briefly discuss these approaches.
2.1 Statistical machine translation
The geometry of the word embedding for a language is highly dependent on the distributed representation of words. Therefore, the geometries of word embeddings for two distinct languages should have a relative degree of similarity. Methods such as the Gromov-Wasserstein Alignmentalvarez2018gromov have exploited this assumption to achieve unsupervised word embedding alignment. alvarez2018gromov proposes to align two embeddings via optimal transport based on the Gromov-Wasserstein distances between pairs of words within each embedding, i.e. distances across embeddings rather than the distance between words. The solution to the optimal transport of Gromov-Wasserstein distances is an optimal coupling matrix between the embeddings which yields minimal discrepancy cost. Despite the potential benefits of this approach, it has several major drawbacks, such as being an expensive computation for large vocabulary sizes and having general poor performance. The poor performance (as we show in section 5) may stem from the fact that the optimal coupling, which aligns the distributions as a whole, may not correspond to the correct alignment.
2.2 Neural Machine Translation
In machine translation, neural approaches have shown superior accuracy over purely statistical methods. In particular, architectures such as the Transformer vaswani2017attention, which makes use of an attention mechanism, have shown far better translation performance due to their capability of attending to the context of the words rather than literal translation vaswani2017attention
. However, the probability distribution given by the attention mechanism may not necessarily allow for inference on word alignment between vocabularieskoehn2017six, thus requiring the use of statistical approaches for word alignment alvarez2018gromov. To address this problem, garg2019jointly
proposed to train a Transformer in a multi-task framework. Their multi-task loss function combines a negative log-likelihood loss of the Transformer (relative to the token prediction based on past tokens) with a conditional cross-entropy loss defined by the Kullback-Leibler (KL) divergence. In their approach, they used the attention probabilities from the penultimate attention head layer as labels for their supervised algorithm, thus dispensing the need for an annotated word alignment. The divergence between the attention probabilities of one arbitrarily chosen alignment head and the labeled alignment distribution is minimized as a KL divergence optimization problemgarg2019jointly.
3 Proposed method
Our method, termed Word Alignment through Maximum Mean Discrepancy (WAM) achieves word translation by using the Transformer model vaswani2017attention in combination with a localized Maximum Mean Discrepancy (MMD) loss. Briefly, while the embedding for each language is learned during the Transformer training for a sentence translation task, the MMD loss applies local constraints on paired sentences in order to learn a pair of embeddings that are locally, and as a result globally, aligned. The end result is an accurate word alignment between languages while maintaining sentence translation performance. Here we describe the details of our method.
3.1 Word and sentence alignment
The alignment between the words of two languages can be thought of as a general alignment problem. In this formulation, the assumption is that the same sentence expressed in two different languages is still linked by semantics and therefore contains a significant shared underlying structure wang2009general. Thus, using the notation from ruder2019survey, the word embedding matrix learned for a language with a vocabulary and dimension can be described as . Consequently, the -th word out of words in the -th sentence out of sentences is defined as . The alignment task consists of minimizing the distance between the words from a source sentence and a target sentence based on their meaning.
The Transformer is a model architecture based purely on an attention mechanism to compute the contextual relationship between source and target languages vaswani2017attention
. This model has gained significant popularity due to its performance on neural machine translation. The architecture of the Transformer is composed of a stack of encoders and decoders. Each encoder is composed of a self-attention layer and a feed-forward neural network. The self-attention computes a weighted sum over all the embedded input words in the source sentence, thus providing information about the context. The output of the self-attention layer is fed to the feed-forward network. Inputs from the encoder are sent to the decoder, which contains both the self-attention and feed-forward network, but with an attention layer in between them. This attention layer receives the inputs from the encoder and it is responsible for learning how to link word representations from the source language to a representation of the target language.
Each word is represented in an embedded space of size . This embedding consists on learning a weight matrix , thus returning a . In the self-attention layer, the input is used to compute the importance of other words relative to its own predicted output (query ), its importance on the output of other words (key ) and the output as a weighted sum of all words (value,
). These are learned linear transformations of the input, wherein , , . The attention to the -th token relative to other tokens is given by
This operation is computed in parallel by 8 attention heads. Their individual outputs are linearly combined to generate a single attention value relative to the -th token.
The loss function used for the Transformer is based on a KL-divergence loss of the smoothed labels with , such as done by vaswani2017attention. This replaces the conventional one-hot distribution by a distribution with relative confidence on the correct label and smoothed mass over the vocabulary szegedy2015rethinking.
3.3 Global Word Alignment via Local MMD
Previously proposed methods aim to perform embedding alignment by assuming that the underlying structure of the embedding is conserved across languages alvarez2018gromov, mikolov2013exploiting. Although it may be locally true, there is no guarantee that such an assumption holds to be true for the entire embedding. On the other hand, a sentence represented in two different languages should convey the same meaning and, consequently, similar word usage within each language. Based on this assumption, we propose to achieve word alignment by computing semantic similarity between pairs of sentences via a sentence based distributional distance. During the training process, the Transformer learns to embed the words of a language based on their context, thus generating an embedding . Therefore, the semantic similarity between the words can be computed in terms of their distances in the embedding. Since we cannot make inferences directly on word distances (as we do not make use of known word pairs), we compute the distance between distributions of words in the sentences by using Maximum Mean Discrepancy (MMD) as a distance measure gretton2012kernel. We thus compute a localized MMD, as it involves the distribution of words within a sentence (as opposed to the distribution of words within the whole language vocabulary). This results in the alignment of distributions of a pair of sentences between languages. We do not compute MMD globally, i.e. on the whole vocabulary, as this may give a distributional alignment that lacks the correct word alignment. However, since we minimize MMD on many sentence pairs, the result is an embedding that is also globally (correctly) aligned.
The MMD is defined in terms of particular function spaces that present a difference in distributions. Unlike a distributional distance such as KL-divergence, the MMD can be formulated directly on the empirical expectation of the samples and dispenses density estimation. This grants a dimension-independent property and preserves information about the original distributions. The empirical expectation of the MMD can be written as follows:
where and denote -th sentence and -th word for source () and target () languages, respectively. The operation is done in a Hilbert space .
We rewrite the MMD in terms of a kernel as follows:
For our application, we choose a Gaussian RBF kernel. To overcome the challenge of finding a that is appropriate for the entirety of the dataset, we use a multi-scale kernel approach, which is defined as follows:
Therefore, we expect the MMD to have small values when the distribution of the word sets are similar (i.e., sentences with similar semantic content) and large for disparate distributions. The global loss of one optimization step is computed as a weighted sum of the Transformer loss () and the MMD loss () (Figure 1 and Algorithm 1). For our tests, we multiply the MMD loss by 10 in order to have a similar magnitude to the Transformer loss. The global loss () is minimized with respect to the Transformer parameter , where is defined as all the learnable parameters of the Transformer (see sec. 3.2).
In the following sections, we show how our method contributes to word embedding alignment when compared to other approaches.
4.1 Data and model setup
In our experiments, we used the French-English parallel corpus of the Europarl-v7 koehn2005europarl dataset for training. From the corpus, we selected the first 100k sentences to be used for development. We used spaCy library for tokenization. We filtered out sentences that were longer than 80 tokens in both languages and sentence pairs with a source/target length ratio greater than 1.5. This dataset was split in 90/10 for training and validation, respectively. For the test, we evaluated the performance on two French-English dictionaries: our manually built dictionary (1026 pairs of words) and the MUSE conneau2017word dictionary (10872 pairs of words). Our dictionary contained commonly used words by both languages, such as pronouns, nouns, prepositions and verbs. The MUSE dictionary offers similar pairs in addition to alternative translations for the same words. For example, the word love (English) has four entries in the MUSE dictionary as possible French translations: aime, amour, aimer and love.
In our experiments, we used the base Transformer setting with embedding size 512, 6 encoder and decoder layers, 8 attention heads and sinusoidal positional embedding. We used Adam optimizer with , and . We varied the learning rate over the course of training, such that the learning rate increased linearly for the first 2000 training steps (warmup), then decreasing it proportionally to the inverse square root of the step number. The training was done in batches of 2500 tokens. We trained our models on a machine with a P100 GPU.
4.2 Word alignment evaluation
To measure the quality of the alignments, we calculated the coefficient of determination () between the embeddings for a set of known word translation pairs. In specific, we computed this quantity per word pair and reported the average over all pairs. We perform this evaluation for word pairs from both our and the MUSE conneau2017word dictionaries. A higher implies a better alignment and is a perfect alignment. In addition, we quantify model performance using a metric presented in alvarez2018gromov, which is the accuracy for the -nearest target neighbors to a target word. We compute this accuracy for 1, 5 and 10 nearest neighbors. These four metrics ( and accuracies) are shown in Table 1. For visualization of the embeddings, we reduced the dimensionality of the original embedding to a 2D space using UMAP mcinnes2018umap, such as illustrated in Figure 2.
4.3 Comparing methods
We compare our method to three state-of-the-art word alignment approaches: unsupervised Gromov-Wasserstein alignment alvarez2018gromov, the neural-based Joint Learning alignment garg2019jointly, and a supervised embedding alignment. We used a supervised alignment method that is based on a Transformer model, such as described in section 4.1
, with a supervised word pair loss that is calculated between the embeddings of a set of known pair of "landmarks". As landmarks, we used the first half of the dictionary while keeping the second half for evaluation. In each epoch, we sampled 50% of the given landmark pairs, computed their embedding and their distances as L2 norm. In this case, the total loss of the step was given as the sum of the Transformer loss and the landmark L2 norm loss.
Semi-supervised methods, such as the Joint Learning alignment garg2019jointly and ours, were only trained on sentence pairs and not on word pairs. The supervised method was trained on both sentence pairs and word pairs. All methods were trained on sentence pairs from Europarl-v7. In addition, the Gromov-Wasserstein alignment method alvarez2018gromov itself does not compute embeddings, thus requiring a pre-computed embedding as input. We trained the Gromov-Wasserstein alignment method on the French and English embeddings generated by our WAM method for the Europarl-v7 dataset (illustrated in Figure 2A) and on their own reported training set, which consisted of word embeddings for French and English trained with FastText on the Wikipedia dataset bojanowski2017enriching.
The performances of the different methods were evaluated such as described in section 4.2. The quantitative results are shown in Table 1. Figure 2 illustrates the alignment achieved by each method, wherein blue and green dots identify English and French words from the vocabulary (i.e, unique words among the 100k sentences from the Europarl-v7 dataset), respectively. The red and orange dots indicate French and English words, respectively, from our dictionary.
|Method||Train||Test||1 NN||5 NN||10 NN|
|Gromov-Wasserstein alvarez2018gromov||Wikipedia bojanowski2017enriching||ours||0.25||0.38||0.41||0.31|
|Gromov-Wasserstein alvarez2018gromov||Europarl-v7 koehn2005europarl||ours||0.00||0.00||0.00||0.00|
|Joint Learning garg2019jointly||Europarl-v7 koehn2005europarl||ours||0.02||0.03||0.04||0.14|
|WAM (ours)||Europarl-v7 koehn2005europarl||ours||0.55||0.64||0.66||0.65|
|Gromov-Wasserstein alvarez2018gromov||Wikipedia bojanowski2017enriching||MUSE conneau2017word||0.08||0.12||0.13||0.33|
|Gromov-Wasserstein alvarez2018gromov||Europarl-v7 koehn2005europarl||MUSE conneau2017word||0.00||0.00||0.00||0.00|
|Joint Learning garg2019jointly||Europarl-v7 koehn2005europarl||MUSE conneau2017word||0.01||0.03||0.03||0.20|
|Supervised||Europarl-v7 koehn2005europarl||MUSE conneau2017word||0.05||0.06||0.06||0.16|
|WAM (ours)||Europarl-v7 koehn2005europarl||MUSE conneau2017word||0.36||0.45||0.46||0.37|
First, we evaluated the performance of the supervised method. This model was trained as described in section 4.3. The supervised method achieved low performance for all the four metrics in both dictionaries (Table 1). This result may be expected as correctly aligning words pairs from a training set does not guarantee generalization to the unseen words. In this setup, the distance between the landmark word pairs appears to be quickly minimized (to zero distance) while the Transformer is still shaping the embeddings around the landmarks, which at this point act as anchors. The result of this is low performance for the unseen part of the dictionary (details about the split in section 4.3) while keeping a constantly high coefficient of determination () for the seen part of the dictionary (data not shown here). Figure 2B shows the resulting alignment between the embeddings with the supervised method for our dictionary (both seen and unseen words). The embeddings present a substantial amount of misalignment, as the quantification also indicates (Table 1). These results show that supervised word embedding alignment gives poor global alignment. Better performance may be achieved by using a bigger set of landmarks, but this also makes the approach impractical as the goal is to learn a generalizable alignment and not to memorize all known word translation pairs.
Next, we evaluated the performance of the Gromov-Wasserstein method alvarez2018gromov. This method presented poor overall performance when trained on our embedding, which motivated us to evaluate their performance using their own reported training set. We observed satisfactory performance for the latter setup in both dictionaries (Table 1). Their achieved alignment for our dictionary is illustrated in Figure 2C. Based on these results, we concluded that the Gromov-Wasserstein method may achieve distributional alignment when trained on very large datasets such as the FastText on the Wikipedia dataset bojanowski2017enriching, but their word-to-word mapping is suboptimal as indicated by labeled landmarks in Figure 2C and Table 1. In other words, their method (visually) appears to align the distributions however may not actually give the correct word alignment.
We then evaluated the performance of the Joint Learning method garg2019jointly. For a better comparison with our method, we trained their model with a vanilla Transformer, which is essentially the same architecture as the Transformer we use (see section 4.1). As shown in Table 1
, this method showed a performance comparable to the supervised learning approach based on the four reported metrics, suggesting their low fit for the word-to-word translation tasks. Their resulting embedding is shown in Figure2D, where a large number of clusters can be appreciated. As described in their work, the Joint Learning Alignment focuses on linking the words from a source sentence to a target sentence in order to provide the best word-to-word translation garg2019jointly. This implies that their method does not aim for vocabulary alignment like we do, although it can still be achieved by giving single words as inputs to their trained model.
Finally, we evaluated the performance of WAM, our proposed semi-supervised approach that does not require known word pairs. WAM showed superior performance in all the four metrics used for both dictionaries (Table 1), thus indicating good generalization capacity. The resulting embedding presented a (visually) nearly perfect alignment (Figure 2A) while preserving the semantic arrangement given by the Transformer, as illustrated by the combined embedding (zoomed in) detail (Figure 3).
In this work, we presented WAM (Word Alignment through Maximum Mean Discrepancy), an end-to-end approach for word embedding alignment. We have shown that our method outperforms other supervised, unsupervised, and semi-supervised methods in several metrics that measure word alignment quality between languages. Our method is based on the Transformer model for sentence translation and uses a novel localized Maximum Mean Discrepancy (MMD) loss, which allows for a semi-supervised matching between word distributions while still learning the semantic embedding. Our method provides an accurate mapping between the embedding of two languages that is not uniquely dependent on the existence of a target word. This feature allows the semantic inference of a source word by inspection of the immediate neighbors of the estimated target location. This is an important step towards bridging a semantic gap between languages with great potential for cultural enrichment and a better understanding between different cultures.
7 Broader Impact
Machine translation has deep societal impact as it allows different cultures to communicate. Word translation is an important part of language translation and presents challenges when there exists no single perfect translation for a word. As such, the inaccuracy of a translation model can cause problems in communication and result in misinterpretation. Many models fail to perform correct word translation due to the difficulty of finding the proper mapping between the domains of two languages. Our method achieves high accuracy word translation via learning of an aligned metric space for two languages. This allows the estimation of a target embedding coordinate even when no single word is available by providing a semantic description using several of the nearest words. This, in general, can potentially give a richer and more meaningful translation when neighboring words are taken into account. One caveat with our method, as is the case for all machine translation models, is that its translations may be biased when it is trained on text from a specific domain and thus machine translations (including word translations) should always be looked at critically.