Log In Sign Up

When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?

The performance of Neural Machine Translation (NMT) systems often suffers in low-resource scenarios where sufficiently large-scale parallel corpora cannot be obtained. Pre-trained word embeddings have proven to be invaluable for improving performance in natural language analysis tasks, which often suffer from paucity of data. However, their utility for NMT has not been extensively explored. In this work, we perform five sets of experiments that analyze when we can expect pre-trained word embeddings to help in NMT tasks. We show that such embeddings can be surprisingly effective in some cases -- providing gains of up to 20 BLEU points in the most favorable setting.


page 1

page 2

page 3

page 4


When and Why are Pre-trainedWord Embeddings Useful for Neural Machine Translation?

The performance of Neural Machine Translation (NMT) systems often suffer...

How Do Source-side Monolingual Word Embeddings Impact Neural Machine Translation?

Using pre-trained word embeddings as input layer is a common practice in...

ReWE: Regressing Word Embeddings for Regularization of Neural Machine Translation Systems

Regularization of neural machine translation is still a significant prob...

Debiasing Word Embeddings Improves Multimodal Machine Translation

In recent years, pretrained word embeddings have proved useful for multi...

Transformers without Tears: Improving the Normalization of Self-Attention

We evaluate three simple, normalization-centric changes to improve Trans...

1 Introduction

Pre-trained word embeddings have proven to be highly useful in neural network models for NLP tasks such as sequence tagging

Lample et al. (2016); Ma and Hovy (2016) and text classification Kim (2014). However, it is much less common to use such pre-training in NMT  Wu et al. (2016), largely because the large-scale training corpora used for tasks such as WMT222 tend to be several orders of magnitude larger than the annotated data available for other tasks, such as the Penn Treebank Marcus et al. (1993). However, for low-resource languages or domains, it is not necessarily the case that bilingual data is available in abundance, and therefore the effective use of monolingual data becomes a more desirable option.

Researchers have worked on a number of methods for using monolingual data in NMT systems Cheng et al. (2016); He et al. (2016); Ramachandran et al. (2016). Among these, pre-trained word embeddings have been used either in standard translation systems Neishi et al. (2017); Artetxe et al. (2017)

or as a method for learning translation lexicons in an entirely unsupervised manner

Conneau et al. (2017); Gangi and Federico (2017). Both methods show potential improvements in BLEU score when pre-training is properly integrated into the NMT system.

However, from these works, it is still not clear as to when we can expect pre-trained embeddings to be useful in NMT, or why they provide performance improvements. In this paper, we examine these questions more closely, conducting five sets of experiments to answer the following questions:

  • Is the behavior of pre-training affected by language families and other linguistic features of source and target languages? (§3)

  • Do pre-trained embeddings help more when the size of the training data is small? (§4)

  • How much does the similarity of the source and target languages affect the efficacy of using pre-trained embeddings? (§5)

  • Is it helpful to align the embedding spaces between the source and target languages? (§6)

  • Do pre-trained embeddings help more in multilingual systems as compared to bilingual systems? (§7)

2 Experimental Setup

In order to perform experiments in a controlled, multilingual setting, we created a parallel corpus from TED talks transcripts.333 Specifically, we prepare data between English (En) and three pairs of languages, where the two languages in the pair are similar, with one being relatively low-resourced compared to the other: Galician (Gl) and Portuguese (Pt), Azerbaijani (Az) and Turkish (Tr), and Belarusian (Be) and Russian (Ru). The languages in each pair are similar in vocabulary, grammar and sentence structure  Matthews (1997)

, which controls for language characteristics and also improves the possibility of transfer learning in multi-lingual models (in §

7). They also represent different language families – Gl/Pt are Romance; Az/Tr are Turkic; Be/Ru are Slavic – allowing for comparison across languages with different caracteristics. Tokenization was done using Moses tokenizer444 and hard punctuation symbols were used to identify sentence boundaries. Table 1 shows data sizes.

Dataset train dev test
Gl En
Pt En
Az En
Tr En
Be En
Ru En
Table 1: Number of sentences for each language pair.

For our experiments, we use a standard 1-layer encoder-decoder model with attention Bahdanau et al. (2014) with a beam size of implemented in xnmt555 Neubig et al. (2018). Training uses a batch size of and the Adam optimizer Kingma and Ba (2014) with an initial learning rate of , decaying the learning rate by when development loss decreases Denkowski and Neubig (2017). We evaluate the model’s performance using BLEU metric Papineni et al. (2002).

We use available pre-trained word embeddings Bojanowski et al. (2016) trained using fastText666 on Wikipedia777 for each language. These word embeddings Mikolov et al. (2017) incorporate character-level, phrase-level and positional information of words and are trained using CBOW algorithm Mikolov et al. (2013). The dimension of word embeddings is set to

. The embedding layer weights of our model are initialized using these pre-trained word vectors. In baseline models without pre-training, we use glorot2010understanding’s uniform initialization.

3 Q1: Efficacy of Pre-training

In our first set of experiments, we examine the efficacy of pre-trained word embeddings across the various languages in our corpus. In addition to providing additional experimental evidence supporting the findings of other recent work on using pre-trained embeddings in NMT Neishi et al. (2017); Artetxe et al. (2017); Gangi and Federico (2017), we also examine whether pre-training is useful across a wider variety of language pairs and if it is more useful on the source or target side of a translation pair.

Src std pre std pre
Trg std std pre pre
Gl En
Pt En
Az En
Tr En
Be En
Ru En
Table 2: Effect of pre-training on BLEU score over six languages. The systems use either random initialization (std) or pre-training (pre) on both the source and target sides.

The results in Table 2 clearly demonstrate that pre-training the word embeddings in the source and/or target languages helps to increase the BLEU scores to some degree. Comparing the second and third columns, we can see the increase is much more significant with pre-trained source language embeddings. This indicates that the majority of the gain from pre-trained word embeddings results from a better encoding of the source sentence.

The gains from pre-training in the higher-resource languages are consistent: 3 BLEU points for all three language pairs. In contrast, for the extremely low-resource languages, the gains are either quite small (Az and Be) or very large, as in Gl which achieves a gain of up to 11 BLEU points. This finding is interesting in that it indicates that word embeddings may be particularly useful to bootstrap models that are on the threshold of being able to produce reasonable translations, as is the case for Gl in our experiments.

4 Q2: Effect of Training Data Size

The previous experiment had interesting implications regarding available data size and effect of pre-training. Our next series of experiments examines this effect in a more controlled environment by down-sampling the training data for the higher-resource languages to 1/2, 1/4 and 1/8 of their original sizes.

Figure 1: BLEU and BLEU gain by data size.

From the BLEU scores in Figure 1, we can see that for all three languages the gain in BLEU score demonstrates a similar trend to that found in Gl in the previous section: the gain is highest when the baseline system is poor but not too poor, usually with a baseline BLEU score in the range of 3-4. This suggests that at least a moderately effective system is necessary before pre-training takes effect, but once there is enough data to capture the basic characteristics of the language, pre-training can be highly effective.

5 Q3: Effect of Language Similarity

The main intuitive hypothesis as to why pre-training works is that the embedding space becomes more consistent, with semantically similar words closer together. We can also make an additional hypothesis: if the two languages in the translation pair are more linguistically similar, the semantic neighborhoods will be more similar between the two languages (i.e. semantic distinctions or polysemy will likely manifest themselves in more similar ways across more similar languages). As a result, we may expect that the gain from pre-training of embeddings may be larger when the source and target languages are more similar. To examine this hypothesis, we selected Portuguese as the target language, which when following its language family tree from top to bottom, belongs to Indo-European, Romance, Western Romance, and West-Iberian families. We then selected one source language from each family above.888English was excluded because the TED talks were originally in English, which results in it having much higher BLEU scores than the other languages due to it being direct translation instead of pivoted through English like the others. To avoid the effects of training set size, all pairs were trained on 40,000 sentences.

Dataset Lang. Family std pre
Es Pt West-Iberian
Fr Pt Western Romance
It Pt Romance
Ru Pt Indo-European
He Pt No Common
Table 3: Effect of linguistic similarity and pre-training on BLEU. The language family in the second column is the most recent common ancestor of source and target language.

From Table 3, we can see that the BLEU scores of Es, Fr, and It do generally follow this hypothesis. As we move to very different languages, Ru and He see larger accuracy gains than their more similar counterparts Fr and It. This can be largely attributed to the observation from the previous section that systems with larger headroom to improve tend to see larger increases; Ru and He have very low baseline BLEU scores, so it makes sense that their increases would be larger.

6 Q4: Effect of Word Embedding Alignment

Until now, we have been using embeddings that have been trained independently in the source and target languages, and as a result there will not necessarily be a direct correspondence between the embedding spaces in both languages. However, we can postulate that having consistent embedding spaces across the two languages may be beneficial, as it would allow the NMT system to more easily learn correspondences between the source and target. To test this hypothesis, we adopted the approach proposed by  smith2017offline to learn orthogonal transformations that convert the word embeddings of multiple languages to a single space and used these aligned embeddings instead of independent ones.

Dataset unaligned aligned
Gl En
Pt En
Az En 2.1
Tr En 17.7
Be En
Ru En
Table 4: Correlation between word embedding alignment and BLEU score in bilingual translation task.

From Table 4, we can see that somewhat surprisingly, the alignment of word embeddings was not beneficial for training, with gains or losses essentially being insignificant across all languages. This, in a way, is good news, as it indicates that a priori alignment of embeddings may not be necessary in the context of NMT, since the NMT system can already learn a reasonable projection of word embeddings during its normal training process.

7 Q5: Effect of Multilinguality

Finally, it is of interest to consider pre-training in multilingual translation systems that share an encoder or decoder between multiple languages Johnson et al. (2016); Firat et al. (2016), which is another promising way to use additional data (this time from another language) as a way to improve NMT. Specifically, we train a model using our pairs of similar low-resource and higher-resource languages, and test on only the low-resource language. For those three pairs, the similarity of Gl/Pt is the highest while Be/Ru is the lowest.

Train Eval bi std pre align
Gl Pt Gl
Az Tr Az
Be Ru Be
Table 5: Effect of pre-training on multilingual translation into English. bi is a bilingual system trained on only the eval source language and all others are multi-lingual systems trained on two similar source languages.

We report the results in Table 5. When applying pre-trained embeddings, the gains in each translation pair are roughly in order of their similarity, with Gl/Pt showing the largest gains, and Be/Ru showing a small decrease. In addition, it is also interesting to note that as opposed to previous section, aligning the word embeddings helps to increase the BLEU scores for all three tasks. These increases are intuitive, as a single encoder is used for both of the source languages, and the encoder would have to learn a significantly more complicated transform of the input if the word embeddings for the languages were in a semantically separate space. Pre-training and alignment ensures that the word embeddings of the two source languages are put into similar vector spaces, allowing the model to learn in a similar fashion as it would if training on a single language.

Interestingly, Be En does not seem to benefit from pre-training in the multilingual scenario, which hypothesize is due to the fact that: 1) Belarusian and Russian are only partially mutually intelligible  Corbett and Comrie (2003), in another words, they are not as similar; 2) the Slavic languages have comparatively rich morphology, making sparsity in the trained embeddings a larger problem.

8 Analysis

8.1 Qualitative Analysis

source ( risos ) e é que chris é un grande avogado , pero non sabía case nada sobre lexislación de patentes e absolutamente nada sobre xenética .
reference ( laughter ) now chris is a really brilliant lawyer , but he knew almost nothing about patent law and certainly nothing about genetics .
bi:std ( laughter ) and i ’m not a little bit of a little bit of a little bit of and ( laughter ) and i ’m going to be able to be a lot of years .
multi:pre-align ( laughter ) and chris is a big lawyer , but i did n’t know almost anything about patent legislation and absolutely nothing about genetic .
Table 6: Example translations of Gl En.

Finally, we perform a qualitative analysis of the translations from Gl En, which showed one of the largest increases in quantitative numbers. As can be seen from Table 6, pre-training not only helps the model to capture rarer vocabulary but also generates sentences that are more grammatically well-formed. As highlighted in the table cells, the best system successfully translates a person’s name (“chris”) and two multi-word phrases (“big lawyer” and “patent legislation”), indicating the usefulness of pre-trained embeddings in providing a better representations of less frequent concepts when used with low-resource languages.

In contrast, the bilingual model without pre-trained embeddings substitutes these phrases for common ones (“i”), drops them entirely, or produces grammatically incorrect sentences. The incomprehension of core vocabulary causes deviation of the sentence semantics and thus increases the uncertainty in predicting next words, generating several phrasal loops which are typical in NMT systems.

More qualitative examples and analysis are presented in the supplementary materials.

8.2 Analysis of Frequently Generated -grams.

bi:std bi:pre
) so about
( laughter ) i people
) i or
laughter ) i these
) and with
they were because
have to like
a new could
to do , all
‘‘ and then two
(a) Pairwise comparison between two bilingual models
multi:std multi:pre+align
here on the
again , like
several should
you ’re going court
’ve judge
we ’ve testosterone
you ’re going to patents
people , patent
what are test
the room with
(b) Pairwise comparison between two multilingual models
Table 7:

Top 10 n-grams that one system did a better job of producing. The numbers in the figure, separated by a slash, indicate how many times each n-gram is generated by each of the two systems.

We additionally performed pairwise comparisons between the top 10 n-grams that each system (selected from the task Gl En) is better at generating, to further understand what kind of words pre-training is particularly helpful for.999Analysis was performed using from The results displayed in Table 7 demonstrate that pre-training helps both with words of low frequency in the training corpus, and even with function words such as prepositions. On the other hand, the improvements in systems without pre-trained embeddings were not very consistent, and largely focused on high-frequency words.

8.3 F-measure of Target Words

Figure 2: The f-measure of target words in bilingual translation task Pt En

Finally, we performed a comparison of the f-measure of target words, bucketed by frequency in the training corpus. As displayed in Figure 2, this shows that pre-training manages to improve the accuracy of translation for the entire vocabulary, but particularly for words that are of low frequency in the training corpus.

9 Conclusion

This paper examined the utility of considering pre-trained word embeddings in NMT from a number of angles. Our conclusions have practical effects on the recommendations for when and why pre-trained embeddings may be effective in NMT, particularly in low-resource scenarios: (1) there is a sweet-spot where word embeddings are most effective, where there is very little training data but not so little that the system cannot be trained at all, (2) pre-trained embeddings seem to be more effective for more similar translation pairs, (3) a priori alignment of embeddings may not be necessary in bilingual scenarios, but is helpful in multi-lingual training scenarios.


Parts of this work were sponsored by Defense Advanced Research Projects Agency Information Innovation Office (I2O). Program: Low Resource Languages for Emergent Incidents (LORELEI). Issued by DARPA/I2O under Contract No. HR0011-15-C-0114. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.