A Survey of Methods to Leverage Monolingual Data in Low-resource Neural Machine Translation

10/01/2019 ∙ by Ilshat Gibadullin, et al. ∙ JSC Innopolis 0

Neural machine translation has become the state-of-the-art for language pairs with large parallel corpora. However, the quality of machine translation for low-resource languages leaves much to be desired. There are several approaches to mitigate this problem, such as transfer learning, semi-supervised and unsupervised learning techniques. In this paper, we review the existing methods, where the main idea is to exploit the power of monolingual data, which, compared to parallel, is usually easier to obtain and significantly greater in amount.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A lack of parallel data is a major problem of machine translation for many language pairs. This is usually the case when one or both languages in a pair have a small number of speakers or low media presence. However, if there’s some parallel data, then, typically, there exists orders of magnitude more monolingual data, which, in addition to parallel one, can result in significant improvements in translation quality.

The success of neural networks in machine translation task motivates the exploration of methods for efficient application of monolingual data over them. Over the last few years a lot of work has been done in this direction, and many new approaches have been suggested

[9; 17; 5; 24; 7; 8; 21; 16; 19]. If someone is looking for a way to benefit from monolingual data, one needs to go through all these works, to be able to understand, compare, and choose. However, if these methods were grouped by some similarity criteria and better organized, then navigating through them would be substantially simplified, both for practical use and for research needs. We have seen some surveys for other subfields of machine translation, e.g., a survey of domain adaptation techniques [6] or a study of post-editing methods [11], but to the best of our knowledge, there are no surveys of approaches which utilize monolingual data. In this paper, we attempt to solve this problem by reviewing, categorizing, and comparing the existing methods of exploiting monolingual data in neural machine translation (NMT) models. This work is meant to be a good starting point for a quick immersion into the subject.

I-a NMT Models

All methods, which will be covered, are based on a couple of fundamental NMT models. Both models consist of encoder and decoder parts, where the role of the encoder is to represent an input sequence as a context-dependent vector, while the purpose of the decoder is to generate the sequence in the target language based on the encoder output. The first model was introduced by

Bahdanau et al. [3]

and is based on recurrent neural network (RNN) layers for encoder-decoder parts of the model and attention layer between them. Further in the paper, we will refer to this model as RNN-based. The second model was presented by

Vaswani et al. [22] and is called Transformer. They introduced a new technique called multi-head attention, and encoder-decoder parts of the model are based on stacks of multi-head attention and feed-forward layers. Such architecture allows to get rid of RNN layers, so the training becomes significantly faster.

I-B Organization of Methods

We divided all existing methods into two broad categories: Architecture Independent and Architecture Dependent methods. Such division is made from the practical viewpoint: there exist several strong NMT models and development of new models or some modifications to existing ones continues; so, Architecture Independent methods of exploiting monolingual data may be used with any model to get translation quality improvements, viewing a model as a black box. On the contrary, Architecture Dependent methods require specific changes in architecture and may or may not be adapted to various other NMT models.

We can relate to Architecture Independent methods all approaches, the idea of which is to generate pseudo-parallel corpus using monolingual corpus, then mix pseudo-parallel corpus with true parallel corpus and make no distinction between them during training. Also, we can include to this category those methods which use separate pre-trained language model and merge it with pre-trained translation model during inference.

Architecture Dependent methods focus on specific architectural features of NMT models and/or require additional changes in architecture. One type of methods only freezes some parameters of NMT model during training on pseudo-parallel corpus. Other methods apply unsupervised pre-training, using monolingual corpus, of some parameters of NMT model, then initialize it with these pre-trained parameters to further train on parallel corpus. More complex methods integrate language modeling idea and/or multi-task learning to NMT model.

There is a recent trend towards using monolingual data in fully unsupervised setting. As this approach is totally different from other methods we describe here (which use monolingual data only in addition to the parallel one), we do not include it in our taxonomy, instead describing in separately.

The taxonomy of methods is presented in Figure 1. Below, we will provide further categorization with examples and results overview for each method.

Fig. 1: The taxonomy of methods exploiting monolingual data for NMT. The shadowed areas highlight the categories sharing the same core idea.

Ii Architecture Independent Methods

Architecture independent methods of exploiting monolingual data are transparent to NMT model architecture, so any existing model may be used as a base. We will consider methods from this category divided into two subcategories according to the way monolingual data is used:

  1. Methods which use additional pseudo-parallel corpus;

  2. Methods which merge NMT model with a separate language model.

Ii-a Additional Pseudo-parallel Corpus

The main idea of methods from this subcategory is to generate pseudo-parallel (or synthetic) corpus using monolingual data so that we have additional input and output for our NMT model. Then we mix pseudo and real parallel corpora and make no distinction between them during training. We can generate either input or output part from the existing target or source monolingual sentences of some language pair. The benefit of pseudo-parallel corpus is that the NMT model will better learn the structure of target or source language, depending on the side of monolingual data. The drawback is that the low quality of generated sentences or domain mismatch may degenerate the learned structure of the corresponding side, so in some cases we have to limit the size of pseudo-parallel corpus. Further, we will consider methods to generate a pseudo-parallel corpus.


This idea of synthetic parallel corpus generation was proposed by Sennrich et al. [17] and has the following idea: an additional reversed machine translation (MT) model is trained on available parallel corpus in target-to-source direction, i.e. target and source sides with respect to the main translation model. This opposite direction is important because thereby the target side stays intact. Then, using this pre-trained reversed model target monolingual sentences are translated to the source language. These sentence pairs form a new pseudo-parallel corpus. Then, true and synthetic parallel corpora are mixed, and the main model is trained on this combination.

This method was evaluated on low-resource EnglishTurkish language pair and RNN-based NMT model was used as a baseline. 300K of parallel sentence pairs and 3.2M of synthetic parallel sentence pairs, back-translated from monolingual sentences, were used for training. The addition of synthetic training data led to an improvement of 2.7 BLEU on average. The authors also analyzed how final NMT model’s quality depends on the quality of back-translated sentences and found out that the quality improvement of back-translated sentences by 6 BLEU gives 0.6 BLEU improvement of the final NMT model. Experiments conducted by Stahlberg et al. [21] show that performance of Back-Translation doesn’t degrade as long as the ratio of synthetic to real parallel corpora doesn’t exceed 8:1 ratio.

Round Trip Training

This method doesn’t generate pseudo-parallel corpus explicitly as the previous method. Instead, it leverages the idea of auto-encoders to generate pseudo-parallel sentence and immediately reconstruct it back.

Simply speaking, the main purpose of auto-encoders in deep learning is to learn the common input features. It has two parts called encoder and decoder. The role of the encoder is to extract the common input features, and the role of the decoder is to reconstruct the input from encoder’s output.

The method proposed by Cheng et al. [5] uses auto-encoders to exploit monolingual corpora. The idea is as follows. There are two NMT models, first with source-to-target and second with target-to-source translation directions respectively. Source-to-target NMT model can be considered as an encoder and target-to-source NMT model as a decoder part of the auto-encoder network, which role is to reconstruct source sentences. The same auto-encoder may be constructed in the opposite direction, where target-to-source model is considered as an encoder part and source-to-target model as a decoder part of the auto-encoder network, which role is to reconstruct target sentences. The whole training objective of the method is to maximize the likelihoods of source-to-target and target-to-source models on parallel corpus, and reconstruction likelihoods of auto-encoders on monolingual corpora. On each iteration, models are trained by mini-batch of parallel corpus and two mini-batches of target and source monolingual corpora.

This method can be seen as an iterative extension to Back-Translation and may exploit not only target-side but also source-side monolingual data.

The authors evaluated the method on ChineseEnglish language pair and used RNN-based NMT for all experiments. 2.56M parallel sentence pairs, 18.75M Chinese and 22.32M English monolingual sentences were used for training. The authors found out that using both source and target monolingual data doesn’t give significant improvements over baseline. Using parallel and English monolingual corpora, the authors achieved +4.7 BLEU over baseline in ChineseEnglish direction. The same result with parallel and Chinese monolingual corpora in EnglishChinese direction was obtained. The method also outperforms Back-Translation by +1.8 and +1.0 BLEU in ChineseEnglish and EnglishChinese directions, respectively. We think that this may happen because with each new iteration the translation quality improves; thus, the quality of synthetic sentences also becomes higher, while in Back-Translation the quality of synthetic parallel sentences is constant. Using source-side monolingual data also gives improvements over baseline and Back-Translation, but smaller than those, obtained with target-side monolingual data.

Copied Monolingual Data

This method explicitly generates pseudo-parallel corpora, but in contrast to Back-Translation, no additional translation model is used.

The method proposed by Currey et al. [7] suggests converting target-side monolingual data to synthetic parallel corpus by copying target-side sentences to the source side. To represent source and target words in the same vocabulary, they use byte-pair encoding [18]. Parallel corpus is mixed with synthetic parallel corpus, and no distinction is made during training.

The authors consider this method as a multi-task system [13; 10], where one NMT model combines several translation directions, e.g. FrenchEnglish, GermanEnglish, and EnglishGerman. This method combines e.g. EnglishEnglish and TurkishEnglish into one system for the purpose of improving Turkish-to-English quality.

The evaluation was performed on EnglishTurkish language pair, and RNN-based model was used for experiments. 207K parallel sentence pairs, 414K English and 414K Turkish monolingual sentences were used for training. The method gave +1.2 BLEU over baseline. Increasing the ratio of copied to parallel text was found to improve BLEU score: 3:1 ratio gives +0.8 BLEU over 1:1 ratio for English-Turkish pair. However, additional experiments with an increased ratio of copied to parallel text are required, because the model will most likely start to degrade with higher ratios.

Ii-B Separate Language Model

In this subcategory, we will describe methods which use an additional separate target-side language model (LM). LMs are pre-trained on target-side monolingual corpus to predict next word probability distribution based on given previous words

. The main idea of these methods is to merge LM and translation model (TM) in the prediction step. This means that the probability distribution of next word given previously predicted words and input sentence

is calculated using output logits (or probability distribution) of TM and output logits of LM. The expected effect of additional target-side LM is that it should help TM to generate grammatically correct sentences.

As LM is independent of TM, there’re no restrictions on its architecture: it can be e.g., n-gram based feed-forward LM

[4] or RNN-based LM [14].

Shallow Fusion

Shallow Fusion [9] is a technique to merge separately pre-trained LM and TM. Each time step (step of prediction of next word) TM and LM propose scores of next possible words based on previously predicted words. Then, the scores of TM are summed with the scores of LM multiplied by hyper-parameter , which requires additional fine-tuning on validation data, to control LM influence. The word with the highest score is selected to be the next word in a sequence. To be more accurate, beam search is applied, where top N most probable sequences are carried until the end of prediction and the most probable one (which has the highest score) is chosen as a result.

The evaluation is provided for TurkishEnglish translation direction, and all experiments used RNN-based TM and LM. 160K parallel sentence pairs and 4M English monolingual sentences were used for training. Unfortunately, no improvements were obtained using this method, it gives almost the same results as a baseline without LM.


This method uses the same idea of merging technique from the previous method with some modifications. The main difference here is that LM is trained first, then merged with TM, then, fixing LM’s parameters, TM is trained. This technique is inspired by Cold Fusion [20].

This LM and TM merging technique is called PostNorm and was proposed by Stahlberg et al. [21]. First, the output of projection layer of TM is normalized by softmax and the probability distribution is obtained. Then, it is component-wise multiplied by probability distribution of LM . To make it a valid probability distribution, additional normalization using softmax function is applied. In contrast to Shallow Fusion, this method requires normalization after the summation of log probabilities (multiplication of probabilities) and training of TM after merging with LM.

The method was evaluated on TurkishEnglish language pair, RNN-based TM and LM were used in experiments. 207K parallel sentence pairs, and 47M English and 3M Turkish monolingual sentences were used for training. The method gives +0.71 BLEU for EnglishTurkish and +0.43 BLEU for TurkishEnglish directions over baselines without LM. For higher-resourced XhosaEnglish pair with 739K parallel sentences, it achieves +2.36 BLEU.

Iii Architecture Dependent Methods

In contrast to architecture independent methods, which are applicable to any existing NMT model, architecture dependent methods are designed for specific NMT model and may require significant modifications of its architecture. Similarly to the previous category, we divide the methods here in subcategories by their approach to using monolingual data:

  1. Methods which freeze some parameters during training on pseudo-parallel corpus;

  2. Methods which integrate language modeling in NMT model architecture;

  3. Methods which pre-train parts of NMT model with language models;

  4. Methods which use monolingual data for multi-tasking.

Iii-a Training with Parameters Freezing

The idea of methods in this subcategory is the same as in II-A, except that we distinguish between pseudo-parallel and real-parallel corpora, and freeze some trainable parameters of NMT model during training on pseudo-parallel corpus. Parameters freezing operation is introduced to weaken the negative influence of generated sentences to the corresponding side of NMT model. However, we can’t completely overcome the negative effect of sentences with low quality; thus, we still need to limit the size of pseudo-parallel corpus. Both methods from this subcategory use RNN-based NMT model.


This method by Zhang and Zong [24] is similar to Back-Translation (II-A), except that here source-side monolingual corpus is exploited and parameters freezing is used during training on synthetic parallel corpus.

As in the Back-Translation method, any NMT or SMT model is trained as an additional TM to generate synthetic corpus by translating monolingual one, but from source to target (forward translation). Further, parallel and synthetic data are combined together to train the main translation model. To protect the decoder part of the model from the negative influence of synthetic corpus, the authors freeze decoder parameters during training on synthetic corpus.

The method was evaluated on ChineseEnglish translation direction, and RNN-based NMT model was used for forward translation. Using 620K parallel sentences and 3.3M Chinese monolingual sentences, the method outperforms baseline NMT model by 3.2 BLEU.

Dummy Input

This method was proposed by Sennrich et al. [17] together with Back-Translation. The idea of the method is the following: target-side monolingual sentences are paired with single-word dummy input to produce pseudo-parallel corpus, which is used for NMT model training. To protect encoder and attention parts of RNN-based NMT model from low-quality input, the parameters of encoder and attention layers are fixed during training on this pseudo-parallel corpus.

The authors justify the efficiency of this method saying that it allows to better learn the target language structure, but significantly harms the conditioning of decoder part of the model on encoder and attention layers if the ratio of monolingual data is too high. The advantage of this method compared to Back-Translation is reduced time to fit the system, as there’s no need in training of additional NMT model and in Back-Translation operation, both of which are quite time-consuming.

TurkishEnglish translation direction was used for evaluation with 300K of parallel sentences and 177M of English monolingual sentences. On average, there is an improvement of 0.6 BLEU over baseline if monolingual corpus with dummy source-side is added to parallel data in 1:1 ratio. Higher proportions of dummy source-side degrade the BLEU score.

Iii-B Integration of Language Modeling

In contrast to the methods described in II-B, where we merged language model with translation model in prediction step, in this subcategory we will cover methods, which integrate language modeling technique to NMT model architecture, so merging of LM with TM occurs in earlier stages. The benefit of such integration is that we can more accurately exploit LM, considering architectural features of TM. Both methods from this subcategory are developed for RNN-based NMT model.

Deep Fusion

This method was proposed by Gülcehre et al. [9] with Shallow Fusion technique and also uses separately pre-trained RNN-based LM and TM, but the difference is that merging of LM with TM occurs earlier.

In this method, called Deep Fusion, the hidden state of LM is concatenated with the decoder’s hidden state. Thus, the new hidden output will be added to the decoder’s input: LM’s state in addition to TM’s state, embedding of the previous word and the context vector. To balance the influence of LM on TM, the additional controller mechanism was proposed to control the magnitude of LM’s hidden state. Then, the model’s hidden output and controller mechanism parameters are fine-tuned on training data.

For TurkishEnglish translation direction with 160K parallel sentences and 4M English monolingual sentences the method gives on average +1.19 BLEU over baseline without LM.

Language Model with Multi-Task Learning

This method differs from the previous one by that it doesn’t use pre-trained LM and TM, and has a different way of integrating RNN-based language model [14].

The method proposed by Domhan and Hieber [8] uses an additional source-independent RNN layer, which has the language modeling role. Source-independent means that the parameters of this layer will only be affected by target-side sequences. To compute its current hidden state, this RNN layer takes as an input its previous state and embedding of the previous target word. Computed current hidden state of this layer goes as an input to the decoder layer of RNN-based model, instead of embedding of the word, to compute the decoder’s hidden state. Here is the difference from Deep Fusion, where we just concatenate the hidden states of LM and TM’s decoder. To jointly train LM and TM, the second objective is specified for the output of the language modeling layer—to predict the next target word, conditioned only on target history information. The system is trained to minimize the loss of the model by updating source-dependent parameters only on parallel corpus and updating source-independent parameters on both monolingual and parallel corpora.

The evaluation is provided for EnglishGerman and ChineseEnglish translation directions. 191K English-German and 242K Chinese-English parallel sentences pairs, 51M German and 51M English monolingual sentences were used for training. The method got +1.4, +0.5 BLEU improvements for EnglishGerman and ChineseEnglish directions respectively over baseline without LM. The authors also provided a comparison with Back-Translation method [17], which outperforms on average by +3 BLEU. The difference may be explained by the fact that the gradients from monolingual data don’t change the source-dependent part of the model; in contrast, synthetic data always affects all model parameters. The authors think that training with synthetic source data may also act as a model regularizer.

Iii-C Pre-training with Language Models

Here we will describe one more subcategory of methods, which exploit the power of language modeling. Pre-training with language models is a technique when parts of NMT model are trained as LMs, and parameters of these pre-trained parts are used to initialize NMT model. We will describe two methods, which use the same idea but apply it for two different NMT models: RNN-based model [3] and Transformer [22]. Both models are based on encoder-decoder architecture, where encoder part may be considered as a source-side LM and pre-trained on a source-side monolingual corpus, and decoder part may be considered as a target-side LM and pre-trained on a target-side monolingual corpus. The benefit of pre-training with LMs is that we can efficiently exploit both source-side and target-side monolingual corpora.

Pre-training of RNN-based model

In this method [16]

, source-side LM is the encoder part of RNN-based model with an additional softmax output layer. Target-side LM is the decoder part of RNN-based model without encoder input. Both LMs are pre-trained on corresponding monolingual corpora. After pre-training, the embedding and first RNN layers of encoder and decoder, plus decoder’s softmax layer, are initialized with pre-trained weights. Then, the model is fine-tuned using parallel corpus. To ensure that the model doesn’t overfit on labeled data forgetting language modeling task, the authors continued training on monolingual data, setting equal weights to LM losses and translation model loss. Also, the authors improved the model by adding a residual connection between the first RNN layer of decoder and softmax layer, because the intermediate decoder layers are randomly initialized and thus may break LM of target-side by random gradients.

Evaluated on EnglishGerman translation direction using WMT 14 dataset with 4 million parallel and monolingual sentences, the method outperformed Back-Translation (II-A) by 0.3 BLEU.

Pre-training of Transformer model

Similarly to the previous method, Skorokhodov et al. [19] train source-side and target-side LMs, then use all pre-trained LM parameters to initialize NMT model. Connections from the hidden state of encoder to logits are discarded and attention weights from encoder outputs to decoder are randomly initialized. In Transformer, a decoder layer stack consists of the following layers: self-attention layer, encoder-decoder attention layer (where encoder output comes), and feed-forward layer. The encoder-decoder attention layer is initialized randomly, so to protect the target-side LM from breaking by random gradients, an additional residual connection between self-attention layer and feed-forward layer was introduced.

This method was examined on extremely low size parallel corpus. The evaluation on RussianEnglish translation direction with only 20K parallel sentences and 500K English sentences showed +1.4 BLEU improvement over Transformer without LMs initialization.

Iii-D Multi-tasking

In this subcategory, we will describe only one method, which uses monolingual data in a different way—neither translating it nor using LMs.

Input Sentences Reordering

Methods to use source-side monolingual corpus were described in II-A, III-A, III-C, but because of the simplicity of the models, monolingual data may not have been used to its full potential. The idea of this method is to use a more complex model to more efficiently leverage source-side monolingual data. The method is based on sentence reordering technique, which tries to reorder words in source-side language sentence so as to approximate the target-side language words order. In all experiments, RNN-based NMT model was used.

The method proposed by Zhang and Zong [24] has the following idea: there’s a single shared encoder and two different decoders. The first decoder is used for translation, the second for reordering, where the target is just a reordered source sentence. Target sentences for training were obtained from source sentences by applying reordering rules proposed by Wang et al. [23]

. The objective of the whole model is to maximize the sum of log probabilities of translation and reordering models. Training is performed by iterations of 5 epochs. One epoch for training reordering model, others for training translation model. The method applies multi-tasking because it performs translation and reordering at the same time.

The evaluation is provided for ChineseEnglish direction with 620K parallel sentences and 6.5M Chinese monolingual sentences. The method gives +4.3 BLEU over baseline, which is a considerable improvement.

Iv Fully Unsupervised Learning

Unlike all the methods we described above, this group of approaches doesn’t treat monolingual data as an additional resource for improving NMT models trained on parallel data. Instead, models here are trained using only monolingual corpora. This is made possible due to linguistic similarities between source and target languages.

The first step in unsupervised learning is bilingual dictionary induction—having it, one can already build a simple word-by-word translation model and then improve it.

There are several approaches to build such dictionary. Initially, the bilingual dictionary generation method was suggested by Mikolov et al. [15]. The authors found out that word embeddings of two different languages can be mapped to a common space where words from different languages with the same meaning reside close to each other. If the first language embeddings matrix is and the second language embedding matrix is

, then one can find such linear transformation matrix

that is in the same space as . In this common space, translations of words in one language can be found by searching nearest neighbors among the words from another language. The transformation matrix can be found using some small seed dictionary or even without it, as it was suggested by Artetxe et al. [2].

When common embeddings space is found, a simple word-by-word translation model can be built and iteratively improved using such techniques as language modeling based on denoising auto-encoders, iterative back-translation, etc.. The detailed description of such methods ([1], [12]) is out of the scope of this survey.

The results obtained by unsupervised methods are impressive. Here we provide the ones demonstrated by Lample et al. [12]. The method was evaluated on EnglishFrench language pair using News Crawl WMT14 monolingual corpora. The reported scores are 25.1 and 24.2 BLEU in EnglishFrench and FrenchEnglish directions respectively, using Transformer model. The authors performed a comparison with a model trained on parallel data, varying the number of parallel sentences, and found out that their method obtains higher BLEU score as long as the model trained on parallel data uses less than 300-400K parallel sentence pairs. For example, when 100K parallel sentence pairs are used, their method outperforms by +2.6 BLEU.

Model 100K 100-300K 300K
Architecture independent
Back-Translation - 2.7 -
Round Trip Training - - 4.7
Copied Monolingual Data - 1.2 -
Shallow Fusion - 0 -
PostNorm - 0.57 2.36
Architecture dependent
Forward-Translation - - 3.2
Dummy Input - 0.6 -
Deep Fusion - 1.19 -
LM with Multi-Task learning - 0.95 -
Pre-training of RNN-based model - - -
Pre-training of Transformer 1.4 - -
Input Sentences Reordering - - 4.3
Fully Unsupervised Learning 17-2.6 2.6-0 0
TABLE I: Positive BLEU improvements of covered models depending on size of parallel corpus used for training.

V Comparison

The common issue with the reviewed models is that the majority of them are evaluated on different datasets, so it’s difficult to compare their effectiveness. However, some trends can be observed across the models, and we discuss them below. Results of all methods we collected in Table I.

The results obtained using Architecture Independent methods show that approaches which use Additional Pseudo-parallel Corpus have significant improvements over baseline NMT model. Especially high scores were obtained for Round Trip Training method, but the evaluation was performed on 2.56M of parallel sentences, so the additional experiments on less parallel data are required to prove its effectiveness for low-resource MT. Back-Translation shows a good result of +2.7 BLEU using 300K parallel sentences. Methods exploiting Separate Language Model, however, don’t obtain such high scores. Nevertheless, methods from these two subcategories may be applied together to get even better results.

Among Architecture Dependent methods, Input Sentence Reordering shows the exceptional result of +4.3 BLEU over baseline, evaluated on 620K of parallel data. Forward-Translation, a technique which uses Additional Pseudo-parallel Corpus and parameters freezing, also gives a high score of +3.2 BLEU over baseline on the same parallel data. In contrast to Architecture Independent methods, Architecture Dependent methods of leveraging language models show better results when evaluated on small parallel corpora. Deep Fusion, a method to integrate language model into NMT architecture, gives +1.19 BLEU for low-resource language pair with 160K parallel sentences. Another approach, where Transformer model is pre-trained with language models, outperforms baseline Transformer by +1.4 BLEU using only 20K parallel sentences.

Vi Open Research Questions

Monolingual data can be characterized by its amount and domain. Stahlberg et al. [21] demonstrate that for Back-Translation increasing the amount of monolingual data improves the translation quality only up to some point, and then it starts to degrade. Zhang and Zong [24] also show that the partial use of monolingual data is beneficial, but in their case, they use the most relevant part of it. Gülcehre et al. [9] further highlight the importance of language domain.

Some questions which remain unanswered are: 1) How to determine the optimal proportion of monolingual data when it comes from the same or different domain? 2) What are the effective domain adaptation techniques in case of using language models and in case of using raw monolingual data? 3) How to use monolingual data to its full potential, such that a model can benefit from out-of-domain data too?

It is not always possible to acquire monolingual data in the same domain as parallel data, especially for low-resource language pairs, therefore answering the above questions can be highly beneficial.

Vii Conclusion

In this paper, we reviewed the various approaches to leverage monolingual data for low-resource MT. We considered methods by dividing them into Architecture Dependent and Architecture Independent categories so that it can be helpful from the practical viewpoint. We further split the methods in each of these broad categories, describing similarities and differences in their core idea, and providing available evaluation with the amount of data used. Finally, we compared both categories by improvements achieved by their methods and shared our view on what needs to be explored.


  • [1] M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2018) UNSUPERVISED neural machine translation. arXiv:1710.11041v2. External Links: Link Cited by: §IV.
  • [2] M. Artetxe, G. Labaka, and E. Agirre (2018-07) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 789–798. External Links: Link Cited by: §IV.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. Proceedings of the International Conference on Learning Representations (ICLR). External Links: Link Cited by: §I-A, §III-C.
  • [4] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2010) A neural probabilistic language model.

    Journal of machine learning research

    , pp. 1137–1155.
    External Links: Link Cited by: §II-B.
  • [5] Y. Cheng, W. Xu, Z. He, W. He, H. Wu, M. Sun, and Y. Liu (2016) Semi-supervised learning for neural machine translation. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1965–1974. External Links: Link Cited by: §I, §II-A.
  • [6] C. Chu and R. Wang (2018) A survey of domain adaptation for neural machine translation. arXiv preprint arXiv:1806.00258. External Links: Link Cited by: §I.
  • [7] A. Currey, A. V. M. Barone, and K. Heafield (2017) Copied monolingual data improves low-resource neural machine translation. Proceedings of the Conference on Machine Translation (WMT), Volume 1, pp. 148–156. External Links: Link Cited by: §I, §II-A.
  • [8] T. Domhan and F. Hieber (2017) Using target-side monolingual data for neural machine translation through multi-task learning.

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pp. 1500–1505.
    External Links: Link Cited by: §I, §III-B.
  • [9] C. Gülcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H. Lin, F. Bougares, H. Schwenk, and Y. Bengio (2015) On using monolingual corpora in neural machine translation. arXiv:1503.03535v2. External Links: Link Cited by: §I, §II-B, §III-B, §VI.
  • [10] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viegas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2016) Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558. External Links: Link Cited by: §II-A.
  • [11] M. Koponen (2016) Is machine translation post-editing worth the effort? a survey of research into post-editing and effort. The Journal of Specialised Translation 25, pp. 131–148. External Links: Link Cited by: §I.
  • [12] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato (2018) Phrase-based and neural unsupervised machine translation. arXiv:1804.07755v2. External Links: Link Cited by: §IV, §IV.
  • [13] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser (2016) Multi-task sequence to sequence learning. 4th International Conference on Learning Representations. External Links: Link Cited by: §II-A.
  • [14] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur (2010) Recurrent neural network based language model. Eleventh Annual Conference of the International Speech Communication Association. External Links: Link Cited by: §II-B, §III-B.
  • [15] T. Mikolov, Q. V. Le, and I. Sutskever (2013) Exploiting similarities among languages for machine translation. arXiv:1309.4168v1. External Links: Link Cited by: §IV.
  • [16] P. Ramachandran, P. J. Liu, and Q. V. Le (2018) Unsupervised pretraining for sequence to sequence learning. arXiv:1611.02683v2. External Links: Link Cited by: §I, §III-C.
  • [17] R. Sennrich, B. Haddow, and A. Birch (2016) Improving neural machine translation models with monolingual data. arXiv:1511.06709v4. External Links: Link Cited by: §I, §II-A, §III-A, §III-B.
  • [18] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the ACL, pp. 1715–1725. External Links: Link Cited by: §II-A.
  • [19] I. Skorokhodov, A. Rykachevskiy, D. Emelyanenko, S. Slotin, and A. Ponkratov (2018) Semi-supervised neural machine translation with language models. Proceedings of AMTA 2018 Workshop: LoResMT 2018, pp. 37–44. External Links: Link Cited by: §I, §III-C.
  • [20] A. Sriram, H. Jun, S. Satheesh, and A. Coates (2017) Training seq2seq models together with language models. arXiv:1708.06426. External Links: Link Cited by: §II-B.
  • [21] F. Stahlberg, J. Cross, and V. Stoyanov (2019) Simple fusion: return of the language model. arXiv:1809.00125v2. External Links: Link Cited by: §I, §II-A, §II-B, §VI.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Computing Research Repository, abs/1706.03762. External Links: Link Cited by: §I-A, §III-C.
  • [23] C. Wang, M. Collins, and P. Koehn (2007) Chinese syntactic reordering for statistical machine translation. Proceedings of EMNLP. External Links: Link Cited by: §III-D.
  • [24] J. Zhang and C. Zong (2016) Exploiting source-side monolingual data in neural machine translation. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. External Links: Link Cited by: §I, §III-A, §III-D, §VI.