The Source-Target Domain Mismatch Problem in Machine Translation

09/28/2019 ∙ by Jiajun Shen, et al. ∙ Facebook Carnegie Mellon University 0

While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in. As a result, people often talk about different things in different parts of the world. In this work we study the effect of local context in machine translation and postulate that particularly in low resource settings this causes the domains of the source and target language to greatly mismatch, as the two languages are often spoken in further apart regions of the world with more distinctive cultural traits and unrelated local events. In this work we first propose a controlled setting to carefully analyze the source-target domain mismatch, and its dependence on the amount of parallel and monolingual data. Second, we test both a model trained with back-translation and one trained with self-training. The latter leverages in-domain source monolingual data but uses potentially incorrect target references. We found that these two approaches are often complementary to each other. For instance, on a low-resource Nepali-English dataset the combined approach improves upon the baseline using just parallel data by 2.5 BLEU points, and by 0.6 BLEU point when compared to back-translation.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The use of language greatly varies with the geographic location (Firth, 1935; Johnstone, 2010). Even within places where people speak the same language (Britain, 2013), there is a lot of lexical variability due to change of style and topic distribution, particularly when considering content posted on social media, blogs and news outlets. For instance, while a primary topic of discussion between British sport fans is cricket, American sport fans are more likely to discuss other sports like baseball (Leech and Fallon, 1992).

The effect of local context in the use of language is even more extreme when considering regions where different languages are spoken. Despite the increasingly interconnected world we live in, people in different places tend to talk about different things. There are several reasons for this, from cultural differences due to geographic separation and history, to the local nature of many events we experience in our every day life; e.g., the traffic congestion in Taipei is not affected by a heavy snowfall in New York City.

This phenomenon has not only interesting socio-linguistic aspects but it has also strong implications in machine translation (Bernardini and Zanettin, 2004)

. In particular, machine translation of low-resoruce language pairs aims at automatically translating content in two languages that are often spoken in very distant geographic locations by people with rather different cultures. In machine learning terms and at a very high level of abstraction, this is akin to the problem of

aligning two very high dimensional and sparsely populated point clouds. The learning problem is difficult because not only very few correspondences are provided to the learner, but also because the distributions of points is rather different.

As of today and to the best of our knowledge, machine translation has been based on the often implicit assumption that content in the two languages is comparable. Sentences comprising the parallel dataset used for training are assumed to cover the same topic distribution, regardless of the originating language. Similarly, monolingual corpora are assumed to be comparable, i.e. to cover the same distribution of topics albeit in two different languages.

Unfortunately, this assumption does not hold for the vast majority of language pairs, which are low-resource, and for the vast majority of the content produced every day on the Internet by means of blogs, social platforms and media outlets.

In this work, we then introduce and formalize the concept of source-target domain mismatch (STDM) which accounts for intrinsic differences (besides the language) between source and target originating sentences of both parallel and monolingual datasets used to train machine translation systems. We surmise that STDM may impact negatively the effectiveness of back-translation (Sennrich et al., 2015), which is de facto the best known approach to leverage monolingual data in low resource settings. When STDM is considerable, back-translation is less effective because even if the backward model were perfect, the back-translated data is out-of-domain relative to the source domain from which we aim to translate.

In order to study the effects of STDM we introduce a controlled setting that any researcher can easily reproduce to finely tune the amount of domain mismatch. Using this synthetic setting, we study how the composition of the parallel data and the amount of monolingual and parallel data affect translation quality. Besides back-translation we also investigate self-training (Yarowski, 1995), as an approach to better leverage in-domain monolingual data on the source side.

Our empirical validation demonstrates that back-translation is often complementary to self-training. The former works better when there is a lot of target side monolingual data, and when STDM is not very strong. The latter works better when source side monolingual data is more abundant and when the STDM is more prominent. Moreover, the combination of self-training and back-translation often yields improvement over each baseline method. We finally demonstrate these approaches on low resource language pairs like Nepali-English and Myanmar-English, and report improvements between 1 to 4 BLEU points over the baseline approach using parallel data only.

2 Related Work

The observation that topic distributions and various kinds of lexical variabilities depend on the local context has been known and studied for a long time. For instance, Firth (1935) says “Most of the give-and-take of conversation in our everyday life is stereotyped and very narrowly conditioned by our particular type of culture”. In her seminal work, Johnstone (2010) analyzed the role of place in language, focussing on lexical variations within the same language, a subject further explored by Britain (2013). Some of these works were the basis for later studies that introduced computational models for how language changes with geographic location (Mei et al., 2006; Eisenstein et al., 2010).

Moving to cross-lingual analyses, there has been work at the intersection of linguistics and cognitive science (Pederson et al., 1998) showing how certain linguistic codings vary across languages, and how these affect how people form mental concepts. In machine translation, researchers have often made an explicit assumption on the use of comparable corpora (Fung and Yee, 1998; Munteanu et al., 2004; Irvine and Callison-Burch, 2013), i.e. corpora in the two languages that roughly cover the same set of topics. Unfortunately, monolingual corpora are seldom comparable in practice. Leech and Fallon (1992) analyzes two comporable corpora, one in American English and the other in British English, and demonstrate differences that reflect the cultures of origin. Similarly, Bernardini and Zanettin (2004) observes that parallel datasets built for machine translation exhibit strong biases in the selection of the original documents, making the text collection not quite comparable.

The non-comparable nature of machine translation datasets is even more striking when considering low resource language pairs, for which differences in local context and cultures are more pronounced. Recent studies (Søgaard et al., 2018; Neubig and Hu, 2018)

have warned that removing the assumption on comparable corpora strongly deteriorates performance of lexicon induction techniques which are at the foundation of machine translation.

To the best of our knowledge, no prior work has so far made explicit the intrinsic mismatch between source and target domain in machine translation, both when considering the portion of the parallel dataset originating in the source and target language, and when considering the source and target monolingual corpora. We believe that this is an important characteristic of machine translation tasks, particularly when the content is derived from blogs, social media platforms, and news outlets. We shall not attempt to make corpora comparable, because it would change the nature of the actual task!

Back-translation (Sennrich et al., 2015) has been the workhorse of modern neural MT, enabling very effective use of target side monolingual data. Back-translation is beneficial because it helps regularizing the model and adapting to new domains (Burlot and Yvon, 2018). However, the typical setting of current MT benchmarks as popularized by recent WMT competitions Bojar et al. (2019) is a mismatch between training and test sets, as opposed to a mismatch between source and target domains. In this setting, vast amounts of target monolingual data in the domain of the test set can be leveraged very effectively by back-translation. Unfortunately, back-translation is much less effective when dealing with STDM, as we will show in §5.1.1.

There has been some work attempting to make better use of source side monolingual data, as this is in-domain with the text we would like to translate at test time. Ueffing (2006) proposed to improve a statistical MT system using self-training (Yarowski, 1995), a direction later pursued by Zhang and Zong (2016) for neural MT. In our work, we also consider this baseline approach with a few important differences (He et al., 2019): i) we train all parameters of the model as opposed to just the encoder parameters, ii) we apply noise to the input and iii) we use it in an iterative fashion as in the algorithm originally proposed by Yarowski (1995). Furthermore, we show consistent improvements when combining self-training with back-translation.

3 The STDM Problem

Figure 1: Illustration of domain mismatch in machine translation. Each block corresponds to a dataset. Filled blocks represent data naturally occurring in a certain language. Empty blocks are human translations. Blocks in the top row are in the source language, blocks in the bottom row are in the target language. Blue blocks are in domain A, red blocks are in domain B. Left: in the traditional setting (here over-simplified), parallel data in both directions belong to the same domain, dubbed domain A, while the test set may be in another domain, domain B. For high-resource language pairs, we typically possess monolingual or small parallel datasets in domain B, enabling standard domain-adaptation approaches (e.g., back-translation). Right: in the source-target domain mismatch setting, typical of low resource language pairs, the parallel and monolingual data originating in the source language belong to domain A, while the parallel and monolingual data originating in the target language belong to domain B. At test time, we ask for a translation of a sentence originating in the source language and belonging to domain A. In this scenario, back-translation is less effective and self-training may be used to improve translation quality.

In this section we formalize the definition of Source-Target Domain Mismatch (STDM). This is an intrinsic property of the data which is independent of the particular machine translation system under consideration.

Machine translation systems are often composed by several datasets, which may contain either parallel or monolingual data. In this work, we assume access to only small quantities of parallel data, but relatively large quantities of monolingual data in the source and target languages. See Fig. 1 for a toy illustration. We denote by and the source and target monolingual data, respectively.

Furthermore, we assume that there are two distinct domains: the domain of the source language , and the domain of the target language, . We make the natural assumption that text originating in the source language belongs to the source domain and that text originating in the target language belongs to the target domain.

We are interested both in translating from the source to the target language, and vice versa. Accordingly, we assume that a portion of the parallel data originates in the source language, while the remainder originates in the target language. We denote by the portion of the parallel dataset originating in the source language, and by the one originating in the target language.

Our assumptions on the existence of source and target domains is expressed by: , and . Thus even if we were to perfectly translate, i.e. if we could disregard the artifacts introduced by the translation process (Baker, 1993; Zhang and Toral, 2019; Toury, 2012), the distributional properties of source-originating text would be different from target-originating text. As mentioned in §2, even “comparable” corpora are affected by such domain mismatch to some extent.

The goal of a machine translation system trained on such data is to accurately translate text originating in the source language and belonging to the source domain, and vice versa for the reverse direction.

The questions we aim to answer with this study are:

  1. Is STDM hampering the performance of back-translation?

  2. Is self-training useful in the presence of STDM?

  3. Is target-domain data useful to improve translation of source-domain data?

  4. Is there a controlled setting that lets us assess how an algorithm works as a function of the degree of STDM?

4 Baseline Algorithms

In this section we review the basic learning algorithms we have considered. In this work, we use a state-of-the-art neural machine translation (NMT) system based on the transformer architecture 

(Vaswani et al., 2017) with subword vocabularies learned via byte-pair encoding (BPE) Sennrich et al. (2015). However, our analysis is not architecture-specific and we believe it extends to other systems as well.

The most basic model is trained on the parallel data using token-level cross-entropy loss with label smoothing (Szegedy et al., 2016) and dropout (Srivastava et al., 2014), as standard practice in the field. Next, we describe approaches that can leverage monolingual data as well.

4.1 Back-Translation (BT)

Back-translation (BT) (Sennrich et al., 2015) is a popular and effective data augmentation technique that leverages target side monolingual data. The algorithm proceeds in three steps. First, a reverse machine translation system is trained from target to source using the provided parallel data: . Then, the reverse model is used to translate the target monolingual data: , for and . The maximization is typically approximated by beam search. Finally, the forward model is trained over the concatenation of the original parallel and back-translated data: with and . In practice, the parallel data is weighted more in the loss, with a weight selected via hyper-parameter search on the validation set.

BT provides several main benefits in practice: (1) since parallel datasets are typically small, augmenting the training set with large quantities of BT data improves generalization; (2) when the target side monolingual data and test set are from the same domain, BT helps adapt to the domain of the test data; (3) BT improves the model’s fluency Edunov et al. (2018, 2019).

In the context of STDM however, BT has a potential weakness. Even if the reverse model were to produce perfect translations, back-translated data belongs to the target domain, and it is therefore out-of-domain with the data we wish to translate, i.e., source sentences belonging to the source domain. We will verify this conjecture empirically in §5.1.1.

4.2 Self-Training (ST)

1 Data: Given a parallel dataset and a source monolingual dataset ;
2 Noise: Let be a function that adds noise to the input by dropping, swapping and blanking words;
3 Hyper-params: Let be the number of iterations and be the number of samples to add at each iteration;
4 Train a forward model: ;
5 for  in  do
       6 forward-translate data: , for and , where is the score of the -th example;
       7 Let be the index set of the top- highest scoring examples according to ;
       8 re-train forward model: with and .
end for
Algorithm 1 Self-Training learning algorithm.

Self-Training (ST) (Yarowski, 1995), shown in Alg. 1, is another method for data augmentation that instead leverages monolingual data on the source side. First, a baseline forward model is trained on the parallel data (line 4). Second, this initial model is applied to the source monolingual data (line 6). Finally, the forward model is re-trained from random initialization by augmenting the original parallel dataset with the forward-translated data. As with BT, the parallel dataset receives more weight in the loss.

A potential benefit of this approach is that the synthetic parallel data added to the original parallel dataset is in-domain, as it comes from the source monolingual data. This is a crucial advantage of ST against BT in the STDM setting. A potential shortcoming is that the model may reinforce its mistakes since synthetic targets are produced by the model itself.

We introduce two methods to mitigate this issue. First, we make the algorithm iterative and add only the examples for which the model was most confident (line 3, loop in line 5 and line 7 where we select sentences with largest average per-token log-probability). Second, we inject noise to the input sentences to further improve generalization (line 8).

4.3 Combining BT and ST

BT and ST are clearly complementary to each other. The former has the advantage of always using correct targets but synthetic data is out-of-domain when there is STDM. The latter has the advantage of using in-domain source sentences but synthetic targets may be inaccurate. We therefore consider their combination as an additional baseline approach.

The combined learning algorithm proceeds in three steps. First, we train an initial forward and reverse model using the parallel dataset. Second, we back-translate target side monolingual data using the reverse model (see §4.1) and iteratively forward translate source side monolingual data using the forward model (see §4.2 and Alg. 1). We then retrain the forward model from random initialization using the union of the original parallel dataset, the synthetic back-translated data, and the synthetic forward translated data at the last iteration of the ST algorithm.

5 Results

In this section, we first introduce a controlled setting to study STDM and report a detailed analysis of the influence of various factors, such as the extent to which target originating data is out-of-domain, and the effect of monolingual data size. We then report experiments on genuine low resource language pairs, namely English-Myanmar and English-Nepali, and conclude with an ablation study on ST.

5.1 Controlled Setting

Figure 2: Illustration of the controlled setting varying the amount of mismatch between the source and the target domain. The source domain is taken from EuroParl. The target domain is: EuroParl OpenSubtitles. By varying , we vary the amount of mismatch.

It is not obvious how to measure STDM. Particularly for low resource language pairs, there is often not enough data translated from the source domain to compute meaningful statistics. Even if we had sufficient parallel data, it would be difficult to factor out the effect of translationese from pure source-target domain mismatch.

Accordingly, we introduce a synthetic benchmark that enables us to finely control the domain of the target originating data, and therefore the amount of STDM. The key idea of this controlled setting is to consider as target originating data, which comprises half of the parallel training data and the target side monolingual data (see Fig. 1), a convex combination of training data from two sufficiently different domains. In this work we use EuroParl (Koehn, 2005) as our source originating data, while our target originating data contains a mix of data from EuroParl and OpenSubtitles (Lison and Tiedemann, 2016), see Fig. 2 for an illustration.

Specifically, we consider a French to English translation task with a parallel dataset composed by 10,000 sentences from EuroParl (which originate in French) and 10,000 sentences from the target domain (which originate in English). The source monolingual data consists of sentences from EuroParl (not overlapping with the parallel set), while the target monolingual data consists of sentences from the target domain. If , a fraction of the target originating data is taken from EuroParl and the rest from OpenSubtitles. The test set comprises sentences all in French and originating from the EuroParl source domain.

For instance, when then the target domain is totally out-of-domain with respect to the source domain. The parallel dataset has equal proportion of EuroParl and OpenSubtitles sentences. The source monolingual dataset is all from EuroParl while the target monolingual dataset is all from OpenSubtitles. This is the most extreme case of STDM, as depicted in Fig. 1 right hand side.

When , the target domain matches perfectly the source domain and all the data comes from EuroParl. For intermediate values of , the target domain only partially overlaps with the source domain. In other words, let us precisely control the amount of STDM.

We perform hyper-parameter search for the model architectures and BPE size on the validation set. For all the experiments in this controlled setting, we use a 5-layer transformer architecture with 8M parameters when training on datasets with less than 300K parallel sentences and use a bigger transformer architecture that consists of 5 layers and a total of 110M parameters when training on bigger datasets. The BPE size is 5,000. We report (Post, 2018) for both languages.

5.1.1 Varying Amount of Domain Mismatch

Figure 3: BLEU score in Fr-En as a function of the degree by how much the target originating data is in-domain. : fully out-of-domain. : fully in-domain.

In our first experiment reported in Fig. 3, we benchmark our baseline approaches while varying (see §5.1), which controls the overlap between source and target domain.

First, we observe that increasing improves performance of all methods according to BLEU (Papineni et al., 2002). Second, there is a big gap between the baseline trained using the parallel data only, and methods which leverage monolingual data. Third, combining ST and BT works better than each individual method, showing that indeed these approaches are complementary. Finally, BT works better than ST but the gap reduces as the target domain becomes increasingly different from the source domain (small values of ). In the extreme case of STDM for , ST actually outperforms BT. In fact, we observe that the gain of BT over the baseline decreases as decreases (notice that the amount of monolingual data and parallel data is always constant across all these experiments). Therefore, BT does suffer when there is strong STDM.

5.1.2 Varying the amount of monolingual data

Figure 4: BLEU in Fr-En as a function of the amount of monolingual data when there is extreme STDM ().

We next explore how the quantity of monolingual data affects performance and if the relative gain of ST over BT when disappears as BT is provided with more monolingual data.

The experiment in Fig. 4 shows that a) the gain in terms of BLEU tapers off exponentially with the amount of data (notice the log-scale in the x-axis), b) for the same amount of monolingual data ST is always better than BT and by roughly the same amount, and c) BT would require about 3 times more target monolingual data (which is out-of-domain) to yield the performance of ST.

5.1.3 Varying the amount of in-domain data

Figure 5: BLEU in Fr-En for various learning algorithms comparing the case where we use only source originating in-domain data (blue bars) and when we also add out-of-domain target originating data, with .

We now explore whether, in the presence of extreme STDM (), it may be worth restricting the training data to only contain in-domain source originating sentences (with the notation introduced in §3, and ). In Fig. 5, we compare the restricted and unrestricted settings for various combinations of parallel, BT and ST training. Across all settings, we find that it is better to include the out-of-domain data originating on the target side (green bars) as opposed to only the in-domain source originating data (blue bars). It appears that, particularly in the low resource settings considered here, neural models benefit from all available data even if this data is out-of-domain.

Figure 6: BLEU score as a function of the proportion of parallel data originating in the source domain. When all parallel data originates from OpenSubtitles (out-of-domain), when all parallel data originates from EuroParl (in-domain). The blue curves show BLEU in the forward direction (Fr-En translation of EuroParl data). The red curves show BLEU in the reverse direction (En-Fr translation of OpenSubtitles sentences). The three curves show BLEU for models trained using only parallel data, only synthetic back-translated data and the union of the two.

Next, we control for the quantity of parallel training data (fixed at 20,000 sentences) and explore whether there exists an optimal ratio of in-domain to out-of-domain parallel data in the presence of extreme STDM (), keeping the target monolingual data unchanged and composed only by OpenSubtitles sentences. It is not obvious what the optimal ratio may be, a priori, particularly when applying back-translation, which could be made more effective by training the reverse model with some target domain parallel data.

Following our earlier synthetic setting, we introduce a hyperparameter

which controls the ratio between source domain (in-domain) and target domain (out-of-domain) parallel data. We again consider EuroParl to be our source domain and OpenSubtitles to be our target domain, with a parallel dataset containing 20,000 sentences and 900,000 target domain monolingual sentences. When , all parallel data comes from OpenSubtitles, while when , all parallel data comes from EuroParl.

Fig. 6 shows that the best way to compose the parallel data is by taking all sentences from EuroParl () when translating from French to English (blue curves). At high values of , we observe a slight decrease in accuracy for models trained only on back-translated data (dotted line), confirming that BT loses its effectiveness when the reverse model is trained on out-of-domain data. However, this is compensated by the gains brought by the additional in-domain parallel sentences (dashed line). In the more natural setting in which the model is trained on both parallel and back-translated data (dash-dotted line), we see monotonic improvement in accuracy with , with optimal accuracy reached at (i.e., all parallel data is in-domain).

A similar trend is observed in the other direction (English to French, red lines). Therefore, if the goal is to maximize translation accuracy in both directions, an intermediate value of () is more desirable. This is the setting we used previously in §5.1.1 and §5.1.2. Note that the performance of English to French model trained on parallel data drops at , even it has more in-domain parallel data than . This is because the OpenSubtitles dataset has shorter sentences in average, and parallel data contains less tokens when decreases, which negatively affects model performance.

5.2 Low-Resource MT

With the findings from the controlled-setting experiments, we test our approaches on low-resource language pairs, namely English-Myanmar and English-Nepali. Myanmar and Nepali are spoken in regions with unique local context which is very distinct from English-speaking regions, making these two language pairs a good use case for studying the STDM setting in real life.

5.2.1 English-Myanmar

For English to Myanmar we use the parallel data provided in the WAT 2019 competition (Nakazawa et al., 2019) which consists of two datasets. The Asian Language Treebank (ALT) corpus (Thu et al., 2016; Ding et al., 2018, 2019) has 18,088 training sentences, 1,000 validation sentences and 1,018 test sentences from English originating news articles. The UCSY dataset111 contains 204,539 sentences from various domains, including news articles and textbooks. The test set is taken from the ALT dataset.

For English monolingual data, we use the 2018 Newscrawl dataset provided by WMT, where the domain of the corpus is also news. We apply the fastText classifier 

Joulin et al. (2017) over the individual sentences to filter out non-English sentences. For Myanmar monolingual data, we use the language split Commoncrawl data from (Buck et al., 2014) which includes texts in various domains crawled from the web. We use the myanmar-tools222 library to classify and convert all Zawgyi text to Unicode. We use 5M unique English sentences and 100k unique Myanmar sentences as our monolingual data.

To summarize and comparing to our idealized setting of Fig. 1, this dataset has a small in-domain parallel dataset from English to Myanmar (ALT), an out-of-domain parallel dataset (UCSY), and small out-of-domain monolingual corpus in Myanmar and a large monolingual corpus in English which is in-domain with ALT. Therefore, a priori we would expect ST to be more useful than BT when translating from English to Myanman.

The model architecture and BPE size is selected by hyper-parameter search on the ALT validation set. We use 5-layer transformer architecture with 42M parameters for model trained on parallel data only, and a 6-layer transformer architecture with 186M parameters for models trained with both parallel and monolingual data. The BPE size is 10,000. We report the system performance on the ALT test set following the same evaluation protocol of the WAT 2019 English-Myanmar subtask, see (Chen et al., 2019) for more details on BLEU calculation.

In Table 1, we observe that back-translation barely out-performs the baseline model by BLEU points, while self-training improves by BLEU points. This is because the source side monolingual data is in-domain with test set, and we have more source side monolingual data than target side monolingual data. We also observe that combining self-training and back-translation together out-performs each individual method only slightly.

Model En My
baseline 32.8
BT 33.4
ST 35.3
ST + BT 35.4
Table 1: BLEU scores for the English to Myanmar translation task.

5.2.2 English-Nepali

We collect a English-Nepali parallel dataset by selecting sentences from public posts in English and Nepali and translating these sentences in the other language. This dataset is composed by 40,000 sentences originated in Nepali and only 7,500 sentences originated in English. We also have 1.8M monolingual sentences in Nepali and 1.8M monolingual sentences in English, also collected from public posts. This dataset is remarkably similar to our idealized setting of Fig. 1 right hand side, except that the two portions of the parallel dataset are grossly uneven.

The model architecture and BPE size are selected by hyper-parameter search on the validation set. We use a 5-layer transformer architecture with 39M parameters when training on parallel data alone, and a 6-layer model with 131M parameters when training on the parallel dataset augmented with synthetic data. The BPE size is 5,000 and we report tokenized BLEU score on both languages.

We consider the translation task in both directions and we report the results in Table 2. First, since our parallel training data contains more in-domain Nepali originating sentences, the baseline models trained on the parallel data should be better at translating Nepali sentences than translating English sentences. As a result, in the NeEn translation task, we observe that augmenting the original parallel dataset by forward-translating from Nepali monolingual dataset achieves a big gain compared to the baseline. In the reversed direction, however, self-training does not improve significantly. Second, despite the domain mismatch, back-translation for NeEn still achieves much better performance than the baseline and the gain is almost on par with self-training. The EnNe direction is a more favorable setting for back-translation, as the reverse model is more in-domain with the Nepali monolingual data, and it shows BLEU point gain over the baseline. Third, we find that for both directions combining ST and BT works better than each individual method in both directions, paticularly when ST and BT have comparable performance, showing that these approaches are indeed complementary.

Model Ne En En Ne
baseline 20.4 10.1
BT 22.3 12.2
ST 22.1 10.5
ST + BT 22.9 12.3
Table 2: BLEU scores for English-Nepali comparing various ways of using monolingual datasets during training. Combining ST and BT works better than each individual method, particularly when ST and BT perform comparably.

5.3 Ablation Study on ST

We conduct an ablation study to understand the effect of iterative training and adding source-side noise on self-training. In particular, we consider a parallel dataset with 10,000 sentences from EuroParl and 10,000 sentences from OpenSubtitle. We also have 900,000 source monolingual sentences available for self-training. We perform four iterations of self-training where we gradually increase for the top-K highest scoring examples we select for training in each iteartion.

In Table 3, we observe that iterative self-training performs better than original self-training, showing advantages of adding training examples for which the model was most confident. Moreover, adding source-side noise to iterative self-training further improves the BLEU score by points. Therefore, injecting source-side noise when doing iterative self-training is the setting yielding the best performance for ST.

Model BLEU
baseline 26.7
ST (non-iterative w/ 900K sentences) 28.9
ST (iteration 1 w/ top 30K sentences) 28.5
ST (iteration 2 w/ top 100K sentences) 29.6
ST (iteartion 3 w/ top 300K sentences) 30.0
ST (iteration 4 w/ 900K sentences) 30.2
ST (iterative + source noise) 30.8
Table 3: Iterative self-training with source-side noise yields better BLEU score.

6 Final Remarks & Perspectives

In this work we introduced the problem of source-target domain mismatch in machine translation. Echoing prior work in the sociolinguistic literature (Leech and Fallon, 1992; Bernardini and Zanettin, 2004), this problem is inherent to the translation task and it is even more prominent for low resource language pairs, for which differences in the local context is even more pronounced.

While the dominant approach to building machine translation corpora has been centered on making corpora comparable, we argue that using the natural distribution of the text data in each language is important if we are targeting translation of organic content produced by social platforms, blogs and even news outlets. In other words, the non-comparability of parallel and monolingual corpora is an important feature of this task, it should be made explicit and it should be taken into account when designing machine translation models.

We introduced a simple controlled setting to study STDM and tested several baseline approaches. We found that ST can perform better than BT when the target monolingual data is scarce or out-of-domain relative to the source domain. In general, the two approaches are complementary to each other and they can be easily combined. Finally, we tested these approaches on truly low-resource language pairs reporting encouraging improvements over the baseline methods.

Looking forward, there are several directions worth future investigation. First, there is need for a better characterization of STDM, better understanding of its causes and effects, and for possibly measuring its prominence on a given dataset factoring out (or accounting for) the effect of translationese from domain mismatch. Second, the approaches we introduced are merely baselines and they clearly underperform when there is severe STDM. Better algorithms leveraging source side monolingual data are required to make strides in this setting. Finally, the community needs to build more benchmarks exhibiting these natural phenomena, which are particularly relevant for low resource language pairs.

7 Acknowledgments

The authors would like to thank Marco Baroni, Silvia Bernardini, Randy Scansani, Alberto Barrón-Cedeño, Adriano Ferraresi, and Adina Williams for pointing to relevant references in the socio-linguistic literature and for general suggestions. They also wish to thank Sergey Edunov for various tips on training MT systems at scale.


  • M. Baker (1993) Corpus linguistics and translation studies: implications and applications. Text and technology: In honour of John Sinclair 233:250. Cited by: §3.
  • S. Bernardini and F. Zanettin (2004) When is a universal not a universal. Translation universals: do they exist? John Benjamin publisher Edited by Anna Mauranen and Pekka Kujammak, pp. 51–62. Cited by: §1, §2, §6.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz (2019) Findings of the 2019 conference on machine translation (wmt19). In Proc. of WMT, Cited by: §2.
  • D. Britain (2013) Space, diffusion and mobility. Wiley publishers; Book Editor(s): J.K. Chambers Natalie Schilling First. Note: chapter 22 Cited by: §1, §2.
  • C. Buck, K. Heafield, and B. van Ooyen (2014) N-gram counts and language models from the common crawl. In Proceedings of the Language Resources and Evaluation Conference, Reykjavik, Iceland. Cited by: §5.2.1.
  • F. Burlot and F. Yvon (2018) Using monolingual data in neural machine translation: a systematic study. In

    Empirical Methods in Natural Language Processing

    Cited by: §2.
  • P. Chen, J. Shen, M. Le, V. Chaudhary, A. El-Kishky, G. Wenzek, M. Ott, and M. Ranzato (2019) Facebook ai’s wat19 myanmar-english translation task submission. In Workshop on Asian Translation, Cited by: §5.2.1.
  • C. Ding, Hnin Thu Zar Aye, Win Pa Pa, Khin Thandar Nwet, Khin Mar Soe, M. Utiyama, and E. Sumita (2019) Towards Burmese (Myanmar) morphological analysis: syllable-based tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 19 (1), pp. 5. Cited by: §5.2.1.
  • C. Ding, M. Utiyama, and E. Sumita (2018) NOVA: a feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) 18 (2), pp. 17. Cited by: §5.2.1.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. In Conference of the Association for Computational Linguistics (ACL), Cited by: §4.1.
  • S. Edunov, M. Ott, M. Ranzato, and M. Auli (2019) On the evaluation of machine translation systems trained with back-translation. arXiv preprint arXiv:1908.05204. Cited by: §4.1.
  • J. Eisenstein, B. O’Connor, N. A. Smith, and E. P. Xing (2010) A latent variable model for geographic lexical variation. In Annual Meeting of the Association for Computational Linguistics, Cited by: §2.
  • J. R. Firth (1935) On sociological linguistics. Transactions of the Royal Society, pp. 67–69. Cited by: §1, §2.
  • P. Fung and L. Y. Yee (1998) An ir approach for translating new words from nonparallel, comparable texts. In The 17th International Conference on Computational Linguistics, Cited by: §2.
  • J. He, J. Gu, J. Shen, and M. Ranzato (2019) Revisiting self-training for neural sequence generation. arXiv. Cited by: §2.
  • A. Irvine and C. Callison-Burch (2013) Combining bilingual and comparable corpora for low resource machine translation. In Proceedings of the eighth workshop on statistical machine translation, pp. 262–270. Cited by: §2.
  • B. Johnstone (2010) Language and place. R. Mesthrie and W. Wolfram, editors, Cambridge Handbook of Sociolinguistics. Cambridge University Press. Cited by: §1, §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Cited by: §5.2.1.
  • P. Koehn (2005) Europarl: a parallel corpus for statistical machine translation. MT Summit. Cited by: §5.1.
  • G. Leech and R. Fallon (1992) Computer corpora: what do they tell us about culture?. ICAME Journal Computers in English Linguistics (16). Cited by: §1, §2, §6.
  • P. Lison and J. Tiedemann (2016) OpenSubtitles2016: extracting large parallel corpora from movie and tv subtitles. In 10th International Conference on Language Resources and Evaluation (LREC), Cited by: §5.1.
  • Q. Mei, C. Liu, and H. Su (2006) A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In WWW, Cited by: §2.
  • D.S. Munteanu, A. Fraser, and D. Marcu (2004) Improved machine translation performance via parallel sentence extraction from comparable corpora. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §2.
  • T. Nakazawa, C. Ding, R. Dabre, H. Mino, I. Goto, W. P. Pa, N. Doi, Y. Oda, A. Kunchukuttan, S. Parida, O. Bojar, and S. Kurohashi (2019) Overview of the 6th workshop on Asian translation. In Proceedings of the 6th Workshop on Asian Translation, Hong Kong. External Links: Link Cited by: §5.2.1.
  • G. Neubig and J. Hu (2018) Rapid adaptation of neural machine translation to new languages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium. External Links: Link Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W.J. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Cited by: §5.1.1.
  • E. Pederson, E. Danziger, D. Wilkins, S. Levinson, S. Kita, and G. Senft (1998) Semantic typology and spatial conceptualization. Language 74 (3). Cited by: §2.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Link Cited by: §5.1.
  • R. Sennrich, B. Haddow, and A. Birch (2015) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96. Cited by: §1, §2, §4.1, §4.
  • A. Søgaard, S. Ruder, and I. Vulic (2018) On the limitations of unsupervised bilingual dictionary induction. In Conference of the Association for Computational Linguistics (ACL), Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014)

    Dropout: a simple way to prevent neural networks from overfitting

    Journal of Machine Learning Research 15. Cited by: §4.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016)

    Rethinking the inception architecture for computer vision


    IEEE conference on computer vision and pattern recognition

    Cited by: §4.
  • Y. K. Thu, W. P. Pa, M. Utiyama, A. Finch, and E. Sumita (2016) Introducing the asian language treebank (alt). In LREC, Cited by: §5.2.1.
  • G. Toury (2012) Descriptive translation studies and beyond: revised edition. Cited by: §3.
  • N. Ueffing (2006) Using monolingual source-language data to improve mt performance. In IWSLT, Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention Is All You Need. In Proc. of NIPS, Cited by: §4.
  • D. Yarowski (1995) Unsupervised word sense disambiguation rivaling supervised methods. In Annual Meeting of the Association for Computational Linguistics, Cited by: §1, §2, §4.2.
  • J. Zhang and C. Zong (2016) Exploiting source-side monolingual data in neural machine translation. In Empirical Methods in Natural Language Processing, Cited by: §2.
  • M. Zhang and A. Toral (2019) The effect of translationese in machine translation test sets. arXiv abs/1906.08069. Cited by: §3.