Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English

by   Francisco Guzmán, et al.
Johns Hopkins University

The vast majority of language pairs in the world are low-resource because they have little, if any, parallel data available. Unfortunately, machine translation (MT) systems do not currently work well in this setting. Besides the technical challenges of learning with limited supervision, there is also another challenge: it is very difficult to evaluate methods trained on low resource language pairs because there are very few freely and publicly available benchmarks. In this work, we take sentences from Wikipedia pages and introduce new evaluation datasets in two very low resource language pairs, Nepali-English and Sinhala-English. These are languages with very different morphology and syntax, for which little out-of-domain parallel data is available and for which relatively large amounts of monolingual data are freely available. We describe our process to collect and cross-check the quality of translations, and we report baseline performance using several learning settings: fully supervised, weakly supervised, semi-supervised, and fully unsupervised. Our experiments demonstrate that current state-of-the-art methods perform rather poorly on this benchmark, posing a challenge to the research community working on low resource MT. Data and code to reproduce our experiments are available at


page 1

page 2

page 3

page 4


Lesan – Machine Translation for Low Resource Languages

Millions of people around the world can not access content on the Web be...

ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization

Cherokee is a highly endangered Native American language spoken by the C...

Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

High-performing machine translation (MT) systems can help overcome langu...

Practical Comparable Data Collection for Low-Resource Languages via Images

We propose a method of curating high-quality comparable training data fo...

Using Multiple Subwords to Improve English-Esperanto Automated Literary Translation Quality

Building Machine Translation (MT) systems for low-resource languages rem...

Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation

In its daily use, the Indonesian language is riddled with informality, t...

Findings of the LoResMT 2021 Shared Task on COVID and Sign Language for Low-resource Languages

We present the findings of the LoResMT 2021 shared task which focuses on...

1 Introduction

Research in Machine Translation (MT) has seen significant advances in recent years thanks to improvements in modeling, and in particular neural models Sutskever et al. (2014); Bahdanau et al. (2015); Gehring et al. (2016); Vaswani et al. (2017), as well as the availability of large parallel corpora for training Tiedemann (2012); Smith et al. (2013); Bojar et al. (2017). Indeed, modern neural MT systems can achieve near human-level translation performance on language pairs for which sufficient parallel training resources exist (e.g., Chinese–English translation Hassan et al. (2018) and English–French translation Gehring et al. (2016); Ott et al. (2018a)).

Unfortunately, MT systems, and in particular neural models, perform rather poorly on low-resource language pairs, for which parallel training data is scarce Koehn and Knowles (2017). Improving performance on low resource language pairs could be very impactful if we consider that, altogether, these languages are spoken by a rather large fraction of the world population.

Technically, there are several challenges to solve in order to improve translation for low-resource languages. First, in face of the scarcity of clean parallel data, MT systems should be able to use any source of data available, namely monolingual resources and noisy comparable data. Second, we need reliable public evaluation benchmarks to track progress in translation quality. Building evaluation sets on low resource languages is very expensive because there are often very few fluent bilingual speakers in these languages. Moreover, it is difficult to check the quality of the human translations, because collecting multiple references is often not feasible and the topics of the documents in these low resource languages may require knowledge of the local culture.

In this work, we introduce new evaluation benchmarks on two very low-resource language pairs: Nepali–English and Sinhala–English . Sentences were randomly extracted from Wikipedia pages in each language and translated by professional translators. The data-sets we release to the community are composed of a tune set of 2559 and 2898 sentences, a development set of 2835 and 2766 sentences, and a test set of 2924 and 2905 sentences for Nepali–English and Sinhala–English respectively. The test set will be released after the WMT 2019 shared task on parallel corpus filtering111

We describe in §3 the methodology we used to collect the data as well as to check the quality of translations. The experiments reported in §4 demonstrate that these benchmarks are very challenging for current state-of-the-art methods, yielding very low BLEU scores Papineni et al. (2002) even using all available parallel data as well as monolingual data or Paracrawl222 filtered data. This suggests that these languages and evaluation benchmarks can constitute a useful test-bed for developing and comparing MT systems for low resource language pairs.

2 Related Work

There is ample literature on low-resource MT. From the modeling side, one possibility is to design methods that make more effective use of monolingual data. This is a research avenue that has seen a recent surge of interest, starting with semisupervised methods relying on back-translation Sennrich et al. (2015), integration of a language model into the decoder Gulcehre et al. (2017); Stahlberg et al. (2018) all the way to fully unsupervised approaches Lample et al. (2018b); Artetxe et al. (2018)

, which use monolingual data both for learning good language models and for fantasizing parallel data. Another avenue of research has been to extend the traditional supervised learning setting to a

weakly supervised one, whereby the original training set is augmented with parallel sentences mined from noisy comparable corpora like Paracrawl. In addition to the challenge of learning with limited supervision, low-resource language pairs often involve distant languages, that do not share the same alphabet, or have very different morphology and syntax, which makes the learning problem more difficult on its own.

In terms of low-resource data-sets, DARPA programs like LORELEI Strassel and Tracey (2016) have collected translations on several low resource languages like English–Tagalog. Unfortunately, the data is only made available to the program’s participants. More recently, the Asian Language Treebank project Riza et al. (2016) has introduced parallel data-sets for several low-resource language pairs, but these are sampled from text originating in English and thus may not generalize to text sampled from low-resource languages.

In the past, there has been work on extracting high quality translations from crowd-sourcing using automatic methods Zaidan and Callison-Burch (2011), and cleaning them through voting mechanisms Post et al. (2012). However, crowd-sourced translations are expected to be of lower quality than professional translations and useful for training systems. In contrast, here we explore the quality checks that had to be put in place to filter professional translations for low resource languages in order to build a high quality benchmark set. Moreover, here we explicitly aim to have larger test sets that are closer in volume to the sets used for WMT shared tasks.

In practice, there are very few publicly available data-sets for low resource language pairs, and often times, researchers simulate learning on low resource languages by using a high resource language pair like English–French, and merely limiting how much labeled data they use for training Johnson et al. (2016); Lample et al. (2018a). While this practice enables a framework for easy comparison of different approaches, the real practical implications deriving from these methods can be unclear. For instance, low resource languages are often distant and often times corresponding corpora are not comparable, conditions which are far from the simulation with high resource European languages, as has been recently pointed out Neubig and Hu (2018).

3 Methodology & Resulting Data-Sets

In this section, we report the methodology used to select Wikipedia documents and to check the quality of human translations. We conclude with a detailed description of the resulting evaluation data-sets.

3.1 Low Resource Languages

For the construction of our benchmark sets we chose to translate between Nepali and Sinhala into and out of English. Both Nepali and Sinhala are Indo-Aryan languages, the former spoken by about million people if we consider only Nepal, while the latter by about million people only in Sri Lanka333See and Both languages are SOV (subject-object-verb), Nepali being similar to Hindi in its structure, while Sinhala is characterized by extensive omissions of arguments in a sentence. Sinhala and Nepali have very little parallel data publicly available. For instance, most of the parallel corpora for Nepali–English originate from GNOME and Ubuntu handbooks, and account for about K sentence pairs.444Nepali has also K sentences translated from English Penn Tree Bank at, which is valuable parallel data. For Sinhala–English , there are an additional K sentence pairs automatically aligned from from OpenSubtitles Lison et al. (2018). Overall, the domains and quantity of the existing parallel data are very limited. However, both languages have a rather large amount of monolingual data publicly available Buck et al. (2014), making them perfect candidates to track performance on unsupervised and semi-supervised tasks for Machine Translation.

3.2 Document selection

To build the evaluation sets, we selected and professionally translated sentences originating from Wikipedia pages in English, Nepali and Sinhala ({en,ne,si} from a Wikipedia crawl of early May 2018.

To select sentences for translation, we first filtered Wikipedia by only retaining the top documents that contain the largest number of candidate sentences in each source language, then we manually filtered the documents to ensure their quality. To this end, we defined candidate sentences as: (i

) being in the intended source language according to a language-id classifier 

Bojanowski et al. (2017)555This is a necessary step as many sentences in foreign language Wikipedias may be in English or other languages., and (ii) having sentences between and characters. Moreover, we considered sentences and documents, when applicable, to be inadequate for translation when they contained large portions of untranslatable content such as lists of entities666For example, the Academy Awards page: For English, sentences have to start with an uppercase letter and end with a period. For Nepali and Sinhala , we ran a regular expression to avoid symbols such as bullet points, repeated dashes or periods and avoid ASCII characters. The document set, along with the categories of documents is presented in the appendix (Table 7).

After the document selection process, we sampled 2,500 random sentences for each language. From English we translated into Nepali and Sinhala , while from Sinhala and Nepali we only translated into English. We requested each string to be translated twice by different translators.

3.3 Quality checks

Translating domain-specialized content such as Wikipedia pages, from and to low-resource languages is challenging: the pool of available translators is limited, there is limited context available to each translator when translating one string at a time, and some of the sentences can contain code-switching (e.g. text about Buddhism in Nepali or Sinhala can contain Sanskrit or Pali words). As a result, we observed large variations in the level of translation quality coming from professional translators. To ensure good quality, we relied on both automatic and manual methods to detect high and low quality translations.

On a first round, we used automatic methods to filter out bad translations and send them for rework. Once the reworked translations were received, we sent all translations (original or reworked) that passed the automatic checks to human quality checks. Only the high quality translations that passed all checks were added to the final pool of translations used to build the test sets. Note that it is possible that some source sentences have less than two (but at least one) translations after two rounds of rework. Below, we describe the automatic and manual quality checks that we applied to the data-sets.

Automatic Filtering.

To maximize the quality of the translation sets while minimizing the amount of manual checks required, we relied on automatic methods that ensured the translations followed these principles: (i) translations should be fluent Zaidan and Callison-Burch (2011), (ii) they should be sufficiently different from the source text, (iii) translations should be similar to each other, yet not equal; and (iv) translations should not be transliterations. In order to identify the vast majority of translation issues we filtered by: (i

) applying a count-based n-gram language model trained on Wikipedia monolingual data and removing translations that have perplexity above 3000.0 (English translations only), (

ii) removing translations that have sentence-level char-BLEU score between the two generated translations below 15 (indicating disparate translations) or above 90 (indicating suspiciously similar translations), (iii) removing sentences that contain at least 33% transliterated words, (iv) removing translations with at least 50% of words are copied from the source sentence, and (v) removing translations that contain more than 50% out-of-vocabulary ratio or 5 total out-of-vocabulary words in the sentences (English translations only). For this, the vocabulary was calculated on the monolingual English Wikipedia described in Table 3.

Manual Filtering.

We performed manual checks on every translation to ensure end-to-end translation quality, and target-language fluency. For translation quality assessment, we followed a setup similar to direct assessment Graham et al. (2013). We asked three different raters to rate the sentences from 0–100 according to the perceived translation quality. In our guidelines, the 0–10 range represents a translation that is completely incorrect and inaccurate, the 70–90 range represents a translation that closely preserves the semantics of the source sentence, while the 90–100 range represents a perfect translation. To ensure rating consistency, we rejected any evaluation set in which the range of scores among the three reviewers was above 30 points, and requested a fourth rater to break ties, by replacing the most diverging translation rating with the new one. This is repeated until convergence is reached. For each translation, we took the average score over all raters and rejected translations whose scores were below 70. This filtering was done for all language pairs.

To ensure that the translations were as fluent as possible, we also designed an Amazon Mechanical Turk (AMT) monolingual task to judge the fluency of English translations. Regardless of content preservation, translations that are not fluent in the target language should be disregarded. For this task, we then asked five independent human annotators to rate the fluency of each English translation from 1 (bad) to 5 (excellent).We rejected translations into English with fluency scores less than 3.

3.4 Resulting data-sets

We built three evaluation sets for each language pair using the all of the data that passed our automatic and manual quality checks: dev (tune), devtest (validation) and test (test). The tune set is used for hyper-parameter tuning and model selection, the validation set is used to measure generalization during development, while the test set will be used as a blind set for a WMT task on data filtering777; the test set will be made available after this competition is over.

To measure performance in both directions (e.g. Sinhala–English and English–Sinhala ), we built test sets with mixed original-translationese Baroni and Bernardini (2005) on the source side. To reduce the effect of the source language on the quality of the resulting translations on the sets used for tracking progress (devtest, test), direct and reverse translations were mixed at an approximate 50-50 ratio for the devtest and test sets. On the other hand, the dev set was composed of the remainder of the available translations, which were not guaranteed to be balanced. Before selection, the sentences were grouped by document, so to minimize the number of documents per evaluation set. For sentences with two satisfactory translations, the second translation was merged as additional evaluation instances. This yielded on average 1.7 translations per source sentence.

In Table 1 we present the statistics of the resulting sets. For Sinhala–English, the test set is composed of 850 sentences originally in English, and 850 originally in Sinhala. We have approximately 1.7 translations per sentence. This yielded 1,465 sentence pairs originally in English, and 1,440 originally in Sinhalese, for a total of 2,905 sentences. Similarly, for Nepali–English, the test set is composed of 850 sentences originally in English, and 850 originally in Nepali. This yielded 1,462 sentence pairs originally in English and 1,462 originally in Nepali, for a total of 2,924 sentence pairs. The composition of the rest of the sets can be found in Table 1.

orig lang dev devtest test
uniq tot uniq tot uniq tot
English 693 1,181 800 1,393 850 1,462
Nepali 825 1,378 800 1,442 850 1,462
1,518 2,559 1,600 2,835 1,700 2,924
English 1,123 1,913 800 1,395 850 1,465
Sinhala 565 985 800 1,371 850 1,440
1,688 2,898 1600 2,766 1700 2,905
Table 1: Number of unique sentences (uniq) and total number of sentence pairs (tot) per test set grouped by their original languages.

In Table 2 we present the aggregate distribution of topics per sentence for the data-sets in Nepali–English and Sinhala–English . We can observe there is a diverse representation of topics ranging from General (e.g. documents about tires, shoes and insurance), History (e.g. documents about history of the radar, the Titanic, etc.) to Law and Sports. This richness of topics increases the difficulty of the set, as it requires domain-independent, generalizable approaches to improve quality. The full list of documents and topics is in the appendix (Table 7).

topic proportion (%)
ne–en si–en
General 18.3 24.1
History 6.5 15.1
Science 7.4 12.7
Religion 8.9 10.5
Social Sciences 10.2 6.9
Biology 6.3 9.1
Geography 10.6 4.6
Art/Culture 6.7 8.3
Sports 5.8 6.7
Politics 8.1 N/A
People 7.4 N/A
Law 3.9 2.0
Table 2: Distribution of the topics of the sentences in the dev, devtest and test sets according to the Wikipedia document they were sampled from.

4 Experiments

In this section, we first describe the data used for training the models, we then discuss the learning settings and models considered, and finally we report the results of these baseline models on the new evaluation benchmarks.

4.1 Training Data

Small amounts of parallel data are available for Sinhala–English and Nepali–English. Statistics can be found in Table 3. This data comes from different sources. Open Subtitles and GNOME/KDE/Ubuntu come from the OPUS repository888 Global Voices is an updated version (2018q4) of a data set originally created for the CASMACAT project999 Bible translations were from the bible-corpus101010 The Paracrawl corpus comes from the Paracrawl project111111 The filtered version (Clean Paracrawl) was filtered with Zipporah Xu and Koehn (2017). We also contrast this filtered version with a randomly filtered version (Random Paracrawl) with the same number of English tokens.

Sentences Tokens
  Bible 62K 1.5M
  Global Voices 3K 75K
  Penn Tree Bank 4K 88K
  GNOME/KDE/Ubuntu 495K 2M
  Unfiltered Paracrawl 2.1M 35.5M
  Clean Paracrawl 249K 5.M
  Random Paracrawl 290K 5.M
  Wikipedia (en) 67.8M 2.0B
  Common Crawl (ne) 3.6M 103.0M
  Wikipedia (ne) 92.3K 2.8M
  Open Subtitles 601K 3.6M
  GNOME/KDE/Ubuntu 46K 151K
  Paracrawl 3.4M 59M
  Clean Paracrawl 279K 5M
  Random Paracrawl 277K 5M
  Wikipedia (en) 67.8M 2.0B
  Common Crawl (si) 5.2M 110.3M
  Wikipedia (si) 155.9K 4.7M
Table 3: Parallel, comparable, and monolingual data used in experiments in §4. The number of tokens for parallel and comparable corpora are reported over the English tokens.
Comparable data from Paracrawl is used only in the weakly-supervised experiments since alignments are noisy.

4.2 Training Settings

We evaluate models in four training settings. First, we consider a fully supervised training setting using the parallel data listed in Table 3.

Second, we consider a semi-supervised setting whereby in addition to parallel data, we also leverage monolingual data on the target side. For this setting, we considered the standard back-translation training protocol as introduced in Sennrich et al. (2015)’s seminal work: we train a backward MT system, which we use to translate monolingual target sentences to the source language. Then, we merge the resulting pairs of noisy (back-translated) source sentences with the original target sentences and add them as additional parallel data for training the original source-to-target MT system. When monolingual data is available for both languages, we can train backward MT systems in both directions and repeat the back-translation process iteratively He et al. (2016); Lample et al. (2018a). We consider up to two back-translation iterations. At each iteration we generate back-translations using beam search, which has been shown to perform well in low-resource settings Edunov et al. (2018); we use a beam width of 5 and individually tune the length-penalty on the dev set.

Third, we consider a weakly supervised setting by using a baseline system to filter out Paracrawl data, in our case the Zipporah method Xu and Koehn (2017), in order to augment the original training set with a possibly larger but noisier set of parallel sentences.

Finally, we consider a fully unsupervised setting, whereby only monolingual data on both the source and target side are used to train the model Lample et al. (2018b).

Supervised Weakly Superv. Semi-supervised Unsupervised
English–Nepali -
English–Sinhala -
Table 4: BLEU scores of various machine translation methods and learning settings on devtest (see §3). We report detokenized SacreBLEU Post (2018) for {Ne,Si}En and tokenized BLEU for En{Ne,Si}.

4.3 Models & Architectures

We consider both phrase-based statistical machine translation (PBSMT) and neural machine translation (NMT) systems in our experiments. The PBSMT systems use Moses 

Koehn et al. (2007), with state-of-the-art settings (5-gram language model, hierarchical lexicalized reordering model, operation sequence model) but no additional monolingual data to train the language model.

The NMT systems use the Transformer (Vaswani et al., 2017) implementation in the Fairseq toolkit;121212 preliminary experiments showed these to perform better than LSTM-based NMT models. More specifically, in the supervised setting we use a Transformer architecture with 5 encoder and 5 decoder layers, where the number of attention heads, embedding dimension and inner-layer dimension are 2, 512 and 2048, respectively. In the semi-supervised setting, where we augment our small parallel training data with millions of back-translated sentence pairs, we use a larger Transformer architecture with 6 encoder and 6 decoder layers, where the number of attention heads, embedding dimension and inner-layer dimension are 8, 512 and 4096, respectively. We regularize our models with dropout, label smoothing and weight decay, with the corresponding hyper-parameters tuned independently for each language pair. Models are optimized with Adam (Kingma and Ba, 2015) using , , and . We use the same learning rate schedule as Ott et al. (2018b). We run experiments on between 4 and 8 Nvidia V100 GPUs with mini-batches of between 10K and 100K target tokens following Ott et al. (2018b). Code to reproduce our results can be found at

4.4 Preprocessing and Evaluation

We tokenize Nepali and Sinhala using the Indic NLP Library.131313 For the PBSMT system, we tokenize English sentences using the Moses tokenization scripts. For NMT systems, we instead use a vocabulary of 5K symbols based on a joint source and target Byte-Pair Encoding (BPE; Sennrich et al., 2015) learned using the sentencepiece library141414 over the parallel training data. We learn the joint BPE for each language pair over the raw English sentences and tokenized Nepali or Sinhala sentences. We then remove training sentence pairs with more than 250 source or target BPE tokens.

We report detokenized SacreBLEU Post (2018) when translating into English, and tokenized BLEU Papineni et al. (2002) when translating from English into Nepali or Sinhala.

Figure 1: Analysis of the NeEn devtest set using the semi-supervised machine translation system. Left: sentence level BLEU versus AMT fluency score of the reference sentences in English; source sentences that have received more fluent human translations are not easier to translate by machines. Right: average sentence level BLEU against Wikipedia document id from which the source sentence was extracted; sentences have roughly the same degree of difficulty across documents since there is no extreme difference between shortest and tallest bar. However, source sentences originating from Nepali Wikipedia (blue) are translated more poorly than those originating from English Wikipedia (red). Documents are sorted by BLEU for ease of reading.

4.5 Results

We run both PBSMT and NMT in the various learning configurations described in §4.2. There are several observations we can make from the results reported in Table 4.

First, these language pairs are very difficult. Both NMT and PBSMT supervised baselines achieve BLEU scores less than 8. Second, not surprisingly BLEU scores are higher when translating into English than into the more morphologically rich Nepali and Sinhala languages. Third, the biggest improvements are brought by the semi-supervised approach using back-translation, which nearly doubles BLEU for Nepali–English from 7.6 to 15.1 (+7.5 BLEU) and Sinhala–English from 7.2 to 15.1 (+7.9 BLEU), and increases BLEU for English–Nepali from 4.3 to 6.8 (+2.5 BLEU) and Sinhala–English from 1.2 to 6.5 (+5.3 BLEU). Notably, repeating back-translation for a second iteration brings further gains compared to the first iteration, suggesting that more iterations of back-translation or “online” back-translation Lample et al. (2018b)

may be helpful. The weakly supervised baseline does not work as well as the semi-supervised one, yet it achieves almost 10 BLEU points in Sinhala–English . Finally, unsupervised NMT approaches seem to be ineffective on these language pairs, achieving BLEU scores close to 0. The reason for this failure is due to poor initialization: unsupervised lexicon induction techniques 

Conneau et al. (2018); Artetxe et al. (2017) did not work well on these morphologically rich languages, partly because the monolingual corpora used to train word embeddings are not comparable Neubig and Hu (2018) and do not have sufficient number of overlapping strings.

We also investigate in more detail the effect of weak supervision in Table 5 for Nepali–English and Sinhala–English . The baseline corresponds to the Supervised NMT setting in Table 4. The data has been described in §4.1. For both Nepali–English and Sinhala–English , we can see that the filtering method applied to the Paracrawl corpus is critical. Filtering at random gives a BLEU score close to 0, while applying the Zipporah filtering method provides an improvement over using the unfiltered Paracrawl directly, +0.5 BLEU for Nepali–English and +6.0 BLEU for Sinhala–English . In the case of Sinhala–English , adding Paracrawl Clean to the initial parallel data improves performance by 2.7 BLEU. However, no improvement over the baseline was observed for Nepali–English with Paracrawl Clean. We surmise that the low availability of parallel data to train the Nepali–English Zipporah has an impact on the effectiveness of the cleaning technique.

Corpora BLEU
ne–en si–en
Unfiltered Paracrawl
Paracrawl Random
Paracrawl Clean
Parallel + Paracrawl Clean
Table 5: Weakly supervised experiments: Adding noisy parallel data from filtered Paracrawl improves translation quality in some conditions. ”Parallel” refers to the data described in Table 3.

4.6 Analysis

In-domain Out-of-domain
 Semi-sup. (+20%) (+210%)
 Semi-sup. (+7%) (+542%)
Table 6: In-domain vs. out-of-domain translation performance (BLEU) for supervised and semi-supervised NMT models. In-domain performance is measured on a held-out subset of 1,000 sentences from the Open Subtitles training data (see Table 3). Out-of-domain performance is measured on devtest (see §3).

In this section, we provide an analysis of the Nepali to English devtest set using the semi-supervised machine translation system, see Figure 1. Findings on other language directions are similar.

First, we observe no correlation between fluency rating of human references and closeness of system hypotheses to such references, suggesting lack of such bias in the benchmark. Fluency of references does not correlate with ease of the translation task, at least at the current level of accuracy.

Second, we observe that source sentences receive rather similar translation quality across all document ids, with a difference of 10 BLEU points between the document that is the easiest and the hardest to translate. This suggests that the random sampling procedure used to construct the data-set was adequate and that no single Wikipedia document produces much harder sentences than others.

However, if we split documents by their originating source (actual Nepali versus English translated into Nepali, a.k.a. translationese), we notice that genuine Nepali documents are harder to translate. The same is true also when performing the evaluation with the supervised MT system: translations of Nepali originating source sentences obtain 4.9 BLEU while translations of English originating sentences obtain 9.1 BLEU. This suggests that the existing parallel corpus is closer to English Wikipedia than Nepali Wikipedia, and that this bias is further reinforced when using English Wikipedia monolingual data during the back-translation process that generates data for the model trained in semi-supervised mode.

In order to better understand the effect of domain mismatch between the parallel data-set and the Wikipedia evaluation set, we restricted the Si-En training set to only the Open Subtitles portion of the parallel data-set, and we held out one thousand sentences for “in-domain” evaluation of generalization performance. Table 6 shows that translation quality on in-domain data is between 10 and 16 BLEU points higher. This may be due to both domain mismatch as well as sensitivity of the BLEU metric to sentence length. Indeed, there are on average 6 words per sentences in the Open Subtitles test set compared to 16 words per sentence in the Wikipedia devtest set. However, when we train semi-supervised models on back-translated Wikipedia data whose domain better matches the “Out-of-domain” devtest set, we see much larger gains in BLEU for the ”Out-of-domain” set than we see on the “In-domain” set, suggesting that domain mismatch is indeed a major problem.

5 Conclusions

One of the biggest challenges in MT today is learning to translate low resource language pairs. Research in this area not only faces formidable technical challenges, from learning with limited supervision to dealing with very distant languages, but it is also hindered by the lack of freely and publicly available evaluation benchmarks.

In this work, we introduce and freely release to the community two new benchmarks: Nepali–English and Sinhala–English . Nepali and Sinhala are languages with very different syntax and morphology than English; also, very little parallel data in these language pairs is publicly available. However, a good amount of monolingual data and Paracrawl data exist in both languages, making these two language pairs a perfect candidate for research on low-resource MT.

Our experiments show that current state-of-the-art approaches perform rather poorly on these new evaluation benchmarks, with semi-supervised neural methods outperforming all the other model variants and training settings we considered. We believe that these benchmarks will help the research community on low-resource MT make faster progress by enabling free access to evaluation data on actual low resource languages and promoting fair comparison of methods.


  • Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 451–462.
  • Artetxe et al. (2018) Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. Unsupervised statistical machine translation. In

    Empirical Methods in Natural Language Processing (EMNLP)

  • Bahdanau et al. (2015) D. Bahdanau, K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations (ICLR).
  • Baroni and Bernardini (2005) Marco Baroni and Silvia Bernardini. 2005. A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing, 21(3):259–274.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017.

    Enriching word vectors with subword information.

    Transactions of the Association for Computational Linguistics, 5:135–146.
  • Bojar et al. (2017) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, et al. 2017. Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the Second Conference on Machine Translation, pages 169–214.
  • Buck et al. (2014) Christian Buck, Kenneth Heafield, and Bas van Ooyen. 2014. N-gram counts and language models from the common crawl. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).
  • Conneau et al. (2018) A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. 2018. Word translation without parallel data. In International Conference on Learning Representations (ICLR).
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Conference of the Association for Computational Linguistics (ACL).
  • Gehring et al. (2016) Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. 2016. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344.
  • Graham et al. (2013) Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous measurement scales in human evaluation of machine translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 33–41, Sofia, Bulgaria. Association for Computational Linguistics.
  • Gulcehre et al. (2017) Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, and Yoshua Bengio. 2017. On integrating a language model into neural machine translation. Computer Speech & Language, 45:137–148.
  • Hassan et al. (2018) Hany Hassan, Anthony Aue, Chang Chen, Vishal Chowdhary, Jonathan Clark, Christian Federmann, Xuedong Huang, Marcin Junczys-Dowmunt, William Lewis, Mu Li, Shujie Liu, Tie-Yan Liu, Renqian Luo, Arul Menezes, Tao Qin, Frank Seide, Xu Tan, Fei Tian, Lijun Wu, Shuangzhi Wu, Yingce Xia, Dongdong Zhang, Zhirui Zhang, and Ming Zhou. 2018. Achieving human parity on automatic chinese to english news translation. In arXiv:1803.05567.
  • He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
  • Johnson et al. (2016) M. Johnson, M. Schuster, Q.V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean. 2016. Google’s multilingual neural machine translation system: Enabling zero-shot translation. In Transactions of the Association for Computational Linguistics.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).
  • Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Ondrej Bojar Chris Dyer, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), demo session.
  • Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39.
  • Lample et al. (2018a) G. Lample, A. Conneau, L. Denoyer, and M. Ranzato. 2018a. Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations (ICLR).
  • Lample et al. (2018b) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018b. Phrase-based & neural unsupervised machine translation. In Empirical Methods in Natural Language Processing (EMNLP).
  • Lison et al. (2018) Pierre Lison, Jörg Tiedemann, and Milen Kouylekov. 2018. Opensubtitles2018: Statistical rescoring of sentence alignments in large, noisy parallel corpora. In LREC. European Language Resources Association (ELRA).
  • Neubig and Hu (2018) Graham Neubig and Junjie Hu. 2018. Rapid adaptation of neural machine translation to new languages. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.
  • Ott et al. (2018a) Myle Ott, Michael Auli, David Granger, and Marc’Aurelio Ranzato. 2018a. Analyzing uncertainty in neural machine translation. arXiv preprint arXiv:1803.00047.
  • Ott et al. (2018b) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018b. Scaling neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers.
  • Papineni et al. (2002) K. Papineni, S. Roukos, T. Ward, and W.J. Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. arXiv, 1804.08771.
  • Post et al. (2012) Matt Post, Chris Callison-Burch, and Miles Osborne. 2012. Constructing parallel corpora for six indian languages via crowdsourcing. In Proceedings of the Seventh Workshop on Statistical Machine Translation, pages 401–409, Montréal, Canada. Association for Computational Linguistics.
  • Riza et al. (2016) Hammam Riza, Michael Purwoadi, Teduh Uliniansyah, Aw Ai Ti, Sharifah Mahani Aljunied, Luong Chi Mai, Vu Tat Thang, Nguyen Phuong Thai, Vichet Chea, Sethserey Sam, et al. 2016. Introduction of the asian language treebank. In Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), 2016 Conference of The Oriental Chapter of International Committee for, pages 1–6. IEEE.
  • Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 86–96.
  • Smith et al. (2013) Jason R Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the common crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1374–1383.
  • Stahlberg et al. (2018) Felix Stahlberg, James Cross, and Veselin Stoyanov. 2018. Simple fusion: Return of the language model. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 204–211. Association for Computational Linguistics.
  • Strassel and Tracey (2016) Stephanie Strassel and Jennifer Tracey. 2016. Lorelei language packs: Data, tools, and resources for technology development in low resource languages. LREC.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.

    Sequence to sequence learning with neural networks.

    In Advances in Neural Information Processing Systems, pages 3104–3112.
  • Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
  • Xu and Koehn (2017) Hainan Xu and Philipp Koehn. 2017. Zipporah: a fast and scalable data cleaning system for noisy web-crawled parallel corpora. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2945–2950.
  • Zaidan and Callison-Burch (2011) Omar F. Zaidan and Chris Callison-Burch. 2011. Crowdsourcing translation: Professional quality from non-professionals. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 1220–1229, Portland, Oregon, USA. Association for Computational Linguistics.

Appendix A List of Wikpedia Documents

domain document/gloss topic Astronomy Science History of radar History Shoe General Tire General Indian cuisine Art/Culture IPhone General Apollo program History Chess General Honey General Police Law Desert Geography Slavery Social Sciences Riddler Art/Culture Diving Sports Cat Biology Boxing Sports White wine General Creativity Social Sciences Capitalism Social Sciences Alaska Geography Museum General Lifeguard General Tennis Sports Writer General Anatomy Science Qoran Religion Dhammas Religion Vegetation Science Names of Colombo Students History Titanic History The Heart Biology The Ear Biology Theravada Religion WuZetian History Psychoanalisis Science Angulimala Religion Insurance General Leafart Art/Culture Communication Science Science Pharaoh Neferneferuaten History Nelson Mandela People Paraliament of India Politics Kailali and Kanchanpur Geography Bhuwan Pokhari Geography COPD Biology KaalSarp Yoga Religion Research Methodology in Economics Social Sciences Essay Social Sciences Mutation Science Maoist Constituent Assembly Politics Patna Geography Federal rule systen Law Newari Community Art/Culture Raka’s Dynasty History Rice Biology Breastfeeding Biology Earthquake Science Motiram Bhatta People Novel Magazine Art/Culture Vladimir Putin Politics History of Nelali Literature History Income tax Law Ravi Prasjal People Yogchudamani Upanishads Religion Sedai Religion
Table 7: List of documents by Wikipedia domain, their document name or English translation, and corresponding topics. The document name has an hyper-reference to the original document. denotes a page that has been removed or no longer available at the time of this submission.

Appendix B Examples from devtest

Table 8: Examples of sentences from the En-Ne, Ne-En, En-Si and Si-En devtest set. System hypotheses (System) are generated using the semi-supervised model described in the main paper using beam search decoding.