Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words

08/16/2018 ∙ by Nelson F. Liu, et al. ∙ USC Information Sciences Institute University of Washington 0

Most statistical machine translation systems cannot translate words that are unseen in the training data. However, humans can translate many classes of out-of-vocabulary (OOV) words (e.g., novel morphological variants, misspellings, and compounds) without context by using orthographic clues. Following this observation, we describe and evaluate several general methods for OOV translation that use only subword information. We pose the OOV translation problem as a standalone task and intrinsically evaluate our approaches on fourteen typologically diverse languages across varying resource levels. Adding OOV translators to a statistical machine translation system yields consistent BLEU gains (0.5 points on average, and up to 2.0) for all fourteen languages, especially in low-resource scenarios.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine translation systems frequently must translate tokens unseen during training (known as out-of-vocabulary or OOV tokens). Neural machine translation (NMT) can mitigate this OOV problem by producing word representations on the fly from subword units

Sennrich et al. (2016); Luong and Manning (2016). Despite this advantage, NMT performs poorly when there is little training data; Koehn and Knowles (2017) show that statistical machine translation (SMT) yields better translations in this low resource setting. However, SMT systems struggle to handle OOV tokens.

Human translators are adept at translating OOV words, in part because they exploit subword orthographic clues. For example, they can translate novel morphological variants and compounds of known words by reasoning over constituent subword units. We use these same subword orthographic units to build broadly-applicable OOV translation systems while making only the loose typological assumption that the orthographic representation contains informative subword units.

Most prior work has focused on translating specific OOV classes; we pursue holistic solutions. Our work is similar in spirit to prior language-specific and general methods for handling multiple classes of OOVs in SMT (Habash, 2009; Gujral et al., 2016). We compare three approaches to building a subword OOV translation module: an edit distance approach, which matches OOV words to orthographically similar known translation pairs; a vector distance approach, which seeks a semantic match instead of a orthographic match; and a sequence-to-sequence

(seq2seq) approach, which generates translations one character at a time. We do not use language-specific heuristics, enabling automatic construction of OOV modules for a wide variety of languages.

We evaluate our approaches on an intrinsic OOV translation task on a typologically diverse set of fourteen languages. We also embed OOV translators into a syntax-based machine translation (SBMT) system and assess its effects on overall system BLEU for the same fourteen languages. Our results show that using an OOV translator with the SBMT system consistently improves translation quality across all languages, especially in low resource scenarios; we see gains of 0.5 BLEU on average and up to 2.0 BLEU. We release code to train our OOV translators at

2 Dataset and Intrinsic Evaluation

To intrinsically evaluate OOV modules, we assess their ability to translate previously unseen foreign tokens into English. For fourteen language pairs, we obtain monolingual data, sentence-aligned parallel data, and bilingual lexicons. We word-align

Och and Ney (2003); Liang et al. (2006) the parallel data and randomly remove 1000 one-count <foreign, English> word pairs that also do not exist in the lexicons. These word translation pairs are out-of-vocabulary with respect to the other resources, making them suitable for intrinsic evaluation. We divide these pairs into validation and test splits of 500 word pairs each, and build our OOV translators with the monolingual data, the lexicons, and translation tables extracted from the parallel text.

To summarize, our dataset contains: (1) Validation and test sets: Foreign OOV tokens and an English translation111While there are often multiple acceptable English translations for a foreign OOV, our dataset provides one. (all extracted from parallel text). Our objective is to predict the English translation, given the foreign OOV. (2) Lexicon (bilingual dictionary): Foreign tokens, their part of speech, and an English translation. A foreign token may have multiple entries. (3) Monolingual data: A modest amount of running text in the foreign language (from, e.g., Wikipedia). (4) Translation tables

: Pairs of aligned foreign words and their English translations (token to token mapping) with alignment probabilities and absolute alignment counts, derived from the parallel text. For dataset quality validation details, see Appendix 


3 OOV Translation Methods

Edit Distance

To translate OOVs with edit distance Levenshtein (1966), we adapt the method of Gujral et al. (2016). We begin by retrieving the foreign word(s) in the bilingual lexicon or translation table with the lowest edit distance from the given OOV token. Our predicted translation is the English word that most frequently222Ties are broken with the words’ frequency in the Gigaword corpus. aligns to any of the selected in-vocabulary foreign words.

Vector Distance

To use vector distance for OOV translation, we calculate the cosine similarity between word vectors to select the in-vocabulary word with the closest word vector to the OOV.

Our predicted translation is again the English word that most frequently aligns to the selected in-vocabulary source token.

To obtain vectors for OOVs, we use FastText models Bojanowski et al. (2017) trained on source language monolingual data. Since FastText computes vectors from subword units, it can produce representations for arbitrary strings; we thus use FastText vectors for both the input OOV tokens and the in-vocabulary words.

Prior work has used word vectors for handling OOVs (Zou et al., 2013; Zhang et al., 2014; Madhyastha and España Bonet, 2017, and more), but the majority learn a bilingual mapping between the source and target languages. Our method does not learn such a mapping, reducing our reliance on parallel data. Using subword vectors enables translation of OOVs unseen in the monolingual data.


The edit and vector distance methods are constrained to only output translations that occur in the bilingual dictionary or the translation table. Towards open-vocabulary OOV translation, we use character level sequence-to-sequence (seq2seq) (Sutskever et al., 2014) models to generate English translations from source strings. This approach is similar in spirit to prior work on word-level NMT models that back off to character-level information for OOV tokens (Luong and Manning, 2016). To the best of our knowledge, this is the first use of seq2seq for translating OOVs in SMT.

We use an LSTM-based seq2seq model with attention, which is trained on source-target pairs extracted from the translation table and bilingual dictionary. We weight the examples in our training data, since certain translations are more common than others. Pairs extracted from the translation table are weighted by their absolute alignment frequency, and pairs from the bilingual dictionary are given a constant weight of 100. See Appendices B and C for training dataset sizes and implementation details.

4 Experiments and Results

To intrinsically evaluate OOV module performance, we measure the proportion of predicted translations that exactly match target translations.

In addition, we measure the effect of integrating OOV translation systems into an SBMT system. We incorporate our OOV translation systems into an end-to-end machine translation system by adding OOVs and their predicted translations as translation pairs (for syntax-based MT, part-of-speech-tagged translation pairs) with an indicator feature that is tuned with other standard features. These pairs compete with do-not-translate pairs (i.e. where the source and target are identical); feature weights and language model scores determine whether the system uses a translated OOV or chooses to not translate it.

4.1 Intrinsic OOV Translation Results

The accuracy of each OOV translation method on each of the fourteen languages is presented in Table 1. On average, the seq2seq models outperform the edit distance systems, followed by the vector distance OOV translation systems.

Amharic 22.8% 14.2% 27.0%
Arabic 20.0% 15.8% 29.4%
Bengali 23.5% 23.2% 20.5%
Farsi 27.2% 25.2% 35.6%
Hausa 23.0% 7.2% 24.8%
Hungarian 23.4% 19.6% 32.4%
Russian 20.0% 20.2% 33.4%
Somali 30.4% 18.4% 37.2%
Spanish 20.8% 16.8% 28.6%
Tamil 21.9% 21.4% 28.7%
Turkish 29.2% 28.8% 38.6%
Urdu 13.3% 17.4% 10.2%
Uzbek 22.6% 21.2% 36.4%
Yoruba 14.6% 11.0% 19.8%
Average 22.34% 19.31% 28.76%
Table 1: Intrinsic test set exact match accuracy for each of the translation methods. For all source languages, the target is English. Bold marks the best performing method for each pair.

4.2 Extrinsic SBMT Results

Source Language Code
avg avg amh ara ben fas hau hun rus som spa tam tur urd uzb yor
SBMT 21.36 - 15.75 21.13 10.92 24.12 21.81 17.56 31.36 21.96 40.36 20.77 20.12 18.22 16.86 18.09
edit dist. 21.72 +0.36 15.76 22.81 11.31 25.22 22.20 18.39 31.45 22.63 40.94 21.98 17.29 18.86 17.13 18.06
vector dist. 21.61 +0.25 16.07 23.02 10.16 23.87 21.55 17.78 31.52 22.19 41.00 22.58 20.30 18.08 17.39 17.06
seq2seq 21.86 +0.50 15.84 23.17 10.92 24.44 21.85 17.74 32.04 22.63 40.73 22.35 20.49 18.49 17.07 18.30
BPE NMT 10.72 -10.71 6.85 8.92 3.74 15.08 15.63 6.51 8.61 13.02 20.31 10.67 6.93 12.8 8.91 12.12
Table 2: Test SBMT BLEU scores for each language pair and OOV translation method. The best OOV translation method for each pair is bolded. BLEU scores of BPE NMT trained on same data are also provided for comparison.

Table 2 illustrates the effects of our OOV module on SBMT BLEU across the fourteen languages. We compare against a baseline subword NMT system trained on the same data with source and target-side byte pair encoding (BPE; Sennrich et al., 2016).333For NMT baseline implementation details, see Appendix E All MT systems are trained on between 262K to 11.9M words; see Appendix D for the amount of training data per language.

On average, adding the seq2seq OOV translator to SBMT produced the highest BLEU scores, followed by the edit and vector distance methods. Notably, the seq2seq OOV translator improves SBMT BLEU for all languages except Bengali, where it ties with the vanilla SBMT baseline. For each language, at least one of the OOV-augmented systems improves upon the SBMT baseline. This confirms that OOV translation from subword information has broad utility; adding an OOV translator to SBMT is an easy and consistent way to improve performance. We see average gains of 0.5 BLEU points, with a 2.0 BLEU improvement for Arabic. SBMT with or without OOV translation outperforms the BPE NMT models, supporting previous observations that SMT is superior in low resource scenarios.444For reference, Koehn and Knowles (2017) report that NMT begins to outperform SMT for English-Spanish when trained on more than approximately 15 million words. This also further motivates OOV translation in SMT, since directly applying subword NMT is clearly impractical here.

5 Discussion

Method Performance by OOV Type

To further study the ability of our OOV translation methods to handle various types of OOV tokens, we randomly sampled 100 examples from the development set used in the Spanish-English intrinsic OOV translation task and broadly categorized OOVs by whether they are morphological variations of an in-vocabulary word, misspellings, a transliteration, a compound word, or whether the OOV is a proper noun that should be copied to the translation. The performance of each method for each category is presented in Table 3.

The seq2seq methods show the best performance on examples involving morphological variation, since they reason over subword units. This also explains the large BLEU gains when adding seq2seq OOV translation to Arabic SBMT, as the language is morphologically complex and many OOVs are morphological variants of known words. Reasoning over subword units also enables seq2seq translators to handle OOV tokens created from compounding, where the edit and vector distance methods struggle. The edit distance-based method intuitively outpaces the others on OOV words generated by misspellings, and draws even with the seq2seq methods on transliteration cases.

Many of the proper nouns are rare words, which the edit and vector distance methods cannot handle. The seq2seq model performs slightly better.

OOV Category Occurrences in Sample Edit Distance Vector Distance Seq2Seq
Morphological Variation 57 14.0% 10.5% 24.6%
Misspelling 19 31.6% 26.3% 21.1%
Transliteration 14 21.4% 14.3% 21.4%
Compounding 5 0.0% 0.0% 40.0%
Proper Noun 5 0.0% 20.0% 60.0%
All 100 17.0% 14.0% 26.0%
Table 3: Exact-match accuracy of each OOV translation method, stratified by OOV type. The best OOV translation method for each category is bolded.

Seq2Seq Produces English Words Unseen During Training

Seq2seq models are able to compose units of meaning to produce novel target-side tokens unseen during training; we see this theoretical advantage in practice. In the examples in Table 4, the seq2seq-predicted translations did not occur in the target side of the training data, so the model must have combined subword units to produce them. The model learns to (a) combine previously seen subword units into novel English compounds, (b) transliterate sequences, and (c) inflect verbs for which it has seen a root form.

(a) Compounding
SPA source OOV ciberviolencia
ENG gold translation cyberviolence
edit distance prediction Roscomnadzor
seq2seq prediction cyberviolence

(b) Transliteration / Copying
SPA source OOV Kafkastán
ENG gold translation Kafkastan
edit distance prediction alternative form of kazajistán
seq2seq prediction Kafkastan

(c) Morphology
SPA source OOV balcanizada
ENG gold translation balkanised
edit distance prediction unbanked
seq2seq prediction balkanized
Table 4: seq2seq models recombine in-vocabulary tokens to output novel words unseen during training.

6 Related Work

Many strategies have been developed for OOV translation, especially in SMT. For example, Nießen and Ney (2000); Koehn and Knight (2003); Virpioja et al. (2007) translate OOV compounds and other morphologically complex words by splitting and translating the resultant segments. Al-Onaizan and Knight (2002); Habash (2008); Hermjakob et al. (2008); Durrani et al. (2014) explore transliteration for OOV named entities.

Many approaches also translate OOV tokens by expanding the translation lexicon with additional bilingual or monolingual resources (Rapp, 1995; Callison-Burch et al., 2006; Haghighi et al., 2008; Marton et al., 2009; Daumé III and Jagarlamudi, 2011; Razmara et al., 2013; Irvine and Callison-Burch, 2013; Mikolov et al., 2013; Saluja et al., 2014; Zhao et al., 2015, among others).

OOV translation has also been cast as a problem of decipherment (Ravi and Knight, 2011; Dou and Knight, 2012), and other approaches use information from cognates or related languages (Hajič et al., 2000; Kondrak et al., 2003; De Gispert and Marino, 2006; Durrani et al., 2010; Wang et al., 2012; Nakov and Ng, 2012; Dholakia and Sarkar, 2014; Tsvetkov and Dyer, 2015, among others).

7 Conclusion

We compare three generally-applicable strategies for translating out-of-vocabulary words, none of which rely on any language-specific resources or typological assumptions beyond the presence of subword units. Integrating these OOV translators into a SMT system consistently improves translation quality over a typologically diverse set of of fourteen languages. We analyze method performance over a range of OOV types and also demonstrate that seq2seq OOV translators compose characters to generate novel target-side translations.


  • Al-Onaizan and Knight (2002) Yaser Al-Onaizan and Kevin Knight. 2002. Machine Transliteration of Names in Arabic Text. In Proc. of the ACL Workshop on Computational Approaches to Semitic Languages, pages 1–13.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching Word Vectors with Subword Information. Transactions of the ACL, 5:135–146.
  • Callison-Burch et al. (2006) Chris Callison-Burch, Philipp Koehn, and Miles Osborne. 2006. Improved Statistical Machine Translation Using Paraphrases. In Proc. of NAACL, pages 17–24.
  • Daumé III and Jagarlamudi (2011) Hal Daumé III and Jagadeesh Jagarlamudi. 2011. Domain Adaptation for Machine Translation by Mining Unseen Words. In Proc. of NAACL, pages 407–412.
  • De Gispert and Marino (2006) Adrià De Gispert and Jose B Marino. 2006. Catalan-English Statistical Machine Translation without Parallel Corpus: Bridging through Spanish. In Proc. of LREC, pages 65–68.
  • Dholakia and Sarkar (2014) Rohit Dholakia and Anoop Sarkar. 2014. Pivot-based Triangulation for Low-Resource Languages. In Proc. of AMTA, pages 315–328.
  • Dou and Knight (2012) Qing Dou and Kevin Knight. 2012. Large Scale Decipherment for Out-of-Domain Machine Translation. In Proc. of EMNLP, pages 266–275.
  • Durrani et al. (2010) Nadir Durrani, Hassan Sajjad, Alexander Fraser, and Helmut Schmid. 2010. Hindi-to-Urdu Machine Translation through Transliteration. In Proc. of ACL, pages 465–474.
  • Durrani et al. (2014) Nadir Durrani, Hassan Sajjad, Hieu Hoang, and Philipp Koehn. 2014. Integrating an Unsupervised Transliteration Model into Statistical Machine Translation. In Proc. of ACL, pages 148–153.
  • Gujral et al. (2016) Biman Gujral, Huda Khayrallah, and Philipp Koehn. 2016. Translation of Unknown Words in Low Resource Languages. In Proc. of AMTA.
  • Habash (2008) Nizar Habash. 2008. Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation. In Proc. of ACL, pages 57–60.
  • Habash (2009) Nizar Habash. 2009. REMOOV: A Tool for Online Handling of Out-of-Vocabulary Words in Machine Translation. In Proc. of MEDAR.
  • Haghighi et al. (2008) Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick, and Dan Klein. 2008. Learning Bilingual Lexicons from Monolingual Corpora. Proc. of ACL, pages 771–779.
  • Hajič et al. (2000) Jan Hajič, Jan Hric, and Vladislav Kuboň. 2000. Machine Translation of Very Close Languages. In Proc. of ANLP, pages 7–12.
  • Hermjakob et al. (2008) Ulf Hermjakob, Kevin Knight, and Hal Daumé III. 2008. Name Translation in Statistical Machine Translation - Learning When to Transliterate. In Proc. of ACL, pages 389–397.
  • Irvine and Callison-Burch (2013) Ann Irvine and Chris Callison-Burch. 2013. Combining Bilingual and Comparable Corpora for Low Resource Machine Translation. In Proc. of the Eighth Workshop on Statistical Machine Translation, pages 262–270.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. ArXiv:1412.6980.
  • Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proc. of ACL.
  • Koehn and Knight (2003) Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound Splitting. In Proc. of EACL, pages 187–193.
  • Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six Challenges for Neural Machine Translation. In Proc. of the First Workshop on Neural Machine Translation, pages 28–39.
  • Kondrak et al. (2003) Grzegorz Kondrak, Daniel Marcu, and Kevin Knight. 2003. Cognates Can Improve Statistical Translation Models. In Proc. of NAACL, pages 46–48.
  • Levenshtein (1966) V. I. Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707.
  • Liang et al. (2006) Percy Liang, Ben Taskar, and Dan Klein. 2006. Alignment by Agreement. In Proc. of NAACL, pages 104–111.
  • Luong and Manning (2016) Minh-Thang Luong and Christopher D. Manning. 2016. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models. In Proc. of ACL, pages 1054–1063.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proc. of EMNLP, pages 1412–1421.
  • Madhyastha and España Bonet (2017) Pranava Swaroop Madhyastha and Cristina España Bonet. 2017. Learning Bilingual Projections of Embeddings for Vocabulary Expansion in Machine Translation. In Proc. of the 2nd Workshop on Representation Learning for NLP, pages 139–145.
  • Marton et al. (2009) Yuval Marton, Chris Callison-Burch, and Philip Resnik. 2009. Improved Statistical Machine Translation Using Monolingually-Derived Paraphrases. In Proc. of EMNLP, pages 381–390.
  • Mikolov et al. (2013) Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013. Exploiting Similarities among Languages for Machine Translation. ArXiv:1309.4168.
  • Nakov and Ng (2012) Preslav Nakov and Hwee Tou Ng. 2012. Improving Statistical Machine Translation for a Resource-poor Language Using Related Resource-Rich Languages.

    Journal of Artificial Intelligence Research

    , 44:179–222.
  • Nießen and Ney (2000) Sonja Nießen and Hermann Ney. 2000. Improving SMT Quality with Morpho-syntactic Analysis. In Proc. of COLING, pages 1081–1085.
  • Och and Ney (2003) Franz Josef Och and Hermann Ney. 2003. A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29:19–51.
  • Rapp (1995) Reinhard Rapp. 1995. Identifying Word Translations in Non-Parallel Texts. In Proc. of ACL, pages 320–322.
  • Ravi and Knight (2011) Sujith Ravi and Kevin Knight. 2011. Deciphering Foreign Language. In Proc. of ACL, pages 12–21.
  • Razmara et al. (2013) Majid Razmara, Maryam Siahbani, Reza Haffari, and Anoop Sarkar. 2013. Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Machine Translation. In Proc. of ACL, pages 1105–1115.
  • Saluja et al. (2014) Avneesh Saluja, Hany Hassan, Kristina Toutanova, and Chris Quirk. 2014.

    Graph-based Semi-Supervised Learning of Translation Models from Monolingual Data.

    In Proc. of ACL, pages 676–686.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proc. of ACL, pages 1715–1725.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.

    Sequence to Sequence Learning with Neural Networks.

    In Proc. of NIPS, pages 3104–3112.
  • Tsvetkov and Dyer (2015) Yulia Tsvetkov and Chris Dyer. 2015. Lexicon Stratification for Translating Out-of-Vocabulary Words. In Proc. of ACL, pages 125–131.
  • Virpioja et al. (2007) Sami Virpioja, Jaakko J. Väyrynen, Mathias Creutz, and Markus Sadeniemi. 2007. Morphology-Aware Statistical Machine Translation Based on Morphs Induced in an Unsupervised Manner. In Proc. of the 11th Machine Translation Summit.
  • Wang et al. (2012) Pidong Wang, Preslav Nakov, and Hwee Tou Ng. 2012. Source Language Adaptation for Resource-Poor Machine Translation. In Proc. of EMNLP, pages 286–296.
  • Zhang et al. (2014) Jiajun Zhang, Shujie Liu, Mu Li, Ming Zhou, and Chengqing Zong. 2014. Bilingually-constrained Phrase Embeddings for Machine Translation. In Proc. of ACL.
  • Zhao et al. (2015) Kai Zhao, Hany Hassan, and Michael Auli. 2015. Learning Translation Models from Monolingual Continuous Representations. In Proc. of NAACL, pages 1527–1536.
  • Zou et al. (2013) Will Y. Zou, Richard Socher, Daniel Cer, and Christopher D Manning. 2013. Bilingual Word Embeddings for Phrase-Based Machine Translation. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    , pages 1393–1398.

Appendix A Validating Dataset Quality

To validate the quality of the automatically-constructed validation and test sets, we built an interface to enable native speakers to post-edit the generated translations. In this setup, speakers cannot provide their own translations for foreign words. Rather, they are shown a foreign sentence and its aligned English sentence, with the OOV and the translation respectively highlighted. They can edit the translation by modifying the highlighting on the English sentence. Speakers are allowed to highlight discontiguous spans. For example, the translation of the Spanish word comeré, as in “No comeré la comida.”, would be will … eat, as in “I will not eat the food”.

Volunteer native speakers validated the OOV datasets for 5 out of our 14 languages (Arabic, Bengali, Farsi, Russian, and Spanish). Many of the generated foreign OOVs and translations were not modified in the process, confirming their quality and the utility of the data-collection method.

Appendix B Number of Word Translation Pairs

Number of
Amharic 210.0K
Arabic 370.0K
Bengali 161.1K
Farsi 146.0K
Hausa 168.9K
Hungarian 938.3K
Russian 875.5K
Somali 179.5K
Spanish 944.1K
Tamil 54.6K
Turkish 349.9K
Urdu 123.9K
Uzbek 404.5K
Yoruba 233.8K
Table 5: Number of word translation pairs (used to train the seq2seq OOV translator) for each language.

Appendix C Seq2Seq OOV Translator Implementation Details

Our seq2seq models consist of 3-layer bidirectional LSTM networks with 1024 hidden units. After each LSTM layer except the last, we apply dropout of 0.3. Our character embeddings are 1024-dimensional. The model is trained with Adam Kingma and Ba (2014)

with a constant learning rate of 0.0001 and a batch size of 128. The models are trained until sequence-level exact-match accuracy on the validation set shows no improvement for three epochs. We decode with a beam size of 1, and use the global attention with the general scoring function and input feeding as described in Luong2015EffectiveAT.

After training each model to convergence, we use the checkpoint with the highest exact match accuracy on a held-out validation set. Checkpoints are saved every 10,000 parameter updates and at the end of each epoch.

Appendix D Amount of MT Training Data For Each Language

Source Language Number of target tokens
Amharic 1.24M
Arabic 2.32M
Bengali 494.0K
Farsi 2.14M
Hausa 1.10M
Hungarian 5.20M
Russian 9.77M
Somali 1.40M
Spanish 11.90M
Tamil 262.5K
Turkish 2.23M
Urdu 527.1K
Uzbek 2.36M
Yoruba 1.11M
Table 6: Amount of training data (target-side tokens) used by SBMT and NMT systems for each language.

Appendix E BPE NMT Baseline Implementation Details

To train the BPE NMT models, we first apply byte pair encoding with 10K joins to the source and target data. We train a sequence-to-sequence model on the data with OpenNMT-py (Klein et al., 2017), with git commit hash 0ecec8b

. The model is built and trained using the default hyperparameters: 2-layer LSTMs in both the encoder and decoder with 500-dimensional embedding vectors and RNN hidden states trained with SGD with an initial learning rate of 1.0. We edit the learning rate schedule from the default, training for 50 epochs and decaying after each epoch only when validation perplexity fails to increase. A checkpoint is saved after each epoch, and we use the checkpoint with the best validation perplexity to make test set predictions.