Even though neural models have improved the state-of-the-art in machine translation considerably in recent years, they still underperform in specific conditions. One such condition is out-of-domain translation. Koehn and Knowles (2017) show that neural machine translation (NMT) systems perform poorly in such settings and that their poor performance cannot be explained solely by the fact that out-of-domain translation is difficult: non-neural, statistical machine translation (SMT) systems are clearly superior at this task. For this reason, Koehn and Knowles (2017) identify translation of out-of-domain text as a key challenge for NMT.
Catastrophic failure to translate out-of-domain text can be viewed as overfitting to the training domain, i.e. systems learn idiosyncracies of the domain rather than more general features.
Our goal is to learn models that generalize well to unseen data distributions, including data from other domains. We will refer to this property of showing good generalization to unseen domains as domain robustness.
We consider domain robustness a desirable property of NLP systems, along with other types of robustness, such as robustness against adversarial examples (Goodfellow et al. 2015) or human input corruption such as typos (Belinkov and Bisk, 2018). While domain adaptation with small amounts of parallel or monolingual in-domain data has proven very effective for NMT (e.g. Luong and Manning, 2015; Sennrich et al., 2016a; Kobus et al., 2017), the target domain(s) may be unknown when a system is built, and there are language pairs for which training data is only available for limited domains.
Model architectures and training techniques have evolved since Koehn and Knowles (2017)’s study, and it is unclear to what extent this problem still persists. We therefore revisit the hypothesis that NMT systems exhibit low domain robustness. In preliminary experiments, we demonstrate that current models still fail at out-of-domain translation: BLEU scores drop drastically for test domains other than the training domain. The degradation in quality is less severe for SMT systems.
An analysis of our baseline systems reveals that hallucinated content occurs frequently in out-of-domain translations. Several authors present anecdotal evidence for NMT systems occasionally falling into a hallucination mode where translations are grammatically correct, but unrelated to the source sentence (Arthur et al., 2016; Koehn and Knowles, 2017; Nguyen and Chiang, 2018). See Figure 1 for an example. After assessing translations manually, we find that hallucination is more pronounced in out-of-domain translation. We therefore expect methods that alleviate the problem of hallucinated translations to indirectly improve domain robustness.
As a means to reduce hallucination, we experiment with several techniques and assess their effectiveness in improving domain robustness: reconstruction (Tu et al., 2017; Niu et al., 2019), subword regularization (Kudo, 2018), neural noisy channel models (Li and Jurafsky, 2016; Yee et al., 2019), and defensive distillation Papernot et al. (2016), as well as combinations of these techniques.
The main contributions of this paper are:
we perform an analysis of current NMT systems that confirms that domain robustness is low, and that hallucination is a major problem,
we empirically compare strategies to improve domain robustness in NMT,
we provide code and data sets to serve as baselines in future work.111https://github.com/ZurichNLP/domain-robustness
2 Data Sets
We report experiments on two different translation directions: GermanEnglish (DEEN) and GermanRomansh (DERM).
|IT||GNOME, KDE, PHP, Ubuntu, OpenOffice||380k||Press Releases||100k|
We use corpora from OPUS to define five domains: medical, IT, koran, law and subtitles. See Table 1 for an overview of sizes per domain. The domains are quite distant, and we therefore expect that systems trained on a single domain will have low domain robustness if tested on other domains.
For each domain, we select 2000 consecutive sentence pairs each for development and testing. Our test sets are different from Koehn and Knowles (2017), so results are not directly comparable.
In all experiments, the medical domain serves as the training domain, while the remaining four domains are used for testing.
To complement our DEEN experiments, we also train systems for DE
RM. Romansh is a Romance language that, with an estimatednative speakers, is low-resource, but has some parallel resources thanks to its status as an official Swiss language. Our training data consists of sentence pairs, specifically the Allegra corpus provided by Scherrer and Cartoni (2012) which contains mostly law text, and an in-house collection of press releases from the Swiss canton of Grisons. As test domain (unseen during training), we use blog posts from Convivenza333See https://www.suedostschweiz.ch/blogs/convivenza.. From both data sets we randomly select 2000 consecutive sentence pairs as test sets.
3 State-of-the-art Models Exhibit Low Domain Robustness
In this section, we establish that current NMT systems exhibit low domain robustness. We do so by analyzing our baseline systems automatically and manually.
3.1 Experimental Setup for Baseline Models
We use Moses scripts to normalize punctuation and tokenize all data. We apply a truecasing model trained only on in-domain training data. Similarly, we apply BPE (Sennrich et al., 2016b) with 32k (DEEN) or 16k (DERM) merge operations learned only from in-domain data. We train two baselines:
SMT Baseline A standard, phrase-based statistical model trained with Moses (Koehn et al., 2007), using mtrain (Läubli et al., 2018)444https://github.com/ZurichNLP/mtrain as frontend, with standard settings.
We always test on different test sets, one for each domain, including the training domain. We consistently use a beam size of 10 to translate test data. We report case-sensitive BLEU (Papineni et al., 2002) scores on detokenized text, computed with SacreBLEU (Post, 2018)555SacreBLEU version signature:
3.2 Analysis of Baseline Systems
Tables 3 and 4 show automatic evaluation results for all our baseline models. Neural models achieve good performance on the respective in-domain test sets (61.5 BLEU on medical for DEEN; 52.5 BLEU on law for DERM), but on out-of-domain text, translation quality is clearly diminished, with an average BLEU of roughly 12 (DEEN) and 17 (DERM). The following analysis will focus on our DEEN baseline systems.
Unknown words constitute one possible reason for failing to translate out-of-domain texts. As shown in Table 2, the percentage of words that that are not seen during training is much higher in all out-of-domain test sets. However, unknown words cannot be the only reason for low translation quality: The test sets with the lowest BLEU scores (koran and subtitles) actually have an out-of-vocabulary (OOV) rate similar to the IT test set, where BLEU scores are much higher across both baseline models.
Additionally, our SMT baseline shows better generalization to some domains unseen at training time, while the average BLEU is comparable to the NMT baseline. In the IT domain, the outcome is most extreme: the SMT system beats the neural system by 4.3 BLEU. This demonstrates that the low domain robustness of NMT is not (only) a data problem, but also due to the model’s inductive biases.
Compared to results reported by Koehn and Knowles (2017), the general trend remains the same: NMT loses more translation quality than PBSMT when comparing in-domain and out-of-domain performance, although the gap in out-of-domain performance is substantially smaller (0.1 BLEU as compared to 3 BLEU for systems trained on medical data).
As a further control, we train an additional baseline system trained on all domains666The subtitles domain (23m sentences) was subsampled to 1m sentence pairs, so as not to overwhelm the remaining domains (3m sentences in total).. We use it to test whether the data we have held out for out-of-domain testing is inherently more difficult to translate than the in-domain test set. The results in Table 5 show that this is not the case. BLEU ranges between 18.4 and 66, with an average out-of-domain BLEU of 37.5.
NMT models can be understood as language models of the target language, conditioned on a representation of source text. This means that NMT models have no explicit mechanism – as SMT models do – that enforces coverage of the source sentence, and if the representation of an out-of-domain source sentence is outside the training distribution, it can be seemingly ignored. This gives rise to a tendency to hallucinate translations, i.e. to produce translations that are fluent, but unrelated to the content of the source sentence.
We hypothesize that hallucination is more common in out-of-domain settings. A small manual evaluation performed by the main author confirms that this is indeed the case. We evaluate the fluency and adequacy of our baseline systems (we refer to them as NMT and SMT). In a blind setup, we annotated a random sample of 100 sentence pairs per domain. As controls, we mix in pairs consisting of (source, actual reference), treating the reference translation as an additional system.
Evaluation of adequacy The annotator is presented with a sentence pair and asked to judge whether the translation is adequate, partially adequate or inadequate. Thus, effectively the annotator is performing a 3-way categorization task.
Evaluation of fluency We use the same data as for the evaluation of adequacy, however, the annotator is shown only the translation, without the corresponding source sentence. The annotator is asked whether the given sentence is fluent, partially fluent or not fluent.
Figure 2 shows the results of the manual evaluation with respect to adequacy and fluency. Individual fluency values in the figure are computed as follows:
Where , and are the number of fluent, partially fluent and non-fluent translations, respectively. Adequacy values are computed in the same way. On the in-domain test set, both baselines achieve high adequacy and fluency, with the NMT baseline effectively matching the adequacy and fluency of the reference translations.
Regarding adequacy, the in-domain samples contain only a small number of translations with content unrelated to the source (1% to 2%). On out-of-domain data on the other hand, both baselines produce a high number of inadequate translations: 57% (SMT) and 84% (NMT). These results suggest that the extremely low BLEU scores on these two test sets (see Table 3) are in large part due to made up content in the translations.
Regarding fluency in out-of-domain settings, SMT and NMT baselines behave very differently: SMT translations are more adequate, while NMT translations are more fluent. This trend is most extreme in the koran domain, where only 2% of SMT translations are found to be fluent (compared to 36% for NMT).
Further analysis of both annotations shows that NMT translations found to be inadequate are not necessarily disfluent in out-of-domain settings. Table 6 shows that, on average, on out-of-domain data, 35% of NMT translations are both inadequate and fluent, while the same is only true for 4% of SMT translations. We refer to translations of this kind as hallucinations.
|NMT + SR||1%||37%|
|NMT + D||3%||33%|
To summarize our analysis of baseline models, we find that the domain robustness of current NMT systems is still lacking and that inadequate, but fluent translations are a prominent issue. This motivates our choice of techniques to improve domain robustness.
4 Approaches to Improve Domain Robustness
We discuss approaches that can potentially remedy the problem of low domain robustness.
Among them is the reconstruction architecture and training objective, which addresses the problem of hallucination, subword regularization, for which good results were reported in out-of-domain translation, defensive distillation, a method that has not yet been used in NMT to address either hallucination or domain robustness, and a neural noisy channel model.
Reconstruction (Tu et al., 2017) is a change to the model architecture that addresses the problem of adequacy. The authors propose to extend encoder-decoder models with a reconstructor component that learns to reconstruct the source sentence from decoder states. The reconstructor has two uses: as a training objective, it forces the decoder representations to retain information that will be useful for reconstruction; during inference, it can provide scores that can be used to re-rank translation hypotheses.
However, we observed in initial experiments that reconstruction from hidden states can be too easy: the reconstruction loss on training batches diminishes very quickly, to the point of being insignificant. To prevent the model from simply reserving parts of the decoder hidden states to memorize the input sentence, we use reconstruction from actual translations instead of hidden states Niu et al. (2019). Translations are produced with differentiable sampling via the Straight-Through Gumbel Softmax Jang et al. (2017), which still allows joint optimization of translation and reconstruction. While Niu et al. (2019) implement reconstruction for recurrent architectures, we apply the technique to Transformers.
In order not to introduce any additional parameters for reconstruction, as recommended in Niu et al. (2019), we train a multilingual, bi-directional system with shared parameters as a further baseline. This bi-directional system is used to initialize the fine-tuning of reconstruction models. We empirically test whether our original baseline and the multilingual system have comparable performance.
4.2 Subword Regularization
Subword regularization (Kudo, 2018)
is a form of data augmentation that, instead of applying a fixed subword segmentation like BPE, probabilistically samples a new subword segmentation each time a sentence is seen during training (i.e. for each epoch). At test time, the model either uses the 1-best segmentation, or translates the k-best segmentations and selects the highest-probability translation.
Kudo (2018) reports large improvements on low-resource and out-of-domain settings. In particular, improvements on in-house patent, web, and query test sets were in the range of 2–10 BLEU. In this work, we apply and evaluate subword regularization on public datasets. We apply sampling at training time, and translate 1-best segmented sentences at test time.
4.3 Defensive Distillation
We hypothesize that defensive distillation can be used to improve domain robustness. Defensive distillation exploits knowledge distillation to fend off adversarial attacks.
Knowledge distillation is a technique to derive models from existing models, instead of training from scratch. The idea was introduced for simple image classification models by Ba and Caruana (2014) and Hinton et al. (2015). A first model (called the teacher) is trained in the usual fashion. Then, a second model (called the student) is trained using the predictions of the teacher model instead of the labels in the training data.
Typically, knowledge distillation is used to approach the performance of a complex teacher model (or ensemble of models) with a simpler student model. Another application is defensive distillation (Papernot et al., 2016; Carlini and Wagner, 2017; Papernot and McDaniel, 2017), where the student shares the network architecture with the teacher, with the purpose not being model compression, but improving the model’s generalization to samples outside of its training set, and specifically robustness against adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014).
Defensive distillation has been shown to be effective at improving robustness to adversarial examples in image recognition tasks such as CIFAR10 or ImageNet. In this work, we apply it to a different task, NMT, and test its effect on domain robustness, whichPapernot et al. (2016) hint at but do not empirically test.
We follow Kim and Rush (2016) and, instead of training the student model with soft labels, use beam search to translate the entire training set with a teacher model, then train a student model on those automatic translations instead of the ground-truth translations.
4.4 Neural Noisy Channel Reranking
Even though the methods we presented previously do lead to improved out-of-domain translation quality, the models still suffer from low adequacy. Also, our reconstruction models only perform reconstruction during training, and the reverse translation direction is not exploited, for instance by reranking translations Tu et al. (2017).
We conjecture that this problem can be addressed with a neural noisy channel model Li and Jurafsky (2016). Standard NMT systems only model , which can lead to a “failure mode that can occur in conditional models in which inputs are explained away by highly predictive output prefixes” (Yu et al., 2017, 1). Noisy channel models propose to also model and to alleviate this effect.
In practical terms, noisy channel models can be implemented by modifying the core decoding algorithm, or simply as n-best list reranking. We adopt the latter approach, since n-best list reranking was shown by Yee et al. (2019) to have equal or better performance than more computationally costly methods that score partial hypotheses during beam search.
5 Experimental Setup for Proposed Methods
This section describes how we preprocessed data and trained the models described in Section 4. Unless stated otherwise, the data is preprocessed in the same way as for the baseline models (see Section 3.1).
Reconstruction Models We implement differentiable sampling reconstruction for Transformer models in Sockeye, and release the implementation.777https://github.com/ZurichNLP/sockeye/tree/domain-robustness We first train a multilingual Sockeye Transformer model using the approach of Johnson et al. (2017). We evaluate validation perplexity every 1000 updates for early stopping, with a patience of 10.
Then we continue training with reconstruction as an additional loss component. All hyperparameters remain the same, except for the new loss and a lower initial learning rate. For testing we select the model with the lowest validation perplexity. We use reconstruction exclusively for training, reconstruction is not used for translation with a trained model.
Subword Regularization Models We integrate subword regularization in Sockeye, using the Python library provided by Kudo (2018)888https://github.com/google/sentencepiece. The training data is not segmented with BPE in this case. Instead, the training tool is given truecased data, and new segmentations are sampled before each training epoch. In our experiments, we use the following hyperparameters: we set the smoothing parameter to 0.1 and use an n-best size of 64. For the validation and test data we use 1-best segmentation.
Defensive Distillation Models We use our baseline Transformer model as the teacher model. We translate the original training set with beam size 10. The student is trained on the translations of the teacher model, using the same hyperparameters and being initialized with the parameters of the teacher model.
Noisy Channel Reranking We decode with a beam size of 50, and store an n-best list of 50 as well. For each hypothesis we produce the following scores: (usual translation score), (translation score in reverse direction) and (language model score in target language).
In order to produce scores we train a Transformer language model with fairseq Ott et al. (2019), with standard settings. We impose a large penalty of for hypotheses that contain subwords not found in the target side training data. and are computed with the same model, either the bi-directional or reconstruction model.
The final hypothesis score for reranking is computed as a weighted multiplication:
The best weights are found with simple grid search over values in the range , on the in-domain development set. The best weight combination is then used to compute scores and perform reranking for the test data of all domains. Table 10 in Appendix B lists optimal weights found for each model individually.
Table 7 shows the results of our automatic evaluation. Overall, the proposed methods do improve over the NMT baseline and are able to outperform the SMT baseline on out-of-domain data.
|in-domain||average OOD||in-domain||average OOD|
|(3) NMT + SR||61.4||11.2||53.7||20.1|
|(4) NMT + D||61.1||13.1||52.5||19.3|
|(7) Multilingual + SR||60.3||12.8||52.4||20.1|
|(8) Reconstruction + SR||60.3||13.2||52.4||20.3|
|(9) Multilingual + NC||62.7||11.8||53.1||21.4|
|(10) Reconstruction + NC||62.8||13.0||53.3||21.6|
|(11) Multilingual + SR + NC||60.7||12.3||53.1||21.4|
|(12) Reconstruction + SR + NC||60.8||13.1||52.4||20.7|
6.1 Subword Regularization
The results for subword regularization are mixed (see Row 3 in Table 7). For DEEN, in-domain translation quality is comparable to the NMT baseline, while the average out-of-domain BLEU falls short of the NMT baseline (-0.5 BLEU). However, in the low-resource condition (DERM), subword regularization improves both in-domain and out-of-domain translation (+1.2 in both cases).
This result is surprising given the larger gains reported by Kudo (2018). To validate our implementation of subword regularization, we reproduce an experiment from Kudo (2018) with English-Vietnamese data from IWSLT 15 (see Table 8). With subword regularization we observe an improvement of 0.8 BLEU, which is lower than the 2 BLEU improvement reported by Kudo (2018), but we also note that our baseline model is stronger.
In a manual evaluation of subword regularization models (see Figure 2) we find that they do not improve adequacy, but increase fluency by several percentage points in some domains.
If subword regularization is combined with multilingual or reconstruction models (see Rows 7 and 8 in Table 7), we observe no improvements on in-domain test sets, but gains for 3 out of 4 out-of-domain data points, indicating that subword regularization is in fact helpful for domain robustness.
6.2 Defensive Distillation
Defensive distillation also leads to improvements in BLEU on out-of-domain text (see Row 4 in Table 7). The average gain is +1.4 for DE-EN, and only +0.4 for DE-RM. In-domain translation is either comparable or slightly worse than the NMT baseline.
Defensive distillation was originally shown to guard against adversarial attacks, where inputs are only slightly different from training examples. Our results indicate that generalization to out-of-domain inputs – that are farther from the training data – is similarly improved.
After assessing translations by our distillation model manually (see Figure 2), we find that distillation does not appear to consistently improve out-of-domain adequacy.
Since reconstruction models are fine-tuned from multilingual models, we report scores for those multilingual models as well. Row 5 of Table 7 shows that our multilingual models perform equally well or better than the NMT baseline.
As shown in Row 6 of Table 7, reconstruction outperforms the NMT baseline for both language pairs (+0.8 BLEU for DE-EN, +2.3 BLEU for DE-RM).
To analyze the improvements in more detail, we conduct the same manual annotation of adequacy and fluency as for the baselines (see Figure 2). We show that reconstruction is containing hallucination on out-of-domain data, reducing the percentage of inadequate translations by 5 percentage points on average. We also note that there is a tradeoff between adequacy and fluency: while reconstruction does improve out-of-domain adequacy, the improvement comes at the cost of lower fluency.
We observe that reconstruction models have a tendency to leave parts of an input sentence untranslated, i.e. parts of the sentence remain in the source language. This is exacerbated by the fact that our models are trained multilingually and bi-directionally. We consider two possible explanations:
This behaviour is a consequence of forcing the model to translate unfamiliar out-of-domain text.
During reconstruction training, the model is punished too harshly for not being able to reconstruct input sentences.
Since the multilingual model itself exhibits this copying behaviour, we do not consider this a problem specific to reconstruction training.999In preliminary experiments we identify 3-way tying of embeddings (Press and Wolf, 2017) in our models as a contributing cause, enabling the model to copy subwords that were never paired in the training data. 2-way tying (tying only target embeddings and the output matrix) reduces the problem of untranslated content, and we report results for 2-way tying for DEEN. Still, to avoid exacerbating the copying problem, we consider not only reranking with the reverse model as proposed by Tu et al. (2017), but a noisy channel model that also includes a language model to balance adequacy and fluency.
6.4 Noisy Channel Reranking
We evaluate the performance of noisy channel reranking in four different settings: applied to multilingual or reconstruction systems, both with and without subword regularization. The results are shown in Rows 9 to 12 of Table 7.
DEEN: Reranking a reconstruction model achieves a good in-domain BLEU (+1.3 over the baseline), and slightly improves out-of-domain translation on average (+0.5 BLEU over reconstruction).
DERM: In our low-resource setting, reranking with a noisy channel model improves the reconstruction model by +0.4 BLEU, producing the best result overall. The improvement on out-of-domain translation is much larger for the multilingual model (+1.8 over multilingual model without reranking).
Combining reranked models (see Rows 11 and 12 in Table 7) with subword regularization does not lead to consistent improvements. Out-of-domain BLEU for DERM is slightly better compared to a subword regularization system without reranking (+0.4 BLEU), all other scores are comparable or worse.
We found the effectiveness of noisy channel reranking to be limited by the homogeneity of n-best lists, and consider that it could become more effective after increasing beam search diversity (Li and Jurafsky, 2016).
Current NMT systems exhibit low domain robustness, i.e. they underperform if they are tested on a domain that differs strongly from the training domain. This is especially problematic in settings where explicit domain adaptation is impossible because the target domain is unknown, or because we are in a low-resource setting where training data is only available for limited domains. Our manual analysis shows that hallucinated translations are a common problem for NMT in out-of-domain settings that partially explains the low domain robustness.
Based on this analysis, we compare several methods to mitigate hallucination: subword regularization, for which improved domain robustness has been reported, defensive distillation, reconstruction and reranking with a neural noisy channel model.
Our results show that several methods yield improved generalization to out-of-domain data, and we find that a combination of reconstruction and a noisy channel model for reranking are most effective. We achieve an improvement in average out-of-domain BLEU of 1.5 (DEEN) and 2.7 (DERM), as well as a reduction in hallucinated translations according to manual analysis.
Still, in our manual evaluation NMT generally underperforms SMT in terms of adequacy on the tested out-of-domain datasets, and we encourage further research on domain robustness, which we consider an unsolved problem. For this purpose, we share data and code to serve as a baseline for future experiments.101010https://github.com/ZurichNLP/domain-robustness
Arthur et al. (2016)
Philip Arthur, Graham Neubig, and Satoshi Nakamura. 2016.
translation lexicons into neural machine translation.
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1557–1567. Association for Computational Linguistics.
- Ba and Caruana (2014) Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662.
- Belinkov and Bisk (2018) Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada.
Carlini and Wagner (2017)
Nicholas Carlini and David Wagner. 2017.
Towards evaluating the robustness of neural networks.In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE.
- Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
- Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572.
- Hieber et al. (2018) Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Post. 2018. The Sockeye Neural Machine Translation Toolkit at AMTA 2018. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (AMTA), volume 1: Research Papers, pages 200–207.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparametrization with gumbel-softmax. In Proceedings International Conference on Learning Representations 2017. OpenReviews.net.
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Kim and Rush (2016) Yoon Kim and Alexander M. Rush. 2016. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1317–1327, Austin, Texas. Association for Computational Linguistics.
- Kobus et al. (2017) Catherine Kobus, Josep Crego, and Jean Senellart. 2017. Domain control for neural machine translation. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 372–378. INCOMA Ltd.
Koehn et al. (2007)
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello
Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst.
Moses: Open Source Toolkit for Statistical Machine Translation.In Proceedings of the ACL-2007 Demo and Poster Sessions, pages 177–180, Prague, Czech Republic.
- Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39. Association for Computational Linguistics.
- Kudo (2018) Taku Kudo. 2018. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75. Association for Computational Linguistics.
- Läubli et al. (2018) Samuel Läubli, Mathias Müller, Beat Horat, and Martin Volk. 2018. mtrain: A convenience tool for machine translation.
- Li and Jurafsky (2016) Jiwei Li and Dan Jurafsky. 2016. Mutual information and diverse decoding improve neural machine translation. arXiv preprint arXiv:1601.00372.
- Lison and Tiedemann (2016) Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pages 923–929. European Language Resources Association (ELRA).
- Luong and Manning (2015) Minh-Thang Luong and Christopher D. Manning. 2015. Stanford Neural Machine Translation Systems for Spoken Language Domains. In Proceedings of the International Workshop on Spoken Language Translation 2015, Da Nang, Vietnam.
- Nguyen and Chiang (2018) Toan Nguyen and David Chiang. 2018. Improving Lexical Choice in Neural Machine Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 334–343. Association for Computational Linguistics.
- Niu et al. (2019) Xing Niu, Weijia Xu, and Marine Carpuat. 2019. Bi-directional differentiable input reconstruction for low-resource neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 442–448, Minneapolis, Minnesota. Association for Computational Linguistics.
- Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
- Papernot and McDaniel (2017) Nicolas Papernot and Patrick McDaniel. 2017. Extending defensive distillation. arXiv preprint arXiv:1705.05264.
- Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, Philadelphia, PA.
- Post (2018) Matt Post. 2018. A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
- Press and Wolf (2017) Ofir Press and Lior Wolf. 2017. Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Valencia, Spain.
Scherrer and Cartoni (2012)
Yves Scherrer and Bruno Cartoni. 2012.
The trilingual ALLEGRA corpus: Presentation and possible use for lexicon induction.In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), pages 2890–2896, Istanbul, Turkey. European Languages Resources Association (ELRA).
- Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany. Association for Computational Linguistics.
- Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
- Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
Tu et al. (2017)
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017.
Neural Machine Translation with Reconstruction.
In Proceedings of the Thirty-First AAAI Conference on
, pages 3097–3103. Association for the Advancement of Artificial Intelligence.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, pages 5998–6008.
- Yee et al. (2019) Kyra Yee, Nathan Ng, Yann N Dauphin, and Michael Auli. 2019. Simple and effective noisy channel modeling for neural machine translation. arXiv preprint arXiv:1908.05731.
- Yu et al. (2017) Lei Yu, Phil Blunsom, Chris Dyer, Edward Grefenstette, and Tomás Kociský. 2017. The neural noisy channel. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
Appendix A Translation Examples
|Source||- die Produktion in der Türkei entspricht 1,3 % der chinesischen Produktion;|
|Target||- Turkey’s volume of production amounts to 1,3 % of Chinese production,|
|NMT Baseline||- the production in slkei is 1.3% of a Chinese hamster ovary (CHO) cell|
|Multilingual||- production in turkei is equivalent to 1.3% of Chinese Hamster production;|
|Reconstruction||- the production in thekei is equivalent to 1.3% of the Chinese production;|
|Reconstruction + NC||- production in the turkei equals 1.3% of the Chinese production;|
Appendix B Noisy Channel Reranking Grid Search Weights
|(7) Multilingual + SR||0.5||0.42||0.08||0.9||0.09||0.01|
|(8) Reconstruction + SR||0.5||0.46||0.04||0.9||0.09||0.01|