Domain Robustness in Neural Machine Translation

11/08/2019 ∙ by Mathias Müller, et al. ∙ Universität Zürich 0

Translating text that diverges from the training domain is a key challenge for neural machine translation (NMT). Domain robustness - the generalization of models to unseen test domains - is low compared to statistical machine translation. In this paper, we investigate the performance of NMT on out-of-domain test sets, and ways to improve it. We observe that hallucination (translations that are fluent but unrelated to the source) is common in out-of-domain settings, and we empirically compare methods that improve adequacy (reconstruction), out-of-domain translation (subword regularization), or robustness against adversarial examples (defensive distillation), as well as noisy channel models. In experiments on German to English OPUS data, and German to Romansh, a low-resource scenario, we find that several methods improve domain robustness, reconstruction standing out as a method that not only improves automatic scores, but also shows improvements in a manual assessments of adequacy, albeit at some loss in fluency. However, out-of-domain performance is still relatively low and domain robustness remains an open problem.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Even though neural models have improved the state-of-the-art in machine translation considerably in recent years, they still underperform in specific conditions. One such condition is out-of-domain translation. Koehn and Knowles (2017) show that neural machine translation (NMT) systems perform poorly in such settings and that their poor performance cannot be explained solely by the fact that out-of-domain translation is difficult: non-neural, statistical machine translation (SMT) systems are clearly superior at this task. For this reason, Koehn and Knowles (2017) identify translation of out-of-domain text as a key challenge for NMT.

Catastrophic failure to translate out-of-domain text can be viewed as overfitting to the training domain, i.e. systems learn idiosyncracies of the domain rather than more general features.

Our goal is to learn models that generalize well to unseen data distributions, including data from other domains. We will refer to this property of showing good generalization to unseen domains as domain robustness.

We consider domain robustness a desirable property of NLP systems, along with other types of robustness, such as robustness against adversarial examples (Goodfellow et al. 2015) or human input corruption such as typos (Belinkov and Bisk, 2018). While domain adaptation with small amounts of parallel or monolingual in-domain data has proven very effective for NMT (e.g. Luong and Manning, 2015; Sennrich et al., 2016a; Kobus et al., 2017), the target domain(s) may be unknown when a system is built, and there are language pairs for which training data is only available for limited domains.

Model architectures and training techniques have evolved since Koehn and Knowles (2017)’s study, and it is unclear to what extent this problem still persists. We therefore revisit the hypothesis that NMT systems exhibit low domain robustness. In preliminary experiments, we demonstrate that current models still fail at out-of-domain translation: BLEU scores drop drastically for test domains other than the training domain. The degradation in quality is less severe for SMT systems.

  • [leftmargin=30pt]

  • Aber geh subtil dabei vor.

  • [leftmargin=30pt]

  • But be subtle about it.

  • [leftmargin=30pt]

  • Pharmacokinetic parameters are not significantly affected in patients with renal impairment (see section 5.2).

Figure 1: Example illustrating how a GermanEnglish NMT system trained on medical text hallucinates the translation of an out-of-domain input sentence.

An analysis of our baseline systems reveals that hallucinated content occurs frequently in out-of-domain translations. Several authors present anecdotal evidence for NMT systems occasionally falling into a hallucination mode where translations are grammatically correct, but unrelated to the source sentence (Arthur et al., 2016; Koehn and Knowles, 2017; Nguyen and Chiang, 2018). See Figure 1 for an example. After assessing translations manually, we find that hallucination is more pronounced in out-of-domain translation. We therefore expect methods that alleviate the problem of hallucinated translations to indirectly improve domain robustness.

As a means to reduce hallucination, we experiment with several techniques and assess their effectiveness in improving domain robustness: reconstruction (Tu et al., 2017; Niu et al., 2019), subword regularization (Kudo, 2018), neural noisy channel models (Li and Jurafsky, 2016; Yee et al., 2019), and defensive distillation Papernot et al. (2016), as well as combinations of these techniques.

The main contributions of this paper are:

  • we perform an analysis of current NMT systems that confirms that domain robustness is low, and that hallucination is a major problem,

  • we empirically compare strategies to improve domain robustness in NMT,

  • we provide code and data sets to serve as baselines in future work.111

2 Data Sets

We report experiments on two different translation directions: GermanEnglish (DEEN) and GermanRomansh (DERM).

domains corpora size domains corpora size
medical EMEA 1.1m law Allegra,
IT GNOME, KDE, PHP, Ubuntu, OpenOffice 380k Press Releases 100k
koran Tanzil 540k blogs Convivenza 20k
law JRC-Acquis 720k
subtitles OpenSubtitles2018 22.5m
Table 1: Data sets common to all of our experiments. Size indicates number of sentence pairs.

2.1 GermanEnglish

For all DEEN experiments, we use the same corpora as Koehn and Knowles (2017), available from OPUS (Lison and Tiedemann, 2016)222OPUS:

We use corpora from OPUS to define five domains: medical, IT, koran, law and subtitles. See Table 1 for an overview of sizes per domain. The domains are quite distant, and we therefore expect that systems trained on a single domain will have low domain robustness if tested on other domains.

For each domain, we select 2000 consecutive sentence pairs each for development and testing. Our test sets are different from Koehn and Knowles (2017), so results are not directly comparable.

In all experiments, the medical domain serves as the training domain, while the remaining four domains are used for testing.

2.2 GermanRomansh

To complement our DEEN experiments, we also train systems for DE

RM. Romansh is a Romance language that, with an estimated

native speakers, is low-resource, but has some parallel resources thanks to its status as an official Swiss language. Our training data consists of sentence pairs, specifically the Allegra corpus provided by Scherrer and Cartoni (2012) which contains mostly law text, and an in-house collection of press releases from the Swiss canton of Grisons. As test domain (unseen during training), we use blog posts from Convivenza333See From both data sets we randomly select 2000 consecutive sentence pairs as test sets.

3 State-of-the-art Models Exhibit Low Domain Robustness

In this section, we establish that current NMT systems exhibit low domain robustness. We do so by analyzing our baseline systems automatically and manually.

3.1 Experimental Setup for Baseline Models

We use Moses scripts to normalize punctuation and tokenize all data. We apply a truecasing model trained only on in-domain training data. Similarly, we apply BPE (Sennrich et al., 2016b) with 32k (DEEN) or 16k (DERM) merge operations learned only from in-domain data. We train two baselines:

NMT Baseline A standard Transformer base model trained with Sockeye (Vaswani et al., 2017; Hieber et al., 2018).

SMT Baseline A standard, phrase-based statistical model trained with Moses (Koehn et al., 2007), using mtrain (Läubli et al., 2018)444 as frontend, with standard settings.

We always test on different test sets, one for each domain, including the training domain. We consistently use a beam size of 10 to translate test data. We report case-sensitive BLEU (Papineni et al., 2002) scores on detokenized text, computed with SacreBLEU (Post, 2018)555SacreBLEU version signature:

3.2 Analysis of Baseline Systems

Tables 3 and 4 show automatic evaluation results for all our baseline models. Neural models achieve good performance on the respective in-domain test sets (61.5 BLEU on medical for DEEN; 52.5 BLEU on law for DERM), but on out-of-domain text, translation quality is clearly diminished, with an average BLEU of roughly 12 (DEEN) and 17 (DERM). The following analysis will focus on our DEEN baseline systems.

OOV rate
medical 2.42%
IT 20.09%
koran 18.63%
law 9.39%
subtitles 18.16%
Table 2: Out-of-vocabulary (OOV) rates of in-domain and out-of-domain test sets (DEEN).

Unknown words constitute one possible reason for failing to translate out-of-domain texts. As shown in Table 2, the percentage of words that that are not seen during training is much higher in all out-of-domain test sets. However, unknown words cannot be the only reason for low translation quality: The test sets with the lowest BLEU scores (koran and subtitles) actually have an out-of-vocabulary (OOV) rate similar to the IT test set, where BLEU scores are much higher across both baseline models.

Additionally, our SMT baseline shows better generalization to some domains unseen at training time, while the average BLEU is comparable to the NMT baseline. In the IT domain, the outcome is most extreme: the SMT system beats the neural system by 4.3 BLEU. This demonstrates that the low domain robustness of NMT is not (only) a data problem, but also due to the model’s inductive biases.

medical 58.4 61.5
IT 21.4 17.1
koran 1.4 1.1
law 19.8 25.3
subtitles 4.7 3.4
average (out-of-domain) 11.8 11.7
Table 3: BLEU scores of baseline DEEN systems trained on medical data.
law 45.2 52.5
blogs 15.5 18.9
Table 4: BLEU scores of baseline DERM systems trained on law data.

Compared to results reported by Koehn and Knowles (2017), the general trend remains the same: NMT loses more translation quality than PBSMT when comparing in-domain and out-of-domain performance, although the gap in out-of-domain performance is substantially smaller (0.1 BLEU as compared to 3 BLEU for systems trained on medical data).

As a further control, we train an additional baseline system trained on all domains666The subtitles domain (23m sentences) was subsampled to 1m sentence pairs, so as not to overwhelm the remaining domains (3m sentences in total).. We use it to test whether the data we have held out for out-of-domain testing is inherently more difficult to translate than the in-domain test set. The results in Table 5 show that this is not the case. BLEU ranges between 18.4 and 66, with an average out-of-domain BLEU of 37.5.

medical 60.1
IT 46.9
koran 18.4
law 66.0
subtitles 18.7
average (out-of-domain) 37.5
Table 5: BLEU scores of NMT baseline trained on a concatenation of all domains (DEEN).

3.2.1 Hallucination

NMT models can be understood as language models of the target language, conditioned on a representation of source text. This means that NMT models have no explicit mechanism – as SMT models do – that enforces coverage of the source sentence, and if the representation of an out-of-domain source sentence is outside the training distribution, it can be seemingly ignored. This gives rise to a tendency to hallucinate translations, i.e. to produce translations that are fluent, but unrelated to the content of the source sentence.

We hypothesize that hallucination is more common in out-of-domain settings. A small manual evaluation performed by the main author confirms that this is indeed the case. We evaluate the fluency and adequacy of our baseline systems (we refer to them as NMT and SMT). In a blind setup, we annotated a random sample of 100 sentence pairs per domain. As controls, we mix in pairs consisting of (source, actual reference), treating the reference translation as an additional system.

Evaluation of adequacy The annotator is presented with a sentence pair and asked to judge whether the translation is adequate, partially adequate or inadequate. Thus, effectively the annotator is performing a 3-way categorization task.

Evaluation of fluency We use the same data as for the evaluation of adequacy, however, the annotator is shown only the translation, without the corresponding source sentence. The annotator is asked whether the given sentence is fluent, partially fluent or not fluent.

Figure 2 shows the results of the manual evaluation with respect to adequacy and fluency. Individual fluency values in the figure are computed as follows:

Where , and are the number of fluent, partially fluent and non-fluent translations, respectively. Adequacy values are computed in the same way. On the in-domain test set, both baselines achieve high adequacy and fluency, with the NMT baseline effectively matching the adequacy and fluency of the reference translations.

Regarding adequacy, the in-domain samples contain only a small number of translations with content unrelated to the source (1% to 2%). On out-of-domain data on the other hand, both baselines produce a high number of inadequate translations: 57% (SMT) and 84% (NMT). These results suggest that the extremely low BLEU scores on these two test sets (see Table 3) are in large part due to made up content in the translations.

Regarding fluency in out-of-domain settings, SMT and NMT baselines behave very differently: SMT translations are more adequate, while NMT translations are more fluent. This trend is most extreme in the koran domain, where only 2% of SMT translations are found to be fluent (compared to 36% for NMT).

Further analysis of both annotations shows that NMT translations found to be inadequate are not necessarily disfluent in out-of-domain settings. Table 6 shows that, on average, on out-of-domain data, 35% of NMT translations are both inadequate and fluent, while the same is only true for 4% of SMT translations. We refer to translations of this kind as hallucinations.

(a) in-domain
(b) out-of-domain
Figure 2: Manual evaluation of adequacy and fluency for DEEN. Legend: marker colors are different systems, marker types are different domains. SR=Subword Regularization, D=Distillation
in-domain OOD average
Reference 2% 2%
NMT 2% 35%
SMT 1% 4%
NMT + SR 1% 37%
NMT + D 3% 33%
Reconstruction 1% 29%
Table 6: Percentage of translations judged as both not adequate and either fluent or partially fluent in the manual evaluation.

To summarize our analysis of baseline models, we find that the domain robustness of current NMT systems is still lacking and that inadequate, but fluent translations are a prominent issue. This motivates our choice of techniques to improve domain robustness.

4 Approaches to Improve Domain Robustness

We discuss approaches that can potentially remedy the problem of low domain robustness.

Among them is the reconstruction architecture and training objective, which addresses the problem of hallucination, subword regularization, for which good results were reported in out-of-domain translation, defensive distillation, a method that has not yet been used in NMT to address either hallucination or domain robustness, and a neural noisy channel model.

4.1 Reconstruction

Reconstruction (Tu et al., 2017) is a change to the model architecture that addresses the problem of adequacy. The authors propose to extend encoder-decoder models with a reconstructor component that learns to reconstruct the source sentence from decoder states. The reconstructor has two uses: as a training objective, it forces the decoder representations to retain information that will be useful for reconstruction; during inference, it can provide scores that can be used to re-rank translation hypotheses.

However, we observed in initial experiments that reconstruction from hidden states can be too easy: the reconstruction loss on training batches diminishes very quickly, to the point of being insignificant. To prevent the model from simply reserving parts of the decoder hidden states to memorize the input sentence, we use reconstruction from actual translations instead of hidden states Niu et al. (2019). Translations are produced with differentiable sampling via the Straight-Through Gumbel Softmax Jang et al. (2017), which still allows joint optimization of translation and reconstruction. While Niu et al. (2019) implement reconstruction for recurrent architectures, we apply the technique to Transformers.

In order not to introduce any additional parameters for reconstruction, as recommended in Niu et al. (2019), we train a multilingual, bi-directional system with shared parameters as a further baseline. This bi-directional system is used to initialize the fine-tuning of reconstruction models. We empirically test whether our original baseline and the multilingual system have comparable performance.

4.2 Subword Regularization

Subword regularization (Kudo, 2018)

is a form of data augmentation that, instead of applying a fixed subword segmentation like BPE, probabilistically samples a new subword segmentation each time a sentence is seen during training (i.e. for each epoch). At test time, the model either uses the 1-best segmentation, or translates the k-best segmentations and selects the highest-probability translation.

Kudo (2018) reports large improvements on low-resource and out-of-domain settings. In particular, improvements on in-house patent, web, and query test sets were in the range of 2–10 BLEU. In this work, we apply and evaluate subword regularization on public datasets. We apply sampling at training time, and translate 1-best segmented sentences at test time.

4.3 Defensive Distillation

We hypothesize that defensive distillation can be used to improve domain robustness. Defensive distillation exploits knowledge distillation to fend off adversarial attacks.

Knowledge distillation is a technique to derive models from existing models, instead of training from scratch. The idea was introduced for simple image classification models by Ba and Caruana (2014) and Hinton et al. (2015). A first model (called the teacher) is trained in the usual fashion. Then, a second model (called the student) is trained using the predictions of the teacher model instead of the labels in the training data.

Typically, knowledge distillation is used to approach the performance of a complex teacher model (or ensemble of models) with a simpler student model. Another application is defensive distillation (Papernot et al., 2016; Carlini and Wagner, 2017; Papernot and McDaniel, 2017), where the student shares the network architecture with the teacher, with the purpose not being model compression, but improving the model’s generalization to samples outside of its training set, and specifically robustness against adversarial examples (Szegedy et al., 2013; Goodfellow et al., 2014).

Defensive distillation has been shown to be effective at improving robustness to adversarial examples in image recognition tasks such as CIFAR10 or ImageNet. In this work, we apply it to a different task, NMT, and test its effect on domain robustness, which

Papernot et al. (2016) hint at but do not empirically test.

We follow Kim and Rush (2016) and, instead of training the student model with soft labels, use beam search to translate the entire training set with a teacher model, then train a student model on those automatic translations instead of the ground-truth translations.

4.4 Neural Noisy Channel Reranking

Even though the methods we presented previously do lead to improved out-of-domain translation quality, the models still suffer from low adequacy. Also, our reconstruction models only perform reconstruction during training, and the reverse translation direction is not exploited, for instance by reranking translations Tu et al. (2017).

We conjecture that this problem can be addressed with a neural noisy channel model Li and Jurafsky (2016). Standard NMT systems only model , which can lead to a “failure mode that can occur in conditional models in which inputs are explained away by highly predictive output prefixes” (Yu et al., 2017, 1). Noisy channel models propose to also model and to alleviate this effect.

In practical terms, noisy channel models can be implemented by modifying the core decoding algorithm, or simply as n-best list reranking. We adopt the latter approach, since n-best list reranking was shown by Yee et al. (2019) to have equal or better performance than more computationally costly methods that score partial hypotheses during beam search.

5 Experimental Setup for Proposed Methods

This section describes how we preprocessed data and trained the models described in Section 4. Unless stated otherwise, the data is preprocessed in the same way as for the baseline models (see Section 3.1).

Reconstruction Models We implement differentiable sampling reconstruction for Transformer models in Sockeye, and release the implementation.777 We first train a multilingual Sockeye Transformer model using the approach of Johnson et al. (2017). We evaluate validation perplexity every 1000 updates for early stopping, with a patience of 10.

Then we continue training with reconstruction as an additional loss component. All hyperparameters remain the same, except for the new loss and a lower initial learning rate. For testing we select the model with the lowest validation perplexity. We use reconstruction exclusively for training, reconstruction is not used for translation with a trained model.

Subword Regularization Models We integrate subword regularization in Sockeye, using the Python library provided by Kudo (2018)888 The training data is not segmented with BPE in this case. Instead, the training tool is given truecased data, and new segmentations are sampled before each training epoch. In our experiments, we use the following hyperparameters: we set the smoothing parameter to 0.1 and use an n-best size of 64. For the validation and test data we use 1-best segmentation.

Defensive Distillation Models We use our baseline Transformer model as the teacher model. We translate the original training set with beam size 10. The student is trained on the translations of the teacher model, using the same hyperparameters and being initialized with the parameters of the teacher model.

Noisy Channel Reranking We decode with a beam size of 50, and store an n-best list of 50 as well. For each hypothesis we produce the following scores: (usual translation score), (translation score in reverse direction) and (language model score in target language).

In order to produce scores we train a Transformer language model with fairseq Ott et al. (2019), with standard settings. We impose a large penalty of for hypotheses that contain subwords not found in the target side training data. and are computed with the same model, either the bi-directional or reconstruction model.

The final hypothesis score for reranking is computed as a weighted multiplication:

The best weights are found with simple grid search over values in the range , on the in-domain development set. The best weight combination is then used to compute scores and perform reranking for the test data of all domains. Table 10 in Appendix B lists optimal weights found for each model individually.

6 Evaluation

Table 7 shows the results of our automatic evaluation. Overall, the proposed methods do improve over the NMT baseline and are able to outperform the SMT baseline on out-of-domain data.

in-domain average OOD in-domain average OOD
(1) SMT 58.4 11.8 45.2 15.5
(2) NMT 61.5 11.7 52.5 18.9
(3) NMT + SR 61.4 11.2 53.7 20.1
(4) NMT + D 61.1 13.1 52.5 19.3
(5) Multilingual 61.4 11.7 52.8 19.6
(6) Reconstruction 61.5 12.5 53.4 21.2
(7) Multilingual + SR 60.3 12.8 52.4 20.1
(8) Reconstruction + SR 60.3 13.2 52.4 20.3
(9) Multilingual + NC 62.7 11.8 53.1 21.4
(10) Reconstruction + NC 62.8 13.0 53.3 21.6
(11) Multilingual + SR + NC 60.7 12.3 53.1 21.4
(12) Reconstruction + SR + NC 60.8 13.1 52.4 20.7
Table 7: BLEU scores (higher is better) of all systems on test data. SR=Subword Regularization, D=Distillation, NC=Noisy Channel Model, average OOD=average BLEU score over out-of-domain test sets.
Baseline Subword
(BPE) Regularization
Kudo (2018) 25.6 27.7
Our results 28.3 29.1
Table 8: Reproducing results from Kudo (2018) on IWSLT 15 English-Vietnamese data.

6.1 Subword Regularization

The results for subword regularization are mixed (see Row 3 in Table 7). For DEEN, in-domain translation quality is comparable to the NMT baseline, while the average out-of-domain BLEU falls short of the NMT baseline (-0.5 BLEU). However, in the low-resource condition (DERM), subword regularization improves both in-domain and out-of-domain translation (+1.2 in both cases).

This result is surprising given the larger gains reported by Kudo (2018). To validate our implementation of subword regularization, we reproduce an experiment from Kudo (2018) with English-Vietnamese data from IWSLT 15 (see Table 8). With subword regularization we observe an improvement of 0.8 BLEU, which is lower than the 2 BLEU improvement reported by Kudo (2018), but we also note that our baseline model is stronger.

In a manual evaluation of subword regularization models (see Figure 2) we find that they do not improve adequacy, but increase fluency by several percentage points in some domains.

If subword regularization is combined with multilingual or reconstruction models (see Rows 7 and 8 in Table 7), we observe no improvements on in-domain test sets, but gains for 3 out of 4 out-of-domain data points, indicating that subword regularization is in fact helpful for domain robustness.

6.2 Defensive Distillation

Defensive distillation also leads to improvements in BLEU on out-of-domain text (see Row 4 in Table 7). The average gain is +1.4 for DE-EN, and only +0.4 for DE-RM. In-domain translation is either comparable or slightly worse than the NMT baseline.

Defensive distillation was originally shown to guard against adversarial attacks, where inputs are only slightly different from training examples. Our results indicate that generalization to out-of-domain inputs – that are farther from the training data – is similarly improved.

After assessing translations by our distillation model manually (see Figure 2), we find that distillation does not appear to consistently improve out-of-domain adequacy.

6.3 Reconstruction

Since reconstruction models are fine-tuned from multilingual models, we report scores for those multilingual models as well. Row 5 of Table 7 shows that our multilingual models perform equally well or better than the NMT baseline.

As shown in Row 6 of Table 7, reconstruction outperforms the NMT baseline for both language pairs (+0.8 BLEU for DE-EN, +2.3 BLEU for DE-RM).

To analyze the improvements in more detail, we conduct the same manual annotation of adequacy and fluency as for the baselines (see Figure 2). We show that reconstruction is containing hallucination on out-of-domain data, reducing the percentage of inadequate translations by 5 percentage points on average. We also note that there is a tradeoff between adequacy and fluency: while reconstruction does improve out-of-domain adequacy, the improvement comes at the cost of lower fluency.

We observe that reconstruction models have a tendency to leave parts of an input sentence untranslated, i.e. parts of the sentence remain in the source language. This is exacerbated by the fact that our models are trained multilingually and bi-directionally. We consider two possible explanations:

  1. This behaviour is a consequence of forcing the model to translate unfamiliar out-of-domain text.

  2. During reconstruction training, the model is punished too harshly for not being able to reconstruct input sentences.

Since the multilingual model itself exhibits this copying behaviour, we do not consider this a problem specific to reconstruction training.999In preliminary experiments we identify 3-way tying of embeddings (Press and Wolf, 2017) in our models as a contributing cause, enabling the model to copy subwords that were never paired in the training data. 2-way tying (tying only target embeddings and the output matrix) reduces the problem of untranslated content, and we report results for 2-way tying for DEEN. Still, to avoid exacerbating the copying problem, we consider not only reranking with the reverse model as proposed by Tu et al. (2017), but a noisy channel model that also includes a language model to balance adequacy and fluency.

6.4 Noisy Channel Reranking

We evaluate the performance of noisy channel reranking in four different settings: applied to multilingual or reconstruction systems, both with and without subword regularization. The results are shown in Rows 9 to 12 of Table 7.

DEEN: Reranking a reconstruction model achieves a good in-domain BLEU (+1.3 over the baseline), and slightly improves out-of-domain translation on average (+0.5 BLEU over reconstruction).

DERM: In our low-resource setting, reranking with a noisy channel model improves the reconstruction model by +0.4 BLEU, producing the best result overall. The improvement on out-of-domain translation is much larger for the multilingual model (+1.8 over multilingual model without reranking).

Combining reranked models (see Rows 11 and 12 in Table 7) with subword regularization does not lead to consistent improvements. Out-of-domain BLEU for DERM is slightly better compared to a subword regularization system without reranking (+0.4 BLEU), all other scores are comparable or worse.

We found the effectiveness of noisy channel reranking to be limited by the homogeneity of n-best lists, and consider that it could become more effective after increasing beam search diversity (Li and Jurafsky, 2016).

7 Conclusions

Current NMT systems exhibit low domain robustness, i.e. they underperform if they are tested on a domain that differs strongly from the training domain. This is especially problematic in settings where explicit domain adaptation is impossible because the target domain is unknown, or because we are in a low-resource setting where training data is only available for limited domains. Our manual analysis shows that hallucinated translations are a common problem for NMT in out-of-domain settings that partially explains the low domain robustness.

Based on this analysis, we compare several methods to mitigate hallucination: subword regularization, for which improved domain robustness has been reported, defensive distillation, reconstruction and reranking with a neural noisy channel model.

Our results show that several methods yield improved generalization to out-of-domain data, and we find that a combination of reconstruction and a noisy channel model for reranking are most effective. We achieve an improvement in average out-of-domain BLEU of 1.5 (DEEN) and 2.7 (DERM), as well as a reduction in hallucinated translations according to manual analysis.

Still, in our manual evaluation NMT generally underperforms SMT in terms of adequacy on the tested out-of-domain datasets, and we encourage further research on domain robustness, which we consider an unsolved problem. For this purpose, we share data and code to serve as a baseline for future experiments.101010


Appendix A Translation Examples

Source - die Produktion in der Türkei entspricht 1,3 % der chinesischen Produktion;
Target - Turkey’s volume of production amounts to 1,3 % of Chinese production,
NMT Baseline - the production in slkei is 1.3% of a Chinese hamster ovary (CHO) cell
Multilingual - production in turkei is equivalent to 1.3% of Chinese Hamster production;
Reconstruction - the production in thekei is equivalent to 1.3% of the Chinese production;
Reconstruction + NC - production in the turkei equals 1.3% of the Chinese production;
Table 9: Example translations for DEEN. Hallucinated parts are set in bold.

Appendix B Noisy Channel Reranking Grid Search Weights

(5) Multilingual 0.7 0.26 0.04 0.9 0.09 0.01
(6) Reconstruction 0.6 0.32 0.08 0.9 0.09 0.01
(7) Multilingual + SR 0.5 0.42 0.08 0.9 0.09 0.01
(8) Reconstruction + SR 0.5 0.46 0.04 0.9 0.09 0.01
Table 10: Best weights for noisy channel reranking found with grid search on in-domain development set. Row numbers correspond to the ones in Table 7. =forward translation weight, =backward translation weight, = language model weight