The Unreasonable Volatility of Neural Machine Translation Models

05/25/2020 ∙ by Marzieh Fadaee, et al. ∙ University of Amsterdam 0

Recent works have shown that Neural Machine Translation (NMT) models achieve impressive performance, however, questions about understanding the behavior of these models remain unanswered. We investigate the unexpected volatility of NMT models where the input is semantically and syntactically correct. We discover that with trivial modifications of source sentences, we can identify cases where unexpected changes happen in the translation and in the worst case lead to mistranslations. This volatile behavior of translating extremely similar sentences in surprisingly different ways highlights the underlying generalization problem of current NMT models. We find that both RNN and Transformer models display volatile behavior in 26 variations, respectively.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of Neural Machine Translation (NMT) models has dramatically improved in recent years, and with sufficient and clean data these models outperform more traditional models. Challenges when sufficient data is not available include translations of rare words (W18-2712) and idiomatic phrases (L18-1148) as well as domain mismatches between training and testing (koehn2017six; W18-2709).

Source: Ich bin         erleichtert und         bescheiden.
NMT output
I am easier and modest.
sehr I am relieved and very modest.
sehr I am very much easier and modest.
sehr sehr I am very easy and very modest.
I am relieved and humble.
sehr sehr I am very relieved and very humble.
Table 1: Insertion of the German word sehr (English: very) in different positions in the source sentence results in substantially different translations. indicates the original sentence from WMT 2017.

Recently, several approaches investigated NMT models when encountering noisy input and how worst-case examples of noisy input can ‘break’ state-of-the-art NMT models (goodfellow6572explaining; D18-1050). DBLP:journals/corr/abs-1711-02173 show that character-level noise in the input leads to poor translation performance. halluc randomly insert words in different positions in the source sentence and observe that in some cases the translations are completely unrelated to the input. While it is to some extent expected that the performance of NMT models that are trained on predominantly clean but tested on noisy data deteriorates, other changes are more unexpected.

In this paper, we explore unexpected and erroneous changes in the output of NMT models. Consider the simple example in Table 1 where the Transformer model (vaswani2017attention) is used to translate very similar sentences. Surprisingly, we observe that by simply altering one word in the source sentence—inserting the German word sehr (English: very)—an unrelated change occurs in the translation. In principle, an NMT model that generates the translation of the word erleichtert (English: relieved) in one context, should also be able to generalize and translate it correctly in a very similar context. Note that there are no infrequent words in the source sentence and after each modification, the input is still syntactically correct and semantically plausible. We call a model volatile if it displays inconsistent behaviour across similar input sentences during inference.

Modification Sentence variations
DEL Some 500 years after the Reformation, Rome [now\] has a Martin Luther Square.
SUBNUM I’m very pleased for it to have happened at Newmarket because this is where I landed [30\31] years ago.
INS I loved Amy and she is [\also] the only person who ever loved me.
SUBGEN [He\She] received considerable appreciation and praise for this.
Table 2: Examples of different variations from WMT. [\] indicates that in the original sentence is replaced by . is an empty string.

We investigate to what extent well-established NMT models are volatile during inference. Specifically, we locally modify sentence pairs in the test set and identify examples where a trivial modification in the source sentence causes an ‘unexpected change’ in the translation. These modifications are generated conservatively to avoid insertion of any noise or rare words in the data (Section 2.2). Our goal is not to fool NMT models, but instead identify common cases where the models exhibit unexpected behaviour and in the worst cases result in incorrect translations.

We observe that our modifications expose volatilities of both RNN and Transformer translation models in and of sentence variations, respectively. Our findings show how vulnerable current NMT models are to trivial linguistic variations, putting into question the generalizability of these models.

2 Sentence Variations

2.1 Is this another noisy text translation problem?

Noisy input text can cause mistranslations in most MT systems, and there has been growing research interest in studying the behaviour of MT systems when encountering noisy input (li-EtAl:2019:WMT1).

DBLP:journals/corr/abs-1711-02173 propose to swap or randomize letters in a word in the input sentence. For instance, they change the word noise in the source sentence into iones. halluc examine how the insertion of a random word in a random position in the source sentence leads to mistranslations. D18-1050 proposes a benchmark dataset for translation of noisy input sentences, consisting of noisy, user-generated comments on Reddit. The types of noisy input text they observe include spelling or typographical errors, word omission/insertion/repetition, and grammatical errors.

In these previous works, the focus of the research is on studying how the MT systems are not robust when handling noisy input text. In these approaches, the input sentences are semantically or syntactically incorrect which leads to mistranslations.

However, in this paper, our focus is on input text that does not contain any types of noise. We modify input sentences in a way that the outcomes are still syntactically and semantically correct. We investigate how the MT systems exhibit volatile behaviour in translating sentences that are extremely similar and only differ in one word without any noise injection.

2.2 Variation generation

While there are various ways to automatically modify sentences, we are interested in simple semantic and syntactic modifications. These trivial linguistic variations should have almost no effect on the translation of the rest of the sentence.

Figure 1: Levenshtein distance and span of change between translations of sentence variations for RNN and Transformer. The majority of sentence variations falls into the category of minor changes between translations (blue area). However, a surprising number of cases have significant changes (red area). RNN exhibits a slightly more unstable pattern i.e., sentence variations with large edit differences and large spans of change.

We define a set of rules to slightly modify the source and target sentences in the test data and keep the sentences syntactically correct and semantically plausible.


A conservative approach of modifying a sentence automatically without breaking the grammaticality of a sentence is to remove adverbs. We identify a list of the 50 most frequent adverbs in English and their translations in For every sentence in the WMT test sets, if we find a sentence pair containing both a word and its translation from this list, we remove both words and create a new sentence pair.


Another simple yet effective approach to safely modify sentences is to substitute numbers with other numbers. In this approach, we select every sentence pair from the test sets that contains a number and substitute the number in both source and target sentences with where . We choose a small range for change so that the sentences are still semantically correct for the most part and result in few implausible sentences.


Randomly inserting words in a sentence has a high chance of producing a syntactically incorrect sentence. To ensure that sentences remain grammatical and semantically plausible after modification, we define a bidirectional

-gram probability for inserting new words as follows:

is inserted in the middle of the phrase , if the conditional probability is greater than a predefined threshold. The probabilities are computed on the WMT data. This simple approach, instead of using a more complex language model, serves our purposes since we are interested in inserting very common words that are already captured by the -grams in the training data.


Finally, a local modification is changing the gender of the person in the sentences. The goal of this modification is to investigate the existence and severity of gender bias in our models. This is inspired by recent approaches that have shown that NMT models learn social stereotypes such as gender bias from training data (escude-font-costa-jussa-2019-equalizing; stanovsky-etal-2019-evaluating).

Note that in a minority of cases these procedures can lead to semantically incorrect sentences, for instance, by substituting numbers we can potentially generate sentences such as “She was born on October 34th“. While this can cause problems for a reasoning task, it barely affects the translation task, as long as the modifications are consistent on the source and target side.

Table 2 shows examples of generated variations. We emphasize that only modifications with local consequences have been selected and we intentionally ignore cases such as negation which can result in wider structural changes in the translation of the sentence.

2016 2017 2018 2016 2017 2018
RNN 32.5 28.2 35.2 28.1 22.4 34.6
Transformer 36.2 32.1 40.1 33.4 27.9 39.8
Table 3: BLEU scores for different models on the WMT data for translation DEEN.
Coes letztes Buch “Chop Suey” handelte von der chinesischen Küche in den USA, während Ziegelman in ihrem Buch “[97\101] Orchard” über das Leben in einem Wohnhaus an der Lower East Side aus der Lebensmittelperspektive erzählt.
Mr. Coe’s last book, “Chop Suey,” was about Chinese cuisine in America, while Ms. Ziegelman told the story of life in a Lower East Side tenement through food in her book “[97\101] Orchard.”
Coes’s last book, “Chop Suey,” was about Chinese cuisine in the US, while Ziegelman, in her book “97 Orchard” talks about living in a lower East Side.
Coes last book “Chop Suey” was about Chinese cuisine in the United States, while Ziegelman writes in her book “101 Orchard” about living in a lower East Side.
: [reordered] [paraphrased]
: No
Man hält [bereits\] Ausschau nach Parkbank, Hund und Fußball spielenden Jungs und Mädels.
You are [already\] on the lookout for a park bench, a dog, and boys and girls playing football.
We are already looking for Parkbank, dog and football playing boys and girls.
Look for Parkbank, dog and football playing boys and girls.
: [word form] [add/remove]
: Yes
Bei einem Unfall eines Reisebusses mit [43\45] Senioren als Fahrgästen sind am Donnerstag in Krummhörn (Landkreis Aurich) acht Menschen verletzt worden.
On Thursday, an accident involving a coach carrying [43\45] elderly people in Krummhörn (district of Aurich) led to eight people being injured.
In the event of an accident involving a coach with 43 senior citizens as passengers, eight people were injured on Thursday in Krummaudin (County Aurich).
In the event of an accident involving a 45-year-old coach as a passenger, eight people were injured on Thursday in the district of Aurich.
: [word form] [add/remove] [other]
: Yes
Es ist ein anstrengendes Pensum, aber die Dorfmusiker helfen [normalerweise\], das Team motiviert zu halten.
It’s a backbreaking pace, but village musicians [usually\] help keep the team motivated.
It’s a demanding child, but the village musicians usually help keep the team motivated.
It is a hard-to-use, but the village musician helps to keep the team motivated.
: [word form] [other]
: Yes
Table 4: A random sample of sentences from the WMT test sets and our proposed variations shown with ‘unexpected change’ annotations (). The cases where the unexpected change leads to a change in translation quality are marked in column . [\] indicates that in the original sentence is replaced by . is the original and modified source sentence, is the original and modified reference translation, is the translation of the original sentence, and is the translation of the modified sentence. Differences in translations related to annotations in the original and the modified translations are in red and orange, respectively. Note that we are interested in unexpected changes and do not highlight the changes that are a direct consequence of the modifications.

We generate sentence variations by applying these modifications to all sentence pairs in WMT test sets 2013–2018 (bojar-EtAl:2018:WMT1). We use RNN and Transformer models to translate sentences and their variations.

2.3 Experimental setup

In the translation experiments, we use the standard ENDE WMT-2017 training data (bojar-EtAl:2018:WMT1). We perform NMT experiments with two different architectures: RNN (luong:2015:EMNLP) and Transformer (vaswani2017attention). We preprocess the training data with Byte-Pair Encoding (BPE) using 32K merge operations (sennrich-haddow-birch:2016:P16-12). During inference, we use beam search with a beam size of 5. Table 3 shows the case-sensitive BLEU scores as calculated by multi-bleu.perl.


As the first NMT system, we use a 2-layer bidirectional attention-based LSTM model implemented in OpenNMT (2017opennmt) trained with an embedding size of 512, hidden dimension size of 1024, and batch size of 64 sentences. We use Adam (kingma2014adam) for optimization.


We also experiment with the Transformer model (vaswani2017attention) implemented in OpenNMT. We train a model with 6 layers, the hidden size is set to 512 and the filter size is set to 2048. The multi-head attention has 8 attention heads. We use Adam (kingma2014adam) for optimization. All parameters are set based on the suggestions in 2017opennmt to replicate the results of the original paper.

Figure 2: Categories of unexpected changes in the translation of sentence variations as provided by annotators. The percentage of sentence variations with minor and major edit differences, as defined in 3.1, are shown separately. The hatched pattern indicates the ratio of sentence variations for which the translation quality changes. Note that expected changes are not plotted here.

3 Evaluation of unexpected and erroneous changes

The modifications described above generate sentences that are extremely similar and hence are expected to have a very similar difficulty of translation. We evaluate the NMT models on how robust and consistent they are in translating these sentence variations rather than their absolute quality.

3.1 Deviations from Original Translations

The variations are aimed to have minimal effect on changing the meaning of the sentences. Hence, major changes in the translations of these variations can be an indication of volatility in the model. To assess whether the proposed sentence variations result in major changes in the translations, we measure changes in the translations of sentence variations with Levenshtein distance (levenshtein1966binary). Specifically, Levenshtein distance measures the edit distance between the two translations. We also use the first and last positions of change in the translations, which represents the span of changes.

Ideally, with our simple modifications, we expect a value of zero for the span of change and a value of at most 2 for the Levenshtein distance for a translation pair. This indicates that there is only one token difference between the translation of the original sentence and the modified sentence. We define two types of changes based on these measures: minor and major. We choose the threshold to distinguish between minor and major changes more conservatively to allow for more variations in the translations. The change in translations is empirically considered major if both metrics are greater than 10, and minor if both are less than 10. Note that edit distances and spans are based on BPE subword units.

With two very similar source sentences, we expect the Levenshtein distance and span of change between translations of these sentences to be small. Figure 1 shows the results for the RNN and Transformer model. While the majority of sentence variations have minor changes, a substantial number of sentences, of RNN and of Transformer translations, result in translations with major differences. This is surprising and an indication of volatility since these trivial modifications, in principle, should only result in minor and local changes in the translations.

3.2 Oscillations of Variation in Translations

In this section, we look into various sentence-level metrics to further analyze the observed behaviour. In particular, we focus on the SUBNUM modification because with this modification we can generate numerous variations of the same sentence. Having a high number of variations for each sentence gives us the opportunity of observing oscillations of various string matching metrics.

We use sentence-level BLEU, METEOR (denkowski-lavie-2011-meteor), TER (Snover06astudy), and LengthRatio to quantify changes in the translations. LengthRatio represents the translation length over reference length as a percentage. For a given source sentence, we define the oscillation range as changes in the sentence-level metric for the translations of variations of a given sentence.

While sentence-level metrics are not reliable indicators of translation quality, they do capture fluctuations in translations. With the variations we introduce, in theory there should be no fluctuations in the translations. Table 5 and Figure 3 provide the results. We observe that even though these sentence variations differ by only one number, there are many cases where an insignificant change in the sentence results in unexpectedly large oscillations. Both RNN and Transformer exhibit this behaviour to a certain extent.

RNN 4.0 3.8 5.2 5.3
Transformer 3.8 3.3 4.2 3.4
Table 5: Mean oscillations for SUBNUM variations. In theory the variations should result in zero oscillations for every metric.
Figure 3: Oscillations of various sentence-level attributes for randomly sampled sentences from our test data and their SUBNUM variations. The data points are the mean values for all variations of each sentence, and the error bars indicate the range of oscillation of the metrics. The x-axis represents test sentence instances, sorted based on the corresponding metric. Ideally each data point should have zero oscillation.

3.3 The Effect of Volatility on Translation Quality

While edit distances and spans of change provide some indication of volatility, they do not capture all aspects of this unexpected behaviour. It is also not entirely clear what effect these unexpected changes have on translation quality. To further investigate this, we also perform manual evaluations.

In the first evaluation, we provide annotators with a pair of sentence variations and their corresponding translations and ask them to identify the differences between the two sentence pairs. In the second evaluation, we additionally provide the source sentences and reference translations, and ask the annotators to rank the sentence variations based on the translation quality similar to bojar-EtAl:2016:WMT1. In total the annotators evaluated 400 randomly selected sentence quadruplets.

The annotators identified and of changes in the variation translation as expected for the RNN and Transformer model, respectively. The main types of unexpected changes identified by the annotators are a change of word form, e.g., verb tense,, reordering of phrases, paraphrasing parts of the sentence, and an ‘other’ category, e.g., preposition. A sentence pair can have multiple labels based on the types of changes. Table 4 provides examples from the test data.

Statistics for each category of unexpected change is shown in Figure 2. Our first observation is that, as to be expected, there are very few ‘unexpected changes’ when two variations lead to translations with minor differences. Interestingly, the vast majority of changes are due to paraphrasing and dropping of words. Comparing the performance of the RNN and Transformer model, we see that both RNN and Transformer display inconsistent translation behaviour. While Transformer has slightly fewer sentences with major changes, it has a higher number of sentence variations in the major category that result in a change in translation quality. From the annotators’ assessments, we find that in and of sentence variations, the modification results in a change in translation quality for the RNN and Transformer model, respectively.

3.4 Generalization and Compositionality

Because of their ability to generalize beyond their training data, deep learning models achieve exceptional performances in numerous tasks. The generalization ability allows MT systems to generate long sentences not seen before. Recently there has been some interest in understanding whether this performance depends on recognizing shallow patterns, or whether the networks are indeed capturing and generalizing linguistic rules.

In simple terms, compositionality is the ability to construct larger linguistic expressions by combining simpler parts. For instance, if a model understands the correct compositional rules to understand ‘John loves Mary’, it must also understand ‘Mary loves John’ (fodor2002compositionality)

. Investigating the compositional behaviour of neural networks in real-world natural language problems is a challenging task. Recently, several works have studied deep learning models’ understanding of compositionality in natural language by using synthetic and simplified languages

(DBLP:journals/corr/abs-1902-07181; babyai_iclr19). DBLP:journals/corr/abs-1904-00157 shows that to a certain extent neural networks can be productive without being compositional.

Although we do not specifically look into the compositional potential of MT systems, we are inspired by compositionality in defining our modifications. We argue that the observed volatile behaviour of the MT systems in this paper is a side effect of current models not being compositional. If an MT system has a good ‘understanding’ of the underlying structures of the sentences ‘Mary is 10 years old’ and ‘Mary is 11 years old’

, it must also translate them very similarly regardless of the accuracy of the translation. While current evaluation metrics capture the accuracy of the NMT models, these volatilities go unnoticed.

Current neural models are successful in generalizing without learning any explicit compositional rules, however, our findings signal that they still lack robustness. We highlight this lack of robustness and suspect that it is associated with these models’ lack of understanding of the compositional nature of language.

4 Conclusion

In this paper, we showed the unexpected volatility of NMT models by using a simple approach to modifying standard test sentences without introducing noise, i.e., by generating semantically and syntactically correct variations. We show that even with trivial linguistic modifications of source sentences we can effectively identify a surprising number of cases where the translations of extremely similar sentences are surprisingly different, see Figure 1. Our manual analyses show that both RNN and Transformer models exhibit volatile behaviour with changes in translation quality for and of sentence variations, respectively. This highlights the problem of generalizability of current NMT models and we hope that our insights will be useful for developing more robust NMT models.


We thank Arianna Bisazza for helpful discussions. This research was funded in part by the Netherlands Organization for Scientific Research (NWO) under project numbers 639.022.213 and 612.001.218. We also thank NVIDIA for their hardware support and the anonymous reviewers for their helpful comments.