Evaluating Robustness to Input Perturbations for Neural Machine Translation

by   Xing Niu, et al.

Neural Machine Translation (NMT) models are sensitive to small perturbations in the input. Robustness to such perturbations is typically measured using translation quality metrics such as BLEU on the noisy input. This paper proposes additional metrics which measure the relative degradation and changes in translation when small perturbations are added to the input. We focus on a class of models employing subword regularization to address robustness and perform extensive evaluations of these models using the robustness measures proposed. Results show that our proposed metrics reveal a clear trend of improved robustness to perturbations when subword regularization methods are used.



There are no comments yet.


page 1

page 2

page 3

page 4


Towards Robust Neural Machine Translation

Small perturbations in the input can severely distort intermediate repre...

Sometimes We Want Translationese

Rapid progress in Neural Machine Translation (NMT) systems over the last...

The Impact of Hole Geometry on Relative Robustness of In-Painting Networks: An Empirical Study

In-painting networks use existing pixels to generate appropriate pixels ...

Robust Learning with Jacobian Regularization

Design of reliable systems must guarantee stability against input pertur...

Addressing the Vulnerability of NMT in Input Perturbations

Neural Machine Translation (NMT) has achieved significant breakthrough i...

Otem&Utem: Over- and Under-Translation Evaluation Metric for NMT

Although neural machine translation(NMT) yields promising translation pe...

Verifying Robustness of Gradient Boosted Models

Gradient boosted models are a fundamental machine learning technique. Ro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work has pointed out the challenges in building robust neural network models

(Goodfellow et al., 2015; Papernot et al., 2016). For Neural Machine Translation (NMT) in particular, it has been shown that NMT models are brittle to small perturbations in the input, both when these perturbations are synthetically created or generated to mimic real data noise (Belinkov and Bisk, 2018). Consider the example in Table 1 where an NMT model generates a worse translation as a consequence of only one character changing in the input.

Improving robustness in NMT has received a lot of attention lately with data augmentation (Sperber et al., 2017; Belinkov and Bisk, 2018; Vaibhav et al., 2019; Liu et al., 2019; Karpukhin et al., 2019) and adversarial training methods (Cheng et al., 2018; Ebrahimi et al., 2018; Cheng et al., 2019; Michel et al., 2019) as some of the more popular approaches used to increase robustness in neural network models.

In this paper, we focus on one class of methods, subword regularization, which addresses NMT robustness without introducing any changes to the architectures or to the training regime, solely through dynamic segmentation of input into subwords (Kudo, 2018; Provilkov et al., 2019). We provide a comprehensive comparison of these methods on several language pairs and under different noise conditions on robustness-focused metrics.

width= Original input Se kyllä tuntuu sangen luultavalta. Translation It certainly seems very likely. Perturbed input Se kyllä tumtuu sangen luultavalta. Translation

It will probably darken quite probably.

Reference It certainly seems probable.

Table 1: An example of NMT English translations for a Finish input and its one-letter misspelled version.

Previous work has used translation quality measures such as BLEU on noisy input as an indicator of robustness. Absolute model performance on noisy input is important, and we believe this is an appropriate measure for noisy domain evaluation (Michel and Neubig, 2018; Berard et al., 2019; Li et al., 2019). However, it does not disentangle model quality from the relative degradation under added noise.

For this reason, we propose two additional measures for robustness which quantify the changes in translation when perturbations are added to the input. The first one measures relative changes in translation quality while the second one focuses on consistency in translation output irrespective of reference translations. Unlike the use of BLEU scores alone, the metrics introduced show clearer trends across all languages tested: NMT models are more robust to perturbations when subword regularization is employed. We also show that for the models used, changes in output strongly correlate with decreased quality and the consistency measure alone can be used as a robustness proxy in the absence of reference data.

2 Evaluation Metrics

Robustness is usually measured with respect to translation quality. Suppose an NMT model translates input to and translates its perturbed version to , the translation quality (TQ) on these datasets is measured against reference translations : and . TQ can be implemented as any quality measurement metric, such as BLEU (Papineni et al., 2002) or 1 minus TER (Snover et al., 2006).

Previous work has used TQ on perturbed or noisy input as an indicator of robustness. However, we argue that assessing models’ performance relative to that of the original dataset is important as well in order to capture models’ sensitivity to perturbations. Consider the following hypothetical example:

Selecting to translate noisy data alone is preferable, since outperforms (). However, ’s quality degradation () reflects that it is in fact more sensitive to perturbation comparing with .

To this end, we use the ratio between and to quantify an NMT model ’s invariance to specific data and perturbation, and define it as robustness:

When evaluating on the dataset , means the translation quality of is degraded under perturbation ; indicates that is robust to perturbation .

It is worth noting that: (1) ROBUST can be viewed as the normalized because . We opt for the ratio definition because it is on a scale, and it is easier to interpret than since the latter needs to be interpreted in the context of the TQ score. (2) High robustness can only be expected under low levels of noise, as it is not realistic for a model to recover from extreme perturbations.

Evaluation without References

Reference translations are not readily available in some cases, such as when evaluating on a new domain. Inspired by unsupervised consistency training Xie et al. (2019), we test if translation consistency

can be used to estimate robustness against noise perturbations. Specifically, a model is consistent under a perturbation

if the two translations, and are similar to each other. Note that consistency is sufficient but not necessary for robustness: a good translation can be expressed in diverse ways, which leads to high robustness but low consistency.

We define consistency by

Sim can be any symmetric measure of similarity, and in this paper we opt for

to be the harmonic mean of

and , where TQ is BLEU between two outputs.

3 Experimental Set-Up

We run several experiments across different language families with varying difficulties, across different training data conditions (i.e. with different training data sizes) and evaluate how different subword segmentation strategies performs across noisy domains and noise types.

Implementation Details

We build NMT models with the Transformer-base architecture (Vaswani et al., 2017) implemented in the Sockeye toolkit (Hieber et al., 2017). The target embeddings and the output layer’s weight matrix are tied (Press and Wolf, 2017). Training is done on 2 GPUs, with a batch size of 3072 tokens and we checkpoint the model every 4000 updates. The learning rate is initialized to 0.0002 and reduced by 10% after 4 checkpoints without improvement of perplexity on the development set. Training stops after 10 checkpoints without improvement.

Tasks and Data

We train NMT models on eight translation directions and measure robustness and consistency for them. ENDE and ENFI models are trained with pre-processed WMT18 news data and tested with the latest news test sets (newstest2019).

Recently, two datasets were built from user-generated content, MTNT (Michel and Neubig, 2018) and 4SQ (Berard et al., 2019). They provide naturally occurring noisy inputs and translations for ENFR and ENJA, thus enabling automatic evaluations. ENJA baseline models are trained and also tested with aggregated data provided by MTNT, i.e., KFTT+TED+JESC (KTJ). ENFR baseline models are trained with aggregated data of Europarl-v7 (Koehn, 2005), NewsCommentary-v14 (Bojar et al., 2018), OpenSubtitles-v2018 (Lison and Tiedemann, 2016), and ParaCrawl-v5111https://paracrawl.eu/, which simulates the UGC training corpus used in 4SQ benchmarks, and they are tested with the latest WMT new test sets supporting ENFR (newstest2014).

Following the convention, we also evaluate models directly on noisy MTNT (mtnt2019) and 4SQ test sets. We fine-tune baseline models with corresponding MTNT/4SQ training data, inheriting all hyper-parameters except the checkpoint interval which is re-set to 100 updates. Table 2 shows itemized training data statistics after pre-processing.

width= Languages # sentences # EN tokens ENDE 29.3 M 591 M BASE ENFR 22.2 M 437 M ENFI 2.9 M 71 M ENJA 3.9 M 43 M ENFR 36.1 K 1,011 K MTNT FREN 19.2 K 779 K ENJA 5.8 K 338 K JAEN 6.5 K 156 K 4SQ FREN 12.1 K 141 K

Table 2: Statistics of various training data sets.


We investigate two frequently used types of perturbations and apply them to WMT and KTJ test data. The first is synthetic misspelling: each word is misspelled with probability of 0.1, and the strategy is randomly chosen from single-character deletion, insertion, and substitution (Karpukhin et al., 2019). The second perturbation is letter case changing: each sentence is modified with probability of 0.5, and the strategy is randomly chosen from upper-casing all letters, lower-casing all letters, and title-casing all words (Berard et al., 2019).222Character substitution uses neighbor letters on the QWERTY keyboard, so accented characters are not substituted. Japanese is “misspelled” for each character with probability of 0.1, and it only supports deletion and repetition. Letter case changing does not apply to Japanese.

Since we change the letter case in the test data, we always report case-insensitive BLEU with ‘13a’ tokenization using sacreBLEU (Post, 2018). Japanese output is pre-segmented with Kytea before running sacreBLEU.333http://www.phontron.com/kytea/

Model Variations

We focus on comparing different (stochastic) subword segmentation strategies: BPE (Sennrich et al., 2016), BPE-Dropout (Provilkov et al., 2019), and SentencePiece (Kudo, 2018). Subword regularization methods (i.e., BPE-Dropout and SentencePiece) generate various segmentations for the same word, so the resulting NMT model better learns the meaning of less frequent subwords and should be more robust to noise that yields unusual subword combinations, such as misspelling. We use them only in offline training data pre-processing steps, which requires no modification to the NMT model.444We sample one subword segmentation for each source sequence with SentencePiece.

4 Experimental Results

width=0.90pt Model BLEU ROBUST CONSIS BLEU ROBUST CONSIS ENDE (newstest2019) DEEN (newstest2019) BPE 39.700.71 40.010.65 original BPE-Dropout 39.650.73 40.160.66 SentencePiece 39.850.75 40.250.67 BPE 29.380.60 74.010.95 60.590.80 33.480.61 83.690.96 71.510.74 + misspelling BPE-Dropout 33.130.70 83.550.92 70.740.77 35.970.64 89.580.78 78.330.64 SentencePiece 31.870.66 79.990.97 66.400.76 35.260.66 87.610.91 74.090.74 BPE 31.610.74 79.631.31 73.261.19 33.720.69 84.271.15 73.191.13 + case-changing BPE-Dropout 35.040.73 88.370.97 80.040.99 36.340.69 90.480.95 78.960.96 SentencePiece 33.490.73 84.051.09 76.241.09 34.480.71 85.651.10 74.551.10 ENFR (newstest2014) FREN (newstest2014) BPE 41.470.48 39.240.50 original BPE-Dropout 40.720.48 39.220.50 SentencePiece 41.050.48 39.140.50 BPE 34.010.45 82.010.66 71.590.53 32.620.48 83.130.63 73.050.49 + misspelling BPE-Dropout 35.980.46 88.360.59 78.490.48 34.710.48 88.510.60 79.270.50 SentencePiece 34.780.45 84.720.59 75.280.51 33.440.48 85.430.62 75.280.50 BPE 34.750.54 83.810.97 79.340.93 32.310.54 82.340.96 76.560.95 + case-changing BPE-Dropout 38.280.47 94.000.55 86.280.58 35.780.50 91.240.65 84.470.65 SentencePiece 36.490.50 88.870.74 82.730.76 33.510.54 85.610.84 78.180.88 ENFI (newstest2019) FIEN (newstest2019) BPE 20.430.55 24.310.59 original BPE-Dropout 20.010.54 24.510.57 SentencePiece 20.630.57 24.670.60 BPE 15.200.46 74.421.39 52.760.89 21.270.54 87.471.14 70.060.89 + misspelling BPE-Dropout 17.390.50 86.951.43 63.630.86 22.400.55 91.381.06 75.180.83 SentencePiece 16.730.51 81.091.52 57.450.85 21.890.57 88.761.19 70.570.87 BPE 15.650.53 76.631.71 68.271.44 20.710.58 85.201.32 74.851.16 + case-changing BPE-Dropout 17.190.53 85.921.39 72.761.30 23.100.58 94.261.09 79.671.00 SentencePiece 15.720.54 76.191.72 67.731.40 21.500.58 87.161.26 76.291.12 ENJA (KTJ) JAEN (KTJ) BPE 24.280.53 22.800.51 original BPE-Dropout 24.110.51 22.210.52 SentencePiece 22.630.45 22.990.50 BPE 19.820.47 81.661.09 54.840.73 18.200.45 79.831.20 52.340.74 + misspelling BPE-Dropout 22.010.49 91.300.95 63.210.78 18.890.47 85.061.17 56.430.78 SentencePiece 19.850.41 87.691.05 61.250.80 18.970.46 82.531.15 56.400.73 BPE 20.350.51 83.831.13 68.101.25 + case-changing BPE-Dropout 21.440.49 88.911.00 72.961.13 SentencePiece 19.990.44 88.321.06 73.521.10

Table 3:

BLEU, robustness (in percentage), and consistency scores of different subword segmentation methods on original and perturbed test sets. We report mean and standard deviation using bootstrap resampling

(Koehn, 2004). Subword regularization makes NMT models more robust to input perturbations.

width=0.90pt MTNT (mtnt2019) 4SQ Model ENJA JAEN ENFR FREN FREN BPE 10.750.49 9.680.59 34.150.93 45.840.89 30.960.85 baseline BPE-Dropout 10.760.47 9.260.64 33.390.95 45.840.90 31.280.84 SentencePiece 10.520.51 9.520.68 33.750.91 45.940.92 31.440.85 BPE 14.880.52 10.470.69 35.110.95 46.490.90 34.830.86 fine-tuning BPE-Dropout 15.260.53 11.130.68 34.800.93 46.880.88 34.720.84 SentencePiece 14.680.53 11.190.72 34.710.93 46.890.90 34.590.86

Table 4: BLEU scores of using different subword segmentation methods on two datasets with natural noise. Subword regularization methods do not achieve consistent improvement over BPE, nor with or without fine-tuning.

As shown in Table 3, there is no clear winner among the three subword segmentation models based on BLEU scores on original WMT or KTJ test sets. This observation is different from results reported by Kudo (2018) and Provilkov et al. (2019). One major difference from previous work is the size of the training data, which is much larger in our experiments – subword regularization is presumably preferable on low-resource settings.

However, both our proposed metrics (i.e., robustness and consistency) show clear trends of models’ robustness to input perturbations across all languages we tested: BPE-Dropout SentencePiece BPE. This suggests that although we did not observe a significant impact of subword regularization on generic translation quality, the robustness of the models is indeed improved drastically.

Unfortunately, it is unclear if subword regularization can help translating real-world noisy input, as shown in Table 4. MTNT and 4SQ contain several natural noise types such as grammar errors, emojis, with misspelling as the dominating noise type for English and French. The training data we use may already cover common natural misspellings, perhaps contributing to the failure of regularization methods to improve over BPE in this case.

Robustness Versus Consistency

Variation in output is not necessarily in itself a marker of reduced translation quality, but empirically, consistency and robustness nearly always provide same model rankings in Table 3. We conduct more comprehensive analysis on the correlation between them, and we collect additional data points by varying the noise level of both perturbations. Specifically, we use the following word misspelling probabilities: and the following sentence case-changing probability values: .

Figure 1: Robustness (in percentage) and consistency are highly correlated within each language pair. Correlation coefficients are marked in the legend.

As illustrated in Figure 1, consistency strongly correlates with robustness (sample Pearson’s to ) within each language pair. This suggests that for this class of models, low consistency signals a drop in translation quality and the consistency score can be used as a robustness proxy when the reference translation is unavailable.

Robustness Versus Noise Level

In this paper, robustness is defined by giving a fixed perturbation function and its noise level. We observe consistent model rankings across language pairs, but is it still true if we vary the noise level?

To test this, we plot the robustness data points from the last section against the noise level. Focusing on the misspelling perturbation for ENDE models, Figure 2 shows that varying the word misspelling probability does not change the ranking of the models, and the gap in the robustness measurement only increases with larger amount of noise. This observation applies to all perturbations and language pairs we investigated.

Figure 2: Varying the synthetic word misspelling probability for ENDE models does not change the model ranking w.r.t. robustness (in percentage).

5 Conclusion

We proposed two additional measures for NMT robustness which can be applied when both original and noisy inputs are available. These measure robustness as relative degradation in quality as well as consistency which quantifies variation in translation output irrespective of reference translations. We also tested two popular subword regularization techniques and their effect on overall performance and robustness. Our robustness metrics reveal a clear trend of subword regularization being much more robust to input perturbations than standard BPE. Furthermore, we identify a strong correlation between robustness and consistency in these models indicating that consistency can be used to estimate robustness on data sets or domains lacking reference translations.

6 Acknowledgements

We thank the anonymous reviewers for their comments and suggestions.


  • Y. Belinkov and Y. Bisk (2018) Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1, §1.
  • A. Berard, I. Calapodescu, M. Dymetman, C. Roux, J. Meunier, and V. Nikoulina (2019) Machine translation of restaurant reviews: new corpus for domain adaptation and robustness. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 168–176. External Links: Link, Document Cited by: §1, §3, §3.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, P. Koehn, and C. Monz (2018) Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 272–303. External Links: Link, Document Cited by: §3.
  • Y. Cheng, L. Jiang, and W. Macherey (2019) Robust neural machine translation with doubly adversarial inputs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4324–4333. External Links: Link, Document Cited by: §1.
  • Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu (2018) Towards robust neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1756–1766. External Links: Link, Document Cited by: §1.
  • J. Ebrahimi, D. Lowd, and D. Dou (2018) On adversarial examples for character-level neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 653–663. External Links: Link Cited by: §1.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • F. Hieber, T. Domhan, M. Denkowski, D. Vilar, A. Sokolov, A. Clifton, and M. Post (2017) Sockeye: A toolkit for neural machine translation. CoRR abs/1712.05690. External Links: Link, 1712.05690 Cited by: §3.
  • V. Karpukhin, O. Levy, J. Eisenstein, and M. Ghazvininejad (2019) Training on synthetic noise improves robustness to natural noise in machine translation. In

    Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)

    Hong Kong, China, pp. 42–47. External Links: Link, Document Cited by: §1, §3.
  • P. Koehn (2004) Statistical significance tests for machine translation evaluation. In

    Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

    Barcelona, Spain, pp. 388–395. External Links: Link Cited by: Table 3.
  • P. Koehn (2005) Europarl: a parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, Vol. 5, pp. 79–86. Cited by: §3.
  • T. Kudo (2018) Subword regularization: improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 66–75. External Links: Link, Document Cited by: §1, §3, §4.
  • X. Li, P. Michel, A. Anastasopoulos, Y. Belinkov, N. Durrani, O. Firat, P. Koehn, G. Neubig, J. Pino, and H. Sajjad (2019) Findings of the first shared task on machine translation robustness. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, pp. 91–102. External Links: Link, Document Cited by: §1.
  • P. Lison and J. Tiedemann (2016) OpenSubtitles2016: extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 923–929. External Links: Link Cited by: §3.
  • H. Liu, M. Ma, L. Huang, H. Xiong, and Z. He (2019) Robust neural machine translation with joint textual and phonetic embedding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3044–3049. External Links: Link, Document Cited by: §1.
  • P. Michel, X. Li, G. Neubig, and J. Pino (2019) On evaluation of adversarial perturbations for sequence-to-sequence models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3103–3114. External Links: Link, Document Cited by: §1.
  • P. Michel and G. Neubig (2018) MTNT: a testbed for machine translation of noisy text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 543–553. External Links: Link, Document Cited by: §1, §3.
  • N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2016)

    Practical black-box attacks against deep learning systems using adversarial examples

    CoRR abs/1602.02697. External Links: Link, 1602.02697 Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §2.
  • M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, pp. 186–191. External Links: Link, Document Cited by: §3.
  • O. Press and L. Wolf (2017) Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 157–163. External Links: Link Cited by: §3.
  • I. Provilkov, D. Emelianenko, and E. Voita (2019) BPE-dropout: simple and effective subword regularization. CoRR abs/1910.13267. External Links: Link, 1910.13267 Cited by: §1, §3, §4.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §3.
  • M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul (2006) A study of translation edit rate with targeted human annotation. In Proceedings of association for machine translation in the Americas, Vol. 200. Cited by: §2.
  • M. Sperber, J. Niehues, and A. Waibel (2017) Toward robust neural machine translation for noisy input sequences. In Proceedings of the 14th International Workshop on Spoken Language Translation, Tokyo, Japan, pp. 1715–1725. External Links: Link Cited by: §1.
  • V. Vaibhav, S. Singh, C. Stewart, and G. Neubig (2019) Improving robustness of machine translation with synthetic noise. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1916–1920. External Links: Link, Document Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, pp. 5998–6008. External Links: Link Cited by: §3.
  • Q. Xie, Z. Dai, E. H. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. CoRR abs/1904.12848. External Links: Link, 1904.12848 Cited by: §2.